canu-1.6/000077500000000000000000000000001314437614700123205ustar00rootroot00000000000000canu-1.6/.github/000077500000000000000000000000001314437614700136605ustar00rootroot00000000000000canu-1.6/.github/ISSUE_TEMPLATE.md000066400000000000000000000012731314437614700163700ustar00rootroot00000000000000**Remove this text before submitting your issue!** Include the canu command used, or at least tell us what options you've set. Include the output of `canu -version` if it isn't in any outputs. It's reported at the start of the logging, and just before any crash report. Include what system you're running on. MacOS, Linux, or other? In a virtual machine? On a grid? FORMATTING TIPS: Use `single backtics` to highlight words in text. ``` Use triple backtics surrounding any pasted-in text. This preserves any bizarre formatting ``` Use the `Preview` button just above this space to see what the issue will look like. **Remove this text before submitting your issue!** canu-1.6/README.citation000066400000000000000000000003331314437614700150100ustar00rootroot00000000000000Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research 2017; doi: https://doi.org/10.1101/gr.215087.116 canu-1.6/README.license.GPL000066400000000000000000000431311314437614700152440ustar00rootroot00000000000000 GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License. canu-1.6/README.licenses000066400000000000000000000145521314437614700150130ustar00rootroot00000000000000 This software constitutes a joint work and the contributions of individual authors are subject to different licenses. Contributions and licenses are listed in the applicable source files, with specific details on each individual contribution captured in the revision control system. This software is based on RELEASE_1-3_2004-03-17 of the 'Celera Assembler' (http://wgs-assembler.sourceforge.net) as distributed by Applera Corporation under the GNU General Public License, version 2. Canu branched from Celera Assembler at revision 4587. This software is based on the 'kmer package' (http://kmer.sourceforge.net) as distributed by Applera Corporation under the GNU General Public License, version 2. Canu branched from the kmer project at revision 1994. This software is based on 'pbutgcns' (https://github.com/pbjd/pbutgcns) and 'FALCON' (https://github.com/PacificBiosciences/FALCON) as released by Pacific Biosciences of California, Inc. For the full software distribution and modifications made under the GNU General Public License version 2: This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received (README.license.GPL) a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA For modifications made by the Battelle National Biodefense Institute under the BSD 3-Clause License: This Software was prepared for the Department of Homeland Security (DHS) by the Battelle National Biodefense Institute, LLC (BNBI) as part of contract HSHQDC-07-C-00020 to manage and operate the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Battelle National Biodefense Institute nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. For modifications in the public domain: PUBLIC DOMAIN NOTICE This software is "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and associated data, the National Human Genome Research Institute (NHGRI), National Institutes of Health (NIH) and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. NHGRI, NIH and the U.S. Government disclaim all warranties as to performance, merchantability or fitness for any particular purpose. Please cite the authors in any work or product based on this material. For software released by Pacific Biosciences of California, Inc.: Copyright (c) 2011-2015, Pacific Biosciences of California, Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted (subject to the limitations in the disclaimer below) provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Pacific Biosciences nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. canu-1.6/README.md000066400000000000000000000030711314437614700136000ustar00rootroot00000000000000# Canu Canu is a fork of the [Celera Assembler](http://wgs-assembler.sourceforge.net/wiki/index.php?title=Main_Page), designed for high-noise single-molecule sequencing (such as the [PacBio](http://www.pacb.com) [RS II](http://www.pacb.com/products-and-services/pacbio-systems/rsii/)/[Sequel](http://www.pacb.com/products-and-services/pacbio-systems/sequel/) or [Oxford Nanopore](https://www.nanoporetech.com/) [MinION](https://nanoporetech.com/products)). Canu is a hierarchical assembly pipeline which runs in four steps: * Detect overlaps in high-noise sequences using [MHAP](https://github.com/marbl/MHAP) * Generate corrected sequence consensus * Trim corrected sequences * Assemble trimmed corrected sequences ## Install: The easiest way to get started is to download a [release](http://github.com/marbl/canu/releases). Alternatively, you can also build the latest unreleased from github: git clone https://github.com/marbl/canu.git cd canu/src make -j ## Learn: The [quick start](http://canu.readthedocs.io/en/latest/quick-start.html) will get you assembling quickly, while the [tutorial](http://canu.readthedocs.io/en/latest/tutorial.html) explains things in more detail. ## Run: Brief command line help: ..//bin/canu Full list of parameters: ..//bin/canu -options ## Citation: - Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. [Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation](https://doi.org/10.1101/gr.215087.116). Genome Research. (2017). canu-1.6/addCopyrights-BuildData.pl000066400000000000000000000062161314437614700173150ustar00rootroot00000000000000#!/usr/local/bin/perl use strict; my @dateStrings = ( "???", "JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC" ); my %dateStrings; $dateStrings{"Jan"} = "01"; $dateStrings{"Feb"} = "02"; $dateStrings{"Mar"} = "03"; $dateStrings{"Apr"} = "04"; $dateStrings{"May"} = "05"; $dateStrings{"Jun"} = "06"; $dateStrings{"Jul"} = "07"; $dateStrings{"Aug"} = "08"; $dateStrings{"Sep"} = "09"; $dateStrings{"Oct"} = "10"; $dateStrings{"Nov"} = "11"; $dateStrings{"Dec"} = "12"; if (! -e "logs") { system("git log --name-status > logs"); } # Update this after each copyright update commit, please. Best method is to commit # the copyright changes -- none of the addCopyrights files, -- update this file and # Then commit the addCopyrights files. my %stoppingCommits; $stoppingCommits{"6950cb74e302a97673a5ba482b3b8992eea72c37"} = 1; # 20 AUG 2015 - Initial copyright addition. $stoppingCommits{"72c27c95d61cb8f37e859c4039456eb2acc5c55b"} = 1; # 19 NOV 2015 - Second copyright addition. $stoppingCommits{"b2df5790f77d38cc31fe77a7f65360e02389f92e"} = 1; # 04 MAR 2016 $stoppingCommits{"1ef335952342ef06ad1651a888f09c312f54dab8"} = 1; # 18 MAY 2016 $stoppingCommits{"bbbdcd063560e5f86006ee6b8b96d2d7b80bb750"} = 1; # 21 NOV 2016 $stoppingCommits{"64459fe33f97f6d23fe036ba1395743d0cdd03e4"} = 1; # 17 APR 2017 $stoppingCommits{"9e9bd674b705f89817b07ff30067210c2d180f42"} = 1; # 14 AUG 2017 open(F, "< logs") or die "Failed to open 'logs': $!\n"; $_ = ; chomp; my $author; my $date; while (!eof(F)) { my $commit; if (m/^commit\s+(\w+)$/) { $commit = $1; } else { die "Expected commit line, got '$_'\n"; } last if (exists($stoppingCommits{$commit})); $_ = ; chomp; if (m/^Merge/) { # Merge commits include an extra line here: "Merge: a75ed40 717c0b1" # No files change, and we can just process the rest of the entry as normal. $_ = ; chomp; } if (m/walenz/i) { $author = "Brian P. Walenz"; } elsif (m/koren/i) { $author = "Sergey Koren"; } else { print STDERR "Skipping commit from '$_'\n"; $author = undef; } $_ = ; chomp; if (m/Date:\s+\w+\s+(\w+)\s+(\d+)\s+\d+:\d+:\d+\s+(\d+)/) { my $day = substr("00$2", -2); my $mo = $dateStrings{$1}; my $year = $3; die "Invalid month '$3'\n" if (! defined($mo)); $date = "$year$mo$day" } else { die "Failed to match date in '$date'\n"; } #print STDERR "$commit -- $date -- $author\n"; $_ = ; chomp; while (! m/^commit/) { next if (m/^$/); # Blank line next if (m/^\s+/); # Comment line if ($_ =~ m/M\s+(\S+)$/) { print "A $1 nihh$date$author\n" if (defined($author)); } elsif ($_ =~ m/A\s+(\S+)$/) { # New file, treat as normal. print "A $1 nihh$date$author\n" if (defined($author)); } elsif ($_ =~ m/D\s+(\S+)$/) { # Deleted file, do nothing,. } else { print STDERR "$_\n"; } } continue { $_ = ; chomp; } } close(F); canu-1.6/addCopyrights.dat000066400000000000000000030662011314437614700156260ustar00rootroot00000000000000A kmer/libutil/bzipBuffer.H jcvi20140411Brian P. Walenz A kmer/libutil/bzipBuffer.H jcvi20060623Brian P. Walenz A kmer/libutil/unaryEncoding.h jcvi20140411Brian P. Walenz A kmer/libutil/unaryEncoding.h none20040427Brian P. Walenz A kmer/libutil/palloc.c jcvi20110110Brian P. Walenz A kmer/libutil/palloc.c jcvi20080130Brian P. Walenz A kmer/libutil/palloc.c jcvi20071029Brian P. Walenz A kmer/libutil/palloc.c jcvi20060228Brian P. Walenz A kmer/libutil/palloc.c jcvi20050714Brian P. Walenz A kmer/libutil/palloc.c jcvi20050523Brian P. Walenz A kmer/libutil/palloc.c jcvi20050412Brian P. Walenz A kmer/libutil/palloc.c jcvi20050305Brian P. Walenz A kmer/libutil/palloc.c none20041010Brian P. Walenz A kmer/libutil/palloc.c none20040511Brian P. Walenz A kmer/libutil/palloc.c craa20040506Brian P. Walenz A kmer/libutil/palloc.c craa20040506Brian P. Walenz A kmer/libutil/palloc.c none20040430Brian P. Walenz A kmer/libutil/palloc.c none20040421Brian P. Walenz A kmer/libutil/palloc.c none20040113Brian P. Walenz A kmer/libutil/palloc.c craa20030506Brian P. Walenz A kmer/libutil/palloc.c craa20030102Brian P. Walenz A kmer/libutil/recordFile.C jcvi20140411Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080826Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080821Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080820Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080817Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080814Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080805Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080724Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080724Brian P. Walenz A kmer/libutil/recordFile.C jcvi20080708Brian P. Walenz A kmer/libutil/fibonacciEncoding.h jcvi20140411Brian P. Walenz A kmer/libutil/fibonacciEncoding.h jcvi20080609Brian P. Walenz A kmer/libutil/fibonacciEncoding.h none20040427Brian P. Walenz A kmer/libutil/util.c jcvi20140411Brian P. Walenz A kmer/libutil/util.c jcvi20100919Brian P. Walenz A kmer/libutil/util.c jcvi20080912Brian P. Walenz A kmer/libutil/util++.H bnbi20141205Brian P. Walenz A kmer/libutil/util++.H jcvi20140411Brian P. Walenz A kmer/libutil/util++.H jcvi20080708Brian P. Walenz A kmer/libutil/util++.H jcvi20060706Brian P. Walenz A kmer/libutil/util++.H jcvi20060624Brian P. Walenz A kmer/libutil/util++.H jcvi20060613Brian P. Walenz A kmer/libutil/util++.H jcvi20060425Brian P. Walenz A kmer/libutil/util++.H jcvi20051127Brian P. Walenz A kmer/libutil/util++.H jcvi20050712Brian P. Walenz A kmer/libutil/util++.H jcvi20050506Brian P. Walenz A kmer/libutil/util++.H jcvi20050401Brian P. Walenz A kmer/libutil/util++.H jcvi20050314Brian P. Walenz A kmer/libutil/util++.H jcvi20050310Brian P. Walenz A kmer/libutil/util++.H jcvi20050309Brian P. Walenz A kmer/libutil/util++.H jcvi20050209Brian P. Walenz A kmer/libutil/util++.H jcvi20050207Brian P. Walenz A kmer/libutil/util++.H none20041010Brian P. Walenz A kmer/libutil/util++.H none20040818Brian P. Walenz A kmer/libutil/util++.H craa20040802Brian P. Walenz A kmer/libutil/util++.H none20040610Brian P. Walenz A kmer/libutil/util++.H none20040525Brian P. Walenz A kmer/libutil/util++.H none20040524Brian P. Walenz A kmer/libutil/util++.H none20040514Brian P. Walenz A kmer/libutil/util++.H none20040511Brian P. Walenz A kmer/libutil/util++.H none20040511Brian P. Walenz A kmer/libutil/util++.H craa20040506Brian P. Walenz A kmer/libutil/util++.H none20040430Brian P. Walenz A kmer/libutil/util++.H none20040427Brian P. Walenz A kmer/libutil/util++.H none20040421Brian P. Walenz A kmer/libutil/logMsg.H jcvi20140411Brian P. Walenz A kmer/libutil/logMsg.H jcvi20130329Brian P. Walenz A kmer/libutil/logMsg.H jcvi20120510Brian P. Walenz A kmer/libutil/logMsg.H jcvi20120508Brian P. Walenz A kmer/libutil/logMsg.H jcvi20070102Brian P. Walenz A kmer/libutil/logMsg.H jcvi20061108Brian P. Walenz A kmer/libutil/logMsg.H jcvi20060706Brian P. Walenz A kmer/libutil/logMsg.H jcvi20060706Brian P. Walenz A kmer/libutil/logMsg.H jcvi20050712Brian P. Walenz A kmer/libutil/logMsg.H jcvi20050305Brian P. Walenz A kmer/libutil/logMsg.H none20041010Brian P. Walenz A kmer/libutil/logMsg.H none20040614Brian P. Walenz A kmer/libutil/logMsg.H none20040610Brian P. Walenz A kmer/libutil/endianess.H jcvi20140411Brian P. Walenz A kmer/libutil/endianess.H jcvi20071226Brian P. Walenz A kmer/libutil/endianess.H jcvi20060724Brian P. Walenz A kmer/libutil/endianess.H jcvi20060512Brian P. Walenz A kmer/libutil/endianess.H jcvi20051127Brian P. Walenz A kmer/libutil/eliasGammaEncoding.h jcvi20140411Brian P. Walenz A kmer/libutil/eliasGammaEncoding.h none20040427Brian P. Walenz A kmer/libutil/fibonacciNumbers.C jcvi20140411Brian P. Walenz A kmer/libutil/fibonacciNumbers.C none20041010Brian P. Walenz A kmer/libutil/fibonacciNumbers.C none20040427Brian P. Walenz A kmer/libutil/uint32List.H jcvi20140411Brian P. Walenz A kmer/libutil/uint32List.H jcvi20060613Brian P. Walenz D kmer/libutil/uint32List.H kmer/libutil/u32bitList.H A kmer/libutil/qsort_mt.c jcvi20080724Brian P. Walenz A kmer/libutil/qsort_mt.c jcvi20080609Brian P. Walenz A kmer/libutil/util.h jcvi20140411Brian P. Walenz A kmer/libutil/util.h jcvi20120118Brian P. Walenz A kmer/libutil/util.h jcvi20111229Brian P. Walenz A kmer/libutil/util.h jcvi20110110Brian P. Walenz A kmer/libutil/util.h jcvi20100919Brian P. Walenz A kmer/libutil/util.h jcvi20080912Brian P. Walenz A kmer/libutil/util.h jcvi20080828Brian P. Walenz A kmer/libutil/util.h jcvi20080817Brian P. Walenz A kmer/libutil/util.h jcvi20080814Brian P. Walenz A kmer/libutil/util.h jcvi20080609Brian P. Walenz A kmer/libutil/util.h jcvi20080514Brian P. Walenz A kmer/libutil/util.h jcvi20080130Brian P. Walenz A kmer/libutil/util.h jcvi20071029Brian P. Walenz A kmer/libutil/util.h jcvi20070823Brian P. Walenz A kmer/libutil/util.h jcvi20070320Brian P. Walenz A kmer/libutil/util.h jcvi20060623Brian P. Walenz A kmer/libutil/util.h jcvi20060621Brian P. Walenz A kmer/libutil/util.h jcvi20060308Brian P. Walenz A kmer/libutil/util.h jcvi20060228Brian P. Walenz A kmer/libutil/util.h jcvi20051127Brian P. Walenz A kmer/libutil/util.h jcvi20050808Brian P. Walenz A kmer/libutil/util.h jcvi20050712Brian P. Walenz A kmer/libutil/util.h jcvi20050712Brian P. Walenz A kmer/libutil/util.h jcvi20050523Brian P. Walenz A kmer/libutil/util.h jcvi20050506Brian P. Walenz A kmer/libutil/util.h jcvi20050412Brian P. Walenz A kmer/libutil/util.h jcvi20050320Brian P. Walenz A kmer/libutil/util.h jcvi20050313Brian P. Walenz A kmer/libutil/util.h jcvi20050310Brian P. Walenz A kmer/libutil/util.h jcvi20050309Brian P. Walenz A kmer/libutil/util.h jcvi20050207Brian P. Walenz A kmer/libutil/util.h none20041010Brian P. Walenz A kmer/libutil/util.h none20040818Brian P. Walenz A kmer/libutil/util.h craa20040802Brian P. Walenz A kmer/libutil/util.h none20040528Brian P. Walenz A kmer/libutil/util.h none20040511Brian P. Walenz A kmer/libutil/util.h craa20040506Brian P. Walenz A kmer/libutil/util.h craa20040506Brian P. Walenz A kmer/libutil/util.h none20040421Brian P. Walenz A kmer/libutil/util.h none20040421Brian P. Walenz A kmer/libutil/recordFile.H jcvi20140411Brian P. Walenz A kmer/libutil/recordFile.H jcvi20080820Brian P. Walenz A kmer/libutil/recordFile.H jcvi20080818Brian P. Walenz A kmer/libutil/recordFile.H jcvi20080817Brian P. Walenz A kmer/libutil/recordFile.H jcvi20080814Brian P. Walenz A kmer/libutil/recordFile.H jcvi20080805Brian P. Walenz A kmer/libutil/recordFile.H jcvi20080708Brian P. Walenz A kmer/libutil/unaryEncodingTester.C jcvi20140411Brian P. Walenz A kmer/libutil/unaryEncodingTester.C jcvi20080129Brian P. Walenz A kmer/libutil/unaryEncodingTester.C none20041010Brian P. Walenz A kmer/libutil/unaryEncodingTester.C none20040427Brian P. Walenz A kmer/libutil/bzipBuffer.C jcvi20140411Brian P. Walenz A kmer/libutil/bzipBuffer.C jcvi20060623Brian P. Walenz A kmer/libutil/file.c jcvi20140411Brian P. Walenz A kmer/libutil/file.c jcvi20121015Brian P. Walenz A kmer/libutil/file.c jcvi20111229Brian P. Walenz A kmer/libutil/file.c jcvi20080318Brian P. Walenz A kmer/libutil/file.c jcvi20070320Brian P. Walenz A kmer/libutil/file.c jcvi20060308Brian P. Walenz A kmer/libutil/file.c jcvi20060228Brian P. Walenz A kmer/libutil/file.c jcvi20050712Brian P. Walenz A kmer/libutil/file.c jcvi20050309Brian P. Walenz A kmer/libutil/file.c none20041010Brian P. Walenz A kmer/libutil/file.c none20040813Brian P. Walenz A kmer/libutil/file.c none20040812Brian P. Walenz A kmer/libutil/file.c craa20040802Brian P. Walenz A kmer/libutil/file.c none20040430Brian P. Walenz A kmer/libutil/file.c none20040421Brian P. Walenz A kmer/libutil/file.c craa20040401Brian P. Walenz A kmer/libutil/file.c none20040330Brian P. Walenz A kmer/libutil/file.c none20040330Brian P. Walenz A kmer/libutil/file.c craa20040218Clark Mobarry A kmer/libutil/file.c none20040120Brian P. Walenz A kmer/libutil/file.c none20040112Brian P. Walenz A kmer/libutil/file.c craa20040106Brian P. Walenz A kmer/libutil/file.c craa20030908Brian P. Walenz A kmer/libutil/file.c craa20030721Brian P. Walenz A kmer/libutil/file.c craa20030624Brian P. Walenz A kmer/libutil/file.c craa20030507Brian P. Walenz A kmer/libutil/file.c craa20030506Brian P. Walenz A kmer/libutil/eliasDeltaEncoding.h jcvi20140411Brian P. Walenz A kmer/libutil/eliasDeltaEncoding.h none20040427Brian P. Walenz A kmer/libutil/test/test-palloc.c none20041010Brian P. Walenz A kmer/libutil/test/test-palloc.c craa20040506Brian P. Walenz A kmer/libutil/test/test-palloc.c craa20040506Brian P. Walenz A kmer/libutil/test/test-intervalList.C bnbi20141014Brian P. Walenz A kmer/libutil/test/test-intervalList.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-intervalList.C jcvi20061027Brian P. Walenz A kmer/libutil/test/test-intervalList.C jcvi20061023Brian P. Walenz A kmer/libutil/test/test-intervalList.C jcvi20061023Brian P. Walenz A kmer/libutil/test/test-intervalList.C none20041010Brian P. Walenz A kmer/libutil/test/test-intervalList.C craa20040506Brian P. Walenz A kmer/libutil/test/test-recordFile.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-recordFile.C jcvi20080909Brian P. Walenz A kmer/libutil/test/test-recordFile.C jcvi20080724Brian P. Walenz A kmer/libutil/test/endianess.c jcvi20140411Brian P. Walenz A kmer/libutil/test/endianess.c jcvi20060724Brian P. Walenz A kmer/libutil/test/atomic.C jcvi20101102Brian P. Walenz A kmer/libutil/test/atomic.C jcvi20060724Brian P. Walenz A kmer/libutil/test/test-bitPackedFile.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-bitPackedFile.C jcvi20070320Brian P. Walenz A kmer/libutil/test/test-bitPackedFile.C jcvi20061022Brian P. Walenz A kmer/libutil/test/test-bitPackedFile.C none20041010Brian P. Walenz A kmer/libutil/test/test-bitPackedFile.C craa20040506Brian P. Walenz A kmer/libutil/test/test-bitPackedFile.C craa20040506Brian P. Walenz A kmer/libutil/test/test-mmap.c jcvi20140411Brian P. Walenz A kmer/libutil/test/test-mmap.c none20041010Brian P. Walenz A kmer/libutil/test/test-mmap.c craa20040506Brian P. Walenz A kmer/libutil/test/order.C jcvi20140411Brian P. Walenz A kmer/libutil/test/order.C jcvi20051204Brian P. Walenz A kmer/libutil/test/test-bitPacking.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-bitPacking.C jcvi20071205Brian P. Walenz A kmer/libutil/test/test-bitPacking.C none20041010Brian P. Walenz A kmer/libutil/test/test-bitPacking.C none20040524Brian P. Walenz A kmer/libutil/test/test-bitPacking.C craa20040506Brian P. Walenz A kmer/libutil/test/test-bitPacking.C craa20040506Brian P. Walenz A kmer/libutil/test/tcat.C jcvi20080909Brian P. Walenz A kmer/libutil/test/tcat.C jcvi20070102Brian P. Walenz A kmer/libutil/test/tcat.C jcvi20060308Brian P. Walenz A kmer/libutil/test/tcat.C jcvi20060303Brian P. Walenz A kmer/libutil/test/tcat.C jcvi20060302Brian P. Walenz A kmer/libutil/test/test-bigQueue.C jcvi20050309Brian P. Walenz A kmer/libutil/test/test-bitPackedArray.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-bitPackedArray.C jcvi20070320Brian P. Walenz A kmer/libutil/test/test-bitPackedArray.C jcvi20061026Brian P. Walenz A kmer/libutil/test/test-bitPackedArray.C jcvi20061022Brian P. Walenz A kmer/libutil/test/test-bitPackedArray.C jcvi20050207Brian P. Walenz A kmer/libutil/test/test-types.c jcvi20140411Brian P. Walenz A kmer/libutil/test/test-types.c none20041010Brian P. Walenz A kmer/libutil/test/test-types.c craa20040506Brian P. Walenz A kmer/libutil/test/test-types.c craa20040506Brian P. Walenz A kmer/libutil/test/test-bzipBuffer.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-bzipBuffer.C jcvi20060623Brian P. Walenz A kmer/libutil/test/test-freeDiskSpace.c none20041010Brian P. Walenz A kmer/libutil/test/test-freeDiskSpace.c craa20040802Brian P. Walenz A kmer/libutil/test/test-readBuffer.C jcvi20140411Brian P. Walenz A kmer/libutil/test/test-readBuffer.C jcvi20080912Brian P. Walenz A kmer/libutil/test/test-readBuffer.C jcvi20080911Brian P. Walenz A kmer/libutil/test/test-readBuffer.C jcvi20080909Brian P. Walenz A kmer/libutil/test/test-readBuffer.C none20041010Brian P. Walenz A kmer/libutil/test/test-readBuffer.C craa20040506Brian P. Walenz A kmer/libutil/test/test-logMsg.C jcvi20060706Brian P. Walenz A kmer/libutil/test/test-logMsg.C none20040610Brian P. Walenz A kmer/libutil/generalizedUnaryEncoding.h jcvi20140411Brian P. Walenz A kmer/libutil/generalizedUnaryEncoding.h jcvi20070102Brian P. Walenz A kmer/libutil/generalizedUnaryEncoding.h none20040427Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20081016Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20080828Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20080808Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20080303Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20080214Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20071217Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20071111Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20061211Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20060710Brian P. Walenz A kmer/libkmer/positionDB-access.C jcvi20050207Brian P. Walenz A kmer/libkmer/positionDB-access.C none20041010Brian P. Walenz A kmer/libkmer/positionDB-access.C none20040715Brian P. Walenz A kmer/libkmer/positionDB-access.C none20040511Brian P. Walenz A kmer/libkmer/positionDB-access.C none20040421Brian P. Walenz A kmer/libkmer/positionDB-access.C craa20030814Brian P. Walenz A kmer/libkmer/positionDB-access.C craa20030506Brian P. Walenz A kmer/libkmer/positionDB-access.C craa20030102Brian P. Walenz A kmer/libkmer/positionDB-dump.C jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB-dump.C jcvi20081016Brian P. Walenz A kmer/libkmer/positionDB-dump.C jcvi20080828Brian P. Walenz A kmer/libkmer/positionDB-dump.C jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB-dump.C none20041010Brian P. Walenz A kmer/libkmer/positionDB-dump.C none20040421Brian P. Walenz A kmer/libkmer/positionDB-dump.C craa20030814Brian P. Walenz A kmer/libkmer/positionDB-dump.C craa20030506Brian P. Walenz A kmer/libkmer/positionDB-dump.C craa20030102Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080828Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080818Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080718Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080707Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080627Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080625Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080623Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080303Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080214Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20080130Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB-mismatch.C jcvi20071111Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20140411Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20130502Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20120510Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20110902Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20100828Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20071010Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20070607Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20070326Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20051204Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20051127Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20051127Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20050523Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C jcvi20050320Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C none20040412Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C craa20031020Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C craa20030915Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C craa20030908Brian P. Walenz A kmer/libkmer/existDB-create-from-meryl.C craa20030908Brian P. Walenz A kmer/libkmer/existDB.H jcvi20140411Brian P. Walenz A kmer/libkmer/existDB.H jcvi20130502Brian P. Walenz A kmer/libkmer/existDB.H jcvi20130329Brian P. Walenz A kmer/libkmer/existDB.H jcvi20130308Brian P. Walenz A kmer/libkmer/existDB.H jcvi20130103Brian P. Walenz A kmer/libkmer/existDB.H jcvi20120510Brian P. Walenz A kmer/libkmer/existDB.H jcvi20120508Brian P. Walenz A kmer/libkmer/existDB.H none20101206Liliana Florea A kmer/libkmer/existDB.H jcvi20071029Brian P. Walenz A kmer/libkmer/existDB.H jcvi20071010Brian P. Walenz A kmer/libkmer/existDB.H jcvi20070607Brian P. Walenz A kmer/libkmer/existDB.H jcvi20070326Brian P. Walenz A kmer/libkmer/existDB.H jcvi20051204Brian P. Walenz A kmer/libkmer/existDB.H jcvi20051127Brian P. Walenz A kmer/libkmer/existDB.H jcvi20051127Brian P. Walenz A kmer/libkmer/existDB.H jcvi20050329Brian P. Walenz A kmer/libkmer/existDB.H jcvi20050320Brian P. Walenz A kmer/libkmer/existDB.H none20041010Brian P. Walenz A kmer/libkmer/existDB.H none20040511Brian P. Walenz A kmer/libkmer/existDB.H none20040421Brian P. Walenz A kmer/libkmer/existDB.H craa20031020Brian P. Walenz A kmer/libkmer/existDB.H craa20030908Brian P. Walenz A kmer/libkmer/existDB.H craa20030805Brian P. Walenz A kmer/libkmer/existDB.H craa20030220Brian P. Walenz A kmer/libkmer/existDB.H craa20030213Brian P. Walenz A kmer/libkmer/existDB.H craa20030102Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20140411Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20110330Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080925Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080912Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080909Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080902Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080814Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080814Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080303Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080214Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20080130Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20071208Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20071028Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20071010Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20070913Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20070823Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20070817Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20070326Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20051204Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20050913Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20050911Brian P. Walenz A kmer/libkmer/driver-posDB.C jcvi20050519Brian P. Walenz A kmer/libkmer/driver-posDB.C none20041010Brian P. Walenz A kmer/libkmer/driver-posDB.C none20040524Brian P. Walenz A kmer/libkmer/driver-posDB.C none20040514Brian P. Walenz A kmer/libkmer/driver-posDB.C none20040511Brian P. Walenz A kmer/libkmer/driver-posDB.C none20040430Brian P. Walenz A kmer/libkmer/driver-posDB.C none20040430Brian P. Walenz A kmer/libkmer/driver-posDB.C craa20030918Brian P. Walenz A kmer/libkmer/driver-posDB.C craa20030915Brian P. Walenz A kmer/libkmer/driver-posDB.C craa20030814Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20081016Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080909Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080902Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080828Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080808Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080627Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080625Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080623Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080303Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20080214Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20071217Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20071208Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20071111Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20071010Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20070326Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20061211Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20060707Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20051204Brian P. Walenz A kmer/libkmer/positionDB.H jcvi20050913Brian P. Walenz A kmer/libkmer/positionDB.H none20041010Brian P. Walenz A kmer/libkmer/positionDB.H none20040715Brian P. Walenz A kmer/libkmer/positionDB.H none20040511Brian P. Walenz A kmer/libkmer/positionDB.H none20040511Brian P. Walenz A kmer/libkmer/positionDB.H none20040421Brian P. Walenz A kmer/libkmer/positionDB.H craa20031021Brian P. Walenz A kmer/libkmer/positionDB.H craa20030814Brian P. Walenz A kmer/libkmer/positionDB.H craa20030506Brian P. Walenz A kmer/libkmer/positionDB.H craa20030415Brian P. Walenz A kmer/libkmer/positionDB.H craa20030402Brian P. Walenz A kmer/libkmer/positionDB.H craa20030102Brian P. Walenz A kmer/libkmer/existDB-state.C jcvi20140411Brian P. Walenz A kmer/libkmer/existDB-state.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-state.C jcvi20070607Brian P. Walenz A kmer/libkmer/existDB-state.C jcvi20070326Brian P. Walenz A kmer/libkmer/existDB-state.C jcvi20050320Brian P. Walenz A kmer/libkmer/existDB-state.C none20041010Brian P. Walenz A kmer/libkmer/existDB-state.C none20040421Brian P. Walenz A kmer/libkmer/existDB-state.C none20040412Brian P. Walenz A kmer/libkmer/existDB-state.C craa20030908Brian P. Walenz A kmer/libkmer/existDB-state.C craa20030908Brian P. Walenz A kmer/libkmer/existDB-create-from-sequence.C bnbi20140831Brian P. Walenz A kmer/libkmer/existDB-create-from-sequence.C jcvi20140411Brian P. Walenz A kmer/libkmer/existDB-create-from-sequence.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-create-from-sequence.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-create-from-sequence.C jcvi20120510Brian P. Walenz A kmer/libkmer/existDB-create-from-sequence.C jcvi20120508Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20140411Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20130103Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20111229Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20080909Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20071029Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20070913Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20070823Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20070817Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20070326Brian P. Walenz A kmer/libkmer/driver-existDB.C jcvi20051204Brian P. Walenz A kmer/libkmer/driver-existDB.C none20041010Brian P. Walenz A kmer/libkmer/driver-existDB.C none20040524Brian P. Walenz A kmer/libkmer/driver-existDB.C none20040511Brian P. Walenz A kmer/libkmer/driver-existDB.C none20040430Brian P. Walenz A kmer/libkmer/driver-existDB.C craa20031020Brian P. Walenz A kmer/libkmer/driver-existDB.C craa20031016Brian P. Walenz A kmer/libkmer/driver-existDB.C craa20030908Brian P. Walenz A kmer/libkmer/driver-existDB.C craa20030805Brian P. Walenz A kmer/libkmer/driver-existDB.C craa20030220Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20111229Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20080312Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20080129Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20071217Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20071208Brian P. Walenz A kmer/libkmer/positionDB-sort.C jcvi20060707Brian P. Walenz A kmer/libkmer/positionDB-sort.C none20041010Brian P. Walenz A kmer/libkmer/positionDB-sort.C none20040421Brian P. Walenz A kmer/libkmer/positionDB-sort.C craa20030506Brian P. Walenz A kmer/libkmer/positionDB-sort.C craa20030102Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20080828Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20080818Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20080817Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20051204Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20050712Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20050519Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20050427Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20050316Brian P. Walenz A kmer/libkmer/positionDB-file.C jcvi20050312Brian P. Walenz A kmer/libkmer/positionDB-file.C none20040430Brian P. Walenz A kmer/libkmer/positionDB-file.C none20040430Brian P. Walenz A kmer/libkmer/positionDB-file.C craa20040401Brian P. Walenz A kmer/libkmer/positionDB-file.C craa20030814Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20140411Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20121015Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20081001Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080909Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080902Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080828Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080818Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080814Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080707Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080625Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080318Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080312Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080303Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080214Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20080129Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071217Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071211Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071208Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071208Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071111Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071028Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071016Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071010Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20071009Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20070326Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20070320Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20060914Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20060707Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20060706Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20060425Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20060407Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20051206Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20051204Brian P. Walenz A kmer/libkmer/positionDB.C jcvi20050519Brian P. Walenz A kmer/libkmer/positionDB.C none20041010Brian P. Walenz A kmer/libkmer/positionDB.C none20040715Brian P. Walenz A kmer/libkmer/positionDB.C none20040623Brian P. Walenz A kmer/libkmer/positionDB.C none20040524Brian P. Walenz A kmer/libkmer/positionDB.C none20040514Brian P. Walenz A kmer/libkmer/positionDB.C none20040511Brian P. Walenz A kmer/libkmer/positionDB.C none20040430Brian P. Walenz A kmer/libkmer/positionDB.C none20040430Brian P. Walenz A kmer/libkmer/positionDB.C none20040421Brian P. Walenz A kmer/libkmer/positionDB.C craa20040405Brian P. Walenz A kmer/libkmer/positionDB.C craa20040212Clark Mobarry A kmer/libkmer/positionDB.C none20040121Brian P. Walenz A kmer/libkmer/positionDB.C craa20031021Brian P. Walenz A kmer/libkmer/positionDB.C craa20031021Brian P. Walenz A kmer/libkmer/positionDB.C craa20030915Brian P. Walenz A kmer/libkmer/positionDB.C craa20030814Brian P. Walenz A kmer/libkmer/positionDB.C craa20030507Brian P. Walenz A kmer/libkmer/positionDB.C craa20030506Brian P. Walenz A kmer/libkmer/positionDB.C craa20030506Brian P. Walenz A kmer/libkmer/positionDB.C craa20030418Brian P. Walenz A kmer/libkmer/positionDB.C craa20030416Brian P. Walenz A kmer/libkmer/positionDB.C craa20030415Brian P. Walenz A kmer/libkmer/positionDB.C craa20030402Brian P. Walenz A kmer/libkmer/positionDB.C craa20030102Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C bnbi20140831Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20140411Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20120510Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20080912Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20080909Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20071010Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20070913Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20070817Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20070607Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20070326Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20051204Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20051127Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20051127Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C jcvi20050320Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C none20041010Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C none20040524Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C none20040421Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C craa20031020Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C craa20030908Brian P. Walenz A kmer/libkmer/existDB-create-from-fasta.C craa20030908Brian P. Walenz A kmer/libkmer/percentCovered.C bnbi20141007Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20140411Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20080911Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20080909Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20071029Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20070913Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20070326Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20070320Brian P. Walenz A kmer/libkmer/percentCovered.C jcvi20060509Brian P. Walenz A kmer/libkmer/existDB.C jcvi20140411Brian P. Walenz A kmer/libkmer/existDB.C jcvi20130502Brian P. Walenz A kmer/libkmer/existDB.C jcvi20130329Brian P. Walenz A kmer/libkmer/existDB.C jcvi20130103Brian P. Walenz A kmer/libkmer/existDB.C jcvi20120510Brian P. Walenz A kmer/libkmer/existDB.C jcvi20120508Brian P. Walenz A kmer/libkmer/existDB.C jcvi20080214Brian P. Walenz A kmer/libkmer/existDB.C jcvi20071029Brian P. Walenz A kmer/libkmer/existDB.C jcvi20071010Brian P. Walenz A kmer/libkmer/existDB.C jcvi20070608Brian P. Walenz A kmer/libkmer/existDB.C jcvi20070607Brian P. Walenz A kmer/libkmer/existDB.C jcvi20070326Brian P. Walenz A kmer/libkmer/existDB.C jcvi20060718Brian P. Walenz A kmer/libkmer/existDB.C jcvi20051204Brian P. Walenz A kmer/libkmer/existDB.C jcvi20051127Brian P. Walenz A kmer/libkmer/existDB.C jcvi20051127Brian P. Walenz A kmer/libkmer/existDB.C jcvi20050320Brian P. Walenz A kmer/libkmer/existDB.C none20041010Brian P. Walenz A kmer/libkmer/existDB.C none20040421Brian P. Walenz A kmer/libkmer/existDB.C none20040412Brian P. Walenz A kmer/libkmer/existDB.C craa20031020Brian P. Walenz A kmer/libkmer/existDB.C craa20031016Brian P. Walenz A kmer/libkmer/existDB.C craa20030908Brian P. Walenz A kmer/libkmer/existDB.C craa20030805Brian P. Walenz A kmer/libkmer/existDB.C craa20030220Brian P. Walenz A kmer/libkmer/existDB.C craa20030213Brian P. Walenz A kmer/libkmer/existDB.C craa20030213Brian P. Walenz A kmer/libkmer/existDB.C craa20030102Brian P. Walenz A src/overlapInCore-analysis/filterTrue.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/find-missed-true-overlaps.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/infer-ovl-from-genomic-blasr.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/check-ordered-reads-for-missed-overlaps.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/infer-obt-from-genomic-blasr.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/analyze-true-vs-test.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/infer-olaps-from-pairwise-coords.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/infer-olaps-from-pairwise-blasr.pl nihh20151012Brian P. Walenz A src/overlapInCore-analysis/infer-olaps-from-genomic-coords.pl nihh20151012Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H jcvi20130905Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H jcvi20130905Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H jcvi20120214Brian P. Walenz D src/bogart/AS_BAT_RepeatJunctionEvidence.H src/AS_BAT/AS_BAT_RepeatJunctionEvidence.H A src/bogart/AS_BAT_InsertSizes.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.H jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.H jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_InsertSizes.H src/AS_BAT/AS_BAT_InsertSizes.H A src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C src/AS_BAT/AS_BAT_Unitig_AddAndPlaceFrag.C A src/bogart/AS_BAT_PlaceContains.H bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_PlaceContains.H src/AS_BAT/AS_BAT_PlaceContains.H A src/bogart/AS_BAT_Joining.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Joining.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Joining.C jcvi20130430Brian P. Walenz A src/bogart/AS_BAT_Joining.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Joining.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Joining.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Joining.C src/AS_BAT/AS_BAT_Joining.C A src/bogart/AS_BAT_Datatypes.H bnbi20150616Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H bnbi20150127Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H bnbi20141117Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H bnbi20140811Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20140128Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20130617Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20121211Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20120727Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20120321Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20120222Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20111205Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20111007Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20110901Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20101217Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Datatypes.H src/AS_BAT/AS_BAT_Datatypes.H A src/bogart/AS_BAT_SetParentAndHang.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_SetParentAndHang.H src/AS_BAT/AS_BAT_SetParentAndHang.H A src/bogart/AS_BAT_Breaking.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Breaking.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Breaking.H jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_Breaking.H jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_Breaking.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_Breaking.H src/AS_BAT/AS_BAT_Breaking.H A src/bogart/AS_BAT_PlaceZombies.C bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20120222Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20110118Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_PlaceZombies.C src/AS_BAT/AS_BAT_PlaceZombies.C A src/bogart/AS_BAT_MoveContains.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MoveContains.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MoveContains.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_MoveContains.H src/AS_BAT/AS_BAT_MoveContains.H A src/bogart/AS_BAT_SplitDiscontinuous.C bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20120313Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20111218Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20110102Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_SplitDiscontinuous.C src/AS_BAT/AS_BAT_SplitDiscontinuous.C A src/bogart/AS_BAT_MateChecker.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MateChecker.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MateChecker.H jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_MateChecker.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_MateChecker.H src/AS_BAT/AS_BAT_MateChecker.H A src/bogart/addReadsToUnitigs.C bnbi20150410Brian P. Walenz A src/bogart/addReadsToUnitigs.C bnbi20141009Brian P. Walenz A src/bogart/addReadsToUnitigs.C jcvi20140422Brian P. Walenz A src/bogart/addReadsToUnitigs.C jcvi20140331Brian P. Walenz A src/bogart/addReadsToUnitigs.C jcvi20130824Brian P. Walenz A src/bogart/addReadsToUnitigs.C jcvi20130824Brian P. Walenz D src/bogart/addReadsToUnitigs.C src/AS_CNS/addReadsToUnitigs.C A src/bogart/AS_BAT_Outputs.H bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_Outputs.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Outputs.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Outputs.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Outputs.H jcvi20120115Brian P. Walenz A src/bogart/AS_BAT_Outputs.H jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_Outputs.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_Outputs.H src/AS_BAT/AS_BAT_Outputs.H A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20150805Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20150520Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20141021Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C bnbi20141009Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20130908Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20130627Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20130627Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20130429Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20111205Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20110408Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20110406Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101217Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101215Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101204Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_PlaceFragUsingOverlaps.C src/AS_BAT/AS_BAT_PlaceFragUsingOverlaps.C A src/bogart/AS_BAT_Logging.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Logging.H bnbi20140811Brian P. Walenz A src/bogart/AS_BAT_Logging.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Logging.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Logging.H jcvi20130430Brian P. Walenz A src/bogart/AS_BAT_Logging.H jcvi20120729Brian P. Walenz D src/bogart/AS_BAT_Logging.H src/AS_BAT/AS_BAT_Logging.H A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150814Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150729Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150520Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150514Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150420Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20141021Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C bnbi20141009Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20140129Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20140128Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20130627Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20130430Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20121120Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20120730Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20111218Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20111212Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20111206Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_BestOverlapGraph.C src/AS_BAT/AS_BAT_BestOverlapGraph.C A src/bogart/AS_BAT_ChunkGraph.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_ChunkGraph.C src/AS_BAT/AS_BAT_ChunkGraph.C A src/bogart/AS_BAT_EvaluateMates.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.C jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_EvaluateMates.C src/AS_BAT/AS_BAT_EvaluateMates.C A src/bogart/AS_BAT_MateLocation.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MateLocation.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MateLocation.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_MateLocation.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_MateLocation.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MateLocation.C jcvi20101202Brian P. Walenz A src/bogart/AS_BAT_MateLocation.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_MateLocation.C src/AS_BAT/AS_BAT_MateLocation.C A src/bogart/AS_BAT_IntersectSplit.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20101213Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_IntersectSplit.C src/AS_BAT/AS_BAT_IntersectSplit.C A src/bogart/AS_BAT_Unitig.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20120115Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Unitig.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Unitig.C src/AS_BAT/AS_BAT_Unitig.C A src/bogart/AS_BAT_PopulateUnitig.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_PopulateUnitig.H src/AS_BAT/AS_BAT_PopulateUnitig.H A src/bogart/AS_BAT_Instrumentation.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20130827Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20110728Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Instrumentation.C src/AS_BAT/AS_BAT_Instrumentation.C A src/bogart/AS_BAT_OverlapCache.H bnbi20150616Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20150520Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20150514Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20141021Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20130828Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20120222Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20120215Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20111212Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20111209Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20111205Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20111106Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H bnbi20111013Sergey Koren A src/bogart/AS_BAT_OverlapCache.H bnbi20111013Sergey Koren A src/bogart/AS_BAT_OverlapCache.H jcvi20110803Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H jcvi20110215Brian P. Walenz D src/bogart/AS_BAT_OverlapCache.H src/AS_BAT/AS_BAT_OverlapCache.H A src/bogart/AS_BAT_MergeSplitJoin.H bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H jcvi20120222Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H jcvi20111218Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H jcvi20110215Brian P. Walenz D src/bogart/AS_BAT_MergeSplitJoin.H src/AS_BAT/AS_BAT_MergeSplitJoin.H A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C bnbi20150306Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C bnbi20150127Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20130429Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20130429Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20111213Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20101204Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C src/AS_BAT/AS_BAT_Unitig_PlaceFragUsingEdges.C A src/bogart/AS_BAT_IntersectBubble.H bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_IntersectBubble.H src/AS_BAT/AS_BAT_IntersectBubble.H A src/bogart/AS_BAT_PlaceZombies.H bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.H jcvi20110118Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_PlaceZombies.H src/AS_BAT/AS_BAT_PlaceZombies.H A src/bogart/AS_BAT_MoveContains.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MoveContains.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_MoveContains.C src/AS_BAT/AS_BAT_MoveContains.C A src/bogart/AS_BAT_SplitDiscontinuous.H bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_SplitDiscontinuous.H src/AS_BAT/AS_BAT_SplitDiscontinuous.H A src/bogart/AS_BAT_MateChecker.C nihh20151012Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C bnbi20110308Sergey Koren A src/bogart/AS_BAT_MateChecker.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MateChecker.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_MateChecker.C src/AS_BAT/AS_BAT_MateChecker.C A src/bogart/AS_BAT_Joining.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Joining.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Joining.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_Joining.H src/AS_BAT/AS_BAT_Joining.H A src/bogart/AS_BAT_SetParentAndHang.C bnbi20150805Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C jcvi20101217Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_SetParentAndHang.C src/AS_BAT/AS_BAT_SetParentAndHang.C A src/bogart/AS_BAT_Breaking.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20110506Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20110102Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Breaking.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Breaking.C src/AS_BAT/AS_BAT_Breaking.C A src/bogart/bogart.C bnbi20150807Brian P. Walenz A src/bogart/bogart.C bnbi20150616Brian P. Walenz A src/bogart/bogart.C bnbi20150603Brian P. Walenz A src/bogart/bogart.C bnbi20150520Brian P. Walenz A src/bogart/bogart.C bnbi20150424Brian P. Walenz A src/bogart/bogart.C bnbi20150303Brian P. Walenz A src/bogart/bogart.C bnbi20141223Brian P. Walenz A src/bogart/bogart.C bnbi20141219Brian P. Walenz A src/bogart/bogart.C bnbi20141021Brian P. Walenz A src/bogart/bogart.C jcvi20140129Brian P. Walenz A src/bogart/bogart.C jcvi20131014Brian P. Walenz A src/bogart/bogart.C jcvi20130905Brian P. Walenz A src/bogart/bogart.C jcvi20130828Brian P. Walenz A src/bogart/bogart.C jcvi20130828Brian P. Walenz A src/bogart/bogart.C jcvi20130801Brian P. Walenz A src/bogart/bogart.C jcvi20130627Brian P. Walenz A src/bogart/bogart.C jcvi20130617Brian P. Walenz A src/bogart/bogart.C jcvi20130430Brian P. Walenz A src/bogart/bogart.C jcvi20120730Brian P. Walenz A src/bogart/bogart.C jcvi20120730Brian P. Walenz A src/bogart/bogart.C jcvi20120729Brian P. Walenz A src/bogart/bogart.C jcvi20120728Brian P. Walenz A src/bogart/bogart.C jcvi20120502Brian P. Walenz A src/bogart/bogart.C jcvi20120313Brian P. Walenz A src/bogart/bogart.C jcvi20120222Brian P. Walenz A src/bogart/bogart.C jcvi20120215Brian P. Walenz A src/bogart/bogart.C jcvi20120215Brian P. Walenz A src/bogart/bogart.C jcvi20120214Brian P. Walenz A src/bogart/bogart.C jcvi20120115Brian P. Walenz A src/bogart/bogart.C jcvi20120105Brian P. Walenz A src/bogart/bogart.C jcvi20111229Brian P. Walenz A src/bogart/bogart.C jcvi20111218Brian P. Walenz A src/bogart/bogart.C jcvi20111212Brian P. Walenz A src/bogart/bogart.C jcvi20111205Brian P. Walenz A src/bogart/bogart.C jcvi20110404Brian P. Walenz A src/bogart/bogart.C jcvi20110317Brian P. Walenz A src/bogart/bogart.C jcvi20110215Brian P. Walenz A src/bogart/bogart.C jcvi20110118Brian P. Walenz A src/bogart/bogart.C jcvi20110102Brian P. Walenz A src/bogart/bogart.C jcvi20101217Brian P. Walenz A src/bogart/bogart.C jcvi20101206Brian P. Walenz A src/bogart/bogart.C jcvi20101204Brian P. Walenz A src/bogart/bogart.C jcvi20101123Brian P. Walenz D src/bogart/bogart.C src/AS_BAT/bogart.C A src/bogart/AS_BAT_PlaceContains.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C bnbi20150127Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20140129Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20110417Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_PlaceContains.C src/AS_BAT/AS_BAT_PlaceContains.C A src/bogart/AS_BAT_InsertSizes.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.C jcvi20111216Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_InsertSizes.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_InsertSizes.C src/AS_BAT/AS_BAT_InsertSizes.C A src/bogart/AS_BAT_MateBubble.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MateBubble.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_MateBubble.C src/AS_BAT/AS_BAT_MateBubble.C A src/bogart/AS_BAT_Instrumentation.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_Instrumentation.H src/AS_BAT/AS_BAT_Instrumentation.H A src/bogart/AS_BAT_OverlapCache.C bnbi20150625Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150616Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150520Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150514Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150204Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20150113Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20141021Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20140806Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20131014Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20130929Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20130905Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20130828Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120727Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120412Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120321Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120222Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120218Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120215Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120215Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20120114Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C bnbi20120111Sergey Koren A src/bogart/AS_BAT_OverlapCache.C jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111212Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111209Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111209Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111206Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111205Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20111205Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20110831Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20110417Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C jcvi20110215Brian P. Walenz D src/bogart/AS_BAT_OverlapCache.C src/AS_BAT/AS_BAT_OverlapCache.C A src/bogart/AS_BAT_ExtendByMates.C bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C jcvi20140128Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_ExtendByMates.C jcvi20120105Brian P. Walenz D src/bogart/AS_BAT_ExtendByMates.C src/AS_BAT/AS_BAT_ExtendByMates.C A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20150805Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20150303Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20141014Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C bnbi20141009Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20140503Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20140304Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20130908Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20130905Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20130828Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20130430Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20130429Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120806Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120730Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120730Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120313Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120222Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20111218Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20111213Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110524Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110506Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110417Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110408Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110406Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C jcvi20110215Brian P. Walenz D src/bogart/AS_BAT_MergeSplitJoin.C src/AS_BAT/AS_BAT_MergeSplitJoin.C A src/bogart/AS_BAT_IntersectBubble.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20110118Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20110106Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20101215Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20101204Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_IntersectBubble.C src/AS_BAT/AS_BAT_IntersectBubble.C A src/bogart/AS_BAT_EvaluateMates.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_EvaluateMates.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_EvaluateMates.H src/AS_BAT/AS_BAT_EvaluateMates.H A src/bogart/AS_BAT_MateLocation.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_MateLocation.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_MateLocation.H jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_MateLocation.H jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_MateLocation.H src/AS_BAT/AS_BAT_MateLocation.H A src/bogart/AS_BAT_IntersectSplit.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_IntersectSplit.H src/AS_BAT/AS_BAT_IntersectSplit.H A src/bogart/AS_BAT_Unitig.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20130318Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20120115Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Unitig.H jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Unitig.H src/AS_BAT/AS_BAT_Unitig.H A src/bogart/AS_BAT_PopulateUnitig.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20110901Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_PopulateUnitig.C src/AS_BAT/AS_BAT_PopulateUnitig.C A src/bogart/AS_BAT_PromoteToSingleton.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C jcvi20120728Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C jcvi20120105Brian P. Walenz D src/bogart/AS_BAT_PromoteToSingleton.C src/AS_BAT/AS_BAT_PromoteToSingleton.C A src/bogart/AS_BAT_Unitig_AddFrag.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C jcvi20130626Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Unitig_AddFrag.C src/AS_BAT/AS_BAT_Unitig_AddFrag.C A src/bogart/AS_BAT_FragmentInfo.C bnbi20150616Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C bnbi20150529Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C bnbi20150317Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20130617Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20120806Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20110417Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_FragmentInfo.C src/AS_BAT/AS_BAT_FragmentInfo.C A src/bogart/AS_BAT_findEdges.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_findEdges.C bnbi20141114Brian P. Walenz D src/bogart/AS_BAT_findEdges.C src/AS_BAT/AS_BAT_findEdges.C A src/bogart/AS_BAT_Outputs.C bnbi20150605Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20150409Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20150127Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20150113Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20150113Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20141223Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Outputs.C bnbi20141117Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20140331Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20120115Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20111229Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_Outputs.C jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_Outputs.C src/AS_BAT/AS_BAT_Outputs.C A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20110408Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20110406Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20110404Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20110317Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20110118Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20101215Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H jcvi20101206Brian P. Walenz D src/bogart/AS_BAT_PlaceFragUsingOverlaps.H src/AS_BAT/AS_BAT_PlaceFragUsingOverlaps.H A src/bogart/AS_BAT_Logging.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_Logging.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_Logging.C jcvi20130617Brian P. Walenz A src/bogart/AS_BAT_Logging.C jcvi20130430Brian P. Walenz A src/bogart/AS_BAT_Logging.C jcvi20120729Brian P. Walenz D src/bogart/AS_BAT_Logging.C src/AS_BAT/AS_BAT_Logging.C A src/bogart/AS_BAT_BestOverlapGraph.H bnbi20150603Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H bnbi20141021Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20140129Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20140128Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20130430Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20111205Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20110215Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20101206Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_BestOverlapGraph.H src/AS_BAT/AS_BAT_BestOverlapGraph.H A src/bogart/AS_BAT_ChunkGraph.H bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H jcvi20120105Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H jcvi20101123Brian P. Walenz D src/bogart/AS_BAT_ChunkGraph.H src/AS_BAT/AS_BAT_ChunkGraph.H A src/bogart/AS_BAT_ReconstructRepeats.C bnbi20150424Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C bnbi20141222Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C bnbi20141219Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C jcvi20130801Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C jcvi20120729Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C jcvi20120214Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C jcvi20120105Brian P. Walenz D src/bogart/AS_BAT_ReconstructRepeats.C src/AS_BAT/AS_BAT_ReconstructRepeats.C A src/bogart/analyzeBest.C bnbi20150410Brian P. Walenz A src/bogart/analyzeBest.C bnbi20141121Brian P. Walenz A src/bogart/analyzeBest.C bnbi20141009Brian P. Walenz A src/bogart/analyzeBest.C jcvi20130801Brian P. Walenz A src/bogart/analyzeBest.C jcvi20130801Brian P. Walenz A src/bogart/analyzeBest.C jcvi20130603Brian P. Walenz A src/bogart/analyzeBest.C jcvi20111229Brian P. Walenz A src/bogart/analyzeBest.C jcvi20101009Brian P. Walenz D src/bogart/analyzeBest.C src/AS_BOG/analyzeBest.C A src/overlapErrorAdjustment/findErrors.C bnbi20150701Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C bnbi20150616Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C bnbi20150603Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C bnbi20150529Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.H bnbi20150701Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.H bnbi20150624Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.H bnbi20150623Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.H bnbi20150618Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C bnbi20150616Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Olaps.C bnbi20150625Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Olaps.C bnbi20150616Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Olaps.C bnbi20150616Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Frags.C bnbi20150603Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Frags.C bnbi20150529Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Frags.C bnbi20150505Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Analyze_Alignment.C bnbi20150618Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.C bnbi20150701Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.C bnbi20150701Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.C bnbi20150624Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.C bnbi20150623Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.C bnbi20150618Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C bnbi20150603Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C bnbi20150529Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C bnbi20150520Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C bnbi20150514Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Read_Olaps.C bnbi20150625Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Read_Olaps.C bnbi20150616Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Read_Olaps.C bnbi20150616Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Correct_Frags.C bnbi20150603Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Correct_Frags.C bnbi20150529Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Correct_Frags.C bnbi20150520Brian P. Walenz A src/overlapInCore/overlapImport.C bnbi20150625Brian P. Walenz A src/overlapInCore/overlapImport.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapImport.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapImport.C bnbi20150529Brian P. Walenz A src/overlapInCore/overlapImport.C bnbi20150514Brian P. Walenz A src/overlapInCore/overlapReadCache.H bnbi20150623Brian P. Walenz A src/overlapInCore/overlapReadCache.H bnbi20150618Brian P. Walenz A src/overlapInCore/overlapReadCache.H bnbi20150616Brian P. Walenz A src/overlapInCore/overlapReadCache.H bnbi20150616Brian P. Walenz D src/overlapInCore/overlapReadCache.H src/overlapInCore/overlapPair-readCache.H A src/overlapInCore/overlapInCore-Output.C nihh20151026Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20150409Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20150321Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20150317Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20150211Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20150113Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20141117Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C bnbi20110617Sergey Koren A src/overlapInCore/overlapInCore-Output.C bnbi20110308Sergey Koren A src/overlapInCore/overlapInCore-Output.C jcvi20100830Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20100819Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20100402Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20100330Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20100216Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20091028Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20091026Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20090730Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20090520Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20090202Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20090116Sergey Koren A src/overlapInCore/overlapInCore-Output.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20081007Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20080515Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20080227Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20080225Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20071120Eli Venter A src/overlapInCore/overlapInCore-Output.C jcvi20071109Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070827Sergey Koren A src/overlapInCore/overlapInCore-Output.C jcvi20070810Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070803Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070801Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070529Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070416Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070305Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070225Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070223Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070214Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070213Art Delcher A src/overlapInCore/overlapInCore-Output.C jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070208Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20060926Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20060925Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20060821Aaron Halpern A src/overlapInCore/overlapInCore-Output.C jcvi20060605Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20060327Aaron Halpern A src/overlapInCore/overlapInCore-Output.C jcvi20050915Eli Venter A src/overlapInCore/overlapInCore-Output.C jcvi20050824Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050805Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050804Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050801Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050715Eli Venter A src/overlapInCore/overlapInCore-Output.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050711Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Output.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Output.C tigr20040923Michael Schatz D src/overlapInCore/overlapInCore-Output.C src/AS_OVL/AS_OVL_overlap_common.h D src/overlapInCore/overlapInCore-Output.C src/AS_OVM/overlapInCore-Output.C A src/overlapInCore/overlapInCore.H nihh20151027Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150825Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150616Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150410Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150317Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150303Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150211Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150209Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150206Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20150206Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20141219Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20130617Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20130111Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20120203Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore.H bnbi20110308Sergey Koren A src/overlapInCore/overlapInCore.H jcvi20100819Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20100330Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20100329Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20091026Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20090202Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20081111Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20081007Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070803Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070801Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070723Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070223Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070213Art Delcher A src/overlapInCore/overlapInCore.H jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore.H jcvi20060823Aaron Halpern A src/overlapInCore/overlapInCore.H jcvi20060822Aaron Halpern A src/overlapInCore/overlapInCore.H jcvi20060821Aaron Halpern A src/overlapInCore/overlapInCore.H jcvi20060327Aaron Halpern A src/overlapInCore/overlapInCore.H jcvi20050929Aaron Halpern A src/overlapInCore/overlapInCore.H jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore.H tigr20050322Jason Miller A src/overlapInCore/overlapInCore.H tigr20050322Jason Miller A src/overlapInCore/overlapInCore.H tigr20040923Michael Schatz D src/overlapInCore/overlapInCore.H src/AS_OVL/AS_OVL_overlap.h D src/overlapInCore/overlapInCore.H src/AS_OVM/overlapInCore.H A src/overlapInCore/overlapInCorePartition.C bnbi20150825Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20150529Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20150317Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20150227Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20150223Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20150206Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20141121Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20131202Sergey Koren A src/overlapInCore/overlapInCorePartition.C bnbi20131022Sergey Koren A src/overlapInCore/overlapInCorePartition.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C bnbi20130627Sergey Koren A src/overlapInCore/overlapInCorePartition.C bnbi20120729Sergey Koren A src/overlapInCore/overlapInCorePartition.C jcvi20120714Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20120220Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20120212Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20120211Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20120206Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20120206Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20120130Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20110706Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20110705Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20110627Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C jcvi20110612Brian P. Walenz D src/overlapInCore/overlapInCorePartition.C src/AS_OVL/overlap_partition.C A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20151029Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20150708Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20150410Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20150331Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20150211Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20130111Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20110617Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C bnbi20110308Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20100830Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20100819Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20100402Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20100330Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20100216Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20091028Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20091026Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20090730Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20090520Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20090202Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20090116Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20081007Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20080515Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20080227Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20080225Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20071120Eli Venter A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20071109Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070827Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070810Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070803Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070529Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070416Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070305Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070225Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070223Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070214Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070213Art Delcher A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070208Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20060926Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20060925Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20060821Aaron Halpern A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20060605Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20060327Aaron Halpern A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050915Eli Venter A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050824Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050805Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050804Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050715Eli Venter A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050711Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Process_String_Overlaps.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Process_String_Overlaps.C tigr20040923Michael Schatz D src/overlapInCore/overlapInCore-Process_String_Overlaps.C src/AS_OVL/AS_OVL_overlap_common.h D src/overlapInCore/overlapInCore-Process_String_Overlaps.C src/AS_OVM/overlapInCore-Process_String_Overlaps.C A src/overlapInCore/overlapPair.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150717Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150709Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150708Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150701Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150701Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150625Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150623Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150623Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150618Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150617Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150603Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150410Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150409Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150331Brian P. Walenz A src/overlapInCore/overlapPair.C bnbi20150327Brian P. Walenz A src/overlapInCore/overlapReadCache.C bnbi20150701Brian P. Walenz A src/overlapInCore/overlapReadCache.C bnbi20150625Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150625Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150605Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150514Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150429Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150409Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150321Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150317Brian P. Walenz A src/overlapInCore/overlapConvert.C bnbi20150211Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150825Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150701Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150603Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150529Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150317Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20150227Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20141219Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20120130Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20111229Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20110617Sergey Koren A src/overlapInCore/overlapInCore-Build_Hash_Index.C bnbi20110308Sergey Koren A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20100830Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20100819Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20100402Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20100330Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20100216Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20091028Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20091026Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20090730Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20090520Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20090202Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20090116Sergey Koren A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20081007Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20080515Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20080227Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20080225Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20071120Eli Venter A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20071109Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070827Sergey Koren A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070810Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070803Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070801Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070529Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070416Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070305Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070225Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070223Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070214Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070213Art Delcher A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070208Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20060926Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20060925Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20060821Aaron Halpern A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20060605Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20060327Aaron Halpern A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050915Eli Venter A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050824Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050805Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050804Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050801Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050715Eli Venter A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050711Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Build_Hash_Index.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Build_Hash_Index.C tigr20040923Michael Schatz D src/overlapInCore/overlapInCore-Build_Hash_Index.C src/AS_OVL/AS_OVL_overlap_common.h D src/overlapInCore/overlapInCore-Build_Hash_Index.C src/AS_OVM/overlapInCore-Build_Hash_Index.C A src/overlapInCore/overlapInCore-Find_Overlaps.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20130617Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C bnbi20110617Sergey Koren A src/overlapInCore/overlapInCore-Find_Overlaps.C bnbi20110308Sergey Koren A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20100830Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20100819Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20100402Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20100330Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20100216Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20091028Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20091026Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20090730Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20090520Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20090202Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20090116Sergey Koren A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20081007Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20080515Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20080227Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20080225Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20071120Eli Venter A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20071109Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070827Sergey Koren A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070810Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070803Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070801Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070529Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070416Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070305Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070225Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070223Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070214Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070213Art Delcher A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070208Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20060926Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20060925Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20060821Aaron Halpern A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20060605Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20060327Aaron Halpern A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050915Eli Venter A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050824Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050805Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050804Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050801Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050715Eli Venter A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050711Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Find_Overlaps.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Find_Overlaps.C tigr20040923Michael Schatz D src/overlapInCore/overlapInCore-Find_Overlaps.C src/AS_OVL/AS_OVL_overlap_common.h D src/overlapInCore/overlapInCore-Find_Overlaps.C src/AS_OVM/overlapInCore-Find_Overlaps.C A src/overlapInCore/overlapInCore.C nihh20151027Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150825Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150708Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150701Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150625Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150616Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150603Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150602Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150317Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150303Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150211Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150209Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150206Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20150131Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20141219Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore.C bnbi20140811Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20130619Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20130617Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20130501Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20130111Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20120312Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20120212Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20120203Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20120130Brian P. Walenz A src/overlapInCore/overlapInCore.C tigr20120126Michael Schatz A src/overlapInCore/overlapInCore.C jcvi20111229Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20111111Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070812Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070812Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070514Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070504Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070305Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070222Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070214Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070208Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20070126Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20060926Brian P. Walenz A src/overlapInCore/overlapInCore.C jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore.C tigr20040923Michael Schatz D src/overlapInCore/overlapInCore.C src/AS_OVL/AS_OVL_driver_common.h D src/overlapInCore/overlapInCore.C src/AS_OVM/overlapInCore.C A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1500.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1500.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1200.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1200.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2900.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2900.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2700.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2700.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2000.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2000.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3100.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3100.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.C nihh20151012Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.C bnbi20150209Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3600.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3600.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.C bnbi20150625Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.C bnbi20150624Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.C bnbi20150209Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3800.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3800.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4400.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4400.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4300.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4300.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0300.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0300.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0400.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0400.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.C bnbi20150814Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2100.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2100.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2600.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2600.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2800.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2800.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H nihh20151029Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H nihh20151026Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150717Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150605Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150410Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150303Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150303Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H bnbi20150211Brian P. Walenz D src/overlapInCore/liboverlap/prefixEditDistance.H src/overlapInCore/prefixEditDistance.H A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150717Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150709Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150709Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150701Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C bnbi20150211Brian P. Walenz D src/overlapInCore/liboverlap/prefixEditDistance-forward.C src/overlapInCore/prefixEditDistance-forward.C A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150221Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C bnbi20150211Brian P. Walenz D src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C src/overlapInCore/prefixEditDistance-allocateMoreSpace.C A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1300.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1300.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1400.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1400.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0500.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0500.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0200.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0200.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C nihh20151029Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150701Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150624Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150617Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150410Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150331Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150211Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150211Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20150206Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20141215Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20130801Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20130501Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20130111Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20111213Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20110807Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20110803Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20110803Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20110801Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20110729Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20110729Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20110617Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-extend.C bnbi20110308Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20100830Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20100819Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20100402Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20100330Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20100216Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20091028Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20091026Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20090730Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20090610Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20090520Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20090202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20090116Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20081008Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20081008Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20081007Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20080627Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20080616Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20080616Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20080515Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20080227Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20080225Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20071120Eli Venter A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20071109Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20071108Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070827Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070810Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070803Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070801Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070529Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070416Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070404Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070329Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070305Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070225Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070223Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070218Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070214Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070213Art Delcher A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070212Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070211Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070208Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070129Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20070128Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20060926Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20060925Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20060821Aaron Halpern A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20060605Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20060327Aaron Halpern A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050915Eli Venter A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050824Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050805Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050804Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050801Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050715Eli Venter A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050711Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C jcvi20050616Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C tigr20050322Jason Miller A src/overlapInCore/liboverlap/prefixEditDistance-extend.C tigr20050322Jason Miller A src/overlapInCore/liboverlap/prefixEditDistance-extend.C tigr20040923Michael Schatz D src/overlapInCore/liboverlap/prefixEditDistance-extend.C src/AS_OVL/AS_OVL_overlap_common.h D src/overlapInCore/liboverlap/prefixEditDistance-extend.C src/AS_OVM/overlapInCore-Extend_Alignment.C D src/overlapInCore/liboverlap/prefixEditDistance-extend.C src/overlapInCore/overlapInCore-Extend_Alignment.C D src/overlapInCore/liboverlap/prefixEditDistance-extend.C src/overlapInCore/prefixEditDistance-extend.C A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4200.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4200.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3900.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3900.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4500.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4500.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3700.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3700.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3000.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3000.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0100.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0100.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0600.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0600.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150717Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150709Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150709Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150701Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C bnbi20150211Brian P. Walenz D src/overlapInCore/liboverlap/prefixEditDistance-reverse.C src/overlapInCore/prefixEditDistance-reverse.C A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0800.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0800.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.H bnbi20150209Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4600.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4600.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4100.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4100.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.H bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.H bnbi20150624Brian P. Walenz A src/overlapInCore/liboverlap/Display_Alignment.H bnbi20150209Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3300.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3300.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3400.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3400.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4800.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4800.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2500.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2500.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2200.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2200.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-5000.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-5000.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1900.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1900.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1700.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1700.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1000.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1000.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3500.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3500.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4900.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4900.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3200.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-3200.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4000.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4000.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4700.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-4700.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0900.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0900.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0700.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-0700.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1100.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1100.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1600.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1600.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1800.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-1800.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2300.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2300.C bnbi20150602Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150921Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150720Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150715Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150713Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150710Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150708Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150605Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2400.C bnbi20150603Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimit-2400.C bnbi20150602Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150825Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150824Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150821Sergey Koren A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150824Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150720Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150603Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150529Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150317Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20150113Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20141219Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20141215Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20130801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20111213Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20110803Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20110801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20110729Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20110617Sergey Koren A src/overlapInCore/overlapInCore-Process_Overlaps.C bnbi20110308Sergey Koren A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20100830Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20100819Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20100402Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20100330Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20100216Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20091028Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20091026Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20090730Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20090610Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20090520Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20090202Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20090116Sergey Koren A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20081008Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20081007Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20080627Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20080616Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20080515Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20080227Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20080225Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20071120Eli Venter A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20071109Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20071108Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070827Sergey Koren A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070810Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070803Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070529Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070416Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070404Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070329Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070305Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070225Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070223Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070218Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070214Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070213Art Delcher A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070212Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070211Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070208Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070129Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20070128Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20060926Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20060925Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20060821Aaron Halpern A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20060605Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20060327Aaron Halpern A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050915Eli Venter A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050824Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050805Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050804Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050801Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050715Eli Venter A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050713Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050711Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C jcvi20050616Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Process_Overlaps.C tigr20050322Jason Miller A src/overlapInCore/overlapInCore-Process_Overlaps.C tigr20040923Michael Schatz D src/overlapInCore/overlapInCore-Process_Overlaps.C src/AS_OVL/AS_OVL_overlap_common.h D src/overlapInCore/overlapInCore-Process_Overlaps.C src/AS_OVM/overlapInCore-Process_Overlaps.C A src/AS_global.C nihh20151108Brian P. Walenz A src/AS_global.C bnbi20150303Brian P. Walenz A src/AS_global.C jcvi20130801Brian P. Walenz A src/AS_global.C jcvi20130801Brian P. Walenz A src/AS_global.C jcvi20130801Brian P. Walenz A src/AS_global.C jcvi20130417Brian P. Walenz A src/AS_global.C jcvi20130314Brian P. Walenz A src/AS_global.C jcvi20120529Brian P. Walenz A src/AS_global.C jcvi20120508Brian P. Walenz A src/AS_global.C jcvi20120508Brian P. Walenz A src/AS_global.C jcvi20120321Brian P. Walenz A src/AS_global.C jcvi20110125Brian P. Walenz A src/AS_global.C jcvi20091125Brian P. Walenz A src/AS_global.C jcvi20090501Brian P. Walenz A src/AS_global.C jcvi20090408Brian P. Walenz A src/AS_global.C jcvi20090306Sergey Koren A src/AS_global.C jcvi20090222Brian P. Walenz A src/AS_global.C jcvi20081113Brian P. Walenz A src/AS_global.C jcvi20081106Brian P. Walenz A src/AS_global.C jcvi20081009Brian P. Walenz A src/AS_global.C jcvi20081008Brian P. Walenz A src/AS_global.C jcvi20081008Brian P. Walenz A src/AS_global.C jcvi20080716Brian P. Walenz A src/AS_global.C jcvi20080716Brian P. Walenz A src/AS_global.C jcvi20080716Brian P. Walenz A src/AS_global.C jcvi20080627Brian P. Walenz A src/AS_global.C jcvi20070809Brian P. Walenz A src/AS_global.C jcvi20070804Brian P. Walenz A src/AS_global.C jcvi20070803Brian P. Walenz A src/mercy/mercy.C bnbi20141205Brian P. Walenz A src/mercy/mercy.C jcvi20140411Brian P. Walenz A src/mercy/mercy.C jcvi20130801Brian P. Walenz A src/mercy/mercy.C jcvi20130801Brian P. Walenz A src/mercy/mercy.C jcvi20110314Brian P. Walenz A src/mercy/mercy.C jcvi20090512Brian P. Walenz A src/mercy/mercy.C jcvi20081009Brian P. Walenz A src/mercy/mercy.C jcvi20081008Brian P. Walenz A src/mercy/mercy.C jcvi20080627Brian P. Walenz A src/mercy/mercy.C jcvi20070319Brian P. Walenz D src/mercy/mercy.C src/AS_MER/mercy.C A src/mercy/mercy-regions.C bnbi20141205Brian P. Walenz A src/mercy/mercy-regions.C bnbi20141014Brian P. Walenz A src/mercy/mercy-regions.C bnbi20141009Brian P. Walenz A src/mercy/mercy-regions.C jcvi20130801Brian P. Walenz A src/mercy/mercy-regions.C jcvi20130801Brian P. Walenz A src/mercy/mercy-regions.C jcvi20111229Brian P. Walenz A src/mercy/mercy-regions.C jcvi20101004Brian P. Walenz A src/mercy/mercy-regions.C jcvi20081009Brian P. Walenz A src/mercy/mercy-regions.C jcvi20081008Brian P. Walenz A src/mercy/mercy-regions.C jcvi20080627Brian P. Walenz A src/mercy/mercy-regions.C jcvi20070319Brian P. Walenz D src/mercy/mercy-regions.C src/AS_MER/mercy-regions.C A src/utgcns/utgcns.C nihh20151030Brian P. Walenz A src/utgcns/utgcns.C nihh20151012Brian P. Walenz A src/utgcns/utgcns.C nihh20151009Brian P. Walenz A src/utgcns/utgcns.C bnbi20150807Brian P. Walenz A src/utgcns/utgcns.C bnbi20150730Brian P. Walenz A src/utgcns/utgcns.C bnbi20150717Brian P. Walenz A src/utgcns/utgcns.C bnbi20150708Brian P. Walenz A src/utgcns/utgcns.C bnbi20150528Brian P. Walenz A src/utgcns/utgcns.C bnbi20150507Brian P. Walenz A src/utgcns/utgcns.C bnbi20150507Brian P. Walenz A src/utgcns/utgcns.C bnbi20150409Brian P. Walenz A src/utgcns/utgcns.C bnbi20150306Brian P. Walenz A src/utgcns/utgcns.C bnbi20150306Brian P. Walenz A src/utgcns/utgcns.C bnbi20150303Brian P. Walenz A src/utgcns/utgcns.C bnbi20150131Brian P. Walenz A src/utgcns/utgcns.C bnbi20150127Brian P. Walenz A src/utgcns/utgcns.C bnbi20150127Brian P. Walenz A src/utgcns/utgcns.C bnbi20150127Brian P. Walenz A src/utgcns/utgcns.C bnbi20150123Brian P. Walenz A src/utgcns/utgcns.C bnbi20150121Brian P. Walenz A src/utgcns/utgcns.C bnbi20150113Brian P. Walenz A src/utgcns/utgcns.C bnbi20150110Brian P. Walenz A src/utgcns/utgcns.C bnbi20141230Brian P. Walenz A src/utgcns/utgcns.C jcvi20140331Brian P. Walenz A src/utgcns/utgcns.C jcvi20140304Brian P. Walenz A src/utgcns/utgcns.C jcvi20140206Brian P. Walenz A src/utgcns/utgcns.C jcvi20131104Brian P. Walenz A src/utgcns/utgcns.C jcvi20131031Brian P. Walenz A src/utgcns/utgcns.C jcvi20131030Brian P. Walenz A src/utgcns/utgcns.C jcvi20131028Brian P. Walenz A src/utgcns/utgcns.C jcvi20131011Brian P. Walenz A src/utgcns/utgcns.C jcvi20130918Brian P. Walenz A src/utgcns/utgcns.C jcvi20130909Brian P. Walenz A src/utgcns/utgcns.C jcvi20130801Brian P. Walenz A src/utgcns/utgcns.C jcvi20130801Brian P. Walenz A src/utgcns/utgcns.C jcvi20120829Brian P. Walenz A src/utgcns/utgcns.C jcvi20120810Brian P. Walenz A src/utgcns/utgcns.C jcvi20120806Brian P. Walenz A src/utgcns/utgcns.C jcvi20111204Brian P. Walenz A src/utgcns/utgcns.C jcvi20111129Brian P. Walenz A src/utgcns/utgcns.C jcvi20110102Brian P. Walenz A src/utgcns/utgcns.C jcvi20100923Brian P. Walenz A src/utgcns/utgcns.C jcvi20091209Brian P. Walenz A src/utgcns/utgcns.C jcvi20091209Brian P. Walenz A src/utgcns/utgcns.C jcvi20091130Brian P. Walenz A src/utgcns/utgcns.C jcvi20091104Brian P. Walenz A src/utgcns/utgcns.C jcvi20091012Brian P. Walenz A src/utgcns/utgcns.C jcvi20091010Brian P. Walenz A src/utgcns/utgcns.C jcvi20091007Brian P. Walenz A src/utgcns/utgcns.C jcvi20091007Brian P. Walenz A src/utgcns/utgcns.C jcvi20091005Brian P. Walenz D src/utgcns/utgcns.C src/AS_CNS/utgcns.C A src/utgcns/libNNalign/NNalign.H bnbi20150810Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20150617Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20150410Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20150113Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20150110Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20141230Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20141223Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20141223Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20141223Brian P. Walenz A src/utgcns/libNNalign/NNalign.H bnbi20141223Brian P. Walenz D src/utgcns/libNNalign/NNalign.H src/alignment/AS_ALN_aligners.H D src/utgcns/libNNalign/NNalign.H src/alignment/aligners.H A src/utgcns/libNNalign/NNalgorithm.C bnbi20150810Brian P. Walenz A src/utgcns/libNNalign/NNalgorithm.C bnbi20150603Brian P. Walenz A src/utgcns/libNNalign/NNalgorithm.C bnbi20141223Brian P. Walenz A src/utgcns/libNNalign/NNalgorithm.C bnbi20141223Brian P. Walenz A src/utgcns/libNNalign/NNalgorithm.C bnbi20141223Brian P. Walenz D src/utgcns/libNNalign/NNalgorithm.C src/alignment/AS_ALN_bruteforcedp.C D src/utgcns/libNNalign/NNalgorithm.C src/alignment/brute-force-dp.C A src/utgcns/libNNalign/NNalign.C bnbi20150810Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20150617Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20150603Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20150303Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20150121Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20150113Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20141223Brian P. Walenz A src/utgcns/libNNalign/NNalign.C bnbi20141223Brian P. Walenz D src/utgcns/libNNalign/NNalign.C src/alignment/AS_ALN_forcns.C D src/utgcns/libNNalign/NNalign.C src/alignment/alignment-drivers.C A src/utgcns/stashContains.H bnbi20150509Brian P. Walenz A src/utgcns/stashContains.H bnbi20150409Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20150112Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20150110Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abIDs.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20110921Jason Miller A src/utgcns/libcns/abIDs.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090925Sergey Koren A src/utgcns/libcns/abIDs.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090421Sergey Koren A src/utgcns/libcns/abIDs.H jcvi20090417Sergey Koren A src/utgcns/libcns/abIDs.H jcvi20090415Sergey Koren A src/utgcns/libcns/abIDs.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081211Sergey Koren A src/utgcns/libcns/abIDs.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20080213Eli Venter A src/utgcns/libcns/abIDs.H jcvi20080128Sergey Koren A src/utgcns/libcns/abIDs.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20060213Eli Venter A src/utgcns/libcns/abIDs.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abIDs.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abIDs.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abIDs.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abIDs.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abIDs.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abIDs.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/unitigConsensus.H nihh20151019Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20151013Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20151009Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150805Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150723Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150720Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150708Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150303Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150113Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150112Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150110Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20110921Jason Miller A src/utgcns/libcns/unitigConsensus.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090925Sergey Koren A src/utgcns/libcns/unitigConsensus.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090421Sergey Koren A src/utgcns/libcns/unitigConsensus.H jcvi20090417Sergey Koren A src/utgcns/libcns/unitigConsensus.H jcvi20090415Sergey Koren A src/utgcns/libcns/unitigConsensus.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081211Sergey Koren A src/utgcns/libcns/unitigConsensus.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20080213Eli Venter A src/utgcns/libcns/unitigConsensus.H jcvi20080128Sergey Koren A src/utgcns/libcns/unitigConsensus.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20071025Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070316Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20061005Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20061002Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20060923Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20060918Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20060906Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20060213Eli Venter A src/utgcns/libcns/unitigConsensus.H jcvi20051215Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20051127Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20051120Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050829Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050818Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050805Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050801Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050801Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H jcvi20050629Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050614Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050612Gennady Denisov A src/utgcns/libcns/unitigConsensus.H jcvi20050523Gennady Denisov D src/utgcns/libcns/unitigConsensus.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/unitigConsensus.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/unitigConsensus.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/unitigConsensus.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/abMultiAlign.H nihh20151014Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150708Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150123Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150113Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150112Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150110Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150109Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150108Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20110921Jason Miller A src/utgcns/libcns/abMultiAlign.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090925Sergey Koren A src/utgcns/libcns/abMultiAlign.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090421Sergey Koren A src/utgcns/libcns/abMultiAlign.H jcvi20090417Sergey Koren A src/utgcns/libcns/abMultiAlign.H jcvi20090415Sergey Koren A src/utgcns/libcns/abMultiAlign.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081211Sergey Koren A src/utgcns/libcns/abMultiAlign.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20080213Eli Venter A src/utgcns/libcns/abMultiAlign.H jcvi20080128Sergey Koren A src/utgcns/libcns/abMultiAlign.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20060213Eli Venter A src/utgcns/libcns/abMultiAlign.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abMultiAlign.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abMultiAlign.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abMultiAlign.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abMultiAlign.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abMultiAlign.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abMultiAlign.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/abAbacus-refine.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151014Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151010Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150814Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150701Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20150109Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111229Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111208Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20110921Jason Miller A src/utgcns/libcns/abAbacus-refine.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20100609Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus-refine.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus-refine.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus-refine.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus-refine.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus-refine.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-refine.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-refine.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus-refine.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus-refine.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus-refine.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-refine.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-refine.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus-refine.C src/AS_CNS/AbacusRefine.C D src/utgcns/libcns/abAbacus-refine.C src/AS_CNS/AbacusRefine.c D src/utgcns/libcns/abAbacus-refine.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus-refine.C src/utgcns/libcns/AbacusRefine.C A src/utgcns/libcns/abAbacus.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.C nihh20151014Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150708Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150306Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150113Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150112Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150109Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus.C src/AS_CNS/MultiAlignment_CNS.C D src/utgcns/libcns/abAbacus.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus.C src/utgcns/libcns/MultiAlignment_CNS.C A src/utgcns/libcns/abColumn.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abColumn.C bnbi20150113Brian P. Walenz A src/utgcns/libcns/abColumn.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abColumn.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abColumn.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abColumn.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090514Sergey Koren A src/utgcns/libcns/abColumn.C jcvi20090421Sergey Koren A src/utgcns/libcns/abColumn.C jcvi20090417Sergey Koren A src/utgcns/libcns/abColumn.C jcvi20090415Sergey Koren A src/utgcns/libcns/abColumn.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090116Sergey Koren A src/utgcns/libcns/abColumn.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080227Sergey Koren A src/utgcns/libcns/abColumn.C jcvi20080213Eli Venter A src/utgcns/libcns/abColumn.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20061114Eli Venter A src/utgcns/libcns/abColumn.C jcvi20061114Eli Venter A src/utgcns/libcns/abColumn.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abColumn.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abColumn.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abColumn.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abColumn.C jcvi20060324Eli Venter A src/utgcns/libcns/abColumn.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060316Eli Venter A src/utgcns/libcns/abColumn.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060214Eli Venter A src/utgcns/libcns/abColumn.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abColumn.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abColumn.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abColumn.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abColumn.C jcvi20050502Eli Venter A src/utgcns/libcns/abColumn.C jcvi20050330Eli Venter A src/utgcns/libcns/abColumn.C tigr20050322Jason Miller A src/utgcns/libcns/abColumn.C tigr20050322Jason Miller A src/utgcns/libcns/abColumn.C tigr20040923Michael Schatz D src/utgcns/libcns/abColumn.C src/AS_CNS/MultiAlignment_CNS.C D src/utgcns/libcns/abColumn.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abColumn.C src/utgcns/libcns/MultiAlignment_CNS.C A src/utgcns/libcns/abBead.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20150108Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abBead.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20110921Jason Miller A src/utgcns/libcns/abBead.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090925Sergey Koren A src/utgcns/libcns/abBead.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090421Sergey Koren A src/utgcns/libcns/abBead.H jcvi20090417Sergey Koren A src/utgcns/libcns/abBead.H jcvi20090415Sergey Koren A src/utgcns/libcns/abBead.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081211Sergey Koren A src/utgcns/libcns/abBead.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20080213Eli Venter A src/utgcns/libcns/abBead.H jcvi20080128Sergey Koren A src/utgcns/libcns/abBead.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20060213Eli Venter A src/utgcns/libcns/abBead.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abBead.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abBead.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abBead.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abBead.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abBead.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abBead.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150728Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111221Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111208Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20100819Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090709Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus-baseCall.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-baseCall.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-baseCall.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus-baseCall.C src/AS_CNS/BaseCall.C D src/utgcns/libcns/abAbacus-baseCall.C src/AS_CNS/BaseCall.c D src/utgcns/libcns/abAbacus-baseCall.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus-baseCall.C src/utgcns/libcns/BaseCall.C A src/utgcns/libcns/abAbacus-refreshMultiAlign.C nihh20151014Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20150306Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20150109Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111208Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090925Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus-refreshMultiAlign.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-refreshMultiAlign.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-refreshMultiAlign.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus-refreshMultiAlign.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus-refreshMultiAlign.C src/AS_CNS/RefreshMANode.C D src/utgcns/libcns/abAbacus-refreshMultiAlign.C src/AS_CNS/RefreshMANode.c D src/utgcns/libcns/abAbacus-refreshMultiAlign.C src/utgcns/libcns/RefreshMANode.C A src/utgcns/libcns/abAbacus.H nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151014Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151010Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151009Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150723Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150708Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150528Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150204Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150204Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150123Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150113Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150112Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150109Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abAbacus.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20110921Jason Miller A src/utgcns/libcns/abAbacus.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090925Sergey Koren A src/utgcns/libcns/abAbacus.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus.H jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus.H jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081211Sergey Koren A src/utgcns/libcns/abAbacus.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus.H jcvi20080128Sergey Koren A src/utgcns/libcns/abAbacus.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20060213Eli Venter A src/utgcns/libcns/abAbacus.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abAbacus.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abAbacus.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abAbacus.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abAbacus.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/abMultiAlign.C bnbi20150811Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150730Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150708Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150702Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150507Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150306Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150128Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150113Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090514Sergey Koren A src/utgcns/libcns/abMultiAlign.C jcvi20090421Sergey Koren A src/utgcns/libcns/abMultiAlign.C jcvi20090417Sergey Koren A src/utgcns/libcns/abMultiAlign.C jcvi20090415Sergey Koren A src/utgcns/libcns/abMultiAlign.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090116Sergey Koren A src/utgcns/libcns/abMultiAlign.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080227Sergey Koren A src/utgcns/libcns/abMultiAlign.C jcvi20080213Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20061114Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20061114Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abMultiAlign.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abMultiAlign.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abMultiAlign.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abMultiAlign.C jcvi20060324Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060316Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060214Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abMultiAlign.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abMultiAlign.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abMultiAlign.C jcvi20050502Eli Venter A src/utgcns/libcns/abMultiAlign.C jcvi20050330Eli Venter A src/utgcns/libcns/abMultiAlign.C tigr20050322Jason Miller A src/utgcns/libcns/abMultiAlign.C tigr20050322Jason Miller A src/utgcns/libcns/abMultiAlign.C tigr20040923Michael Schatz D src/utgcns/libcns/abMultiAlign.C src/AS_CNS/MultiAlignment_CNS.C D src/utgcns/libcns/abMultiAlign.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abMultiAlign.C src/utgcns/libcns/MultiAlignment_CNS.C A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20151014Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20150306Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus-mergeRefine.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-mergeRefine.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-mergeRefine.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus-mergeRefine.C src/AS_CNS/MergeRefine.C D src/utgcns/libcns/abAbacus-mergeRefine.C src/AS_CNS/MergeRefine.c D src/utgcns/libcns/abAbacus-mergeRefine.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus-mergeRefine.C src/utgcns/libcns/MergeRefine.C A src/utgcns/libcns/abSequence.H bnbi20150701Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150112Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150110Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150108Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abSequence.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20110921Jason Miller A src/utgcns/libcns/abSequence.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090925Sergey Koren A src/utgcns/libcns/abSequence.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090421Sergey Koren A src/utgcns/libcns/abSequence.H jcvi20090417Sergey Koren A src/utgcns/libcns/abSequence.H jcvi20090415Sergey Koren A src/utgcns/libcns/abSequence.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081211Sergey Koren A src/utgcns/libcns/abSequence.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20080213Eli Venter A src/utgcns/libcns/abSequence.H jcvi20080128Sergey Koren A src/utgcns/libcns/abSequence.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20060213Eli Venter A src/utgcns/libcns/abSequence.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abSequence.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abSequence.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abSequence.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abSequence.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abSequence.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abSequence.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/unitigConsensus.C nihh20151113Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151020Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151013Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151009Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150811Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150810Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150805Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150730Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150728Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150726Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150723Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150722Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150720Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150720Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150717Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150713Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150708Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150708Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150623Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150528Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150306Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150303Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150128Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150113Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150112Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20131004Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130918Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130918Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130806Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130730Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130703Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20120726Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111229Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111214Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111129Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111123Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111123Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111114Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111111Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C bnbi20111027Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20110728Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20110224Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20110104Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100812Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100423Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100318Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100316Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20100302Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091209Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091130Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091104Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090811Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090807Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090731Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090726Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090715Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090713Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090707Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090707Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090629Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090624Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090624Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090624Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090605Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090514Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20090421Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20090417Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20090415Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090116Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080606Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080227Sergey Koren A src/utgcns/libcns/unitigConsensus.C jcvi20080213Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20080124Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20071025Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20071017Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20071017Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20071003Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20071002Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070831Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070830Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070830Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070829Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070617Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070610Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070601Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070419Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070402Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070316Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070310Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20061114Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20061114Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20061029Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061029Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061025Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061025Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061024Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061023Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061023Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061021Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061020Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20061017Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061017Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061016Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20061015Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061013Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061010Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061009Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20061005Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061003Aaron Halpern A src/utgcns/libcns/unitigConsensus.C jcvi20061002Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20061002Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060924Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060922Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060918Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060913Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060909Aaron Halpern A src/utgcns/libcns/unitigConsensus.C jcvi20060908Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060908Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060907Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060906Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060905Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060825Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060825Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060821Aaron Halpern A src/utgcns/libcns/unitigConsensus.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060818Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060818Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060804Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060524Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060521Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060520Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060519Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060503Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060328Aaron Halpern A src/utgcns/libcns/unitigConsensus.C jcvi20060324Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20060323Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060322Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060317Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060316Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20060308Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060214Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20060112Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20051216Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20051215Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20051212Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20051127Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20051120Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20051001Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050930Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050930Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050929Aaron Halpern A src/utgcns/libcns/unitigConsensus.C jcvi20050929Aaron Halpern A src/utgcns/libcns/unitigConsensus.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20050918Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050917Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050914Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20050911Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050827Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20050819Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050818Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050812Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050810Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050808Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050805Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050801Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050727Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20050630Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050629Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050623Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050621Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C jcvi20050614Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050612Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050523Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050509Gennady Denisov A src/utgcns/libcns/unitigConsensus.C jcvi20050502Eli Venter A src/utgcns/libcns/unitigConsensus.C jcvi20050330Eli Venter A src/utgcns/libcns/unitigConsensus.C tigr20050322Jason Miller A src/utgcns/libcns/unitigConsensus.C tigr20050322Jason Miller A src/utgcns/libcns/unitigConsensus.C tigr20040923Michael Schatz D src/utgcns/libcns/unitigConsensus.C src/AS_CNS/MultiAlignUnitig.C D src/utgcns/libcns/unitigConsensus.C src/AS_CNS/MultiAlignUnitig.c D src/utgcns/libcns/unitigConsensus.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/unitigConsensus.C src/utgcns/libcns/MultiAlignUnitig.C A src/utgcns/libcns/abIterators.H bnbi20150131Brian P. Walenz A src/utgcns/libcns/abIterators.H bnbi20150113Brian P. Walenz A src/utgcns/libcns/abIterators.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abIterators.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abIterators.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abIterators.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abIterators.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20110921Jason Miller A src/utgcns/libcns/abIterators.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090925Sergey Koren A src/utgcns/libcns/abIterators.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090421Sergey Koren A src/utgcns/libcns/abIterators.H jcvi20090417Sergey Koren A src/utgcns/libcns/abIterators.H jcvi20090415Sergey Koren A src/utgcns/libcns/abIterators.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081211Sergey Koren A src/utgcns/libcns/abIterators.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20080213Eli Venter A src/utgcns/libcns/abIterators.H jcvi20080128Sergey Koren A src/utgcns/libcns/abIterators.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20060213Eli Venter A src/utgcns/libcns/abIterators.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abIterators.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abIterators.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abIterators.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abIterators.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abIterators.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abIterators.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151113Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151029Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151014Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151013Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150728Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150726Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150715Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150708Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150317Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150306Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150127Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150123Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090713Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090629Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090624Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus-applyAlignment.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-applyAlignment.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-applyAlignment.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus-applyAlignment.C src/AS_CNS/ApplyAlignment.C D src/utgcns/libcns/abAbacus-applyAlignment.C src/AS_CNS/ApplyAlignment.c D src/utgcns/libcns/abAbacus-applyAlignment.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus-applyAlignment.C src/utgcns/libcns/ApplyAlignment.C A src/utgcns/libcns/abAbacus-addRead.C nihh20151009Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150811Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150708Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150603Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150528Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150317Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150317Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150131Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150121Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150112Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150110Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150108Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20150107Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20141230Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C bnbi20141117Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20130801Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20130617Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111215Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111212Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111209Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111207Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111206Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111204Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111117Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111115Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20111110Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20110102Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20100923Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20091026Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20091010Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20091005Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090929Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090924Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090914Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090804Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090730Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090710Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090622Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090610Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090529Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090527Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090522Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090520Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090519Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090516Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090515Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090514Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C jcvi20090421Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C jcvi20090417Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C jcvi20090415Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C jcvi20090223Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090126Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090116Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C jcvi20090113Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090108Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090107Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090105Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090104Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20090102Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081231Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081230Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081229Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081219Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081218Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081205Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081113Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081112Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081107Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081105Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081029Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081011Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081008Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081007Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20081001Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080925Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080924Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080811Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080731Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080627Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080606Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20080425Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080404Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080318Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080227Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C jcvi20080213Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20080204Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20080124Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20080101Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071220Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071218Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071217Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071208Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071108Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20071025Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20071017Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20071003Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20071002Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070925Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070918Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070905Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070902Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070901Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070831Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070830Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070829Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070829Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070807Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070804Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070803Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070617Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070612Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070610Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070601Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070601Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070529Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070518Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070428Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070426Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070419Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070416Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070402Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070319Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070316Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070310Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20070305Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070304Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070226Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070225Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070220Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070218Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070214Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070212Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070208Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070204Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070203Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070129Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20070128Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20061212Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20061117Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20061114Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061029Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061025Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061024Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061023Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061021Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061020Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061017Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061017Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061016Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061016Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20061015Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061013Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061010Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061009Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061008Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20061005Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061003Aaron Halpern A src/utgcns/libcns/abAbacus-addRead.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20061002Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060926Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060924Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060922Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060918Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060913Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060909Aaron Halpern A src/utgcns/libcns/abAbacus-addRead.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060908Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060907Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060906Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060905Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060829Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060825Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060821Aaron Halpern A src/utgcns/libcns/abAbacus-addRead.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060821Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060818Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060804Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060612Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060602Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060601Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060524Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060521Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060520Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060519Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060503Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060406Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060328Aaron Halpern A src/utgcns/libcns/abAbacus-addRead.C jcvi20060324Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20060323Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060322Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060317Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060316Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20060308Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20060302Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060214Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20060130Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20060112Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20051216Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20051215Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20051213Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20051212Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20051212Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20051127Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20051120Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20051001Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050930Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-addRead.C jcvi20050929Aaron Halpern A src/utgcns/libcns/abAbacus-addRead.C jcvi20050926Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20050920Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20050918Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050917Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050914Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050912Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20050911Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050827Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050819Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20050819Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050818Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050812Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050810Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050808Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050805Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050801Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050727Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050708Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20050630Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050629Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050623Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050621Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050616Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C jcvi20050614Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050612Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050523Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050509Gennady Denisov A src/utgcns/libcns/abAbacus-addRead.C jcvi20050502Eli Venter A src/utgcns/libcns/abAbacus-addRead.C jcvi20050330Eli Venter A src/utgcns/libcns/abAbacus-addRead.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-addRead.C tigr20050322Jason Miller A src/utgcns/libcns/abAbacus-addRead.C tigr20040923Michael Schatz D src/utgcns/libcns/abAbacus-addRead.C src/AS_CNS/MultiAlignment_CNS.C D src/utgcns/libcns/abAbacus-addRead.C src/AS_CNS/MultiAlignment_CNS.c D src/utgcns/libcns/abAbacus-addRead.C src/utgcns/libcns/MultiAlignment_CNS.C D src/utgcns/libcns/abAbacus-addRead.C src/utgcns/libcns/abacus-addRead.C A src/utgcns/libcns/abColumn.H bnbi20150701Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150127Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150121Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150113Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150112Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150110Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150109Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20150107Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20141230Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20141223Brian P. Walenz A src/utgcns/libcns/abColumn.H bnbi20141117Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20130801Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20130107Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111215Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111208Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111207Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111206Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111204Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111129Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111117Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20111115Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20110921Jason Miller A src/utgcns/libcns/abColumn.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20110102Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20091026Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20091005Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090925Sergey Koren A src/utgcns/libcns/abColumn.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090924Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090914Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090907Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090804Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090710Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090622Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090610Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090529Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090515Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090421Sergey Koren A src/utgcns/libcns/abColumn.H jcvi20090417Sergey Koren A src/utgcns/libcns/abColumn.H jcvi20090415Sergey Koren A src/utgcns/libcns/abColumn.H jcvi20090108Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20090105Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081230Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081218Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081211Sergey Koren A src/utgcns/libcns/abColumn.H jcvi20081107Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081029Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20081008Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080925Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080924Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080627Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080609Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080425Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20080213Eli Venter A src/utgcns/libcns/abColumn.H jcvi20080128Sergey Koren A src/utgcns/libcns/abColumn.H jcvi20071108Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20071025Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20070925Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070918Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070905Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070514Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070316Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20070225Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070214Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070212Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20070208Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20061005Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20061002Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20060923Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20060918Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20060906Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20060612Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20060213Eli Venter A src/utgcns/libcns/abColumn.H jcvi20051215Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20051127Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20051120Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050829Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050818Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050805Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050801Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050708Brian P. Walenz A src/utgcns/libcns/abColumn.H jcvi20050629Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050614Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050612Gennady Denisov A src/utgcns/libcns/abColumn.H jcvi20050523Gennady Denisov D src/utgcns/libcns/abColumn.H src/AS_CNS/MultiAlignment_CNS.h D src/utgcns/libcns/abColumn.H src/AS_CNS/MultiAlignment_CNS_private.H D src/utgcns/libcns/abColumn.H src/AS_CNS/MultiAlignment_CNS_private.h D src/utgcns/libcns/abColumn.H src/utgcns/libcns/MultiAlignment_CNS_private.H A src/utgcns/libNDalign/NDalgorithm-reverse.C nihh20151029Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C nihh20151014Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C nihh20151013Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150805Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150728Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150726Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150723Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150722Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-reverse.C bnbi20150720Brian P. Walenz D src/utgcns/libNDalign/NDalgorithm-reverse.C src/utgcns/libNDalign/prefixEditDistance-reverse.C A src/utgcns/libNDalign/NDalgorithm.C bnbi20150805Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.C bnbi20150720Brian P. Walenz D src/utgcns/libNDalign/NDalgorithm.C src/utgcns/libNDalign/prefixEditDistance.C A src/utgcns/libNDalign/NDalign.C nihh20151113Brian P. Walenz A src/utgcns/libNDalign/NDalign.C nihh20151029Brian P. Walenz A src/utgcns/libNDalign/NDalign.C nihh20151019Brian P. Walenz A src/utgcns/libNDalign/NDalign.C nihh20151013Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150825Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150730Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150728Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150726Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150723Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150717Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150715Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150715Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150713Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150708Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150708Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150701Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150624Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150623Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150623Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150617Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150617Brian P. Walenz A src/utgcns/libNDalign/NDalign.C bnbi20150617Brian P. Walenz D src/utgcns/libNDalign/NDalign.C src/overlapInCore/overlapAlign.C A src/utgcns/libNDalign/NDalgorithm-allocateMoreSpace.C bnbi20150805Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-allocateMoreSpace.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-allocateMoreSpace.C bnbi20150720Brian P. Walenz D src/utgcns/libNDalign/NDalgorithm-allocateMoreSpace.C src/utgcns/libNDalign/prefixEditDistance-allocateMoreSpace.C A src/utgcns/libNDalign/NDalgorithm.H nihh20151029Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H nihh20151020Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H nihh20151019Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H bnbi20150805Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H bnbi20150726Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H bnbi20150722Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.H bnbi20150720Brian P. Walenz D src/utgcns/libNDalign/NDalgorithm.H src/utgcns/libNDalign/prefixEditDistance.H A src/utgcns/libNDalign/NDalign.H bnbi20150825Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150728Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150726Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150723Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150717Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150708Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150708Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150701Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150625Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150624Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150623Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150623Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150617Brian P. Walenz A src/utgcns/libNDalign/NDalign.H bnbi20150617Brian P. Walenz D src/utgcns/libNDalign/NDalign.H src/overlapInCore/overlapAlign.H A src/utgcns/libNDalign/NDalgorithm-forward.C nihh20151029Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C nihh20151013Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C bnbi20150805Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C bnbi20150726Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C bnbi20150723Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C bnbi20150722Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-forward.C bnbi20150720Brian P. Walenz D src/utgcns/libNDalign/NDalgorithm-forward.C src/utgcns/libNDalign/prefixEditDistance-forward.C A src/utgcns/libNDalign/NDalgorithm-extend.C nihh20151013Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-extend.C bnbi20150728Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-extend.C bnbi20150726Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-extend.C bnbi20150722Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-extend.C bnbi20150720Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-extend.C bnbi20150720Brian P. Walenz D src/utgcns/libNDalign/NDalgorithm-extend.C src/utgcns/libNDalign/prefixEditDistance-extend.C A src/utgcns/libNDalign/Binomial_Bound.C nihh20151012Brian P. Walenz A src/utgcns/stashContains.C bnbi20150509Brian P. Walenz A src/utgcns/stashContains.C bnbi20150409Brian P. Walenz A src/utgcns/utgcnsfix.C bnbi20150306Brian P. Walenz A src/utgcns/utgcnsfix.C bnbi20141230Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20131004Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20130918Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20130917Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20130801Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20130801Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20130308Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20120726Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20111229Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20111214Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20111212Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20111207Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20111204Brian P. Walenz A src/utgcns/utgcnsfix.C jcvi20111129Brian P. Walenz D src/utgcns/utgcnsfix.C src/AS_CNS/utgcnsfix.C A src/pipelines/canu.pl nihh20151119Brian P. Walenz A src/pipelines/canu.pl nihh20151116Brian P. Walenz A src/pipelines/canu.pl nihh20151109Brian P. Walenz A src/pipelines/canu.pl nihh20151108Brian P. Walenz A src/pipelines/canu.pl nihh20151108Brian P. Walenz A src/pipelines/canu.pl nihh20151104Brian P. Walenz A src/pipelines/canu.pl nihh20151103Brian P. Walenz A src/pipelines/canu.pl bnbi20150826Brian P. Walenz A src/pipelines/canu.pl bnbi20150826Brian P. Walenz A src/pipelines/canu.pl bnbi20150826Brian P. Walenz A src/pipelines/canu.pl bnbi20150825Brian P. Walenz A src/pipelines/canu.pl bnbi20150814Brian P. Walenz A src/pipelines/canu.pl bnbi20150814Brian P. Walenz A src/pipelines/canu.pl bnbi20150616Brian P. Walenz A src/pipelines/canu.pl bnbi20150616Brian P. Walenz A src/pipelines/canu.pl bnbi20150603Brian P. Walenz A src/pipelines/canu.pl bnbi20150528Brian P. Walenz A src/pipelines/canu.pl bnbi20150526Brian P. Walenz A src/pipelines/canu.pl bnbi20150514Brian P. Walenz A src/pipelines/canu.pl bnbi20150507Brian P. Walenz A src/pipelines/canu.pl bnbi20150429Brian P. Walenz A src/pipelines/canu.pl bnbi20150424Brian P. Walenz A src/pipelines/canu.pl bnbi20150421Brian P. Walenz A src/pipelines/canu.pl bnbi20150415Brian P. Walenz A src/pipelines/canu.pl bnbi20150415Brian P. Walenz A src/pipelines/canu.pl bnbi20150410Brian P. Walenz A src/pipelines/canu.pl bnbi20150409Brian P. Walenz A src/pipelines/canu.pl bnbi20150409Brian P. Walenz A src/pipelines/canu.pl bnbi20150327Brian P. Walenz A src/pipelines/canu.pl bnbi20150327Brian P. Walenz A src/pipelines/canu.pl bnbi20150327Brian P. Walenz A src/pipelines/canu.pl bnbi20150325Brian P. Walenz A src/pipelines/canu.pl bnbi20150321Brian P. Walenz A src/pipelines/canu.pl bnbi20150317Brian P. Walenz A src/pipelines/canu.pl bnbi20150316Brian P. Walenz A src/pipelines/canu.pl bnbi20150310Brian P. Walenz A src/pipelines/canu.pl bnbi20150306Brian P. Walenz A src/pipelines/canu.pl bnbi20150306Brian P. Walenz A src/pipelines/canu.pl bnbi20150303Brian P. Walenz A src/pipelines/canu.pl bnbi20150227Brian P. Walenz D src/pipelines/canu.pl src/pipelines/ca3g.pl A src/pipelines/canu/OverlapInCore.pm nihh20151119Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151109Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151108Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151104Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151103Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151027Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151019Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150317Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150306Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150303Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/OverlapInCore.pm src/pipelines/ca3g/OverlapInCore.pm A src/pipelines/canu/HTML.pm nihh20151118Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20151109Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20151109Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151118Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151105Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151102Brian P. Walenz A src/pipelines/canu/Output.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Output.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Output.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Output.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Output.pm bnbi20150316Brian P. Walenz D src/pipelines/canu/Output.pm src/pipelines/ca3g/Output.pm A src/pipelines/canu/Defaults.pm nihh20151119Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151118Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151113Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151022Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151021Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150921Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150911Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150903Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150826Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150809Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150729Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150701Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150615Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150610Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150603Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150529Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150529Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150529Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150528Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150526Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150509Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150508Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150429Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150410Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150325Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150317Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150316Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150306Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150303Brian P. Walenz A src/pipelines/canu/Defaults.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/Defaults.pm src/pipelines/ca3g/Defaults.pm A src/pipelines/canu/OverlapStore.pm nihh20151119Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151118Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151113Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151109Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151108Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151104Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151104Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151103Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151029Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151010Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150921Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150317Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/OverlapStore.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/OverlapStore.pm src/pipelines/ca3g/OverlapStore.pm A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151119Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151113Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151109Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151108Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151105Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151104Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151103Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150921Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150528Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150526Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150306Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150306Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150303Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150303Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150303Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/OverlapErrorAdjustment.pm src/pipelines/ca3g/OverlapErrorAdjustment.pm A src/pipelines/canu/OverlapBasedTrimming.pm nihh20151113Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20151109Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20151108Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20151104Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150720Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150526Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150415Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150415Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150415Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150325Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150321Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150317Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150316Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm bnbi20150316Brian P. Walenz D src/pipelines/canu/OverlapBasedTrimming.pm src/pipelines/ca3g/OverlapBasedTrimming.pm A src/pipelines/canu/OverlapMhap.pm nihh20151119Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151109Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151108Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151104Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151103Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150921Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150615Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150526Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150429Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150410Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm bnbi20150327Brian P. Walenz D src/pipelines/canu/OverlapMhap.pm src/pipelines/ca3g/OverlapMhap.pm A src/pipelines/canu/Execution.pm nihh20151119Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151119Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151113Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151113Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151113Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150911Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150828Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150723Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150615Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150529Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150526Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150317Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150303Brian P. Walenz A src/pipelines/canu/Execution.pm bnbi20150227Brian P. Walenz A src/pipelines/canu/Execution.pm jcvi20111228Brian P. Walenz A src/pipelines/canu/Execution.pm jcvi20090606Brian P. Walenz A src/pipelines/canu/Execution.pm jcvi20080925Brian P. Walenz A src/pipelines/canu/Execution.pm jcvi20060407Brian P. Walenz A src/pipelines/canu/Execution.pm none20040322Brian P. Walenz A src/pipelines/canu/Execution.pm none20040322Brian P. Walenz A src/pipelines/canu/Execution.pm none20040322Brian P. Walenz A src/pipelines/canu/Execution.pm craa20031111Brian P. Walenz A src/pipelines/canu/Execution.pm craa20030807Brian P. Walenz A src/pipelines/canu/Execution.pm craa20030729Brian P. Walenz A src/pipelines/canu/Execution.pm craa20030103Brian P. Walenz D src/pipelines/canu/Execution.pm kmer/ESTmapper/scheduler.pm D src/pipelines/canu/Execution.pm kmer/scripts/libBri.pm D src/pipelines/canu/Execution.pm kmer/scripts/scheduler.pm D src/pipelines/canu/Execution.pm src/pipelines/ca3g/Execution.pm A src/pipelines/canu/Meryl.pm nihh20151119Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150615Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150610Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150529Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150410Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150327Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/Meryl.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/Meryl.pm src/pipelines/ca3g/Meryl.pm A src/pipelines/canu/Consensus.pm nihh20151119Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151113Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151113Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151105Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151103Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150728Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150617Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150429Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150310Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150306Brian P. Walenz A src/pipelines/canu/Consensus.pm bnbi20150306Brian P. Walenz D src/pipelines/canu/Consensus.pm src/pipelines/ca3g/Consensus.pm A src/pipelines/canu/Unitig.pm nihh20151119Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151105Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151102Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150826Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150811Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150807Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150805Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150803Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150429Brian P. Walenz A src/pipelines/canu/Unitig.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/Unitig.pm src/pipelines/ca3g/Unitig.pm A src/pipelines/canu/CorrectReads.pm nihh20151119Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151117Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151117Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151117Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151109Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151108Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151104Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151103Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151103Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151021Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151019Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150903Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150903Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150809Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150729Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150615Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150610Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150605Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150528Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150526Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150514Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150509Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150429Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150424Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150421Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150415Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150410Brian P. Walenz A src/pipelines/canu/CorrectReads.pm bnbi20150409Brian P. Walenz D src/pipelines/canu/CorrectReads.pm src/pipelines/ca3g/CorrectReads.pm A src/pipelines/canu/Gatekeeper.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20151109Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20151108Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20151104Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20151027Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150921Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150825Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150814Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150616Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150507Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150429Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150420Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150410Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150409Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm bnbi20150227Brian P. Walenz D src/pipelines/canu/Gatekeeper.pm src/pipelines/ca3g/Gatekeeper.pm A src/pipelines/sanity/sanity-update-reference.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/sanity-asm-done.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/sanity-all-done.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/build-all-wgs-revisions.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/build-all-kmer-revisions.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/sanity-purge-old.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/compile-all-wgs-revisions.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/sanity.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/sanity-merge-qc.pl nihh20151012Brian P. Walenz A src/pipelines/sanity/sanity-get-next-date.pl nihh20151012Brian P. Walenz A src/meryl/compare-counts.C bnbi20141205Brian P. Walenz A src/meryl/compare-counts.C jcvi20140411Brian P. Walenz A src/meryl/compare-counts.C jcvi20121113Brian P. Walenz A src/meryl/compare-counts.C jcvi20120118Brian P. Walenz A src/meryl/compare-counts.C jcvi20100828Brian P. Walenz D src/meryl/compare-counts.C kmer/meryl/compare-counts.C A src/meryl/meryl-binaryOp.C bnbi20141205Brian P. Walenz A src/meryl/meryl-binaryOp.C bnbi20141205Brian P. Walenz A src/meryl/meryl-binaryOp.C jcvi20140411Brian P. Walenz A src/meryl/meryl-binaryOp.C jcvi20090804Brian P. Walenz A src/meryl/meryl-binaryOp.C jcvi20080609Brian P. Walenz A src/meryl/meryl-binaryOp.C jcvi20071012Brian P. Walenz A src/meryl/meryl-binaryOp.C jcvi20050523Brian P. Walenz A src/meryl/meryl-binaryOp.C none20041010Brian P. Walenz A src/meryl/meryl-binaryOp.C craa20040407Brian P. Walenz A src/meryl/meryl-binaryOp.C craa20040405Brian P. Walenz A src/meryl/meryl-binaryOp.C craa20030131Brian P. Walenz A src/meryl/meryl-binaryOp.C craa20030102Brian P. Walenz D src/meryl/meryl-binaryOp.C kmer/meryl/binaryOp.C A src/meryl/leaff-partition.C bnbi20141208Brian P. Walenz A src/meryl/leaff-partition.C jcvi20140411Brian P. Walenz A src/meryl/leaff-partition.C jcvi20100831Brian P. Walenz A src/meryl/leaff-partition.C jcvi20090207Brian P. Walenz D src/meryl/leaff-partition.C kmer/leaff/partition.C A src/meryl/mapMers.C bnbi20150722Brian P. Walenz A src/meryl/mapMers.C bnbi20141222Brian P. Walenz A src/meryl/mapMers.C bnbi20141205Brian P. Walenz A src/meryl/mapMers.C jcvi20140411Brian P. Walenz A src/meryl/mapMers.C jcvi20130329Brian P. Walenz A src/meryl/mapMers.C jcvi20130308Brian P. Walenz A src/meryl/mapMers.C jcvi20080911Brian P. Walenz A src/meryl/mapMers.C jcvi20080909Brian P. Walenz A src/meryl/mapMers.C jcvi20071029Brian P. Walenz A src/meryl/mapMers.C jcvi20071010Brian P. Walenz A src/meryl/mapMers.C jcvi20070913Brian P. Walenz A src/meryl/mapMers.C jcvi20070608Brian P. Walenz A src/meryl/mapMers.C jcvi20070607Brian P. Walenz A src/meryl/mapMers.C jcvi20070326Brian P. Walenz A src/meryl/mapMers.C jcvi20070320Brian P. Walenz A src/meryl/mapMers.C jcvi20061023Brian P. Walenz A src/meryl/mapMers.C jcvi20061018Brian P. Walenz D src/meryl/mapMers.C kmer/meryl/mapMers.C A src/meryl/libmeryl.H bnbi20150616Brian P. Walenz A src/meryl/libmeryl.H bnbi20141205Brian P. Walenz A src/meryl/libmeryl.H bnbi20141205Brian P. Walenz A src/meryl/libmeryl.H jcvi20140411Brian P. Walenz A src/meryl/libmeryl.H jcvi20120227Brian P. Walenz A src/meryl/libmeryl.H jcvi20080609Brian P. Walenz A src/meryl/libmeryl.H jcvi20080331Brian P. Walenz A src/meryl/libmeryl.H jcvi20080318Brian P. Walenz A src/meryl/libmeryl.H jcvi20080229Brian P. Walenz A src/meryl/libmeryl.H jcvi20080214Brian P. Walenz A src/meryl/libmeryl.H jcvi20071012Brian P. Walenz A src/meryl/libmeryl.H jcvi20051204Brian P. Walenz A src/meryl/libmeryl.H jcvi20050531Brian P. Walenz A src/meryl/libmeryl.H jcvi20050531Brian P. Walenz A src/meryl/libmeryl.H jcvi20050523Brian P. Walenz A src/meryl/libmeryl.H none20041010Brian P. Walenz A src/meryl/libmeryl.H none20040511Brian P. Walenz A src/meryl/libmeryl.H none20040430Brian P. Walenz A src/meryl/libmeryl.H none20040421Brian P. Walenz A src/meryl/libmeryl.H craa20040408Brian P. Walenz A src/meryl/libmeryl.H craa20040407Brian P. Walenz A src/meryl/libmeryl.H craa20040106Brian P. Walenz A src/meryl/libmeryl.H craa20031021Brian P. Walenz A src/meryl/libmeryl.H craa20030908Brian P. Walenz A src/meryl/libmeryl.H craa20030908Brian P. Walenz D src/meryl/libmeryl.H kmer/libmeryl/libmeryl.H A src/meryl/estimate-mer-threshold.C bnbi20141208Brian P. Walenz A src/meryl/estimate-mer-threshold.C bnbi20141205Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20140411Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20130801Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20130801Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20120306Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20100831Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20090610Brian P. Walenz A src/meryl/estimate-mer-threshold.C jcvi20081211Brian P. Walenz D src/meryl/estimate-mer-threshold.C src/AS_MER/estimate-mer-threshold.C A src/meryl/meryl-merge.C bnbi20141208Brian P. Walenz A src/meryl/meryl-merge.C bnbi20141205Brian P. Walenz A src/meryl/meryl-merge.C jcvi20140411Brian P. Walenz A src/meryl/meryl-merge.C jcvi20090807Brian P. Walenz A src/meryl/meryl-merge.C jcvi20090807Brian P. Walenz A src/meryl/meryl-merge.C jcvi20080609Brian P. Walenz A src/meryl/meryl-merge.C jcvi20080331Brian P. Walenz A src/meryl/meryl-merge.C jcvi20071012Brian P. Walenz A src/meryl/meryl-merge.C jcvi20070913Brian P. Walenz A src/meryl/meryl-merge.C jcvi20060514Brian P. Walenz A src/meryl/meryl-merge.C jcvi20050531Brian P. Walenz A src/meryl/meryl-merge.C jcvi20050523Brian P. Walenz A src/meryl/meryl-merge.C none20041010Brian P. Walenz A src/meryl/meryl-merge.C none20040729Brian P. Walenz A src/meryl/meryl-merge.C none20040726Brian P. Walenz A src/meryl/meryl-merge.C none20040409Brian P. Walenz A src/meryl/meryl-merge.C craa20040408Brian P. Walenz A src/meryl/meryl-merge.C craa20040407Brian P. Walenz A src/meryl/meryl-merge.C craa20040405Brian P. Walenz A src/meryl/meryl-merge.C craa20030131Brian P. Walenz A src/meryl/meryl-merge.C craa20030102Brian P. Walenz D src/meryl/meryl-merge.C kmer/meryl/merge.C A src/meryl/meryl.C bnbi20141208Brian P. Walenz A src/meryl/meryl.C bnbi20141205Brian P. Walenz A src/meryl/meryl.C jcvi20090807Brian P. Walenz A src/meryl/meryl.C jcvi20090804Brian P. Walenz A src/meryl/meryl.C jcvi20080909Brian P. Walenz A src/meryl/meryl.C jcvi20080609Brian P. Walenz A src/meryl/meryl.C jcvi20080318Brian P. Walenz A src/meryl/meryl.C jcvi20070608Brian P. Walenz A src/meryl/meryl.C jcvi20050523Brian P. Walenz A src/meryl/meryl.C none20041010Brian P. Walenz A src/meryl/meryl.C none20040726Brian P. Walenz A src/meryl/meryl.C none20040421Brian P. Walenz A src/meryl/meryl.C none20040414Brian P. Walenz A src/meryl/meryl.C none20040413Brian P. Walenz A src/meryl/meryl.C none20040413Brian P. Walenz A src/meryl/meryl.C craa20040407Brian P. Walenz A src/meryl/meryl.C craa20040405Brian P. Walenz A src/meryl/meryl.C none20040325Brian P. Walenz A src/meryl/meryl.C craa20031104Brian P. Walenz A src/meryl/meryl.C craa20030908Brian P. Walenz A src/meryl/meryl.C craa20030106Brian P. Walenz A src/meryl/meryl.C craa20030102Brian P. Walenz D src/meryl/meryl.C kmer/meryl/meryl.C A src/meryl/leaff-duplicates.C bnbi20141208Brian P. Walenz A src/meryl/leaff-duplicates.C jcvi20140411Brian P. Walenz A src/meryl/leaff-duplicates.C jcvi20090207Brian P. Walenz D src/meryl/leaff-duplicates.C kmer/leaff/dups.C A src/meryl/meryl-estimate.C bnbi20150529Brian P. Walenz A src/meryl/meryl-estimate.C bnbi20141208Brian P. Walenz A src/meryl/meryl-estimate.C bnbi20141205Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20140411Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20080909Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20080609Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20080401Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20071012Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20070913Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20070817Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20070326Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20060515Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20050531Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20050523Brian P. Walenz A src/meryl/meryl-estimate.C jcvi20050316Brian P. Walenz A src/meryl/meryl-estimate.C none20041010Brian P. Walenz A src/meryl/meryl-estimate.C none20040524Brian P. Walenz A src/meryl/meryl-estimate.C none20040511Brian P. Walenz A src/meryl/meryl-estimate.C none20040421Brian P. Walenz A src/meryl/meryl-estimate.C none20040413Brian P. Walenz A src/meryl/meryl-estimate.C craa20040408Brian P. Walenz A src/meryl/meryl-estimate.C craa20040407Brian P. Walenz A src/meryl/meryl-estimate.C craa20040405Brian P. Walenz A src/meryl/meryl-estimate.C none20040325Brian P. Walenz A src/meryl/meryl-estimate.C none20040323Brian P. Walenz A src/meryl/meryl-estimate.C craa20040212Clark Mobarry A src/meryl/meryl-estimate.C craa20030506Brian P. Walenz A src/meryl/meryl-estimate.C craa20030102Brian P. Walenz D src/meryl/meryl-estimate.C kmer/meryl/estimate.C A src/meryl/meryl-dump.C bnbi20141208Brian P. Walenz A src/meryl/meryl-dump.C bnbi20141205Brian P. Walenz A src/meryl/meryl-dump.C jcvi20140411Brian P. Walenz A src/meryl/meryl-dump.C jcvi20080609Brian P. Walenz A src/meryl/meryl-dump.C jcvi20080331Brian P. Walenz A src/meryl/meryl-dump.C jcvi20080318Brian P. Walenz A src/meryl/meryl-dump.C jcvi20080229Brian P. Walenz A src/meryl/meryl-dump.C jcvi20070608Brian P. Walenz A src/meryl/meryl-dump.C jcvi20070608Brian P. Walenz A src/meryl/meryl-dump.C jcvi20070606Brian P. Walenz A src/meryl/meryl-dump.C jcvi20051204Brian P. Walenz A src/meryl/meryl-dump.C jcvi20050712Brian P. Walenz A src/meryl/meryl-dump.C jcvi20050523Brian P. Walenz A src/meryl/meryl-dump.C none20041010Brian P. Walenz A src/meryl/meryl-dump.C none20040412Brian P. Walenz A src/meryl/meryl-dump.C craa20040407Brian P. Walenz A src/meryl/meryl-dump.C craa20040405Brian P. Walenz A src/meryl/meryl-dump.C craa20031021Brian P. Walenz A src/meryl/meryl-dump.C craa20030908Brian P. Walenz A src/meryl/meryl-dump.C craa20030618Brian P. Walenz A src/meryl/meryl-dump.C craa20030131Brian P. Walenz A src/meryl/meryl-dump.C craa20030102Brian P. Walenz D src/meryl/meryl-dump.C kmer/meryl/dump.C A src/meryl/leaff-statistics.C bnbi20150321Brian P. Walenz A src/meryl/leaff-statistics.C bnbi20141208Brian P. Walenz A src/meryl/leaff-statistics.C jcvi20140411Brian P. Walenz A src/meryl/leaff-statistics.C jcvi20130308Brian P. Walenz A src/meryl/leaff-statistics.C jcvi20120718Brian P. Walenz A src/meryl/leaff-statistics.C jcvi20090207Brian P. Walenz D src/meryl/leaff-statistics.C kmer/leaff/stats.C A src/meryl/libleaff/seqStore.C nihh20151029Brian P. Walenz A src/meryl/libleaff/seqStore.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqFactory.C bnbi20150204Brian P. Walenz A src/meryl/libleaff/fastqStdin.C nihh20151029Brian P. Walenz A src/meryl/libleaff/fastqStdin.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/fastaStdin.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/fastaFile.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqCache.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/gkStoreFile.H nihh20151029Brian P. Walenz A src/meryl/libleaff/gkStoreFile.H bnbi20150317Brian P. Walenz A src/meryl/libleaff/gkStoreFile.H bnbi20150204Brian P. Walenz A src/meryl/libleaff/fastqFile.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqStream.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqFile.H nihh20151029Brian P. Walenz A src/meryl/libleaff/seqFile.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/sffFile.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/fastqFile.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqCache.C bnbi20141222Brian P. Walenz A src/meryl/libleaff/seqCache.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C bnbi20150814Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C bnbi20150317Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C bnbi20150227Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C bnbi20150204Brian P. Walenz A src/meryl/libleaff/fastaStdin.C bnbi20141208Brian P. Walenz A src/meryl/libleaff/fastaFile.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqStore.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqFactory.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/merStream.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/fastqStdin.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/sffFile.H bnbi20141208Brian P. Walenz A src/meryl/libleaff/seqStream.H bnbi20141208Brian P. Walenz A src/meryl/meryl-merge-qsort.C bnbi20141205Brian P. Walenz A src/meryl/meryl-merge-qsort.C jcvi20140411Brian P. Walenz A src/meryl/meryl-merge-qsort.C jcvi20080620Brian P. Walenz D src/meryl/meryl-merge-qsort.C kmer/meryl/merge.qsort.C A src/meryl/meryl-build-threads.C bnbi20141205Brian P. Walenz A src/meryl/meryl-build-threads.C bnbi20141205Brian P. Walenz A src/meryl/meryl-build-threads.C jcvi20140411Brian P. Walenz A src/meryl/meryl-build-threads.C jcvi20061109Brian P. Walenz A src/meryl/meryl-build-threads.C jcvi20060514Brian P. Walenz A src/meryl/meryl-build-threads.C none20041010Brian P. Walenz A src/meryl/meryl-build-threads.C none20040421Brian P. Walenz A src/meryl/meryl-build-threads.C none20040413Brian P. Walenz A src/meryl/meryl-build-threads.C none20040413Brian P. Walenz D src/meryl/meryl-build-threads.C kmer/meryl/build-threads.C A src/meryl/leaff.C bnbi20150113Brian P. Walenz A src/meryl/leaff.C bnbi20141222Brian P. Walenz A src/meryl/leaff.C bnbi20141208Brian P. Walenz A src/meryl/leaff.C bnbi20141007Brian P. Walenz A src/meryl/leaff.C bnbi20140822Brian P. Walenz A src/meryl/leaff.C jcvi20140411Brian P. Walenz A src/meryl/leaff.C jcvi20120718Brian P. Walenz A src/meryl/leaff.C none20111116Liliana Florea A src/meryl/leaff.C jcvi20110330Brian P. Walenz A src/meryl/leaff.C jcvi20090707Brian P. Walenz A src/meryl/leaff.C jcvi20090613Brian P. Walenz A src/meryl/leaff.C jcvi20090416Brian P. Walenz A src/meryl/leaff.C jcvi20090413Brian P. Walenz A src/meryl/leaff.C jcvi20090207Brian P. Walenz A src/meryl/leaff.C jcvi20090128Brian P. Walenz A src/meryl/leaff.C jcvi20081023Brian P. Walenz A src/meryl/leaff.C jcvi20081023Brian P. Walenz A src/meryl/leaff.C jcvi20080922Brian P. Walenz A src/meryl/leaff.C jcvi20080912Brian P. Walenz A src/meryl/leaff.C jcvi20080909Brian P. Walenz A src/meryl/leaff.C jcvi20080626Brian P. Walenz A src/meryl/leaff.C jcvi20080508Brian P. Walenz A src/meryl/leaff.C jcvi20071220Brian P. Walenz A src/meryl/leaff.C jcvi20070407Brian P. Walenz A src/meryl/leaff.C jcvi20070326Brian P. Walenz A src/meryl/leaff.C jcvi20070320Brian P. Walenz A src/meryl/leaff.C jcvi20070320Brian P. Walenz A src/meryl/leaff.C jcvi20061220Brian P. Walenz A src/meryl/leaff.C jcvi20061211Brian P. Walenz A src/meryl/leaff.C jcvi20061022Brian P. Walenz A src/meryl/leaff.C jcvi20061012Brian P. Walenz A src/meryl/leaff.C jcvi20060808Brian P. Walenz A src/meryl/leaff.C jcvi20060728Brian P. Walenz A src/meryl/leaff.C jcvi20060718Brian P. Walenz A src/meryl/leaff.C jcvi20060706Brian P. Walenz A src/meryl/leaff.C jcvi20060407Brian P. Walenz A src/meryl/leaff.C jcvi20060228Brian P. Walenz A src/meryl/leaff.C jcvi20060207Brian P. Walenz A src/meryl/leaff.C jcvi20051211Brian P. Walenz A src/meryl/leaff.C jcvi20051012Brian P. Walenz A src/meryl/leaff.C jcvi20050916Brian P. Walenz A src/meryl/leaff.C jcvi20050727Brian P. Walenz A src/meryl/leaff.C jcvi20050522Brian P. Walenz A src/meryl/leaff.C jcvi20050506Brian P. Walenz A src/meryl/leaff.C jcvi20050323Brian P. Walenz A src/meryl/leaff.C jcvi20050306Brian P. Walenz A src/meryl/leaff.C none20041010Brian P. Walenz A src/meryl/leaff.C none20041003Brian P. Walenz A src/meryl/leaff.C none20040715Brian P. Walenz A src/meryl/leaff.C none20040624Brian P. Walenz A src/meryl/leaff.C none20040524Brian P. Walenz A src/meryl/leaff.C none20040421Brian P. Walenz A src/meryl/leaff.C none20040421Brian P. Walenz A src/meryl/leaff.C none20040312Brian P. Walenz A src/meryl/leaff.C none20040311Brian P. Walenz A src/meryl/leaff.C none20040220Brian P. Walenz A src/meryl/leaff.C craa20031014Brian P. Walenz A src/meryl/leaff.C craa20030926Brian P. Walenz A src/meryl/leaff.C craa20030805Brian P. Walenz A src/meryl/leaff.C craa20030803Brian P. Walenz A src/meryl/leaff.C craa20030721Brian P. Walenz A src/meryl/leaff.C craa20030711Brian P. Walenz A src/meryl/leaff.C craa20030710Brian P. Walenz A src/meryl/leaff.C craa20030709Brian P. Walenz A src/meryl/leaff.C craa20030514Brian P. Walenz A src/meryl/leaff.C craa20030508Brian P. Walenz A src/meryl/leaff.C craa20030502Brian P. Walenz A src/meryl/leaff.C craa20030422Brian P. Walenz A src/meryl/leaff.C craa20030415Brian P. Walenz A src/meryl/leaff.C craa20030326Brian P. Walenz A src/meryl/leaff.C craa20030310Brian P. Walenz A src/meryl/leaff.C craa20030106Brian P. Walenz A src/meryl/leaff.C craa20030102Brian P. Walenz D src/meryl/leaff.C kmer/leaff/leaff.C A src/meryl/leaff-blocks.C bnbi20141208Brian P. Walenz A src/meryl/leaff-blocks.C jcvi20140411Brian P. Walenz A src/meryl/leaff-blocks.C jcvi20090207Brian P. Walenz D src/meryl/leaff-blocks.C kmer/leaff/blocks.C A src/meryl/meryl.H bnbi20150529Brian P. Walenz A src/meryl/meryl.H bnbi20141208Brian P. Walenz A src/meryl/meryl.H bnbi20141205Brian P. Walenz A src/meryl/meryl.H bnbi20141205Brian P. Walenz A src/meryl/meryl.H jcvi20140411Brian P. Walenz A src/meryl/meryl.H jcvi20110330Brian P. Walenz A src/meryl/meryl.H jcvi20100115Brian P. Walenz A src/meryl/meryl.H jcvi20090807Brian P. Walenz A src/meryl/meryl.H jcvi20090807Brian P. Walenz A src/meryl/meryl.H jcvi20090804Brian P. Walenz A src/meryl/meryl.H jcvi20080909Brian P. Walenz A src/meryl/meryl.H jcvi20080609Brian P. Walenz A src/meryl/meryl.H jcvi20080401Brian P. Walenz A src/meryl/meryl.H jcvi20080331Brian P. Walenz A src/meryl/meryl.H jcvi20080318Brian P. Walenz A src/meryl/meryl.H jcvi20071012Brian P. Walenz A src/meryl/meryl.H jcvi20061023Brian P. Walenz A src/meryl/meryl.H jcvi20060515Brian P. Walenz A src/meryl/meryl.H jcvi20060514Brian P. Walenz A src/meryl/meryl.H jcvi20050531Brian P. Walenz A src/meryl/meryl.H jcvi20050523Brian P. Walenz A src/meryl/meryl.H none20041010Brian P. Walenz A src/meryl/meryl.H none20040726Brian P. Walenz A src/meryl/meryl.H none20040421Brian P. Walenz A src/meryl/meryl.H none20040414Brian P. Walenz A src/meryl/meryl.H none20040413Brian P. Walenz A src/meryl/meryl.H craa20040407Brian P. Walenz A src/meryl/meryl.H craa20040405Brian P. Walenz A src/meryl/meryl.H none20040325Brian P. Walenz A src/meryl/meryl.H craa20030908Brian P. Walenz A src/meryl/meryl.H craa20030102Brian P. Walenz D src/meryl/meryl.H kmer/meryl/meryl.H A src/meryl/leaff-simulate.C nihh20151012Brian P. Walenz A src/meryl/leaff-simulate.C bnbi20141208Brian P. Walenz A src/meryl/leaff-simulate.C jcvi20090613Brian P. Walenz A src/meryl/leaff-simulate.C jcvi20080922Brian P. Walenz A src/meryl/leaff-simulate.C none20041010Brian P. Walenz A src/meryl/leaff-simulate.C none20040715Brian P. Walenz A src/meryl/leaff-simulate.C none20040629Brian P. Walenz A src/meryl/leaff-simulate.C none20040624Brian P. Walenz D src/meryl/leaff-simulate.C kmer/leaff/simseq.C A src/meryl/gkrpt.pl nihh20151012Brian P. Walenz A src/meryl/gkrpt.pl nihh20151012Brian P. Walenz A src/meryl/gkrpt.pl bnbi20150410Brian P. Walenz A src/meryl/gkrpt.pl jcvi20130801Brian P. Walenz A src/meryl/gkrpt.pl jcvi20080627Brian P. Walenz A src/meryl/gkrpt.pl jcvi20080103Brian P. Walenz D src/meryl/gkrpt.pl src/AS_MER/gkrpt.pl A src/meryl/meryl-args.C bnbi20150529Brian P. Walenz A src/meryl/meryl-args.C bnbi20141222Brian P. Walenz A src/meryl/meryl-args.C bnbi20141205Brian P. Walenz A src/meryl/meryl-args.C bnbi20141205Brian P. Walenz A src/meryl/meryl-args.C jcvi20140411Brian P. Walenz A src/meryl/meryl-args.C jcvi20110330Brian P. Walenz A src/meryl/meryl-args.C jcvi20100331Brian P. Walenz A src/meryl/meryl-args.C jcvi20100115Brian P. Walenz A src/meryl/meryl-args.C jcvi20090807Brian P. Walenz A src/meryl/meryl-args.C jcvi20090804Brian P. Walenz A src/meryl/meryl-args.C jcvi20090724Brian P. Walenz A src/meryl/meryl-args.C jcvi20080925Brian P. Walenz A src/meryl/meryl-args.C jcvi20080909Brian P. Walenz A src/meryl/meryl-args.C jcvi20080609Brian P. Walenz A src/meryl/meryl-args.C jcvi20080401Brian P. Walenz A src/meryl/meryl-args.C jcvi20080331Brian P. Walenz A src/meryl/meryl-args.C jcvi20080318Brian P. Walenz A src/meryl/meryl-args.C jcvi20071012Brian P. Walenz A src/meryl/meryl-args.C jcvi20070608Brian P. Walenz A src/meryl/meryl-args.C jcvi20070407Brian P. Walenz A src/meryl/meryl-args.C jcvi20061218Brian P. Walenz A src/meryl/meryl-args.C jcvi20061109Brian P. Walenz A src/meryl/meryl-args.C jcvi20061105Brian P. Walenz A src/meryl/meryl-args.C jcvi20060527Brian P. Walenz A src/meryl/meryl-args.C jcvi20060515Brian P. Walenz A src/meryl/meryl-args.C jcvi20060514Brian P. Walenz A src/meryl/meryl-args.C jcvi20050531Brian P. Walenz A src/meryl/meryl-args.C jcvi20050523Brian P. Walenz A src/meryl/meryl-args.C jcvi20050320Brian P. Walenz A src/meryl/meryl-args.C none20041010Brian P. Walenz A src/meryl/meryl-args.C none20040726Brian P. Walenz A src/meryl/meryl-args.C craa20040701Brian P. Walenz A src/meryl/meryl-args.C none20040421Brian P. Walenz A src/meryl/meryl-args.C none20040414Brian P. Walenz A src/meryl/meryl-args.C none20040413Brian P. Walenz A src/meryl/meryl-args.C none20040412Brian P. Walenz A src/meryl/meryl-args.C craa20040407Brian P. Walenz A src/meryl/meryl-args.C craa20040405Brian P. Walenz A src/meryl/meryl-args.C none20040331Brian P. Walenz D src/meryl/meryl-args.C kmer/meryl/args.C A src/meryl/libmeryl.C nihh20151029Brian P. Walenz A src/meryl/libmeryl.C bnbi20150701Brian P. Walenz A src/meryl/libmeryl.C bnbi20150616Brian P. Walenz A src/meryl/libmeryl.C bnbi20141205Brian P. Walenz A src/meryl/libmeryl.C bnbi20141205Brian P. Walenz A src/meryl/libmeryl.C jcvi20140411Brian P. Walenz A src/meryl/libmeryl.C jcvi20120227Brian P. Walenz A src/meryl/libmeryl.C jcvi20080609Brian P. Walenz A src/meryl/libmeryl.C jcvi20080331Brian P. Walenz A src/meryl/libmeryl.C jcvi20080318Brian P. Walenz A src/meryl/libmeryl.C jcvi20080229Brian P. Walenz A src/meryl/libmeryl.C jcvi20071012Brian P. Walenz A src/meryl/libmeryl.C jcvi20070827Brian P. Walenz A src/meryl/libmeryl.C jcvi20060524Brian P. Walenz A src/meryl/libmeryl.C jcvi20050531Brian P. Walenz A src/meryl/libmeryl.C jcvi20050531Brian P. Walenz A src/meryl/libmeryl.C jcvi20050523Brian P. Walenz A src/meryl/libmeryl.C none20040430Brian P. Walenz A src/meryl/libmeryl.C none20040430Brian P. Walenz A src/meryl/libmeryl.C craa20040408Brian P. Walenz A src/meryl/libmeryl.C craa20040407Brian P. Walenz A src/meryl/libmeryl.C craa20040405Brian P. Walenz A src/meryl/libmeryl.C craa20040106Brian P. Walenz A src/meryl/libmeryl.C craa20031104Brian P. Walenz A src/meryl/libmeryl.C craa20030908Brian P. Walenz D src/meryl/libmeryl.C kmer/libmeryl/libmeryl.C A src/meryl/maskMers.C bnbi20141205Brian P. Walenz A src/meryl/maskMers.C jcvi20140411Brian P. Walenz A src/meryl/maskMers.C jcvi20080909Brian P. Walenz A src/meryl/maskMers.C jcvi20080430Brian P. Walenz A src/meryl/maskMers.C jcvi20080429Brian P. Walenz A src/meryl/maskMers.C jcvi20080429Brian P. Walenz A src/meryl/maskMers.C jcvi20080429Brian P. Walenz A src/meryl/maskMers.C jcvi20080429Brian P. Walenz A src/meryl/maskMers.C jcvi20080428Brian P. Walenz A src/meryl/maskMers.C jcvi20080421Brian P. Walenz A src/meryl/maskMers.C jcvi20080402Brian P. Walenz A src/meryl/maskMers.C jcvi20080402Brian P. Walenz A src/meryl/maskMers.C jcvi20080401Brian P. Walenz A src/meryl/maskMers.C jcvi20080331Brian P. Walenz A src/meryl/maskMers.C jcvi20080331Brian P. Walenz D src/meryl/maskMers.C kmer/meryl/maskMers.C A src/meryl/leaff-gc.C bnbi20141208Brian P. Walenz A src/meryl/leaff-gc.C jcvi20140411Brian P. Walenz A src/meryl/leaff-gc.C jcvi20090207Brian P. Walenz D src/meryl/leaff-gc.C kmer/leaff/gc.C A src/meryl/meryl-build.C bnbi20150701Brian P. Walenz A src/meryl/meryl-build.C bnbi20150529Brian P. Walenz A src/meryl/meryl-build.C bnbi20141208Brian P. Walenz A src/meryl/meryl-build.C bnbi20141205Brian P. Walenz A src/meryl/meryl-build.C bnbi20141205Brian P. Walenz A src/meryl/meryl-build.C jcvi20140411Brian P. Walenz A src/meryl/meryl-build.C jcvi20121015Brian P. Walenz A src/meryl/meryl-build.C jcvi20111214Brian P. Walenz A src/meryl/meryl-build.C jcvi20110330Brian P. Walenz A src/meryl/meryl-build.C jcvi20090724Brian P. Walenz A src/meryl/meryl-build.C jcvi20090721Brian P. Walenz A src/meryl/meryl-build.C jcvi20081012Brian P. Walenz A src/meryl/meryl-build.C jcvi20081012Brian P. Walenz A src/meryl/meryl-build.C jcvi20080917Brian P. Walenz A src/meryl/meryl-build.C jcvi20080909Brian P. Walenz A src/meryl/meryl-build.C jcvi20080609Brian P. Walenz A src/meryl/meryl-build.C jcvi20080401Brian P. Walenz A src/meryl/meryl-build.C jcvi20080331Brian P. Walenz A src/meryl/meryl-build.C jcvi20080318Brian P. Walenz A src/meryl/meryl-build.C jcvi20071208Brian P. Walenz A src/meryl/meryl-build.C jcvi20071023Brian P. Walenz A src/meryl/meryl-build.C jcvi20071012Brian P. Walenz A src/meryl/meryl-build.C jcvi20070913Brian P. Walenz A src/meryl/meryl-build.C jcvi20070827Brian P. Walenz A src/meryl/meryl-build.C jcvi20070823Brian P. Walenz A src/meryl/meryl-build.C jcvi20070823Brian P. Walenz A src/meryl/meryl-build.C jcvi20070606Brian P. Walenz A src/meryl/meryl-build.C jcvi20070330Brian P. Walenz A src/meryl/meryl-build.C jcvi20070326Brian P. Walenz A src/meryl/meryl-build.C jcvi20061109Brian P. Walenz A src/meryl/meryl-build.C jcvi20060527Brian P. Walenz A src/meryl/meryl-build.C jcvi20060515Brian P. Walenz A src/meryl/meryl-build.C jcvi20060514Brian P. Walenz A src/meryl/meryl-build.C jcvi20050911Brian P. Walenz A src/meryl/meryl-build.C jcvi20050531Brian P. Walenz A src/meryl/meryl-build.C jcvi20050531Brian P. Walenz A src/meryl/meryl-build.C jcvi20050523Brian P. Walenz A src/meryl/meryl-build.C jcvi20050320Brian P. Walenz A src/meryl/meryl-build.C jcvi20050312Brian P. Walenz A src/meryl/meryl-build.C none20041010Brian P. Walenz A src/meryl/meryl-build.C none20040812Brian P. Walenz A src/meryl/meryl-build.C none20040623Brian P. Walenz A src/meryl/meryl-build.C none20040421Brian P. Walenz A src/meryl/meryl-build.C none20040413Brian P. Walenz A src/meryl/meryl-build.C none20040413Brian P. Walenz A src/meryl/meryl-build.C none20040413Brian P. Walenz A src/meryl/meryl-build.C none20040409Brian P. Walenz A src/meryl/meryl-build.C craa20040408Brian P. Walenz A src/meryl/meryl-build.C craa20040407Brian P. Walenz A src/meryl/meryl-build.C craa20040405Brian P. Walenz A src/meryl/meryl-build.C none20040325Brian P. Walenz A src/meryl/meryl-build.C none20040323Brian P. Walenz A src/meryl/meryl-build.C craa20040212Clark Mobarry A src/meryl/meryl-build.C craa20030506Brian P. Walenz A src/meryl/meryl-build.C craa20030131Brian P. Walenz A src/meryl/meryl-build.C craa20030102Brian P. Walenz D src/meryl/meryl-build.C kmer/meryl/build.C A src/meryl/simple.C bnbi20150424Brian P. Walenz A src/meryl/simple.C bnbi20150421Brian P. Walenz A src/meryl/simple.C bnbi20141205Brian P. Walenz A src/meryl/simple.C jcvi20140411Brian P. Walenz A src/meryl/simple.C jcvi20110330Brian P. Walenz A src/meryl/simple.C jcvi20100831Brian P. Walenz D src/meryl/simple.C kmer/meryl/simple.C A src/meryl/mapMers-depth.C bnbi20141222Brian P. Walenz A src/meryl/mapMers-depth.C bnbi20141205Brian P. Walenz A src/meryl/mapMers-depth.C bnbi20141014Brian P. Walenz A src/meryl/mapMers-depth.C bnbi20141007Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20140411Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20080911Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20080909Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20071029Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20071010Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20070913Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20070619Brian P. Walenz A src/meryl/mapMers-depth.C jcvi20070608Brian P. Walenz D src/meryl/mapMers-depth.C kmer/meryl/mapMers-depth.C A src/meryl/meryl-merge-listmerge.C bnbi20141205Brian P. Walenz A src/meryl/meryl-merge-listmerge.C jcvi20140411Brian P. Walenz A src/meryl/meryl-merge-listmerge.C jcvi20080620Brian P. Walenz D src/meryl/meryl-merge-listmerge.C kmer/meryl/merge.listmerge.C A src/meryl/meryl-unaryOp.C bnbi20141205Brian P. Walenz A src/meryl/meryl-unaryOp.C jcvi20080609Brian P. Walenz A src/meryl/meryl-unaryOp.C jcvi20080331Brian P. Walenz A src/meryl/meryl-unaryOp.C jcvi20071012Brian P. Walenz A src/meryl/meryl-unaryOp.C none20041010Brian P. Walenz A src/meryl/meryl-unaryOp.C craa20040407Brian P. Walenz A src/meryl/meryl-unaryOp.C craa20040405Brian P. Walenz A src/meryl/meryl-unaryOp.C craa20030131Brian P. Walenz A src/meryl/meryl-unaryOp.C craa20030102Brian P. Walenz D src/meryl/meryl-unaryOp.C kmer/meryl/unaryOp.C A src/mhap/mhapConvert.C bnbi20150625Brian P. Walenz A src/mhap/mhapConvert.C bnbi20150616Brian P. Walenz A src/mhap/mhapConvert.C bnbi20150409Brian P. Walenz A src/mhap/mhapConvert.C bnbi20150327Brian P. Walenz A src/stores/tgTig.H nihh20151113Brian P. Walenz A src/stores/tgTig.H nihh20151030Brian P. Walenz A src/stores/tgTig.H bnbi20150814Brian P. Walenz A src/stores/tgTig.H bnbi20150811Brian P. Walenz A src/stores/tgTig.H bnbi20150807Brian P. Walenz A src/stores/tgTig.H bnbi20150708Brian P. Walenz A src/stores/tgTig.H bnbi20150701Brian P. Walenz A src/stores/tgTig.H bnbi20150623Brian P. Walenz A src/stores/tgTig.H bnbi20150603Brian P. Walenz A src/stores/tgTig.H bnbi20150507Brian P. Walenz A src/stores/tgTig.H bnbi20150409Brian P. Walenz A src/stores/tgTig.H bnbi20150303Brian P. Walenz A src/stores/tgTig.H bnbi20150127Brian P. Walenz A src/stores/tgTig.H bnbi20150127Brian P. Walenz A src/stores/tgTig.H bnbi20150127Brian P. Walenz A src/stores/tgTig.H bnbi20150121Brian P. Walenz A src/stores/tgTig.H bnbi20150113Brian P. Walenz A src/stores/tgTig.H bnbi20150113Brian P. Walenz A src/stores/tgTig.H bnbi20150110Brian P. Walenz A src/stores/tgTig.H bnbi20141222Brian P. Walenz A src/stores/gkStore.H nihh20151029Brian P. Walenz A src/stores/gkStore.H nihh20151009Brian P. Walenz A src/stores/gkStore.H bnbi20150814Brian P. Walenz A src/stores/gkStore.H bnbi20150810Brian P. Walenz A src/stores/gkStore.H bnbi20150810Brian P. Walenz A src/stores/gkStore.H bnbi20150701Brian P. Walenz A src/stores/gkStore.H bnbi20150605Brian P. Walenz A src/stores/gkStore.H bnbi20150603Brian P. Walenz A src/stores/gkStore.H bnbi20150529Brian P. Walenz A src/stores/gkStore.H bnbi20150529Brian P. Walenz A src/stores/gkStore.H bnbi20150514Brian P. Walenz A src/stores/gkStore.H bnbi20150508Brian P. Walenz A src/stores/gkStore.H bnbi20150318Brian P. Walenz A src/stores/gkStore.H bnbi20150317Brian P. Walenz A src/stores/gkStore.H bnbi20150317Brian P. Walenz A src/stores/gkStore.H bnbi20150306Brian P. Walenz A src/stores/gkStore.H bnbi20150227Brian P. Walenz A src/stores/gkStore.H bnbi20150227Brian P. Walenz A src/stores/gkStore.H bnbi20150224Brian P. Walenz A src/stores/gkStore.H bnbi20150113Brian P. Walenz A src/stores/gkStore.H bnbi20141223Brian P. Walenz A src/stores/gkStore.H bnbi20141215Brian P. Walenz A src/stores/gkStore.H bnbi20141202Brian P. Walenz A src/stores/gkStore.H bnbi20141127Brian P. Walenz A src/stores/gkStore.H bnbi20141126Brian P. Walenz A src/stores/ovStore.C nihh20151012Brian P. Walenz A src/stores/ovStore.C nihh20151012Brian P. Walenz A src/stores/ovStore.C bnbi20150814Brian P. Walenz A src/stores/ovStore.C bnbi20150701Brian P. Walenz A src/stores/ovStore.C bnbi20150625Brian P. Walenz A src/stores/ovStore.C bnbi20150616Brian P. Walenz A src/stores/ovStore.C bnbi20150616Brian P. Walenz A src/stores/ovStore.C bnbi20150605Brian P. Walenz A src/stores/ovStore.C bnbi20150520Brian P. Walenz A src/stores/ovStore.C bnbi20150514Brian P. Walenz A src/stores/ovStore.C bnbi20150413Brian P. Walenz A src/stores/ovStore.C bnbi20150409Brian P. Walenz A src/stores/ovStore.C bnbi20150321Brian P. Walenz A src/stores/ovStore.C bnbi20150317Brian P. Walenz A src/stores/ovStore.C bnbi20150312Brian P. Walenz A src/stores/ovStore.C bnbi20150306Brian P. Walenz A src/stores/ovStore.C bnbi20150227Brian P. Walenz A src/stores/ovStore.C bnbi20150113Brian P. Walenz A src/stores/ovStore.C bnbi20141223Brian P. Walenz A src/stores/ovStore.C bnbi20141215Brian P. Walenz A src/stores/ovStore.C bnbi20141209Brian P. Walenz A src/stores/ovStore.C jcvi20130801Brian P. Walenz A src/stores/ovStore.C jcvi20130801Brian P. Walenz A src/stores/ovStore.C jcvi20130801Brian P. Walenz A src/stores/ovStore.C jcvi20130702Brian P. Walenz A src/stores/ovStore.C jcvi20130702Brian P. Walenz A src/stores/ovStore.C jcvi20130111Brian P. Walenz A src/stores/ovStore.C jcvi20120424Brian P. Walenz A src/stores/ovStore.C jcvi20120321Brian P. Walenz A src/stores/ovStore.C jcvi20120214Gregory Sims A src/stores/ovStore.C jcvi20120208Gregory Sims A src/stores/ovStore.C jcvi20120203Gregory Sims A src/stores/ovStore.C jcvi20120201Gregory Sims A src/stores/ovStore.C jcvi20111205Brian P. Walenz A src/stores/ovStore.C jcvi20110603Brian P. Walenz A src/stores/ovStore.C bnbi20110603Sergey Koren A src/stores/ovStore.C bnbi20110602Sergey Koren A src/stores/ovStore.C jcvi20110211Brian P. Walenz A src/stores/ovStore.C jcvi20101215Brian P. Walenz A src/stores/ovStore.C jcvi20091201Brian P. Walenz A src/stores/ovStore.C jcvi20091026Brian P. Walenz A src/stores/ovStore.C jcvi20091026Brian P. Walenz A src/stores/ovStore.C jcvi20090712Brian P. Walenz A src/stores/ovStore.C jcvi20090610Brian P. Walenz A src/stores/ovStore.C jcvi20081008Brian P. Walenz A src/stores/ovStore.C jcvi20080627Brian P. Walenz A src/stores/ovStore.C jcvi20080103Brian P. Walenz A src/stores/ovStore.C jcvi20071127Brian P. Walenz A src/stores/ovStore.C jcvi20071120Brian P. Walenz A src/stores/ovStore.C jcvi20071119Brian P. Walenz A src/stores/ovStore.C jcvi20070603Brian P. Walenz A src/stores/ovStore.C jcvi20070510Brian P. Walenz A src/stores/ovStore.C jcvi20070508Sergey Koren A src/stores/ovStore.C jcvi20070416Brian P. Walenz A src/stores/ovStore.C jcvi20070402Brian P. Walenz A src/stores/ovStore.C jcvi20070316Brian P. Walenz A src/stores/ovStore.C jcvi20070313Brian P. Walenz A src/stores/ovStore.C jcvi20070313Brian P. Walenz A src/stores/ovStore.C jcvi20070312Brian P. Walenz A src/stores/ovStore.C jcvi20070309Brian P. Walenz A src/stores/ovStore.C jcvi20070309Brian P. Walenz A src/stores/ovStore.C jcvi20070308Brian P. Walenz A src/stores/ovStore.C jcvi20070308Brian P. Walenz D src/stores/ovStore.C src/AS_OVS/AS_OVS_overlapStore.C D src/stores/ovStore.C src/AS_OVS/AS_OVS_overlapStore.c A src/stores/gatekeeperDumpMetaData.C bnbi20150921Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C bnbi20150701Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C bnbi20150603Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C bnbi20150529Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C bnbi20150507Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C bnbi20150321Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C bnbi20150318Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20151108Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150921Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150814Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150625Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150616Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150514Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150410Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150321Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150317Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150310Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20150224Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20141215Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20141121Brian P. Walenz A src/stores/ovStoreBucketizer.C bnbi20140822Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20130801Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20130801Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120602Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120510Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120508Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120417Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120417Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120403Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120402Brian P. Walenz A src/stores/ovStoreBucketizer.C jcvi20120402Brian P. Walenz D src/stores/ovStoreBucketizer.C src/AS_OVS/overlapStoreBucketizer.C A src/stores/ovStoreDump.C nihh20151021Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150625Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150616Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150616Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150605Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150526Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150520Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150514Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150429Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150409Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150321Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150317Brian P. Walenz A src/stores/ovStoreDump.C bnbi20150317Brian P. Walenz A src/stores/ovStoreDump.C bnbi20141215Brian P. Walenz A src/stores/ovStoreDump.C bnbi20141210Brian P. Walenz A src/stores/ovStoreDump.C bnbi20141021Brian P. Walenz A src/stores/ovStoreDump.C bnbi20140822Brian P. Walenz A src/stores/ovStoreDump.C jcvi20130801Brian P. Walenz A src/stores/ovStoreDump.C jcvi20130801Brian P. Walenz A src/stores/ovStoreDump.C jcvi20130801Brian P. Walenz A src/stores/ovStoreDump.C jcvi20120508Brian P. Walenz A src/stores/ovStoreDump.C jcvi20120402Brian P. Walenz A src/stores/ovStoreDump.C jcvi20120314Gregory Sims A src/stores/ovStoreDump.C jcvi20120203Gregory Sims A src/stores/ovStoreDump.C jcvi20120201Gregory Sims A src/stores/ovStoreDump.C jcvi20111229Brian P. Walenz A src/stores/ovStoreDump.C jcvi20110824Brian P. Walenz A src/stores/ovStoreDump.C jcvi20110102Brian P. Walenz A src/stores/ovStoreDump.C jcvi20101001Brian P. Walenz A src/stores/ovStoreDump.C jcvi20100928Brian P. Walenz A src/stores/ovStoreDump.C jcvi20100416Brian P. Walenz A src/stores/ovStoreDump.C jcvi20100129Brian P. Walenz A src/stores/ovStoreDump.C jcvi20091104Brian P. Walenz A src/stores/ovStoreDump.C jcvi20091026Brian P. Walenz A src/stores/ovStoreDump.C jcvi20090814Sergey Koren A src/stores/ovStoreDump.C jcvi20090706Brian P. Walenz A src/stores/ovStoreDump.C jcvi20090514Brian P. Walenz A src/stores/ovStoreDump.C jcvi20090225Brian P. Walenz A src/stores/ovStoreDump.C jcvi20081216Sergey Koren A src/stores/ovStoreDump.C jcvi20081113Brian P. Walenz A src/stores/ovStoreDump.C jcvi20081013Brian P. Walenz A src/stores/ovStoreDump.C jcvi20081008Brian P. Walenz A src/stores/ovStoreDump.C jcvi20080627Brian P. Walenz A src/stores/ovStoreDump.C jcvi20071119Brian P. Walenz A src/stores/ovStoreDump.C jcvi20071108Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070802Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070723Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070723Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070412Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070403Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070313Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070312Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070309Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070309Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070308Brian P. Walenz A src/stores/ovStoreDump.C jcvi20070308Brian P. Walenz D src/stores/ovStoreDump.C src/AS_OVS/overlapStore.C D src/stores/ovStoreDump.C src/AS_OVS/overlapStore.c A src/stores/ovStoreIndexer.C bnbi20150921Brian P. Walenz A src/stores/ovStoreIndexer.C bnbi20150312Brian P. Walenz A src/stores/ovStoreIndexer.C bnbi20150310Brian P. Walenz A src/stores/ovStoreIndexer.C bnbi20141215Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20130801Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20130801Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120602Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120516Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120508Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120424Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120417Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120403Brian P. Walenz A src/stores/ovStoreIndexer.C jcvi20120402Brian P. Walenz D src/stores/ovStoreIndexer.C src/AS_OVS/overlapStoreIndexer.C A src/stores/ovStoreSorter.C bnbi20150921Brian P. Walenz A src/stores/ovStoreSorter.C bnbi20150625Brian P. Walenz A src/stores/ovStoreSorter.C bnbi20150616Brian P. Walenz A src/stores/ovStoreSorter.C bnbi20150312Brian P. Walenz A src/stores/ovStoreSorter.C bnbi20150310Brian P. Walenz A src/stores/ovStoreSorter.C bnbi20141215Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20130909Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20130801Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20130801Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120819Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120602Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120508Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120417Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120417Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120417Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120417Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120403Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120402Brian P. Walenz A src/stores/ovStoreSorter.C jcvi20120402Brian P. Walenz D src/stores/ovStoreSorter.C src/AS_OVS/overlapStoreSorter.C A src/stores/gatekeeperDumpFASTQ.C bnbi20150603Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150415Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150410Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150327Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150321Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150318Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150317Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150317Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20150227Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20141215Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20141215Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20141202Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20141127Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C bnbi20141121Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C jcvi20130801Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C jcvi20130801Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C jcvi20120206Brian P. Walenz D src/stores/gatekeeperDumpFASTQ.C src/AS_GKP/gkpStoreDumpFASTQ.C A src/stores/gatekeeperPartition.C bnbi20150317Brian P. Walenz A src/stores/gatekeeperPartition.C bnbi20150306Brian P. Walenz A src/stores/gatekeeperPartition.C bnbi20150306Brian P. Walenz A src/stores/gatekeeperPartition.C bnbi20150305Brian P. Walenz A src/stores/gatekeeperPartition.C bnbi20141223Brian P. Walenz A src/stores/tgStore.H nihh20151030Brian P. Walenz A src/stores/tgStore.H nihh20151029Brian P. Walenz A src/stores/tgStore.H bnbi20150811Brian P. Walenz A src/stores/tgStore.H bnbi20150701Brian P. Walenz A src/stores/tgStore.H bnbi20150113Brian P. Walenz A src/stores/tgStore.H bnbi20150113Brian P. Walenz A src/stores/tgStore.H bnbi20141222Brian P. Walenz A src/stores/tgStore.H jcvi20140331Brian P. Walenz A src/stores/tgStore.H jcvi20130918Brian P. Walenz A src/stores/tgStore.H jcvi20130801Brian P. Walenz A src/stores/tgStore.H jcvi20130801Brian P. Walenz A src/stores/tgStore.H jcvi20130801Brian P. Walenz A src/stores/tgStore.H jcvi20121114Brian P. Walenz A src/stores/tgStore.H jcvi20120412Brian P. Walenz A src/stores/tgStore.H jcvi20120328Brian P. Walenz A src/stores/tgStore.H jcvi20110102Brian P. Walenz A src/stores/tgStore.H jcvi20100923Brian P. Walenz A src/stores/tgStore.H jcvi20091212Brian P. Walenz A src/stores/tgStore.H jcvi20091210Brian P. Walenz A src/stores/tgStore.H jcvi20091209Brian P. Walenz A src/stores/tgStore.H jcvi20091119Brian P. Walenz A src/stores/tgStore.H jcvi20091007Brian P. Walenz A src/stores/tgStore.H jcvi20091005Brian P. Walenz A src/stores/tgStore.H jcvi20091005Brian P. Walenz D src/stores/tgStore.H src/AS_CNS/MultiAlignStore.H D src/stores/tgStore.H src/AS_CNS/MultiAlignStore.h A src/stores/tgStoreLoad.C bnbi20150814Brian P. Walenz A src/stores/tgStoreLoad.C bnbi20150811Brian P. Walenz A src/stores/tgStoreLoad.C bnbi20150807Brian P. Walenz A src/stores/tgTigSizeAnalysis.C nihh20151030Brian P. Walenz A src/stores/tgTigSizeAnalysis.C bnbi20141222Brian P. Walenz A src/stores/tgTigSizeAnalysis.C jcvi20131024Brian P. Walenz A src/stores/tgTigSizeAnalysis.C jcvi20130801Brian P. Walenz A src/stores/tgTigSizeAnalysis.C jcvi20120626Brian P. Walenz A src/stores/tgTigSizeAnalysis.C jcvi20120326Brian P. Walenz D src/stores/tgTigSizeAnalysis.C src/AS_CNS/MultiAlignSizeAnalysis.C A src/stores/gatekeeperCreate.C nihh20151108Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150810Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150729Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150715Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150701Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150603Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150603Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150514Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150429Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150325Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150317Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150303Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150227Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150204Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150204Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20150130Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20141223Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20141215Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20141202Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20141127Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20141121Brian P. Walenz A src/stores/gatekeeperCreate.C bnbi20141009Brian P. Walenz A src/stores/gatekeeperCreate.C jcvi20130801Brian P. Walenz A src/stores/gatekeeperCreate.C jcvi20130801Brian P. Walenz A src/stores/gatekeeperCreate.C jcvi20120206Brian P. Walenz D src/stores/gatekeeperCreate.C src/AS_GKP/gkpStoreCreate.C A src/stores/ovStoreStats.C nihh20151104Brian P. Walenz A src/stores/ovStoreStats.C nihh20151029Brian P. Walenz A src/stores/ovStoreStats.C nihh20151027Brian P. Walenz A src/stores/ovStoreStats.C nihh20151027Brian P. Walenz A src/stores/ovStore.H nihh20151027Brian P. Walenz A src/stores/ovStore.H nihh20151027Brian P. Walenz A src/stores/ovStore.H nihh20151026Brian P. Walenz A src/stores/ovStore.H nihh20151012Brian P. Walenz A src/stores/ovStore.H bnbi20150701Brian P. Walenz A src/stores/ovStore.H bnbi20150625Brian P. Walenz A src/stores/ovStore.H bnbi20150616Brian P. Walenz A src/stores/ovStore.H bnbi20150616Brian P. Walenz A src/stores/ovStore.H bnbi20150605Brian P. Walenz A src/stores/ovStore.H bnbi20150529Brian P. Walenz A src/stores/ovStore.H bnbi20150528Brian P. Walenz A src/stores/ovStore.H bnbi20150514Brian P. Walenz A src/stores/ovStore.H bnbi20150413Brian P. Walenz A src/stores/ovStore.H bnbi20150409Brian P. Walenz A src/stores/ovStore.H bnbi20150321Brian P. Walenz A src/stores/ovStore.H bnbi20150317Brian P. Walenz A src/stores/ovStore.H bnbi20150317Brian P. Walenz A src/stores/ovStore.H bnbi20150306Brian P. Walenz A src/stores/ovStore.H bnbi20150211Brian P. Walenz A src/stores/ovStore.H bnbi20150113Brian P. Walenz A src/stores/ovStore.H bnbi20141230Brian P. Walenz A src/stores/ovStore.H bnbi20141223Brian P. Walenz A src/stores/ovStore.H bnbi20141215Brian P. Walenz A src/stores/ovStore.H bnbi20141209Brian P. Walenz A src/stores/tgStoreDump.C nihh20151108Brian P. Walenz A src/stores/tgStoreDump.C nihh20151104Brian P. Walenz A src/stores/tgStoreDump.C nihh20151103Brian P. Walenz A src/stores/tgStoreDump.C nihh20151102Brian P. Walenz A src/stores/tgStoreDump.C nihh20151030Brian P. Walenz A src/stores/tgStoreDump.C nihh20151012Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150814Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150811Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150807Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150701Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150603Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150526Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150509Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150409Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150113Brian P. Walenz A src/stores/tgStoreDump.C bnbi20150113Brian P. Walenz A src/stores/tgStoreDump.C bnbi20141222Brian P. Walenz A src/stores/tgStoreDump.C bnbi20141014Brian P. Walenz A src/stores/tgStoreDump.C bnbi20141009Brian P. Walenz A src/stores/tgStoreDump.C bnbi20140413Sergey Koren A src/stores/tgStoreDump.C jcvi20140331Brian P. Walenz A src/stores/tgStoreDump.C jcvi20140225Brian P. Walenz A src/stores/tgStoreDump.C jcvi20140225Brian P. Walenz A src/stores/tgStoreDump.C jcvi20140220Brian P. Walenz A src/stores/tgStoreDump.C jcvi20140218Brian P. Walenz A src/stores/tgStoreDump.C jcvi20140123Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130918Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130917Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130917Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130916Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130801Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130801Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130503Brian P. Walenz A src/stores/tgStoreDump.C jcvi20130131Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120910Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120810Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120806Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120417Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120412Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120328Brian P. Walenz A src/stores/tgStoreDump.C jcvi20120326Brian P. Walenz A src/stores/tgStoreDump.C jcvi20111218Brian P. Walenz A src/stores/tgStoreDump.C jcvi20111218Brian P. Walenz A src/stores/tgStoreDump.C jcvi20111215Brian P. Walenz A src/stores/tgStoreDump.C jcvi20111208Brian P. Walenz A src/stores/tgStoreDump.C jcvi20111207Brian P. Walenz A src/stores/tgStoreDump.C jcvi20110815Brian P. Walenz A src/stores/tgStoreDump.C jcvi20110102Brian P. Walenz A src/stores/tgStoreDump.C jcvi20100920Brian P. Walenz A src/stores/tgStoreDump.C tigr20100517Michael Schatz A src/stores/tgStoreDump.C jcvi20100423Brian P. Walenz A src/stores/tgStoreDump.C jcvi20100402Brian P. Walenz A src/stores/tgStoreDump.C jcvi20100329Brian P. Walenz A src/stores/tgStoreDump.C tigr20100227Michael Schatz A src/stores/tgStoreDump.C jcvi20091212Brian P. Walenz A src/stores/tgStoreDump.C jcvi20091209Brian P. Walenz A src/stores/tgStoreDump.C jcvi20091104Brian P. Walenz A src/stores/tgStoreDump.C jcvi20091007Brian P. Walenz A src/stores/tgStoreDump.C jcvi20091005Brian P. Walenz D src/stores/tgStoreDump.C src/AS_CNS/tigStore.C A src/stores/tgTig.C nihh20151113Brian P. Walenz A src/stores/tgTig.C nihh20151030Brian P. Walenz A src/stores/tgTig.C nihh20151030Brian P. Walenz A src/stores/tgTig.C bnbi20150811Brian P. Walenz A src/stores/tgTig.C bnbi20150807Brian P. Walenz A src/stores/tgTig.C bnbi20150701Brian P. Walenz A src/stores/tgTig.C bnbi20150603Brian P. Walenz A src/stores/tgTig.C bnbi20150507Brian P. Walenz A src/stores/tgTig.C bnbi20150507Brian P. Walenz A src/stores/tgTig.C bnbi20150409Brian P. Walenz A src/stores/tgTig.C bnbi20150306Brian P. Walenz A src/stores/tgTig.C bnbi20150127Brian P. Walenz A src/stores/tgTig.C bnbi20150127Brian P. Walenz A src/stores/tgTig.C bnbi20150126Brian P. Walenz A src/stores/tgTig.C bnbi20150121Brian P. Walenz A src/stores/tgTig.C bnbi20150113Brian P. Walenz A src/stores/tgTig.C bnbi20150110Brian P. Walenz A src/stores/tgTig.C bnbi20141222Brian P. Walenz A src/stores/tgStoreFilter.C nihh20151104Brian P. Walenz A src/stores/tgStoreFilter.C nihh20151030Brian P. Walenz A src/stores/tgStoreFilter.C bnbi20150814Brian P. Walenz A src/stores/tgStoreFilter.C bnbi20141014Brian P. Walenz A src/stores/tgStoreFilter.C bnbi20141009Brian P. Walenz A src/stores/tgStoreFilter.C jcvi20140415Brian P. Walenz A src/stores/tgStoreFilter.C jcvi20140415Brian P. Walenz A src/stores/tgStoreFilter.C bnbi20140414Sergey Koren A src/stores/tgStoreFilter.C jcvi20140331Brian P. Walenz D src/stores/tgStoreFilter.C src/AS_BAT/markRepeatUnique.C A src/stores/tgTigDisplay.C bnbi20150701Brian P. Walenz A src/stores/tgTigDisplay.C bnbi20150409Brian P. Walenz A src/stores/tgTigDisplay.C bnbi20150126Brian P. Walenz A src/stores/gkStore.C nihh20151029Brian P. Walenz A src/stores/gkStore.C nihh20151021Brian P. Walenz A src/stores/gkStore.C nihh20151009Brian P. Walenz A src/stores/gkStore.C bnbi20150810Brian P. Walenz A src/stores/gkStore.C bnbi20150810Brian P. Walenz A src/stores/gkStore.C bnbi20150729Brian P. Walenz A src/stores/gkStore.C bnbi20150701Brian P. Walenz A src/stores/gkStore.C bnbi20150529Brian P. Walenz A src/stores/gkStore.C bnbi20150526Brian P. Walenz A src/stores/gkStore.C bnbi20150507Brian P. Walenz A src/stores/gkStore.C bnbi20150429Brian P. Walenz A src/stores/gkStore.C bnbi20150327Brian P. Walenz A src/stores/gkStore.C bnbi20150317Brian P. Walenz A src/stores/gkStore.C bnbi20150317Brian P. Walenz A src/stores/gkStore.C bnbi20150306Brian P. Walenz A src/stores/gkStore.C bnbi20150306Brian P. Walenz A src/stores/gkStore.C bnbi20150303Brian P. Walenz A src/stores/gkStore.C bnbi20150227Brian P. Walenz A src/stores/gkStore.C bnbi20150227Brian P. Walenz A src/stores/gkStore.C bnbi20150224Brian P. Walenz A src/stores/gkStore.C bnbi20141223Brian P. Walenz A src/stores/gkStore.C bnbi20141219Brian P. Walenz A src/stores/gkStore.C bnbi20141215Brian P. Walenz A src/stores/gkStore.C bnbi20141215Brian P. Walenz A src/stores/gkStore.C bnbi20141202Brian P. Walenz A src/stores/gkStore.C bnbi20141127Brian P. Walenz A src/stores/gkStore.C bnbi20141126Brian P. Walenz A src/stores/ovOverlap.C bnbi20150701Brian P. Walenz A src/stores/ovOverlap.C bnbi20150616Brian P. Walenz A src/stores/ovOverlap.C bnbi20150616Brian P. Walenz A src/stores/ovOverlap.C bnbi20150605Brian P. Walenz A src/stores/ovOverlap.C bnbi20150520Brian P. Walenz A src/stores/ovOverlap.C bnbi20150514Brian P. Walenz A src/stores/ovOverlap.C bnbi20150409Brian P. Walenz A src/stores/ovOverlap.C bnbi20150321Brian P. Walenz A src/stores/ovOverlap.C bnbi20150127Brian P. Walenz A src/stores/ovOverlap.C bnbi20141215Brian P. Walenz A src/stores/tgStore.C nihh20151030Brian P. Walenz A src/stores/tgStore.C nihh20151029Brian P. Walenz A src/stores/tgStore.C bnbi20150811Brian P. Walenz A src/stores/tgStore.C bnbi20150701Brian P. Walenz A src/stores/tgStore.C bnbi20150409Brian P. Walenz A src/stores/tgStore.C bnbi20150127Brian P. Walenz A src/stores/tgStore.C bnbi20150126Brian P. Walenz A src/stores/tgStore.C bnbi20150113Brian P. Walenz A src/stores/tgStore.C bnbi20150113Brian P. Walenz A src/stores/tgStore.C bnbi20141222Brian P. Walenz A src/stores/tgStore.C jcvi20140331Brian P. Walenz A src/stores/tgStore.C jcvi20131018Brian P. Walenz A src/stores/tgStore.C jcvi20130918Brian P. Walenz A src/stores/tgStore.C jcvi20130905Brian P. Walenz A src/stores/tgStore.C jcvi20130801Brian P. Walenz A src/stores/tgStore.C jcvi20130801Brian P. Walenz A src/stores/tgStore.C jcvi20121116Brian P. Walenz A src/stores/tgStore.C jcvi20121114Brian P. Walenz A src/stores/tgStore.C jcvi20121112Brian P. Walenz A src/stores/tgStore.C jcvi20120914Brian P. Walenz A src/stores/tgStore.C jcvi20120913Brian P. Walenz A src/stores/tgStore.C jcvi20120829Brian P. Walenz A src/stores/tgStore.C jcvi20120810Brian P. Walenz A src/stores/tgStore.C jcvi20120412Brian P. Walenz A src/stores/tgStore.C jcvi20120328Brian P. Walenz A src/stores/tgStore.C jcvi20110506Brian P. Walenz A src/stores/tgStore.C jcvi20100923Brian P. Walenz A src/stores/tgStore.C jcvi20100824Brian P. Walenz A src/stores/tgStore.C jcvi20100414Brian P. Walenz A src/stores/tgStore.C jcvi20100205Sergey Koren A src/stores/tgStore.C jcvi20100121Brian P. Walenz A src/stores/tgStore.C jcvi20100116Brian P. Walenz A src/stores/tgStore.C jcvi20091212Brian P. Walenz A src/stores/tgStore.C jcvi20091210Brian P. Walenz A src/stores/tgStore.C jcvi20091210Brian P. Walenz A src/stores/tgStore.C jcvi20091209Brian P. Walenz A src/stores/tgStore.C jcvi20091209Brian P. Walenz A src/stores/tgStore.C jcvi20091203Brian P. Walenz A src/stores/tgStore.C jcvi20091202Brian P. Walenz A src/stores/tgStore.C jcvi20091119Brian P. Walenz A src/stores/tgStore.C jcvi20091012Brian P. Walenz A src/stores/tgStore.C jcvi20091007Brian P. Walenz A src/stores/tgStore.C jcvi20091007Brian P. Walenz A src/stores/tgStore.C jcvi20091007Brian P. Walenz A src/stores/tgStore.C jcvi20091005Brian P. Walenz D src/stores/tgStore.C src/AS_CNS/MultiAlignStore.C A src/stores/tgTigSizeAnalysis.H nihh20151030Brian P. Walenz A src/stores/tgTigSizeAnalysis.H bnbi20141222Brian P. Walenz A src/stores/tgTigSizeAnalysis.H jcvi20130801Brian P. Walenz A src/stores/tgTigSizeAnalysis.H jcvi20130801Brian P. Walenz A src/stores/tgTigSizeAnalysis.H jcvi20120626Brian P. Walenz A src/stores/tgTigSizeAnalysis.H jcvi20120326Brian P. Walenz D src/stores/tgTigSizeAnalysis.H src/AS_CNS/MultiAlignSizeAnalysis.H A src/stores/tgTigMultiAlignDisplay.C bnbi20150603Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150409Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150317Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150131Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150128Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150127Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150127Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150126Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20150113Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C bnbi20141121Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20130801Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20130801Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20130801Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20111208Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20111207Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20111206Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20111110Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20110102Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20100806Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20091005Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20090907Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20090610Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20081008Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20080627Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20071108Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C jcvi20070905Brian P. Walenz D src/stores/tgTigMultiAlignDisplay.C src/AS_CNS/MultiAlignPrint.C D src/stores/tgTigMultiAlignDisplay.C src/AS_CNS/MultiAlignPrint.c A src/stores/ovStoreBuild.C nihh20151118Brian P. Walenz A src/stores/ovStoreBuild.C nihh20151108Brian P. Walenz A src/stores/ovStoreBuild.C nihh20151103Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150921Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150814Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150625Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150616Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150616Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150514Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150321Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150317Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150306Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150303Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20150224Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20141215Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20141121Brian P. Walenz A src/stores/ovStoreBuild.C bnbi20140731Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20130801Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20130801Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20120819Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20120602Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20120515Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20120508Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20120403Brian P. Walenz A src/stores/ovStoreBuild.C jcvi20120402Brian P. Walenz D src/stores/ovStoreBuild.C src/AS_OVS/overlapStoreBuild.C A src/stores/tgStoreCoverageStat.C nihh20151104Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20151103Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20151030Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20151012Brian P. Walenz A src/stores/tgStoreCoverageStat.C bnbi20150814Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20130801Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20130801Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20120829Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20120718Jason Miller A src/stores/tgStoreCoverageStat.C jcvi20120327Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20120313Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20120115Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20111230Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20111229Brian P. Walenz A src/stores/tgStoreCoverageStat.C jcvi20111220Brian P. Walenz D src/stores/tgStoreCoverageStat.C src/AS_BAT/computeCoverageStat.C A src/stores/ovStoreFile.C bnbi20150616Brian P. Walenz A src/stores/ovStoreFile.C bnbi20150529Brian P. Walenz A src/stores/ovStoreFile.C bnbi20150409Brian P. Walenz A src/stores/ovStoreFile.C bnbi20150306Brian P. Walenz A src/stores/ovStoreFile.C bnbi20150211Brian P. Walenz A src/stores/ovStoreFile.C bnbi20141215Brian P. Walenz A src/stores/ovStoreFile.C bnbi20141215Brian P. Walenz A src/stores/ovStoreFile.C bnbi20141209Brian P. Walenz A src/stores/ovStoreFile.C jcvi20130922Brian P. Walenz A src/stores/ovStoreFile.C jcvi20130801Brian P. Walenz A src/stores/ovStoreFile.C jcvi20130801Brian P. Walenz A src/stores/ovStoreFile.C jcvi20130801Brian P. Walenz A src/stores/ovStoreFile.C jcvi20120602Brian P. Walenz A src/stores/ovStoreFile.C jcvi20120508Brian P. Walenz A src/stores/ovStoreFile.C jcvi20120201Gregory Sims A src/stores/ovStoreFile.C jcvi20110401Brian P. Walenz A src/stores/ovStoreFile.C bnbi20110331Sergey Koren A src/stores/ovStoreFile.C jcvi20091026Brian P. Walenz A src/stores/ovStoreFile.C jcvi20090410Brian P. Walenz A src/stores/ovStoreFile.C jcvi20090222Brian P. Walenz A src/stores/ovStoreFile.C jcvi20081013Brian P. Walenz A src/stores/ovStoreFile.C jcvi20081008Brian P. Walenz A src/stores/ovStoreFile.C jcvi20080627Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070723Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070603Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070320Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070313Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070312Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070309Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070309Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070308Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070308Brian P. Walenz A src/stores/ovStoreFile.C jcvi20070228Brian P. Walenz D src/stores/ovStoreFile.C src/AS_OVS/AS_OVS_overlapFile.C D src/stores/ovStoreFile.C src/AS_OVS/AS_OVS_overlapFile.c A src/bogart-analysis/show-false-best-edges-from-mapping.pl nihh20151012Brian P. Walenz A src/bogart-analysis/locate-read-in-unitig-based-on-olaps.pl nihh20151012Brian P. Walenz A src/bogart-analysis/examine-mapping.pl nihh20151012Brian P. Walenz A src/bogart-analysis/count-bubbles.pl nihh20151012Brian P. Walenz A src/bogart-analysis/examine-mapping-ideal.pl nihh20151012Brian P. Walenz A src/bogart-analysis/bogart-build-fasta.pl nihh20151012Brian P. Walenz A src/bogart-analysis/analyze-nucmer-gaps.pl nihh20151012Brian P. Walenz A src/bogart-analysis/analyze-mapped-unitigs-for-joins.pl nihh20151012Brian P. Walenz A src/AS_global.H nihh20151029Brian P. Walenz A src/AS_global.H bnbi20150331Brian P. Walenz A src/AS_global.H bnbi20150303Brian P. Walenz A src/AS_global.H bnbi20150227Brian P. Walenz A src/AS_global.H bnbi20141222Brian P. Walenz A src/AS_global.H bnbi20141209Brian P. Walenz A src/AS_global.H bnbi20141208Brian P. Walenz A src/AS_global.H bnbi20141205Brian P. Walenz A src/AS_global.H bnbi20141126Brian P. Walenz A src/AS_global.H bnbi20141121Brian P. Walenz A src/AS_global.H bnbi20141117Brian P. Walenz A src/AS_global.H bnbi20140811Brian P. Walenz A src/AS_global.H jcvi20140305Brian P. Walenz A src/AS_global.H jcvi20131106Brian P. Walenz A src/AS_global.H jcvi20130801Brian P. Walenz A src/AS_global.H jcvi20130801Brian P. Walenz A src/AS_global.H jcvi20130801Brian P. Walenz A src/AS_global.H jcvi20130501Brian P. Walenz A src/AS_global.H jcvi20130222Brian P. Walenz A src/AS_global.H jcvi20120721Brian P. Walenz A src/AS_global.H jcvi20120508Brian P. Walenz A src/AS_global.H jcvi20120203Brian P. Walenz A src/AS_global.H jcvi20111229Brian P. Walenz A src/AS_global.H jcvi20111204Brian P. Walenz A src/AS_global.H bnbi20111011Sergey Koren A src/AS_global.H bnbi20110502Sergey Koren A src/AS_global.H jcvi20110125Brian P. Walenz A src/AS_global.H jcvi20110106Brian P. Walenz A src/AS_global.H jcvi20101025Brian P. Walenz A src/AS_global.H jcvi20091027Brian P. Walenz A src/AS_global.H jcvi20091026Brian P. Walenz A src/AS_global.H jcvi20090610Brian P. Walenz A src/AS_global.H jcvi20090202Brian P. Walenz A src/AS_global.H jcvi20081111Brian P. Walenz A src/AS_global.H jcvi20081009Brian P. Walenz A src/AS_global.H jcvi20081008Brian P. Walenz A src/AS_global.H jcvi20080627Brian P. Walenz A src/AS_global.H jcvi20080514Brian P. Walenz A src/AS_global.H jcvi20080514Brian P. Walenz A src/AS_global.H jcvi20071108Brian P. Walenz A src/AS_global.H jcvi20071024Brian P. Walenz A src/AS_global.H jcvi20071024Brian P. Walenz A src/AS_global.H jcvi20070810Brian P. Walenz A src/AS_global.H jcvi20070804Brian P. Walenz A src/AS_global.H jcvi20070803Brian P. Walenz A src/AS_global.H jcvi20070725Brian P. Walenz A src/AS_global.H jcvi20070723Brian P. Walenz A src/AS_global.H jcvi20070723Brian P. Walenz A src/AS_global.H jcvi20070603Brian P. Walenz A src/AS_global.H jcvi20070514Brian P. Walenz A src/AS_global.H jcvi20070416Brian P. Walenz A src/AS_global.H jcvi20070319Brian P. Walenz A src/AS_global.H jcvi20070318Brian P. Walenz A src/AS_global.H jcvi20070308Brian P. Walenz A src/AS_global.H jcvi20070225Brian P. Walenz A src/AS_global.H jcvi20070224Brian P. Walenz A src/AS_global.H jcvi20070223Brian P. Walenz A src/AS_global.H jcvi20061218Brian P. Walenz A src/AS_global.H jcvi20061218Brian P. Walenz A src/AS_global.H jcvi20061114Eli Venter A src/AS_global.H jcvi20061114Eli Venter A src/AS_global.H jcvi20061008Brian P. Walenz A src/AS_global.H jcvi20061008Brian P. Walenz A src/merTrim/merTrimApply.C bnbi20141205Brian P. Walenz A src/merTrim/merTrimApply.C bnbi20141009Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20130801Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20130801Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20111229Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20110822Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20110603Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20110106Brian P. Walenz A src/merTrim/merTrimApply.C jcvi20110104Brian P. Walenz D src/merTrim/merTrimApply.C src/AS_MER/merTrimApply.C A src/merTrim/merTrim-compare-logs.pl nihh20151012Brian P. Walenz A src/merTrim/merTrim-compare-logs.pl bnbi20141205Brian P. Walenz A src/merTrim/merTrim-compare-logs.pl bnbi20141115Brian P. Walenz D src/merTrim/merTrim-compare-logs.pl src/AS_MER/merTrim-compare-logs.pl A src/merTrim/merTrim.C bnbi20141205Brian P. Walenz A src/merTrim/merTrim.C jcvi20140411Brian P. Walenz A src/merTrim/merTrim.C jcvi20131009Brian P. Walenz A src/merTrim/merTrim.C jcvi20130922Brian P. Walenz A src/merTrim/merTrim.C jcvi20130801Brian P. Walenz A src/merTrim/merTrim.C jcvi20130801Brian P. Walenz A src/merTrim/merTrim.C jcvi20130620Brian P. Walenz A src/merTrim/merTrim.C jcvi20130508Brian P. Walenz A src/merTrim/merTrim.C jcvi20130402Brian P. Walenz A src/merTrim/merTrim.C jcvi20130329Brian P. Walenz A src/merTrim/merTrim.C jcvi20130308Brian P. Walenz A src/merTrim/merTrim.C jcvi20121119Brian P. Walenz A src/merTrim/merTrim.C jcvi20121114Brian P. Walenz A src/merTrim/merTrim.C jcvi20121101Brian P. Walenz A src/merTrim/merTrim.C jcvi20121023Brian P. Walenz A src/merTrim/merTrim.C jcvi20120726Brian P. Walenz A src/merTrim/merTrim.C jcvi20120529Brian P. Walenz A src/merTrim/merTrim.C jcvi20120521Brian P. Walenz A src/merTrim/merTrim.C jcvi20120510Brian P. Walenz A src/merTrim/merTrim.C jcvi20120323Brian P. Walenz A src/merTrim/merTrim.C jcvi20120227Brian P. Walenz A src/merTrim/merTrim.C jcvi20120227Brian P. Walenz A src/merTrim/merTrim.C jcvi20120226Brian P. Walenz A src/merTrim/merTrim.C jcvi20120212Brian P. Walenz A src/merTrim/merTrim.C jcvi20120202Brian P. Walenz A src/merTrim/merTrim.C jcvi20120131Brian P. Walenz A src/merTrim/merTrim.C jcvi20120131Brian P. Walenz A src/merTrim/merTrim.C jcvi20120131Brian P. Walenz A src/merTrim/merTrim.C jcvi20120131Brian P. Walenz A src/merTrim/merTrim.C jcvi20120124Brian P. Walenz A src/merTrim/merTrim.C jcvi20120123Brian P. Walenz A src/merTrim/merTrim.C jcvi20120123Brian P. Walenz A src/merTrim/merTrim.C jcvi20120118Brian P. Walenz A src/merTrim/merTrim.C jcvi20120117Brian P. Walenz A src/merTrim/merTrim.C jcvi20111229Brian P. Walenz A src/merTrim/merTrim.C jcvi20110822Brian P. Walenz A src/merTrim/merTrim.C jcvi20110728Brian P. Walenz A src/merTrim/merTrim.C jcvi20110721Brian P. Walenz A src/merTrim/merTrim.C jcvi20110708Brian P. Walenz A src/merTrim/merTrim.C jcvi20110623Brian P. Walenz A src/merTrim/merTrim.C jcvi20110603Brian P. Walenz A src/merTrim/merTrim.C jcvi20110104Brian P. Walenz A src/merTrim/merTrim.C jcvi20101102Brian P. Walenz A src/merTrim/merTrim.C jcvi20100712Sergey Koren A src/merTrim/merTrim.C jcvi20100325Brian P. Walenz A src/merTrim/merTrim.C jcvi20100322Brian P. Walenz A src/merTrim/merTrim.C jcvi20100321Brian P. Walenz A src/merTrim/merTrim.C jcvi20100316Brian P. Walenz A src/merTrim/merTrim.C jcvi20100301Brian P. Walenz A src/merTrim/merTrim.C jcvi20100222Brian P. Walenz D src/merTrim/merTrim.C src/AS_MER/merTrim.C A src/merTrim/merTrimAdapter.C bnbi20141205Brian P. Walenz A src/merTrim/merTrimAdapter.C jcvi20130801Brian P. Walenz A src/merTrim/merTrimAdapter.C jcvi20130801Brian P. Walenz A src/merTrim/merTrimAdapter.C jcvi20120521Brian P. Walenz A src/merTrim/merTrimAdapter.C jcvi20120510Brian P. Walenz D src/merTrim/merTrimAdapter.C src/AS_MER/merTrimAdapter.C A src/merTrim/merTrimResult.H bnbi20141205Brian P. Walenz A src/merTrim/merTrimResult.H jcvi20130801Brian P. Walenz A src/merTrim/merTrimResult.H jcvi20130801Brian P. Walenz A src/merTrim/merTrimResult.H jcvi20110822Brian P. Walenz D src/merTrim/merTrimResult.H src/AS_MER/merTrimResult.H A src/bogus/bogusness.C bnbi20141223Brian P. Walenz A src/bogus/bogusness.C bnbi20141219Brian P. Walenz A src/bogus/bogusness.C jcvi20130801Brian P. Walenz A src/bogus/bogusness.C jcvi20110815Brian P. Walenz A src/bogus/bogusness.C jcvi20110506Brian P. Walenz A src/bogus/bogusness.C jcvi20110404Brian P. Walenz A src/bogus/bogusness.C jcvi20110210Brian P. Walenz A src/bogus/bogusness.C jcvi20101221Brian P. Walenz A src/bogus/bogusness.C jcvi20101217Brian P. Walenz A src/bogus/bogusness.C jcvi20101213Brian P. Walenz A src/bogus/bogusness.C jcvi20101206Brian P. Walenz A src/bogus/bogusness.C jcvi20101202Brian P. Walenz A src/bogus/bogusness.C jcvi20101123Brian P. Walenz D src/bogus/bogusness.C src/AS_BAT/bogusness.C A src/bogus/bogusUtil.C bnbi20141223Brian P. Walenz A src/bogus/bogusUtil.C bnbi20141219Brian P. Walenz A src/bogus/bogusUtil.C bnbi20141219Brian P. Walenz A src/bogus/bogusUtil.C jcvi20140123Brian P. Walenz A src/bogus/bogusUtil.C jcvi20130801Brian P. Walenz A src/bogus/bogusUtil.C jcvi20130417Brian P. Walenz A src/bogus/bogusUtil.C jcvi20130329Brian P. Walenz A src/bogus/bogusUtil.C jcvi20130315Brian P. Walenz A src/bogus/bogusUtil.C jcvi20110815Brian P. Walenz A src/bogus/bogusUtil.C jcvi20110627Jason Miller A src/bogus/bogusUtil.C jcvi20110404Brian P. Walenz A src/bogus/bogusUtil.C jcvi20110210Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101221Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101217Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101213Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101213Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101206Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101206Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101202Brian P. Walenz A src/bogus/bogusUtil.C jcvi20101123Brian P. Walenz D src/bogus/bogusUtil.C src/AS_BAT/AS_BAT_bogusUtil.C D src/bogus/bogusUtil.C src/bogart/AS_BAT_bogusUtil.C A src/bogus/bogusness-run.pl nihh20151012Brian P. Walenz A src/bogus/bogusness-run.pl bnbi20141219Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20130801Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20110606Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20110601Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20110506Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20110404Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20110317Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20101213Brian P. Walenz A src/bogus/bogusness-run.pl jcvi20101202Brian P. Walenz D src/bogus/bogusness-run.pl src/AS_BAT/bogusness-run.pl A src/bogus/bogus.C bnbi20141223Brian P. Walenz A src/bogus/bogus.C bnbi20141219Brian P. Walenz A src/bogus/bogus.C bnbi20141014Brian P. Walenz A src/bogus/bogus.C bnbi20141009Brian P. Walenz A src/bogus/bogus.C jcvi20130801Brian P. Walenz A src/bogus/bogus.C jcvi20130417Brian P. Walenz A src/bogus/bogus.C jcvi20111229Brian P. Walenz A src/bogus/bogus.C jcvi20110815Brian P. Walenz A src/bogus/bogus.C jcvi20110726Brian P. Walenz A src/bogus/bogus.C jcvi20110404Brian P. Walenz A src/bogus/bogus.C jcvi20110210Brian P. Walenz A src/bogus/bogus.C jcvi20101221Brian P. Walenz A src/bogus/bogus.C jcvi20101217Brian P. Walenz A src/bogus/bogus.C jcvi20101213Brian P. Walenz A src/bogus/bogus.C jcvi20101206Brian P. Walenz A src/bogus/bogus.C jcvi20101123Brian P. Walenz D src/bogus/bogus.C src/AS_BAT/bogus.C A src/bogus/bogusUtil.H bnbi20141223Brian P. Walenz A src/bogus/bogusUtil.H bnbi20141219Brian P. Walenz A src/bogus/bogusUtil.H bnbi20141219Brian P. Walenz A src/bogus/bogusUtil.H bnbi20141009Brian P. Walenz A src/bogus/bogusUtil.H jcvi20130801Brian P. Walenz A src/bogus/bogusUtil.H jcvi20130801Brian P. Walenz A src/bogus/bogusUtil.H jcvi20111229Brian P. Walenz A src/bogus/bogusUtil.H jcvi20110815Brian P. Walenz A src/bogus/bogusUtil.H jcvi20110404Brian P. Walenz A src/bogus/bogusUtil.H jcvi20110210Brian P. Walenz A src/bogus/bogusUtil.H jcvi20101221Brian P. Walenz A src/bogus/bogusUtil.H jcvi20101217Brian P. Walenz A src/bogus/bogusUtil.H jcvi20101213Brian P. Walenz A src/bogus/bogusUtil.H jcvi20101206Brian P. Walenz A src/bogus/bogusUtil.H jcvi20101123Brian P. Walenz D src/bogus/bogusUtil.H src/AS_BAT/AS_BAT_bogusUtil.H D src/bogus/bogusUtil.H src/bogart/AS_BAT_bogusUtil.H A src/AS_RUN/replaceIIDwithName-overlapDump.pl nihh20151012Brian P. Walenz A src/AS_RUN/replaceIIDwithName-overlapDump.pl jcvi20130922Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl nihh20151012Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl bnbi20141001Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl bnbi20141001Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl jcvi20130823Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl jcvi20130801Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl jcvi20130308Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl jcvi20130131Brian P. Walenz A src/AS_RUN/replaceUIDwithName-fastq.pl jcvi20121213Brian P. Walenz D src/AS_RUN/replaceUIDwithName-fastq.pl src/AS_RUN/replaceUIDwithName.pl A src/AS_RUN/fragmentDepth.C bnbi20141121Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20130801Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20130801Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20130801Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20120812Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20120508Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20100322Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20090814Sergey Koren A src/AS_RUN/fragmentDepth.C jcvi20090605Sergey Koren A src/AS_RUN/fragmentDepth.C jcvi20081008Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20081002Sergey Koren A src/AS_RUN/fragmentDepth.C jcvi20080627Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20071128Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20071128Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20071108Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20070904Sergey Koren A src/AS_RUN/fragmentDepth.C jcvi20070427Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20070403Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20070403Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20070329Brian P. Walenz A src/AS_RUN/fragmentDepth.C jcvi20070328Brian P. Walenz A src/AS_RUN/replaceUIDwithName-posmap.pl nihh20151012Brian P. Walenz A src/AS_RUN/replaceUIDwithName-posmap.pl bnbi20141001Brian P. Walenz A src/AS_RUN/replaceUIDwithName-posmap.pl bnbi20141001Brian P. Walenz A src/AS_RUN/replaceUIDwithName-posmap.pl bnbi20141001Brian P. Walenz A src/AS_RUN/replaceUIDwithName-posmap.pl jcvi20130922Brian P. Walenz D src/AS_RUN/replaceUIDwithName-posmap.pl src/AS_RUN/replaceUIDwithName-simple.pl A src/falcon_sense/outputFalcon.H bnbi20150420Brian P. Walenz A src/falcon_sense/outputFalcon.C bnbi20150520Brian P. Walenz A src/falcon_sense/outputFalcon.C bnbi20150420Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150814Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150701Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150603Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150514Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150420Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150415Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150409Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C bnbi20150409Brian P. Walenz A src/fastq-utilities/fastqSample.C bnbi20150224Brian P. Walenz A src/fastq-utilities/fastqSample.C bnbi20150113Brian P. Walenz A src/fastq-utilities/fastqSample.C bnbi20141001Brian P. Walenz A src/fastq-utilities/fastqSample.C bnbi20140806Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20121023Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20120529Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20111229Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20110819Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20110728Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20110719Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20110719Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20110104Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20100402Brian P. Walenz A src/fastq-utilities/fastqSample.C jcvi20100222Brian P. Walenz D src/fastq-utilities/fastqSample.C src/AS_GKP/fastqSample.C A src/fastq-utilities/fastqSimulate-sort.C bnbi20150113Brian P. Walenz A src/fastq-utilities/fastqSimulate-sort.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSimulate-sort.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSimulate-sort.C jcvi20130321Brian P. Walenz D src/fastq-utilities/fastqSimulate-sort.C src/AS_GKP/fastqSimulate-sort.C A src/fastq-utilities/fastqSimulate-perfectSep.pl nihh20151012Brian P. Walenz A src/fastq-utilities/fastqSimulate-perfectSep.pl bnbi20150113Brian P. Walenz A src/fastq-utilities/fastqSimulate-perfectSep.pl jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSimulate-perfectSep.pl jcvi20111228Brian P. Walenz D src/fastq-utilities/fastqSimulate-perfectSep.pl src/AS_GKP/fastqSimulate-perfectSep.pl A src/fastq-utilities/fastqSimulate.C bnbi20150204Brian P. Walenz A src/fastq-utilities/fastqSimulate.C bnbi20150202Brian P. Walenz A src/fastq-utilities/fastqSimulate.C bnbi20150131Brian P. Walenz A src/fastq-utilities/fastqSimulate.C bnbi20150131Brian P. Walenz A src/fastq-utilities/fastqSimulate.C bnbi20150113Brian P. Walenz A src/fastq-utilities/fastqSimulate.C bnbi20140806Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20140331Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20140304Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20140220Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20140129Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20130318Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20120709Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20120313Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20120118Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110907Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110906Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110829Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110815Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110803Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110728Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110705Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110627Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110623Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110623Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110603Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110421Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110224Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110222Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110222Brian P. Walenz A src/fastq-utilities/fastqSimulate.C jcvi20110221Brian P. Walenz D src/fastq-utilities/fastqSimulate.C src/AS_GKP/fastqSimulate.C A src/fastq-utilities/fastqSimulate-checkCoverage.pl nihh20151012Brian P. Walenz A src/fastq-utilities/fastqSimulate-checkCoverage.pl bnbi20150113Brian P. Walenz A src/fastq-utilities/fastqSimulate-checkCoverage.pl jcvi20140220Brian P. Walenz D src/fastq-utilities/fastqSimulate-checkCoverage.pl src/AS_GKP/fastqSimulate-checkCoverage.pl A src/fastq-utilities/fastqAnalyze.C bnbi20150113Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20130801Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20130215Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20130215Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20130107Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20121213Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20121212Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20120625Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20120321Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20120225Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C jcvi20120224Brian P. Walenz D src/fastq-utilities/fastqAnalyze.C src/AS_GKP/fastqAnalyze.C A src/AS_UTL/bitOperations.H bnbi20141208Brian P. Walenz A src/AS_UTL/bitOperations.H bnbi20141205Brian P. Walenz A src/AS_UTL/bitOperations.H jcvi20140411Brian P. Walenz A src/AS_UTL/bitOperations.H jcvi20080707Brian P. Walenz A src/AS_UTL/bitOperations.H jcvi20060728Brian P. Walenz A src/AS_UTL/bitOperations.H jcvi20050310Brian P. Walenz A src/AS_UTL/bitOperations.H jcvi20050223Brian P. Walenz A src/AS_UTL/bitOperations.H none20040427Brian P. Walenz D src/AS_UTL/bitOperations.H kmer/libutil/bitOperations.h A src/AS_UTL/intervalList.H nihh20151030Brian P. Walenz A src/AS_UTL/intervalList.H bnbi20141205Brian P. Walenz A src/AS_UTL/intervalList.H bnbi20141022Brian P. Walenz A src/AS_UTL/intervalList.H bnbi20141021Brian P. Walenz A src/AS_UTL/intervalList.H bnbi20141014Brian P. Walenz A src/AS_UTL/intervalList.H bnbi20141007Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20140411Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20130508Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20130429Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20130404Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20101201Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20070608Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20061212Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20061023Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20060825Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20060808Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20060407Brian P. Walenz A src/AS_UTL/intervalList.H jcvi20050712Brian P. Walenz A src/AS_UTL/intervalList.H none20040421Brian P. Walenz A src/AS_UTL/intervalList.H craa20030813Brian P. Walenz A src/AS_UTL/intervalList.H craa20030811Brian P. Walenz D src/AS_UTL/intervalList.H kmer/libutil/intervalList.H A src/AS_UTL/AS_UTL_reverseComplement.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H jcvi20080515Brian P. Walenz A src/AS_UTL/sweatShop.C bnbi20150624Brian P. Walenz A src/AS_UTL/sweatShop.C bnbi20141208Brian P. Walenz A src/AS_UTL/sweatShop.C bnbi20141205Brian P. Walenz A src/AS_UTL/sweatShop.C bnbi20141205Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20140411Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20130103Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20121015Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20110816Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20110119Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20110110Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20100919Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20100331Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20081001Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080925Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080904Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080903Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080902Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080828Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080820Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080819Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080818Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080818Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20080814Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060512Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060501Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060407Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060308Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060303Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060302Brian P. Walenz A src/AS_UTL/sweatShop.C jcvi20060302Brian P. Walenz D src/AS_UTL/sweatShop.C kmer/libutil/sweatShop.C A src/AS_UTL/decodeBooleanString.C bnbi20141126Brian P. Walenz A src/AS_UTL/readBuffer.H bnbi20141208Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20140411Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20101112Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20100919Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20080911Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20080909Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20071226Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20070320Brian P. Walenz A src/AS_UTL/readBuffer.H jcvi20060624Brian P. Walenz A src/AS_UTL/readBuffer.H none20040524Brian P. Walenz A src/AS_UTL/readBuffer.H none20040421Brian P. Walenz A src/AS_UTL/readBuffer.H none20040329Brian P. Walenz A src/AS_UTL/readBuffer.H none20040317Brian P. Walenz A src/AS_UTL/readBuffer.H craa20040106Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030929Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030915Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030708Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030612Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030506Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030416Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030415Brian P. Walenz A src/AS_UTL/readBuffer.H craa20030415Brian P. Walenz D src/AS_UTL/readBuffer.H kmer/libutil/readBuffer.H A src/AS_UTL/AS_UTL_fileIO.C bnbi20150814Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C bnbi20150514Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C bnbi20150325Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20130927Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20130922Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20121114Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20120402Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20120306Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20111229Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20091202Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20091107Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20090610Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20090511Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070929Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070902Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070827Eli Venter A src/AS_UTL/AS_UTL_fileIO.C jcvi20070824Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070723Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070723Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070603Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070416Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070402Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070313Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070301Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070228Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20070218Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20061008Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20050928Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C jcvi20050926Brian P. Walenz A src/AS_UTL/kMerTiny.H nihh20151029Brian P. Walenz A src/AS_UTL/kMerTiny.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMerTiny.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMerTiny.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20140411Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20100915Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20080808Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20080728Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20080609Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20080318Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20071208Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20071023Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20070913Brian P. Walenz A src/AS_UTL/kMerTiny.H jcvi20070913Brian P. Walenz D src/AS_UTL/kMerTiny.H kmer/libbio/kmertiny.H A src/AS_UTL/bitPackedFile.H bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedFile.H bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20140411Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20080708Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20070827Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20070823Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20070320Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20070102Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20061026Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20061022Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20060706Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20060621Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20051127Brian P. Walenz A src/AS_UTL/bitPackedFile.H jcvi20050712Brian P. Walenz A src/AS_UTL/bitPackedFile.H none20040421Brian P. Walenz A src/AS_UTL/bitPackedFile.H none20040331Brian P. Walenz A src/AS_UTL/bitPackedFile.H none20040329Brian P. Walenz A src/AS_UTL/bitPackedFile.H craa20031020Brian P. Walenz A src/AS_UTL/bitPackedFile.H craa20030909Brian P. Walenz A src/AS_UTL/bitPackedFile.H craa20030506Brian P. Walenz A src/AS_UTL/bitPackedFile.H craa20030220Brian P. Walenz A src/AS_UTL/bitPackedFile.H craa20030102Brian P. Walenz D src/AS_UTL/bitPackedFile.H kmer/libutil/bitPackedFile.H A src/AS_UTL/speedCounter.C bnbi20141208Brian P. Walenz A src/AS_UTL/speedCounter.C bnbi20141205Brian P. Walenz A src/AS_UTL/speedCounter.C jcvi20140411Brian P. Walenz A src/AS_UTL/speedCounter.C jcvi20061022Brian P. Walenz D src/AS_UTL/speedCounter.C kmer/libutil/speedCounter.C A src/AS_UTL/testRand.C nihh20151012Brian P. Walenz A src/AS_UTL/testRand.C jcvi20130801Brian P. Walenz A src/AS_UTL/testRand.C jcvi20130801Brian P. Walenz A src/AS_UTL/testRand.C jcvi20130801Brian P. Walenz A src/AS_UTL/testRand.C jcvi20081008Brian P. Walenz A src/AS_UTL/testRand.C jcvi20080627Brian P. Walenz A src/AS_UTL/bitEncodings.C bnbi20141208Brian P. Walenz A src/AS_UTL/stddev.H nihh20151027Brian P. Walenz A src/AS_UTL/stddev.H nihh20151027Brian P. Walenz A src/AS_UTL/stddev.H bnbi20150818Brian P. Walenz A src/AS_UTL/stddev.H bnbi20150726Brian P. Walenz A src/AS_UTL/stddev.H bnbi20150723Brian P. Walenz A src/AS_UTL/dnaAlphabets.C bnbi20141208Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.C bnbi20150528Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.C jcvi20131011Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.C jcvi20120213Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.C jcvi20120212Brian P. Walenz A src/AS_UTL/AS_UTL_stackTraceTest.C jcvi20131018Brian P. Walenz A src/AS_UTL/bitPackedArray.H bnbi20141208Brian P. Walenz A src/AS_UTL/bitPackedArray.H bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedArray.H bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedArray.H jcvi20140411Brian P. Walenz A src/AS_UTL/bitPackedArray.H jcvi20061026Brian P. Walenz A src/AS_UTL/bitPackedArray.H jcvi20060706Brian P. Walenz A src/AS_UTL/bitPackedArray.H jcvi20060621Brian P. Walenz A src/AS_UTL/bitPackedArray.H jcvi20050712Brian P. Walenz D src/AS_UTL/bitPackedArray.H kmer/libutil/bitPackedArray.H A src/AS_UTL/AS_UTL_alloc.C nihh20151008Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20120508Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20110719Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20070830Eli Venter A src/AS_UTL/AS_UTL_alloc.C jcvi20070218Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20070215Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20070214Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20070213Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20061106Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20061008Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20060926Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20050825Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20050825Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C jcvi20050824Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H jcvi20130314Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20120508Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20100322Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20090331Sergey Koren A src/AS_UTL/AS_UTL_fasta.H jcvi20081029Sergey Koren A src/AS_UTL/AS_UTL_fasta.H jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20080515Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H jcvi20071102Brian P. Walenz A src/AS_UTL/timeAndSize.C bnbi20141208Brian P. Walenz A src/AS_UTL/kMer.C nihh20151029Brian P. Walenz A src/AS_UTL/kMer.C bnbi20141205Brian P. Walenz A src/AS_UTL/kMer.C bnbi20141205Brian P. Walenz A src/AS_UTL/kMer.C bnbi20141205Brian P. Walenz A src/AS_UTL/kMer.C jcvi20140411Brian P. Walenz A src/AS_UTL/kMer.C jcvi20111214Brian P. Walenz A src/AS_UTL/kMer.C jcvi20110330Brian P. Walenz D src/AS_UTL/kMer.C kmer/libbio/kmer.C A src/AS_UTL/bitPackedFile.C nihh20151029Brian P. Walenz A src/AS_UTL/bitPackedFile.C bnbi20141208Brian P. Walenz A src/AS_UTL/bitPackedFile.C bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedFile.C bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20140411Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20120227Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20080708Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20070827Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20070102Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20061026Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20061022Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20060706Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20060621Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20051127Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20050712Brian P. Walenz A src/AS_UTL/bitPackedFile.C jcvi20050316Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20041010Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20040514Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20040430Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20040427Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20040421Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20040401Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20040331Brian P. Walenz A src/AS_UTL/bitPackedFile.C none20040329Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20031020Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20031009Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20030909Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20030506Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20030220Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20030131Brian P. Walenz A src/AS_UTL/bitPackedFile.C craa20030102Brian P. Walenz D src/AS_UTL/bitPackedFile.C kmer/libutil/bitPackedFile.C A src/AS_UTL/speedCounter.H bnbi20141208Brian P. Walenz A src/AS_UTL/speedCounter.H bnbi20141205Brian P. Walenz A src/AS_UTL/speedCounter.H jcvi20140411Brian P. Walenz A src/AS_UTL/speedCounter.H jcvi20130103Brian P. Walenz A src/AS_UTL/speedCounter.H jcvi20120229Brian P. Walenz A src/AS_UTL/speedCounter.H jcvi20061012Brian P. Walenz A src/AS_UTL/speedCounter.H jcvi20050712Brian P. Walenz A src/AS_UTL/speedCounter.H craa20030506Brian P. Walenz A src/AS_UTL/speedCounter.H craa20030102Brian P. Walenz D src/AS_UTL/speedCounter.H kmer/libutil/speedCounter.H A src/AS_UTL/bitEncodings.H bnbi20141208Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20150421Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20150306Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20150211Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20150204Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20141223Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20141209Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20141208Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20141127Brian P. Walenz A src/AS_UTL/memoryMappedFile.H bnbi20141126Brian P. Walenz A src/AS_UTL/memoryMappedFile.H jcvi20130801Brian P. Walenz A src/AS_UTL/memoryMappedFile.H jcvi20120219Brian P. Walenz A src/AS_UTL/memoryMappedFile.H jcvi20120218Brian P. Walenz A src/AS_UTL/memoryMappedFile.H jcvi20120216Brian P. Walenz D src/AS_UTL/memoryMappedFile.H src/AS_BAT/memoryMappedFile.H A src/AS_UTL/AS_UTL_reverseComplement.C bnbi20141205Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20090529Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C jcvi20080515Brian P. Walenz A src/AS_UTL/sweatShop.H bnbi20141208Brian P. Walenz A src/AS_UTL/sweatShop.H bnbi20141205Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20140411Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20110816Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20101102Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20100331Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20080904Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20080902Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20080820Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20060501Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20060308Brian P. Walenz A src/AS_UTL/sweatShop.H jcvi20060302Brian P. Walenz D src/AS_UTL/sweatShop.H kmer/libutil/sweatShop.H A src/AS_UTL/readBuffer.C nihh20151029Brian P. Walenz A src/AS_UTL/readBuffer.C bnbi20141208Brian P. Walenz A src/AS_UTL/readBuffer.C bnbi20140822Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20140411Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20101112Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20100919Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20090724Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20080925Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20080912Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20080911Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20080909Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20080129Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20071226Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20060914Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20060620Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20050712Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20050523Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20050519Brian P. Walenz A src/AS_UTL/readBuffer.C jcvi20050401Brian P. Walenz A src/AS_UTL/readBuffer.C none20041010Brian P. Walenz A src/AS_UTL/readBuffer.C none20040511Brian P. Walenz A src/AS_UTL/readBuffer.C craa20040506Brian P. Walenz D src/AS_UTL/readBuffer.C kmer/libutil/readBuffer.C A src/AS_UTL/AS_UTL_fileIO.H bnbi20150814Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H bnbi20141208Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H bnbi20141205Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20130927Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20130922Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20120508Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20091202Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20091107Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20090610Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20070603Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20070416Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20070228Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20070218Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H jcvi20050926Brian P. Walenz A src/AS_UTL/splitToWords.H bnbi20150811Brian P. Walenz A src/AS_UTL/splitToWords.H bnbi20141205Brian P. Walenz A src/AS_UTL/splitToWords.H jcvi20140411Brian P. Walenz A src/AS_UTL/splitToWords.H jcvi20120718Brian P. Walenz A src/AS_UTL/splitToWords.H jcvi20060718Brian P. Walenz A src/AS_UTL/splitToWords.H jcvi20050712Brian P. Walenz D src/AS_UTL/splitToWords.H kmer/libutil/splitToWords.H A src/AS_UTL/findKeyAndValue.H bnbi20150130Brian P. Walenz A src/AS_UTL/findKeyAndValue.H bnbi20141202Brian P. Walenz A src/AS_UTL/findKeyAndValue.H bnbi20141127Brian P. Walenz A src/AS_UTL/findKeyAndValue.H bnbi20141121Brian P. Walenz A src/AS_UTL/kMer.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMer.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMer.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMer.H jcvi20140411Brian P. Walenz A src/AS_UTL/kMer.H jcvi20110330Brian P. Walenz A src/AS_UTL/kMer.H jcvi20100831Brian P. Walenz A src/AS_UTL/kMer.H jcvi20100225Brian P. Walenz A src/AS_UTL/kMer.H jcvi20081008Brian P. Walenz A src/AS_UTL/kMer.H jcvi20080808Brian P. Walenz A src/AS_UTL/kMer.H jcvi20080728Brian P. Walenz A src/AS_UTL/kMer.H jcvi20080318Brian P. Walenz A src/AS_UTL/kMer.H jcvi20080318Brian P. Walenz A src/AS_UTL/kMer.H jcvi20071028Brian P. Walenz A src/AS_UTL/kMer.H jcvi20071011Brian P. Walenz A src/AS_UTL/kMer.H jcvi20071009Brian P. Walenz A src/AS_UTL/kMer.H jcvi20071009Brian P. Walenz A src/AS_UTL/kMer.H jcvi20070926Brian P. Walenz A src/AS_UTL/kMer.H jcvi20070913Brian P. Walenz A src/AS_UTL/kMer.H jcvi20070913Brian P. Walenz A src/AS_UTL/kMer.H jcvi20070326Brian P. Walenz A src/AS_UTL/kMer.H jcvi20060621Brian P. Walenz A src/AS_UTL/kMer.H jcvi20060512Brian P. Walenz A src/AS_UTL/kMer.H jcvi20051127Brian P. Walenz A src/AS_UTL/kMer.H jcvi20051012Brian P. Walenz A src/AS_UTL/kMer.H jcvi20050910Brian P. Walenz A src/AS_UTL/kMer.H jcvi20050712Brian P. Walenz A src/AS_UTL/kMer.H jcvi20050531Brian P. Walenz A src/AS_UTL/kMer.H jcvi20050531Brian P. Walenz A src/AS_UTL/kMer.H jcvi20050523Brian P. Walenz A src/AS_UTL/kMer.H jcvi20050519Brian P. Walenz D src/AS_UTL/kMer.H kmer/libbio/kmer.H A src/AS_UTL/bitPacking.H bnbi20141205Brian P. Walenz A src/AS_UTL/bitPacking.H jcvi20140411Brian P. Walenz A src/AS_UTL/bitPacking.H jcvi20080912Brian P. Walenz A src/AS_UTL/bitPacking.H jcvi20080707Brian P. Walenz A src/AS_UTL/bitPacking.H jcvi20071211Brian P. Walenz A src/AS_UTL/bitPacking.H jcvi20071205Brian P. Walenz A src/AS_UTL/bitPacking.H none20040623Brian P. Walenz A src/AS_UTL/bitPacking.H none20040622Brian P. Walenz A src/AS_UTL/bitPacking.H craa20040506Brian P. Walenz A src/AS_UTL/bitPacking.H none20040427Brian P. Walenz D src/AS_UTL/bitPacking.H kmer/libutil/bitPacking.h A src/AS_UTL/timeAndSize.H bnbi20141208Brian P. Walenz A src/AS_UTL/bitPackedArray.C bnbi20141208Brian P. Walenz A src/AS_UTL/bitPackedArray.C bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedArray.C bnbi20141205Brian P. Walenz A src/AS_UTL/bitPackedArray.C jcvi20140411Brian P. Walenz A src/AS_UTL/bitPackedArray.C jcvi20121015Brian P. Walenz A src/AS_UTL/bitPackedArray.C jcvi20061026Brian P. Walenz A src/AS_UTL/bitPackedArray.C jcvi20060621Brian P. Walenz A src/AS_UTL/bitPackedArray.C jcvi20050207Brian P. Walenz D src/AS_UTL/bitPackedArray.C kmer/libutil/bitPackedArray.C A src/AS_UTL/AS_UTL_alloc.H nihh20151029Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H nihh20151008Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H bnbi20150131Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H bnbi20150113Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H bnbi20150112Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H bnbi20150107Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H bnbi20141222Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H bnbi20140811Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20070214Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20070214Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20061106Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20060926Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H jcvi20050824Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130417Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130329Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130320Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C jcvi20130314Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C bnbi20150227Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C bnbi20130320Sergey Koren A src/AS_UTL/AS_UTL_fasta.C jcvi20120508Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20100322Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20090331Sergey Koren A src/AS_UTL/AS_UTL_fasta.C jcvi20081029Sergey Koren A src/AS_UTL/AS_UTL_fasta.C jcvi20081008Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20080627Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20080604Sergey Koren A src/AS_UTL/AS_UTL_fasta.C jcvi20080515Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C jcvi20071102Brian P. Walenz A src/AS_UTL/stddev.C bnbi20150818Brian P. Walenz A src/AS_UTL/stddev.C bnbi20150726Brian P. Walenz A src/AS_UTL/stddev.C bnbi20150723Brian P. Walenz A src/AS_UTL/stddev.C bnbi20150410Brian P. Walenz A src/AS_UTL/stddev.C jcvi20130923Brian P. Walenz A src/AS_UTL/stddev.C jcvi20130801Brian P. Walenz A src/AS_UTL/stddev.C jcvi20121218Brian P. Walenz A src/AS_UTL/stddev.C jcvi20121213Brian P. Walenz A src/AS_UTL/stddev.C jcvi20121205Brian P. Walenz A src/AS_UTL/stddev.C jcvi20121204Brian P. Walenz D src/AS_UTL/stddev.C src/AS_TER/analyzePosMap-libraryFate.C A src/AS_UTL/dnaAlphabets.H bnbi20141205Brian P. Walenz A src/AS_UTL/dnaAlphabets.H bnbi20141205Brian P. Walenz A src/AS_UTL/dnaAlphabets.H bnbi20141205Brian P. Walenz A src/AS_UTL/dnaAlphabets.H bnbi20141205Brian P. Walenz A src/AS_UTL/dnaAlphabets.H jcvi20080728Brian P. Walenz A src/AS_UTL/dnaAlphabets.H jcvi20080303Brian P. Walenz A src/AS_UTL/dnaAlphabets.H jcvi20071208Brian P. Walenz A src/AS_UTL/dnaAlphabets.H jcvi20060407Brian P. Walenz A src/AS_UTL/dnaAlphabets.H jcvi20050305Brian P. Walenz A src/AS_UTL/dnaAlphabets.H none20040528Brian P. Walenz A src/AS_UTL/dnaAlphabets.H craa20031021Brian P. Walenz A src/AS_UTL/dnaAlphabets.H craa20031016Brian P. Walenz A src/AS_UTL/dnaAlphabets.H craa20030506Brian P. Walenz D src/AS_UTL/dnaAlphabets.H kmer/libbio/alphabet-generate.c D src/AS_UTL/dnaAlphabets.H kmer/libbio/dnaAlphabets.H A src/AS_UTL/kMerHuge.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMerHuge.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMerHuge.H bnbi20141205Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20140411Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20080808Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20080609Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20080318Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20071208Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20071023Brian P. Walenz A src/AS_UTL/kMerHuge.H jcvi20070913Brian P. Walenz D src/AS_UTL/kMerHuge.H kmer/libbio/kmerhuge.H A src/AS_UTL/AS_UTL_decodeRange.H bnbi20150528Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.H jcvi20131011Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.H jcvi20130801Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.H jcvi20120220Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.H jcvi20120212Brian P. Walenz A src/AS_UTL/memoryMappedFileTest.C bnbi20141127Brian P. Walenz A src/AS_UTL/memoryMappedFileTest.C bnbi20141126Brian P. Walenz D src/AS_UTL/memoryMappedFileTest.C src/AS_BAT/memoryMappedFileTest.C A src/correction/readConsensus.C nihh20151029Brian P. Walenz A src/correction/readConsensus.C bnbi20150726Brian P. Walenz A src/correction/readConsensus.C bnbi20150720Brian P. Walenz A src/correction/readConsensus.C bnbi20150708Brian P. Walenz A src/correction/readConsensus.C bnbi20150701Brian P. Walenz A src/correction/readConsensus.C bnbi20150701Brian P. Walenz A src/correction/readConsensus.C bnbi20150624Brian P. Walenz A src/correction/readConsensus.C bnbi20150623Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20151029Brian P. Walenz A src/correction/filterCorrectionOverlaps.C bnbi20150625Brian P. Walenz A src/correction/filterCorrectionOverlaps.C bnbi20150616Brian P. Walenz A src/correction/filterCorrectionOverlaps.C bnbi20150616Brian P. Walenz A src/correction/filterCorrectionOverlaps.C bnbi20150615Brian P. Walenz A src/correction/filterCorrectionOverlaps.C bnbi20150605Brian P. Walenz A src/correction/filterCorrectionOverlaps.C bnbi20150528Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150921Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150625Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150616Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150616Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150615Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150612Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150610Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150605Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150603Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150528Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150526Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150520Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150514Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150509Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150420Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150415Brian P. Walenz A src/correction/generateCorrectionLayouts.C bnbi20150409Brian P. Walenz A src/overlapBasedTrimming/generate-random-chimeric-fragments.pl nihh20151012Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150625Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150603Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150528Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150514Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150420Brian P. Walenz A src/overlapBasedTrimming/trimReads.C bnbi20150415Brian P. Walenz A src/overlapBasedTrimming/finalTrim-stats.pl nihh20151012Brian P. Walenz A src/overlapBasedTrimming/splitReads.H bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/splitReads.H bnbi20150420Brian P. Walenz A src/overlapBasedTrimming/splitReads-workUnit.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/splitReads-workUnit.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/splitReads-workUnit.C bnbi20150420Brian P. Walenz A src/overlapBasedTrimming/splitReads-workUnit.C bnbi20150415Brian P. Walenz A src/overlapBasedTrimming/adjustOverlaps.H bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/detect-unsplit-subreads-in-overlaps.pl nihh20151012Brian P. Walenz A src/overlapBasedTrimming/trimReads-bestEdge.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads-bestEdge.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads-bestEdge.C bnbi20150528Brian P. Walenz A src/overlapBasedTrimming/trimReads-largestCovered.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads-largestCovered.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads-largestCovered.C bnbi20150528Brian P. Walenz A src/overlapBasedTrimming/test-random-chimeric-fragments.pl nihh20151012Brian P. Walenz A src/overlapBasedTrimming/adjustFlipped.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/adjustFlipped.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/adjustFlipped.C bnbi20150415Brian P. Walenz A src/overlapBasedTrimming/splitReads.C bnbi20150625Brian P. Walenz A src/overlapBasedTrimming/splitReads.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/splitReads.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/splitReads.C bnbi20150420Brian P. Walenz A src/overlapBasedTrimming/splitReads.C bnbi20150415Brian P. Walenz A src/overlapBasedTrimming/trimReads.H bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/trimReads.H bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/adjustNormal.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/adjustNormal.C bnbi20150616Brian P. Walenz A src/overlapBasedTrimming/adjustNormal.C bnbi20150415Brian P. Walenz A src/overlapBasedTrimming/clearRangeFile.H bnbi20150415Brian P. Walenz A src/erateEstimate/erate-estimate-plot-per-base-estimate.pl nihh20151012Brian P. Walenz A src/erateEstimate/erate-estimate-plot-per-base-estimate.pl bnbi20150807Brian P. Walenz A src/erateEstimate/erate-estimate-plot-per-base-estimate.pl bnbi20141115Brian P. Walenz D src/erateEstimate/erate-estimate-plot-per-base-estimate.pl src/AS_BAT/erate-estimate-plot-per-base-estimate.pl A src/erateEstimate/erate-estimate-test-based-on-mapping.pl nihh20151012Brian P. Walenz A src/erateEstimate/erate-estimate-test-based-on-mapping.pl bnbi20150807Brian P. Walenz A src/erateEstimate/erate-estimate-test-based-on-mapping.pl bnbi20141115Brian P. Walenz D src/erateEstimate/erate-estimate-test-based-on-mapping.pl src/AS_BAT/erate-estimate-test-based-on-mapping.pl A src/erateEstimate/erateEstimate.C bnbi20150729Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150701Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150625Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150616Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150616Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150529Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150514Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150317Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150303Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20150227Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20141223Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20141223Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20141223Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20141219Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20141117Brian P. Walenz A src/erateEstimate/erateEstimate.C bnbi20141021Brian P. Walenz D src/erateEstimate/erateEstimate.C src/AS_BAT/erate-estimate.C A src/AS_UTL/md5.c jcvi20140411Brian P. Walenz A src/AS_UTL/md5.c jcvi20080724Brian P. Walenz A src/AS_UTL/md5.c jcvi20050701Brian P. Walenz A src/AS_UTL/md5.c jcvi20050302Brian P. Walenz A src/AS_UTL/md5.c none20041010Brian P. Walenz A src/AS_UTL/md5.c craa20040430Brian P. Walenz A src/AS_UTL/md5.c craa20040421Brian P. Walenz A src/AS_UTL/md5.c craa20040220Brian P. Walenz A src/AS_UTL/md5.c craa20030729Brian P. Walenz A src/AS_UTL/md5.c craa20030721Brian P. Walenz A src/AS_UTL/md5.c jcvi20140411Brian P. Walenz A src/AS_UTL/md5.c jcvi20080724Brian P. Walenz A src/AS_UTL/md5.c jcvi20050712Brian P. Walenz A src/AS_UTL/md5.c none20040302Brian P. Walenz A src/AS_UTL/md5.c none20041010Brian P. Walenz A src/AS_UTL/md5.c craa20040430Brian P. Walenz A src/AS_UTL/md5.c craa20040421Brian P. Walenz A src/AS_UTL/md5.c craa20040220Brian P. Walenz A src/AS_UTL/md5.c craa20030721Brian P. Walenz D src/AS_UTL/md5.H kmer/libutil/md5.h D src/AS_UTL/md5.C kmer/libutil/md5.c A documentation/source/quick-start.rst nihh20160303Sergey Koren A documentation/source/commands/updateDocs.sh nihh20160303Sergey Koren A documentation/source/command-reference.rst nihh20160303Sergey Koren A documentation/source/commands/bogart.rst nihh20160303Sergey Koren A documentation/source/commands/bogus.rst nihh20160303Sergey Koren A documentation/source/commands/canu.rst nihh20160303Sergey Koren A documentation/source/commands/correctOverlaps.rst nihh20160303Sergey Koren A documentation/source/commands/createFalconSenseInputs.rst nihh20160303Sergey Koren A documentation/source/commands/erateEstimate.rst nihh20160303Sergey Koren A documentation/source/commands/estimate-mer-threshold.rst nihh20160303Sergey Koren A documentation/source/commands/fastqAnalyze.rst nihh20160303Sergey Koren A documentation/source/commands/fastqSample.rst nihh20160303Sergey Koren A documentation/source/commands/fastqSimulate-sort.rst nihh20160303Sergey Koren A documentation/source/commands/fastqSimulate.rst nihh20160303Sergey Koren A documentation/source/commands/filterCorrectionOverlaps.rst nihh20160303Sergey Koren A documentation/source/commands/findErrors.rst nihh20160303Sergey Koren A documentation/source/commands/gatekeeperCreate.rst nihh20160303Sergey Koren A documentation/source/commands/gatekeeperDumpFASTQ.rst nihh20160303Sergey Koren A documentation/source/commands/gatekeeperDumpMetaData.rst nihh20160303Sergey Koren A documentation/source/commands/gatekeeperPartition.rst nihh20160303Sergey Koren A documentation/source/commands/generateCorrectionLayouts.rst nihh20160303Sergey Koren A documentation/source/commands/leaff.rst nihh20160303Sergey Koren A documentation/source/commands/meryl.rst nihh20160303Sergey Koren A documentation/source/commands/mhapConvert.rst nihh20160303Sergey Koren A documentation/source/commands/ovStoreBucketizer.rst nihh20160303Sergey Koren A documentation/source/commands/ovStoreBuild.rst nihh20160303Sergey Koren A documentation/source/commands/ovStoreDump.rst nihh20160303Sergey Koren A documentation/source/commands/ovStoreIndexer.rst nihh20160303Sergey Koren A documentation/source/commands/ovStoreSorter.rst nihh20160303Sergey Koren A documentation/source/commands/overlapConvert.rst nihh20160303Sergey Koren A documentation/source/commands/overlapImport.rst nihh20160303Sergey Koren A documentation/source/commands/overlapInCore.rst nihh20160303Sergey Koren A documentation/source/commands/overlapInCorePartition.rst nihh20160303Sergey Koren A documentation/source/commands/overlapPair.rst nihh20160303Sergey Koren A documentation/source/commands/prefixEditDistance-matchLimitGenerate.rst nihh20160303Sergey Koren A documentation/source/commands/readConsensus.rst nihh20160303Sergey Koren A documentation/source/commands/splitReads.rst nihh20160303Sergey Koren A documentation/source/commands/tgStoreCoverageStat.rst nihh20160303Sergey Koren A documentation/source/commands/tgStoreDump.rst nihh20160303Sergey Koren A documentation/source/commands/tgStoreFilter.rst nihh20160303Sergey Koren A documentation/source/commands/tgStoreLoad.rst nihh20160303Sergey Koren A documentation/source/commands/tgTigDisplay.rst nihh20160303Sergey Koren A documentation/source/commands/trimReads.rst nihh20160303Sergey Koren A documentation/source/commands/updateDocs.sh nihh20160303Sergey Koren A documentation/source/commands/utgcns.rst nihh20160303Sergey Koren A documentation/source/history.rst nihh20160303Sergey Koren A documentation/source/index.rst nihh20160303Sergey Koren A documentation/source/parameter-reference.rst nihh20160303Sergey Koren A documentation/source/history.rst nihh20160303Sergey Koren A documentation/source/history.rst nihh20160303Sergey Koren A documentation/source/quick-start.rst nihh20160303Sergey Koren A documentation/source/history.rst nihh20160303Sergey Koren A documentation/source/index.rst nihh20160303Sergey Koren A README.citation nihh20160303Sergey Koren A documentation/source/history.rst nihh20160303Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160303Sergey Koren A README.md nihh20160302Sergey Koren A documentation/source/conf.py nihh20160302Sergey Koren A src/mhap/mhap-2.0.tar nihh20160302Sergey Koren A documentation/source/quick-start.rst nihh20160302Sergey Koren A src/bogart/AS_BAT_Logging.C nihh20160302Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160302Sergey Koren A documentation/source/quick-start.rst nihh20160302Sergey Koren A documentation/source/tutorial.rst nihh20160302Sergey Koren A src/pipelines/canu/Configure.pm nihh20160302Sergey Koren A src/pipelines/canu/OverlapStore.pm nihh20160302Sergey Koren A src/stores/gatekeeperDumpFASTQ.C nihh20160301Brian P. Walenz A README.md nihh20160301Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20160301Sergey Koren A src/stores/ovStoreBucketizer.C nihh20160229Sergey Koren A src/pipelines/canu.pl nihh20160229Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160229Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160229Sergey Koren A src/pipelines/canu/Gatekeeper.pm nihh20160229Sergey Koren A src/pipelines/canu/Unitig.pm nihh20160229Sergey Koren A src/pipelines/canu/Configure.pm nihh20160229Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160229Sergey Koren A src/pipelines/canu/OverlapStore.pm nihh20160229Sergey Koren A src/stores/ovStoreBuild.C nihh20160229Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160228Sergey Koren A src/falcon_sense/createFalconSenseInputs.C nihh20160225Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160225Sergey Koren A src/falcon_sense/createFalconSenseInputs.C nihh20160225Sergey Koren A src/correction/filterCorrectionOverlaps.C nihh20160225Sergey Koren A src/correction/generateCorrectionLayouts.C nihh20160225Sergey Koren A src/AS_UTL/AS_UTL_fileIO.C nihh20160225Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20160225Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20160225Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20160225Sergey Koren A src/falcon_sense/libfalcon/kmer_lookup.C nihh20160225Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160225Brian P. Walenz A src/AS_global.C nihh20160225Brian P. Walenz A src/AS_UTL/splitToWords.H nihh20160225Brian P. Walenz A src/stores/gkStore.C nihh20160225Brian P. Walenz A src/stores/gkStore.H nihh20160225Brian P. Walenz A src/stores/ovStore.C nihh20160225Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C nihh20160225Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C nihh20160225Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20160225Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C nihh20160225Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm.C nihh20160225Brian P. Walenz A src/meryl/meryl-dump.C nihh20160225Brian P. Walenz A src/stores/gkStore.H nihh20160225Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160225Brian P. Walenz A src/AS_UTL/intervalList.H nihh20160225Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160224Sergey Koren A src/mhap/mhap-2.0.tar nihh20160224Sergey Koren A src/mhap/mhap.mk nihh20160224Sergey Koren A src/mhap/mhapConvert.C nihh20160224Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160224Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160224Sergey Koren A src/correction/filterCorrectionOverlaps.C nihh20160224Sergey Koren A src/correction/generateCorrectionLayouts.C nihh20160224Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160224Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160224Sergey Koren A src/Makefile nihh20160224Sergey Koren A src/main.mk nihh20160224Sergey Koren A src/minimap/mmapConvert.C nihh20160224Sergey Koren A src/minimap/mmapConvert.mk nihh20160224Sergey Koren A src/pipelines/canu.pl nihh20160224Sergey Koren A src/pipelines/canu/Configure.pm nihh20160224Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160224Sergey Koren A src/pipelines/canu/Meryl.pm nihh20160224Sergey Koren A src/pipelines/canu/OverlapMMap.pm nihh20160224Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160224Sergey Koren A src/Makefile nihh20160224Sergey Koren A src/falcon_sense/falcon_sense.C nihh20160224Sergey Koren A src/falcon_sense/falcon_sense.mk nihh20160224Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20160224Sergey Koren A src/falcon_sense/libfalcon/falcon.H nihh20160224Sergey Koren A src/falcon_sense/libfalcon/kmer_lookup.C nihh20160224Sergey Koren A src/main.mk nihh20160224Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160224Sergey Koren A src/stores/ovStore.H nihh20160222Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20160222Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160222Brian P. Walenz A src/stores/ovStoreFile.C nihh20160222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160222Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160222Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160222Brian P. Walenz A src/stores/tgStoreDump.C nihh20160222Sergey Koren A src/pipelines/canu/Gatekeeper.pm nihh20160222Brian P. Walenz A src/Makefile nihh20160222Brian P. Walenz A src/main.mk nihh20160222Brian P. Walenz A src/pipelines/install-perl-libraries.sh nihh20160222Brian P. Walenz A src/Makefile nihh20160222Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160222Brian P. Walenz A src/pipelines/canu.pl nihh20160222Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160222Brian P. Walenz A src/AS_UTL/findKeyAndValue.H nihh20160222Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20160222Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20160220Sergey Koren A src/stores/ovStoreBuild.C nihh20160220Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160219Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160219Brian P. Walenz A src/utgcns/utgcns.C nihh20160219Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160219Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20160219Sergey Koren A src/stores/ovStoreBuild.C nihh20160219Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160219Brian P. Walenz A src/stores/ovStoreSorter.C nihh20160219Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160218Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20160218Brian P. Walenz A src/pipelines/canu.pl nihh20160218Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160218Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160218Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160218Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160218Brian P. Walenz A src/stores/ovStore.H nihh20160218Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20160218Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160218Brian P. Walenz A src/stores/ovStoreFile.C nihh20160218Brian P. Walenz A src/stores/ovStoreSorter.C nihh20160218Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H nihh20160218Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20160217Sergey Koren A src/overlapInCore/overlapImport.C nihh20160217Sergey Koren A src/AS_global.H nihh20160215Sergey Koren A src/Makefile nihh20160215Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.C nihh20160215Sergey Koren A src/correction/generateCorrectionLayouts.C nihh20160212Sergey Koren A src/overlapInCore/overlapPair.C nihh20160212Sergey Koren A src/overlapInCore/overlapPair.mk nihh20160212Sergey Koren A src/pipelines/canu/Meryl.pm nihh20160212Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160212Sergey Koren A src/utgcns/libNDFalcon/dw.C nihh20160212Sergey Koren A src/bogart/AS_BAT_Outputs.C nihh20160212Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160212Brian P. Walenz A src/bogart/buildGraph.C nihh20160212Brian P. Walenz A src/bogart/buildGraph.mk nihh20160212Brian P. Walenz A src/main.mk nihh20160212Brian P. Walenz A src/pipelines/canu/Output.pm nihh20160212Brian P. Walenz A src/stores/tgTig.H nihh20160212Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160211Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20160211Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160211Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160211Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20160211Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C nihh20160211Brian P. Walenz A src/stores/gkStore.C nihh20160211Brian P. Walenz A src/stores/gkStore.H nihh20160211Brian P. Walenz A src/stores/ovStore.H nihh20160211Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20160211Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160211Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160211Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160211Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160210Brian P. Walenz A src/Makefile nihh20160209Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160209Brian P. Walenz A src/Makefile nihh20160209Brian P. Walenz A src/Makefile nihh20160209Brian P. Walenz A src/pipelines/canu.pl nihh20160209Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20160208Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20160208Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20160208Brian P. Walenz A src/pipelines/canu.pl nihh20160208Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160208Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160208Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160208Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160208Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160208Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160208Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160208Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.C nihh20160208Brian P. Walenz A src/main.mk nihh20160205Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20160205Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20160205Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.C nihh20160205Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.H nihh20160205Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.C nihh20160205Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.C nihh20160205Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H nihh20160205Brian P. Walenz A src/stores/gkStore.H nihh20160205Brian P. Walenz A src/main.mk nihh20160204Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20160204Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160201Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20160201Brian P. Walenz A src/stores/ovStore.C nihh20160201Brian P. Walenz A src/stores/ovStore.H nihh20160201Brian P. Walenz A src/stores/ovStoreStats.C nihh20160201Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160201Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160201Brian P. Walenz A src/meryl/meryl-build.C nihh20160129Brian P. Walenz A src/meryl/meryl-estimate.C nihh20160129Brian P. Walenz A src/meryl/meryl.H nihh20160129Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160129Brian P. Walenz A src/Makefile nihh20160129Brian P. Walenz A src/stores/ovStore.H nihh20160129Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160128Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160128Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160128Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160128Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160128Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H nihh20160128Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C nihh20160128Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H nihh20160128Brian P. Walenz A src/bogart/bogart.C nihh20160128Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160128Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160128Sergey Koren A src/pipelines/canu.pl nihh20160127Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160127Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160127Brian P. Walenz A src/stores/tgStoreFilter.C nihh20160127Brian P. Walenz A src/bogart/bogart.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160127Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160127Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160127Brian P. Walenz A src/stores/tgTigSizeAnalysis.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H nihh20160127Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160127Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160127Brian P. Walenz A src/bogart/bogart.C nihh20160127Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20160127Brian P. Walenz A src/pipelines/canu/Output.pm nihh20160127Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160127Brian P. Walenz A src/stores/gkStore.H nihh20160127Brian P. Walenz A src/stores/tgStore.H nihh20160127Brian P. Walenz A src/stores/tgStoreDump.C nihh20160127Brian P. Walenz A src/stores/tgStoreFilter.C nihh20160127Brian P. Walenz A src/stores/tgTig.C nihh20160127Brian P. Walenz A src/stores/tgTig.H nihh20160127Brian P. Walenz A src/stores/tgTigSizeAnalysis.C nihh20160127Brian P. Walenz A src/stores/tgTigSizeAnalysis.H nihh20160127Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20160122Brian P. Walenz A src/canu_version_update.pl nihh20160122Brian P. Walenz A src/pipelines/canu/Output.pm nihh20160122Brian P. Walenz A documentation/source/parameter-reference.rst nihh20160120Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20160120Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160120Brian P. Walenz A src/pipelines/canu.pl nihh20160120Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160120Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160120Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160120Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160120Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160120Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160120Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20160120Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160120Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160120Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160120Brian P. Walenz A src/overlapBasedTrimming/splitReads-trimBad.C nihh20160119Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20160119Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160119Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20160119Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20160119Brian P. Walenz A src/overlapBasedTrimming/trimStat.H nihh20160119Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160119Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160119Brian P. Walenz A src/pipelines/canu.pl nihh20160117Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20160117Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160117Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160117Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20160117Brian P. Walenz A src/pipelines/canu.pl nihh20160117Brian P. Walenz A src/pipelines/canu.pl nihh20160115Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160115Brian P. Walenz A src/Makefile nihh20160115Brian P. Walenz A src/utgcns/libcns/abBead.H nihh20160114Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160114Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160114Brian P. Walenz A src/pipelines/canu.pl nihh20160114Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160114Brian P. Walenz A src/pipelines/canu/Grid.pm nihh20160114Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160114Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160114Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160114Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160112Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160112Brian P. Walenz A src/pipelines/canu.pl nihh20160112Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160112Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160112Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160112Brian P. Walenz A src/pipelines/canu.pl nihh20160112Brian P. Walenz A src/AS_global.C nihh20160112Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20160112Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160112Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160112Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20160112Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20160112Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160111Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160111Brian P. Walenz A src/AS_RUN/fragmentDepth.C nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_decodeRange.C nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20160111Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H nihh20160111Brian P. Walenz A src/AS_UTL/memoryMappedFile.H nihh20160111Brian P. Walenz A src/AS_UTL/stddev.C nihh20160111Brian P. Walenz A src/AS_UTL/testRand.C nihh20160111Brian P. Walenz A src/AS_global.C nihh20160111Brian P. Walenz A src/AS_global.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Breaking.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Breaking.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_IntersectSplit.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Joining.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Joining.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceZombies.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddAndPlaceFrag.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160111Brian P. Walenz A src/bogart/addReadsToUnitigs.C nihh20160111Brian P. Walenz A src/bogart/analyzeBest.C nihh20160111Brian P. Walenz A src/bogart/bogart.C nihh20160111Brian P. Walenz A src/bogus/bogus.C nihh20160111Brian P. Walenz A src/bogus/bogusUtil.C nihh20160111Brian P. Walenz A src/bogus/bogusUtil.H nihh20160111Brian P. Walenz A src/bogus/bogusness.C nihh20160111Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20160111Brian P. Walenz A src/correction/readConsensus.C nihh20160111Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20160111Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C nihh20160111Brian P. Walenz A src/falcon_sense/outputFalcon.C nihh20160111Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C nihh20160111Brian P. Walenz A src/fastq-utilities/fastqSample.C nihh20160111Brian P. Walenz A src/fastq-utilities/fastqSimulate-sort.C nihh20160111Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20160111Brian P. Walenz A src/merTrim/merTrim.C nihh20160111Brian P. Walenz A src/merTrim/merTrimAdapter.C nihh20160111Brian P. Walenz A src/merTrim/merTrimApply.C nihh20160111Brian P. Walenz A src/merTrim/merTrimResult.H nihh20160111Brian P. Walenz A src/mercy/mercy-regions.C nihh20160111Brian P. Walenz A src/mercy/mercy.C nihh20160111Brian P. Walenz A src/meryl/estimate-mer-threshold.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/adjustFlipped.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/adjustNormal.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/splitReads-subReads.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/trimReads-bestEdge.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/trimReads-largestCovered.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/trimReads-quality.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20160111Brian P. Walenz A src/overlapBasedTrimming/trimReads.H nihh20160111Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20160111Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20160111Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20160111Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20160111Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20160111Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20160111Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20160111Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C nihh20160111Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20160111Brian P. Walenz A src/stores/ovStore.C nihh20160111Brian P. Walenz A src/stores/ovStore.H nihh20160111Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20160111Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160111Brian P. Walenz A src/stores/ovStoreDump.C nihh20160111Brian P. Walenz A src/stores/ovStoreFile.C nihh20160111Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20160111Brian P. Walenz A src/stores/ovStoreSorter.C nihh20160111Brian P. Walenz A src/stores/ovStoreStats.C nihh20160111Brian P. Walenz A src/stores/tgStore.C nihh20160111Brian P. Walenz A src/stores/tgStore.H nihh20160111Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20160111Brian P. Walenz A src/stores/tgStoreDump.C nihh20160111Brian P. Walenz A src/stores/tgStoreFilter.C nihh20160111Brian P. Walenz A src/stores/tgStoreLoad.C nihh20160111Brian P. Walenz A src/stores/tgTig.H nihh20160111Brian P. Walenz A src/stores/tgTigDisplay.C nihh20160111Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20160111Brian P. Walenz A src/stores/tgTigSizeAnalysis.C nihh20160111Brian P. Walenz A src/stores/tgTigSizeAnalysis.H nihh20160111Brian P. Walenz A src/utgcns/libNDalign/NDalgorithm-extend.C nihh20160111Brian P. Walenz A src/utgcns/libNNalign/NNalgorithm.C nihh20160111Brian P. Walenz A src/utgcns/libNNalign/NNalign.C nihh20160111Brian P. Walenz A src/utgcns/libNNalign/NNalign.H nihh20160111Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20160111Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20160111Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20160111Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20160111Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C nihh20160111Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20160111Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160111Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20160111Brian P. Walenz A src/utgcns/stashContains.C nihh20160111Brian P. Walenz A src/utgcns/utgcns.C nihh20160111Brian P. Walenz A src/AS_global.C nihh20160111Brian P. Walenz A src/Makefile nihh20160111Brian P. Walenz A src/canu_version_update.pl nihh20160111Brian P. Walenz A src/pipelines/canu.pl nihh20160111Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160111Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160111Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160111Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_MergeSplitJoin.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160111Brian P. Walenz A src/bogart/AS_BAT_RepeatJunctionEvidence.H nihh20160111Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160111Brian P. Walenz A src/bogart/bogart.C nihh20160111Brian P. Walenz A src/bogart/bogart.mk nihh20160111Brian P. Walenz A src/stores/tgStoreFilter.C nihh20160108Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160107Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160107Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160107Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160107Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160107Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20160107Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160107Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160107Brian P. Walenz A src/pipelines/canu.pl nihh20160106Sergey Koren A src/utgcns/stashContains.C nihh20160106Brian P. Walenz A src/utgcns/stashContains.H nihh20160106Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20160105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160105Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20160105Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20160105Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20160105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160105Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160105Brian P. Walenz A src/utgcns/libNDalign/NDalign.C nihh20160105Brian P. Walenz A src/stores/tgTig.H nihh20160105Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160105Brian P. Walenz A documentation/source/quick-start.rst nihh20160105Sergey Koren A src/pipelines/canu/Consensus.pm nihh20160105Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160105Brian P. Walenz A src/utgcns/utgcns.C nihh20160105Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160104Brian P. Walenz A src/AS_global.H nihh20160104Brian P. Walenz A src/stores/tgStoreFilter.C nihh20160104Brian P. Walenz A src/utgcns/libNDFalcon/dw.H nihh20160104Brian P. Walenz A src/utgcns/libpbutgcns/Alignment.H nihh20160104Brian P. Walenz A src/utgcns/libpbutgcns/Alignment.H nihh20160104Sergey Koren A documentation/source/quick-start.rst nihh20151230Sergey Koren A documentation/source/tutorial.rst nihh20151230Sergey Koren A documentation/source/tutorial.rst nihh20151230Sergey Koren A documentation/source/quick-start.rst nihh20151229Sergey Koren A src/utgcns/libpbutgcns/LICENSE nihh20151229Sergey Koren A documentation/source/quick-start.rst nihh20151228Sergey Koren A documentation/source/quick-start.rst nihh20151228Sergey Koren A documentation/source/quick-start.rst nihh20151228Sergey Koren A src/pipelines/canu/Consensus.pm nihh20151228Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151228Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20151228Sergey Koren A src/utgcns/libpbutgcns/SimpleAligner.C nihh20151228Sergey Koren A src/utgcns/libpbutgcns/SimpleAligner.H nihh20151228Sergey Koren A src/main.mk nihh20151228Sergey Koren A src/pipelines/canu.pl nihh20151228Sergey Koren A src/pipelines/canu/Configure.pm nihh20151228Sergey Koren A src/pipelines/canu/Consensus.pm nihh20151228Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151228Sergey Koren A src/utgcns/libNDFalcon/LICENSE nihh20151228Sergey Koren A src/utgcns/libNDFalcon/dw.C nihh20151228Sergey Koren A src/utgcns/libNDFalcon/dw.H nihh20151228Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20151228Sergey Koren A src/utgcns/libcns/unitigConsensus.H nihh20151228Sergey Koren A src/utgcns/libpbutgcns/.gitignore nihh20151228Sergey Koren A src/utgcns/libpbutgcns/Alignment.C nihh20151228Sergey Koren A src/utgcns/libpbutgcns/Alignment.H nihh20151228Sergey Koren A src/utgcns/libpbutgcns/AlnGraphBoost.C nihh20151228Sergey Koren A src/utgcns/libpbutgcns/AlnGraphBoost.H nihh20151228Sergey Koren A src/utgcns/libpbutgcns/LICENSE nihh20151228Sergey Koren A src/utgcns/libpbutgcns/README.md nihh20151228Sergey Koren A src/utgcns/libpbutgcns/SimpleAligner.C nihh20151228Sergey Koren A src/utgcns/libpbutgcns/SimpleAligner.H nihh20151228Sergey Koren A src/utgcns/utgcns.C nihh20151228Sergey Koren A src/utgcns/utgcns.mk nihh20151228Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151223Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abSequence.H nihh20151223Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151223Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abBead.H nihh20151223Brian P. Walenz A src/utgcns/libcns/abColumn.H nihh20151223Brian P. Walenz A src/utgcns/utgcns.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151223Brian P. Walenz A src/utgcns/libcns/abSequence.H nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151223Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151223Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20151223Brian P. Walenz A src/utgcns/utgcns.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C nihh20151223Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151222Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20151222Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151222Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20151222Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20151222Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C nihh20151222Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151222Brian P. Walenz A src/utgcns/libcns/abBead.H nihh20151222Brian P. Walenz A src/utgcns/libcns/abColumn.H nihh20151222Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C nihh20151222Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20151222Brian P. Walenz A src/pipelines/canu.pl nihh20151222Sergey Koren A src/pipelines/canu/Consensus.pm nihh20151222Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151222Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20151222Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-appendBases.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151221Brian P. Walenz A src/utgcns/libcns/abBead.H nihh20151221Brian P. Walenz A src/utgcns/libcns/abColumn.H nihh20151221Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C nihh20151221Brian P. Walenz A src/utgcns/libcns/abSequence.H nihh20151221Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151221Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20151221Brian P. Walenz A src/utgcns/utgcns.C nihh20151221Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151221Brian P. Walenz A src/bogart/bogart.mk nihh20151221Brian P. Walenz A src/bogus/bogus.mk nihh20151221Brian P. Walenz A src/bogus/bogusness.mk nihh20151221Brian P. Walenz A src/correction/filterCorrectionOverlaps.mk nihh20151221Brian P. Walenz A src/correction/generateCorrectionLayouts.mk nihh20151221Brian P. Walenz A src/correction/readConsensus.mk nihh20151221Brian P. Walenz A src/erateEstimate/erateEstimate.mk nihh20151221Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.mk nihh20151221Brian P. Walenz A src/fastq-utilities/fastqAnalyze.mk nihh20151221Brian P. Walenz A src/fastq-utilities/fastqSample.mk nihh20151221Brian P. Walenz A src/fastq-utilities/fastqSimulate-sort.mk nihh20151221Brian P. Walenz A src/fastq-utilities/fastqSimulate.mk nihh20151221Brian P. Walenz A src/main.mk nihh20151221Brian P. Walenz A src/meryl/estimate-mer-threshold.mk nihh20151221Brian P. Walenz A src/meryl/leaff.mk nihh20151221Brian P. Walenz A src/meryl/meryl.mk nihh20151221Brian P. Walenz A src/meryl/simple.mk nihh20151221Brian P. Walenz A src/mhap/mhapConvert.mk nihh20151221Brian P. Walenz A src/overlapBasedTrimming/splitReads.mk nihh20151221Brian P. Walenz A src/overlapBasedTrimming/trimReads.mk nihh20151221Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.mk nihh20151221Brian P. Walenz A src/overlapErrorAdjustment/findErrors.mk nihh20151221Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.mk nihh20151221Brian P. Walenz A src/overlapInCore/overlapConvert.mk nihh20151221Brian P. Walenz A src/overlapInCore/overlapImport.mk nihh20151221Brian P. Walenz A src/overlapInCore/overlapInCore.mk nihh20151221Brian P. Walenz A src/overlapInCore/overlapInCorePartition.mk nihh20151221Brian P. Walenz A src/overlapInCore/overlapPair.mk nihh20151221Brian P. Walenz A src/stores/gatekeeperCreate.mk nihh20151221Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.mk nihh20151221Brian P. Walenz A src/stores/gatekeeperDumpMetaData.mk nihh20151221Brian P. Walenz A src/stores/gatekeeperPartition.mk nihh20151221Brian P. Walenz A src/stores/ovStoreBucketizer.mk nihh20151221Brian P. Walenz A src/stores/ovStoreBuild.mk nihh20151221Brian P. Walenz A src/stores/ovStoreDump.mk nihh20151221Brian P. Walenz A src/stores/ovStoreIndexer.mk nihh20151221Brian P. Walenz A src/stores/ovStoreSorter.mk nihh20151221Brian P. Walenz A src/stores/ovStoreStats.mk nihh20151221Brian P. Walenz A src/stores/tgStoreCoverageStat.mk nihh20151221Brian P. Walenz A src/stores/tgStoreDump.mk nihh20151221Brian P. Walenz A src/stores/tgStoreFilter.mk nihh20151221Brian P. Walenz A src/stores/tgStoreLoad.mk nihh20151221Brian P. Walenz A src/stores/tgTigDisplay.mk nihh20151221Brian P. Walenz A src/utgcns/utgcns.mk nihh20151221Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20151221Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus-appendBases.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus-mergeRefine.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus-refreshMultiAlign.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151218Brian P. Walenz A src/utgcns/libcns/abBead.H nihh20151218Brian P. Walenz A src/utgcns/libcns/abColumn.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abColumn.H nihh20151218Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C nihh20151218Brian P. Walenz A src/utgcns/libcns/abSequence.H nihh20151218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151218Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20151218Brian P. Walenz A src/utgcns/utgcns.C nihh20151218Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151217Sergey Koren A src/pipelines/canu/OverlapStore.pm nihh20151217Sergey Koren A documentation/source/quick-start.rst nihh20151217Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20151217Sergey Koren A src/pipelines/canu/Configure.pm nihh20151217Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151216Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151216Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151216Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151216Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151216Sergey Koren A src/pipelines/canu/Unitig.pm nihh20151216Sergey Koren A src/pipelines/canu/Consensus.pm nihh20151216Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151215Brian P. Walenz A src/pipelines/canu.pl nihh20151215Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151215Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151215Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151215Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151215Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151215Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151215Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151215Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151215Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151215Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151215Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151215Sergey Koren A src/AS_UTL/memoryMappedFile.H nihh20151215Sergey Koren A src/stores/ovStore.C nihh20151215Sergey Koren A src/pipelines/canu/Execution.pm nihh20151214Sergey Koren A src/pipelines/canu/Grid_PBSTorque.pm nihh20151214Sergey Koren A src/pipelines/canu/Grid_Slurm.pm nihh20151214Sergey Koren A documentation/source/quick-start.rst nihh20151211Sergey Koren A src/utgcns/libcns/abAbacus-addRead.C nihh20151211Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151209Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151209Sergey Koren A src/pipelines/canu/Configure.pm nihh20151209Brian P. Walenz A documentation/source/command-reference.rst nihh20151209Sergey Koren A documentation/source/commands/bogart.rst nihh20151209Sergey Koren A documentation/source/commands/bogus.rst nihh20151209Sergey Koren A documentation/source/commands/canu.rst nihh20151209Sergey Koren A documentation/source/commands/correctOverlaps.rst nihh20151209Sergey Koren A documentation/source/commands/createFalconSenseInputs.rst nihh20151209Sergey Koren A documentation/source/commands/erateEstimate.rst nihh20151209Sergey Koren A documentation/source/commands/estimate-mer-threshold.rst nihh20151209Sergey Koren A documentation/source/commands/fastqAnalyze.rst nihh20151209Sergey Koren A documentation/source/commands/fastqSample.rst nihh20151209Sergey Koren A documentation/source/commands/fastqSimulate-sort.rst nihh20151209Sergey Koren A documentation/source/commands/fastqSimulate.rst nihh20151209Sergey Koren A documentation/source/commands/filterCorrectionOverlaps.rst nihh20151209Sergey Koren A documentation/source/commands/findErrors.rst nihh20151209Sergey Koren A documentation/source/commands/gatekeeperCreate.rst nihh20151209Sergey Koren A documentation/source/commands/gatekeeperDumpFASTQ.rst nihh20151209Sergey Koren A documentation/source/commands/gatekeeperDumpMetaData.rst nihh20151209Sergey Koren A documentation/source/commands/gatekeeperPartition.rst nihh20151209Sergey Koren A documentation/source/commands/generateCorrectionLayouts.rst nihh20151209Sergey Koren A documentation/source/commands/leaff.rst nihh20151209Sergey Koren A documentation/source/commands/meryl.rst nihh20151209Sergey Koren A documentation/source/commands/mhapConvert.rst nihh20151209Sergey Koren A documentation/source/commands/ovStoreBucketizer.rst nihh20151209Sergey Koren A documentation/source/commands/ovStoreBuild.rst nihh20151209Sergey Koren A documentation/source/commands/ovStoreDump.rst nihh20151209Sergey Koren A documentation/source/commands/ovStoreIndexer.rst nihh20151209Sergey Koren A documentation/source/commands/ovStoreSorter.rst nihh20151209Sergey Koren A documentation/source/commands/overlapConvert.rst nihh20151209Sergey Koren A documentation/source/commands/overlapImport.rst nihh20151209Sergey Koren A documentation/source/commands/overlapInCore.rst nihh20151209Sergey Koren A documentation/source/commands/overlapInCorePartition.rst nihh20151209Sergey Koren A documentation/source/commands/overlapPair.rst nihh20151209Sergey Koren A documentation/source/commands/prefixEditDistance-matchLimitGenerate.rst nihh20151209Sergey Koren A documentation/source/commands/readConsensus.rst nihh20151209Sergey Koren A documentation/source/commands/simple.rst nihh20151209Sergey Koren A documentation/source/commands/splitReads.rst nihh20151209Sergey Koren A documentation/source/commands/tgStoreCoverageStat.rst nihh20151209Sergey Koren A documentation/source/commands/tgStoreDump.rst nihh20151209Sergey Koren A documentation/source/commands/tgStoreFilter.rst nihh20151209Sergey Koren A documentation/source/commands/tgStoreLoad.rst nihh20151209Sergey Koren A documentation/source/commands/tgTigDisplay.rst nihh20151209Sergey Koren A documentation/source/commands/trimReads.rst nihh20151209Sergey Koren A documentation/source/commands/utgcns.rst nihh20151209Sergey Koren A documentation/source/conf.py nihh20151209Sergey Koren A documentation/source/index.rst nihh20151209Sergey Koren A documentation/source/parameter-reference.rst nihh20151209Sergey Koren A documentation/source/quick-start.rst nihh20151209Sergey Koren A documentation/source/tutorial.rst nihh20151209Sergey Koren A src/stores/gkStore.C nihh20151209Sergey Koren A src/stores/gatekeeperDumpFASTQ.C nihh20151208Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20151208Brian P. Walenz A src/stores/gkStore.C nihh20151208Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151208Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151208Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151208Brian P. Walenz A src/falcon_sense/falcon_sense.py nihh20151208Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151207Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151207Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20151207Brian P. Walenz A src/pipelines/canu.pl nihh20151207Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Grid.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151207Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20151207Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151207Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20151207Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20151207Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151207Brian P. Walenz A src/bogart/addReadsToUnitigs.C nihh20151207Brian P. Walenz A src/bogart/analyzeBest.C nihh20151207Brian P. Walenz A src/bogart/bogart.C nihh20151207Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20151207Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20151207Brian P. Walenz A src/correction/readConsensus.C nihh20151207Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20151207Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C nihh20151207Brian P. Walenz A src/merTrim/merTrim.C nihh20151207Brian P. Walenz A src/merTrim/merTrimApply.C nihh20151207Brian P. Walenz A src/meryl/leaff-simulate.C nihh20151207Brian P. Walenz A src/meryl/leaff.C nihh20151207Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C nihh20151207Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20151207Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20151207Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20151207Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20151207Brian P. Walenz A src/overlapInCore/overlapConvert.C nihh20151207Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20151207Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20151207Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20151207Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20151207Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20151207Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20151207Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C nihh20151207Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20151207Brian P. Walenz A src/stores/gkStore.C nihh20151207Brian P. Walenz A src/stores/gkStore.H nihh20151207Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20151207Brian P. Walenz A src/stores/ovStoreBuild.C nihh20151207Brian P. Walenz A src/stores/ovStoreDump.C nihh20151207Brian P. Walenz A src/stores/ovStoreStats.C nihh20151207Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20151207Brian P. Walenz A src/stores/tgStoreDump.C nihh20151207Brian P. Walenz A src/stores/tgStoreFilter.C nihh20151207Brian P. Walenz A src/stores/tgStoreLoad.C nihh20151207Brian P. Walenz A src/stores/tgTigDisplay.C nihh20151207Brian P. Walenz A src/utgcns/utgcns.C nihh20151207Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20151207Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20151207Brian P. Walenz A src/stores/gkStore.C nihh20151207Brian P. Walenz A src/stores/gkStore.H nihh20151207Brian P. Walenz A src/stores/gkStoreEncode.C nihh20151207Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C nihh20151207Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151204Sergey Koren A src/pipelines/canu/Meryl.pm nihh20151204Sergey Koren A src/pipelines/canu/Execution.pm nihh20151204Sergey Koren A src/stores/gkStore.C nihh20151204Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151204Sergey Koren A src/pipelines/canu/Execution.pm nihh20151204Sergey Koren A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151203Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151203Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151203Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20151203Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C nihh20151203Brian P. Walenz A src/main.mk nihh20151203Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20151203Brian P. Walenz A src/stores/gkStore.C nihh20151203Brian P. Walenz A src/stores/gkStore.H nihh20151203Brian P. Walenz A src/stores/gkStoreEncode.C nihh20151203Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C nihh20151203Brian P. Walenz A src/pipelines/canu/Output.pm nihh20151202Sergey Koren A documentation/source/quick-start.rst nihh20151202Sergey Koren A documentation/source/quick-start.rst nihh20151202Sergey Koren A documentation/source/quick-start.rst nihh20151202Sergey Koren A documentation/source/quick-start.rst nihh20151202Sergey Koren A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151202Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151202Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151202Sergey Koren A src/pipelines/canu.pl nihh20151201Sergey Koren A src/pipelines/canu.pl nihh20151201Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151201Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151201Brian P. Walenz A src/pipelines/canu.pl nihh20151201Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151201Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20151201Brian P. Walenz A src/pipelines/canu.pl nihh20151201Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151201Brian P. Walenz A src/stores/tgTig.H nihh20151201Brian P. Walenz A src/utgcns/libcns/abMultiAlign.C nihh20151201Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151201Brian P. Walenz A src/utgcns/stashContains.C nihh20151201Brian P. Walenz A src/utgcns/utgcns.C nihh20151201Brian P. Walenz A src/stores/tgTig.C nihh20151130Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151130Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151130Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20151130Sergey Koren A src/pipelines/canu/Grid_PBSTorque.pm nihh20151130Sergey Koren A src/pipelines/canu/Grid_SGE.pm nihh20151130Sergey Koren A src/pipelines/canu/Grid_Slurm.pm nihh20151130Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151130Sergey Koren A src/pipelines/canu/Execution.pm nihh20151130Sergey Koren A src/pipelines/canu/Grid_LSF.pm nihh20151130Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151130Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20151130Sergey Koren A src/pipelines/canu/Meryl.pm nihh20151130Sergey Koren A src/pipelines/canu/Grid_Slurm.pm nihh20151130Sergey Koren A src/pipelines/canu.pl nihh20151127Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151127Sergey Koren A src/pipelines/canu/Meryl.pm nihh20151127Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20151127Sergey Koren A src/Makefile nihh20151127Brian P. Walenz A src/pipelines/canu.pl nihh20151127Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151127Brian P. Walenz A src/pipelines/canu.pl nihh20151127Sergey Koren A src/correction/generateCorrectionLayouts.C nihh20151127Brian P. Walenz A src/pipelines/canu.pl nihh20151127Sergey Koren A src/pipelines/canu/Meryl.pm nihh20151127Sergey Koren A src/pipelines/canu/Unitig.pm nihh20151127Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151127Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20151127Brian P. Walenz A src/stores/tgTig.C nihh20151125Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151125Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151125Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151125Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151125Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151125Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151125Sergey Koren A README.md nihh20151125Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151125Sergey Koren A src/pipelines/canu/Execution.pm nihh20151125Sergey Koren A src/utgcns/libcns/abAbacus-appendBases.C nihh20151125Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20151124Sergey Koren A src/utgcns/utgcns.C nihh20151124Brian P. Walenz A src/stores/tgStoreDump.C nihh20151124Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C nihh20151124Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151124Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151124Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20151123Brian P. Walenz A src/stores/gkStore.C nihh20151123Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.C nihh20151123Brian P. Walenz A src/AS_UTL/AS_UTL_fasta.H nihh20151123Brian P. Walenz A src/merTrim/merTrim.C nihh20151123Brian P. Walenz A src/overlapBasedTrimming/trimReads-quality.C nihh20151123Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C nihh20151123Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C nihh20151123Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20151123Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20151123Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20151123Brian P. Walenz A src/stores/gkStore.C nihh20151123Brian P. Walenz A src/stores/gkStore.H nihh20151123Brian P. Walenz A src/stores/tgStoreDump.C nihh20151123Brian P. Walenz A src/stores/tgTig.C nihh20151123Brian P. Walenz A src/stores/tgTig.H nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus-addRead.C nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus-baseCall.C nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus.C nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151123Brian P. Walenz A src/utgcns/utgcns.C nihh20151123Brian P. Walenz A src/main.mk nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus-appendBases.C nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus.H nihh20151123Brian P. Walenz A src/utgcns/libcns/abIDs.H nihh20151123Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151123Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151123Brian P. Walenz A src/utgcns/libcns/abSequence.H nihh20151123Brian P. Walenz A src/stores/tgTig.H nihh20151123Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151121Sergey Koren A src/pipelines/canu.pl nihh20151121Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151121Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20151120Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151120Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20151120Sergey Koren A src/pipelines/canu.pl nihh20151120Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20151120Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20151120Sergey Koren A src/overlapInCore/overlapInCore.H nihh20151120Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151120Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151120Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20151120Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20151120Brian P. Walenz A src/utgcns/libcns/abAbacus-applyAlignment.C nihh20151120Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20151120Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20151120Brian P. Walenz A src/utgcns/utgcns.C nihh20151120Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20151119Sergey Koren A src/pipelines/canu.pl nihh20151119Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20151119Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151119Sergey Koren A src/pipelines/canu/Meryl.pm nihh20151119Sergey Koren A src/pipelines/canu/Defaults.pm nihh20151119Sergey Koren D src/bogart/AS_BAT_PromoteToSingleton.H bogart/AS_BAT_BreakRepeats.H D src/stores/tgStoreCompress.C src/stores/tgStore.C D src/bogart/AS_BAT_FragmentInfo.H src/bogart/AS_BAT_Datatypes.H A src/overlapErrorAdjustment/findErrors-Analyze_Alignment.C nihh20160518Brian P. Walenz A src/utgcns/libcns/NOTES nihh20160518Brian P. Walenz A src/bogart/AS_BAT_MergeUnitigs.C nihh20160517Brian P. Walenz A src/bogart/AS_BAT_MergeUnitigs.H nihh20160517Brian P. Walenz A src/bogart/bogart.C nihh20160517Brian P. Walenz A src/bogart/bogart.mk nihh20160517Brian P. Walenz A src/Makefile nihh20160516Sergey Koren A src/correction/errorEstimate.C nihh20160516Sergey Koren A src/correction/errorEstimate.mk nihh20160516Sergey Koren A src/main.mk nihh20160516Sergey Koren A src/pipelines/canu.pl nihh20160516Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160516Sergey Koren A src/pipelines/canu/ErrorEstimate.pm nihh20160516Sergey Koren A src/pipelines/canu/Execution.pm nihh20160516Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20160516Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160516Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160516Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160516Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160516Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160516Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160513Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20160513Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160513Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160513Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160513Brian P. Walenz A src/bogart/bogart.C nihh20160513Brian P. Walenz A src/bogart/bogart.mk nihh20160513Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160513Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20160513Brian P. Walenz A src/bogart/bogart.C nihh20160513Brian P. Walenz A src/bogart/buildGraph.C nihh20160513Brian P. Walenz A src/pipelines/canu/Output.pm nihh20160513Brian P. Walenz A documentation/source/faq.rst nihh20160512Sergey Koren A src/bogart/AS_BAT_Unitig.H nihh20160510Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160510Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160510Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160510Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160510Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160510Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160510Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160510Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160509Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160509Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160509Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160509Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160509Brian P. Walenz A documentation/source/parameter-reference.rst nihh20160506Sergey Koren A documentation/source/conf.py nihh20160506Sergey Koren A documentation/source/faq.rst nihh20160506Sergey Koren A documentation/source/index.rst nihh20160506Sergey Koren A documentation/source/parameter-reference.rst nihh20160506Sergey Koren A documentation/source/commands/bogart.rst nihh20160506Sergey Koren A documentation/source/commands/gatekeeperDumpFASTQ.rst nihh20160506Sergey Koren A documentation/source/commands/ovStoreDump.rst nihh20160506Sergey Koren A documentation/source/index.rst nihh20160506Sergey Koren A documentation/source/parameter-reference.rst nihh20160506Sergey Koren A documentation/source/tutorial.rst nihh20160506Sergey Koren A documentation/source/quick-start.rst nihh20160506Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160504Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160504Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160504Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20160504Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160504Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160504Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160504Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20160503Brian P. Walenz A src/pipelines/canu.pl nihh20160502Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Correct_Frags.C nihh20160502Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Prefix_Edit_Distance.C nihh20160502Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Read_Olaps.C nihh20160502Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C nihh20160502Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20160502Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.H nihh20160502Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160502Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160502Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160502Brian P. Walenz A README.md nihh20160428Brian P. Walenz A README.md nihh20160428Brian P. Walenz A src/stores/ovStoreDump.C nihh20160428Brian P. Walenz A src/stores/ovStoreDump.C nihh20160428Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160428Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160428Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H nihh20160428Brian P. Walenz A src/bogart/bogart.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160428Brian P. Walenz A addCopyrights.dat nihh20160428Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Joining.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.H nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160428Brian P. Walenz A src/bogart/bogart.C nihh20160428Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160427Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160427Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20160427Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160427Brian P. Walenz A src/bogart/bogart.C nihh20160427Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160427Sergey Koren A src/AS_UTL/AS_UTL_stackTrace.C nihh20160426Sergey Koren A src/Makefile nihh20160426Sergey Koren A src/bogart/AS_BAT_OverlapCache.C nihh20160426Sergey Koren A src/falcon_sense/falcon_sense.C nihh20160426Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160426Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160426Sergey Koren A src/utgcns/utgcns.C nihh20160426Sergey Koren A documentation/source/quick-start.rst nihh20160426Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160413Brian P. Walenz A src/AS_global.H nihh20160426Brian P. Walenz A src/pipelines/canu.pl nihh20160426Brian P. Walenz A src/AS_UTL/stddev.H nihh20160426Brian P. Walenz A src/AS_UTL/stddev.H nihh20160426Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160426Sergey Koren A documentation/source/parameter-reference.rst nihh20160426Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160426Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160426Brian P. Walenz A src/stores/gkStore.C nihh20160426Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20160426Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20160426Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20160426Brian P. Walenz A src/pipelines/canu.pl nihh20160426Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20160426Brian P. Walenz A documentation/source/quick-start.rst nihh20160425Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160422Sergey Koren A src/pipelines/canu/Unitig.pm nihh20160422Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160422Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160422Sergey Koren A src/pipelines/canu/OverlapMMap.pm nihh20160422Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160422Sergey Koren A src/pipelines/canu.pl nihh20160422Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160422Sergey Koren A src/pipelines/canu/Unitig.pm nihh20160422Sergey Koren A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160422Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160422Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160422Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.H nihh20160422Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160422Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.C nihh20160422Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.H nihh20160422Brian P. Walenz A src/bogart/bogart.C nihh20160422Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160422Sergey Koren A src/pipelines/canu/HTML.pm nihh20160420Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160420Sergey Koren A src/overlapInCore/overlapPair.C nihh20160420Sergey Koren A src/stores/ovStoreStats.C nihh20160420Sergey Koren A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160420Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160419Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160419Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160419Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160418Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160418Brian P. Walenz A src/bogart/bogart.C nihh20160418Brian P. Walenz A src/stores/ovStoreDump.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160418Brian P. Walenz A addCopyrights.dat nihh20160418Brian P. Walenz A src/main.mk nihh20160418Brian P. Walenz A src/stores/tgStoreCompress.C nihh20160418Brian P. Walenz A src/stores/tgStoreCompress.mk nihh20160418Brian P. Walenz A src/pipelines/canu/CorrectReads.txt nihh20160418Brian P. Walenz A src/pipelines/canu/Execution.txt nihh20160418Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160418Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.txt nihh20160418Brian P. Walenz A src/AS_UTL/stddevTest.C nihh20160418Brian P. Walenz A documentation/source/quick-start.rst nihh20160415Sergey Koren A src/pipelines/canu/Unitig.pm nihh20160413Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160413Brian P. Walenz A addCopyrights.dat nihh20160413Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.H nihh20160413Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.H nihh20160413Brian P. Walenz A src/bogart/bogart.C nihh20160413Brian P. Walenz A src/bogart/bogart.mk nihh20160413Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160413Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160413Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160413Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160413Brian P. Walenz A src/bogart/bogart.mk nihh20160413Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160412Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160412Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160412Brian P. Walenz A src/bogart/bogart.C nihh20160412Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160412Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160411Sergey Koren A src/stores/tgStoreLoad.C nihh20160411Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160408Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160408Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160408Brian P. Walenz A src/bogart/bogart.C nihh20160408Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160408Sergey Koren A src/pipelines/canu/OverlapMMap.pm nihh20160408Sergey Koren A src/pipelines/canu/Execution.pm nihh20160408Sergey Koren A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160408Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160407Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160407Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160407Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160407Brian P. Walenz A src/Makefile nihh20160406Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160406Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160406Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160406Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160406Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.C nihh20160406Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.H nihh20160406Brian P. Walenz A src/bogart/bogart.C nihh20160406Brian P. Walenz A src/bogart/bogart.mk nihh20160406Brian P. Walenz A src/AS_UTL/stddev.H nihh20160406Brian P. Walenz A src/AS_UTL/stddev.H nihh20160331Sergey Koren A src/stores/ovStoreStats.C nihh20160331Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160330Sergey Koren A src/pipelines/canu.pl nihh20160330Sergey Koren A src/bogart/AS_BAT_Outputs.C nihh20160330Sergey Koren A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160330Sergey Koren A src/overlapErrorAdjustment/correctOverlaps.C nihh20160330Sergey Koren A src/pipelines/canu/Configure.pm nihh20160330Sergey Koren A src/AS_UTL/stddev.H nihh20160330Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160330Brian P. Walenz A src/main.mk nihh20160330Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160329Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20160323Sergey Koren A src/overlapInCore/overlapInCore.C nihh20160323Sergey Koren A src/overlapInCore/overlapInCore.H nihh20160323Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160327Sergey Koren A src/pipelines/canu.pl nihh20160327Sergey Koren A src/pipelines/canu/Configure.pm nihh20160327Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160327Sergey Koren A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160327Sergey Koren A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160324Brian P. Walenz A src/bogart/AS_BAT_IntersectBubble.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160324Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160324Brian P. Walenz A src/bogart/bogart.C nihh20160324Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20160324Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20160324Sergey Koren A src/falcon_sense/falcon_sense.C nihh20160324Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20160324Sergey Koren A src/falcon_sense/libfalcon/falcon.H nihh20160324Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160324Sergey Koren A src/pipelines/canu/Gatekeeper.pm nihh20160324Sergey Koren A src/pipelines/canu/OverlapStore.pm nihh20160323Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20160323Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20160323Sergey Koren A src/overlapInCore/overlapInCore.C nihh20160323Sergey Koren A src/overlapInCore/overlapInCore.H nihh20160323Sergey Koren A src/overlapErrorAdjustment/findErrors-Analyze_Alignment.C nihh20160322Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20160322Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160322Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160322Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160322Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160322Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160322Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C nihh20160322Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160322Brian P. Walenz A src/bogart/bogart.C nihh20160322Brian P. Walenz A src/bogart/bogart.mk nihh20160322Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160321Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160321Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160321Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160321Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160317Brian P. Walenz A src/AS_UTL/stddev.H nihh20160317Brian P. Walenz A src/AS_UTL/stddev.H nihh20160317Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160317Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160317Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160316Brian P. Walenz A src/bogart/bogart.C nihh20160316Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160316Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160316Brian P. Walenz A src/bogart/AS_BAT_Datatypes.H nihh20160316Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160316Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160316Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160316Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160316Brian P. Walenz A src/AS_UTL/memoryMappedFile.H nihh20160316Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160315Sergey Koren A src/bogart/bogart.C nihh20160315Sergey Koren A src/pipelines/canu/Unitig.pm nihh20160315Sergey Koren A src/AS_UTL/memoryMappedFile.H nihh20160315Sergey Koren A src/pipelines/canu.pl nihh20160315Sergey Koren A src/AS_UTL/memoryMappedFile.H nihh20160313Brian P. Walenz A src/stores/ovOverlap.C nihh20160311Sergey Koren A src/stores/ovStore.H nihh20160311Sergey Koren A src/stores/ovStoreDump.C nihh20160311Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160311Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160311Sergey Koren A src/bogart/AS_BAT_MergeSplitJoin.C nihh20160311Sergey Koren A src/bogart/AS_BAT_MergeSplitJoin.H nihh20160311Sergey Koren A src/bogart/bogart.C nihh20160311Sergey Koren A src/bogart/bogart.mk nihh20160311Sergey Koren A src/stores/ovOverlap.C nihh20160311Sergey Koren A src/stores/ovStore.H nihh20160311Sergey Koren A src/stores/ovStoreDump.C nihh20160311Sergey Koren A src/pipelines/canu.pl nihh20160311Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160311Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160311Brian P. Walenz A src/bogart/AS_BAT_BreakRepeats.C nihh20160311Brian P. Walenz A src/bogart/AS_BAT_BreakRepeats.H nihh20160311Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160311Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160311Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160311Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.H nihh20160311Brian P. Walenz A src/bogart/bogart.C nihh20160311Brian P. Walenz A src/bogart/bogart.mk nihh20160311Brian P. Walenz A src/pipelines/bogart-sweep.pl nihh20160310Brian P. Walenz A src/AS_UTL/intervalListTest.C nihh20160310Brian P. Walenz A src/AS_UTL/stddev.H nihh20160310Brian P. Walenz A src/AS_UTL/stddevTest.C nihh20160310Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160310Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160310Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160310Brian P. Walenz A src/bogart/bogart.C nihh20160310Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160310Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160310Brian P. Walenz A src/stores/ovStore.H nihh20160310Brian P. Walenz A src/AS_UTL/intervalList.H nihh20160310Brian P. Walenz A src/AS_UTL/intervalList.H nihh20160310Brian P. Walenz A src/AS_UTL/memoryMappedFile.H nihh20160310Brian P. Walenz A documentation/source/quick-start.rst nihh20160309Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160309Sergey Koren A documentation/source/quick-start.rst nihh20160307Sergey Koren D src/bogart/TigVector.C src/bogart/UnitigVector.C D src/bogart/TigVector.H src/bogart/UnitigVector.H D src/bogart/AS_BAT_PlaceReadUsingOverlaps.C src/bogart/AS_BAT_PlaceFragUsingOverlaps.C D src/bogart/AS_BAT_PlaceReadUsingOverlaps.H src/bogart/AS_BAT_PlaceFragUsingOverlaps.H D src/bogart/AS_BAT_Unitig_AddRead.C src/bogart/AS_BAT_Unitig_AddFrag.C D src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C D src/stores/ovOverlap.H src/stores/ovStore.H D src/stores/ovStoreFile.H src/stores/ovStore.H D src/stores/ovStoreFilter.C src/stores/ovStore.C D src/stores/ovStoreFilter.H src/stores/ovStore.H D src/stores/ovStoreWriter.C src/stores/ovStore.C A src/bogart/AS_BAT_TigGraph.C nihh20161121Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20161121Brian P. Walenz A src/stores/gkStore.C nihh20161121Brian P. Walenz A src/stores/gkStore.H nihh20161121Brian P. Walenz A src/stores/gkStore.C nihh20161118Brian P. Walenz A src/stores/gkStore.H nihh20161118Brian P. Walenz A src/AS_UTL/writeBuffer.H nihh20161118Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161118Brian P. Walenz A src/bogart/bogart.C nihh20161118Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20161117Brian P. Walenz A src/bogart/bogart.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20161117Brian P. Walenz A src/bogart/bogart.C nihh20161117Brian P. Walenz A src/bogart/bogart.C nihh20161117Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161116Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161116Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161116Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20161116Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161110Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161110Brian P. Walenz A src/pipelines/sanity/sanity.pl nihh20161108Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161108Brian P. Walenz A src/pipelines/canu.pl nihh20161108Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161108Brian P. Walenz A documentation/source/parameter-reference.rst nihh20161107Brian P. Walenz A src/pipelines/canu.pl nihh20161107Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161107Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20161107Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161107Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20161107Brian P. Walenz A src/AS_global.C nihh20161105Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161105Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20161105Brian P. Walenz A src/bogart/bogart.C nihh20161105Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161104Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20161104Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.C nihh20161104Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20161104Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20161104Brian P. Walenz A src/stores/ovStoreDump.C nihh20161031Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.C nihh20161031Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H nihh20161031Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161031Brian P. Walenz A src/meryl/libmeryl.C nihh20161031Brian P. Walenz A src/stores/gkStore.C nihh20161031Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H nihh20161031Brian P. Walenz A addCopyrights.dat nihh20161028Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20161028Brian P. Walenz A src/main.mk nihh20161028Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20161028Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20161028Brian P. Walenz A src/stores/ovStore.C nihh20161028Brian P. Walenz A src/stores/ovStore.H nihh20161028Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161028Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161028Brian P. Walenz A src/stores/ovStoreFile.C nihh20161028Brian P. Walenz A src/stores/ovStoreFile.H nihh20161028Brian P. Walenz A src/stores/ovStoreFilter.C nihh20161028Brian P. Walenz A src/stores/ovStoreFilter.H nihh20161028Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20161028Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20161028Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20161028Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161028Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161028Brian P. Walenz A src/stores/ovStore.C nihh20161026Brian P. Walenz A src/stores/ovStore.H nihh20161026Brian P. Walenz A src/main.mk nihh20161025Brian P. Walenz A src/stores/ovStore.C nihh20161025Brian P. Walenz A src/stores/ovStore.H nihh20161025Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161025Brian P. Walenz A src/stores/ovStoreDump.C nihh20161025Brian P. Walenz A src/stores/ovStoreFile.C nihh20161025Brian P. Walenz A src/stores/ovStoreFile.H nihh20161025Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20161025Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20161025Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20161025Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161025Brian P. Walenz A src/Makefile nihh20161025Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161025Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20161025Brian P. Walenz A src/AS_UTL/AS_UTL_alloc.H nihh20161025Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161024Brian P. Walenz A addCopyrights.dat nihh20161024Brian P. Walenz A src/stores/ovOverlap.H nihh20161024Brian P. Walenz A src/stores/ovStore.H nihh20161024Brian P. Walenz A src/stores/ovStoreFile.H nihh20161024Brian P. Walenz A src/mhap/mhapConvert.C nihh20161024Brian P. Walenz A src/minimap/mmapConvert.C nihh20161024Brian P. Walenz A src/overlapInCore/overlapConvert.C nihh20161024Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20161024Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20161024Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20161024Brian P. Walenz A src/stores/ovStore.C nihh20161024Brian P. Walenz A src/stores/ovStore.H nihh20161024Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161024Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161024Brian P. Walenz A src/stores/ovStoreFile.C nihh20161024Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161024Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161021Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161021Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20161021Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161021Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20161021Brian P. Walenz A documentation/source/faq.rst nihh20161018Brian P. Walenz A documentation/source/faq.rst nihh20161018Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161018Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20161018Brian P. Walenz A src/stores/ovStoreDump.C nihh20161018Brian P. Walenz A src/stores/ovStoreDump.C nihh20161018Brian P. Walenz A documentation/source/overlap_transformations.svg nihh20161014Brian P. Walenz A documentation/source/overlaps.svg nihh20161014Brian P. Walenz A documentation/source/repeat-spanned.svg nihh20161014Brian P. Walenz A documentation/source/repeat-unspanned.svg nihh20161014Brian P. Walenz A src/AS_RUN/fragmentDepth.C nihh20161017Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161017Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20161017Brian P. Walenz A src/AS_UTL/bitPackedArray.C nihh20161017Brian P. Walenz A src/AS_UTL/bitPackedArray.H nihh20161017Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161017Brian P. Walenz A src/AS_UTL/bitPackedFile.H nihh20161017Brian P. Walenz A src/AS_UTL/kMer.C nihh20161017Brian P. Walenz A src/AS_UTL/kMerHuge.H nihh20161017Brian P. Walenz A src/AS_UTL/memoryMappedFile.H nihh20161017Brian P. Walenz A src/AS_UTL/readBuffer.C nihh20161017Brian P. Walenz A src/AS_UTL/stddev.H nihh20161017Brian P. Walenz A src/AS_UTL/sweatShop.C nihh20161017Brian P. Walenz A src/AS_global.C nihh20161017Brian P. Walenz A src/AS_global.H nihh20161017Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20161017Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20161017Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161017Brian P. Walenz A src/bogart/addReadsToUnitigs.C nihh20161017Brian P. Walenz A src/bogart/analyzeBest.C nihh20161017Brian P. Walenz A src/bogart/buildGraph.C nihh20161017Brian P. Walenz A src/bogus/bogus.C nihh20161017Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20161017Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20161017Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20161017Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C nihh20161017Brian P. Walenz A src/falcon_sense/outputFalcon.C nihh20161017Brian P. Walenz A src/fastq-utilities/fastqAnalyze.C nihh20161017Brian P. Walenz A src/fastq-utilities/fastqSample.C nihh20161017Brian P. Walenz A src/fastq-utilities/fastqSimulate-sort.C nihh20161017Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161017Brian P. Walenz A src/merTrim/merTrim.C nihh20161017Brian P. Walenz A src/merTrim/merTrimResult.H nihh20161017Brian P. Walenz A src/mercy/mercy-regions.C nihh20161017Brian P. Walenz A src/mercy/mercy.C nihh20161017Brian P. Walenz A src/meryl/estimate-mer-threshold.C nihh20161017Brian P. Walenz A src/meryl/leaff-blocks.C nihh20161017Brian P. Walenz A src/meryl/leaff-duplicates.C nihh20161017Brian P. Walenz A src/meryl/leaff-gc.C nihh20161017Brian P. Walenz A src/meryl/leaff-partition.C nihh20161017Brian P. Walenz A src/meryl/leaff-statistics.C nihh20161017Brian P. Walenz A src/meryl/leaff.C nihh20161017Brian P. Walenz A src/meryl/libleaff/fastaFile.C nihh20161017Brian P. Walenz A src/meryl/libleaff/fastaStdin.C nihh20161017Brian P. Walenz A src/meryl/libleaff/fastqFile.C nihh20161017Brian P. Walenz A src/meryl/libleaff/fastqStdin.C nihh20161017Brian P. Walenz A src/meryl/libleaff/seqCache.C nihh20161017Brian P. Walenz A src/meryl/libleaff/seqStore.C nihh20161017Brian P. Walenz A src/meryl/libleaff/seqStream.C nihh20161017Brian P. Walenz A src/meryl/libmeryl.C nihh20161017Brian P. Walenz A src/meryl/meryl-args.C nihh20161017Brian P. Walenz A src/meryl/meryl-binaryOp.C nihh20161017Brian P. Walenz A src/meryl/meryl-build.C nihh20161017Brian P. Walenz A src/meryl/meryl-dump.C nihh20161017Brian P. Walenz A src/meryl/meryl-estimate.C nihh20161017Brian P. Walenz A src/meryl/meryl-merge.C nihh20161017Brian P. Walenz A src/meryl/simple.C nihh20161017Brian P. Walenz A src/overlapBasedTrimming/splitReads-subReads.C nihh20161017Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20161017Brian P. Walenz A src/overlapBasedTrimming/trimReads-largestCovered.C nihh20161017Brian P. Walenz A src/overlapBasedTrimming/trimReads-quality.C nihh20161017Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/analyzeAlignment.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Correct_Frags.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Read_Olaps.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.H nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Frags.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Olaps.C nihh20161017Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20161017Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C nihh20161017Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C nihh20161017Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C nihh20161017Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20161017Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161017Brian P. Walenz A src/overlapInCore/overlapReadCache.C nihh20161017Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20161017Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20161017Brian P. Walenz A src/stores/gatekeeperDumpMetaData.C nihh20161017Brian P. Walenz A src/stores/gkStore.C nihh20161017Brian P. Walenz A src/stores/gkStore.H nihh20161017Brian P. Walenz A src/stores/ovOverlap.C nihh20161017Brian P. Walenz A src/stores/ovStore.C nihh20161017Brian P. Walenz A src/stores/ovStore.H nihh20161017Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161017Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161017Brian P. Walenz A src/stores/ovStoreDump.C nihh20161017Brian P. Walenz A src/stores/ovStoreFile.C nihh20161017Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20161017Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161017Brian P. Walenz A src/stores/ovStoreStats.C nihh20161017Brian P. Walenz A src/stores/tgStore.C nihh20161017Brian P. Walenz A src/stores/tgStoreCompress.C nihh20161017Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20161017Brian P. Walenz A src/stores/tgStoreDump.C nihh20161017Brian P. Walenz A src/stores/tgStoreFilter.C nihh20161017Brian P. Walenz A src/stores/tgTig.C nihh20161017Brian P. Walenz A src/stores/tgTigSizeAnalysis.C nihh20161017Brian P. Walenz A src/utgcns/libNDalign/NDalign.C nihh20161017Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20161017Brian P. Walenz A src/utgcns/stashContains.C nihh20161017Brian P. Walenz A src/utgcns/stashContains.H nihh20161017Brian P. Walenz A src/utgcns/utgcns.C nihh20161017Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161014Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20161014Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20161014Sergey Koren A src/bogart/AS_BAT_CreateUnitigs.C nihh20161014Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161014Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161014Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161014Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20161014Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20161013Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161013Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161013Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161012Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161012Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161011Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20161011Brian P. Walenz A src/pipelines/canu.pl nihh20161011Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161011Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161011Sergey Koren A src/pipelines/canu.pl nihh20161011Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20161011Sergey Koren A src/pipelines/canu/Defaults.pm nihh20161011Sergey Koren A src/pipelines/canu/Unitig.pm nihh20161011Sergey Koren A src/mhap/mhap-2.1.2.tar nihh20161011Sergey Koren A src/mhap/mhap.mk nihh20161011Sergey Koren A src/pipelines/canu/Defaults.pm nihh20161011Sergey Koren A src/overlapInCore/libedlib/edlib.C nihh20161007Sergey Koren A src/overlapInCore/libedlib/edlib.H nihh20161007Sergey Koren A src/overlapInCore/overlapPair.C nihh20161007Sergey Koren A src/meryl/libleaff/fastaFile.C nihh20161006Brian P. Walenz A src/meryl/libleaff/fastqFile.C nihh20161006Brian P. Walenz A src/meryl/libleaff/sffFile.C nihh20161006Brian P. Walenz A src/meryl/libleaff/seqFactory.C nihh20161006Brian P. Walenz A src/meryl/leaff.C nihh20161006Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161006Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161006Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20161006Brian P. Walenz A addCopyrights.pl nihh20161006Brian P. Walenz A kmer/libutil/qsort_mt.c nihh20161006Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161005Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161005Brian P. Walenz A documentation/source/faq.rst nihh20161005Sergey Koren A src/overlapInCore/overlapPair.C nihh20161004Sergey Koren A src/bogart/AS_BAT_Unitig.C nihh20161004Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161004Brian P. Walenz A src/AS_UTL/stddev.H nihh20161004Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161003Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20161003Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161003Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20161003Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20161003Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20161003Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161003Brian P. Walenz A src/bogart/AS_BAT_TigGraph.H nihh20161003Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20161003Brian P. Walenz A src/bogart/bogart.C nihh20161003Brian P. Walenz A src/bogart/bogart.mk nihh20161003Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20161003Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20161003Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161003Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20161003Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20161003Brian P. Walenz A src/utgcns/utgcns.C nihh20161003Brian P. Walenz A src/stores/gkStore.C nihh20161003Brian P. Walenz A src/stores/gkStore.C nihh20160930Brian P. Walenz A src/stores/gkStore.H nihh20160930Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20160930Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20160930Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160930Brian P. Walenz A src/bogart/bogart.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160930Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160930Brian P. Walenz A src/bogart/bogart.C nihh20160930Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20160929Brian P. Walenz A src/bogart/bogart.C nihh20160929Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160929Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160929Brian P. Walenz A src/pipelines/canu/Output.pm nihh20160929Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160929Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160929Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160929Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160929Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160929Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160929Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160929Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160929Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160929Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160929Brian P. Walenz A documentation/source/tutorial.rst nihh20160926Brian P. Walenz A documentation/source/canu-overlaps.svg nihh20160926Brian P. Walenz A documentation/source/canu-pipeline.svg nihh20160926Brian P. Walenz A documentation/source/pipeline.rst nihh20160926Brian P. Walenz A src/overlapInCore/libedlib/edlib.C nihh20160923Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160923Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160923Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160923Brian P. Walenz A addCopyrights.dat nihh20160923Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160921Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160921Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160921Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160921Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160921Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160921Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160921Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20160921Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160921Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160921Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160921Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20160921Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_TigVector.H nihh20160919Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160919Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddRead.C nihh20160919Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C nihh20160919Brian P. Walenz A src/bogart/bogart.C nihh20160919Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20160917Brian P. Walenz A src/Makefile nihh20160917Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20160917Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps-Prefix_Edit_Distance.C nihh20160917Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C nihh20160917Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160917Brian P. Walenz A documentation/source/index.rst nihh20160920Sergey Koren A documentation/source/index.rst nihh20160920Sergey Koren A src/pipelines/canu/Execution.pm nihh20160916Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160916Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160916Brian P. Walenz A src/stores/gkStore.C nihh20160915Brian P. Walenz A src/stores/gkStore.H nihh20160915Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C nihh20160915Brian P. Walenz A src/overlapInCore/overlapInCore-Process_Overlaps.C nihh20160915Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20160915Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20160915Brian P. Walenz A src/meryl/meryl-args.C nihh20160914Brian P. Walenz A src/meryl/meryl-build.C nihh20160914Brian P. Walenz A src/meryl/meryl.H nihh20160914Brian P. Walenz A src/meryl/meryl.mk nihh20160914Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C nihh20160913Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20160913Brian P. Walenz A src/stores/gkStore.C nihh20160913Brian P. Walenz A src/stores/gkStore.H nihh20160913Brian P. Walenz A src/stores/gkStore.C nihh20160913Brian P. Walenz A src/AS_UTL/hexDump.C nihh20160915Brian P. Walenz A src/AS_UTL/hexDump.H nihh20160915Brian P. Walenz A src/main.mk nihh20160915Brian P. Walenz A src/AS_UTL/mt19937arTest.C nihh20160909Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160909Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160909Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20160909Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20160909Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160909Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20160909Brian P. Walenz A src/AS_UTL/mt19937ar.C nihh20160909Brian P. Walenz A src/AS_UTL/mt19937ar.H nihh20160909Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160909Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160906Sergey Koren A src/pipelines/canu/OverlapInCore.pm nihh20160902Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160902Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160902Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160902Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20160902Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20160902Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160902Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160901Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160901Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160901Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160901Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20160901Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160831Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160831Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20160831Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160831Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20160831Brian P. Walenz A documentation/source/parameter-reference.rst nihh20160831Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160831Brian P. Walenz A src/overlapInCore/overlapConvert.C nihh20160831Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20160831Brian P. Walenz A src/stores/ovStore.H nihh20160831Brian P. Walenz A src/stores/ovStoreFile.C nihh20160831Brian P. Walenz A src/main.mk nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy-internal.h nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy-sinksource.cc nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy-sinksource.h nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy-stubs-internal.cc nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy-stubs-internal.h nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy-stubs-public.h nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy.cc nihh20160830Brian P. Walenz A src/stores/libsnappy/snappy.h nihh20160830Brian P. Walenz A src/stores/ovStore.H nihh20160830Brian P. Walenz A src/stores/ovStoreFile.C nihh20160830Brian P. Walenz A src/main.mk nihh20160830Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20160829Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20160829Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20160829Brian P. Walenz A src/stores/ovStore.H nihh20160829Brian P. Walenz A src/stores/ovStoreFile.C nihh20160829Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20160829Brian P. Walenz A src/stores/ovStore.H nihh20160829Brian P. Walenz A src/stores/ovStoreFile.C nihh20160829Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160829Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160829Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160829Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160829Brian P. Walenz A src/AS_global.H nihh20160829Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160829Brian P. Walenz A src/bogart/bogart.C nihh20160826Brian P. Walenz A src/overlapInCore/overlapPair.mk nihh20160830Sergey Koren A src/main.mk nihh20160830Sergey Koren A src/overlapInCore/libedlib/edlib.C nihh20160830Sergey Koren A src/overlapInCore/libedlib/edlib.H nihh20160830Sergey Koren A src/overlapInCore/overlapPair.C nihh20160830Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160830Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160826Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160825Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160825Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160825Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160825Brian P. Walenz A README.md nihh20160824Sergey Koren A README.citation nihh20160824Sergey Koren A src/bogart/AS_BAT_PopBubbles.C nihh20160822Brian P. Walenz A src/bogart/bogart.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160821Brian P. Walenz A src/bogart/bogart.C nihh20160821Brian P. Walenz A src/bogart/bogart.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160821Brian P. Walenz A src/bogart/bogart.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20160821Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160819Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160819Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160819Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160819Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.H nihh20160819Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160818Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160817Brian P. Walenz A src/pipelines/canu/Grid.pm nihh20160817Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20160816Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.H nihh20160816Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160816Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160816Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddRead.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C nihh20160812Brian P. Walenz A src/bogart/bogart.C nihh20160812Brian P. Walenz A src/bogart/bogart.mk nihh20160812Brian P. Walenz A addCopyrights.dat nihh20160812Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.txt nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddRead.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C nihh20160812Brian P. Walenz A src/bogart/bogart.mk nihh20160812Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig_AddFrag.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160812Brian P. Walenz A src/bogart/addReadsToUnitigs.C nihh20160812Brian P. Walenz A src/bogart/analyzeBest.C nihh20160812Brian P. Walenz A src/bogart/bogart.C nihh20160812Brian P. Walenz A src/bogart/buildGraph.C nihh20160812Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20160812Brian P. Walenz A src/stores/gkStore.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160812Brian P. Walenz A src/bogart/bogart.C nihh20160812Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160810Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_TigVector.H nihh20160809Brian P. Walenz A src/bogart/addReadsToUnitigs.C nihh20160809Brian P. Walenz A src/bogart/bogart.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160809Brian P. Walenz A src/bogart/bogart.C nihh20160809Brian P. Walenz A src/bogart/bogart.mk nihh20160809Brian P. Walenz A src/bogart/bogart.C nihh20160809Brian P. Walenz A src/bogart/bogart.mk nihh20160809Brian P. Walenz A addCopyrights.dat nihh20160809Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MergeUnitigs.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MergeUnitigs.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Outputs.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PopulateUnitig.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_ReconstructRepeats.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SetParentAndHang.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_TigVector.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160809Brian P. Walenz A src/bogart/bogart.C nihh20160809Brian P. Walenz A src/bogart/bogart.mk nihh20160809Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160809Brian P. Walenz A src/mhap/mhapConvert.C nihh20160809Brian P. Walenz A src/bogart/bogart.C nihh20160809Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160808Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160808Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160805Brian P. Walenz A src/bogart/findOverlappingReads.pl nihh20160805Brian P. Walenz A src/stores/tgStoreDump.C nihh20160805Brian P. Walenz A src/stores/tgTig.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160805Brian P. Walenz A src/bogart/bogart.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160805Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160805Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160805Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160805Brian P. Walenz A src/bogart/bogart.C nihh20160805Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160801Brian P. Walenz A src/bogart/AS_BAT_MergeUnitigs.C nihh20160801Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160801Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160801Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160801Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160801Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160801Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.H nihh20160721Brian P. Walenz A src/bogart/bogart.C nihh20160721Brian P. Walenz A src/bogart/bogart.mk nihh20160721Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_PlaceFragUsingOverlaps.H nihh20160721Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160721Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20160721Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20160803Brian P. Walenz A src/meryl/libmeryl.C nihh20160724Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160719Brian P. Walenz A src/main.mk nihh20160719Brian P. Walenz A src/utgcns/libNDFalcon/dw.C nihh20160719Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160719Brian P. Walenz A src/stores/gkStore.H nihh20160707Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160706Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160706Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20160706Brian P. Walenz A README.md nihh20160718Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160718Brian P. Walenz A src/Makefile nihh20160718Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160718Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160711Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160709Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160709Sergey Koren A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160706Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160706Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160706Brian P. Walenz A src/bogart/bogart.C nihh20160706Brian P. Walenz A src/pipelines/canu.pl nihh20160706Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160706Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20160629Brian P. Walenz A src/bogart/buildGraph.C nihh20160629Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160629Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160628Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160628Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20160627Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20160627Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.C nihh20160627Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Process_Olap.C nihh20160627Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20160627Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20160627Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160627Brian P. Walenz A README.md nihh20160624Sergey Koren A src/pipelines/canu/Execution.pm nihh20160621Brian P. Walenz A src/pipelines/canu/Grid.pm nihh20160621Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160620Sergey Koren A src/pipelines/canu/Grid.pm nihh20160620Sergey Koren A src/pipelines/canu/Consensus.pm nihh20160618Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20160618Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160618Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20160618Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20160618Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20160618Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20160618Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20160618Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160618Brian P. Walenz A src/AS_UTL/intervalList.H nihh20160615Brian P. Walenz A src/AS_UTL/intervalListTest.C nihh20160615Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20160615Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20160615Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20160614Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.H nihh20160614Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160614Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_UnitigVector.C nihh20160614Brian P. Walenz A src/bogart/bogart.C nihh20160614Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160613Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.C nihh20160613Brian P. Walenz A src/bogart/AS_BAT_FragmentInfo.H nihh20160613Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20160613Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160613Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160613Brian P. Walenz A src/stores/ovStore.C nihh20160610Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160603Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20160603Brian P. Walenz A src/stores/tgStoreDump.C nihh20160603Brian P. Walenz A src/stores/tgTig.C nihh20160603Brian P. Walenz A src/stores/tgTig.H nihh20160603Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20160610Sergey Koren A src/utgcns/utgcns.C nihh20160610Sergey Koren A src/bogart/bogart.C nihh20160608Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20160608Brian P. Walenz A src/stores/gkStore.C nihh20160608Brian P. Walenz A src/stores/gkStore.H nihh20160608Brian P. Walenz A src/overlapInCore/overlapInCore-Find_Overlaps.C nihh20160608Sergey Koren A src/overlapInCore/overlapInCore-Process_Overlaps.C nihh20160608Sergey Koren A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20160608Sergey Koren A src/overlapInCore/overlapInCore.C nihh20160608Sergey Koren A src/overlapInCore/overlapInCore.H nihh20160608Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160608Sergey Koren A src/pipelines/canu/OverlapInCore.pm nihh20160608Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160608Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160608Sergey Koren A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160608Brian P. Walenz A src/stores/ovStoreBuild.C nihh20160607Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160606Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20160606Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20160606Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160606Brian P. Walenz A src/bogart/bogart.C nihh20160606Brian P. Walenz A src/bogart/plotErrorProfile.pl nihh20160606Brian P. Walenz A src/AS_UTL/stddev.H nihh20160606Brian P. Walenz A README.md nihh20160603Sergey Koren A src/bogart/AS_BAT_Unitig.C nihh20160531Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20160531Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160531Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20160531Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20160531Brian P. Walenz A src/AS_UTL/stddev.H nihh20160531Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20160531Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20160531Sergey Koren A documentation/source/faq.rst nihh20160531Sergey Koren A documentation/source/quick-start.rst nihh20160531Sergey Koren A documentation/source/quick-start.rst nihh20160531Sergey Koren A documentation/source/faq.rst nihh20160527Sergey Koren A documentation/source/faq.rst nihh20160527Sergey Koren A documentation/source/faq.rst nihh20160527Sergey Koren A src/pipelines/canu/HTML.pm nihh20160526Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20160526Brian P. Walenz A documentation/source/quick-start.rst nihh20160526Sergey Koren A src/stores/gatekeeperCreate.C nihh20160526Brian P. Walenz A src/stores/gkStore.C nihh20160526Brian P. Walenz A src/stores/gkStoreEncode.C nihh20160526Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20160524Sergey Koren A src/pipelines/canu.pl nihh20160524Sergey Koren A src/pipelines/canu/ErrorEstimate.pm nihh20160524Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160524Sergey Koren A src/mhap/mhap-2.1.tar nihh20160523Sergey Koren A src/pipelines/canu/ErrorEstimate.pm nihh20160523Sergey Koren A src/bogart/AS_BAT_BestOverlapGraph.C nihh20160523Brian P. Walenz A src/mhap/mhap-2.1.tar nihh20160523Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160523Sergey Koren A src/pipelines/canu/ErrorEstimate.pm nihh20160523Sergey Koren A src/pipelines/canu/Meryl.pm nihh20160523Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160523Sergey Koren A src/Makefile nihh20160523Brian P. Walenz A src/meryl/meryl-build.C nihh20160523Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C nihh20160523Brian P. Walenz A src/stores/gkStore.H nihh20160523Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Analyze_Alignment.C nihh20160520Brian P. Walenz A src/main.mk nihh20160520Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Dump.C nihh20160520Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Dump.mk nihh20160520Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20160520Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20160519Brian P. Walenz A src/meryl/libmeryl.C nihh20160519Brian P. Walenz A src/meryl/libmeryl.H nihh20160519Brian P. Walenz A src/meryl/meryl-dump.C nihh20160519Brian P. Walenz A src/meryl/libleaff/fastaFile.C nihh20160519Brian P. Walenz A src/meryl/libleaff/fastaStdin.C nihh20160519Brian P. Walenz A src/meryl/libleaff/fastqFile.C nihh20160519Brian P. Walenz A src/meryl/libleaff/fastqStdin.C nihh20160519Brian P. Walenz A src/meryl/libleaff/seqCache.C nihh20160519Brian P. Walenz A src/meryl/libleaff/seqFile.H nihh20160519Brian P. Walenz A src/correction/errorEstimate.C nihh20160518Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160518Sergey Koren A src/pipelines/canu/ErrorEstimate.pm nihh20160518Sergey Koren A src/pipelines/canu.pl nihh20160518Sergey Koren A src/pipelines/canu/CorrectReads.pm nihh20160518Sergey Koren A src/mhap/mhap.mk nihh20160518Sergey Koren A src/pipelines/canu/Meryl.pm nihh20160518Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20160518Sergey Koren A src/pipelines/canu/Defaults.pm nihh20160518Sergey Koren A src/correction/errorEstimate.C nihh20160518Sergey Koren A src/pipelines/canu/ErrorEstimate.pm nihh20160518Sergey Koren A addCopyrights-BuildData.pl nihh20160518Brian P. Walenz A addCopyrights.dat nihh20160518Brian P. Walenz A addCopyrights.pl nihh20160518Brian P. Walenz D src/bogart/AS_BAT_MergeOrphans.C src/bogart/AS_BAT_PopBubbles.C D src/bogart/AS_BAT_MergeOrphans.H src/bogart/AS_BAT_PopBubbles.H D src/libkmer/driver-existDB.C kmer/libkmer/driver-existDB.C D src/libkmer/driver-posDB.C kmer/libkmer/driver-posDB.C D src/libkmer/existDB-create-from-fasta.C kmer/libkmer/existDB-create-from-fasta.C D src/libkmer/existDB-create-from-meryl.C kmer/libkmer/existDB-create-from-meryl.C D src/libkmer/existDB-create-from-sequence.C kmer/libkmer/existDB-create-from-sequence.C D src/libkmer/existDB-state.C kmer/libkmer/existDB-state.C D src/libkmer/existDB.C kmer/libkmer/existDB.C D src/libkmer/existDB.H kmer/libkmer/existDB.H D src/libkmer/existDB.mk kmer/libkmer/existDB.mk D src/libkmer/main.mk kmer/libkmer/main.mk D src/libkmer/percentCovered.C kmer/libkmer/percentCovered.C D src/libkmer/percentCovered.mk kmer/libkmer/percentCovered.mk D src/libkmer/posDB.mk kmer/libkmer/posDB.mk D src/libkmer/positionDB-access.C kmer/libkmer/positionDB-access.C D src/libkmer/positionDB-dump.C kmer/libkmer/positionDB-dump.C D src/libkmer/positionDB-file.C kmer/libkmer/positionDB-file.C D src/libkmer/positionDB-mismatch.C kmer/libkmer/positionDB-mismatch.C D src/libkmer/positionDB-sort.C kmer/libkmer/positionDB-sort.C D src/libkmer/positionDB.C kmer/libkmer/positionDB.C D src/libkmer/positionDB.H kmer/libkmer/positionDB.H A src/stores/ovStore.C nihh20170417Brian P. Walenz A src/stores/ovStore.C nihh20170412Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170411Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170411Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170411Brian P. Walenz A src/AS_global.C nihh20170411Brian P. Walenz A src/canu_version_update.pl nihh20170411Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/tutorial.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A src/pipelines/canu.pl nihh20170407Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170407Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170407Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170407Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170407Brian P. Walenz A src/gfa/alignGFA.C nihh20170407Brian P. Walenz A src/stores/ovStore.C nihh20170406Brian P. Walenz A src/stores/ovStore.H nihh20170406Brian P. Walenz A documentation/source/faq.rst nihh20170405Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170405Brian P. Walenz A src/meryl/estimate-mer-threshold.C nihh20170405Brian P. Walenz A src/gfa/alignGFA.C nihh20170404Brian P. Walenz A src/gfa/gfa.C nihh20170404Brian P. Walenz A src/gfa/gfa.H nihh20170404Brian P. Walenz A src/gfa/alignGFA.C nihh20170404Brian P. Walenz A src/gfa/gfa.C nihh20170404Brian P. Walenz A src/gfa/gfa.H nihh20170404Brian P. Walenz A src/main.mk nihh20170404Brian P. Walenz A src/gfa/alignGFA.C nihh20170404Brian P. Walenz A src/gfa/alignGFA.mk nihh20170404Brian P. Walenz A src/main.mk nihh20170404Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20170404Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20170404Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170403Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170330Brian P. Walenz A documentation/source/faq.rst nihh20170330Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20170330Brian P. Walenz A src/pipelines/canu.pl nihh20170329Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20170329Brian P. Walenz A src/stores/gkStore.C nihh20170329Brian P. Walenz A src/stores/gkStore.H nihh20170329Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170329Brian P. Walenz A src/canu_version_update.pl nihh20170329Brian P. Walenz A src/main.mk nihh20170329Brian P. Walenz A src/utgcns/alignGFA.C nihh20170329Brian P. Walenz A src/utgcns/alignGFA.mk nihh20170329Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170329Brian P. Walenz A documentation/source/faq.rst nihh20170328Sergey Koren A documentation/source/conf.py nihh20170328Sergey Koren A src/pipelines/canu/Gatekeeper.pm nihh20170327Brian P. Walenz A src/stores/tgStoreDump.C nihh20170327Brian P. Walenz A src/stores/tgTig.C nihh20170327Brian P. Walenz A src/stores/tgTig.H nihh20170327Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20170327Brian P. Walenz A src/pipelines/canu.pl nihh20170324Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/LICENSE nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/README nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/atomic.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/backtrace-supported.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/backtrace.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/backtrace.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/config.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/dwarf.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/elf.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/fileline.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/internal.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/make.out nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/make.sh nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/mmap.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/mmapio.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/posix.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/print.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/simple.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/sort.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/state.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/unknown.c nihh20170322Brian P. Walenz A src/Makefile nihh20170322Brian P. Walenz A src/main.mk nihh20170322Brian P. Walenz A src/Makefile nihh20170322Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170322Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H nihh20170322Brian P. Walenz A src/AS_global.C nihh20170322Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170322Brian P. Walenz A src/pipelines/canu.pl nihh20170321Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170321Brian P. Walenz A src/pipelines/canu.pl nihh20170321Brian P. Walenz A src/pipelines/canu.pl nihh20170321Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170321Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170321Brian P. Walenz A src/pipelines/canu.pl nihh20170320Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170320Brian P. Walenz A src/AS_global.C nihh20170320Brian P. Walenz A src/AS_global.C nihh20170320Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170320Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170320Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170320Brian P. Walenz A src/pipelines/canu.pl nihh20170320Brian P. Walenz A src/pipelines/canu/Report.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170320Brian P. Walenz A src/Makefile nihh20170320Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Report.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20170316Brian P. Walenz A documentation/source/quick-start.rst nihh20170316Brian P. Walenz A documentation/source/faq.rst nihh20170316Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170315Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20170315Sergey Koren A src/pipelines/sanity/sanity.sh nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p4c2.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p5c3.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.caenorhabditis_elegans.pacbio.p6c4.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.drosophila_melanogaster.pacbio.p5c3.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-1000.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-5000.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.sra-3000.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.pacbio.p6.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p4.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p5.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.average.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.long.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.francisella_tularensis.pacbio.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.sra.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.pacbio.spec nihh20170314Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170314Brian P. Walenz A src/pipelines/canu-object-store.pl nihh20170314Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20170310Brian P. Walenz A src/overlapInCore/libedlib/edlib.C nihh20170310Brian P. Walenz A src/overlapInCore/libedlib/edlib.H nihh20170310Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170310Brian P. Walenz A src/pipelines/sanity/medium.caenorhabditis_elegans.pacbio.p6c4.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/sanity.NOTES nihh20170310Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20170310Brian P. Walenz A src/pipelines/sanity/small.bacillus_anthracis_sterne.nanopore.34F2_NBI0483991.poretools.2D.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.all.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-1.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-2.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-1.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-2.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.r9.4.superlong.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.r9.SpotOn.1d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.pacbio.p6.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p4.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p5.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.long.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.nanopore.r7.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.nanopore.r9.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_w303.nanopore.poretools.2D.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.yersinia_pestis.nanopore.NBI0499872.poretools.2D.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/success.caenorhabditis_elegans.sh nihh20170310Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170309Sergey Koren A src/pipelines/canu/Defaults.pm nihh20170309Sergey Koren A src/pipelines/canu/Meryl.pm nihh20170307Sergey Koren A src/pipelines/canu/OverlapMMap.pm nihh20170307Sergey Koren A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170303Sergey Koren A src/pipelines/canu/Defaults.pm nihh20170302Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170223Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170302Sergey Koren A src/pipelines/canu/OverlapStore.pm nihh20170227Brian P. Walenz A src/main.mk nihh20170224Brian P. Walenz A src/merTrim/merTrim.C nihh20170224Brian P. Walenz A src/merTrim/merTrim.mk nihh20170224Brian P. Walenz A src/merTrim/merTrimResult.H nihh20170224Brian P. Walenz A src/main.mk nihh20170224Brian P. Walenz A src/meryl/existDB.C nihh20170224Brian P. Walenz A src/meryl/existDB.mk nihh20170224Brian P. Walenz A src/meryl/positionDB.C nihh20170224Brian P. Walenz A src/meryl/positionDB.mk nihh20170224Brian P. Walenz A src/main.mk nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-fasta.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-meryl.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-sequence.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-state.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.H nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-access.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-dump.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-file.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-mismatch.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-sort.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB.H nihh20170224Brian P. Walenz A addCopyrights.dat nihh20170224Brian P. Walenz A src/meryl/libkmer/driver-existDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/driver-posDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-fasta.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-meryl.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-sequence.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-state.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.H nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.mk nihh20170224Brian P. Walenz A src/meryl/libkmer/main.mk nihh20170224Brian P. Walenz A src/meryl/libkmer/percentCovered.C nihh20170224Brian P. Walenz A src/meryl/libkmer/percentCovered.mk nihh20170224Brian P. Walenz A src/meryl/libkmer/posDB.mk nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-access.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-dump.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-file.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-mismatch.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-sort.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB.H nihh20170224Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170224Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170223Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170223Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170223Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170222Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20170222Brian P. Walenz A src/pipelines/canu.pl nihh20170222Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Grid_DNANexus.pm nihh20170221Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170221Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20170221Brian P. Walenz A src/stores/ovStoreBuild.C nihh20170221Brian P. Walenz A src/stores/ovStoreWriter.C nihh20170221Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170221Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170221Brian P. Walenz A documentation/reST-markup-hints nihh20170221Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170220Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170220Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170215Brian P. Walenz A src/Makefile nihh20170215Brian P. Walenz A src/pipelines/canu.pl nihh20170215Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170215Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170215Brian P. Walenz A src/pipelines/canu.pl nihh20170213Brian P. Walenz A src/Makefile nihh20170211Brian P. Walenz A src/pipelines/canu.pl nihh20170211Brian P. Walenz A src/pipelines/canu/Grid_DNANexus.pm nihh20170211Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170204Brian P. Walenz A src/utgcns/libNDFalcon/dw.C nihh20170214Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170214Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170214Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170214Sergey Koren A src/pipelines/canu.pl nihh20170213Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170213Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170209Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170209Sergey Koren A src/pipelines/canu/Consensus.pm nihh20170203Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170203Sergey Koren A src/utgcns/libcns/unitigConsensus.H nihh20170203Sergey Koren A src/utgcns/utgcns.C nihh20170203Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H nihh20170202Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20170202Brian P. Walenz A src/pipelines/canu.pl nihh20170202Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170202Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170202Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170202Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170202Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170202Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170202Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170202Brian P. Walenz A src/stores/ovStoreBuild.C nihh20170201Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170201Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20170201Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170201Brian P. Walenz A documentation/source/tutorial.rst nihh20170201Brian P. Walenz A src/bogart/bogart.C nihh20170127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170124Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170124Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170124Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170124Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170117Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170117Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170130Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170127Brian P. Walenz A documentation/source/quick-start.rst nihh20170127Brian P. Walenz A documentation/source/tutorial.rst nihh20170127Brian P. Walenz A src/pipelines/canu.pl nihh20170127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170127Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170127Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170127Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170127Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170126Brian P. Walenz A src/pipelines/canu.pl nihh20170126Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170126Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170126Brian P. Walenz A src/falcon_sense/libfalcon/falcon.C nihh20170124Brian P. Walenz A src/falcon_sense/libfalcon/falcon.H nihh20170124Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20170124Brian P. Walenz A src/falcon_sense/libfalcon/falcon.C nihh20170124Brian P. Walenz A src/falcon_sense/libfalcon/falcon.H nihh20170124Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170123Brian P. Walenz A src/Makefile nihh20170123Brian P. Walenz A src/Makefile nihh20170123Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170123Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170120Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20170120Brian P. Walenz A src/pipelines/canu.pl nihh20170120Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170120Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170120Brian P. Walenz A src/pipelines/canu.pl nihh20170119Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170119Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170119Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170119Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170119Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170119Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170119Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170119Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170116Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20170113Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20170112Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20170112Brian P. Walenz A src/bogart/bogart.C nihh20170112Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170110Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170110Brian P. Walenz A src/stores/ovStore.H nihh20170110Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170110Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20170110Brian P. Walenz A src/mhap/mhapConvert.C nihh20170110Brian P. Walenz A src/minimap/mmapConvert.C nihh20170110Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170110Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170110Brian P. Walenz A src/pipelines/parallel-ovl-store-test.sh nihh20170109Brian P. Walenz A documentation/reST-markup-hints nihh20170109Brian P. Walenz A src/pipelines/simple-repeat-test.pl nihh20170109Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170109Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20170107Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20170106Brian P. Walenz A src/falcon_sense/falcon_sense.mk nihh20170106Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20170106Sergey Koren A src/falcon_sense/libfalcon/falcon.H nihh20170106Sergey Koren A src/main.mk nihh20170106Sergey Koren A src/overlapInCore/libedlib/edlib.C nihh20170106Sergey Koren A src/overlapInCore/libedlib/edlib.H nihh20170106Sergey Koren A documentation/source/parameter-reference.rst nihh20170106Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170106Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170106Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170106Brian P. Walenz A src/pipelines/canu.pl nihh20170106Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170106Brian P. Walenz A src/pipelines/sanity/sanity.pl nihh20170104Brian P. Walenz A src/Makefile nihh20170104Brian P. Walenz A src/stores/ovOverlap.C nihh20170104Brian P. Walenz A src/stores/ovOverlap.H nihh20170104Brian P. Walenz A src/stores/ovStoreFile.C nihh20170104Brian P. Walenz A src/stores/ovStore.H nihh20170104Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20170103Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20170103Brian P. Walenz A src/stores/ovStoreBuild.C nihh20170103Brian P. Walenz A src/stores/ovStoreFilter.C nihh20170103Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.arabidopsis_thaliana.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.bacillus_anthracis_sterne.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.bibersteinia_trehalosi.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.drosophila_melanogaster.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_k12.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_ne92.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_o157_h7_str_f8092b.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.francisella_tularensis.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.yersinia_pestis_i195.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/small.bacillus_anthracis_sterne.nanopore.34F2_NBI0483991.poretools.2D.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.sra.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.pacbio.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/success.saccharomyces_cerevisiae_s288c.sh nihh20161225Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161225Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161222Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-1000.spec nihh20161222Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-5000.spec nihh20161222Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.sra-3000.spec nihh20161222Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p4c2.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p5c3.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/medium.drosophila_melanogaster.pacbio.p5c3.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bacillus_anthracis_sterne.nanopore.34F2_NBI0483991.poretools.2D.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-1000.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-5000.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.sra-1000.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.all.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-1.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-2.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-1.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-2.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.r9.SpotOn.1d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.pacbio.p6.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p4.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p5.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.average.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.long.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.francisella_tularensis.pacbio.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.yersinia_pestis.nanopore.NBI0499872.poretools.2D.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/success.arabidopsis_thaliana.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.bacillus_anthracis_sterne.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.bibersteinia_trehalosi.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.drosophila_melanogaster.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_k12.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_ne92.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_o157_h7_str_f8092b.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.francisella_tularensis.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.yersinia_pestis_i195.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/sanity.pl nihh20161220Brian P. Walenz A src/stores/tgStoreDump.C nihh20161219Brian P. Walenz A src/stores/tgStoreLoad.C nihh20161219Brian P. Walenz A src/stores/tgStore.C nihh20161219Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161219Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20161219Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161219Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20161219Brian P. Walenz A src/meryl/estimate-mer-threshold.C nihh20161216Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20161216Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20161214Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161214Brian P. Walenz A src/meryl/meryl-build.C nihh20161214Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20161214Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161214Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161214Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161214Brian P. Walenz A buildRelease.sh nihh20161213Brian P. Walenz A buildRelease.sh nihh20161213Brian P. Walenz A README.licenses nihh20161213Brian P. Walenz A src/utgcns/libpbutgcns/LICENSE nihh20161213Brian P. Walenz A buildRelease.sh nihh20161213Brian P. Walenz A src/pipelines/canu.pl nihh20161212Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161212Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20161212Brian P. Walenz A src/stores/ovStoreDump.C nihh20161209Brian P. Walenz A src/stores/ovOverlap.C nihh20161208Brian P. Walenz A src/stores/ovOverlap.H nihh20161208Brian P. Walenz A src/stores/ovStoreFile.C nihh20161208Brian P. Walenz A addCopyrights.dat nihh20161207Brian P. Walenz A addCopyrights.pl nihh20161207Brian P. Walenz A src/bogart/AS_BAT_MergeOrphans.C nihh20161207Brian P. Walenz A src/bogart/AS_BAT_MergeOrphans.H nihh20161207Brian P. Walenz A src/bogart/bogart.C nihh20161207Brian P. Walenz A src/bogart/bogart.mk nihh20161207Brian P. Walenz A src/main.mk nihh20161206Brian P. Walenz A src/meryl/maskMers.C nihh20161206Brian P. Walenz A src/meryl/maskMers.mk nihh20161206Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20161206Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20161206Brian P. Walenz A src/bogart/bogart.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20161206Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20161205Brian P. Walenz A src/stores/ovStoreStats.C nihh20161205Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.H nihh20161202Brian P. Walenz A src/bogart/bogart.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.H nihh20161202Brian P. Walenz A src/bogart/bogart.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161201Brian P. Walenz A src/bogart/bogart.C nihh20161201Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161201Brian P. Walenz A src/stores/gkStore.C nihh20161130Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20161130Brian P. Walenz A src/stores/gkStore.C nihh20161130Brian P. Walenz A src/stores/ovStore.C nihh20161130Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161130Brian P. Walenz A src/stores/tgStoreDump.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20161130Brian P. Walenz A src/stores/gkStore.C nihh20161130Brian P. Walenz A src/stores/ovOverlap.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20161130Brian P. Walenz A src/stores/tgStoreDump.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161130Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20161130Brian P. Walenz A src/stores/ovStore.H nihh20161130Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20161130Brian P. Walenz A src/stores/ovStore.H nihh20161130Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161130Brian P. Walenz A src/stores/ovStore.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161129Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161129Brian P. Walenz A src/fastq-utilities/fastqSample.C nihh20161129Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161129Brian P. Walenz A src/meryl/leaff-partition.C nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Frags.C nihh20161129Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20161129Brian P. Walenz A src/stores/tgStoreDump.C nihh20161129Brian P. Walenz A src/utgcns/stashContains.C nihh20161129Brian P. Walenz A src/stores/tgStoreDump.C nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.H nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20161129Brian P. Walenz A src/utgcns/libNDFalcon/dw.H nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20161129Brian P. Walenz A src/main.mk nihh20161129Brian P. Walenz A src/stores/gkStore.H nihh20161129Brian P. Walenz A src/stores/gkStore.H nihh20161129Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161129Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20161129Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20161129Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20161129Brian P. Walenz A src/bogart/bogart.C nihh20161129Brian P. Walenz A src/stores/ovStoreDump.C nihh20161129Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161129Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161129Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20161129Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161129Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20161129Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20161128Brian P. Walenz A src/meryl/libmeryl.C nihh20161123Brian P. Walenz A src/meryl/meryl-build.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/stores/tgStoreDump.C nihh20161122Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C nihh20161122Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20161122Brian P. Walenz A src/meryl/leaff-statistics.C nihh20161122Brian P. Walenz A src/stores/ovStoreDump.C nihh20161122Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C nihh20161122Brian P. Walenz A src/meryl/libleaff/fastqStdin.C nihh20161122Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20161122Brian P. Walenz A src/stores/ovOverlap.H nihh20161122Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20161122Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20161122Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20161122Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161122Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20161122Brian P. Walenz A src/AS_global.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161122Brian P. Walenz A src/bogart/bogart.C nihh20161122Brian P. Walenz A src/bogart/buildGraph.C nihh20161122Brian P. Walenz A src/bogus/bogus.C nihh20161122Brian P. Walenz A src/bogus/bogusness.C nihh20161122Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20161122Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20161122Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20161122Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C nihh20161122Brian P. Walenz A src/fastq-utilities/fastqSample.C nihh20161122Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161122Brian P. Walenz A src/merTrim/merTrim.C nihh20161122Brian P. Walenz A src/mercy/mercy.C nihh20161122Brian P. Walenz A src/meryl/compare-counts.C nihh20161122Brian P. Walenz A src/meryl/leaff-partition.C nihh20161122Brian P. Walenz A src/meryl/leaff.C nihh20161122Brian P. Walenz A src/meryl/libmeryl.C nihh20161122Brian P. Walenz A src/meryl/maskMers.C nihh20161122Brian P. Walenz A src/meryl/meryl-args.C nihh20161122Brian P. Walenz A src/meryl/meryl-build.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/trimReads-bestEdge.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/trimStat.H nihh20161122Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161122Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20161122Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20161122Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20161122Brian P. Walenz A src/stores/gkStore.C nihh20161122Brian P. Walenz A src/stores/ovStore.C nihh20161122Brian P. Walenz A src/stores/ovStore.H nihh20161122Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161122Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161122Brian P. Walenz A src/stores/ovStoreDump.C nihh20161122Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20161122Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161122Brian P. Walenz A src/stores/ovStoreStats.C nihh20161122Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161122Brian P. Walenz A src/stores/tgStore.C nihh20161122Brian P. Walenz A src/stores/tgStore.H nihh20161122Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20161122Brian P. Walenz A src/stores/tgStoreDump.C nihh20161122Brian P. Walenz A src/stores/tgStoreFilter.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20161122Brian P. Walenz A src/utgcns/utgcns.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161122Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161122Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161122Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161121Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161121Brian P. Walenz A addCopyrights-BuildData.pl nihh20161121Brian P. Walenz A addCopyrights.dat nihh20161121Brian P. Walenz A addCopyrights.pl nihh20161121Brian P. Walenz D src/bogart/AS_BAT_OptimizePositions.C src/bogart/AS_BAT_Unitig.C A src/pipelines/canu/Consensus.pm nihh20170811Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20170811Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170810Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170810Brian P. Walenz A src/bogart/bogart.C nihh20170810Brian P. Walenz A src/AS_UTL/timeAndSize.C nihh20170810Brian P. Walenz A src/AS_UTL/timeAndSize.H nihh20170810Brian P. Walenz A src/AS_global.C nihh20170810Brian P. Walenz A documentation/source/faq.rst nihh20170809Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20170809Brian P. Walenz A src/bogart/AS_BAT_OptimizePositions.C nihh20170809Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20170808Brian P. Walenz A src/bogart/AS_BAT_Logging.H nihh20170808Brian P. Walenz A src/bogart/AS_BAT_MergeOrphans.C nihh20170808Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170808Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170808Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170808Brian P. Walenz A src/bogart/bogart.C nihh20170808Brian P. Walenz A documentation/source/faq.rst nihh20170808Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170808Brian P. Walenz A documentation/source/quick-start.rst nihh20170808Brian P. Walenz A documentation/source/tutorial.rst nihh20170808Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170808Brian P. Walenz A src/overlapBasedTrimming/splitReads-workUnit.C nihh20170808Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170808Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20170804Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170802Brian P. Walenz A src/pipelines/canu.pl nihh20170802Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170802Brian P. Walenz A documentation/source/tutorial.rst nihh20170802Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20170802Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170801Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20170801Brian P. Walenz A README.md nihh20170731Sergey Koren A src/bogart/AS_BAT_OverlapCache.C nihh20170731Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20170731Brian P. Walenz A src/bogart/bogart.C nihh20170728Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170728Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170728Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170728Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170728Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170728Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170728Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170728Brian P. Walenz A src/stores/gkStore.C nihh20170728Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20170728Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20170728Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170728Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170728Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170727Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170727Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20170727Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20170727Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170727Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20170727Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.C nihh20170727Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20170727Brian P. Walenz A src/bogart/bogart.C nihh20170727Brian P. Walenz A documentation/source/quick-start.rst nihh20170727Sergey Koren A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170727Sergey Koren A src/bogart/AS_BAT_CreateUnitigs.C nihh20170727Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20170727Brian P. Walenz A src/gfa/alignGFA.C nihh20170725Brian P. Walenz A src/minimap/mmapConvert.C nihh20170723Sergey Koren A src/pipelines/canu/OverlapMMap.pm nihh20170723Sergey Koren A src/bogart/AS_BAT_OptimizePositions.C nihh20170718Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170718Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170718Brian P. Walenz A src/bogart/AS_BAT_OptimizePositions.C nihh20170717Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20170717Brian P. Walenz A addCopyrights.dat nihh20170717Brian P. Walenz A src/bogart/AS_BAT_OptimizePositions.C nihh20170717Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170717Brian P. Walenz A src/bogart/bogart.mk nihh20170717Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170717Brian P. Walenz A src/bogart/bogart.C nihh20170717Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170717Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20170717Brian P. Walenz A src/bogart/bogart.C nihh20170714Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20170714Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170714Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20170714Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170714Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170714Brian P. Walenz A src/bogart/bogart.C nihh20170714Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170713Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170713Brian P. Walenz A src/pipelines/canu.pl nihh20170713Brian P. Walenz A src/pipelines/canu.pl nihh20170713Brian P. Walenz A src/utgcns/utgcns.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_DropDeadEnds.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_PromoteToSingleton.C nihh20170712Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170708Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170707Brian P. Walenz A src/stores/ovStoreDump.C nihh20170707Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20170706Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170706Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20170706Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20170706Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170706Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170706Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170705Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170705Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170705Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170705Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170703Brian P. Walenz A src/bogart/bogart.C nihh20170703Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170703Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170703Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170703Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170703Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170702Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_MergeOrphans.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170629Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20170629Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170629Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170629Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170629Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170629Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170629Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170629Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170629Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170629Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170629Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170629Brian P. Walenz A src/stores/ovStoreDump.C nihh20170629Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20170628Brian P. Walenz A src/meryl/libmeryl.C nihh20170627Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20170627Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170627Brian P. Walenz A src/pipelines/canu.pl nihh20170627Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170627Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20170626Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170626Brian P. Walenz A src/stores/ovStoreSorter.C nihh20170626Brian P. Walenz A src/stores/ovStoreSorter.C nihh20170626Brian P. Walenz A src/stores/ovStoreWriter.C nihh20170626Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20170626Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20170626Brian P. Walenz A src/stores/ovStoreSorter.C nihh20170626Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170626Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170626Brian P. Walenz A src/gfa/alignGFA.C nihh20170624Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.C nihh20170624Brian P. Walenz A src/AS_UTL/AS_UTL_reverseComplement.H nihh20170624Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170623Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20170623Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.H nihh20170623Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170623Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170623Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170623Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20170622Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170622Brian P. Walenz A src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C nihh20170621Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20170621Brian P. Walenz A src/bogart/AS_BAT_TigVector.H nihh20170621Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20170621Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170621Brian P. Walenz A src/bogart/bogart.C nihh20170621Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.H nihh20170621Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170620Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170620Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170615Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170615Brian P. Walenz A src/bogart/bogart.C nihh20170615Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170614Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20170614Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20170614Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.H nihh20170614Brian P. Walenz A src/bogart/bogart.C nihh20170614Brian P. Walenz A src/bogart/bogart.C nihh20170614Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170614Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170614Brian P. Walenz A src/bogart/AS_BAT_DropDeadEnds.C nihh20170612Brian P. Walenz A src/bogart/AS_BAT_DropDeadEnds.C nihh20170612Brian P. Walenz A src/bogart/bogart.C nihh20170609Brian P. Walenz A src/pipelines/canu.pl nihh20170609Brian P. Walenz A src/overlapBasedTrimming/splitReads-trimBad.C nihh20170613Sergey Koren A src/fastq-utilities/fastqSample.C nihh20170613Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170612Brian P. Walenz A src/utgcns/utgcns.C nihh20170612Brian P. Walenz A src/gfa/alignGFA.C nihh20170612Brian P. Walenz A src/gfa/alignGFA.C nihh20170612Brian P. Walenz A src/falcon_sense/libfalcon/falcon.C nihh20170609Sergey Koren A src/overlapInCore/libedlib/edlib.C nihh20170609Brian P. Walenz A src/overlapInCore/libedlib/edlib.H nihh20170609Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170608Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170607Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20170607Brian P. Walenz A src/bogart/bogart.C nihh20170607Brian P. Walenz A src/bogart/bogart.C nihh20170607Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170607Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170607Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170607Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20170607Brian P. Walenz A src/utgcns/utgcns.C nihh20170607Brian P. Walenz A src/utgcns/libboost/boost/config/compiler/intel.hpp nihh20170606Sergey Koren A src/bogart/AS_BAT_DropDeadEnds.C nihh20170531Brian P. Walenz A src/bogart/AS_BAT_DropDeadEnds.H nihh20170531Brian P. Walenz A src/bogart/bogart.C nihh20170531Brian P. Walenz A src/bogart/bogart.mk nihh20170531Brian P. Walenz A src/bogart/bogart.C nihh20170530Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170530Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170530Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20170525Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20170525Brian P. Walenz A src/pipelines/canu.pl nihh20170522Brian P. Walenz A src/pipelines/canu.pl nihh20170522Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170519Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170518Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170518Brian P. Walenz A documentation/source/faq.rst nihh20170518Sergey Koren A src/AS_UTL/writeBuffer.H nihh20170517Sergey Koren A src/stores/gkStore.C nihh20170517Sergey Koren A .github/ISSUE_TEMPLATE.md nihh20170517Brian P. Walenz A src/pipelines/canu.pl nihh20170516Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170516Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170516Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170516Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170516Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170515Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170512Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170512Brian P. Walenz A src/gfa/alignGFA.C nihh20170512Brian P. Walenz A src/gfa/alignGFA.C nihh20170512Brian P. Walenz A src/gfa/gfa.C nihh20170512Brian P. Walenz A src/gfa/gfa.H nihh20170512Brian P. Walenz A src/gfa/gfa.C nihh20170512Brian P. Walenz A src/gfa/bed.C nihh20170512Brian P. Walenz A src/gfa/bed.H nihh20170512Brian P. Walenz A src/main.mk nihh20170512Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170512Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20170512Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170512Sergey Koren A src/canu_version_update.pl nihh20170511Brian P. Walenz A src/Makefile nihh20170511Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20170510Brian P. Walenz A src/stores/tgStore.C nihh20170510Brian P. Walenz A src/stores/tgTig.C nihh20170510Brian P. Walenz A src/bogart/AS_BAT_MergeOrphans.C nihh20170510Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.H nihh20170509Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20170509Brian P. Walenz A src/overlapInCore/liboverlap/Binomial_Bound.C nihh20170509Brian P. Walenz A src/main.mk nihh20170509Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170509Brian P. Walenz A src/utgcns/libcns/unitigConsensus.H nihh20170509Brian P. Walenz A src/utgcns/libpbutgcns/Alignment.H nihh20170509Brian P. Walenz A src/utgcns/libpbutgcns/AlnGraphBoost.C nihh20170509Brian P. Walenz A src/utgcns/libpbutgcns/AlnGraphBoost.H nihh20170509Brian P. Walenz A src/utgcns/utgcns.C nihh20170509Brian P. Walenz A src/overlapInCore/libedlib/edlib.C nihh20170508Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20170508Brian P. Walenz A src/utgcns/utgcns.C nihh20170508Brian P. Walenz A src/utgcns/utgcns.C nihh20170508Brian P. Walenz A src/utgcns/utgcns.C nihh20170508Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170509Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20170503Sergey Koren A src/pipelines/canu/Consensus.pm nihh20170425Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170425Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170425Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170425Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170425Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170425Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170425Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170425Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170425Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170425Brian P. Walenz A documentation/source/index.rst nihh20170422Sergey Koren A README.md nihh20170420Sergey Koren A README.md nihh20170420Sergey Koren A README.citation nihh20170420Sergey Koren A buildRelease.sh nihh20170420Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170420Sergey Koren A src/pipelines/canu/Grid_LSF.pm nihh20170420Sergey Koren A src/pipelines/canu.pl nihh20170420Sergey Koren A src/pipelines/canu/Consensus.pm nihh20170420Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20170420Sergey Koren A documentation/source/faq.rst nihh20170417Sergey Koren A src/canu_version_update.pl nihh20170417Brian P. Walenz A addCopyrights-BuildData.pl nihh20170417Brian P. Walenz A addCopyrights.dat nihh20170417Brian P. Walenz A addCopyrights.pl nihh20170417Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170417Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20170417Brian P. Walenz A src/bogart/AS_BAT_MergeOrphans.H nihh20170417Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20170417Brian P. Walenz A src/gfa/alignGFA.C nihh20170417Brian P. Walenz A src/gfa/gfa.C nihh20170417Brian P. Walenz A src/gfa/gfa.H nihh20170417Brian P. Walenz A src/meryl/compare-counts.C nihh20170417Brian P. Walenz A src/meryl/maskMers.C nihh20170417Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C nihh20170417Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C nihh20170417Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C nihh20170417Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20170417Brian P. Walenz A src/pipelines/canu-object-store.pl nihh20170417Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170417Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170417Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170417Brian P. Walenz A src/pipelines/canu/Grid_DNANexus.pm nihh20170417Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170417Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170417Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170417Brian P. Walenz A src/pipelines/canu/Report.pm nihh20170417Brian P. Walenz A src/pipelines/simple-repeat-test.pl nihh20170417Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170417Brian P. Walenz A src/pipelines/simple-repeat-test.pl nihh20170417Brian P. Walenz A src/stores/ovStore.C nihh20170417Brian P. Walenz A documentation/source/faq.rst nihh20170413Sergey Koren A documentation/source/conf.py nihh20170413Sergey Koren A documentation/source/quick-start.rst nihh20170413Sergey Koren A src/stores/ovStore.C nihh20170412Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170411Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170411Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170411Brian P. Walenz A src/AS_global.C nihh20170411Brian P. Walenz A src/canu_version_update.pl nihh20170411Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/tutorial.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170407Brian P. Walenz A src/pipelines/canu.pl nihh20170407Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170407Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170407Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170407Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170407Brian P. Walenz A src/gfa/alignGFA.C nihh20170407Brian P. Walenz A src/stores/ovStore.C nihh20170406Brian P. Walenz A src/stores/ovStore.H nihh20170406Brian P. Walenz A documentation/source/faq.rst nihh20170405Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170405Brian P. Walenz A src/meryl/estimate-mer-threshold.C nihh20170405Brian P. Walenz A src/gfa/alignGFA.C nihh20170404Brian P. Walenz A src/gfa/gfa.C nihh20170404Brian P. Walenz A src/gfa/gfa.H nihh20170404Brian P. Walenz A src/gfa/alignGFA.C nihh20170404Brian P. Walenz A src/gfa/gfa.C nihh20170404Brian P. Walenz A src/gfa/gfa.H nihh20170404Brian P. Walenz A src/main.mk nihh20170404Brian P. Walenz A src/main.mk nihh20170404Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20170404Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20170404Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170403Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170330Brian P. Walenz A documentation/source/faq.rst nihh20170330Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20170330Brian P. Walenz A src/pipelines/canu.pl nihh20170329Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20170329Brian P. Walenz A src/stores/gkStore.C nihh20170329Brian P. Walenz A src/stores/gkStore.H nihh20170329Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170329Brian P. Walenz A src/canu_version_update.pl nihh20170329Brian P. Walenz A src/main.mk nihh20170329Brian P. Walenz A src/utgcns/alignGFA.C nihh20170329Brian P. Walenz A src/utgcns/alignGFA.mk nihh20170329Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20170329Brian P. Walenz A documentation/source/faq.rst nihh20170328Sergey Koren A documentation/source/conf.py nihh20170328Sergey Koren A src/pipelines/canu/Gatekeeper.pm nihh20170327Brian P. Walenz A src/stores/tgStoreDump.C nihh20170327Brian P. Walenz A src/stores/tgTig.C nihh20170327Brian P. Walenz A src/stores/tgTig.H nihh20170327Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20170327Brian P. Walenz A src/pipelines/canu.pl nihh20170324Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/LICENSE nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/README nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/atomic.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/backtrace-supported.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/backtrace.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/backtrace.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/config.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/dwarf.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/elf.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/fileline.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/internal.h nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/make.out nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/make.sh nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/mmap.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/mmapio.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/posix.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/print.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/simple.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/sort.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/state.c nihh20170322Brian P. Walenz A src/AS_UTL/libbacktrace/unknown.c nihh20170322Brian P. Walenz A src/Makefile nihh20170322Brian P. Walenz A src/main.mk nihh20170322Brian P. Walenz A src/Makefile nihh20170322Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170322Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.H nihh20170322Brian P. Walenz A src/AS_global.C nihh20170322Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20170322Brian P. Walenz A src/pipelines/canu.pl nihh20170321Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170321Brian P. Walenz A src/pipelines/canu.pl nihh20170321Brian P. Walenz A src/pipelines/canu.pl nihh20170321Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170321Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170321Brian P. Walenz A src/pipelines/canu.pl nihh20170320Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170320Brian P. Walenz A src/AS_global.C nihh20170320Brian P. Walenz A src/AS_global.C nihh20170320Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170320Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170320Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170320Brian P. Walenz A src/pipelines/canu.pl nihh20170320Brian P. Walenz A src/pipelines/canu/Report.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170320Brian P. Walenz A src/Makefile nihh20170320Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170320Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Report.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170320Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20170316Brian P. Walenz A documentation/source/quick-start.rst nihh20170316Brian P. Walenz A documentation/source/faq.rst nihh20170316Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170315Sergey Koren A src/pipelines/canu/OverlapMhap.pm nihh20170315Sergey Koren A src/pipelines/sanity/sanity.sh nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p4c2.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p5c3.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.caenorhabditis_elegans.pacbio.p6c4.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/medium.drosophila_melanogaster.pacbio.p5c3.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-1000.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-5000.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.sra-3000.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.pacbio.p6.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p4.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p5.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.average.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.long.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.francisella_tularensis.pacbio.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.sra.spec nihh20170314Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.pacbio.spec nihh20170314Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170314Brian P. Walenz A src/pipelines/canu-object-store.pl nihh20170314Brian P. Walenz A src/overlapInCore/overlapPair.C nihh20170310Brian P. Walenz A src/overlapInCore/libedlib/edlib.C nihh20170310Brian P. Walenz A src/overlapInCore/libedlib/edlib.H nihh20170310Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170310Brian P. Walenz A src/pipelines/sanity/medium.caenorhabditis_elegans.pacbio.p6c4.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/sanity.NOTES nihh20170310Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20170310Brian P. Walenz A src/pipelines/sanity/small.bacillus_anthracis_sterne.nanopore.34F2_NBI0483991.poretools.2D.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.all.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-1.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-2.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-1.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-2.2d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.r9.4.superlong.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.r9.SpotOn.1d.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.pacbio.p6.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p4.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p5.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.long.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.nanopore.r7.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.nanopore.r9.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_w303.nanopore.poretools.2D.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/small.yersinia_pestis.nanopore.NBI0499872.poretools.2D.spec nihh20170310Brian P. Walenz A src/pipelines/sanity/success.caenorhabditis_elegans.sh nihh20170310Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170309Sergey Koren A src/pipelines/canu/Defaults.pm nihh20170309Sergey Koren A src/pipelines/canu/Meryl.pm nihh20170307Sergey Koren A src/pipelines/canu/OverlapMMap.pm nihh20170307Sergey Koren A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170303Sergey Koren A src/pipelines/canu/Defaults.pm nihh20170302Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170223Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170302Sergey Koren A src/pipelines/canu/OverlapStore.pm nihh20170227Brian P. Walenz A src/main.mk nihh20170224Brian P. Walenz A src/merTrim/merTrim.C nihh20170224Brian P. Walenz A src/merTrim/merTrim.mk nihh20170224Brian P. Walenz A src/merTrim/merTrimResult.H nihh20170224Brian P. Walenz A src/main.mk nihh20170224Brian P. Walenz A src/meryl/existDB.mk nihh20170224Brian P. Walenz A src/meryl/positionDB.mk nihh20170224Brian P. Walenz A src/main.mk nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-fasta.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-meryl.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-create-from-sequence.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB-state.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/existDB.H nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-access.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-dump.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-file.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-mismatch.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB-sort.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB.C nihh20170224Brian P. Walenz A src/meryl/libkmer/positionDB.H nihh20170224Brian P. Walenz A addCopyrights.dat nihh20170224Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170224Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170223Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170223Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170223Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170222Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20170222Brian P. Walenz A src/pipelines/canu.pl nihh20170222Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170222Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Grid_DNANexus.pm nihh20170221Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170221Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20170221Brian P. Walenz A src/stores/ovStoreBuild.C nihh20170221Brian P. Walenz A src/stores/ovStoreWriter.C nihh20170221Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170221Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170221Brian P. Walenz A documentation/reST-markup-hints nihh20170221Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170221Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170220Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170220Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170215Brian P. Walenz A src/Makefile nihh20170215Brian P. Walenz A src/pipelines/canu.pl nihh20170215Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170215Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Grid_Cloud.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170215Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170215Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170215Brian P. Walenz A src/pipelines/canu.pl nihh20170213Brian P. Walenz A src/Makefile nihh20170211Brian P. Walenz A src/pipelines/canu.pl nihh20170211Brian P. Walenz A src/pipelines/canu/Grid_DNANexus.pm nihh20170211Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170204Brian P. Walenz A src/utgcns/libNDFalcon/dw.C nihh20170214Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170214Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170214Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170214Sergey Koren A src/pipelines/canu.pl nihh20170213Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170213Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170213Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170213Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170209Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170209Sergey Koren A src/pipelines/canu/Consensus.pm nihh20170203Sergey Koren A src/utgcns/libcns/unitigConsensus.C nihh20170203Sergey Koren A src/utgcns/libcns/unitigConsensus.H nihh20170203Sergey Koren A src/utgcns/utgcns.C nihh20170203Sergey Koren A src/overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-extend.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-forward.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance-reverse.C nihh20170202Brian P. Walenz A src/overlapInCore/liboverlap/prefixEditDistance.H nihh20170202Brian P. Walenz A src/overlapInCore/overlapInCore-Process_String_Overlaps.C nihh20170202Brian P. Walenz A src/pipelines/canu.pl nihh20170202Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170202Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170202Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170202Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170202Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170202Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170202Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170202Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170202Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170202Brian P. Walenz A src/stores/ovStoreBuild.C nihh20170201Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170201Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20170201Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170201Brian P. Walenz A documentation/source/tutorial.rst nihh20170201Brian P. Walenz A src/bogart/bogart.C nihh20170127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170124Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170124Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170124Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170124Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170117Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Gatekeeper.pm nihh20170117Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Output.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170117Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170117Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170130Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170127Brian P. Walenz A documentation/source/quick-start.rst nihh20170127Brian P. Walenz A documentation/source/tutorial.rst nihh20170127Brian P. Walenz A src/pipelines/canu.pl nihh20170127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170127Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170127Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170127Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170127Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170127Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170126Brian P. Walenz A src/pipelines/canu.pl nihh20170126Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170126Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170126Brian P. Walenz A src/falcon_sense/libfalcon/falcon.C nihh20170124Brian P. Walenz A src/falcon_sense/libfalcon/falcon.H nihh20170124Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20170124Brian P. Walenz A src/falcon_sense/libfalcon/falcon.C nihh20170124Brian P. Walenz A src/falcon_sense/libfalcon/falcon.H nihh20170124Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170123Brian P. Walenz A src/Makefile nihh20170123Brian P. Walenz A src/Makefile nihh20170123Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170123Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20170120Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20170120Brian P. Walenz A src/pipelines/canu.pl nihh20170120Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20170120Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170120Brian P. Walenz A src/pipelines/canu.pl nihh20170119Brian P. Walenz A src/pipelines/canu/Consensus.pm nihh20170119Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170119Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20170119Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170119Brian P. Walenz A src/pipelines/canu/HTML.pm nihh20170119Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapBasedTrimming.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapErrorAdjustment.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapInCore.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170119Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170119Brian P. Walenz A src/pipelines/canu/Unitig.pm nihh20170119Brian P. Walenz A src/utgcns/libcns/unitigConsensus.C nihh20170116Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20170113Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20170112Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.H nihh20170112Brian P. Walenz A src/bogart/bogart.C nihh20170112Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170110Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170110Brian P. Walenz A src/stores/ovStore.H nihh20170110Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20170110Brian P. Walenz A src/stores/ovStoreHistogram.H nihh20170110Brian P. Walenz A src/mhap/mhapConvert.C nihh20170110Brian P. Walenz A src/minimap/mmapConvert.C nihh20170110Brian P. Walenz A src/pipelines/canu/OverlapMMap.pm nihh20170110Brian P. Walenz A src/pipelines/canu/OverlapMhap.pm nihh20170110Brian P. Walenz A src/pipelines/parallel-ovl-store-test.sh nihh20170109Brian P. Walenz A documentation/reST-markup-hints nihh20170109Brian P. Walenz A src/pipelines/simple-repeat-test.pl nihh20170109Brian P. Walenz A documentation/source/parameter-reference.rst nihh20170109Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_LSF.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_PBSTorque.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_SGE.pm nihh20170107Brian P. Walenz A src/pipelines/canu/Grid_Slurm.pm nihh20170107Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20170106Brian P. Walenz A src/falcon_sense/falcon_sense.mk nihh20170106Sergey Koren A src/falcon_sense/libfalcon/falcon.C nihh20170106Sergey Koren A src/falcon_sense/libfalcon/falcon.H nihh20170106Sergey Koren A src/main.mk nihh20170106Sergey Koren A src/overlapInCore/libedlib/edlib.C nihh20170106Sergey Koren A src/overlapInCore/libedlib/edlib.H nihh20170106Sergey Koren A documentation/source/parameter-reference.rst nihh20170106Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20170106Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170106Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20170106Brian P. Walenz A src/pipelines/canu.pl nihh20170106Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20170106Brian P. Walenz A src/pipelines/sanity/sanity.pl nihh20170104Brian P. Walenz A src/Makefile nihh20170104Brian P. Walenz A src/stores/ovOverlap.C nihh20170104Brian P. Walenz A src/stores/ovOverlap.H nihh20170104Brian P. Walenz A src/stores/ovStoreFile.C nihh20170104Brian P. Walenz A src/stores/ovStore.H nihh20170104Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20170103Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20170103Brian P. Walenz A src/stores/ovStoreBuild.C nihh20170103Brian P. Walenz A src/stores/ovStoreFilter.C nihh20170103Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.arabidopsis_thaliana.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.bacillus_anthracis_sterne.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.bibersteinia_trehalosi.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.drosophila_melanogaster.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_k12.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_ne92.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_o157_h7_str_f8092b.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.francisella_tularensis.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/success.yersinia_pestis_i195.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/small.bacillus_anthracis_sterne.nanopore.34F2_NBI0483991.poretools.2D.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161225Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_glbrcy22-3.pacbio.sra.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/small.saccharomyces_cerevisiae_s288c.pacbio.spec nihh20161225Brian P. Walenz A src/pipelines/sanity/success.saccharomyces_cerevisiae_s288c.sh nihh20161225Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161225Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161222Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-1000.spec nihh20161222Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-5000.spec nihh20161222Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p4c2.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/medium.arabidopsis_thaliana.pacbio.p5c3.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/medium.drosophila_melanogaster.pacbio.p5c3.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/sanity.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bacillus_anthracis_sterne.nanopore.34F2_NBI0483991.poretools.2D.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-1000.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.h5-5000.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.bibersteinia_trehalosi.pacbio.sra-1000.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.all.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-1.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-2.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-1.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.map006-pcr-2.2d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.nanopore.r9.SpotOn.1d.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_k12.pacbio.p6.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p4.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_ne92.pacbio.p5.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.average.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.escherichia_coli_o157_h7_str_f8092b.pacbio.p4c2.long.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.francisella_tularensis.pacbio.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/small.yersinia_pestis.nanopore.NBI0499872.poretools.2D.spec nihh20161220Brian P. Walenz A src/pipelines/sanity/success.arabidopsis_thaliana.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.bacillus_anthracis_sterne.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.bibersteinia_trehalosi.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.drosophila_melanogaster.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_k12.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_ne92.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.escherichia_coli_o157_h7_str_f8092b.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.francisella_tularensis.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/success.yersinia_pestis_i195.sh nihh20161220Brian P. Walenz A src/pipelines/sanity/sanity.pl nihh20161220Brian P. Walenz A src/stores/tgStoreDump.C nihh20161219Brian P. Walenz A src/stores/tgStoreLoad.C nihh20161219Brian P. Walenz A src/stores/tgStore.C nihh20161219Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161219Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.H nihh20161219Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161219Brian P. Walenz A src/pipelines/canu/Configure.pm nihh20161219Brian P. Walenz A src/meryl/estimate-mer-threshold.C nihh20161216Brian P. Walenz A src/pipelines/canu/Meryl.pm nihh20161216Brian P. Walenz A src/pipelines/canu/Execution.pm nihh20161214Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161214Brian P. Walenz A src/meryl/meryl-build.C nihh20161214Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20161214Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161214Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161214Brian P. Walenz A src/pipelines/canu/CorrectReads.pm nihh20161214Brian P. Walenz A buildRelease.sh nihh20161213Brian P. Walenz A buildRelease.sh nihh20161213Brian P. Walenz A README.licenses nihh20161213Brian P. Walenz A src/utgcns/libpbutgcns/LICENSE nihh20161213Brian P. Walenz A buildRelease.sh nihh20161213Brian P. Walenz A src/pipelines/canu.pl nihh20161212Brian P. Walenz A src/pipelines/canu/Defaults.pm nihh20161212Brian P. Walenz A src/pipelines/canu/ErrorEstimate.pm nihh20161212Brian P. Walenz A src/stores/ovStoreDump.C nihh20161209Brian P. Walenz A src/stores/ovOverlap.C nihh20161208Brian P. Walenz A src/stores/ovOverlap.H nihh20161208Brian P. Walenz A src/stores/ovStoreFile.C nihh20161208Brian P. Walenz A addCopyrights.dat nihh20161207Brian P. Walenz A addCopyrights.pl nihh20161207Brian P. Walenz A src/bogart/bogart.C nihh20161207Brian P. Walenz A src/bogart/bogart.mk nihh20161207Brian P. Walenz A src/main.mk nihh20161206Brian P. Walenz A src/meryl/maskMers.C nihh20161206Brian P. Walenz A src/meryl/maskMers.mk nihh20161206Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_Unitig.H nihh20161206Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_PlaceContains.C nihh20161206Brian P. Walenz A src/bogart/bogart.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.C nihh20161206Brian P. Walenz A src/bogart/AS_BAT_ReadInfo.H nihh20161206Brian P. Walenz A src/pipelines/canu/OverlapStore.pm nihh20161205Brian P. Walenz A src/stores/ovStoreStats.C nihh20161205Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.H nihh20161202Brian P. Walenz A src/bogart/bogart.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.H nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.H nihh20161202Brian P. Walenz A src/bogart/bogart.C nihh20161202Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161201Brian P. Walenz A src/bogart/bogart.C nihh20161201Brian P. Walenz A src/pipelines/canu/Output.pm nihh20161201Brian P. Walenz A src/stores/gkStore.C nihh20161130Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20161130Brian P. Walenz A src/stores/gkStore.C nihh20161130Brian P. Walenz A src/stores/ovStore.C nihh20161130Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161130Brian P. Walenz A src/stores/tgStoreDump.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCore-Build_Hash_Index.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20161130Brian P. Walenz A src/stores/gkStore.C nihh20161130Brian P. Walenz A src/stores/ovOverlap.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161130Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20161130Brian P. Walenz A src/stores/tgStoreDump.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_SplitDiscontinuous.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161130Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20161130Brian P. Walenz A src/stores/ovStore.H nihh20161130Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_PlaceReadUsingOverlaps.C nihh20161130Brian P. Walenz A src/stores/ovStore.H nihh20161130Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161130Brian P. Walenz A src/stores/ovStore.C nihh20161130Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161129Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161129Brian P. Walenz A src/fastq-utilities/fastqSample.C nihh20161129Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161129Brian P. Walenz A src/meryl/leaff-partition.C nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/findErrors-Read_Frags.C nihh20161129Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20161129Brian P. Walenz A src/stores/tgStoreDump.C nihh20161129Brian P. Walenz A src/utgcns/stashContains.C nihh20161129Brian P. Walenz A src/stores/tgStoreDump.C nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/correctOverlaps.H nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20161129Brian P. Walenz A src/utgcns/libNDFalcon/dw.H nihh20161129Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20161129Brian P. Walenz A src/main.mk nihh20161129Brian P. Walenz A src/stores/gkStore.H nihh20161129Brian P. Walenz A src/stores/gkStore.H nihh20161129Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161129Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20161129Brian P. Walenz A src/overlapInCore/overlapInCore.C nihh20161129Brian P. Walenz A src/overlapInCore/overlapInCore.H nihh20161129Brian P. Walenz A src/bogart/bogart.C nihh20161129Brian P. Walenz A src/stores/ovStoreDump.C nihh20161129Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161129Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161129Brian P. Walenz A src/stores/ovStoreIndexer.C nihh20161129Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161129Brian P. Walenz A src/overlapInCore/overlapImport.C nihh20161129Brian P. Walenz A src/falcon_sense/falcon_sense.C nihh20161128Brian P. Walenz A src/meryl/libmeryl.C nihh20161123Brian P. Walenz A src/meryl/meryl-build.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/stores/tgStoreDump.C nihh20161122Brian P. Walenz A src/overlapInCore/overlapInCore-Output.C nihh20161122Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20161122Brian P. Walenz A src/meryl/leaff-statistics.C nihh20161122Brian P. Walenz A src/stores/ovStoreDump.C nihh20161122Brian P. Walenz A src/meryl/libleaff/gkStoreFile.C nihh20161122Brian P. Walenz A src/meryl/libleaff/fastqStdin.C nihh20161122Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20161122Brian P. Walenz A src/stores/ovOverlap.H nihh20161122Brian P. Walenz A src/overlapErrorAdjustment/findErrors.C nihh20161122Brian P. Walenz A src/overlapErrorAdjustment/findErrors.H nihh20161122Brian P. Walenz A src/utgcns/libcns/abAbacus-refine.C nihh20161122Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161122Brian P. Walenz A src/AS_UTL/AS_UTL_stackTrace.C nihh20161122Brian P. Walenz A src/AS_global.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_AssemblyGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_BestOverlapGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_ChunkGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Instrumentation.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Logging.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_MarkRepeatReads.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Outputs.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_OverlapCache.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_PopBubbles.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_TigGraph.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_TigVector.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161122Brian P. Walenz A src/bogart/bogart.C nihh20161122Brian P. Walenz A src/bogart/buildGraph.C nihh20161122Brian P. Walenz A src/bogus/bogus.C nihh20161122Brian P. Walenz A src/bogus/bogusness.C nihh20161122Brian P. Walenz A src/correction/filterCorrectionOverlaps.C nihh20161122Brian P. Walenz A src/correction/generateCorrectionLayouts.C nihh20161122Brian P. Walenz A src/erateEstimate/erateEstimate.C nihh20161122Brian P. Walenz A src/falcon_sense/createFalconSenseInputs.C nihh20161122Brian P. Walenz A src/fastq-utilities/fastqSample.C nihh20161122Brian P. Walenz A src/fastq-utilities/fastqSimulate.C nihh20161122Brian P. Walenz A src/merTrim/merTrim.C nihh20161122Brian P. Walenz A src/mercy/mercy.C nihh20161122Brian P. Walenz A src/meryl/compare-counts.C nihh20161122Brian P. Walenz A src/meryl/leaff-partition.C nihh20161122Brian P. Walenz A src/meryl/leaff.C nihh20161122Brian P. Walenz A src/meryl/libmeryl.C nihh20161122Brian P. Walenz A src/meryl/maskMers.C nihh20161122Brian P. Walenz A src/meryl/meryl-args.C nihh20161122Brian P. Walenz A src/meryl/meryl-build.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/splitReads.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/trimReads-bestEdge.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/trimReads.C nihh20161122Brian P. Walenz A src/overlapBasedTrimming/trimStat.H nihh20161122Brian P. Walenz A src/overlapInCore/overlapInCorePartition.C nihh20161122Brian P. Walenz A src/stores/gatekeeperCreate.C nihh20161122Brian P. Walenz A src/stores/gatekeeperDumpFASTQ.C nihh20161122Brian P. Walenz A src/stores/gatekeeperPartition.C nihh20161122Brian P. Walenz A src/stores/gkStore.C nihh20161122Brian P. Walenz A src/stores/ovStore.C nihh20161122Brian P. Walenz A src/stores/ovStore.H nihh20161122Brian P. Walenz A src/stores/ovStoreBucketizer.C nihh20161122Brian P. Walenz A src/stores/ovStoreBuild.C nihh20161122Brian P. Walenz A src/stores/ovStoreDump.C nihh20161122Brian P. Walenz A src/stores/ovStoreHistogram.C nihh20161122Brian P. Walenz A src/stores/ovStoreSorter.C nihh20161122Brian P. Walenz A src/stores/ovStoreStats.C nihh20161122Brian P. Walenz A src/stores/ovStoreWriter.C nihh20161122Brian P. Walenz A src/stores/tgStore.C nihh20161122Brian P. Walenz A src/stores/tgStore.H nihh20161122Brian P. Walenz A src/stores/tgStoreCoverageStat.C nihh20161122Brian P. Walenz A src/stores/tgStoreDump.C nihh20161122Brian P. Walenz A src/stores/tgStoreFilter.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/stores/tgTigMultiAlignDisplay.C nihh20161122Brian P. Walenz A src/utgcns/utgcns.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/stores/tgTig.C nihh20161122Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161122Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161122Brian P. Walenz A src/AS_UTL/bitPackedFile.C nihh20161122Brian P. Walenz A src/AS_UTL/AS_UTL_fileIO.C nihh20161122Brian P. Walenz A src/bogart/AS_BAT_CreateUnitigs.C nihh20161121Brian P. Walenz A src/bogart/AS_BAT_Unitig.C nihh20161121Brian P. Walenz A addCopyrights-BuildData.pl nihh20161121Brian P. Walenz A addCopyrights.dat nihh20161121Brian P. Walenz A addCopyrights.pl nihh20161121Brian P. Walenz canu-1.6/addCopyrights.pl000066400000000000000000000304641314437614700154700ustar00rootroot00000000000000#!/usr/local/bin/perl use strict; # To run this: # # Update the copyright data file by appending info on new commits: # perl addCopyrights-BuildData.pl >> addCopyrights.dat # # Update copyright on each file, writing to new files: # perl addCopyrights.pl -test # # Update copyright on specific files by listing them at then end: # perl addCopyrights.pl -test src/bogart/bogart.C # # All files get rewritten, even if there are no changes. If not running in 'test' mode # you can use git to see what changes, and to verify they look sane. # # Once source files are updated, update addCopyright-BuildData.pl with the last # commit hash and commit those changes (both the dat and pl). # # # If set, rename original files to name.ORIG, rewrite files with updated copyright text. # If not, create new name.MODIFIED files with updated copyright text. # my $doForReal = 1; if ($ARGV[0] eq "-test") { shift @ARGV; $doForReal = 0; } # # The change data 'addCopyrights.dat' contains lines of two types: # # A -- a committed change to this file # D -- 'file' used to be called 'oldfile' # # The 'D' lines DO NOT map existing A lines. They just emit 'this file derived from' lines in the # copyright text. # # When a file is added to the repo, nothing special needs to be done; an 'A' line # is emitted for the new file. If a file comes about because of a copy (of an existing file), # a 'D' line needs to be added. # # If a file is renamed, a 'D' line needs to be added, and all previous mentions # of the original name need to be updated to the new name. # sub toList (@) { my @all = sort { $a <=> $b } @_; my $ret; $ret = substr($all[0], 0, 4); shift @all; foreach my $a (@all) { $a = substr($a, 0, 4); if ($ret =~ /^(\d+)$/) { if ($1 == $a) { } elsif ($1 + 1 == $a) { $ret = "$1-$a"; } else { $ret = "$1,$a"; } } if ($ret =~ /^(.*)-(\d+)$/) { if ($2 == $a) { } elsif ($2 + 1 == $a) { $ret = "$1-$a"; } else { $ret = "$1-$2,$a"; } } if ($ret =~ /^(.*),(\d+)$/) { if ($2 == $a) { } elsif ($2 + 1 == $a) { $ret = "$1,$2-$a"; } else { $ret = "$1,$2,$a"; } } } return($ret); } sub splitAC ($@) { my $cc = shift @_; my @AC = @_; my @AClist; my @dateStrings = ( "???", "JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC" ); my %dates; foreach my $ac (@AC) { if ($ac =~ m/^(....)(\d\d\d\d\d\d\d\d)(.*)$/) { $dates{"$1$3"} .= "$2\n"; } else { die "$ac failed\n"; } } foreach my $ac (keys %dates) { my @dates = split '\n', $dates{$ac}; @dates = sort { $a <=> $b } @dates; my $years = toList(@dates); my $ord = $dates[0]; my $bgn = $dates[0]; my $end = $dates[ scalar(@dates)-1 ]; if ($bgn =~ m/^(\d\d\d\d)(\d\d)(\d\d)$/) { $bgn = "$1-$dateStrings[$2]-$3"; } else { die "bgn date $bgn\n"; } if ($end =~ m/^(\d\d\d\d)(\d\d)(\d\d)$/) { $end = "$1-$dateStrings[$2]-$3"; } else { die "bgn date $end\n"; } my $org; my $nam; if ($ac =~ m/^(....)(.*)$/) { $org = $1; $nam = $2; } else { die "$ac match\n"; } my $dates = "from $bgn to $end"; if ($bgn eq $end) { $dates = "on $bgn"; } if ($org eq "nihh") { $dates = "beginning on $bgn"; } my $str; if ($org eq "craa") { $str .= " $cc $nam $dates\n"; $str .= " $cc are Copyright $years Applera Corporation, and\n"; $str .= " $cc are subject to the GNU General Public License version 2\n"; $str .= " $cc\n"; } elsif ($org eq "tigr") { $str .= " $cc $nam $dates\n"; $str .= " $cc are Copyright $years The Institute for Genomics Research, and\n"; $str .= " $cc are subject to the GNU General Public License version 2\n"; $str .= " $cc\n"; } elsif ($org eq "jcvi") { $str .= " $cc $nam $dates\n"; $str .= " $cc are Copyright $years J. Craig Venter Institute, and\n"; $str .= " $cc are subject to the GNU General Public License version 2\n"; $str .= " $cc\n"; } elsif ($org eq "bnbi") { $str .= " $cc $nam $dates\n"; $str .= " $cc are Copyright $years Battelle National Biodefense Institute, and\n"; $str .= " $cc are subject to the BSD 3-Clause License\n"; $str .= " $cc\n"; } elsif ($org eq "nihh") { $str .= " $cc $nam $dates\n"; $str .= " $cc are a 'United States Government Work', and\n"; $str .= " $cc are released in the public domain\n"; $str .= " $cc\n"; } elsif ($org eq "none") { $str .= " $cc $nam $dates\n"; $str .= " $cc are Copyright $years $nam, and\n"; $str .= " $cc are subject to the GNU General Public License version 2\n"; $str .= " $cc\n"; } else { die "$ac org\n"; } push @AClist, "$ord\0$str"; } @AClist = sort { $a <=> $b } @AClist; foreach my $a (@AClist) { (undef, $a) = split '\0', $a; } return(@AClist); } # Load the previously generated change data my %authcopy; my %derived; { open(F, "< addCopyrights.dat"); while () { chomp; if (m/^A\s+(\S+)\s+(.*)$/) { $authcopy{$1} .= "$2\n"; } elsif (m/^D\s+(\S+)\s+(\S+)$/) { $derived{$1} .= "$2\n"; } else { die "invalid addCopyrights.dat line '$_'\n"; } } close(F); } # Process each file. #open(OUT, "> addCopyrights.dat.new") or die "Failed to open 'addCopyrights.dat.new' for writing: $!\n"; my @filesToProcess = @ARGV; if (scalar(@filesToProcess) == 0) { open(FIN, "find kmer src -type f -print |") or die "Failed to launch 'find'\n"; while () { chomp; s/^\.\/(.*)$//; # Remove leading ./ added by find. push @filesToProcess, $_; } close(FIN); } foreach my $file (@filesToProcess) { next if ($file =~ m/\.mk$/); next if ($file =~ m/Makefile/); next if ($file =~ m/\.sh$/); next if ($file =~ m/\.py$/); next if ($file =~ m/\.jar$/); next if ($file =~ m/\.tar$/); next if ($file =~ m/\.bin$/); # falcon_sense next if ($file =~ m/\.gz$/); next if ($file =~ m/\.json$/); next if ($file =~ m/\.json.README$/); next if ($file =~ m/\.css$/); next if ($file =~ m/\.fasta$/); # meryl test next if ($file =~ m/\.dat$/); # src/overlapInCore/liboverlap/prefixEditDistance-matchLimitData/prefixEditDistance-matchLimit-*.dat next if ($file =~ m/md5/); next if ($file =~ m/mt19937ar/); next if ($file =~ m/\.jpg$/); next if ($file =~ m/README/); next if ($file =~ m/\.dat$/); next if ($file =~ m/libboost/); next if ($file =~ m/libedlib/); next if ($file =~ m/libfalcon/); next if ($file =~ m/libNDFalcon/); next if ($file =~ m/libbacktrace/); next if ($file =~ m/qsort_mt.c$/); my $cb = "/"; my $cc = "*"; my $ce = "/"; if ($file =~ m/\.p[lm]$/) { $cb = "#"; $cc = "#"; $ce = "#"; } my $iskmer = 0; $iskmer = 1 if ($file =~ m/^kmer/); #$iskmer = 1 if ($file =~ m/libmeryl/); $iskmer = 1 if ($file =~ m/src\/meryl\/libkmer/); $iskmer = 1 if ($file =~ m/src\/meryl\/existDB.C/); $iskmer = 1 if ($file =~ m/src\/meryl\/positionDB.C/); if ($iskmer) { print STDERR "Won't process: '$file' - kmer copyrights screwed up\n"; next; } if (($file !~ m/\.[CHch]$/) && ($file !~ m/\.p[lm]/)) { print STDERR "Won't process: '$file'\n"; next; } my @AC = split '\n', $authcopy{$file}; #foreach my $ac (@AC) { # print OUT "A\t$file\t$ac\n"; #} my @AClist = splitAC($cc, @AC); my %DElist; my @DElist = split '\n', $derived{$file}; foreach my $d (@DElist) { next if ($d eq ""); next if (lc $d eq lc $file); $DElist{$d}++; } undef @DElist; if (scalar(keys %DElist) > 0) { foreach my $d (keys %DElist) { push @DElist, " $cc $d\n"; } @DElist = sort @DElist; #foreach my $d (@DElist) { # if ($d =~ m/^\s*\S\s*(\S+)$/) { # print OUT "D\t$file\t$1\n"; # } else { # die "Failed to match DElist line '$d'\n"; # } #} unshift @DElist, " $cc\n"; unshift @DElist, " $cc This file is derived from:\n"; push @DElist, " $cc\n"; } my @lines; push @lines, "#!/usr/bin/env perl\n" if ($file =~ m/\.pl$/); push @lines, "\n"; push @lines, "$cb" . $cc x 78 . "\n"; push @lines, " $cc\n"; push @lines, " $cc This file is part of canu, a software program that assembles whole-genome\n"; push @lines, " $cc sequencing reads into contigs.\n"; push @lines, " $cc\n"; push @lines, " $cc This software is based on:\n"; #push @lines, " $cc RELEASE_1-3_2004-03-17 of the 'Celera Assembler' (http://wgs-assembler.sourceforge.net)\n"; push @lines, " $cc 'Celera Assembler' (http://wgs-assembler.sourceforge.net)\n"; push @lines, " $cc the 'kmer package' (http://kmer.sourceforge.net)\n"; push @lines, " $cc both originally distributed by Applera Corporation under the GNU General\n"; push @lines, " $cc Public License, version 2.\n"; push @lines, " $cc\n"; push @lines, " $cc Canu branched from Celera Assembler at its revision 4587.\n"; push @lines, " $cc Canu branched from the kmer project at its revision 1994.\n"; push @lines, " $cc\n"; push @lines, @DElist; push @lines, " $cc Modifications by:\n"; push @lines, " $cc\n"; push @lines, @AClist; push @lines, " $cc File 'README.licenses' in the root directory of this distribution contains\n"; push @lines, " $cc full conditions and disclaimers for each license.\n"; push @lines, " $cc$ce\n"; push @lines, "\n"; my $start = 1; # To skip comment lines at the start of the file (the previous copyright block). open(F, "< $file") or die "Failed to open '$file' for reading: $!\n"; while () { s/\s+$//; # Remove trailing spaces because they bug me. # If a single "/*" at the start of the line, assume this is NOT an old copyright block, # but a copyright block (or just a comment) from some third party code. if ($_ eq "/*") { print STDERR "Foreign code found: '$file'\n"; $start = 0; } # If not at the start, add the line. if ($start == 0) { push @lines, "$_\n"; next; } # Else, we're at the start; if blank or a comment, skip it. if (($_ eq "") || # Blank lines ($_ =~ m/^[\/\s]\*/) || # C-style comment (the old copyright block) ($_ =~ m/^\s*##/) || # Perl comment, at least two #'s ($_ =~ m/^\s*#$/) || # Perl comment, exactly one # ($_ =~ m/^\s*#\s/) || # Perl comment, a single # followed by a space (so we don't catch #! lines) ($_ =~ m/^\s*#\!/)) { # #! line. I guess we do want to skip them now next; } # Else, add the line, and declare that we're no longer at the start. push @lines, "$_\n"; $start = 0; } close(F); if ($doForReal) { my $perms = `stat -f %p $file`; chomp $perms; $perms = substr($perms, -3); open(F, "> $file") or die "Failed to open '$file' for writing: $!\n"; print F @lines; close(F); system("chmod $perms $file"); } else { open(F, "> $file.MODIFIED") or die "Failed to open '$file.MODIFIED' for writing: $!\n"; print F @lines; close(F); } } close(FIN); canu-1.6/buildRelease.sh000066400000000000000000000030461314437614700152570ustar00rootroot00000000000000#!/bin/sh version=$1 if [ x$version = x ] ; then echo usage: $0 numeric-version exit fi # From the tarball if [ ! -e canu-$version.tar.gz ] ; then echo Fetch. curl -L -R -o canu-$version.tar.gz https://github.com/marbl/canu/archive/v$version.tar.gz fi if [ ! -d canu-$versioin ] ; then echo Unpack. gzip -dc canu-$version.tar.gz | tar -xf - fi cd canu-$version # From the repo #git clone git@github.com:marbl/canu.git #mv canu canu-$version #cd canu-$version #git tag v$version #git checkout v$version echo Build MacOS. cd src gmake -j 12 > ../Darwin-amd64.out 2>&1 cd ../.. rm -f canu-$version/linux.sh echo >> canu-$version/linux.sh \#\!/bin/bash #echo >> canu-$version/linux.sh yum install -y git echo >> canu-$version/linux.sh cd /build/canu-$version/src echo >> canu-$version/linux.sh gmake -j 12 \> ../Linux-amd64.out 2\>\&1 echo >> canu-$version/linux.sh cd ../.. echo >> canu-$version/linux.sh rm -rf canu-$version/Darwin-amd64/obj echo >> canu-$version/linux.sh rm -rf canu-$version/Linux-amd64/obj echo >> canu-$version/linux.sh tar -cf canu-$version.Darwin-amd64.tar canu-$version/README* canu-$version/Darwin-amd64 echo >> canu-$version/linux.sh tar -cf canu-$version.Linux-amd64.tar canu-$version/README* canu-$version/Linux-amd64 chmod 755 canu-$version/linux.sh echo Build Linux and make tarballs. docker run -v `pwd`:/build -t -i --rm phusion/holy-build-box-64:latest /hbb_exe/activate-exec bash /build/canu-$version/linux.sh echo Compress. xz -9v canu-$version.Darwin-amd64.tar xz -9v canu-$version.Linux-amd64.tar exit canu-1.6/documentation/000077500000000000000000000000001314437614700151715ustar00rootroot00000000000000canu-1.6/documentation/Makefile000066400000000000000000000151721314437614700166370ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source .PHONY: all help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext all: html help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/canu.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/canu.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/canu" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/canu" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." canu-1.6/documentation/reST-markup-hints000066400000000000000000000005561314437614700204170ustar00rootroot00000000000000 http://docutils.sourceforge.net/rst.html http://docutils.sourceforge.net/docs/user/rst/quickstart.html http://docutils.sourceforge.net/docs/user/rst/quickref.html <- most useful http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#definition-lists http://rest-sphinx-memo.readthedocs.io/en/latest/ReST.html - A very useful page with examples. canu-1.6/documentation/source/000077500000000000000000000000001314437614700164715ustar00rootroot00000000000000canu-1.6/documentation/source/canu-overlaps.svg000066400000000000000000003763501314437614700220070ustar00rootroot00000000000000 image/svg+xml compute minhashtables count k-mers frequentk-mers place overlapsin buckets sort buckets Raw Reads Create read database gkpStore ovlStore compute overlapswith overlapInCore compute overlapswith MHAP minHashes overlaps overlaps unsortedbuckets CORRECTED READS UNCORRECTED READS canu-1.6/documentation/source/canu-pipeline.svg000066400000000000000000005767761314437614700217770ustar00rootroot00000000000000 image/svg+xml contigsequences assemblygraph readlayouts Raw Reads generate correctedread consensus choose overlapsfor correction globalscores estimate correctedread lengths read IDsto correct correctedreads Build read andoverlapdatabases output reads trim reads split reads trimmedreads detect errors in reads recompute overlapalignments errorsin reads adjustederror rates construct contigs(bogart) generate contigconsensus generate outputs Build read andoverlapdatabases Build read andoverlapdatabases gkpStore ovlStore ovlStore ovlStore tigStore gkpStore gkpStore Correct Trim Assemble canu-1.6/documentation/source/command-reference.rst000066400000000000000000000127021314437614700225770ustar00rootroot00000000000000 .. _command-reference: Canu Command Reference ====================== Every command, even the useless ones. Commands marked as 'just usage' were automagically generated from the command line usage summary. Yes, some of them even crashed. :doc:`commands/bogart` (just usage) The unitig construction algorithm. BOG stands for Best Overlap Graph; we haven't figured out what ART stands for. :doc:`commands/bogus` (just usage) A unitig construction algorithm simulator. Given reads mapped to a reference, returns the largest unitigs possible. :doc:`commands/canu` (just usage) The executive in charge! Coordinates all these commands to make an assembler. :doc:`commands/correctOverlaps` (just usage) Part of Overlap Error Adjustment, recomputes overlaps given a set of read corrections. :doc:`commands/estimate-mer-threshold` (just usage) Decides on a k-mer threshold for overlapInCore seeds. :doc:`commands/fastqAnalyze` (just usage) Analyzes a FASTQ file and reports the best guess of the QV encoding. Can also rewrite the FASTQ to be in Sanger QV format. :doc:`commands/fastqSample` (just usage) Extracts random reads from a single or mated FASTQ file. Extracts based on desired coverage, desired number of reads/pairs, desired fraction of the total, or desired total length. :doc:`commands/fastqSimulate` (just usage) Creates reads with unbiased errors from a FASTA sequence. :doc:`commands/fastqSimulate-sort` (just usage) Given input from fastqSimulate, sorts the reads by position in the reference. :doc:`commands/filterCorrectionOverlaps` (just usage) Part of Read Correction, filters overlaps that shouldn't be used for correcting reads. :doc:`commands/findErrors` (just usage) Part of Overlap Error Adjustment, generates a multialignment for each read, outputs a list of suspected errors in the read. :doc:`commands/gatekeeperCreate` (just usage) Loads FASTA or FASTQ reads into the canu read database, gkpStore. :doc:`commands/gatekeeperDumpFASTQ` (just usage) Outputs FASTQ reads fromt the canu read database, gkpStore. :doc:`commands/gatekeeperDumpMetaData` (just usage) Outputs read and library metadata fromt the canu read database, gkpStore. :doc:`commands/gatekeeperPartition` (just usage) Part of Consensus, rearranges the canu read database, gkpStore, to localize read to unitigs. :doc:`commands/generateCorrectionLayouts` (just usage) Part of Read Correction, generates the multialignment layout used to correct reads. :doc:`commands/leaff` (just usage) Not actually part of canu, but it came along with meryl. Provides random access to FASTA, FASTQ and gkpStore. Also does some analysis tasks. Handy Swiss Army knife type of tool. :doc:`commands/meryl` (just usage) Counts k-mer occurrences in FASTA, FASTQ and gkpStore. Performs mathematical and logical operations on the resulting k-mer databases. :doc:`commands/mhapConvert` (just usage) Convert mhap output to overlap output. :doc:`commands/ovStoreBucketizer` (just usage) Part of the parallel overlap store building pipeline, loads raw overlaps from overlapper into the store. :doc:`commands/ovStoreBuild` (just usage) Sequentially builds an overlap store from raw overlaps. Simplest to run, but slow on large datasets. :doc:`commands/ovStoreDump` (just usage) Dumps overlaps from the overlap store, ovlStore. :doc:`commands/ovStoreIndexer` (just usage) Part of the parallel overlap store building pipeline, finalizes the store, after sorting with ovStoreSorter. :doc:`commands/ovStoreSorter` (just usage) Part of the parallel overlap store building pipeline, sorts overlaps loaded into the store by ovStoreBucketizer. :doc:`commands/overlapConvert` (just usage) Reads raw overlapper output, writes overlaps as ASCII. The reverse of overlapImport. :doc:`commands/overlapImport` (just usage) Reads ASCII overlaps in a few different formats, writes either 'raw overlapper output' or creates an ovlStore. :doc:`commands/overlapInCore` (just usage) The classic overlapper algorithm. :doc:`commands/overlapInCorePartition` (just usage) Generate partitioning to run overlapInCore jobs in parallel. :doc:`commands/overlapPair` (just usage) An *experimental* algorithm to recompute overlaps and output the alignments. :doc:`commands/prefixEditDistance-matchLimitGenerate` (just usage) Generate source code files with data representing the minimum length of a good overlap given some number of errors. :doc:`commands/splitReads` (just usage) Part of Overlap Based Trimming, splits reads based on overlaps, specifically, looking for PacBio hairpin adapter signatures. :doc:`commands/tgStoreCoverageStat` (just usage) Analyzes tigs in the tigStore, computes the classic `arrival rate statistic `_. :doc:`commands/tgStoreDump` (just usage) Analyzes and outputs tigs from the tigStore, in various formats (FASTQ, layouts, multialignments, etc). :doc:`commands/tgStoreFilter` (just usage) Analyzes tigs in the tigStore, marks those that appear to be spurious 'degenerate' tigs. :doc:`commands/tgStoreLoad` (just usage) Loads tigs into a tigStore. :doc:`commands/tgTigDisplay` (just usage) Displays the tig contained in a binary multialignment file, as output by utgcns. :doc:`commands/trimReads` (just usage) Part of Overlap Based Trimming, trims reads based on overlaps. :doc:`commands/utgcns` (just usage) Generates a multialignment for a tig, based on the layout stored in tigStore. Outputs FASTQ, layouts and binary mutlialignment files. canu-1.6/documentation/source/commands/000077500000000000000000000000001314437614700202725ustar00rootroot00000000000000canu-1.6/documentation/source/commands/bogart.rst000066400000000000000000000046631314437614700223130ustar00rootroot00000000000000bogart ~~~~~~ :: usage: bogart -o outputName -O ovlStore -G gkpStore -T tigStore -O Mandatory path to an ovlStore. -G Mandatory path to a gkpStore. -T Mandatory path to a tigStore (can exist or not). -o prefix Mandatory name for the output files -B b Target number of fragments per tigStore (consensus) partition Algorithm Options -gs Genome size in bases. -J Join promiscuous unitigs using unused best edges. -SR Shatter repeats, don't rebuild. -R Shatter repeats (-SR), then rebuild them -RL len Force reads below 'len' bases to be singletons. This WILL cause CGW to fail; diagnostic only. -threads N Use N compute threads during repeat detection. 0 - use OpenMP default (default) 1 - use one thread Overlap Selection - an overlap will be considered for use in a unitig under the following conditions: When constructing the Best Overlap Graph and Promiscuous Unitigs ('g'raph): -eg 0.020 no more than 0.020 fraction (2.0%) error ** DEPRECATED ** When loading overlaps, an inflated maximum (to allow reruns with different error rates): -eM 0.05 no more than 0.05 fraction (5.0%) error in any overlap loaded into bogart the maximum used will ALWAYS be at leeast the maximum of the four error rates For all, the lower limit on overlap length -el 500 no shorter than 40 bases Overlap Storage -M gb Use at most 'gb' gigabytes of memory for storing overlaps. -N num Load at most 'num' overlaps per read. -create Only create the overlap graph, save to disk and quit. -save Save the overlap graph to disk, and continue. Debugging and Logging -D enable logging/debugging for a specific component. -d disable logging/debugging for a specific component. overlapScoring allBestEdges chunkGraph buildUnitig placeUnplaced bubbles splitDiscontinuous intermediateUnitigs setParentAndHang stderr No output prefix name (-o option) supplied. No gatekeeper store (-G option) supplied. No overlap store (-O option) supplied. No output tigStore (-T option) supplied. canu-1.6/documentation/source/commands/bogus.rst000066400000000000000000000002671314437614700221500ustar00rootroot00000000000000bogus ~~~~~~ :: ERROR: No input matches supplied (either -nucmer or -snapper). ERROR: No reference sequence supplied (-reference). ERROR: No output prefix supplied (-output). canu-1.6/documentation/source/commands/canu.rst000066400000000000000000000035171314437614700217600ustar00rootroot00000000000000canu ~~~~~~ :: -- Detected Java(TM) Runtime Environment '1.8.0_60' (from 'java'). usage: canu [-correct | -trim | -assemble] \ [-s ] \ -p \ -d \ genomeSize=[g|m|k] \ errorRate=0.X \ [other-options] \ [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq By default, all three stages (correct, trim, assemble) are computed. To compute only a single stage, use: -correct - generate corrected reads -trim - generate trimmed reads -assemble - generate an assembly The assembly is computed in the (created) -d , with most files named using the -p . The genome size is your best guess of the genome size of what is being assembled. It is used mostly to compute coverage in reads. Fractional values are allowed: '4.7m' is the same as '4700k' and '4700000' The errorRate is not used correctly (we're working on it). Don't set it If you want to change the defaults, use the various utg*ErrorRate options. A full list of options can be printed with '-options'. All options can be supplied in an optional sepc file. Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz. Reads are specified by the technology they were generated with: -pacbio-raw -pacbio-corrected -nanopore-raw -nanopore-corrected Complete documentation at http://canu.readthedocs.org/en/latest/ ERROR: Assembly name prefix not supplied with -p. ERROR: Directory not supplied with -d. ERROR: Required parameter 'genomeSize' is not set canu-1.6/documentation/source/commands/correctOverlaps.rst000066400000000000000000000020451314437614700242020ustar00rootroot00000000000000correctOverlaps ~~~~~~ :: ERROR: no input gatekeeper store (-G) supplied. ERROR: no input overlap store (-O) supplied. ERROR: no input read corrections file (-c) supplied. ERROR: no output erates file (-o) supplied. USAGE: correctOverlaps [-d ] [-o ] [-q ] [-x ] [-F OlapFile] [-S OlapStore] [-c ] [-e Recalculates overlaps for frags .. in using corrections in Options: -e specifies binary file to dump corrected erates to for later updating of olap store by update-erates -F specify file of sorted overlaps to use (in the format produced by get-olaps -o specifies name of file to which OVL messages go -q overlaps less than this error rate are automatically output -S specify the binary overlap store containing overlaps to use canu-1.6/documentation/source/commands/createFalconSenseInputs.rst000066400000000000000000000000451314437614700256120ustar00rootroot00000000000000createFalconSenseInputs ~~~~~~ :: canu-1.6/documentation/source/commands/erateEstimate.rst000066400000000000000000000000551314437614700236200ustar00rootroot00000000000000erateEstimate ~~~~~~ :: Opening '(null)' canu-1.6/documentation/source/commands/estimate-mer-threshold.rst000066400000000000000000000001661314437614700254150ustar00rootroot00000000000000estimate-mer-threshold ~~~~~~ :: usage: estimate-mer-threshold -m mercounts -m mercounts file of mercounts canu-1.6/documentation/source/commands/fastqAnalyze.rst000066400000000000000000000013221314437614700234640ustar00rootroot00000000000000fastqAnalyze ~~~~~~ :: usage: fastqAnalyze [-stats] [-o output.fastq] input.fastq If no options are given, input.fastq is analyzed and a best guess for the QV encoding is output. Otherwise, the QV encoding is converted to Sanger-style using this guess. In some cases, the encoding cannot be determined. When this occurs, no guess is output. For conversion, you can force the input QV type with: -solexa input QV is solexa -illumina input QV is illumina -sanger input QV is sanger -o sanger-style-output.fastq If -stats is supplied, no QV analysis or conversion is performed, but some simple statistics are computed and output to stdout. canu-1.6/documentation/source/commands/fastqSample.rst000066400000000000000000000031111314437614700233000ustar00rootroot00000000000000fastqSample ~~~~~~ :: usage: fastqSample [opts] Input Specification -I NAME input name (prefix) of the reads -T T total number of mate pairs in the input (if not supplied, will be counted) -L L length of a single read (if not supplied, will be determined) -U reads are unmated, expected in *.u.fastq Output Specification -O NAME output name (prefix) of the reads (default is same as -I) -A automatically include coverage or number of reads in the output name -m L ignore reads shorter than L bases -max don't sample randomly, pick the longest reads Method 1: specify desired output coverage: -g G genome size -c C desired coverage in the output reads Method 2: specify desired number of output pairs -p N for mated reads, output 2N reads, or N pairs of reads for unmated reads, output N reads Method 3: specify a desired fraction of the input: -f F output F * T pairs of reads (T as above in -t option) 0.0 < F <= 1.0 Method 4: specify a desired total length -b B output reads/pairs until B bases is exceeded Samples reads from paired Illumina reads NAME.1.fastq and NAME.2.fastq and outputs: NAME.Cx.1.fastq and N.Cx.2.fastq (for coverage based sampling) NAME.n=N.1.fastq and N.n=N.2.fastq (for coverage based sampling) If -T is not supplied, the number of reads will be counted for you. ERROR: no name supplied with -I. ERROR: no method supplied with -c, -p, -f or -b canu-1.6/documentation/source/commands/fastqSimulate-sort.rst000066400000000000000000000003241314437614700246320ustar00rootroot00000000000000fastqSimulate-sort ~~~~~~ :: usage: fastqSimulate-sort -i1 in.1.fastq [-i2 in.2.fastq] -o1 out.1.fastq [-o2 out.2.fastq] ERROR: No in.1.fastq supplied with -i1. ERROR: No out.1.fastq supplied with -i1. canu-1.6/documentation/source/commands/fastqSimulate.rst000066400000000000000000000057641314437614700236620ustar00rootroot00000000000000fastqSimulate ~~~~~~ :: usage: fastqSimulate -f reference.fasta -o output-prefix -l read-length .... -f ref.fasta Use sequences in ref.fasta as the genome. -o name Create outputs name.1.fastq and name.2.fastq (and maybe others). -l len Create reads of length 'len' bases. -n n Create 'n' reads (for -se) or 'n' pairs of reads (for -pe and -mp). -x read-cov Set 'np' to create reads that sample the genome to 'read-cov' read coverage. -X clone-cov Set 'np' to create reads that sample the genome to 'clone-cov' clone coverage. -em err Reads will contain fraction mismatch error 'e' (0.01 == 1% error). -ei err Reads will contain fraction insertion error 'e' (0.01 == 1% error). -ed err Reads will contain fraction deletion error 'e' (0.01 == 1% error). -seed s Seed randomness with 32-bit integer s. -allowgaps Allow pairs to span N regions in the reference. By default, pairs are not allowed to span a gap. Reads are never allowed to cover N's. -allowns Allow reads to contain N regions. Implies -allowgaps -nojunction For -mp, do not create chimeric junction reads. Create only fully PE or fully MP reads. -normal p Output a normal-oriented (both forward or both reverse) pair with probability p. Only for -pe and -mp. -se Create single-end reads. -cc junkSize junkStdDev false Create chimeric single-end reads. The chimer is formed from two uniformly distributed positions in the reference. Some amount of random junk is inserted at the junction. With probability 'false' the read is not chimeric, but still the junk bases inserted in the middle. -pe shearSize shearStdDev Create paired-end reads, from fragments of size 'shearSize +- shearStdDev'. -mp insertSize insertStdDev shearSize shearStdDev enrichment Create mate-pair reads. The pairs will be 'insertSize +- insertStdDev' apart. The circularized insert is then sheared into fragments of size 'shearSize +- shearStdDev'. With probability 'enrichment' the fragment containing the junction is used to form the pair of reads. The junction location is uniformly distributed through this fragment. Reads are labeled as: tMP - a MP pair fMP - a PE pair aMP - a MP pair with junction in the first read bMP - a MP pair with junction in the second read cMP - a MP pair with junction in both reads (the reads overlap) Output QV's are the Sanger spec. ERROR: No fasta file (-f) supplied. ERROR: No output prefix (-o) supplied. ERROR: No type (-se or -pe or -mp) selected. canu-1.6/documentation/source/commands/filterCorrectionOverlaps.rst000066400000000000000000000021641314437614700260600ustar00rootroot00000000000000filterCorrectionOverlaps ~~~~~~ :: usage: filterCorrectionOverlaps [options] Rewrites an ovlStore, filtering overlaps that shouldn't be used for correcting reads. -G gkpStore input reads -O ovlStore input overlaps -S scoreFile output scores for each read, binary file, to 'scoreFile' per-read logging to 'scoreFile.log' (see -nolog) summary statistics to 'scoreFile.stats' (see -nostats) -c coverage retain at most this many overlaps per read -l length filter overlaps shorter than this length -e (min-)max filter overlaps outside this range of fraction error example: -e 0.20 filter overlaps above 20% error example: -e 0.05-0.20 filter overlaps below 5% error or above 20% error -nolog don't create 'scoreFile.log' -nostats don't create 'scoreFile.stats' ERROR: no gatekeeper store (-G) supplied. ERROR: no overlap store (-O) supplied. ERROR: no output scoreFile (-S) supplied. canu-1.6/documentation/source/commands/findErrors.rst000066400000000000000000000000301314437614700231320ustar00rootroot00000000000000findErrors ~~~~~~ :: canu-1.6/documentation/source/commands/gatekeeperCreate.rst000066400000000000000000000004001314437614700242560ustar00rootroot00000000000000gatekeeperCreate ~~~~~~ :: usage: gatekeeperCreate [...] -o gkpStore -o gkpStore create this gkpStore -minlength L discard reads shorter than L ERROR: no gkpStore (-o) supplied. ERROR: no input files supplied. canu-1.6/documentation/source/commands/gatekeeperDumpFASTQ.rst000066400000000000000000000022051314437614700245640ustar00rootroot00000000000000gatekeeperDumpFASTQ ~~~~~~ :: usage: gatekeeperDumpFASTQ [...] -o fastq-prefix -g gkpStore -G gkpStore -o fastq-prefix write files fastq-prefix.(libname).fastq, ... if fastq-prefix is '-', all sequences output to stdout if fastq-prefix ends in .gz, .bz2 or .xz, output is compressed -l libToDump output only read in library number libToDump (NOT IMPLEMENTED) -r id[-id] output only the single read 'id', or the specified range of ids -c clearFile clear range file from OBT modules -allreads if a clear range file, lower case mask the deleted reads -allbases if a clear range file, lower case mask the non-clear bases -onlydeleted if a clear range file, only output deleted reads (the entire read) -fastq output is FASTQ format (with extension .fastq, default) -fasta output is FASTA format (with extension .fasta) -nolibname don't include the library name in the output file name ERROR: no gkpStore (-G) supplied. ERROR: no output prefix (-o) supplied. canu-1.6/documentation/source/commands/gatekeeperDumpMetaData.rst000066400000000000000000000012121314437614700253630ustar00rootroot00000000000000gatekeeperDumpMetaData ~~~~~~ :: usage: gatekeeperDumpMetaData -G gkpStore [p] [...] -G gkpStore [p] dump reads from 'gkpStore', restricted to partition 'p', if supplied. -libs dump information about libraries -reads [-full] dump information about reads (-full also dumps some storage metadata) -stats dump summary statistics on reads -b id output starting at read/library 'id' -e id output stopping after read/library 'id' -r id output only the single read 'id' ERROR: no gkpStore (-G) supplied. canu-1.6/documentation/source/commands/gatekeeperPartition.rst000066400000000000000000000005161314437614700250340ustar00rootroot00000000000000gatekeeperPartition ~~~~~~ :: usage: gatekeeperPartition -G gkpStore -P partitionMapFile -G gkpStore path to gatekeeper store -P partFile file mapping read ID to partiton format: 'partition readID' ERROR: no gkpStore (-G) supplied. ERROR: no partition input (-P) supplied. canu-1.6/documentation/source/commands/generateCorrectionLayouts.rst000066400000000000000000000015011314437614700262240ustar00rootroot00000000000000generateCorrectionLayouts ~~~~~~ :: usage: generateCorrectionLayouts -G gkpStore -O ovlStore [ -T tigStore | -F ] ... -G gkpStore mandatory path to gkpStore -O ovlStore mandatory path to ovlStore -S file global score (binary) input file -T corStore output layouts to tigStore corStore -F output falconsense-style input directly to stdout -p name output prefix name, for logging and summary -b bgnID -e endID -rl file -L length minimum length of evidence overlaps -E erate maxerror rate of evidence overlaps -C coverage maximum coverage of evidence reads to emit -M length minimum length of a corrected read ERROR: no gkpStore input (-G) supplied. ERROR: no ovlStore input (-O) supplied. canu-1.6/documentation/source/commands/leaff.rst000066400000000000000000000044021314437614700221010ustar00rootroot00000000000000leaff ~~~~~~ :: usage: leaff [-f fasta-file] [options] SOURCE FILES -f file: use sequence in 'file' (-F is also allowed for historical reasons) -A file: read actions from 'file' SOURCE FILE EXAMINATION -d: print the number of sequences in the fasta -i name: print an index, labelling the source 'name' OUTPUT OPTIONS -6 <#>: insert a newline every 60 letters (if the next arg is a number, newlines are inserted every n letters, e.g., -6 80. Disable line breaks with -6 0, or just don't use -6!) -e beg end: Print only the bases from position 'beg' to position 'end' (space based, relative to the FORWARD sequence!) If beg == end, then the entire sequence is printed. It is an error to specify beg > end, or beg > len, or end > len. -ends n Print n bases from each end of the sequence. One input sequence generates two output sequences, with '_5' or '_3' appended to the ID. If 2n >= length of the sequence, the sequence itself is printed, no ends are extracted (they overlap). -C: complement the sequences -H: DON'T print the defline -h: Use the next word as the defline ("-H -H" will reset to the original defline -R: reverse the sequences -u: uppercase all bases SEQUENCE SELECTION -G n s l: print n randomly generated sequences, 0 < s <= length <= l -L s l: print all sequences such that s <= length < l -N l h: print all sequences such that l <= % N composition < h (NOTE 0.0 <= l < h < 100.0) (NOTE that you cannot print sequences with 100% N This is a useful bug). -q file: print sequences from the seqid list in 'file' -r num: print 'num' randomly picked sequences -s seqid: print the single sequence 'seqid' -S f l: print all the sequences from ID 'f' to 'l' (inclusive) -W: print all sequences (do the whole file) LONGER HELP -help analysis -help examples canu-1.6/documentation/source/commands/meryl.rst000066400000000000000000000140131314437614700221530ustar00rootroot00000000000000meryl ~~~~~~ :: usage: meryl [personality] [global options] [options] where personality is: -P -- compute parameters -B -- build table -S -- scan table -M -- "math" operations -D -- dump table -P: Given a sequence file (-s) or an upper limit on the number of mers in the file (-n), compute the table size (-t in build) to minimize the memory usage. -m # (size of a mer; required) -c # (homopolymer compression; optional) -p (enable positions) -s seq.fasta (seq.fasta is scanned to determine the number of mers) -n # (compute params assuming file with this many mers in it) Only one of -s, -n need to be specified. If both are given -s takes priority. -B: Given a sequence file (-s) and lots of parameters, compute the mer-count tables. By default, both strands are processed. -f (only build for the forward strand) -r (only build for the reverse strand) -C (use canonical mers, assumes both strands) -L # (DON'T save mers that occur less than # times) -U # (DON'T save mers that occur more than # times) -m # (size of a mer; required) -c # (homopolymer compression; optional) -p (enable positions) -s seq.fasta (sequence to build the table for) -o tblprefix (output table prefix) -v (entertain the user) By default, the computation is done as one large sequential process. Multi-threaded operation is possible, at additional memory expense, as is segmented operation, at additional I/O expense. Threaded operation: Split the counting in to n almost-equally sized pieces. This uses an extra h MB (from -P) per thread. -threads n (use n threads to build) Segmented, sequential operation: Split the counting into pieces that will fit into no more than m MB of memory, or into n equal sized pieces. Each piece is computed sequentially, and the results are merged at the end. Only one of -memory and -segments is needed. -memory mMB (use at most m MB of memory per segment) -segments n (use n segments) Segmented, batched operation: Same as sequential, except this allows each segment to be manually executed in parallel. Only one of -memory and -segments is needed. -memory mMB (use at most m MB of memory per segment) -segments n (use n segments) -configbatch (create the batches) -countbatch n (run batch number n) -mergebatch (merge the batches) Initialize the compute with -configbatch, which needs all the build options. Execute all -countbatch jobs, then -mergebatch to complete. meryl -configbatch -B [options] -o file meryl -countbatch 0 -o file meryl -countbatch 1 -o file ... meryl -countbatch N -o file meryl -mergebatch N -o file Batched mode can run on the grid. -sge jobname unique job name for this execution. Meryl will submit jobs with name mpjobname, ncjobname, nmjobname, for phases prepare, count and merge. -sgebuild "options" any additional options to sge, e.g., -sgemerge "options" "-p -153 -pe thread 2 -A merylaccount" N.B. - -N will be ignored N.B. - be sure to quote the options -M: Given a list of tables, perform a math, logical or threshold operation. Unless specified, all operations take any number of databases. Math operations are: min count is the minimum count for all databases. If the mer does NOT exist in all databases, the mer has a zero count, and is NOT in the output. minexist count is the minimum count for all databases that contain the mer max count is the maximum count for all databases add count is sum of the counts for all databases sub count is the first minus the second (binary only) abs count is the absolute value of the first minus the second (binary only) Logical operations are: and outputs mer iff it exists in all databases nand outputs mer iff it exists in at least one, but not all, databases or outputs mer iff it exists in at least one database xor outputs mer iff it exists in an odd number of databases Threshold operations are: lessthan x outputs mer iff it has count < x lessthanorequal x outputs mer iff it has count <= x greaterthan x outputs mer iff it has count > x greaterthanorequal x outputs mer iff it has count >= x equal x outputs mer iff it has count == x Threshold operations work on exactly one database. -s tblprefix (use tblprefix as a database) -o tblprefix (create this output) -v (entertain the user) NOTE: Multiple tables are specified with multiple -s switches; e.g.: meryl -M add -s 1 -s 2 -s 3 -s 4 -o all NOTE: It is NOT possible to specify more than one operation: meryl -M add -s 1 -s 2 -sub -s 3 will NOT work. -D: Dump the table (not all of these work). -Dd Dump a histogram of the distance between the same mers. -Dt Dump mers >= a threshold. Use -n to specify the threshold. -Dc Count the number of mers, distinct mers and unique mers. -Dh Dump (to stdout) a histogram of mer counts. -s Read the count table from here (leave off the .mcdat or .mcidx). canu-1.6/documentation/source/commands/mhapConvert.rst000066400000000000000000000006161314437614700233150ustar00rootroot00000000000000mhapConvert ~~~~~~ :: usage: mhapConvert [options] file.mhap[.gz] Converts mhap native output to ovb -o out.ovb output file -h id num base id and number of hash table reads (mhap output IDs 1 through 'num') -q id base id of query reads (mhap output IDs 'num+1' and higher) ERROR: no overlap files supplied canu-1.6/documentation/source/commands/ovStoreBucketizer.rst000066400000000000000000000024051314437614700245160ustar00rootroot00000000000000ovStoreBucketizer ~~~~~~ :: usage: ovStoreBucketizer -O asm.ovlStore -G asm.gkpStore -i file.ovb.gz -job j [opts] -O asm.ovlStore path to store to create -G asm.gkpStore path to gkpStore for this assembly -C config path to previously created ovStoreBuild config data file -i file.ovb.gz input overlaps -job j index of this overlap input file -F f use up to 'f' files for store creation -obt filter overlaps for OBT -dup filter overlaps for OBT/dedupe -e e filter overlaps above e fraction error -raw write uncompressed buckets DANGER DO NOT USE DO NOT USE DO NOT USE DANGER DANGER DANGER DANGER This command is difficult to run by hand. DANGER DANGER Use ovStoreCreate instead. DANGER DANGER DANGER DANGER DO NOT USE DO NOT USE DO NOT USE DANGER ERROR: No overlap store (-O) supplied. ERROR: No gatekeeper store (-G) supplied. ERROR: No input (-i) supplied. ERROR: No job index (-job) supplied. canu-1.6/documentation/source/commands/ovStoreBuild.rst000066400000000000000000000020661314437614700234510ustar00rootroot00000000000000ovStoreBuild ~~~~~~ :: usage: ovStoreBuild -O asm.ovlStore -G asm.gkpStore [opts] [-L fileList | *.ovb.gz] -O asm.ovlStore path to store to create -G asm.gkpStore path to gkpStore for this assembly -L fileList read input filenames from 'flieList' -F f use up to 'f' files for store creation -M g use up to 'g' gigabytes memory for sorting overlaps default 4; g-0.125 gb is available for sorting overlaps -e e filter overlaps above e fraction error -l l filter overlaps below l bases overlap length (needs gkpStore to get read lengths!) Non-building options: -evalues input files are evalue updates from overlap error adjustment -config out.dat don't build a store, just dump a binary partitioning file for ovStoreBucketizer ERROR: No overlap store (-o) supplied. ERROR: No gatekeeper store (-g) supplied. ERROR: No input overlap files (-L or last on the command line) supplied. canu-1.6/documentation/source/commands/ovStoreDump.rst000066400000000000000000000034101314437614700233110ustar00rootroot00000000000000ovStoreDump ~~~~~~ :: usage: ovStoreDump -G gkpStore -O ovlStore ... There are three modes of operation: -d [a[-b]] dump overlaps for reads a to b, inclusive -q a b report the a,b overlap, if it exists. -p a dump a picture of overlaps to fragment 'a'. FORMAT (for -d) -coords dump overlap showing coordinates in the reads (default) -hangs dump overlap showing dovetail hangs unaligned -raw dump overlap showing its raw native format (four hangs) -paf dump overlaps in miniasm/minimap format -binary dump overlap as raw binary data -counts dump the number of overlaps per read MODIFIERS (for -d and -p) -E erate Dump only overlaps <= erate fraction error. -L length Dump only overlaps that are larger than L bases (only for -p picture mode). -d5 Dump only overlaps off the 5' end of the A frag. -d3 Dump only overlaps off the 3' end of the A frag. -dC Dump only overlaps that are contained in the A frag (B contained in A). -dc Dump only overlaps that are containing the A frag (A contained in B). -v Report statistics (to stderr) on some dumps (-d). -unique Report only overlaps where A id is < B id, do not report both A to B and B to A overlap -best prefix Annotate picture with status from bogart outputs prefix.edges, prefix.singletons, prefix.edges.suspicious -noc With -best data, don't show overlaps to contained reads. -nos With -best data, don't show overlaps to suspicious reads. ERROR: no operation (-d, -q or -p) supplied. ERROR: no input gkpStore (-G) supplied. ERROR: no input ovlStore (-O) supplied. canu-1.6/documentation/source/commands/ovStoreIndexer.rst000066400000000000000000000020031314437614700237770ustar00rootroot00000000000000ovStoreIndexer ~~~~~~ :: usage: ovStoreIndexer ... -O x.ovlStore path to overlap store to build the final index for -F s number of slices used in bucketizing/sorting -t x.ovlStore explicitly test a previously constructed index -f when testing, also create a new 'idx.fixed' which might resolve rare problems -nodelete do not remove intermediate files when the index is successfully created DANGER DO NOT USE DO NOT USE DO NOT USE DANGER DANGER DANGER DANGER This command is difficult to run by hand. DANGER DANGER Use ovStoreCreate instead. DANGER DANGER DANGER DANGER DO NOT USE DO NOT USE DO NOT USE DANGER ERROR: No overlap store (-O) supplied. ERROR: One of -F (number of slices) or -t (test a store) must be supplied. canu-1.6/documentation/source/commands/ovStoreSorter.rst000066400000000000000000000021511314437614700236630ustar00rootroot00000000000000ovStoreSorter ~~~~~~ :: usage: ovStoreSorter ... -O x.ovlStore path to overlap store to build the final index for -G asm.gkpStore path to gkpStore for this assembly -F s number of slices used in bucketizing/sorting -job j m index of this overlap input file, and max number of files -M m maximum memory to use, in gigabytes -deleteearly remove intermediates as soon as possible (unsafe) -deletelate remove intermediates when outputs exist (safe) -force force a recompute, even if the output exists DANGER DO NOT USE DO NOT USE DO NOT USE DANGER DANGER DANGER DANGER This command is difficult to run by hand. DANGER DANGER Use ovStoreCreate instead. DANGER DANGER DANGER DANGER DO NOT USE DO NOT USE DO NOT USE DANGER ERROR: No overlap store (-O) supplied. ERROR: no slice number (-F) supplied. ERROR: no max job ID (-job) supplied. canu-1.6/documentation/source/commands/overlapConvert.rst000066400000000000000000000005501314437614700240350ustar00rootroot00000000000000overlapConvert ~~~~~~ :: usage: overlapConvert [options] file.ovb[.gz] -G gkpStore (needed for -coords, the default) -coords output coordiantes on reads -hangs output hangs on reads -raw output raw hangs on reads ERROR: -coords mode requires a gkpStore (-G) ERROR: no overlap files supplied canu-1.6/documentation/source/commands/overlapImport.rst000066400000000000000000000013061314437614700236670ustar00rootroot00000000000000overlapImport ~~~~~~ :: usage: overlapImport [options] ascii-ovl-file-input.[.gz] Required: -G name.gkpStore path to valid gatekeeper store Output options: -o file.ovb output file name -O name.ovlStore output overlap store Format options: -legacy 'CA8 overlapStore -d' format -coords 'overlapConvert -coords' format (not implemented) -hangs 'overlapConvert -hangs' format (not implemented) -raw 'overlapConvert -raw' format Input file can be stdin ('-') or a gz/bz2/xz compressed file. ERROR: need to supply a gkpStore (-G). ERROR: need to supply a format type (-legacy, -coords, -hangs, -raw). canu-1.6/documentation/source/commands/overlapInCore.rst000066400000000000000000000045201314437614700235750ustar00rootroot00000000000000overlapInCore ~~~~~~ :: * No kmer length supplied; -k needed! ERROR: No output file name specified USAGE: overlapInCore [options] -b in contig mode, specify the output file -c contig mode. Use 2 frag stores. First is for reads; second is for contigs -G do partial overlaps -h to specify fragments to put in hash table Implies LSF mode (no changes to frag store) -I designate a file of frag iids to limit olaps to (Contig mode only) -k if one or two digits, the length of a kmer, otherwise the filename containing a list of kmers to ignore in the hash table -l specify the maximum number of overlaps per fragment-end per batch of fragments. -m allow multiple overlaps per oriented fragment pair -M specify memory size. Valid values are '8GB', '4GB', '2GB', '1GB', '256MB'. (Not for Contig mode) -o specify output file name -P write protoIO output (if not -G) -r specify old fragments to overlap -t use parallel threads -u allow only 1 overlap per oriented fragment pair -w filter out overlaps with too many errors in a window -z skip the hopeless check --maxerate only output overlaps with fraction or less error (e.g., 0.06 == 6%) --minlength only output overlaps of or more bases --hashbits n Use n bits for the hash mask. --hashstrings n Load at most n strings into the hash table at one time. --hashdatalen n Load at most n bytes into the hash table at one time. --hashload f Load to at most 0.0 < f < 1.0 capacity (default 0.7). --maxreadlen n For batches with all short reads, pack bits differently to process more reads per batch. all reads must be shorter than n --hashstrings limited to 2^(30-m) Common values: maxreadlen 2048->hashstrings 524288 (default) maxreadlen 512->hashstrings 2097152 maxreadlen 128->hashstrings 8388608 --readsperbatch n Force batch size to n. --readsperthread n Force each thread to process n reads. canu-1.6/documentation/source/commands/overlapInCorePartition.rst000066400000000000000000000001351314437614700254650ustar00rootroot00000000000000overlapInCorePartition ~~~~~~ :: HASH: 0 reads or 0 length. REF: 0 reads or 0 length. canu-1.6/documentation/source/commands/overlapPair.rst000066400000000000000000000014711314437614700233130ustar00rootroot00000000000000overlapPair ~~~~~~ :: usage: overlapPair ... -G gkpStore Mandatory, path to gkpStore Inputs can come from either a store or a file. -O ovlStore -O ovlFile If from an ovlStore, the range of reads processed can be restricted. -b bgnID -e endID Outputs will be written to a store or file, depending on the input type -o ovlStore -o ovlFile -erate e Overlaps are computed at 'e' fraction error; must be larger than the original erate -partial Overlaps are 'overlapInCore -G' partial overlaps -memory m Use up to 'm' GB of memory -t n Use up to 'n' cores Advanced options: -invert Invert the overlap A <-> B before aligning (they are not re-inverted before output) canu-1.6/documentation/source/commands/prefixEditDistance-matchLimitGenerate.rst000066400000000000000000000003761314437614700303540ustar00rootroot00000000000000prefixEditDistance-matchLimitGenerate ~~~~~~ :: usage: prefixEditDistance-matchLimitGenerate minEvalue [maxEvalue [step]] computes overlapper probabilities for minEvalue <= eValue <= maxEvalue' eValue 100 == 0.01 fraction error == 1% error canu-1.6/documentation/source/commands/readConsensus.rst000066400000000000000000000014121314437614700236360ustar00rootroot00000000000000readConsensus ~~~~~~ :: usage: readConsensus ... -G gkpStore Mandatory, path to gkpStore Inputs can come from either an overlap or a tig store. -O ovlStore -T tigStore tigVers If from an ovlStore, the range of reads processed can be restricted. -b bgnID -e endID Outputs will be written as the full multialignment and the final consensus sequence -c output.cns -f output.fastq -erate e Overlaps are computed at 'e' fraction error; must be larger than the original erate -memory m Use up to 'm' GB of memory -t n Use up to 'n' cores ERROR: no gatekeeper (-G) supplied. ERROR: no inputs (-O or -T) supplied. ERROR: no outputs (-c or -f) supplied. canu-1.6/documentation/source/commands/simple.rst000066400000000000000000000001121314437614700223070ustar00rootroot00000000000000simple ~~~~~~ :: no input given with '-i' no output given with '-o' canu-1.6/documentation/source/commands/splitReads.rst000066400000000000000000000011241314437614700231340ustar00rootroot00000000000000splitReads ~~~~~~ :: usage: splitReads -G gkpStore -O ovlStore -Ci input.clearFile -Co output.clearFile -o outputPrefix] -G gkpStore path to read store -O ovlStore path to overlap store -o name output prefix, for logging -t bgn-end limit processing to only reads from bgn to end (inclusive) -Ci clearFile path to input clear ranges (NOT SUPPORTED) -Co clearFile path to ouput clear ranges -e erate ignore overlaps with more than 'erate' percent error -minlength l reads trimmed below this many bases are deleted canu-1.6/documentation/source/commands/tgStoreCoverageStat.rst000066400000000000000000000012611314437614700247630ustar00rootroot00000000000000tgStoreCoverageStat ~~~~~~ :: usage: tgStoreCoverageStat -G gkpStore -T tigStore version -o output-prefix [-s genomeSize] ... -G Mandatory, path G to a gkpStore directory. -T Mandatory, path T to a tigStore, and version V. -o Mandatory, prefix for output files. -s Optional, assume genome size S. -n Do not update the tigStore (default = do update). -u Do not estimate based on N50 (default = use N50). -L Be leniant; don't require reads start at position zero. No gatekeeper store (-G option) supplied. No input tigStore (-T option) supplied. No output prefix (-o option) supplied. canu-1.6/documentation/source/commands/tgStoreDump.rst000066400000000000000000000056741314437614700233150ustar00rootroot00000000000000tgStoreDump ~~~~~~ :: usage: tgStoreDump -G -T [opts] STORE SELECTION (mandatory) -G path to the gatekeeper store -T path to the tigStore, version, to use TIG SELECTION - if nothing specified, all tigs are reported - all ranges are inclusive. -tig A[-B] only dump tigs between ids A and B -unassembled only dump tigs that are 'unassembled' -bubbles only dump tigs that are 'bubbles' -contigs only dump tigs that are 'contigs' -nreads min max only dump tigs with between min and max reads -length min max only dump tigs with length between 'min' and 'max' bases -coverage c C g G only dump tigs with between fraction g and G at coverage between c and C example: -coverage 10 inf 0.5 1.0 would report tigs where half of the bases are at 10+ times coverage. DUMP TYPE - all dumps, except status, report on tigs selected as above -status the number of tigs in the store -tigs a list of tigs, and some information about them -consensus [opts] the consensus sequence, with options: -gapped report the gapped (multialignment) consensus sequence -fasta report sequences in FASTA format (the default) -fastq report sequences in FASTQ format -layout [opts] the layout of reads in each tig if '-o' is supplied, three files are created, otherwise just the layout is printed to stdout -gapped report the gapped (multialignment) positions -o outputPrefix write plots to 'outputPrefix.*' in the current directory -multialign [opts] the full multialignment, output is to stdout -w width width of the page -s spacing spacing between reads on the same line -sizes [opts] size statistics -s genomesize denominator to use for n50 computation -coverage [opts] read coverage plots, one plot per tig -o outputPrefix write plots to 'outputPrefix.*' in the current directory -depth [opts] a histogram of depths -single one histogram per tig -overlap read overlaps -thin overlap report regions where the (thickest) read overlap is less than 'overlap' bases -overlaphistogram a histogram of the thickest overlaps used -o outputPrefix write plots to 'outputPrefix.*' in the current directory canu-1.6/documentation/source/commands/tgStoreFilter.rst000066400000000000000000000000731314437614700236210ustar00rootroot00000000000000tgStoreFilter ~~~~~~ :: this is obsolete. do not use. canu-1.6/documentation/source/commands/tgStoreLoad.rst000066400000000000000000000021631314437614700232550ustar00rootroot00000000000000tgStoreLoad ~~~~~~ :: usage: tgStoreLoad -G -T [input.cns] -G Path to the gatekeeper store -T Path to the tigStore and version to add tigs to -L Load the tig(s) from files listed in 'file-of-files' -n Don't replace, just report what would have happened The primary operation is to replace tigs in the store with ones in a set of input files. The input files can be either supplied directly on the command line or listed in a text file (-L). A new store is created if one doesn't exist, otherwise, whatever tigs are there are replaced with those in the -R file. If version 'v' doesn't exist, it is created. Even if -n is supplied, a new store is created if one doesn't exist. To add a new tig, give it a tig id of -1. New tigs must be added to the latest version. To delete a tig, remove all children, and set the number of them to zero. ERROR: no gatekeeper store (-G) supplied. ERROR: no tig store (-T) supplied. ERROR: no input tigs (-R) supplied. canu-1.6/documentation/source/commands/tgTigDisplay.rst000066400000000000000000000001061314437614700234250ustar00rootroot00000000000000tgTigDisplay ~~~~~~ :: usage: tgTigDisplay -G gkpStore -t tigFile canu-1.6/documentation/source/commands/trimReads.rst000066400000000000000000000015011314437614700227530ustar00rootroot00000000000000trimReads ~~~~~~ :: usage: trimReads -G gkpStore -O ovlStore -Co output.clearFile -o outputPrefix -G gkpStore path to read store -O ovlStore path to overlap store -o name output prefix, for logging -t bgn-end limit processing to only reads from bgn to end (inclusive) -Ci clearFile path to input clear ranges (NOT SUPPORTED) -Co clearFile path to ouput clear ranges -e erate ignore overlaps with more than 'erate' percent error -ol l the minimum evidence overlap length -oc c the minimum evidence overlap coverage evidence overlaps must overlap by 'l' bases to be joined, and must be at least 'c' deep to be retained -minlength l reads trimmed below this many bases are deleted canu-1.6/documentation/source/commands/updateDocs.sh000066400000000000000000000004671314437614700227300ustar00rootroot00000000000000for file in `ls *.rst`; do cmd=`echo $file |sed s/.rst//g` export PATH=$PATH:../../../Darwin-amd64/bin OUTPUT=`$cmd 2>&1 |awk '{print " "$0}'` echo "$cmd" > $file echo "~~~~~~" >> $file echo "" >> $file echo "::" >> $file echo "" >> $file echo "$OUTPUT" >> $file done canu-1.6/documentation/source/commands/utgcns.rst000066400000000000000000000066141314437614700223360ustar00rootroot00000000000000utgcns ~~~~~~ :: usage: utgcns [opts] INPUT -G g Load reads from gkStore 'g' -T t v p Load unitigs from tgStore 't', version 'v', partition 'p'. Expects reads will be in gkStore partition 'p' as well Use p='.' to specify no partition -t file Test the computation of the unitig layout in 'file' 'file' can be from: 'tgStoreDump -d layout' (human readable layout format) 'utgcns -L' (human readable layout format) 'utgcns -O' (binary multialignment format) -p package Load unitig and read from 'package' created with -P. This is usually used by developers. ALGORITHM -quick No alignments, just paste read sequence into the unitig positions. This is very fast, but the consensus sequence is formed from a mosaic of read sequences, and there can be large indel. This is useful for checking intermediate assembly structure by mapping to reference, or possibly for use as input to a polishing step. -pbdagcon Use pbdagcon (https://github.com/PacificBiosciences/pbdagcon). This is fast and robust. It is the default algorithm. It does not generate a final multialignment output (the -v option will not show anything useful). -utgcns Use utgcns (the original Celera Assembler consensus algorithm) This isn't as fast, isn't as robust, but does generate a final multialign output. OUTPUT -O results Write computed tigs to binary output file 'results' -L layouts Write computed tigs to layout output file 'layouts' -A fasta Write computed tigs to fasta output file 'fasta' -Q fastq Write computed tigs to fastq output file 'fastq' -P package Create a copy of the inputs needed to compute the unitigs. This file can then be sent to the developers for debugging. The unitig(s) are not processed and no other outputs are created. Ideally, only one unitig is selected (-u, below). TIG SELECTION (if -T input is used) -u b Compute only unitig ID 'b' (must be in the correct partition!) -u b-e Compute only unitigs from ID 'b' to ID 'e' -f Recompute unitigs that already have a multialignment -maxlength l Do not compute consensus for unitigs longer than l bases. PARAMETERS -e e Expect alignments at up to fraction e error -em m Don't ever allow alignments more than fraction m error -l l Expect alignments of at least l bases -maxcoverage c Use non-contained reads and the longest contained reads, up to C coverage, for consensus generation. The default is 0, and will use all reads. LOGGING -v Show multialigns. -V Enable debugging option 'verbosemultialign'. ERROR: No gkpStore (-G) and no package (-p) supplied. ERROR: No tigStore (-T) OR no test unitig (-t) OR no package (-p) supplied. canu-1.6/documentation/source/conf.py000066400000000000000000000210251314437614700177700ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # canu documentation build configuration file, created by # sphinx-quickstart on Wed Aug 26 18:41:08 2015. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.insert(0, os.path.abspath('.')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.todo', 'sphinx.ext.mathjax', 'sphinx.ext.ifconfig', ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'canu' copyright = u'2015, Adam Phillippy, Sergey Koren, Brian Walenz' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '1.6' # The full version, including alpha/beta/rc tags. release = '1.6' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = [] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'default' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # Build using the RTD theme, if not on RTD. # https://read-the-docs.readthedocs.org/en/latest/theme.html # https://github.com/snide/sphinx_rtd_theme # on_rtd = os.environ.get('READTHEDOCS', None) == 'True' if not on_rtd: # only import and set the theme if we're building docs locally import sphinx_rtd_theme html_theme = 'sphinx_rtd_theme' html_theme_path = [ "/usr/local/lib/python2.7/site-packages", ] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'canudoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ('index', 'canu.tex', u'canu Documentation', u'Adam Phillippy, Sergey Koren, Brian Walenz', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'canu', u'canu Documentation', [u'Adam Phillippy, Sergey Koren, Brian Walenz'], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'canu', u'canu Documentation', u'Adam Phillippy, Sergey Koren, Brian Walenz', 'canu', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False canu-1.6/documentation/source/faq.rst000066400000000000000000000351171314437614700200010ustar00rootroot00000000000000 .. _faq: Canu FAQ ======== .. contents:: :local: What resources does Canu require for a bacterial genome assembly? A mammalian assembly? ------------------------------------- Canu will detect available resources and configure itself to run efficiently using those resources. It will request resources, for example, the number of compute threads to use, Based on the ``genomeSize`` being assembled. It will fail to even start if it feels there are insufficient resources available. A typical bacterial genome can be assembled with 8GB memory in a few CPU hours - around an hour on 8 cores. It is possible, but not allowed by default, to run with only 4GB memory. A well-behaved large genome, such as human or other mammals, can be assembled in 10,000 to 25,000 CPU hours, depending on coverage. A grid environment is strongly recommended, with at least 16GB available on each compute node, and one node with at least 64GB memory. You should plan on having 3TB free disk space, much more for highly repetitive genomes. Our compute nodes have 48 compute threads and 128GB memory, with a few larger nodes with up to 1TB memory. We develop and test (mostly bacteria, yeast and drosophila) on laptops and desktops with 4 to 12 compute threads and 16GB to 64GB memory. How do I run Canu on my SLURM / SGE / PBS / LSF / Torque system? ------------------------------------- Canu will detect and configure itself to use on most grids. You can supply your own grid options, such as a partition on SLURM or an account code on SGE, with ``gridOptions=""`` which will passed to every job submitted by Canu. Similar options exist for every stage of Canu, which could be used to, for example, restrict overlapping to a specific partition or queue. To disable grid support and run only on the local machine, specify ``useGrid=false`` My run stopped with the error ``'Failed to submit batch jobs'`` ------------------------------------- The grid you run on must allow compute nodes to submit jobs. This means that if you are on a compute host, ``qsub/bsub/sbatch/etc`` must be available and working. You can test this by starting an interactive compute session and running the submit command manually (e.g. ``qsub`` on SGE, ``bsub`` on LSF, ``sbatch`` on SLURM). If this is not the case, Canu **WILL NOT** work on your grid. You must then set ``useGrid=false`` and run on a single machine. Alternatively, you can run Canu with ``useGrid=remote`` which will stop at every submit command, list what should be submitted. You then submit these jobs manually, wait for them to complete, and run the Canu command again. This is a manual process but currently the only workaround for grids without submit support on the compute nodes. What parameters should I use for my reads? ------------------------------------- Canu is designed to be universal on a large range of PacBio (C2, P4-C2, P5-C3, P6-C4) and Oxford Nanopore (R6 through R9) data. Assembly quality and/or efficiency can be enhanced for specific datatypes: **Nanopore R7 1D** and **Low Identity Reads** With R7 1D sequencing data, and generally for any raw reads lower than 80% identity, five to ten rounds of error correction are helpful. To run just the correction phase, use options ``-correct corOutCoverage=500 corMinCoverage=0 corMhapSensitivity=high``. Use the output of the previous run (in ``asm.correctedReads.fasta.gz``) as input to the next round. Once corrected, assemble with ``-nanopore-corrected correctedErrorRate=0.3 utgGraphDeviation=50`` **Nanopore R7 2D** and **Nanopore R9 1D** Increase the maximum allowed difference in overlaps from the default of 4.5% to 7.5% with ``correctedErrorRate=0.075`` **Nanopore R9 2D** and **PacBio P6** Slightly decrease the maximum allowed difference in overlaps from the default of 4.5% to 4.0% with ``correctedErrorRate=0.040`` **Early PacBio Sequel** Based on exactly one publically released *A. thaliana* `dataset `_, slightly decrease the maximum allowed difference from the default of 4.5% to 4.0% with ``correctedErrorRate=0.040 corMhapSensitivity=normal``. For recent Sequel data, the defaults are appropriate. **Nanopore R9 large genomes** Due to some systematic errors, the identity estimate used by Canu for correction can be an over-estimate of true error, inflating runtime. For recent large genomes (>1gbp) we've used ``'corMhapOptions=--threshold 0.8 --num-hashes 512 --ordered-sketch-size 1000 --ordered-kmer-size 14'``. This can be used with 30x or more of coverage, below that the defaults are OK. My assembly continuity is not good, how can I improve it? ------------------------------------- The most important determinant for assembly quality is sequence length, followed by the repeat complexity/heterozygosity of your sample. The first thing to check is the amount of corrected bases output by the correction step. This is logged in the stdout of Canu or in canu-scripts/canu.*.out if you are running in a grid environment. For example on `a haploid H. sapiens `_ sample: :: -- BEGIN TRIMMING -- ... -- In gatekeeper store 'chm1/trimming/asm.gkpStore': -- Found 5459105 reads. -- Found 91697412754 bases (29.57 times coverage). ... Canu tries to correct the longest 40X of data. Some loss is normal but having output coverage below 20-25X is a sign that correction did not work well (assuming you have more input coverage than that). If that is the case, re-running with ``corMhapSensitivity=normal`` if you have >50X or ``corMhapSensitivity=high corMinCoverage=0`` otherwise can help. You can also increase the target coverage to correct ``corOutCoverage=100`` to get more correct sequences for assembly. If there are sufficient corrected reads, the poor assembly is likely due to either repeats in the genome being greater than read lengths or a high heterozygosity in the sample. Stay tuned for mor information on tuning unitigging in those instances. .. _tweak: What parameters can I tweak? ------------------------------------- For all stages: - ``rawErrorRate`` is the maximum expected difference in an alignment of two _uncorrected_ reads. It is a meta-parameter that sets other parameters. - ``correctedErrorRate`` is the maximum expected difference in an alignment of two _corrected_ reads. It is a meta-parameter that sets other parameters. (If you're used to the ``errorRate`` parameter, multiply that by 3 and use it here.) - ``minReadLength`` and ``minOverlapLength``. The defaults are to discard reads shorter than 1000bp and to not look for overlaps shorter than 500bp. Increasing ``minReadLength`` can improve run time, and increasing ``minOverlapLength`` can improve assembly quality by removing false overlaps. However, increasing either too much will quickly degrade assemblies by either omitting valuable reads or missing true overlaps. For correction: - ``corOutCoverage`` controls how much coverage in corrected reads is generated. The default is to target 40X, but, for various reasons, this results in 30X to 35X of reads being generated. - ``corMinCoverage``, loosely, controls the quality of the corrected reads. It is the coverage in evidence reads that is needed before a (portion of a) corrected read is reported. Corrected reads are generated as a consensus of other reads; this is just the minimum ocverage needed for the consensus sequence to be reported. The default is based on input read coverage: 0x coverage for less than 30X input coverage, and 4x coverage for more than that. For assembly: - ``utgOvlErrorRate`` is essientially a speed optimization. Overlaps above this error rate are not computed. Setting it too high generally just wastes compute time, while setting it too low will degrade assemblies by missing true overlaps between lower quality reads. - ``utgGraphDeviation`` and ``utgRepeatDeviation`` what quality of overlaps are used in contig construction or in breaking contigs at false repeat joins, respectively. Both are in terms of a deviation from the mean error rate in the longest overlaps. - ``utgRepeatConfusedBP`` controls how similar a true overlap (between two reads in the same contig) and a false overlap (between two reads in different contigs) need to be before the contig is split. When this occurs, it isn't clear which overlap is 'true' - the longer one or the slightly shorter one - and the contig is split to avoid misassemblies. For polyploid genomes: Generally, there's a couple of ways of dealing with the ploidy. 1) **Avoid collapsing the genome** so you end up with double (assuming diploid) the genome size as long as your divergence is above about 2% (for PacBio data). Below this divergence, you'd end up collapsing the variations. We've used the following parameters for polyploid populations (PacBio data): ``corOutCoverage=200 correctedErrorRate=0.040 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"`` This will output more corrected reads (than the default 40x). The latter option will be more conservative at picking the error rate to use for the assembly to try to maintain haplotype separation. If it works, you'll end up with an assembly >= 2x your haploid genome size. Post-processing using gene information or other synteny information is required to remove redunancy from this assembly. 2) **Smash haplotypes together** and then do phasing using another approach (like HapCUT2 or whatshap or others). In that case you want to do the opposite, increase the error rates used for finding overlaps: ``corOutCoverage=200 ovlErrorRate=0.15 obtErrorRate=0.15`` Error rates for trimming (``obtErrorRate``) and assembling (``batErrorRate``) can usually be left as is. When trimming, reads will be trimmed using other reads in the same chromosome (and probably some reads from other chromosomes). When assembling, overlaps well outside the observed error rate distribution are discarded. For low coverage: - For less than 30X coverage, increase the alllowed difference in overlaps from 4.5% to 7.5% (or more) with ``correctedErrorRate=0.075``, to adjust for inferior read correction. Canu will automatically reduce ``corMinCoverage`` to zero to correct as many reads as possible. For high coverage: - For more than 60X coverage, decrease the allowed difference in overlaps from 4.5% to 4.0% with ``correctedErrorRate=0.040``, so that only the better corrected reads are used. This is primarily an optimization for speed and generally does not change assembly continuity. My asm.contigs.fasta is empty, why? ------------------------------------- Canu creates three assembled sequence :ref:`output files `: ``.contigs.fasta``, ``.unitigs.fasta``, and ``.unassembled.fasta``, where contigs are the primary output, unitigs are the primary output split at alternate paths, and unassembled are the leftover pieces. The :ref:`contigFilter` parameter sets several parameters that control how small or low coverage initial contigs are handled. By default, initial contigs with more than 50% of the length at less than 5X coverage will be classified as 'unassembled' and removed from the assembly, that is, ``contigFilter="2 0 1.0 0.5 5"``. The filtering can be disabled by changing the last number from '5' to '0' (meaning, filter if 50% is less than 0X coverage). Why is my assembly is missing my favorite short plasmid? ------------------------------------- Only the longest 40X of data (based on the specified genome size) is used for correction. Datasets with uneven coverage or small plasmids can fail to generate enough corrected reads to give enough coverage for assembly, resulting in gaps in the genome or even no reads for small plasmids. Set ``corOutCoverage=1000`` (or any value greater than your total input coverage) to correct all input data. An alternate approach is to correct all reads (``-correct corOutCoverage=1000``) then assemble 40X of reads picked at random from the ``.correctedReads.fasta.gz`` output. Why do I get less corrected read data than I asked for? ------------------------------------- Some reads are trimmed during correction due to being chimeric or because there wasn't enough evidence to generate a quality corrected sequence. Typically, this results in a 25% loss. Setting ``corMinCoverage=0`` will report all bases, even low those of low quality. Canu will trim these in its 'trimming' phase before assembly. What is the minimum coverage required to run Canu? ------------------------------------- For eukaryotic genomes, coverage more than 20X is enough to outperform current hybrid methods. My genome is AT (or GC) rich, do I need to adjust parameters? What about highly repetitive genomes? ------------------------------------- On bacterial genomes, no adjustment of parameters is (usually) needed. See the next question. On repetitive genomes with with a significantly skewed AT/GC ratio, the Jaccard estimate used by MHAP is biased. Setting ``corMaxEvidenceErate=0.15`` is sufficient to correct for the bias in our testing. In general, with high coverage repetitive genomes (such as plants) it can be beneficial to set the above parameter anyway, as it will eliminate repetitive matches, speed up the assembly, and sometime improve unitigs. How can I send data to you? ------------------------------------- FTP to ftp://ftp.cbcb.umd.edu/incoming/sergek. This is a write-only location that only the Canu developers can see. Here is a quick walk-through using a command-line ftp client (should be available on most Linux and OSX installations). Say we want to transfer a file named ``reads.fastq``. First, run ``ftp ftp.cbcb.umd.edu``, specify ``anonymous`` as the user name and hit return for password (blank). Then: .. code-block:: cd incoming/sergek put reads.fastq quit That's it, you won't be able to see the file but we can download it. canu-1.6/documentation/source/history.rst000066400000000000000000000067231314437614700207340ustar00rootroot00000000000000.. _history: Software Background ==================== Canu is derived from the Celera Assembler. The Celera assembler [Myers 2000] was designed to reconstruct mammalian chromosomal DNA sequences from the short fragments of a whole genome shotgun sequencing project. The Celera Assembler was used to produce reconstructions of several large genomes, namely those of Homo sapiens [Venter 2001], Mus musculus [Mural 2002], Rattus norvegicus [unpublished data], Canis familiaris [Kirkness 2003], Drosophila melanogaster [Adams 2000], and Anopheles gambiae [Holt 2001]. The Celera Assembler was shown to be very accurate when its reconstruction of the human genome was compared to independent reconstructions completed later [Istrail 2004]. It was used to reconstructing one of the first large-scale metagenomic projects [Venter 2004, Rusch 2007] and a diploid human reference [Levy 2007, Denisov 2008]. It was adapted to 454 Pyrosequencing [Miller 2008] and PacBio sequencing [Koren 2011], demonstrating finished bacterial genomes [Koren 2013] and efficient algorithms for eukaryotic assembly [Berlin 2015]. In 2015 Canu was forked from Celera Assembler and specialized for single-molecule high-noise sequences. The Celera Assembler codebase is no longer maintained. Canu is a pipeline consisting of several executable programs and perl driver scripts. The source code includes programs in the C++ language with Unix make scripts. The original Celera Assembler was designed to run under Compaq(R) Tru64(R) Unix with access to 32GB RAM. It has also run under IBM(R) AIX(R) and Red Hat Linux. The Celera Assembler was released under the GNU General Public License, version 2 as as supplement to the publication [Istrail 2004]. For the most recent license information please see README.licences References -------------------- - Adams et al. (2000) The Genome Sequence of Drosophila melanogaster. Science 287 2185-2195. - Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204. - Venter et al. (2001) The Sequence of the Human Genome. Science 291 1304-1351. - Mural et al. (2002) A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome. Science 296 1661-1671. - Holt et al. (2002) The Genome Sequence of the Malaria Mosquito Anophelesd gambiae. Science 298 129-149. - Istrail et al. (2004) Whole Genome Shotgun Assembly and Comparison of Human Genome Assemblies. PNAS 101 1916-1921. - Kirkness et al. (2003) The Dog Genome: Survey Sequencing and Comparative Analysis. Science 301 1898-1903. - Venter et al. (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304 66-74. - Levy et al. (2007) The Diploid Genome Sequence of an Individual Human. PLoS Biology 0050254 - Rusch et al. (2007) The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology 1821060. - Denisov et al. (2008) Consensus Generation and Variant Detection by Celera Assembler. Bioinformatics 24(8):1035-40 - Miller et al. (2008) Aggressive Assembly of Pyrosequencing Reads with Mates. Bioinformatics 24(24):2818-2824 - Koren et al. (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads, Nature Biotechnology, July 2012. - Koren et al. (2013) Reducing assembly complexity of microbial genomes with single-molecule sequencing, Genome Biology 14:R101. - Berlin et. al. (2015) Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. Nature Biotechnology. (2015). canu-1.6/documentation/source/index.rst000066400000000000000000000041521314437614700203340ustar00rootroot00000000000000 .. introduction what this is quick start one PacBio SMRT cell from ecoli pipeline (not reference, designed to be read through) overview introduce modular pipeline introduce gkpStore, overlaps, ovlStore, tigStore read correction read trimming unitig construction local vs grid mode canu option reference each option, in detail, grouped by function canu executable reference each binary, in detail, alphabetical option index (alphabetical) history Canu ==== .. toctree:: :hidden: quick-start faq tutorial pipeline parameter-reference command-reference history `Canu `_ is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION). Publication =========== Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. `Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation `_. Genome Research. (2017). Install ========= The easiest way to get started is to download a `release `_. If you encounter any issues, please report them using the `github issues `_ page. Alternatively, you can also build the latest unreleased from github: :: git clone https://github.com/marbl/canu.git cd canu/src make -j Learn ========= * :ref:`Quick Start ` - no experience or data required, download and assemble *Escherichia coli* today! * :ref:`FAQ ` - Frequently asked questions * :ref:`Canu tutorial ` - a gentle introduction to the complexities of canu. * :ref:`Canu pipeline ` - what, exactly, is canu doing, anyway? * :ref:`Canu Parameter Reference ` - all the paramters, grouped by function. * :ref:`Canu Command Reference ` - all the commands that canu runs for you. * :ref:`Canu History ` - the history of the Canu project. canu-1.6/documentation/source/overlap_transformations.svg000066400000000000000000000515021314437614700241760ustar00rootroot00000000000000 image/svg+xml a = x b = y a = -x b = -y RevComp AB RevComp b = -x a = -y b = x a = y AB canu-1.6/documentation/source/overlaps.svg000066400000000000000000000415051314437614700210520ustar00rootroot00000000000000 image/svg+xml a-hang >= 0 b-hang >= 0 a-hang >= 0 b-hang <= 0 a-hang <= 0 b-hang <= 0 a-hang <= 0 b-hang >= 0 canu-1.6/documentation/source/parameter-reference.rst000066400000000000000000000763611314437614700231540ustar00rootroot00000000000000 .. _parameter-reference: Canu Parameter Reference ======================== To get the most up-to-date options, run canu -options The default values below will vary based on the input data type and genome size. Boolean options accept true/false or 1/0. Memory sizes are assumed to be in gigabytes if no units are supplied. Values may be non-integer with or without a unit - 'k' for kilobytes, 'm' for megabytes, 'g' for gigabytes or 't' for terabytes. For example, "0.25t" is equivalent to "256g" (or simply "256"). Global Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The catch all category. .. _errorRate: errorRate (OBSOLETE) This parameter was removed on January 27th, 2016, and is valid only in Canu 1.4 or earlier. Canu currently still accepts the :ref:`errorRate ` parameter, but its use is strongly discouraged. The expected error in a single corrected read. The seven error rates were then set to three times this value (except for :ref:`corErrorRate `). .. _rawErrorRate: rawErrorRate The allowed difference in an overlap between two uncorrected reads, expressed as fraction error. Sets :ref:`corOvlErrorRate` and :ref:`corErrorRate`. The `rawErrorRate` typically does not need to be modified. It might need to be increased if very early reads are being assembled. The default is 0.300 For PacBio reads, and 0.500 for Nanopore reads. .. _correctedErrorRate: correctedErrorRate The allowed difference in an overlap between two corrected reads, expressed as fraction error. Sets :ref:`obtOvlErrorRate`, :ref:`utgOvlErrorRate`, :ref:`obtErrorRate`, :ref:`utgErrorRate`, and :ref:`cnsErrorRate`. The `correctedErrorRate` can be adjusted to account for the quality of read correction, for the amount of divergence in the sample being assembled, and for the amount of sequence being assembled. The default is 0.045 for PacBio reads, and 0.144 for Nanopore reads. For low coverage datasets (less than 30X), we recommend increasing `correctedErrorRate` slightly, by 1% or so. For high-coverage datasets (more than 60X), we recommend decreasing `correctedErrorRate` slighly, by 1% or so. Raising the `correctedErrorRate` will increase run time. Likewise, decreasing `correctedErrorRate` will decrease run time, at the risk of missing overlaps and fracturing the assembly. .. _minReadLength: minReadLength Reads shorter than this are not loaded into the assembler. Reads output by correction and trimming that are shorter than this are discarded. Must be no smaller than minOverlapLength. If set high enough, the gatekeeper module will halt as too many of the input reads have been discarded. Set `stopOnReadQuality` to false to avoid this. .. _minOverlapLength: minOverlapLength Overlaps shorter than this will not be discovered. Smaller values can be used to overcome lack of read coverage, but will also lead to false overlaps and potential misassemblies. Larger values will result in more correct assemblies, but more fragmented, assemblies. Must be no bigger than minReadLength. .. _genomeSize: genomeSize *required* An estimate of the size of the genome. Common suffices are allowed, for example, 3.7m or 2.8g. The genome size estimate is used to decide how many reads to correct (via the corOutCoverage_ parameter) and how sensitive the mhap overlapper should be (via the mhapSensitivity_ parameter). It also impacts some logging, in particular, reports of NG50 sizes. .. _canuIteration: canuIteration Which parallel iteration is being attempted. canuIterationMax How many parallel iterations to try. Ideally, the parallel jobs, run under grid control, would all finish successfully on the first try. Sometimes, jobs fail due to other jobs exhausting resources (memory), or by the node itself failing. In this case, canu will launch the jobs again. This parameter controls how many times it tries. .. _onSuccess: onSuccess Execute the command supplied when Canu successfully completes an assembly. The command will execute in the (the -d option to canu) and will be supplied with the name of the assembly (the -p option to canu) as its first and only parameter. .. _onFailure: onFailure Execute the command supplied when Canu terminates abnormally. The command will execute in the (the -d option to canu) and will be supplied with the name of the assembly (the -p option to canu) as its first and only parameter. There are two exceptions when the command is not executed: if a 'spec' file cannot be read, or if canu tries to access an invalid parameter. The former will be reported as a command line error, and canu will never start. The latter should never occur except when developers are developing the software. Process Control ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ showNext Report the first major command that would be run, but don't run it. Processing to get to that command, for example, checking the output of the previous command or preparing inputs for the next command, is still performed. stopAfter If set, Canu will stop processing after a specific stage in the pipeline finishes. Valid values for ``stopAfter`` are: - ``gatekeeper`` - stops after the reads are loaded into the assembler read database. - ``meryl`` - stops after frequent kmers are tabulated. - ``overlapConfigure`` - stops after overlap jobs are configured. - ``overlap`` - stops after overlaps are generated, before they are loaded into the overlap database. - ``overlapStoreConfigure`` - stops after the ``ovsMethod=parallel`` jobs are configured; has no impact for ``ovsMethod=sequential``. - ``overlapStore`` - stops after overlaps are loaded into the overlap database. - ``readCorrection`` - stops after corrected reads are generated. - ``readTrimming`` - stops after trimmed reads are generated. - ``unitig`` - stops after unitigs and contigs are created. - ``consensusConfigure`` - stops after consensus jobs are configured. - ``consensus`` - stops after consensus sequences are loaded into the databases. General Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pathMap A text file containing lines that map a hostname to a path to the assembler binaries. This can be used to provide fine-grained binary directories, for example, when incompatible versions of the operating system are used, or when canu is installed in a non-standard way. The hostname must be the same as returned by 'uname -n'. For example:: grid01 /grid/software/canu/Linux-amd64/bin/ devel01 /devel/canu/Linux-amd64/bin/ shell A path to a Bourne shell, to be used for executing scripts. By default, '/bin/sh', which is typically the same as 'bash'. C shells (csh, tcsh) are not supported. java A path to a Java application launcher of at least version 1.8. gnuplot A path to the gnuplot graphing utility. gnuplotImageFormat The type of image to generate in gnuplot. By default, canu will use png, svg or gif, in that order. gnuplotTested If set, skip the tests to determine if gnuplot will run, and to decide the image type to generate. This is used when gnuplot fails to run, or isn't even installed, and allows canu to continue execution without generating graphs. File Staging ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The correction stage of Canu requires random access to all the reads. Performance is greatly improved if the gkpStore database of reads is copied locally to each node that computes corrected read consensus sequences. This 'staging' is enabled by supplying a path name to fast local storage with the `stageDirectory` option, and, optionally, requesting access to that resource from the grid with the `gridEngineStageOption` option. stageDirectory A path to a directory local to each compute node. The directory should use an environment variable specific to the grid engine to ensure that it is unique to each task. For example, in Sun Grid Engine, `/scratch/$JOB_ID-$SGE_TASK_ID` will use both the numeric job ID and the numeric task ID. In SLURM, `/scratch/$SLRUM_JOBID` accomplishes the same. If specified on the command line, be sure to escape the dollar sign, otherwise the shell will try to expand it before Canu sees the option: `stageDirectory=/scratch/\$JOB_ID-\$SGE_TASK_ID`. If specified in a specFile, do not escape the dollar signs. gridEngineStageOption This string is passed to the job submission command, and is expected to request local disk space on each node. It is highly grid specific. The string `DISK_SPACE` will be replaced with the amount of disk space needed, in gigabytes. On SLURM, an example is `--gres=lscratch:DISK_SPACE` Cleanup Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ saveOverlaps If set, do not remove raw overlap output from either mhap or overlapInCore. Normally, this output is removed once the overlaps are loaded into an overlap store. saveReadCorrections If set, do not remove intermediate outputs. Normally, intermediate files are removed once they are no longer needed. NOT IMPLEMENTED. saveMerCounts If set, do not remove meryl binary databases. Overlapper Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overlaps are generated for three purposes: read correction, read trimming and unitig construction. The algorithm and parameters used can be set independently for each set of overlaps. Two overlap algorithms are in use. One, mhap, is typically applied to raw uncorrected reads and returns alignment-free overlaps with imprecise extents. The other, the original overlapper algorithm 'ovl', returns alignments but is much more expensive. There are three sets of parameters, one for the 'mhap' algorithm, one for the 'ovl' algorithm, and one for the 'minimap' algorithm. Parameters used for a specific type of overlap are set by a prefix on the option: 'cor' for read correction, 'obt' for read trimming ('overlap based trimming') or 'utg' for unitig construction. For example, 'corOverlapper=ovl' would set the overlapper used for read correction to the 'ovl' algorithm. {prefix}Overlapper Specify which overlap algorith, 'mhap' or 'ovl' or 'minimap'. The default is to use 'mhap' for 'cor' and 'ovl' for both 'obt' and 'utg'. Overlapper Configuration, ovl Algorithm --------------------------------------- .. _corOvlErrorRate: .. _obtOvlErrorRate: .. _utgOvlErrorRate: .. _ovlErrorRate: {prefix}OvlErrorRate Overlaps above this error rate are not computed. * `corOvlErrorRate` applies to overlaps generated for correcting reads; * `obtOvlErrorRate` applied to overlaps generated for trimming reads; * `utgOvlErrorRate` applies to overlaps generated for assembling reads. These limits apply to the 'ovl' overlap algorithm and when alignments are computed for mhap overlaps with :ref:`mhapReAlign `. {prefix}OvlFrequentMers Do not seed overlaps with these kmers (fasta format). {prefix}OvlHashBits Width of the kmer hash. Width 22=1gb, 23=2gb, 24=4gb, 25=8gb. Plus 10b per corOvlHashBlockLength. {prefix}OvlHashBlockLength Amount of sequence (bp to load into the overlap hash table. {prefix}OvlHashLoad Maximum hash table load. If set too high, table lookups are inefficent; if too low, search overhead dominates run time. {prefix}OvlMerDistinct K-mer frequency threshold; the least frequent fraction of distinct mers can seed overlaps. {prefix}OvlMerSize K-mer size for seeds in overlaps. {prefix}OvlMerThreshold K-mer frequency threshold; mers more frequent than this count are not used to seed overlaps. {prefix}OvlMerTotal K-mer frequency threshold; the least frequent fraction of all mers can seed overlaps. {prefix}OvlRefBlockLength Amount of sequence (bp to search against the hash table per batch. {prefix}OvlRefBlockSize Number of reads to search against the hash table per batch. Overlapper Configuration, mhap Algorithm ---------------------------------------- {prefix}MhapBlockSize Number of reads per 1GB block. Memory * size is loaded into memory per job. {prefix}MhapMerSize K-mer size for seeds in mhap. .. _mhapReAlign: {prefix}ReAlign Compute actual alignments from mhap overlaps. uses either obtErrorRate or ovlErrorRate, depending on which overlaps are computed) .. _mhapSensitivity: {prefix}MhapSensitivity Coarse sensitivity level: 'low', 'normal' or 'high'. Based on read coverage (which is impacted by genomeSize), 'low' sensitivity is used if coverage is more than 60; 'normal' is used if coverage is between 60 and 30, and 'high' is used for coverages less than 30. Overlapper Configuration, mhap Algorithm ---------------------------------------- {prefix}MMapBlockSize Number of reads per 1GB block. Memory * size is loaded into memory per job. {prefix}MMapMerSize K-mer size for seeds in minimap. Overlap Store ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The overlap algorithms return overlaps in an arbitrary order. The correction, trimming and assembly algorithms usually need to know all overlaps for a single read. The overlap store duplicates each overlap, sorts them by the first ID, and stores them for quick retrieval of all overlaps for a single read. ovsMemory How much memory, in gigabytes, to use for constructing overlap stores. Must be at least 256m or 0.25g. ovsMethod Two construction algorithms are supported. One uses a single data stream, and is faster for small and moderate size assemblies. The other uses parallel data streams and can be faster (depending on your network disk bandwitdh) for moderate and large assemblies. Meryl ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The 'meryl' algorithm counts the occurrences of kmers in the input reads. It outputs a FASTA format list of frequent kmers, and (optionally) a binary database of the counts for each kmer in the input. Meryl can run in (almost) any memory size, by splitting the computation into smaller (or larger) chunks. merylMemory Amount of memory, in gigabytes, to use for counting kmers. merylThreads Number of compute threads to use for kmer counting. Overlap Based Trimming ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _obtErrorRate: obtErrorRate Stringency of overlaps to use for trimming reads. trimReadsOverlap Minimum overlap between evidence to make contiguous trim. trimReadsCoverage Minimum depth of evidence to retain bases. .. _grid-engine: Grid Engine Support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Canu directly supports most common grid scheduling systems. Under normal use, Canu will query the system for grid support, congigure itself for the machines available in the grid, then submit itself to the grid for execution. The Canu pipeline is a series of about a dozen steps that alternate between embarassingly parallel computations (e.g., overlap computation) and sequential bookkeeping steps (e.g., checking if all overlap jobs finished). This is entirely managed by Canu. Canu has first class support for the various schedulers derived from Sun Grid Engine (Univa, Son of Grid Engine) and the Simple Linux Utility for Resource Management (SLURM), meaning that the devlopers have direct access to these systems. Platform Computing's Load Sharing Facility (LSF) and the various schedulers derived from the Portable Batch System (PBS, Torque and PBSPro) are supported as well, but without developer access bugs do creep in. As of Canu v1.5, support seems stable and working. useGrid Master control. If 'false', no algorithms will run under grid control. Does not change the value of the other useGrid options. If 'remote', jobs are configured for grid execution, but not submitted. A message, with commands to launch the job, is reported and canu halts execution. Note that the host used to run canu for 'remote' execution must know about the grid, that is, it must be able to submit jobs to the grid. It is also possible to enable/disable grid support for individual algorithms with options such as `useGridBAT`, `useGridCNS`, et cetera. This has been useful in the (far) past to prevent certain algorithms, notably overlap error adjustment, from running too many jobs concurrently and thrashing disk. Recent storage systems seem to be able to handle the load better -- computers have gotten faster quicker than genomes have gotten larger. There are many options for configuring a new grid ('gridEngine*') and for configuring how canu configures its computes to run under grid control ('gridOptions*'). The grid engine to use is specified with the 'gridEngine' option. gridEngine Which grid engine to use. Auto-detected. Possible choices are 'sge', 'pbs', 'pbspro', 'lsf' or 'slurm'. .. _grid-engine-config: Grid Engine Configuration ------------------------- There are many options to configure support for a new grid engine, and we don't describe them fully. If you feel the need to add support for a new engine, please contact us. That said, file ``src/pipeline/canu/Defaults.pm`` lists a whole slew of parameters that are used to build up grid commands, they all start with ``gridEngine``. For each grid, these parameters are defined in the various ``src/pipeline/Grid_*.pm`` modules. The parameters are used in ``src/pipeline/canu/Execution.pm``. For SGE grids, two options are sometimes necessary to tell canu about pecularities of your grid: ``gridEngineThreadsOption`` describes how to request multiple cores, and ``gridEngineMemoryOption`` describes how to request memory. Usually, canu can figure out how to do this, but sometimes it reports an error such as:: -- WARNING: Couldn't determine the SGE parallel environment to run multi-threaded codes. -- Valid choices are (pick one and supply it to canu): -- gridEngineThreadsOption="-pe make THREADS" -- gridEngineThreadsOption="-pe make-dedicated THREADS" -- gridEngineThreadsOption="-pe mpich-rr THREADS" -- gridEngineThreadsOption="-pe openmpi-fill THREADS" -- gridEngineThreadsOption="-pe smp THREADS" -- gridEngineThreadsOption="-pe thread THREADS" or:: -- WARNING: Couldn't determine the SGE resource to request memory. -- Valid choices are (pick one and supply it to canu): -- gridEngineMemoryOption="-l h_vmem=MEMORY" -- gridEngineMemoryOption="-l mem_free=MEMORY" If you get such a message, just add the appropriate line to your canu command line. Both options will replace the uppercase text (THREADS or MEMORY) with the value canu wants when the job is submitted. For ``gridEngineMemoryOption``, any number of ``-l`` options can be supplied; we could use ``gridEngineMemoryOption="-l h_vmem=MEMORY -l mem_free=MEMORY"`` to request both ``h_vmem`` and ``mem_free`` memory. .. _grid-options: Grid Options ------------ To run on the grid, each stage needs to be configured - to tell the grid how many cores it will use and how much memory it needs. Some support for this is automagic (for example, overlapInCore and mhap know how to do this), others need to be manually configured. Yes, it's a problem, and yes, we want to fix it. The gridOptions* parameters supply grid-specific opitons to the grid submission command. gridOptions Grid submission command options applied to all grid jobs gridOptionsJobName Grid submission command jobs name suffix gridOptionsBAT Grid submission command options applied to unitig construction with the bogart algorithm gridOptionsGFA Grid submission command options applied to gfa alignment and processing gridOptionsCNS Grid submission command options applied to unitig consensus jobs gridOptionsCOR Grid submission command options applied to read correction jobs gridOptionsExecutive Grid submission command options applied to master script jobs gridOptionsOEA Grid submission command options applied to overlap error adjustment jobs gridOptionsRED Grid submission command options applied to read error detection jobs gridOptionsOVB Grid submission command options applied to overlap store bucketizing jobs gridOptionsOVS Grid submission command options applied to overlap store sorting jobs gridOptionsCORMHAP Grid submission command options applied to mhap overlaps for correction jobs gridOptionsCOROVL Grid submission command options applied to overlaps for correction jobs gridOptionsOBTMHAP Grid submission command options applied to mhap overlaps for trimming jobs gridOptionsOBTOVL Grid submission command options applied to overlaps for trimming jobs gridOptionsUTGMHAP Grid submission command options applied to mhap overlaps for unitig construction jobs gridOptionsUTGOVL Grid submission command options applied to overlaps for unitig construction jobs Algorithm Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Several algorithmic components of canu can be disabled, based on the type of the reads being assebmled, the type of processing desired, or the amount of comput resources available. Overlap enableOEA Do overlap error adjustment - comprises two steps: read error detection (RED and overlap error adjustment (OEA Algorithm Execution Method -------------------------- Canu has a fairly sophisticated (or complicated, depending on if it is working or not) method for dividing large computes, such as read overlapping and consensus, into many smaller pieces and then running those pieces on a grid or in parallel on the local machine. The size of each piece is generally determined by the amount of memory the task is allowed to use, and this memory size -- actually a range of memory sizes -- is set based on the genomeSize parameter, but can be set explicitly by the user. The same holds for the number of processors each task can use. For example, a genomeSize=5m would result in overlaps using between 4gb and 8gb of memory, and between 1 and 8 processors. Given these requirements, Canu will pick a specific memory size and number of processors so that the maximum number of jobs will run at the same time. In the overlapper example, if we are running on a machine with 32gb memory and 8 processors, it is not possible to run 8 concurrent jobs that each require 8gb memory, but it is possible to run 4 concurrent jobs each using 6gb memory and 2 processors. To completely specify how Canu runs algorithms, one needs to specify a maximum memory size, a maximum number of processors, and how many pieces to run at one time. Users can set these manually through the {prefix}Memory, {prefix}Threads and {prefix}Concurrency options. If they are not set, defaults are chosen based on genomeSize. {prefix}Concurrency Set the number of tasks that can run at the same time, when running without grid support. {prefix}Threads Set the number of compute threads used per task. {prefix}Memory Set the amount of memory, in gigabytes, to use for each job in a task. Available prefixes are: +-------+-----------+----------------------------------------+ | Prefix | Algorithm | +=======+===========+========================================+ | cor | | | Overlap generation using the | +-------| | | 'mhap' algorithm for | | obt | mhap | | 'cor'=correction,, 'obt'=trimming | +-------| | | or 'utg'=assembly. | | utg | | | +-------+-----------+----------------------------------------+ | cor | | | Overlap generation using the | +-------| | | 'minimap' algorithm for | | obt | mmap | | 'cor'=correction,, 'obt'=trimming | +-------| | | or 'utg'=assembly. | | utg | | | +-------+-----------+----------------------------------------+ | cor | | | Overlap generation using the | +-------| | | 'overlapInCore' algorithm for | | obt | ovl | | 'cor'=correction,, 'obt'=trimming | +-------| | | or 'utg'=assembly. | | utg | | | +-------+-----------+----------------------------------------+ | | ovb | Parallel overlap store bucketizing | +-------+-----------+----------------------------------------+ | | ovs | Parallel overlap store bucket sorting | +-------+-----------+----------------------------------------+ | | cor | Read correction | +-------+-----------+----------------------------------------+ | | red | Error detection in reads | +-------+-----------+----------------------------------------+ | | oea | Error adjustment in overlaps | +-------+-----------+----------------------------------------+ | | bat | Unitig/contig construction | +-------+-----------+----------------------------------------+ | | cns | Unitig/contig consensus | +-------+-----------+----------------------------------------+ For example, 'mhapMemory` would set the memory limit for computing overlaps with the mhap algorithm; 'cormhapMemory' would set the memory limit only when mhap is used for generating overlaps used for correction. The 'minMemory', 'maxMemory', 'minThreads' and 'maxThreads' options will apply to all jobs, and can be used to artifically limit canu to a portion of the current machine. In the overlapper example above, setting maxThreads=4 would result in two concurrent jobs instead of four. Overlap Error Adjustment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ red = Read Error Detection oea = Overlap Error Adjustment oeaBatchLength Number of bases per overlap error correction batch oeaBatchSize Number of reads per overlap error correction batch redBatchLength Number of bases per fragment error detection batch redBatchSize Number of reads per fragment error detection batch Unitigger ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ unitigger Which unitig construction algorithm to use. Only "bogart" is supported. .. _utgErrorRate: utgErrorRate Stringency of overlaps used for constructing contigs. The `bogart` algorithm uses the distribution of overlap error rates to filter high error overlaps; `bogart` will never see overlaps with error higher than this parameter. batOptions Advanced options to bogart Consensus Partitioning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ STILL DONE BY UNITIGGER, NEED TO MOVE OUTSIDE cnsConsensus Which algorithm to use for computing consensus sequences. Only 'utgcns' is supported. cnsPartitions Compute conseus by splitting the tigs into N partitions. cnsPartitionMin Don't make a paritition with fewer than N reads cnsMaxCoverage Limit unitig consensus to at most this coverage. .. _cnsErrorRate: cnsErrorRate Inform the consensus genration algorithm of the amount of difference it should expect in a read-to-read alignment. Typically set to :ref:`utgOvlErrorRate `. If set too high, reads could be placed in an incorrect location, leading to errors in the consensus sequence. If set too low, reads could be omitted from the consensus graph (or multialignment, depending on algorithm), resulting in truncated consensus sequences. .. _correction: Read Correction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The first step in Canu is to find high-error overlaps and generate corrected sequences for subsequent assembly. This is currently the fastest step in Canu. By default, only the longest 40X of data (based on the specified genome size) is used for correction. Typically, some reads are trimmed during correction due to being chimeric or having erroneous sequence, resulting in a loss of 20-25% (30X output). You can force correction to be non-lossy by setting `corMinCoverage=0`, in which case the corrected reads output will be the same length as the input data, keeping any high-error unsupported bases. Canu will trim these in downstream steps before assembly. If you have a dataset with uneven coverage or small plasmids, correcting the longest 40X may not give you sufficient coverage of your genome/plasmid. In these cases, you can set `corOutCoverage=999`, or any value greater than your total input coverage which will correct and assemble all input data, at the expense of runtime. corErrorRate Do not use overlaps with error rate higher than this (estimated error rate for `mhap` and `minimap` overlaps). corConsensus Which algorithm to use for computing read consensus sequences. Only 'falcon' and 'falconpipe' are supported. corPartitions Partition read correction into N jobs corPartitionMin Don't make a read correction partition with fewer than N reads corMinEvidenceLength Limit read correction to only overlaps longer than this; default: unlimited corMinCoverage Limit read correction to regions with at least this minimum coverage. Split reads when coverage drops below threshold. corMaxEvidenceErate Limit read correction to only overlaps at or below this fraction error; default: unlimited corMaxEvidenceCoverageGlobal Limit reads used for correction to supporting at most this coverage; default: 1.0 * estimated coverage corMaxEvidenceCoverageLocal Limit reads being corrected to at most this much evidence coverage; default: 10 * estimated coverage .. _corOutCoverage: corOutCoverage Only correct the longest reads up to this coverage; default 40 corFilter Method to filter short reads from correction; 'quick' or 'expensive' or 'none' Output Filtering ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _contigFilter: contigFilter Remove spurious assemblies from consideration. Any contig that meeds any of the following conditions is flagged as 'unassembled' and removed from further consideration: - fewer than minReads reads - shorter than minLength bases - a single read covers more than singleReadSpan fraction of the contig - more than lowCovSpan fraction of the contig is at coverage below lowCovDepth This filtering is done immediately after initial contigs are formed, before repeat detection. Initial contigs that span a repeat can be split into multiple conitgs; none of these new contigs will be 'unassembled', even if they are a single read. canu-1.6/documentation/source/pipeline.rst000066400000000000000000000010111314437614700210210ustar00rootroot00000000000000 .. _pipeline: Canu Pipeline ============= The pipeline is described in Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. `Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation `_. bioRxiv. (2016). Figure 1 of the paper shows the primary pipeline (below, top) and the supplement contains the sub-pipeline for building read and overlap databases (below, bottom). .. image:: canu-pipeline.* .. image:: canu-overlaps.* canu-1.6/documentation/source/quick-start.rst000066400000000000000000000155141314437614700215000ustar00rootroot00000000000000 .. _quickstart: Canu Quick Start ================ Canu specializes in assembling PacBio or Oxford Nanopore sequences. Canu operates in three phases: correction, trimming and assembly. The correction phase will improve the accuracy of bases in reads. The trimming phase will trim reads to the portion that appears to be high-quality sequence, removing suspicious regions such as remaining SMRTbell adapter. The assembly phase will order the reads into contigs, generate consensus sequences and create graphs of alternate paths. For eukaryotic genomes, coverage more than 20x is enough to outperform current hybrid methods, however, between 30x and 60x coverage is the recommended minimum. More coverage will let Canu use longer reads for assembly, which will result in better assemblies. Input sequences can be FASTA or FASTQ format, uncompressed or compressed with gzip (.gz), bzip2 (.bz2) or xz (.xz). Note that zip files (.zip) are not supported. Canu can resume incomplete assemblies, allowing for recovery from system outages or other abnormal terminations. Canu will auto-detect computational resources and scale itself to fit, using all of the resources available and are reasonable for the size of your assembly. Memory and processors can be explicitly limited with with parameters :ref:`maxMemory` and :ref:`maxThreads`. See section :ref:`execution` for more details. Canu will automaticall take full advantage of any LSF/PBS/PBSPro/Torque/Slrum/SGE grid available, even submitting itself for execution. Canu makes heavy use of array jobs and requires job submission from compute nodes, which are sometimes not available or allowed. Canu option ``useGrid=false`` will restrict Canu to using only the current machine, while option ``useGrid=remote`` will configure Canu for grid execution but not submit jobs to the grid. See section :ref:`execution` for more details. The :ref:`tutorial` has more background, and the :ref:`faq` has a wealth of practical advice. Assembling PacBio or Nanopore data ---------------------- Pacific Biosciences released P6-C4 chemistry reads for Escherichia coli K12. You can `download them from their original release `_, but note that you must have the `SMRTpipe software `_ installed to extract the reads as FASTQ. Instead, use a `FASTQ format 25X subset `_ (223MB). Download from the command line with:: curl -L -o pacbio.fastq http://gembox.cbcb.umd.edu/mhap/raw/ecoli_p6_25x.filtered.fastq There doesn't appear to be any "official" Oxford Nanopore sample data, but the `Loman Lab `_ released a `set of runs `_, also for Escherichia coli K12. This is early data, from September 2015. Any of the four runs will work; we picked `MAP-006-1 `_ (243 MB). Download from the command line with:: curl -L -o oxford.fasta http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta By default, Canu will correct the reads, then trim the reads, then assemble the reads to unitigs. Canu needs to know the approximate genome size (so it can determine coverage in the input reads) and the technology used to generate the reads. For PacBio:: canu \ -p ecoli -d ecoli-pacbio \ genomeSize=4.8m \ -pacbio-raw pacbio.fastq For Nanopore:: canu \ -p ecoli -d ecoli-oxford \ genomeSize=4.8m \ -nanopore-raw oxford.fasta Output and intermediate files will be in directories 'ecoli-pacbio' and 'ecoli-nanopore', respectively. Intermeditate files are written in directories 'correction', 'trimming' and 'unitigging' for the respective stages. Output files are named using the '-p' prefix, such as 'ecoli.contigs.fasta', 'ecoli.contigs.gfa', etc. See section :ref:`outputs` for more details on outputs (intermediate files aren't documented). Assembling With Multiple Technologies and Multiple Files ------------------------------------------- Canu can use reads from any number of input files, which can be a mix of formats and technologies. We'll assemble a mix of 10X PacBio reads in two FASTQ files and 10X of Nanopore reads in one FASTA file:: curl -L -o mix.tar.gz http://gembox.cbcb.umd.edu/mhap/raw/ecoliP6Oxford.tar.gz tar xvzf mix.tar.gz canu \ -p ecoli -d ecoli-mix \ genomeSize=4.8m \ -pacbio-raw pacbio.part?.fastq.gz \ -nanopore-raw oxford.fasta.gz Correct, Trim and Assemble, Manually ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sometimes, however, it makes sense to do the three top-level tasks by hand. This would allow trying multiple unitig construction parameters on the same set of corrected and trimmed reads, or skipping trimming and assembly if you only want correced reads. We'll use the PacBio reads from above. First, correct the raw reads:: canu -correct \ -p ecoli -d ecoli \ genomeSize=4.8m \ -pacbio-raw pacbio.fastq Then, trim the output of the correction:: canu -trim \ -p ecoli -d ecoli \ genomeSize=4.8m \ -pacbio-corrected ecoli/ecoli.correctedReads.fasta.gz And finally, assemble the output of trimming, twice, with different stringency on which overlaps to use (see :ref:`correctedErrorRate `):: canu -assemble \ -p ecoli -d ecoli-erate-0.039 \ genomeSize=4.8m \ correctedErrorRate=0.039 \ -pacbio-corrected ecoli/ecoli.trimmedReads.fasta.gz canu -assemble \ -p ecoli -d ecoli-erate-0.075 \ genomeSize=4.8m \ correctedErrorRate=0.075 \ -pacbio-corrected ecoli/ecoli.trimmedReads.fasta.gz Note that the assembly stages use different '-d' directories. It is not possible to run multiple copies of canu with the same work directory. Assembling Low Coverage Datasets ---------------------------------- We claimed Canu works down to 20X coverage, and we will now assemble `a 20X subset of S. cerevisae `_ (215 MB). When assembling, we adjust :ref:`correctedErrorRate ` to accomodate the slightly lower quality corrected reads:: curl -L -o yeast.20x.fastq.gz http://gembox.cbcb.umd.edu/mhap/raw/yeast_filtered.20x.fastq.gz canu \ -p asm -d yeast \ genomeSize=12.1m \ correctedErrorRate=0.075 \ -pacbio-raw yeast.20x.fastq.gz Consensus Accuracy ------------------- Canu consensus sequences are typically well above 99% identity. Accuracy can be improved by polishing the contigs with tools developed specifically for that task. We recommend `Quiver `_ for PacBio and `Nanopolish `_ for Oxford Nanpore data. When Illumina reads are available, `Pilon `_ can be used to polish either PacBio or Oxford Nanopore assemblies. canu-1.6/documentation/source/repeat-spanned.svg000066400000000000000000000402611314437614700221230ustar00rootroot00000000000000 image/svg+xml canu-1.6/documentation/source/repeat-unspanned.svg000066400000000000000000000444451314437614700224760ustar00rootroot00000000000000 image/svg+xml canu-1.6/documentation/source/tutorial.rst000066400000000000000000000552241314437614700210760ustar00rootroot00000000000000 .. _celera-assembler: `Celera Assembler ` .. _tutorial: Canu Tutorial ============= Canu assembles reads from PacBio RS II or Oxford Nanopore MinION instruments into uniquely-assemblable contigs, unitigs. Canu owes lots of it design and code to `celera-assembler `_. Canu can be run using hardware of nearly any shape or size, anywhere from laptops to computational grids with thousands of nodes. Obviouisly, larger assemblies will take a long time to compute on laptops, and smaller assemblies can't take advantage of hundreds of nodes, so what is being assembled plays some part in determining what hardware can be effectively used. Most algorithms in canu have been multi-threaded (to use all the cores on a single node), parallelized (to use all the nodes in a grid), or both (all the cores on all the nodes). .. _canu-command: Canu, the command ~~~~~~~~~~~~~~~~~~~~~~ The **canu** command is the 'executive' program that runs all modules of the assembler. It oversees each of the three top-level tasks (correction, trimming, unitig construction), each of which consists of many steps. Canu ensures that input files for each step exist, that each step successfully finished, and that the output for each step exists. It does minor bits of processing, such as reformatting files, but generally just executes other programs. :: canu [-correct | -trim | -assemble | -trim-assemble] \ [-s ] \ -p \ -d \ genomeSize=[g|m|k] \ [other-options] \ [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq The -p option, to set the file name prefix of intermediate and output files, is mandatory. If -d is not supplied, canu will run in the current directory, otherwise, Canu will create the `assembly-directory` and run in that directory. It is _not_ possible to run two different assemblies in the same directory. The -s option will import a list of parameters from the supplied specification ('spec') file. These parameters will be applied before any from the command line are used, providing a method for setting commonly used parameters, but overriding them for specific assemblies. By default, all three top-level tasks are performed. It is possible to run exactly one task by using the -correct, -trim or -assemble options. These options can be useful if you want to correct reads once and try many different assemblies. We do exactly that in the :ref:`quickstart`. Additionally, suppling pre-corrected reads with -pacbio-corrected or -nanopore-corrected will run only the trimming (-trim) and assembling (-assemble) stages. Parameters are key=value pairs that configure the assembler. They set run time parameters (e.g., memory, threads, grid), algorithmic parameters (e.g., error rates, trimming aggressiveness), and enable or disable entire processing steps (e.g., don't correct errors, don't search for subreads). They are described later. One parameter is required: the genomeSize (in bases, with common SI prefixes allowed, for example, 4.7m or 2.8g; see :ref:`genomeSize`). Parameters are listed in the :ref:`parameter-reference`, but the common ones are described in this document. Reads are supplied to canu by options that options that describe how the reads were generated, and what level of quality they are, for example, -pacbio-raw indicates the reads were generated on a PacBio RS II instrument, and have had no processing done to them. Each file of reads supplied this way becomes a 'library' of reads. The reads should have been (physically) generated all at the same time using the same steps, but perhaps sequenced in multiple batches. In canu, each library has a set of options setting various algorithmic parameters, for example, how aggressively to trim. To explicitly set library parameters, a text 'gkp' file describing the library and the input files must be created. Don't worry too much about this yet, it's an advanced feature, fully described in Section :ref:`gkp-files`. The read-files contain sequence data in either FASTA or FASTQ format (or both! A quirk of the implementation allows files that contain both FASTA and FASTQ format reads). The files can be uncompressed, gzip, bzip2 or xz compressed. We've found that "gzip -1" provides good compression that is fast to both compress and decompress. For 'archival' purposes, we use "xz -9". .. _canu-pipeline: Canu, the pipeline ~~~~~~~~~~~~~~~~~~~~~~ The canu pipeline, that is, what it actually computes, comprises of computing overlaps and processing the overlaps to some result. Each of the three tasks (read correction, read trimming and unitig construction) follow the same pattern: * Load reads into the read database, gkpStore. * Compute k-mer counts in preparation for the overlap computation. * Compute overlaps. * Load overlaps into the overlap database, ovlStore. * Do something interesting with the reads and overlaps. * The read correction task will replace the original noisy read sequences with consensus sequences computed from overlapping reads. * The read trimming task will use overlapping reads to decide what regions of each read are high-quality sequence, and what regions should be trimmed. After trimming, the single largest high-quality chunk of sequence is retained. * The unitig construction task finds sets of overlaps that are consistent, and uses those to place reads into a multialignment layout. The layout is then used to generate a consensus sequence for the unitig. .. _module-tags: Module Tags ~~~~~~~~~~~~~~~~~~~~~~ Because each of the three tasks share common algorithms (all compute overlaps, two compute consensus sequences, etc), parameters are differentiated by a short prefix 'tag' string. This lets canu have one generic parameter that can be set to different values for each stage in each task. For example, "corOvlMemory" will set memory usage for overlaps being generated for read correction; "obtOvlMemory" for overlaps generated for Overlap Based Trimming; "utgOvlMemory" for overlaps generated for unitig construction. The tags are: +--------+-------------------------------------------------------------------+ |Tag | Usage | +========+===================================================================+ |master | the canu script itself, and any components that it runs directly | +--------+-------------------------------------------------------------------+ +--------+-------------------------------------------------------------------+ |cns | unitig consensus generation | +--------+-------------------------------------------------------------------+ |cor | read correction generation | +--------+-------------------------------------------------------------------+ +--------+-------------------------------------------------------------------+ |red | read error detection | +--------+-------------------------------------------------------------------+ |oea | overlap error adjustment | +--------+-------------------------------------------------------------------+ +--------+-------------------------------------------------------------------+ |ovl | the standard overlapper | +--------+-------------------------------------------------------------------+ |corovl | the standard overlapper, as used in the correction phase | +--------+-------------------------------------------------------------------+ |obtovl | the standard overlapper, as used in the trimming phase | +--------+-------------------------------------------------------------------+ |utgovl | the standard overlapper, as used in the assembly phase | +--------+-------------------------------------------------------------------+ +--------+-------------------------------------------------------------------+ |mhap | the mhap overlapper | +--------+-------------------------------------------------------------------+ |cormhap | the mhap overlapper, as used in the correction phase | +--------+-------------------------------------------------------------------+ |obtmhap | the mhap overlapper, as used in the trimming phase | +--------+-------------------------------------------------------------------+ |utgmhap | the mhap overlapper, as used in the assembly phase | +--------+-------------------------------------------------------------------+ +--------+-------------------------------------------------------------------+ |mmap | the `minimap `_ overlapper | +--------+-------------------------------------------------------------------+ |cormmap | the minimap overlapper, as used in the correction phase | +--------+-------------------------------------------------------------------+ |obtmmap | the minimap overlapper, as used in the trimming phase | +--------+-------------------------------------------------------------------+ |utgmmap | the minimap overlapper, as used in the assembly phase | +--------+-------------------------------------------------------------------+ +--------+-------------------------------------------------------------------+ |ovb | the bucketizing phase of overlap store building | +--------+-------------------------------------------------------------------+ |ovs | the sort phase of overlap store building | +--------+-------------------------------------------------------------------+ We'll get to the details eventually. .. _execution: Execution Configuration ~~~~~~~~~~~~~~~~~~~~~~~~ There are two modes that canu runs in: locally, using just one machine, or grid-enabled, using multiple hosts managed by a grid engine. LSF, PBS/Torque, PBSPro, Sun Grid Engine (and derivations), and Slurm are supported, though LSF has has limited testing. Section :ref:`grid-engine-config` has a few hints on how to set up a new grid engine. By default, if a grid is detected the canu pipeline will immediately submit itself to the grid and run entirely under grid control. If no grid is detected, or if option ``useGrid=false`` is set, canu will run on the local machine. In both cases, Canu will auto-detect available resources and configure job sizes based on the resources and genome size you're assembling. Thus, most users should be able to run the command without modifying the defaults. Some advanced options are outlined below. Each stage has the same five configuration options, and tags are used to specialize the option to a specific stage. The options are: useGrid=boolean Run this stage on the grid, usually in parallel. gridOptions=string Supply this string to the grid submit command. Memory=integer Use this many gigabytes of memory, per process. Threads Use this many compute threads per process. Concurrency If not on the grid, run this many jobs at the same time. Global grid options, applied to every job submitted to the grid, can be set with 'gridOptions'. This can be used to add accounting information or access credentials. A name can be associated with this compute using 'gridOptionsJobName'. Canu will work just fine with no name set, but if multiple canu assemblies are running at the same time, they will tend to wait for each others jobs to finish. For example, if two assemblies are running, at some point both will have overlap jobs running. Each assembly will be waiting for all jobs named 'ovl_asm' to finish. Had the assemblies specified job names, gridOptionsJobName=apple and gridOptionsJobName=orange, then one would be waiting for jobs named 'ovl_asm_apple', and the other would be waiting for jobs named 'ovl_asm_orange'. .. _error-rates: Error Rates ~~~~~~~~~~~~~~~~~~~~~~ Canu expects all error rates to be reported as fraction error, not as percent error. We're not sure exactly why this is so. Previously, it used a mix of fraction error and percent error (or both!), and was a little confusing. Here's a handy table you can print out that converts between fraction error and percent error. Not all values are shown (it'd be quite a large table) but we have every confidence you can figure out the missing values: ============== ============= Fraction Error Percent Error ============== ============= 0.01 1% 0.02 2% 0.03 3% . . . . 0.12 12% . . . . ============== ============= Canu error rates always refer to the percent difference in an alignment of two reads, not the percent error in a single read, and not the amount of variation in your reads. These error rates are used in two different ways: they are used to limit what overlaps are generated, e.g., don't compute overlaps that have more than 5% difference; and they are used to tell algorithms what overlaps to use, e.g., even though overlaps were computed to 5% difference, don't trust any above 3% difference. There are seven error rates. Three error rates control overlap creation (:ref:`corOvlErrorRate `, :ref:`obtOvlErrorRate ` and :ref:`utgOvlErrorRate `), and four error rates control algorithms (:ref:`corErrorRate `, :ref:`obtErrorRate `, :ref:`utgErrorRate `, :ref:`cnsErrorRate `). The three error rates for overlap creation apply to the `ovl` overlap algorithm and the :ref:`mhapReAlign ` option used to generate alignments from `mhap` or `minimap` overlaps. Since `mhap` is used for generating correction overlaps, the :ref:`corOvlErrorRate ` parameter is not used by default. Overlaps for trimming and assembling use the `ovl` algorithm, therefore, :ref:`obtOvlErrorRate ` and :ref:`utgOvlErrorRate ` are used. The four algoriothm error rates are used to select which overlaps can be used for correcting reads (:ref:`corErrorRate `); which overlaps can be used for trimming reads (:ref:`obtErrorRate `); which overlaps can be used for assembling reads (:ref:`utgErrorRate `). The last error rate, :ref:`cnsErrorRate `, tells the consensus algorithm to not trust read alignments above that value. For convenience, two meta options set the error rates used with uncorrected reads (:ref:`rawErrorRate `) or used with corrected reads. (:ref:`correctedErrorRate `). The default depends on the type of read being assembled. ================== ====== ======== Parameter PacBio Nanopore ================== ====== ======== rawErrorRate 0.300 0.500 correctedErrorRate 0.045 0.144 ================== ====== ======== In practice, only :ref:`correctedErrorRate ` is usually changed. The :ref:`faq` has :ref:`specific suggestions ` on when to change this. Canu v1.4 and earlier used the :ref:`errorRate ` parameter, which set the expected rate of error in a single corrected read. .. _minimum-lengths: Minimum Lengths ~~~~~~~~~~~~~~~~~~~~~~ Two minimum sizes are known: minReadLength Discard reads shorter than this when loading into the assembler, and when trimming reads. minOverlapLength Do not save overlaps shorter than this. Overlap configuration ~~~~~~~~~~~~~~~~~~~~~~ The largest compute of the assembler is also the most complicated to configure. As shown in the 'module tags' section, there are up to eight (!) different overlapper configurations. For each overlapper ('ovl' or 'mhap') there is a global configuration, and three specializations that apply to each stage in the pipeline (correction, trimming or assembly). Like with 'grid configuration', overlap configuration uses a 'tag' prefix applied to each option. The tags in this instance are 'cor', 'obt' and 'utg'. For example: - To change the k-mer size for all instances of the ovl overlapper, 'merSize=23' would be used. - To change the k-mer size for just the ovl overlapper used during correction, 'corMerSize=16' would be used. - To change the mhap k-mer size for all instances, 'mhapMerSize=18' would be used. - To change the mhap k-mer size just during correction, 'corMhapMerSize=15' would be used. - To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used. Ovl Overlapper Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Overlapper select the overlap algorithm to use, 'ovl' or 'mhap'. Ovl Overlapper Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ovlHashBlockLength how many bases to reads to include in the hash table; directly controls process size ovlRefBlockSize how many reads to compute overlaps for in one process; directly controls process time ovlRefBlockLength same, but use 'bases in reads' instead of 'number of reads' ovlHashBits size of the hash table (SHOULD BE REMOVED AND COMPUTED, MAYBE TWO PASS) ovlHashLoad how much to fill the hash table before computing overlaps (SHOULD BE REMOVED) ovlMerSize size of kmer seed; smaller - more sensitive, but slower The overlapper will not use frequent kmers to seed overlaps. These are computed by the 'meryl' program, and can be selected in one of three ways. Terminology. A k-mer is a contiguous sequence of k bases. The read 'ACTTA' has two 4-mers: ACTT and CTTA. To account for reverse-complement sequence, a 'canonical kmer' is the lexicographically smaller of the forward and reverse-complemented kmer sequence. Kmer ACTT, with reverse complement AAGT, has a canonical kmer AAGT. Kmer CTTA, reverse-complement TAAG, has canonical kmer CTTA. A 'distinct' kmer is the kmer sequence with no count associated with it. A 'total' kmer (for lack of a better term) is the kmer with its count. The sequence TCGTTTTTTTCGTCG has 12 'total' 4-mers and 8 'distinct' kmers. :: TCGTTTTTTTCGTCG count TCGT 2 distinct-1 CGTT 1 distinct-2 GTTT 1 distinct-3 TTTT 4 distinct-4 TTTT 4 copy of distinct-4 TTTT 4 copy of distinct-4 TTTT 4 copy of distinct-4 TTTC 1 distinct-5 TTCG 1 distinct-6 TCGT 2 copy of distinct-1 CGTC 1 distinct-7 GTCG 1 distinct-8 MerThreshold any kmer with count higher than N is not used MerDistinct pick a threshold so as to seed overlaps using this fraction of all distinct kmers in the input. In the example above, fraction 0.875 of the k-mers (7/8) will be at or below threshold 2. MerTotal pick a threshold so as to seed overlaps using this fraction of all kmers in the input. In the example above, fraction 0.667 of the k-mers (8/12) will be at or below threshold 2. FrequentMers don't compute frequent kmers, use those listed in this fasta file Mhap Overlapper Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~ MhapBlockSize Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into. MhapMerSize Use k-mers of this size for detecting overlaps. ReAlign After computing overlaps with mhap, compute a sequence alignment for each overlap. MhapSensitivity Either 'normal', 'high', or 'fast'. Mhap also will down-weight frequent kmers (using tf-idf), but it's selection of frequent is not exposed. Minimap Overlapper Parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~ MMapBlockSize Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into. MMapMerSize Use k-mers of this size for detecting overlaps Minimap also will ignore high-frequency minimzers, but it's selection of frequent is not exposed. .. _outputs: Outputs ~~~~~~~~~~~~~~~~~~~~~~~~~~~ As Canu runs, it outputs status messages, execution logs, and some analysis to the console. Most of the analysis is captured in ``.report`` as well. LOGGING .report Most of the analysis reported during assembly. READS .correctedReads.fasta.gz The reads after correction. .trimmedReads.fasta.gz The corrected reads after overlap based trimming. SEQUENCE .contigs.fasta Everything which could be assembled and is part of the primary assembly, including both unique and repetitive elements. .unitigs.fasta Contigs, split at alternate paths in the graph. .unassembled.fasta Reads and low-coverage contigs which could not be incorporated into the primary assembly. The header line for each sequence provides some metadata on the sequence.:: >tig######## len= reads= covStat= gappedBases= class= suggestRepeat= suggestCircular= len Length of the sequence, in bp. reads Number of reads used to form the contig. covStat The log of the ratio of the contig being unique versus being two-copy, based on the read arrival rate. Positive values indicate more likely to be unique, while negative values indicate more likely to be repetitive. See `Footnote 24 `_ in `Myers et al., A Whole-Genome Assembly of Drosophila `_. gappedBases If yes, the sequence includes all gaps in the multialignment. class Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read. suggestRepeat If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences. suggestCircular If yes, sequence is likely circular. Not implemented. GRAPHS .contigs.gfa Unused or ambiguous edges between contig sequences. Bubble edges cannot be represented in this format. .unitigs.gfa Contigs split at bubble intersections. .unitigs.bed The position of each unitig in a contig. METADATA The layout provides information on where each read ended up in the final assembly, including contig and positions. It also includes the consensus sequence for each contig. .contigs.layout, .unitigs.layout (undocumented) .contigs.layout.readToTig, .unitigs.layout.readToTig The position of each read in a contig (unitig). .contigs.layout.tigInfo, .unitigs.layout.tigInfo A list of the contigs (unitigs), lengths, coverage, number of reads and other metadata. Essentially the same information provided in the FASTA header line. canu-1.6/logo.jpg000066400000000000000000001264711314437614700137750ustar00rootroot00000000000000ÿØÿàJFIFHHÿáExifMM*nv(1"~2 ‡i´,,Adobe Illustrator CS6 (Macintosh)2015:08:21 19:34:15ê  È ž2015:08:21 15:34:11ÿá>ïhttp://ns.adobe.com/xap/1.0/ ÿídPhotoshop 3.08BIM,Z%G>20150821? 153411-04008BIM%ð©àûµžÇÜ„ÚOx %Çÿâ XICC_PROFILE HLinomntrRGB XYZ Î 1acspMSFTIEC sRGBöÖÓ-HP cprtP3desc„lwtptðbkptrXYZgXYZ,bXYZ@dmndTpdmddĈvuedL†viewÔ$lumiømeas $tech0 rTRC< gTRC< bTRC< textCopyright (c) 1998 Hewlett-Packard CompanydescsRGB IEC61966-2.1sRGB IEC61966-2.1XYZ óQÌXYZ XYZ o¢8õXYZ b™·…ÚXYZ $ „¶ÏdescIEC http://www.iec.chIEC http://www.iec.chdesc.IEC 61966-2.1 Default RGB colour space - sRGB.IEC 61966-2.1 Default RGB colour space - sRGBdesc,Reference Viewing Condition in IEC61966-2.1,Reference Viewing Condition in IEC61966-2.1view¤þ_.ÏíÌ \žXYZ L VPWçmeassig CRT curv #(-27;@EJOTY^chmrw|†‹•šŸ¤©®²·¼ÁÆËÐÕÛàåëðöû %+28>ELRY`gnu|ƒ‹’š¡©±¹ÁÉÑÙáéòú &/8AKT]gqz„Ž˜¢¬¶ÁËÕàëõ !-8COZfr~Š–¢®ºÇÓàìù -;HUcq~Œš¨¶ÄÓáðþ +:IXgw†–¦µÅÕåö'7HYj{Œ¯ÀÑãõ+=Oat†™¬¿Òåø 2FZn‚–ª¾Òçû  % : O d y ¤ º Ï å û  ' = T j ˜ ® Å Ü ó " 9 Q i € ˜ ° È á ù  * C \ u Ž § À Ù ó & @ Z t Ž © Ã Þ ø.Id›¶Òî %A^z–³Ïì &Ca~›¹×õ1OmŒªÉè&Ed„£Ãã#Ccƒ¤Åå'Ij‹­Îð4Vx›½à&Il²ÖúAe‰®Ò÷@eНÕú Ek‘·Ý*QwžÅì;cвÚ*R{£ÌõGp™Ãì@j”¾é>i”¿ê  A l ˜ Ä ð!!H!u!¡!Î!û"'"U"‚"¯"Ý# #8#f#”#Â#ð$$M$|$«$Ú% %8%h%—%Ç%÷&'&W&‡&·&è''I'z'«'Ü( (?(q(¢(Ô))8)k))Ð**5*h*›*Ï++6+i++Ñ,,9,n,¢,×- -A-v-«-á..L.‚.·.î/$/Z/‘/Ç/þ050l0¤0Û11J1‚1º1ò2*2c2›2Ô3 3F33¸3ñ4+4e4ž4Ø55M5‡5Â5ý676r6®6é7$7`7œ7×88P8Œ8È99B99¼9ù:6:t:²:ï;-;k;ª;è<' >`> >à?!?a?¢?â@#@d@¦@çA)AjA¬AîB0BrBµB÷C:C}CÀDDGDŠDÎEEUEšEÞF"FgF«FðG5G{GÀHHKH‘H×IIcI©IðJ7J}JÄK KSKšKâL*LrLºMMJM“MÜN%NnN·OOIO“OÝP'PqP»QQPQ›QæR1R|RÇSS_SªSöTBTTÛU(UuUÂVV\V©V÷WDW’WàX/X}XËYYiY¸ZZVZ¦Zõ[E[•[å\5\†\Ö]']x]É^^l^½__a_³``W`ª`üaOa¢aõbIbœbðcCc—cëd@d”dée=e’eçf=f’fèg=g“géh?h–hìiCišiñjHjŸj÷kOk§kÿlWl¯mm`m¹nnknÄooxoÑp+p†pàq:q•qðrKr¦ss]s¸ttptÌu(u…uáv>v›vøwVw³xxnxÌy*y‰yçzFz¥{{c{Â|!||á}A}¡~~b~Â#„å€G€¨ kÍ‚0‚’‚ôƒWƒº„„€„ã…G…«††r†×‡;‡ŸˆˆiˆÎ‰3‰™‰þŠdŠÊ‹0‹–‹üŒcŒÊ1˜ÿŽfŽÎ6žnÖ‘?‘¨’’z’ã“M“¶” ”Š”ô•_•É–4–Ÿ— —u—à˜L˜¸™$™™üšhšÕ›B›¯œœ‰œ÷dÒž@ž®ŸŸ‹Ÿú i Ø¡G¡¶¢&¢–££v£æ¤V¤Ç¥8¥©¦¦‹¦ý§n§à¨R¨Ä©7©©ªª««u«é¬\¬Ð­D­¸®-®¡¯¯‹°°u°ê±`±Ö²K²Â³8³®´%´œµµŠ¶¶y¶ð·h·à¸Y¸Ñ¹J¹Âº;ºµ».»§¼!¼›½½¾ ¾„¾ÿ¿z¿õÀpÀìÁgÁãÂ_ÂÛÃXÃÔÄQÄÎÅKÅÈÆFÆÃÇAÇ¿È=ȼÉ:ɹÊ8Ê·Ë6˶Ì5̵Í5͵Î6ζÏ7ϸÐ9кÑ<ѾÒ?ÒÁÓDÓÆÔIÔËÕNÕÑÖUÖØ×\×àØdØèÙlÙñÚvÚûÛ€ÜÜŠÝÝ–ÞÞ¢ß)߯à6à½áDáÌâSâÛãcãëäsäü儿 æ–çç©è2è¼éFéÐê[êåëpëûì†ííœî(î´ï@ïÌðXðåñrñÿòŒóó§ô4ôÂõPõÞömöû÷Šøø¨ù8ùÇúWúçûwüü˜ý)ýºþKþÜÿmÿÿÿÀžÈÿÄ ÿĵ}!1AQa"q2‘¡#B±ÁRÑð$3br‚ %&'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyzƒ„…†‡ˆ‰Š’“”•–—˜™š¢£¤¥¦§¨©ª²³´µ¶·¸¹ºÂÃÄÅÆÇÈÉÊÒÓÔÕÖרÙÚáâãäåæçèéêñòóôõö÷øùúÿÄ ÿĵw!1AQaq"2B‘¡±Á #3RðbrÑ $4á%ñ&'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz‚ƒ„…†‡ˆ‰Š’“”•–—˜™š¢£¤¥¦§¨©ª²³´µ¶·¸¹ºÂÃÄÅÆÇÈÉÊÒÓÔÕÖרÙÚâãäåæçèéêòóôõö÷øùúÿÛCÿÛCÿÝÿÚ ?þêõ?‡º{(¥ð§‡öù³¾HÓ‡+m&9û? p ÆyÀÜ*tÚi–­˸zmËÞ[Ým×s‹F38Ù¯Þ6ùuoÜŸª]Ã÷ꊖ¿ ük¨[´~Ð~x/ÑôãŒbg`ôÎO<6Š(Ežö²¥NÝ÷²¿Mù}æX|Q×3|N»wUoºÕJéËå×[ꥫ*Ið—À77ò·†4!›¢¸>œ6…·µÈÀ·=ÏLŒbEÊœµöp½î×$?d­Þ:¦÷‹½£Í[ ÏZ¤Ój<ê6²½ÔbÝíöºïÛvhé |§–|-¡6Ë»À húq8’€îà0Nl(Ý8GNJzw§þ÷ý.ôZhva©%F1r”œeR-Þ×å©(÷zÙv]Öæ]¿Â?‡Ó[‡ h¥¤3±ÿ‰M€äÏ)#ˆ»ç ÀþöÓØRz¨C[ïN7²èÛÝ^Ú{ÉYhíhð¼*’rævmßE}—£};-ÝþjÛ|5ð hЯü">mšrrúE‰;…·R|¬îÏS’}AÆ){:wåä…¯o‚7JÿÍfïÛWµ“Z:QT"·µ«»ô†š&ú®ýu±“'‡Á|áÒËg¹KiyÝäÎ|¼žyäŽ}rE“z ۥ“¾ß-ݺêÖÊ÷n\Â%ÊüÏ•6üùoweu¿–»·cwPøiðö[ø+Âøt‰[œ§H†F ã<‚=ýÍ z{:núk¶ÿíësäÑqéV„gNi­ãÊ­¥¿?Ëï3¥øKðâ ™<á VöÔäh¶$Ü'_Üàú`œ¼{Jþä[x/K]ÞéoËkm£ÕT*RšZF¥%vô÷¤•¾µ£ï·TÚ-j þM¼_ð†xmß[ä¦`­¼Ÿ›ÊÈÎ÷½êÔ*tÞŽœî£ÛÕ­Zò¿þ£:qTÕJj6ÖU)¤ß”¯¶·û¼î¬g§Â?‡V÷vn¾ðøÝ4¡ÁÒ,J06³àåœàóœ‘Ó´{*kzqÑ^ÖVZÿ/½yZßÝoß% 7-H;»&îí«÷ed¶[êÓz¨ÝXÞ? þnðo‡8Ñ´ó´vÇî{vèÚã ¹!ü±^iE|“ViyZÞ‡¤áÓeeåé·ç¯–ƒáV|=Æ?áðî9ÿ˜&™ßÐ}›üsß8͑ۖ+Ñ%§k®ŸÝ½·z\|©m¦·ó½­[7~Üß6|# m[HÑ¢éx$Ës“ͰùŽõ ’¡E5NräŽTU—¤4àÒí#‚¾š¢PoX-4ß™÷‹I[_µò±>™ðŸáõ»ß ðŸ‡Ïï­ðN‰¥îÚ¡Ú3h@îOÉÛÒg•£µ{òŽúIÝ«.Î?;ÈÓ ECÚEÞéÇFûÁ_¤um_g«Ór›üø{qu'ü"^é!:—Œ-´ sþŒÜÇqÈëÏ94*pþZnýe6¾rMýÊÏËS¸Vç6›øš}-óå{Ývôw±wIøSðþ(%ŒøGÄÇ{x¹þÂÒvTslrp€çù ’÷!ßÞ„dïäÜn¶éË÷ÞFøZiRäméR¦›ý·¿Ã¯ßèõr¡ÂÌd“þ?m’æìt=0à}¦AÔ[àŒt s€0«B¥ä¦ïmeÉê·¼®ôò^—w8êaç)ÎQwýåK½4Jm'ðÉÙE-º'{éË©§ü-ø~še·üQþ$YíÉд¬¨À~ÊÄœó“Û<í\W\²ÓXEí×TýçÖNÍõzqT¡ö›‚ÕÝtzYtô^z^FWü*‡£O ÿ‡ 6;þÃÒþ÷Ù·d³î'999õînœK’¯Ò1M[¼­Ì¯×Þmõ¾¬ó\ŠI¾e §¢Úß•Þí_¥õѤѷwðÓáøÒ¦Çƒ<0 ´ ö—YqþŠ=³œúæ—$—$m~Êÿ¢»ë.[½–‡¥V Ø5µ©èû{º4—_ËæÏÿÐþøn9º³»\ÿé1íÎsúÛ­4ßGk´¿=Õ×åÕ;™TW•4ör—þ‘/øqþ›jpAònóõ&Û¯å߯SÍ6ÛŠ¿~‘µ·Ò÷iü’ù„#ËUéÊù$Ò½ôv½’µºjüž¶÷‰út|õµœ gµÅ¹ÉéÎr:0è~nŠ–Ïå×ôÒÿ§ìW©MöŒÕì¶rƒ·á}¤µòå‘Íûu²LP.fíÏNç#ž)·Ã_¿GÕµµþ}Y4ïídíkÓm}ùýÞ—ûî†(Íåç]ž{c䟎F÷ÕGD¯Ùõ·]|ïÛ_ºäÔNUf­î¨Rûùegé¯U$>Ðb]Cœæh¾>ÉéÏÇåS-ÿ­ÿ^×ÓUäÜ®…’©Ñ¹Å¿üy+ok]ü¬„„÷˜ÿ¦8êFG‘kíÇ>›²©œ]­×ðjÏ£z;úùid)BRk6¿xï}¬áOKiµ¯~½/¢$±C #í—yÿÀ¹yïרç¿z™;¶ûÿ^_-=lUÔ,ÿž¯þžÞ^­ú²;ÑqЇNââlÿn}~½ÕÞÎýV–Û¥»;tÒÞNöLšpý×~e+YoïÉ­:½·IéÒé2>ãì+øæãëéøáK~úýúùë÷—à­ÿƒÿ¶_ÒUÙ²Gú ãÓýÄvï·ŽØÎiÅ«®÷üþõ¿ù+nK_¸M·¥(Éÿà¸ß¯e¦ÞŽÃï,ÜŽËÿÇàôöÿ<Ò[®ºùÿÀõÒÿš.¯ðç§géiÅßÑ»ûÿ×í·þ”.?_óÖš[¯/]žÛ­tóôØUvŽŸòò›·ý¾šðüÖœãý:ÇýhÇéØóÔcšK¯§ê¿¯øaÔzCþ¾Ã§ø¿o×Dýbé×l¶xà}¥¿/³\qß¿=½9Å izmó_—êKš­ñ¿/±Sõÿ>¬¹õÿ?äÒ,n>läõ#¹çÔŽƒÐ}MRuäÀ`¢ÛsÏ?¼¹¼uãåCÉ[NÑ~o¢¿MoëÓ®ýŒ¥êm§"Õ÷æ~keg®û+jâ[ô‹áœåíóÇCöU` gðãÑ”ºk}7ëóZÛïóÖ⦚[÷†Û|=6ü´ò%ñµ±éßì¶ãè/®ìIôùo×úé}ÑJ7r³Ó™ß®­%åÓÓ^ÿ¶CçÞúèÿäaþ÷ôü:«}=? {ßeç÷…%hË»­ZÿøõÒ:ôÐ-˜ÉÈâæì{óu7û¹÷à{qÍ;ü:t]ß[t·nÏ·~|ã5&Ÿ-§[ð”úé¿kÛËq,¸ÓâïþÛþþzŸ¯]²÷ùïý]þyµ;ªt¯º§ïÞÝ-~»k¿uq¡GönyÇØxïÖ×òè{ÍUÒ•üß^ž^½í/Mˆ”±²º½;®¿eÚþò^¶oð¸\àé’c86È}ñ±8é‚?¨©¿^»üÿR§­wt£ø¨ú^ËÓkécÿÑþñöömUV -/mEÕÚB¬2Ãå‚x<Á’óBÉ —Í]ÊJô~«õÿ?êÆs¿=4ïéìç¯è÷ýc¯Ü7—RÂÙSòHŒ­±JŸgÝÑ7ï!•8ÊH¡ˆe# ÊÌþÏý½ú~–üFÛö‘xÉùèã§âÞÝöº,2ƒGc¸ýg·ÏùÿIÙm­×šëßKü¾ë «Î=¹dÝûÞ6Ó®î j$ú ÷Ïü¼Ëôëœû{ç4ïtÿ­/÷jÝ÷û‰I*®Ë_gôûS·Ý¯ßæ,@}®ëýËAžr~Y±Ó§çÎp3Í.‹Õþƒ_Ũí¯-.Úé/_éëmŒã¨ù3ë+K»Ók=šÍ’$³5´;Œq¹E”Á黎eû@L‚FxUÎß?ë·N‹¯]/Þ¨¿¾ŸÏ’|¶~"Ö™säwSÛ¾ød»ÆÅd¸†Û!ã‘VXX•’9Q]mdR Ðúy/Õÿ_ð†õz?hüíîBßwü„ö\Å.p?Óo>Ÿñ÷/ׯÓéšùyôþ½6è¾ÿ_*ÿéÙíþ_!l@«éºpzŽòöä÷õç¯4=4þ¿Oë¾ã¥ü8ÿÛß/~Z[ø–EÓÈcÿž$_Ô-ïZýw_ðûwö_‹‡£ëå÷”IJO_°1ëÍ·½ž1ž½3è([¯Uýtüþâ_û»ëûŸ¿÷~vüWÜ-àc“Ô,`õë¾.?ï×ß*/‰zéÒúôÛóû‹¨Ò„¯µ—æ´é×O]Uíqo†b‹¶/-sÿ/1õé×ÿÕ»3êíç­¿á­u~š_et¨¢­¯µ§ÿ¥wÓþ ºÚÁv2m}¯ =ÿé§_n{…%×Ó^šoúmþaQ]CʬôÓ~þ¾{y§\­²ÿ¯–õÿŸkóψ.¾Ÿ¯ãøwéi9oñ?ý"×OÒVøçù…"¿¯ëúüP5é³sÿ.¶ã<~òçè}8Ï~Üš}/æ¿«ZÏïÓ¶·'íùò-:|O_^?7$·é7ßïÛé>8ü>˜ýX}—å§Îö×oÖSާ¬?ôþ-¼Ó’ÛýûìÿÏÑÇþ[ñïßðǽ [óûþÿë¶Ã‡ÛÕüoä­ç×”[gô®y÷cƒÌR=ú÷'ÐñMô¿o×OÀTÝÔõ¿ïj½­kËþu™ÌN=.®Çôº“œãÇ>ÇŒ©ü·ì¼úöýï¾í¨”ôŒ›þzÏm½ù|›^¯ÕØŽÌìøÀãýö#ë>¿§é’i=ß«ü˧¤!§ØýM-ø÷ë`þAhéÀ\±ý;ÿ_j樂qòåטüýÍþ{ê6|e¶çÑ?D\~´¿¯ëë¶àÿ‚ïÿ>㾺Z-/[#ÿÒþìßú´z„oq¯êZ}ÑÉRÖh¢–{§bpí³H‘íc+€¨åŸGòþ¶‹_;‘/Ž—ø§ÿ¦åþ]ºë}†úÅ./lfŠI-.ü»€/-–/´lA ˆ$óÒkw'ª~V&6IBºªü¿Pk÷”ÿÃS]6|«mz¥k-õwWEf¿žÆúÕQB=³Fº…´llüÉ$œ\Å™$±Š7yK@ë™%–˜èýWëÿ{~’mûÑè¹g®áæ·Û®ÞˆÕŽHæ½Dë$fÄ’7Ys 8‘ VŒpz†Î3HKø§qyý©y/Íüšl¹êA[1øb\çê:}3O¢õ×ôþë¢þí-~SÓ§åêÞ†V€Æho®î–âþi%“4¨[FR xb ·”z·ÎÈqûw×ßZÿÛ‹É~Ÿ=Ë1Æ »ººL/yå]®â<Ð-¢ò¢m`×(åbà§™Ž[ÌhÓk}=5ûßéýh(¯z¦{ýwv„7ÕþKçbå&ÇõÞqÿ_R~ç¨ýB¯ëúõÜT´‹ÿ¯µ_—Ç·ùï¿Ê+eÿ£×tÿ¤òÿ,úwïMÝ¿?ëOëç¸á~HßûÛyN_×ôÈ­Æ4Èý>Ä£ÛýQíÉïÜþx4þ×ý½ú“à¯úõÿ¶|ÿ?[‹.™!ÿ§lÁÛ§ê?,ÒŽÿÖý;õòûÄÚö§od®–öäWßçqoqöI?Ýèy#ùP›ºï~åU^䯮‰þ)þôÛ½”}Ä1ñÏÚí1øÝEïøõúQÕôþ¶ó*oNmýènº¹(üµwOïê`¢ûÞAørþäu'éêsBòìÿ¯Ô$ì¢úûJ×]ý<õ°ëõÖ]ãå»õe¸ÿëuïõ¡uôýWõÿ…-áþ/ý²e¯óÿë=ÿ/®2)/ùÿ>¼ž(šÿÇì¸ÿŸ[§úÙñßœuàŸÆëïÿ†þ®Gü¼à_úT‚ ý¦ûýûnŸõÀgŒëNM7ÓÓõ­ÿ«„W¿QùÇÿIÓúë÷·æ[þO\qùtƒëÜžr3é“CV·šMÿ_$ú[åyøªyM[Ë÷tßæÛùúž3v@Çú}Þxî2O_SøvÝÀ¡ÿ—ãývùËìªv\ö¿ñ'÷·wo›þ®-Tž×W˜íÿ/2þ¿OË"ž©®»måÙ÷é§§r¡ð¾žõOÆr¿}ÿàYl2Èf {Ûàþ;é=ß«'zTÛß—×U¦›iòóvi‚È5; ~ }ýIïýëG_Ÿ§ü0’µ$¿éÚÿÒ>žn$øþËnr>Èœúä¿—ù4zWÿƒ/úô–Ò]4ÞÖ¶¶¿•ÏÿÓþð<¡pºŠ"´’.½ªËe°â 8ž(‰ §Ê‘†”íb€V—ÂýWëýtü–søéŠ_úDŒu×ïõ-FÒÇK¶wVVs=ÛÈòXÙ šT¶dº¸0Ê@{qöÛX­—tÈQ‹Gµ–—Ožÿ×o^½l;¿i7¤$»]?^¯õÑ¢Ž¡áoO|­c­ië%½Ë˜ïôëNDI‘-¥‰o&‘®÷sͲ:!ü•5tùöþ¶¿ãqN7œmÒ3ßÖ¢_{×Dsé>5ðåÝÜ@é«0}®ÏV°Ôu;'Ú;Ý2å.lìÔ¹û3£ ákxÌ–ÿfb¯ëúìL"ÕI]˸Y­÷çç-¿«è£éþ×ñŒZ„:f©äZ+YêiÕÏ’òÅö»w·f´»†áŠ:ÉfìƒÍ ±~VgÓæÿOëo¼µüYÿ‚Ÿþßùmÿ ^ÒlçÔôÆÎb¸óíÜð'‚â1;s÷DÐÌÓG$ÎR‚i6‰•_×ôÿ5Õ¾Ô_«p‹¿ß{è¾ZsKrþ\·í[-Ao\nÁh­ †ITJêq´a™6Ÿ§õýÀøªÿm®ï_ÎÖlµ¥J—‹:+ žy¥ÚäoO6fbŽ6®¥¶º0ܬ *­ïò_‘0ø_øêér¿åçù(Ëcƒj§¡&oâÿ¦òû {¯=ß“·Ý÷þxSÒ[¥Ì»}¹zþy8þËN>ÅÁÈÿž'ž‡¶?úü†ĽWæÖšéz_‡'šíÝ|º øþÍqÏüy{ ÿàgúÇžq†#¿Þÿeoc­íìÒºµ×¹eeµ×NâÞ…’pq¶O©ôî6µ×Óõ_×ü8KxyOÿlŸõ¿Þ[ÿëôëølRþ¿­ÿ?¼ ãÓ#>½±ôõôüûP8Èûtò[Oý4¸úÞŸ•ËÇþÿ¥KüÆÁÍÍðá­{ƒ’mÏO”pG\çp:U=£¿_ϧŸüÇS½ã×û¿Ö¡o2ÿƒÿ~½?Ñ öõ?¯^ÔK§øPCâ«å5ÿ¦é¿ÖÿÒÈô¬dÿ§Ýgžìñ‚>ïo¯ÓŸOEù|ý~}-`‚·?_ÞKÿm“ì»é°ë/õoŒô«¼óÿO2{wO¯RÃV·š¿õ¿õÛaÓÕy9ÍäòóýWÈK ”COLûÉèO§>ô7vßwqRþ;r»zsK¿—õ{ S1NΟžp:Ú·ËÓÝ3Þ—õ ¿åן³^_c·Ÿõ¸ÙÿäÜ>Ƹ9ÿ¦iÎ:kGg®¶íú鯛ù…FÕ)mfžÝ¢ž´]z\ÿÔþòlœ.£=»åe}WT¼U ñnð¬(Kt Ì…Â}ï,£œ ´¾|¿?ë§Ýr%ñÓõŸßÈý;þuE&Λ®Å¶2ahnåùp±Go9Š8£8"XRÜ#Z0­(2®žwüþ~]¾}ÿŸœjiÓKJëml­óë{1#íñÿË•Áã¾'¶çø¿¯CÑNÕ~¿ÖßxßÇðÏó‡õ¿Þ2F ~Ù³É}¸~>ÐøQÉ}äžÀÆ[9imº­|ŸK|»}Âÿ—ü ÿJ=~{t¿½Í]øvÃPšåmÖËìåíd–ÒæÒ;« do´ï•W1ÝXÝÁ¹šÞKKˆ<©f·c±è½_éýÃþ$öø)¦»¯~Ý¿]¶WLÇ„ë:]Üð3ïHµhn}båä‡ì~DQK –°»š0¶°ù÷Úœ)(ŽXÐÜ\È„Êvþºþ?‡n—‘Š®¿jún+òo»BÖ™â CTšîÎ{Qák¦»•$‡W^IpÆ#Ùæb¶š <³#I0.·Q"%ZÖóÒÞÿþÙ ;énß~¦ö’Ïn·6ÑY¬fßT1Ë N³¸–k=_2„É"<.ìGê²1ý[[_qR¿+¾Ÿ½«÷s½~×nß}’޶™,RÙE$N MÈtžBr¬ªÊØ •q¼dd ŠrÝú¿ë§õßp§ðGþÞòûrò[m×Ôl?ò @çÇ’Oý1ob¿ל¡|Kׯ¨ ÿs?åÕº&›èºLYNt§Ç}?¯'­·o^}>¢Ž¿?ëM?/¸U?>Ÿº{.ª‡õq×|Ù>N>X³è>h¸êr3þ{°·^¨uoì¤Öü°ßMÜSé¦þ^½dëÐM¸ö¹µ?€»‹üû~/ÑþZ~%TÑyóÓZ×Ȯݯü½úXšaß2b o<»1UO'sofΠ ÌNT’1€ªöùèSWÑ÷Oæ¶#šÓüaáMfö+K fÖês%³X80Úê¦÷G}f 4+ÙáŽÇÄ6Œ^ýæÑn¯á‚Ø<“ºiíýÃ[ß`¶ÞNëî·äßôëŸxVÎùì.µ»8f‹íÂæcç¶Ÿ§É¦Ëc å®§«$M¥iwÑK©éè–…íµíÁº„Û[J,€Þ]BÀß¾–·–§SKT¿}8OÛRÆY䵎ñ­·E´—PMl“”XÚxäIujÏ·×4 „Òum_Kž˜­t;¸oíä·Öäû5æ¡:T‹)ŠúCgm}x©lÒ»[[\Mƒ.ê_×áý!Y^ýmkùm.ôû›I,î­®.l®ã²Õ"‚á&’ÊøYÛ]Çiw»}šçì7v·B6;[]A>Ð’£3ü¿¯^ý¾ûI6í«µüíýWbÛ­¿õûXÇþ[ý;{þ\S—Oð¯ë§õ½írañUÿÿÓtÿ¯ø`³?ñ÷ާP¹éþücßžxïÞ—oOÉ¿ë¯éµþ7ù-BÓp…ö“ys÷·`)º“qùy«µJ‹×÷’Ú÷_{ûÆØŸôû þ[äíÇn¼ryã47­îûùþÒ&’µ8ú?Íú^ßð÷Õ¹þÌ©þÏÈçì§9Ç=½úýÑŒ±×Îþ·þ¿¯áÜ?Ç—üÿ­DŸþAMÿ^‰ŸÂ5ãœöÿ3š¨ü6*–öRßàùü+Ñz~OcÿÕþòñRmCéq¹ÇW‘æ¿>Ì ¾È¢Àó¶0À`S[?ë¯ü?Á¸Ë^ô=eó÷äí§™wUµûLÖ…K¤Ð¥ä´gkn_!ŒdÿÏ7(Ó¤†$ FbÚ={_/ëf&¿yMëµ]oýÕm7z¿ëR¶xe½X¤$”³Ò]¬ŠÑµÅ¶ó åîXʷκ)H¸¥ý_õa·ïÁr£û¥Oüû÷j÷f±ÀÔy scžÀÛríËr9ƒԩàgú?Uú‰ëöŸöóþºmçiUX_^Ë´2´6M¾6Ý„‹Œ<±dG>áò–ÎöO»*¶ÆS¢õ§ùÒ¤Þ¿ ;?NoÉ»à IÖu½V&x¼É™­‰6ëòH6\ VÄ‘˜Ë¿–¬å eý_×êøªÿŠ…(߯ž¿+ò襜š¤«|¶R h^éÛ솯tÙHàÌ-gÜbbòK#=”öÞt¬¯?›åF¾žkõ×ü8ⵟKÏ®šòÇä»ö×Í™v¶×ºh24¯1_Kä‡GÔ¬"žiÊ]MÁWÔtèZ=ÑyS‰¡ó4lSsPÕ¿¯×úí°©hŸmUýóm.ŸŸ¥µ5´i·Eu,P½2]‹‹`TEy$7RùW0\uWš)Êe>i>Tr,oR nðM^Þõ¯þ)|·¾ÖôZ£KO™.4xf@À5ž_‡‰Õ^ A,. r(à2œR¦šø––Õo¯ù~_q0þvºù|:+é~Ûï½®Hä6’Ùëýž3ÿ€À~qŸþµ uꂦ”gÿ^¥¶Ÿbþ×m‰.Çú÷ÊGéÓtC?¦{{t…ºõþ»þ_yU?†ÿÁý·Ó­ŸùìIy³ŸO>Û¿ý=Åôïþô¦·}>/–¨Tøíúúr#®ßÊ´¹ºÇåÛÜÉæ3**l†WÞÌûQU0™ÈQŒ±À&¤³÷?‡|,Ú¤ñ¾§u¡iM!’[c%ÅùÒbžôÛù'Ê‘Î&–_±n‹ÊV~à@>©ã²ø•ôé4ÖÓaÖuHEÛj&«m4þ†ÆÚÂØZÉ¥Ig±µXõÔïmõ+YÄQAÛ‰P ÍqüMÿ $CC¹‰ ³µÐ¯®¬îÁmïl?á#½µ×àw”›yF)½³œlˆÞéÖÖåöÍr(¶‰{«>‡ðɮóÛ¶¬æ]?}ôÚåÄÒ/”|©”Ý-Œïý¹‚ìp~Íæêtinä¾ñBÜHZ(|@±XŒÀ|»?ì _,yyuîK¹1s™þrB¬  ?¯ëoÏî4 É~ÀûHéÔÿ¢[sÁž{ôõª—ÙßáýzOËK(Äw©þ?ý²>›=>ZßTxÍßoôëž0?½?÷çéIôôÿ1Çí®Õ%ÿ¤Çü»ýÃíîŸóótyçþ^dúv~¼ã‡OwçV¦Ýy%~¶ûþò+/øòùdßúŸç§åš©|OÕ Ÿðáþÿ¥KúüA?ä¼ÿË?«ž¹Ï[Þ×_zÞºŠ?Â4þøü¿/¸lÇ:[sÿ.‰Î8ºLàô<óÏò¢/Þ»óoîMiKü Ïìï××oKØÿÖþð¼Asý—©èZÌóÉ„RÝi·ªJGm_Ǿ=Jîá˜,0Ù>s¹®w.KQý_ÓüoþÎïW+­üžßúZ]Ÿñ÷fÙoõwDcŒ‚‘÷Cu¯<çݾËói~¿Öß…¥/ø‘òŒíÿor'÷Y5çÕm,›ËV’ö¨¤hÔÛI;9f¶h,‘ƒ ñ/â5ÿNïÿ“¥çߥ»]ìhÅ‘ß]c¶2–že»òn I(&)@ÏîÕZWÚÊt^¯ôþ$Þï–šû¹·×ò¶[o•Ö·*÷šŒlÍ©5¾è¤uI6¼±8ÁÃ,¸ýÙRÙ`c?:0¡«[úëní=º%¿]жŸnÚßòî:u¿OÍÚéëªi6Óê-­ 3‹–•­RG¸WS [û>#æ3‹[ùrå!C†]¿¯¿ú_“‘Чný²¶»þ–ÄÚ>¹mr²G‹“(¹¹`Ékr‘ɘîfjd¬&eÙ"´aÝFúo’üéÛ^¯ÚÔéÚ_wã¯NUs>Òæ¹†ÆþÎæÝï“íiugjw,³·•qÝ4“C™R@M¼‰½Ñƒ+d¨÷Úß×븩üÓ¿þ—/5nÖíÚÊEx–ëXÒb“M¾›GÖÖÆÝ¦URÖ×! nW‚áPL£Ë’Ý&ÏîÉxÜÏ¡»»õþ½?/¸šK÷4Õ·¦·i-£õ(-&¶¼Ó¡vF6’_íöòCi—"ûJ›’Åo ÒÒYâÂŽKìCgnÏ·é¯Ý[Š_îí+ÙÒ²ÛE(i}m¥õ×îû=êjú~§c9±¼Iš?-$ŒŽt"T{iBN€í%7'̹?- uëý_æ][*R¿HǧTãkvm®¯¯ÅökÂL[ý}±ëÿO1~žŸ¯JkâÓÎ×ùö·å÷_íñS½ºþò/Ïo+mÖ䲺¨Pá˜K"÷ À™>_˜Êã¨ÉÈà‚*¯ëú_’•¶’Wë%¿¯];íÒÉí,kmNÓ諒A{,··þD7W×W–:[ÛéòZ*è–2½¦ƒ –ªñÜŤEiÛ9kˆÝ‹nÎ+[ÉÙzÙ¿Ê/ú¹ªöv’E,2ÚÛÉ ÄžmÄ2A Å<»‘¼É£hÝ&“thÞd›œ²#܈Pñq¸òc ®<´óÚv‘a3mó +#´‹%<ÆgÚœ° › +»y4¶ŽH#[WH^ÊF°žÉnbžÐ˦ÝY´WmÔq´¢+«&‚x7îŠU4[z?¾þK·Ÿê?M³¶Óâ6¢qäEj%¸¹žK»ÛÇKdµŽæþú夼Ô/Œ±E=íä²ÜÌ#O2WÆæ?¯ëþîõqê”_þªÿƒ÷õåŒÖù2ß›þ>ÇCëkoÐ÷Ço«7ÓKiý?ŸæL/ÍW·:×Ö•?ø=¯ò¼‹Læï;¸¾¸ÁõôçíÓÐæ—õµ¿Ïïëå°âþ>žû¿ÝÁ´ÉIæâêäc×2cùçƒê9À§ýZÛðûµæTºßþ~Tßþ¾K~ÿ×fGg‘c,q󜃆“¡ùzý;ñ}þÿ×}W]>ð§ðGÓµºþ?‡n‰É©ŸìÅëÅ»[7°ÆO·ä@4uï¯ßÿ(»Qog÷Cçýt[ (fÓe†l׿ê—§=°:“íšk~ûüô~»úzÜ'¥)[¥?O³o-{ù÷»Gÿ×þ÷5;KkäŠÊò¸´º7×0JÇ,Y\$‘È‚ <çM+ßÉ__ë¿õŒÉÙÓÑ»ÏoûrÖ—v¿¬xk›g÷6‰©ÝI{§ØE#iÚ㬹™#›N×áˆGsG XˆõtAÓ™ËÇlH,t}î¿_ënþ8km'¦×Ñzí£kÑ+]³´ŽêÞK›{«i~ÑisäIï\Ín<Ï€¶DøWq3¢£µ?²ýWÏñÒÞší®¼­üqÿÿ8^þ{zôÙ•5M2+·šØÀ.!{æÈËöui~ÚÒ‹‹k•Rö·HìòÇ"å$“h;0®ªÚ_¥ì/ù{ÿpÿöýúþ_yÌÝ­í¥ïš‘]jºzØÚ[\@Zaª[fn$r°Àûª–É€¸ù\µ“þº|úK`†‘Ž›ÝÛþÞwZ6µþ¶nP[Ççi¶jR†Þg…™0W‰ò„´n$O(ÒAD¾'ÿþ_wMµÜTÿ… ÞÜzèß]<¼÷êŒyììõ-;ιµ³’ý4Æû9‘ÞKy•a&7‚älx¤VT3Ik ¸DG1º¥¼š·_Oóþ¶¶Ä5|;ïìoÿ’~/g¥ûùG2ëÃךdwHd[µ´¶VKÙžMVÍ£åíãÖ£ßÇ(óv´Ó­ÂpŒ"(Î-ÿË¿Nݧ±U5¥'{{‹]µÓ×úíª+Ê|E¡Db}FKÈEÄ ‘xÃ'Ú­À•5›öR Ìåâ+;‘Ý,Åt×tÿ®ßŸ­‚kÜZ·ïÓׯÇÑZüŸèt²xŠÏ}´:„7]À½„2ÜÆM³0•£>Uôy¶•ò#%ãgÚÄ P)_×oÇ·VãSו[z°·[;¿N—ë÷\ßšEØz~¿šÛ}”Þå-|ôVÚ6í¨CÿwG+_§n=OË4¿1ËÉÿ†Ÿþ’ý?­ïk‰oþ·P?ôò¿ú°ÿòTº…ÁüEЧø×þ›¦:×ï]Óí“óÏ÷b÷Áúàzzî‘Çí¾õ%ÿ¤Çüû}÷ÏîMœÿÇÝÞ?ð!ñéÔôôéÎ3Mÿ_×ü?é•ÚíûÚ–öÿϯ—Þ2Ëþ<£ÿ®MŸûîNüööüé_Õ…OJq¾Öoñ­ÿ¦$_ò ŒæÇÿ׳cê{Sëþ_õ÷ûèÓö)ùü/_óµÂCÿÖçþ]ߺ¿”ã¿¥÷þº|½Vá?àK¿³O^ü«¥»ÿÃÚìÿÐþù§'Ì´8ÿ—’=1˜&çø¿ŸqÈÍ5×ÓõDKzãÿÛ'ÿ E:¬—V±Èªé$7¨é$a‘Ñ£ˆ::1`U‡ 8§ouú¯Éé·Kùy6_¼§uv•MoÝ%óÕ~>‡%an¾Ö?³YÉÒîí¯®,¼æ"|/ql‘ìbRÛÌÝ ºÛM–ÚJ]õòÝ[g­Ÿm.Áÿ?àŸþ•OüÎqz€Ößb,¥C9‰MÆ@ÈÒDK1R‡*­ÂˆÂ^_×õòûƒþ^vý×ßï¯ëåäJ#]݈-BÉQ$d­Ò†Fl˸°Ü¬¤õFÅ;h¼ïòü?»¨+ûIßùiÛÿ&õéoÒÖ±Í]èIs¨ßL‹5´Ò=®ÝKJ²ßÇ+Ú¬M%äY^B´£vãny#—(}=?WýÃ~*¿â‡ãN:?=ú}å;kZ°ûp»´_Áo1k«ý#m}ÑK|Ý;²„ÉjÍçºÊæ4 Š_×õÜPø«>ޝåNš{i¿o^¦îªi:Ìr=œÑË4WW-±Ñíï­Ÿó ÑÅqWmŽ@xZ@Ñî‘K rwmÿ_’ÛÓï%dïÿ?j|½þ«NÍ[Ë[Ý2õˆ¹†Ü ßi‰d¹<ô¾neÀÝòE.<²9oáwü¿/ŸõÛaÓ~ì|›ºã—É_þ©&žû´ËvÃm³µ—§ ŽFAN dR wëëøé»î&“½ z»û+ß^ÍÿÀì¾L¡wcÞ•>øeŠÙ.#¸¶"’e„e÷mu`ëÇ ‘tchÆ–¾Öý_geÿî÷ý½ÿ^£ÿ¤«þ½ôûãþx²]ÿfþß]72G·e½òE%ºHÙAòÛ½Ã@!}Ë,ŠìØXTÅ]«í÷_×qÖÒ”ÿçÎÖïÞþ~[š’\‰l÷`,Ë-¨ž ¼Sí̈ÊrØ]ÇiäH›dGeej­i.öõ¿õÚéª÷íúv÷>O——r®£¥ÛMqÆ ²µìnÉq5»ÈøBòZ̲ۻ(PxbÅò]ö‰^þ—û¿¯/ì9ýúû.ýv_;_¦Ö—>Ú%þšÐÇ¥ùv ÜyvÏh÷XF©Œ·z=Ä»ÑãméØÜ|Þlh°ªÃ¹…×}º~¾B—ÅJײž¾»ž–í÷y5¹j/ê6†àj6þ+bí=Ί~Ñ=°RJý«M”At¬c1;ˆRRŒd¹GT_™×õ÷¯_‘¿e®iZƒ­o­ä\ÆÖÌÞMÒ:™#ÚÏåÜ#¯•!*Éõ  K‘zücýéûéýpÐð)ôôŸ—Ë¿Ü/·ÿn+éý÷»¾¿v›ktÂ"Æêç·ÉmÛý™sì?_~H*tùùÿÃ~õü¼Ÿøiùô~Ÿð6Ö×oŸ6ÿÞèvÿ§KnÜ÷{û¦–Ñ[üþÿOžâ†²«å5óýÕ7ÿ¯é´'uØÏü¾OÛ¢ÇäLý7c4>Ë_É~ çòCŽõ?ëãü£×¨YŸ–nŸñùwÛ?òðØôëïú€ü¿!RÙïüZŸú_O?øYgì‘d¸üŽÿ‹Œãéùb‡»ÿ+~?®áOøpô/yþ{ÿÃŒˆĵ?ëȃ<ÂxÏz{{än:üúþ¢ðUÕ¿t·ÒÞç•öô×Îá/üƒ[þ¼×·ý2{Ëó§‰|ÿ&4¥.ÞÎßù*û¬ÿ˹ÿÑþùî9{Lvºüño?ùç¨-Qû_á¡ÞŸý|ÿÛ'é¾ßç°É?ãòÓ©ù.ñßþYÄ={›Ž£û¸!³ó_“ôÛþú°“÷é®êz³×Gº}Û[ÙuKõŽ‘á”A<¶·ª™-.âx~ÏvŠÙIÝßrÅ(h¤é"8!it'úwóóô@ÿ‰ýÊî•/^ÿMÌ}R¹þѺ°Ô£hïl­Yî9•Ñ£{Ù+«2êk ”U”a¤šÚW‘%ŽD1Rþ¿®¡ÿ//ÿNíéïÿ—_‘ÐÆY®nd‰–OÜZü’ç‡\œ‘÷[æN~`G*_§õý]§øiîOëþ’ qz`†·8Î|ÄvÝׂÜs‘Ÿ•¾žŸ«þ¿áÇЧ¬?ô…òwߥ¶îåTØÚÝÜ\Kí>ñN³§EdöIª[=ȱÚ-o¥·%¾Ï¹c‚GŽ ìË€ªw•;IË9ßÅ¥÷ÿƒå÷üõÜ‹§‡i5ü–´ôí{í¿ÜjêUÍ”«Ímv +kt¨»ã2,x’mÑÍáe0u˜F ‘®Ðè–þº}ú[z­Í&¿w%o°“ù%é»óÝô±Jþ$¹W‰ãƱeoiåÜ81Çpt‘¬‹4E] {ˆ÷Ïh˜¬s•XŽÿåê¿?KަÚ<-ÿƒ#ÛþåróK!DXÉqí¯™ í¶E_2Eó#ž îÝ1©È À*þ¿¯ëò íúû ;½tÞÿžÛ¯µjiÔ½’¾aܨ) ÇÍäOò‰9GÏvrTƒòä•k®—Ó_óùiý0{ÃüoÿH–Ë«þº&µ·¸*eŒ\mŽePáö¬Ñ´rª3¨f@á$Ûó«‚B¢Ž?WðU†¤`–â mJâÙÐÁu¨¡U„*J»­õ«!Ø•KC䉃‚,2;±zçZ߇.ÇÙµ[˜­“OVÃÄ>©§4°E>Q¼KkºçO´mò÷Ú•¼Vë¹X±—pgÒÝ/¿¯ËËÎý·½?òñÿ×µo^yj½vÿ‡E˜pžtQH§O;ÿ]OŸ@_Äõïðç^}ü¾z¤VÒõ1ý¥u§^ÛýƒWkh'û*¸û%ç–&Yn4¹[ä’ÝØîع‹æûLJbvUý_ðßx“ýíE×’›ù.uúwûîmDyuÉt%­Ãm&9#cn¼¬ŠAVï¡ÿh€)¾žŸ«ßúüŠ‹÷ª+iͯ_qkÛ?K]‰n³,×ø1íQ€G@m  µÁÃŽ`“a1glª« ~Ÿõ×î[r«å;äzy;Ýo׺QKK˜sm‚4¶·pQ\±ë ƒ»”HÀÁRÔö—£~[|¿O—A?àµoùv·ÿ ×§®›y«³ÿÓþú& =·½Çþј{óþxÆi¯Ñùôþ¿¤ÌçñRßZ?ëÝGù//ÒMpÍ·øÿ¾Ÿ¨úóïš_×õÿ ÷þ*_÷oð-=]>q·¼Ž?Ò¡ã³Ü÷õ{sÔç“ÏoÏúm¥×ü7OÏîþ$ÁSÿJ¥ÿÓæ)ÿÑëöOÓÏo¯¸ìCKòþ¿àÿxÿëßãοEý\ϽÓ-5Fž ¸˜ˆÅœöòÅ#ÃsksŸåÜÚ\&$¶¸%D‘œ•Ê>äb¬ú/W÷iëçÓ︒ýíM>Å/]§ç¥Ìcuy£Érº´·7¶Íw vÚÕ•™{«E{bÑ®±in¯v˜ühÖÓ;F.-­ÝÉqôôý_Ýý>¬qÒU=`öþâÝéù;®ºðÊë%ëy~t 8"HJ±EЇ,¹ýàSƒ˜‹7-¹h4=—§êÿ¯˜©üU­ÿ?÷?M÷Ûîl-î¡”‘ÄFîã+¨-!ÎÆÅ[†bÊ@9W-]ûÛòù~_qTövßÚÔÿÒ´]vwÿ± ¥³Ç›Y䄜_Û \Hœ¬‘ªó¸C:ÎY†1Iîÿ¯óß§éöfÀ®¾Ôôÿ·åëùýöAe%ÂØÂÜ8ò3æA"ò|îŠM²#桘u]å±¹ —ð)ÿ×¥ù=¿¯ÍžÓ|²Þ\§³„•^7beËP‚Àãí5OYvÕoúÞ÷ûõó¹ó ÿrÿûŒ³sÿn@'÷iÏ=3ëÐúð[ëÉ¢?õ6½œ¯ÿ>—俯Ì}â úùÖüã'þ>béŒçü=0M(ïÓç·ÌU~4÷éô¿ü¼ŽÊúýÿ=n-ÎìC’Gú\'©þùþ~Ùõï•;ÿ_×õæ9ý4ö”õí¯÷Öÿ’ØmÊ!’Ô²«´¼ŠÙÿGŸ©#ž äôéž@£úüzÿ_ ¥½=¾?ý²{u½ößË«ŒþTd«вž )=Hm¹_BTŒãœ‘šE‰äBzÁ Ës G>çŽIàç$úô;@+ù›¦ÌœÛÆpaˆ€D¯ÈÊã#'±ô }<¯¿_éßüöDÛ÷ü _û~Jû|µåïÖÃc†ysû˜r!µÇîbÈÿ\ )ÇwÏO—£¢õ§õ·Üòòá§ø¦Ko‘5à:>íä[Å鎣n8Ïvþºÿ_ÕÅНnhýþÎ?ןÈ-rZï9ÿ¹0wùçùíŒ{‚Ëúþº~^}\ˆËÏúû%µ´å‡á×ÊâÛ²^?åêç¨ç~ÜñÉç àtéGõýzl:zú{IÇîŸË¾Öù±¶9±çûúI&qÓ?ŸåO¯gë¿å÷ŠŸðáèÿô¹ÿ/»ŸôfëΔÙõéõüð(ní¿?ëï¥>Ö§ø¨í×Óo¼ÉÓòÖ:¯Ê9»º×¥¨ä±þ,uR¦0i¿‰ôÕþçÿ#wJ6Wnœtèï£Ûmo·ÊçÿÔþùî2Ð srä<‰Ï¶î{÷"š_×ë³ròÒäJצßó»zòM~M÷_uâH3skÈÎ'Ç?ì)àñÐsŸ¨éF–/ø?§Ü9YNžÿòõmýÕ÷>ö¿m7é0ôÏ‘qÜc ö9çŒ3Ãeú¯×úÛ︟ñ#þ þtÀƒöÑÓ‹Oïùíÿ×õü'jü½ÿ¸ûõßñ¼Hóö›•â+SÔ¦~;ã§gžsGõý}ãK÷³ìáOÿJ«ú[úbDÚ/3†€ýñÞß9‡*¼r2 Úûz~¿×ôÇyTÿ>V§ï¶­ë³Ó©€ú±_]ê:-áÓîÉX$´vitk¤Fèg° ˜ÝŽÛÛ?&c43 P £Ù]®Ÿy1øªkö’ÿÊpß­úï³K[ZjÐA9‡\·Eë_ÏbG“M¾ÃLêÐê(‚gŠY,µÒhÓ|A&qï¿Eù/Ëb¡ð½~Ö­ÿð/éÿÀ±±f÷BtX.ày.?)ÂJщåd)¹Œ´€ü…$‚0;›wÊÞÿÃoýkµû+Øš?éïM}Õ$¯÷køõ$±Êb"“ìÅÌR‘ŠªÞ[9D„ u%‘°° ÒzýËðVòþ»î*À‚ê©ÙÛº¿õ§§Av—Òœö=‚A±ïDZÎj¥¤¯Ù¯ë¯å÷‰¯öv—üøivþýtùé}Õ®%Õ¬olîQw˜£Ã+•a±QTŒaAUÎÞ=û|²·^¨¹YÓ–é{?ý·w¾}¼î-Í»%¼'Ÿ/,4³,åÒ"ÆÅ•J®{áO¯9&—åÜU~>zI_§ïa¯ËÒ^›ñ\¯’ÊÉ›˜‰3E]ùD-ïn ¼“´®6•§{~+ïûûù~l©»(¯ú{KðŸo[uûÅu¼V·ó%µùÃ*±¼9ù%Ûó§ÆAÊ6ìdlÏÊ.»íÓå¿—üKâ¥ÿ_?öÉÿ]ûkbÊ}¯?¼ŽÙG<¤îǶӃŽyÏ+Œg8¤XäY”í+A¤LìÄœ–ÈhýIÁÜÄŽ¸à(BnþÔÛcµ'ìéË\:ñæÉųþqœ¯§•ßéýÓ§?Ÿ"éý÷÷¿=,´êœ£S{ö»VdùVüµÄ£#÷»z[¸õÏ$'$Òùü¿^ßÞ%üIéöiþR駯ç}BoüÛ¼Ee“4yÍÌÃþ]âèÇ®9鎸 eëúÛúï¸Gâ©­ý轿¹/§äýt t½ÍÙ3ZÆMË’« Ê«û¸ºHgˆ¸ rLiƒÎ@¦úzw¿ü7 Cþ^oüYïé >Ÿð6ÖÉ’ZÇs²Oßß´O“äÒ¾?ÒxÜ2'‚s· Š…ºmÏ/;>gu׮ɽ/mD³‚_)Iº˜>ͱ3#ýÓånç=›§’Íî﮽¥ùýÄRþ~úS_§Ÿæ£`/ú&ÜŒzË23ß{ätÉÉÜžº÷þ¼¿/¸¦­MÆûS’Më´ZMÛ]íý\ÈÓý“Woü~Þc 2GÙ-FI=#8ç®}¶µk­ÿ®ßð|÷éN-ßH'kk¤o¢×úê·?ÿÕþù®ÏiŸùúÀÿ–òr;ŽÙõäS]}?®Þ¿çk-éÿÿ鹄˜ûM®AÈóúŸ»^?ýGó¥ý_׿Æþ:÷ñ‚ó×O»û×´UÆ.a?ôÆç¿ûPsÜò‡ž½*¾ËßF¿áÞ²[­úƒøãþ ÿéPþ¶û®lsöR:ÿÓo§¿zœb¤Kø¯þ½íèj¹Æså[rH㛎¼q×<ƒýôù¿Ó§õ/âKüÿÒª ˆ¤ÞõäÛƒÈç÷?î£òíŒåG²Ó¿Ïú½¿¤5ñTÖúÃÿM¯]º÷}­aÐãÌ»ÆGïê9ýÄX=ÿšþ84=—ÎÞŸð÷ôùŠ?Oñ+úòGñµ¿28cŠX®b–5š&º^9FO$tb !SÈØu ®@;…ºõ_™U.èËþ½ù-¢ºëÓ_òM »—¾è¦P²BwáYN&„ü»w99ã^Iê`п¯ëþÕn>IStâÿ§­·³µˆç¼¶ÄEäò±qúõx3óàíóU7`àœ”»…×Ñüí­¿/žÁShÿ×Ú_ú_mo·¼’K‹y¾ÌðÍ©ö¤]É*Ÿ*^­çÓýœåŽÿ×_ë·ãi9|Tßißÿ$ŸßýveÀñð¹çœ‘Ø.3ž: ãÐ nEÀïœ{öÇOº1뜟Nù p>ÐØa@=x‘úúõ¯ÚÉv¾‘·m{ë÷\[2|Àûóv_ùùš‚h| §ï*kÿqeÿ-°Ö‘‚£kC‚wr œ‚0sƒE9nýXR_¹§{ë{ùÊ]íÓÏÏFÑÉ\øGK’Ô,Lú6£%¹–{Í*O³}°ý™#Ù}mͽʆ$l¬Rì@‹:#:Ð÷~¿×oë¾â_Àíz;¯ú÷o˶þ¶-K6»aj<øÓS·Šynm#ó/$U@FûF猾Û6m¸T &H¡n½|ßÎÞë‡èT•éµ®°¶š;YmÒöïçÕ’Eâm+Q¶ì×P¼ÒýžO±¾ë-J0óBèeÒuíuŽdK 2g?0+H*|/ºqÿÒ£¶Ýïù^Ö6./"_%\´ln¡SæÄêª|Î&Óy»Xà$MuÖÚ?ëæíëòô·_ÞÓ×þßß®¾Žýu°Ùg´¹’Ü$–·npV†Rª©0$Œ7ˆö=X ¯§ê¿¯Ã¨Kzk¼Úô÷&î¼ô{ô~¥¯³ÛbÞAÈdŠ0An¬®ˆ6ãï~q†E -âHçÜçŒz30öÈõ mŒE ÚhÌç æ8Àùþ_÷”«~B€êõé·õùy\jÅÚ'!G)'sãø8ÝŽÜ`ŒûàÑåý_/¸…­IéÒŸ^ÜÏmz뒲Е`‰^S±NæSóüì€à€Iÿw×''iý[þ_yJ+]/w~ÿðÝöûˆ¢‚{’ÐÂKNù&O\_x”<3‚yï’ 7ÓÓúïývØPÞõñéÛHÿ_ðÄC^c$P¡iQ¡*$ʆ* ¶3Ð’=¤To}^¼ÏU}›Û®«gmü¯am0GÇÍ×ó+zàvúžF1Mîÿ¯ëñ·w¹Õ¡oñt·Û’îö·ü>®J0ðm8ÃDT•^¡‘ÁÇ\N>ñìsUþOòù|µûŒO*Åá}6$U –#U $PJª¡¸Îçc‚ ×õýz™ÞÔ\­{Rr·~X__[–èÿ×þú'Æëúø¡Ïú©zŸ¥5ù^}ÿá·3ŸÅKþ¾uó„þï_ò ÍÅ©ôó¸Çý32Gò?…"ŸÇKþâÿé ×ú}7ÿ×ÅëäÎg˜}rO㎸àšwÓæŸçýà ÿ?àŸþ•Oúÿ‡ÿÇàÿ¯VÿÑ«GÙ~«õþµ¿ëþ^ÿÜÿ§#ýÃŽOøø˜±No_nßeÛ_ÏîþµèL‹S·%?Χü(”‹‹Îz‹nÇ´d›³ø v;¿…½—«ë~ÞJßÓÒéFâýú¾N—Ï÷wò¶ï¯Í}™"=ϼ ÷ÀýÌcù ÿú…&´O½ÿý›ÑEGâšó_ŒWùwû„·çíõôÿÍsøuíÆ;Ðúz~¬!¯?ý|šóé¾ÿzZÛ®‚ÚãÉéÿ-&×àú{úúô4åñ?ë PøWøçÿ§%ýÃŒ²ÇØá#§’>ÿÎ;~4žï×ðÿ† Ç£ÿÒä0ø—¸ûè†üÿ×߯ÞJ_¸Kþœ¿=©é~ÿw¯5ì>_øõ'òÅNyþì|äàñíýÞø…ºõ*ÖŸòﯜw~]} º®Ÿc}lëygmr¢ÜA­>,|ì»Æ6©RªzªŠå¯Ý¨U·#¾¾ô>þxÛ¶Ï×˹{¡\ÁM#[Ôtö{».;­š½”_?w9I ž§W¶Ifúz~¯úéù9(_÷‰¿ùy.–û1vµº+^^«rA ÎI‹·s }’2Cg·y£¡É ´Ðœ Ò)y¾¯W¢µôZ_e¥úï¦Ãtô˜C’a"âN)šqʳŒÏœàmowêȧð/Y[ç—?ëTVÕ広E¾šÈ§Ú¢±–H7’ª]$dAd,¨AʾΤUýÖúò¶½m¦ÖëÛî{máH4È¢ŒHìÀEleG“¼#… 9<Ó{ín˯–ÝH–”&÷µ½ofÕ7¿[6µî½OÿÐþú&8k{€?ò¦‚'½?úùÿ¶Oñþº1$ÇÚ-Ž;Íøf0ÃÓõ!´~©~×ü8ßÇKþâÿél‡ý*Û¶aºïÐþàœðO¦>¼à:?Uù0k÷‘ÜŸÏÞ‡£éÛÖÖ°§þ?—ÞÑÏÓGþ>¿ªý­þòå÷ýÁúqÃHrÜMèc€š`,N¼ö,=—§êÁ_ÚÏO±O^úÔÿI»êqäè³Ó±Ç_ÿY£¢ùÿÃþŸ!Çø•µßÙþÕ½7ëê:/õ—_õ×ôòbú?¦1Céé~Ý^Ö>*¾Uù{*OólKn>Ñÿ_Rñþö܇ú{ö4=—§êÇOy÷ö²Ó¶ÝöùúõÈþã¦?{?öõ?õëÓ>ƒ¡îúyvü¿/ºâ ½Åÿ_*éÙ~wcÿpóŸÝæÜþ>¿ røŸËò5û¸i§+ÓþÞ—õÓÑl7ìãèlÛÿÓqŽ=ÿ,Rëý}ûuÞÖ}µ%·î^ݳz/–ƒç#ì×ýJÿ$9ÆO¦$Ñתþºÿ]·*—ofïn‹—~¿==/a÷êûðÿèø¾œú^”¿¯ëë¶áSà~°ÿÒãÿòóAq‚"ïþ“àCçÿ­Æ:sÓ ×_OÕ «kÙ÷ué+íÿ/çÖé|®Ô‹½mÿ_ ÿ KÛÿÕŸÀ_Kíæ¿¯øpŸÇC_ù|´ïîTü¾_’”Äí*·ÌpJœ*àgæ9¢áN[²äšF†-ׇ4{‹™/EœVºŒÐ=³jvJ¶—þ[ì ðæ®A—~eåX­gÇk¨é¿”ƒX¶òcÉT²Õ–%u ÒȲ-®¥ ùß “Ê«-O§Ï·ëú%ëk"töõöi|ùÛ¿}´Ýzkyh[_Ù5ù€Þ´WwFöÚmç—ozRÜJe0ÛHâUU;¤ær nÜ 3*þ¿¯¸Kø“ÒÞí=u×Z›koëìí+j.Ä×&9 eóPÞ'‹?¹'ÍW•—hço•‡#–LüÀGâ©þ(ÿéýžíMÑ1C Y\…ŽfYM…Eó¢X‚“½Ü`Ÿ¸~Z§ö}?¯ëüØGz»éV_úL:õ}ôûî„·žs¸}‰”™7´Û¤Ë‡à“%¾\—Û…‘KªÓ·’~»¯/ÅÚí6w·óÎÿâNÒí­÷óï£,Ú ¶ñƒœüÜzfGçŸ×éCÝú±Sw§ºóésZíÛôÖ×2u‡1ÙYÀãϸhç–ûNŸ¢j¡åKª\E#)Ûº%m­Î)-tïýy~qW÷oåý»ù|º€âÇéiŒúþç®?Å>¶Ö÷²ïäe+¼4Þ·xi¿;º/õgÿÑþù®ÓÞèN°ÍõôÏAëžM5×Óõ^^}ï§KûÑ?Š—ý}_úLÿ­þñ$ÇÚm01þ¿ÛþY¯8ÿ?­ð5íý|ÿX·ñÒÿ¸¿úBþ¾ëôNní¸?ê®ûŸ[\žøÀüŒQÑú¯É…ýø¯îOó‡ùÿVÿÇòz}’OÏ΋ü?Ý ªýIÿ—ß÷ÿnŸñó8íå[ž;|Òöü:ö=–õï¯è5üIÿ‚Ÿþ•WúÛïn.ø}Ÿ«œÆzàŒútížrhè¼úÿVµš·ù­E{ZºÿϾ–ûý? Rºrt8ó.Îâaü]?qL0 u÷êy=£éúþºú|ÇiTÿ_ßN¥µÖþV°–ÝnAQ‘tÿÅþî:7ëÎ}0y4>žŸ«þ¿á…MßÚuµY«ýßÖò¾÷ 1ˆ>êÿ®¸þ/úz›ÏׯN3CÝúÿ[Ûúï¸PøWý|©çÿ/g¶ÛWÜmˆÿC„m_õXûÜunƒ?×Óžôåñ?ë§Ëòû‡OøpÖú>–ÚO½¯ëo¸güÃ:)ÿB<ƒ“Ÿ ôò÷íÖ•ýëùßñùþx£oaåì_þ›bÍÅ“qǾÿ‡Ž=zcŸ™hþ¿­¿ó_d©üÛþ}ö}—Dõ×Î7ù¸»âÝ'÷‘qÏü÷ú߇Z–ÿõòô±U¹/X~!éýwØ[ƒþ>!œ£>ÜžqÏ1Ö…×ÓõD×û:Ëê_/Þ/ëU÷Y‰q‚ÖßõóÿÇ%Çåôüø¡lý?_éïÓi}—?Š–ÿÅV·~I}ÿ…·ò•œ8üóŸ^¼úúûóH°ÿéþ}?L0ùyÁy2Oü¶ëÓüàŸC•ôò»ýöŸøŸööþ}­çs#UÐt_¶Ö´»=J8Å´°‹¨CKo"<¬’[Ü..-äR[ ‘Ÿ™ÆHw¿¯ëþî"?Ä©þvÿÉÆ›MFÕ¥:]Ä/R@>Á¨+Œ1Ä¿¹‚î/ßÛ?•µ#–ax©±‹æwVö^Ÿ«õþ»^Ã÷ª«»sÆËþáÇúù•¢ñµ¥äÖz´3éÓM;˜î%ŠGÒä‘`·/ Z’Gä B¶ý“ˆ“yqÕGÓÓúû¶éø^N;Ô}êËÿI޽l»kß} ë7Y#’HÊÉ—²I”ÉŒ‰”aÁåI“Ö­§Ÿ¯ãÿÈþ¢†Óÿ¯µß'¹«jÓ}´¼¡kË[ =¯ogŽÒÚ‘¥¸™‚FŠ$|àÿðvF€È턌;´=ß]zõáCþßÿÓ“34ieÔam^Hd† íU4Èçf6F=ëxÐ2†³’óy&Îò@<Â)A…ºõ-+ÅwqüZz/óésdœØ×y#±ÿGÏNÃü=÷­çm=zmùýÆRÿvw×ý›[kÜÙ®›ëú'±ÿÒþùç5¯ý|ƒÏýq˜_æU®·è¿à~¾ãø©yÔK{o úÿ]·Aþ‘kÜæÖ>ÞÜR |t½jé üÂAþ•nrxŽä~~GøgëM|/Õ~ õ«ÿNê¿[:~½ü¶³N÷ÿÉÿ^’ü÷ƒ§¿È×Ã/—çýLrÞãÿÛe–›yßN[XéÉãüóÿëÇ^¼T”.?žh~Ñúb{ÿ¬éÀŸ\Ó蟛ü-þ}¾ñ}«qKæäão¹^ÿ+?ˆjô™ù?ê­Áü Ý:õÉÏØði¿‰?ðÓüçú ýíÏ8ËÇÿ¢S íõÏ~ئúz~¯úÿ‡7«Ûž:Ü8ëý?U-9bå` ½Ó†V•ÔÁ•t`U`A™;GÓÉ_ˆGâ©ÿ_䌼+£‰ÞæÖôÙÖæR_J¼¹ÓÖE ›RHacP±ª€‘&Ñ¸ÜÆùWó~ŸŸÝvÞõò˜í;Ã:LrE¨J—Wתó¸ÔïnuMҾЫpæ !çÈùÉÞåf-CѵÙÿ]ÿ?¼T¿‡—óöþy/OëÈÞˆ¡/ý{;«'>¹üqÛš/­üïýoù}á )­ÿ‡ë¼¯Ód0Œi­ƒÈ±o^ÖÞœö3­wêÿ­R%¦ÿØ?þâüÿáúŸÿÙcanu-1.6/src/000077500000000000000000000000001314437614700131075ustar00rootroot00000000000000canu-1.6/src/AS_UTL/000077500000000000000000000000001314437614700141365ustar00rootroot00000000000000canu-1.6/src/AS_UTL/AS_UTL_alloc.C000066400000000000000000000043431314437614700164470ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2005-AUG-24 to 2013-AUG-01 * are Copyright 2005-2008,2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter on 2007-AUG-30 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz beginning on 2015-OCT-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_UTL_alloc.H" #if !defined(__CYGWIN__) && !defined(_WIN32) #include #endif #ifdef HW_PHYSMEM uint64 getPhysicalMemorySize(void) { uint64 physMemory = 0; int mib[2] = { CTL_HW, HW_PHYSMEM }; size_t len = sizeof(uint64); errno = 0; if (sysctl(mib, 2, &physMemory, &len, NULL, 0) != 0) fprintf(stderr, "getPhysicalMemorySize()-- sysctl() failed to return CTL_HW, HW_PHYSMEM: %s\n", strerror(errno)), exit(1); if (len != sizeof(uint64)) { #ifdef HW_MEMSIZE mib[1] = HW_MEMSIZE; len = sizeof(uint64); if (sysctl(mib, 2, &physMemory, &len, NULL, 0) != 0 || len != sizeof(uint64)) #endif fprintf(stderr, "getPhysicalMemorySize()-- sysctl() failed to return CTL_HW, HW_PHYSMEM: %s\n", strerror(errno)), exit(1); } return(physMemory); } #else uint64 getPhysicalMemorySize(void) { uint64 physPages = sysconf(_SC_PHYS_PAGES); uint64 pageSize = sysconf(_SC_PAGESIZE); uint64 physMemory = physPages * pageSize; return(physMemory); } #endif canu-1.6/src/AS_UTL/AS_UTL_alloc.H000066400000000000000000000076331314437614700164610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2005-AUG-24 to 2013-AUG-01 * are Copyright 2005-2008,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-11 to 2015-JAN-31 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_ALLOC_H #define AS_UTL_ALLOC_H #include "AS_global.H" uint64 getPhysicalMemorySize(void); const uint32 resizeArray_doNothing = 0x00; const uint32 resizeArray_copyData = 0x01; const uint32 resizeArray_clearNew = 0x02; template void allocateArray(TT*& array, LL arrayMax, uint32 op=resizeArray_clearNew) { if (array != NULL) delete [] array; array = new TT [arrayMax]; if (op == resizeArray_clearNew) memset(array, 0, sizeof(TT) * arrayMax); } template void duplicateArray(TT*& to, LL &toLen, LL &toMax, TT *fr, LL frLen, LL UNUSED(frMax), bool forceAlloc=false) { if ((toMax < frLen) || (forceAlloc)) { delete [] to; toLen = frLen; toMax = toLen; to = new TT [toMax]; } memcpy(to, fr, sizeof(TT) * toLen); } template void resizeArray(TT*& array, uint64 arrayLen, LL &arrayMax, uint64 newMax, uint32 op=resizeArray_copyData) { if (newMax <= arrayMax) return; arrayMax = newMax; TT *copy = new TT [arrayMax]; if (op & resizeArray_copyData) memcpy(copy, array, sizeof(TT) * arrayLen); delete [] array; array = copy; if (op & resizeArray_clearNew) memset(array + arrayLen, 0, sizeof(TT) * (newMax - arrayLen)); } template void resizeArrayPair(T1*& array1, T2*& array2, uint64 arrayLen, LL &arrayMax, LL newMax, uint32 op=resizeArray_copyData) { if (newMax <= arrayMax) return; arrayMax = newMax; T1 *copy1 = new T1 [arrayMax]; T2 *copy2 = new T2 [arrayMax]; if (op & resizeArray_copyData) { memcpy(copy1, array1, sizeof(T1) * arrayLen); memcpy(copy2, array2, sizeof(T2) * arrayLen); } delete [] array1; delete [] array2; array1 = copy1; array2 = copy2; if (op & resizeArray_clearNew) { memset(array1 + arrayLen, 0, sizeof(T1) * (newMax - arrayLen)); memset(array2 + arrayLen, 0, sizeof(T2) * (newMax - arrayLen)); } } template void increaseArray(TT*& array, uint64 arrayLen, LL &arrayMax, uint64 increment) { if (arrayLen + increment <= arrayMax) return; LL newMax = arrayMax; while (newMax < arrayLen + increment) newMax *= 2; resizeArray(array, arrayLen, arrayMax, newMax, resizeArray_copyData); } template void increaseArrayPair(T1*& array1, T2*& array2, uint64 arrayLen, LL &arrayMax, uint64 increment) { if (arrayLen + increment <= arrayMax) return; LL newMax = arrayMax; while (newMax < arrayLen + increment) newMax *= 2; resizeArrayPair(array1, array2, arrayLen, arrayMax, newMax, resizeArray_copyData); } #endif // AS_UTL_ALLOC_H canu-1.6/src/AS_UTL/AS_UTL_decodeRange.C000066400000000000000000000074241314437614700175600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2012-FEB-12 to 2013-OCT-11 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2015-MAY-28 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_UTL_decodeRange.H" void AS_UTL_decodeRange(char *range, set &ranges) { char *ap = range; uint32 av = 0, bv = 0; while (*ap != 0) { av = strtoull(ap, &ap, 10); if (*ap == ',') { ap++; ranges.insert(av); } else if (*ap == 0) { ranges.insert(av); } else if (*ap == '-') { ap++; bv = strtoull(ap, &ap, 10); for (uint32 xx=av; xx<=bv; xx++) ranges.insert(xx); if (*ap == ',') ap++; } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } } void AS_UTL_decodeRange(char *range, set &ranges) { char *ap = range; uint32 av = 0, bv = 0; while (*ap != 0) { av = strtoul(ap, &ap, 10); if (*ap == ',') { ap++; ranges.insert(av); } else if (*ap == 0) { ranges.insert(av); } else if (*ap == '-') { ap++; bv = strtoull(ap, &ap, 10); for (uint32 xx=av; xx<=bv; xx++) ranges.insert(xx); if (*ap == ',') ap++; } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } } void AS_UTL_decodeRange(char *range, uint64 &lo, uint64 &hi) { char *ap = range; lo = hi = strtoull(ap, &ap, 10); if (*ap == '-') { ap++; hi = strtoull(ap, &ap, 10); } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } void AS_UTL_decodeRange(char *range, int64 &lo, int64 &hi) { char *ap = range; lo = hi = strtoll(ap, &ap, 10); if (*ap == '-') { ap++; hi = strtoll(ap, &ap, 10); } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } void AS_UTL_decodeRange(char *range, uint32 &lo, uint32 &hi) { char *ap = range; lo = hi = strtoul(ap, &ap, 10); if (*ap == '-') { ap++; hi = strtoul(ap, &ap, 10); } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } void AS_UTL_decodeRange(char *range, int32 &lo, int32 &hi) { char *ap = range; lo = hi = strtol(ap, &ap, 10); if (*ap == '-') { ap++; hi = strtol(ap, &ap, 10); } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } void AS_UTL_decodeRange(char *range, double &lo, double &hi) { char *ap = range; lo = hi = strtod(ap, &ap); if (*ap == '-') { ap++; hi = strtod(ap, &ap); } else if (*ap != 0) { fprintf(stderr, "ERROR: invalid range '%s'\n", range); exit(1); } } canu-1.6/src/AS_UTL/AS_UTL_decodeRange.H000066400000000000000000000032261314437614700175610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2012-FEB-12 to 2013-OCT-11 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2015-MAY-28 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_DECODERANGE_H #define AS_UTL_DECODERANGE_H #include "AS_global.H" #include using namespace std; void AS_UTL_decodeRange(char *range, set &ranges); void AS_UTL_decodeRange(char *range, set &ranges); void AS_UTL_decodeRange(char *range, uint64 &lo, uint64 &hi); void AS_UTL_decodeRange(char *range, int64 &lo, int64 &hi); void AS_UTL_decodeRange(char *range, uint32 &lo, uint32 &hi); void AS_UTL_decodeRange(char *range, int32 &lo, int32 &hi); void AS_UTL_decodeRange(char *range, double &lo, double &hi); #endif // AS_UTL_DECODERANGE_H canu-1.6/src/AS_UTL/AS_UTL_fasta.C000066400000000000000000000054151314437614700164540ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2007-NOV-02 to 2013-AUG-01 * are Copyright 2007-2008,2010,2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-JUN-04 to 2009-MAR-31 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2013-MAR-20 * are Copyright 2013 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz on 2015-FEB-27 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_UTL_fasta.H" #include "AS_UTL_fileIO.H" #include void AS_UTL_writeFastA(FILE *f, char *s, int sl, int bl, char *h, ...) { va_list ap; char *o = new char [sl + sl / 60 + 2]; int si = 0; int oi = 0; while (si < sl) { o[oi++] = s[si++]; if (bl != 0 && (si % bl) == 0) o[oi++] = '\n'; } if (o[oi-1] != '\n') o[oi++] = '\n'; o[oi] = 0; va_start(ap, h); vfprintf(f, h, ap); va_end(ap); AS_UTL_safeWrite(f, o, "AS_UTL_writeFastA", sizeof(char), oi); delete [] o; } void AS_UTL_writeFastQ(FILE *f, char *s, int sl, char *q, int ql, char *h, ...) { va_list ap; char *o = new char [ql + 1]; int qi = 0; int oi = 0; assert(sl == ql); // Reencode the QV to the Sanger spec. This is a copy, so we don't need to undo it. while (qi < ql) o[oi++] = q[qi++] + '!'; o[oi] = 0; va_start(ap, h); vfprintf(f, h, ap); va_end(ap); AS_UTL_safeWrite(f, s, "AS_UTL_writeFastQ", sizeof(char), sl); fprintf(f, "\n"); fprintf(f, "+\n"); AS_UTL_safeWrite(f, o, "AS_UTL_writeFastQ", sizeof(char), ql); fprintf(f, "\n"); delete [] o; } canu-1.6/src/AS_UTL/AS_UTL_fasta.H000066400000000000000000000032641314437614700164610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2007-NOV-02 to 2013-AUG-01 * are Copyright 2007-2008,2010,2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-OCT-29 to 2009-MAR-31 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_FASTA_H #define AS_UTL_FASTA_H #include "AS_global.H" // Writes sequence as fasta, with at most 'bl' letters per line (unlimited if 0). void AS_UTL_writeFastA(FILE *f, char *s, int sl, int bl, char *h, ...); // Writes FastQ, with Sanger QVs. void AS_UTL_writeFastQ(FILE *f, char *s, int sl, char *q, int ql, char *h, ...); #endif canu-1.6/src/AS_UTL/AS_UTL_fileIO.C000066400000000000000000000416131314437614700165250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2005-SEP-26 to 2013-SEP-27 * are Copyright 2005-2009,2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter on 2007-AUG-27 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2015-MAR-25 to 2015-AUG-14 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_UTL_fileIO.H" // Report ALL attempts to seek somewhere. #undef DEBUG_SEEK // Use ftell() to verify that we wrote the expected number of bytes, // and that we ended up at the expected location. #undef VERIFY_WRITE_POSITIONS // Return the basename of a path -- that is, strip off any and all extensions. // Anything after the first dot after the last slash is removed. // // But if a directory, do nothing. void AS_UTL_findBaseFileName(char *basename, const char *filename) { strcpy(basename, filename); if (AS_UTL_fileExists(basename, true, false)) return; char *slash = strrchr(basename, '/'); char *dot = strchr((slash == NULL) ? basename : slash, '.'); if (dot) *dot = 0; } // Provides a safe and reliable mechanism for reading / writing // binary data. // // Split writes/reads into smaller pieces, check the result of each // piece. Really needed by OSF1 (V5.1), useful on other platforms to // be a little more friendly (big writes are usually not // interruptable). void AS_UTL_safeWrite(FILE *file, const void *buffer, const char *desc, size_t size, size_t nobj) { size_t position = 0; size_t length = 32 * 1024 * 1024 / size; size_t towrite = 0; size_t written = 0; #ifdef VERIFY_WRITE_POSITIONS off_t expectedposition = AS_UTL_ftell(file) + nobj * size; if (errno) // If we return, and errno is set, the stream isn't seekable. expectedposition = 0; #endif while (position < nobj) { towrite = length; if (position + towrite > nobj) towrite = nobj - position; errno = 0; written = fwrite(((char *)buffer) + position * size, size, towrite, file); if (errno) { fprintf(stderr, "safeWrite()-- Write failure on %s: %s\n", desc, strerror(errno)); fprintf(stderr, "safeWrite()-- Wanted to write " F_SIZE_T " objects (size=" F_SIZE_T "), wrote " F_SIZE_T ".\n", towrite, size, written); assert(errno == 0); } position += written; } // This catches a bizarre bug on FreeBSD (6.1 for sure, 4.10 too, I // think) where we write at the wrong location; see fseek below. // // UNFORTUNATELY, you can't ftell() on stdio. // #ifdef VERIFY_WRITE_POSITIONS if ((expectedposition > 0) && (AS_UTL_ftell(file) != expectedposition)) { fprintf(stderr, "safeWrite()-- EXPECTED " F_OFF_T ", ended up at " F_OFF_T "\n", expectedposition, AS_UTL_ftell(file)); assert(AS_UTL_ftell(file) == expectedposition); } #endif } size_t AS_UTL_safeRead(FILE *file, void *buffer, const char *desc, size_t size, size_t nobj) { size_t position = 0; size_t length = 32 * 1024 * 1024 / size; size_t toread = 0; size_t written = 0; // readen? while (position < nobj) { toread = length; if (position + toread > nobj) toread = nobj - position; errno = 0; written = fread(((char *)buffer) + position * size, size, toread, file); position += written; if (feof(file) || (written == 0)) goto finish; if ((errno) && (errno != EINTR)) { fprintf(stderr, "safeRead()-- Read failure on %s: %s.\n", desc, strerror(errno)); fprintf(stderr, "safeRead()-- Wanted to read " F_SIZE_T " objects (size=" F_SIZE_T "), read " F_SIZE_T ".\n", toread, size, written); assert(errno == 0); } } finish: // Just annoys developers. Stop it. //if (position != nobj) // fprintf(stderr, "AS_UTL_safeRead()-- Short read; wanted " F_SIZE_T " objects, read " F_SIZE_T " instead.\n", // nobj, position); return(position); } #if 0 // Reads a line, allocating space as needed. Alternate implementatioin, probably slower than the // getc() based one below. bool readLine(char *&L, uint32 &Llen, uint32 &Lmax, FILE *F) { if ((L == NULL) || (Lmax == 0)) allocateArray(L, Lmax = 4, resizeArray_clearNew); L[Lmax-2] = 0; L[Lmax-1] = 0; fgets(L, Lmax, F); Llen = strlen(L); fprintf(stderr, "READ Llen %u\n", Llen); // fgets() will always NUL-terminate the string. If the seocnd to last // character exists and is not a newline, we didn't read the whole string. while ((L[Lmax-2] != 0) && (L[Lmax-2] != '\n')) { uint32 growth = 4; assert(Llen == Lmax - 1); resizeArray(L, Llen, Lmax, Lmax + growth); // Grow the array. L[Lmax-2] = 0; L[Lmax-1] = 0; fgets(L + Llen, 1 + growth, F); // Read more bytes. Llen += strlen(L + Llen); // How many more? fprintf(stderr, "READ Llen %u Lmax %u '%s'\n", Llen, Lmax, L); } // Trim trailing whitespace. while ((Llen > 0) && (isspace(L[Llen-1]))) L[--Llen] = 0; return(true); } #endif // Reads a line of text from a file. Trims off trailing whitespace, including newlines. bool AS_UTL_readLine(char *&L, uint32 &Llen, uint32 &Lmax, FILE *F) { if ((L == NULL) || (Lmax == 0)) allocateArray(L, Lmax = 1024, resizeArray_clearNew); Llen = 0; int32 ch = getc(F); uint32 growth = 1024; if (feof(F)) return(false); while ((feof(F) == false) && (ch != '\n')) { if (Llen + 1 >= Lmax) resizeArray(L, Llen, Lmax, Lmax + growth, resizeArray_copyData | resizeArray_clearNew); // Grow the array. L[Llen++] = ch; ch = getc(F); } // Terminate. L[Llen] = 0; // Trim trailing whitespace. while ((Llen > 0) && (isspace(L[Llen-1]))) L[--Llen] = 0; return(true); } // Ensure that directory 'dirname' exists. void AS_UTL_mkdir(const char *dirname) { struct stat st; // Stat the file. Don't fail if the file doesn't exist though. errno = 0; if ((stat(dirname, &st) != 0) && (errno != ENOENT)) fprintf(stderr, "AS_UTL_mkdir()-- Couldn't stat '%s': %s\n", dirname, strerror(errno)), exit(1); // If file doesn't exist, make the directory. Fail horribly if it can't be made.. // Or, fail horribly if the 'directory' is a file instead. if ((errno == ENOENT) && (mkdir(dirname, S_IRWXU | S_IRWXG | S_IRWXO) != 0)) fprintf(stderr, "AS_UTL_mkdir()-- Couldn't create directory '%s': %s\n", dirname, strerror(errno)), exit(1); if ((errno != ENOENT) && (S_ISDIR(st.st_mode) == false)) fprintf(stderr, "AS_UTL_mkdir()-- ERROR! '%s' is a file, and not a directory.\n", dirname), exit(1); } // Remove a directory, or do nothing if the file doesn't exist. void AS_UTL_rmdir(const char *dirname) { if (AS_UTL_fileExists(dirname, FALSE, FALSE) == false) return; errno = 0; rmdir(dirname); if (errno) fprintf(stderr, "AS_UTL_rmdir()-- Failed to remove directory '%s': %s\n", dirname, strerror(errno)), exit(1); } void AS_UTL_symlink(const char *pathToFile, const char *pathToLink) { // Fail horribly if the file doesn't exist. if (AS_UTL_fileExists(pathToFile, FALSE, FALSE) == false) fprintf(stderr, "AS_UTL_symlink()-- Original file '%s' doesn't exist, won't make a link to nothing.\n", pathToFile), exit(1); // Succeed silently if the link already exists. if (AS_UTL_fileExists(pathToLink, FALSE, FALSE) == true) return; // Nope? Make the link. errno = 0; symlink(pathToFile, pathToLink); if (errno) fprintf(stderr, "AS_UTL_symlink()-- Failed to make link '%s' pointing to file '%s': %s\n", pathToLink, pathToFile, strerror(errno)), exit(1); } // Remove a file, or do nothing if the file doesn't exist. void AS_UTL_unlink(const char *filename) { if (AS_UTL_fileExists(filename, FALSE, FALSE) == false) return; errno = 0; unlink(filename); if (errno) fprintf(stderr, "AS_UTL_unlink()-- Failed to remove file '%s': %s\n", filename, strerror(errno)), exit(1); } // Returns true if the named file/directory exists, and permissions // allow us to read and/or write. // int AS_UTL_fileExists(const char *path, int directory, int readwrite) { struct stat s; int r; errno = 0; r = stat(path, &s); if (errno) return(0); if ((directory == 1) && (readwrite == 0) && (s.st_mode & S_IFDIR) && (s.st_mode & (S_IRUSR | S_IRGRP | S_IROTH)) && (s.st_mode & (S_IXUSR | S_IXGRP | S_IXOTH))) return(1); if ((directory == 1) && (readwrite == 1) && (s.st_mode & S_IFDIR) && (s.st_mode & (S_IRUSR | S_IRGRP | S_IROTH)) && (s.st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) && (s.st_mode & (S_IXUSR | S_IXGRP | S_IXOTH))) return(1); if ((directory == 0) && (readwrite == 0) && (s.st_mode & (S_IRUSR | S_IRGRP | S_IROTH))) return(1); if ((directory == 0) && (readwrite == 1) && (s.st_mode & (S_IRUSR | S_IRGRP | S_IROTH)) && (s.st_mode & (S_IWUSR | S_IWGRP | S_IWOTH))) return(1); return(0); } off_t AS_UTL_sizeOfFile(const char *path) { struct stat s; int r; off_t size = 0; errno = 0; r = stat(path, &s); if (errno) { fprintf(stderr, "Failed to stat() file '%s': %s\n", path, strerror(errno)); exit(1); } // gzipped files contain a file contents list, which we can // use to get the uncompressed size. // // gzip -l // compressed uncompressed ratio uncompressed_name // 14444 71680 79.9% up.tar // // bzipped files have no contents and we just guess. if (strcasecmp(path+strlen(path)-3, ".gz") == 0) { char cmd[FILENAME_MAX], *p = cmd; snprintf(cmd, FILENAME_MAX, "gzip -l '%s'", path); FILE *F = popen(cmd, "r"); fgets(cmd, FILENAME_MAX, F); // compressed uncompressed ratio uncompressed_name fgets(cmd, FILENAME_MAX, F); // 30264891 43640320 30.6% file pclose(F); while (isspace(*p) == true) p++; // Skip spaces at the start of the line while (isspace(*p) == false) p++; // Skip the compressed size while (isspace(*p) == true) p++; // Skip spaces size = strtoull(p, NULL, 10); // Retain the uncompresssed size } else if (strcasecmp(path+strlen(path)-4, ".bz2") == 0) { size = s.st_size * 14 / 10; } else { size = s.st_size; } return(size); } off_t AS_UTL_ftell(FILE *stream) { off_t pos = 0; errno = 0; pos = ftello(stream); if ((errno == ESPIPE) || (errno == EBADF)) // Not a seekable stream. Return some goofy big number. return(((off_t)1) < 42); if (errno) fprintf(stderr, "AS_UTL_ftell()-- Failed with %s.\n", strerror(errno)); assert(errno == 0); return(pos); } void AS_UTL_fseek(FILE *stream, off_t offset, int whence) { off_t beginpos = AS_UTL_ftell(stream); // If the stream is already at the correct position, just return. // // Unless we're on FreeBSD. For unknown reasons, FreeBSD fails // updating the gkpStore with mate links. It seems to misplace the // file pointer, and ends up writing the record to the wrong // location. ftell() is returning the correct current location, // and so AS_PER_genericStore doesn't seek() and just writes to the // current position. At the end of the write, we're off by 4096 // bytes. // // LINK 498318175,1538 <-> 498318174,1537 // AS_UTL_fseek()-- seek to 159904 (whence=0); already there // safeWrite()-- write nobj=1x104 = 104 bytes at position 159904 // safeWrite()-- wrote nobj=1x104 = 104 bytes position now 164000 // safeWrite()-- EXPECTED 160008, ended up at 164000 // #if !defined __FreeBSD__ && !defined __osf__ && !defined __APPLE__ if ((whence == SEEK_SET) && (beginpos == offset)) { #ifdef DEBUG_SEEK // This isn't terribly informative, and adds a lot of clutter. //fprintf(stderr, "AS_UTL_fseek()-- seek to " F_OFF_T " (whence=%d); already there\n", offset, whence); #endif return; } #endif // __FreeBSD__ if (fseeko(stream, offset, whence) != 0) { fprintf(stderr, "AS_UTL_fseek()-- Failed with %s.\n", strerror(errno)); assert(errno == 0); } #ifdef DEBUG_SEEK fprintf(stderr, "AS_UTL_fseek()-- seek to " F_OFF_T " (requested " F_OFF_T ", whence=%d) from " F_OFF_T "\n", AS_UTL_ftell(stream), offset, whence, beginpos); #endif if (whence == SEEK_SET) assert(AS_UTL_ftell(stream) == offset); } void AS_UTL_loadFileList(char *fileName, vector &fileList) { errno = 0; FILE *F = fopen(fileName, "r"); if (errno) fprintf(stderr, "Can't open '%s': %s\n", fileName, strerror(errno)), exit(1); char *line = new char [FILENAME_MAX]; fgets(line, FILENAME_MAX, F); while (!feof(F)) { chomp(line); fileList.push_back(line); line = new char [FILENAME_MAX]; fgets(line, FILENAME_MAX, F); } delete [] line; fclose(F); } cftType compressedFileType(char const *filename) { if ((filename == NULL) || (filename[0] == 0) || (strcmp(filename, "-") == 0)) return(cftSTDIN); int32 len = strlen(filename); if ((len > 3) && (strcasecmp(filename + len - 3, ".gz") == 0)) return(cftGZ); else if ((len > 4) && (strcasecmp(filename + len - 4, ".bz2") == 0)) return(cftBZ2); else if ((len > 3) && (strcasecmp(filename + len - 3, ".xz") == 0)) return(cftXZ); else return(cftNONE); } compressedFileReader::compressedFileReader(const char *filename) { char cmd[FILENAME_MAX]; int32 len = 0; _file = NULL; _pipe = false; _stdi = false; cftType ft = compressedFileType(filename); if ((ft != cftSTDIN) && (AS_UTL_fileExists(filename, FALSE, FALSE) == FALSE)) fprintf(stderr, "ERROR: Failed to open input file '%s': %s\n", filename, strerror(errno)), exit(1); errno = 0; switch (ft) { case cftGZ: snprintf(cmd, FILENAME_MAX, "gzip -dc '%s'", filename); _file = popen(cmd, "r"); _pipe = true; break; case cftBZ2: snprintf(cmd, FILENAME_MAX, "bzip2 -dc '%s'", filename); _file = popen(cmd, "r"); _pipe = true; break; case cftXZ: snprintf(cmd, FILENAME_MAX, "xz -dc '%s'", filename); _file = popen(cmd, "r"); _pipe = true; if (_file == NULL) // popen() returns NULL on error. It does not reliably set errno. fprintf(stderr, "ERROR: Failed to open input file '%s': popen() returned NULL\n", filename), exit(1); errno = 0; break; case cftSTDIN: _file = stdin; _stdi = 1; break; default: _file = fopen(filename, "r"); _pipe = false; break; } if (errno) fprintf(stderr, "ERROR: Failed to open input file '%s': %s\n", filename, strerror(errno)), exit(1); } compressedFileReader::~compressedFileReader() { if (_file == NULL) return; if (_stdi) return; if (_pipe) pclose(_file); else fclose(_file); } compressedFileWriter::compressedFileWriter(const char *filename, int32 level) { char cmd[FILENAME_MAX]; int32 len = 0; _file = NULL; _pipe = false; _stdi = false; cftType ft = compressedFileType(filename); errno = 0; switch (ft) { case cftGZ: snprintf(cmd, FILENAME_MAX, "gzip -%dc > '%s'", level, filename); _file = popen(cmd, "w"); _pipe = true; break; case cftBZ2: snprintf(cmd, FILENAME_MAX, "bzip2 -%dc > '%s'", level, filename); _file = popen(cmd, "w"); _pipe = true; break; case cftXZ: snprintf(cmd, FILENAME_MAX, "xz -%dc > '%s'", level, filename); _file = popen(cmd, "w"); _pipe = true; break; case cftSTDIN: _file = stdout; _stdi = 1; break; default: _file = fopen(filename, "w"); _pipe = false; break; } if (errno) fprintf(stderr, "ERROR: Failed to open output file '%s': %s\n", filename, strerror(errno)), exit(1); } compressedFileWriter::~compressedFileWriter() { if (_file == NULL) return; if (_stdi) return; if (_pipe) pclose(_file); else fclose(_file); } canu-1.6/src/AS_UTL/AS_UTL_fileIO.H000066400000000000000000000066421314437614700165350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2005-SEP-26 to 2013-SEP-27 * are Copyright 2005,2007-2009,2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_FILEIO_H #define AS_UTL_FILEIO_H #include "AS_global.H" #include using namespace std; // Provides a safe and reliable mechanism for reading / writing // binary data. // // Split writes/reads into smaller pieces, check the result of each // piece. Really needed by OSF1 (V5.1), useful on other platforms to // be a little more friendly (big writes are usually not // interruptable). #ifndef O_LARGEFILE #define O_LARGEFILE 0 #endif void AS_UTL_findBaseFileName(char *basename, const char *filename); void AS_UTL_safeWrite(FILE *file, const void *buffer, const char *desc, size_t size, size_t nobj); size_t AS_UTL_safeRead (FILE *file, void *buffer, const char *desc, size_t size, size_t nobj); bool AS_UTL_readLine(char *&L, uint32 &Llen, uint32 &Lmax, FILE *F); void AS_UTL_mkdir(const char *dirname); void AS_UTL_rmdir(const char *dirname); void AS_UTL_symlink(const char *pathToFile, const char *pathToLink); void AS_UTL_unlink(const char *filename); int AS_UTL_fileExists(const char *path, int directory=false, int readwrite=false); off_t AS_UTL_sizeOfFile(const char *path); off_t AS_UTL_ftell(FILE *stream); void AS_UTL_fseek(FILE *stream, off_t offset, int whence); // Read a file-of-files into a vector void AS_UTL_loadFileList(char *fileName, vector &fileList); enum cftType { cftNONE = 0, cftGZ = 1, cftBZ2 = 2, cftXZ = 3, cftSTDIN = 4 }; cftType compressedFileType(char const *filename); class compressedFileReader { public: compressedFileReader(char const *filename); ~compressedFileReader(); FILE *operator*(void) { return(_file); }; FILE *file(void) { return(_file); }; bool isCompressed(void) { return(_pipe); }; private: FILE *_file; bool _pipe; bool _stdi; }; class compressedFileWriter { public: compressedFileWriter(char const *filename, int32 level=1); ~compressedFileWriter(); FILE *operator*(void) { return(_file); }; FILE *file(void) { return(_file); }; bool isCompressed(void) { return(_pipe); }; private: FILE *_file; bool _pipe; bool _stdi; }; #endif // AS_UTL_FILEIO_H canu-1.6/src/AS_UTL/AS_UTL_reverseComplement.C000066400000000000000000000073661314437614700210640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2008-MAY-15 to 2013-AUG-01 * are Copyright 2008-2009,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" static char inv[256] = { 0, 0, 0, 0, 0, 0, 0, 0, // 0x00 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x08 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x10 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x18 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x20 - !"#$%&' 0, 0, 0, 0, 0, 0, 0, 0, // 0x28 - ()*+,-./ 0, 0, 0, 0, 0, 0, 0, 0, // 0x30 - 01234567 0, 0, 0, 0, 0, 0, 0, 0, // 0x38 - 89:;<=>? 0,'T', 0,'G', 0, 0, 0,'C', // 0x40 - @ABCDEFG 0, 0, 0, 0, 0, 0, 0, 0, // 0x48 - HIJKLMNO 0, 0, 0, 0,'A', 0, 0, 0, // 0x50 - PQRSTUVW 0, 0, 0, 0, 0, 0, 0, 0, // 0x58 - XYZ[\]^_ 0,'t', 0,'g', 0, 0, 0,'c', // 0x60 - `abcdefg 0, 0, 0, 0, 0, 0, 0, 0, // 0x68 - hijklmno 0, 0, 0, 0,'a', 0, 0, 0, // 0x70 - pqrstuvw 0, 0, 0, 0, 0, 0, 0, 0, // 0x78 - xyz{|}~ 0, 0, 0, 0, 0, 0, 0, 0, // 0x80 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x88 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x90 - 0, 0, 0, 0, 0, 0, 0, 0, // 0x98 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xa0 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xa8 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xb0 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xb8 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xc0 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xc8 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xd0 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xd8 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xe0 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xe8 - 0, 0, 0, 0, 0, 0, 0, 0, // 0xf0 - 0, 0, 0, 0, 0, 0, 0, 0 // 0xf8 - }; void reverseComplementSequence(char *seq, int len) { char c=0; char *s=seq, *S=seq+len-1; if (len == 0) { len = strlen(seq); S = seq + len - 1; } while (s < S) { c = *s; *s++ = inv[*S]; *S-- = inv[c]; } if (s == S) *s = inv[*s]; } char * reverseComplementCopy(char *seq, int len) { char *rev = new char [len+1]; assert(len > 0); for (int32 p=len, q=0; p>0; ) rev[q++] = inv[seq[--p]]; rev[len] = 0; return(rev); } void reverseComplement(char *seq, char *qlt, int len) { char c=0; char *s=seq, *S=seq+len-1; char *q=qlt, *Q=qlt+len-1; if (qlt == NULL) { reverseComplementSequence(seq, len); return; } if (len == 0) { len = strlen(seq); S = seq + len - 1; Q = qlt + len - 1; } while (s < S) { c = *s; *s++ = inv[*S]; *S-- = inv[c]; c = *q; *q++ = *Q; *Q-- = c; } if (s == S) *s = inv[*s]; } canu-1.6/src/AS_UTL/AS_UTL_reverseComplement.H000066400000000000000000000024551314437614700210630ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2008-MAY-15 to 2013-AUG-01 * are Copyright 2008,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef REVERSECOMPLEMENT_H #define REVERSECOMPLEMENT_H #include "AS_global.H" void reverseComplementSequence(char *seq, int len); char *reverseComplementCopy(char *seq, int len); void reverseComplement(char *seq, char *qlt, int len); #endif canu-1.6/src/AS_UTL/AS_UTL_stackTrace.C000066400000000000000000000227101314437614700174370ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2013-MAR-14 to 2013-AUG-01 * are Copyright 2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-APR-26 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #if (!defined(__CYGWIN__) && !defined(_WIN31)) #include // backtrace #endif #include // backtrace #include #include #include #include #include #include #include #include // Derived from // http://sourceware.org/git/?p=glibc.git;a=blob;f=debug/segfault.c // // Linux g++ -rdynamic -g3 -o st1 st1.C // FreeBSD CC -o unwind unwind.C -I/usr/local/include -L/usr/local/lib -lunwind #define WRITE_STRING(S) write(2, S, strlen(S)) // If set, a signal handler will be installed to call AS_UTL_catchCrash(). // Defined by default, let the exceptions undef it. // #define INSTALL_HANDLER #if defined(__CYGWIN__) || defined(NOBACKTRACE) // Do nothing. #undef INSTALL_HANDLER #elif defined(LIBBACKTRACE) #warning LIBBACKTRACE extern "C" { #include "libbacktrace/backtrace.h" } backtrace_state *backtraceState = NULL; int AS_UTL_catchCrash_full(void *data, uintptr_t pc, const char *filename, int lineno, const char *function) { fprintf(stderr, "%s::%d in %s()\n", filename, lineno, function); return(0); // to continue tracing return(1); // to stop tracing } void AS_UTL_catchCrash(int sig_num, siginfo_t *UNUSED(info), void *UNUSED(ctx)) { WRITE_STRING("\nFailed with '"); WRITE_STRING(strsignal(sig_num)); WRITE_STRING("'; backtrace (libbacktrace):\n"); backtrace_full(backtraceState, 0, AS_UTL_catchCrash_full, NULL, NULL); //WRITE_STRING("\nBacktrace (libbacktrace print):\n\n"); //backtrace_print(backtraceState, 0, stderr); // Pass the signal through, only so a core file can get generated. struct sigaction sa; sa.sa_handler = SIG_DFL; sigemptyset (&sa.sa_mask); sa.sa_flags = 0; sigaction(sig_num, &sa, NULL); raise(sig_num); } #elif defined(LIBUNWIND) #warning LIBUNWIND #include void AS_UTL_catchCrash(int sig_num, siginfo_t *UNUSED(info), void *UNUSED(ctx)) { WRITE_STRING("\nFailed with '"); WRITE_STRING(strsignal(sig_num)); WRITE_STRING("'\n"); WRITE_STRING("\nBacktrace (mangled):\n\n"); unw_cursor_t cursor; unw_context_t uc; unw_word_t ip, sp; int depth = 0; unw_getcontext(&uc); // Get state unw_init_local(&cursor, &uc); // Initialize state cursor while (unw_step(&cursor) > 0) { unw_get_reg(&cursor, UNW_REG_IP, &ip); unw_get_reg(&cursor, UNW_REG_SP, &sp); unw_word_t off; char name[256]; if (unw_get_proc_name (&cursor, name, sizeof (name), &off) == 0) { if (off) fprintf(stderr, "%02d <%s + 0x%lx> ip=%lx sp=%lx\n", depth, name, (long)off, ip, sp); else fprintf(stderr, "%02d <%s> ip=%lx sp=%lx\n", depth, name, ip, sp); } else { fprintf(stderr, "%02d ip=%lx sp=%lx\n", depth, ip, sp); } depth++; } // Pass the signal through, only so a core file can get generated. struct sigaction sa; sa.sa_handler = SIG_DFL; sigemptyset (&sa.sa_mask); sa.sa_flags = 0; sigaction(sig_num, &sa, NULL); raise(sig_num); } #elif defined(BACKWARDCPP) #warning BACKWARDCPP #include "backward.hpp" //namespace backward { //backward::SignalHandling sh; //} // namespace backward void AS_UTL_catchCrash(int sig_num, siginfo_t *UNUSED(info), void *UNUSED(ctx)) { // Report the signal we failed on, be careful to not allocate memory. WRITE_STRING("\nFailed with '"); WRITE_STRING(strsignal(sig_num)); WRITE_STRING("'\n"); // Dump a full backtrace, even including the signal handler frames, // instead of not dumping anything. // The _fd version doesn't allocate memory, the standard backtrace_symbols // does. Demangling names also allocates memory. We'll do the safe one // first, just to get something, then try the others. WRITE_STRING("\nBacktrace:\n\n"); backward::StackTrace st; backward::TraceResolver tr; st.load_here(32); tr.load_stacktrace(st); for (size_t tt=0; tt 0) { fprintf(stderr, " " F_SIZE_T " inlined functions:\n", trace.inliners.size()); for (size_t ii=0; ii 0) // Failed. name = mang; fprintf(stderr, "[%d] %s::%s + %s %s\n", i, messages[i], name, obgn, oend); //free(name); } else { fprintf(stderr, "[%d] %s\n", i, messages[i]); } } // Pass the signal through, only so a core file can get generated. struct sigaction sa; sa.sa_handler = SIG_DFL; sigemptyset (&sa.sa_mask); sa.sa_flags = 0; sigaction(sig_num, &sa, NULL); raise(sig_num); } #endif #ifdef INSTALL_HANDLER void AS_UTL_installCrashCatcher(const char *filename) { struct sigaction sigact; memset(&sigact, 0, sizeof(struct sigaction)); sigact.sa_sigaction = AS_UTL_catchCrash; sigact.sa_flags = SA_RESTART | SA_SIGINFO; // Don't especially care if these fail or not. sigaction(SIGINT, &sigact, NULL); // Interrupt sigaction(SIGILL, &sigact, NULL); // Illegal instruction sigaction(SIGFPE, &sigact, NULL); // Floating point exception sigaction(SIGABRT, &sigact, NULL); // Abort - from assert sigaction(SIGBUS, &sigact, NULL); // Bus error sigaction(SIGSEGV, &sigact, NULL); // Segmentation fault #ifdef LIBBACKTRACE backtraceState = backtrace_create_state(filename, true, NULL, NULL); #endif } #else void AS_UTL_installCrashCatcher(const char *UNUSED(filename)) { } #endif canu-1.6/src/AS_UTL/AS_UTL_stackTrace.H000066400000000000000000000024311314437614700174420ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2013-MAR-14 to 2013-AUG-01 * are Copyright 2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_STACKTRACE_H #define AS_UTL_STACKTRACE_H #include void AS_UTL_catchCrash(int sig_num, siginfo_t *info, void *ctx); void AS_UTL_installCrashCatcher(const char *filename); #endif // AS_UTL_STACKTRACE_H canu-1.6/src/AS_UTL/AS_UTL_stackTraceTest.C000066400000000000000000000025671314437614700203070ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2013-OCT-18 * are Copyright 2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ int crash() { char * p = NULL; *p = 0; return 0; } int foo4() { crash(); return 0; } int foo3() { foo4(); return 0; } int foo2() { foo3(); return 0; } int foo1() { foo2(); return 0; } int main(int argc, char ** argv) { struct sigaction sigact; sigact.sa_sigaction = crit_err_hdlr; sigact.sa_flags = SA_RESTART | SA_SIGINFO; // Don't especially care if this fails or not. sigaction(SIGSEGV, &sigact, NULL); foo1(); exit(EXIT_SUCCESS); } canu-1.6/src/AS_UTL/bitEncodings.C000066400000000000000000000107521314437614700166570ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bitEncodings.H" const uint32 fibonacciValuesLen = 92; const uint64 fibonacciValues[92] = { 1LLU, 2LLU, 3LLU, 5LLU, 8LLU, 13LLU, 21LLU, 34LLU, 55LLU, 89LLU, 144LLU, 233LLU, 377LLU, 610LLU, 987LLU, 1597LLU, 2584LLU, 4181LLU, 6765LLU, 10946LLU, 17711LLU, 28657LLU, 46368LLU, 75025LLU, 121393LLU, 196418LLU, 317811LLU, 514229LLU, 832040LLU, 1346269LLU, 2178309LLU, 3524578LLU, 5702887LLU, 9227465LLU, 14930352LLU, 24157817LLU, 39088169LLU, 63245986LLU, 102334155LLU, 165580141LLU, 267914296LLU, 433494437LLU, 701408733LLU, 1134903170LLU, 1836311903LLU, 2971215073LLU, 4807526976LLU, 7778742049LLU, 12586269025LLU, 20365011074LLU, 32951280099LLU, 53316291173LLU, 86267571272LLU, 139583862445LLU, 225851433717LLU, 365435296162LLU, 591286729879LLU, 956722026041LLU, 1548008755920LLU, 2504730781961LLU, 4052739537881LLU, 6557470319842LLU, 10610209857723LLU, 17167680177565LLU, 27777890035288LLU, 44945570212853LLU, 72723460248141LLU, 117669030460994LLU, 190392490709135LLU, 308061521170129LLU, 498454011879264LLU, 806515533049393LLU, 1304969544928657LLU, 2111485077978050LLU, 3416454622906707LLU, 5527939700884757LLU, 8944394323791464LLU, 14472334024676221LLU, 23416728348467685LLU, 37889062373143906LLU, 61305790721611591LLU, 99194853094755497LLU, 160500643816367088LLU, 259695496911122585LLU, 420196140727489673LLU, 679891637638612258LLU, 1100087778366101931LLU, 1779979416004714189LLU, 2880067194370816120LLU, 4660046610375530309LLU, 7540113804746346429LLU, 12200160415121876738LLU }; canu-1.6/src/AS_UTL/bitEncodings.H000066400000000000000000000240721314437614700166640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "bitPacking.H" #include "bitOperations.H" // Methods to encode and decode numbers from an array of bits. // // The bit packed array is rooted at location 'ptr' and currently at location // 'pos'. The size of the encoded number (in bits) is returned in 'siz'. // // Came from kmer/libutil/ // eliasDeltaEncoding.h // eliasGammaEncoding.h // fibonacciEncoding.h // generalizedUnaryEncoding.h // unaryEncoding.h // unaryEncodingTester.C // // Unary Encoding // // Routines to store and retrieve a unary encoded number to/from a // bit packed word array based at 'ptr' and currently at location // 'pos'. Both routines return the size of the encoded number in // 'siz'. // // The usual unary encoding. Store the number n as n 0 bits followed // by a single 1 bit. // // 0 -> 1 // 1 -> 01 // 2 -> 001 // 3 -> 0001 // 4 -> 00001 // // See the decoder as to why we use 0 instead of 1 for the count. // inline void setUnaryEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz, uint64 val) { *siz = val + 1; while (val >= 64) { setDecodedValue(ptr, pos, 64, uint64ZERO); pos += 64; val -= 64; siz += 64; } setDecodedValue(ptr, pos, val + 1, uint64ONE); pos += val + 1; } inline uint64 getUnaryEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz) { uint64 val = uint64ZERO; uint64 enc = uint64ZERO; // How many whole words are zero? // enc = getDecodedValue(ptr, pos, 64); while (enc == uint64ZERO) { val += 64; pos += 64; enc = getDecodedValue(ptr, pos, 64); } // This word isn't zero. Count how many bits are zero (see, the // choice of 0 or 1 for the encoding wasn't arbitrary!) val += 64 - logBaseTwo64(enc); *siz = val + 1; return(val); } // // Generalized Unary Encoding // // Generalized unary encodings. Defined by (start, step, stop). // This implementation uses stop=infinity to encode all possible // numbers. If you know the highest number possible, you'll get a // slight decrease in space used ... // // The method: // // The mth code word consists of 'm' unary encoded, followed by w = // start + m * step binary encoded bits. If a == stop, then the // terminator in the unary code is dropped. // // Encoding is tricky. Take the 3,2,9 example: // m w template # vals #'s // 0 3 1xxx 8 0- 7 // 1 5 01xxxxx 32 8- 39 // 2 7 001xxxxxxx 128 40-167 // 3 9 000xxxxxxxxx 512 168-679 // // I don't see a nice way of mapping our number n to the prefix m, // short of some sort of search. The implementation below is // probably very slow. // // On the bright side, decoding is trivial. Read back the unary // encoded number, then read that many bits to get the value. // static const uint64 _genunary_start = 3; static const uint64 _genunary_step = 2; inline void setGeneralizedUnaryEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz, uint64 val) { uint64 m = uint64ZERO; uint64 w = _genunary_start; uint64 n = uint64ONE << w; // Search for the prefix m, given our number 'val'. // While doing this, we get rid of all the implicitly stored values from 'val'. // #ifdef DEBUG_GENERALIZEDUNARYENCODING fprintf(stderr, " val="uint64FMT" try n="uint64FMT" for m="uint64FMT"\n", val, n, m); #endif while (n <= val) { val -= n; w += _genunary_step; n = uint64ONE << w; m++; #ifdef DEBUG_GENERALIZEDUNARYENCODING fprintf(stderr, " val="uint64FMT" try n="uint64FMT" for m="uint64FMT"\n", val, n, m); #endif } #ifdef DEBUG_GENERALIZEDUNARYENCODING fprintf(stderr, "val="uint64FMT" found m="uint64FMT"\n", val, m); #endif // Now just encode the number // m - the unary encoded prefix // w - the size of the binary encoded number setUnaryEncodedNumber(ptr, pos, siz, m); setDecodedValue(ptr, pos+*siz, w, val); *siz = m + 1 + w; } inline uint64 getGeneralizedUnaryEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz) { uint64 val = uint64ZERO; uint64 m = uint64ZERO; uint64 w = uint64ZERO; // Comments in the encoder apply here too. m = getUnaryEncodedNumber(ptr, pos, siz); w = _genunary_start + m * _genunary_step; val = getDecodedValue(ptr, pos + *siz, w); *siz = m + 1 + w; #ifdef DEBUG_GENERALIZEDUNARYENCODING fprintf(stderr, "m="uint64FMT" w="uint64FMT" val="uint64FMT"\n", m, w, val); #endif // Add in the implcitly stored pieces of the number while (m--) { w -= _genunary_step; val += uint64ONE << w; } return(val); } // // Elias' Gamma Encoding // inline void setEliasGammaEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz, uint64 val) { uint64 b = logBaseTwo64(val); setUnaryEncodedNumber(ptr, pos, siz, b); pos += *siz; setDecodedValue(ptr, pos, b, val); *siz += b; } inline uint64 getEliasGammaEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz) { uint64 b = getUnaryEncodedNumber(ptr, pos, siz); pos += *siz; *siz += b; return(getDecodedValue(ptr, pos, b)); } // // Elias' Delta Encoding // inline void setEliasDeltaEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz, uint64 val) { uint64 b = logBaseTwo64(val); setEliasGammaEncodedNumber(ptr, pos, siz, b); pos += *siz; setDecodedValue(ptr, pos, b-1, val); *siz += b-1; } inline uint64 getEliasDeltaEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz) { uint64 b = getEliasGammaEncodedNumber(ptr, pos, siz) - 1; pos += *siz; *siz += b; return(uint64ONE << b | getDecodedValue(ptr, pos, b)); } // // Fibonacci Encoding // // FibEncoding can store values up to 17,167,680,177,565 (slightly // below 2^45, so at most a 44-bit number) in a 64-bit quantity. // 93 bits (92 + 1) are needed to store up to 64-bit values. // Since zero cannot be stored, the value is incremented by one before storing. // The actual space used is: // // #### bits // 0 2 // 1 3 // 2 4 // 3 4 // 4 5 // 5 5 // 6 5 // 7 6 // 8 6 // 9 6 // 10 6 // 11 6 // 12 7 // 20 8 // 33 9 // 54 10 // 88 11 // 143 12 // 232 13 // 376 14 // 609 15 // 986 16 // 1596 17 // 2583 18 // 4180 19 // 6764 20 // 10945 21 // 17710 22 // 28656 23 // 46387 24 // 75024 25 // 121392 26 extern const uint32 fibonacciValuesLen; extern const uint64 fibonacciValues[92]; inline void setFibonacciEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz, uint64 val) { uint64 out1 = uint64ZERO; uint64 out2 = uint64ZERO; uint32 fib = fibonacciValuesLen; uint32 fibmax = uint64ZERO; // We cannot store zero as a fibonacci number, so we simply // increase everything by one. // val++; // Estimate a starting point for our search; we need a function // that is always slightly more than fib() // // Find the highest bit set, do a lookup // // XXX: Still need this! while (fib-- > 0) { if (val >= fibonacciValues[fib]) { if (fib >= 64) out2 |= uint64ONE << (127 - fib); else out1 |= uint64ONE << (63 - fib); val -= fibonacciValues[fib]; if (fibmax == uint64ZERO) { fibmax = fib + 1; if (fibmax >= 64) out2 |= uint64ONE << (127 - fibmax); else out1 |= uint64ONE << (63 - fibmax); } } } fibmax++; // Write the encoded numbers to the stream // if (fibmax > 64) { setDecodedValue(ptr, pos, 64, out1); pos += 64; out2 >>= (128 - fibmax); setDecodedValue(ptr, pos, fibmax - 64, out2); } else { out1 >>= (64 - fibmax); setDecodedValue(ptr, pos, fibmax, out1); } *siz = fibmax; } inline uint64 getFibonacciEncodedNumber(uint64 *ptr, uint64 pos, uint64 *siz) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 sft = 0x8000000000000000llu >> (pos & 0x000000000000003fllu); uint64 val = 0; uint32 fib = 0; uint64 newbit; uint64 oldbit; oldbit = ptr[wrd] & sft; sft >>= 1; if (sft == uint64ZERO) { wrd++; sft = 0x8000000000000000llu; } newbit = ptr[wrd] & sft; sft >>= 1; if (sft == uint64ZERO) { wrd++; sft = 0x8000000000000000llu; } while (!oldbit || !newbit) { if (oldbit) val += fibonacciValues[fib]; fib++; oldbit = newbit; newbit = ptr[wrd] & sft; sft >>= 1; if (sft == uint64ZERO) { wrd++; sft = 0x8000000000000000llu; } } val += fibonacciValues[fib]; (*siz) = fib + 2; // We stored val+1, remember? Probably not, because the encoder is // next. // return(val - 1); } canu-1.6/src/AS_UTL/bitOperations.H000066400000000000000000000145671314437614700171060ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/bitOperations.h * * Modifications by: * * Brian P. Walenz on 2004-APR-27 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-FEB-23 to 2014-APR-11 * are Copyright 2005-2006,2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef BRI_BITS_H #define BRI_BITS_H // For dealing with the bits in bytes. // I wish I could claim these. // // Freed, Edwin E. 1983. "Binary Magic Number" Dr. Dobbs Journal // Vol. 78 (April) pp. 24-37 // // Supposedly tells us how to reverse the bits in a word, count the number // of set bits in a words and more. // // A bit of verbage on counting the number of set bits. The naive way // is to loop and shift: // // uint32 r = uint32ZERO; // while (x) { // r++; // x >>= 1; // } // return(r); // // http://remus.rutgers.edu/~rhoads/Code/bitcount3.c has an optimized // method: // // x -= (0xaaaaaaaa & x) >> 1; // x = (x & 0x33333333) + ((x >> 2) & 0x33333333); // x += x >> 4; // x &= 0x0f0f0f0f; // x += x >> 8; // x += x >> 16; // x &= 0x000000ff; // return(x); // // No loops! // // Freed's methods are easier to understand, and just as fast. // // Using our bit counting routines, Ross Lippert suggested a nice // way of computing log2 -- use log2 shifts to fill up the lower // bits, then count bits. See logBaseTwo*() // inline uint64 reverseBits64(uint64 x) { x = ((x >> 1) & uint64NUMBER(0x5555555555555555)) | ((x << 1) & uint64NUMBER(0xaaaaaaaaaaaaaaaa)); x = ((x >> 2) & uint64NUMBER(0x3333333333333333)) | ((x << 2) & uint64NUMBER(0xcccccccccccccccc)); x = ((x >> 4) & uint64NUMBER(0x0f0f0f0f0f0f0f0f)) | ((x << 4) & uint64NUMBER(0xf0f0f0f0f0f0f0f0)); x = ((x >> 8) & uint64NUMBER(0x00ff00ff00ff00ff)) | ((x << 8) & uint64NUMBER(0xff00ff00ff00ff00)); x = ((x >> 16) & uint64NUMBER(0x0000ffff0000ffff)) | ((x << 16) & uint64NUMBER(0xffff0000ffff0000)); x = ((x >> 32) & uint64NUMBER(0x00000000ffffffff)) | ((x << 32) & uint64NUMBER(0xffffffff00000000)); return(x); } inline uint32 reverseBits32(uint32 x) { x = ((x >> 1) & uint32NUMBER(0x55555555)) | ((x << 1) & uint32NUMBER(0xaaaaaaaa)); x = ((x >> 2) & uint32NUMBER(0x33333333)) | ((x << 2) & uint32NUMBER(0xcccccccc)); x = ((x >> 4) & uint32NUMBER(0x0f0f0f0f)) | ((x << 4) & uint32NUMBER(0xf0f0f0f0)); x = ((x >> 8) & uint32NUMBER(0x00ff00ff)) | ((x << 8) & uint32NUMBER(0xff00ff00)); x = ((x >> 16) & uint32NUMBER(0x0000ffff)) | ((x << 16) & uint32NUMBER(0xffff0000)); return(x); } inline uint64 uint64Swap(uint64 x) { x = ((x >> 8) & uint64NUMBER(0x00ff00ff00ff00ff)) | ((x << 8) & uint64NUMBER(0xff00ff00ff00ff00)); x = ((x >> 16) & uint64NUMBER(0x0000ffff0000ffff)) | ((x << 16) & uint64NUMBER(0xffff0000ffff0000)); x = ((x >> 32) & uint64NUMBER(0x00000000ffffffff)) | ((x << 32) & uint64NUMBER(0xffffffff00000000)); return(x); } inline uint32 uint32Swap(uint32 x) { x = ((x >> 8) & uint32NUMBER(0x00ff00ff)) | ((x << 8) & uint32NUMBER(0xff00ff00)); x = ((x >> 16) & uint32NUMBER(0x0000ffff)) | ((x << 16) & uint32NUMBER(0xffff0000)); return(x); } inline uint16 uint16Swap(uint16 x) { x = ((x >> 8) & 0x00ff) | ((x << 8) & 0xff00); return(x); } #if (__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) #define PREFETCH(x) __builtin_prefetch((x), 0, 0) #else #define PREFETCH(x) #endif // Amazingingly, this is slower. From what I can google, the builtin // is using the 2^16 lookup table method - so a 64-bit popcount does // 4 lookups in the table and sums. Bad cache performance in codes // that already have bad cache performance, I'd guess. // //#if (__GNUC__ > 3) || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4) //#define BUILTIN_POPCOUNT //#endif #ifdef BUILTIN_POPCOUNT inline uint32 countNumberOfSetBits32(uint32 x) { return(__builtin_popcount(x)); } inline uint64 countNumberOfSetBits64(uint64 x) { return(__builtin_popcountll(x)); } #else inline uint32 countNumberOfSetBits32(uint32 x) { x = ((x >> 1) & uint32NUMBER(0x55555555)) + (x & uint32NUMBER(0x55555555)); x = ((x >> 2) & uint32NUMBER(0x33333333)) + (x & uint32NUMBER(0x33333333)); x = ((x >> 4) & uint32NUMBER(0x0f0f0f0f)) + (x & uint32NUMBER(0x0f0f0f0f)); x = ((x >> 8) & uint32NUMBER(0x00ff00ff)) + (x & uint32NUMBER(0x00ff00ff)); x = ((x >> 16) & uint32NUMBER(0x0000ffff)) + (x & uint32NUMBER(0x0000ffff)); return(x); } inline uint64 countNumberOfSetBits64(uint64 x) { x = ((x >> 1) & uint64NUMBER(0x5555555555555555)) + (x & uint64NUMBER(0x5555555555555555)); x = ((x >> 2) & uint64NUMBER(0x3333333333333333)) + (x & uint64NUMBER(0x3333333333333333)); x = ((x >> 4) & uint64NUMBER(0x0f0f0f0f0f0f0f0f)) + (x & uint64NUMBER(0x0f0f0f0f0f0f0f0f)); x = ((x >> 8) & uint64NUMBER(0x00ff00ff00ff00ff)) + (x & uint64NUMBER(0x00ff00ff00ff00ff)); x = ((x >> 16) & uint64NUMBER(0x0000ffff0000ffff)) + (x & uint64NUMBER(0x0000ffff0000ffff)); x = ((x >> 32) & uint64NUMBER(0x00000000ffffffff)) + (x & uint64NUMBER(0x00000000ffffffff)); return(x); } #endif inline uint32 logBaseTwo32(uint32 x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; return(countNumberOfSetBits32(x)); } inline uint64 logBaseTwo64(uint64 x) { x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x |= x >> 32; return(countNumberOfSetBits64(x)); } #endif // BRI_BITS_H canu-1.6/src/AS_UTL/bitPackedArray.C000066400000000000000000000072241314437614700171340ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/bitPackedArray.C * * Modifications by: * * Brian P. Walenz from 2005-FEB-07 to 2014-APR-11 * are Copyright 2005-2006,2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include #include #include #include "bitPackedArray.H" #include "bitPacking.H" bitPackedArray::bitPackedArray(uint32 valueWidth, uint32 segmentSize) { _valueWidth = valueWidth; _segmentSize = segmentSize; _nextElement = 0; _valuesPerSegment = (uint64)_segmentSize * 1024 * 8 / (uint64)_valueWidth; _numSegments = 0; _maxSegments = 16; _segments = new uint64 * [_maxSegments]; } bitPackedArray::~bitPackedArray() { for (uint32 i=0; i<_numSegments; i++) delete [] _segments[i]; delete [] _segments; } uint64 bitPackedArray::get(uint64 idx) { uint64 s = idx / _valuesPerSegment; uint64 p = _valueWidth * (idx % _valuesPerSegment); if (idx >= _nextElement) { fprintf(stderr, "bitPackedArray::get()-- element index " F_U64 " is out of range, only " F_U64 " elements.\n", idx, _nextElement-1); return(0xdeadbeefdeadbeefULL); } return(getDecodedValue(_segments[s], p, _valueWidth)); } void bitPackedArray::set(uint64 idx, uint64 val) { uint64 s = idx / _valuesPerSegment; uint64 p = _valueWidth * (idx % _valuesPerSegment); //fprintf(stderr, "s=" F_U64 " p=" F_U64 " segments=" F_U64 "/" F_U64 "\n", s, p, _numSegments, _maxSegments); if (idx >= _nextElement) _nextElement = idx+1; if (s >= _maxSegments) { _maxSegments = s + 16; uint64 **S = new uint64 * [_maxSegments]; for (uint32 i=0; i<_numSegments; i++) S[i] = _segments[i]; delete [] _segments; _segments = S; } while (_numSegments <= s) _segments[_numSegments++] = new uint64 [_segmentSize * 1024 / 8]; setDecodedValue(_segments[s], p, _valueWidth, val); } void bitPackedArray::clear(void) { for (uint32 s=0; s<_numSegments; s++) bzero(_segments[s], _segmentSize * 1024); } //////////////////////////////////////// bitArray::bitArray(uint32 segmentSize) { _segmentSize = segmentSize; _valuesPerSegment = (uint64)_segmentSize * 1024 * 8; _numSegments = 0; _maxSegments = 16; _segments = new uint64 * [_maxSegments]; } bitArray::~bitArray() { for (uint32 i=0; i<_numSegments; i++) delete [] _segments[i]; delete [] _segments; } void bitArray::clear(void) { for (uint32 s=0; s<_numSegments; s++) bzero(_segments[s], _segmentSize * 1024); } canu-1.6/src/AS_UTL/bitPackedArray.H000066400000000000000000000202561314437614700171410ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/bitPackedArray.H * * Modifications by: * * Brian P. Walenz from 2005-JUL-12 to 2014-APR-11 * are Copyright 2005-2006,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef BITPACKEDARRAY_H #define BITPACKEDARRAY_H #include "AS_global.H" #undef DEBUG_BPH_ADD #undef DEBUG_BPH_GET //////////////////////////////////////// // // bitPackedArray // // implements an integer array using bit-widths less than word-sizes, // e.g., a memory efficient way to store 23 bit numbers. Numbers may // be up to 64 bits wide. // // The array is variable length, and it is implemented as an array, // not a list or tree -- accessing element 1,000,000 will allocate // elements 0 through 999,999. // class bitPackedArray { public: // Create a bitpacked array with elements of width 'width' using // 'segmentSize' KB per segment. If you know your array is going // to be much bigger or smaller, crank this value. // bitPackedArray(uint32 valueWidth, uint32 segmentSize = 1024); ~bitPackedArray(); // No array operator is provided, because we cannot return a // reference to a value that is split across two words (or even a // reference to a value that is not bit aligned in the word). // uint64 get(uint64 idx); void set(uint64 idx, uint64 val); // Clear the array. Since the array is variable sized, you must add // things to a new array before clearing it. void clear(void); private: uint32 _valueWidth; uint32 _segmentSize; uint64 _nextElement; // the first invalid element uint64 _valuesPerSegment; uint64 _numSegments; uint64 _maxSegments; uint64 **_segments; }; // An array of bits. Exactly the same as the bitPackedArray, but // optimized for width=1. // class bitArray { public: bitArray(uint32 segmentSize = 1024); ~bitArray(); uint64 get(uint64 idx); uint64 getAndSet(uint64 idx); void set(uint64 idx); void clr(uint64 idx); void clear(void); private: void resize(uint64 s); uint32 _segmentSize; uint64 _valuesPerSegment; uint64 _numSegments; uint64 _maxSegments; uint64 **_segments; }; // Uses the bitPackedArray to implement a heap. The bitPackedArray is dynamically sized, // so this can be too. // class bitPackedHeap { public: bitPackedHeap(uint32 width, uint64 size=16) { _array = new bitPackedArray(width, size); _array->set(0, 0); _lastVal = 0; }; ~bitPackedHeap() { delete _array; }; uint64 get(void) { uint64 biggestVal = ~uint64ZERO; if (_lastVal == 0) return(biggestVal); biggestVal = _array->get(0); _lastVal--; if (_lastVal == 0) return(biggestVal); uint64 t = _array->get(_lastVal); _array->set(0, t); uint64 pidx = 0; uint64 pval = t; uint64 cidx = 1; uint64 cval = 0; // set below while (cidx < _lastVal) { // Set cval here, so we can first test if cidx is in range. cval = _array->get(cidx); // Pick the smallest of the two kids if (cidx+1 < _lastVal) { t = _array->get(cidx+1); if (cval > t) { cidx++; cval = t; } } #ifdef DEBUG_BPH_GET fprintf(stderr, "test c=" F_U64 " and p=" F_U64 " lastVal=" F_U64 "\n", cidx, pidx, _lastVal); fprintf(stderr, "test c=" F_U64 "=" F_U64 "\n", cidx, cval); fprintf(stderr, "test p=" F_U64 "=" F_U64 "\n", pidx, pval); fprintf(stderr, "test c=" F_U64 "=" F_U64 " and p=" F_U64 "=" F_U64 "\n", cidx, cval, pidx, pval); #endif if (cval < pval) { #ifdef DEBUG_BPH_GET fprintf(stderr, "swap c=" F_U64 "=" F_U64 " and p=" F_U64 "=" F_U64 "\n", cidx, cval, pidx, pval); #endif // Swap p and c _array->set(pidx, cval); _array->set(cidx, pval); // Move down the tree -- pval doesn't change, we moved it into cidx! pidx = cidx; cidx = cidx * 2 + 1; } else { cidx = _lastVal; } } return(biggestVal); }; void add(uint64 value) { uint64 cidx = _lastVal; uint64 cval = value; uint64 pidx = 0; uint64 pval = 0; bool more = false; #ifdef DEBUG_BPH_ADD fprintf(stderr, "add c=" F_U64 "=" F_U64 " -- lastVal=" F_U64 "\n", cidx, cval, _lastVal); #endif _array->set(cidx, cval); if (cidx > 0) more = true; while (more) { pidx = (cidx-1) / 2; #ifdef DEBUG_BPH_ADD fprintf(stderr, "more c=" F_U64 " and p=" F_U64 "\n", cidx, pidx); #endif pval = _array->get(pidx); #ifdef DEBUG_BPH_ADD fprintf(stderr, "test c=" F_U64 "=" F_U64 " and p=" F_U64 "=" F_U64 "\n", cidx, cval, pidx, pval); #endif if (pval > cval) { #ifdef DEBUG_BPH_ADD fprintf(stderr, "swap c=" F_U64 "=" F_U64 " and p=" F_U64 "=" F_U64 "\n", cidx, cval, pidx, pval); #endif // Swap p and c _array->set(cidx, pval); _array->set(pidx, cval); // Move up the tree -- cval doesn't change, we moved it into pidx! cidx = pidx; } else { more = false; } if (cidx == 0) more = false; } _lastVal++; //dump(); }; void dump(void) { for (uint32 i=0; i<_lastVal; i++) fprintf(stderr, "HEAP[" F_U32 "]=" F_U64 "\n", i, _array->get(i)); } void clear(void) { _array->clear(); _lastVal = 0; }; private: bitPackedArray *_array; uint64 _lastVal; }; inline uint64 bitArray::get(uint64 idx) { uint64 s = idx / _valuesPerSegment; uint64 p = idx % _valuesPerSegment; uint64 wrd = (p >> 6) & 0x0000cfffffffffffllu; uint64 bit = (p ) & 0x000000000000003fllu; return((_segments[s][wrd] >> bit) & 0x0000000000000001llu); } inline void bitArray::resize(uint64 s) { if (s < _numSegments) return; if (s > _maxSegments) { _maxSegments = s + 16; uint64 **S = new uint64 * [_maxSegments]; for (uint32 i=0; i<_numSegments; i++) S[i] = _segments[i]; delete [] _segments; _segments = S; } while (_numSegments <= s) _segments[_numSegments++] = new uint64 [_segmentSize * 1024 / 8]; } inline uint64 bitArray::getAndSet(uint64 idx) { uint64 s = idx / _valuesPerSegment; uint64 p = idx % _valuesPerSegment; uint64 wrd = (p >> 6) & 0x0000cfffffffffffllu; uint64 bit = (p ) & 0x000000000000003fllu; uint64 ret = (_segments[s][wrd] >> bit) & 0x0000000000000001llu; _segments[s][wrd] |= uint64ONE << bit; return(ret); } inline void bitArray::set(uint64 idx) { uint64 s = idx / _valuesPerSegment; uint64 p = idx % _valuesPerSegment; resize(s); uint64 wrd = (p >> 6) & 0x0000cfffffffffffllu; uint64 bit = (p ) & 0x000000000000003fllu; _segments[s][wrd] |= uint64ONE << bit; } inline void bitArray::clr(uint64 idx) { uint64 s = idx / _valuesPerSegment; uint64 p = idx % _valuesPerSegment; resize(s); uint64 wrd = (p >> 6) & 0x0000cfffffffffffllu; uint64 bit = (p ) & 0x000000000000003fllu; _segments[s][wrd] &= ~(0x0000000000000001llu << bit); } #endif // BITPACKEDARRAY_H canu-1.6/src/AS_UTL/bitPackedFile.C000066400000000000000000000355171314437614700167430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/bitPackedFile.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-01 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-MAR-29 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-16 to 2014-APR-11 * are Copyright 2005-2008,2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "bitPackedFile.H" #include "AS_UTL_fileIO.H" //#include //#include //#include //#include //#include #include // N.B. any read() / write() pair (either order) must have a seek (or // a fflush) in between. bitPackedFile::bitPackedFile(char const *name, uint64 offset, bool forceTruncate) { _file = 0; _name = new char [strlen(name) + 1]; strcpy(_name, name); #ifdef WITH_BZIP2 _bzFILE = 0L; _bzerr = 0; _bzfile = 0L; #endif _bfrmax = 1048576 / 8; _bfr = new uint64 [_bfrmax]; _pos = uint64ZERO; _bit = uint64ZERO; memset(_bfr, 0, sizeof(uint64) * _bfrmax); _inCore = false; _bfrDirty = false; _forceFirstLoad = false; _isReadOnly = false; _isBzip2 = false; stat_seekInside = uint64ZERO; stat_seekOutside = uint64ZERO; stat_dirtyFlushes = uint64ZERO; file_offset = 0; endianess_offset = 0; endianess_flipped = false; // Try to open the original name -- we don't support compressed // files for rewrite. We just fail with a can't open message. // // To get read/write and create we have to use open(2), as mode // "r+" of fopen(3) will not create. (Yes, but w+ does, sigh.) // if (forceTruncate) { errno = 0; _file = open(_name, O_RDWR | O_CREAT | O_TRUNC | O_LARGEFILE, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH); if (errno) fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to open and truncate '%s': %s\n", _name, strerror(errno)), exit(1); } else if (AS_UTL_fileExists(_name)) { errno = 0; _file = open(_name, O_RDONLY | O_LARGEFILE, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH); if (errno) fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to open '%s': %s\n", _name, strerror(errno)), exit(1); _isReadOnly = true; } else { errno = 0; _file = open(_name, O_RDWR | O_CREAT | O_LARGEFILE, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH); if (errno) fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to open '%s': %s\n", _name, strerror(errno)), exit(1); } // Move to the correct position in the file. // file_offset = offset; if (file_offset > 0) { errno = 0; lseek(_file, file_offset, SEEK_SET); if (errno) fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' failed to seed to position " F_U64 ": %s\n", _name, file_offset, strerror(errno)), exit(1); } // Deal with endianess. We write out some bytes (or read back some bytes) to the start of // the file, and then hide them from the user. // endianess_offset = 32 + file_offset; endianess_flipped = false; char t[16] = { 'b', 'i', 't', 'P', 'a', 'c', 'k', 'e', 'd', 'F', 'i', 'l', 'e', 0, 0, 1 }; char c[16] = { 0 }; uint64 at = uint64NUMBER(0xdeadbeeffeeddada ); uint64 bt = uint64NUMBER(0x0abeadedbabed8f8); uint64 ac = uint64NUMBER(0); uint64 bc = uint64NUMBER(0); size_t nr = 0; errno = 0; nr += read(_file, c, sizeof(char) * 16); nr += read(_file, &ac, sizeof(uint64)); nr += read(_file, &bc, sizeof(uint64)); // Errors? if (errno) fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' failed to read the header: %s\n", _name, strerror(errno)), exit(1); // Empty file, but expecting data! if ((nr == 0) && (_isReadOnly)) fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' failed to read the header: empty file\n", _name), exit(1); // Empty file! Write the magic number and our endianess check. if (nr == 0) { errno = 0; write(_file, t, sizeof(char) * 16); write(_file, &at, sizeof(uint64)); write(_file, &bt, sizeof(uint64)); if (errno) fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' failed to write the header: %s\n", _name, strerror(errno)), exit(1); return; } if ((c[0] == 'B') && (c[1] == 'Z') && (c[2] == 'h')) { #ifdef WITH_BZIP2 // Looks like a bzip2 file! errno = 0; _bzFILE = fopen(_name, "r"); if (errno) { fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to open bzip2 file '%s'\n", _name); exit(1); } _bzerr = 0; _bzfile = BZ2_bzReadOpen(&_bzerr, _bzFILE, 0, 0, 0L, 0); if ((_bzfile == 0L) || (_bzerr != BZ_OK)) { fprintf(stderr, "bitPackedFile::bitPackedFile()-- failed to init bzip2 file '%s'\n", _name); exit(1); } BZ2_bzRead(&_bzerr, _bzfile, c, sizeof(char) * 16); BZ2_bzRead(&_bzerr, _bzfile, &ac, sizeof(uint64)); BZ2_bzRead(&_bzerr, _bzfile, &bc, sizeof(uint64)); // XXX should check bzerr! _isReadOnly = true; _isBzip2 = true; #else fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' looks like a bzip2 file, but bzip2 support not available!\n", _name); exit(1); #endif } // Check the magic number, decide on an endianess to use. // if (strncmp(t, c, 16) == 0) { if ((at == ac) && (bt == bc)) { endianess_flipped = false; } else if ((at == uint64Swap(ac)) && (bt == uint64Swap(bc))) { endianess_flipped = true; } else { fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' looked like a bitPackedFile, but failed the endianess check, not opened.\n", _name); exit(1); } } else { fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' doesn't appear to be a bitPackedFile, not opened.\n", _name); fprintf(stderr, "bitPackedFile::bitPackedFile()-- found "); for (uint32 i=0; i<16; i++) fprintf(stderr, "%c", isascii(c[i]) ? c[i] : '.'); fprintf(stderr, " at position " F_X64 "\n", file_offset); exit(1); } _forceFirstLoad = true; seek(0); } bitPackedFile::~bitPackedFile() { flushDirty(); delete [] _bfr; delete [] _name; close(_file); #ifdef WITH_BZIP2 if (_bzFILE) fclose(_bzFILE); if (_bzfile) BZ2_bzReadClose(&_bzerr, _bzfile); #endif } // If the page is dirty, flush it to disk // void bitPackedFile::flushDirty(void) { if (_bfrDirty == false) return; if (_isReadOnly) { fprintf(stderr, "bitPackedFile::bitPackedFile()-- '%s' is readonly, but is dirty!\n", _name); exit(1); } stat_dirtyFlushes++; errno = 0; lseek(_file, _pos * sizeof(uint64) + endianess_offset, SEEK_SET); if (errno) { fprintf(stderr, "bitPackedFile::seek()-- '%s' failed: %s\n", _name, strerror(errno)); exit(1); } // If we need to, flip all the words we are going to write // if (endianess_flipped) for (uint32 i=0; i<_bfrmax; i++) _bfr[i] = uint64Swap(_bfr[i]); // We should only write bits up to _bit, the position we are // currently at. However, we don't know if the block is being // flushed because we're totally finished with it, or because we // are moving on to the next block. If we're done with it, we // want to flush the word that contains _bit, and if we're moving // on to the next one, we'll flush that word again. So, in // either case, we flush the word that contains _bit. // errno = 0; write(_file, _bfr, sizeof(uint64) * _bfrmax); if (errno) { fprintf(stderr, "bitPackedFile::write()-- '%s' failed: %s\n", _name, strerror(errno)); exit(1); } // And then flip them back // if (endianess_flipped) for (uint32 i=0; i<_bfrmax; i++) _bfr[i] = uint64Swap(_bfr[i]); _bfrDirty = false; } void bitPackedFile::seekBzip2(uint64 UNUSED(bitpos)) { #ifdef WITH_BZIP2 // All we can do here is check that bitpos is // a) in our current buffer // b) would be in the next buffer once we read it uint64 newpos = bitpos >> 6; if (_pos + _bfrmax < newpos) { // nope, not in the buffer -- we could probably handle this by just reading and // discarding from the file until we get to the correct bitpos. fprintf(stderr, "bitPackedFile::seekBzip2()-- '%s' seek was not contiguous!\n", _name); exit(1); } // Copy the remaining bits of the current buffer to the start. Or // not, if this is the first load. uint64 lastpos = _bit >> 6; // The word we are currently in uint64 lastlen = (_bfrmax - lastpos); // The number of words left in the buffer if (_forceFirstLoad == true) { lastpos = 0; lastlen = 0; } else { memcpy(_bfr, _bfr + lastpos, sizeof(uint64) * lastlen); } // Update _bit and _pos -- lastlen is now the first invalid word // _bit = bitpos & 0x3f; // 64 * lastlen; _pos = bitpos >> 6; // Fill the buffer size_t wordsread = 0; if (_bzfile) { _bzerr = 0; wordsread = BZ2_bzRead(&_bzerr, _bzfile, _bfr + lastlen, sizeof(uint64) * (_bfrmax - lastlen)); if (_bzerr == BZ_STREAM_END) { //fprintf(stderr, "bitPackedFile::seekBzip2() file ended.\n"); BZ2_bzReadClose(&_bzerr, _bzfile); fclose(_bzFILE); _bzfile = 0L; _bzFILE = 0L; } else if (_bzerr != BZ_OK) { fprintf(stderr, "bitPackedFile::seekBzip2() '%s' read failed.\n", _name); exit(1); } } //fprintf(stderr, "Filled buffer with %d words!\n", wordsread); // Adjust to make wordsread be the index of the last word we actually read. // wordsread += lastlen; // Flip all the words we just read, if needed // if (endianess_flipped) for (uint32 i=lastlen; i> 6) is just before the old // position (_pos), assume that we are being accessed iteratively // backwards and load a full buffer so that the position we want to // access is at the end. // // Easy to think of bone-headed ways to break this (e.g., seek to // the second element in a structure, access the first, then access // the third). Not so easy to think of a logical reason someone // would want to do that. // if (((bitpos >> 6) < _pos) && (_pos <= (bitpos >> 6) + 32)) { _pos = bitpos >> 6; if (_pos > _bfrmax) _pos = _pos - _bfrmax + 32; else _pos = 0; } else { _pos = bitpos >> 6; } _bit = bitpos - (_pos << 6); errno = 0; lseek(_file, _pos * 8 + endianess_offset, SEEK_SET); if (errno) { fprintf(stderr, "bitPackedFile::seekNormal() '%s' seek to pos=" F_U64 " failed: %s\n", _name, _pos * 8 + endianess_offset, strerror(errno)); exit(1); } errno = 0; size_t wordsread = read(_file, _bfr, sizeof(uint64) * _bfrmax); if (errno) { fprintf(stderr, "bitPackedFile::seekNormal() '%s' read of " F_U64 " bytes failed': %s\n", _name, sizeof(uint64) * _bfrmax, strerror(errno)); exit(1); } // Flip all the words we just read, if needed // if (endianess_flipped) for (size_t i=0; i> 6; if ((_pos <= np) && (np <= _pos + _bfrmax - 32)) { _bit = bitpos - (_pos << 6); stat_seekInside++; //fprintf(stderr, "SEEK INSIDE to _bit=" F_U64 "\n", _bit); return; } } if (_inCore) { fprintf(stderr, "bitPackedFile::seek()-- file '%s' is in core, but still needed to seek??\n", _name); exit(1); } stat_seekOutside++; flushDirty(); if (_isBzip2) seekBzip2(bitpos); else seekNormal(bitpos); _forceFirstLoad = false; //fprintf(stderr, "SEEK OUTSIDE to _pos=" F_U64 " _bit=" F_U64 "\n", _pos, _bit); } uint64 bitPackedFile::loadInCore(void) { struct stat sb; // Convert this disk-based, read/write bitPackedFile to memory-based read-only. flushDirty(); errno = 0; fstat(_file, &sb); if (errno) fprintf(stderr, "bitPackedFile::loadInCore() failed to fstat(): %s\n", strerror(errno)), exit(1); // The extra 1024 words is to keep seek() from attempting to grab // the next block (there isn't a next block, we've got it all!) // when we're near the end of this block. We just make the block // a little bigger than it really is. delete [] _bfr; _bfrmax = sb.st_size / 8 + 1024; _bfr = new uint64 [_bfrmax]; _pos = 0; _bit = 0; // Tada! All we need to do now is load the block! _forceFirstLoad = true; seek(0); _inCore = true; return(_bfrmax * 8); } canu-1.6/src/AS_UTL/bitPackedFile.H000066400000000000000000000104721314437614700167410ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/bitPackedFile.H * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-OCT-20 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-MAR-29 to 2004-APR-21 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUL-12 to 2014-APR-11 * are Copyright 2005-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef BITPACKEDFILE_H #define BITPACKEDFILE_H #include "AS_global.H" #include "bitOperations.H" #include "bitEncodings.H" #include "bitPacking.H" //#define WITH_BZIP2 #ifdef WITH_BZIP2 #include #endif class bitPackedFile { public: bitPackedFile(char const *name, uint64 offset=0, bool forceTruncate=false); ~bitPackedFile(); uint64 getBits(uint32 size); uint64 getNumber(void); void putBits(uint64 bits, uint32 size); void putNumber(uint64 val); uint64 tell(void) { return((_pos << 6) + _bit); }; void seek(uint64 pos); uint64 loadInCore(void); void showStats(FILE *f) { fprintf(f, "inside: " F_U64 " outside: " F_U64 "\n", stat_seekInside, stat_seekOutside); fflush(f); }; private: // Ensure that the buffer has enough space for any future // operation. This constant, currently 31 bytes, must be strictly // less than the constant used in deciding if seek() is moving // forward or backwards. // void sync(void) { if (((_bit >> 6) + 31) >= _bfrmax) seek((_pos << 6) + _bit); }; void flushDirty(void); void seekBzip2(uint64 bitpos); void seekNormal(uint64 bitpos); int _file; char *_name; #ifdef WITH_BZIP2 FILE *_bzFILE; int _bzerr; BZFILE *_bzfile; #endif uint64 _bfrmax; // Number of words in the buffer uint64 *_bfr; // A chunk of the bitPackedFile in core uint64 _pos; // The location this chunk is from (in words) uint64 _bit; // The bit position we are modifying relative to _pos bool _inCore; bool _bfrDirty; bool _forceFirstLoad; bool _isReadOnly; bool _isBzip2; // For collecting statistics on our usage // uint64 stat_seekInside; uint64 stat_seekOutside; uint64 stat_dirtyFlushes; // For converting between hardware of different endianess. // uint64 file_offset; uint64 endianess_offset; bool endianess_flipped; }; inline uint64 bitPackedFile::getBits(uint32 siz) { sync(); uint64 ret = getDecodedValue(_bfr, _bit, siz); _bit += siz; return(ret); } inline uint64 bitPackedFile::getNumber(void) { sync(); uint64 siz = 0; uint64 ret = getFibonacciEncodedNumber(_bfr, _bit, &siz); _bit += siz; return(ret); } inline void bitPackedFile::putBits(uint64 bits, uint32 siz) { assert(_isReadOnly == false); sync(); setDecodedValue(_bfr, _bit, siz, bits); _bit += siz; _bfrDirty = true; } inline void bitPackedFile::putNumber(uint64 val) { assert(_isReadOnly == false); sync(); uint64 siz = 0; setFibonacciEncodedNumber(_bfr, _bit, &siz, val); _bit += siz; _bfrDirty = true; } #endif // BITPACKEDFILE_H canu-1.6/src/AS_UTL/bitPacking.H000066400000000000000000000342101314437614700163220ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/bitPacking.h * * Modifications by: * * Brian P. Walenz from 2004-APR-27 to 2004-JUN-23 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2004-MAY-06 * are Copyright 2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2007-DEC-05 to 2014-APR-11 * are Copyright 2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef BRI_BITPACKING_H #define BRI_BITPACKING_H #include #include // Routines used for stuffing bits into a word array. // Define this to enable testing that the width of the data element // is greater than zero. The uint64MASK() macro (bri.h) does not // generate a mask for 0. Compiler warnings are issued, because you // shouldn't use this in production code. // //#define CHECK_WIDTH // As CHECK_WIDTH is kind of expensive, we'll warn. #ifdef CHECK_WIDTH #warning CHECK_WIDTH is EXPENSIVE #endif // Returns 'siz' bits from the stream based at 'ptr' and currently at // location 'pos'. The position of the stream is not changed. // // Retrieves a collection of values; the number of bits advanced in // the stream is returned. // // Copies the lowest 'siz' bits in 'val' to the stream based at 'ptr' // and currently at 'pos'. The position of the stream is not // changed. // // Sets a collection of values; the number of bits advanced in the // stream is returned. // uint64 getDecodedValue (uint64 *ptr, uint64 pos, uint64 siz); uint64 getDecodedValues(uint64 *ptr, uint64 pos, uint64 num, uint64 *sizs, uint64 *vals); void setDecodedValue (uint64 *ptr, uint64 pos, uint64 siz, uint64 val); uint64 setDecodedValues(uint64 *ptr, uint64 pos, uint64 num, uint64 *sizs, uint64 *vals); // Like getDecodedValue() but will pre/post increment/decrement the // value stored in the stream before in addition to returning the // value. // // preIncrementDecodedValue(ptr, pos, siz) === x = getDecodedValue(ptr, pos, siz) + 1; // setDecodedValue(ptr, pos, siz, x); // // preDecrementDecodedValue(ptr, pos, siz) === x = getDecodedValue(ptr, pos, siz) - 1; // setDecodedValue(ptr, pos, siz, x); // // postIncrementDecodedValue(ptr, pos, siz) === x = getDecodedValue(ptr, pos, siz); // setDecodedValue(ptr, pos, siz, x + 1); // // postDecrementDecodedValue(ptr, pos, siz) === x = getDecodedValue(ptr, pos, siz); // setDecodedValue(ptr, pos, siz, x - 1); // uint64 preIncrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz); uint64 preDecrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz); uint64 postIncrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz); uint64 postDecrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz); // N.B. - I assume the bits in words are big-endian, which is // backwards from the way we shift things around. // // I define the "addresses" of bits in two consectuve words as // [0123][0123]. When adding words to the bit array, they're added // from left to right: // // setDecodedValue(bitstream, %0abc, 3) // setDecodedValue(bitstream, %0def, 3) // // results in [abcd][ef00] // // But when shifting things around, we typically do it from the right // side, since that is where the machine places numbers. // // A picture or two might help. // // // |----b1-----| // |-bit-||-sz-| // XXXXXX // [0---------------63] // ^ // pos // // // If the bits span two words, it'll look like this; b1 is smaller // than siz, and we update bit to be the "uncovered" piece of XXX // (all the stuff in word2). The first word is masked, then those // bits are shifted onto the result in the correct place. The second // word has the correct bits shifted to the right, then those are // appended to the result. // // |b1-| // |-----bit-----||---sz---| // XXXXXXXXXX // [0------------word1][0-------------word2] // ^ // pos // inline uint64 getDecodedValue(uint64 *ptr, uint64 pos, uint64 siz) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; //PREFETCH(ptr + wrd); makes it worse uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 64 - bit; uint64 ret = 0; #ifdef CHECK_WIDTH if (siz == 0) { fprintf(stderr, "ERROR: getDecodedValue() called with zero size!\n"); abort(); } if (siz > 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= siz) { ret = ptr[wrd] >> (b1 - siz); } else { bit = siz - b1; ret = (ptr[wrd] & uint64MASK(b1)) << bit; wrd++; ret |= (ptr[wrd] >> (64 - bit)) & uint64MASK(bit); } ret &= uint64MASK(siz); return(ret); } inline void setDecodedValue(uint64 *ptr, uint64 pos, uint64 siz, uint64 val) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 64 - bit; #ifdef CHECK_WIDTH if (siz == 0) { fprintf(stderr, "ERROR: setDecodedValue() called with zero size!\n"); abort(); } if (siz > 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif val &= uint64MASK(siz); if (b1 >= siz) { ptr[wrd] &= ~( uint64MASK(siz) << (b1 - siz) ); ptr[wrd] |= val << (b1 - siz); } else { bit = siz - b1; ptr[wrd] &= ~uint64MASK(b1); ptr[wrd] |= (val & (uint64MASK(b1) << (bit))) >> (bit); wrd++; ptr[wrd] &= ~(uint64MASK(bit) << (64 - bit)); ptr[wrd] |= (val & (uint64MASK(bit))) << (64 - bit); } } inline uint64 getDecodedValues(uint64 *ptr, uint64 pos, uint64 num, uint64 *sizs, uint64 *vals) { // compute the location of the start of the encoded words, then // just walk through to get the remaining words. uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; //PREFETCH(ptr + wrd); makes it worse uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 0; for (uint64 i=0; i 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= sizs[i]) { //fprintf(stderr, "get-single pos=%d b1=%d bit=%d wrd=%d\n", pos, b1, bit, wrd); vals[i] = ptr[wrd] >> (b1 - sizs[i]); bit += sizs[i]; } else { //fprintf(stderr, "get-double pos=%d b1=%d bit=%d wrd=%d bitafter=%d\n", pos, b1, bit, wrd, sizs[i]-b1); bit = sizs[i] - b1; vals[i] = (ptr[wrd] & uint64MASK(b1)) << bit; wrd++; vals[i] |= (ptr[wrd] >> (64 - bit)) & uint64MASK(bit); } if (bit == 64) { wrd++; bit = 0; } assert(bit < 64); vals[i] &= uint64MASK(sizs[i]); pos += sizs[i]; } return(pos); } inline uint64 setDecodedValues(uint64 *ptr, uint64 pos, uint64 num, uint64 *sizs, uint64 *vals) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 0; for (uint64 i=0; i 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= sizs[i]) { //fprintf(stderr, "set-single pos=%d b1=%d bit=%d wrd=%d\n", pos, b1, bit, wrd); ptr[wrd] &= ~( uint64MASK(sizs[i]) << (b1 - sizs[i]) ); ptr[wrd] |= vals[i] << (b1 - sizs[i]); bit += sizs[i]; } else { //fprintf(stderr, "set-double pos=%d b1=%d bit=%d wrd=%d bitafter=%d\n", pos, b1, bit, wrd, sizs[i]-b1); bit = sizs[i] - b1; ptr[wrd] &= ~uint64MASK(b1); ptr[wrd] |= (vals[i] & (uint64MASK(b1) << (bit))) >> (bit); wrd++; ptr[wrd] &= ~(uint64MASK(bit) << (64 - bit)); ptr[wrd] |= (vals[i] & (uint64MASK(bit))) << (64 - bit); } if (bit == 64) { wrd++; bit = 0; } assert(bit < 64); pos += sizs[i]; } return(pos); } inline uint64 preIncrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 64 - bit; uint64 ret = 0; #ifdef CHECK_WIDTH if (siz == 0) { fprintf(stderr, "ERROR: preIncrementDecodedValue() called with zero size!\n"); abort(); } if (siz > 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= siz) { ret = ptr[wrd] >> (b1 - siz); ret++; ret &= uint64MASK(siz); ptr[wrd] &= ~( uint64MASK(siz) << (b1 - siz) ); ptr[wrd] |= ret << (b1 - siz); } else { bit = siz - b1; ret = (ptr[wrd] & uint64MASK(b1)) << bit; ret |= (ptr[wrd+1] >> (64 - bit)) & uint64MASK(bit); ret++; ret &= uint64MASK(siz); ptr[wrd] &= ~uint64MASK(b1); ptr[wrd] |= (ret & (uint64MASK(b1) << (bit))) >> (bit); wrd++; ptr[wrd] &= ~(uint64MASK(bit) << (64 - bit)); ptr[wrd] |= (ret & (uint64MASK(bit))) << (64 - bit); } return(ret); } inline uint64 preDecrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 64 - bit; uint64 ret = 0; #ifdef CHECK_WIDTH if (siz == 0) { fprintf(stderr, "ERROR: preDecrementDecodedValue() called with zero size!\n"); abort(); } if (siz > 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= siz) { ret = ptr[wrd] >> (b1 - siz); ret--; ret &= uint64MASK(siz); ptr[wrd] &= ~( uint64MASK(siz) << (b1 - siz) ); ptr[wrd] |= ret << (b1 - siz); } else { bit = siz - b1; ret = (ptr[wrd] & uint64MASK(b1)) << bit; ret |= (ptr[wrd+1] >> (64 - bit)) & uint64MASK(bit); ret--; ret &= uint64MASK(siz); ptr[wrd] &= ~uint64MASK(b1); ptr[wrd] |= (ret & (uint64MASK(b1) << (bit))) >> (bit); wrd++; ptr[wrd] &= ~(uint64MASK(bit) << (64 - bit)); ptr[wrd] |= (ret & (uint64MASK(bit))) << (64 - bit); } return(ret); } inline uint64 postIncrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 64 - bit; uint64 ret = 0; #ifdef CHECK_WIDTH if (siz == 0) { fprintf(stderr, "ERROR: postIncrementDecodedValue() called with zero size!\n"); abort(); } if (siz > 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= siz) { ret = ptr[wrd] >> (b1 - siz); ret++; ret &= uint64MASK(siz); ptr[wrd] &= ~( uint64MASK(siz) << (b1 - siz) ); ptr[wrd] |= ret << (b1 - siz); } else { bit = siz - b1; ret = (ptr[wrd] & uint64MASK(b1)) << bit; ret |= (ptr[wrd+1] >> (64 - bit)) & uint64MASK(bit); ret++; ret &= uint64MASK(siz); ptr[wrd] &= ~uint64MASK(b1); ptr[wrd] |= (ret & (uint64MASK(b1) << (bit))) >> (bit); wrd++; ptr[wrd] &= ~(uint64MASK(bit) << (64 - bit)); ptr[wrd] |= (ret & (uint64MASK(bit))) << (64 - bit); } ret--; ret &= uint64MASK(siz); return(ret); } inline uint64 postDecrementDecodedValue(uint64 *ptr, uint64 pos, uint64 siz) { uint64 wrd = (pos >> 6) & 0x0000cfffffffffffllu; uint64 bit = (pos ) & 0x000000000000003fllu; uint64 b1 = 64 - bit; uint64 ret = 0; #ifdef CHECK_WIDTH if (siz == 0) { fprintf(stderr, "ERROR: postDecrementDecodedValue() called with zero size!\n"); abort(); } if (siz > 64) { fprintf(stderr, "ERROR: getDecodedValue() called with huge size ("uint64FMT")!\n", siz); abort(); } #endif if (b1 >= siz) { ret = ptr[wrd] >> (b1 - siz); ret--; ret &= uint64MASK(siz); ptr[wrd] &= ~( uint64MASK(siz) << (b1 - siz) ); ptr[wrd] |= ret << (b1 - siz); } else { bit = siz - b1; ret = (ptr[wrd] & uint64MASK(b1)) << bit; ret |= (ptr[wrd+1] >> (64 - bit)) & uint64MASK(bit); ret--; ret &= uint64MASK(siz); ptr[wrd] &= ~uint64MASK(b1); ptr[wrd] |= (ret & (uint64MASK(b1) << (bit))) >> (bit); wrd++; ptr[wrd] &= ~(uint64MASK(bit) << (64 - bit)); ptr[wrd] |= (ret & (uint64MASK(bit))) << (64 - bit); } ret++; ret &= uint64MASK(siz); return(ret); } #endif // BRI_BITPACKING_H canu-1.6/src/AS_UTL/decodeBooleanString.C000066400000000000000000000026571314437614700201660ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-NOV-26 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ static bool decodeBoolean(char *feature, char *value) { bool ret = false; // Decodes a string with 0/1, false/true, no/yes into an integer flag. switch (value[0]) { case '0': case 'f': case 'F': case 'n': case 'N': ret = false; break; case '1': case 't': case 'T': case 'y': case 'Y': ret = true; break; default: fprintf(stderr, "decodeBoolean()-- feature '%s' has unknown boolean value '%s'\n", feature, value); break; } return(ret); } canu-1.6/src/AS_UTL/dnaAlphabets.C000066400000000000000000000223001314437614700166250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "dnaAlphabets.H" dnaAlphabets alphabet; void dnaAlphabets::initTablesForACGTSpace(void) { int i, j; for (i=0; i<256; i++) { _whitespaceSymbol[i] = isspace(i) ? 1 : 0; _toLower[i] = tolower(i); _toUpper[i] = toupper(i); _letterToBits[i] = (unsigned char)0xff; _bitsToLetter[i] = (unsigned char)'?'; _bitsToColor[i] = (unsigned char)'?'; _complementSymbol[i] = (unsigned char)'?'; } for (i=0; i<128; i++) for (j=0; j<128; j++) _IUPACidentity[i][j] = 0; _letterToBits['a'] = _letterToBits['A'] = (unsigned char)0x00; _letterToBits['c'] = _letterToBits['C'] = (unsigned char)0x01; _letterToBits['g'] = _letterToBits['G'] = (unsigned char)0x02; _letterToBits['t'] = _letterToBits['T'] = (unsigned char)0x03; _letterToBits['0'] = (unsigned char)0x00; _letterToBits['1'] = (unsigned char)0x01; _letterToBits['2'] = (unsigned char)0x02; _letterToBits['3'] = (unsigned char)0x03; _bitsToLetter[0x00] = 'A'; _bitsToLetter[0x01] = 'C'; _bitsToLetter[0x02] = 'G'; _bitsToLetter[0x03] = 'T'; _bitsToColor[0x00] = '0'; _bitsToColor[0x01] = '1'; _bitsToColor[0x02] = '2'; _bitsToColor[0x03] = '3'; _complementSymbol['a'] = 't'; // a _complementSymbol['t'] = 'a'; // t _complementSymbol['u'] = 'a'; // u, Really, only for RNA _complementSymbol['g'] = 'c'; // g _complementSymbol['c'] = 'g'; // c _complementSymbol['y'] = 'r'; // c t _complementSymbol['r'] = 'y'; // a g _complementSymbol['s'] = 'w'; // g c _complementSymbol['w'] = 's'; // a t _complementSymbol['k'] = 'm'; // t/u g _complementSymbol['m'] = 'k'; // a c _complementSymbol['b'] = 'v'; // c g t _complementSymbol['d'] = 'h'; // a g t _complementSymbol['h'] = 'd'; // a c t _complementSymbol['v'] = 'b'; // a c g _complementSymbol['n'] = 'n'; // a c g t _complementSymbol['A'] = 'T'; // a _complementSymbol['T'] = 'A'; // t _complementSymbol['U'] = 'A'; // u, Really, only for RNA _complementSymbol['G'] = 'C'; // g _complementSymbol['C'] = 'G'; // c _complementSymbol['Y'] = 'R'; // c t _complementSymbol['R'] = 'Y'; // a g _complementSymbol['S'] = 'W'; // g c _complementSymbol['W'] = 'S'; // a t _complementSymbol['K'] = 'M'; // t/u g _complementSymbol['M'] = 'K'; // a c _complementSymbol['B'] = 'V'; // c g t _complementSymbol['D'] = 'H'; // a g t _complementSymbol['H'] = 'D'; // a c t _complementSymbol['V'] = 'B'; // a c g _complementSymbol['N'] = 'N'; // a c g t _complementSymbol['0'] = '0'; // ColorSpace is self-complementing _complementSymbol['1'] = '1'; _complementSymbol['2'] = '2'; _complementSymbol['3'] = '3'; _IUPACidentity['A']['A'] = 1; _IUPACidentity['C']['C'] = 1; _IUPACidentity['G']['G'] = 1; _IUPACidentity['T']['T'] = 1; _IUPACidentity['M']['A'] = 1; _IUPACidentity['M']['C'] = 1; _IUPACidentity['R']['A'] = 1; _IUPACidentity['R']['G'] = 1; _IUPACidentity['W']['A'] = 1; _IUPACidentity['W']['T'] = 1; _IUPACidentity['S']['C'] = 1; _IUPACidentity['S']['G'] = 1; _IUPACidentity['Y']['C'] = 1; _IUPACidentity['Y']['T'] = 1; _IUPACidentity['K']['G'] = 1; _IUPACidentity['K']['T'] = 1; _IUPACidentity['V']['A'] = 1; _IUPACidentity['V']['C'] = 1; _IUPACidentity['V']['G'] = 1; _IUPACidentity['H']['A'] = 1; _IUPACidentity['H']['C'] = 1; _IUPACidentity['H']['T'] = 1; _IUPACidentity['D']['A'] = 1; _IUPACidentity['D']['G'] = 1; _IUPACidentity['D']['T'] = 1; _IUPACidentity['B']['C'] = 1; _IUPACidentity['B']['G'] = 1; _IUPACidentity['B']['T'] = 1; _IUPACidentity['N']['A'] = 1; _IUPACidentity['N']['C'] = 1; _IUPACidentity['N']['G'] = 1; _IUPACidentity['N']['T'] = 1; _IUPACidentity['M']['M'] = 1; _IUPACidentity['R']['R'] = 1; _IUPACidentity['W']['W'] = 1; _IUPACidentity['S']['S'] = 1; _IUPACidentity['Y']['Y'] = 1; _IUPACidentity['K']['K'] = 1; _IUPACidentity['V']['V'] = 1; _IUPACidentity['H']['W'] = 1; _IUPACidentity['D']['D'] = 1; _IUPACidentity['B']['B'] = 1; _IUPACidentity['N']['N'] = 1; // Order isn't important // for (i='A'; i<'Z'; i++) for (j='A'; j<'Z'; j++) { if (_IUPACidentity[j][i]) _IUPACidentity[i][j] = 1; } // Case isn't important // for (i='A'; i<'Z'; i++) for (j='A'; j<'Z'; j++) { if (_IUPACidentity[j][i]) { _IUPACidentity[tolower(i)][tolower(j)] = 1; _IUPACidentity[tolower(i)][j ] = 1; _IUPACidentity[i ][tolower(j)] = 1; } } } void dnaAlphabets::initTablesForColorSpace(void) { int i, j; for (i=0; i<128; i++) for (j=0; j<128; j++) _baseToColor[i][j] = '.'; // Invalid // Supports transforming a base sequence to a color sequence. // Not sure how valid this is; treat every letter like it's a gap. // We then override ACGT to be the correct encoding. for (i='a'; i<='z'; i++) { _baseToColor['a'][i] = '4'; _baseToColor['c'][i] = '4'; _baseToColor['g'][i] = '4'; _baseToColor['t'][i] = '4'; _baseToColor['n'][i] = '4'; } for (i='a'; i<='z'; i++) { _baseToColor[i]['a'] = '0'; _baseToColor[i]['c'] = '1'; _baseToColor[i]['g'] = '2'; _baseToColor[i]['t'] = '3'; _baseToColor[i]['n'] = '4'; } _baseToColor['a']['a'] = '0'; _baseToColor['a']['c'] = '1'; _baseToColor['a']['g'] = '2'; _baseToColor['a']['t'] = '3'; _baseToColor['a']['n'] = '4'; _baseToColor['c']['a'] = '1'; _baseToColor['c']['c'] = '0'; _baseToColor['c']['g'] = '3'; _baseToColor['c']['t'] = '2'; _baseToColor['c']['n'] = '4'; _baseToColor['g']['a'] = '2'; _baseToColor['g']['c'] = '3'; _baseToColor['g']['g'] = '0'; _baseToColor['g']['t'] = '1'; _baseToColor['g']['n'] = '4'; _baseToColor['t']['a'] = '3'; _baseToColor['t']['c'] = '2'; _baseToColor['t']['g'] = '1'; _baseToColor['t']['t'] = '0'; _baseToColor['t']['n'] = '4'; for (i='a'; i<='z'; i++) for (j='a'; j<='z'; j++) { _baseToColor[toupper(i)][toupper(j)] = _baseToColor[i][j]; _baseToColor[tolower(i)][toupper(j)] = _baseToColor[i][j]; _baseToColor[toupper(i)][tolower(j)] = _baseToColor[i][j]; _baseToColor[tolower(i)][tolower(j)] = _baseToColor[i][j]; } // Supports composing colors _baseToColor['0']['0'] = '0'; _baseToColor['0']['1'] = '1'; _baseToColor['0']['2'] = '2'; _baseToColor['0']['3'] = '3'; _baseToColor['0']['4'] = '4'; _baseToColor['1']['0'] = '1'; _baseToColor['1']['1'] = '0'; _baseToColor['1']['2'] = '3'; _baseToColor['1']['3'] = '2'; _baseToColor['1']['4'] = '4'; _baseToColor['2']['0'] = '2'; _baseToColor['2']['1'] = '3'; _baseToColor['2']['2'] = '0'; _baseToColor['2']['3'] = '1'; _baseToColor['2']['4'] = '4'; _baseToColor['3']['0'] = '3'; _baseToColor['3']['1'] = '2'; _baseToColor['3']['2'] = '1'; _baseToColor['3']['3'] = '0'; _baseToColor['3']['4'] = '4'; // Supports transforming color sequence to base sequence. _baseToColor['a']['0'] = _baseToColor['A']['0'] = 'a'; _baseToColor['a']['1'] = _baseToColor['A']['1'] = 'c'; _baseToColor['a']['2'] = _baseToColor['A']['2'] = 'g'; _baseToColor['a']['3'] = _baseToColor['A']['3'] = 't'; _baseToColor['a']['4'] = _baseToColor['A']['4'] = 'n'; _baseToColor['c']['0'] = _baseToColor['C']['0'] = 'c'; _baseToColor['c']['1'] = _baseToColor['C']['1'] = 'a'; _baseToColor['c']['2'] = _baseToColor['C']['2'] = 't'; _baseToColor['c']['3'] = _baseToColor['C']['3'] = 'g'; _baseToColor['c']['4'] = _baseToColor['C']['4'] = 'n'; _baseToColor['g']['0'] = _baseToColor['G']['0'] = 'g'; _baseToColor['g']['1'] = _baseToColor['G']['1'] = 't'; _baseToColor['g']['2'] = _baseToColor['G']['2'] = 'a'; _baseToColor['g']['3'] = _baseToColor['G']['3'] = 'c'; _baseToColor['g']['4'] = _baseToColor['G']['4'] = 'n'; _baseToColor['t']['0'] = _baseToColor['T']['0'] = 't'; _baseToColor['t']['1'] = _baseToColor['T']['1'] = 'g'; _baseToColor['t']['2'] = _baseToColor['T']['2'] = 'c'; _baseToColor['t']['3'] = _baseToColor['T']['3'] = 'a'; _baseToColor['t']['4'] = _baseToColor['T']['4'] = 'n'; _baseToColor['n']['0'] = _baseToColor['N']['0'] = 'a'; _baseToColor['n']['1'] = _baseToColor['N']['1'] = 'c'; _baseToColor['n']['2'] = _baseToColor['N']['2'] = 'g'; _baseToColor['n']['3'] = _baseToColor['N']['3'] = 't'; _baseToColor['n']['4'] = _baseToColor['N']['4'] = 'n'; } canu-1.6/src/AS_UTL/dnaAlphabets.H000066400000000000000000000054571314437614700166500ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libbio/alphabet-generate.c * kmer/libbio/dnaAlphabets.H * * Modifications by: * * Brian P. Walenz from 2003-MAY-06 to 2003-OCT-21 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2004-MAY-28 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-05 to 2008-JUL-28 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include class dnaAlphabets { public: dnaAlphabets() { initTablesForACGTSpace(); }; ~dnaAlphabets() { }; void initTablesForACGTSpace(void); void initTablesForColorSpace(void); bool isWhitespace(unsigned char x) { return(_whitespaceSymbol[x]); }; unsigned char toLower(unsigned char x) { return(_toLower[x]); }; unsigned char toUpper(unsigned char x) { return(_toUpper[x]); }; unsigned char letterToBits(unsigned char x) { return(_letterToBits[x]); }; unsigned char bitsToLetter(unsigned char x) { return(_bitsToLetter[x]); }; unsigned char complementSymbol(unsigned char x) { return(_complementSymbol[x]); }; bool validCompressedSymbol(unsigned char x) { return(_validCompressedSymbol[x]); }; private: unsigned char _whitespaceSymbol[256]; unsigned char _toLower[256]; unsigned char _toUpper[256]; unsigned char _letterToBits[256]; unsigned char _bitsToLetter[256]; unsigned char _bitsToColor[256]; unsigned char _complementSymbol[256]; unsigned char _validCompressedSymbol[256]; unsigned char _IUPACidentity[128][128]; unsigned char _baseToColor[128][128]; }; extern dnaAlphabets alphabet; canu-1.6/src/AS_UTL/findKeyAndValue.H000066400000000000000000000076641314437614700172750ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-NOV-21 to 2015-JAN-30 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-22 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ class KeyAndValue { public: KeyAndValue(char *line = NULL) { find(line); }; ~KeyAndValue() { } bool find(char *line); char *key(void) { return(key_); }; char *value(void) { return(val_); }; bool value_bool(void) { return((val_[0] == 't') || (val_[0] == 'T') || (val_[0] == '1')); }; int32 value_int32(void) { return(strtol (val_, NULL, 10)); }; int64 value_int64(void) { return(strtoll(val_, NULL, 10)); }; uint32 value_uint32(void) { return(strtoul (val_, NULL, 10)); }; uint64 value_uint64(void) { return(strtoull(val_, NULL, 10)); }; float value_float(void) { return(strtof(val_, NULL)); }; double value_double(void) { return(strtod(val_, NULL)); }; public: bool iscomment(char c) { return((c == '!') || (c == '#') || (c == 0)); }; bool isdelimiter(char c) { return((c == ':') || (c == '=') || isspace(c)); }; char *key_; char *val_; }; // Returns true if a key and value are found. line is modified. // bool KeyAndValue::find(char *line) { key_ = NULL; val_ = NULL; if (line == NULL) return(false); key_ = line; while (isspace(*key_) == true) // Spaces before the key key_++; if ((iscomment(*key_) == true) || // If we're at a comment right now, there is no key (*key_ == 0)) { // and we return failure. key_ = NULL; val_ = NULL; return(false); } val_ = key_; // We're at the key now while ((*val_ != 0) && (isdelimiter(*val_) == false)) // The key cannot contain a delimiter. val_++; *val_++ = 0; while (isdelimiter(*val_) == true) { // Spaces or delimiter after the key *val_ = 0; val_++; } if (*val_ == 0) // And there is no value, must be a filename. return(true); char *eol = val_; // We're at the value now // If quoted, all we need to do is find the other quote and stop. if ((*val_ == '"') || (*val_ == '\'')) { val_++; eol++; while (*eol != '"') // The value itself. eol++; // The value CAN contain delimiters and comment markers. *eol = 0; } // Otherwise, not quoted. Find the first comment marker (or eol) then backup to the first non-space. else { while (iscomment(*eol) == false) // The value MUST NOT contain delimiters or comment markers. eol++; // But it can contains spaces and other nasties. eol--; // Back up off the comment or eol. while (isspace(*eol) == true) // And keep backing up as long as we're a space. eol--; eol++; // Move past the last non-space, non-comment *eol = 0; // And terminate the value } return(true); } canu-1.6/src/AS_UTL/hexDump.C000066400000000000000000000056451314437614700156660ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-SEP-15 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" // Dump DATAlen bytes from DATA in a hex format. // It will print W bytes per line, separated into words of 8 bytes. // The end of the line will have the ASCII representation of the data. // // 00000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f '................' // void hexDump(FILE *F, void *DATA, uint32 DATAlen, uint32 W) { char *STR = new char [8 + 2 + W * 3 + W/8 + 2 + 1 + W * 1 + 1]; for (uint32 Dp=0; Dp < DATAlen; Dp += W) { uint8 *D = (uint8 *)DATA + Dp; uint32 Ds = (Dp + W <= DATAlen) ? (W) : (DATAlen - Dp); for (uint32 Z=Dp, ii=8; ii>0; Z>>=4) // Dump the address in hexadecimal STR[--ii] = ((Z & 0x0f) < 0x0a) ? ((Z & 0x0f) + '0') : ((Z & 0x0f) - 0x0a + 'a'); char *H = STR + 8; // Data pointer char *A = STR + 8 + 1 + 3 * W + W/8; // ASCII pointer *H++ = ' '; // Another space is added at ii=0 below. *A++ = ' '; // Space between the last digit and the string *A++ = '\''; // Bracket at the start of the string. for (uint32 ii=0; ii> 4) + '0') : (((D[ii] & 0xf0) >> 4) - 0x0a + 'a'); *H++ = ((D[ii] & 0x0f) < 0x0a) ? (((D[ii] & 0x0f) ) + '0') : (((D[ii] & 0x0f) ) - 0x0a + 'a'); } else { *H++ = ' '; // ...spaces if we fell off the end of the data *H++ = ' '; } *H++ = ' '; // Space between digits if (ii < Ds) // Printable ASCII or a dot *A++ = ((' ' <= D[ii]) && (D[ii] <= '~')) ? (D[ii]) : ('.'); } *A++ = '\''; // Bracket at the end of the string. *A++ = '\n'; *A = 0; // NUL terminate the string. fputs(STR, F); } delete [] STR; } canu-1.6/src/AS_UTL/hexDump.H000066400000000000000000000020021314437614700156530ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-SEP-15 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef HEXDUMP_H #define HEXDUMP_H void hexDump(FILE *F, void *DATA, uint32 DATAlen, uint32 W=32); #endif // HEXDUMP_H canu-1.6/src/AS_UTL/intervalList.H000066400000000000000000000437271314437614700167440ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/intervalList.H * * Modifications by: * * Brian P. Walenz from 2003-AUG-11 to 2003-AUG-13 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2004-APR-21 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUL-12 to 2014-APR-11 * are Copyright 2005-2007,2010,2013-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-07 to 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INTERVALLIST_H #define INTERVALLIST_H #include // iNum - lo, hi - coordinates of the interval // iVal - va - data stored at each interval // uint32 - ct - number of elements in this interval // - when merged, needs function that converts multiple iVal and a uint32 into a single iVal template class _intervalPair { public: iNum lo; iNum hi; uint32 ct; // Number of source intervals iVal va; // Value at this interval; default is 1 bool operator<(const _intervalPair &that) const { if (lo != that.lo) return(lo < that.lo); return(hi < that.hi); }; }; template class intervalDepthRegions { public: iNum pos; // Position of the change in depth iVal change; // The value associated with this object; added or subtracted from 'va'. bool open; // If true, the start of a new interval bool operator<(const intervalDepthRegions &that) const { if (pos != that.pos) return(pos < that.pos); return(open > that.open); }; }; template class intervalList { public: intervalList(uint32 initialSize=32) { _isSorted = true; _isMerged = true; _listLen = 0; _listMax = initialSize; _list = new _intervalPair [_listMax]; }; // Takes as input an unmerged intervalList, returns to a new set of intervals, one // for each 'depth'. Two intervals, (1,4) and (2,6) would return 'depths': // 1,2,1 bgn=1, end=2, depth=1 // 2,4,2 // 4,6,1 // intervalList(intervalList &IL) { _isSorted = false; _isMerged = false; _listLen = 0; _listMax = 0; _list = 0L; depth(IL); }; intervalList(intervalDepthRegions *id, uint32 idlen) { _isSorted = false; _isMerged = false; _listLen = 0; _listMax = 0; _list = 0L; #ifdef _GLIBCXX_PARALLEL // Don't use the parallel sort, not with the expense of starting threads. __gnu_sequential::sort(id, id + idlen); #else std::sort(id, id + idlen); #endif computeDepth(id, idlen); }; ~intervalList() { delete [] _list; }; intervalList &operator=(intervalList &src); void clear(void) { _isSorted = true; _isMerged = true; _listLen = 0; } void add(iNum position, iNum length, iVal value=0); void sort(void); void merge(iNum minOverlap=0); // Merge overlapping regions void merge(intervalList *IL); // Insert IL into this list void filterShort(iNum minLength); void intersect(intervalList &A, intervalList &B); uint32 overlapping(iNum lo, iNum hi, uint32 *&intervals, uint32 &intervalsLen, uint32 &intervalsMax); // Populates this intervalList with regions in A that are completely // contained in a region in B. // // Both A and B call merge(). // void contained(intervalList &A, intervalList &B); void invert(iNum lo, iNum hi); void depth(intervalList &A); uint32 numberOfIntervals(void) { return(_listLen); }; iNum sumOfLengths(void) { iNum len = 0; uint32 i = numberOfIntervals(); if (i > 0) while (i--) len += _list[i].hi - _list[i].lo; return(len); }; iNum &lo(uint32 i) { return(_list[i].lo); }; iNum &hi(uint32 i) { return(_list[i].hi); }; uint32 &count(uint32 i) { return(_list[i].ct); }; // Number of source intervals. uint32 &depth(uint32 i) { return(_list[i].ct); }; // Depth, if converted. iVal &value(uint32 i) { return(_list[i].va); }; // Value or sum of values. private: void computeDepth(intervalDepthRegions *id, uint32 idlen); bool _isSorted; bool _isMerged; uint32 _listMax; uint32 _listLen; _intervalPair *_list; }; template intervalList & intervalList::operator=(intervalList &src) { _isSorted = src._isSorted; _isMerged = src._isMerged; if (_listMax < src._listMax) { delete [] _list; _listMax = src._listMax; _list = new _intervalPair [_listMax]; } _listLen = src._listLen; memcpy(_list, src._list, _listLen * sizeof(_intervalPair)); return(*this); } template void intervalList::add(iNum position, iNum length, iVal val) { if (_listLen >= _listMax) { _listMax *= 2; _intervalPair *l = new _intervalPair [_listMax]; memcpy(l, _list, sizeof(_intervalPair) * _listLen); delete [] _list; _list = l; } _list[_listLen].lo = position; _list[_listLen].hi = position + length; _list[_listLen].ct = 1; _list[_listLen].va = val; // Could optimize, and search the list to see if these are false, // but that's rather expensive. _isSorted = false; _isMerged = false; _listLen++; } template void intervalList::sort(void) { if (_isSorted) return; if (_listLen > 1) #ifdef _GLIBCXX_PARALLEL // Don't use the parallel sort, not with the expense of starting threads. __gnu_sequential::sort(_list, _list + _listLen); #else std::sort(_list, _list + _listLen); #endif _isSorted = true; } template void intervalList::merge(iNum minOverlap) { uint32 thisInterval = 0; uint32 nextInterval = 1; if (_isMerged) return; sort(); while (nextInterval < _listLen) { if ((_list[thisInterval].lo == 0) && (_list[thisInterval].hi == 0)) { // Our interval is empty. Copy in the interval we are // examining and move to the next. // XXX This is probably useless, thisInterval should always be // valid. _list[thisInterval].lo = _list[nextInterval].lo; _list[thisInterval].hi = _list[nextInterval].hi; _list[thisInterval].ct = _list[nextInterval].ct; _list[thisInterval].va = _list[nextInterval].va; _list[nextInterval].lo = 0; _list[nextInterval].hi = 0; nextInterval++; } else { // This interval is valid. See if it overlaps with the next // interval. bool intersects = false; if ((_list[thisInterval].lo <= _list[nextInterval].lo) && (_list[nextInterval].hi <= _list[thisInterval].hi)) // next is contained in this intersects = true; if (_list[thisInterval].hi - minOverlap >= _list[nextInterval].lo) // next has thick overlap to this intersects = true; if (intersects) { // Got an intersection. // Merge nextInterval into thisInterval -- the hi range // is extended if the nextInterval range is larger. // if (_list[thisInterval].hi < _list[nextInterval].hi) _list[thisInterval].hi = _list[nextInterval].hi; _list[thisInterval].ct += _list[nextInterval].ct; _list[thisInterval].va += _list[nextInterval].va; // Clear the just merged nextInterval and move to the next one. // _list[nextInterval].lo = 0; _list[nextInterval].hi = 0; _list[nextInterval].ct = 0; _list[nextInterval].va = 0; nextInterval++; } else { // No intersection. Move along. Nothing to see here. // If there is a gap between the target and the examine (we // must have merged sometime in the past), copy examine to // the next target. thisInterval++; if (thisInterval != nextInterval) { _list[thisInterval].lo = _list[nextInterval].lo; _list[thisInterval].hi = _list[nextInterval].hi; _list[thisInterval].ct = _list[nextInterval].ct; _list[thisInterval].va = _list[nextInterval].va; } nextInterval++; } } } if (thisInterval+1 < _listLen) _listLen = thisInterval + 1; _isMerged = true; } template void intervalList::merge(intervalList *IL) { for (uint32 i=0; i_listLen; i++) add(IL->_list[i].lo, IL->_list[i].hi - IL->_list[i].lo); } template void intervalList::filterShort(iNum minLength) { uint32 fr = 0; uint32 to = 0; for (; fr < _listLen; fr++) { if (hi(fr) - lo(fr) <= minLength) continue; if (to != fr) _list[to] = _list[fr]; to++; } _listLen = to; } template void intervalList::invert(iNum invlo, iNum invhi) { merge(); // Create a new list to store the inversion // uint32 invLen = 0; uint32 invMax = _listLen + 2; _intervalPair *inv = new _intervalPair [invMax]; // Add the zeroth and only? if (_listLen == 0) { inv[invLen].lo = invlo; inv[invLen].hi = invhi; inv[invLen].ct = 1; inv[invLen].va = 0; invLen++; } // Add the first, then the pieces, then the last // else { if (invlo < _list[0].lo) { inv[invLen].lo = invlo; inv[invLen].hi = _list[0].lo; inv[invLen].ct = 1; inv[invLen].va = 0; invLen++; } for (uint32 i=1; i<_listLen; i++) { if (_list[i-1].hi < _list[i].lo) { inv[invLen].lo = _list[i-1].hi; inv[invLen].hi = _list[i].lo; inv[invLen].ct = 1; inv[invLen].va = 0; invLen++; } } if (_list[_listLen-1].hi < invhi) { inv[invLen].lo = _list[_listLen-1].hi; inv[invLen].hi = invhi; inv[invLen].ct = 1; inv[invLen].va = 0; invLen++; } } assert(invLen <= invMax); // Nuke the old list, swap in the new one delete [] _list; _list = inv; _listLen = invLen; _listMax = invMax; } template void intervalList::intersect(intervalList &A, intervalList &B) { A.merge(); B.merge(); uint32 ai = 0; uint32 bi = 0; while ((ai < A.numberOfIntervals()) && (bi < B.numberOfIntervals())) { uint32 al = A.lo(ai); uint32 ah = A.hi(ai); uint32 bl = B.lo(bi); uint32 bh = B.hi(bi); uint32 nl = 0; uint32 nh = 0; // If they intersect, make a new region // if ((al <= bl) && (bl < ah)) { nl = bl; nh = (ah < bh) ? ah : bh; } if ((bl <= al) && (al < bh)) { nl = al; nh = (ah < bh) ? ah : bh; } if (nl < nh) add(nl, nh - nl); // Advance the list with the earlier region. // if (ah < bh) { // A ends before B ai++; } else if (ah > bh) { // B ends before A bi++; } else { // Exactly the same ending! ai++; bi++; } } } // Populates an array with the intervals that are within the supplied interval. // // Naive implementation that is easy to verify (and that works on an unsorted list). // template uint32 intervalList::overlapping(iNum rangelo, iNum rangehi, uint32 *&intervals, uint32 &intervalsLen, uint32 &intervalsMax) { if (intervals == 0L) { intervalsMax = 256; intervals = new uint32 [intervalsMax]; } intervalsLen = 0; for (uint32 i=0; i<_listLen; i++) { if ((rangelo <= _list[i].hi) && (rangehi >= _list[i].lo)) { if (intervalsLen >= intervalsMax) { intervalsMax *= 2; uint32 *X = new uint32 [intervalsMax]; memcpy(X, intervals, sizeof(uint32) * intervalsLen); delete [] intervals; intervals = X; } intervals[intervalsLen++] = i; } } return(intervalsLen); } template void intervalList::contained(intervalList &A, intervalList &B) { A.merge(); B.merge(); uint32 ai = 0; uint32 bi = 0; while ((ai < A.numberOfIntervals()) && (bi < B.numberOfIntervals())) { uint32 al = A.lo(ai); uint32 ah = A.hi(ai); uint32 bl = B.lo(bi); uint32 bh = B.hi(bi); // If A is contained in B, make a new region. // if ((bl <= al) && (ah <= bh)) add(bl, bh - bl); #if 0 if ((al <= bl) && (bh <= ah)) add(al, ah - al); #endif // Advance the list with the earlier region. // if (ah < bh) { // A ends before B ai++; } else if (ah > bh) { // B ends before A bi++; } else { // Exactly the same ending! ai++; bi++; } } } template void intervalList::depth(intervalList &IL) { uint32 idlen = IL.numberOfIntervals() * 2; intervalDepthRegions *id = new intervalDepthRegions [idlen]; for (uint32 i=0; i void intervalList::computeDepth(intervalDepthRegions *id, uint32 idlen) { // No intervals input? No intervals output. _listLen = 0; if (idlen == 0) return; // Sort by coordinate. #ifdef _GLIBCXX_PARALLEL // Don't use the parallel sort, not with the expense of starting threads. __gnu_sequential::sort(id, id + idlen); #else std::sort(id, id + idlen); #endif // Allocate the (maximum possible) depth of coverage intervals if (_listMax < idlen) { delete [] _list; _listMax = idlen; _list = new _intervalPair [_listMax]; } // The first thing must be an 'open' event. If not, someone supplied a negative length to the // original intervalList. Or, possibly, two zero-length intervals. if (id[0].open == false) for (uint32 ii=0; ii 1) && (_list[_listLen-1].hi == _list[_listLen].lo) && (_list[_listLen-1].ct == _list[_listLen].ct) && (_list[_listLen-1].va == _list[_listLen].va)) { _list[_listLen-1].hi = _list[_listLen].hi; _listLen--; } #if 0 fprintf(stderr, "id[%2d] - list[%u] = lo=%u hi=%u ct=%u va=%d\n", i, _listLen, _list[_listLen].lo, _list[_listLen].hi, _list[_listLen].ct, _list[_listLen].va); #endif } assert(_listLen > 0); assert(_listLen <= _listMax); } #endif // INTERVALLIST_H canu-1.6/src/AS_UTL/intervalListTest.C000066400000000000000000000041301314437614700175600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-MAR-10 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include #include typedef int8_t int8; typedef int16_t int16; typedef int32_t int32; typedef int64_t int64; typedef uint8_t uint8; typedef uint16_t uint16; typedef uint32_t uint32; typedef uint64_t uint64; #include "intervalList.H" // g++ -o intervalListTest -I.. -I. intervalListTest.C int main(int argc, char **argv) { if (0) { intervalList t1; t1.add(0, 10); t1.add(11,7); t1.add(20, 8); fprintf(stderr, "BEFORE:\n"); for (uint32 ii=0; ii il; il.add(1, -1); intervalList de(il); il.merge(); for (uint32 ii=0; ii KMER_WORDS * 32) fprintf(stderr, "kMer size too large; increase KMER_WORDS in libbio/kmer.H\n"), exit(1); _compressionLength = new uint32 [_merSize]; for (uint32 z=0; z<_merSize; z++) _compressionLength[z] = (cm) ? 0 : 1; if (tm) { _merStorage = new kMer [_templateLength * 2]; _merSizeValid = new uint32 [_templateLength]; for (uint32 i=0; i<2*_templateLength; i++) { _merStorage[i].setMerSize(_merSize); _merStorage[i].setMerSpan(_templateSpan); } // VERY IMPORTANT! Offset the valid length to adjust for the // template that every mer except the first is starting in the // middle of. // for (uint32 i=0; i<_templateLength; i++) _merSizeValid[i] = _merSize - i; } else { _merStorage = new kMer [2]; _merSizeValid = new uint32 [1]; _merStorage[0].setMerSize(_merSize); _merStorage[1].setMerSize(_merSize); _merSizeValid[0] = _merSizeValidZero; if (cm) { _merStorage[0].setMerSpan(0); _merStorage[1].setMerSpan(0); } } _fMer = _merStorage + 0; _rMer = _merStorage + 1; } kMerBuilder::~kMerBuilder() { delete [] _merSizeValid; delete [] _merStorage; delete [] _compressionLength; delete [] _template; } void kMerBuilder::clear(bool clearMer) { // Contiguous mers _merSizeValid[0] = _merSizeValidZero; // Compressed mers if (_compression) { _compressionIndex = 0; _compressionFirstIndex = 0; _compressionCurrentLength = 0; for (uint32 z=0; z<_merSize; z++) _compressionLength[z] = 0; _merStorage[0].setMerSpan(0); _merStorage[1].setMerSpan(0); } // Spaced mers if (_template) { for (uint32 i=0; i<2*_templateLength; i++) _merStorage[i].clear(); for (uint32 i=0; i<_templateLength; i++) _merSizeValid[i] = _merSize - i; _templatePos = 0; _templateMer = 0; _templateFirst = 1; } if (clearMer) { _fMer->clear(); _rMer->clear(); } } // // The addBase methods add a single base (cf - forward, cr - complemented) to // the mer. The return true if another base is needed to finish the mer, and // false if the mer is complete. // bool kMerBuilder::addBaseContiguous(uint64 cf, uint64 cr) { // Not a valid base, reset the mer to empty, and request more bases // (this is a slightly optimized version of clear()). if (cf & (unsigned char)0xfc) { clear(false); //_merSizeValid[0] = _merSizeValidZero; return(true); } // Add the base to both mers. *_fMer += cf; *_rMer -= cr; // If there aren't enough bases, request another one. if (_merSizeValid[0] + 1 < _merSizeValidIs) { _merSizeValid[0]++; return(true); } return(false); // Good! Don't need another letter. } bool kMerBuilder::addBaseCompressed(uint64 cf, uint64 cr) { // Not a valid base, reset the mer to empty, and request more bases. // if (cf & (unsigned char)0xfc) { clear(); return(true); } uint64 lb = theFMer().endOfMer(2); // Last base in the mer uint32 ms = theFMer().getMerSpan(); // Span BEFORE adding the mer if (_merSizeValid[0] <= _merSizeValidZero) lb = 9; // No valid last base (should probably be ~uint64ZERO, but that screws up diagnostic output) #ifdef DEBUGCOMP fprintf(stderr, "kMerBuilder::addBaseCompressed()-- lb="uint64FMT" cf="uint64FMT" ms=" F_U32 " ccl=" F_U32 " lvl=" F_U32 "\n", lb, cf, ms, _compressionCurrentLength, _compression); #endif // Always add one to the current length. When we started, it // was 0. This represents the length AFTER adding the base. // _compressionCurrentLength++; // If the lastbase is the same as the one we want to add (and // there IS a last base), and we've seen too many of these, // remember we've seen another letter in the run, and don't add // it. Request another letter. // if ((lb == cf) && // last is the same as this (_compressionCurrentLength > _compression)) { // run is already too big _compressionLength[_compressionIndex]++; _fMer->setMerSpan(ms + 1); _rMer->setMerSpan(ms + 1); #ifdef DEBUGCOMP fprintf(stderr, "kMerBuilder::addBaseCompressed()-- COMPRESSED currentIdx=%u first=%u", _compressionIndex, _compressionFirstIndex); for (uint32 x=0, y=_compressionFirstIndex; x<_merSize; x++) { fprintf(stderr, " %u(%d)", _compressionLength[y], y); y = (y + 1) % _merSize; } fprintf(stderr, "\n"); #endif return(true); } // Else, it's a new run (a different letter) or our run isn't // big enough to compress and we need to add the duplicate // letter. *_fMer += cf; *_rMer -= cr; // If this is a new letter, propagate the current length to the first letter in this run. That // way, when that letter is popped off the mer, we automagically update our span to include only // as many letters as are here. // // 01234567890 // // E.g. For sequence TATTTTTTAGT (that's 6 T's) with a mersize of 3 and compression 2, we'd have // mers with position: // // TATTTTTTAGT // #1 TAT position 0 (with lengths 1, 1, 1) uncompressed mer TAT // #2 ATT position 1 (with lengths 1, 1, 1) ATT // #3 TTA position 6 (with lengths 5, 1, 1) TTTTTTA // #4 TAG position 7 TAG // #5 AGT position 8 AGT // // In #2, because the length so far (1) is not >= the compression (2) we add a new base and // return. // // In #3, the current length is >= the compression, so we keep stuffing on T's and incrementing // the last length, stopping when we get the A. We now propagate the current length to the first // letter in the run. Special case, if the first letter in the run is the first letter in the // mer, we need to immediately update the span. #ifdef DEBUGCOMP fprintf(stderr, "kMerBuilder::addBaseCompressed()-- ADDNEWBASE currentIdx=%u first=%u", _compressionIndex, _compressionFirstIndex); for (uint32 x=0, y=_compressionFirstIndex; x<_merSize; x++) { fprintf(stderr, " %u(%d)", _compressionLength[y], y); y = (y + 1) % _merSize; } fprintf(stderr, "\n"); #endif // If we added a new letter, transfer the run-length count to the first letter in the previous // run. In the above example, when we built the run, the lengths are (1, 1, 5). That is, all // compression occurred on the last letter. When we shift off that first letter, we want to // remove as much of the run as possible. if (lb != cf) { if (_compressionFirstIndex != _compressionIndex) { _compressionLength[_compressionFirstIndex] += _compressionLength[_compressionIndex] - 1; _compressionLength[_compressionIndex] = 1; } _compressionFirstIndex = (_compressionIndex + 1) % _merSize; _compressionCurrentLength = 1; } _compressionIndex = (_compressionIndex + 1) % _merSize; ms -= _compressionLength[_compressionIndex]; // subtract the count for the letter we just shifted out #ifdef DEBUGCOMP fprintf(stderr, "kMerBuilder::addBaseCompressed()-- ADDNEWBASE shifted out at idx=" F_U32 " with " F_U32 " positions; final span " F_U32 "\n", _compressionIndex, _compressionLength[_compressionIndex], ms + 1); #endif _compressionLength[_compressionIndex] = 1; // one letter at this position _fMer->setMerSpan(ms + 1); _rMer->setMerSpan(ms + 1); // If there aren't enough bases, request another one. if (_merSizeValid[0] + 1 < _merSizeValidIs) { _merSizeValid[0]++; return(true); } return(false); // Good! Don't need another letter. } bool kMerBuilder::addBaseSpaced(uint64 cf, uint64 cr) { #ifdef DEBUGSPACE fprintf(stderr, "add %c templatePos=%u templateMer=%u\n", ch, _templatePos, _templateMer); #endif // We always advance the templatePos, unfortunately, we need to // use the current value throughout this function. If there // was a single return point, we could advance immediately // before returning. // uint32 tp = _templatePos; _templatePos = (_templatePos + 1) % _templateLength; // If we get an invalid letter, set all mers that would have // had a letter added to be broken. // if (cf & (unsigned char)0xfc) { for (uint32 m=0; m<_templateLength; m++) { uint32 tppos = (tp + _templateLength - m) % _templateLength; if (_template[tppos] == 1) { // Reset to 'zero', but make it skip over any remaining // positions in the current template. // _merSizeValid[m] = _merSizeValidZero + tppos - _templateLength + 1; #ifdef DEBUGSPACE fprintf(stderr, "-- invalid letter, reset mer %u to valid %u (mersizevalidzero=%u ttpos=%u templatelength=%u)\n", m, _merSizeValid[m], _merSizeValidZero, tppos, _templateLength); #endif } } if (_templateFirst == 0) _templateMer = (_templateMer + 1) % _templateLength; return(true); } // We have a valid letter, and add it to all the mers that the // template allows. // for (uint32 m=0; m<_templateLength; m++) { uint32 tppos = (tp + _templateLength - m) % _templateLength; if (_template[tppos] == 1) { _merStorage[2*m+0] += cf; _merStorage[2*m+1] -= cr; if (_merSizeValid[m] < _merSizeValidIs) _merSizeValid[m]++; #ifdef DEBUGSPACE fprintf(stderr, "push %c onto %d (at template %u) length = %u %s\n", ch, m, (tp + _templateLength - m) % _templateLength, _merSizeValid[m], (_merSizeValid[m] >= _merSizeValidIs) ? "complete" : ""); #endif } else if (_merSizeValid[m] <= _merSizeValidZero) { // The template doesn't want us to add a letter to the mer, // but we're adjusting for an aborted template, and we're // counting template positions (not just non-zero template // positions) when adjusting. // _merSizeValid[m]++; } } // If the current mer isn't long enough, we move to the next mer, // and request another letter. // if (_merSizeValid[_templateMer] < _merSizeValidIs) { if (_templateFirst == 0) _templateMer = (_templateMer + 1) % _templateLength; #ifdef DEBUGSPACE fprintf(stderr, "-- too short -- need more templateMer=%u templateFirst=%u\n", _templateMer, _templateFirst); #endif return(true); } // On startup, _templateMer is always 0 (the first mer) until // it is long enough to be a valid mer. Then, we clear // _templateFirst so that we can start advancing through mers. // Update the f and r pointers to the correct mers, advance our // template to the next, and terminate. // _fMer = _merStorage + 2 * _templateMer + 0; _rMer = _merStorage + 2 * _templateMer + 1; #ifdef DEBUGSPACE fprintf(stderr, "-- valid! (templateMer = %u)\n", _templateMer); #endif _templateFirst = 0; _templateMer = (_templateMer + 1) % _templateLength; return(false); // Good! Don't need another letter. } bool kMerBuilder::addBaseCompressedSpaced(uint64 UNUSED(cf), uint64 UNUSED(cr)) { fprintf(stderr, "kMerBuilder::addBaseCompressedSpace()-- Compressed and spaced mers not supported.\n"); exit(1); } canu-1.6/src/AS_UTL/kMer.H000066400000000000000000000230371314437614700151520ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libbio/kmer.H * * Modifications by: * * Brian P. Walenz from 2005-MAY-19 to 2014-APR-11 * are Copyright 2005-2008,2010-2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright (c) 2005 J. Craig Venter Institute // Author: Brian Walenz // // This program is free software; you can redistribute it and/or modify // it under the terms of the GNU General Public License as published by // the Free Software Foundation; either version 2 of the License, or // (at your option) any later version. // // This program is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received (LICENSE.txt) a copy of the GNU General Public // License along with this program; if not, write to the Free Software // Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA // A 'simple' kMer datastructure. #ifndef BIO_KMER_H #define BIO_KMER_H // The maximum size of a mer. You get 32 bases per word, so // KMER_WORDS=4 will get you up to a 128-mer. // #define KMER_WORDS 1 #include "AS_global.H" #include "bitPackedFile.H" #include "dnaAlphabets.H" // The kmer class is in two mutually exclusive implementations. The first implementation // uses one word to store up to 32-mers. The second implementation uses multiple words // to store arbitrarily large kmers. // // Incomplete and out of date description of the methods: // // // Reverse all the words, reverse and complement the bases in // // each word, then shift right to align the edge. // // // virtual kMerInterface &reverseComplement(void) = 0; // virtual void clear(void); // // // Construct a mer by shifting bases onto the end: // // += shifts onto the right end // // -= shifts onto the left end // // // virtual void operator+=(uint64 x) = 0; // virtual void operator-=(uint64 x) = 0; // // // used by merStream at least // // // virtual void mask(bool) = 0; // // // Return the mer, as a 64-bit integer. If the mer is more than // // 32-bases long, then the left-most (the earliest, the start, etc) // // bases are used. // // // virtual operator uint64 () const = 0; // // // These are written/read in 5'endian, which isn't the most natural // // implementation. It's done this way to keep the sequence in // // order (e.g., the merStreamFile). Don't change the order. // // // // On the otherhand, the implementation (of write anyway) is // // basically the same as merToString(). // // // // Takes an optional number of BITS to write, pulled from the // // END of the mer. // // // virtual void writeToBitPackedFile(bitPackedFile *BPF, uint32 numBits=0) const = 0; // virtual void readFromBitPackedFile(bitPackedFile *BPF, uint32 numBits=0) = 0; // // // Returns a sub-mer from either the start (left end) or the end // // (right end) of the mer. The sub-mer must be at most 64 bits // // long. Yes, BITS. // // // // The start is difficult, because it can span multiple words. The // // end is always in the first word. // // // virtual uint64 startOfMer(uint32 bits) const = 0; // virtual uint64 endOfMer(uint32 bits) const = 0; // // // Set 'numbits' bits from (the end of) 'val' at bit position 'pos' // // in the mer. This is wildly low-level, but merylStreamReader // // needs it. // // // // The position is measured from the right end. // // (0, 8, X) would copy the bits 7 to 0 of X to bits 7 to 0 of the mer. // // // // Argh! Can't use set/getDecodedValue because that is doing things in the wrong order. // // // // Meryl // // // virtual uint64 getWord(uint32 wrd) const = 0; // { return(MERWORD(wrd)); }; // virtual void setWord(uint32 wrd, uint64 val) = 0; // { MERWORD(wrd) = val; }; // // // Show the mer as ascii // // // // Doesn't print the last full word, if it's on the word boundary // // // // We build the string right to left, print any partial word first, // // then print whole words until we run out of words to print. // // // virtual char *merToString(char *instr) const = 0; // #if KMER_WORDS == 1 #include "kMerTiny.H" typedef kMerTiny kMer; #else #include "kMerHuge.H" typedef kMerHuge kMer; #endif #undef DEBUGADDBASE #undef DEBUGCOMP #undef DEBUGSPACE class kMerBuilder { public: kMerBuilder(uint32 ms=0, uint32 cm=0, char *tm=0L); ~kMerBuilder(); // Clear all mer data, reset state to as just after construction. void clear(bool clearMer=true); // Returns true if we need another base to finish the mer. This // only occurs for compressed mers, if we are in a homopolymer run. // private: bool addBaseContiguous(uint64 cf, uint64 cr); bool addBaseCompressed(uint64 cf, uint64 cr); bool addBaseSpaced(uint64 cf, uint64 cr); bool addBaseCompressedSpaced(uint64 cf, uint64 cr); public: bool addBase(char ch) { uint64 cf = alphabet.letterToBits(ch); uint64 cr = alphabet.letterToBits(alphabet.complementSymbol(ch)); #ifdef DEBUGADDBASE fprintf(stderr, "addBase() %c\n", ch); #endif if (_style == 0) return(addBaseContiguous(cf, cr)); if (_style == 1) return(addBaseCompressed(cf, cr)); if (_style == 2) return(addBaseSpaced(cf, cr)); if (_style == 3) return(addBaseCompressedSpaced(cf, cr)); fprintf(stderr, "kMerBuilder::addBase()-- Invalid mer type %d.\n", _style); exit(1); return(false); } void mask(void) { _fMer->mask(true); _rMer->mask(false); }; kMer const &theFMer(void) { return(*_fMer); }; kMer const &theRMer(void) { return(*_rMer); }; kMer const &theCMer(void) { return((theFMer() < theRMer()) ? theFMer() : theRMer()); }; uint32 merSize(void) { return(_merSize); }; uint32 templateSpan(void) { return(_templateSpan); }; uint32 baseSpan(uint32 b) { return(_compressionLength[(_compressionIndex + 1 + b) % _merSize]);; }; private: // Style of builder we are uint32 _style; // Amount of the mer that has valid sequence. Sigh. I really needed a signed value here -- // where negative values mean that we first have to get to the end of the template that was // invalid, then we need to build a new mer. // // And, yes, just simply making it signed leads to all sortes of compiler warnings about // comparing signed and unsigned. And I've been here before, and those warnings just propate // endlessly. Just go away, Mr. Smartypants. // // Details: when building spaced seeds, if we hit an N in the middle of the template, we need to // invalidate the mer, but not start building a new mer until we exhaust the current template. // The example is template=1101. Suppose we hit an N at the second 1. We set the merSizeValid // to 0, and proceed. When we push on the base for the last 1 in the template, we'd increment // the merSizeValid. The first two 1's in the template would now create a mer big enough to be // valid, and we'd return it -- but now the template we're using is 0111. // // _merSizeValid is offset by _merSize (e.g., the true valid size is _merSizeValid - _merSize). // _merSizeValidIs is the size _merSizeValid needs to be in order for it to be valid. // Similarily, _merSizeValidZero is the value of zero (currently this is equal to _merSize). // uint32 _merSize; // Desired number of bases in the mer uint32 *_merSizeValid; // Actual number of bases in the mer uint32 _merSizeValidZero; // Definition of 'zero' bases in the mer uint32 _merSizeValidIs; // Definition of 'full' bases in the mer // An array of mers, we allocate all mers in one block kMer *_merStorage; // Pointer to the currently active mer kMer *_fMer; kMer *_rMer; // For compression uint32 _compression; uint32 _compressionIndex; // index into cL[] that is the last base in the mer uint32 _compressionFirstIndex; // index into cL[] that is the first base in a run uint32 *_compressionLength; // one per base uint32 _compressionCurrentLength; // For templates uint32 _templateSpan; // # of 0's and 1's in the template uint32 _templateLength; // length of the pattern in the template char *_template; // character string template uint32 _templatePos; // position we are building in the template uint32 _templateMer; // the mer we should output next uint32 _templateFirst; // if true, we're still building the initial mer }; #endif // BIO_KMER_H canu-1.6/src/AS_UTL/kMerHuge.H000066400000000000000000000302501314437614700157560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libbio/kmerhuge.H * * Modifications by: * * Brian P. Walenz from 2007-SEP-13 to 2014-APR-11 * are Copyright 2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #define MERWORD(N) _md[N] class kMerHuge { public: kMerHuge(uint32 ms=uint32ZERO) { setMerSize(ms); clear(); }; ~kMerHuge() { }; void setMerSize(uint32 ms); uint32 getMerSize(void) const { return(_merSize); }; void setMerSpan(uint32 ms) { _merSpan = ms; }; uint32 getMerSpan(void) const { return(_merSpan); }; kMerHuge &reverseComplement(void) { for (uint32 i=0, j=KMER_WORDS-1; i> 2) & 0x3333333333333333llu) | ((MERWORD(i) << 2) & 0xccccccccccccccccllu); MERWORD(i) = ((MERWORD(i) >> 4) & 0x0f0f0f0f0f0f0f0fllu) | ((MERWORD(i) << 4) & 0xf0f0f0f0f0f0f0f0llu); MERWORD(i) = ((MERWORD(i) >> 8) & 0x00ff00ff00ff00ffllu) | ((MERWORD(i) << 8) & 0xff00ff00ff00ff00llu); MERWORD(i) = ((MERWORD(i) >> 16) & 0x0000ffff0000ffffllu) | ((MERWORD(i) << 16) & 0xffff0000ffff0000llu); MERWORD(i) = ((MERWORD(i) >> 32) & 0x00000000ffffffffllu) | ((MERWORD(i) << 32) & 0xffffffff00000000llu); MERWORD(i) ^= 0xffffffffffffffffllu; } *this >>= KMER_WORDS * 64 - 2 * _merSize; return(*this); }; void clear(void) { for (uint32 i=0; i>=(uint32 x) { // thisWord, the word we shift bits into // thatWord, the word we shift bits out of // shift, the number of bits we shift // uint32 thisWord = 0; uint32 thatWord = x >> 6; uint32 shift = x & uint32MASK(6); // Do an initial word-size shift, to reduce the shift amount to // be less than wordsize. Fill any shifted-out words with zero. // if (thatWord) { while (thatWord < KMER_WORDS) MERWORD(thisWord++) = MERWORD(thatWord++); while (thisWord < KMER_WORDS) MERWORD(thisWord++) = 0; } // Do bit-size shift, of adjacent words // thisWord = 0; thatWord = 1; MERWORD(thisWord) >>= shift; while (thatWord < KMER_WORDS) { MERWORD(thisWord++) |= MERWORD(thatWord) << (64 - shift); MERWORD(thatWord++) >>= shift; } }; void operator<<=(uint32 x) { uint32 thisWord = KMER_WORDS; uint32 thatWord = KMER_WORDS - (x >> 6); uint32 shift = x & uint32MASK(6); if (thatWord != KMER_WORDS) { while (thatWord > 0) MERWORD(--thisWord) = MERWORD(--thatWord); while (thisWord > 0) MERWORD(--thisWord) = 0; } thisWord = KMER_WORDS; thatWord = KMER_WORDS - 1; MERWORD(thisWord-1) <<= shift; while (thatWord > 0) { --thisWord; --thatWord; MERWORD(thisWord) |= MERWORD(thatWord) >> (64 - shift); MERWORD(thatWord) <<= shift; } }; public: void operator+=(uint64 x) { *this <<= 2; assert((x & 0xfc) == 0); MERWORD(0) |= x & uint64NUMBER(0x3); }; void operator-=(uint64 x) { *this >>= 2; assert((x & 0xfc) == 0); MERWORD(_lastWord) |= (x & uint64NUMBER(0x3)) << _lastShift; }; void mask(bool full) { MERWORD(_maskWord) &= _mask; if (full) for (uint32 x=_maskWord+1; x r.MERWORD(i)) return(false); } return(false); }; bool operator>(kMerHuge const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (MERWORD(i) > r.MERWORD(i)) return(true); if (MERWORD(i) < r.MERWORD(i)) return(false); } return(false); }; bool operator<=(kMerHuge const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (MERWORD(i) < r.MERWORD(i)) return(true); if (MERWORD(i) > r.MERWORD(i)) return(false); } return(true); }; bool operator>=(kMerHuge const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (MERWORD(i) > r.MERWORD(i)) return(true); if (MERWORD(i) < r.MERWORD(i)) return(false); } return(true); }; int qsort_less(kMerHuge const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (MERWORD(i) < r.MERWORD(i)) return(-1); if (MERWORD(i) > r.MERWORD(i)) return(1); } return(0); }; public: operator uint64 () const {return(MERWORD(0));}; public: // these should work generically for both big and small void writeToBitPackedFile(bitPackedFile *BPF, uint32 numBits=0) const { if (numBits == 0) numBits = _merSize << 1; uint32 lastWord = numBits >> 6; if ((numBits & uint32MASK(6)) == 0) lastWord++; if (numBits & uint32MASK(6)) BPF->putBits(MERWORD(lastWord), numBits & uint32MASK(6)); while (lastWord > 0) { lastWord--; BPF->putBits(MERWORD(lastWord), 64); } }; void readFromBitPackedFile(bitPackedFile *BPF, uint32 numBits=0) { if (numBits == 0) numBits = _merSize << 1; uint32 lastWord = numBits >> 6; if ((numBits & uint32MASK(6)) == 0) lastWord++; if (numBits & uint32MASK(6)) MERWORD(lastWord) = BPF->getBits(numBits & uint32MASK(6)); while (lastWord > 0) { lastWord--; MERWORD(lastWord) = BPF->getBits(64); } }; public: // these should work generically for both big and small void setBits(uint32 pos, uint32 numbits, uint64 val) { uint32 wrd = pos >> 6; uint32 bit = pos & 0x3f; val &= uint64MASK(numbits); if (wrd >= KMER_WORDS) { fprintf(stderr, "kMer::setBits()-- ERROR: tried to set pos=" F_U32 " numbits=" F_U32 " larger than KMER_WORDS=%d\n", pos, numbits, KMER_WORDS), exit(1); } // If we have enough space in the word for the bits, replace // those bits in the word. Otherwise we need to split the value // into two pieces, and add to the end of the first word and the // start of the second. if (64 - bit >= numbits) { MERWORD(wrd) &= ~(uint64MASK(numbits) << bit); MERWORD(wrd) |= val << bit; } else { if (wrd+1 >= KMER_WORDS) { fprintf(stderr, "kMer::setBits()-- ERROR: tried to set pos=" F_U32 " numbits=" F_U32 " larger than KMER_WORDS=%d\n", pos, numbits, KMER_WORDS), exit(1); } uint32 b1 = 64 - bit; // bits in the first word uint32 b2 = numbits - b1; // bits in the second word MERWORD(wrd) &= ~(uint64MASK(b1) << bit); MERWORD(wrd) |= (val & uint64MASK(b1)) << bit; MERWORD(wrd+1) &= ~(uint64MASK(b2)); MERWORD(wrd+1) |= (val >> b1) & uint64MASK(b2); } }; uint64 getBits(uint32 pos, uint32 numbits) const { uint64 val = uint64ZERO; uint32 wrd = pos >> 6; uint32 bit = pos & 0x3f; if (wrd >= KMER_WORDS) { fprintf(stderr, "kMer::getBits()-- ERROR: tried to get pos=" F_U32 " numbits=" F_U32 " larger than KMER_WORDS=%d\n", pos, numbits, KMER_WORDS), exit(1); } if (64 - bit >= numbits) { val = MERWORD(wrd) >> bit; } else { if (wrd+1 >= KMER_WORDS) { fprintf(stderr, "kMer::getBits()-- ERROR: tried to get pos=" F_U32 " numbits=" F_U32 " larger than KMER_WORDS=%d\n", pos, numbits, KMER_WORDS), exit(1); } uint32 b1 = 64 - bit; // bits in the first word uint32 b2 = numbits - b1; // bits in the second word val = MERWORD(wrd) >> (64-b1); val |= (MERWORD(wrd+1) & uint64MASK(b2)) << b1; } val &= uint64MASK(numbits); return(val); }; public: // these should work generically for both big and small uint64 startOfMer(uint32 bits) const { return(getBits((_merSize << 1) - bits, bits)); }; uint64 endOfMer(uint32 bits) const { return(MERWORD(0) & uint64MASK(bits)); }; public: // these should work generically for both big and small uint64 getWord(uint32 wrd) const { return(MERWORD(wrd)); }; void setWord(uint32 wrd, uint64 val) { MERWORD(wrd) = val; }; public: char *merToString(char *instr) const; private: uint64 _md[KMER_WORDS]; // The _merSize is always the number of letters in the mer -- if we // are a spaced seed, it is the weight. // uint32 _merSize; uint32 _merSpan; // The mask is used to make sure the mer has only _merSize bases // set -- we can get more than that if we shift to the left. The // _maskWord is the word that we want to mask: // uint64 _mask; uint32 _maskWord; // For operator-=() (add a base to the left end) we need to know // what the last word is, and how far to shift the bits. // // _lastWord -- the last word that contains bases // _lastShift -- the amount we need to shift left to put bits 0 and 1 // into the last base uint32 _lastWord; uint32 _lastShift; }; inline void kMerHuge::setMerSize(uint32 ms) { _merSize = ms; _merSpan = ms; _lastWord = (2 * ms - 2) / 64; _lastShift = (2 * ms - 2) % 64; _mask = uint64ZERO; _maskWord = _merSize / 32; // Filled whole words with the mer, the mask is special-cased // to clear the whole next word, unless there is no whole next // word, then it does nothing on the last word. // // Otherwise, we can construct the mask as usual. // if ((_merSize % 32) == 0) { if (_maskWord >= KMER_WORDS) { _maskWord = KMER_WORDS - 1; _mask = ~uint64ZERO; } else { _maskWord = _merSize / 32; _mask = uint64ZERO; } } else { _mask = uint64MASK((_merSize % 32) << 1); } if (_maskWord >= KMER_WORDS) { fprintf(stderr, "kMer::setMerSize()-- ERROR! Desired merSize of " F_U32 " larger than\n", _merSize); fprintf(stderr, " available storage space (KMER_WORDS=%d, max merSize %d).\n", KMER_WORDS, KMER_WORDS*32); exit(1); } } inline char * uint64ToMerString(uint32 ms, uint64 mer, char *str) { for (uint32 i=0; i> (2*i)) & 0x03]; str[ms] = 0; return(str); } inline char * kMerHuge::merToString(char *instr) const { uint32 lastWord = _merSize >> 5; char *str = instr; if ((_merSize & uint32MASK(6)) == 0) lastWord++; if (_merSize & uint32MASK(5)) { uint64ToMerString(_merSize & uint32MASK(5), MERWORD(lastWord), str); str += _merSize & uint32MASK(5); } while (lastWord > 0) { lastWord--; uint64ToMerString(32, MERWORD(lastWord), str); str += 32; } return(instr); }; canu-1.6/src/AS_UTL/kMerTiny.H000066400000000000000000000127621314437614700160210ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libbio/kmertiny.H * * Modifications by: * * Brian P. Walenz from 2007-SEP-13 to 2014-APR-11 * are Copyright 2007-2008,2010,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ class kMerTiny { public: kMerTiny(uint32 ms=uint32ZERO) { setMerSize(ms); clear(); }; ~kMerTiny() { }; void setMerSize(uint32 ms); uint32 getMerSize(void) const { return(_merSize); }; void setMerSpan(uint32 ms) { _merSpan = ms; }; uint32 getMerSpan(void) const { return(_merSpan); }; kMerTiny &reverseComplement(void) { // Reverse the mer _md = ((_md >> 2) & 0x3333333333333333llu) | ((_md << 2) & 0xccccccccccccccccllu); _md = ((_md >> 4) & 0x0f0f0f0f0f0f0f0fllu) | ((_md << 4) & 0xf0f0f0f0f0f0f0f0llu); _md = ((_md >> 8) & 0x00ff00ff00ff00ffllu) | ((_md << 8) & 0xff00ff00ff00ff00llu); _md = ((_md >> 16) & 0x0000ffff0000ffffllu) | ((_md << 16) & 0xffff0000ffff0000llu); _md = ((_md >> 32) & 0x00000000ffffffffllu) | ((_md << 32) & 0xffffffff00000000llu); // Complement the bases _md ^= 0xffffffffffffffffllu; // Shift and mask out the bases not in the mer _md >>= 64 - _merSize * 2; _md &= uint64MASK(_merSize * 2); return(*this); }; void clear(void) { _md = uint64ZERO; }; void smallest(void) { clear(); }; void largest(void) { clear(); reverseComplement(); }; private: void operator>>=(uint32 x) { _md >>= x; }; void operator<<=(uint32 x) { _md <<= x; }; public: void operator+=(uint64 x) { *this <<= 2; assert((x & 0xfc) == 0); _md |= x & uint64NUMBER(0x3); }; void operator-=(uint64 x) { *this >>= 2; assert((x & 0xfc) == 0); _md |= (x & uint64NUMBER(0x3)) << _lastShift; }; public: void mask(bool) { _md &= _mask; }; public: bool operator!=(kMerTiny const &r) const { return(_md != r._md); }; bool operator==(kMerTiny const &r) const { return(_md == r._md); }; bool operator< (kMerTiny const &r) const { return(_md < r._md); }; bool operator> (kMerTiny const &r) const { return(_md > r._md); }; bool operator<=(kMerTiny const &r) const { return(_md <= r._md); }; bool operator>=(kMerTiny const &r) const { return(_md >= r._md); }; int qsort_less(kMerTiny const &r) const { if (_md < r._md) return(-1); if (_md > r._md) return( 1); return(0); }; public: operator uint64 () const {return(_md);}; public: void writeToBitPackedFile(bitPackedFile *BPF, uint32 UNUSED(numBits)=0) const { BPF->putBits(_md, _merSize << 1); }; void readFromBitPackedFile(bitPackedFile *BPF, uint32 UNUSED(numBits)=0) { _md = BPF->getBits(_merSize << 1); }; public: void setBits(uint32 pos, uint32 numbits, uint64 val) { _md &= ~(uint64MASK(numbits) << pos); _md |= val << pos; }; uint64 getBits(uint32 pos, uint32 numbits) const { return((_md >> pos) & uint64MASK(numbits)); }; public: uint64 startOfMer(uint32 bits) const { return(getBits((_merSize << 1) - bits, bits)); }; uint64 endOfMer(uint32 bits) const { return(_md & uint64MASK(bits)); }; public: uint64 getWord(uint32 UNUSED(wrd)) const { return(_md); }; void setWord(uint32 UNUSED(wrd), uint64 val) { _md = val; }; public: char *merToString(char *instr) const; private: uint64 _md; // The _merSize is always the number of letters in the mer -- if we // are a spaced seed, it is the weight. // uint32 _merSize; uint32 _merSpan; // The mask is used to make sure the mer has only _merSize bases // set -- we can get more than that if we shift to the left. The // uint64 _mask; // For operator-=() (add a base to the left end) we need to know // what the last word is, and how far to shift the bits. // uint32 _lastShift; }; inline void kMerTiny::setMerSize(uint32 ms) { _merSize = ms; _merSpan = ms; _lastShift = (2 * ms - 2) % 64; _mask = uint64MASK(_merSize << 1); } inline char * kMerTiny::merToString(char *str) const { for (uint32 i=0; i<_merSize; i++) str[_merSize-i-1] = alphabet.bitsToLetter((_md >> (2*i)) & 0x03); str[_merSize] = 0; return(str); } #if 0 // This is not used anywhere, and is probably quite stale inline uint64 stringToMer(uint32 ms, char *str) { uint64 mer = 0L; for (uint32 i=0; i The libbacktrace library is provided under a BSD license. See the source files for the exact license text. canu-1.6/src/AS_UTL/libbacktrace/atomic.c000066400000000000000000000056171314437614700201750ustar00rootroot00000000000000/* atomic.c -- Support for atomic functions if not present. Copyright (C) 2013-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include "backtrace.h" #include "backtrace-supported.h" #include "internal.h" /* This file holds implementations of the atomic functions that are used if the host compiler has the sync functions but not the atomic functions, as is true of versions of GCC before 4.7. */ #if !defined (HAVE_ATOMIC_FUNCTIONS) && defined (HAVE_SYNC_FUNCTIONS) /* Do an atomic load of a pointer. */ void * backtrace_atomic_load_pointer (void *arg) { void **pp; void *p; pp = (void **) arg; p = *pp; while (!__sync_bool_compare_and_swap (pp, p, p)) p = *pp; return p; } /* Do an atomic load of an int. */ int backtrace_atomic_load_int (int *p) { int i; i = *p; while (!__sync_bool_compare_and_swap (p, i, i)) i = *p; return i; } /* Do an atomic store of a pointer. */ void backtrace_atomic_store_pointer (void *arg, void *p) { void **pp; void *old; pp = (void **) arg; old = *pp; while (!__sync_bool_compare_and_swap (pp, old, p)) old = *pp; } /* Do an atomic store of a size_t value. */ void backtrace_atomic_store_size_t (size_t *p, size_t v) { size_t old; old = *p; while (!__sync_bool_compare_and_swap (p, old, v)) old = *p; } /* Do an atomic store of a int value. */ void backtrace_atomic_store_int (int *p, int v) { size_t old; old = *p; while (!__sync_bool_compare_and_swap (p, old, v)) old = *p; } #endif canu-1.6/src/AS_UTL/libbacktrace/backtrace-supported.h000066400000000000000000000056461314437614700226720ustar00rootroot00000000000000/* backtrace-supported.h.in -- Whether stack backtrace is supported. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* The file backtrace-supported.h.in is used by configure to generate the file backtrace-supported.h. The file backtrace-supported.h may be #include'd to see whether the backtrace library will be able to get a backtrace and produce symbolic information. */ /* BACKTRACE_SUPPORTED will be #define'd as 1 if the backtrace library should work, 0 if it will not. Libraries may #include this to make other arrangements. */ #define BACKTRACE_SUPPORTED 1 /* BACKTRACE_USES_MALLOC will be #define'd as 1 if the backtrace library will call malloc as it works, 0 if it will call mmap instead. This may be used to determine whether it is safe to call the backtrace functions from a signal handler. In general this only applies to calls like backtrace and backtrace_pcinfo. It does not apply to backtrace_simple, which never calls malloc. It does not apply to backtrace_print, which always calls fprintf and therefore malloc. */ #define BACKTRACE_USES_MALLOC 0 /* BACKTRACE_SUPPORTS_THREADS will be #define'd as 1 if the backtrace library is configured with threading support, 0 if not. If this is 0, the threaded parameter to backtrace_create_state must be passed as 0. */ #define BACKTRACE_SUPPORTS_THREADS 1 /* BACKTRACE_SUPPORTS_DATA will be #defined'd as 1 if the backtrace_syminfo will work for variables. It will always work for functions. */ #define BACKTRACE_SUPPORTS_DATA 1 canu-1.6/src/AS_UTL/libbacktrace/backtrace.c000066400000000000000000000072121314437614700206310ustar00rootroot00000000000000/* backtrace.c -- Entry point for stack backtrace library. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include "unwind.h" #include "backtrace.h" #include "internal.h" /* The main backtrace_full routine. */ /* Data passed through _Unwind_Backtrace. */ struct backtrace_data { /* Number of frames to skip. */ int skip; /* Library state. */ struct backtrace_state *state; /* Callback routine. */ backtrace_full_callback callback; /* Error callback routine. */ backtrace_error_callback error_callback; /* Data to pass to callback routines. */ void *data; /* Value to return from backtrace_full. */ int ret; /* Whether there is any memory available. */ int can_alloc; }; /* Unwind library callback routine. This is passed to _Unwind_Backtrace. */ static _Unwind_Reason_Code unwind (struct _Unwind_Context *context, void *vdata) { struct backtrace_data *bdata = (struct backtrace_data *) vdata; uintptr_t pc; int ip_before_insn = 0; #ifdef HAVE_GETIPINFO pc = _Unwind_GetIPInfo (context, &ip_before_insn); #else pc = _Unwind_GetIP (context); #endif if (bdata->skip > 0) { --bdata->skip; return _URC_NO_REASON; } if (!ip_before_insn) --pc; if (!bdata->can_alloc) bdata->ret = bdata->callback (bdata->data, pc, NULL, 0, NULL); else bdata->ret = backtrace_pcinfo (bdata->state, pc, bdata->callback, bdata->error_callback, bdata->data); if (bdata->ret != 0) return _URC_END_OF_STACK; return _URC_NO_REASON; } /* Get a stack backtrace. */ int backtrace_full (struct backtrace_state *state, int skip, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data) { struct backtrace_data bdata; void *p; bdata.skip = skip + 1; bdata.state = state; bdata.callback = callback; bdata.error_callback = error_callback; bdata.data = data; bdata.ret = 0; /* If we can't allocate any memory at all, don't try to produce file/line information. */ p = backtrace_alloc (state, 4096, NULL, NULL); if (p == NULL) bdata.can_alloc = 0; else { backtrace_free (state, p, 4096, NULL, NULL); bdata.can_alloc = 1; } _Unwind_Backtrace (unwind, &bdata); return bdata.ret; } canu-1.6/src/AS_UTL/libbacktrace/backtrace.h000066400000000000000000000203131314437614700206330ustar00rootroot00000000000000/* backtrace.h -- Public header file for stack backtrace library. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef BACKTRACE_H #define BACKTRACE_H #include #include #include #ifdef __cplusplus extern "C" { #endif /* The backtrace state. This struct is intentionally not defined in the public interface. */ struct backtrace_state; /* The type of the error callback argument to backtrace functions. This function, if not NULL, will be called for certain error cases. The DATA argument is passed to the function that calls this one. The MSG argument is an error message. The ERRNUM argument, if greater than 0, holds an errno value. The MSG buffer may become invalid after this function returns. As a special case, the ERRNUM argument will be passed as -1 if no debug info can be found for the executable, but the function requires debug info (e.g., backtrace_full, backtrace_pcinfo). The MSG in this case will be something along the lines of "no debug info". Similarly, ERRNUM will be passed as -1 if there is no symbol table, but the function requires a symbol table (e.g., backtrace_syminfo). This may be used as a signal that some other approach should be tried. */ typedef void (*backtrace_error_callback) (void *data, const char *msg, int errnum); /* Create state information for the backtrace routines. This must be called before any of the other routines, and its return value must be passed to all of the other routines. FILENAME is the path name of the executable file; if it is NULL the library will try system-specific path names. If not NULL, FILENAME must point to a permanent buffer. If THREADED is non-zero the state may be accessed by multiple threads simultaneously, and the library will use appropriate atomic operations. If THREADED is zero the state may only be accessed by one thread at a time. This returns a state pointer on success, NULL on error. If an error occurs, this will call the ERROR_CALLBACK routine. */ extern struct backtrace_state *backtrace_create_state ( const char *filename, int threaded, backtrace_error_callback error_callback, void *data); /* The type of the callback argument to the backtrace_full function. DATA is the argument passed to backtrace_full. PC is the program counter. FILENAME is the name of the file containing PC, or NULL if not available. LINENO is the line number in FILENAME containing PC, or 0 if not available. FUNCTION is the name of the function containing PC, or NULL if not available. This should return 0 to continuing tracing. The FILENAME and FUNCTION buffers may become invalid after this function returns. */ typedef int (*backtrace_full_callback) (void *data, uintptr_t pc, const char *filename, int lineno, const char *function); /* Get a full stack backtrace. SKIP is the number of frames to skip; passing 0 will start the trace with the function calling backtrace_full. DATA is passed to the callback routine. If any call to CALLBACK returns a non-zero value, the stack backtrace stops, and backtrace returns that value; this may be used to limit the number of stack frames desired. If all calls to CALLBACK return 0, backtrace returns 0. The backtrace_full function will make at least one call to either CALLBACK or ERROR_CALLBACK. This function requires debug info for the executable. */ extern int backtrace_full (struct backtrace_state *state, int skip, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data); /* The type of the callback argument to the backtrace_simple function. DATA is the argument passed to simple_backtrace. PC is the program counter. This should return 0 to continue tracing. */ typedef int (*backtrace_simple_callback) (void *data, uintptr_t pc); /* Get a simple backtrace. SKIP is the number of frames to skip, as in backtrace. DATA is passed to the callback routine. If any call to CALLBACK returns a non-zero value, the stack backtrace stops, and backtrace_simple returns that value. Otherwise backtrace_simple returns 0. The backtrace_simple function will make at least one call to either CALLBACK or ERROR_CALLBACK. This function does not require any debug info for the executable. */ extern int backtrace_simple (struct backtrace_state *state, int skip, backtrace_simple_callback callback, backtrace_error_callback error_callback, void *data); /* Print the current backtrace in a user readable format to a FILE. SKIP is the number of frames to skip, as in backtrace_full. Any error messages are printed to stderr. This function requires debug info for the executable. */ extern void backtrace_print (struct backtrace_state *state, int skip, FILE *); /* Given PC, a program counter in the current program, call the callback function with filename, line number, and function name information. This will normally call the callback function exactly once. However, if the PC happens to describe an inlined call, and the debugging information contains the necessary information, then this may call the callback function multiple times. This will make at least one call to either CALLBACK or ERROR_CALLBACK. This returns the first non-zero value returned by CALLBACK, or 0. */ extern int backtrace_pcinfo (struct backtrace_state *state, uintptr_t pc, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data); /* The type of the callback argument to backtrace_syminfo. DATA and PC are the arguments passed to backtrace_syminfo. SYMNAME is the name of the symbol for the corresponding code. SYMVAL is the value and SYMSIZE is the size of the symbol. SYMNAME will be NULL if no error occurred but the symbol could not be found. */ typedef void (*backtrace_syminfo_callback) (void *data, uintptr_t pc, const char *symname, uintptr_t symval, uintptr_t symsize); /* Given ADDR, an address or program counter in the current program, call the callback information with the symbol name and value describing the function or variable in which ADDR may be found. This will call either CALLBACK or ERROR_CALLBACK exactly once. This returns 1 on success, 0 on failure. This function requires the symbol table but does not require the debug info. Note that if the symbol table is present but ADDR could not be found in the table, CALLBACK will be called with a NULL SYMNAME argument. Returns 1 on success, 0 on error. */ extern int backtrace_syminfo (struct backtrace_state *state, uintptr_t addr, backtrace_syminfo_callback callback, backtrace_error_callback error_callback, void *data); #ifdef __cplusplus } /* End extern "C". */ #endif #endif canu-1.6/src/AS_UTL/libbacktrace/config.h000066400000000000000000000065671314437614700202000ustar00rootroot00000000000000/* config.h. Generated from config.h.in by configure. */ /* config.h.in. Generated from configure.ac by autoheader. */ /* ELF size: 32 or 64 */ #ifdef __APPLE__ #define BACKTRACE_ELF_SIZE unused #else #define BACKTRACE_ELF_SIZE 64 #endif /* Define to 1 if you have the __atomic functions */ #define HAVE_ATOMIC_FUNCTIONS 1 /* Define to 1 if you have the declaration of `strnlen', and to 0 if you don't. */ #define HAVE_DECL_STRNLEN 1 /* Define to 1 if you have the header file. */ #define HAVE_DLFCN_H 1 /* Define if dl_iterate_phdr is available. */ #ifndef __APPLE__ #define HAVE_DL_ITERATE_PHDR 1 #endif /* Define to 1 if you have the fcntl function */ #define HAVE_FCNTL 1 /* Define if getexecname is available. */ /* #undef HAVE_GETEXECNAME */ /* Define if _Unwind_GetIPInfo is available. */ #define HAVE_GETIPINFO 1 /* Define to 1 if you have the header file. */ #define HAVE_INTTYPES_H 1 /* Define to 1 if you have the header file. */ #ifndef __APPLE__ #define HAVE_LINK_H 1 #endif /* Define to 1 if you have the header file. */ #define HAVE_MEMORY_H 1 /* Define to 1 if you have the header file. */ #define HAVE_STDINT_H 1 /* Define to 1 if you have the header file. */ #define HAVE_STDLIB_H 1 /* Define to 1 if you have the header file. */ #define HAVE_STRINGS_H 1 /* Define to 1 if you have the header file. */ #define HAVE_STRING_H 1 /* Define to 1 if you have the __sync functions */ #define HAVE_SYNC_FUNCTIONS 1 /* Define to 1 if you have the header file. */ #define HAVE_SYS_MMAN_H 1 /* Define to 1 if you have the header file. */ #define HAVE_SYS_STAT_H 1 /* Define to 1 if you have the header file. */ #define HAVE_SYS_TYPES_H 1 /* Define to 1 if you have the header file. */ #define HAVE_UNISTD_H 1 /* Define to the sub-directory in which libtool stores uninstalled libraries. */ #define LT_OBJDIR ".libs/" /* Define to the address where bug reports for this package should be sent. */ #define PACKAGE_BUGREPORT "" /* Define to the full name of this package. */ #define PACKAGE_NAME "package-unused" /* Define to the full name and version of this package. */ #define PACKAGE_STRING "package-unused version-unused" /* Define to the one symbol short name of this package. */ #define PACKAGE_TARNAME "libbacktrace" /* Define to the home page for this package. */ #define PACKAGE_URL "" /* Define to the version of this package. */ #define PACKAGE_VERSION "version-unused" /* Define to 1 if you have the ANSI C header files. */ #define STDC_HEADERS 1 /* Enable extensions on AIX 3, Interix. */ #ifndef _ALL_SOURCE # define _ALL_SOURCE 1 #endif /* Enable GNU extensions on systems that have them. */ #ifndef _GNU_SOURCE # define _GNU_SOURCE 1 #endif /* Enable threading extensions on Solaris. */ #ifndef _POSIX_PTHREAD_SEMANTICS # define _POSIX_PTHREAD_SEMANTICS 1 #endif /* Enable extensions on HP NonStop. */ #ifndef _TANDEM_SOURCE # define _TANDEM_SOURCE 1 #endif /* Enable general extensions on Solaris. */ #ifndef __EXTENSIONS__ # define __EXTENSIONS__ 1 #endif /* Define to 1 if on MINIX. */ /* #undef _MINIX */ /* Define to 2 if the system does not provide POSIX.1 features except with this defined. */ /* #undef _POSIX_1_SOURCE */ /* Define to 1 if you need to in order for `stat' and other things to work. */ /* #undef _POSIX_SOURCE */ canu-1.6/src/AS_UTL/libbacktrace/dwarf.c000066400000000000000000002360471314437614700200270ustar00rootroot00000000000000/* dwarf.c -- Get file/line information from DWARF for backtraces. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include #include #include "backtrace.h" #include "internal.h" /* DWARF constants. */ enum dwarf_tag { DW_TAG_entry_point = 0x3, DW_TAG_compile_unit = 0x11, DW_TAG_inlined_subroutine = 0x1d, DW_TAG_subprogram = 0x2e, }; enum dwarf_form { DW_FORM_addr = 0x1, DW_FORM_block2 = 0x3, DW_FORM_block4 = 0x4, DW_FORM_data2 = 0x5, DW_FORM_data4 = 0x6, DW_FORM_data8 = 0x07, DW_FORM_string = 0x08, DW_FORM_block = 0x09, DW_FORM_block1 = 0x0a, DW_FORM_data1 = 0x0b, DW_FORM_flag = 0x0c, DW_FORM_sdata = 0x0d, DW_FORM_strp = 0x0e, DW_FORM_udata = 0x0f, DW_FORM_ref_addr = 0x10, DW_FORM_ref1 = 0x11, DW_FORM_ref2 = 0x12, DW_FORM_ref4 = 0x13, DW_FORM_ref8 = 0x14, DW_FORM_ref_udata = 0x15, DW_FORM_indirect = 0x16, DW_FORM_sec_offset = 0x17, DW_FORM_exprloc = 0x18, DW_FORM_flag_present = 0x19, DW_FORM_ref_sig8 = 0x20, DW_FORM_GNU_addr_index = 0x1f01, DW_FORM_GNU_str_index = 0x1f02, DW_FORM_GNU_ref_alt = 0x1f20, DW_FORM_GNU_strp_alt = 0x1f21, }; enum dwarf_attribute { DW_AT_name = 0x3, DW_AT_stmt_list = 0x10, DW_AT_low_pc = 0x11, DW_AT_high_pc = 0x12, DW_AT_comp_dir = 0x1b, DW_AT_abstract_origin = 0x31, DW_AT_specification = 0x47, DW_AT_ranges = 0x55, DW_AT_call_file = 0x58, DW_AT_call_line = 0x59, DW_AT_linkage_name = 0x6e, DW_AT_MIPS_linkage_name = 0x2007, }; enum dwarf_line_number_op { DW_LNS_extended_op = 0x0, DW_LNS_copy = 0x1, DW_LNS_advance_pc = 0x2, DW_LNS_advance_line = 0x3, DW_LNS_set_file = 0x4, DW_LNS_set_column = 0x5, DW_LNS_negate_stmt = 0x6, DW_LNS_set_basic_block = 0x7, DW_LNS_const_add_pc = 0x8, DW_LNS_fixed_advance_pc = 0x9, DW_LNS_set_prologue_end = 0xa, DW_LNS_set_epilogue_begin = 0xb, DW_LNS_set_isa = 0xc, }; enum dwarf_extedned_line_number_op { DW_LNE_end_sequence = 0x1, DW_LNE_set_address = 0x2, DW_LNE_define_file = 0x3, DW_LNE_set_discriminator = 0x4, }; #if defined(__MSDOS__) || defined(_WIN32) || defined(__OS2__) || defined (__CYGWIN__) # define IS_DIR_SEPARATOR(c) ((c) == '/' || (c) == '\\') #define HAS_DRIVE_SPEC(f) ((f)[0] && (f)[1] == ':') # define IS_ABSOLUTE_PATH(f) (IS_DIR_SEPARATOR(f[0]) || HAS_DRIVE_SPEC(f)) #else # define IS_DIR_SEPARATOR(c) ((c) == '/') # define IS_ABSOLUTE_PATH(f) IS_DIR_SEPARATOR(f[0]) #endif #if !defined(HAVE_DECL_STRNLEN) || !HAVE_DECL_STRNLEN /* If strnlen is not declared, provide our own version. */ static size_t xstrnlen (const char *s, size_t maxlen) { size_t i; for (i = 0; i < maxlen; ++i) if (s[i] == '\0') break; return i; } #define strnlen xstrnlen #endif /* A buffer to read DWARF info. */ struct dwarf_buf { /* Buffer name for error messages. */ const char *name; /* Start of the buffer. */ const unsigned char *start; /* Next byte to read. */ const unsigned char *buf; /* The number of bytes remaining. */ size_t left; /* Whether the data is big-endian. */ int is_bigendian; /* Error callback routine. */ backtrace_error_callback error_callback; /* Data for error_callback. */ void *data; /* Non-zero if we've reported an underflow error. */ int reported_underflow; }; /* A single attribute in a DWARF abbreviation. */ struct attr { /* The attribute name. */ enum dwarf_attribute name; /* The attribute form. */ enum dwarf_form form; }; /* A single DWARF abbreviation. */ struct abbrev { /* The abbrev code--the number used to refer to the abbrev. */ uint64_t code; /* The entry tag. */ enum dwarf_tag tag; /* Non-zero if this abbrev has child entries. */ int has_children; /* The number of attributes. */ size_t num_attrs; /* The attributes. */ struct attr *attrs; }; /* The DWARF abbreviations for a compilation unit. This structure only exists while reading the compilation unit. Most DWARF readers seem to a hash table to map abbrev ID's to abbrev entries. However, we primarily care about GCC, and GCC simply issues ID's in numerical order starting at 1. So we simply keep a sorted vector, and try to just look up the code. */ struct abbrevs { /* The number of abbrevs in the vector. */ size_t num_abbrevs; /* The abbrevs, sorted by the code field. */ struct abbrev *abbrevs; }; /* The different kinds of attribute values. */ enum attr_val_encoding { /* An address. */ ATTR_VAL_ADDRESS, /* A unsigned integer. */ ATTR_VAL_UINT, /* A sigd integer. */ ATTR_VAL_SINT, /* A string. */ ATTR_VAL_STRING, /* An offset to other data in the containing unit. */ ATTR_VAL_REF_UNIT, /* An offset to other data within the .dwarf_info section. */ ATTR_VAL_REF_INFO, /* An offset to data in some other section. */ ATTR_VAL_REF_SECTION, /* A type signature. */ ATTR_VAL_REF_TYPE, /* A block of data (not represented). */ ATTR_VAL_BLOCK, /* An expression (not represented). */ ATTR_VAL_EXPR, }; /* An attribute value. */ struct attr_val { /* How the value is stored in the field u. */ enum attr_val_encoding encoding; union { /* ATTR_VAL_ADDRESS, ATTR_VAL_UINT, ATTR_VAL_REF*. */ uint64_t uint; /* ATTR_VAL_SINT. */ int64_t sint; /* ATTR_VAL_STRING. */ const char *string; /* ATTR_VAL_BLOCK not stored. */ } u; }; /* The line number program header. */ struct line_header { /* The version of the line number information. */ int version; /* The minimum instruction length. */ unsigned int min_insn_len; /* The maximum number of ops per instruction. */ unsigned int max_ops_per_insn; /* The line base for special opcodes. */ int line_base; /* The line range for special opcodes. */ unsigned int line_range; /* The opcode base--the first special opcode. */ unsigned int opcode_base; /* Opcode lengths, indexed by opcode - 1. */ const unsigned char *opcode_lengths; /* The number of directory entries. */ size_t dirs_count; /* The directory entries. */ const char **dirs; /* The number of filenames. */ size_t filenames_count; /* The filenames. */ const char **filenames; }; /* Map a single PC value to a file/line. We will keep a vector of these sorted by PC value. Each file/line will be correct from the PC up to the PC of the next entry if there is one. We allocate one extra entry at the end so that we can use bsearch. */ struct line { /* PC. */ uintptr_t pc; /* File name. Many entries in the array are expected to point to the same file name. */ const char *filename; /* Line number. */ int lineno; /* Index of the object in the original array read from the DWARF section, before it has been sorted. The index makes it possible to use Quicksort and maintain stability. */ int idx; }; /* A growable vector of line number information. This is used while reading the line numbers. */ struct line_vector { /* Memory. This is an array of struct line. */ struct backtrace_vector vec; /* Number of valid mappings. */ size_t count; }; /* A function described in the debug info. */ struct function { /* The name of the function. */ const char *name; /* If this is an inlined function, the filename of the call site. */ const char *caller_filename; /* If this is an inlined function, the line number of the call site. */ int caller_lineno; /* Map PC ranges to inlined functions. */ struct function_addrs *function_addrs; size_t function_addrs_count; }; /* An address range for a function. This maps a PC value to a specific function. */ struct function_addrs { /* Range is LOW <= PC < HIGH. */ uint64_t low; uint64_t high; /* Function for this address range. */ struct function *function; }; /* A growable vector of function address ranges. */ struct function_vector { /* Memory. This is an array of struct function_addrs. */ struct backtrace_vector vec; /* Number of address ranges present. */ size_t count; }; /* A DWARF compilation unit. This only holds the information we need to map a PC to a file and line. */ struct unit { /* The first entry for this compilation unit. */ const unsigned char *unit_data; /* The length of the data for this compilation unit. */ size_t unit_data_len; /* The offset of UNIT_DATA from the start of the information for this compilation unit. */ size_t unit_data_offset; /* DWARF version. */ int version; /* Whether unit is DWARF64. */ int is_dwarf64; /* Address size. */ int addrsize; /* Offset into line number information. */ off_t lineoff; /* Primary source file. */ const char *filename; /* Compilation command working directory. */ const char *comp_dir; /* Absolute file name, only set if needed. */ const char *abs_filename; /* The abbreviations for this unit. */ struct abbrevs abbrevs; /* The fields above this point are read in during initialization and may be accessed freely. The fields below this point are read in as needed, and therefore require care, as different threads may try to initialize them simultaneously. */ /* PC to line number mapping. This is NULL if the values have not been read. This is (struct line *) -1 if there was an error reading the values. */ struct line *lines; /* Number of entries in lines. */ size_t lines_count; /* PC ranges to function. */ struct function_addrs *function_addrs; size_t function_addrs_count; }; /* An address range for a compilation unit. This maps a PC value to a specific compilation unit. Note that we invert the representation in DWARF: instead of listing the units and attaching a list of ranges, we list the ranges and have each one point to the unit. This lets us do a binary search to find the unit. */ struct unit_addrs { /* Range is LOW <= PC < HIGH. */ uint64_t low; uint64_t high; /* Compilation unit for this address range. */ struct unit *u; }; /* A growable vector of compilation unit address ranges. */ struct unit_addrs_vector { /* Memory. This is an array of struct unit_addrs. */ struct backtrace_vector vec; /* Number of address ranges present. */ size_t count; }; /* The information we need to map a PC to a file and line. */ struct dwarf_data { /* The data for the next file we know about. */ struct dwarf_data *next; /* The base address for this file. */ uintptr_t base_address; /* A sorted list of address ranges. */ struct unit_addrs *addrs; /* Number of address ranges in list. */ size_t addrs_count; /* The unparsed .debug_info section. */ const unsigned char *dwarf_info; size_t dwarf_info_size; /* The unparsed .debug_line section. */ const unsigned char *dwarf_line; size_t dwarf_line_size; /* The unparsed .debug_ranges section. */ const unsigned char *dwarf_ranges; size_t dwarf_ranges_size; /* The unparsed .debug_str section. */ const unsigned char *dwarf_str; size_t dwarf_str_size; /* Whether the data is big-endian or not. */ int is_bigendian; /* A vector used for function addresses. We keep this here so that we can grow the vector as we read more functions. */ struct function_vector fvec; }; /* Report an error for a DWARF buffer. */ static void dwarf_buf_error (struct dwarf_buf *buf, const char *msg) { char b[200]; snprintf (b, sizeof b, "%s in %s at %d", msg, buf->name, (int) (buf->buf - buf->start)); buf->error_callback (buf->data, b, 0); } /* Require at least COUNT bytes in BUF. Return 1 if all is well, 0 on error. */ static int require (struct dwarf_buf *buf, size_t count) { if (buf->left >= count) return 1; if (!buf->reported_underflow) { dwarf_buf_error (buf, "DWARF underflow"); buf->reported_underflow = 1; } return 0; } /* Advance COUNT bytes in BUF. Return 1 if all is well, 0 on error. */ static int advance (struct dwarf_buf *buf, size_t count) { if (!require (buf, count)) return 0; buf->buf += count; buf->left -= count; return 1; } /* Read one byte from BUF and advance 1 byte. */ static unsigned char read_byte (struct dwarf_buf *buf) { const unsigned char *p = buf->buf; if (!advance (buf, 1)) return 0; return p[0]; } /* Read a signed char from BUF and advance 1 byte. */ static signed char read_sbyte (struct dwarf_buf *buf) { const unsigned char *p = buf->buf; if (!advance (buf, 1)) return 0; return (*p ^ 0x80) - 0x80; } /* Read a uint16 from BUF and advance 2 bytes. */ static uint16_t read_uint16 (struct dwarf_buf *buf) { const unsigned char *p = buf->buf; if (!advance (buf, 2)) return 0; if (buf->is_bigendian) return ((uint16_t) p[0] << 8) | (uint16_t) p[1]; else return ((uint16_t) p[1] << 8) | (uint16_t) p[0]; } /* Read a uint32 from BUF and advance 4 bytes. */ static uint32_t read_uint32 (struct dwarf_buf *buf) { const unsigned char *p = buf->buf; if (!advance (buf, 4)) return 0; if (buf->is_bigendian) return (((uint32_t) p[0] << 24) | ((uint32_t) p[1] << 16) | ((uint32_t) p[2] << 8) | (uint32_t) p[3]); else return (((uint32_t) p[3] << 24) | ((uint32_t) p[2] << 16) | ((uint32_t) p[1] << 8) | (uint32_t) p[0]); } /* Read a uint64 from BUF and advance 8 bytes. */ static uint64_t read_uint64 (struct dwarf_buf *buf) { const unsigned char *p = buf->buf; if (!advance (buf, 8)) return 0; if (buf->is_bigendian) return (((uint64_t) p[0] << 56) | ((uint64_t) p[1] << 48) | ((uint64_t) p[2] << 40) | ((uint64_t) p[3] << 32) | ((uint64_t) p[4] << 24) | ((uint64_t) p[5] << 16) | ((uint64_t) p[6] << 8) | (uint64_t) p[7]); else return (((uint64_t) p[7] << 56) | ((uint64_t) p[6] << 48) | ((uint64_t) p[5] << 40) | ((uint64_t) p[4] << 32) | ((uint64_t) p[3] << 24) | ((uint64_t) p[2] << 16) | ((uint64_t) p[1] << 8) | (uint64_t) p[0]); } /* Read an offset from BUF and advance the appropriate number of bytes. */ static uint64_t read_offset (struct dwarf_buf *buf, int is_dwarf64) { if (is_dwarf64) return read_uint64 (buf); else return read_uint32 (buf); } /* Read an address from BUF and advance the appropriate number of bytes. */ static uint64_t read_address (struct dwarf_buf *buf, int addrsize) { switch (addrsize) { case 1: return read_byte (buf); case 2: return read_uint16 (buf); case 4: return read_uint32 (buf); case 8: return read_uint64 (buf); default: dwarf_buf_error (buf, "unrecognized address size"); return 0; } } /* Return whether a value is the highest possible address, given the address size. */ static int is_highest_address (uint64_t address, int addrsize) { switch (addrsize) { case 1: return address == (unsigned char) -1; case 2: return address == (uint16_t) -1; case 4: return address == (uint32_t) -1; case 8: return address == (uint64_t) -1; default: return 0; } } /* Read an unsigned LEB128 number. */ static uint64_t read_uleb128 (struct dwarf_buf *buf) { uint64_t ret; unsigned int shift; int overflow; unsigned char b; ret = 0; shift = 0; overflow = 0; do { const unsigned char *p; p = buf->buf; if (!advance (buf, 1)) return 0; b = *p; if (shift < 64) ret |= ((uint64_t) (b & 0x7f)) << shift; else if (!overflow) { dwarf_buf_error (buf, "LEB128 overflows uint64_t"); overflow = 1; } shift += 7; } while ((b & 0x80) != 0); return ret; } /* Read a signed LEB128 number. */ static int64_t read_sleb128 (struct dwarf_buf *buf) { uint64_t val; unsigned int shift; int overflow; unsigned char b; val = 0; shift = 0; overflow = 0; do { const unsigned char *p; p = buf->buf; if (!advance (buf, 1)) return 0; b = *p; if (shift < 64) val |= ((uint64_t) (b & 0x7f)) << shift; else if (!overflow) { dwarf_buf_error (buf, "signed LEB128 overflows uint64_t"); overflow = 1; } shift += 7; } while ((b & 0x80) != 0); if ((b & 0x40) != 0 && shift < 64) val |= ((uint64_t) -1) << shift; return (int64_t) val; } /* Return the length of an LEB128 number. */ static size_t leb128_len (const unsigned char *p) { size_t ret; ret = 1; while ((*p & 0x80) != 0) { ++p; ++ret; } return ret; } /* Free an abbreviations structure. */ static void free_abbrevs (struct backtrace_state *state, struct abbrevs *abbrevs, backtrace_error_callback error_callback, void *data) { size_t i; for (i = 0; i < abbrevs->num_abbrevs; ++i) backtrace_free (state, abbrevs->abbrevs[i].attrs, abbrevs->abbrevs[i].num_attrs * sizeof (struct attr), error_callback, data); backtrace_free (state, abbrevs->abbrevs, abbrevs->num_abbrevs * sizeof (struct abbrev), error_callback, data); abbrevs->num_abbrevs = 0; abbrevs->abbrevs = NULL; } /* Read an attribute value. Returns 1 on success, 0 on failure. If the value can be represented as a uint64_t, sets *VAL and sets *IS_VALID to 1. We don't try to store the value of other attribute forms, because we don't care about them. */ static int read_attribute (enum dwarf_form form, struct dwarf_buf *buf, int is_dwarf64, int version, int addrsize, const unsigned char *dwarf_str, size_t dwarf_str_size, struct attr_val *val) { /* Avoid warnings about val.u.FIELD may be used uninitialized if this function is inlined. The warnings aren't valid but can occur because the different fields are set and used conditionally. */ memset (val, 0, sizeof *val); switch (form) { case DW_FORM_addr: val->encoding = ATTR_VAL_ADDRESS; val->u.uint = read_address (buf, addrsize); return 1; case DW_FORM_block2: val->encoding = ATTR_VAL_BLOCK; return advance (buf, read_uint16 (buf)); case DW_FORM_block4: val->encoding = ATTR_VAL_BLOCK; return advance (buf, read_uint32 (buf)); case DW_FORM_data2: val->encoding = ATTR_VAL_UINT; val->u.uint = read_uint16 (buf); return 1; case DW_FORM_data4: val->encoding = ATTR_VAL_UINT; val->u.uint = read_uint32 (buf); return 1; case DW_FORM_data8: val->encoding = ATTR_VAL_UINT; val->u.uint = read_uint64 (buf); return 1; case DW_FORM_string: val->encoding = ATTR_VAL_STRING; val->u.string = (const char *) buf->buf; return advance (buf, strnlen ((const char *) buf->buf, buf->left) + 1); case DW_FORM_block: val->encoding = ATTR_VAL_BLOCK; return advance (buf, read_uleb128 (buf)); case DW_FORM_block1: val->encoding = ATTR_VAL_BLOCK; return advance (buf, read_byte (buf)); case DW_FORM_data1: val->encoding = ATTR_VAL_UINT; val->u.uint = read_byte (buf); return 1; case DW_FORM_flag: val->encoding = ATTR_VAL_UINT; val->u.uint = read_byte (buf); return 1; case DW_FORM_sdata: val->encoding = ATTR_VAL_SINT; val->u.sint = read_sleb128 (buf); return 1; case DW_FORM_strp: { uint64_t offset; offset = read_offset (buf, is_dwarf64); if (offset >= dwarf_str_size) { dwarf_buf_error (buf, "DW_FORM_strp out of range"); return 0; } val->encoding = ATTR_VAL_STRING; val->u.string = (const char *) dwarf_str + offset; return 1; } case DW_FORM_udata: val->encoding = ATTR_VAL_UINT; val->u.uint = read_uleb128 (buf); return 1; case DW_FORM_ref_addr: val->encoding = ATTR_VAL_REF_INFO; if (version == 2) val->u.uint = read_address (buf, addrsize); else val->u.uint = read_offset (buf, is_dwarf64); return 1; case DW_FORM_ref1: val->encoding = ATTR_VAL_REF_UNIT; val->u.uint = read_byte (buf); return 1; case DW_FORM_ref2: val->encoding = ATTR_VAL_REF_UNIT; val->u.uint = read_uint16 (buf); return 1; case DW_FORM_ref4: val->encoding = ATTR_VAL_REF_UNIT; val->u.uint = read_uint32 (buf); return 1; case DW_FORM_ref8: val->encoding = ATTR_VAL_REF_UNIT; val->u.uint = read_uint64 (buf); return 1; case DW_FORM_ref_udata: val->encoding = ATTR_VAL_REF_UNIT; val->u.uint = read_uleb128 (buf); return 1; case DW_FORM_indirect: { uint64_t form; form = read_uleb128 (buf); return read_attribute ((enum dwarf_form) form, buf, is_dwarf64, version, addrsize, dwarf_str, dwarf_str_size, val); } case DW_FORM_sec_offset: val->encoding = ATTR_VAL_REF_SECTION; val->u.uint = read_offset (buf, is_dwarf64); return 1; case DW_FORM_exprloc: val->encoding = ATTR_VAL_EXPR; return advance (buf, read_uleb128 (buf)); case DW_FORM_flag_present: val->encoding = ATTR_VAL_UINT; val->u.uint = 1; return 1; case DW_FORM_ref_sig8: val->encoding = ATTR_VAL_REF_TYPE; val->u.uint = read_uint64 (buf); return 1; case DW_FORM_GNU_addr_index: val->encoding = ATTR_VAL_REF_SECTION; val->u.uint = read_uleb128 (buf); return 1; case DW_FORM_GNU_str_index: val->encoding = ATTR_VAL_REF_SECTION; val->u.uint = read_uleb128 (buf); return 1; case DW_FORM_GNU_ref_alt: val->encoding = ATTR_VAL_REF_SECTION; val->u.uint = read_offset (buf, is_dwarf64); return 1; case DW_FORM_GNU_strp_alt: val->encoding = ATTR_VAL_REF_SECTION; val->u.uint = read_offset (buf, is_dwarf64); return 1; default: dwarf_buf_error (buf, "unrecognized DWARF form"); return 0; } } /* Compare function_addrs for qsort. When ranges are nested, make the smallest one sort last. */ static int function_addrs_compare (const void *v1, const void *v2) { const struct function_addrs *a1 = (const struct function_addrs *) v1; const struct function_addrs *a2 = (const struct function_addrs *) v2; if (a1->low < a2->low) return -1; if (a1->low > a2->low) return 1; if (a1->high < a2->high) return 1; if (a1->high > a2->high) return -1; return strcmp (a1->function->name, a2->function->name); } /* Compare a PC against a function_addrs for bsearch. Note that if there are multiple ranges containing PC, which one will be returned is unpredictable. We compensate for that in dwarf_fileline. */ static int function_addrs_search (const void *vkey, const void *ventry) { const uintptr_t *key = (const uintptr_t *) vkey; const struct function_addrs *entry = (const struct function_addrs *) ventry; uintptr_t pc; pc = *key; if (pc < entry->low) return -1; else if (pc >= entry->high) return 1; else return 0; } /* Add a new compilation unit address range to a vector. Returns 1 on success, 0 on failure. */ static int add_unit_addr (struct backtrace_state *state, uintptr_t base_address, struct unit_addrs addrs, backtrace_error_callback error_callback, void *data, struct unit_addrs_vector *vec) { struct unit_addrs *p; /* Add in the base address of the module here, so that we can look up the PC directly. */ addrs.low += base_address; addrs.high += base_address; /* Try to merge with the last entry. */ if (vec->count > 0) { p = (struct unit_addrs *) vec->vec.base + (vec->count - 1); if ((addrs.low == p->high || addrs.low == p->high + 1) && addrs.u == p->u) { if (addrs.high > p->high) p->high = addrs.high; return 1; } } p = ((struct unit_addrs *) backtrace_vector_grow (state, sizeof (struct unit_addrs), error_callback, data, &vec->vec)); if (p == NULL) return 0; *p = addrs; ++vec->count; return 1; } /* Free a unit address vector. */ static void free_unit_addrs_vector (struct backtrace_state *state, struct unit_addrs_vector *vec, backtrace_error_callback error_callback, void *data) { struct unit_addrs *addrs; size_t i; addrs = (struct unit_addrs *) vec->vec.base; for (i = 0; i < vec->count; ++i) free_abbrevs (state, &addrs[i].u->abbrevs, error_callback, data); } /* Compare unit_addrs for qsort. When ranges are nested, make the smallest one sort last. */ static int unit_addrs_compare (const void *v1, const void *v2) { const struct unit_addrs *a1 = (const struct unit_addrs *) v1; const struct unit_addrs *a2 = (const struct unit_addrs *) v2; if (a1->low < a2->low) return -1; if (a1->low > a2->low) return 1; if (a1->high < a2->high) return 1; if (a1->high > a2->high) return -1; if (a1->u->lineoff < a2->u->lineoff) return -1; if (a1->u->lineoff > a2->u->lineoff) return 1; return 0; } /* Compare a PC against a unit_addrs for bsearch. Note that if there are multiple ranges containing PC, which one will be returned is unpredictable. We compensate for that in dwarf_fileline. */ static int unit_addrs_search (const void *vkey, const void *ventry) { const uintptr_t *key = (const uintptr_t *) vkey; const struct unit_addrs *entry = (const struct unit_addrs *) ventry; uintptr_t pc; pc = *key; if (pc < entry->low) return -1; else if (pc >= entry->high) return 1; else return 0; } /* Sort the line vector by PC. We want a stable sort here to maintain the order of lines for the same PC values. Since the sequence is being sorted in place, their addresses cannot be relied on to maintain stability. That is the purpose of the index member. */ static int line_compare (const void *v1, const void *v2) { const struct line *ln1 = (const struct line *) v1; const struct line *ln2 = (const struct line *) v2; if (ln1->pc < ln2->pc) return -1; else if (ln1->pc > ln2->pc) return 1; else if (ln1->idx < ln2->idx) return -1; else if (ln1->idx > ln2->idx) return 1; else return 0; } /* Find a PC in a line vector. We always allocate an extra entry at the end of the lines vector, so that this routine can safely look at the next entry. Note that when there are multiple mappings for the same PC value, this will return the last one. */ static int line_search (const void *vkey, const void *ventry) { const uintptr_t *key = (const uintptr_t *) vkey; const struct line *entry = (const struct line *) ventry; uintptr_t pc; pc = *key; if (pc < entry->pc) return -1; else if (pc >= (entry + 1)->pc) return 1; else return 0; } /* Sort the abbrevs by the abbrev code. This function is passed to both qsort and bsearch. */ static int abbrev_compare (const void *v1, const void *v2) { const struct abbrev *a1 = (const struct abbrev *) v1; const struct abbrev *a2 = (const struct abbrev *) v2; if (a1->code < a2->code) return -1; else if (a1->code > a2->code) return 1; else { /* This really shouldn't happen. It means there are two different abbrevs with the same code, and that means we don't know which one lookup_abbrev should return. */ return 0; } } /* Read the abbreviation table for a compilation unit. Returns 1 on success, 0 on failure. */ static int read_abbrevs (struct backtrace_state *state, uint64_t abbrev_offset, const unsigned char *dwarf_abbrev, size_t dwarf_abbrev_size, int is_bigendian, backtrace_error_callback error_callback, void *data, struct abbrevs *abbrevs) { struct dwarf_buf abbrev_buf; struct dwarf_buf count_buf; size_t num_abbrevs; abbrevs->num_abbrevs = 0; abbrevs->abbrevs = NULL; if (abbrev_offset >= dwarf_abbrev_size) { error_callback (data, "abbrev offset out of range", 0); return 0; } abbrev_buf.name = ".debug_abbrev"; abbrev_buf.start = dwarf_abbrev; abbrev_buf.buf = dwarf_abbrev + abbrev_offset; abbrev_buf.left = dwarf_abbrev_size - abbrev_offset; abbrev_buf.is_bigendian = is_bigendian; abbrev_buf.error_callback = error_callback; abbrev_buf.data = data; abbrev_buf.reported_underflow = 0; /* Count the number of abbrevs in this list. */ count_buf = abbrev_buf; num_abbrevs = 0; while (read_uleb128 (&count_buf) != 0) { if (count_buf.reported_underflow) return 0; ++num_abbrevs; // Skip tag. read_uleb128 (&count_buf); // Skip has_children. read_byte (&count_buf); // Skip attributes. while (read_uleb128 (&count_buf) != 0) read_uleb128 (&count_buf); // Skip form of last attribute. read_uleb128 (&count_buf); } if (count_buf.reported_underflow) return 0; if (num_abbrevs == 0) return 1; abbrevs->num_abbrevs = num_abbrevs; abbrevs->abbrevs = ((struct abbrev *) backtrace_alloc (state, num_abbrevs * sizeof (struct abbrev), error_callback, data)); if (abbrevs->abbrevs == NULL) return 0; memset (abbrevs->abbrevs, 0, num_abbrevs * sizeof (struct abbrev)); num_abbrevs = 0; while (1) { uint64_t code; struct abbrev a; size_t num_attrs; struct attr *attrs; if (abbrev_buf.reported_underflow) goto fail; code = read_uleb128 (&abbrev_buf); if (code == 0) break; a.code = code; a.tag = (enum dwarf_tag) read_uleb128 (&abbrev_buf); a.has_children = read_byte (&abbrev_buf); count_buf = abbrev_buf; num_attrs = 0; while (read_uleb128 (&count_buf) != 0) { ++num_attrs; read_uleb128 (&count_buf); } if (num_attrs == 0) { attrs = NULL; read_uleb128 (&abbrev_buf); read_uleb128 (&abbrev_buf); } else { attrs = ((struct attr *) backtrace_alloc (state, num_attrs * sizeof *attrs, error_callback, data)); if (attrs == NULL) goto fail; num_attrs = 0; while (1) { uint64_t name; uint64_t form; name = read_uleb128 (&abbrev_buf); form = read_uleb128 (&abbrev_buf); if (name == 0) break; attrs[num_attrs].name = (enum dwarf_attribute) name; attrs[num_attrs].form = (enum dwarf_form) form; ++num_attrs; } } a.num_attrs = num_attrs; a.attrs = attrs; abbrevs->abbrevs[num_abbrevs] = a; ++num_abbrevs; } backtrace_qsort (abbrevs->abbrevs, abbrevs->num_abbrevs, sizeof (struct abbrev), abbrev_compare); return 1; fail: free_abbrevs (state, abbrevs, error_callback, data); return 0; } /* Return the abbrev information for an abbrev code. */ static const struct abbrev * lookup_abbrev (struct abbrevs *abbrevs, uint64_t code, backtrace_error_callback error_callback, void *data) { struct abbrev key; void *p; /* With GCC, where abbrevs are simply numbered in order, we should be able to just look up the entry. */ if (code - 1 < abbrevs->num_abbrevs && abbrevs->abbrevs[code - 1].code == code) return &abbrevs->abbrevs[code - 1]; /* Otherwise we have to search. */ memset (&key, 0, sizeof key); key.code = code; p = bsearch (&key, abbrevs->abbrevs, abbrevs->num_abbrevs, sizeof (struct abbrev), abbrev_compare); if (p == NULL) { error_callback (data, "invalid abbreviation code", 0); return NULL; } return (const struct abbrev *) p; } /* Add non-contiguous address ranges for a compilation unit. Returns 1 on success, 0 on failure. */ static int add_unit_ranges (struct backtrace_state *state, uintptr_t base_address, struct unit *u, uint64_t ranges, uint64_t base, int is_bigendian, const unsigned char *dwarf_ranges, size_t dwarf_ranges_size, backtrace_error_callback error_callback, void *data, struct unit_addrs_vector *addrs) { struct dwarf_buf ranges_buf; if (ranges >= dwarf_ranges_size) { error_callback (data, "ranges offset out of range", 0); return 0; } ranges_buf.name = ".debug_ranges"; ranges_buf.start = dwarf_ranges; ranges_buf.buf = dwarf_ranges + ranges; ranges_buf.left = dwarf_ranges_size - ranges; ranges_buf.is_bigendian = is_bigendian; ranges_buf.error_callback = error_callback; ranges_buf.data = data; ranges_buf.reported_underflow = 0; while (1) { uint64_t low; uint64_t high; if (ranges_buf.reported_underflow) return 0; low = read_address (&ranges_buf, u->addrsize); high = read_address (&ranges_buf, u->addrsize); if (low == 0 && high == 0) break; if (is_highest_address (low, u->addrsize)) base = high; else { struct unit_addrs a; a.low = low + base; a.high = high + base; a.u = u; if (!add_unit_addr (state, base_address, a, error_callback, data, addrs)) return 0; } } if (ranges_buf.reported_underflow) return 0; return 1; } /* Find the address range covered by a compilation unit, reading from UNIT_BUF and adding values to U. Returns 1 if all data could be read, 0 if there is some error. */ static int find_address_ranges (struct backtrace_state *state, uintptr_t base_address, struct dwarf_buf *unit_buf, const unsigned char *dwarf_str, size_t dwarf_str_size, const unsigned char *dwarf_ranges, size_t dwarf_ranges_size, int is_bigendian, backtrace_error_callback error_callback, void *data, struct unit *u, struct unit_addrs_vector *addrs) { while (unit_buf->left > 0) { uint64_t code; const struct abbrev *abbrev; uint64_t lowpc; int have_lowpc; uint64_t highpc; int have_highpc; int highpc_is_relative; uint64_t ranges; int have_ranges; size_t i; code = read_uleb128 (unit_buf); if (code == 0) return 1; abbrev = lookup_abbrev (&u->abbrevs, code, error_callback, data); if (abbrev == NULL) return 0; lowpc = 0; have_lowpc = 0; highpc = 0; have_highpc = 0; highpc_is_relative = 0; ranges = 0; have_ranges = 0; for (i = 0; i < abbrev->num_attrs; ++i) { struct attr_val val; if (!read_attribute (abbrev->attrs[i].form, unit_buf, u->is_dwarf64, u->version, u->addrsize, dwarf_str, dwarf_str_size, &val)) return 0; switch (abbrev->attrs[i].name) { case DW_AT_low_pc: if (val.encoding == ATTR_VAL_ADDRESS) { lowpc = val.u.uint; have_lowpc = 1; } break; case DW_AT_high_pc: if (val.encoding == ATTR_VAL_ADDRESS) { highpc = val.u.uint; have_highpc = 1; } else if (val.encoding == ATTR_VAL_UINT) { highpc = val.u.uint; have_highpc = 1; highpc_is_relative = 1; } break; case DW_AT_ranges: if (val.encoding == ATTR_VAL_UINT || val.encoding == ATTR_VAL_REF_SECTION) { ranges = val.u.uint; have_ranges = 1; } break; case DW_AT_stmt_list: if (abbrev->tag == DW_TAG_compile_unit && (val.encoding == ATTR_VAL_UINT || val.encoding == ATTR_VAL_REF_SECTION)) u->lineoff = val.u.uint; break; case DW_AT_name: if (abbrev->tag == DW_TAG_compile_unit && val.encoding == ATTR_VAL_STRING) u->filename = val.u.string; break; case DW_AT_comp_dir: if (abbrev->tag == DW_TAG_compile_unit && val.encoding == ATTR_VAL_STRING) u->comp_dir = val.u.string; break; default: break; } } if (abbrev->tag == DW_TAG_compile_unit || abbrev->tag == DW_TAG_subprogram) { if (have_ranges) { if (!add_unit_ranges (state, base_address, u, ranges, lowpc, is_bigendian, dwarf_ranges, dwarf_ranges_size, error_callback, data, addrs)) return 0; } else if (have_lowpc && have_highpc) { struct unit_addrs a; if (highpc_is_relative) highpc += lowpc; a.low = lowpc; a.high = highpc; a.u = u; if (!add_unit_addr (state, base_address, a, error_callback, data, addrs)) return 0; } /* If we found the PC range in the DW_TAG_compile_unit, we can stop now. */ if (abbrev->tag == DW_TAG_compile_unit && (have_ranges || (have_lowpc && have_highpc))) return 1; } if (abbrev->has_children) { if (!find_address_ranges (state, base_address, unit_buf, dwarf_str, dwarf_str_size, dwarf_ranges, dwarf_ranges_size, is_bigendian, error_callback, data, u, addrs)) return 0; } } return 1; } /* Build a mapping from address ranges to the compilation units where the line number information for that range can be found. Returns 1 on success, 0 on failure. */ static int build_address_map (struct backtrace_state *state, uintptr_t base_address, const unsigned char *dwarf_info, size_t dwarf_info_size, const unsigned char *dwarf_abbrev, size_t dwarf_abbrev_size, const unsigned char *dwarf_ranges, size_t dwarf_ranges_size, const unsigned char *dwarf_str, size_t dwarf_str_size, int is_bigendian, backtrace_error_callback error_callback, void *data, struct unit_addrs_vector *addrs) { struct dwarf_buf info; struct abbrevs abbrevs; memset (&addrs->vec, 0, sizeof addrs->vec); addrs->count = 0; /* Read through the .debug_info section. FIXME: Should we use the .debug_aranges section? gdb and addr2line don't use it, but I'm not sure why. */ info.name = ".debug_info"; info.start = dwarf_info; info.buf = dwarf_info; info.left = dwarf_info_size; info.is_bigendian = is_bigendian; info.error_callback = error_callback; info.data = data; info.reported_underflow = 0; memset (&abbrevs, 0, sizeof abbrevs); while (info.left > 0) { const unsigned char *unit_data_start; uint64_t len; int is_dwarf64; struct dwarf_buf unit_buf; int version; uint64_t abbrev_offset; int addrsize; struct unit *u; if (info.reported_underflow) goto fail; unit_data_start = info.buf; is_dwarf64 = 0; len = read_uint32 (&info); if (len == 0xffffffff) { len = read_uint64 (&info); is_dwarf64 = 1; } unit_buf = info; unit_buf.left = len; if (!advance (&info, len)) goto fail; version = read_uint16 (&unit_buf); if (version < 2 || version > 4) { dwarf_buf_error (&unit_buf, "unrecognized DWARF version"); goto fail; } abbrev_offset = read_offset (&unit_buf, is_dwarf64); if (!read_abbrevs (state, abbrev_offset, dwarf_abbrev, dwarf_abbrev_size, is_bigendian, error_callback, data, &abbrevs)) goto fail; addrsize = read_byte (&unit_buf); u = ((struct unit *) backtrace_alloc (state, sizeof *u, error_callback, data)); if (u == NULL) goto fail; u->unit_data = unit_buf.buf; u->unit_data_len = unit_buf.left; u->unit_data_offset = unit_buf.buf - unit_data_start; u->version = version; u->is_dwarf64 = is_dwarf64; u->addrsize = addrsize; u->filename = NULL; u->comp_dir = NULL; u->abs_filename = NULL; u->lineoff = 0; u->abbrevs = abbrevs; memset (&abbrevs, 0, sizeof abbrevs); /* The actual line number mappings will be read as needed. */ u->lines = NULL; u->lines_count = 0; u->function_addrs = NULL; u->function_addrs_count = 0; if (!find_address_ranges (state, base_address, &unit_buf, dwarf_str, dwarf_str_size, dwarf_ranges, dwarf_ranges_size, is_bigendian, error_callback, data, u, addrs)) { free_abbrevs (state, &u->abbrevs, error_callback, data); backtrace_free (state, u, sizeof *u, error_callback, data); goto fail; } if (unit_buf.reported_underflow) { free_abbrevs (state, &u->abbrevs, error_callback, data); backtrace_free (state, u, sizeof *u, error_callback, data); goto fail; } } if (info.reported_underflow) goto fail; return 1; fail: free_abbrevs (state, &abbrevs, error_callback, data); free_unit_addrs_vector (state, addrs, error_callback, data); return 0; } /* Add a new mapping to the vector of line mappings that we are building. Returns 1 on success, 0 on failure. */ static int add_line (struct backtrace_state *state, struct dwarf_data *ddata, uintptr_t pc, const char *filename, int lineno, backtrace_error_callback error_callback, void *data, struct line_vector *vec) { struct line *ln; /* If we are adding the same mapping, ignore it. This can happen when using discriminators. */ if (vec->count > 0) { ln = (struct line *) vec->vec.base + (vec->count - 1); if (pc == ln->pc && filename == ln->filename && lineno == ln->lineno) return 1; } ln = ((struct line *) backtrace_vector_grow (state, sizeof (struct line), error_callback, data, &vec->vec)); if (ln == NULL) return 0; /* Add in the base address here, so that we can look up the PC directly. */ ln->pc = pc + ddata->base_address; ln->filename = filename; ln->lineno = lineno; ln->idx = vec->count; ++vec->count; return 1; } /* Free the line header information. If FREE_FILENAMES is true we free the file names themselves, otherwise we leave them, as there may be line structures pointing to them. */ static void free_line_header (struct backtrace_state *state, struct line_header *hdr, backtrace_error_callback error_callback, void *data) { backtrace_free (state, hdr->dirs, hdr->dirs_count * sizeof (const char *), error_callback, data); backtrace_free (state, hdr->filenames, hdr->filenames_count * sizeof (char *), error_callback, data); } /* Read the line header. Return 1 on success, 0 on failure. */ static int read_line_header (struct backtrace_state *state, struct unit *u, int is_dwarf64, struct dwarf_buf *line_buf, struct line_header *hdr) { uint64_t hdrlen; struct dwarf_buf hdr_buf; const unsigned char *p; const unsigned char *pend; size_t i; hdr->version = read_uint16 (line_buf); if (hdr->version < 2 || hdr->version > 4) { dwarf_buf_error (line_buf, "unsupported line number version"); return 0; } hdrlen = read_offset (line_buf, is_dwarf64); hdr_buf = *line_buf; hdr_buf.left = hdrlen; if (!advance (line_buf, hdrlen)) return 0; hdr->min_insn_len = read_byte (&hdr_buf); if (hdr->version < 4) hdr->max_ops_per_insn = 1; else hdr->max_ops_per_insn = read_byte (&hdr_buf); /* We don't care about default_is_stmt. */ read_byte (&hdr_buf); hdr->line_base = read_sbyte (&hdr_buf); hdr->line_range = read_byte (&hdr_buf); hdr->opcode_base = read_byte (&hdr_buf); hdr->opcode_lengths = hdr_buf.buf; if (!advance (&hdr_buf, hdr->opcode_base - 1)) return 0; /* Count the number of directory entries. */ hdr->dirs_count = 0; p = hdr_buf.buf; pend = p + hdr_buf.left; while (p < pend && *p != '\0') { p += strnlen((const char *) p, pend - p) + 1; ++hdr->dirs_count; } hdr->dirs = ((const char **) backtrace_alloc (state, hdr->dirs_count * sizeof (const char *), line_buf->error_callback, line_buf->data)); if (hdr->dirs == NULL) return 0; i = 0; while (*hdr_buf.buf != '\0') { if (hdr_buf.reported_underflow) return 0; hdr->dirs[i] = (const char *) hdr_buf.buf; ++i; if (!advance (&hdr_buf, strnlen ((const char *) hdr_buf.buf, hdr_buf.left) + 1)) return 0; } if (!advance (&hdr_buf, 1)) return 0; /* Count the number of file entries. */ hdr->filenames_count = 0; p = hdr_buf.buf; pend = p + hdr_buf.left; while (p < pend && *p != '\0') { p += strnlen ((const char *) p, pend - p) + 1; p += leb128_len (p); p += leb128_len (p); p += leb128_len (p); ++hdr->filenames_count; } hdr->filenames = ((const char **) backtrace_alloc (state, hdr->filenames_count * sizeof (char *), line_buf->error_callback, line_buf->data)); if (hdr->filenames == NULL) return 0; i = 0; while (*hdr_buf.buf != '\0') { const char *filename; uint64_t dir_index; if (hdr_buf.reported_underflow) return 0; filename = (const char *) hdr_buf.buf; if (!advance (&hdr_buf, strnlen ((const char *) hdr_buf.buf, hdr_buf.left) + 1)) return 0; dir_index = read_uleb128 (&hdr_buf); if (IS_ABSOLUTE_PATH (filename) || (dir_index == 0 && u->comp_dir == NULL)) hdr->filenames[i] = filename; else { const char *dir; size_t dir_len; size_t filename_len; char *s; if (dir_index == 0) dir = u->comp_dir; else if (dir_index - 1 < hdr->dirs_count) dir = hdr->dirs[dir_index - 1]; else { dwarf_buf_error (line_buf, ("invalid directory index in " "line number program header")); return 0; } dir_len = strlen (dir); filename_len = strlen (filename); s = ((char *) backtrace_alloc (state, dir_len + filename_len + 2, line_buf->error_callback, line_buf->data)); if (s == NULL) return 0; memcpy (s, dir, dir_len); /* FIXME: If we are on a DOS-based file system, and the directory or the file name use backslashes, then we should use a backslash here. */ s[dir_len] = '/'; memcpy (s + dir_len + 1, filename, filename_len + 1); hdr->filenames[i] = s; } /* Ignore the modification time and size. */ read_uleb128 (&hdr_buf); read_uleb128 (&hdr_buf); ++i; } if (hdr_buf.reported_underflow) return 0; return 1; } /* Read the line program, adding line mappings to VEC. Return 1 on success, 0 on failure. */ static int read_line_program (struct backtrace_state *state, struct dwarf_data *ddata, struct unit *u, const struct line_header *hdr, struct dwarf_buf *line_buf, struct line_vector *vec) { uint64_t address; unsigned int op_index; const char *reset_filename; const char *filename; int lineno; address = 0; op_index = 0; if (hdr->filenames_count > 0) reset_filename = hdr->filenames[0]; else reset_filename = ""; filename = reset_filename; lineno = 1; while (line_buf->left > 0) { unsigned int op; op = read_byte (line_buf); if (op >= hdr->opcode_base) { unsigned int advance; /* Special opcode. */ op -= hdr->opcode_base; advance = op / hdr->line_range; address += (hdr->min_insn_len * (op_index + advance) / hdr->max_ops_per_insn); op_index = (op_index + advance) % hdr->max_ops_per_insn; lineno += hdr->line_base + (int) (op % hdr->line_range); add_line (state, ddata, address, filename, lineno, line_buf->error_callback, line_buf->data, vec); } else if (op == DW_LNS_extended_op) { uint64_t len; len = read_uleb128 (line_buf); op = read_byte (line_buf); switch (op) { case DW_LNE_end_sequence: /* FIXME: Should we mark the high PC here? It seems that we already have that information from the compilation unit. */ address = 0; op_index = 0; filename = reset_filename; lineno = 1; break; case DW_LNE_set_address: address = read_address (line_buf, u->addrsize); break; case DW_LNE_define_file: { const char *f; unsigned int dir_index; f = (const char *) line_buf->buf; if (!advance (line_buf, strnlen (f, line_buf->left) + 1)) return 0; dir_index = read_uleb128 (line_buf); /* Ignore that time and length. */ read_uleb128 (line_buf); read_uleb128 (line_buf); if (IS_ABSOLUTE_PATH (f)) filename = f; else { const char *dir; size_t dir_len; size_t f_len; char *p; if (dir_index == 0) dir = u->comp_dir; else if (dir_index - 1 < hdr->dirs_count) dir = hdr->dirs[dir_index - 1]; else { dwarf_buf_error (line_buf, ("invalid directory index " "in line number program")); return 0; } dir_len = strlen (dir); f_len = strlen (f); p = ((char *) backtrace_alloc (state, dir_len + f_len + 2, line_buf->error_callback, line_buf->data)); if (p == NULL) return 0; memcpy (p, dir, dir_len); /* FIXME: If we are on a DOS-based file system, and the directory or the file name use backslashes, then we should use a backslash here. */ p[dir_len] = '/'; memcpy (p + dir_len + 1, f, f_len + 1); filename = p; } } break; case DW_LNE_set_discriminator: /* We don't care about discriminators. */ read_uleb128 (line_buf); break; default: if (!advance (line_buf, len - 1)) return 0; break; } } else { switch (op) { case DW_LNS_copy: add_line (state, ddata, address, filename, lineno, line_buf->error_callback, line_buf->data, vec); break; case DW_LNS_advance_pc: { uint64_t advance; advance = read_uleb128 (line_buf); address += (hdr->min_insn_len * (op_index + advance) / hdr->max_ops_per_insn); op_index = (op_index + advance) % hdr->max_ops_per_insn; } break; case DW_LNS_advance_line: lineno += (int) read_sleb128 (line_buf); break; case DW_LNS_set_file: { uint64_t fileno; fileno = read_uleb128 (line_buf); if (fileno == 0) filename = ""; else { if (fileno - 1 >= hdr->filenames_count) { dwarf_buf_error (line_buf, ("invalid file number in " "line number program")); return 0; } filename = hdr->filenames[fileno - 1]; } } break; case DW_LNS_set_column: read_uleb128 (line_buf); break; case DW_LNS_negate_stmt: break; case DW_LNS_set_basic_block: break; case DW_LNS_const_add_pc: { unsigned int advance; op = 255 - hdr->opcode_base; advance = op / hdr->line_range; address += (hdr->min_insn_len * (op_index + advance) / hdr->max_ops_per_insn); op_index = (op_index + advance) % hdr->max_ops_per_insn; } break; case DW_LNS_fixed_advance_pc: address += read_uint16 (line_buf); op_index = 0; break; case DW_LNS_set_prologue_end: break; case DW_LNS_set_epilogue_begin: break; case DW_LNS_set_isa: read_uleb128 (line_buf); break; default: { unsigned int i; for (i = hdr->opcode_lengths[op - 1]; i > 0; --i) read_uleb128 (line_buf); } break; } } } return 1; } /* Read the line number information for a compilation unit. Returns 1 on success, 0 on failure. */ static int read_line_info (struct backtrace_state *state, struct dwarf_data *ddata, backtrace_error_callback error_callback, void *data, struct unit *u, struct line_header *hdr, struct line **lines, size_t *lines_count) { struct line_vector vec; struct dwarf_buf line_buf; uint64_t len; int is_dwarf64; struct line *ln; memset (&vec.vec, 0, sizeof vec.vec); vec.count = 0; memset (hdr, 0, sizeof *hdr); if (u->lineoff != (off_t) (size_t) u->lineoff || (size_t) u->lineoff >= ddata->dwarf_line_size) { error_callback (data, "unit line offset out of range", 0); goto fail; } line_buf.name = ".debug_line"; line_buf.start = ddata->dwarf_line; line_buf.buf = ddata->dwarf_line + u->lineoff; line_buf.left = ddata->dwarf_line_size - u->lineoff; line_buf.is_bigendian = ddata->is_bigendian; line_buf.error_callback = error_callback; line_buf.data = data; line_buf.reported_underflow = 0; is_dwarf64 = 0; len = read_uint32 (&line_buf); if (len == 0xffffffff) { len = read_uint64 (&line_buf); is_dwarf64 = 1; } line_buf.left = len; if (!read_line_header (state, u, is_dwarf64, &line_buf, hdr)) goto fail; if (!read_line_program (state, ddata, u, hdr, &line_buf, &vec)) goto fail; if (line_buf.reported_underflow) goto fail; if (vec.count == 0) { /* This is not a failure in the sense of a generating an error, but it is a failure in that sense that we have no useful information. */ goto fail; } /* Allocate one extra entry at the end. */ ln = ((struct line *) backtrace_vector_grow (state, sizeof (struct line), error_callback, data, &vec.vec)); if (ln == NULL) goto fail; ln->pc = (uintptr_t) -1; ln->filename = NULL; ln->lineno = 0; ln->idx = 0; if (!backtrace_vector_release (state, &vec.vec, error_callback, data)) goto fail; ln = (struct line *) vec.vec.base; backtrace_qsort (ln, vec.count, sizeof (struct line), line_compare); *lines = ln; *lines_count = vec.count; return 1; fail: vec.vec.alc += vec.vec.size; vec.vec.size = 0; backtrace_vector_release (state, &vec.vec, error_callback, data); free_line_header (state, hdr, error_callback, data); *lines = (struct line *) (uintptr_t) -1; *lines_count = 0; return 0; } /* Read the name of a function from a DIE referenced by a DW_AT_abstract_origin or DW_AT_specification tag. OFFSET is within the same compilation unit. */ static const char * read_referenced_name (struct dwarf_data *ddata, struct unit *u, uint64_t offset, backtrace_error_callback error_callback, void *data) { struct dwarf_buf unit_buf; uint64_t code; const struct abbrev *abbrev; const char *ret; size_t i; /* OFFSET is from the start of the data for this compilation unit. U->unit_data is the data, but it starts U->unit_data_offset bytes from the beginning. */ if (offset < u->unit_data_offset || offset - u->unit_data_offset >= u->unit_data_len) { error_callback (data, "abstract origin or specification out of range", 0); return NULL; } offset -= u->unit_data_offset; unit_buf.name = ".debug_info"; unit_buf.start = ddata->dwarf_info; unit_buf.buf = u->unit_data + offset; unit_buf.left = u->unit_data_len - offset; unit_buf.is_bigendian = ddata->is_bigendian; unit_buf.error_callback = error_callback; unit_buf.data = data; unit_buf.reported_underflow = 0; code = read_uleb128 (&unit_buf); if (code == 0) { dwarf_buf_error (&unit_buf, "invalid abstract origin or specification"); return NULL; } abbrev = lookup_abbrev (&u->abbrevs, code, error_callback, data); if (abbrev == NULL) return NULL; ret = NULL; for (i = 0; i < abbrev->num_attrs; ++i) { struct attr_val val; if (!read_attribute (abbrev->attrs[i].form, &unit_buf, u->is_dwarf64, u->version, u->addrsize, ddata->dwarf_str, ddata->dwarf_str_size, &val)) return NULL; switch (abbrev->attrs[i].name) { case DW_AT_name: /* We prefer the linkage name if get one. */ if (val.encoding == ATTR_VAL_STRING) ret = val.u.string; break; case DW_AT_linkage_name: case DW_AT_MIPS_linkage_name: if (val.encoding == ATTR_VAL_STRING) return val.u.string; break; case DW_AT_specification: if (abbrev->attrs[i].form == DW_FORM_ref_addr || abbrev->attrs[i].form == DW_FORM_ref_sig8) { /* This refers to a specification defined in some other compilation unit. We can handle this case if we must, but it's harder. */ break; } if (val.encoding == ATTR_VAL_UINT || val.encoding == ATTR_VAL_REF_UNIT) { const char *name; name = read_referenced_name (ddata, u, val.u.uint, error_callback, data); if (name != NULL) ret = name; } break; default: break; } } return ret; } /* Add a single range to U that maps to function. Returns 1 on success, 0 on error. */ static int add_function_range (struct backtrace_state *state, struct dwarf_data *ddata, struct function *function, uint64_t lowpc, uint64_t highpc, backtrace_error_callback error_callback, void *data, struct function_vector *vec) { struct function_addrs *p; /* Add in the base address here, so that we can look up the PC directly. */ lowpc += ddata->base_address; highpc += ddata->base_address; if (vec->count > 0) { p = (struct function_addrs *) vec->vec.base + vec->count - 1; if ((lowpc == p->high || lowpc == p->high + 1) && function == p->function) { if (highpc > p->high) p->high = highpc; return 1; } } p = ((struct function_addrs *) backtrace_vector_grow (state, sizeof (struct function_addrs), error_callback, data, &vec->vec)); if (p == NULL) return 0; p->low = lowpc; p->high = highpc; p->function = function; ++vec->count; return 1; } /* Add PC ranges to U that map to FUNCTION. Returns 1 on success, 0 on error. */ static int add_function_ranges (struct backtrace_state *state, struct dwarf_data *ddata, struct unit *u, struct function *function, uint64_t ranges, uint64_t base, backtrace_error_callback error_callback, void *data, struct function_vector *vec) { struct dwarf_buf ranges_buf; if (ranges >= ddata->dwarf_ranges_size) { error_callback (data, "function ranges offset out of range", 0); return 0; } ranges_buf.name = ".debug_ranges"; ranges_buf.start = ddata->dwarf_ranges; ranges_buf.buf = ddata->dwarf_ranges + ranges; ranges_buf.left = ddata->dwarf_ranges_size - ranges; ranges_buf.is_bigendian = ddata->is_bigendian; ranges_buf.error_callback = error_callback; ranges_buf.data = data; ranges_buf.reported_underflow = 0; while (1) { uint64_t low; uint64_t high; if (ranges_buf.reported_underflow) return 0; low = read_address (&ranges_buf, u->addrsize); high = read_address (&ranges_buf, u->addrsize); if (low == 0 && high == 0) break; if (is_highest_address (low, u->addrsize)) base = high; else { if (!add_function_range (state, ddata, function, low + base, high + base, error_callback, data, vec)) return 0; } } if (ranges_buf.reported_underflow) return 0; return 1; } /* Read one entry plus all its children. Add function addresses to VEC. Returns 1 on success, 0 on error. */ static int read_function_entry (struct backtrace_state *state, struct dwarf_data *ddata, struct unit *u, uint64_t base, struct dwarf_buf *unit_buf, const struct line_header *lhdr, backtrace_error_callback error_callback, void *data, struct function_vector *vec_function, struct function_vector *vec_inlined) { while (unit_buf->left > 0) { uint64_t code; const struct abbrev *abbrev; int is_function; struct function *function; struct function_vector *vec; size_t i; uint64_t lowpc; int have_lowpc; uint64_t highpc; int have_highpc; int highpc_is_relative; uint64_t ranges; int have_ranges; code = read_uleb128 (unit_buf); if (code == 0) return 1; abbrev = lookup_abbrev (&u->abbrevs, code, error_callback, data); if (abbrev == NULL) return 0; is_function = (abbrev->tag == DW_TAG_subprogram || abbrev->tag == DW_TAG_entry_point || abbrev->tag == DW_TAG_inlined_subroutine); if (abbrev->tag == DW_TAG_inlined_subroutine) vec = vec_inlined; else vec = vec_function; function = NULL; if (is_function) { function = ((struct function *) backtrace_alloc (state, sizeof *function, error_callback, data)); if (function == NULL) return 0; memset (function, 0, sizeof *function); } lowpc = 0; have_lowpc = 0; highpc = 0; have_highpc = 0; highpc_is_relative = 0; ranges = 0; have_ranges = 0; for (i = 0; i < abbrev->num_attrs; ++i) { struct attr_val val; if (!read_attribute (abbrev->attrs[i].form, unit_buf, u->is_dwarf64, u->version, u->addrsize, ddata->dwarf_str, ddata->dwarf_str_size, &val)) return 0; /* The compile unit sets the base address for any address ranges in the function entries. */ if (abbrev->tag == DW_TAG_compile_unit && abbrev->attrs[i].name == DW_AT_low_pc && val.encoding == ATTR_VAL_ADDRESS) base = val.u.uint; if (is_function) { switch (abbrev->attrs[i].name) { case DW_AT_call_file: if (val.encoding == ATTR_VAL_UINT) { if (val.u.uint == 0) function->caller_filename = ""; else { if (val.u.uint - 1 >= lhdr->filenames_count) { dwarf_buf_error (unit_buf, ("invalid file number in " "DW_AT_call_file attribute")); return 0; } function->caller_filename = lhdr->filenames[val.u.uint - 1]; } } break; case DW_AT_call_line: if (val.encoding == ATTR_VAL_UINT) function->caller_lineno = val.u.uint; break; case DW_AT_abstract_origin: case DW_AT_specification: if (abbrev->attrs[i].form == DW_FORM_ref_addr || abbrev->attrs[i].form == DW_FORM_ref_sig8) { /* This refers to an abstract origin defined in some other compilation unit. We can handle this case if we must, but it's harder. */ break; } if (val.encoding == ATTR_VAL_UINT || val.encoding == ATTR_VAL_REF_UNIT) { const char *name; name = read_referenced_name (ddata, u, val.u.uint, error_callback, data); if (name != NULL) function->name = name; } break; case DW_AT_name: if (val.encoding == ATTR_VAL_STRING) { /* Don't override a name we found in some other way, as it will normally be more useful--e.g., this name is normally not mangled. */ if (function->name == NULL) function->name = val.u.string; } break; case DW_AT_linkage_name: case DW_AT_MIPS_linkage_name: if (val.encoding == ATTR_VAL_STRING) function->name = val.u.string; break; case DW_AT_low_pc: if (val.encoding == ATTR_VAL_ADDRESS) { lowpc = val.u.uint; have_lowpc = 1; } break; case DW_AT_high_pc: if (val.encoding == ATTR_VAL_ADDRESS) { highpc = val.u.uint; have_highpc = 1; } else if (val.encoding == ATTR_VAL_UINT) { highpc = val.u.uint; have_highpc = 1; highpc_is_relative = 1; } break; case DW_AT_ranges: if (val.encoding == ATTR_VAL_UINT || val.encoding == ATTR_VAL_REF_SECTION) { ranges = val.u.uint; have_ranges = 1; } break; default: break; } } } /* If we couldn't find a name for the function, we have no use for it. */ if (is_function && function->name == NULL) { backtrace_free (state, function, sizeof *function, error_callback, data); is_function = 0; } if (is_function) { if (have_ranges) { if (!add_function_ranges (state, ddata, u, function, ranges, base, error_callback, data, vec)) return 0; } else if (have_lowpc && have_highpc) { if (highpc_is_relative) highpc += lowpc; if (!add_function_range (state, ddata, function, lowpc, highpc, error_callback, data, vec)) return 0; } else { backtrace_free (state, function, sizeof *function, error_callback, data); is_function = 0; } } if (abbrev->has_children) { if (!is_function) { if (!read_function_entry (state, ddata, u, base, unit_buf, lhdr, error_callback, data, vec_function, vec_inlined)) return 0; } else { struct function_vector fvec; /* Gather any information for inlined functions in FVEC. */ memset (&fvec, 0, sizeof fvec); if (!read_function_entry (state, ddata, u, base, unit_buf, lhdr, error_callback, data, vec_function, &fvec)) return 0; if (fvec.count > 0) { struct function_addrs *faddrs; if (!backtrace_vector_release (state, &fvec.vec, error_callback, data)) return 0; faddrs = (struct function_addrs *) fvec.vec.base; backtrace_qsort (faddrs, fvec.count, sizeof (struct function_addrs), function_addrs_compare); function->function_addrs = faddrs; function->function_addrs_count = fvec.count; } } } } return 1; } /* Read function name information for a compilation unit. We look through the whole unit looking for function tags. */ static void read_function_info (struct backtrace_state *state, struct dwarf_data *ddata, const struct line_header *lhdr, backtrace_error_callback error_callback, void *data, struct unit *u, struct function_vector *fvec, struct function_addrs **ret_addrs, size_t *ret_addrs_count) { struct function_vector lvec; struct function_vector *pfvec; struct dwarf_buf unit_buf; struct function_addrs *addrs; size_t addrs_count; /* Use FVEC if it is not NULL. Otherwise use our own vector. */ if (fvec != NULL) pfvec = fvec; else { memset (&lvec, 0, sizeof lvec); pfvec = &lvec; } unit_buf.name = ".debug_info"; unit_buf.start = ddata->dwarf_info; unit_buf.buf = u->unit_data; unit_buf.left = u->unit_data_len; unit_buf.is_bigendian = ddata->is_bigendian; unit_buf.error_callback = error_callback; unit_buf.data = data; unit_buf.reported_underflow = 0; while (unit_buf.left > 0) { if (!read_function_entry (state, ddata, u, 0, &unit_buf, lhdr, error_callback, data, pfvec, pfvec)) return; } if (pfvec->count == 0) return; addrs_count = pfvec->count; if (fvec == NULL) { if (!backtrace_vector_release (state, &lvec.vec, error_callback, data)) return; addrs = (struct function_addrs *) pfvec->vec.base; } else { /* Finish this list of addresses, but leave the remaining space in the vector available for the next function unit. */ addrs = ((struct function_addrs *) backtrace_vector_finish (state, &fvec->vec, error_callback, data)); if (addrs == NULL) return; fvec->count = 0; } backtrace_qsort (addrs, addrs_count, sizeof (struct function_addrs), function_addrs_compare); *ret_addrs = addrs; *ret_addrs_count = addrs_count; } /* See if PC is inlined in FUNCTION. If it is, print out the inlined information, and update FILENAME and LINENO for the caller. Returns whatever CALLBACK returns, or 0 to keep going. */ static int report_inlined_functions (uintptr_t pc, struct function *function, backtrace_full_callback callback, void *data, const char **filename, int *lineno) { struct function_addrs *function_addrs; struct function *inlined; int ret; if (function->function_addrs_count == 0) return 0; function_addrs = ((struct function_addrs *) bsearch (&pc, function->function_addrs, function->function_addrs_count, sizeof (struct function_addrs), function_addrs_search)); if (function_addrs == NULL) return 0; while (((size_t) (function_addrs - function->function_addrs) + 1 < function->function_addrs_count) && pc >= (function_addrs + 1)->low && pc < (function_addrs + 1)->high) ++function_addrs; /* We found an inlined call. */ inlined = function_addrs->function; /* Report any calls inlined into this one. */ ret = report_inlined_functions (pc, inlined, callback, data, filename, lineno); if (ret != 0) return ret; /* Report this inlined call. */ ret = callback (data, pc, *filename, *lineno, inlined->name); if (ret != 0) return ret; /* Our caller will report the caller of the inlined function; tell it the appropriate filename and line number. */ *filename = inlined->caller_filename; *lineno = inlined->caller_lineno; return 0; } /* Look for a PC in the DWARF mapping for one module. On success, call CALLBACK and return whatever it returns. On error, call ERROR_CALLBACK and return 0. Sets *FOUND to 1 if the PC is found, 0 if not. */ static int dwarf_lookup_pc (struct backtrace_state *state, struct dwarf_data *ddata, uintptr_t pc, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data, int *found) { struct unit_addrs *entry; struct unit *u; int new_data; struct line *lines; struct line *ln; struct function_addrs *function_addrs; struct function *function; const char *filename; int lineno; int ret; *found = 1; /* Find an address range that includes PC. */ entry = bsearch (&pc, ddata->addrs, ddata->addrs_count, sizeof (struct unit_addrs), unit_addrs_search); if (entry == NULL) { *found = 0; return 0; } /* If there are multiple ranges that contain PC, use the last one, in order to produce predictable results. If we assume that all ranges are properly nested, then the last range will be the smallest one. */ while ((size_t) (entry - ddata->addrs) + 1 < ddata->addrs_count && pc >= (entry + 1)->low && pc < (entry + 1)->high) ++entry; /* We need the lines, lines_count, function_addrs, function_addrs_count fields of u. If they are not set, we need to set them. When running in threaded mode, we need to allow for the possibility that some other thread is setting them simultaneously. */ u = entry->u; lines = u->lines; /* Skip units with no useful line number information by walking backward. Useless line number information is marked by setting lines == -1. */ while (entry > ddata->addrs && pc >= (entry - 1)->low && pc < (entry - 1)->high) { if (state->threaded) lines = (struct line *) backtrace_atomic_load_pointer (&u->lines); if (lines != (struct line *) (uintptr_t) -1) break; --entry; u = entry->u; lines = u->lines; } if (state->threaded) lines = backtrace_atomic_load_pointer (&u->lines); new_data = 0; if (lines == NULL) { size_t function_addrs_count; struct line_header lhdr; size_t count; /* We have never read the line information for this unit. Read it now. */ function_addrs = NULL; function_addrs_count = 0; if (read_line_info (state, ddata, error_callback, data, entry->u, &lhdr, &lines, &count)) { struct function_vector *pfvec; /* If not threaded, reuse DDATA->FVEC for better memory consumption. */ if (state->threaded) pfvec = NULL; else pfvec = &ddata->fvec; read_function_info (state, ddata, &lhdr, error_callback, data, entry->u, pfvec, &function_addrs, &function_addrs_count); free_line_header (state, &lhdr, error_callback, data); new_data = 1; } /* Atomically store the information we just read into the unit. If another thread is simultaneously writing, it presumably read the same information, and we don't care which one we wind up with; we just leak the other one. We do have to write the lines field last, so that the acquire-loads above ensure that the other fields are set. */ if (!state->threaded) { u->lines_count = count; u->function_addrs = function_addrs; u->function_addrs_count = function_addrs_count; u->lines = lines; } else { backtrace_atomic_store_size_t (&u->lines_count, count); backtrace_atomic_store_pointer (&u->function_addrs, function_addrs); backtrace_atomic_store_size_t (&u->function_addrs_count, function_addrs_count); backtrace_atomic_store_pointer (&u->lines, lines); } } /* Now all fields of U have been initialized. */ if (lines == (struct line *) (uintptr_t) -1) { /* If reading the line number information failed in some way, try again to see if there is a better compilation unit for this PC. */ if (new_data) return dwarf_lookup_pc (state, ddata, pc, callback, error_callback, data, found); return callback (data, pc, NULL, 0, NULL); } /* Search for PC within this unit. */ ln = (struct line *) bsearch (&pc, lines, entry->u->lines_count, sizeof (struct line), line_search); if (ln == NULL) { /* The PC is between the low_pc and high_pc attributes of the compilation unit, but no entry in the line table covers it. This implies that the start of the compilation unit has no line number information. */ if (entry->u->abs_filename == NULL) { const char *filename; filename = entry->u->filename; if (filename != NULL && !IS_ABSOLUTE_PATH (filename) && entry->u->comp_dir != NULL) { size_t filename_len; const char *dir; size_t dir_len; char *s; filename_len = strlen (filename); dir = entry->u->comp_dir; dir_len = strlen (dir); s = (char *) backtrace_alloc (state, dir_len + filename_len + 2, error_callback, data); if (s == NULL) { *found = 0; return 0; } memcpy (s, dir, dir_len); /* FIXME: Should use backslash if DOS file system. */ s[dir_len] = '/'; memcpy (s + dir_len + 1, filename, filename_len + 1); filename = s; } entry->u->abs_filename = filename; } return callback (data, pc, entry->u->abs_filename, 0, NULL); } /* Search for function name within this unit. */ if (entry->u->function_addrs_count == 0) return callback (data, pc, ln->filename, ln->lineno, NULL); function_addrs = ((struct function_addrs *) bsearch (&pc, entry->u->function_addrs, entry->u->function_addrs_count, sizeof (struct function_addrs), function_addrs_search)); if (function_addrs == NULL) return callback (data, pc, ln->filename, ln->lineno, NULL); /* If there are multiple function ranges that contain PC, use the last one, in order to produce predictable results. */ while (((size_t) (function_addrs - entry->u->function_addrs + 1) < entry->u->function_addrs_count) && pc >= (function_addrs + 1)->low && pc < (function_addrs + 1)->high) ++function_addrs; function = function_addrs->function; filename = ln->filename; lineno = ln->lineno; ret = report_inlined_functions (pc, function, callback, data, &filename, &lineno); if (ret != 0) return ret; return callback (data, pc, filename, lineno, function->name); } /* Return the file/line information for a PC using the DWARF mapping we built earlier. */ static int dwarf_fileline (struct backtrace_state *state, uintptr_t pc, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data) { struct dwarf_data *ddata; int found; int ret; if (!state->threaded) { for (ddata = (struct dwarf_data *) state->fileline_data; ddata != NULL; ddata = ddata->next) { ret = dwarf_lookup_pc (state, ddata, pc, callback, error_callback, data, &found); if (ret != 0 || found) return ret; } } else { struct dwarf_data **pp; pp = (struct dwarf_data **) (void *) &state->fileline_data; while (1) { ddata = backtrace_atomic_load_pointer (pp); if (ddata == NULL) break; ret = dwarf_lookup_pc (state, ddata, pc, callback, error_callback, data, &found); if (ret != 0 || found) return ret; pp = &ddata->next; } } /* FIXME: See if any libraries have been dlopen'ed. */ return callback (data, pc, NULL, 0, NULL); } /* Initialize our data structures from the DWARF debug info for a file. Return NULL on failure. */ static struct dwarf_data * build_dwarf_data (struct backtrace_state *state, uintptr_t base_address, const unsigned char *dwarf_info, size_t dwarf_info_size, const unsigned char *dwarf_line, size_t dwarf_line_size, const unsigned char *dwarf_abbrev, size_t dwarf_abbrev_size, const unsigned char *dwarf_ranges, size_t dwarf_ranges_size, const unsigned char *dwarf_str, size_t dwarf_str_size, int is_bigendian, backtrace_error_callback error_callback, void *data) { struct unit_addrs_vector addrs_vec; struct unit_addrs *addrs; size_t addrs_count; struct dwarf_data *fdata; if (!build_address_map (state, base_address, dwarf_info, dwarf_info_size, dwarf_abbrev, dwarf_abbrev_size, dwarf_ranges, dwarf_ranges_size, dwarf_str, dwarf_str_size, is_bigendian, error_callback, data, &addrs_vec)) return NULL; if (!backtrace_vector_release (state, &addrs_vec.vec, error_callback, data)) return NULL; addrs = (struct unit_addrs *) addrs_vec.vec.base; addrs_count = addrs_vec.count; backtrace_qsort (addrs, addrs_count, sizeof (struct unit_addrs), unit_addrs_compare); fdata = ((struct dwarf_data *) backtrace_alloc (state, sizeof (struct dwarf_data), error_callback, data)); if (fdata == NULL) return NULL; fdata->next = NULL; fdata->base_address = base_address; fdata->addrs = addrs; fdata->addrs_count = addrs_count; fdata->dwarf_info = dwarf_info; fdata->dwarf_info_size = dwarf_info_size; fdata->dwarf_line = dwarf_line; fdata->dwarf_line_size = dwarf_line_size; fdata->dwarf_ranges = dwarf_ranges; fdata->dwarf_ranges_size = dwarf_ranges_size; fdata->dwarf_str = dwarf_str; fdata->dwarf_str_size = dwarf_str_size; fdata->is_bigendian = is_bigendian; memset (&fdata->fvec, 0, sizeof fdata->fvec); return fdata; } /* Build our data structures from the DWARF sections for a module. Set FILELINE_FN and STATE->FILELINE_DATA. Return 1 on success, 0 on failure. */ int backtrace_dwarf_add (struct backtrace_state *state, uintptr_t base_address, const unsigned char *dwarf_info, size_t dwarf_info_size, const unsigned char *dwarf_line, size_t dwarf_line_size, const unsigned char *dwarf_abbrev, size_t dwarf_abbrev_size, const unsigned char *dwarf_ranges, size_t dwarf_ranges_size, const unsigned char *dwarf_str, size_t dwarf_str_size, int is_bigendian, backtrace_error_callback error_callback, void *data, fileline *fileline_fn) { struct dwarf_data *fdata; fdata = build_dwarf_data (state, base_address, dwarf_info, dwarf_info_size, dwarf_line, dwarf_line_size, dwarf_abbrev, dwarf_abbrev_size, dwarf_ranges, dwarf_ranges_size, dwarf_str, dwarf_str_size, is_bigendian, error_callback, data); if (fdata == NULL) return 0; if (!state->threaded) { struct dwarf_data **pp; for (pp = (struct dwarf_data **) (void *) &state->fileline_data; *pp != NULL; pp = &(*pp)->next) ; *pp = fdata; } else { while (1) { struct dwarf_data **pp; pp = (struct dwarf_data **) (void *) &state->fileline_data; while (1) { struct dwarf_data *p; p = backtrace_atomic_load_pointer (pp); if (p == NULL) break; pp = &p->next; } if (__sync_bool_compare_and_swap (pp, NULL, fdata)) break; } } *fileline_fn = dwarf_fileline; return 1; } canu-1.6/src/AS_UTL/libbacktrace/elf.c000066400000000000000000000636341314437614700174720ustar00rootroot00000000000000/* elf.c -- Get debug data from an ELF file for backtraces. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #if (BACKTRACE_ELF_SIZE == 32) || (BACKTRACE_ELF_SIZE == 64) #include #include #include #ifdef HAVE_DL_ITERATE_PHDR #include #endif #include "backtrace.h" #include "internal.h" #ifndef HAVE_DL_ITERATE_PHDR /* Dummy version of dl_iterate_phdr for systems that don't have it. */ #define dl_phdr_info x_dl_phdr_info #define dl_iterate_phdr x_dl_iterate_phdr struct dl_phdr_info { uintptr_t dlpi_addr; const char *dlpi_name; }; static int dl_iterate_phdr (int (*callback) (struct dl_phdr_info *, size_t, void *) ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { return 0; } #endif /* ! defined (HAVE_DL_ITERATE_PHDR) */ /* The configure script must tell us whether we are 32-bit or 64-bit ELF. We could make this code test and support either possibility, but there is no point. This code only works for the currently running executable, which means that we know the ELF mode at configure mode. */ #if BACKTRACE_ELF_SIZE != 32 && BACKTRACE_ELF_SIZE != 64 #error "Unknown BACKTRACE_ELF_SIZE" #endif /* might #include which might define our constants with slightly different values. Undefine them to be safe. */ #undef EI_NIDENT #undef EI_MAG0 #undef EI_MAG1 #undef EI_MAG2 #undef EI_MAG3 #undef EI_CLASS #undef EI_DATA #undef EI_VERSION #undef ELF_MAG0 #undef ELF_MAG1 #undef ELF_MAG2 #undef ELF_MAG3 #undef ELFCLASS32 #undef ELFCLASS64 #undef ELFDATA2LSB #undef ELFDATA2MSB #undef EV_CURRENT #undef ET_DYN #undef SHN_LORESERVE #undef SHN_XINDEX #undef SHN_UNDEF #undef SHT_SYMTAB #undef SHT_STRTAB #undef SHT_DYNSYM #undef STT_OBJECT #undef STT_FUNC /* Basic types. */ typedef uint16_t b_elf_half; /* Elf_Half. */ typedef uint32_t b_elf_word; /* Elf_Word. */ typedef int32_t b_elf_sword; /* Elf_Sword. */ #if BACKTRACE_ELF_SIZE == 32 typedef uint32_t b_elf_addr; /* Elf_Addr. */ typedef uint32_t b_elf_off; /* Elf_Off. */ typedef uint32_t b_elf_wxword; /* 32-bit Elf_Word, 64-bit ELF_Xword. */ #else typedef uint64_t b_elf_addr; /* Elf_Addr. */ typedef uint64_t b_elf_off; /* Elf_Off. */ typedef uint64_t b_elf_xword; /* Elf_Xword. */ typedef int64_t b_elf_sxword; /* Elf_Sxword. */ typedef uint64_t b_elf_wxword; /* 32-bit Elf_Word, 64-bit ELF_Xword. */ #endif /* Data structures and associated constants. */ #define EI_NIDENT 16 typedef struct { unsigned char e_ident[EI_NIDENT]; /* ELF "magic number" */ b_elf_half e_type; /* Identifies object file type */ b_elf_half e_machine; /* Specifies required architecture */ b_elf_word e_version; /* Identifies object file version */ b_elf_addr e_entry; /* Entry point virtual address */ b_elf_off e_phoff; /* Program header table file offset */ b_elf_off e_shoff; /* Section header table file offset */ b_elf_word e_flags; /* Processor-specific flags */ b_elf_half e_ehsize; /* ELF header size in bytes */ b_elf_half e_phentsize; /* Program header table entry size */ b_elf_half e_phnum; /* Program header table entry count */ b_elf_half e_shentsize; /* Section header table entry size */ b_elf_half e_shnum; /* Section header table entry count */ b_elf_half e_shstrndx; /* Section header string table index */ } b_elf_ehdr; /* Elf_Ehdr. */ #define EI_MAG0 0 #define EI_MAG1 1 #define EI_MAG2 2 #define EI_MAG3 3 #define EI_CLASS 4 #define EI_DATA 5 #define EI_VERSION 6 #define ELFMAG0 0x7f #define ELFMAG1 'E' #define ELFMAG2 'L' #define ELFMAG3 'F' #define ELFCLASS32 1 #define ELFCLASS64 2 #define ELFDATA2LSB 1 #define ELFDATA2MSB 2 #define EV_CURRENT 1 #define ET_DYN 3 typedef struct { b_elf_word sh_name; /* Section name, index in string tbl */ b_elf_word sh_type; /* Type of section */ b_elf_wxword sh_flags; /* Miscellaneous section attributes */ b_elf_addr sh_addr; /* Section virtual addr at execution */ b_elf_off sh_offset; /* Section file offset */ b_elf_wxword sh_size; /* Size of section in bytes */ b_elf_word sh_link; /* Index of another section */ b_elf_word sh_info; /* Additional section information */ b_elf_wxword sh_addralign; /* Section alignment */ b_elf_wxword sh_entsize; /* Entry size if section holds table */ } b_elf_shdr; /* Elf_Shdr. */ #define SHN_UNDEF 0x0000 /* Undefined section */ #define SHN_LORESERVE 0xFF00 /* Begin range of reserved indices */ #define SHN_XINDEX 0xFFFF /* Section index is held elsewhere */ #define SHT_SYMTAB 2 #define SHT_STRTAB 3 #define SHT_DYNSYM 11 #if BACKTRACE_ELF_SIZE == 32 typedef struct { b_elf_word st_name; /* Symbol name, index in string tbl */ b_elf_addr st_value; /* Symbol value */ b_elf_word st_size; /* Symbol size */ unsigned char st_info; /* Symbol binding and type */ unsigned char st_other; /* Visibility and other data */ b_elf_half st_shndx; /* Symbol section index */ } b_elf_sym; /* Elf_Sym. */ #else /* BACKTRACE_ELF_SIZE != 32 */ typedef struct { b_elf_word st_name; /* Symbol name, index in string tbl */ unsigned char st_info; /* Symbol binding and type */ unsigned char st_other; /* Visibility and other data */ b_elf_half st_shndx; /* Symbol section index */ b_elf_addr st_value; /* Symbol value */ b_elf_xword st_size; /* Symbol size */ } b_elf_sym; /* Elf_Sym. */ #endif /* BACKTRACE_ELF_SIZE != 32 */ #define STT_OBJECT 1 #define STT_FUNC 2 /* An index of ELF sections we care about. */ enum debug_section { DEBUG_INFO, DEBUG_LINE, DEBUG_ABBREV, DEBUG_RANGES, DEBUG_STR, DEBUG_MAX }; /* Names of sections, indexed by enum elf_section. */ static const char * const debug_section_names[DEBUG_MAX] = { ".debug_info", ".debug_line", ".debug_abbrev", ".debug_ranges", ".debug_str" }; /* Information we gather for the sections we care about. */ struct debug_section_info { /* Section file offset. */ off_t offset; /* Section size. */ size_t size; /* Section contents, after read from file. */ const unsigned char *data; }; /* Information we keep for an ELF symbol. */ struct elf_symbol { /* The name of the symbol. */ const char *name; /* The address of the symbol. */ uintptr_t address; /* The size of the symbol. */ size_t size; }; /* Information to pass to elf_syminfo. */ struct elf_syminfo_data { /* Symbols for the next module. */ struct elf_syminfo_data *next; /* The ELF symbols, sorted by address. */ struct elf_symbol *symbols; /* The number of symbols. */ size_t count; }; /* A dummy callback function used when we can't find any debug info. */ static int elf_nodebug (struct backtrace_state *state ATTRIBUTE_UNUSED, uintptr_t pc ATTRIBUTE_UNUSED, backtrace_full_callback callback ATTRIBUTE_UNUSED, backtrace_error_callback error_callback, void *data) { error_callback (data, "no debug info in ELF executable", -1); return 0; } /* A dummy callback function used when we can't find a symbol table. */ static void elf_nosyms (struct backtrace_state *state ATTRIBUTE_UNUSED, uintptr_t addr ATTRIBUTE_UNUSED, backtrace_syminfo_callback callback ATTRIBUTE_UNUSED, backtrace_error_callback error_callback, void *data) { error_callback (data, "no symbol table in ELF executable", -1); } /* Compare struct elf_symbol for qsort. */ static int elf_symbol_compare (const void *v1, const void *v2) { const struct elf_symbol *e1 = (const struct elf_symbol *) v1; const struct elf_symbol *e2 = (const struct elf_symbol *) v2; if (e1->address < e2->address) return -1; else if (e1->address > e2->address) return 1; else return 0; } /* Compare an ADDR against an elf_symbol for bsearch. We allocate one extra entry in the array so that this can look safely at the next entry. */ static int elf_symbol_search (const void *vkey, const void *ventry) { const uintptr_t *key = (const uintptr_t *) vkey; const struct elf_symbol *entry = (const struct elf_symbol *) ventry; uintptr_t addr; addr = *key; if (addr < entry->address) return -1; else if (addr >= entry->address + entry->size) return 1; else return 0; } /* Initialize the symbol table info for elf_syminfo. */ static int elf_initialize_syminfo (struct backtrace_state *state, uintptr_t base_address, const unsigned char *symtab_data, size_t symtab_size, const unsigned char *strtab, size_t strtab_size, backtrace_error_callback error_callback, void *data, struct elf_syminfo_data *sdata) { size_t sym_count; const b_elf_sym *sym; size_t elf_symbol_count; size_t elf_symbol_size; struct elf_symbol *elf_symbols; size_t i; unsigned int j; sym_count = symtab_size / sizeof (b_elf_sym); /* We only care about function symbols. Count them. */ sym = (const b_elf_sym *) symtab_data; elf_symbol_count = 0; for (i = 0; i < sym_count; ++i, ++sym) { int info; info = sym->st_info & 0xf; if ((info == STT_FUNC || info == STT_OBJECT) && sym->st_shndx != SHN_UNDEF) ++elf_symbol_count; } elf_symbol_size = elf_symbol_count * sizeof (struct elf_symbol); elf_symbols = ((struct elf_symbol *) backtrace_alloc (state, elf_symbol_size, error_callback, data)); if (elf_symbols == NULL) return 0; sym = (const b_elf_sym *) symtab_data; j = 0; for (i = 0; i < sym_count; ++i, ++sym) { int info; info = sym->st_info & 0xf; if (info != STT_FUNC && info != STT_OBJECT) continue; if (sym->st_shndx == SHN_UNDEF) continue; if (sym->st_name >= strtab_size) { error_callback (data, "symbol string index out of range", 0); backtrace_free (state, elf_symbols, elf_symbol_size, error_callback, data); return 0; } elf_symbols[j].name = (const char *) strtab + sym->st_name; elf_symbols[j].address = sym->st_value + base_address; elf_symbols[j].size = sym->st_size; ++j; } backtrace_qsort (elf_symbols, elf_symbol_count, sizeof (struct elf_symbol), elf_symbol_compare); sdata->next = NULL; sdata->symbols = elf_symbols; sdata->count = elf_symbol_count; return 1; } /* Add EDATA to the list in STATE. */ static void elf_add_syminfo_data (struct backtrace_state *state, struct elf_syminfo_data *edata) { if (!state->threaded) { struct elf_syminfo_data **pp; for (pp = (struct elf_syminfo_data **) (void *) &state->syminfo_data; *pp != NULL; pp = &(*pp)->next) ; *pp = edata; } else { while (1) { struct elf_syminfo_data **pp; pp = (struct elf_syminfo_data **) (void *) &state->syminfo_data; while (1) { struct elf_syminfo_data *p; p = backtrace_atomic_load_pointer (pp); if (p == NULL) break; pp = &p->next; } if (__sync_bool_compare_and_swap (pp, NULL, edata)) break; } } } /* Return the symbol name and value for an ADDR. */ static void elf_syminfo (struct backtrace_state *state, uintptr_t addr, backtrace_syminfo_callback callback, backtrace_error_callback error_callback ATTRIBUTE_UNUSED, void *data) { struct elf_syminfo_data *edata; struct elf_symbol *sym = NULL; if (!state->threaded) { for (edata = (struct elf_syminfo_data *) state->syminfo_data; edata != NULL; edata = edata->next) { sym = ((struct elf_symbol *) bsearch (&addr, edata->symbols, edata->count, sizeof (struct elf_symbol), elf_symbol_search)); if (sym != NULL) break; } } else { struct elf_syminfo_data **pp; pp = (struct elf_syminfo_data **) (void *) &state->syminfo_data; while (1) { edata = backtrace_atomic_load_pointer (pp); if (edata == NULL) break; sym = ((struct elf_symbol *) bsearch (&addr, edata->symbols, edata->count, sizeof (struct elf_symbol), elf_symbol_search)); if (sym != NULL) break; pp = &edata->next; } } if (sym == NULL) callback (data, addr, NULL, 0, 0); else callback (data, addr, sym->name, sym->address, sym->size); } /* Add the backtrace data for one ELF file. Returns 1 on success, 0 on failure (in both cases descriptor is closed) or -1 if exe is non-zero and the ELF file is ET_DYN, which tells the caller that elf_add will need to be called on the descriptor again after base_address is determined. */ static int elf_add (struct backtrace_state *state, int descriptor, uintptr_t base_address, backtrace_error_callback error_callback, void *data, fileline *fileline_fn, int *found_sym, int *found_dwarf, int exe) { struct backtrace_view ehdr_view; b_elf_ehdr ehdr; off_t shoff; unsigned int shnum; unsigned int shstrndx; struct backtrace_view shdrs_view; int shdrs_view_valid; const b_elf_shdr *shdrs; const b_elf_shdr *shstrhdr; size_t shstr_size; off_t shstr_off; struct backtrace_view names_view; int names_view_valid; const char *names; unsigned int symtab_shndx; unsigned int dynsym_shndx; unsigned int i; struct debug_section_info sections[DEBUG_MAX]; struct backtrace_view symtab_view; int symtab_view_valid; struct backtrace_view strtab_view; int strtab_view_valid; off_t min_offset; off_t max_offset; struct backtrace_view debug_view; int debug_view_valid; *found_sym = 0; *found_dwarf = 0; shdrs_view_valid = 0; names_view_valid = 0; symtab_view_valid = 0; strtab_view_valid = 0; debug_view_valid = 0; if (!backtrace_get_view (state, descriptor, 0, sizeof ehdr, error_callback, data, &ehdr_view)) goto fail; memcpy (&ehdr, ehdr_view.data, sizeof ehdr); backtrace_release_view (state, &ehdr_view, error_callback, data); if (ehdr.e_ident[EI_MAG0] != ELFMAG0 || ehdr.e_ident[EI_MAG1] != ELFMAG1 || ehdr.e_ident[EI_MAG2] != ELFMAG2 || ehdr.e_ident[EI_MAG3] != ELFMAG3) { error_callback (data, "executable file is not ELF", 0); goto fail; } if (ehdr.e_ident[EI_VERSION] != EV_CURRENT) { error_callback (data, "executable file is unrecognized ELF version", 0); goto fail; } #if BACKTRACE_ELF_SIZE == 32 #define BACKTRACE_ELFCLASS ELFCLASS32 #else #define BACKTRACE_ELFCLASS ELFCLASS64 #endif if (ehdr.e_ident[EI_CLASS] != BACKTRACE_ELFCLASS) { error_callback (data, "executable file is unexpected ELF class", 0); goto fail; } if (ehdr.e_ident[EI_DATA] != ELFDATA2LSB && ehdr.e_ident[EI_DATA] != ELFDATA2MSB) { error_callback (data, "executable file has unknown endianness", 0); goto fail; } /* If the executable is ET_DYN, it is either a PIE, or we are running directly a shared library with .interp. We need to wait for dl_iterate_phdr in that case to determine the actual base_address. */ if (exe && ehdr.e_type == ET_DYN) return -1; shoff = ehdr.e_shoff; shnum = ehdr.e_shnum; shstrndx = ehdr.e_shstrndx; if ((shnum == 0 || shstrndx == SHN_XINDEX) && shoff != 0) { struct backtrace_view shdr_view; const b_elf_shdr *shdr; if (!backtrace_get_view (state, descriptor, shoff, sizeof shdr, error_callback, data, &shdr_view)) goto fail; shdr = (const b_elf_shdr *) shdr_view.data; if (shnum == 0) shnum = shdr->sh_size; if (shstrndx == SHN_XINDEX) { shstrndx = shdr->sh_link; /* Versions of the GNU binutils between 2.12 and 2.18 did not handle objects with more than SHN_LORESERVE sections correctly. All large section indexes were offset by 0x100. There is more information at http://sourceware.org/bugzilla/show_bug.cgi?id-5900 . Fortunately these object files are easy to detect, as the GNU binutils always put the section header string table near the end of the list of sections. Thus if the section header string table index is larger than the number of sections, then we know we have to subtract 0x100 to get the real section index. */ if (shstrndx >= shnum && shstrndx >= SHN_LORESERVE + 0x100) shstrndx -= 0x100; } backtrace_release_view (state, &shdr_view, error_callback, data); } /* To translate PC to file/line when using DWARF, we need to find the .debug_info and .debug_line sections. */ /* Read the section headers, skipping the first one. */ if (!backtrace_get_view (state, descriptor, shoff + sizeof (b_elf_shdr), (shnum - 1) * sizeof (b_elf_shdr), error_callback, data, &shdrs_view)) goto fail; shdrs_view_valid = 1; shdrs = (const b_elf_shdr *) shdrs_view.data; /* Read the section names. */ shstrhdr = &shdrs[shstrndx - 1]; shstr_size = shstrhdr->sh_size; shstr_off = shstrhdr->sh_offset; if (!backtrace_get_view (state, descriptor, shstr_off, shstr_size, error_callback, data, &names_view)) goto fail; names_view_valid = 1; names = (const char *) names_view.data; symtab_shndx = 0; dynsym_shndx = 0; memset (sections, 0, sizeof sections); /* Look for the symbol table. */ for (i = 1; i < shnum; ++i) { const b_elf_shdr *shdr; unsigned int sh_name; const char *name; int j; shdr = &shdrs[i - 1]; if (shdr->sh_type == SHT_SYMTAB) symtab_shndx = i; else if (shdr->sh_type == SHT_DYNSYM) dynsym_shndx = i; sh_name = shdr->sh_name; if (sh_name >= shstr_size) { error_callback (data, "ELF section name out of range", 0); goto fail; } name = names + sh_name; for (j = 0; j < (int) DEBUG_MAX; ++j) { if (strcmp (name, debug_section_names[j]) == 0) { sections[j].offset = shdr->sh_offset; sections[j].size = shdr->sh_size; break; } } } if (symtab_shndx == 0) symtab_shndx = dynsym_shndx; if (symtab_shndx != 0) { const b_elf_shdr *symtab_shdr; unsigned int strtab_shndx; const b_elf_shdr *strtab_shdr; struct elf_syminfo_data *sdata; symtab_shdr = &shdrs[symtab_shndx - 1]; strtab_shndx = symtab_shdr->sh_link; if (strtab_shndx >= shnum) { error_callback (data, "ELF symbol table strtab link out of range", 0); goto fail; } strtab_shdr = &shdrs[strtab_shndx - 1]; if (!backtrace_get_view (state, descriptor, symtab_shdr->sh_offset, symtab_shdr->sh_size, error_callback, data, &symtab_view)) goto fail; symtab_view_valid = 1; if (!backtrace_get_view (state, descriptor, strtab_shdr->sh_offset, strtab_shdr->sh_size, error_callback, data, &strtab_view)) goto fail; strtab_view_valid = 1; sdata = ((struct elf_syminfo_data *) backtrace_alloc (state, sizeof *sdata, error_callback, data)); if (sdata == NULL) goto fail; if (!elf_initialize_syminfo (state, base_address, symtab_view.data, symtab_shdr->sh_size, strtab_view.data, strtab_shdr->sh_size, error_callback, data, sdata)) { backtrace_free (state, sdata, sizeof *sdata, error_callback, data); goto fail; } /* We no longer need the symbol table, but we hold on to the string table permanently. */ backtrace_release_view (state, &symtab_view, error_callback, data); *found_sym = 1; elf_add_syminfo_data (state, sdata); } /* FIXME: Need to handle compressed debug sections. */ backtrace_release_view (state, &shdrs_view, error_callback, data); shdrs_view_valid = 0; backtrace_release_view (state, &names_view, error_callback, data); names_view_valid = 0; /* Read all the debug sections in a single view, since they are probably adjacent in the file. We never release this view. */ min_offset = 0; max_offset = 0; for (i = 0; i < (int) DEBUG_MAX; ++i) { off_t end; if (sections[i].size == 0) continue; if (min_offset == 0 || sections[i].offset < min_offset) min_offset = sections[i].offset; end = sections[i].offset + sections[i].size; if (end > max_offset) max_offset = end; } if (min_offset == 0 || max_offset == 0) { if (!backtrace_close (descriptor, error_callback, data)) goto fail; return 1; } if (!backtrace_get_view (state, descriptor, min_offset, max_offset - min_offset, error_callback, data, &debug_view)) goto fail; debug_view_valid = 1; /* We've read all we need from the executable. */ if (!backtrace_close (descriptor, error_callback, data)) goto fail; descriptor = -1; for (i = 0; i < (int) DEBUG_MAX; ++i) { if (sections[i].size == 0) sections[i].data = NULL; else sections[i].data = ((const unsigned char *) debug_view.data + (sections[i].offset - min_offset)); } if (!backtrace_dwarf_add (state, base_address, sections[DEBUG_INFO].data, sections[DEBUG_INFO].size, sections[DEBUG_LINE].data, sections[DEBUG_LINE].size, sections[DEBUG_ABBREV].data, sections[DEBUG_ABBREV].size, sections[DEBUG_RANGES].data, sections[DEBUG_RANGES].size, sections[DEBUG_STR].data, sections[DEBUG_STR].size, ehdr.e_ident[EI_DATA] == ELFDATA2MSB, error_callback, data, fileline_fn)) goto fail; *found_dwarf = 1; return 1; fail: if (shdrs_view_valid) backtrace_release_view (state, &shdrs_view, error_callback, data); if (names_view_valid) backtrace_release_view (state, &names_view, error_callback, data); if (symtab_view_valid) backtrace_release_view (state, &symtab_view, error_callback, data); if (strtab_view_valid) backtrace_release_view (state, &strtab_view, error_callback, data); if (debug_view_valid) backtrace_release_view (state, &debug_view, error_callback, data); if (descriptor != -1) backtrace_close (descriptor, error_callback, data); return 0; } /* Data passed to phdr_callback. */ struct phdr_data { struct backtrace_state *state; backtrace_error_callback error_callback; void *data; fileline *fileline_fn; int *found_sym; int *found_dwarf; int exe_descriptor; }; /* Callback passed to dl_iterate_phdr. Load debug info from shared libraries. */ static int #ifdef __i386__ __attribute__ ((__force_align_arg_pointer__)) #endif phdr_callback (struct dl_phdr_info *info, size_t size ATTRIBUTE_UNUSED, void *pdata) { struct phdr_data *pd = (struct phdr_data *) pdata; int descriptor; int does_not_exist; fileline elf_fileline_fn; int found_dwarf; /* There is not much we can do if we don't have the module name, unless executable is ET_DYN, where we expect the very first phdr_callback to be for the PIE. */ if (info->dlpi_name == NULL || info->dlpi_name[0] == '\0') { if (pd->exe_descriptor == -1) return 0; descriptor = pd->exe_descriptor; pd->exe_descriptor = -1; } else { if (pd->exe_descriptor != -1) { backtrace_close (pd->exe_descriptor, pd->error_callback, pd->data); pd->exe_descriptor = -1; } descriptor = backtrace_open (info->dlpi_name, pd->error_callback, pd->data, &does_not_exist); if (descriptor < 0) return 0; } if (elf_add (pd->state, descriptor, info->dlpi_addr, pd->error_callback, pd->data, &elf_fileline_fn, pd->found_sym, &found_dwarf, 0)) { if (found_dwarf) { *pd->found_dwarf = 1; *pd->fileline_fn = elf_fileline_fn; } } return 0; } /* Initialize the backtrace data we need from an ELF executable. At the ELF level, all we need to do is find the debug info sections. */ int backtrace_initialize (struct backtrace_state *state, int descriptor, backtrace_error_callback error_callback, void *data, fileline *fileline_fn) { int ret; int found_sym; int found_dwarf; fileline elf_fileline_fn = elf_nodebug; struct phdr_data pd; ret = elf_add (state, descriptor, 0, error_callback, data, &elf_fileline_fn, &found_sym, &found_dwarf, 1); if (!ret) return 0; pd.state = state; pd.error_callback = error_callback; pd.data = data; pd.fileline_fn = &elf_fileline_fn; pd.found_sym = &found_sym; pd.found_dwarf = &found_dwarf; pd.exe_descriptor = ret < 0 ? descriptor : -1; dl_iterate_phdr (phdr_callback, (void *) &pd); if (!state->threaded) { if (found_sym) state->syminfo_fn = elf_syminfo; else if (state->syminfo_fn == NULL) state->syminfo_fn = elf_nosyms; } else { if (found_sym) backtrace_atomic_store_pointer (&state->syminfo_fn, elf_syminfo); else (void) __sync_bool_compare_and_swap (&state->syminfo_fn, NULL, elf_nosyms); } if (!state->threaded) { if (state->fileline_fn == NULL || state->fileline_fn == elf_nodebug) *fileline_fn = elf_fileline_fn; } else { fileline current_fn; current_fn = backtrace_atomic_load_pointer (&state->fileline_fn); if (current_fn == NULL || current_fn == elf_nodebug) *fileline_fn = elf_fileline_fn; } return 1; } #endif // __APPLE__ canu-1.6/src/AS_UTL/libbacktrace/fileline.c000066400000000000000000000117031314437614700205010ustar00rootroot00000000000000/* fileline.c -- Get file and line number information in a backtrace. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include #include #include #include "backtrace.h" #include "internal.h" #ifndef HAVE_GETEXECNAME #define getexecname() NULL #endif /* Initialize the fileline information from the executable. Returns 1 on success, 0 on failure. */ static int fileline_initialize (struct backtrace_state *state, backtrace_error_callback error_callback, void *data) { int failed; fileline fileline_fn; int pass; int called_error_callback; int descriptor; if (!state->threaded) failed = state->fileline_initialization_failed; else failed = backtrace_atomic_load_int (&state->fileline_initialization_failed); if (failed) { error_callback (data, "failed to read executable information", -1); return 0; } if (!state->threaded) fileline_fn = state->fileline_fn; else fileline_fn = backtrace_atomic_load_pointer (&state->fileline_fn); if (fileline_fn != NULL) return 1; /* We have not initialized the information. Do it now. */ descriptor = -1; called_error_callback = 0; for (pass = 0; pass < 4; ++pass) { const char *filename; int does_not_exist; switch (pass) { case 0: filename = state->filename; break; case 1: filename = getexecname (); break; case 2: filename = "/proc/self/exe"; break; case 3: filename = "/proc/curproc/file"; break; default: abort (); } if (filename == NULL) continue; descriptor = backtrace_open (filename, error_callback, data, &does_not_exist); if (descriptor < 0 && !does_not_exist) { called_error_callback = 1; break; } if (descriptor >= 0) break; } if (descriptor < 0) { if (!called_error_callback) { if (state->filename != NULL) error_callback (data, state->filename, ENOENT); else error_callback (data, "libbacktrace could not find executable to open", 0); } failed = 1; } if (!failed) { if (!backtrace_initialize (state, descriptor, error_callback, data, &fileline_fn)) failed = 1; } if (failed) { if (!state->threaded) state->fileline_initialization_failed = 1; else backtrace_atomic_store_int (&state->fileline_initialization_failed, 1); return 0; } if (!state->threaded) state->fileline_fn = fileline_fn; else { backtrace_atomic_store_pointer (&state->fileline_fn, fileline_fn); /* Note that if two threads initialize at once, one of the data sets may be leaked. */ } return 1; } /* Given a PC, find the file name, line number, and function name. */ int backtrace_pcinfo (struct backtrace_state *state, uintptr_t pc, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data) { if (!fileline_initialize (state, error_callback, data)) return 0; if (state->fileline_initialization_failed) return 0; return state->fileline_fn (state, pc, callback, error_callback, data); } /* Given a PC, find the symbol for it, and its value. */ int backtrace_syminfo (struct backtrace_state *state, uintptr_t pc, backtrace_syminfo_callback callback, backtrace_error_callback error_callback, void *data) { if (!fileline_initialize (state, error_callback, data)) return 0; if (state->fileline_initialization_failed) return 0; state->syminfo_fn (state, pc, callback, error_callback, data); return 1; } canu-1.6/src/AS_UTL/libbacktrace/internal.h000066400000000000000000000245251314437614700205410ustar00rootroot00000000000000/* internal.h -- Internal header file for stack backtrace library. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef BACKTRACE_INTERNAL_H #define BACKTRACE_INTERNAL_H /* We assume that and "backtrace.h" have already been included. */ #ifndef GCC_VERSION # define GCC_VERSION (__GNUC__ * 1000 + __GNUC_MINOR__) #endif #if (GCC_VERSION < 2007) # define __attribute__(x) #endif #ifndef ATTRIBUTE_UNUSED # define ATTRIBUTE_UNUSED __attribute__ ((__unused__)) #endif #ifndef ATTRIBUTE_MALLOC # if (GCC_VERSION >= 2096) # define ATTRIBUTE_MALLOC __attribute__ ((__malloc__)) # else # define ATTRIBUTE_MALLOC # endif #endif #ifndef HAVE_SYNC_FUNCTIONS /* Define out the sync functions. These should never be called if they are not available. */ #define __sync_bool_compare_and_swap(A, B, C) (abort(), 1) #define __sync_lock_test_and_set(A, B) (abort(), 0) #define __sync_lock_release(A) abort() #endif /* !defined (HAVE_SYNC_FUNCTIONS) */ #ifdef HAVE_ATOMIC_FUNCTIONS /* We have the atomic builtin functions. */ #define backtrace_atomic_load_pointer(p) \ __atomic_load_n ((p), __ATOMIC_ACQUIRE) #define backtrace_atomic_load_int(p) \ __atomic_load_n ((p), __ATOMIC_ACQUIRE) #define backtrace_atomic_store_pointer(p, v) \ __atomic_store_n ((p), (v), __ATOMIC_RELEASE) #define backtrace_atomic_store_size_t(p, v) \ __atomic_store_n ((p), (v), __ATOMIC_RELEASE) #define backtrace_atomic_store_int(p, v) \ __atomic_store_n ((p), (v), __ATOMIC_RELEASE) #else /* !defined (HAVE_ATOMIC_FUNCTIONS) */ #ifdef HAVE_SYNC_FUNCTIONS /* We have the sync functions but not the atomic functions. Define the atomic ones in terms of the sync ones. */ extern void *backtrace_atomic_load_pointer (void *); extern int backtrace_atomic_load_int (int *); extern void backtrace_atomic_store_pointer (void *, void *); extern void backtrace_atomic_store_size_t (size_t *, size_t); extern void backtrace_atomic_store_int (int *, int); #else /* !defined (HAVE_SYNC_FUNCTIONS) */ /* We have neither the sync nor the atomic functions. These will never be called. */ #define backtrace_atomic_load_pointer(p) (abort(), (void *) NULL) #define backtrace_atomic_load_int(p) (abort(), 0) #define backtrace_atomic_store_pointer(p, v) abort() #define backtrace_atomic_store_size_t(p, v) abort() #define backtrace_atomic_store_int(p, v) abort() #endif /* !defined (HAVE_SYNC_FUNCTIONS) */ #endif /* !defined (HAVE_ATOMIC_FUNCTIONS) */ /* The type of the function that collects file/line information. This is like backtrace_pcinfo. */ typedef int (*fileline) (struct backtrace_state *state, uintptr_t pc, backtrace_full_callback callback, backtrace_error_callback error_callback, void *data); /* The type of the function that collects symbol information. This is like backtrace_syminfo. */ typedef void (*syminfo) (struct backtrace_state *state, uintptr_t pc, backtrace_syminfo_callback callback, backtrace_error_callback error_callback, void *data); /* What the backtrace state pointer points to. */ struct backtrace_state { /* The name of the executable. */ const char *filename; /* Non-zero if threaded. */ int threaded; /* The master lock for fileline_fn, fileline_data, syminfo_fn, syminfo_data, fileline_initialization_failed and everything the data pointers point to. */ void *lock; /* The function that returns file/line information. */ fileline fileline_fn; /* The data to pass to FILELINE_FN. */ void *fileline_data; /* The function that returns symbol information. */ syminfo syminfo_fn; /* The data to pass to SYMINFO_FN. */ void *syminfo_data; /* Whether initializing the file/line information failed. */ int fileline_initialization_failed; /* The lock for the freelist. */ int lock_alloc; /* The freelist when using mmap. */ struct backtrace_freelist_struct *freelist; }; /* Open a file for reading. Returns -1 on error. If DOES_NOT_EXIST is not NULL, *DOES_NOT_EXIST will be set to 0 normally and set to 1 if the file does not exist. If the file does not exist and DOES_NOT_EXIST is not NULL, the function will return -1 and will not call ERROR_CALLBACK. On other errors, or if DOES_NOT_EXIST is NULL, the function will call ERROR_CALLBACK before returning. */ extern int backtrace_open (const char *filename, backtrace_error_callback error_callback, void *data, int *does_not_exist); /* A view of the contents of a file. This supports mmap when available. A view will remain in memory even after backtrace_close is called on the file descriptor from which the view was obtained. */ struct backtrace_view { /* The data that the caller requested. */ const void *data; /* The base of the view. */ void *base; /* The total length of the view. */ size_t len; }; /* Create a view of SIZE bytes from DESCRIPTOR at OFFSET. Store the result in *VIEW. Returns 1 on success, 0 on error. */ extern int backtrace_get_view (struct backtrace_state *state, int descriptor, off_t offset, size_t size, backtrace_error_callback error_callback, void *data, struct backtrace_view *view); /* Release a view created by backtrace_get_view. */ extern void backtrace_release_view (struct backtrace_state *state, struct backtrace_view *view, backtrace_error_callback error_callback, void *data); /* Close a file opened by backtrace_open. Returns 1 on success, 0 on error. */ extern int backtrace_close (int descriptor, backtrace_error_callback error_callback, void *data); /* Sort without using memory. */ extern void backtrace_qsort (void *base, size_t count, size_t size, int (*compar) (const void *, const void *)); /* Allocate memory. This is like malloc. If ERROR_CALLBACK is NULL, this does not report an error, it just returns NULL. */ extern void *backtrace_alloc (struct backtrace_state *state, size_t size, backtrace_error_callback error_callback, void *data) ATTRIBUTE_MALLOC; /* Free memory allocated by backtrace_alloc. If ERROR_CALLBACK is NULL, this does not report an error. */ extern void backtrace_free (struct backtrace_state *state, void *mem, size_t size, backtrace_error_callback error_callback, void *data); /* A growable vector of some struct. This is used for more efficient allocation when we don't know the final size of some group of data that we want to represent as an array. */ struct backtrace_vector { /* The base of the vector. */ void *base; /* The number of bytes in the vector. */ size_t size; /* The number of bytes available at the current allocation. */ size_t alc; }; /* Grow VEC by SIZE bytes. Return a pointer to the newly allocated bytes. Note that this may move the entire vector to a new memory location. Returns NULL on failure. */ extern void *backtrace_vector_grow (struct backtrace_state *state, size_t size, backtrace_error_callback error_callback, void *data, struct backtrace_vector *vec); /* Finish the current allocation on VEC. Prepare to start a new allocation. The finished allocation will never be freed. Returns a pointer to the base of the finished entries, or NULL on failure. */ extern void* backtrace_vector_finish (struct backtrace_state *state, struct backtrace_vector *vec, backtrace_error_callback error_callback, void *data); /* Release any extra space allocated for VEC. This may change VEC->base. Returns 1 on success, 0 on failure. */ extern int backtrace_vector_release (struct backtrace_state *state, struct backtrace_vector *vec, backtrace_error_callback error_callback, void *data); /* Read initial debug data from a descriptor, and set the fileline_data, syminfo_fn, and syminfo_data fields of STATE. Return the fileln_fn field in *FILELN_FN--this is done this way so that the synchronization code is only implemented once. This is called after the descriptor has first been opened. It will close the descriptor if it is no longer needed. Returns 1 on success, 0 on error. There will be multiple implementations of this function, for different file formats. Each system will compile the appropriate one. */ extern int backtrace_initialize (struct backtrace_state *state, int descriptor, backtrace_error_callback error_callback, void *data, fileline *fileline_fn); /* Add file/line information for a DWARF module. */ extern int backtrace_dwarf_add (struct backtrace_state *state, uintptr_t base_address, const unsigned char* dwarf_info, size_t dwarf_info_size, const unsigned char *dwarf_line, size_t dwarf_line_size, const unsigned char *dwarf_abbrev, size_t dwarf_abbrev_size, const unsigned char *dwarf_ranges, size_t dwarf_range_size, const unsigned char *dwarf_str, size_t dwarf_str_size, int is_bigendian, backtrace_error_callback error_callback, void *data, fileline *fileline_fn); #endif canu-1.6/src/AS_UTL/libbacktrace/make.out000066400000000000000000000030531314437614700202130ustar00rootroot00000000000000#!/bin/sh OPTS="-W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -Wmissing-format-attribute -Wcast-qual -g -O2" gcc48 -I. -funwind-tables $OPTS -c atomic.c -o atomic.o gcc48 -I. -funwind-tables $OPTS -c dwarf.c -o dwarf.o gcc48 -I. -funwind-tables $OPTS -c fileline.c -o fileline.o gcc48 -I. -funwind-tables $OPTS -c posix.c -o posix.o gcc48 -I. -funwind-tables $OPTS -c print.c -o print.o gcc48 -I. -funwind-tables $OPTS -c sort.c -o sort.o gcc48 -I. -funwind-tables $OPTS -c state.c -o state.o gcc48 -I. -funwind-tables $OPTS -c backtrace.c -o backtrace.o gcc48 -I. -funwind-tables $OPTS -c simple.c -o simple.o gcc48 -I. -funwind-tables $OPTS -c elf.c -o elf.o gcc48 -I. -funwind-tables $OPTS -c mmapio.c -o mmapio.o gcc48 -I. -funwind-tables $OPTS -c mmap.c -o mmap.o #/bin/sh ./libtool --tag=CC --mode=link gcc48 -funwind-tables -frandom-seed=libbacktrace.la -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -Wmissing-format-attribute -Wcast-qual -g -O2 -o libbacktrace.la -rpath /work/canu-stack-trace/src/AS_UTL/libbacktrace/../../../FreeBSD-amd64/lib atomic.lo dwarf.lo fileline.lo posix.lo print.lo sort.lo state.lo backtrace.lo simple.lo elf.lo mmapio.lo mmap.lo ar cru libbacktrace.a atomic.o dwarf.o fileline.o posix.o print.o sort.o state.o backtrace.o simple.o elf.o mmapio.o mmap.o ranlib libbacktrace.a # #atomic.c dwarf.c fileline.c posix.c print.c sort.c state.c backtrace.c simple.c elf.c mmapio.c mmap.c #canu-1.6/src/AS_UTL/libbacktrace/make.sh000066400000000000000000000026711314437614700200230ustar00rootroot00000000000000#!/bin/sh OPTS="-W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -Wmissing-format-attribute -Wcast-qual -g -O2" gcc48 -I. -funwind-tables $OPTS -c atomic.c -o atomic.o gcc48 -I. -funwind-tables $OPTS -c dwarf.c -o dwarf.o gcc48 -I. -funwind-tables $OPTS -c fileline.c -o fileline.o gcc48 -I. -funwind-tables $OPTS -c posix.c -o posix.o gcc48 -I. -funwind-tables $OPTS -c print.c -o print.o gcc48 -I. -funwind-tables $OPTS -c sort.c -o sort.o gcc48 -I. -funwind-tables $OPTS -c state.c -o state.o gcc48 -I. -funwind-tables $OPTS -c backtrace.c -o backtrace.o gcc48 -I. -funwind-tables $OPTS -c simple.c -o simple.o gcc48 -I. -funwind-tables $OPTS -c elf.c -o elf.o gcc48 -I. -funwind-tables $OPTS -c mmapio.c -o mmapio.o gcc48 -I. -funwind-tables $OPTS -c mmap.c -o mmap.o #/bin/sh ./libtool --tag=CC --mode=link gcc48 -funwind-tables -frandom-seed=libbacktrace.la -W -Wall -Wwrite-strings -Wstrict-prototypes -Wmissing-prototypes -Wold-style-definition -Wmissing-format-attribute -Wcast-qual -g -O2 -o libbacktrace.la -rpath /work/canu-stack-trace/src/AS_UTL/libbacktrace/../../../FreeBSD-amd64/lib atomic.lo dwarf.lo fileline.lo posix.lo print.lo sort.lo state.lo backtrace.lo simple.lo elf.lo mmapio.lo mmap.lo ar cru libbacktrace.a atomic.o dwarf.o fileline.o posix.o print.o sort.o state.o backtrace.o simple.o elf.o mmapio.o mmap.o ranlib libbacktrace.a canu-1.6/src/AS_UTL/libbacktrace/mmap.c000066400000000000000000000173121314437614700176460ustar00rootroot00000000000000/* mmap.c -- Memory allocation with mmap. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include #include #include #include #include "backtrace.h" #include "internal.h" /* Memory allocation on systems that provide anonymous mmap. This permits the backtrace functions to be invoked from a signal handler, assuming that mmap is async-signal safe. */ #ifndef MAP_ANONYMOUS #define MAP_ANONYMOUS MAP_ANON #endif #ifndef MAP_FAILED #define MAP_FAILED ((void *)-1) #endif /* A list of free memory blocks. */ struct backtrace_freelist_struct { /* Next on list. */ struct backtrace_freelist_struct *next; /* Size of this block, including this structure. */ size_t size; }; /* Free memory allocated by backtrace_alloc. */ static void backtrace_free_locked (struct backtrace_state *state, void *addr, size_t size) { /* Just leak small blocks. We don't have to be perfect. */ if (size >= sizeof (struct backtrace_freelist_struct)) { struct backtrace_freelist_struct *p; p = (struct backtrace_freelist_struct *) addr; p->next = state->freelist; p->size = size; state->freelist = p; } } /* Allocate memory like malloc. If ERROR_CALLBACK is NULL, don't report an error. */ void * backtrace_alloc (struct backtrace_state *state, size_t size, backtrace_error_callback error_callback, void *data) { void *ret; int locked; struct backtrace_freelist_struct **pp; size_t pagesize; size_t asksize; void *page; ret = NULL; /* If we can acquire the lock, then see if there is space on the free list. If we can't acquire the lock, drop straight into using mmap. __sync_lock_test_and_set returns the old state of the lock, so we have acquired it if it returns 0. */ if (!state->threaded) locked = 1; else locked = __sync_lock_test_and_set (&state->lock_alloc, 1) == 0; if (locked) { for (pp = &state->freelist; *pp != NULL; pp = &(*pp)->next) { if ((*pp)->size >= size) { struct backtrace_freelist_struct *p; p = *pp; *pp = p->next; /* Round for alignment; we assume that no type we care about is more than 8 bytes. */ size = (size + 7) & ~ (size_t) 7; if (size < p->size) backtrace_free_locked (state, (char *) p + size, p->size - size); ret = (void *) p; break; } } if (state->threaded) __sync_lock_release (&state->lock_alloc); } if (ret == NULL) { /* Allocate a new page. */ pagesize = getpagesize (); asksize = (size + pagesize - 1) & ~ (pagesize - 1); page = mmap (NULL, asksize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (page == MAP_FAILED) { if (error_callback) error_callback (data, "mmap", errno); } else { size = (size + 7) & ~ (size_t) 7; if (size < asksize) backtrace_free (state, (char *) page + size, asksize - size, error_callback, data); ret = page; } } return ret; } /* Free memory allocated by backtrace_alloc. */ void backtrace_free (struct backtrace_state *state, void *addr, size_t size, backtrace_error_callback error_callback ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { int locked; /* If we are freeing a large aligned block, just release it back to the system. This case arises when growing a vector for a large binary with lots of debug info. Calling munmap here may cause us to call mmap again if there is also a large shared library; we just live with that. */ if (size >= 16 * 4096) { size_t pagesize; pagesize = getpagesize (); if (((uintptr_t) addr & (pagesize - 1)) == 0 && (size & (pagesize - 1)) == 0) { /* If munmap fails for some reason, just add the block to the freelist. */ if (munmap (addr, size) == 0) return; } } /* If we can acquire the lock, add the new space to the free list. If we can't acquire the lock, just leak the memory. __sync_lock_test_and_set returns the old state of the lock, so we have acquired it if it returns 0. */ if (!state->threaded) locked = 1; else locked = __sync_lock_test_and_set (&state->lock_alloc, 1) == 0; if (locked) { backtrace_free_locked (state, addr, size); if (state->threaded) __sync_lock_release (&state->lock_alloc); } } /* Grow VEC by SIZE bytes. */ void * backtrace_vector_grow (struct backtrace_state *state,size_t size, backtrace_error_callback error_callback, void *data, struct backtrace_vector *vec) { void *ret; if (size > vec->alc) { size_t pagesize; size_t alc; void *base; pagesize = getpagesize (); alc = vec->size + size; if (vec->size == 0) alc = 16 * size; else if (alc < pagesize) { alc *= 2; if (alc > pagesize) alc = pagesize; } else { alc *= 2; alc = (alc + pagesize - 1) & ~ (pagesize - 1); } base = backtrace_alloc (state, alc, error_callback, data); if (base == NULL) return NULL; if (vec->base != NULL) { memcpy (base, vec->base, vec->size); backtrace_free (state, vec->base, vec->size + vec->alc, error_callback, data); } vec->base = base; vec->alc = alc - vec->size; } ret = (char *) vec->base + vec->size; vec->size += size; vec->alc -= size; return ret; } /* Finish the current allocation on VEC. */ void * backtrace_vector_finish ( struct backtrace_state *state ATTRIBUTE_UNUSED, struct backtrace_vector *vec, backtrace_error_callback error_callback ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED) { void *ret; ret = vec->base; vec->base = (char *) vec->base + vec->size; vec->size = 0; return ret; } /* Release any extra space allocated for VEC. */ int backtrace_vector_release (struct backtrace_state *state, struct backtrace_vector *vec, backtrace_error_callback error_callback, void *data) { size_t size; size_t alc; size_t aligned; /* Make sure that the block that we free is aligned on an 8-byte boundary. */ size = vec->size; alc = vec->alc; aligned = (size + 7) & ~ (size_t) 7; alc -= aligned - size; backtrace_free (state, (char *) vec->base + aligned, alc, error_callback, data); vec->alc = 0; return 1; } canu-1.6/src/AS_UTL/libbacktrace/mmapio.c000066400000000000000000000056541314437614700202040ustar00rootroot00000000000000/* mmapio.c -- File views using mmap. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include #include #include "backtrace.h" #include "internal.h" #ifndef MAP_FAILED #define MAP_FAILED ((void *)-1) #endif /* This file implements file views and memory allocation when mmap is available. */ /* Create a view of SIZE bytes from DESCRIPTOR at OFFSET. */ int backtrace_get_view (struct backtrace_state *state ATTRIBUTE_UNUSED, int descriptor, off_t offset, size_t size, backtrace_error_callback error_callback, void *data, struct backtrace_view *view) { size_t pagesize; unsigned int inpage; off_t pageoff; void *map; pagesize = getpagesize (); inpage = offset % pagesize; pageoff = offset - inpage; size += inpage; size = (size + (pagesize - 1)) & ~ (pagesize - 1); map = mmap (NULL, size, PROT_READ, MAP_PRIVATE, descriptor, pageoff); if (map == MAP_FAILED) { error_callback (data, "mmap", errno); return 0; } view->data = (char *) map + inpage; view->base = map; view->len = size; return 1; } /* Release a view read by backtrace_get_view. */ void backtrace_release_view (struct backtrace_state *state ATTRIBUTE_UNUSED, struct backtrace_view *view, backtrace_error_callback error_callback, void *data) { union { const void *cv; void *v; } const_cast; const_cast.cv = view->base; if (munmap (const_cast.v, view->len) < 0) error_callback (data, "munmap", errno); } canu-1.6/src/AS_UTL/libbacktrace/posix.c000066400000000000000000000054661314437614700200650ustar00rootroot00000000000000/* posix.c -- POSIX file I/O routines for the backtrace library. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include #include #include #include "backtrace.h" #include "internal.h" #ifndef O_BINARY #define O_BINARY 0 #endif #ifndef O_CLOEXEC #define O_CLOEXEC 0 #endif #ifndef FD_CLOEXEC #define FD_CLOEXEC 1 #endif /* Open a file for reading. */ int backtrace_open (const char *filename, backtrace_error_callback error_callback, void *data, int *does_not_exist) { int descriptor; if (does_not_exist != NULL) *does_not_exist = 0; descriptor = open (filename, (int) (O_RDONLY | O_BINARY | O_CLOEXEC)); if (descriptor < 0) { if (does_not_exist != NULL && errno == ENOENT) *does_not_exist = 1; else error_callback (data, filename, errno); return -1; } #ifdef HAVE_FCNTL /* Set FD_CLOEXEC just in case the kernel does not support O_CLOEXEC. It doesn't matter if this fails for some reason. FIXME: At some point it should be safe to only do this if O_CLOEXEC == 0. */ fcntl (descriptor, F_SETFD, FD_CLOEXEC); #endif return descriptor; } /* Close DESCRIPTOR. */ int backtrace_close (int descriptor, backtrace_error_callback error_callback, void *data) { if (close (descriptor) < 0) { error_callback (data, "close", errno); return 0; } return 1; } canu-1.6/src/AS_UTL/libbacktrace/print.c000066400000000000000000000053301314437614700200450ustar00rootroot00000000000000/* print.c -- Print the current backtrace. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include #include "backtrace.h" #include "internal.h" /* Passed to callbacks. */ struct print_data { struct backtrace_state *state; FILE *f; }; /* Print one level of a backtrace. */ static int print_callback (void *data, uintptr_t pc, const char *filename, int lineno, const char *function) { struct print_data *pdata = (struct print_data *) data; fprintf (pdata->f, "0x%lx %s\n\t%s:%d\n", (unsigned long) pc, function == NULL ? "???" : function, filename == NULL ? "???" : filename, lineno); return 0; } /* Print errors to stderr. */ static void error_callback (void *data, const char *msg, int errnum) { struct print_data *pdata = (struct print_data *) data; if (pdata->state->filename != NULL) fprintf (stderr, "%s: ", pdata->state->filename); fprintf (stderr, "libbacktrace: %s", msg); if (errnum > 0) fprintf (stderr, ": %s", strerror (errnum)); fputc ('\n', stderr); } /* Print a backtrace. */ void backtrace_print (struct backtrace_state *state, int skip, FILE *f) { struct print_data data; data.state = state; data.f = f; backtrace_full (state, skip + 1, print_callback, error_callback, (void *) &data); } canu-1.6/src/AS_UTL/libbacktrace/simple.c000066400000000000000000000061621314437614700202060ustar00rootroot00000000000000/* simple.c -- The backtrace_simple function. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include "unwind.h" #include "backtrace.h" /* The simple_backtrace routine. */ /* Data passed through _Unwind_Backtrace. */ struct backtrace_simple_data { /* Number of frames to skip. */ int skip; /* Library state. */ struct backtrace_state *state; /* Callback routine. */ backtrace_simple_callback callback; /* Error callback routine. */ backtrace_error_callback error_callback; /* Data to pass to callback routine. */ void *data; /* Value to return from backtrace. */ int ret; }; /* Unwind library callback routine. This is passd to _Unwind_Backtrace. */ static _Unwind_Reason_Code simple_unwind (struct _Unwind_Context *context, void *vdata) { struct backtrace_simple_data *bdata = (struct backtrace_simple_data *) vdata; uintptr_t pc; int ip_before_insn = 0; #ifdef HAVE_GETIPINFO pc = _Unwind_GetIPInfo (context, &ip_before_insn); #else pc = _Unwind_GetIP (context); #endif if (bdata->skip > 0) { --bdata->skip; return _URC_NO_REASON; } if (!ip_before_insn) --pc; bdata->ret = bdata->callback (bdata->data, pc); if (bdata->ret != 0) return _URC_END_OF_STACK; return _URC_NO_REASON; } /* Get a simple stack backtrace. */ int backtrace_simple (struct backtrace_state *state, int skip, backtrace_simple_callback callback, backtrace_error_callback error_callback, void *data) { struct backtrace_simple_data bdata; bdata.skip = skip + 1; bdata.state = state; bdata.callback = callback; bdata.error_callback = error_callback; bdata.data = data; bdata.ret = 0; _Unwind_Backtrace (simple_unwind, &bdata); return bdata.ret; } canu-1.6/src/AS_UTL/libbacktrace/sort.c000066400000000000000000000061471314437614700177070ustar00rootroot00000000000000/* sort.c -- Sort without allocating memory Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include "backtrace.h" #include "internal.h" /* The GNU glibc version of qsort allocates memory, which we must not do if we are invoked by a signal handler. So provide our own sort. */ static void swap (char *a, char *b, size_t size) { size_t i; for (i = 0; i < size; i++, a++, b++) { char t; t = *a; *a = *b; *b = t; } } void backtrace_qsort (void *basearg, size_t count, size_t size, int (*compar) (const void *, const void *)) { char *base = (char *) basearg; size_t i; size_t mid; tail_recurse: if (count < 2) return; /* The symbol table and DWARF tables, which is all we use this routine for, tend to be roughly sorted. Pick the middle element in the array as our pivot point, so that we are more likely to cut the array in half for each recursion step. */ swap (base, base + (count / 2) * size, size); mid = 0; for (i = 1; i < count; i++) { if ((*compar) (base, base + i * size) > 0) { ++mid; if (i != mid) swap (base + mid * size, base + i * size, size); } } if (mid > 0) swap (base, base + mid * size, size); /* Recurse with the smaller array, loop with the larger one. That ensures that our maximum stack depth is log count. */ if (2 * mid < count) { backtrace_qsort (base, mid, size, compar); base += (mid + 1) * size; count -= mid + 1; goto tail_recurse; } else { backtrace_qsort (base + (mid + 1) * size, count - (mid + 1), size, compar); count = mid; goto tail_recurse; } } canu-1.6/src/AS_UTL/libbacktrace/state.c000066400000000000000000000046011314437614700200310ustar00rootroot00000000000000/* state.c -- Create the backtrace state. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #include #include #include "backtrace.h" #include "backtrace-supported.h" #include "internal.h" /* Create the backtrace state. This will then be passed to all the other routines. */ struct backtrace_state * backtrace_create_state (const char *filename, int threaded, backtrace_error_callback error_callback, void *data) { struct backtrace_state init_state; struct backtrace_state *state; #ifndef HAVE_SYNC_FUNCTIONS if (threaded) { error_callback (data, "backtrace library does not support threads", 0); return NULL; } #endif memset (&init_state, 0, sizeof init_state); init_state.filename = filename; init_state.threaded = threaded; state = ((struct backtrace_state *) backtrace_alloc (&init_state, sizeof *state, error_callback, data)); if (state == NULL) return NULL; *state = init_state; return state; } canu-1.6/src/AS_UTL/libbacktrace/unknown.c000066400000000000000000000045771314437614700204240ustar00rootroot00000000000000/* unknown.c -- used when backtrace configury does not know file format. Copyright (C) 2012-2016 Free Software Foundation, Inc. Written by Ian Lance Taylor, Google. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "config.h" #if (BACKTRACE_ELF_SIZE == unknown) #include #include "backtrace.h" #include "internal.h" /* A trivial routine that always fails to find fileline data. */ static int unknown_fileline (struct backtrace_state *state ATTRIBUTE_UNUSED, uintptr_t pc, backtrace_full_callback callback, backtrace_error_callback error_callback ATTRIBUTE_UNUSED, void *data) { return callback (data, pc, NULL, 0, NULL); } /* Initialize the backtrace data when we don't know how to read the debug info. */ int backtrace_initialize (struct backtrace_state *state ATTRIBUTE_UNUSED, int descriptor ATTRIBUTE_UNUSED, backtrace_error_callback error_callback ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED, fileline *fileline_fn) { state->fileline_data = NULL; *fileline_fn = unknown_fileline; return 1; } #endif // __APPLE__ canu-1.6/src/AS_UTL/md5.C000066400000000000000000000304061314437614700147320ustar00rootroot00000000000000#include "md5.H" // The RSA MD5 implementation. Functions md5_* (at the end) are glue // to kmer libutil. // See RFC1321, "The MD5 Message-Digest Algorithm", R. Rivest. // Copyright (C) 1991-2, RSA Data Security, Inc. Created 1991. All // rights reserved. // // License to copy and use this software is granted provided that it // is identified as the "RSA Data Security, Inc. MD5 Message-Digest // Algorithm" in all material mentioning or referencing this software // or this function. // // License is also granted to make and use derivative works provided // that such works are identified as "derived from the RSA Data // Security, Inc. MD5 Message-Digest Algorithm" in all material // mentioning or referencing the derived work. // // RSA Data Security, Inc. makes no representations concerning either // the merchantability of this software or the suitability of this // software for any particular purpose. It is provided "as is" // without express or implied warranty of any kind. // // These notices must be retained in any copies of any part of this // documentation and/or software. typedef struct { uint32 state[4]; // state (ABCD) uint32 count[2]; // number of bits, modulo 2^64 (lsb first) unsigned char buffer[64]; // input buffer } MD5_CTX; static void MD5Init(MD5_CTX *); static void MD5Update(MD5_CTX *, unsigned char const *, size_t); static void MD5Final(unsigned char [16], MD5_CTX *); static void MD5Transform(uint32 [4], unsigned char const [64]); static void Encode(unsigned char *, uint32 *, unsigned int); static void Decode(uint32 *, unsigned char const *, unsigned int); // Constants for MD5Transform routine. #define S11 7 #define S12 12 #define S13 17 #define S14 22 #define S21 5 #define S22 9 #define S23 14 #define S24 20 #define S31 4 #define S32 11 #define S33 16 #define S34 23 #define S41 6 #define S42 10 #define S43 15 #define S44 21 static unsigned char PADDING[64] = { 0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; // F, G, H and I are basic MD5 functions. #define F(x, y, z) (((x) & (y)) | ((~x) & (z))) #define G(x, y, z) (((x) & (z)) | ((y) & (~z))) #define H(x, y, z) ((x) ^ (y) ^ (z)) #define I(x, y, z) ((y) ^ ((x) | (~z))) // ROTATE_LEFT rotates x left n bits. #define ROTATE_LEFT(x, n) (((x) << (n)) | ((x) >> (32-(n)))) // FF, GG, HH, and II transformations for rounds 1, 2, 3, and 4. // Rotation is separate from addition to prevent recomputation. #define FF(a, b, c, d, x, s, ac) { \ (a) += F ((b), (c), (d)) + (x) + (uint32)(ac); \ (a) = ROTATE_LEFT ((a), (s)); \ (a) += (b); \ } #define GG(a, b, c, d, x, s, ac) { \ (a) += G ((b), (c), (d)) + (x) + (uint32)(ac); \ (a) = ROTATE_LEFT ((a), (s)); \ (a) += (b); \ } #define HH(a, b, c, d, x, s, ac) { \ (a) += H ((b), (c), (d)) + (x) + (uint32)(ac); \ (a) = ROTATE_LEFT ((a), (s)); \ (a) += (b); \ } #define II(a, b, c, d, x, s, ac) { \ (a) += I ((b), (c), (d)) + (x) + (uint32)(ac); \ (a) = ROTATE_LEFT ((a), (s)); \ (a) += (b); \ } // MD5 initialization. Begins an MD5 operation, writing a new context. // void MD5Init (MD5_CTX *context) { context->count[0] = context->count[1] = 0; // Load magic initialization constants. context->state[0] = 0x67452301; context->state[1] = 0xefcdab89; context->state[2] = 0x98badcfe; context->state[3] = 0x10325476; } // MD5 block update operation. Continues an MD5 message-digest // operation, processing another message block, and updating the // context. // void MD5Update (MD5_CTX *context, unsigned char const *input, size_t inputLen) { unsigned int i, index, partLen; // Compute number of bytes mod 64 index = (unsigned int)((context->count[0] >> 3) & 0x3F); // Update number of bits if ((context->count[0] += ((uint32)inputLen << 3)) < ((uint32)inputLen << 3)) context->count[1]++; context->count[1] += ((uint32)inputLen >> 29); partLen = 64 - index; // Transform as many times as possible. if (inputLen >= partLen) { memcpy(&context->buffer[index], input, partLen); MD5Transform(context->state, context->buffer); for (i = partLen; i + 63 < inputLen; i += 64) MD5Transform(context->state, &input[i]); index = 0; } else i = 0; // Buffer remaining input memcpy(&context->buffer[index], &input[i], inputLen-i); } // MD5 finalization. Ends an MD5 message-digest operation, writing the // the message digest and zeroizing the context. // void MD5Final (unsigned char digest[16], MD5_CTX *context) { unsigned char bits[8]; unsigned int index, padLen; // Save number of bits Encode (bits, context->count, 8); // Pad out to 56 mod 64. index = (unsigned int)((context->count[0] >> 3) & 0x3f); padLen = (index < 56) ? (56 - index) : (120 - index); MD5Update (context, PADDING, padLen); // Append length (before padding) MD5Update (context, bits, 8); // Store state in digest Encode (digest, context->state, 16); // Zeroize sensitive information. memset(context, 0, sizeof(*context)); } // MD5 basic transformation. Transforms state based on block. // static void MD5Transform(uint32 state[4], unsigned char const block[64]) { uint32 a = state[0], b = state[1], c = state[2], d = state[3], x[16]; Decode(x, block, 64); // Round 1 FF (a, b, c, d, x[ 0], S11, 0xd76aa478); // 1 FF (d, a, b, c, x[ 1], S12, 0xe8c7b756); // 2 FF (c, d, a, b, x[ 2], S13, 0x242070db); // 3 FF (b, c, d, a, x[ 3], S14, 0xc1bdceee); // 4 FF (a, b, c, d, x[ 4], S11, 0xf57c0faf); // 5 FF (d, a, b, c, x[ 5], S12, 0x4787c62a); // 6 FF (c, d, a, b, x[ 6], S13, 0xa8304613); // 7 FF (b, c, d, a, x[ 7], S14, 0xfd469501); // 8 FF (a, b, c, d, x[ 8], S11, 0x698098d8); // 9 FF (d, a, b, c, x[ 9], S12, 0x8b44f7af); // 10 FF (c, d, a, b, x[10], S13, 0xffff5bb1); // 11 FF (b, c, d, a, x[11], S14, 0x895cd7be); // 12 FF (a, b, c, d, x[12], S11, 0x6b901122); // 13 FF (d, a, b, c, x[13], S12, 0xfd987193); // 14 FF (c, d, a, b, x[14], S13, 0xa679438e); // 15 FF (b, c, d, a, x[15], S14, 0x49b40821); // 16 // Round 2 GG (a, b, c, d, x[ 1], S21, 0xf61e2562); // 17 GG (d, a, b, c, x[ 6], S22, 0xc040b340); // 18 GG (c, d, a, b, x[11], S23, 0x265e5a51); // 19 GG (b, c, d, a, x[ 0], S24, 0xe9b6c7aa); // 20 GG (a, b, c, d, x[ 5], S21, 0xd62f105d); // 21 GG (d, a, b, c, x[10], S22, 0x2441453); // 22 GG (c, d, a, b, x[15], S23, 0xd8a1e681); // 23 GG (b, c, d, a, x[ 4], S24, 0xe7d3fbc8); // 24 GG (a, b, c, d, x[ 9], S21, 0x21e1cde6); // 25 GG (d, a, b, c, x[14], S22, 0xc33707d6); // 26 GG (c, d, a, b, x[ 3], S23, 0xf4d50d87); // 27 GG (b, c, d, a, x[ 8], S24, 0x455a14ed); // 28 GG (a, b, c, d, x[13], S21, 0xa9e3e905); // 29 GG (d, a, b, c, x[ 2], S22, 0xfcefa3f8); // 30 GG (c, d, a, b, x[ 7], S23, 0x676f02d9); // 31 GG (b, c, d, a, x[12], S24, 0x8d2a4c8a); // 32 // Round 3 HH (a, b, c, d, x[ 5], S31, 0xfffa3942); // 33 HH (d, a, b, c, x[ 8], S32, 0x8771f681); // 34 HH (c, d, a, b, x[11], S33, 0x6d9d6122); // 35 HH (b, c, d, a, x[14], S34, 0xfde5380c); // 36 HH (a, b, c, d, x[ 1], S31, 0xa4beea44); // 37 HH (d, a, b, c, x[ 4], S32, 0x4bdecfa9); // 38 HH (c, d, a, b, x[ 7], S33, 0xf6bb4b60); // 39 HH (b, c, d, a, x[10], S34, 0xbebfbc70); // 40 HH (a, b, c, d, x[13], S31, 0x289b7ec6); // 41 HH (d, a, b, c, x[ 0], S32, 0xeaa127fa); // 42 HH (c, d, a, b, x[ 3], S33, 0xd4ef3085); // 43 HH (b, c, d, a, x[ 6], S34, 0x4881d05); // 44 HH (a, b, c, d, x[ 9], S31, 0xd9d4d039); // 45 HH (d, a, b, c, x[12], S32, 0xe6db99e5); // 46 HH (c, d, a, b, x[15], S33, 0x1fa27cf8); // 47 HH (b, c, d, a, x[ 2], S34, 0xc4ac5665); // 48 // Round 4 II (a, b, c, d, x[ 0], S41, 0xf4292244); // 49 II (d, a, b, c, x[ 7], S42, 0x432aff97); // 50 II (c, d, a, b, x[14], S43, 0xab9423a7); // 51 II (b, c, d, a, x[ 5], S44, 0xfc93a039); // 52 II (a, b, c, d, x[12], S41, 0x655b59c3); // 53 II (d, a, b, c, x[ 3], S42, 0x8f0ccc92); // 54 II (c, d, a, b, x[10], S43, 0xffeff47d); // 55 II (b, c, d, a, x[ 1], S44, 0x85845dd1); // 56 II (a, b, c, d, x[ 8], S41, 0x6fa87e4f); // 57 II (d, a, b, c, x[15], S42, 0xfe2ce6e0); // 58 II (c, d, a, b, x[ 6], S43, 0xa3014314); // 59 II (b, c, d, a, x[13], S44, 0x4e0811a1); // 60 II (a, b, c, d, x[ 4], S41, 0xf7537e82); // 61 II (d, a, b, c, x[11], S42, 0xbd3af235); // 62 II (c, d, a, b, x[ 2], S43, 0x2ad7d2bb); // 63 II (b, c, d, a, x[ 9], S44, 0xeb86d391); // 64 state[0] += a; state[1] += b; state[2] += c; state[3] += d; // Zeroize sensitive information. memset (x, 0, sizeof(x)); } // Encodes input (uint32) into output (unsigned char). Assumes len is // a multiple of 4. // static void Encode (unsigned char *output, uint32 *input, unsigned int len) { unsigned int i, j; for (i = 0, j = 0; j < len; i++, j += 4) { output[j] = (unsigned char)(input[i] & 0xff); output[j+1] = (unsigned char)((input[i] >> 8) & 0xff); output[j+2] = (unsigned char)((input[i] >> 16) & 0xff); output[j+3] = (unsigned char)((input[i] >> 24) & 0xff); } } // Decodes input (unsigned char) into output (uint32). Assumes len is // a multiple of 4. // static void Decode (uint32 *output, unsigned char const *input, unsigned int len) { unsigned int i, j; for (i = 0, j = 0; j < len; i++, j += 4) output[i] = ((uint32)input[j]) | (((uint32)input[j+1]) << 8) | (((uint32)input[j+2]) << 16) | (((uint32)input[j+3]) << 24); } //////////////////////////////////////////////////////////////////////////////// // // kmer glue functions // //////////////////////////////////////////////////////////////////////////////// int md5_compare(void const *a, void const *b) { md5_s const *A = (md5_s const *)a; md5_s const *B = (md5_s const *)b; if (A->a < B->a) return(-1); if (A->a > B->a) return(1); if (A->b < B->b) return(-1); if (A->b > B->b) return(1); return(0); } static const char *md5_letters = "0123456789abcdef"; char* md5_toascii(md5_s *m, char *s) { int i; for (i=0; i<16; i++) { s[15-i ] = md5_letters[(m->a >> 4*i) & 0x0f]; s[15-i+16] = md5_letters[(m->b >> 4*i) & 0x0f]; } s[32] = 0; return(s); } md5_s* md5_string(md5_s *m, char *s, uint32 l) { MD5_CTX ctx; unsigned char dig[16]; int i = 0; if (m == NULL) { errno = 0; m = new md5_s; if (errno) { fprintf(stderr, "md5_string()-- Can't allocate a md5_s.\n%s\n", strerror(errno)); exit(1); } } MD5Init(&ctx); MD5Update(&ctx, (unsigned char*)s, l); MD5Final(dig, &ctx); m->a = dig[0]; while (i<8) { m->a <<= 8; m->a |= dig[i++]; } m->b = dig[i++]; while (i<16) { m->b <<= 8; m->b |= dig[i++]; } return(m); } static md5_increment_s* md5_increment_initialize(void) { md5_increment_s *m; errno = 0; m = new md5_increment_s; if (errno) { fprintf(stderr, "md5_increment_*()-- Can't allocate a md5_increment_s.\n%s\n", strerror(errno)); exit(1); } m->context = new MD5_CTX; if (errno) { fprintf(stderr, "md5_increment_*()-- Can't allocate a md5 context.\n%s\n", strerror(errno)); exit(1); } MD5Init((MD5_CTX *)m->context); m->bufferPos = 0; return(m); } md5_increment_s* md5_increment_char(md5_increment_s *m, char s) { if (m == NULL) m = md5_increment_initialize(); m->buffer[m->bufferPos++] = s; if (m->bufferPos == MD5_BUFFER_SIZE) { MD5Update((MD5_CTX *)m->context, m->buffer, m->bufferPos); m->bufferPos = 0; } return(m); } md5_increment_s* md5_increment_block(md5_increment_s *m, char *s, uint32 l) { if (m == NULL) m = md5_increment_initialize(); MD5Update((MD5_CTX *)m->context, (unsigned char*)s, l); return(m); } void md5_increment_finalize(md5_increment_s *m) { MD5_CTX *ctx = (MD5_CTX *)m->context; unsigned char dig[16]; int i = 0; if (m->bufferPos > 0) { MD5Update((MD5_CTX *)m->context, m->buffer, m->bufferPos); m->bufferPos = 0; } MD5Final(dig, ctx); m->a = dig[0]; while (i<8) { m->a <<= 8; m->a |= dig[i++]; } m->b = dig[i++]; while (i<16) { m->b <<= 8; m->b |= dig[i++]; } m->context = 0L; delete ctx; } void md5_increment_destroy(md5_increment_s *m) { delete m; } canu-1.6/src/AS_UTL/md5.H000066400000000000000000000022211314437614700147310ustar00rootroot00000000000000#include "AS_global.H" typedef struct { uint64 a; uint64 b; uint32 i; // the iid, used in leaff uint32 pad; // keep us size compatible between 32- and 64-bit machines. } md5_s; #define MD5_BUFFER_SIZE 32*1024 typedef struct { uint64 a; uint64 b; void *context; int bufferPos; unsigned char buffer[MD5_BUFFER_SIZE]; } md5_increment_s; // Returns -1, 0, 1 depending on if a <, ==, > b. Suitable for // qsort(). // int md5_compare(void const *a, void const *b); // Converts an md5_s into a character string. s must be at least // 33 bytes long. // char *md5_toascii(md5_s *m, char *s); // Computes the md5 checksum on the string s. // md5_s *md5_string(md5_s *m, char *s, uint32 l); // Computes an md5 checksum piece by piece. // // If m is NULL, a new md5_increment_s is allocated and returned. // md5_increment_s *md5_increment_char(md5_increment_s *m, char s); md5_increment_s *md5_increment_block(md5_increment_s *m, char *s, uint32 l); void md5_increment_finalize(md5_increment_s *m); void md5_increment_destroy(md5_increment_s *m); canu-1.6/src/AS_UTL/memoryMappedFile.H000066400000000000000000000133211314437614700175060ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/memoryMappedFile.H * * Modifications by: * * Brian P. Walenz from 2012-FEB-16 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-26 to 2015-APR-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Sergey Koren beginning on 2015-DEC-15 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_MEMORYMAPPEDFILE_H #define AS_UTL_MEMORYMAPPEDFILE_H #include "AS_global.H" #include "AS_UTL_fileIO.H" #include #include #include #include using namespace std; // The BSD's are able to map to an arbitrary position in the file, but the Linux's can only map to // multiples of pagesize. Thus, this class maps the whole file into address space, then passes out // pointers to pieces in it. This is slightly unfortunate, because array out-of-bounds will not be // caught. To be fair, on the BSD's the file is mapped to a length that is a multiple of pagesize, // so it would take a big out-of-bounds to fail. enum memoryMappedFileType { memoryMappedFile_readOnly = 0x00, memoryMappedFile_readWrite = 0x01 }; #ifndef MAP_POPULATE #define MAP_POPULATE 0 #endif class memoryMappedFile { public: memoryMappedFile(const char *name, memoryMappedFileType type = memoryMappedFile_readOnly) { strcpy(_name, name); _type = type; errno = 0; int fd = (_type == memoryMappedFile_readOnly) ? open(_name, O_RDONLY | O_LARGEFILE) : open(_name, O_RDWR | O_LARGEFILE); if (errno) fprintf(stderr, "memoryMappedFile()-- Couldn't open '%s' for mmap: %s\n", _name, strerror(errno)), exit(1); struct stat sb; fstat(fd, &sb); if (errno) fprintf(stderr, "memoryMappedFile()-- Couldn't stat '%s' for mmap: %s\n", _name, strerror(errno)), exit(1); _length = sb.st_size; _offset = 0; if (_length == 0) fprintf(stderr, "memoryMappedFile()-- File '%s' is empty, can't mmap.\n", _name), exit(1); // Map a region that allows reading, or reading and shared writing. One could add PROT_WRITE // to the readOnly, and modifications will be kept private (if MAP_SHARED is switched to // MAP_PRIVATE) to the process (and discarded at the end). // // FreeBSD supports MAP_NOCORE which will exclude the region from any core files generated. Linux does not support it. // // Linux supports MAP_NORESERVE which will not reserve swap space for the file. When reserved, a write is guaranteed to succeed. // // NOTA BENE!! Even though it is writable, it CANNOT be extended. _data = (_type == memoryMappedFile_readOnly) ? mmap(0L, _length, PROT_READ, MAP_FILE | MAP_PRIVATE | MAP_POPULATE, fd, 0) : mmap(0L, _length, PROT_READ | PROT_WRITE, MAP_FILE | MAP_SHARED, fd, 0); if (errno) fprintf(stderr, "memoryMappedFile()-- Couldn't mmap '%s' of length " F_SIZE_T ": %s\n", _name, _length, strerror(errno)), exit(1); close(fd); //fprintf(stderr, "memoryMappedFile()-- File '%s' of length %lu is mapped.\n", _name, _length); }; ~memoryMappedFile() { if (_type == memoryMappedFile_readWrite) msync(_data, _length, MS_SYNC); munmap(_data, _length); }; // Return a pointer to position 'offset' in the file, and set the current position to 'offset + // length'. // // The length parameter is checked against the length of the file, and a fatal error occurs if // 'offset + length' exceeds the bounds of the file. // void *get(size_t offset, size_t length) { if (length == 0) length = _length - offset; if (offset + length > _length) fprintf(stderr, "memoryMappedFile()-- Requested " F_SIZE_T " bytes at position " F_SIZE_T " in file '%s', but only " F_SIZE_T " bytes in file.\n", length, offset, _name, _length), exit(1); _offset = offset + length; return((uint8 *)_data + offset); }; // Return a pointer to the current position in the file, and move the current position ahead by // 'length' bytes. // // get() (or get(0)) returns the current position without moving it. // void *get(size_t length=0) { return(get(_offset, length)); }; size_t length(void) { return(_length); }; memoryMappedFileType type(void) { return(_type); }; private: char _name[FILENAME_MAX]; memoryMappedFileType _type; size_t _length; // Length of the mapped file size_t _offset; // File pointer for reading void *_data; }; #endif // AS_UTL_MEMORYMAPPEDFILE_H canu-1.6/src/AS_UTL/memoryMappedFileTest.C000066400000000000000000000023421314437614700203420ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/memoryMappedFileTest.C * * Modifications by: * * Brian P. Walenz from 2014-NOV-26 to 2014-NOV-27 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "memoryMappedFile.H" int main(int argc, char **argv) { memoryMappedFileRW *write = new memoryMappedFileRW("testWrite"); for (uint32 ii=0; ii<1000000000; ii++) (*write)[ii] = ii; delete write; exit(0); } canu-1.6/src/AS_UTL/mt19937ar.C000066400000000000000000000116021314437614700156220ustar00rootroot00000000000000/* A C-program for MT19937, with initialization improved 2002/1/26. Coded by Takuji Nishimura and Makoto Matsumoto. Before using, initialize the state by using init_genrand(seed) or init_by_array(init_key, key_length). Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura, All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. The names of its contributors may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Any feedback is very welcome. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html email: m-mat @ math.sci.hiroshima-u.ac.jp (remove space) */ #include "mt19937ar.H" #include #include // initialize with a single seed void mtRandom::construct(uint32 s) { mt[0] = s; // See Knuth TAOCP Vol2. 3rd Ed. P.106 for multiplier. In the previous versions, MSBs of the seed // affect only MSBs of the array mt[]. // 2002/01/09 modified by Makoto Matsumoto for (mti=1; mti> 30)) + mti); mag01[0] = uint32ZERO; mag01[1] = MT_MATRIX_A; } /* initialize by an array with array-length */ /* init_key is the array for initializing keys */ /* key_length is its length */ /* slight change for C++, 2004/2/26 */ mtRandom::mtRandom(uint32 *init_key, uint32 key_length) { construct(19650218UL); int i = 1; int j = 0; int k = (MT_N > key_length ? MT_N : key_length); for (; k; k--) { mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1664525UL)) + init_key[j] + j; /* non linear */ i++; j++; if (i >= MT_N) { mt[0] = mt[MT_N-1]; i=1; } if (j >= key_length) j=0; } for (k=MT_N-1; k; k--) { mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1566083941UL)) - i; /* non linear */ i++; if (i>=MT_N) { mt[0] = mt[MT_N-1]; i=1; } } mt[0] = 0x80000000UL; /* MSB is 1; assuring non-zero initial array */ } /* generates a random number on [0,0xffffffff]-interval */ uint32 mtRandom::mtRandom32(void) { uint32 y = 0; // generate MT_N words at one time // if (mti >= MT_N) { int kk; for (kk=0; kk < MT_N - MT_M; kk++) { y = (mt[kk] & MT_UPPER_MASK) | (mt[kk+1] & MT_LOWER_MASK); mt[kk] = mt[kk + MT_M] ^ (y >> 1) ^ mag01[y & uint32ONE]; } for (; kk < MT_N-1; kk++) { y = (mt[kk] & MT_UPPER_MASK) | (mt[kk + 1] & MT_LOWER_MASK); mt[kk] = mt[kk + (MT_M - MT_N)] ^ (y >> 1) ^ mag01[y & uint32ONE]; } y = (mt[MT_N-1] & MT_UPPER_MASK) | (mt[0] & MT_LOWER_MASK); mt[MT_N-1] = mt[MT_M-1] ^ (y >> 1) ^ mag01[y & uint32ONE]; mti = 0; } y = mt[mti++]; /* Tempering */ y ^= (y >> 11); y ^= (y << 7) & 0x9d2c5680UL; y ^= (y << 15) & 0xefc60000UL; y ^= (y >> 18); return(y); } // generates a random number on gaussian distribution with 0 median and 1 std.dev. double mtRandom::mtRandomGaussian(void) { double x1=0, x2=0, w=0, y1=0, y2=0; // from http://www.taygeta.com/random/gaussian.html // // supposedly equivalent to // // y1 = sqrt(-2*ln(x1)) cos(2*pi*x2) // y2 = sqrt(-2*ln(x1)) sin(2*pi*x2) // // but stable when x1 close to zero do { x1 = 2.0 * mtRandomRealClosed() - 1.0; x2 = 2.0 * mtRandomRealClosed() - 1.0; w = x1 * x1 + x2 * x2; } while (w >= 1.0); w = sqrt( (-2.0 * log(w)) / w); y1 = x1 * w; y2 = x2 * w; return(y1); } // Ganerate a number from an exponential distribution using Inverse Transform Sampling. // double mtRandom::mtRandomExponential(double mode, double lambda) { return(mode - 1/lambda * log(mtRandomRealOpen())); } canu-1.6/src/AS_UTL/mt19937ar.H000066400000000000000000000044401314437614700156310ustar00rootroot00000000000000#ifndef MT19937AR_H #define MT19937AR_H #include "AS_global.H" // Refactoring of // // A C-program for MT19937, with initialization improved 2002/1/26. // Coded by Takuji Nishimura and Makoto Matsumoto. // // to make it thread safe and (hopefully) more portable. // // 20040421, bpw // Refactored again, 20141208 for C++. static const uint32 MT_N = 624; // period parameters static const uint32 MT_M = 397; static const uint32 MT_MATRIX_A = 0x9908b0dfUL; // constant vector a static const uint32 MT_UPPER_MASK = 0x80000000UL; // most significant w-r bits static const uint32 MT_LOWER_MASK = 0x7fffffffUL; // least significant r bits class mtRandom { private: void construct(uint32 s); public: mtRandom() { construct(getpid() * time(NULL)); }; mtRandom(uint32 s) { construct(s); }; mtRandom(uint32 *init_key, uint32 key_length); ~mtRandom() { }; uint32 mtRandom32(void); uint64 mtRandom64(void) { return((((uint64)mtRandom32()) << 32) | (uint64)mtRandom32()); } // Real valued randomness // mtRandomRealOpen() -- on [0,1) real interval // mtRandomRealClosed() -- on [0,1] real interval // mrRandomRealOpen53() -- on [0,1) real interval, using 53 bits // // "These real versions are due to Isaku Wada, 2002/01/09 added" and were taken from // the mt19937ar.c distribution (but they had actual functions, not macros) // // They also had // random number in (0,1) as (mtRandom32() + 0.5) * (1.0 / 4294967296.0) // double mtRandomRealOpen(void) { return((double)mtRandom32() * (1.0 / 4294967296.0)); }; double mtRandomRealClosed(void) { return((double)mtRandom32() * (1.0 / 4294967295.0)); }; double mtRandomRealOpen53(void) { return(((mtRandom32() >> 5) * 67108864.0 + (mtRandom32() >> 6)) * (1.0 / 9007199254740992.0)); }; // returns a random number with gaussian distribution, mean of zero and std.dev. of 1 // double mtRandomGaussian(void); double mtRandomExponential(double lambda, double tau=1.0); private: uint32 mt[MT_N]; // State vector array uint32 mti; // Ordinal of the first uninit'd element -- mti = N+1 -> elt N is uninit uint32 mag01[2]; // mag01[x] = x * MT_MATRIX_A for x=0,1 }; #endif // MT19937AR_H canu-1.6/src/AS_UTL/mt19937arTest.C000066400000000000000000000006171314437614700164660ustar00rootroot00000000000000#include "mt19937ar.H" int main(int argc, char **argv) { mtRandom mt; if (argc != 4) fprintf(stderr, "usage: %s \n", argv[0]), exit(1); uint32 number = atoi(argv[1]); double mode = atof(argv[2]); double scale = atof(argv[3]); for (uint32 ii=0; ii #include #include #include #include #endif #include // If bufferMax is zero, then the file is accessed using memory // mapped I/O. Otherwise, a small buffer is used. // readBuffer::readBuffer(const char *filename, uint64 bufferMax) { _filename = 0L; _file = 0; _filePos = 0; _mmap = NULL; _stdin = false; _eof = false; _bufferPos = 0; _bufferLen = 0; _bufferMax = 0; _buffer = 0L; if (((filename == 0L) && (isatty(fileno(stdin)) == 0)) || ((filename != 0L) && (filename[0] == '-') && (filename[1] == 0))) { _filename = new char [32]; strcpy(_filename, "(stdin)"); _stdin = true; if (bufferMax == 0) bufferMax = 32 * 1024; } else if (filename == 0L) { fprintf(stderr, "readBuffer()-- no filename supplied, and I will not use the terminal for input.\n"), exit(1); } else { _filename = new char [strlen(filename) + 1]; strcpy(_filename, filename); } if (bufferMax == 0) { _mmap = new memoryMappedFile(_filename); _buffer = (char *)_mmap->get(0); } else { errno = 0; _file = (_stdin) ? fileno(stdin) : open(_filename, O_RDONLY | O_LARGEFILE); if (errno) fprintf(stderr, "readBuffer()-- couldn't open the file '%s': %s\n", _filename, strerror(errno)), exit(1); _bufferMax = bufferMax; _buffer = new char [_bufferMax]; } fillBuffer(); if (_bufferLen == 0) _eof = true; } readBuffer::readBuffer(FILE *file, uint64 bufferMax) { if (bufferMax == 0) fprintf(stderr, "readBuffer()-- WARNING: mmap() not supported in readBuffer(FILE *)\n"); _filename = new char [32]; _file = fileno(file); _filePos = 0; _mmap = NULL; _stdin = false; _eof = false; _bufferPos = 0; _bufferLen = 0; _bufferMax = (bufferMax == 0) ? 32 * 1024 : bufferMax; _buffer = new char [_bufferMax]; strcpy(_filename, "(hidden file)"); // Just be sure that we are at the start of the file. errno = 0; lseek(_file, 0, SEEK_SET); if ((errno) && (errno != ESPIPE)) fprintf(stderr, "readBuffer()-- '%s' couldn't seek to position 0: %s\n", _filename, strerror(errno)), exit(1); fillBuffer(); if (_bufferLen == 0) _eof = true; } readBuffer::~readBuffer() { delete [] _filename; if (_mmap) delete _mmap; else delete [] _buffer; if (_stdin == false) close(_file); } void readBuffer::fillBuffer(void) { // If there is still stuff in the buffer, no need to fill. if (_bufferPos < _bufferLen) return; // No more stuff in the buffer. But if mmap'd, ths means we're EOF. if (_mmap) { _eof = true; return; } _bufferPos = 0; _bufferLen = 0; again: errno = 0; _bufferLen = (uint64)::read(_file, _buffer, _bufferMax); if (errno == EAGAIN) goto again; if (errno) fprintf(stderr, "readBuffer::fillBuffer()-- only read " F_U64 " bytes, couldn't read " F_U64 " bytes from '%s': %s\n", _bufferLen, _bufferMax, _filename, strerror(errno)), exit(1); if (_bufferLen == 0) _eof = true; } void readBuffer::seek(uint64 pos) { if (_stdin == true) { if (_filePos < _bufferLen) { _filePos = 0; _bufferPos = 0; return; } else { fprintf(stderr, "readBuffer()-- seek() not available for file 'stdin'.\n"); exit(1); } return; } assert(_stdin == false); if (_mmap) { _bufferPos = pos; _filePos = pos; } else { errno = 0; lseek(_file, pos, SEEK_SET); if (errno) fprintf(stderr, "readBuffer()-- '%s' couldn't seek to position " F_U64 ": %s\n", _filename, pos, strerror(errno)), exit(1); _bufferLen = 0; _bufferPos = 0; _filePos = pos; fillBuffer(); } _eof = (_bufferPos >= _bufferLen); } uint64 readBuffer::read(void *buf, uint64 len) { char *bufchar = (char *)buf; // Handle the mmap'd file first. if (_mmap) { uint64 c = 0; while ((_bufferPos < _bufferLen) && (c < len)) { bufchar[c++] = _buffer[_bufferPos++]; _filePos++; } if (c == 0) _eof = true; return(c); } // Easy case; the next len bytes are already in the buffer; just // copy and move the position. if (_bufferLen - _bufferPos > len) { memcpy(bufchar, _buffer + _bufferPos, len); _bufferPos += len; fillBuffer(); _filePos += len; return(len); } // Existing buffer not big enough. Copy what's there, then finish // with a read. uint64 bCopied = 0; // Number of bytes copied into the buffer uint64 bRead = 0; // Number of bytes read into the buffer uint64 bAct = 0; // Number of bytes actually read from disk memcpy(bufchar, _buffer + _bufferPos, _bufferLen - _bufferPos); bCopied = _bufferLen - _bufferPos; _bufferPos = _bufferLen; while (bCopied + bRead < len) { errno = 0; bAct = (uint64)::read(_file, bufchar + bCopied + bRead, len - bCopied - bRead); if (errno) fprintf(stderr, "readBuffer()-- couldn't read " F_U64 " bytes from '%s': n%s\n", len, _filename, strerror(errno)), exit(1); // If we hit EOF, return a short read if (bAct == 0) len = 0; bRead += bAct; } fillBuffer(); _filePos += bCopied + bRead; return(bCopied + bRead); } uint64 readBuffer::read(void *buf, uint64 maxlen, char stop) { char *bufchar = (char *)buf; uint64 c = 0; // We will copy up to 'maxlen'-1 bytes into 'buf', or stop at the first occurrence of 'stop'. // This will reserve space at the end of any string for a zero-terminating byte. maxlen--; if (_mmap) { // Handle the mmap'd file first. while ((_bufferPos < _bufferLen) && (c < maxlen)) { bufchar[c++] = _buffer[_bufferPos++]; if (bufchar[c-1] == stop) break; } if (_bufferPos >= _bufferLen) _eof = true; } else { // And the usual case. while ((_eof == false) && (c < maxlen)) { bufchar[c++] = _buffer[_bufferPos++]; if (_bufferPos >= _bufferLen) fillBuffer(); if (bufchar[c-1] == stop) break; } } bufchar[c] = 0; return(c); } canu-1.6/src/AS_UTL/readBuffer.H000066400000000000000000000066321314437614700163230ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/readBuffer.H * * Modifications by: * * Brian P. Walenz from 2003-APR-15 to 2004-JAN-06 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-MAR-17 to 2004-MAY-24 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2006-JUN-24 to 2014-APR-11 * are Copyright 2006-2008,2010,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef READ_BUFFER_H #define READ_BUFFER_H #include "AS_global.H" #include "memoryMappedFile.H" class readBuffer { public: readBuffer(const char *filename, uint64 bufferMax = 32 * 1024); readBuffer(FILE *F, uint64 bufferMax = 32 * 1024); ~readBuffer(); bool eof(void) { return(_eof); }; char peek(void); char read(void); uint64 read(void *buf, uint64 len); uint64 read(void *buf, uint64 maxlen, char stop); void seek(uint64 pos); uint64 tell(void) { return(_filePos); }; const char *filename(void) { return(_filename); }; private: void fillBuffer(void); void init(int fileptr, const char *filename, uint64 bufferMax); char *_filename; int _file; uint64 _filePos; memoryMappedFile *_mmap; bool _stdin; bool _eof; // If bufferMax is zero, then we are using the mmapped interface, otherwise, // we are using a open()/read() and a small buffer. uint64 _bufferPos; uint64 _bufferLen; uint64 _bufferMax; char *_buffer; }; // Returns the next letter in the buffer, but DOES NOT advance past // it. Might have some wierd interaction with EOF -- if you peek() // and the next thing is eof , the _eof flag might get set. // inline char readBuffer::peek(void) { if ((_eof == false) && (_bufferPos >= _bufferLen)) fillBuffer(); if (_eof) return(0); return(_buffer[_bufferPos]); } // Returns the next letter in the buffer. Returns EOF (0) if there // is no next letter. // inline char readBuffer::read(void) { if ((_eof == false) && (_bufferPos >= _bufferLen)) fillBuffer(); if (_eof) return(0); _bufferPos++; _filePos++; return(_buffer[_bufferPos-1]); } #endif // READ_BUFFER_H canu-1.6/src/AS_UTL/speedCounter.C000066400000000000000000000056351314437614700167130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/speedCounter.C * * Modifications by: * * Brian P. Walenz from 2006-OCT-22 to 2014-APR-11 * are Copyright 2006,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "speedCounter.H" const char* speedCounter::_spinr[4] = { "[|]", "[/]", "[-]", "[\\]" }; const char* speedCounter::_liner[19] = { "[- ]", "[-- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ --]", "[ -]", "[ --]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]", "[ -- ]" }; speedCounter::speedCounter(char const *fmt, double unit, uint64 freq, bool enabled) { _count = 0; _draws = 0; _unit = unit; _freq = freq; _startTime = getTime(); _fmt = fmt; _spin = false; _line = false; _enabled = enabled; // We use _draws instead of shifting _count just because it's // simpler, and both methods need another variable anyway. // Set all the bits below the hightest set in _freq -- // this allows us to do a super-fast test in tick(). // _freq |= _freq >> 1; _freq |= _freq >> 2; _freq |= _freq >> 4; _freq |= _freq >> 8; _freq |= _freq >> 16; _freq |= _freq >> 32; } speedCounter::~speedCounter() { finish(); } canu-1.6/src/AS_UTL/speedCounter.H000066400000000000000000000065161314437614700167170ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/speedCounter.H * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-MAY-06 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUL-12 to 2014-APR-11 * are Copyright 2005-2006,2012-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SPEEDCOUNTER_H #define SPEEDCOUNTER_H #include "AS_global.H" #include "timeAndSize.H" class speedCounter { public: // fmt specifies the status format. An example: // " %8f [unit]things (%8.5f [unit]things/sec)\r" // speedCounter(char const *fmt, double unit, uint64 freq, bool enabled=true); ~speedCounter(); void enableSpinner(void) { _spin = true; }; void enableLiner(void) { _line = true; }; bool tick(void) { if (_enabled && ((++_count & _freq) == uint64ZERO)) { double v = _count / _unit; if (_spin) fputs(_spinr[_draws % 4], stderr); if (_line) fputs(_liner[_draws % 19], stderr); _draws++; fprintf(stderr, _fmt, v, v / (getTime() - _startTime)); fflush(stderr); return(true); } return(false); }; bool tick(uint64 increment) { if (_enabled == false) return(false); _count += increment; if ((_count & _freq) == uint64ZERO) { double v = _count / _unit; if (_spin) fputs(_spinr[_draws % 4], stderr); if (_line) fputs(_liner[_draws % 19], stderr); _draws++; fprintf(stderr, _fmt, v, v / (getTime() - _startTime)); fflush(stderr); return(true); } return(false); }; void finish(void) { if (_enabled && (_count >= _freq)) { double v = _count / _unit; if (_spin) fputs(_spinr[_draws % 4], stderr); if (_line) fputs(_liner[_draws % 19], stderr); fprintf(stderr, _fmt, v, v / (getTime() - _startTime)); fprintf(stderr, "\n"); fflush(stderr); } _count = 0; }; private: static const char *_spinr[4]; static const char *_liner[19]; uint64 _count; uint64 _draws; double _unit; uint64 _freq; double _startTime; char const *_fmt; bool _spin; bool _line; bool _enabled; }; #endif // SPEEDCOUNTER_H canu-1.6/src/AS_UTL/splitToWords.H000066400000000000000000000100751314437614700167270ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/splitToWords.H * * Modifications by: * * Brian P. Walenz from 2005-JUL-12 to 2014-APR-11 * are Copyright 2005-2006,2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-AUG-11 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SPLITTOWORDS_H #define SPLITTOWORDS_H class splitToWords { public: splitToWords() { _argWords = 0; _maxWords = 0; _arg = 0L; _maxChars = 0; _cmd = 0L; }; splitToWords(char *cmd) { _argWords = 0; _maxWords = 0; _arg = 0L; _maxChars = 0; _cmd = 0L; split(cmd); }; ~splitToWords() { delete [] _cmd; delete [] _arg; }; void split(char *cmd) { // Step Zero: // // Count the length of the string, in words and in characters. // For simplicity, we overcount words, by just counting white-space. // // Then, allocate space for a temporary copy of the string, and a // set of pointers into the temporary copy (much like argv). // uint32 cmdChars = 1; // 1 == Space for terminating 0 uint32 cmdWords = 2; // 2 == Space for first word and terminating 0L for (char *tmp=cmd; *tmp; tmp++) { cmdWords += (*tmp == ' ') ? 1 : 0; cmdWords += (*tmp == '\t') ? 1 : 0; cmdChars++; } if (cmdChars > _maxChars) { delete [] _cmd; _cmd = new char [cmdChars]; _maxChars = cmdChars; } if (cmdWords > _maxWords) { delete [] _arg; _arg = new char * [cmdWords]; _maxWords = cmdWords; } _argWords = 0; // Step One: // // Determine where the words are in the command string, copying the // string to _cmd and storing words in _arg. // bool isFirst = true; char *cmdI = cmd; char *cmdO = _cmd; while (*cmdI) { // If we are at a non-space character, we are in a word. If // this is the first character in the word, save the word in // the args list. // // Otherwise we are at a space and thus not in a word. Make // all spaces be string terminators, and declare that we are // at the start of a word. // if ((*cmdI != ' ') && (*cmdI != '\t') && (*cmdI != '\n') && (*cmdI != '\r')) { *cmdO = *cmdI; if (isFirst) { _arg[_argWords++] = cmdO; isFirst = false; } } else { *cmdO = 0; isFirst = true; } cmdI++; cmdO++; } // Finish off the list by terminating the last arg, and // terminating the list of args. // *cmdO = 0; _arg[_argWords] = 0L; }; uint32 numWords(void) { return(_argWords); }; char *getWord(uint32 i) { return(_arg[i]); }; char *operator[](uint32 i) { return(_arg[i]); }; int64 operator()(uint32 i) { return(strtoull(_arg[i], NULL, 10)); }; private: uint32 _argWords; uint32 _maxWords; char **_arg; uint32 _maxChars; char *_cmd; }; #endif // SPLITTOWORDS_H canu-1.6/src/AS_UTL/stddev.H000066400000000000000000000276631314437614700155560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUL-23 to 2015-AUG-18 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-27 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-31 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef STDDEV_H #define STDDEV_H #include "AS_global.H" #include #include using namespace std; // Online mean and std.dev calculation. // B. P. Welford, Technometrics, Vol 4, No 3, Aug 1962 pp 419-420. // http://www2.in.tu-clausthal.de/~zach/teaching/info_literatur/Welford.pdf // Also presented in Knuth Vol 2 (3rd Ed.) pp 232. // template class stdDev { public: stdDev(double mn=0.0, double sn=0.0, uint32 nn=0) { _mn = mn; _sn = sn; _nn = nn; }; ~stdDev() { }; void insert(TT val) { double m0 = _mn; double s0 = _sn; uint32 n0 = _nn + 1; if (_nn == 0x7fffffff) fprintf(stderr, "ERROR: stdDev is full; can't insert() new value.\n"), exit(1); if (_nn & 0x80000000) fprintf(stderr, "ERROR: stdDev has been finalized; can't insert() new value.\n"), exit(1); _mn = m0 + (val - m0) / n0; _sn = s0 + (val - m0) * (val - _mn); _nn = n0; }; void remove(double val) { uint32 n0 = _nn - 1; double m0 = (n0 == 0) ? (0) : ((_nn * _mn - val) / n0); double s0 = _sn - (val - m0) * (val - _mn); if (_nn == 0) fprintf(stderr, "ERROR: stdDev has no data; can't remove() old value.\n"), exit(1); if (_nn & 0x80000000) fprintf(stderr, "ERROR: stdDev has been finalized; can't remove() old value.\n"), exit(1); _nn = n0; _mn = m0; _sn = s0; }; void finalize(void) { _sn = stddev(); _nn |= 0x80000000; }; uint32 size(void) { return(_nn & 0x7fffffff); }; double mean(void) { return(_mn); }; double variance(void) { if (_nn & 0x80000000) return(_sn * _sn); else return((_nn < 2) ? (0.0) : (_sn / (_nn-1))); }; double stddev(void) { if (_nn & 0x80000000) return(_sn); else return(sqrt(variance())); }; private: double _mn; // mean double _sn; // "sum of variances" uint32 _nn; // number of items in the set }; // Offline mean and std.dev calculation. Filters outliers. // Does not work well with unsigned types. The 'smallest' compute can underflow. // template void computeStdDev(vector dist, double &mean, double &stddev, bool isSorted=false) { mean = 0; stddev = 0; if (dist.size() == 0) return; // Sort the values. Lets us approximate the stddev for filtering out outliers. if (isSorted == false) sort(dist.begin(), dist.end()); // Approximate the stddev to filter out outliers. This is done by assuming we're normally // distributed, finding the values that would represent 1 standard deviation (about 68.27% of the // data), and using that to find the 5 std.dev. limits. TT median = dist[1 * dist.size() / 2]; TT oneThird = dist[1 * dist.size() / 3]; TT twoThird = dist[2 * dist.size() / 3]; TT approxStd = max(median - oneThird, twoThird - median); TT biggest = median + approxStd * 5; TT smallest = median - approxStd * 5; fprintf(stderr, "computeStdDev %d %d %d %d %d %d\n", median, oneThird, twoThird, approxStd, biggest, smallest); // Now, compute the number of samples within our bounds. And find the mean, too. size_t numSamples = 0; for (size_t x=0; x 1) stddev = sqrt(stddev / (numSamples - 1)); }; // Compute the mode. Once the values are sorted, we just need to scan the list and remember the // most common value. // template void computeMode(vector dist, TT &mode, bool isSorted=false) { mode = 0; if (dist.size() == 0) return; if (isSorted == false) sort(dist.begin(), dist.end()); uint32 modeCnt = 0; TT modeVal = 0; uint32 modeTmpCnt = 0; TT modeTmpVal = 0; for (uint64 x=0; x void computeMedianAbsoluteDeviation(vector dist, TT &median, TT &mad, bool isSorted=false) { median = 0; mad = 0; if (dist.size() == 0) return; if (isSorted == false) sort(dist.begin(), dist.end()); // Technically, if there are an even number of values, the median should be the average of the two // in the middle. median = dist[ dist.size()/2 ]; vector m; for (uint64 ii=0; ii TT computeExponentialMovingAverage(TT alpha, TT ema, TT value) { assert(0.0 <= alpha); assert(alpha <= 1.0); return(alpha * value + (1 - alpha) * ema); }; template class genericStatistics { public: genericStatistics() { _finalized = false; _mean = 0.0; _stddev = 0.0; _mode = 0; _median = 0; _mad = 0; }; ~genericStatistics() { }; void add(TT data) { _finalized = false; _data.push_back(data); }; uint64 numberOfObjects(void) { finalizeData(); return(_data.size()); } double mean(void) { finalizeData(); return(_mean); } double stddev(void) { finalizeData(); return(_stddev); }; TT median(void) { finalizeData(); return(_median); }; TT mad(void) { // Median Absolute Deviation finalizeData(); return(_mad); }; vector &histogram(void) { // Returns pointer to private histogram data finalizeData(); return(&_histogram); }; vector &Nstatistics(void) { // Returns pointer to private N data finalizeData(); return(&_Nstatistics); }; void finalizeData(void) { if (_finalized == true) return; computeStdDev(_data, _mean, _stddev); // Filters out outliers computeMode(_data, _mode); // Mo filtering computeMedianAbsoluteDeviation(_data, _median, _mad); // No filtering _finalized = true; }; private: bool _finalized; vector _data; double _mean; double _stddev; TT _mode; TT _median; TT _mad; vector _histogram; vector _Nstatistics; }; class histogramStatistics { public: histogramStatistics() { _histogramAlloc = 1024 * 1024; _histogramMax = 0; _histogram = new uint64 [_histogramAlloc]; memset(_histogram, 0, sizeof(uint64) * _histogramAlloc); _finalized = false; clearStatistics(); }; ~histogramStatistics() { delete [] _histogram; }; void add(uint64 data, uint32 count=1) { while (_histogramAlloc < data) resizeArray(_histogram, _histogramMax+1, _histogramAlloc, _histogramAlloc * 2, resizeArray_copyData | resizeArray_clearNew); if (_histogramMax < data) _histogramMax = data; _histogram[data] += count; _finalized = false; }; uint64 numberOfObjects(void) { finalizeData(); return(_numObjs); }; double mean(void) { finalizeData(); return(_mean); }; double stddev(void) { finalizeData(); return(_stddev); }; uint64 median(void) { finalizeData(); return(_median); }; uint64 mad(void) { finalizeData(); return(_mad); }; #if 0 vector &histogram(void) { // Returns pointer to private histogram data finalizeData(); return(&_histogram); }; vector &Nstatistics(void) { // Returns pointer to private N data finalizeData(); return(&_Nstatistics); }; #endif void clearStatistics(void) { _numObjs = 0; _mean = 0.0; _stddev = 0.0; _mode = 0; _median = 0; _mad = 0; }; void finalizeData(void) { if (_finalized == true) return; clearStatistics(); // Compute number of objects for (uint64 ii=0; ii <= _histogramMax; ii++) _numObjs += _histogram[ii]; // Compute mean and stddev for (uint64 ii=0; ii <= _histogramMax; ii++) _mean += ii * _histogram[ii]; if (_numObjs > 1) _mean /= _numObjs; for (uint64 ii=0; ii <= _histogramMax; ii++) _stddev += _histogram[ii] * (ii - _mean) * (ii - _mean); if (_numObjs > 1) _stddev = sqrt(_stddev / (_numObjs - 1)); // Compute mode for (uint64 ii=0; ii <= _histogramMax; ii++) if (_histogram[ii] > _histogram[_mode]) _mode = ii; // Compute median and mad for (uint64 ii=0; ii <= _histogramMax; ii++) if (_median < _numObjs / 2) { _median += _histogram[ii]; } else { _median = ii; break; } uint64 *maddata = new uint64 [_histogramAlloc]; memset(maddata, 0, sizeof(uint64) * _histogramAlloc); for (uint64 ii=0; ii <= _histogramMax; ii++) { uint64 mad = (ii < _median) ? (_median - ii) : (ii - _median); maddata[mad] += _histogram[ii]; } for (uint64 ii=0; ii <= _histogramMax; ii++) if (_mad < _numObjs / 2) { _mad += maddata[ii]; } else { _mad = ii; break; } // And, done delete [] maddata; _finalized = true; }; uint64 histogram(uint64 ii) { return(_histogram[ii]); }; uint64 histogramMax(void) { return(_histogramMax); }; void writeHistogram(FILE *F, char *label) { fprintf(F, "#%s\tquantity\n", label); for (uint64 ii=0; ii <= _histogramMax; ii++) fprintf(F, F_U64"\t" F_U64 "\n", ii, _histogram[ii]); }; private: bool _finalized; uint64 _histogramAlloc; // Maximum allocated value uint64 _histogramMax; // Maximum valid value uint64 *_histogram; uint64 _numObjs; double _mean; double _stddev; uint64 _mode; uint64 _median; uint64 _mad; }; #endif // STDDEV_H canu-1.6/src/AS_UTL/stddevTest.C000066400000000000000000000057101314437614700163760ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-MAR-10 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "stddev.H" // g++ -Wall -o stddevTest -I. -I.. stddevTest.C void testInsert(void) { stdDev sdu; stdDev sdi; stdDev sdd; sdu.insert((uint32)2); sdu.insert((uint32)4); sdu.insert((uint32)4); sdu.insert((uint32)4); sdu.insert((uint32)5); sdu.insert((uint32)5); sdu.insert((uint32)7); sdu.insert((uint32)9); sdi.insert((uint32)2); sdi.insert((uint32)4); sdi.insert((uint32)4); sdi.insert((uint32)4); sdi.insert((uint32)5); sdi.insert((uint32)5); sdi.insert((uint32)7); sdi.insert((uint32)9); sdd.insert((uint32)2); sdd.insert((uint32)4); sdd.insert((uint32)4); sdd.insert((uint32)4); sdd.insert((uint32)5); sdd.insert((uint32)5); sdd.insert((uint32)7); sdd.insert((uint32)9); fprintf(stderr, "Expect mean=5, variance=%f, stddev=%f\n", 32.0 / 7.0, sqrt(32.0 / 7.0)); fprintf(stderr, " uint32 size %u mean %f variance %f stddev %f\n", sdu.size(), sdu.mean(), sdu.variance(), sdu.stddev()); fprintf(stderr, " int32 size %u mean %f variance %f stddev %f\n", sdi.size(), sdi.mean(), sdi.variance(), sdi.stddev()); fprintf(stderr, " double size %u mean %f variance %f stddev %f\n", sdd.size(), sdd.mean(), sdd.variance(), sdd.stddev()); assert(sdu.variance() == 32.0 / 7.0); assert(sdi.variance() == 32.0 / 7.0); assert(sdd.variance() == 32.0 / 7.0); fprintf(stderr, "\n\n"); } void testRemove(void) { double values[10] = { 1, 2, 3, 4, 9, 8, 7, 6, 20, 30 }; stdDev sd; fprintf(stderr, "Expect final to be zero, and insert() == remove().\n"); for (int ii=0; ii<10; ii++) { sd.insert(values[ii]); fprintf(stderr, "insert[%2d] mean %8.4f stddev %8.4f\n", ii+1, sd.mean(), sd.stddev()); } assert(sd.mean() == 9.0); fprintf(stderr, "\n"); for (int ii=9; ii>=0; ii--) { sd.remove(values[ii]); fprintf(stderr, "remove[%2d] mean %8.4f stddev %8.4f\n", ii, sd.mean(), sd.stddev()); } assert(sd.mean() == 0.0); assert(sd.stddev() == 0.0); } int main(int argc, char **argv) { testInsert(); testRemove(); exit(0); } canu-1.6/src/AS_UTL/sweatShop.C000066400000000000000000000424251314437614700162260ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/sweatShop.C * * Modifications by: * * Brian P. Walenz from 2006-MAR-02 to 2014-APR-11 * are Copyright 2006,2008,2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-JUN-24 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "sweatShop.H" #include "timeAndSize.H" #include // pthread scheduling stuff class sweatShopWorker { public: sweatShopWorker() { shop = 0L; threadUserData = 0L; numComputed = 0; workerQueue = 0L; workerQueueLen = 0L; }; sweatShop *shop; void *threadUserData; pthread_t threadID; uint32 numComputed; sweatShopState **workerQueue; uint32 workerQueueLen; }; // This gets created by the loader, passed to the worker, and printed // by the writer. userData is controlled by the user. // class sweatShopState { public: sweatShopState(void *userData) { _user = userData; _computed = false; _next = 0L; }; ~sweatShopState() { }; void *_user; bool _computed; sweatShopState *_next; }; // Simply forwards control to the class void* _sweatshop_loaderThread(void *ss_) { sweatShop *ss = (sweatShop *)ss_; return(ss->loader()); } void* _sweatshop_workerThread(void *sw_) { sweatShopWorker *sw = (sweatShopWorker *)sw_; return(sw->shop->worker(sw)); } void* _sweatshop_writerThread(void *ss_) { sweatShop *ss = (sweatShop *)ss_; return(ss->writer()); } void* _sweatshop_statusThread(void *ss_) { sweatShop *ss = (sweatShop *)ss_; return(ss->status()); } sweatShop::sweatShop(void*(*loaderfcn)(void *G), void (*workerfcn)(void *G, void *T, void *S), void (*writerfcn)(void *G, void *S)) { _userLoader = loaderfcn; _userWorker = workerfcn; _userWriter = writerfcn; _globalUserData = 0L; _writerP = 0L; _workerP = 0L; _loaderP = 0L; _showStatus = false; _loaderQueueSize = 1024; _loaderQueueMax = 10240; _loaderQueueMin = 4; // _numberOfWorkers * 2, reset when that changes _loaderBatchSize = 1; _workerBatchSize = 1; _writerQueueSize = 4096; _writerQueueMax = 10240; _numberOfWorkers = 2; _workerData = 0L; _numberLoaded = 0; _numberComputed = 0; _numberOutput = 0; } sweatShop::~sweatShop() { delete [] _workerData; } void sweatShop::setThreadData(uint32 t, void *x) { if (_workerData == 0L) _workerData = new sweatShopWorker [_numberOfWorkers]; if (t >= _numberOfWorkers) fprintf(stderr, "sweatShop::setThreadData()-- worker ID " F_U32 " more than number of workers=" F_U32 "\n", t, _numberOfWorkers), exit(1); _workerData[t].threadUserData = x; } // Build a list of states to add in one swoop // void sweatShop::loaderSave(sweatShopState *&tail, sweatShopState *&head, sweatShopState *thisState) { thisState->_next = 0L; if (tail) { head->_next = thisState; head = thisState; } else { tail = head = thisState; } _numberLoaded++; } // Add a bunch of new states to the queue. // void sweatShop::loaderAppend(sweatShopState *&tail, sweatShopState *&head) { int err; if ((tail == 0L) || (head == 0L)) return; err = pthread_mutex_lock(&_stateMutex); if (err != 0) fprintf(stderr, "sweatShop::loaderAppend()-- Failed to lock mutex (%d). Fail.\n", err), exit(1); if (_loaderP == 0L) { _writerP = tail; _workerP = tail; _loaderP = head; } else { _loaderP->_next = tail; } _loaderP = head; err = pthread_mutex_unlock(&_stateMutex); if (err != 0) fprintf(stderr, "sweatShop::loaderAppend()-- Failed to unlock mutex (%d). Fail.\n", err), exit(1); tail = 0L; head = 0L; } void* sweatShop::loader(void) { struct timespec naptime; naptime.tv_sec = 0; naptime.tv_nsec = 166666666ULL; // 1/6 second // We can batch several loads together before we push them onto the // queue, this should reduce the number of times the loader needs to // lock the queue. // // But it also increases the latency, so it's disabled by default. // sweatShopState *tail = 0L; // The first thing loaded sweatShopState *head = 0L; // The last thing loaded uint32 numLoaded = 0; bool moreToLoad = true; while (moreToLoad) { // Zzzzzzz.... while (_numberLoaded > _numberComputed + _loaderQueueSize) nanosleep(&naptime, 0L); sweatShopState *thisState = new sweatShopState((*_userLoader)(_globalUserData)); // If we actually loaded a new state, add it // if (thisState->_user) { loaderSave(tail, head, thisState); numLoaded++; if (numLoaded >= _loaderBatchSize) loaderAppend(tail, head); } else { // Didn't read, must be all done! Push on the end-of-input marker state. // loaderSave(tail, head, new sweatShopState(0L)); loaderAppend(tail, head); moreToLoad = false; delete thisState; } } //fprintf(stderr, "sweatShop::reader exits.\n"); return(0L); } void* sweatShop::worker(sweatShopWorker *workerData) { struct timespec naptime; naptime.tv_sec = 0; naptime.tv_nsec = 50000000ULL; bool moreToCompute = true; int err; while (moreToCompute) { // Usually beacuse some worker is taking a long time, and the // output queue isn't big enough. // while (_numberOutput + _writerQueueSize < _numberComputed) nanosleep(&naptime, 0L); // Grab the next state. We don't grab it if it's the last in the // queue (else we would fall off the end) UNLESS it really is the // last one. // err = pthread_mutex_lock(&_stateMutex); if (err != 0) fprintf(stderr, "sweatShop::worker()-- Failed to lock mutex (%d). Fail.\n", err), exit(1); for (workerData->workerQueueLen = 0; ((workerData->workerQueueLen < _workerBatchSize) && (_workerP) && ((_workerP->_next != 0L) || (_workerP->_user == 0L))); workerData->workerQueueLen++) { workerData->workerQueue[workerData->workerQueueLen] = _workerP; _workerP = _workerP->_next; } if (_workerP == 0L) moreToCompute = false; err = pthread_mutex_unlock(&_stateMutex); if (err != 0) fprintf(stderr, "sweatShop::worler()-- Failed to lock mutex (%d). Fail.\n", err), exit(1); if (workerData->workerQueueLen == 0) { // No work, sleep a bit to prevent thrashing the mutex and resume. nanosleep(&naptime, 0L); continue; } // Execute // for (uint32 x=0; xworkerQueueLen; x++) { sweatShopState *ts = workerData->workerQueue[x]; if (ts && ts->_user) { (*_userWorker)(_globalUserData, workerData->threadUserData, ts->_user); ts->_computed = true; workerData->numComputed++; } else { // When we really do run out of stuff to do, we'll end up here // (only one thread will end up in the other case, with // something to do and moreToCompute=false). If it's actually // the end, skip the sleep and just get outta here. // if (moreToCompute == true) { fprintf(stderr, "WARNING! Worker is sleeping because the reader is slow!\n"); nanosleep(&naptime, 0L); } } } } //fprintf(stderr, "sweatShop::worker exits.\n"); return(0L); } void* sweatShop::writer(void) { sweatShopState *deleteState = 0L; // Wait for output to appear, then write. // while (_writerP && _writerP->_user) { if (_writerP->_computed == false) { // Wait for a slow computation. struct timespec naptime; naptime.tv_sec = 0; naptime.tv_nsec = 5000000ULL; //fprintf(stderr, "Writer waits for slow thread at " F_U64 ".\n", _numberOutput); nanosleep(&naptime, 0L); } else if (_writerP->_next == 0L) { // Wait for the input. struct timespec naptime; naptime.tv_sec = 0; naptime.tv_nsec = 5000000ULL; //fprintf(stderr, "Writer waits for all threads at " F_U64 ".\n", _numberOutput); nanosleep(&naptime, 0L); } else { (*_userWriter)(_globalUserData, _writerP->_user); _numberOutput++; deleteState = _writerP; _writerP = _writerP->_next; delete deleteState; } } // Tell status to stop. _writerP = 0L; //fprintf(stderr, "sweatShop::writer exits.\n"); return(0L); } // This thread not only shows a status message, but it also updates the critical shared variable // _numberComputed. Worker threads use this to throttle themselves. Thus, even if _showStatus is // not set, and this thread doesn't _appear_ to be doing anything useful....it is. // void* sweatShop::status(void) { struct timespec naptime; naptime.tv_sec = 0; naptime.tv_nsec = 250000000ULL; double startTime = getTime() - 0.001; double thisTime = 0; uint64 deltaOut = 0; uint64 deltaCPU = 0; double cpuPerSec = 0; uint64 readjustAt = 16384; while (_writerP) { uint32 nc = 0; for (uint32 i=0; i<_numberOfWorkers; i++) nc += _workerData[i].numComputed; _numberComputed = nc; deltaOut = deltaCPU = 0; thisTime = getTime(); if (_numberComputed > _numberOutput) deltaOut = _numberComputed - _numberOutput; if (_numberLoaded > _numberComputed) deltaCPU = _numberLoaded - _numberComputed; cpuPerSec = _numberComputed / (thisTime - startTime); if (_showStatus) { fprintf(stderr, " %6.1f/s - %8" F_U64P " loaded; %8" F_U64P " queued for compute; %08" F_U64P " finished; %8" F_U64P " written; %8" F_U64P " queued for output)\r", cpuPerSec, _numberLoaded, deltaCPU, _numberComputed, _numberOutput, deltaOut); fflush(stderr); } // Readjust queue sizes based on current performance, but don't let it get too big or small. // In particular, don't let it get below 2*numberOfWorkers. // if (_numberComputed > readjustAt) { readjustAt += (uint64)(2 * cpuPerSec); _loaderQueueSize = (uint32)(5 * cpuPerSec); } if (_loaderQueueSize < _loaderQueueMin) _loaderQueueSize = _loaderQueueMin; if (_loaderQueueSize < 2 * _numberOfWorkers) _loaderQueueSize = 2 * _numberOfWorkers; if (_loaderQueueSize > _loaderQueueMax) _loaderQueueSize = _loaderQueueMax; nanosleep(&naptime, 0L); } if (_showStatus) { thisTime = getTime(); if (_numberComputed > _numberOutput) deltaOut = _numberComputed - _numberOutput; if (_numberLoaded > _numberComputed) deltaCPU = _numberLoaded - _numberComputed; cpuPerSec = _numberComputed / (thisTime - startTime); fprintf(stderr, " %6.1f/s - %08" F_U64P " queued for compute; %08" F_U64P " finished; %08" F_U64P " queued for output)\n", cpuPerSec, deltaCPU, _numberComputed, deltaOut); } //fprintf(stderr, "sweatShop::status exits.\n"); return(0L); } void sweatShop::run(void *user, bool beVerbose) { pthread_attr_t threadAttr; pthread_t threadIDloader; pthread_t threadIDwriter; pthread_t threadIDstats; #if 0 int threadSchedPolicy = 0; struct sched_param threadSchedParamDef; struct sched_param threadSchedParamMax; #endif int err = 0; _globalUserData = user; _showStatus = beVerbose; // Configure everything ahead of time. if (_workerBatchSize < 1) _workerBatchSize = 1; if (_workerData == 0L) _workerData = new sweatShopWorker [_numberOfWorkers]; for (uint32 i=0; i<_numberOfWorkers; i++) { _workerData[i].shop = this; _workerData[i].workerQueue = new sweatShopState * [_workerBatchSize]; } // Open the doors. errno = 0; err = pthread_mutex_init(&_stateMutex, NULL); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (state mutex): %s.\n", strerror(err)), exit(1); err = pthread_attr_init(&threadAttr); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (attr init): %s.\n", strerror(err)), exit(1); err = pthread_attr_setscope(&threadAttr, PTHREAD_SCOPE_SYSTEM); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (set scope): %s.\n", strerror(err)), exit(1); err = pthread_attr_setdetachstate(&threadAttr, PTHREAD_CREATE_JOINABLE); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (joinable): %s.\n", strerror(err)), exit(1); #if 0 err = pthread_attr_getschedparam(&threadAttr, &threadSchedParamDef); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (get default param): %s.\n", strerror(err)), exit(1); err = pthread_attr_getschedparam(&threadAttr, &threadSchedParamMax); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (get max param): %s.\n", strerror(err)), exit(1); #endif // SCHED_RR needs root privs to run on FreeBSD. // //err = pthread_attr_setschedpolicy(&threadAttr, SCHED_RR); //if (err) // fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (sched policy): %s.\n", strerror(err)), exit(1); #if 0 err = pthread_attr_getschedpolicy(&threadAttr, &threadSchedPolicy); if (err) fprintf(stderr, "sweatShop::run()-- Failed to configure pthreads (sched policy): %s.\n", strerror(err)), exit(1); errno = 0; threadSchedParamMax.sched_priority = sched_get_priority_max(threadSchedPolicy); if (errno) fprintf(stderr, "sweatShop::run()-- WARNING: Failed to configure pthreads (set max param priority): %s.\n", strerror(errno)); // Fire off the loader err = pthread_attr_setschedparam(&threadAttr, &threadSchedParamMax); if (err) fprintf(stderr, "sweatShop::run()-- Failed to set loader priority: %s.\n", strerror(err)), exit(1); #endif err = pthread_create(&threadIDloader, &threadAttr, _sweatshop_loaderThread, this); if (err) fprintf(stderr, "sweatShop::run()-- Failed to launch loader thread: %s.\n", strerror(err)), exit(1); // Wait for it to actually load something (otherwise all the // workers immediately go home) while (!_writerP && !_workerP && !_loaderP) { struct timespec naptime; naptime.tv_sec = 0; naptime.tv_nsec = 250000ULL; nanosleep(&naptime, 0L); } // Start the statistics and writer #if 0 err = pthread_attr_setschedparam(&threadAttr, &threadSchedParamMax); if (err) fprintf(stderr, "sweatShop::run()-- Failed to set status and writer priority: %s.\n", strerror(err)), exit(1); #endif err = pthread_create(&threadIDstats, &threadAttr, _sweatshop_statusThread, this); if (err) fprintf(stderr, "sweatShop::run()-- Failed to launch status thread: %s.\n", strerror(err)), exit(1); err = pthread_create(&threadIDwriter, &threadAttr, _sweatshop_writerThread, this); if (err) fprintf(stderr, "sweatShop::run()-- Failed to launch writer thread: %s.\n", strerror(err)), exit(1); // And some labor #if 0 err = pthread_attr_setschedparam(&threadAttr, &threadSchedParamDef); if (err) fprintf(stderr, "sweatShop::run()-- Failed to set worker priority: %s.\n", strerror(err)), exit(1); #endif for (uint32 i=0; i<_numberOfWorkers; i++) { err = pthread_create(&_workerData[i].threadID, &threadAttr, _sweatshop_workerThread, _workerData + i); if (err) fprintf(stderr, "sweatShop::run()-- Failed to launch worker thread " F_U32 ": %s.\n", i, strerror(err)), exit(1); } // Now sit back and relax. err = pthread_join(threadIDloader, 0L); if (err) fprintf(stderr, "sweatShop::run()-- Failed to join loader thread: %s.\n", strerror(err)), exit(1); err = pthread_join(threadIDwriter, 0L); if (err) fprintf(stderr, "sweatShop::run()-- Failed to join writer thread: %s.\n", strerror(err)), exit(1); err = pthread_join(threadIDstats, 0L); if (err) fprintf(stderr, "sweatShop::run()-- Failed to join status thread: %s.\n", strerror(err)), exit(1); for (uint32 i=0; i<_numberOfWorkers; i++) { err = pthread_join(_workerData[i].threadID, 0L); if (err) fprintf(stderr, "sweatShop::run()-- Failed to join worker thread " F_U32 ": %s.\n", i, strerror(err)), exit(1); } // Cleanup. delete _loaderP; _loaderP = _workerP = _writerP = 0L; } canu-1.6/src/AS_UTL/sweatShop.H000066400000000000000000000074131314437614700162310ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libutil/sweatShop.H * * Modifications by: * * Brian P. Walenz from 2006-MAR-02 to 2014-APR-11 * are Copyright 2006,2008,2010-2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SWEATSHOP_H #define SWEATSHOP_H #include #include #include "AS_global.H" class sweatShopWorker; class sweatShopState; class sweatShop { public: sweatShop(void*(*loaderfcn)(void *G), void (*workerfcn)(void *G, void *T, void *S), void (*writerfcn)(void *G, void *S)); ~sweatShop(); void setNumberOfWorkers(uint32 x) { _numberOfWorkers = x; _loaderQueueMin = x * 2; }; void setThreadData(uint32 t, void *x); void setLoaderBatchSize(uint32 batchSize) { _loaderBatchSize = batchSize; }; void setLoaderQueueSize(uint32 queueSize) { _loaderQueueSize = queueSize; _loaderQueueMax = queueSize; }; void setWorkerBatchSize(uint32 batchSize) { _workerBatchSize = batchSize; }; void setWriterQueueSize(uint32 queueSize) { _writerQueueSize = queueSize; _writerQueueMax = queueSize; }; void run(void *user=0L, bool beVerbose=false); private: // Stubs that forward control from the c-based pthread to this class friend void *_sweatshop_loaderThread(void *ss); friend void *_sweatshop_workerThread(void *ss); friend void *_sweatshop_writerThread(void *ss); friend void *_sweatshop_statusThread(void *ss); // The threaded routines void *loader(void); void *worker(sweatShopWorker *workerData); void *writer(void); void *status(void); // Utilities for the loader thread //void loaderAdd(sweatShopState *thisState); void loaderSave(sweatShopState *&tail, sweatShopState *&head, sweatShopState *thisState); void loaderAppend(sweatShopState *&tail, sweatShopState *&head); pthread_mutex_t _stateMutex; void *(*_userLoader)(void *global); void (*_userWorker)(void *global, void *thread, void *thing); void (*_userWriter)(void *global, void *thing); void *_globalUserData; sweatShopState *_writerP; // Where output takes stuff from, the tail sweatShopState *_workerP; // Where computes happen, the middle sweatShopState *_loaderP; // Where input is put, the head bool _showStatus; uint32 _loaderQueueSize, _loaderQueueMin, _loaderQueueMax; uint32 _loaderBatchSize; uint32 _workerBatchSize; uint32 _writerQueueSize, _writerQueueMax; uint32 _numberOfWorkers; sweatShopWorker *_workerData; uint64 _numberLoaded; uint64 _numberComputed; uint64 _numberOutput; }; #endif // SWEATSHOP_H canu-1.6/src/AS_UTL/testRand.C000066400000000000000000000043471314437614700160360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2008-JUN-27 to 2013-AUG-01 * are Copyright 2008,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "AS_UTL_rand.H" #define ITERATIONS 1000000 int main(void){ double sumx = 0.0, sumx2 = 0.0, sumx3 = 0.0; int i; int Seed = getpid(); double num; srand48(Seed); /* Initialize Random Number Generator */ for(i = 0; i < ITERATIONS; i++){ double rand = GaussRandomNormalized_AS(); double mult = rand; sumx += mult; mult *= rand; sumx2 += mult; mult *= rand; sumx3 += mult; } num = (double)ITERATIONS; fprintf(stderr,"***** Normalized Gaussian *****"); fprintf(stderr,"* avg = %f avg2 = %f avg3 = %f\n", sumx/num, sumx2/num, sumx3/num); sumx = 0.0; sumx2 = 0.0; sumx3 = 0.0; for(i = 0; i < ITERATIONS; i++){ double rand = GaussRandom_AS(0.0, 5.0); double mult = rand; sumx += mult; mult *= rand; sumx2 += mult; mult *= rand; sumx3 += mult; } num = (double)ITERATIONS; sumx = sumx/num; sumx2 = sqrt(sumx2/num); sumx3 = pow(sumx3/num, 0.33); fprintf(stderr,"***** Gaussian with STDEV 5.0 *****"); fprintf(stderr,"* avg = %f sqrt avg2 = %f curt avg3 = %f\n", sumx, sumx2, sumx3); return 0; } canu-1.6/src/AS_UTL/timeAndSize.C000066400000000000000000000050401314437614700164550ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2017-AUG-10 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include #include #include double getTime(void) { struct timeval tp; gettimeofday(&tp, NULL); return(tp.tv_sec + (double)tp.tv_usec / 1000000.0); } static bool getrusage(struct rusage &ru) { errno = 0; if (getrusage(RUSAGE_SELF, &ru) == -1) { fprintf(stderr, "getrusage(RUSAGE_SELF, ...) failed: %s\n", strerror(errno)); return(false); } return(true); } static bool getrlimit(struct rlimit &rl) { errno = 0; if (getrlimit(RLIMIT_DATA, &rl) == -1) { fprintf(stderr, "getrlimit(RLIMIT_DATA, ...) failed: %s\n", strerror(errno)); return(false); } return(true); } double getCPUTime(void) { struct rusage ru; double tm = 0; if (getrusage(ru) == true) tm = ((ru.ru_utime.tv_sec + ru.ru_utime.tv_usec / 1000000.0) + (ru.ru_stime.tv_sec + ru.ru_stime.tv_usec / 1000000.0)); return(tm); } double getProcessTime(void) { struct timeval tp; static double st = 0.0; double tm = 0; if (gettimeofday(&tp, NULL) == 0) tm = tp.tv_sec + tp.tv_usec / 100000.0; if (st == 0.0) st = tm; return(tm - st); } uint64 getProcessSize(void) { struct rusage ru; uint64 sz = 0; if (getrusage(ru) == true) { sz = ru.ru_maxrss; sz *= 1024; } return(sz); } uint64 getProcessSizeLimit(void) { struct rlimit rl; uint64 sz = ~uint64ZERO; if (getrlimit(rl) == true) sz = rl.rlim_cur; return(sz); } canu-1.6/src/AS_UTL/timeAndSize.H000066400000000000000000000023141314437614700164630ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2017-AUG-10 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" double getTime(void); double getCPUTime(void); double getProcessTime(void); uint64 getProcessSize(void); uint64 getProcessSizeLimit(void); canu-1.6/src/AS_UTL/writeBuffer.H000066400000000000000000000051031314437614700165320ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-NOV-18 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2017-MAY-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef WRITE_BUFFER_H #define WRITE_BUFFER_H #include "AS_global.H" #include "AS_UTL_fileIO.H" class writeBuffer { public: writeBuffer(const char *filename, const char *filemode, uint64 bufferMax = 1024 * 1024) { _filename = filename; _filemode = filemode; errno = 0; _file = fopen(filename, filemode); if (errno) fprintf(stderr, "writeBuffer()-- Failed to open file '%s' with mode '%s': %s\n", filename, filemode, strerror(errno)), exit(1); _filePos = AS_UTL_ftell(_file); _bufferLen = 0; _bufferMax = bufferMax; _buffer = new char [_bufferMax]; }; ~writeBuffer() { flush(); delete [] _buffer; fclose(_file); }; uint64 tell(void) { return(_filePos); }; void write(void *data, uint64 length) { if ((_bufferMax < length) || (_bufferLen + length > _bufferMax)) flush(); if (_bufferMax < length) AS_UTL_safeWrite(_file, data, "writeBuffer", 1, length); else { memcpy(_buffer + _bufferLen, data, length); _bufferLen += length; } _filePos += length; }; const char *filename(void) { return(_filename); }; private: void flush(void) { AS_UTL_safeWrite(_file, _buffer, "writeBuffer", 1, _bufferLen); _bufferLen = 0; }; const char *_filename; const char *_filemode; FILE *_file; uint64 _filePos; uint64 _bufferLen; uint64 _bufferMax; char *_buffer; }; #endif // WRITE_BUFFER_H canu-1.6/src/AS_global.C000066400000000000000000000122551314437614700150430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2007-AUG-03 to 2013-AUG-01 * are Copyright 2007-2009,2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2009-MAR-06 * are Copyright 2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2015-MAR-03 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "canu_version.H" #include "AS_UTL_stackTrace.H" #include "timeAndSize.H" #ifdef X86_GCC_LINUX #include #endif #ifdef _GLIBCXX_PARALLEL #include #include #endif // We take argc and argv, so, maybe, eventually, we'll want to parse // something out of there. We return argc in case what we parse we // want to remove. // int AS_configure(int argc, char **argv) { #ifdef X86_GCC_LINUX // Set the x86 FPU control word to force double precision rounding // rather than `extended' precision rounding. This causes base // calls and quality values on x86 GCC-Linux (tested on RedHat // Linux) machines to be identical to those on IEEE conforming UNIX // machines. // fpu_control_t fpu_cw = ( _FPU_DEFAULT & ~_FPU_EXTENDED ) | _FPU_DOUBLE; _FPU_SETCW( fpu_cw ); #endif #ifdef _GLIBCXX_PARALLEL_SETTINGS_H __gnu_parallel::_Settings s = __gnu_parallel::_Settings::get(); // Force all algorithms to be parallel. // Force some algs to be sequential by using a tag, eg: // sort(a, a+end, __gnu_parallel::sequential_tag()); // //s.algorithm_strategy = __gnu_parallel::force_parallel; // The default seems to be 1000, way too small for us. s.sort_minimal_n = 128 * 1024; // The default is MWMS, which, at least on FreeBSD 8.2 w/gcc46, is NOT inplace. // Then again, the others also appear to be NOT inplace as well. //s.sort_algorithm = __gnu_parallel::MWMS; //s.sort_algorithm = __gnu_parallel::QS_BALANCED; //s.sort_algorithm = __gnu_parallel::QS; __gnu_parallel::_Settings::set(s); #endif // Default to one thread. This is mostly to disable the parallel sort, // which seems to have a few bugs left in it. e.g., a crash when using 48 // threads, but not when using 47, 49 or 64 threads. omp_set_num_threads(1); // Install a signal handler to catch seg faults and errors. AS_UTL_installCrashCatcher(argv[0]); // Set the start time. getProcessTime(); // // Et cetera. // for (int32 i=0; i 4) || ((__GNUC__ == 4) && (__GNUC_MINOR__ > 2))) // Not by default! //#ifndef _GLIBCXX_PARALLEL //#define _GLIBCXX_PARALLEL //#endif #else #undef _GLIBCXX_PARALLEL #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifndef TRUE #define TRUE true #endif #ifndef FALSE #define FALSE false #endif #ifndef MIN #define MIN(a,b) (((a) < (b)) ? (a) : (b)) #endif #ifndef MAX #define MAX(a,b) (((a) > (b)) ? (a) : (b)) #endif typedef int8_t int8; typedef int16_t int16; typedef int32_t int32; typedef int64_t int64; typedef uint8_t uint8; typedef uint16_t uint16; typedef uint32_t uint32; typedef uint64_t uint64; //typedef void* PtrT; // Correct? Or should it be LLU and LU? #define uint64NUMBER(X) X ## LU #define uint32NUMBER(X) X ## U #define uint64ZERO uint64NUMBER(0x0000000000000000) #define uint64ONE uint64NUMBER(0x0000000000000001) #define uint64MAX uint64NUMBER(0xffffffffffffffff) #define uint64MASK(X) ((~uint64ZERO) >> (64 - (X))) #define uint32ZERO uint32NUMBER(0x00000000) #define uint32ONE uint32NUMBER(0x00000001) #define uint32MAX uint32NUMBER(0xffffffff) #define uint32MASK(X) ((~uint32ZERO) >> (32 - (X))) #define uint16ZERO (0x0000) #define uint16ONE (0x0001) #define uint16MAX (0xffff) #define uint16MASK(X) ((~uint16ZERO) >> (16 - (X))) #define uint8ZERO (0x00) #define uint8ONE (0x01) #define uint8MAX (0xff) #define uint8MASK(X) ((~uint8ZERO) >> (8 - (X))) #define strtouint32(N) (uint32)strtoul (N, NULL, 10) #define strtouint64(N) (uint64)strtoull(N, NULL, 10) #define strtodouble(N) (double)strtod (N, NULL) //inline uint32 strtouint32(char *N, O) (uint32)strtoul (N, O, 10) //inline uint64 strtouint64(char *N, O) (uint64)strtoull(N, O, 10) //inline double strtodouble(char *N, O) (double)strtod (N, O) // Pointers #define F_PTR "0x%016p" // Characters #define F_C "%c" #define F_CP "c" #define F_CI "%*c" // Strings #define F_STR "%s" #define F_STRP "s" #define F_STRI "%*s" // Integers #define F_S16 "%" PRId16 #define F_S16P PRId16 #define F_S16I "%*" PRId16 #define F_U16 "%" PRIu16 #define F_U16P PRIu16 #define F_U16I "%*" PRIu16 #define F_S32 "%" PRId32 #define F_S32P PRId32 #define F_S32I "%*" PRId32 #define F_U32 "%" PRIu32 #define F_U32P PRIu32 #define F_U32I "%*" PRIu32 #define F_S64 "%" PRId64 #define F_S64P PRId64 #define F_S64I "%*" PRId64 #define F_U64 "%" PRIu64 #define F_U64P PRIu64 #define F_U64I "%*" PRIu64 #define F_X64 "%016" PRIx64 #define F_X64P PRIx64 #define F_X64I "%*" PRIx64 // Floating points #define F_F32 "%f" #define F_F32P "f" #define F_F32I "%*f" #define F_F64 "%lf" #define F_F64P "lf" #define F_F64I "%*lf" // Standard typedefs #define F_SIZE_T "%zu" #define F_SIZE_TP "zu" #define F_SIZE_TI "%*zu" #define F_OFF_T F_S64 #define F_OFF_TP F_S64P #define F_OFF_TI F_S64I #if defined(_FILE_OFFSET_BITS) && (_FILE_OFFSET_BITS == 32) #error I do not support 32-bit off_t. #endif // perl's chomp is pretty nice // Not a great place to put this, but it's getting used all over. #ifndef chomp #define chomp(S) { char *t=(S); while (*t) t++; t--; while (t >= S && isspace(*t)) *t--=0; } #endif #ifndef munch #define munch(S) { while (*(S) && isspace(*(S))) (S)++; } #endif #ifndef crunch #define crunch(S) { while (*(S) && !isspace(*(S))) (S)++; } #endif int AS_configure(int argc, char **argv); // These macros are use to eliminate inter-platform differnces between // calculated results #define DBL_TO_INT(X) ((int)((1.0+16.0*DBL_EPSILON)*(X))) #define ROUNDPOS(X) (DBL_TO_INT((X)+0.5) ) #define ROUND(X) (((X)>0.0) ? ROUNDPOS(X) : -ROUNDPOS(-(X)) ) #define ZERO_PLUS ( 16.0*DBL_EPSILON) #define ZERO_MINUS (-16.0*DBL_EPSILON) #define ONE_PLUS (1.0+ZERO_PLUS) #define ONE_MINUS (1.0+ZERO_MINUS) #define INT_EQ_DBL(I,D) (fabs((double)(I)-(D)) < 16.0*DBL_EPSILON ) #define DBL_EQ_DBL(A,B) (fabs((A)-(B))<16.0*DBL_EPSILON) // Tell gcc (and others, maybe) about unused parameters. This is important for gcc (especially // newer ones) that complain about unused parameters. Thanks to ideasman42 at // http://stackoverflow.com/questions/3599160/unused-parameter-warnings-in-c-code. #ifdef __GNUC__ #define UNUSED(x) UNUSED_ ## x __attribute__((__unused__)) #else #define UNUSED(x) UNUSED_ ## x #endif #include "AS_UTL_alloc.H" // Stubs for ignoring OpenMP on clang. #ifdef BROKEN_CLANG_OpenMP static void omp_set_dynamic(int x) { #pragma unused(x) } static int omp_get_max_threads(void) { return(1); } static void omp_set_num_threads(int x) { #pragma unused(x) } static int omp_get_num_threads(void) { return(1); } static int omp_get_thread_num(void) { return(0); } typedef int omp_lock_t; static void omp_init_lock(omp_lock_t *a) { #pragma unused(a) } static void omp_set_lock(omp_lock_t *a) { #pragma unused(a) } static void omp_unset_lock(omp_lock_t *a) { #pragma unused(a) } static void omp_destroy_lock(omp_lock_t *a) { #pragma unused(a) } #endif #endif canu-1.6/src/Makefile000066400000000000000000000657501314437614700145640ustar00rootroot00000000000000# boilermake: A reusable, but flexible, boilerplate Makefile. # # Copyright 2008, 2009, 2010 Dan Moulding, Alan T. DeKok # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see . # Caution: Don't edit this Makefile! Create your own main.mk and other # submakefiles, which will be included by this Makefile. # Only edit this if you need to modify boilermake's behavior (fix # bugs, add features, etc). # Note: Parameterized "functions" in this makefile that are marked with # "USE WITH EVAL" are only useful in conjuction with eval. This is # because those functions result in a block of Makefile syntax that must # be evaluated after expansion. Since they must be used with eval, most # instances of "$" within them need to be escaped with a second "$" to # accomodate the double expansion that occurs when eval is invoked. # ADD_CLEAN_RULE - Parameterized "function" that adds a new rule and phony # target for cleaning the specified target (removing its build-generated # files). # # USE WITH EVAL # define ADD_CLEAN_RULE clean: clean_${1} .PHONY: clean_${1} clean_${1}: $$(strip rm -f ${TARGET_DIR}/${1} $${${1}_OBJS:%.o=%.[doP]}) $${${1}_POSTCLEAN} endef # ADD_OBJECT_RULE - Parameterized "function" that adds a pattern rule for # building object files from source files with the filename extension # specified in the second argument. The first argument must be the name of the # base directory where the object files should reside (such that the portion # of the path after the base directory will match the path to corresponding # source files). The third argument must contain the rules used to compile the # source files into object code form. # # USE WITH EVAL # define ADD_OBJECT_RULE ${1}/%.o: ${2} ${3} endef # ADD_TARGET_RULE - Parameterized "function" that adds a new target to the # Makefile. The target may be an executable or a library. The two allowable # types of targets are distinguished based on the name: library targets must # end with the traditional ".a" extension. # # USE WITH EVAL # define ADD_TARGET_RULE ifeq "$$(suffix ${1})" ".a" # Add a target for creating a static library. $${TARGET_DIR}/${1}: $${${1}_OBJS} @mkdir -p $$(dir $$@) $$(strip $${AR} $${ARFLAGS} $$@ $${${1}_OBJS}) $${${1}_POSTMAKE} else ifeq "$$(suffix ${1})" ".jar" # Add a target for copying files. Actually, just untars the files, kind # of ugly, but the only way I could discover to get the lib/ files installed. $${TARGET_DIR}/${1}: $${${1}} @mkdir -p $$(dir $$@) $$(strip tar -C $$(dir $$@) -xf $${${1}_SOURCES}) else # Add a target for linking an executable. First, attempt to select the # appropriate front-end to use for linking. This might not choose the # right one (e.g. if linking with a C++ static library, but all other # sources are C sources), so the user makefile is allowed to specify a # linker to be used for each target. ifeq "$$(strip $${${1}_LINKER})" "" # No linker was explicitly specified to be used for this target. If # there are any C++ sources for this target, use the C++ compiler. # For all other targets, default to using the C compiler. ifneq "$$(strip $$(filter $${CXX_SRC_EXTS},$${${1}_SOURCES}))" "" ${1}_LINKER = $${CXX} else ${1}_LINKER = $${CC} endif endif $${TARGET_DIR}/${1}: $${${1}_OBJS} $${${1}_PREREQS} @mkdir -p $$(dir $$@) $$(strip $${${1}_LINKER} -o $$@ $${LDFLAGS} $${${1}_LDFLAGS} \ $${${1}_OBJS} $${${1}_LDLIBS} $${LDLIBS}) $${${1}_POSTMAKE} endif endif endef # CANONICAL_PATH - Given one or more paths, converts the paths to the canonical # form. The canonical form is the path, relative to the project's top-level # directory (the directory from which "make" is run), and without # any "./" or "../" sequences. For paths that are not located below the # top-level directory, the canonical form is the absolute path (i.e. from # the root of the filesystem) also without "./" or "../" sequences. define CANONICAL_PATH $(patsubst ${CURDIR}/%,%,$(abspath ${1})) endef # COMPILE_C_CMDS - Commands for compiling C source code. define COMPILE_C_CMDS @mkdir -p $(dir $@) $(strip ${CC} -o $@ -c -MD ${CFLAGS} ${SRC_CFLAGS} ${INCDIRS} \ ${SRC_INCDIRS} ${SRC_DEFS} ${DEFS} $<) @cp ${@:%$(suffix $@)=%.d} ${@:%$(suffix $@)=%.P}; \ sed -e 's/#.*//' -e 's/^[^:]*: *//' -e 's/ *\\$$//' \ -e '/^$$/ d' -e 's/$$/ :/' < ${@:%$(suffix $@)=%.d} \ >> ${@:%$(suffix $@)=%.P}; \ rm -f ${@:%$(suffix $@)=%.d} endef # COMPILE_CXX_CMDS - Commands for compiling C++ source code. define COMPILE_CXX_CMDS @mkdir -p $(dir $@) $(strip ${CXX} -o $@ -c -MD ${CXXFLAGS} ${SRC_CXXFLAGS} ${INCDIRS} \ ${SRC_INCDIRS} ${SRC_DEFS} ${DEFS} $<) @cp ${@:%$(suffix $@)=%.d} ${@:%$(suffix $@)=%.P}; \ sed -e 's/#.*//' -e 's/^[^:]*: *//' -e 's/ *\\$$//' \ -e '/^$$/ d' -e 's/$$/ :/' < ${@:%$(suffix $@)=%.d} \ >> ${@:%$(suffix $@)=%.P}; \ rm -f ${@:%$(suffix $@)=%.d} endef # INCLUDE_SUBMAKEFILE - Parameterized "function" that includes a new # "submakefile" fragment into the overall Makefile. It also recursively # includes all submakefiles of the specified submakefile fragment. # # USE WITH EVAL # define INCLUDE_SUBMAKEFILE # Initialize all variables that can be defined by a makefile fragment, then # include the specified makefile fragment. TARGET := TGT_CFLAGS := TGT_CXXFLAGS := TGT_DEFS := TGT_INCDIRS := TGT_LDFLAGS := TGT_LDLIBS := TGT_LINKER := TGT_POSTCLEAN := TGT_POSTMAKE := TGT_PREREQS := SOURCES := SRC_CFLAGS := SRC_CXXFLAGS := SRC_DEFS := SRC_INCDIRS := SUBMAKEFILES := # A directory stack is maintained so that the correct paths are used as we # recursively include all submakefiles. Get the makefile's directory and # push it onto the stack. DIR := $(call CANONICAL_PATH,$(dir ${1})) DIR_STACK := $$(call PUSH,$${DIR_STACK},$${DIR}) include ${1} # Initialize internal local variables. OBJS := # Determine which target this makefile's variables apply to. A stack is # used to keep track of which target is the "current" target as we # recursively include other submakefiles. ifneq "$$(strip $${TARGET})" "" # This makefile defined a new target. Target variables defined by this # makefile apply to this new target. Initialize the target's variables. TGT := $$(strip $${TARGET}) ALL_TGTS += $${TGT} $${TGT}_CFLAGS := $${TGT_CFLAGS} $${TGT}_CXXFLAGS := $${TGT_CXXFLAGS} $${TGT}_DEFS := $${TGT_DEFS} $${TGT}_DEPS := TGT_INCDIRS := $$(call QUALIFY_PATH,$${DIR},$${TGT_INCDIRS}) TGT_INCDIRS := $$(call CANONICAL_PATH,$${TGT_INCDIRS}) $${TGT}_INCDIRS := $${TGT_INCDIRS} $${TGT}_LDFLAGS := $${TGT_LDFLAGS} $${TGT}_LDLIBS := $${TGT_LDLIBS} $${TGT}_LINKER := $${TGT_LINKER} $${TGT}_OBJS := $${TGT}_POSTCLEAN := $${TGT_POSTCLEAN} $${TGT}_POSTMAKE := $${TGT_POSTMAKE} $${TGT}_PREREQS := $$(addprefix $${TARGET_DIR}/,$${TGT_PREREQS}) $${TGT}_SOURCES := else # The values defined by this makefile apply to the the "current" target # as determined by which target is at the top of the stack. TGT := $$(strip $$(call PEEK,$${TGT_STACK})) $${TGT}_CFLAGS += $${TGT_CFLAGS} $${TGT}_CXXFLAGS += $${TGT_CXXFLAGS} $${TGT}_DEFS += $${TGT_DEFS} TGT_INCDIRS := $$(call QUALIFY_PATH,$${DIR},$${TGT_INCDIRS}) TGT_INCDIRS := $$(call CANONICAL_PATH,$${TGT_INCDIRS}) $${TGT}_INCDIRS += $${TGT_INCDIRS} $${TGT}_LDFLAGS += $${TGT_LDFLAGS} $${TGT}_LDLIBS += $${TGT_LDLIBS} $${TGT}_POSTCLEAN += $${TGT_POSTCLEAN} $${TGT}_POSTMAKE += $${TGT_POSTMAKE} $${TGT}_PREREQS += $${TGT_PREREQS} endif # Push the current target onto the target stack. TGT_STACK := $$(call PUSH,$${TGT_STACK},$${TGT}) ifneq "$$(strip $${SOURCES})" "" # This makefile builds one or more objects from source. Validate the # specified sources against the supported source file types. BAD_SRCS := $$(strip $$(filter-out $${ALL_SRC_EXTS},$${SOURCES})) ifneq "$${BAD_SRCS}" "" $$(error Unsupported source file(s) found in ${1} [$${BAD_SRCS}]) endif # Qualify and canonicalize paths. SOURCES := $$(call QUALIFY_PATH,$${DIR},$${SOURCES}) SOURCES := $$(call CANONICAL_PATH,$${SOURCES}) SRC_INCDIRS := $$(call QUALIFY_PATH,$${DIR},$${SRC_INCDIRS}) SRC_INCDIRS := $$(call CANONICAL_PATH,$${SRC_INCDIRS}) # Save the list of source files for this target. $${TGT}_SOURCES += $${SOURCES} # Convert the source file names to their corresponding object file # names. OBJS := $$(addprefix $${BUILD_DIR}/$$(call CANONICAL_PATH,$${TGT})/,\ $$(addsuffix .o,$$(basename $${SOURCES}))) # Add the objects to the current target's list of objects, and create # target-specific variables for the objects based on any source # variables that were defined. $${TGT}_OBJS += $${OBJS} $${TGT}_DEPS += $${OBJS:%.o=%.P} $${OBJS}: SRC_CFLAGS := $${$${TGT}_CFLAGS} $${SRC_CFLAGS} $${OBJS}: SRC_CXXFLAGS := $${$${TGT}_CXXFLAGS} $${SRC_CXXFLAGS} $${OBJS}: SRC_DEFS := $$(addprefix -D,$${$${TGT}_DEFS} $${SRC_DEFS}) $${OBJS}: SRC_INCDIRS := $$(addprefix -I,\ $${$${TGT}_INCDIRS} $${SRC_INCDIRS}) endif ifneq "$$(strip $${SUBMAKEFILES})" "" # This makefile has submakefiles. Recursively include them. $$(foreach MK,$${SUBMAKEFILES},\ $$(eval $$(call INCLUDE_SUBMAKEFILE,\ $$(call CANONICAL_PATH,\ $$(call QUALIFY_PATH,$${DIR},$${MK}))))) endif # Reset the "current" target to it's previous value. TGT_STACK := $$(call POP,$${TGT_STACK}) TGT := $$(call PEEK,$${TGT_STACK}) # Reset the "current" directory to it's previous value. DIR_STACK := $$(call POP,$${DIR_STACK}) DIR := $$(call PEEK,$${DIR_STACK}) endef # MIN - Parameterized "function" that results in the minimum lexical value of # the two values given. define MIN $(firstword $(sort ${1} ${2})) endef # PEEK - Parameterized "function" that results in the value at the top of the # specified colon-delimited stack. define PEEK $(lastword $(subst :, ,${1})) endef # POP - Parameterized "function" that pops the top value off of the specified # colon-delimited stack, and results in the new value of the stack. Note that # the popped value cannot be obtained using this function; use peek for that. define POP ${1:%:$(lastword $(subst :, ,${1}))=%} endef # PUSH - Parameterized "function" that pushes a value onto the specified colon- # delimited stack, and results in the new value of the stack. define PUSH ${2:%=${1}:%} endef # QUALIFY_PATH - Given a "root" directory and one or more paths, qualifies the # paths using the "root" directory (i.e. appends the root directory name to # the paths) except for paths that are absolute. define QUALIFY_PATH $(addprefix ${1}/,$(filter-out /%,${2})) $(filter /%,${2}) endef ############################################################################### # # Start of Makefile Evaluation # ############################################################################### # Older versions of GNU Make lack capabilities needed by boilermake. # With older versions, "make" may simply output "nothing to do", likely leading # to confusion. To avoid this, check the version of GNU make up-front and # inform the user if their version of make doesn't meet the minimum required. MIN_MAKE_VERSION := 3.81 MIN_MAKE_VER_MSG := boilermake requires GNU Make ${MIN_MAKE_VERSION} or greater ifeq "${MAKE_VERSION}" "" $(info GNU Make not detected) $(error ${MIN_MAKE_VER_MSG}) endif ifneq "${MIN_MAKE_VERSION}" "$(call MIN,${MIN_MAKE_VERSION},${MAKE_VERSION})" $(info This is GNU Make version ${MAKE_VERSION}) $(error ${MIN_MAKE_VER_MSG}) endif # Define the source file extensions that we know how to handle. C_SRC_EXTS := %.c CXX_SRC_EXTS := %.C %.cc %.cp %.cpp %.CPP %.cxx %.c++ JAVA_EXTS := %.jar %.tar ALL_SRC_EXTS := ${C_SRC_EXTS} ${CXX_SRC_EXTS} ${JAVA_EXTS} # Initialize global variables. ALL_TGTS := DEFS := DIR_STACK := INCDIRS := TGT_STACK := # Discover our OS and architecture. These are used to set the BUILD_DIR and TARGET_DIR to # something more useful than 'build' and '.'. OSTYPE := $(shell echo `uname`) OSVERSION := $(shell echo `uname -r`) MACHINETYPE := $(shell echo `uname -m`) ifeq (${MACHINETYPE}, x86_64) MACHINETYPE = amd64 endif ifeq (${MACHINETYPE}, Power Macintosh) MACHINETYPE = ppc endif ifeq (${OSTYPE}, SunOS) MACHINETYPE = ${shell echo `uname -p`} ifeq (${MACHINETYPE}, sparc) ifeq (${shell /usr/bin/isainfo -b}, 64) MACHINETYPE = sparc64 else MACHINETYPE = sparc32 endif endif endif # Set compiler and flags based on discovered hardware # # By default, debug symbols are included in all builds (even optimized). # # BUILDOPTIMIZED will disable debug symbols (leaving it just optimized) # BUILDDEBUG will disable optimization (leaving it just with debug symbols) # BUILDSTACKTRACE will enable stack trace on crashes, only works for Linux # set to 0 on command line to disable (it's enabled by default for Linux) # # BUILDPROFILE used to add -pg to LDFLAGS, and remove -D_GLIBCXX_PARALLE from CXXFLAGS and LDFLAGS, # and remove -fomit-frame-pointer from CXXFLAGS. It added a bunch of complication and wasn't # really used. ifeq (${OSTYPE}, Linux) CC ?= gcc CXX ?= g++ CXXFLAGS += -D_GLIBCXX_PARALLEL -pthread -fopenmp -fPIC LDFLAGS += -D_GLIBCXX_PARALLEL -pthread -fopenmp -lm CXXFLAGS += -Wall -Wextra -Wno-write-strings -Wno-unused -Wno-char-subscripts -Wno-sign-compare -Wformat BUILDSTACKTRACE ?= 1 ifeq ($(BUILDOPTIMIZED), 1) else CXXFLAGS += -g3 endif ifeq ($(BUILDDEBUG), 1) else CXXFLAGS += -O4 -funroll-loops -fexpensive-optimizations -finline-functions -fomit-frame-pointer endif endif ifeq (${OSTYPE}, Darwin) CC ?= gcc CXX ?= g++ CLANG = $(shell echo `${CXX} --version 2>&1 | grep -c clang`) ifeq ($(CLANG), 0) CXXFLAGS += -D_GLIBCXX_PARALLEL -fopenmp LDFLAGS += -D_GLIBCXX_PARALLEL -fopenmp else CXXFLAGS += -DBROKEN_CLANG_OpenMP LDFLAGS += endif CXXFLAGS += -pthread -fPIC -m64 -Wall -Wextra -Wno-write-strings -Wno-unused -Wno-char-subscripts -Wno-sign-compare -Wformat LDFLAGS += -pthread -lm ifeq ($(BUILDOPTIMIZED), 1) else CXXFLAGS += -g3 endif ifeq ($(BUILDDEBUG), 1) else CXXFLAGS += -O4 -funroll-loops -fexpensive-optimizations -finline-functions -fomit-frame-pointer endif endif ifeq (${OSTYPE}, FreeBSD) ifeq (${MACHINETYPE}, amd64) CC ?= gcc48 CXX ?= g++48 CXXFLAGS += -I/usr/local/include -D_GLIBCXX_PARALLEL -pthread -fopenmp -fPIC LDFLAGS += -L/usr/local/lib -D_GLIBCXX_PARALLEL -pthread -fopenmp -rpath /usr/local/lib/gcc48 -lm -lexecinfo CXXFLAGS += -Wall -Wextra -Wno-write-strings -Wno-unused -Wno-char-subscripts -Wno-sign-compare -Wformat -Wno-parentheses # Google Performance Tools malloc and heapchecker (HEAPCHECK=normal) #CXXFLAGS += #LDFLAGS += -ltcmalloc # Google Performance Tools cpu profiler (CPUPROFILE=/path) #CXXFLAGS += #LDFLAGS += -lprofiler # callgrind #CXXFLAGS += -g3 -Wa,--gstabs -save-temps ifeq ($(BUILDOPTIMIZED), 1) else CXXFLAGS += -g3 endif ifeq ($(BUILDDEBUG), 1) else CXXFLAGS += -O4 -funroll-loops -fexpensive-optimizations -finline-functions -fomit-frame-pointer endif endif endif ifeq (${OSTYPE}, FreeBSD) ifeq (${MACHINETYPE}, arm) CC ?= gcc48 CXX ?= g++48 CXXFLAGS += -I/usr/local/include -D_GLIBCXX_PARALLEL -pthread -fopenmp -fPIC LDFLAGS += -L/usr/local/lib -D_GLIBCXX_PARALLEL -pthread -fopenmp -rpath /usr/local/lib/gcc48 -lm CXXFLAGS += -Wall -Wextra -Wno-write-strings -Wno-unused -Wno-char-subscripts -Wno-sign-compare -Wformat -Wno-parentheses CXXFLAGS += -funroll-loops -fomit-frame-pointer LDFLAGS += ifeq ($(BUILDOPTIMIZED), 1) else CXXFLAGS += -g3 endif ifeq ($(BUILDDEBUG), 1) else CXXFLAGS += -O4 -funroll-loops -fexpensive-optimizations -finline-functions -fomit-frame-pointer endif endif endif ifneq (,$(findstring CYGWIN, ${OSTYPE})) CC ?= gcc CXX ?= g++ CXXFLAGS := -fopenmp -pthread LDFLAGS := -fopenmp -pthread -lm CXXFLAGS += -Wall -Wextra -Wno-write-strings -Wno-unused -Wno-char-subscripts -Wno-sign-compare -Wformat ifeq ($(BUILDOPTIMIZED), 1) else CXXFLAGS += -g3 endif ifeq ($(BUILDDEBUG), 1) else CXXFLAGS += -O4 -funroll-loops -fexpensive-optimizations -finline-functions -fomit-frame-pointer endif endif # Stack tracing support. Wow, what a pain. Only Linux is supported. This is just documentation, # don't actually enable any of this stuff! # # backward-cpp looks very nice, only a single header file. But it needs libberty (-liberty) and # libbfd (-lbfd). The former should be installed with gcc, and the latter is in elfutils. On # three out of our three development machines, it fails for various reasons. # # libunwind is pretty basic. # # libbacktrace works (on Linux) and is simple enough to include in our tree. # # None of these give any useful information on BSDs (which includes OS X aka macOS). # # # Backtraces with libunwind. Not informative on FreeBSD. #CXXFLAGS += -DLIBUNWIND #LDFLAGS += #LDLIBS += -lunwind -lunwind-x86_64 # # # Backtraces with libbacktrace. FreeBSD works, but trace is empty. #BUILDSTACK = 1 #CXXFLAGS += -DLIBBACKTRACE #LDFLAGS += #LDLIBS += # # # Backtraces with backward-cpp. # # Stack walking: # BACKWARD_HAS_UNWIND - used by gcc/clang for exception handling # BACKWARD_HAS_BACKTRACE - part of glib, not as accurate, more portable # # Stack interpretation: # BACKWARE_HAS_DW - most information, libdw, (elfutils or libdwarf) # BACKWARD_HAS_BFD - some information, libbfd # BACKWARD_HAS_BACKTRACE_SYMBOL - minimal information (file and function), portable # # helix fails with: cannot find -liberty # gryphon fails with: cannot find -lbfd # freebsd can't install a working elfutils, needed for libdw" # In file included from AS_UTL/AS_UTL_stackTrace.C:183:0: # AS_UTL/backward.hpp:241:30: fatal error: elfutils/libdw.h: No such file or directory # # include # #CXXFLAGS += -DBACKWARDCPP -DBACKWARD_HAS_BFD #LDFLAGS += #LDLIBS += -lbfd -liberty -ldl -lz # # Needs libdw, elfutils #CXXFLAGS += -DBACKWARDCPP -DBACKWARD_HAS_DW #LDFLAGS += #LDLIBS += -ldl -lz # # Generates nothing useful, no function names, just binary names #CXXFLAGS += -DBACKWARDCPP #LDFLAGS += #LDLIBS += -ldl -lz # # # No backtrace support. #CXXFLAGS += -DNOBACKTRACE # But, if we we have an old GCC, stack tracing support isn't there. GCC_45 := $(shell expr `${CC} -dumpversion | sed -e 's/\.\([0-9][0-9]\)/\1/g' -e 's/\.\([0-9]\)/0\1/g' -e 's/^[0-9]\{3,4\}$$/&00/'` \>= 40500) GCC_VV := $(shell ${CC} -dumpversion) GXX_VV := $(shell ${CXX} -dumpversion) ifeq (${BUILDSTACKTRACE}, 1) ifeq (${GCC_45}, 0) $(info WARNING:) $(info WARNING: GCC ${GCC_VV} detected, disabling stack trace support. Please upgrade to GCC 4.7 or higher.) $(info WARNING:) BUILDSTACKTRACE = 0 endif endif ifeq (${BUILDSTACKTRACE}, 1) CXXFLAGS += -DLIBBACKTRACE else CXXFLAGS += -DNOBACKTRACE endif # Include the main user-supplied submakefile. This also recursively includes # all other user-supplied submakefiles. $(eval $(call INCLUDE_SUBMAKEFILE,main.mk)) # Perform post-processing on global variables as needed. DEFS := $(addprefix -D,${DEFS}) INCDIRS := $(addprefix -I,$(call CANONICAL_PATH,${INCDIRS})) # Define the "all" target (which simply builds all user-defined targets) as the # default goal. .PHONY: all all: UPDATE_VERSION MAKE_DIRS \ $(addprefix ${TARGET_DIR}/,${ALL_TGTS}) \ ${TARGET_DIR}/canu \ ${TARGET_DIR}/canu.defaults \ ${TARGET_DIR}/lib/canu/Consensus.pm \ ${TARGET_DIR}/lib/canu/CorrectReads.pm \ ${TARGET_DIR}/lib/canu/Configure.pm \ ${TARGET_DIR}/lib/canu/Defaults.pm \ ${TARGET_DIR}/lib/canu/ErrorEstimate.pm \ ${TARGET_DIR}/lib/canu/Execution.pm \ ${TARGET_DIR}/lib/canu/Gatekeeper.pm \ ${TARGET_DIR}/lib/canu/Grid.pm \ ${TARGET_DIR}/lib/canu/Grid_Cloud.pm \ ${TARGET_DIR}/lib/canu/Grid_DNANexus.pm \ ${TARGET_DIR}/lib/canu/Grid_LSF.pm \ ${TARGET_DIR}/lib/canu/Grid_PBSTorque.pm \ ${TARGET_DIR}/lib/canu/Grid_SGE.pm \ ${TARGET_DIR}/lib/canu/Grid_Slurm.pm \ ${TARGET_DIR}/lib/canu/HTML.pm \ ${TARGET_DIR}/lib/canu/Meryl.pm \ ${TARGET_DIR}/lib/canu/Output.pm \ ${TARGET_DIR}/lib/canu/OverlapBasedTrimming.pm \ ${TARGET_DIR}/lib/canu/OverlapErrorAdjustment.pm \ ${TARGET_DIR}/lib/canu/OverlapInCore.pm \ ${TARGET_DIR}/lib/canu/OverlapMhap.pm \ ${TARGET_DIR}/lib/canu/OverlapMMap.pm \ ${TARGET_DIR}/lib/canu/OverlapStore.pm \ ${TARGET_DIR}/lib/canu/Report.pm \ ${TARGET_DIR}/lib/canu/Unitig.pm @echo "" @echo "Success!" @echo "canu installed in ${TARGET_DIR}/canu" # Add a new target rule for each user-defined target. $(foreach TGT,${ALL_TGTS},\ $(eval $(call ADD_TARGET_RULE,${TGT}))) # Add pattern rule(s) for creating compiled object code from C source. $(foreach TGT,${ALL_TGTS},\ $(foreach EXT,${C_SRC_EXTS},\ $(eval $(call ADD_OBJECT_RULE,${BUILD_DIR}/$(call CANONICAL_PATH,${TGT}),\ ${EXT},$${COMPILE_C_CMDS})))) # Add pattern rule(s) for creating compiled object code from C++ source. $(foreach TGT,${ALL_TGTS},\ $(foreach EXT,${CXX_SRC_EXTS},\ $(eval $(call ADD_OBJECT_RULE,${BUILD_DIR}/$(call CANONICAL_PATH,${TGT}),\ ${EXT},$${COMPILE_CXX_CMDS})))) # Add "clean" rules to remove all build-generated files. .PHONY: clean $(foreach TGT,${ALL_TGTS},\ $(eval $(call ADD_CLEAN_RULE,${TGT}))) # Include generated rules that define additional (header) dependencies. $(foreach TGT,${ALL_TGTS},\ $(eval -include ${${TGT}_DEPS})) # A fake target, to regenerate the canu_version.H file on every build. .PHONY: UPDATE_VERSION UPDATE_VERSION: @./canu_version_update.pl AS_global.C: UPDATE_VERSION # A fake target, to make the directory for the canu perl modules. .PHONY: MAKE_DIRS MAKE_DIRS: @if [ ! -e ${TARGET_DIR}/lib/canu ] ; then mkdir -p ${TARGET_DIR}/lib/canu ; fi ${TARGET_DIR}/canu: pipelines/canu.pl cp -pf pipelines/canu.pl ${TARGET_DIR}/canu chmod +x ${TARGET_DIR}/canu ${TARGET_DIR}/canu.defaults: echo > ${TARGET_DIR}/canu.defaults "# Add site specific options (for setting up Grid or limiting memory/threads) here." chmod -x ${TARGET_DIR}/canu.defaults ${TARGET_DIR}/lib/canu/Consensus.pm: pipelines/canu/Consensus.pm cp -pf pipelines/canu/Consensus.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/CorrectReads.pm: pipelines/canu/CorrectReads.pm cp -pf pipelines/canu/CorrectReads.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Configure.pm: pipelines/canu/Configure.pm cp -pf pipelines/canu/Configure.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Defaults.pm: pipelines/canu/Defaults.pm cp -pf pipelines/canu/Defaults.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/ErrorEstimate.pm: pipelines/canu/ErrorEstimate.pm cp -pf pipelines/canu/ErrorEstimate.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Execution.pm: pipelines/canu/Execution.pm cp -pf pipelines/canu/Execution.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Gatekeeper.pm: pipelines/canu/Gatekeeper.pm cp -pf pipelines/canu/Gatekeeper.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid.pm: pipelines/canu/Grid.pm cp -pf pipelines/canu/Grid.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid_Cloud.pm: pipelines/canu/Grid_Cloud.pm cp -pf pipelines/canu/Grid_Cloud.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid_DNANexus.pm: pipelines/canu/Grid_DNANexus.pm cp -pf pipelines/canu/Grid_DNANexus.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid_LSF.pm: pipelines/canu/Grid_LSF.pm cp -pf pipelines/canu/Grid_LSF.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid_PBSTorque.pm: pipelines/canu/Grid_PBSTorque.pm cp -pf pipelines/canu/Grid_PBSTorque.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid_SGE.pm: pipelines/canu/Grid_SGE.pm cp -pf pipelines/canu/Grid_SGE.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Grid_Slurm.pm: pipelines/canu/Grid_Slurm.pm cp -pf pipelines/canu/Grid_Slurm.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/HTML.pm: pipelines/canu/HTML.pm cp -pf pipelines/canu/HTML.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Meryl.pm: pipelines/canu/Meryl.pm cp -pf pipelines/canu/Meryl.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Output.pm: pipelines/canu/Output.pm cp -pf pipelines/canu/Output.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/OverlapBasedTrimming.pm: pipelines/canu/OverlapBasedTrimming.pm cp -pf pipelines/canu/OverlapBasedTrimming.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/OverlapErrorAdjustment.pm: pipelines/canu/OverlapErrorAdjustment.pm cp -pf pipelines/canu/OverlapErrorAdjustment.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/OverlapInCore.pm: pipelines/canu/OverlapInCore.pm cp -pf pipelines/canu/OverlapInCore.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/OverlapMhap.pm: pipelines/canu/OverlapMhap.pm cp -pf pipelines/canu/OverlapMhap.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/OverlapMMap.pm: pipelines/canu/OverlapMMap.pm cp -pf pipelines/canu/OverlapMMap.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/OverlapStore.pm: pipelines/canu/OverlapStore.pm cp -pf pipelines/canu/OverlapStore.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Report.pm: pipelines/canu/Report.pm cp -pf pipelines/canu/Report.pm ${TARGET_DIR}/lib/canu/ ${TARGET_DIR}/lib/canu/Unitig.pm: pipelines/canu/Unitig.pm cp -pf pipelines/canu/Unitig.pm ${TARGET_DIR}/lib/canu/ # Makefile processed. Report that we're starting the build. $(info Building for '${OSTYPE}' '${OSVERSION}' as '${MACHINETYPE}' into '${DESTDIR}${PREFIX}/$(OSTYPE)-$(MACHINETYPE)/{bin,obj}') $(info CC ${CC} ${GCC_VV}) $(info CXX ${CXX} ${GXX_VV}) #${info Using LD_RUN_PATH '${LD_RUN_PATH}'} canu-1.6/src/bogart-analysis/000077500000000000000000000000001314437614700162065ustar00rootroot00000000000000canu-1.6/src/bogart-analysis/analyze-mapped-unitigs-for-joins.pl000066400000000000000000000153301314437614700250400ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $alignedReads = shift @ARGV; die "Can't find blasr-coords aligned reads in '$alignedReads'\n" if (! -e "$alignedReads"); my $b1last = 0; my $e1last = 0; my $n1last = ""; my $f1last = "fwd"; my $b2last = 0; my $e2last = 0; my $n2last = ""; my $f2last = "fwd"; my $nBubble = 0; my $nSpan = 0; my $nGap = 0; my $nGapEv = 0; my $nAbut = 0; my $nOvl = 0; my $nOvlEv = 0; # Load the name to iid map. my %NAMtoIID; my %NAMtoUID; my %UIDtoIID; my %IIDtoDEG; my %IIDtoUTG; open(F, "< test.gkpStore.fastqUIDmap") or die "Failed to open 'test.gkpStore.fastqUIDmap'\n"; while () { my @v = split '\s+', $_; if (scalar(@v) == 3) { $NAMtoIID{$v[2]} = $v[1]; $NAMtoUID{$v[2]} = $v[0]; $UIDtoIID{$v[0]} = $v[1]; } else { $NAMtoIID{$v[2]} = $v[1]; $NAMtoUID{$v[2]} = $v[0]; $UIDtoIID{$v[0]} = $v[1]; $NAMtoIID{$v[5]} = $v[4]; $NAMtoUID{$v[5]} = $v[3]; $UIDtoIID{$v[3]} = $v[4]; } } close(F); print STDERR "Loaded ", scalar(keys %NAMtoIID), " read names.\n"; open(F, "< 9-terminator/test.posmap.frgdeg") or die "Failed to open '9-terminator/test.posmap.frgdeg'\n"; while () { my @v = split '\s+', $_; my $i = $UIDtoIID{$v[0]}; die "$_" if (!defined($i)); $IIDtoDEG{$i} = $v[1]; } close(F); print STDERR "Loaded ", scalar(keys %IIDtoDEG), " degenerate unitig reads.\n"; open(F, "< 9-terminator/test.posmap.frgutg") or die "Failed to open '9-terminator/test.posmap.frgutg'\n"; while () { my @v = split '\s+', $_; my $i = $UIDtoIID{$v[0]}; die "$_" if (!defined($i)); $IIDtoUTG{$i} = $v[1]; } close(F); print STDERR "Loaded ", scalar(keys %IIDtoUTG), " unitig reads.\n"; # UTG.coords is standard nucmer output. open(F, "< 9-terminator/UTG.coords") or die "Failed to open '9-terminator/UTG.coords'\n"; $_ = ; $_ = ; $_ = ; $_ = ; $_ = ; while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; my $b1 = $v[0]; # Assembly coords my $e1 = $v[1]; my $b2 = $v[3]; # Read coords my $e2 = $v[4]; my $f1 = ($b1 < $e1) ? "fwd" : "rev"; my $f2 = ($b2 < $e2) ? "fwd" : "rev"; ($b1,$e1) = ($e1,$b1) if ($b1 > $e1); # Doesn't seem to occur. ($b2,$e2) = ($e2,$b2) if ($b2 > $e2); # Definitely does occur. #my $l1 = $v[6]; #my $l2 = $v[7]; my $id = $v[9]; my $n1 = $v[11]; my $n2 = $v[12]; if (($b1last <= $b1) && ($e1 <= $e1last)) { $nBubble++; #print STDERR "BUBBLE $n2 ($b1,$e1) in $n2last ($b1last,$e1last)\n"; next; } $nSpan++; goto bail if ($n1last ne $n1); #print STDERR "last $b1last,$e1last curr $b1,$e1\n"; die "coords not sorted by increasing assembly start!\n" if ($b1last > $b1); $nGap++ if ($e1last < $b1); $nAbut++ if ($e1last == $b1); $nOvl++ if ($e1last > $b1); my $minOvlSpan = 150; if ($e1last < $b1) { my $ge = $b1; my $gb = $e1last; my $gap = $ge - $gb; my $span = 0; open(C, "< $alignedReads") or die "Failed to open '$alignedReads'\n"; while () { chomp; my @v = split '\s+', $_; next if ($v[9] ne $n1); # STOP changing the read name, blasr! if ($v[10] =~ m/^(.*\d+_\d+)\/\d+_\d+/) { $v[10] = $1; } my $iid = $NAMtoIID{$v[10]}; my $utg = $IIDtoUTG{$iid}; my $deg = $IIDtoDEG{$iid}; my $ann; #next if (defined($utg) || defined($deg)); if (defined($utg)) { $ann = "utg $utg"; } elsif (defined($deg)) { $ann = "deg $deg"; } else { $ann = "singleton "; } if (($v[0] + $minOvlSpan < $ge) && ($gb < $v[1] - $minOvlSpan)) { #print "$ann iid $iid -- $_\n"; $span++; } } close(C); if ($span > 0) { $nGapEv++; print "GAP $n1 $n2last ($b1last,$e1last,$f2last) -- $gap -- $n2 ($b1,$e1,$f2) -- SPAN $span\n"; print "\n"; print "\n"; } } if ($e1last == $b1) { } if ($e1last > $b1) { my $ge = $e1last; my $gb = $b1; my $ovl = $ge - $gb; my $span = 0; open(C, "< $alignedReads") or die "Failed to open '$alignedReads'\n"; while () { chomp; my @v = split '\s+', $_; next if ($v[9] ne $n1); if ($v[10] =~ m/^(.*\d+_\d+)\/\d+_\d+/) { $v[10] = $1; } my $iid = $NAMtoIID{$v[10]}; my $utg = $IIDtoUTG{$iid}; my $deg = $IIDtoDEG{$iid}; my $ann; #next if (defined($utg) || defined($deg)); if (defined($utg)) { $ann = "utg $utg"; } elsif (defined($deg)) { $ann = "deg $deg"; } else { $ann = "singleton "; } if (($v[0] + $minOvlSpan < $ge) && ($gb < $v[1] - $minOvlSpan)) { print "$ann iid $iid -- $_\n"; $span++; } } close(C); if ($span > 0) { $nOvlEv++; print "OVL $n1 $n2last ($b1last,$e1last,$f2last) -- $ovl -- $n2 ($b1,$e1,$f2) -- SPAN $span\n"; print "\n"; print "\n"; } } bail: ($b1last,$e1last,$n1last,$f1last) = ($b1,$e1,$n1,$f1); ($b2last,$e2last,$n2last,$f2last) = ($b2,$e2,$n2,$f2); } close(F); print STDERR "nBubble $nBubble\n"; print STDERR "nSpan $nSpan\n"; print STDERR "nGap $nGap\n"; print STDERR "nGap $nGapEv with evidence\n"; print STDERR "nAbut $nAbut\n"; print STDERR "nOvl $nOvl\n"; print STDERR "nOvl $nOvlEv with evidence\n"; canu-1.6/src/bogart-analysis/analyze-nucmer-gaps.pl000066400000000000000000000153301314437614700224270ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $alignedReads = shift @ARGV; die "Can't find blasr-coords aligned reads in '$alignedReads'\n" if (! -e "$alignedReads"); my $b1last = 0; my $e1last = 0; my $n1last = ""; my $f1last = "fwd"; my $b2last = 0; my $e2last = 0; my $n2last = ""; my $f2last = "fwd"; my $nBubble = 0; my $nSpan = 0; my $nGap = 0; my $nGapEv = 0; my $nAbut = 0; my $nOvl = 0; my $nOvlEv = 0; # Load the name to iid map. my %NAMtoIID; my %NAMtoUID; my %UIDtoIID; my %IIDtoDEG; my %IIDtoUTG; open(F, "< test.gkpStore.fastqUIDmap") or die "Failed to open 'test.gkpStore.fastqUIDmap'\n"; while () { my @v = split '\s+', $_; if (scalar(@v) == 3) { $NAMtoIID{$v[2]} = $v[1]; $NAMtoUID{$v[2]} = $v[0]; $UIDtoIID{$v[0]} = $v[1]; } else { $NAMtoIID{$v[2]} = $v[1]; $NAMtoUID{$v[2]} = $v[0]; $UIDtoIID{$v[0]} = $v[1]; $NAMtoIID{$v[5]} = $v[4]; $NAMtoUID{$v[5]} = $v[3]; $UIDtoIID{$v[3]} = $v[4]; } } close(F); print STDERR "Loaded ", scalar(keys %NAMtoIID), " read names.\n"; open(F, "< 9-terminator/test.posmap.frgdeg") or die "Failed to open '9-terminator/test.posmap.frgdeg'\n"; while () { my @v = split '\s+', $_; my $i = $UIDtoIID{$v[0]}; die "$_" if (!defined($i)); $IIDtoDEG{$i} = $v[1]; } close(F); print STDERR "Loaded ", scalar(keys %IIDtoDEG), " degenerate unitig reads.\n"; open(F, "< 9-terminator/test.posmap.frgutg") or die "Failed to open '9-terminator/test.posmap.frgdeg'\n"; while () { my @v = split '\s+', $_; my $i = $UIDtoIID{$v[0]}; die "$_" if (!defined($i)); $IIDtoUTG{$i} = $v[1]; } close(F); print STDERR "Loaded ", scalar(keys %IIDtoUTG), " unitig reads.\n"; # UTG.coords is standard nucmer output. open(F, "< 9-terminator/UTG.coords") or die "Failed to open '9-terminator/UTG.coords'\n"; $_ = ; $_ = ; $_ = ; $_ = ; $_ = ; while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; my $b1 = $v[0]; # Assembly coords my $e1 = $v[1]; my $b2 = $v[3]; # Read coords my $e2 = $v[4]; my $f1 = ($b1 < $e1) ? "fwd" : "rev"; my $f2 = ($b2 < $e2) ? "fwd" : "rev"; ($b1,$e1) = ($e1,$b1) if ($b1 > $e1); # Doesn't seem to occur. ($b2,$e2) = ($e2,$b2) if ($b2 > $e2); # Definitely does occur. #my $l1 = $v[6]; #my $l2 = $v[7]; my $id = $v[9]; my $n1 = $v[11]; my $n2 = $v[12]; if (($b1last <= $b1) && ($e1 <= $e1last)) { $nBubble++; #print STDERR "BUBBLE $n2 ($b1,$e1) in $n2last ($b1last,$e1last)\n"; next; } $nSpan++; goto bail if ($n1last ne $n1); #print STDERR "last $b1last,$e1last curr $b1,$e1\n"; die "coords not sorted by increasing assembly start!\n" if ($b1last > $b1); $nGap++ if ($e1last < $b1); $nAbut++ if ($e1last == $b1); $nOvl++ if ($e1last > $b1); my $minOvlSpan = 150; if ($e1last < $b1) { my $ge = $b1; my $gb = $e1last; my $gap = $ge - $gb; my $span = 0; open(C, "< $alignedReads") or die "Failed to open '$alignedReads'\n"; while () { chomp; my @v = split '\s+', $_; next if ($v[9] ne $n1); # STOP changing the read name, blasr! if ($v[10] =~ m/^(.*\d+_\d+)\/\d+_\d+/) { $v[10] = $1; } my $iid = $NAMtoIID{$v[10]}; my $utg = $IIDtoUTG{$iid}; my $deg = $IIDtoDEG{$iid}; my $ann; #next if (defined($utg) || defined($deg)); if (defined($utg)) { $ann = "utg $utg"; } elsif (defined($deg)) { $ann = "deg $deg"; } else { $ann = "singleton "; } if (($v[0] + $minOvlSpan < $ge) && ($gb < $v[1] - $minOvlSpan)) { #print "$ann iid $iid -- $_\n"; $span++; } } close(C); if ($span > 0) { $nGapEv++; print "GAP $n1 $n2last ($b1last,$e1last,$f2last) -- $gap -- $n2 ($b1,$e1,$f2) -- SPAN $span\n"; print "\n"; print "\n"; } } if ($e1last == $b1) { } if ($e1last > $b1) { my $ge = $e1last; my $gb = $b1; my $ovl = $ge - $gb; my $span = 0; open(C, "< $alignedReads") or die "Failed to open '$alignedReads'\n"; while () { chomp; my @v = split '\s+', $_; next if ($v[9] ne $n1); if ($v[10] =~ m/^(.*\d+_\d+)\/\d+_\d+/) { $v[10] = $1; } my $iid = $NAMtoIID{$v[10]}; my $utg = $IIDtoUTG{$iid}; my $deg = $IIDtoDEG{$iid}; my $ann; #next if (defined($utg) || defined($deg)); if (defined($utg)) { $ann = "utg $utg"; } elsif (defined($deg)) { $ann = "deg $deg"; } else { $ann = "singleton "; } if (($v[0] + $minOvlSpan < $ge) && ($gb < $v[1] - $minOvlSpan)) { print "$ann iid $iid -- $_\n"; $span++; } } close(C); if ($span > 0) { $nOvlEv++; print "OVL $n1 $n2last ($b1last,$e1last,$f2last) -- $ovl -- $n2 ($b1,$e1,$f2) -- SPAN $span\n"; print "\n"; print "\n"; } } bail: ($b1last,$e1last,$n1last,$f1last) = ($b1,$e1,$n1,$f1); ($b2last,$e2last,$n2last,$f2last) = ($b2,$e2,$n2,$f2); } close(F); print STDERR "nBubble $nBubble\n"; print STDERR "nSpan $nSpan\n"; print STDERR "nGap $nGap\n"; print STDERR "nGap $nGapEv with evidence\n"; print STDERR "nAbut $nAbut\n"; print STDERR "nOvl $nOvl\n"; print STDERR "nOvl $nOvlEv with evidence\n"; canu-1.6/src/bogart-analysis/bogart-build-fasta.pl000066400000000000000000000111431314437614700222120ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $prefix; my $FILE = undef; my $REF = undef; my $withSGE = 0; my $withPart = 0; if (scalar(@ARGV) == 0) { die "usage: $0 [-sge] [-partition] PREFIX.tigStore REFERENCE.fasta\n"; } while (scalar(@ARGV) > 0) { if ($ARGV[0] eq "-sge") { $withSGE = 1; } elsif ($ARGV[0] eq "-partition") { $withPart = 1; } else { $FILE = $ARGV[0]; $REF = $ARGV[1]; shift @ARGV; } shift @ARGV; } die "Must supply 'PREFIX.tigStore' as first non-option argument.\n" if (!defined($FILE)); die "Must supply 'REFERENCE.fasta' as second non-option argument.\n" if (!defined($REF)); if ($FILE =~ m/^(.*).tigStore/) { $FILE = $1; } if ($FILE =~ m/^(.*)\.\d\d\d\./) { $prefix = $1; } else { die "Failed to find prefix in '$FILE', expecting PREFIX.###.tigStore.\n"; } die "Failed to find reference '$REF'\n" if (! -e $REF); my %detected1; my %detected2; if (! -e "$FILE.tigStore") { my $cmd; $cmd = "bogart \\n"; $cmd .= " -O ../$prefix.ovlStore \\n"; $cmd .= " -G ../$prefix.gkpStore \\n"; $cmd .= " -T ../$prefix.tigStore \\n"; $cmd .= " -B 75000 \\n"; $cmd .= " -e 0.03 \\n"; $cmd .= " -E 0 \\n"; $cmd .= " -b \\n"; $cmd .= " -m 7 \\n"; $cmd .= " -U \\n"; $cmd .= " -o $prefix \\n"; $cmd .= " -D all \\n"; $cmd .= " -d overlapQuality \\n"; print "$cmd\n"; die "Will not run command.\n"; } ######################################## if ($withPart) { my $cmd; $cmd .= "gatekeeper"; $cmd .= " -P $FILE.tigStore.partitioning ../$prefix.gkpStore"; system($cmd); print STDERR "Partitioned.\n"; exit; } ######################################## print STDERR "Running consensus\n"; open(R, "> $FILE.sge.sh") or die; print R "#!/bin/sh\n"; print R "idx=`printf %03d \$SGE_TASK_ID`\n"; print R "utgcns -f -g ../$prefix.gkpStore -t $FILE.tigStore 1 \$idx > $FILE.tigStore/seqDB.v002.p\${idx}.utg.err 2>&1\n"; close(R); open(R, "> $FILE.tmp.sh") or die; open(F, "ls $FILE.tigStore/seqDB.v001.p???.dat |"); while () { chomp; if (m/seqDB.v001.p(\d\d\d).dat$/) { my $idx = $1; if (! -e "$FILE.tigStore/seqDB.v002.p${idx}.dat") { print STDERR " $_\n"; print R "utgcns -f -g ../$prefix.gkpStore -t $FILE.tigStore 1 $idx > $FILE.tigStore/seqDB.v002.p${idx}.utg.err 2>&1 &\n"; } else { print STDERR " $_ finished.\n"; } } else { die "No '$_'\n"; } } close(F); print R "wait\n"; close(R); if ($withSGE) { #system("qsub -cwd -j y -o /dev/null -l memory=2g -t 1-9 -q servers.q,black.q $FILE.sge.sh"); exit; } else { system("sh $FILE.tmp.sh"); } #unlink "$FILE.tmp.sh"; #unlink "$FILE.sge.sh"; ######################################## print STDERR "Dumping fasta\n"; if (! -e "$FILE.fasta") { my $cmd; $cmd .= "tigStore"; $cmd .= " -g ../$prefix.gkpStore"; $cmd .= " -t $FILE.tigStore 2"; $cmd .= " -U -d consensus"; $cmd .= "> $FILE.fasta"; system($cmd); } print STDERR "Dumping layout\n"; if (! -e "$FILE.layout") { my $cmd; $cmd .= "tigStore"; $cmd .= " -g ../$prefix.gkpStore"; $cmd .= " -t $FILE.tigStore 2"; $cmd .= " -U -d layout"; $cmd .= "> $FILE.layout"; system($cmd); } ######################################## print STDERR "Running nucmer\n"; if (! -e "$FILE.delta") { my $cmd; $cmd .= "nucmer"; $cmd .= " --maxmatch --coords -p $FILE"; $cmd .= " $REF"; $cmd .= " $FILE.fasta"; system($cmd); } if (! -e "$FILE.png") { my $cmd; $cmd .= "mummerplot"; $cmd .= " --layout --filter -p $FILE -t png"; #$cmd .= " --layout -p $FILE -t png"; $cmd .= " $FILE.delta"; system($cmd); } canu-1.6/src/bogart-analysis/count-bubbles.pl000066400000000000000000000141131314437614700213070ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $FILE = "test.004.buildUnitigs"; $FILE = shift @ARGV if (scalar(@ARGV) > 0); if ((! -e "$FILE.tigStore") || (! -e "$FILE.fasta")) { die "Missing tigStore or fasta. Run build-fasta.pl\n"; } ######################################## print STDERR "Aligning unitigs to unitigs\n"; if (! -e "$FILE.allVall.sim4db" ) { my $cmd; $cmd .= "/work/kmer/FreeBSD-amd64/bin/snapper2 "; $cmd .= " -queries $FILE.fasta "; $cmd .= " -genomic $FILE.fasta "; $cmd .= " -mersize 16 "; $cmd .= " -minmatchidentity 94 "; $cmd .= " -minmatchcoverage 94 "; $cmd .= " -aligns "; $cmd .= " -verbose > $FILE.allVall.sim4db"; system($cmd); } print STDERR "Filtering alignments\n"; if (! -e "$FILE.bubbles.sim4db") { my $cmd; $cmd .= "/work/kmer/FreeBSD-amd64/bin/filterPolishes "; $cmd .= " -selfhits "; $cmd .= " -D "; $cmd .= " < $FILE.allVall.sim4db "; $cmd .= " > $FILE.bubbles.sim4db"; system($cmd); } ######################################## my %detected1; my %detected2; my $utg; my $ref; my $ide; my $cov; print STDERR "Reading bubble mapping\n"; open(F, "< $FILE.bubbles.sim4db") or die; while () { chomp; if (m/^edef=utg(\d+)/) { $utg = $1; } if (m/^ddef=utg(\d+)/) { $ref = $1; } if (m/^\d+\[(\d+)-\d+-\d+\]\s\d+\[\d+-\d+\]\s+<(\d+)-\d+-(\d+)-\w+-\w+>$/) { $ide = $3; $cov = $2; } if (m/^(\d+)-(\d+)\s\((\d+)-(\d+)\)\s/) { $cov = 100 * $cov / ($2 - $1); } if (m/^sim4end$/) { die if (!defined($utg)); die if (!defined($ref)); if (($ide >= 96) && ($cov >= 98)) { $detected1{"$utg"}++; $detected2{"$utg-$ref"}++; } #if ($cov < 100) { # print STDERR "LOW COV $utg - $ref -- $cov\n"; #} undef $utg; undef $ref; } } close(F); print STDERR "Found ", scalar(keys %detected2), " actual bubble instances, from ", scalar(keys %detected1), " unitigs.\n"; my $numUnique = 0; foreach my $k (keys %detected1) { if ($detected1{$k} == 1) { #print STDERR "UNIQUE $k\n"; $numUnique++; } } print STDERR "Found $numUnique uniquely placeable bubbles.\n"; # # Scan the log. Discover if we successfully merged a bubble. Report bubbles that we merged that # were not verified by mapping. Remove bubbles we are successful on from the list of alignments # (so we can next report what we failed on). # my $numMerged = 0; my $numMergedUnique = 0; my $numMergedMultiple = 0; my $numMergedExtra = 0; my $numMergedExtraMultiple = 0; open(F, "< test.005.bubblePopping.log") or die; while () { chomp; if (m/merged\sbubble\sunitig\s(\d+)\sinto\sunitig\s(\d+)$/) { $numMerged++; if (!exists($detected1{"$1"})) { #print STDERR "EXTRA $_ (no mapping at all)\n"; $numMergedExtra++; } elsif (!exists($detected2{"$1-$2"})) { #print STDERR "EXTRA $_ (no mapping for this pair out of $detected1{\"$1\"} alignments - possibly incorrectly placed!)\n"; $numMergedExtraMultiple++; } elsif ($detected1{"$1"} == 1) { $numMergedUnique++; } elsif ($detected1{"$1"} > 1) { #print STDERR "MULTI $_ ($detected1{\"$1\"} alignments - possibly incorrectly placed!)\n"; $numMergedMultiple++; } else { die; } delete $detected2{"$1-$2"}; delete $detected1{"$1"}; } } close(F); print STDERR "Merged $numMerged bubbles:\n"; print STDERR " $numMergedUnique were uniquely alignable.\n"; print STDERR " $numMergedMultiple were multiply alignable.\n"; print STDERR " $numMergedExtra were not alignable at all.\n"; print STDERR " $numMergedExtraMultiple were not alignable at the spot it was merged.\n"; # # Scan the layouts. For each un-popped bubble (BOG didn't pop, but it aligned to a unitig) count # the number of fragments in the unitig, and length of the tig. Report that and the number of # places the bubble mapped. # my $numMergedMissed = 0; my $numMergedMissedUnique = 0; my $id = 0; # Unitig ID we're looking at my $nf = 0; # Number of fragments my $ln = 0; # Length of the unitig open(F, "< $FILE.layout"); open(O, "> $FILE.bubblesMISSED"); while () { chomp; if (m/^unitig\s+(\d+)$/) { if ($nf > 0) { $numMergedMissed++ if ($detected1{$id} > 0); $numMergedMissedUnique++ if ($detected1{$id} == 1); print O "Missed merge unitig $id of length $ln with $nf frags -- mapped to $detected1{$id} places.\n"; } $id = $1; $nf = 0; $ln = 0; } if (($detected1{"$id"} > 0) && (m/^FRG.*position\s+(\d+)\s+(\d+)$/)) { $nf++; $ln = $1 if ($ln < $1); $ln = $2 if ($ln < $2); } } close(F); if ($nf > 0) { $numMergedMissed++ if ($detected1{$id} > 0); $numMergedMissedUnique++ if ($detected1{$id} == 1); print O "Missed merge unitig $id of length $ln with $nf frags -- mapped to $detected1{$id} places.\n"; } close(O); print STDERR "Failed to merge $numMergedMissed unitigs.\n"; print STDERR " $numMergedMissedUnique were uniquely placeable.\n"; canu-1.6/src/bogart-analysis/examine-mapping-ideal.pl000066400000000000000000000306041314437614700227010ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $TOOSHORT = 500; my $FILE = "test.004.buildUnitigs"; $FILE = shift @ARGV if (scalar(@ARGV) > 0); my $IDEAL = "../ideal/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO0.ideal.intervals"; if ((! -e "$FILE.tigStore") || (! -e "$FILE.fasta")) { die "Missing tigStore or fasta. Run build-fasta.pl\n"; } ######################################## if (! -e "$FILE.coords") { my $cmd; $cmd = "nucmer"; $cmd .= " --maxmatch --coords -p $FILE"; $cmd .= " /work/FRAGS/porphyromonas_gingivalis_w83/reference/AE015924.fasta"; $cmd .= " $FILE.fasta"; system($cmd); } ######################################## # Load the lengh of each sequence. my %length; if (-e "$FILE.fasta") { open(F, "< $FILE.fasta") or die; while () { if (m/^>(utg\d+)\s+len=(\d+)$/) { $length{$1} = $2; } } close(F); } else { print STDERR "No $FILE.fasta file found, cannot .....\n"; } ######################################## # Pass one, load the coords. Analyze anything with one match, counting # the amount of coverage. my %nucmer; if (-e "$FILE.coords") { open(F, "< $FILE.coords") or die; $_ = ; $_ = ; $_ = ; $_ = ; $_ = ; while () { s/^\s+//; s/\s$+//; my @v = split '\s+', $_; my $utgBgn = ($v[3] < $v[4]) ? $v[3] : $v[4]; my $utgEnd = ($v[3] < $v[4]) ? $v[4] : $v[3]; my $genBgn = ($v[0] < $v[1]) ? $v[0] : $v[1]; my $genEnd = ($v[0] < $v[1]) ? $v[1] : $v[0]; # Rearrange the coords so that bgn is ALWAYS less than end (we lose orientation). my $str = "$utgBgn\t$utgEnd\t$genBgn\t$genEnd"; if (exists($nucmer{$v[12]})) { $nucmer{$v[12]} .= "\n" . $str; } else { $nucmer{$v[12]} = $str; } } close(F); } else { die "No $FILE.coords??\n"; } ######################################## # For things with one match, report spurs. my $spur = 0; my $spurShort = 0; my $confirmed = 0; my $confirmedBP = 0; foreach my $utg (keys %nucmer) { my @m = split '\n', $nucmer{$utg}; if (scalar(@m) > 1) { @m = sort {$a <=> $b} @m; $nucmer{$utg} = join "\n", @m; next; } my ($utgBgn, $utgEnd, $genBgn, $genEnd) = split '\s+', $nucmer{$utg}; my $len = $utgEnd - $utgBgn; die if ($len <= 0); my $per = $len / $length{$utg}; if ($per < 0.95) { if ($length{$utg} < $TOOSHORT) { $spurShort++; } else { print STDERR "SPUR $utg len $length{$utg} mapped $len ($per%) coords $nucmer{$utg}\n"; $spur++; } } else { $confirmed++; $confirmedBP += $length{$utg}; } delete $nucmer{$utg}; } print STDERR "FOUND ", scalar(keys %nucmer), " MATCHES, of which $spur are SPURs (and $spurShort are SHORT SPURS)\n"; print STDERR "FOUND $confirmed confirmed unitigs of total length $confirmedBP bp.\n"; ######################################## # Things with multiple matches are candidates for chimera. my $classFalseMateSplit = 0; my $lenFalseMateSplit = 0; my $classShort = 0; my $lenShort = 0; my $classDisconnected = 0; my $lenDisconnected = 0; my $classRepeat = 0; my $lenRepeat = 0; my $classRepeatEndWithRepeat = 0; my $lenRepeatEndWithRepeat = 0; my $classRepeat1Unique = 0; my $lenRepeat1Unique = 0; my $classRepeat2Unique = 0; my $lenRepeat2Unique = 0; my $classMultipleBlock = 0; my $lenMultipleBlock = 0; my $classUnknown = 0; my $lenUnknown = 0; foreach my $utg (keys %nucmer) { my @m = split '\n', $nucmer{$utg}; my $numBlocks = 0; my @bgn; my @end; my $min = 999999999; my $max = 0; #print STDOUT "$utg\n"; # Examine the matches, decide if the multiple matches are due to # unitig is a repeat # short repeats interior to the unitig # long repeats at the end # chimer my $numMatches = 0; my $isDisconnected = 0; my $verboseBuild = 0; print STDOUT "--------------------------------------------------------------------------------\n" if ($verboseBuild); print STDOUT "UNITIG $utg\n" if ($verboseBuild); foreach my $m (@m) { my ($utgBgn, $utgEnd, $genBgn, $genEnd) = split '\s+', $m; $numMatches++; print STDOUT " $m" if ($verboseBuild); # Are we an exact match to the largest thing so far? if (($utgBgn == $min) && ($utgEnd == $max)) { print STDOUT " -- is exact\n" if ($verboseBuild); next; } # Search for things comtained or overlapping previous matches. This assumes that matches # are sorted by begin coord. We need to handle: # # ---------- and ----------- not -------- or ----- # ---- --------- ------ ---------- # # and, the ones we're more-or-less really looking for: # # -------- OR -------- # -------- ------- # This isolates out the first case above. We don't want to save it as a region, # because it is completely contained in a previous region. if ($utgEnd <= $max) { print STDOUT " -- is contained\n" if ($verboseBuild); next; } # If we aren't even intersecting, we've found the third case. $isDisconnected++ if (($max > 0) && ($max < $utgBgn)); # Finally, update the min/max extents. This must be last. if ($max == 0) { $min = $utgBgn; $max = $utgEnd; } if (($utgBgn - 5 <= $min) && ($max < $utgEnd)) { # max STRICTLY LESS than utgEnd! pop @end; push @end, $utgEnd; $max = $utgEnd; print STDOUT " -- extends ", $bgn[scalar(@bgn)-1], " ", $end[scalar(@end)-1], " ", scalar(@bgn), " ", scalar(@end), " $utgBgn $utgEnd\n" if ($verboseBuild); next; } if ($utgBgn < $min) { $min = $utgBgn; } if ($max < $utgEnd) { $max = $utgEnd; } push @bgn, $utgBgn; push @end, $utgEnd; $numBlocks++; print STDOUT "\n" if ($verboseBuild); } # One more pass through to count repeats. Above we just found the min/max extent of the # alignments on the unitig. Here we can count if alignments are interior to the extent, at the # end, or completely spanning it. # # This fails by counting interior disconnected chimer (e.g., -- -- --) as 'isInteriorRepeat' my $isMaxExact = 0; my $isInteriorRepeat = 0; my $isTerminalRepeat = 0; foreach my $m (@m) { my ($utgBgn, $utgEnd, $genBgn, $genEnd) = split '\s+', $m; # New match covers the whole region. if (($utgBgn - 5 <= $min) && ($max <= $utgEnd + 5)) { $isMaxExact++; next; } # New match completely contained, near the extent. if (($utgBgn - 5 <= $min) || ($max <= $utgEnd + 5)) { $isTerminalRepeat++; next; } # Otherwise, the match must be interior. $isInteriorRepeat++; } print STDERR "UNDEF LENGTH $utg\n" if (!defined($length{$utg})); print STDERR "ZERO LENGTH $utg\n" if ($length{$utg} == 0); print STDERR "SHORT LENGTH $utg\n" if ($length{$utg} < 64); if (($numMatches == 2) && ($isDisconnected > 0) && ($isTerminalRepeat == 2)) { $classFalseMateSplit++; $lenFalseMateSplit += $length{$utg}; #print "$utg (len $length{$utg}) is a FALSE MATE SPLIT.\n"; next; } if ($isDisconnected > 0) { $classDisconnected++; $lenDisconnected += $length{$utg}; #print "$utg (len $length{$utg}) is DISCONNECTED.\n"; next; } if ($isMaxExact >= 2) { $classRepeat++; $lenRepeat += $length{$utg}; #print "$utg (len $length{$utg}) is a REPEAT.\n"; next; } if ($max < $TOOSHORT) { $classShort++; $lenShort += $length{$utg}; next; } if (($isMaxExact == 1) && ($isTerminalRepeat > 0) && ($isInteriorRepeat > 0)) { $classRepeatEndWithRepeat++; $lenRepeatEndWithRepeat += $length{$utg}; if (0) { print "----------\n"; print "$utg extent $min $max numBlocks $numBlocks numMatches $numMatches maxExact $isMaxExact interiorRepeat $isInteriorRepeat terminalRepeat $isTerminalRepeat disconnect $isDisconnected\n"; print "$nucmer{$utg}\n"; for (my $iii=0; $iii 1); next; } if (($isMaxExact == 1) && ($isTerminalRepeat == $numMatches - 1)) { $classRepeat1Unique++; $lenRepeat1Unique += $length{$utg}; if (0) { print "----------\n"; print "$utg extent $min $max numBlocks $numBlocks numMatches $numMatches maxExact $isMaxExact interiorRepeat $isInteriorRepeat terminalRepeat $isTerminalRepeat disconnect $isDisconnected\n"; print "$nucmer{$utg}\n"; for (my $iii=0; $iii 1) { $classMultipleBlock++; $lenMultipleBlock += $length{$utg}; if (0) { print "----------\n"; print "$utg extent $min $max numBlocks $numBlocks numMatches $numMatches maxExact $isMaxExact interiorRepeat $isInteriorRepeat terminalRepeat $isTerminalRepeat disconnect $isDisconnected\n"; print "$nucmer{$utg}\n"; for (my $iii=0; $iii 0); if ((! -e "$FILE.tigStore") || (! -e "$FILE.fasta")) { die "Missing tigStore or fasta. Run build-fasta.pl\n"; } ######################################## if (! -e "$FILE.coords") { my $cmd; $cmd = "nucmer"; $cmd .= " --maxmatch --coords -p $FILE"; $cmd .= " /work/FRAGS/porphyromonas_gingivalis_w83/reference/AE015924.fasta"; $cmd .= " $FILE.fasta"; system($cmd); } ######################################## # Load the lengh of each sequence. my %length; if (-e "$FILE.fasta") { open(F, "< $FILE.fasta") or die; while () { if (m/^>(utg\d+)\s+len=(\d+)$/) { $length{$1} = $2; } } close(F); } else { print STDERR "No $FILE.fasta file found, cannot .....\n"; } ######################################## # Pass one, load the coords. Analyze anything with one match, counting # the amount of coverage. my %nucmer; if (-e "$FILE.coords") { open(F, "< $FILE.coords") or die; $_ = ; $_ = ; $_ = ; $_ = ; $_ = ; while () { s/^\s+//; s/\s$+//; my @v = split '\s+', $_; my $utgBgn = ($v[3] < $v[4]) ? $v[3] : $v[4]; my $utgEnd = ($v[3] < $v[4]) ? $v[4] : $v[3]; my $genBgn = ($v[0] < $v[1]) ? $v[0] : $v[1]; my $genEnd = ($v[0] < $v[1]) ? $v[1] : $v[0]; # Rearrange the coords so that bgn is ALWAYS less than end (we lose orientation). my $str = "$utgBgn\t$utgEnd\t$genBgn\t$genEnd"; if (exists($nucmer{$v[12]})) { $nucmer{$v[12]} .= "\n" . $str; } else { $nucmer{$v[12]} = $str; } } close(F); } else { die "No $FILE.coords??\n"; } ######################################## # For things with one match, report spurs. my $spur = 0; my $spurShort = 0; my $confirmed = 0; my $confirmedBP = 0; foreach my $utg (keys %nucmer) { my @m = split '\n', $nucmer{$utg}; if (scalar(@m) > 1) { @m = sort {$a <=> $b} @m; $nucmer{$utg} = join "\n", @m; next; } my ($utgBgn, $utgEnd, $genBgn, $genEnd) = split '\s+', $nucmer{$utg}; my $len = $utgEnd - $utgBgn; die if ($len <= 0); my $per = $len / $length{$utg}; if ($per < 0.95) { if ($length{$utg} < $TOOSHORT) { $spurShort++; } else { print STDERR "SPUR $utg len $length{$utg} mapped $len ($per%) coords $nucmer{$utg}\n"; $spur++; } } else { $confirmed++; $confirmedBP += $length{$utg}; } delete $nucmer{$utg}; } print STDERR "FOUND ", scalar(keys %nucmer), " MATCHES, of which $spur are SPURs (and $spurShort are SHORT SPURS)\n"; print STDERR "FOUND $confirmed confirmed unitigs of total length $confirmedBP bp.\n"; ######################################## # Things with multiple matches are candidates for chimera. my $classFalseMateSplit = 0; my $lenFalseMateSplit = 0; my $classShort = 0; my $lenShort = 0; my $classDisconnected = 0; my $lenDisconnected = 0; my $classRepeat = 0; my $lenRepeat = 0; my $classRepeatEndWithRepeat = 0; my $lenRepeatEndWithRepeat = 0; my $classRepeat1Unique = 0; my $lenRepeat1Unique = 0; my $classRepeat2Unique = 0; my $lenRepeat2Unique = 0; my $classMultipleBlock = 0; my $lenMultipleBlock = 0; my $classUnknown = 0; my $lenUnknown = 0; foreach my $utg (keys %nucmer) { my @m = split '\n', $nucmer{$utg}; my $numBlocks = 0; my @bgn; my @end; my $min = 999999999; my $max = 0; #print STDOUT "$utg\n"; # Examine the matches, decide if the multiple matches are due to # unitig is a repeat # short repeats interior to the unitig # long repeats at the end # chimer my $numMatches = 0; my $isDisconnected = 0; my $verboseBuild = 1; print STDOUT "--------------------------------------------------------------------------------\n" if ($verboseBuild); print STDOUT "UNITIG $utg\n" if ($verboseBuild); foreach my $m (@m) { my ($utgBgn, $utgEnd, $genBgn, $genEnd) = split '\s+', $m; $numMatches++; print STDOUT " $m" if ($verboseBuild); # Are we an exact match to the largest thing so far? if (($utgBgn == $min) && ($utgEnd == $max)) { print STDOUT " -- is exact\n" if ($verboseBuild); next; } # Search for things comtained or overlapping previous matches. This assumes that matches # are sorted by begin coord. We need to handle: # # ---------- and ----------- not -------- or ----- # ---- --------- ------ ---------- # # and, the ones we're more-or-less really looking for: # # -------- OR -------- # -------- ------- # This isolates out the first case above. We don't want to save it as a region, # because it is completely contained in a previous region. if ($utgEnd <= $max) { print STDOUT " -- is contained in a region\n" if ($verboseBuild); next; } # If we aren't even intersecting, we've found the third case. my $dlabel = ""; if (($max > 0) && ($max < $utgBgn)) { my $dist = $utgBgn - $max; $dlabel = " DISCONNECT $dist"; $isDisconnected++ } # Finally, update the min/max extents. This must be last. if ($max == 0) { $min = $utgBgn; $max = $utgEnd; } if (($utgBgn - 5 <= $min) && ($max < $utgEnd)) { # max STRICTLY LESS than utgEnd! pop @end; push @end, $utgEnd; $max = $utgEnd; print STDOUT " -- extends a region\n" if ($verboseBuild); next; } if ($utgBgn < $min) { $min = $utgBgn; } if ($max < $utgEnd) { $max = $utgEnd; } push @bgn, $utgBgn; push @end, $utgEnd; $numBlocks++; print STDOUT " -- makes a new region$dlabel\n" if ($verboseBuild); } # One more pass through to count repeats. Above we just found the min/max extent of the # alignments on the unitig. Here we can count if alignments are interior to the extent, at the # end, or completely spanning it. # # This fails by counting interior disconnected chimer (e.g., -- -- --) as 'isInteriorRepeat' my $isMaxExact = 0; my $isInteriorRepeat = 0; my $isTerminalRepeat = 0; foreach my $m (@m) { my ($utgBgn, $utgEnd, $genBgn, $genEnd) = split '\s+', $m; # New match covers the whole region. if (($utgBgn - 5 <= $min) && ($max <= $utgEnd + 5)) { $isMaxExact++; next; } # New match completely contained, near the extent. if (($utgBgn - 5 <= $min) || ($max <= $utgEnd + 5)) { $isTerminalRepeat++; next; } # Otherwise, the match must be interior. $isInteriorRepeat++; } print STDERR "UNDEF LENGTH $utg\n" if (!defined($length{$utg})); print STDERR "ZERO LENGTH $utg\n" if ($length{$utg} == 0); print STDERR "SHORT LENGTH $utg\n" if ($length{$utg} < 64); if (($numMatches == 2) && ($isDisconnected > 0) && ($isTerminalRepeat == 2)) { $classFalseMateSplit++; $lenFalseMateSplit += $length{$utg}; print "$utg (len $length{$utg}) is a FALSE MATE SPLIT.\n" if ($verboseBuild); next; } if ($isDisconnected > 0) { $classDisconnected++; $lenDisconnected += $length{$utg}; print "$utg (len $length{$utg}) is DISCONNECTED.\n" if ($verboseBuild); next; } if ($isMaxExact >= 2) { $classRepeat++; $lenRepeat += $length{$utg}; print "$utg (len $length{$utg}) is a REPEAT.\n" if ($verboseBuild); next; } if ($max < $TOOSHORT) { $classShort++; $lenShort += $length{$utg}; next; } if (($isMaxExact == 1) && ($isTerminalRepeat > 0) && ($isInteriorRepeat > 0)) { $classRepeatEndWithRepeat++; $lenRepeatEndWithRepeat += $length{$utg}; if (0) { print "----------\n"; print "$utg extent $min $max numBlocks $numBlocks numMatches $numMatches maxExact $isMaxExact interiorRepeat $isInteriorRepeat terminalRepeat $isTerminalRepeat disconnect $isDisconnected\n"; print "$nucmer{$utg}\n"; for (my $iii=0; $iii 1); next; } if (($isMaxExact == 1) && ($isTerminalRepeat == $numMatches - 1)) { $classRepeat1Unique++; $lenRepeat1Unique += $length{$utg}; if (0) { print "----------\n"; print "$utg extent $min $max numBlocks $numBlocks numMatches $numMatches maxExact $isMaxExact interiorRepeat $isInteriorRepeat terminalRepeat $isTerminalRepeat disconnect $isDisconnected\n"; print "$nucmer{$utg}\n"; for (my $iii=0; $iii 1) { $classMultipleBlock++; $lenMultipleBlock += $length{$utg}; if (0) { print "----------\n"; print "$utg extent $min $max numBlocks $numBlocks numMatches $numMatches maxExact $isMaxExact interiorRepeat $isInteriorRepeat terminalRepeat $isTerminalRepeat disconnect $isDisconnected\n"; print "$nucmer{$utg}\n"; for (my $iii=0; $iii; my $len = ; my $cns = ; my $qlt = ; my $d1 = ; # cov stat my $d2 = ; # microhet my $d3 = ; # status my $d4 = ; # unique_rept my $d5 = ; # status my $nfrg = ; # num frags my $d7 = ; # num unitigs my $maxPos = 0; if ($utg =~ m/unitig\s+(\d+)$/) { $utg = $1; } else { die "Out of sync on '$utg'\n"; } if ($nfrg =~ m/data.num_frags\s+(\d+)$/) { $nfrg = $1; } else { die "Out of sync on '$nfrg'\n"; } my @frgs; for (my $nnnn=0; $nnnn < $nfrg; $nnnn++) { $_ = ; chomp; if (m/^FRG\stype\sR\sident\s+(\d+)\s+.*position\s+(\d+)\s+(\d+)$/) { push @frgs, "FRG\t$1\t$2\t$3"; $maxPos = ($maxPos < $2) ? $2 : $maxPos; $maxPos = ($maxPos < $3) ? $3 : $maxPos; } else { die "Nope: '$_'\n"; } } push @tigStore, "UTG\t$utg\t$nfrg\t$maxPos"; push @tigStore, @frgs; } close(F); if (defined($ARGV[0])) { push @reads, $ARGV[0]; } else { print STDERR "SCANNING tigStore for reads that span a unitig.\n"; my $chk; my $utg; my $nfrg = 0; my $tfrg = 0; my $maxPos; foreach my $l (@tigStore) { if ($nfrg == 0) { ($chk, $utg, $nfrg, $maxPos) = split '\s+', $l; die if ($chk ne "UTG"); next; } $nfrg--; $tfrg++; my ($chk, $iid, $bgn, $end) = split '\s+', $l; die if ($chk ne "FRG"); if ($bgn > $end) { ($bgn, $end) = ($end, $bgn); } if (($bgn == 0) && ($end == $maxPos)) { push @reads, $iid; } } print STDERR "Searching for placements for ", scalar(@reads), " reads.\n"; } foreach my $read (@reads) { my %ovlRead; my %ovlType; #print STDERR "Searching for placements for read $read in unitigs.\n"; print "\n"; # Find overlapping reads. open(F, "overlapStore -b $read -e $read -d $prefix.ovlStore |"); while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; $ovlRead{$v[1]} = 1; $ovlType{$v[1]} = ''; if (($v[3] < 0) && ($v[4] < 0)) { $ovlType{$v[1]} = '5'; } elsif (($v[3] > 0) && ($v[4] > 0)) { $ovlType{$v[1]} = '3'; } elsif (($v[3] < 0) && ($v[4] > 0)) { $ovlType{$v[1]} = 'c'; # Contained } else { $ovlType{$v[1]} = 'C'; # Container } } close(F); #print STDERR "Searching for placements for read $read in unitigs, using ", scalar(keys %ovlRead), " overlapping reads.\n"; # Scan unitigs. For any unitig with more than zero overlapping reads present, report. my $chk; my $utg; my $nfrg = 0; my $cfrg = 0; my $maxPos; my $nmat; my %ntyp; my $self; foreach my $l (@tigStore) { if ($cfrg == 0) { ($chk, $utg, $nfrg, $maxPos) = split '\s+', $l; die if ($chk ne "UTG"); $cfrg = $nfrg; $nmat = 0; $ntyp{'5'} = 0; $ntyp{'3'} = 0; $ntyp{'C'} = 0; $ntyp{'c'} = 0; $self = 0; next; } $cfrg--; my ($chk, $iid, $bgn, $end) = split '\s+', $l; die if ($chk ne "FRG"); if ($bgn > $end) { ($bgn, $end) = ($end, $bgn); } if (exists($ovlRead{$iid})) { $nmat++; $ntyp{$ovlType{$iid}}++; #print "FRG\t$iid\tat\t$bgn\t$end\t$ovlType{$iid}\n"; } if ($iid == $read) { $self++; } if (($cfrg == 0) && ($self + $nmat > 0)) { if ($self) { print "read $read\tUNITIG $utg\tof length $maxPos with 5'=$ntyp{'5'} 3'=$ntyp{'3'} contained=$ntyp{'c'} container=$ntyp{'C'} overlapping reads out of $nfrg.\n"; } else { print "read $read\tunitig $utg\tof length $maxPos with 5'=$ntyp{'5'} 3'=$ntyp{'3'} contained=$ntyp{'c'} container=$ntyp{'C'} overlapping reads out of $nfrg.\n"; } } } close(F); } canu-1.6/src/bogart-analysis/show-false-best-edges-from-mapping.pl000066400000000000000000000045361314437614700252350ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $asm = shift @ARGV; # Reads: # $asm//best.edges # filtered-overlaps.true.ova # filtered-overlaps.false.ova # # Reports which best.edges are false my %true; my %false; open(F, "< filtered-overlaps.true.ova") or die; while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; $true{"$v[0]-$v[1]"}++; $true{"$v[1]-$v[0]"}++; } close(F); open(F, "< filtered-overlaps.false.ova") or die; while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; $false{"$v[0]-$v[1]"}++; $false{"$v[1]-$v[0]"}++; } close(F); my $true5 = 0; my $true3 = 0; my $false5 = 0; my $false3 = 0; my $novel5 = 0; my $novel3 = 0; open(F, "< $asm/best.edges") or die; while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; my $p5 = "$v[0]-$v[2]"; my $p3 = "$v[0]-$v[4]"; if (exists($true{$p5})) { $true5++; } elsif (exists($false{$p5})) { print STDERR "FALSE 5 $_\n"; $false5++; } else { print STDERR "NOVEL 5 $_\n"; $novel5++; } if (exists($true{$p3})) { $true3++; } elsif (exists($false{$p3})) { print STDERR "FALSE 3 $_\n"; $false3++; } else { print STDERR "NOVEL 3 $_\n"; $novel3++; } } close(F); print "true5 $true5\n"; print "true3 $true3\n"; print "false5 $false5\n"; print "false3 $false3\n"; print "novel5 $novel5\n"; print "novel3 $novel3\n"; canu-1.6/src/bogart/000077500000000000000000000000001314437614700143655ustar00rootroot00000000000000canu-1.6/src/bogart/AS_BAT_AssemblyGraph.C000066400000000000000000001025221314437614700203050ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JUL-21 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #include "intervalList.H" #include "stddev.H" #undef FILTER_DENSE_BUBBLES_FROM_GRAPH #define FILTER_DENSE_BUBBLES_THRESHOLD 3 // Retain bubbles if they have fewer than this number of edges to other tigs #undef LOG_GRAPH #undef LOG_GRAPH_ALL void AssemblyGraph::buildReverseEdges(void) { writeStatus("AssemblyGraph()-- building reverse edges.\n"); for (uint32 fi=1; finumReads()+1; fi++) _pReverse[fi].clear(); for (uint32 fi=1; finumReads()+1; fi++) { for (uint32 ff=0; ff<_pForward[fi].size(); ff++) { BestPlacement &bp = _pForward[fi][ff]; BestReverse br(fi, ff); // Ensure that contained edges have no dovetail edges. This screws up the logic when // rebuilding and outputting the graph. if (bp.bestC.b_iid != 0) { assert(bp.best5.b_iid == 0); assert(bp.best3.b_iid == 0); } // Add reverse edges if the forward edge exists if (bp.bestC.b_iid != 0) _pReverse[bp.bestC.b_iid].push_back(br); if (bp.best5.b_iid != 0) _pReverse[bp.best5.b_iid].push_back(br); if (bp.best3.b_iid != 0) _pReverse[bp.best3.b_iid].push_back(br); // Check sanity. assert((bp.bestC.a_hang <= 0) && (bp.bestC.b_hang >= 0)); // ALL contained edges should be this. assert((bp.best5.a_hang <= 0) && (bp.best5.b_hang <= 0)); // ALL 5' edges should be this. assert((bp.best3.a_hang >= 0) && (bp.best3.b_hang >= 0)); // ALL 3' edges should be this. } } } void AssemblyGraph::buildGraph(const char *UNUSED(prefix), double deviationRepeat, TigVector &tigs, bool tigEndsOnly) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; // Just some logging. Count the number of reads we try to place. uint32 nToPlaceContained = 0; uint32 nToPlace = 0; uint32 nPlacedContained = 0; uint32 nPlaced = 0; uint32 nFailedContained = 0; uint32 nFailed = 0; for (uint32 fid=1; fidnumReads()+1; fid++) { if (tigs.inUnitig(fid) == 0) // Unplaced, don't care. These didn't assemble, and aren't contained. continue; if (OG->isContained(fid)) nToPlaceContained++; else nToPlace++; } writeStatus("\n"); writeStatus("AssemblyGraph()-- allocating vectors for placements, %.3fMB\n", // vector<> is 24 bytes, pretty tiny. (sizeof(vector) + sizeof(vector)) * (fiLimit + 1) / 1048576.0); _pForward = new vector [fiLimit + 1]; _pReverse = new vector [fiLimit + 1]; writeStatus("AssemblyGraph()-- finding edges for %u reads (%u contained), ignoring %u unplaced reads, with %d thread%s.\n", nToPlaceContained + nToPlace, nToPlaceContained, RI->numReads() - nToPlaceContained - nToPlace, numThreads, (numThreads == 1) ? "" : "s"); // Do the placing! #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 fi=1; finumReads()+1; fi++) { bool enableLog = true; uint32 fiTigID = tigs.inUnitig(fi); if (fiTigID == 0) // Unplaced, don't care. continue; if (tigs[fiTigID]->_isUnassembled == true) // Unassembled, don't care. continue; if (tigEndsOnly == true) { uint32 f = tigs[fiTigID]->firstRead()->ident; uint32 l = tigs[fiTigID]->lastRead()->ident; if ((f != fi) && (l != fi)) // Not the first read and not the last read, continue; // Don't care. } // Grab a bit about this read. uint32 fiLen = RI->readLength(fi); ufNode *fiRead = &tigs[fiTigID]->ufpath[ tigs.ufpathIdx(fi) ]; int32 fiMin = fiRead->position.min(); int32 fiMax = fiRead->position.max(); // Find ALL potential placements, regardless of error rate. vector placements; placeReadUsingOverlaps(tigs, NULL, fi, placements); #ifdef LOG_GRAPH //writeLog("AG()-- working on read %u with %u placements\n", fi, placements.size()); #endif // For each placement decide if the overlap is compatible with the tig. for (uint32 pp=0; ppufpath.size() <= 1) { #ifdef LOG_GRAPH writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f SINGLETON\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate); #endif continue; } int32 utgmin = placements[pp].position.min(); // Placement in unitig. int32 utgmax = placements[pp].position.max(); bool utgfwd = placements[pp].position.isForward(); int32 ovlmin = placements[pp].verified.min(); // Placement in unitig, verified by overlaps. int32 ovlmax = placements[pp].verified.max(); assert(placements[pp].covered.bgn < placements[pp].covered.end); // Coverage is always forward. bool is5 = (placements[pp].covered.bgn == 0) ? true : false; // Placement covers the 5' end of the read bool is3 = (placements[pp].covered.end == fiLen) ? true : false; // Placement covers the 3' end of the read // Ignore placements that aren't overlaps (contained reads placed inside this read will do this). if ((is5 == false) && (is3 == false)) { #ifdef LOG_GRAPH_ALL writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f SPANNED_REPEAT\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate); #endif continue; } // Decide if the overlap is to the left (towards 0) or right (towards infinity) of us on the tig. bool onLeft = (((utgfwd == true) && (is5 == true)) || ((utgfwd == false) && (is3 == true))) ? true : false; bool onRight = (((utgfwd == true) && (is3 == true)) || ((utgfwd == false) && (is5 == true))) ? true : false; // Decide if this is already captured in a tig. If so, we'll emit to GFA, but omit from our // internal graph. bool isTig = false; if ((placements[pp].tigID == fiTigID) && (utgmin <= fiMax) && (fiMin <= utgmax)) isTig = true; // Decide if the placement is complatible with the other reads in the tig. #define REPEAT_FRACTION 0.5 if ((isTig == false) && (tig->overlapConsistentWithTig(deviationRepeat, ovlmin, ovlmax, erate) < REPEAT_FRACTION)) { #ifdef LOG_GRAPH_ALL if ((enableLog == true) && (logFileFlagSet(LOG_PLACE_UNPLACED))) writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f HIGH_ERROR\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate); #endif continue; } // A valid placement! Create a BestPlacement for it. BestPlacement bp; #ifdef LOG_GRAPH writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f Fidx %6u Lidx %6u is5 %d is3 %d onLeft %d onRight %d VALID_PLACEMENT\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate, placements[pp].tigFidx, placements[pp].tigLidx, is5, is3, onLeft, onRight); #endif // Find the reads we have overlaps to. The range of reads here is the first and last read in // the tig layout that overlaps with ourself. We don't need to check that the reads overlap in the // layout: the only false case I can think of involves contained reads. // // READ: ----------------------------------- // TIG: Fidx ----------------------------- // TIG: (1) ------ // TIG: -------------------------------------- // TIG: Lidx ----------------------------------------- // (2) ------ // // The short read is placed at (1), but also has an overlap to us at (2). set tigReads; for (uint32 rr=placements[pp].tigFidx; rr <= placements[pp].tigLidx; rr++) tigReads.insert(tig->ufpath[rr].ident); // Scan all overlaps. Decide if the overlap is to the L or R of the _placed_ read, and save // the thickest overlap on the 5' or 3' end of the read. uint32 no = 0; BAToverlap *ovl = OC->getOverlaps(fi, no); uint32 thickestC = UINT32_MAX, thickestCident = 0; uint32 thickest5 = UINT32_MAX, thickest5len = 0; uint32 thickest3 = UINT32_MAX, thickest3len = 0; for (uint32 oo=0; oooverlapLength(ovl[oo].a_iid, ovl[oo].b_iid, ovl[oo].a_hang, ovl[oo].b_hang); if (ovl[oo].AisContainer() == true) { continue; } else if ((ovl[oo].AisContained() == true) && (is5 == true) && (is3 == true)) { if (thickestCident < ovl[oo].evalue) { thickestC = oo; thickestCident = ovl[oo].evalue; bp.bestC = ovl[oo]; } } else if ((ovl[oo].AEndIs5prime() == true) && (is5 == true)) { if (thickest5len < olapLen) { thickest5 = oo; thickest5len = olapLen; bp.best5 = ovl[oo]; } } else if ((ovl[oo].AEndIs3prime() == true) && (is3 == true)) { if (thickest3len < olapLen) { thickest3 = oo; thickest3len = olapLen; bp.best3 = ovl[oo]; } } } // If we have both 5' and 3' edges, delete the containment edge. if ((bp.best5.b_iid != 0) && (bp.best3.b_iid != 0)) { thickestC = UINT32_MAX; thickestCident = 0; bp.bestC = BAToverlap(); } // If we have a containment edge, delete the 5' and 3' edges. if (bp.bestC.b_iid != 0) { thickest5 = UINT32_MAX; thickest5len = 0; bp.best5 = BAToverlap(); thickest3 = UINT32_MAX; thickest3len = 0; bp.best3 = BAToverlap(); } // Save the edge. bp.tigID = placements[pp].tigID; bp.placedBgn = placements[pp].position.bgn; bp.placedEnd = placements[pp].position.end; bp.olapBgn = placements[pp].verified.bgn; bp.olapEnd = placements[pp].verified.end; bp.isContig = isTig; bp.isUnitig = false; bp.isBubble = false; bp.isRepeat = false; // If there are best edges off the 5' or 3' end, grab all the overlaps, find the particular // overlap, and generate new BestEdgeOverlaps for them. if ((thickestC == UINT32_MAX) && (thickest5 == UINT32_MAX) && (thickest3 == UINT32_MAX)) { #ifdef LOG_GRAPH writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f NO_EDGES Fidx %6u Lidx %6u is5 %d is3 %d onLeft %d onRight %d\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate, placements[pp].tigFidx, placements[pp].tigLidx, is5, is3, onLeft, onRight); #endif continue; } assert((thickestC != 0) || (thickest5 != 0) || (thickest3 != 0)); // Save the BestPlacement uint32 ff = _pForward[fi].size(); _pForward[fi].push_back(bp); // And now just log. #ifdef LOG_GRAPH if (thickestC != UINT32_MAX) { writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f CONTAINED %8d (%8d %8d)%s\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate, bp.bestC.b_iid, bp.best5.b_iid, bp.best3.b_iid, (isTig == true) ? " IN_UNITIG" : ""); } else { writeLog("AG()-- read %8u placement %2u -> tig %7u placed %9d-%9d verified %9d-%9d cov %7.5f erate %6.4f DOVETAIL (%8d) %8d %8d%s\n", fi, pp, placements[pp].tigID, placements[pp].position.bgn, placements[pp].position.end, placements[pp].verified.bgn, placements[pp].verified.end, placements[pp].fCoverage, erate, bp.bestC.b_iid, bp.best5.b_iid, bp.best3.b_iid, (isTig == true) ? " IN_UNITIG" : ""); } #endif } // Over all placements } // Over all reads buildReverseEdges(); writeStatus("AssemblyGraph()-- build complete.\n"); } void placeAsContained(TigVector &tigs, uint32 fi, BestPlacement &bp) { BestEdgeOverlap edge(bp.bestC); ufNode read; Unitig *tig = tigs[ tigs.inUnitig(edge.readId()) ]; if (tig->placeRead(read, fi, bp.bestC.AEndIs3prime(), &edge) == false) { fprintf(stderr, "WARNING: placeAsContained failed for fi=%u\n", fi); assert(0); } bp.tigID = tig->id(); bp.placedBgn = read.position.bgn; bp.placedEnd = read.position.end; bp.olapBgn = INT32_MIN; // We don't know the overlapping region (without a lot bp.olapEnd = INT32_MAX; // of work) so make it invalid. bp.isContig = (tigs.inUnitig(fi) == tigs.inUnitig(edge.readId())); } // This test is correct, but it isn't used correctly. When rebuilding the graph, we don't know if // a read is fully covered. If it isn't fully covered, it isn't 'inContig' even if the positions // overlap. bool areReadsOverlapping(TigVector &tigs, uint32 ai, uint32 bi) { Unitig *at = tigs[ tigs.inUnitig(ai) ]; Unitig *bt = tigs[ tigs.inUnitig(bi) ]; if (at != bt) return(false); ufNode &ar = at->ufpath[ tigs.ufpathIdx(ai) ]; ufNode &br = bt->ufpath[ tigs.ufpathIdx(bi) ]; return((ar.position.min() < br.position.max()) && (br.position.min() < ar.position.max())); } void placeAsDovetail(TigVector &tigs, uint32 fi, BestPlacement &bp) { BestEdgeOverlap edge5(bp.best5), edge3(bp.best3); ufNode read5, read3; if ((bp.best5.b_iid > 0) && (bp.best3.b_iid > 0)) { Unitig *tig5 = tigs[ tigs.inUnitig(edge5.readId()) ]; Unitig *tig3 = tigs[ tigs.inUnitig(edge3.readId()) ]; assert(tig5->id() == tig3->id()); if ((tig5->placeRead(read5, fi, bp.best5.AEndIs3prime(), &edge5) == false) || (tig3->placeRead(read3, fi, bp.best3.AEndIs3prime(), &edge3) == false)) { fprintf(stderr, "WARNING: placeAsDovetail 5' 3' failed for fi=%u\n", fi); assert(0); } bp.tigID = tig5->id(); bp.placedBgn = (read5.position.bgn + read3.position.bgn) / 2; bp.placedEnd = (read5.position.end + read3.position.end) / 2; #if 0 bp.isContig = (areReadsOverlapping(tigs, fi, bp.best5.b_iid) && areReadsOverlapping(tigs, fi, bp.best3.b_iid)); #else if ((bp.isContig == true) && // Remove the isContig mark if this read is now (tigs.inUnitig(fi) != bp.tigID)) // in a different tig than the two edges (which is unlikely). bp.isContig = false; #endif } else if (bp.best5.b_iid > 0) { Unitig *tig5 = tigs[ tigs.inUnitig(edge5.readId()) ]; if (tig5->placeRead(read5, fi, bp.best5.AEndIs3prime(), &edge5) == false) { fprintf(stderr, "WARNING: placeAsDovetail 5' failed for fi=%u\n", fi); assert(0); } bp.tigID = tig5->id(); bp.placedBgn = read5.position.bgn; bp.placedEnd = read5.position.end; #if 0 bp.isContig = areReadsOverlapping(tigs, fi, bp.best5.b_iid); #else if ((bp.isContig == true) && // Remove the isContig mark if this read is now (tigs.inUnitig(fi) != bp.tigID)) // in a different tig than the edge. bp.isContig = false; #endif } else if (bp.best3.b_iid > 0) { Unitig *tig3 = tigs[ tigs.inUnitig(edge3.readId()) ]; if (tig3->placeRead(read3, fi, bp.best3.AEndIs3prime(), &edge3) == false) { fprintf(stderr, "WARNING: placeAsDovetail 3' failed for fi=%u\n", fi); assert(0); } bp.tigID = tig3->id(); bp.placedBgn = read3.position.bgn; bp.placedEnd = read3.position.end; #if 0 bp.isContig = areReadsOverlapping(tigs, fi, bp.best3.b_iid); #else if ((bp.isContig == true) && // Remove the isContig mark if this read is now (tigs.inUnitig(fi) != bp.tigID)) // in a different tig than the edge. bp.isContig = false; #endif } assert(tigs[bp.tigID] != NULL); bp.olapBgn = INT32_MIN; // We don't know the overlapping region (without a lot bp.olapEnd = INT32_MAX; // of work) so make it invalid. } void AssemblyGraph::rebuildGraph(TigVector &tigs) { writeStatus("AssemblyGraph()-- rebuilding\n"); uint64 nContain = 0; uint64 nSame = 0; uint64 nSplit = 0; for (uint32 fi=1; finumReads()+1; fi++) { for (uint32 ff=0; ff<_pForward[fi].size(); ff++) { BestPlacement &bp = _pForward[fi][ff]; // Figure out which tig each of our three overlaps is in. uint32 t5 = (bp.best5.b_iid > 0) ? tigs.inUnitig(bp.best5.b_iid) : UINT32_MAX; uint32 t3 = (bp.best3.b_iid > 0) ? tigs.inUnitig(bp.best3.b_iid) : UINT32_MAX; //writeLog("AssemblyGraph()-- rebuilding read %u edge %u with overlaps %u %u %u\n", // fi, ff, bp.bestC.b_iid, bp.best5.b_iid, bp.best3.b_iid); // If a containment relationship, place it using the contain and update the placement. if (bp.bestC.b_iid > 0) { assert(bp.best5.b_iid == 0); assert(bp.best3.b_iid == 0); nContain++; placeAsContained(tigs, fi, bp); } // Otherwise, dovetails. If both overlapping reads are in the same tig, place it and update // the placement. else if ((t5 == t3) || // Both in the same tig (t5 == UINT32_MAX) || // 5' overlap isn't set (t3 == UINT32_MAX)) { // 3' overlap isn't set nSame++; placeAsDovetail(tigs, fi, bp); } // Otherwise, yikes, our overlapping reads are in different tigs! We need to make new // placements and delete the current one. else { BestPlacement bp5 = bp; BestPlacement bp3 = bp; bp5.best3 = BAToverlap(); // Erase the 3' overlap bp3.best5 = BAToverlap(); // Erase the 5' overlap assert(bp5.best5.b_iid != 0); // Overlap must exist! assert(bp3.best3.b_iid != 0); // Overlap must exist! nSplit++; placeAsDovetail(tigs, fi, bp5); placeAsDovetail(tigs, fi, bp3); // Add the two placements to our list. We let one placement overwrite the current // placement, move the placement after that to the end of the list, and overwrite // that placement with our other new one. uint32 ll = _pForward[fi].size(); // There's a nasty case when ff is the last currently on the list; there isn't an ff+1 // element to move to the end of the list. So, we add a new element to the list - // guaranteeing there is always an ff+1 element - then move, then replace. _pForward[fi].push_back(BestPlacement()); _pForward[fi][ll] = _pForward[fi][ff+1]; _pForward[fi][ff] = bp5; _pForward[fi][ff+1] = bp3; // Skip the edge we just added. ff++; } } } buildReverseEdges(); writeStatus("AssemblyGraph()-- rebuild complete.\n"); } // Filter edges that originate from the middle of a tig. // Need to save interior edges as long as they are consistent with a boundary edge. void AssemblyGraph::filterEdges(TigVector &tigs) { uint64 nUnitig = 0; uint64 nContig = 0; uint64 nBubble = 0; uint64 nRepeat = 0; uint64 nMiddleFiltered = 0, nMiddleReads = 0; uint64 nRepeatFiltered = 0, nRepeatReads = 0; uint64 nIntersecting = 0; uint64 nRepeatEdges = 0; uint64 nBubbleEdges = 0; writeStatus("AssemblyGraph()-- filtering edges\n"); // Mark edges that are from the interior of a tig as 'repeat'. for (uint32 fi=1; finumReads()+1; fi++) { if (_pForward[fi].size() == 0) continue; uint32 tT = tigs.inUnitig(fi); Unitig *tig = tigs[tT]; ufNode &read = tig->ufpath[tigs.ufpathIdx(fi)]; bool hadMiddle = false; for (uint32 ff=0; ff<_pForward[fi].size(); ff++) { BestPlacement &bp = _pForward[fi][ff]; // Edges forming the tig are not repeats. if (bp.isUnitig == true) continue; if (bp.isContig == true) continue; // Edges from the end of a tig are not repeats. if (((read.position.min() == 0) && (read.position.isForward()) && (bp.best5.b_iid > 0) && (bp.best3.b_iid == 0)) || ((read.position.min() == 0) && (read.position.isReverse()) && (bp.best5.b_iid == 0) && (bp.best3.b_iid > 0)) || ((read.position.max() == tig->getLength()) && (read.position.isForward()) && (bp.best5.b_iid == 0) && (bp.best3.b_iid > 0)) || ((read.position.max() == tig->getLength()) && (read.position.isReverse()) && (bp.best5.b_iid > 0) && (bp.best3.b_iid == 0))) { nIntersecting++; continue; } nMiddleFiltered++; bp.isRepeat = true; hadMiddle = true; } if (hadMiddle) nMiddleReads++; } // Filter edges that hit too many tigs for (uint32 fi=1; finumReads()+1; fi++) { if (_pForward[fi].size() == 0) continue; uint32 tT = tigs.inUnitig(fi); Unitig *tig = tigs[tT]; ufNode &read = tig->ufpath[tigs.ufpathIdx(fi)]; set hits; for (uint32 ff=0; ff<_pForward[fi].size(); ff++) { BestPlacement &bp = _pForward[fi][ff]; assert(bp.isUnitig == false); if (bp.isUnitig == true) { continue; } // Skip edges that are in tigs if (bp.isContig == true) { continue; } // if (bp.isRepeat == true) { continue; } // Skip edges that are already ignored hits.insert(bp.tigID); } // If only a few other tigs are involved, keep all. if (hits.size() > 0) writeLog("AG()-- read %u in tig %u has edges to %u tigs\n", fi, tT, hits.size()); #ifdef FILTER_DENSE_BUBBLES_FROM_GRAPH if (hits.size() <= FILTER_DENSE_BUBBLES_THRESHOLD) continue; // Otherwise, mark all edges as repeat. nRepeatReads++; for (uint32 ff=0; ff<_pForward[fi].size(); ff++) { BestPlacement &bp = _pForward[fi][ff]; assert(bp.isUnitig == false); if (bp.isUnitig == true) { continue; } // Skip edges that are in tigs if (bp.isContig == true) { continue; } // if (bp.isRepeat == true) { continue; } // Skip edges that are already ignored nRepeatFiltered++; bp.isRepeat = true; } #endif } // Generate statistics for (uint32 fi=1; finumReads()+1; fi++) { for (uint32 ff=0; ff<_pForward[fi].size(); ff++) { BestPlacement &bp = _pForward[fi][ff]; if (bp.isUnitig == true) { nUnitig++; continue; } if (bp.isContig == true) { nContig++; continue; } if (bp.isRepeat == true) { nRepeatEdges++; } if (bp.isRepeat == false) { nBubbleEdges++; } } } // Report writeStatus("AssemblyGraph()-- " F_U64 " contig edges and " F_U64 " unitig edges.\n", nContig, nUnitig); writeStatus("AssemblyGraph()-- " F_U64 " bubble edges and " F_U64 " repeat edges.\n", nBubble, nRepeat); writeStatus("AssemblyGraph()-- " F_U64 " middle contig edges filtered from " F_U64 " reads.\n", nMiddleFiltered, nMiddleReads); writeStatus("AssemblyGraph()-- " F_U64 " repeat end edges filtered from " F_U64 " reads.\n", nRepeatFiltered, nRepeatReads); writeStatus("AssemblyGraph()-- " F_U64 " repeat edges (not output).\n", nRepeatEdges); writeStatus("AssemblyGraph()-- " F_U64 " bubble edges.\n", nBubbleEdges); writeStatus("AssemblyGraph()-- " F_U64 " intersecting edges (from the end of a tig to somewhere else).\n", nIntersecting); } bool reportReadGraph_reportEdge(TigVector &tigs, BestPlacement &pf, bool skipBubble, bool skipRepeat, bool &reportC, bool &report5, bool &report3) { reportC = false; report5 = false; report3 = false; if ((skipBubble == true) && (pf.isBubble == true)) return(false); if ((skipRepeat == true) && (pf.isRepeat == true)) return(false); // If the destination isunassembled, all edges are ignored. if ((tigs[pf.tigID] == NULL) || (tigs[pf.tigID]->_isUnassembled == true)) return(false); reportC = (tigs.inUnitig(pf.bestC.b_iid) != 0) && (tigs[ tigs.inUnitig(pf.bestC.b_iid) ]->_isUnassembled == false); report5 = (tigs.inUnitig(pf.best5.b_iid) != 0) && (tigs[ tigs.inUnitig(pf.best5.b_iid) ]->_isUnassembled == false); report3 = (tigs.inUnitig(pf.best3.b_iid) != 0) && (tigs[ tigs.inUnitig(pf.best3.b_iid) ]->_isUnassembled == false); if ((reportC == false) && (report5 == false) && (report3 == false)) return(false); return(true); } // SWIPED FROM BestOverlapGraph::reportBestEdges void AssemblyGraph::reportReadGraph(TigVector &tigs, const char *prefix, const char *label) { char N[FILENAME_MAX]; FILE *BEG = NULL; bool skipBubble = true; bool skipRepeat = true; bool skipUnassembled = true; uint64 nEdgeToUnasm = 0; writeStatus("AssemblyGraph()-- generating '%s.%s.assembly.gfa'.\n", prefix, label); snprintf(N, FILENAME_MAX, "%s.%s.assembly.gfa", prefix, label); BEG = fopen(N, "w"); if (BEG == NULL) return; fprintf(BEG, "H\tVN:Z:bogart/edges\n"); // First, figure out what sequences are used. A sequence is used if it has forward edges, // or if it is referred to by a forward edge. uint32 *used = new uint32 [RI->numReads() + 1]; memset(used, 0, sizeof(uint32) * (RI->numReads() + 1)); for (uint32 fi=1; finumReads() + 1; fi++) { for (uint32 pp=0; pp<_pForward[fi].size(); pp++) { BestPlacement &pf = _pForward[fi][pp]; bool reportC=false, report5=false, report3=false; if ((tigs.inUnitig(pf.bestC.b_iid) != 0) && (tigs[ tigs.inUnitig(pf.bestC.b_iid) ]->_isUnassembled == true)) nEdgeToUnasm++; if ((tigs.inUnitig(pf.best5.b_iid) != 0) && (tigs[ tigs.inUnitig(pf.best5.b_iid) ]->_isUnassembled == true)) nEdgeToUnasm++; if ((tigs.inUnitig(pf.best3.b_iid) != 0) && (tigs[ tigs.inUnitig(pf.best3.b_iid) ]->_isUnassembled == true)) nEdgeToUnasm++; if (reportReadGraph_reportEdge(tigs, pf, skipBubble, skipRepeat, reportC, report5, report3) == false) continue; used[fi] = 1; if (reportC) used[pf.bestC.b_iid] = 1; if (report5) used[pf.best5.b_iid] = 1; if (report3) used[pf.best3.b_iid] = 1; } } writeStatus("AssemblyGraph()-- Found " F_U64 " edges to unassembled contigs.\n", nEdgeToUnasm); // Then write those sequences. for (uint32 fi=1; finumReads() + 1; fi++) if (used[fi] == 1) fprintf(BEG, "S\tread%08u\t*\tLN:i:%u\n", fi, RI->readLength(fi)); delete [] used; // Now, report edges. GFA wants edges in exactly this format: // // ------------- // ------------- // // with read orientation given by +/-. Conveniently, this is what we've saved (for the edges). uint64 nTig[3] = {0,0,0}; // Number of edges - both contig and unitig uint64 nCtg[3] = {0,0,0}; // Number of edges - contig only uint64 nUtg[3] = {0,0,0}; // Number of edges - unitig only (should be zero) uint64 nAsm[3] = {0,0,0}; // Number of edges - between contigs uint64 nBubble = 0; uint64 nRepeat = 0; for (uint32 fi=1; finumReads() + 1; fi++) { for (uint32 pp=0; pp<_pForward[fi].size(); pp++) { BestPlacement &pf = _pForward[fi][pp]; bool reportC=false, report5=false, report3=false; if (reportReadGraph_reportEdge(tigs, pf, skipBubble, skipRepeat, reportC, report5, report3) == false) continue; // Some statistics - number of edges of each type (in a contig, in a unitig, in both (tig), in neither (asm)) if ((pf.isContig == true) && (pf.isUnitig == true)) { if (reportC == true) nTig[0]++; if (report5 == true) nTig[1]++; if (report3 == true) nTig[2]++; } if ((pf.isContig == true) && (pf.isUnitig == false)) { if (reportC == true) nCtg[0]++; if (report5 == true) nCtg[1]++; if (report3 == true) nCtg[2]++; } if ((pf.isContig == false) && (pf.isUnitig == true)) { if (reportC == true) nUtg[0]++; if (report5 == true) nUtg[1]++; if (report3 == true) nUtg[2]++; } if ((pf.isContig == false) && (pf.isUnitig == false)) { if (reportC == true) nAsm[0]++; if (report5 == true) nAsm[1]++; if (report3 == true) nAsm[2]++; } // Finally, output the edge. if (reportC) fprintf(BEG, "C\tread%08u\t+\tread%08u\t%c\t%u\t%uM\tic:i:%d\tiu:i:%d\tib:i:%d\tir:i:%d\n", fi, pf.bestC.b_iid, pf.bestC.flipped ? '-' : '+', -pf.bestC.a_hang, RI->readLength(fi), pf.isContig, pf.isUnitig, pf.isBubble, pf.isRepeat); if (report5) fprintf(BEG, "L\tread%08u\t-\tread%08u\t%c\t%uM\tic:i:%d\tiu:i:%d\tib:i:%d\tir:i:%d\n", fi, pf.best5.b_iid, pf.best5.BEndIs3prime() ? '-' : '+', RI->overlapLength(fi, pf.best5.b_iid, pf.best5.a_hang, pf.best5.b_hang), pf.isContig, pf.isUnitig, pf.isBubble, pf.isRepeat); if (report3) fprintf(BEG, "L\tread%08u\t+\tread%08u\t%c\t%uM\tic:i:%d\tiu:i:%d\tib:i:%d\tir:i:%d\n", fi, pf.best3.b_iid, pf.best3.BEndIs3prime() ? '-' : '+', RI->overlapLength(fi, pf.best3.b_iid, pf.best3.a_hang, pf.best3.b_hang), pf.isContig, pf.isUnitig, pf.isBubble, pf.isRepeat); } } fclose(BEG); // And report statistics. writeStatus("AssemblyGraph()-- %8" F_U64P " bubble placements\n", nBubble); writeStatus("AssemblyGraph()-- %8" F_U64P " repeat placements\n", nRepeat); writeStatus("\n"); writeStatus("AssemblyGraph()-- Intratig edges: %8" F_U64P " contained %8" F_U64P " 5' %8" F_U64P " 3' (in both contig and unitig)\n", nTig[0], nTig[1], nTig[2]); writeStatus("AssemblyGraph()-- Contig only edges: %8" F_U64P " contained %8" F_U64P " 5' %8" F_U64P " 3'\n", nCtg[0], nCtg[1], nCtg[2]); writeStatus("AssemblyGraph()-- Unitig only edges: %8" F_U64P " contained %8" F_U64P " 5' %8" F_U64P " 3'\n", nUtg[0], nUtg[1], nUtg[2]); writeStatus("AssemblyGraph()-- Intercontig edges: %8" F_U64P " contained %8" F_U64P " 5' %8" F_U64P " 3' (in neither contig nor unitig)\n", nAsm[0], nAsm[1], nAsm[2]); } canu-1.6/src/bogart/AS_BAT_AssemblyGraph.H000066400000000000000000000074021314437614700203130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JUL-21 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_ASSEMBLYGRAPH #define INCLUDE_AS_BAT_ASSEMBLYGRAPH #include "AS_global.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" // For ReadEnd #include "AS_BAT_Unitig.H" // For SeqInterval #include "AS_BAT_TigVector.H" class BestPlacement { public: BestPlacement() { tigID = UINT32_MAX; placedBgn = INT32_MIN; placedEnd = INT32_MAX; olapBgn = INT32_MIN; olapEnd = INT32_MAX; isContig = false; isUnitig = false; isBubble = false; isRepeat = false; }; ~BestPlacement() { }; uint32 tigID; // Which tig this is placed in. int32 placedBgn; // Position in the tig. Can extend negative. int32 placedEnd; // int32 olapBgn; // Position in the tig covered by overlaps. int32 olapEnd; // bool isContig; // This placement is in a contig bool isUnitig; // This placement is in a unitig bool isBubble; // This placement is to an unambiguous region in a contig bool isRepeat; // This placement is to an ambiguous region in a contig that was split BAToverlap bestC; BAToverlap best5; BAToverlap best3; }; class BestReverse { public: BestReverse() { readID = 0; placeID = 0; }; BestReverse(uint32 id, uint32 pp) { readID = id; placeID = pp; }; ~BestReverse() { }; uint32 readID; // readID we have an overlap from; Index into _pForward uint32 placeID; // index into the vector for _pForward[readID] }; class AssemblyGraph { public: AssemblyGraph(const char *prefix, double deviationRepeat, TigVector &tigs, bool tigEndsOnly = false) { buildGraph(prefix, deviationRepeat, tigs, tigEndsOnly); } ~AssemblyGraph() { delete [] _pForward; delete [] _pReverse; }; public: vector &getForward(uint32 fi) { return(_pForward[fi]); }; vector &getReverse(uint32 fi) { return(_pReverse[fi]); }; public: void buildReverseEdges(void); void buildGraph(const char *prefix, double deviationRepeat, TigVector &tigs, bool tigEndsOnly); void rebuildGraph(TigVector &tigs); void filterEdges(TigVector &tigs); void reportReadGraph(TigVector &tigs, const char *prefix, const char *label); private: vector *_pForward; // Where each read is placed in other tigs vector *_pReverse; // What reads overlap to me }; #endif // INCLUDE_AS_BAT_ASSEMBLYGRAPH canu-1.6/src/bogart/AS_BAT_BestOverlapGraph.C000066400000000000000000001035741314437614700207640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_BestOverlapGraph.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2014-JAN-29 * are Copyright 2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "intervalList.H" #include "stddev.H" void BestOverlapGraph::removeSuspicious(const char *UNUSED(prefix)) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 fi=1; fi <= fiLimit; fi++) { uint32 no = 0; BAToverlap *ovl = OC->getOverlaps(fi, no); bool verified = false; intervalList IL; uint32 fLen = RI->readLength(fi); for (uint32 ii=0; (ii= 0) && (ovl[ii].b_hang >= 0)) // Right side dovetail IL.add(ovl[ii].a_hang, fLen - ovl[ii].a_hang); else if ((ovl[ii].a_hang >= 0) && (ovl[ii].b_hang <= 0)) // I contain the other IL.add(ovl[ii].a_hang, fLen - ovl[ii].a_hang - ovl[ii].b_hang); else if ((ovl[ii].a_hang <= 0) && (ovl[ii].b_hang >= 0)) // I am contained and thus now perfectly good! verified = true; else // Huh? Coding error. assert(0); } if (verified == false) { IL.merge(); verified = (IL.numberOfIntervals() == 1); } if (verified == false) { #pragma omp critical (suspInsert) { _suspicious.insert(fi); _nSuspicious; } } } writeStatus("BestOverlapGraph()-- marked " F_U64 " reads as suspicious.\n", _suspicious.size()); } void BestOverlapGraph::removeHighErrorBestEdges(void) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; stdDev edgeStats; // Find the overlap for every best edge. double *erates = new double [fiLimit + 1 + fiLimit + 1]; double *absdev = new double [fiLimit + 1 + fiLimit + 1]; uint32 eratesLen = 0; for (uint32 fi=1; fi <= fiLimit; fi++) { BestEdgeOverlap *b5 = getBestEdgeOverlap(fi, false); BestEdgeOverlap *b3 = getBestEdgeOverlap(fi, true); if (b5->readId() != 0) edgeStats.insert(erates[eratesLen++] = b5->erate()); if (b3->readId() != 0) edgeStats.insert(erates[eratesLen++] = b3->erate()); } _mean = edgeStats.mean(); _stddev = edgeStats.stddev(); // Find the median and absolute deviations. sort(erates, erates+eratesLen); _median = erates[ eratesLen / 2 ]; for (uint32 ii=0; ii= 0.0); _mad = absdev[eratesLen/2]; delete [] absdev; delete [] erates; // Compute an error limit based on the median or absolute deviation. double Tmean = _mean + _deviationGraph * _stddev; double Tmad = _median + _deviationGraph * 1.4826 * _mad; _errorLimit = (_median > 1e-10) ? Tmad : Tmean; // The real filtering is done on the next pass through findEdges(). Here, we're just collecting statistics. uint32 oneFiltered = 0; uint32 twoFiltered = 0; for (uint32 fi=1; fi <= fiLimit; fi++) { BestEdgeOverlap *b5 = getBestEdgeOverlap(fi, false); BestEdgeOverlap *b3 = getBestEdgeOverlap(fi, true); bool b5filtered = (b5->erate() > _errorLimit); bool b3filtered = (b3->erate() > _errorLimit); if (b5filtered && b3filtered) _n2EdgeFiltered++; else if (b5filtered || b3filtered) _n1EdgeFiltered++; } writeLog("\n"); writeLog("ERROR RATES (%u samples)\n", edgeStats.size()); writeLog("-----------\n"); writeLog("mean %10.8f stddev %10.8f -> %10.8f fraction error = %10.6f%% error\n", _mean, _stddev, Tmean, 100.0 * Tmean); writeLog("median %10.8f mad %10.8f -> %10.8f fraction error = %10.6f%% error\n", _median, _mad, Tmad, 100.0 * Tmad); writeLog("\n"); } void BestOverlapGraph::removeLopsidedEdges(const char *UNUSED(prefix)) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 fi=1; fi <= fiLimit; fi++) { BestEdgeOverlap *this5 = getBestEdgeOverlap(fi, false); BestEdgeOverlap *this3 = getBestEdgeOverlap(fi, true); // Ignore spurs and contains...and previously detected suspicious reads. The suspicious reads // do not have best edges back to them, and it's possible to find reads B where best edge A->B // exists, yet no best edge from B exists. if ((isSuspicious(fi) == true) || // Suspicious overlap pattern (isContained(fi) == true) || // Contained read (duh!) ((this5->readId() == 0) || // Spur read (this3->readId() == 0))) continue; // If there is a huge difference in error rates between the two best overlaps, that's a little // suspicious. This kind-of worked, but it is very sensitive to the 'limit', and was only // tested on one bacteria. It will also do very bad things in metagenomics. #if 0 double this5erate = this5->erate(); double this3erate = this3->erate(); double limit = 0.01; if (fabs(this5erate - this3erate) > limit) { #pragma omp critical (suspInsert) { _suspicious.insert(fi); writeStatus("Incompatible error rates on best edges for read %u -- %.4f %.4f.\n", fi, this5erate, this3erate); #warning NOT COUNTING ERATE DIFFS //_ERateIncompatible++; } continue; } #endif // Find the overlap for this5 and this3. int32 this5ovlLen = RI->overlapLength(fi, this5->readId(), this5->ahang(), this5->bhang()); int32 this3ovlLen = RI->overlapLength(fi, this3->readId(), this3->ahang(), this3->bhang()); // Find the edges for our best overlaps. BestEdgeOverlap *that5 = getBestEdgeOverlap(this5->readId(), this5->read3p()); BestEdgeOverlap *that3 = getBestEdgeOverlap(this3->readId(), this3->read3p()); // If both point back to us, we're done. These must be symmetric, else overlapper is bonkers. if ((that5->readId() == fi) && (that5->read3p() == false) && (that3->readId() == fi) && (that3->read3p() == true)) continue; // If there is an overlap to something with no overlaps out of it, that's a little suspicious. if ((that5->readId() == 0) || (that3->readId() == 0)) { writeLog("WARNING: read %u has overlap to spur! 3' overlap to read %u back to read %u 5' overlap to read %u back to read %u\n", fi, this5->readId(), that5->readId(), this3->readId(), that3->readId()); #pragma omp critical (suspInsert) _suspicious.insert(fi); continue; } // Something doesn't agree. Find those overlaps... int32 that5ovlLen = RI->overlapLength(this5->readId(), that5->readId(), that5->ahang(), that5->bhang()); int32 that3ovlLen = RI->overlapLength(this3->readId(), that3->readId(), that3->ahang(), that3->bhang()); // ...and compare. double percDiff5 = 200.0 * abs(this5ovlLen - that5ovlLen) / (this5ovlLen + that5ovlLen); double percDiff3 = 200.0 * abs(this3ovlLen - that3ovlLen) / (this3ovlLen + that3ovlLen); if ((percDiff5 <= 5.0) && // Both good, keep 'em as is. (percDiff3 <= 5.0)) { //writeLog("fi %8u -- %8u/%c' len %6u VS %8u/%c' len %6u %8.4f%% -- %8u/%c' len %6u VS %8u/%c' len %6u %8.4f%% -- ACCEPTED\n", // fi, // this5->readId(), this5->read3p() ? '3' : '5', this5ovlLen, that5->readId(), that5->read3p() ? '3' : '5', that5ovlLen, percDiff5, // this3->readId(), this3->read3p() ? '3' : '5', this3ovlLen, that3->readId(), that3->read3p() ? '3' : '5', that3ovlLen, percDiff3); continue; } // Nope, one or both of the edges are too different. Flag the read as suspicious. //writeLog("fi %8u -- %8u/%c' len %6u VS %8u/%c' len %6u %8.4f%% -- %8u/%c' len %6u VS %8u/%c' len %6u %8.4f%%\n", // fi, // this5->readId(), this5->read3p() ? '3' : '5', this5ovlLen, that5->readId(), that5->read3p() ? '3' : '5', that5ovlLen, percDiff5, // this3->readId(), this3->read3p() ? '3' : '5', this3ovlLen, that3->readId(), that3->read3p() ? '3' : '5', that3ovlLen, percDiff3); #pragma omp critical (suspInsert) { _suspicious.insert(fi); if ((percDiff5 > 5.0) && (percDiff3 > 5.0)) _n2EdgeIncompatible++; else _n1EdgeIncompatible++; } } } void BestOverlapGraph::removeSpurs(const char *prefix) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.best.spurs", prefix); errno = 0; FILE *F = fopen(N, "w"); if (errno) F = NULL; _spur.clear(); for (uint32 fi=1; fi <= fiLimit; fi++) { bool spur5 = (getBestEdgeOverlap(fi, false)->readId() == 0); bool spur3 = (getBestEdgeOverlap(fi, true)->readId() == 0); if (isContained(fi)) // Contained, not a spur. continue; if ((spur5 == false) && (spur3 == false)) // Edges off of both ends. Not a spur. continue; // We've now got either a spur or a singleton. // // How do we get an edge to a singleton, which, by definition, has no edges? The one case I // looked at had different error rates for the A->B and B->A overlap, and these straddled the // error rate cutoff. Dmel had 357 edges to singletons; I didn't look at any of them. bool isSingleton = ((spur5 == true) && (spur3 == true)); if (F) fprintf(F, F_U32" %s\n", fi, (isSingleton) ? "singleton" : ((spur5) ? "5'" : "3'")); if (isSingleton) _singleton.insert(fi); else _spur.insert(fi); } writeStatus("BestOverlapGraph()-- detected " F_SIZE_T " spur reads and " F_SIZE_T " singleton reads.\n", _spur.size(), _singleton.size()); if (F) fclose(F); } void BestOverlapGraph::findEdges(void) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; memset(_bestA, 0, sizeof(BestOverlaps) * (fiLimit + 1)); memset(_scorA, 0, sizeof(BestScores) * (fiLimit + 1)); #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 fi=1; fi <= fiLimit; fi++) { uint32 no = 0; BAToverlap *ovl = OC->getOverlaps(fi, no); for (uint32 ii=0; iigetOverlaps(fi, no); // Build edges out of spurs, but don't allow edges into them. This should prevent them from // being incorporated into a promiscuous unitig, but still let them be popped as bubbles (but // they shouldn't because they're spurs). for (uint32 ii=0; iinumReads(); for (uint32 fi=1; fi <= fiLimit; fi++) { if (isContained(fi) == true) { getBestEdgeOverlap(fi, false)->clear(); getBestEdgeOverlap(fi, true) ->clear(); } } } BestOverlapGraph::BestOverlapGraph(double erateGraph, double deviationGraph, const char *prefix, bool filterSuspicious, bool filterHighError, bool filterLopsided, bool filterSpur) { writeStatus("\n"); writeStatus("BestOverlapGraph()-- allocating best edges (" F_SIZE_T "MB)\n", ((2 * sizeof(BestEdgeOverlap) * (RI->numReads() + 1)) >> 20)); _bestA = new BestOverlaps [RI->numReads() + 1]; // Cleared in findEdges() _scorA = new BestScores [RI->numReads() + 1]; _mean = erateGraph; _stddev = 0.0; _median = erateGraph; _mad = 0.0; _errorLimit = erateGraph; _nSuspicious = 0; _n1EdgeFiltered = 0; _n2EdgeFiltered = 0; _n1EdgeIncompatible = 0; _n2EdgeIncompatible = 0; _suspicious.clear(); _singleton.clear(); _bestM.clear(); _scorM.clear(); _restrict = NULL; _restrictEnabled = false; _erateGraph = erateGraph; _deviationGraph = deviationGraph; // Find initial edges, only so we can report initial statistics on the graph writeStatus("\n"); writeStatus("BestOverlapGraph()-- finding initial best edges.\n"); findEdges(); reportEdgeStatistics(prefix, "INITIAL"); // Mark reads as suspicious if they are not fully covered by overlaps. writeStatus("\n"); writeStatus("BestOverlapGraph()-- %sfiltering suspicious reads.\n", (filterSuspicious == true) ? "" : "NOT "); if (filterSuspicious) { removeSuspicious(prefix); findEdges(); } if (logFileFlagSet(LOG_ALL_BEST_EDGES)) reportBestEdges(prefix, "best.0.initial"); // Analyze the current best edges to set a cutoff on overlap quality used for graph building. writeStatus("\n"); writeStatus("BestOverlapGraph()-- %sfiltering high error edges.\n", (filterHighError == true) ? "" : "NOT "); if (filterHighError) { removeHighErrorBestEdges(); findEdges(); } if (logFileFlagSet(LOG_ALL_BEST_EDGES)) reportBestEdges(prefix, "best.1.filtered"); // Mark reads as suspicious if the length of the best edge out is very different than the length // of the best edge that should be back to us. E.g., if readA has best edge to readB (of length // lenAB), but readB has best edge to readC (of length lenBC), and lenAB is much shorter than // lenBC, then something is wrong with readA. // // This must come before removeSpurs(). writeStatus("\n"); writeStatus("BestOverlapGraph()-- %sfiltering reads with lopsided best edges.\n", (filterLopsided == true) ? "" : "NOT "); if (filterLopsided) { removeLopsidedEdges(prefix); findEdges(); } if (logFileFlagSet(LOG_ALL_BEST_EDGES)) reportBestEdges(prefix, "best.2.cleaned"); // Mark reads as spurs, so we don't find best edges to them. writeStatus("\n"); writeStatus("BestOverlapGraph()-- %sfiltering spur reads.\n", (filterSpur == true) ? "" : "NOT "); if (filterSpur) { removeSpurs(prefix); findEdges(); } reportBestEdges(prefix, logFileFlagSet(LOG_ALL_BEST_EDGES) ? "best.3.final" : "best"); // One more pass, to find any ambiguous best edges. // Cleanup the contained reads. Why? writeStatus("\n"); writeStatus("BestOverlapGraph()-- removing best edges for contained reads.\n"); removeContainedDovetails(); // Report filtering and final statistics. writeLog("\n"); writeLog("EDGE FILTERING\n"); writeLog("-------- ------------------------------------------\n"); writeLog("%8u reads have a suspicious overlap pattern\n", _nSuspicious); writeLog("%8u reads had edges filtered\n", _n1EdgeFiltered + _n2EdgeFiltered); writeLog(" %8u had one\n", _n1EdgeFiltered); writeLog(" %8u had two\n", _n2EdgeFiltered); writeLog("%8u reads have length incompatible edges\n", _n1EdgeIncompatible + _n2EdgeIncompatible); writeLog(" %8u have one\n", _n1EdgeIncompatible); writeLog(" %8u have two\n", _n2EdgeIncompatible); reportEdgeStatistics(prefix, "FINAL"); // Done with scoring data. delete [] _scorA; _scorA = NULL; _spur.clear(); setLogFile(prefix, NULL); } void BestOverlapGraph::reportEdgeStatistics(const char *prefix, const char *label) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; uint32 nContained = 0; uint32 nSingleton = 0; uint32 nSpur = 0; uint32 nSpur1Mutual = 0; uint32 nBoth = 0; uint32 nBoth1Mutual = 0; uint32 nBoth2Mutual = 0; for (uint32 fi=1; fi <= fiLimit; fi++) { BestEdgeOverlap *this5 = getBestEdgeOverlap(fi, false); BestEdgeOverlap *this3 = getBestEdgeOverlap(fi, true); // Count contained reads if (isContained(fi)) { nContained++; continue; } // Count singleton reads if ((this5->readId() == 0) && (this3->readId() == 0)) { nSingleton++; continue; } // Compute mutual bestedness bool mutual5 = false; bool mutual3 = false; if (this5->readId() != 0) { BestEdgeOverlap *that5 = getBestEdgeOverlap(this5->readId(), this5->read3p()); mutual5 = ((that5->readId() == fi) && (that5->read3p() == false)); } if (this3->readId() != 0) { BestEdgeOverlap *that3 = getBestEdgeOverlap(this3->readId(), this3->read3p()); mutual3 = ((that3->readId() == fi) && (that3->read3p() == true)); } // Compute spur, and mutual best if ((this5->readId() == 0) || (this3->readId() == 0)) { nSpur++; nSpur1Mutual += (mutual5 || mutual3) ? 1 : 0; continue; } // Otherwise, both edges exist nBoth++; nBoth1Mutual += (mutual5 != mutual3) ? 1 : 0; nBoth2Mutual += ((mutual5 == true) && (mutual3 == true)) ? 1 : 0; } writeLog("\n"); writeLog("%s EDGES\n", label); writeLog("-------- ----------------------------------------\n"); writeLog("%8u reads are contained\n", nContained); writeLog("%8u reads have no best edges (singleton)\n", nSingleton); writeLog("%8u reads have only one best edge (spur) \n", nSpur); writeLog(" %8u are mutual best\n", nSpur1Mutual); writeLog("%8u reads have two best edges \n", nBoth); writeLog(" %8u have one mutual best edge\n", nBoth1Mutual); writeLog(" %8u have two mutual best edges\n", nBoth2Mutual); writeLog("\n"); } void BestOverlapGraph::reportBestEdges(const char *prefix, const char *label) { char N[FILENAME_MAX]; FILE *BCH = NULL; FILE *BE = NULL, *BEH = NULL, *BEG = NULL; FILE *BS = NULL; FILE *SS = NULL; // Open output files. snprintf(N, FILENAME_MAX, "%s.%s.edges", prefix, label); BE = fopen(N, "w"); snprintf(N, FILENAME_MAX, "%s.%s.singletons", prefix, label); BS = fopen(N, "w"); snprintf(N, FILENAME_MAX, "%s.%s.edges.suspicious", prefix, label); SS = fopen(N, "w"); snprintf(N, FILENAME_MAX, "%s.%s.edges.gfa", prefix, label); BEG = fopen(N, "w"); snprintf(N, FILENAME_MAX, "%s.%s.contains.histogram", prefix, label); BCH = fopen(N, "w"); snprintf(N, FILENAME_MAX, "%s.%s.edges.histogram", prefix, label); BEH = fopen(N, "w"); // Write best edges, singletons and suspicious edges. if ((BE) && (BS) && (SS)) { fprintf(BE, "#readId\tlibId\tbest5iid\tbest5end\tbest3iid\tbest3end\teRate5\teRate3\tbest5len\tbest3len\n"); fprintf(BS, "#readId\tlibId\n"); for (uint32 id=1; idnumReads() + 1; id++) { BestEdgeOverlap *bestedge5 = getBestEdgeOverlap(id, false); BestEdgeOverlap *bestedge3 = getBestEdgeOverlap(id, true); if ((bestedge5->readId() == 0) && (bestedge3->readId() == 0) && (isContained(id) == false)) { fprintf(BS, "%u\t%u\n", id, RI->libraryIID(id)); } else if (_suspicious.count(id) > 0) { fprintf(SS, "%u\t%u\t%u\t%c'\t%u\t%c'\t%6.4f\t%6.4f\t%u\t%u%s\n", id, RI->libraryIID(id), bestedge5->readId(), bestedge5->read3p() ? '3' : '5', bestedge3->readId(), bestedge3->read3p() ? '3' : '5', AS_OVS_decodeEvalue(bestedge5->evalue()), AS_OVS_decodeEvalue(bestedge3->evalue()), (bestedge5->readId() == 0 ? 0 : RI->overlapLength(id, bestedge5->readId(), bestedge5->ahang(), bestedge5->bhang())), (bestedge3->readId() == 0 ? 0 : RI->overlapLength(id, bestedge3->readId(), bestedge3->ahang(), bestedge3->bhang())), isContained(id) ? "\tcontained" : ""); } else { fprintf(BE, "%u\t%u\t%u\t%c'\t%u\t%c'\t%6.4f\t%6.4f\t%u\t%u%s\n", id, RI->libraryIID(id), bestedge5->readId(), bestedge5->read3p() ? '3' : '5', bestedge3->readId(), bestedge3->read3p() ? '3' : '5', AS_OVS_decodeEvalue(bestedge5->evalue()), AS_OVS_decodeEvalue(bestedge3->evalue()), (bestedge5->readId() == 0 ? 0 : RI->overlapLength(id, bestedge5->readId(), bestedge5->ahang(), bestedge5->bhang())), (bestedge3->readId() == 0 ? 0 : RI->overlapLength(id, bestedge3->readId(), bestedge3->ahang(), bestedge3->bhang())), isContained(id) ? "\tcontained" : ""); } } } // Write best edge graph. if (BEG) { fprintf(BEG, "H\tVN:Z:bogart/edges\n"); // First, write the sequences used. for (uint32 id=1; idnumReads() + 1; id++) { BestEdgeOverlap *bestedge5 = getBestEdgeOverlap(id, false); BestEdgeOverlap *bestedge3 = getBestEdgeOverlap(id, true); if ((bestedge5->readId() == 0) && (bestedge3->readId() == 0) && (isContained(id) == false)) { // Do nothing, a singleton. } else if (isContained(id) == true) { // Do nothing, a contained read. } else if (_suspicious.count(id) > 0) { // Do nothing, a suspicious read. } else { // Report the read, it has best edges - including contained reads. fprintf(BEG, "S\tread%08u\t*\tLN:i:%u\n", id, RI->readLength(id)); } } // Now, report edges. GFA wants edges in exactly this format: // // ------------- // ------------- // // with read orientation given by +/-. Conveniently, this is what we've saved (for the edges). for (uint32 id=1; idnumReads() + 1; id++) { BestEdgeOverlap *bestedge5 = getBestEdgeOverlap(id, false); BestEdgeOverlap *bestedge3 = getBestEdgeOverlap(id, true); if ((bestedge5->readId() == 0) && (bestedge3->readId() == 0) && (isContained(id) == false)) { // Do nothing, a singleton. } else if (isContained(id) == true) { // Do nothing, a contained read. } else if (_suspicious.count(id) > 0) { // Do nothing, a suspicious read. } else { if (bestedge5->readId() != 0) { int32 ahang = bestedge5->ahang(); int32 bhang = bestedge5->bhang(); int32 olaplen = RI->overlapLength(id, bestedge5->readId(), bestedge5->ahang(), bestedge5->bhang()); assert((ahang <= 0) && (bhang <= 0)); // ALL 5' edges should be this. fprintf(BEG, "L\tread%08u\t-\tread%08u\t%c\t%uM\n", id, bestedge5->readId(), bestedge5->read3p() ? '-' : '+', olaplen); } if (bestedge3->readId() != 0) { int32 ahang = bestedge3->ahang(); int32 bhang = bestedge3->bhang(); int32 olaplen = RI->overlapLength(id, bestedge3->readId(), bestedge3->ahang(), bestedge3->bhang()); assert((ahang >= 0) && (bhang >= 0)); // ALL 3' edges should be this. fprintf(BEG, "L\tread%08u\t+\tread%08u\t%c\t%uM\n", id, bestedge3->readId(), bestedge3->read3p() ? '-' : '+', RI->overlapLength(id, bestedge3->readId(), bestedge3->ahang(), bestedge3->bhang())); } } } } // Write error rate histograms of best edges and contains. if ((BCH) && (BEH)) { double *bc = new double [RI->numReads() + 1 + RI->numReads() + 1]; double *be = new double [RI->numReads() + 1 + RI->numReads() + 1]; uint32 bcl = 0; uint32 bel = 0; for (uint32 id=1; idnumReads() + 1; id++) { BestEdgeOverlap *bestedge5 = getBestEdgeOverlap(id, false); BestEdgeOverlap *bestedge3 = getBestEdgeOverlap(id, true); if (isContained(id)) { //bc[bcl++] = bestcont->erate(); #warning what is the error rate of the 'best contained' overlap? bc[bcl++] = bestedge5->erate(); bc[bcl++] = bestedge3->erate(); } else { if (bestedge5->readId() > 0) be[bel++] = bestedge5->erate(); if (bestedge3->readId() > 0) be[bel++] = bestedge3->erate(); } } sort(bc, bc+bcl); sort(be, be+bel); for (uint32 ii=0; ii olap.b_iid)) // Exact! Each contains the other. Make the lower IID the container. return; if ((olap.a_hang > 0) || (olap.b_hang < 0)) // We only save if A is the contained read. return; setContained(olap.a_iid); } void BestOverlapGraph::scoreEdge(BAToverlap& olap) { bool enableLog = false; // useful for reporting this stuff only for specific reads //if ((olap.a_iid == 97202) || (olap.a_iid == 30701)) // enableLog = true; if (isOverlapBadQuality(olap)) { // Yuck. Don't want to use this crud. if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP BADQ: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- bad quality\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return; } if (isOverlapRestricted(olap)) { // Whoops, don't want this overlap for this BOG if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP RESTRICT: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- restricted\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return; } if (isSuspicious(olap.b_iid)) { // Whoops, don't want this overlap for this BOG if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP SUSP: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- suspicious\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return; } if (((olap.a_hang >= 0) && (olap.b_hang <= 0)) || ((olap.a_hang <= 0) && (olap.b_hang >= 0))) { // Skip containment overlaps. if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP CONT: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- container read\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return; } if (isContained(olap.b_iid) == true) { // Skip overlaps to contained reads (allow scoring of best edges from contained reads). if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP CONT: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- contained read\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return; } uint64 newScr = scoreOverlap(olap); bool a3p = olap.AEndIs3prime(); BestEdgeOverlap *best = getBestEdgeOverlap(olap.a_iid, a3p); uint64 &score = (a3p) ? (best3score(olap.a_iid)) : (best5score(olap.a_iid)); assert(newScr > 0); if (newScr <= score) { if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP GOOD: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- no better than best\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return; } best->set(olap); score = newScr; if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("scoreEdge()-- OVERLAP BEST: %d %d %c hangs " F_S32 " " F_S32 " err %.3f -- NOW BEST\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); } bool BestOverlapGraph::isOverlapBadQuality(BAToverlap& olap) { bool enableLog = false; // useful for reporting this stuff only for specific reads //if ((olap.a_iid == 97202) || (olap.a_iid == 30701)) // enableLog = true; // The overlap is bad if it involves deleted reads. Shouldn't happen in a normal // assembly, but sometimes us users want to delete reads after overlaps are generated. if ((RI->readLength(olap.a_iid) == 0) || (RI->readLength(olap.b_iid) == 0)) { olap.filtered = true; return(true); } // The overlap is GOOD (false == not bad) if the error rate is below the allowed erate. // Initially, this is just the erate passed in. After the first rount of finding edges, // it is reset to the mean and stddev of selected best edges. if (olap.erate() <= _errorLimit) { if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("isOverlapBadQuality()-- OVERLAP GOOD: %d %d %c hangs " F_S32 " " F_S32 " err %.3f\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); return(false); } // CA8.3 would further compute the (expected) number of errors in the alignment, and compare // against another limit. This was to allow very short overlaps where one error would push the // error rate above a few percent. canu doesn't do short overlaps. if ((enableLog == true) && (logFileFlagSet(LOG_OVERLAP_SCORING))) writeLog("isOverlapBadQuality()-- OVERLAP REJECTED: %d %d %c hangs " F_S32 " " F_S32 " err %.3f\n", olap.a_iid, olap.b_iid, olap.flipped ? 'A' : 'N', olap.a_hang, olap.b_hang, olap.erate()); olap.filtered = true; return(true); } // If no restrictions are known, this overlap is useful if both reads are not in a unitig // already. Otherwise, we are restricted to just a specific set of reads (usually a single // unitig and all the mated reads). The overlap is useful if both reads are in the set. // bool BestOverlapGraph::isOverlapRestricted(const BAToverlap &olap) { if (_restrictEnabled == false) return(false); assert(_restrict != NULL); if ((_restrict->count(olap.a_iid) != 0) && (_restrict->count(olap.b_iid) != 0)) return(false); else return(true); } uint64 BestOverlapGraph::scoreOverlap(BAToverlap& olap) { uint64 leng = 0; uint64 rate = AS_MAX_EVALUE - olap.evalue; assert(olap.evalue <= AS_MAX_EVALUE); assert(rate <= AS_MAX_EVALUE); // Containments - the length of the overlaps are all the same. We return the quality. // if (((olap.a_hang >= 0) && (olap.b_hang <= 0)) || ((olap.a_hang <= 0) && (olap.b_hang >= 0))) return(rate); // Dovetails - the length of the overlap is the score, but we bias towards lower error. // (again, shift AFTER assigning to avoid overflows) assert((olap.a_hang < 0) == (olap.b_hang < 0)); // Compute the length of the overlap, as either the official overlap length that currently // takes into account both reads, or as the number of aligned bases on the A read. #if 0 leng = RI->overlapLength(olap.a_iid, olap.b_iid, olap.a_hang, olap.b_hang); #endif if (olap.a_hang > 0) leng = RI->readLength(olap.a_iid) - olap.a_hang; else leng = RI->readLength(olap.a_iid) + olap.b_hang; // Convert the length into an expected number of matches. #if 0 assert(olap.erate() <= 1.0); leng -= leng * olap.erate(); #endif // And finally shift it to the correct place in the word. leng <<= AS_MAX_EVALUE_BITS; return(leng | rate); } canu-1.6/src/bogart/AS_BAT_BestOverlapGraph.H000066400000000000000000000206061314437614700207630ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_BestOverlapGraph.H * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2014-JAN-29 * are Copyright 2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-21 to 2015-JUN-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_BESTOVERLAPGRAPH #define INCLUDE_AS_BAT_BESTOVERLAPGRAPH #include "AS_global.H" #include "AS_BAT_OverlapCache.H" class ReadEnd { public: ReadEnd() { _id = 0; _e3p = false; }; ReadEnd(uint32 id, bool e3p) { _id = id; _e3p = e3p; }; uint32 readId(void) const { return(_id); }; bool read3p(void) const { return(_e3p == true); }; bool read5p(void) const { return(_e3p == false); }; bool operator==(ReadEnd const that) const { return((readId() == that.readId()) && (read3p() == that.read3p())); }; bool operator!=(ReadEnd const that) const { return((readId() != that.readId()) || (read3p() != that.read3p())); }; bool operator<(ReadEnd const that) const { if (readId() != that.readId()) return readId() < that.readId(); else return read3p() < that.read3p(); }; private: uint32 _id:31; uint32 _e3p:1; }; // Stores an overlap from an 'a' read (implied by the index into the array of best edges) to a 'b' // read. The hangs are relative to the 'a' read - just as a normal overlap would be. // class BestEdgeOverlap { public: BestEdgeOverlap() { clear(); }; BestEdgeOverlap(BAToverlap const &ovl) { set(ovl); }; ~BestEdgeOverlap() { }; void clear(void) { _id = 0; _e3p = 0; _ahang = 0; _bhang = 0; _evalue = 0; }; void set(BAToverlap const &olap) { _id = olap.b_iid; if (((olap.a_hang <= 0) && (olap.b_hang >= 0)) || // If contained, _e3p just means ((olap.a_hang >= 0) && (olap.b_hang <= 0))) // the other read is flipped _e3p = olap.flipped; else // Otherwise, means the _e3p = olap.BEndIs3prime(); // olap is to the 3' end _ahang = olap.a_hang; _bhang = olap.b_hang; _evalue = olap.evalue; }; void set(uint32 id, bool e3p, int32 ahang, int32 bhang, uint32 evalue) { _id = id; _e3p = e3p; _ahang = ahang; _bhang = bhang; _evalue = evalue; }; uint32 readId(void) const { return(_id); }; bool read3p(void) const { return(_e3p == true); }; bool read5p(void) const { return(_e3p == false); }; int32 ahang(void) const { return(_ahang); }; int32 bhang(void) const { return(_bhang); }; uint32 evalue(void) const { return(_evalue); }; double erate(void) const { return(AS_OVS_decodeEvalue(_evalue)); }; private: uint32 _id; uint64 _e3p : 1; // Overlap with the 3' end of that read, or flipped contain int64 _ahang : AS_MAX_READLEN_BITS+1; int64 _bhang : AS_MAX_READLEN_BITS+1; uint64 _evalue : AS_MAX_EVALUE_BITS; }; #if (1 + AS_MAX_READLEN_BITS + 1 + AS_MAX_READLEN_BITS + 1 + AS_MAX_EVALUE_BITS > 64) #error not enough bits to store overlaps. decrease AS_MAX_EVALUE_BITS or AS_MAX_READLEN_BITS. #endif class BestOverlaps { public: BestEdgeOverlap _best5; BestEdgeOverlap _best3; uint32 _isC; }; class BestScores { public: BestScores() { _best5score = 0; _best3score = 0; _isC = 0; }; uint64 _best5score; uint64 _best3score; uint32 _isC; }; class BestOverlapGraph { private: void removeSuspicious(const char *prefix); void removeSpurs(const char *prefix); void removeLopsidedEdges(const char *prefix); void findEdges(void); void removeHighErrorBestEdges(void); void removeContainedDovetails(void); public: BestOverlapGraph(double erateGraph, double deviationGraph, const char *prefix, bool filterSuspicious, bool filterHighError, bool filterLopsided, bool filterSpur); ~BestOverlapGraph() { delete [] _bestA; delete [] _scorA; }; // Given a read UINT32 and which end, returns pointer to // BestOverlap node. BestEdgeOverlap *getBestEdgeOverlap(uint32 readid, bool threePrime) { if (_bestA) return((threePrime) ? (&_bestA[readid]._best3) : (&_bestA[readid]._best5)); return((threePrime) ? (&_bestM[readid]._best3) : (&_bestM[readid]._best5)); }; // given a ReadEnd sets it to the next ReadEnd after following the // best edge ReadEnd followOverlap(ReadEnd end) { if (end.readId() == 0) return(ReadEnd()); BestEdgeOverlap *edge = getBestEdgeOverlap(end.readId(), end.read3p()); return(ReadEnd(edge->readId(), !edge->read3p())); }; void setContained(const uint32 readid) { if (_bestA) _bestA[readid]._isC = true; else _bestM[readid]._isC = true; }; bool isContained(const uint32 readid) { if (_bestA) return(_bestA[readid]._isC); return(_bestM[readid]._isC); }; bool isSuspicious(const uint32 readid) { return(_suspicious.count(readid) > 0); }; void reportEdgeStatistics(const char *prefix, const char *label); void reportBestEdges(const char *prefix, const char *label); public: bool isOverlapBadQuality(BAToverlap& olap); // Used in repeat detection private: uint64 scoreOverlap(BAToverlap& olap); private: void scoreContainment(BAToverlap& olap); void scoreEdge(BAToverlap& olap); private: uint64 &best5score(uint32 id) { if (_restrictEnabled == false) return(_scorA[id]._best5score); return(_scorM[id]._best5score); }; uint64 &best3score(uint32 id) { if (_restrictEnabled == false) return(_scorA[id]._best3score); return(_scorM[id]._best3score); }; private: BestOverlaps *_bestA; BestScores *_scorA; double _mean; double _stddev; double _median; double _mad; uint32 _nSuspicious; // Stats for output uint32 _n1EdgeFiltered; uint32 _n2EdgeFiltered; uint32 _n1EdgeIncompatible; uint32 _n2EdgeIncompatible; set _suspicious; set _singleton; set _spur; map _bestM; map _scorM; // These restrict the best overlap graph to a set of reads, instead of all reads. // Currently (Aug 2016) unused. There used to be a constructor that would take // a set(uint32) of reads we cared about, but it was quite stale and was removed. private: bool isOverlapRestricted(const BAToverlap &olap); private: set *_restrict; bool _restrictEnabled; public: double _erateGraph; double _deviationGraph; private: double _errorLimit; }; //BestOverlapGraph extern BestOverlapGraph *OG; #endif // INCLUDE_AS_BAT_BESTOVERLAPGRAPH canu-1.6/src/bogart/AS_BAT_ChunkGraph.C000066400000000000000000000170611314437614700176010ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_ChunkGraph.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010,2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2014-DEC-22 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_ChunkGraph.H" #include "AS_BAT_Logging.H" ChunkGraph::ChunkGraph(const char *prefix) { char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.chunkGraph.log", prefix); errno = 0; _chunkLog = (logFileFlagSet(LOG_CHUNK_GRAPH)) ? fopen(N, "w") : NULL; if (errno) _chunkLog = NULL; _maxRead = RI->numReads(); _restrict = NULL; _pathLen = new uint32 [_maxRead * 2 + 2]; _chunkLength = new ChunkLength [_maxRead]; _chunkLengthIter = 0; memset(_pathLen, 0, sizeof(uint32) * (_maxRead * 2 + 2)); memset(_chunkLength, 0, sizeof(ChunkLength) * (_maxRead)); for (uint32 fid=1; fid <= _maxRead; fid++) { if (OG->isContained(fid)) { if (_chunkLog) fprintf(_chunkLog, "read %u contained\n", fid); continue; } if (OG->isSuspicious(fid)) { if (_chunkLog) fprintf(_chunkLog, "read %u suspicious\n", fid); continue; } uint32 l5 = countFullWidth(ReadEnd(fid, false)); uint32 l3 = countFullWidth(ReadEnd(fid, true)); _chunkLength[fid-1].readId = fid; _chunkLength[fid-1].cnt = l5 + l3; } if (_chunkLog) fclose(_chunkLog); delete [] _pathLen; _pathLen = NULL; std::sort(_chunkLength, _chunkLength + _maxRead); } ChunkGraph::ChunkGraph(set *restrict) { _chunkLog = NULL; _maxRead = 0; _restrict = restrict; for (set::iterator it=_restrict->begin(); it != _restrict->end(); it++) _idMap[*it] = _maxRead++; _pathLen = new uint32 [_maxRead * 2 + 2]; _chunkLength = new ChunkLength [_maxRead]; _chunkLengthIter = 0; memset(_pathLen, 0, sizeof(uint32) * (_maxRead * 2 + 2)); memset(_chunkLength, 0, sizeof(ChunkLength) * (_maxRead)); for (set::iterator it=_restrict->begin(); it != _restrict->end(); it++) { uint32 fid = *it; // Actual read ID uint32 fit = _idMap[fid]; // Local array index if (OG->isContained(fid)) continue; _chunkLength[fit].readId = fid; _chunkLength[fit].cnt = (countFullWidth(ReadEnd(fid, false)) + countFullWidth(ReadEnd(fid, true))); } delete [] _pathLen; _pathLen = NULL; std::sort(_chunkLength, _chunkLength + _maxRead); } uint64 ChunkGraph::getIndex(ReadEnd e) { if (_restrict == NULL) return(e.readId() * 2 + e.read3p()); return(_idMap[e.readId()] * 2 + e.read3p()); } uint32 ChunkGraph::countFullWidth(ReadEnd firstEnd) { uint64 firstIdx = getIndex(firstEnd); assert(firstIdx < _maxRead * 2 + 2); if (_pathLen[firstIdx] > 0) { if (_chunkLog) fprintf(_chunkLog, "path from %d,%d'(length=%d)\n", firstEnd.readId(), (firstEnd.read3p()) ? 3 : 5, _pathLen[firstIdx]); return _pathLen[firstIdx]; } uint32 length = 0; std::set seen; ReadEnd lastEnd = firstEnd; uint64 lastIdx = firstIdx; // Until we run off the chain, or we hit a read with a known length, compute the length FROM // THE START. // while ((lastIdx != 0) && (_pathLen[lastIdx] == 0)) { seen.insert(lastEnd); _pathLen[lastIdx] = ++length; // Follow the path of lastEnd lastEnd = OG->followOverlap(lastEnd); lastIdx = getIndex(lastEnd); } // Check why we stopped. Three cases: // // 1) We ran out of best edges to follow -- lastEnd.readId() == 0 // 2) We encountered a read with known length -- _pathLen[lastEnd.index()] > 0 // 3) We encountered a self-loop (same condition as case 2) // // To distinguish case 2 and 3, we keep a set<> of the reads we've seen in this construction. // If 'lastEnd' is in that set, then we're case 3. If so, adjust every node in the cycle to have // the same length, the length of the cycle itself. // // 'lastEnd' and 'index' are the first read in the cycle; we've seen this one before. // if (lastEnd.readId() == 0) { // Case 1. Do nothing. ; } else if (seen.find(lastEnd) != seen.end()) { // Case 3, a cycle. uint32 cycleLen = length - _pathLen[lastIdx] + 1; ReadEnd currEnd = lastEnd; uint64 currIdx = lastIdx; do { _pathLen[currIdx] = cycleLen; currEnd = OG->followOverlap(currEnd); currIdx = getIndex(currEnd); } while (lastEnd != currEnd); } else { // Case 2, an existing path. length += _pathLen[lastIdx]; } // Our return value is now whatever count we're at. uint32 lengthMax = length; // Traverse again, converting "path length from the start" into "path length from the end". Any // cycle has had its length set correctly already, and we stop at either the start of the cycle, // or at the start of any existing path. // ReadEnd currEnd = firstEnd; uint64 currIdx = firstIdx; while (currEnd != lastEnd) { _pathLen[currIdx] = length--; currEnd = OG->followOverlap(currEnd); currIdx = getIndex(currEnd); } if (lengthMax != _pathLen[firstIdx]) { writeStatus("chunkGraph()-- ERROR: lengthMax %d _pathLen[] %d\n", lengthMax, _pathLen[firstIdx]); flushLog(); } assert(lengthMax == _pathLen[firstIdx]); if (logFileFlagSet(LOG_CHUNK_GRAPH)) { seen.clear(); currEnd = firstEnd; currIdx = firstIdx; if (_chunkLog) fprintf(_chunkLog, "path from %d,%d'(length=%d):", firstEnd.readId(), (firstEnd.read3p()) ? 3 : 5, _pathLen[firstIdx]); while ((currEnd.readId() != 0) && (seen.find(currEnd) == seen.end())) { seen.insert(currEnd); if ((_chunkLog) && (currEnd == lastEnd)) fprintf(_chunkLog, " LAST"); if (_chunkLog) fprintf(_chunkLog, " %d,%d'(%d)", currEnd.readId(), (currEnd.read3p()) ? 3 : 5, _pathLen[currIdx]); currEnd = OG->followOverlap(currEnd); currIdx = getIndex(currEnd); } if ((_chunkLog) && (seen.find(currEnd) != seen.end())) fprintf(_chunkLog, " CYCLE %d,%d'(%d)", currEnd.readId(), (currEnd.read3p()) ? 3 : 5, _pathLen[currIdx]); if (_chunkLog) fprintf(_chunkLog, "\n"); } return(_pathLen[firstIdx]); } canu-1.6/src/bogart/AS_BAT_ChunkGraph.H000066400000000000000000000050441314437614700176040ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_ChunkGraph.H * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010,2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2014-DEC-22 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_CHUNKGRAPH #define INCLUDE_AS_BAT_CHUNKGRAPH #include "AS_global.H" #include #include using namespace std; class BestOverlapGraph; class ChunkLength { public: uint32 readId; uint32 cnt; bool operator<(ChunkLength const that) const { if (cnt == that.cnt) return(readId < that.readId); return(cnt > that.cnt); }; }; class ChunkGraph { public: ChunkGraph(const char *prefix); ChunkGraph(set *restrict); ~ChunkGraph(void) { delete [] _chunkLength; }; uint32 nextReadByChunkLength(void) { if (_chunkLengthIter >= _maxRead) return(0); return(_chunkLength[_chunkLengthIter++].readId); }; private: uint64 getIndex(ReadEnd e); uint32 countFullWidth(ReadEnd firstEnd); FILE *_chunkLog; uint64 _maxRead; // The usual case, for a chunk graph of all reads. ChunkLength *_chunkLength; uint32 _chunkLengthIter; uint32 *_pathLen; // For a chunk graph of a single unitig plus some extra reads. // This maps the uint32 to an index in the arrays above. map _idMap; set *_restrict; }; extern ChunkGraph *CG; #endif // INCLUDE_AS_BAT_CHUNKGRAPH canu-1.6/src/bogart/AS_BAT_CreateUnitigs.C000066400000000000000000000677661314437614700203360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #include "AS_BAT_CreateUnitigs.H" // Break on at a specific position. In converting to unitigs, the position // is the end of a read with an intersection. // // _bgn == true -> reads that begin at after position are in the region // _end == false -> reads that end before position are in the region class breakPointEnd { public: breakPointEnd(uint32 tigID, uint32 pos, bool bgn) { _tigID = tigID; _pos = pos; _bgn = bgn; }; ~breakPointEnd() { }; bool operator<(breakPointEnd const &that) const { uint64 a = _tigID; a <<= 32; a |= _pos; a <<= 1; a |= _bgn; // Because _tigID is 32-bit uint64 b = that._tigID; b <<= 32; b |= that._pos; b <<= 1; b |= that._bgn; return(a < b); }; bool operator==(breakPointEnd const &that) const { uint64 a = _tigID; a <<= 32; a |= _pos; a <<= 1; a |= _bgn; // Because _tigID is 32-bit uint64 b = that._tigID; b <<= 32; b |= that._pos; b <<= 1; b |= that._bgn; return(a == b); }; uint32 _tigID; uint32 _pos; bool _bgn; }; Unitig * copyTig(TigVector &tigs, Unitig *oldtig) { Unitig *newtig = tigs.newUnitig(false); newtig->_isUnassembled = oldtig->_isUnassembled; newtig->_isRepeat = oldtig->_isRepeat; for (uint32 fi=0; fiufpath.size(); fi++) newtig->addRead(oldtig->ufpath[fi], 0, false); return(newtig); } // Split a tig based on read ends. uint32 splitTig(TigVector &tigs, Unitig *tig, vector &BP, Unitig **newTigs, int32 *lowCoord, bool doMove) { writeLog("\n"); writeLog("splitTig()-- processing tig %u\n", tig->id()); // The first call is with doMove = false. This call just figures out how many new tigs are // created. We use nMoved to count if a new tig is made for a break point. uint32 *nMoved = NULL; if (doMove == false) allocateArray(nMoved, BP.size() + 2); // The second call is with doMove = true. This does the actual moving. if (doMove == true) for (uint32 tt=0; tt < BP.size() + 2; tt++) { newTigs[tt] = NULL; lowCoord[tt] = INT32_MAX; } if (doMove == true) for (uint32 tt=0; tt < BP.size() - 1; tt++) writeLog("splitTig()-- piece %2u from %8u %c to %8u %c\n", tt, BP[tt ]._pos, BP[tt ]._bgn ? 't' : 'f', BP[tt+1]._pos, BP[tt+1]._bgn ? 't' : 'f'); for (uint32 fi=0; fiufpath.size(); fi++) { ufNode &read = tig->ufpath[fi]; uint32 lo = read.position.min(); uint32 hi = read.position.max(); //writeLog("splitTig()-- processing read #%u ident %u pos %u-%u\n", fi, read.ident, lo, hi); // Find the intervals the end points of the read fall into. Suppose we're trying to place // the long read. It begins in piece 1 and ends in piece 6. // // // [----1---][----3----]---4---[--5---]------6-----] Piece and boundary condition // ------ // -------------------------------------- // ----- // ------ // ------ // ---- // ----- // ---------- // // The long read can not go in piece 1, as it would span the end boundary. Piece 2 is // of size zero between pieces 1 and 3, and we can place the read there. Or, we can place // it in piece 6 (we prefer piece 6). uint32 bgnBP = UINT32_MAX; uint32 endBP = UINT32_MAX; uint32 finBP = UINT32_MAX; // Find the pieces the end points are in. for (uint32 tt=0; tt < BP.size()-1; tt++) { uint32 p = BP[tt ]._pos; bool pb = BP[tt ]._bgn; uint32 n = BP[tt+1]._pos; bool nb = BP[tt+1]._bgn; if ((p <= lo) && (lo < n)) // If bgn == true -- p == lo is in this region bgnBP = tt; if ((p < hi) && (hi <= n)) // If bgn == false -- hi == n is in this region endBP = tt; } // If both pieces are the same, we're done. if (bgnBP == endBP) { finBP = bgnBP; } // If the next BP is a bgn boundary, we can still place the read in this piece. It'll extend // off the end, but we don't care. else if (BP[bgnBP+1]._bgn == true) { finBP = bgnBP; } // If not, the next boundary is an end point, and we cannot place the read in this piece. // If the endBP piece doesn't have restrictions on the begin, we can place the read there. else if (BP[endBP]._bgn == false) { finBP = endBP; } // Well, shucks. No place to put the read. Search for an unbounded region between bgnBP and // endBP. There must be one, because bgnBP ends with a bgn=false boundary, and endBP begins // with a bgn=true boundary. If there are no intermediate boundaries, we can place the read in // the middle. If there are intermediate boundaries, we'll still have some piece that is // unbounded. else { for (finBP=bgnBP+1; finBP < endBP; finBP++) { if ((BP[finBP ]._bgn == false) && (BP[finBP+1]._bgn == true)) break; } if (finBP == endBP) writeLog("splitTig()-- failed to place read %u %u-%u in a region. found bgn=%u and end=%u\n", read.ident, read.position.bgn, read.position.end, bgnBP, endBP); assert(finBP != endBP); } // Make a new tig, if needed if ((doMove == true) && (newTigs[finBP] == NULL)) { writeLog("splitTig()-- new tig %u (id=%u) at read %u %u-%u\n", tigs.size(), finBP, read.ident, read.position.min(), read.position.max()); lowCoord[finBP] = read.position.min(); newTigs[finBP] = tigs.newUnitig(false); } // Now move the read, or account for moving it. if (doMove) { writeLog("splitTig()-- Move read %8u %8u-%-8u to piece %2u tig %6u\n", read.ident, read.position.bgn, read.position.end, finBP, newTigs[finBP]->id()); newTigs[finBP]->addRead(read, -lowCoord[finBP], false); } else { //writeLog("splitTig()-- Move read %u %u-%u to piece %u (pos=%u)\n", // read.ident, read.position.bgn, read.position.end, finBP, BP[finBP]._pos); nMoved[finBP]++; } } // Return the number of tigs created. uint32 nTigsCreated = 0; if (doMove == false) { for (uint32 ii=0; ii 0) nTigsCreated++; delete [] nMoved; } return(nTigsCreated); } static uint32 checkReadContained(overlapPlacement &op, Unitig *tgB) { for (uint32 ii=op.tigFidx; ii<=op.tigLidx; ii++) { if (isContained(op.verified, tgB->ufpath[ii].position)) return(ii + 1); } return(0); } // Decide which read, and which end, we're overlapping. We know: // // verified tells us the positions covered with overlaps and the orietation of the aligned read // // isFirst and rdAfwd tell if the invading tig is flopping free to the left // or right of this location // // break here // v // invaded tig ---------------------------------------------- // ------------> // -------> // <------------------ (ignore these two container reads) // <------------ (in reality, this wouldn't be split) // | | // (overlap) (verified.isForward() == false) // | | // <-------- // ----------- // --------------> // // isLow is true if this coordinate is the start of the read placement // void findEnd(overlapPlacement &op, bool rdAfwd, bool isFirst, bool &isLow, int32 &coord) { if (((isFirst == true) && (rdAfwd == true) && (op.verified.isForward() == true)) || ((isFirst == true) && (rdAfwd == false) && (op.verified.isForward() == false)) || ((isFirst == false) && (rdAfwd == false) && (op.verified.isForward() == true)) || // rdAfwd is opposite what reality is, ((isFirst == false) && (rdAfwd == true) && (op.verified.isForward() == false))) { // because we've flipped the tig outside here isLow = false; coord = INT32_MIN; } else { isLow = true; coord = INT32_MAX; } } static uint32 checkRead(Unitig *tgA, ufNode *rdA, vector &rdAplacements, TigVector &contigs, vector &breakpoints, uint32 minOverlap, uint32 maxPlacements, bool isFirst) { bool verbose = true; // To support maxPlacements, we first find all the breaks as we've done forever, then simply // ignore them if there are too many. vector breaks; for (uint32 pp=0; pp_isUnassembled == true) { toUnassembled = true; continue; } // If we're overlapping with ourself, not a useful edge to be splitting on. if ((tgA->id() == tgB->id()) && (isOverlapping(op.verified, rdA->position))) { toSelf = true; if (verbose == false) continue; } // If the overlap is on the end that is used in the tig, not a useful edge. // // first == true (tig) first == false (tig) // is5 fwd == true ----------> fwd == false <--------- // is3 fwd == false <---------- fwd == true ---------> bool is5 = (isFirst == rdA->position.isForward()) ? true : false; if ((is5 == true) && (op.covered.bgn != 0)) { expected5 = true; if (verbose == false) continue; } if ((is5 == false) && (op.covered.end != RI->readLength(rdA->ident))) { expected3 = true; if (verbose == false) continue; } // If too small, bail. if (op.verified.max() - op.verified.min() < minOverlap) { tooSmall = true; if (verbose == false) continue; } // Sacn all the reads we supposedly overlap, checking for overlaps. Save the one that is the // lowest (is5 == true) or highest (is5 == false). Also, compute an average erate for the // overlaps to this read. uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(rdA->ident, ovlLen); double erate = 0.0; uint32 erateN = 0; bool isLow = false; int32 coord = 0; ufNode *rdB = NULL; // DEBUG: If not to self, try to find the overlap. Otherwise, this just adds useless clutter, // the self edge is disqualifying enough. if (toSelf == false) { findEnd(op, rdA->position.isForward(), isFirst, isLow, coord); // Simple code, but lots of comments. writeLog("\n"); writeLog("Scan reads from #%u to #%u for %s coordinate in verified region %u-%u\n", op.tigFidx, op.tigLidx, (isLow) ? "low" : "high", op.verified.min(), op.verified.max()); for (uint32 ii=op.tigFidx; ii<=op.tigLidx; ii++) { for (uint32 oo=0; ooufpath[ii]; if (ovl[oo].b_iid != rdBii->ident) continue; writeLog("Test read #%6u ident %7u %9u-%9u against verified region %9u-%9u", ii, rdBii->ident, rdBii->position.min(), rdBii->position.max(), op.verified.min(), op.verified.max()); erate += ovl[oo].erate(); erateN += 1; // Split on the higher coordinate. If this is larger than the current coordinate AND still // within the verified overlap range, reset the coordinate. Allow only dovetail overlaps. if ((isLow == false) && (rdBii->position.max() < op.verified.max())) { writeLog(" - CANDIDATE hangs %7d %7d", ovl[oo].a_hang, ovl[oo].b_hang); if ((rdBii->position.max() > coord) && (rdBii->position.min() < op.verified.min()) /* && (ovl[oo].a_hang < 0) */) { writeLog(" - SAVED"); rdB = rdBii; coord = rdBii->position.max(); } } // Split on the lower coordinate. if ((isLow == true) && (rdBii->position.min() > op.verified.min())) { writeLog(" - CANDIDATE hangs %7d %7d", ovl[oo].a_hang, ovl[oo].b_hang); if ((rdBii->position.min() < coord) && (rdBii->position.max() > op.verified.max()) /* && (ovl[oo].b_hang > 0) */) { writeLog(" - SAVED"); rdB = rdBii; coord = rdBii->position.min(); } } writeLog("\n"); } } if (erateN > 0) erate /= erateN; // Huh? If didn't find any overlaps, give up without crashing (this hasn't ever been triggered). if (rdB == NULL) { writeLog("\n"); writeLog("Failed to find appropriate intersecting read.\n"); writeLog("\n"); flushLog(); noOverlaps = true; if (verbose == false) continue; } else { writeLog("Found appropriate intersecting read.\n"); } } // End of toSelf DEBUG // Finally, ignore it if the overlap isn't similar to everything else in the tig. A // complication here is we don't know what erate we have between tgA and tgB. We approximate // it by averaging all the overlaps from rdA to the reads it overlaps here. Kind of expensive, // too bad. #define REPEAT_FRACTION 0.5 #warning deviationGraph hard coded double deviationGraph = 6; double sim = tgB->overlapConsistentWithTig(deviationGraph, op.verified.min(), op.verified.max(), erate); if (sim < REPEAT_FRACTION) { notSimilar = true; if (verbose == false) continue; } // if not useful, bail. This only occurs here if verbose == true, otherwise, we shortcircuit in the tests above. if (toSelf || expected5 || expected3 || tooSmall || isContained || noOverlaps || notSimilar) { if (verbose) writeLog("createUnitigs()-- read %6u place %3d edgeTo tig %5u reads #%5u %9u-%9u verified %9d-%9d position %9d-%9d covered %7d-%7d%s%s%s%s%s%s%s\n", rdA->ident, pp, op.tigID, op.tigFidx, tgB->ufpath[op.tigFidx].ident, tgB->ufpath[op.tigLidx].ident, op.verified.bgn, op.verified.end, op.position.bgn, op.position.end, op.covered.bgn, op.covered.end, (toSelf == true) ? " SELF" : "", (expected5 == true) ? " EXPECTED_5'" : "", (expected3 == true) ? " EXPECTED_3'" : "", (tooSmall == true) ? " TOO_SMALL" : "", (isContained == true) ? " IS_CONTAINED" : "", // Would be nice to report read it's contained in? (noOverlaps == true) ? " NO_OVERLAPS" : "", (notSimilar == true) ? " NOT_SIMILAR" : ""); continue; } // Otherwise, it's a useful edge. if (verbose) writeLog("createUnitigs()-- read %6u place %3d edgeTo tig %5u reads #%5u %9u-%9u verified %9d-%9d position %9d-%9d covered %7d-%7d BREAK at pos %8u read %6u isLow %u sim %.4f\n", rdA->ident, pp, op.tigID, op.tigFidx, tgB->ufpath[op.tigFidx].ident, tgB->ufpath[op.tigLidx].ident, op.verified.bgn, op.verified.end, op.position.bgn, op.position.end, op.covered.bgn, op.covered.end, coord, rdB->ident, isLow, sim); breaks.push_back(breakPointEnd(op.tigID, coord, isLow)); } if (breaks.size() == 0) { // Do nothing. } else if (breaks.size() > maxPlacements) { writeLog("createUnitigs()-- discarding %u breakpoints.\n", breaks.size()); } else if (breaks.size() <= maxPlacements) { writeLog("createUnitigs()-- saving %u breakpoints to master list.\n", breaks.size()); //breakpoints.isert(breakpoints.end(), breaks.begin(), breaks.end()); for (uint32 ii=0; ii ufpath; uint32 ii = 0; while (RI->isBackbone(tig->ufpath[ii].ident) == false) { // Find the first backbone read, unitigs.registerRead(tig->ufpath[ii].ident); writeLog("WARNING: unitig %u %s read %8u %9u-%9u is not backbone, removing.\n", tig->id(), isFirst ? "first" : "last ", tig->ufpath[ii].ident, tig->ufpath[ii].position.bgn, tig->ufpath[ii].position.end); ii++; } while (ii < tig->ufpath.size()) { // and copy to a new vector. ufpath.push_back(tig->ufpath[ii]); ii++; } tig->ufpath.swap(ufpath); // assign the new vector to the tig tig->cleanUp(); // adjust zero, find new length tig->reverseComplement(); // rebuild the idx mappings, and reverse for the next phase } void createUnitigs(TigVector &contigs, TigVector &unitigs, uint32 minIntersectLen, uint32 maxPlacements, vector &confusedEdges, vector &unitigSource) { vector breaks; uint32 nBreaksSentinel; uint32 nBreaksConfused; uint32 nBreaksIntersection; // Give each tig a pair of bogus breakpoints at the ends, just to get it in the list. If there // are no break points, it won't be split. These also serve as sentinels during splitting. writeLog("\n"); writeLog("----------------------------------------\n"); writeLog("Adding sentinel breaks at the ends of contigs.\n"); for (uint32 ti=0; ti_isUnassembled == true)) continue; breaks.push_back(breakPointEnd(ti, 0, true)); // Add one at the start of the tig breaks.push_back(breakPointEnd(ti, tig->getLength(), false)); // And one at the end } nBreaksSentinel = breaks.size(); // Add breaks for any confused edges detected during repeat detection. We should, probably, // remove duplicates, but they (should) cause no harm. writeLog("\n"); writeLog("----------------------------------------\n"); writeLog("Adding breaks at confused reads.\n"); for (uint32 ii=0; iiufpath[tpp]; if ((tig == NULL) || // It won't be NULL, but we definitely don't want to (tig->_isUnassembled == true)) // see unassembled crap here. We don't care, and they'll crash. continue; uint32 coord = 0; // Pick the coordinate and set isLow based on orientation bool isLow = false; // and the end of the read that is confused. if (((rda->position.isForward() == true) && (a3p == true)) || ((rda->position.isForward() == false) && (a3p == false))) { coord = rda->position.max(); isLow = false; } if (((rda->position.isForward() == true) && (a3p == false)) || ((rda->position.isForward() == false) && (a3p == true))) { coord = rda->position.min(); isLow = true; } breakPointEnd bp(tid, coord, isLow); if (breaks.back() == bp) continue; writeLog("createUnitigs()-- add break tig %u pos %u isLow %c\n", tid, coord, (isLow) ? 't' : 'f'); breaks.push_back(bp); } nBreaksConfused = breaks.size(); // Check the reads at the end of every tig for intersections to other tigs. If the read has a // compatible overlap to the middle of some other tig, split the other tig into multiple unitigs. writeLog("\n"); writeLog("----------------------------------------\n"); writeLog("Finding contig-end to contig-middle intersections.\n"); uint32 *numP = NULL; uint32 lenP = 0; uint32 maxP = 1024; allocateArray(numP, maxP); for (uint32 ti=0; ti_isUnassembled == true)) continue; // Find break points in other tigs using the first and last reads. ufNode *fi = tig->firstRead(); ufNode *li = tig->lastRead(); vector fiPlacements; vector liPlacements; placeReadUsingOverlaps(contigs, NULL, fi->ident, fiPlacements, placeRead_all); placeReadUsingOverlaps(contigs, NULL, li->ident, liPlacements, placeRead_all); if (fiPlacements.size() + liPlacements.size() > 0) writeLog("\ncreateUnitigs()-- tig %u len %u first read %u with %lu placements - last read %u with %lu placements\n", ti, tig->getLength(), fi->ident, fiPlacements.size(), li->ident, liPlacements.size()); uint32 npf = checkRead(tig, fi, fiPlacements, contigs, breaks, minIntersectLen, maxPlacements, true); uint32 npr = checkRead(tig, li, liPlacements, contigs, breaks, minIntersectLen, maxPlacements, false); lenP = max(lenP, npf); lenP = max(lenP, npr); resizeArray(numP, maxP, maxP, lenP+1, resizeArray_copyData | resizeArray_clearNew); numP[npf]++; numP[npr]++; } nBreaksIntersection = breaks.size(); writeLog("\n"); writeLog("Histogram of number of placements per contig end:\n"); writeLog("numPlacements numEnds\n"); for (uint32 pp=0; pp<=lenP; pp++) writeLog("%13u %7u\n", pp, numP[pp]); writeLog("\n"); writeLog("----------------------------------------\n"); writeLog("Found %u breakpoints (including duplicates).\n", breaks.size()); writeLog(" %u from sentinels.\n", nBreaksSentinel); writeLog(" %u from confused edges.\n", nBreaksConfused - nBreaksSentinel); writeLog(" %u from intersections.\n", nBreaksIntersection - nBreaksConfused); writeLog("\n"); writeLog("Splitting contigs into unitigs.\n"); writeLog("\n"); delete [] numP; // The splitTigs function operates only on a single tig. Sort the break points // by tig id to find all the break points for each tig. sort(breaks.begin(), breaks.end()); // Allocate space for breaking tigs. These are _vastly_ too big, but guaranteed. vector BP; Unitig **newTigs = new Unitig * [breaks.size() + 2]; // Plus two, because we add an extra int32 *lowCoord = new int32 [breaks.size() + 2]; // break at the start and end of each set. // Walk through the breaks, making a new vector of breaks for each tig. uint32 ss = 0; uint32 ee = 0; while (ss < breaks.size()) { Unitig *tig = contigs[breaks[ss]._tigID]; // Find the last break point for this tig. (Technically, the one after the last, but...) while ((ee < breaks.size()) && (breaks[ss]._tigID == breaks[ee]._tigID)) ee++; // Make a new vector for those break points. BP.clear(); for (uint32 bb=ss; bb 2) writeLog("createUnitigs()-- contig %u found %u breakpoint%s\n", tig->id(), BP.size()-2, (BP.size()-2 != 1) ? "s" : ""); // Split the tig. Copy it into the unitigs TigVector too. uint32 nTigs = splitTig(contigs, tig, BP, newTigs, lowCoord, false); if (nTigs > 1) { splitTig(unitigs, tig, BP, newTigs, lowCoord, true); writeLog("createUnitigs()-- contig %u was split into %u unitigs, %u through %u.\n", // Can't use newTigs, because tig->id(), nTigs, unitigs.size() - nTigs, unitigs.size() - 1); // there are holes in it } else { newTigs[0] = copyTig(unitigs, tig); // splitTig populates newTigs and lowCoord, used below. lowCoord[0] = 0; writeLog("createUnitigs()-- contig %u copied into unitig %u.\n", tig->id(), newTigs[0]->id()); } // Remember where these unitigs came from. unitigSource.resize(unitigs.size() + 1); for (uint32 tt=0; ttid(); writeLog("createUnitigs()-- piece %3u -> tig %u from contig %u %u-%u\n", tt, id, tig->id(), lowCoord[tt], lowCoord[tt] + newTigs[tt]->getLength()); unitigSource[id].cID = tig->id(); unitigSource[id].cBgn = lowCoord[tt]; unitigSource[id].cEnd = lowCoord[tt] + newTigs[tt]->getLength(); unitigSource[id].uID = id; } } // Reset for the next iteration. ss = ee; } // Remove non-backbone reads from the ends of unitigs. These confound graph building because // they can be missing overlaps. // // If the last read in the tig is not a backbone read, we can remove it and all reads that come // after it (because those reads are contained). #if 1 for (uint32 ti=0; ti_isUnassembled == true)) continue; // First, check if we have any backbone reads. If we have none, leave it as is. uint32 bbReads = 0; uint32 nbReads = 0; for (uint32 li=0; liufpath.size(); li++) { if (RI->isBackbone(tig->ufpath[li].ident) == true) bbReads++; else nbReads++; } if (bbReads == 0) continue; // Now remove non-backbone reads from the start of the tig. writeLog("unitig %u with %u reads, %u backbone and %u unplaced.\n", tig->id(), tig->ufpath.size(), bbReads, nbReads); stripNonBackboneFromStart(unitigs, tig, true); // Does reverse complement at very end stripNonBackboneFromStart(unitigs, tig, false); } #endif // Cleanup. delete [] newTigs; delete [] lowCoord; } canu-1.6/src/bogart/AS_BAT_CreateUnitigs.H000066400000000000000000000033331314437614700203170ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_BAT_CREATEUNITIGS_H #define AS_BAT_CREATEUNITIGS_H #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_MarkRepeatReads.H" // confusedEdge #include "AS_BAT_TigVector.H" class tigLoc { public: tigLoc() { cID = UINT32_MAX; cBgn = 0; cEnd = 0; uID = UINT32_MAX; }; uint32 cID; uint32 cBgn; uint32 cEnd; uint32 uID; // Debugging. }; void createUnitigs(TigVector &contigs, TigVector &unitigs, uint32 minIntersectLen, uint32 maxPlacements, vector &confusedEdges, vector &unitigSource); #endif // AS_BAT_CREATEUNITIGS_H canu-1.6/src/bogart/AS_BAT_DropDeadEnds.C000066400000000000000000000224241314437614700200420ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-MAY-31 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" #include "AS_BAT_CreateUnitigs.H" // Find the next/previous read in the tig. Skips contained reads if they are isolated. ufNode * findNextRead(Unitig *tig, ufNode *fn) { for (uint32 ni = tig->ufpathIdx(fn->ident)+1; ni < tig->ufpath.size(); ni++) { ufNode *nn = &tig->ufpath[ni]; // If nn is dovetail, return it. // fn ----------- // nn --------- // if (fn->position.max() < nn->position.max()) return(nn); // Otherwise, if it intersects the next-next read, return it. // fn ---------------------- // nn --------- // next-next ------- // if ((ni + 1 < tig->ufpath.size()) && (tig->ufpath[ni+1].position.min() < nn->position.max())) return(nn); } // Otherwise, ran out of reads. return(NULL); } #if 0 // Not used anymore. Might be incorrect. ufNode * findPrevRead(Unitig *tig, ufNode *li) { // A significant complication of working with reads on the 3' end is that they aren't sorted by // their end position. We get around this by saving a copy of the existing reads, reverse // complementing that, and using the same method as in findNextRead(). // // Don't be clever and thing you can just reverse complement the tig; that can change order of // reads, and we don't want to do that here. vector ufcopy; ufcopy.resize(tig->ufpath.size()); for (uint32 ii=0; iiufpath.size(); ii++) { ufcopy[ii] = tig->ufpath[ii]; ufcopy[ii].position.bgn = tig->getLength() - tig->ufpath[ii].position.bgn; ufcopy[ii].position.end = tig->getLength() - tig->ufpath[ii].position.end; } std::sort(ufcopy.begin(), ufcopy.end()); // ufpathIdx() won't work anymore, but li should be the first read. uint32 niPos=0; while (ufcopy[niPos].ident != li->ident) niPos++; // Set 'fn' to that first node, and search for the next node. This is nearly cut-n-paste from // above (just replaced the return value with one that uses the ufpath). ufNode *fn = &ufcopy[niPos]; for (uint32 ni = niPos+1; ni < tig->ufpath.size(); ni++) { ufNode *nn = &ufcopy[ni]; if (fn->position.max() < nn->position.max()) return(&tig->ufpath[ tig->ufpathIdx(nn->ident) ]); if ((ni + 1 < tig->ufpath.size()) && (ufcopy[ni+1].position.min() < nn->position.max())) return(&tig->ufpath[ tig->ufpathIdx(nn->ident) ]); } // Otherwise, ran out of reads. return(NULL); } #endif uint32 dropDeadFirstRead(AssemblyGraph *AG, Unitig *tig) { ufNode *fn = tig->firstRead(); ufNode *sn = findNextRead(tig, fn); // No next read, keep fn in the tig. if (sn == NULL) { writeLog("dropDead()- read %8u no sn\n", fn->ident); return(0); } // Over all edges from the first read, look for any edge to something else. // // If a contained edge to anything, read fn is good and should be kept. // // Otherwise, decide which overlap we want to be using, based on the orientation of the read in // the tig. We assume that this is always the first read, which is OK, because the function name // says so. Any edge to anywhere means the read is good and should be kept. for (uint32 pp=0; ppgetForward(fn->ident).size(); pp++) { BestPlacement &pf = AG->getForward(fn->ident)[pp]; writeLog("dropDead()-- 1st read %8u %s pf %3u/%3u best5 %8u best3 %8u bestC %8u\n", fn->ident, fn->position.isForward() ? "->" : "<-", pp, AG->getForward(fn->ident).size(), pf.best5.b_iid, pf.best3.b_iid, pf.bestC.b_iid); if (pf.bestC.b_iid > 0) { return(0); } if (((fn->position.isForward() == true) && (pf.best5.b_iid != 0)) || ((fn->position.isForward() == false) && (pf.best3.b_iid != 0))) return(0); } // But no edge means we need to check the second read. If it has an edge, then we infer the // first read is bogus and should be removed. If it also has no edge (except to the first read, // duh) then we know nothing: this could be novel sequence or it could be the same garbage that // is infecting the first read. // // This is basically the same as the previous loop, except we also need to exclude edges to the // first read. Well, and that if the second read has an edge we declare the first read to be // junk. That's also a bit of a difference from the previous loop. for (uint32 pp=0; ppgetForward(sn->ident).size(); pp++) { BestPlacement &pf = AG->getForward(sn->ident)[pp]; writeLog("dropDead()-- 2nd read %8u %s pf %3u/%3u best5 %8u best3 %8u bestC %8u\n", sn->ident, sn->position.isForward() ? "->" : "<-", pp, AG->getForward(sn->ident).size(), pf.best5.b_iid, pf.best3.b_iid, pf.bestC.b_iid); if ((pf.bestC.b_iid > 0) && (pf.bestC.b_iid != fn->ident)) return(fn->ident); if (((sn->position.isForward() == true) && (pf.best5.b_iid != 0) && (pf.best5.b_iid != fn->ident)) || ((sn->position.isForward() == false) && (pf.best3.b_iid != 0) && (pf.best3.b_iid != fn->ident))) return(fn->ident); } // Otherwise, the second read had only edges to the first read, and we should keep the first // read. return(0); } void dropDeadEnds(AssemblyGraph *AG, TigVector &tigs) { uint32 numF = 0; // Number of first-read drops uint32 numL = 0; // Number of last-read drops uint32 numB = 0; // Number of both-first-and-last-read drops uint32 numT = 0; // Number of tigs mucked with for (uint32 ti=0; tiufpath.size() <= 1) || (tig->_isUnassembled == true)) continue; uint32 fn = dropDeadFirstRead(AG, tig); // Decide if the first read is junk. tig->reverseComplement(); // Flip. uint32 ln = dropDeadFirstRead(AG, tig); // Decide if the last (now first) read is junk. tig->reverseComplement(); // Flip back. if ((fn == 0) && (ln == 0)) // Nothing to remove, just get out of here. continue; // At least one read needs to be kicked out. Make new tigs for everything. char fnMsg[80] = {0}; Unitig *fnTig = NULL; char nnMsg[80] = {0}; Unitig *nnTig = NULL; int32 nnOff = INT32_MAX; char lnMsg[80] = {0}; Unitig *lnTig = NULL; if (fn > 0) fnTig = tigs.newUnitig(false); if (tig->ufpath.size() > (fn > 0) + (ln > 0)) nnTig = tigs.newUnitig(false); if (ln > 0) lnTig = tigs.newUnitig(false); // Count what we do numT++; if (fnTig) numF++; if (fnTig && lnTig) numB++; if (lnTig) numL++; // Move reads to their new unitig. strcpy(fnMsg, " "); strcpy(nnMsg, " "); strcpy(lnMsg, ""); for (uint32 cc=0, tt=0; ttufpath.size(); tt++) { ufNode &read = tig->ufpath[tt]; if (read.ident == fn) { sprintf(fnMsg, "first read %9u to tig %7u --", read.ident, fnTig->id()); fnTig->addRead(read, -read.position.min(), false); } else if (read.ident == ln) { sprintf(lnMsg, "-- last read %9u to tig %7u", read.ident, lnTig->id()); lnTig->addRead(read, -read.position.min(), false); } else { if (nnOff == INT32_MAX) { sprintf(nnMsg, "other reads to tig %7u", nnTig->id()); nnOff = read.position.min(); } nnTig->addRead(read, -nnOff, false); } } writeLog("dropDeadEnds()-- tig %7u --> %s %s %s\n", tig->id(), fnMsg, nnMsg, lnMsg); if (fnTig) fnTig->cleanUp(); // Probably not neeeded, but cheap. if (lnTig) lnTig->cleanUp(); // Probably not neeeded, but cheap. if (nnTig) nnTig->cleanUp(); // Most likely needed. // Old tig is now junk. delete tigs[ti]; tigs[ti] = NULL; } writeStatus("dropDeadEnds()-- Modified %u tigs. Dropped %u first and %u last reads, %u tig%s had both reads dropped.\n", numT, numF, numL, numB, (numB == 1) ? "" : "s"); } canu-1.6/src/bogart/AS_BAT_DropDeadEnds.H000066400000000000000000000023561314437614700200510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-MAY-31 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_BAT_DROPDEADENDS_H #define AS_BAT_DROPDEADENDS_H #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_TigVector.H" void dropDeadEnds(AssemblyGraph *AG, TigVector &tigs); #endif // AS_BAT_DROPDEADENDS_H canu-1.6/src/bogart/AS_BAT_Instrumentation.C000066400000000000000000001007511314437614700207510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Instrumentation.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-27 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2014-DEC-23 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_SetParentAndHang.H" #include "AS_BAT_Outputs.H" #include "intervalList.H" // Will fail if a read is in unitig 0, or if a read isn't in a unitig. void checkUnitigMembership(TigVector &tigs) { uint32 *inUnitig = new uint32 [RI->numReads()+1]; uint32 noUnitig = 0xffffffff; // All reads start of not placed in a unitig. for (uint32 i=0; inumReads()+1; i++) inUnitig[i] = noUnitig; // Over all tigs, remember where each read is. for (uint32 ti=0; tiufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; if (frg->ident > RI->numReads()) fprintf(stderr, "tig %u ufpath[%d] ident %u more than number of reads %u\n", tig->id(), fi, frg->ident, RI->numReads()); if (inUnitig[frg->ident] != noUnitig) fprintf(stderr, "tig %u ufpath[%d] ident %u placed multiple times\n", tig->id(), fi, frg->ident); assert(frg->ident <= RI->numReads()); // Can't be out of range. assert(inUnitig[frg->ident] == noUnitig); // Read must be not placed yet. inUnitig[frg->ident] = ti; } } // Find any read not placed in a unitig. for (uint32 i=0; inumReads()+1; i++) { if (RI->readLength(i) == 0) // Deleted read. continue; assert(inUnitig[i] != 0); // There shouldn't be a unitig 0. assert(inUnitig[i] != noUnitig); // The read should be in a unitig. } delete [] inUnitig; } // Rule S. Singleton. bool classifyRuleS(Unitig *utg, FILE *UNUSED(F), uint32 &num, uint64 &len) { if (utg->ufpath.size() > 1) return(false); //fprintf(F, "unitig " F_U32 " (%s) unassembled - singleton\n", utg->id(), // (utg->_isRepeat) ? "repeat" : "normal"); num += 1; len += utg->getLength(); return(true); } // Rule 1. Too few reads. bool classifyRule1(Unitig *utg, FILE *F, uint32 &num, uint64 &len, uint32 fewReadsNumber) { if (utg->ufpath.size() == 1) return(false); if (utg->ufpath.size() >= fewReadsNumber) return(false); fprintf(F, "unitig " F_U32 " (%s) unassembled - too few reads (" F_U64 " < " F_U32 ")\n", utg->id(), (utg->_isRepeat) ? "repeat" : "normal", utg->ufpath.size(), fewReadsNumber); num += 1; len += utg->getLength(); return(true); } // Rule 2. Short. bool classifyRule2(Unitig *utg, FILE *F, uint32 &num, uint64 &len, uint32 tooShortLength) { if (utg->ufpath.size() == 1) return(false); if (utg->getLength() >= tooShortLength) return(false); if (utg->ufpath.size() > 1) fprintf(F, "unitig " F_U32 " (%s) unassembled - too short (" F_U32 " < " F_U32 ")\n", utg->id(), (utg->_isRepeat) ? "repeat" : "normal", utg->getLength(), tooShortLength); num += 1; len += utg->getLength(); return(true); } // Rule 3. Single read spans large fraction of tig. bool classifyRule3(Unitig *utg, FILE *F, uint32 &num, uint64 &len, double spanFraction) { if (utg->ufpath.size() == 1) return(false); for (uint32 oi=0; oiufpath.size(); oi++) { ufNode *frg = &utg->ufpath[oi]; int frgbgn = MIN(frg->position.bgn, frg->position.end); int frgend = MAX(frg->position.bgn, frg->position.end); if (frgend - frgbgn > utg->getLength() * spanFraction) { if (utg->ufpath.size() > 1) fprintf(F, "unitig " F_U32 " (%s) unassembled - single read spans unitig (read " F_U32 " " F_U32 "-" F_U32 " spans fraction %f > %f\n", utg->id(), (utg->_isRepeat) ? "repeat" : "normal", frg->ident, frg->position.bgn, frg->position.end, (double)(frgend - frgbgn) / utg->getLength(), spanFraction); num += 1; len += utg->getLength(); return(true); } } return(false); } // Rule 4. Low coverage. bool classifyRule4(Unitig *utg, FILE *F, uint32 &num, uint64 &len, double lowcovFraction, uint32 lowcovDepth) { if (utg->ufpath.size() == 1) return(false); intervalList IL; for (uint32 oi=0; oiufpath.size(); oi++) { ufNode *frg = &utg->ufpath[oi]; int frgbgn = MIN(frg->position.bgn, frg->position.end); int frgend = MAX(frg->position.bgn, frg->position.end); IL.add(frgbgn, frgend - frgbgn); } intervalList ID(IL); uint32 basesLow = 0; uint32 basesHigh = 0; for (uint32 ii=0; ii 0); double lowcov = (double)basesLow / (basesLow + basesHigh); if (lowcov < lowcovFraction) return(false); if (utg->ufpath.size() > 1) fprintf(F, "Unitig " F_U32 " (%s) unassembled - low coverage (%.2f%% of unitig at < " F_U32 "x coverage, allowed %.2f%%)\n", utg->id(), (utg->_isRepeat) ? "repeat" : "normal", 100.0 * lowcov, lowcovDepth, 100.0 * lowcovFraction); num += 1; len += utg->getLength(); return(true); } void classifyTigsAsUnassembled(TigVector &tigs, uint32 fewReadsNumber, uint32 tooShortLength, double spanFraction, double lowcovFraction, uint32 lowcovDepth) { uint32 nSingleton = 0; uint64 bSingleton = 0; uint32 nTooFew = 0; uint64 bTooFew = 0; uint32 nShort = 0; uint64 bShort = 0; uint32 nSingleSpan = 0; uint64 bSingleSpan = 0; uint32 nCoverage = 0; uint64 bCoverage = 0; uint32 nContig = 0; uint64 bContig = 0; char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.unassembled", getLogFilePrefix()); errno = 0; FILE *F = fopen(N, "w"); if (errno) F = NULL; if (F) { fprintf(F, "# Contigs flagged as unassembled.\n"); fprintf(F, "#\n"); fprintf(F, "# fewReadsNumber %u (singletons always removed and not logged)\n", fewReadsNumber); fprintf(F, "# tooShortLength %u\n", tooShortLength); fprintf(F, "# spanFraction %f\n", spanFraction); fprintf(F, "# lowcovFraction %f\n", lowcovFraction); fprintf(F, "# lowcovDepth %u\n", lowcovDepth); fprintf(F, "#\n"); } for (uint32 ti=0; ti_isUnassembled = true; // Check the tig. bool rr = (utg->_isRepeat == true); bool rs = classifyRuleS(utg, F, nSingleton, bSingleton); bool r1 = classifyRule1(utg, F, nTooFew, bTooFew, fewReadsNumber); bool r2 = classifyRule2(utg, F, nShort, bShort, tooShortLength); bool r3 = classifyRule3(utg, F, nSingleSpan, bSingleSpan, spanFraction); bool r4 = classifyRule4(utg, F, nCoverage, bCoverage, lowcovFraction, lowcovDepth); // If flagged, we're done, just move on. if ((rr == false) && (rs || r1 || r2 || r3 || r4)) continue; // Otherwise, unitig is assembled! nContig += 1; bContig += utg->getLength(); utg->_isUnassembled = false; } if (F) fclose(F); writeStatus("classifyAsUnassembled()-- %6u tigs %11lu bases -- singleton\n", nSingleton, bSingleton, fewReadsNumber); writeStatus("classifyAsUnassembled()-- %6u tigs %11lu bases -- too few reads (< %u reads)\n", nTooFew, bTooFew, fewReadsNumber); writeStatus("classifyAsUnassembled()-- %6u tigs %11lu bases -- too short (< %u bp)\n", nShort, bShort, tooShortLength); writeStatus("classifyAsUnassembled()-- %6u tigs %11lu bases -- single spanning read (> %f tig length)\n", nSingleSpan, bSingleSpan, spanFraction); writeStatus("classifyAsUnassembled()-- %6u tigs %11lu bases -- low coverage (> %f tig length at < %u coverage)\n", nCoverage, bCoverage, lowcovFraction, lowcovDepth); writeStatus("classifyAsUnassembled()-- %6u tigs %11lu bases -- acceptable contigs\n", nContig, bContig); writeStatus("\n"); } void reportN50(FILE *F, vector &data, char const *label, uint64 genomeSize) { uint64 cnt = data.size(); uint64 sum = 0; uint64 tot = 0; uint64 nnn = 10; uint64 siz = 0; if (cnt == 0) return; // Duplicates tgTigSizeAnalysis::printSummary() sort(data.begin(), data.end(), greater()); for (uint64 i=0; i 0) siz = genomeSize; else siz = tot; for (uint64 i=0; i unassembledLength; vector repeatLength; vector contigLength; for (uint32 ti=0; ti_isUnassembled) { unassembledLength.push_back(utg->getLength()); } else if (utg->_isRepeat) { repeatLength.push_back(utg->getLength()); } else { contigLength.push_back(utg->getLength()); } } char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.sizes", getLogFilePrefix()); errno = 0; FILE *F = fopen(N, "w"); if (errno == 0) { reportN50(F, unassembledLength, "UNASSEMBLED", genomeSize); reportN50(F, repeatLength, "REPEAT", genomeSize); reportN50(F, contigLength, "CONTIGS", genomeSize); fclose(F); } if (logFileFlagSet(LOG_INTERMEDIATE_TIGS) == 0) return; // Dump the tigs to an intermediate store. setParentAndHang(tigs); writeTigsToStore(tigs, getLogFilePrefix(), "tig", false); } #define tCTG 0 // To a read in a normal tig #define tRPT 1 // To a read in a repeat tig #define tUNA 2 // To a read in an 'unassembled' leftover tig #define tUNU 3 // To a read not placed in a tig #define tNOP 4 // To no read (for best edges) struct olapsUsed { uint64 total; // By definition, satisfied overlaps are in the same tig. uint64 doveSatSame[5]; uint64 contSatSame[5]; // Unsatisfied overlaps can be in the same tig... uint64 doveUnsatSame[5]; uint64 contUnsatSame[5]; // ...or can be between tigs. uint64 doveUnsatDiff[5][5]; uint64 contUnsatDiff[5][5]; }; uint32 getTigType(Unitig *tg) { if (tg == NULL) return(tUNU); if (tg->_isUnassembled) return(tUNA); if (tg->_isRepeat) return(tRPT); if (1) return(tCTG); } bool satisfiedOverlap(uint32 UNUSED(rdAlo), uint32 rdAhi, bool rdAfwd, uint32 rdBlo, uint32 rdBhi, bool rdBfwd, bool flipped) { return(((rdAhi < rdBlo) || (rdBhi < rdBlo)) || // Not satisfied, no overlap ((rdAfwd == rdBfwd) && (flipped == true)) || // Not satisfied, same orient, but flipped overlap ((rdAfwd != rdBfwd) && (flipped == false))); // Not satisfied, diff orient, but normal overlap } // Iterate over all overlaps (but the only interface we have is by iterating // over all reads), and count the number of overlaps satisfied in tigs. void reportOverlaps(TigVector &tigs, const char *UNUSED(prefix), const char *UNUSED(name)) { olapsUsed *dd = new olapsUsed; // Dovetail overlaps to non-contained reads olapsUsed *dc = new olapsUsed; // Dovetail overlaps to contained reads olapsUsed *cc = new olapsUsed; // Containment overlaps olapsUsed *bb = new olapsUsed; // Best overlaps memset(dd, 0, sizeof(olapsUsed)); memset(dc, 0, sizeof(olapsUsed)); memset(cc, 0, sizeof(olapsUsed)); memset(bb, 0, sizeof(olapsUsed)); for (uint32 fi=0; finumReads()+1; fi++) { if (RI->readLength(fi) == 0) continue; uint32 rdAid = fi; uint32 tgAid = tigs.inUnitig(rdAid); Unitig *tgA = tigs[tgAid]; uint32 tgAtype = getTigType(tgA); // Best overlaps exist if the read isn't contained. if (OG->isContained(rdAid) == false) { BestEdgeOverlap *b5 = OG->getBestEdgeOverlap(fi, false); uint32 rd5id = b5->readId(); uint32 tg5id = tigs.inUnitig(rd5id); Unitig *tg5 = tigs[tg5id]; uint32 tg5type = getTigType(tg5); BestEdgeOverlap *b3 = OG->getBestEdgeOverlap(fi, true); uint32 rd3id = b3->readId(); uint32 tg3id = tigs.inUnitig(rd3id); Unitig *tg3 = tigs[tg3id]; uint32 tg3type = getTigType(tg3); bb->total += 2; // If this read isn't even in a tig, add to the unused categories. if (tgAid == 0) { if (rd5id == 0) bb->doveUnsatDiff[tUNU][tNOP]++; else bb->doveUnsatDiff[tUNU][tg5type]++; if (rd3id == 0) bb->doveUnsatDiff[tUNU][tNOP]++; else bb->doveUnsatDiff[tUNU][tg3type]++; } // Otherwise, its in a tig, and we need to compare positions. else { uint32 rdApos = tigs[tgAid]->ufpathIdx(rdAid); ufNode *rdA = &tigs[tgAid]->ufpath[rdApos]; bool rdAfwd = (rdA->position.bgn < rdA->position.end); int32 rdAlo = (rdAfwd) ? rdA->position.bgn : rdA->position.end; int32 rdAhi = (rdAfwd) ? rdA->position.end : rdA->position.bgn; // Different tigs? Unsatisfied. Same tig? Grab the reads and check for overlap. if (tgA != tg5) { bb->doveUnsatDiff[tgAtype][tg5type]++; } else if (rd5id == 0) { bb->doveUnsatDiff[tgAtype][tNOP]++; } else { uint32 rd5pos = tigs[tg5id]->ufpathIdx(rd5id); ufNode *rd5 = &tigs[tg5id]->ufpath[rd5pos]; bool rd5fwd = (rd5->position.bgn < rd5->position.end); int32 rd5lo = (rd5fwd) ? rd5->position.bgn : rd5->position.end; int32 rd5hi = (rd5fwd) ? rd5->position.end : rd5->position.bgn; if (satisfiedOverlap(rdAlo, rdAhi, rdAfwd, rd5lo, rd5hi, rd5fwd, (b5->read3p() == true))) { bb->doveSatSame[tgAtype]++; } else { bb->doveUnsatSame[tgAtype]++; } } if (tgA != tg3) { bb->doveUnsatDiff[tgAtype][tg3type]++; } else if (rd3id == 0) { bb->doveUnsatDiff[tgAtype][tNOP]++; } else { uint32 rd3pos = tigs[tg3id]->ufpathIdx(rd3id); ufNode *rd3 = &tigs[tg3id]->ufpath[rd3pos]; bool rd3fwd = (rd3->position.bgn < rd3->position.end); int32 rd3lo = (rd3fwd) ? rd3->position.bgn : rd3->position.end; int32 rd3hi = (rd3fwd) ? rd3->position.end : rd3->position.bgn; if (satisfiedOverlap(rdAlo, rdAhi, rdAfwd, rd3lo, rd3hi, rd3fwd, (b3->read3p() == false))) { bb->doveSatSame[tgAtype]++; } else { bb->doveUnsatSame[tgAtype]++; } } } } // For all overlaps. uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(fi, ovlLen); for (uint32 oi=0; oiisContained(rdAid) || OG->isContained(rdBid); // Figure out what class of overlap we're counting. olapsUsed *used = NULL; if (isDove == false) used = cc; else if (contReads == true) used = dc; else used = dd; used->total++; // If to reads not in a tig, unsatisfied. if ((tgAid == 0) || (tgBid == 0)) { if (isDove) used->doveUnsatDiff[tgAtype][tgBtype]++; else used->contUnsatDiff[tgAtype][tgBtype]++; continue; } // If in different tigs, unsatisfied. if (tgAid != tgBid) { if (isDove) used->doveUnsatDiff[tgAtype][tgBtype]++; else used->contUnsatDiff[tgAtype][tgBtype]++; continue; } // Else, possibly satisfied. We need to check positions. uint32 rdApos = tigs[tgAid]->ufpathIdx(rdAid); ufNode *rdA = &tigs[tgAid]->ufpath[rdApos]; bool rdAfwd = (rdA->position.bgn < rdA->position.end); int32 rdAlo = (rdAfwd) ? rdA->position.bgn : rdA->position.end; int32 rdAhi = (rdAfwd) ? rdA->position.end : rdA->position.bgn; uint32 rdBpos = tigs[tgBid]->ufpathIdx(rdBid); ufNode *rdB = &tigs[tgBid]->ufpath[rdBpos]; bool rdBfwd = (rdB->position.bgn < rdB->position.end); int32 rdBlo = (rdBfwd) ? rdB->position.bgn : rdB->position.end; int32 rdBhi = (rdBfwd) ? rdB->position.end : rdB->position.bgn; // If overlapping and correctly oriented, good enough for now. Do we want to care about // overlap length? Nah, there's enough fudging (still, I think) in placement that it'd be // tough to get that usefully precise. if (satisfiedOverlap(rdAlo, rdAhi, rdAfwd, rdBlo, rdBhi, rdBfwd, ovl[oi].flipped)) { if (isDove) used->doveUnsatSame[tgAtype]++; else used->contUnsatSame[tgAtype]++; } else { if (isDove) used->doveSatSame[tgAtype]++; else used->contSatSame[tgAtype]++; } } } // Merge the symmetrical counts for (uint32 ii=0; ii<5; ii++) { for (uint32 jj=ii+1; jj<5; jj++) { bb->doveUnsatDiff[ii][jj] += bb->doveUnsatDiff[jj][ii]; bb->doveUnsatDiff[jj][ii] = UINT64_MAX; dd->doveUnsatDiff[ii][jj] += dd->doveUnsatDiff[jj][ii]; dd->doveUnsatDiff[jj][ii] = UINT64_MAX; dc->doveUnsatDiff[ii][jj] += dc->doveUnsatDiff[jj][ii]; dc->doveUnsatDiff[jj][ii] = UINT64_MAX; cc->doveUnsatDiff[ii][jj] += cc->doveUnsatDiff[jj][ii]; cc->doveUnsatDiff[jj][ii] = UINT64_MAX; bb->contUnsatDiff[ii][jj] += bb->contUnsatDiff[jj][ii]; bb->contUnsatDiff[jj][ii] = UINT64_MAX; dd->contUnsatDiff[ii][jj] += dd->contUnsatDiff[jj][ii]; dd->contUnsatDiff[jj][ii] = UINT64_MAX; dc->contUnsatDiff[ii][jj] += dc->contUnsatDiff[jj][ii]; dc->contUnsatDiff[jj][ii] = UINT64_MAX; cc->contUnsatDiff[ii][jj] += cc->contUnsatDiff[jj][ii]; cc->contUnsatDiff[jj][ii] = UINT64_MAX; } } // Emit a nicely formatted report. #define B(X) (100.0 * (X) / (bb->total)) #define P(X) (100.0 * (X) / (dd->total)) #define Q(X) (100.0 * (X) / (dc->total)) #define R(X) (100.0 * (X) / (cc->total)) char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.overlaps", getLogFilePrefix()); errno = 0; FILE *F = fopen(N, "w"); if (errno) return; fprintf(F, "=====================================\n"); fprintf(F, "OVERLAP COUNTS\n"); fprintf(F, "\n"); fprintf(F, "dovetail overlaps (best) " F_U64 "\n", bb->total); fprintf(F, "dovetail overlaps " F_U64 "\n", dd->total); fprintf(F, "dovetail overlaps to contained reads " F_U64 "\n", dc->total); fprintf(F, "containment overlaps " F_U64 "\n", cc->total); fprintf(F, "\n"); fprintf(F, "=====================================\n"); fprintf(F, "BEST EDGE OVERLAP FATE\n"); fprintf(F, "\n"); fprintf(F, "SATISFIED best edges DOVETAIL\n"); fprintf(F, "--------- ------------ -------\n"); fprintf(F, "same-contig %12" F_U64P " %6.2f%%\n", bb->doveSatSame[tCTG], B(bb->doveSatSame[tCTG])); fprintf(F, "same-repeat %12" F_U64P " %6.2f%%\n", bb->doveSatSame[tRPT], B(bb->doveSatSame[tRPT])); fprintf(F, "\n"); fprintf(F, "UNSATISFIED best edges DOVETAIL\n"); fprintf(F, "----------- ------------ -------\n"); fprintf(F, "same-contig %12" F_U64P " %6.2f%%\n", bb->doveUnsatSame[tCTG], B(bb->doveUnsatSame[tCTG])); fprintf(F, "same-repeat %12" F_U64P " %6.2f%%\n", bb->doveUnsatSame[tRPT], B(bb->doveUnsatSame[tRPT])); fprintf(F, "same-unassembled %12" F_U64P " %6.2f%%\n", bb->doveUnsatSame[tUNA], B(bb->doveUnsatSame[tUNA])); fprintf(F, "same-unused %12" F_U64P " %6.2f%%\n", bb->doveUnsatSame[tUNU], B(bb->doveUnsatSame[tUNU])); fprintf(F, "\n"); fprintf(F, "UNSATISFIED best edges DOVETAIL\n"); fprintf(F, "----------- ------------ -------\n"); fprintf(F, "contig-contig %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tCTG][tCTG], B(bb->doveUnsatDiff[tCTG][tCTG])); fprintf(F, "contig-repeat %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tCTG][tRPT], B(bb->doveUnsatDiff[tCTG][tRPT])); fprintf(F, "contig-unassembled %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tCTG][tUNA], B(bb->doveUnsatDiff[tCTG][tUNA])); fprintf(F, "contig-unused %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tCTG][tUNU], B(bb->doveUnsatDiff[tCTG][tUNU])); fprintf(F, "contig-none %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tCTG][tNOP], B(bb->doveUnsatDiff[tCTG][tNOP])); fprintf(F, "\n"); //fprintf(F, "repeat-contig %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tRPT][tCTG], B(bb->doveUnsatDiff[tRPT][tCTG])); fprintf(F, "repeat-repeat %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tRPT][tRPT], B(bb->doveUnsatDiff[tRPT][tRPT])); fprintf(F, "repeat-unassembled %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tRPT][tUNA], B(bb->doveUnsatDiff[tRPT][tUNA])); fprintf(F, "repeat-unused %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tRPT][tUNU], B(bb->doveUnsatDiff[tRPT][tUNU])); fprintf(F, "repeat-none %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tRPT][tNOP], B(bb->doveUnsatDiff[tRPT][tNOP])); fprintf(F, "\n"); //fprintf(F, "unassembled-contig %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNA][tCTG], B(bb->doveUnsatDiff[tUNA][tCTG])); //fprintf(F, "unassembled-repeat %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNA][tRPT], B(bb->doveUnsatDiff[tUNA][tRPT])); fprintf(F, "unassembled-unassembled %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNA][tUNA], B(bb->doveUnsatDiff[tUNA][tUNA])); fprintf(F, "unassembled-unused %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNA][tUNU], B(bb->doveUnsatDiff[tUNA][tUNU])); fprintf(F, "unassembled-none %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNA][tNOP], B(bb->doveUnsatDiff[tUNA][tNOP])); fprintf(F, "\n"); //fprintf(F, "unused-contig %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNU][tCTG], B(bb->doveUnsatDiff[tUNU][tCTG])) //fprintf(F, "unused-repeat %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNU][tRPT], B(bb->doveUnsatDiff[tUNU][tRPT])); //fprintf(F, "unused-unassembled %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNU][tUNA], B(bb->doveUnsatDiff[tUNU][tUNA])); fprintf(F, "unused-unused %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNU][tUNU], B(bb->doveUnsatDiff[tUNU][tUNU])); fprintf(F, "unused-none %12" F_U64P " %6.2f%%\n", bb->doveUnsatDiff[tUNU][tNOP], B(bb->doveUnsatDiff[tUNU][tNOP])); fprintf(F, "\n"); fprintf(F, "\n"); fprintf(F, "=====================================\n"); fprintf(F, "ALL OVERLAP FATE\n"); fprintf(F, "\n"); fprintf(F, "SATISFIED all overlaps DOVETAIL DOVECONT CONTAINMENT\n"); fprintf(F, "--------- ------------ ------- ------------ ------- ------------ -------\n"); fprintf(F, "same-contig %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveSatSame[tCTG], P(dd->doveSatSame[tCTG]), dc->doveSatSame[tCTG], Q(dc->doveSatSame[tCTG]), cc->contSatSame[tCTG], R(cc->contSatSame[tCTG])); fprintf(F, "same-repeat %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveSatSame[tRPT], P(dd->doveSatSame[tRPT]), dc->doveSatSame[tRPT], Q(dc->doveSatSame[tRPT]), cc->contSatSame[tRPT], R(cc->contSatSame[tRPT])); fprintf(F, "\n"); fprintf(F, "UNSATISFIED all overlaps DOVETAIL DOVECONT CONTAINMENT\n"); fprintf(F, "----------- ------------ ------- ------------ ------- ------------ -------\n"); fprintf(F, "same-contig %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatSame[tCTG], P(dd->doveUnsatSame[tCTG]), dc->doveUnsatSame[tCTG], Q(dc->doveUnsatSame[tCTG]), cc->contUnsatSame[tCTG], R(cc->contUnsatSame[tCTG])); fprintf(F, "same-repeat %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatSame[tRPT], P(dd->doveUnsatSame[tRPT]), dc->doveUnsatSame[tRPT], Q(dc->doveUnsatSame[tRPT]), cc->contUnsatSame[tRPT], R(cc->contUnsatSame[tRPT])); fprintf(F, "same-unassembled %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatSame[tUNA], P(dd->doveUnsatSame[tUNA]), dc->doveUnsatSame[tUNA], Q(dc->doveUnsatSame[tUNA]), cc->contUnsatSame[tUNA], R(cc->contUnsatSame[tUNA])); fprintf(F, "same-unused %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatSame[tUNU], P(dd->doveUnsatSame[tUNU]), dc->doveUnsatSame[tUNU], Q(dc->doveUnsatSame[tUNU]), cc->contUnsatSame[tUNU], R(cc->contUnsatSame[tUNU])); fprintf(F, "\n"); fprintf(F, "UNSATISFIED all overlaps DOVETAIL DOVECONT CONTAINMENT\n"); fprintf(F, "----------- ------------ ------- ------------ ------- ------------ -------\n"); fprintf(F, "contig-contig %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tCTG][tCTG], P(dd->doveUnsatDiff[tCTG][tCTG]), dc->doveUnsatDiff[tCTG][tCTG], Q(dc->doveUnsatDiff[tCTG][tCTG]), cc->contUnsatDiff[tCTG][tCTG], R(cc->contUnsatDiff[tCTG][tCTG])); fprintf(F, "contig-repeat %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tCTG][tRPT], P(dd->doveUnsatDiff[tCTG][tRPT]), dc->doveUnsatDiff[tCTG][tRPT], Q(dc->doveUnsatDiff[tCTG][tRPT]), cc->contUnsatDiff[tCTG][tRPT], R(cc->contUnsatDiff[tCTG][tRPT])); fprintf(F, "contig-unassembled %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tCTG][tUNA], P(dd->doveUnsatDiff[tCTG][tUNA]), dc->doveUnsatDiff[tCTG][tUNA], Q(dc->doveUnsatDiff[tCTG][tUNA]), cc->contUnsatDiff[tCTG][tUNA], R(cc->contUnsatDiff[tCTG][tUNA])); fprintf(F, "contig-unused %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tCTG][tUNU], P(dd->doveUnsatDiff[tCTG][tUNU]), dc->doveUnsatDiff[tCTG][tUNU], Q(dc->doveUnsatDiff[tCTG][tUNU]), cc->contUnsatDiff[tCTG][tUNU], R(cc->contUnsatDiff[tCTG][tUNU])); fprintf(F, "\n"); //fprintf(F, "repeat-contig %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tRPT][tCTG], P(dd->doveUnsatDiff[tRPT][tCTG]), dc->doveUnsatDiff[tRPT][tCTG], Q(dc->doveUnsatDiff[tRPT][tCTG]), cc->contUnsatDiff[tRPT][tCTG], R(cc->contUnsatDiff[tRPT][tCTG])); fprintf(F, "repeat-repeat %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tRPT][tRPT], P(dd->doveUnsatDiff[tRPT][tRPT]), dc->doveUnsatDiff[tRPT][tRPT], Q(dc->doveUnsatDiff[tRPT][tRPT]), cc->contUnsatDiff[tRPT][tRPT], R(cc->contUnsatDiff[tRPT][tRPT])); fprintf(F, "repeat-unassembled %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tRPT][tUNA], P(dd->doveUnsatDiff[tRPT][tUNA]), dc->doveUnsatDiff[tRPT][tUNA], Q(dc->doveUnsatDiff[tRPT][tUNA]), cc->contUnsatDiff[tRPT][tUNA], R(cc->contUnsatDiff[tRPT][tUNA])); fprintf(F, "repeat-unused %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tRPT][tUNU], P(dd->doveUnsatDiff[tRPT][tUNU]), dc->doveUnsatDiff[tRPT][tUNU], Q(dc->doveUnsatDiff[tRPT][tUNU]), cc->contUnsatDiff[tRPT][tUNU], R(cc->contUnsatDiff[tRPT][tUNU])); fprintf(F, "\n"); //fprintf(F, "unassembled-contig %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNA][tCTG], P(dd->doveUnsatDiff[tUNA][tCTG]), dc->doveUnsatDiff[tUNA][tCTG], Q(dc->doveUnsatDiff[tUNA][tCTG]), cc->contUnsatDiff[tUNA][tCTG], R(cc->contUnsatDiff[tUNA][tCTG])); //fprintf(F, "unassembled-repeat %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNA][tRPT], P(dd->doveUnsatDiff[tUNA][tRPT]), dc->doveUnsatDiff[tUNA][tRPT], Q(dc->doveUnsatDiff[tUNA][tRPT]), cc->contUnsatDiff[tUNA][tRPT], R(cc->contUnsatDiff[tUNA][tRPT])); fprintf(F, "unassembled-unassembled %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNA][tUNA], P(dd->doveUnsatDiff[tUNA][tUNA]), dc->doveUnsatDiff[tUNA][tUNA], Q(dc->doveUnsatDiff[tUNA][tUNA]), cc->contUnsatDiff[tUNA][tUNA], R(cc->contUnsatDiff[tUNA][tUNA])); fprintf(F, "unassembled-unused %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNA][tUNU], P(dd->doveUnsatDiff[tUNA][tUNU]), dc->doveUnsatDiff[tUNA][tUNU], Q(dc->doveUnsatDiff[tUNA][tUNU]), cc->contUnsatDiff[tUNA][tUNU], R(cc->contUnsatDiff[tUNA][tUNU])); fprintf(F, "\n"); //fprintf(F, "unused-contig %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNU][tCTG], P(dd->doveUnsatDiff[tUNU][tCTG]), dc->doveUnsatDiff[tUNU][tCTG], Q(dc->doveUnsatDiff[tUNU][tCTG]), cc->contUnsatDiff[tUNU][tCTG], R(cc->contUnsatDiff[tUNU][tCTG])); //fprintf(F, "unused-repeat %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNU][tRPT], P(dd->doveUnsatDiff[tUNU][tRPT]), dc->doveUnsatDiff[tUNU][tRPT], Q(dc->doveUnsatDiff[tUNU][tRPT]), cc->contUnsatDiff[tUNU][tRPT], R(cc->contUnsatDiff[tUNU][tRPT])); //fprintf(F, "unused-unassembled %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNU][tUNA], P(dd->doveUnsatDiff[tUNU][tUNA]), dc->doveUnsatDiff[tUNU][tUNA], Q(dc->doveUnsatDiff[tUNU][tUNA]), cc->contUnsatDiff[tUNU][tUNA], R(cc->contUnsatDiff[tUNU][tUNA])); fprintf(F, "unused-unused %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%% %12" F_U64P " %6.2f%%\n", dd->doveUnsatDiff[tUNU][tUNU], P(dd->doveUnsatDiff[tUNU][tUNU]), dc->doveUnsatDiff[tUNU][tUNU], Q(dc->doveUnsatDiff[tUNU][tUNU]), cc->contUnsatDiff[tUNU][tUNU], R(cc->contUnsatDiff[tUNU][tUNU])); fprintf(F, "\n"); fprintf(F, "\n"); fclose(F); delete dd; delete dc; delete cc; delete bb; } canu-1.6/src/bogart/AS_BAT_Instrumentation.H000066400000000000000000000036711314437614700207610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Instrumentation.H * * Modifications by: * * Brian P. Walenz from 2010-DEC-06 to 2013-AUG-01 * are Copyright 2010,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_INSTRUMENTATION #define INCLUDE_AS_BAT_INSTRUMENTATION void checkUnitigMembership(TigVector &tigs); void reportOverlaps(TigVector &tigs, const char *prefix, const char *name); void reportTigs(TigVector &tigs, const char *prefix, const char *name, uint64 genomeSize); void classifyTigsAsUnassembled(TigVector &tigs, uint32 fewReadsNumber, uint32 tooShortLength, double spanFraction, double lowcovFraction, uint32 lowcovDepth); #endif // INCLUDE_AS_BAT_INSTRUMENTATION canu-1.6/src/bogart/AS_BAT_Logging.C000066400000000000000000000155441314437614700171410ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Logging.C * * Modifications by: * * Brian P. Walenz from 2012-JUL-29 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_Logging.H" class logFileInstance { public: logFileInstance() { file = stderr; prefix[0] = 0; name[0] = 0; part = 0; length = 0; }; ~logFileInstance() { if ((name[0] != 0) && (file)) { fprintf(stderr, "WARNING: open file '%s'\n", name); fclose(file); } }; void set(char const *prefix_, int32 order_, char const *label_, int32 tn_) { if (label_ == NULL) { file = stderr; prefix[0] = 0; name[0] = 0; part = 0; length = 0; return; } snprintf(prefix, FILENAME_MAX, "%s.%03u.%s", prefix_, order_, label_); snprintf(name, FILENAME_MAX, "%s.%03u.%s.thr%03d", prefix_, order_, label_, tn_); }; void rotate(void) { assert(name[0] != 0); fclose(file); file = NULL; length = 0; part++; } void open(void) { char path[FILENAME_MAX]; assert(file == NULL); assert(name[0] != 0); snprintf(path, FILENAME_MAX, "%s.num%03d.log", name, part); errno = 0; file = fopen(path, "w"); if (errno) { writeStatus("setLogFile()-- Failed to open logFile '%s': %s.\n", path, strerror(errno)); writeStatus("setLogFile()-- Will now log to stderr instead.\n"); file = stderr; } }; void close(void) { if ((file != NULL) && (file != stderr)) fclose(file); file = NULL; prefix[0] = 0; name[0] = 0; part = 0; length = 0; }; FILE *file; char prefix[FILENAME_MAX]; char name[FILENAME_MAX]; uint32 part; uint64 length; }; // NONE of the logFileMain/logFileThread is implemented logFileInstance logFileMain; // For writes during non-threaded portions logFileInstance *logFileThread = NULL; // For writes during threaded portions. uint32 logFileOrder = 0; uint64 logFileFlags = 0; uint64 LOG_OVERLAP_SCORING = 0x0000000000000001; // Debug, scoring of overlaps uint64 LOG_ALL_BEST_EDGES = 0x0000000000000002; uint64 LOG_ERROR_PROFILES = 0x0000000000000004; uint64 LOG_CHUNK_GRAPH = 0x0000000000000008; // Report the chunk graph as we build it uint64 LOG_BUILD_UNITIG = 0x0000000000000010; // Report building of initial tigs (both unitig creation and read placement) uint64 LOG_PLACE_UNPLACED = 0x0000000000000020; // Report placing of unplaced reads uint64 LOG_ORPHAN_DETAIL = 0x0000000000000040; uint64 LOG_SPLIT_DISCONTINUOUS = 0x0000000000000080; // uint64 LOG_INTERMEDIATE_TIGS = 0x0000000000000100; // At various spots, dump the current tigs uint64 LOG_SET_PARENT_AND_HANG = 0x0000000000000200; // uint64 LOG_STDERR = 0x0000000000000400; // Write ALL logging to stderr, not the files. uint64 LOG_PLACE_READ = 0x8000000000000000; // Internal use only. char const *logFileFlagNames[64] = { "overlapScoring", "allBestEdges", "errorProfiles", "chunkGraph", "buildUnitig", "placeUnplaced", "orphans", "splitDiscontinuous", // Update made it to here, need repeats "intermediateTigs", "setParentAndHang", "stderr", NULL }; // Closes the current logFile, opens a new one called 'prefix.logFileOrder.label'. If 'label' is // NULL, the logFile is reset to stderr. void setLogFile(char const *prefix, char const *label) { assert(prefix != NULL); // Allocate space. if (logFileThread == NULL) logFileThread = new logFileInstance [omp_get_max_threads()]; // If writing to stderr, that's all we needed to do. if (logFileFlagSet(LOG_STDERR)) return; // Close out the old. logFileMain.close(); for (int32 tn=0; tnname[0] != 0) && (lf->length > maxLength)) { fprintf(lf->file, "logFile()-- size " F_U64 " exceeds limit of " F_U64 "; rotate to new file.\n", lf->length, maxLength); lf->rotate(); } // Open the file if needed. if (lf->file == NULL) lf->open(); // Write the log. va_start(ap, fmt); lf->length += vfprintf(lf->file, fmt, ap); va_end(ap); } void flushLog(void) { int32 nt = omp_get_num_threads(); int32 tn = omp_get_thread_num(); logFileInstance *lf = (nt == 1) ? (&logFileMain) : (&logFileThread[tn]); if (lf->file != NULL) fflush(lf->file); } canu-1.6/src/bogart/AS_BAT_Logging.H000066400000000000000000000046661314437614700171510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Logging.H * * Modifications by: * * Brian P. Walenz from 2012-JUL-29 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-11 to 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_LOGGING #define INCLUDE_AS_BAT_LOGGING #include "AS_global.H" #include "AS_UTL_fileIO.H" #include #include #include #include #include #include #include #include #ifndef BROKEN_CLANG_OpenMP #include #endif void setLogFile(char const *prefix, char const *name); char *getLogFilePrefix(void); void writeStatus(char const *fmt, ...); void writeLog(char const *fmt, ...); void flushLog(void); #define logFileFlagSet(L) ((logFileFlags & L) == L) extern uint64 logFileFlags; extern uint32 logFileOrder; // Used debug tigStore dumps, etc extern uint64 LOG_OVERLAP_SCORING; extern uint64 LOG_ALL_BEST_EDGES; extern uint64 LOG_ERROR_PROFILES; extern uint64 LOG_CHUNK_GRAPH; extern uint64 LOG_BUILD_UNITIG; extern uint64 LOG_PLACE_UNPLACED; extern uint64 LOG_ORPHAN_DETAIL; extern uint64 LOG_SPLIT_DISCONTINUOUS; extern uint64 LOG_INTERMEDIATE_TIGS; extern uint64 LOG_SET_PARENT_AND_HANG; extern uint64 LOG_STDERR; extern uint64 LOG_PLACE_READ; extern char const *logFileFlagNames[64]; #endif // INCLUDE_AS_BAT_LOGGING canu-1.6/src/bogart/AS_BAT_MarkRepeatReads.C000066400000000000000000001115131314437614700205560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_MarkRepeatReads.H" #include "intervalList.H" #include "stddev.H" #include using namespace std; // Hack. uint32 MIN_ANCHOR_HANG = 500; // Require reads to be anchored by this many bases at boundaries of repeats. int32 REPEAT_OVERLAP_MIN = 50; #define REPEAT_FRACTION 0.5 #undef OLD_ANNOTATE #undef SHOW_ANNOTATE #undef SHOW_ANNOTATION_RAW // Show all overlaps used to annotate reads #undef SHOW_ANNOTATION_RAW_FILTERED // Show all overlaps filtered by high error rate #undef DUMP_READ_COVERAGE // Each evidence read picks its single best overlap to tig (based on overlaps to reads in the tig). // Filter out evidence that aligns at erate higher than expected. // Collapse to intervals on tig. // If still not significant and not spanned, break. class olapDat { public: olapDat(uint32 b, uint32 e, uint32 r, uint32 p) { tigbgn = b; tigend = e; eviRid = r; eviPid = p; }; bool operator<(const olapDat &that) const { return(tigbgn < that.tigbgn); }; int32 tigbgn; // Location of the overlap on this tig int32 tigend; // uint32 eviRid; // evidence read uint32 eviPid; // evidence read placeID }; bool olapDatByEviRid(const olapDat &A, const olapDat &B) { if (A.eviRid == B.eviRid) return(A.tigbgn < B.tigbgn); return(A.eviRid < B.eviRid); } class breakPointCoords { public: breakPointCoords(int32 bgn, int32 end, bool rpt=false) { _bgn = bgn; _end = end; _rpt = rpt; }; ~breakPointCoords() { }; bool operator<(breakPointCoords const &that) const { return(_bgn < that._bgn); }; int32 _bgn; int32 _end; bool _rpt; }; // Returns the coordinates the overlap intersects on the A read. // // lo hi // v v // ------------ // ---------- void olapToReadCoords(ufNode *frg, int32 ahang, int32 bhang, int32 &lo, int32 &hi) { lo = 0; hi = RI->readLength(frg->ident); if (ahang > 0) lo += ahang; // Positive hang! if (bhang < 0) hi += bhang; // Negative hang! assert(0 <= lo); assert(0 <= hi); assert(lo <= hi); assert(lo <= RI->readLength(frg->ident)); assert(hi <= RI->readLength(frg->ident)); } void findUnitigCoverage(Unitig *tig, intervalList &coverage) { intervalList rawcoverage; for (uint32 fi=0; fiufpath.size(); fi++) { ufNode frg = tig->ufpath[fi]; if (frg.position.bgn < frg.position.end) rawcoverage.add(frg.position.bgn, frg.position.end - frg.position.bgn); else rawcoverage.add(frg.position.end, frg.position.bgn - frg.position.end); } coverage.clear(); coverage.depth(rawcoverage); #ifdef DUMP_READ_COVERAGE char fn[FILENAME_MAX]; snprintf(fn, FILENAME_MAX, "%08u.coverage", tig->id()); FILE *F = fopen(fn, "w"); for (uint32 ii=0; ii &BP, Unitig **newTigs, int32 *lowCoord, uint32 *nRepeat, uint32 *nUnique, bool doMove) { if (doMove == true) { memset(newTigs, 0, sizeof(Unitig *) * BP.size()); memset(lowCoord, 0, sizeof(int32) * BP.size()); } else { memset(nRepeat, 0, sizeof(uint32) * BP.size()); memset(nUnique, 0, sizeof(uint32) * BP.size()); } for (uint32 fi=0; fiufpath.size(); fi++) { ufNode &frg = tig->ufpath[fi]; int32 frgbgn = min(frg.position.bgn, frg.position.end); int32 frgend = max(frg.position.bgn, frg.position.end); // Search for the region that matches the read. BP's are sorted in increasing order. It // probably doesn't matter, but makes the logging a little easier to read. uint32 rid = UINT32_MAX; bool rpt = false; //fprintf(stderr, "Searching for placement for read %u at %d-%d\n", frg.ident, frgbgn, frgend); for (uint32 ii=0; ii nUnique[rid]) newTigs[rid]->_isRepeat = true; } newTigs[rid]->addRead(frg, -lowCoord[rid], false); } // Else, we're not moving, just count how many reads came from repeats or uniques. else { if (rpt) nRepeat[rid]++; else nUnique[rid]++; } } // Return the number of tigs created. uint32 nTigsCreated = 0; for (uint32 ii=0; ii 0) nTigsCreated++; return(nTigsCreated); } // Over all reads in tgA, return a vector of olapDat (tigBgn, tigEnd, eviRid) // for all reads that overlap into this tig. // // The current AssemblyGraph is backwards to what we need. It has, for each read, the // overlaps from that read that are compatible - but we need to the overlaps to each // read that are compatible, and the two are not symmetric. A can be compatible in tig 1, // but the same overlapping read B can be incompatible with tig 2. // // We can invert the graph at the start of repeat detection, making a list of // read B ---> overlaps to tig N position X-Y, with read A void annotateRepeatsOnRead(AssemblyGraph *AG, TigVector &UNUSED(tigs), Unitig *tig, double UNUSED(deviationRepeat), vector &repeats) { // Over all reads in this tig, // Grab pointers to all incoming edges. // Push those locations onto our output list. for (uint32 ii=0; iiufpath.size(); ii++) { ufNode *read = &tig->ufpath[ii]; vector &rPlace = AG->getReverse(read->ident); #if 0 writeLog("annotateRepeatsOnRead()-- tig %u read #%u %u at %d-%d reverse %u items\n", tig->id(), ii, read->ident, read->position.bgn, read->position.end, rPlace.size()); #endif for (uint32 rr=0; rrgetForward(rID)[pID]; #ifdef SHOW_ANNOTATION_RAW writeLog("annotateRepeatsOnRead()-- tig %u read #%u %u place %u reverse read %u in tig %u placed %d-%d olap %d-%d%s\n", tig->id(), ii, read->ident, rr, rID, tig->inUnitig(rID), fPlace.placedBgn, fPlace.placedEnd, fPlace.olapBgn, fPlace.olapEnd, (fPlace.isUnitig) ? " IN_UNITIG" : ""); #endif if ((fPlace.isUnitig == true) || (fPlace.isContig == true)) continue; repeats.push_back(olapDat(fPlace.olapBgn, fPlace.olapEnd, rID, pID)); } } } void mergeAnnotations(vector &repeatOlaps) { sort(repeatOlaps.begin(), repeatOlaps.end(), olapDatByEviRid); #ifdef SHOW_ANNOTATE for (uint32 ii=0; ii &tigMarksR) { for (uint32 fi=0; fiufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; bool frgfwd = (frg->position.bgn < frg->position.end); int32 frglo = (frgfwd) ? frg->position.bgn : frg->position.end; int32 frghi = (frgfwd) ? frg->position.end : frg->position.bgn; bool discarded = false; for (uint32 ri=0; rigetLength()) && // Read at end of tig, spans off the low end (frglo + MIN_ANCHOR_HANG <= tigMarksR.lo(ri))) spanLo = spanHi = true; if (frglo + MIN_ANCHOR_HANG <= tigMarksR.lo(ri)) // Read spanned off the low end spanLo = true; if (tigMarksR.hi(ri) + MIN_ANCHOR_HANG <= frghi) // Read spanned off the high end spanHi = true; if (spanLo && spanHi) { writeLog("discard region %8d:%-8d - contained in read %6u %8d-%8d\n", tigMarksR.lo(ri), tigMarksR.hi(ri), frg->ident, frglo, frghi); tigMarksR.lo(ri) = 0; tigMarksR.hi(ri) = 0; discarded = true; } } if (discarded) tigMarksR.filterShort(1); } } void reportThickestEdgesInRepeats(Unitig *tig, intervalList &tigMarksR) { writeLog("thickest edges to the repeat regions:\n"); for (uint32 ri=0; riufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; bool frgfwd = (frg->position.bgn < frg->position.end); int32 frglo = (frgfwd) ? frg->position.bgn : frg->position.end; int32 frghi = (frgfwd) ? frg->position.end : frg->position.bgn; bool discarded = false; // Overlap off the 5' end of the region. if (frglo <= tigMarksR.lo(ri) && (tigMarksR.lo(ri) <= frghi)) { uint32 olap = frghi - tigMarksR.lo(ri); if (l5 < olap) { l5 = olap; t5 = fi; t5bgn = frglo; // Easier than recomputing it later on... t5end = frghi; } } // Overlap off the 3' end of the region. if (frglo <= tigMarksR.hi(ri) && (tigMarksR.hi(ri) <= frghi)) { uint32 olap = tigMarksR.hi(ri) - frglo; if (l3 < olap) { l3 = olap; t3 = fi; t3bgn = frglo; t3end = frghi; } } if (frglo <= tigMarksR.lo(ri) && (tigMarksR.hi(ri) <= frghi)) { writeLog("saved region %8d:%-8d - closest read %6u (%+6d) %8d:%-8d (%+6d) (contained)\n", tigMarksR.lo(ri), tigMarksR.hi(ri), frg->ident, tigMarksR.lo(ri) - frglo, frglo, frghi, frghi - tigMarksR.hi(ri)); } } if (t5 != UINT32_MAX) writeLog("saved region %8d:%-8d - closest 5' read %6u (%+6d) %8d:%-8d (%+6d)\n", tigMarksR.lo(ri), tigMarksR.hi(ri), tig->ufpath[t5].ident, tigMarksR.lo(ri) - t5bgn, t5bgn, t5end, t5end - tigMarksR.hi(ri)); if (t3 != UINT32_MAX) writeLog("saved region %8d:%-8d - closest 3' read %6u (%+6d) %8d:%-8d (%+6d)\n", tigMarksR.lo(ri), tigMarksR.hi(ri), tig->ufpath[t3].ident, tigMarksR.lo(ri) - t3bgn, t3bgn, t3end, t3end - tigMarksR.hi(ri)); } } uint32 * findConfusedEdges(TigVector &tigs, Unitig *tig, intervalList &tigMarksR, double confusedAbsolute, double confusedPercent, vector &confusedEdges) { uint32 *isConfused = new uint32 [tigMarksR.numberOfIntervals()]; memset(isConfused, 0, sizeof(uint32) * tigMarksR.numberOfIntervals()); // Examine every read in this tig. If the read intersects a marked repeat, find the best edge // that continues the tig in either direction. If those reads are in the repeat region, scan all // the overlaps of this read for any that are of comparable length. If any are found, declare // this repeat to be potentially confused. If none are found - for the whole repeat region - // then we can leave the repeat alone. for (uint32 fi=0; fiufpath.size(); fi++) { ufNode *rdA = &tig->ufpath[fi]; uint32 rdAid = rdA->ident; bool rdAfwd = (rdA->position.bgn < rdA->position.end); int32 rdAlo = (rdAfwd) ? rdA->position.bgn : rdA->position.end; int32 rdAhi = (rdAfwd) ? rdA->position.end : rdA->position.bgn; double sc = (rdAhi - rdAlo) / (double)RI->readLength(rdAid); if ((OG->isContained(rdAid) == true) || // Don't care about contained or suspicious (OG->isSuspicious(rdAid) == true)) // reads; we'll use the container instead. continue; for (uint32 ri=0; ri don't care about this read! // Compute the position (in the tig) of the best overlaps. int32 tig5bgn=0, tig5end=0; int32 tig3bgn=0, tig3end=0; // Instead of using the best edge - which might not be the edge used in the unitig - // we need to scan the layout to return the previous/next dovetail // Put this in a function - what to return if no best overlap? BestEdgeOverlap *b5 = OG->getBestEdgeOverlap(rdAid, false); BestEdgeOverlap *b3 = OG->getBestEdgeOverlap(rdAid, true); // If the best edge is to a read not in this tig, there is nothing to compare against. // Is this confused by default? Possibly. The unitig was constructed somehow, and that // must then be the edge coming into us. We'll pick it up later. bool b5use = true; bool b3use = true; if (b5->readId() == 0) b5use = false; if (b3->readId() == 0) b3use = false; if ((b5use) && (tig->inUnitig(b5->readId()) != tig->id())) b5use = false; if ((b3use) && (tig->inUnitig(b3->readId()) != tig->id())) b3use = false; // The best edge read is in this tig. If they don't overlap, again, nothing to compare // against. if (b5use) { ufNode *rdB = &tig->ufpath[tig->ufpathIdx(b5->readId())]; uint32 rdBid = rdB->ident; bool rdBfwd = (rdB->position.bgn < rdB->position.end); int32 rdBlo = (rdBfwd) ? rdB->position.bgn : rdB->position.end; int32 rdBhi = (rdBfwd) ? rdB->position.end : rdB->position.bgn; if ((rdAhi < rdBlo) || (rdBhi < rdAlo)) b5use = false; } if (b3use) { ufNode *rdB = &tig->ufpath[tig->ufpathIdx(b3->readId())]; uint32 rdBid = rdB->ident; bool rdBfwd = (rdB->position.bgn < rdB->position.end); int32 rdBlo = (rdBfwd) ? rdB->position.bgn : rdB->position.end; int32 rdBhi = (rdBfwd) ? rdB->position.end : rdB->position.bgn; if ((rdAhi < rdBlo) || (rdBhi < rdAlo)) b3use = false; } // If we can use this edge, compute the placement of the overlap on the unitig. // Call #1; if (b5use) { int32 bgn=0, end=0; olapToReadCoords(rdA, b5->ahang(), b5->bhang(), bgn, end); tig5bgn = (rdAfwd) ? (rdAlo + sc * bgn) : (rdAhi - sc * end); tig5end = (rdAfwd) ? (rdAlo + sc * end) : (rdAhi - sc * bgn); assert(tig5bgn < tig5end); if (tig5bgn < 0) tig5bgn = 0; if (tig5end > tig->getLength()) tig5end = tig->getLength(); } // Call #2 if (b3use) { int32 bgn=0, end=0; olapToReadCoords(rdA, b3->ahang(), b3->bhang(), bgn, end); tig3bgn = (rdAfwd) ? (rdAlo + sc * bgn) : (rdAhi - sc * end); tig3end = (rdAfwd) ? (rdAlo + sc * end) : (rdAhi - sc * bgn); assert(tig3bgn < tig3end); if (tig3bgn < 0) tig3bgn = 0; if (tig3end > tig->getLength()) tig3end = tig->getLength(); } // If either of the 5' or 3' overlaps (or both!) are in the repeat region, we need to check for // close overlaps on that end. uint32 len5 = 0; uint32 len3 = 0; if ((rMin < tig5bgn) && (tig5end < rMax) && (b5use)) len5 = RI->overlapLength(rdAid, b5->readId(), b5->ahang(), b5->bhang()); else b5use = false; if ((rMin < tig3bgn) && (tig3end < rMax) && (b3use)) len3 = RI->overlapLength(rdAid, b3->readId(), b3->ahang(), b3->bhang()); else b3use = false; double score5 = len5 * (1 - b5->erate()); double score3 = len3 * (1 - b3->erate()); // Neither of the best edges are in the repeat region; move to the next region and/or read. if (len5 + len3 == 0) continue; // At least one of the best edge overlaps is in the repeat region. Scan for other edges // that are of comparable length and quality. uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(rdAid, ovlLen); for (uint32 oo=0; ooufpath.size() == 1)) continue; // Skip if this overlap is the best we're trying to match. if ((rdBid == b5->readId()) || (rdBid == b3->readId())) continue; // Skip if this overlap is crappy quality if (OG->isOverlapBadQuality(ovl[oo])) continue; // Skip if the read is contained or suspicious. if ((OG->isContained(rdBid) == true) || (OG->isSuspicious(rdBid) == true)) continue; // Skip if the overlap isn't dovetail. bool ovl5 = ovl[oo].AEndIs5prime(); bool ovl3 = ovl[oo].AEndIs3prime(); if ((ovl5 == false) && (ovl3 == false)) continue; // Skip if we're not using this overlap if ((ovl5 == true) && (b5use == false)) continue; if ((ovl3 == true) && (b3use == false)) continue; uint32 rdBpos = tigs[tgBid]->ufpathIdx(rdBid); ufNode *rdB = &tigs[tgBid]->ufpath[rdBpos]; bool rdBfwd = (rdB->position.bgn < rdB->position.end); int32 rdBlo = (rdBfwd) ? rdB->position.bgn : rdB->position.end; int32 rdBhi = (rdBfwd) ? rdB->position.end : rdB->position.bgn; // If the overlap is to a read in a different tig, or // the overlap is to a read in the same tig, but we don't overlap in the tig, check lengths. // Otherwise, the overlap is present in the tig, and can't be confused. if ((tgBid == tig->id()) && (rdBlo <= rdAhi) && (rdAlo <= rdBhi)) continue; uint32 len = RI->overlapLength(rdAid, ovl[oo].b_iid, ovl[oo].a_hang, ovl[oo].b_hang); double score = len * (1 - ovl[oo].erate()); // Compute percent difference. double ad5 = fabs(score - score5); double ad3 = fabs(score - score3); double pd5 = 200 * ad5 / (score + score5); double pd3 = 200 * ad3 / (score + score3); // Skip if this overlap is vastly worse than the best. if ((ovl5 == true) && ((ad5 >= confusedAbsolute) || (pd5 > confusedPercent))) { writeLog("tig %7u read %8u pos %7u-%-7u NOT confused by 5' edge to read %8u - best edge read %8u len %6u erate %.4f score %8.2f - alt edge len %6u erate %.4f score %8.2f - absdiff %8.2f percdiff %8.4f\n", tig->id(), rdAid, rdAlo, rdAhi, rdBid, b5->readId(), len5, b5->erate(), score5, len, ovl[oo].erate(), score, ad5, pd5); continue; } if ((ovl3 == true) && ((ad3 >= confusedAbsolute) || (pd3 > confusedPercent))) { writeLog("tig %7u read %8u pos %7u-%-7u NOT confused by 3' edge to read %8u - best edge read %8u len %6u erate %.4f score %8.2f - alt edge len %6u erate %.4f score %8.2f - absdiff %8.2f percdiff %8.4f\n", tig->id(), rdAid, rdAlo, rdAhi, rdBid, b3->readId(), len3, b3->erate(), score3, len, ovl[oo].erate(), score, ad3, pd3); continue; } // Potential confusion! if (ovl5 == true) { writeLog("tig %7u read %8u pos %7u-%-7u IS confused by 5' edge to read %8u - best edge read %8u len %6u erate %.4f score %8.2f - alt edge len %6u erate %.4f score %8.2f - absdiff %8.2f percdiff %8.4f\n", tig->id(), rdAid, rdAlo, rdAhi, rdBid, b5->readId(), len5, b5->erate(), score5, len, ovl[oo].erate(), score, ad5, pd5); confusedEdges.push_back(confusedEdge(rdAid, false, rdBid)); } if (ovl3 == true) { writeLog("tig %7u read %8u pos %7u-%-7u IS confused by 3' edge to read %8u - best edge read %8u len %6u erate %.4f score %8.2f - alt edge len %6u erate %.4f score %8.2f - absdiff %8.2f percdiff %8.4f\n", tig->id(), rdAid, rdAlo, rdAhi, rdBid, b3->readId(), len3, b3->erate(), score3, len, ovl[oo].erate(), score, ad3, pd3); confusedEdges.push_back(confusedEdge(rdAid, true, rdBid)); } isConfused[ri]++; } } // Over all marks (ri) } // Over all reads (fi) return(isConfused); } void discardUnambiguousRepeats(TigVector &tigs, Unitig *tig, intervalList &tigMarksR, double confusedAbsolute, double confusedPercent, vector &confusedEdges) { uint32 *isConfused = findConfusedEdges(tigs, tig, tigMarksR, confusedAbsolute, confusedPercent, confusedEdges); // Scan all the regions, and delete any that have no confusion. bool discarded = false; for (uint32 ri=0; ri &tigMarksR) { // Extend, but don't extend past the end of the tig. for (uint32 ii=0; ii(tigMarksR.lo(ii) - MIN_ANCHOR_HANG, 0); tigMarksR.hi(ii) = min(tigMarksR.hi(ii) + MIN_ANCHOR_HANG, tig->getLength()); } // Merge. bool merged = false; for (uint32 ri=1; ri &BP, uint32 nTigs, Unitig **newTigs, uint32 *nRepeat, uint32 *nUnique) { for (uint32 ii=0; iiid(), (repeat == true) ? "repeat" : "unique", rgnbgn, rgnend, nRepeat[ii], nUnique[ii]); else if (nTigs > 1) writeLog("For tig %5u %s region %8d %8d - %6u/%6u reads repeat/unique - unitig %5u created.\n", tig->id(), (repeat == true) ? "repeat" : "unique", rgnbgn, rgnend, nRepeat[ii], nUnique[ii], newTigs[ii]->id()); else writeLog("For tig %5u %s region %8d %8d - %6u/%6u repeat/unique reads - unitig %5u remains unchanged.\n", tig->id(), (repeat == true) ? "repeat" : "unique", rgnbgn, rgnend, nRepeat[ii], nUnique[ii], tig->id()); } } void markRepeatReads(AssemblyGraph *AG, TigVector &tigs, double deviationRepeat, uint32 confusedAbsolute, double confusedPercent, vector &confusedEdges) { uint32 tiLimit = tigs.size(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (tiLimit < 100000 * numThreads) ? numThreads : tiLimit / 99999; writeLog("repeatDetect()-- working on " F_U32 " tigs, with " F_U32 " thread%s.\n", tiLimit, numThreads, (numThreads == 1) ? "" : "s"); vector repeatOlaps; // Overlaps to reads promoted to tig coords intervalList tigMarksR; // Marked repeats based on reads, filtered by spanning reads intervalList tigMarksU; // Non-repeat invervals, just the inversion of tigMarksR for (uint32 ti=0; tiufpath.size() == 1) || // Singleton, nothing to do. (tig->_isUnassembled == true)) // Unassembled, don't care. continue; writeLog("Annotating repeats in reads for tig %u/%u.\n", ti, tiLimit); // Clear out all the existing marks. They're not for this tig. // Analyze overlaps for each read. For each overlap to a read not in this tig, or not // overlapping in this tig, and of acceptable error rate, add the overlap to repeatOlaps. repeatOlaps.clear(); uint32 fiLimit = tig->ufpath.size(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; annotateRepeatsOnRead(AG, tigs, tig, deviationRepeat, repeatOlaps); writeLog("Annotated with %lu overlaps.\n", repeatOlaps.size()); // Merge marks for the same read into the largest possible. mergeAnnotations(repeatOlaps); // Make a new set of intervals based on all the detected repeats. tigMarksR.clear(); for (uint32 bb=0, ii=0; iigetLength()); // Create the list of intervals we'll use to make new tigs. vector BP; for (uint32 ii=0; ii 1) splitTig(tigs, tig, BP, newTigs, lowCoord, nRepeat, nUnique, true); // Report the tigs created. reportTigsCreated(tig, BP, nTigs, newTigs, nRepeat, nUnique); // Cleanup. delete [] newTigs; delete [] lowCoord; delete [] nRepeat; delete [] nUnique; // Remove the old unitig....if we made new ones. if (nTigs > 1) { tigs[tig->id()] = NULL; delete tig; } } #if 0 FILE *F = fopen("junk.confusedEdges", "w"); for (uint32 ii=0; ii &confusedEdges); #endif // INCLUDE_AS_BAT_MARKREPEATREADS canu-1.6/src/bogart/AS_BAT_MergeOrphans.C000066400000000000000000000663241314437614700201470ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_PopBubbles.C * * Modifications by: * * Brian P. Walenz beginning on 2016-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #include "AS_BAT_Instrumentation.H" #include "AS_BAT_MergeOrphans.H" #include "intervalList.H" #include #include #include using namespace std; #undef SHOW_MULTIPLE_PLACEMENTS // Reports reads that are placed multiple times in a single target region class candidatePop { public: candidatePop(Unitig *orphan_, Unitig *target_, uint32 bgn_, uint32 end_) { orphan = orphan_; target = target_; bgn = bgn_; end = end_; }; Unitig *orphan; Unitig *target; uint32 bgn; uint32 end; vector placed; }; // A list of the target tigs that a orphan could be popped into. typedef map > BubTargetList; // Decide which tigs can be orphans. Any unitig where (nearly) every dovetail read has an overlap // to some other unitig is a candidate for orphan popping. void findPotentialOrphans(TigVector &tigs, BubTargetList &potentialOrphans) { writeStatus("\n"); writeStatus("findPotentialOrphans()-- working on " F_U32 " tigs.\n", tigs.size()); for (uint32 ti=0; tiufpath.size() == 1)) // Singleton, handled elsewhere. continue; // Count the number of reads that have an overlap to some other tig. tigOlapsTo[otherTig] = count. map tigOlapsTo; uint32 nonContainedReads = 0; bool validOrphan = true; for (uint32 fi=0; fiufpath.size(); fi++) { uint32 rid = tig->ufpath[fi].ident; if (OG->isContained(rid) == true) // Don't need to check contained reads. If their container continue; // passes the tests below, the contained read will too. nonContainedReads++; // Find the list of tigs that we have an overlap to. set readOlapsTo; uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(rid, ovlLen); for (uint32 oi=0; oiufpath.size() == 1) || // that is shorter than us. We can not pop this (ovlTig->id() == tig->id()) || // tig as a orphan in any of those cases. (ovlTig->getLength() < tig->getLength())) // continue; readOlapsTo.insert(ovlTigID); // Otherwise, remember that we had an overlap to ovlTig. } // With the list of tigs that this read has an overlap to, add one to each tig in the list of // tigs that this tig has an overlap to. for (set::iterator it=readOlapsTo.begin(); it != readOlapsTo.end(); ++it) tigOlapsTo[*it]++; // Decide if we're a valid potential orphan. If tig id (in it->first) has overlaps to // (nearly) every read we've seen so far (nonContainedReads), we're still a valid orphan. validOrphan = false; for (map::iterator it=tigOlapsTo.begin(); it != tigOlapsTo.end(); ++it) if (it->second == nonContainedReads) // All reads have an overlap to the tig validOrphan = true; // at *it, so valid orphan. if (validOrphan == false) // If not a valid orphan, bail. There is no other break; // tig that all of our reads have overlaps to. } // If not a valid orphan, just move on to the next tig. if (validOrphan == false) continue; // Otherwise, a valid orphan! There is at least one tig that (nearly) every dovetail read has // at least one overlap to. Save those tigs in potentialOrphans. uint32 nTigs = 0; for (map::iterator it=tigOlapsTo.begin(); it != tigOlapsTo.end(); ++it) if (it->second >= 0.5 * nonContainedReads) nTigs++; writeLog("findPotentialOrphans()--\n"); writeLog("findPotentialOrphans()-- potential orphan tig %8u length %9u nReads %7u to %3u tigs:\n", tig->id(), tig->getLength(), tig->ufpath.size(), nTigs); for (map::iterator it=tigOlapsTo.begin(); it != tigOlapsTo.end(); ++it) { if (it->second >= 0.5 * nonContainedReads) { Unitig *dest = tigs[it->first]; writeLog("findPotentialOrphans()-- tig %8u length %9u nReads %7u\n", dest->id(), dest->getLength(), dest->ufpath.size()); potentialOrphans[ti].push_back(dest->id()); } } } // Over all tigs. flushLog(); } // Find filtered placements for all the reads in the potential orphan tigs. vector * findOrphanReadPlacements(TigVector &tigs, BubTargetList &potentialOrphans, double deviationOrphan) { uint32 fiLimit = RI->numReads(); uint32 fiNumThreads = omp_get_max_threads(); uint32 fiBlockSize = (fiLimit < 1000 * fiNumThreads) ? fiNumThreads : fiLimit / 999; uint64 nReads = 0; uint64 nPlaces = 0; vector *placed = new vector [fiLimit + 1]; writeLog("findOrphanReadPlacement()--\n"); #pragma omp parallel for schedule(dynamic, fiBlockSize) for (uint32 fi=0; fiisContained(fi)) || // Read is contained, ignore it. (potentialOrphans.count(rdAtigID) == 0)) // Read isn't in a potential orphan, ignore it. continue; #pragma omp atomic nReads++; Unitig *rdAtig = tigs[rdAtigID]; ufNode *rdA = &rdAtig->ufpath[ tigs.ufpathIdx(fi) ]; bool rdAfwd = (rdA->position.bgn < rdA->position.end); int32 rdAlo = (rdAfwd) ? rdA->position.bgn : rdA->position.end; int32 rdAhi = (rdAfwd) ? rdA->position.end : rdA->position.bgn; bool isEnd = (rdAlo == 0) || (rdAhi == rdAtig->getLength()); // Compute all placements for this read. We ask for only fully placed reads. vector placements; placeReadUsingOverlaps(tigs, NULL, rdA->ident, placements, placeRead_fullMatch); // Weed out placements that aren't for orphans, or that are for orphans but are poor quality. Or are to ourself! for (uint32 pi=0; piufpath.size() == 1) || // To a singleton tig. (potentialOrphans.count((rdBtigID) > 0))) // To a potential orphan tig continue; // Ignore the placement if it isn't to one of our orphan-popping candidate tigs. bool dontcare = true; vector &porphans = potentialOrphans[rdAtigID]; for (uint32 pb=0; pb tig %6u %6u reads at %8u-%-8u (cov %7.5f erate %6.4f) - NOT CANDIDATE TIG\n", rdAtigID, placements[pi].frgID, placements[pi].tigID, rdBtig->ufpath.size(), placements[pi].position.bgn, placements[pi].position.end, placements[pi].fCoverage, erate); continue; } // Ignore the placement if it is too diverged from the destination tig. if (rdBtig->overlapConsistentWithTig(deviationOrphan, lo, hi, erate) < 0.5) { if (logFileFlagSet(LOG_ORPHAN_DETAIL)) writeLog("findOrphanReadPlacement()-- tig %6u read %8u -> tig %6u %6u reads at %8u-%-8u (cov %7.5f erate %6.4f) - HIGH ERROR\n", rdAtigID, placements[pi].frgID, placements[pi].tigID, rdBtig->ufpath.size(), placements[pi].position.bgn, placements[pi].position.end, placements[pi].fCoverage, erate); continue; } // Good placement! if (logFileFlagSet(LOG_ORPHAN_DETAIL)) writeLog("findOrphanReadPlacement()-- tig %6u read %8u -> tig %6u %6u reads at %8u-%-8u (cov %7.5f erate %6.4f)\n", rdAtigID, placements[pi].frgID, placements[pi].tigID, rdBtig->ufpath.size(), placements[pi].position.bgn, placements[pi].position.end, placements[pi].fCoverage, erate); #pragma omp atomic nPlaces++; placed[fi].push_back(placements[pi]); } } writeLog("findOrphanReadPlacement()-- placed %u reads into %u locations\n", nReads, nPlaces); return(placed); } static bool failedToPlaceAnchor(Unitig *orphan, vector *placed) { uint32 nReads = orphan->ufpath.size(); char placed0 = ((nReads > 0) && (placed[ orphan->ufpath[ 0 ].ident ].size() > 0)) ? 't' : '-'; char placed1 = ((nReads > 1) && (placed[ orphan->ufpath[ 1 ].ident ].size() > 0)) ? 't' : '-'; char placedb = ((nReads > 1) && (placed[ orphan->ufpath[ nReads-2 ].ident ].size() > 0)) ? 't' : '-'; char placeda = ((nReads > 0) && (placed[ orphan->ufpath[ nReads-1 ].ident ].size() > 0)) ? 't' : '-'; char placedS[128]; uint32 placedN = 0; bool failed = false; if (nReads > 3) for (uint32 fi=2; fiufpath[fi].ident].size() > 0) placedN++; switch (nReads) { case 0: assert(0); break; case 1: snprintf(placedS, 128, "%c", placed0); break; case 2: snprintf(placedS, 128, "%c%c", placed0, placeda); break; case 3: snprintf(placedS, 128, "%c%c%c", placed0, placed1, placeda); break; case 4: snprintf(placedS, 128, "%c%c%c%c", placed0, placed1, placedb, placeda); break; default: snprintf(placedS, 128, "%c%c[%u]%c%c", placed0, placed1, placedN, placedb, placeda); break; } failed = ((placed0 != 't') || (placeda != 't')); writeLog("failedToPlaceAnchor()-- potential orphan tig %8u (reads %5u length %8u) - placed %s%s\n", orphan->id(), nReads, orphan->getLength(), placedS, failed ? " FAILED" : ""); return(failed); } static void addInitialIntervals(Unitig *orphan, vector *placed, uint32 fReadID, uint32 lReadID, map *> &targetIntervals) { uint32 orphanLen = orphan->getLength(); // Add extended intervals for the first read. // // target --------------------------------------------- // read ------- // orphan ------------------------- for (uint32 pp=0; pp; targetIntervals[tid]->add(bgn, orphanLen); // Don't care if it goes off the high end of the tig. } // Add extended intervals for the last read. // // target --------------------------------------------- // read ------- // orphan ------------------------- for (uint32 pp=0; pp; if (end < orphanLen) targetIntervals[tid]->add(0, end); // Careful! Negative will underflow! else targetIntervals[tid]->add(end - orphanLen, orphanLen); } } static void saveCorrectlySizedInitialIntervals(Unitig *orphan, Unitig *target, intervalList *IL, uint32 fReadID, uint32 lReadID, vector *placed, vector &targets) { IL->merge(); // Merge overlapping initial intervals created above. for (uint32 ii=0; iinumberOfIntervals(); ii++) { bool noFirst = true; bool noLast = true; uint32 intBgn = IL->lo(ii); uint32 intEnd = IL->hi(ii); SeqInterval fPos; SeqInterval lPos; // Find the read placement in this interval, if it exists. for (uint32 pp=0; ppid() == placed[fReadID][pp].tigID) && (intBgn <= fPos.min()) && (fPos.max() <= intEnd)) { noFirst = false; break; } } for (uint32 pp=0; ppid() == placed[lReadID][pp].tigID) && (intBgn <= lPos.min()) && (lPos.max() <= intEnd)) { noLast = false; break; } } // Ignore if missing either read. if ((noFirst == true) || (noLast == true)) { writeLog("saveCorrectlySizedInitialIntervals()-- potential orphan tig %8u (length %8u) - target %8u %8u-%-8u (length %8u) - MISSING %s%s%s READ%s\n", orphan->id(), orphan->getLength(), target->id(), intBgn, intEnd, intEnd - intBgn, (noFirst) ? "FIRST" : "", (noFirst && noLast) ? " and " : "", (noLast) ? "LAST" : "", (noFirst && noLast) ? "S" : ""); continue; } writeLog("saveCorrectlySizedInitialIntervals()-- potential orphan tig %8u (length %8u) - target %8u %8u-%-8u (length %8u) - %8u-%-8u %8u-%-8u\n", orphan->id(), orphan->getLength(), target->id(), intBgn, intEnd, intEnd - intBgn, fPos.min(), fPos.max(), lPos.min(), lPos.max()); // Ignore if the region is too small or too big. uint32 regionMin = min(fPos.min(), lPos.min()); uint32 regionMax = max(fPos.max(), lPos.max()); if ((regionMax - regionMin < 0.75 * orphan->getLength()) || (regionMax - regionMin > 1.25 * orphan->getLength())) continue; // We probably should be checking orientation. Maybe tomorrow. // Both reads placed, and at about the right size. Save the candidate position - we can // possibly place 'orphan' in 'tigs[target->id()' at position regionMin-regionMax. targets.push_back(new candidatePop(orphan, target, regionMin, regionMax)); } // Over all intervals for this target // We're done with this intervalList, clean it up. This does leave a dangling pointer in the map<> though. delete IL; } void assignReadsToTargets(Unitig *orphan, vector *placed, vector targets) { for (uint32 fi=0; fiufpath.size(); fi++) { uint32 readID = orphan->ufpath[fi].ident; for (uint32 pp=0; pptarget->id() == tid) && // if the target is the same tig and the read (isContained(bgn, end, targets[tt]->bgn, targets[tt]->end))) // is contained in the target position, targets[tt]->placed.push_back(placed[readID][pp]); // save the position to the target } } // Remove duplicate placements from each target. // // Detect duplicates, keep the one with lower error. // There are a lot of duplicate placements, logging isn't terribly useful. uint32 nDup = 0; uint32 save; uint32 remo; for (uint32 tt=0; ttplaced.size(); aa++) { for (uint32 bb=0; bbplaced.size(); bb++) { if ((aa == bb) || (t->placed[aa].frgID != t->placed[bb].frgID) || (t->placed[aa].frgID == 0) || (t->placed[bb].frgID == 0)) continue; nDup++; if (t->placed[aa].errors / t->placed[aa].aligned < t->placed[bb].errors / t->placed[bb].aligned) { save = aa; remo = bb; } else { save = bb; remo = aa; } #ifdef SHOW_MULTIPLE_PLACEMENTS writeLog("assignReadsToTargets()-- duplicate read alignment for tig %u read %u - better %u-%-u %.4f - worse %u-%-u %.4f\n", t->placed[save].tigID, t->placed[save].frgID, t->placed[save].position.bgn, t->placed[save].position.end, t->placed[save].errors / t->placed[save].aligned, t->placed[remo].position.bgn, t->placed[remo].position.end, t->placed[remo].errors / t->placed[remo].aligned); #endif t->placed[remo] = overlapPlacement(); } } // Get rid of any now-empty entries. for (uint32 aa=t->placed.size(); aa--; ) { if (t->placed[aa].frgID == 0) { t->placed[aa] = t->placed.back(); t->placed.pop_back(); } } } writeLog("assignReadsToTargets()-- Removed %u duplicate placements.\n", nDup); } void mergeOrphans(TigVector &tigs, double deviationOrphan) { // Find, for each tig, the list of other tigs that it could potentially be placed into. BubTargetList potentialOrphans; findPotentialOrphans(tigs, potentialOrphans); writeStatus("mergeOrphans()-- Found " F_SIZE_T " potential orphans.\n", potentialOrphans.size()); writeLog("\n"); writeLog("mergeOrphans()-- Found " F_SIZE_T " potential orphans.\n", potentialOrphans.size()); writeLog("\n"); // For any tig that is a potential orphan, find all read placements. vector *placed = findOrphanReadPlacements(tigs, potentialOrphans, deviationOrphan); // We now have, in 'placed', a list of all the places that each read could be placed. Decide if there is a _single_ // place for each orphan to be popped. uint32 nUniqOrphan = 0; uint32 nReptOrphan = 0; for (uint32 ti=0; tigetLength(), orphan->ufpath.size()); // Create intervals for each placed read. // // target --------------------------------------------- // read ------- // orphan ------------------------- uint32 fReadID = orphan->ufpath.front().ident; uint32 lReadID = orphan->ufpath.back().ident; map *> targetIntervals; addInitialIntervals(orphan, placed, fReadID, lReadID, targetIntervals); // Figure out if each interval has both the first and last read of some orphan, and if those // are properly sized. If so, save a candidatePop. vector targets; for (map *>::iterator it=targetIntervals.begin(); it != targetIntervals.end(); ++it) saveCorrectlySizedInitialIntervals(orphan, tigs[it->first], // The targetID in targetIntervals it->second, // The interval list in targetIntervals fReadID, lReadID, placed, targets); targetIntervals.clear(); // intervalList already freed. // If no targets, nothing to do. writeLog("mergeOrphans()-- Processing orphan %u - found %u target location%s\n", ti, targets.size(), (targets.size() == 1) ? "" : "s"); if (targets.size() == 0) continue; // Assign read placements to targets. assignReadsToTargets(orphan, placed, targets); // Compare the orphan against each target. uint32 nOrphan = 0; // Number of targets that have all the reads. uint32 orphanTarget = 0; // If nOrphan == 1, the target we're popping into. for (uint32 tt=0; ttufpath.size(); uint32 targetSize = targets[tt]->placed.size(); // Report now, before we nuke targets[tt] for being not a orphan! if (logFileFlagSet(LOG_ORPHAN_DETAIL)) for (uint32 op=0; opplaced.size(); op++) writeLog("mergeOrphans()-- tig %8u length %9u -> target %8u piece %2u position %9u-%-9u length %8u - read %7u at %9u-%-9u\n", orphan->id(), orphan->getLength(), targets[tt]->target->id(), tt, targets[tt]->bgn, targets[tt]->end, targets[tt]->end - targets[tt]->bgn, targets[tt]->placed[op].frgID, targets[tt]->placed[op].position.bgn, targets[tt]->placed[op].position.end); writeLog("mergeOrphans()-- tig %8u length %9u -> target %8u piece %2u position %9u-%-9u length %8u - expected %3" F_SIZE_TP " reads, had %3" F_SIZE_TP " reads.\n", orphan->id(), orphan->getLength(), targets[tt]->target->id(), tt, targets[tt]->bgn, targets[tt]->end, targets[tt]->end - targets[tt]->bgn, orphanSize, targetSize); // If all reads placed, we can merge this orphan into the target. Preview: if this happens more than once, we just // split the orphan and place reads individually. if (orphanSize == targetSize) { nOrphan++; orphanTarget = tt; } } // If a unique orphan placement, place it there. if (nOrphan == 1) { writeLog("mergeOrphans()-- tig %8u length %8u reads %6u - orphan\n", orphan->id(), orphan->getLength(), orphan->ufpath.size()); nUniqOrphan++; for (uint32 op=0, tt=orphanTarget; opplaced.size(); op++) { ufNode frg; frg.ident = targets[tt]->placed[op].frgID; frg.contained = 0; frg.parent = 0; frg.ahang = 0; frg.bhang = 0; frg.position.bgn = targets[tt]->placed[op].position.bgn; frg.position.end = targets[tt]->placed[op].position.end; writeLog("mergeOrphans()-- move read %u from tig %u to tig %u %u-%-u\n", frg.ident, orphan->id(), targets[tt]->target->id(), frg.position.bgn, frg.position.end); targets[tt]->target->addRead(frg, 0, false); } writeLog("\n"); tigs[orphan->id()] = NULL; delete orphan; } // If multiply placed, we can't distinguish between them, and // instead just place reads where they individually decide to go. if (nOrphan > 1) { writeLog("tig %8u length %8u reads %6u - orphan with multiple placements\n", orphan->id(), orphan->getLength(), orphan->ufpath.size()); nReptOrphan++; for (uint32 fi=0; fiufpath.size(); fi++) { uint32 rr = orphan->ufpath[fi].ident; double er = 1.00; uint32 bb = 0; // Over all placements for this read, pick the one with lowest error, as long as it isn't // to the orphan. for (uint32 pp=0; ppid())) // Self placement. continue; er = erate; bb = pp; } assert(rr == placed[rr][bb].frgID); ufNode frg; frg.ident = placed[rr][bb].frgID; frg.contained = 0; frg.parent = 0; frg.ahang = 0; frg.bhang = 0; frg.position.bgn = placed[rr][bb].position.bgn; frg.position.end = placed[rr][bb].position.end; Unitig *target = tigs[placed[rr][bb].tigID]; writeLog("move read %u from tig %u to tig %u %u-%-u\n", frg.ident, orphan->id(), target->id(), frg.position.bgn, frg.position.end); assert(target->id() != orphan->id()); target->addRead(frg, 0, false); } writeLog("\n"); tigs[orphan->id()] = NULL; delete orphan; } // Clean up the targets list. for (uint32 tt=0; ttufpath.size() == 1)) // Singleton, already sorted. continue; tig->sort(); } } canu-1.6/src/bogart/AS_BAT_MergeOrphans.H000066400000000000000000000023401314437614700201400ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_PopBubbles.H * * Modifications by: * * Brian P. Walenz beginning on 2016-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_MERGEORPHANS #define INCLUDE_AS_BAT_MERGEORPHANS #include "AS_global.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Unitig.H" void mergeOrphans(TigVector &tigs, double deviationBubble); #endif // INCLUDE_AS_BAT_MERGEORPHANS canu-1.6/src/bogart/AS_BAT_OptimizePositions.C000066400000000000000000000366221314437614700212630ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_Unitig.C * * Modifications by: * * Brian P. Walenz beginning on 2017-JUL-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" class optPos { public: optPos() { }; ~optPos() { }; void set(ufNode &n) { ident = n.ident; min = n.position.min(); max = n.position.max(); fwd = n.position.isForward(); }; uint32 ident; double min; double max; bool fwd; }; void Unitig::optimize_initPlace(uint32 ii, optPos *op, optPos *np, bool firstPass, set &failed, bool beVerbose) { uint32 iid = ufpath[ii].ident; double nmin = 0; int32 cnt = 0; if ((firstPass == false) && (failed.count(iid) == 0)) // If the second pass and not return; // failed, do nothing. if (firstPass == false) writeLog("optimize_initPlace()-- Second pass begins.\n"); // Then process all overlaps. if (ii > 0) { uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(iid, ovlLen); for (uint32 oo=0; oo ii) ? "after" : "before"); if (isOvl == false) // Skip if the reads continue; // don't overlap if ((firstPass) && (jj > ii)) // We're setting initial positions, so overlaps to reads after continue; // us aren't correct, unless we're in the 2nd pass // Reads overlap. Compute the position of the read using // the overlap and the other read. nmin += (op[iid].fwd) ? (op[jid].min - ovl[oo].a_hang) : (op[jid].min + ovl[oo].b_hang); cnt += 1; } // over all overlaps // If no overlaps found, flag this read for a second pass. If in the second pass, // not much we can do. if ((firstPass == true) && (cnt == 0)) { writeLog("optimize_initPlace()-- Failed to find overlaps for read %u in tig %u at %d-%d (first pass)\n", iid, id(), ufpath[ii].position.bgn, ufpath[ii].position.end); failed.insert(iid); return; } if ((firstPass == false) && (cnt == 0)) { writeLog("optimize_initPlace()-- Failed to find overlaps for read %u in tig %u at %d-%d (second pass)\n", iid, id(), ufpath[ii].position.bgn, ufpath[ii].position.end); flushLog(); } assert(cnt > 0); } // The initialization above does very little to enforce read lengths, and the optimization // doesn't put enough weight in the read length to make it stable. We simply force // the correct read length here. op[iid].min = (cnt == 0) ? 0 : (nmin / cnt); op[iid].max = op[iid].min + RI->readLength(ufpath[ii].ident); np[iid].min = 0; np[iid].max = 0; if (beVerbose) writeLog("optimize_initPlace()-- tig %7u read %9u initialized to position %9.2f %9.2f%s\n", id(), op[iid].ident, op[iid].min, op[iid].max, (firstPass == true) ? "" : " SECONDPASS"); } void Unitig::optimize_recompute(uint32 iid, optPos *op, optPos *np, bool beVerbose) { uint32 ii = ufpathIdx(iid); int32 readLen = RI->readLength(iid); uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(iid, ovlLen); double nmin = 0.0; double nmax = 0.0; uint32 cnt = 0; if (beVerbose) { writeLog("optimize()-- tig %8u read %8u previous - %9.2f-%-9.2f\n", id(), iid, op[iid].min, op[iid].max); writeLog("optimize()-- tig %8u read %8u length - %9.2f-%-9.2f\n", id(), iid, op[iid].max - readLen, op[iid].min + readLen); } // Process all overlaps. for (uint32 oo=0; ooreadLength(iid); double opiimin = op[iid].min; // New start of this read, same as the old start double opiimax = op[iid].min + readLen; // New end of this read double opiilen = op[iid].max - op[iid].min; if (readLen <= opiilen) // This read is sufficiently long, continue; // do nothing. double scale = readLen / opiilen; double expand = opiimax - op[iid].max; // Amount we changed this read, bases // For each read, adjust positions based on how much they overlap with this read. for (uint32 jj=0; jjreadLength(iid); int32 opll = (int32)op[iid].max - (int32)op[iid].min; double opdd = 200.0 * (opll - readLen) / (opll + readLen); if (op[iid].fwd) { if (beVerbose) writeLog("optimize()-- read %8u -> from %9d,%-9d %7d to %9d,%-9d %7d readLen %7d diff %7.4f%%\n", iid, ufpath[ii].position.bgn, ufpath[ii].position.end, ufpath[ii].position.end - ufpath[ii].position.bgn, (int32)op[iid].min, (int32)op[iid].max, opll, readLen, opdd); ufpath[ii].position.bgn = (int32)op[iid].min; ufpath[ii].position.end = (int32)op[iid].max; } else { if (beVerbose) writeLog("optimize()-- read %8u <- from %9d,%-9d %7d to %9d,%-9d %7d readLen %7d diff %7.4f%%\n", iid, ufpath[ii].position.bgn, ufpath[ii].position.end, ufpath[ii].position.bgn - ufpath[ii].position.end, (int32)op[iid].max, (int32)op[iid].min, opll, readLen, opdd); ufpath[ii].position.bgn = (int32)op[iid].max; ufpath[ii].position.end = (int32)op[iid].min; } } } void TigVector::optimizePositions(const char *prefix, const char *label) { uint32 numThreads = omp_get_max_threads(); uint32 tiLimit = size(); uint32 tiBlockSize = 10; //(tiLimit < 10 * numThreads) ? numThreads : tiLimit / 9; uint32 fiLimit = RI->numReads() + 1; uint32 fiBlockSize = 100; //(fiLimit < 1000 * numThreads) ? numThreads : fiLimit / 999; bool beVerbose = false; writeStatus("optimizePositions()-- Optimizing read positions for %u reads in %u tigs, with %u thread%s.\n", tiLimit, fiLimit, numThreads, (numThreads == 1) ? "" : "s"); // Create work space and initialize to current read positions. writeStatus("optimizePositions()-- Allocating scratch space for %u reads (%u KB).\n", fiLimit, sizeof(optPos) * fiLimit * 2 >> 1024); optPos *pp = NULL; optPos *op = new optPos [fiLimit]; optPos *np = new optPos [fiLimit]; memset(op, 0, sizeof(optPos) * fiLimit); memset(np, 0, sizeof(optPos) * fiLimit); for (uint32 fi=0; fiufpath[pp]); np[fi].set(operator[](ti)->ufpath[pp]); } // Compute initial positions using previously placed reads and the read length. // // Initialize positions using only reads before us. If any reads fail to find overlaps, a second // round will init positions using any read (before or after). // writeStatus("optimizePositions()-- Initializing positions with %u threads.\n", numThreads); #pragma omp parallel for schedule(dynamic, tiBlockSize) for (uint32 ti=0; ti failed; if (tig == NULL) continue; for (uint32 ii=0; iiufpath.size(); ii++) tig->optimize_initPlace(ii, op, np, true, failed, beVerbose); for (uint32 ii=0; iiufpath.size(); ii++) tig->optimize_initPlace(ii, op, np, false, failed, true); } // // Recompute positions using all overlaps and reads both before and after. Do this for a handful of iterations // so it somewhat stabilizes. // for (uint32 iter=0; iter<5; iter++) { // Recompute positions writeStatus("optimizePositions()-- Recomputing positions, iteration %u, with %u threads.\n", iter+1, numThreads); #pragma omp parallel for schedule(dynamic, fiBlockSize) for (uint32 fi=0; fioptimize_recompute(fi, op, np, beVerbose); } // Reset zero writeStatus("optimizePositions()-- Reset zero.\n"); for (uint32 ti=0; tiufpath[0].ident ].min; for (uint32 ii=0; iiufpath.size(); ii++) { uint32 iid = tig->ufpath[ii].ident; np[iid].min -= z; np[iid].max -= z; } } // Decide if we've converged. We used to compute percent difference in coordinates, but that is // biased by the position of the read. Just use percent difference from read length. writeStatus("optimizePositions()-- Checking convergence.\n"); uint32 nConverged = 0; uint32 nChanged = 0; for (uint32 fi=0; fireadLength(fi)); double maxp = 2 * (op[fi].max - np[fi].max) / (RI->readLength(fi)); if (minp < 0) minp = -minp; if (maxp < 0) maxp = -maxp; if ((minp < 0.005) && (maxp < 0.005)) nConverged++; else nChanged++; } // All reads processed, swap op and np for the next iteration. pp = op; op = np; np = pp; writeStatus("optimizePositions()-- converged: %6u reads\n", nConverged); writeStatus("optimizePositions()-- changed: %6u reads\n", nChanged); if (nChanged == 0) break; } // // Reset small reads. If we've placed a read too small, expand it (and all reads that overlap) // to make the length not smaller. // writeStatus("optimizePositions()-- Expanding short reads with %u threads.\n", numThreads); #pragma omp parallel for schedule(dynamic, tiBlockSize) for (uint32 ti=0; tioptimize_expand(op); } // // Update the tig with new positions. op[] is the result of the last iteration. // writeStatus("optimizePositions()-- Updating positions.\n"); for (uint32 ti=0; tioptimize_setPositions(op, beVerbose); tig->cleanUp(); } // Cleanup and finish. delete [] op; delete [] np; writeStatus("optimizePositions()-- Finished.\n"); } canu-1.6/src/bogart/AS_BAT_Outputs.C000066400000000000000000000064031314437614700172300ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Outputs.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2014-MAR-31 * are Copyright 2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JUN-05 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-04 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #include "tgStore.H" void writeTigsToStore(TigVector &tigs, char *filePrefix, char *storeName, bool isFinal) { char filename[FILENAME_MAX] = {0}; snprintf(filename, FILENAME_MAX, "%s.%sStore", filePrefix, storeName); tgStore *tigStore = new tgStore(filename); tgTig *tig = new tgTig; for (uint32 ti=0; tigetNumReads() == 0)) continue; assert(utg->getLength() > 0); // Initialize the output tig. tig->clear(); tig->_tigID = utg->id(); tig->_coverageStat = 1.0; // Default to just barely unique tig->_microhetProb = 1.0; // Default to 100% probability of unique // Set the class and some flags. tig->_class = (utg->_isUnassembled == true) ? tgTig_unassembled : tgTig_contig; tig->_suggestRepeat = utg->_isRepeat; tig->_suggestCircular = utg->_isCircular; tig->_layoutLen = utg->getLength(); // Transfer reads from the bogart tig to the output tig. resizeArray(tig->_children, tig->_childrenLen, tig->_childrenMax, utg->ufpath.size(), resizeArray_doNothing); for (uint32 ti=0; tiufpath.size(); ti++) { ufNode *frg = &utg->ufpath[ti]; tig->addChild()->set(frg->ident, frg->parent, frg->ahang, frg->bhang, frg->position.bgn, frg->position.end); } // And write to the store tigStore->insertTig(tig, false); } delete tig; delete tigStore; } canu-1.6/src/bogart/AS_BAT_Outputs.H000066400000000000000000000031701314437614700172330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Outputs.H * * Modifications by: * * Brian P. Walenz from 2010-DEC-06 to 2013-AUG-01 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2014-DEC-23 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_OUTPUTS #define INCLUDE_AS_BAT_OUTPUTS #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" void writeTigsToStore(TigVector &tigs, char *filePrefix, char *storeName, bool isFinal); #endif // INCLUDE_AS_BAT_OUTPUTS canu-1.6/src/bogart/AS_BAT_OverlapCache.C000066400000000000000000001072321314437614700201030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_OverlapCache.C * * Modifications by: * * Brian P. Walenz from 2011-FEB-15 to 2013-OCT-14 * are Copyright 2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2012-JAN-11 * are Copyright 2012 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-AUG-06 to 2015-JUN-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-APR-26 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" // sizeof(BestEdgeOverlap) #include "AS_BAT_Unitig.H" // sizeof(ufNode) #include "AS_BAT_Logging.H" #include "memoryMappedFile.H" #include uint64 ovlCacheMagic = 0x65686361436c766fLLU; //0102030405060708LLU; #undef TEST_LINEAR_SEARCH #define ERR_MASK (((uint64)1 << AS_MAX_EVALUE_BITS) - 1) #define SALT_BITS (64 - AS_MAX_READLEN_BITS - AS_MAX_EVALUE_BITS) #define SALT_MASK (((uint64)1 << SALT_BITS) - 1) OverlapCache::OverlapCache(const char *ovlStorePath, const char *prefix, double maxErate, uint32 minOverlap, uint64 memlimit, uint64 genomeSize, bool doSave) { _prefix = prefix; writeStatus("\n"); if (memlimit == UINT64_MAX) { _memLimit = getPhysicalMemorySize(); writeStatus("OverlapCache()-- limited to " F_U64 "MB memory (total physical memory).\n", _memLimit >> 20); writeStatus("\n"); } else if (memlimit > 0) { _memLimit = memlimit; writeStatus("OverlapCache()-- limited to " F_U64 "MB memory (user supplied).\n", _memLimit >> 20); writeStatus("\n"); } else { _memLimit = UINT64_MAX; writeStatus("OverlapCache()-- using unlimited memory (-M 0).\n"); writeStatus("\n"); } // Account for memory used by read data, best overlaps, and tigs. // The chunk graph is temporary, and should be less than the size of the tigs. // Likewise, the buffers used for loading and scoring overlaps aren't accounted for. // // NOTES: // // memFI - read length, // // memUT - worst case, we have one unitig per read. also, maps of read-to-unitig and read-to-vector-position. // // memEP - each read adds two epValue points, the open and close points, and two uint32 pointers // to the data. // // memEO - overlaps for computing error profiles. this is definitely a hack, but I can't think of // any reasonable estimates. just reserve 25% of memory, which then dominates our accounting. // // memOS - make sure we're this much below using all the memory - allows for other stuff to run, // and a little buffer in case we're too big. uint64 memFI = RI->memoryUsage(); uint64 memBE = RI->numReads() * sizeof(BestOverlaps); uint64 memUT = RI->numReads() * sizeof(Unitig) + RI->numReads() * sizeof(uint32) * 2; uint64 memUL = RI->numReads() * sizeof(ufNode); uint64 memEP = RI->numReads() * sizeof(uint32) * 2 + RI->numReads() * Unitig::epValueSize() * 2; uint64 memEO = (_memLimit == UINT64_MAX) ? (0.0) : (0.25 * _memLimit); uint64 memOS = (_memLimit < 0.9 * getPhysicalMemorySize()) ? (0.0) : (0.1 * getPhysicalMemorySize()); uint64 memST = ((RI->numReads() + 1) * (sizeof(BAToverlap *) + sizeof(uint32)) + // Cache pointers (RI->numReads() + 1) * sizeof(uint32) + // Num olaps stored per read (RI->numReads() + 1) * sizeof(uint32)); // Num olaps allocated per read _memReserved = memFI + memBE + memUL + memUT + memEP + memEO + memST + memOS; _memStore = memST; _memAvail = (_memReserved + _memStore < _memLimit) ? (_memLimit - _memReserved - _memStore) : 0; _memOlaps = 0; writeStatus("OverlapCache()-- %7" F_U64P "MB for read data.\n", memFI >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for best edges.\n", memBE >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for tigs.\n", memUT >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for tigs - read layouts.\n", memUL >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for tigs - error profiles.\n", memEP >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for tigs - error profile overlaps.\n", memEO >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for other processes.\n", memOS >> 20); writeStatus("OverlapCache()-- ---------\n"); writeStatus("OverlapCache()-- %7" F_U64P "MB for data structures (sum of above).\n", _memReserved >> 20); writeStatus("OverlapCache()-- ---------\n"); writeStatus("OverlapCache()-- %7" F_U64P "MB for overlap store structure.\n", _memStore >> 20); writeStatus("OverlapCache()-- %7" F_U64P "MB for overlap data.\n", _memAvail >> 20); writeStatus("OverlapCache()-- ---------\n"); writeStatus("OverlapCache()-- %7" F_U64P "MB allowed.\n", _memLimit >> 20); writeStatus("OverlapCache()--\n"); if (_memAvail == 0) { writeStatus("OverlapCache()-- Out of memory before loading overlaps; increase -M.\n"); exit(1); } _maxEvalue = AS_OVS_encodeEvalue(maxErate); _minOverlap = minOverlap; // Allocate space to load overlaps. With a NULL gkpStore we can't call the bgn or end methods. _ovsMax = 16; _ovs = ovOverlap::allocateOverlaps(NULL, _ovsMax); _ovsSco = new uint64 [_ovsMax]; _ovsTmp = new uint64 [_ovsMax]; // Allocate pointers to overlaps. _overlapLen = new uint32 [RI->numReads() + 1]; _overlapMax = new uint32 [RI->numReads() + 1]; _overlaps = new BAToverlap * [RI->numReads() + 1]; memset(_overlapLen, 0, sizeof(uint32) * (RI->numReads() + 1)); memset(_overlapMax, 0, sizeof(uint32) * (RI->numReads() + 1)); memset(_overlaps, 0, sizeof(BAToverlap *) * (RI->numReads() + 1)); // Open the overlap store. ovStore *ovlStore = new ovStore(ovlStorePath, NULL); // Load overlaps! computeOverlapLimit(ovlStore, genomeSize); loadOverlaps(ovlStore, doSave); delete [] _ovs; _ovs = NULL; // There is a small cost with these arrays that we'd delete [] _ovsSco; _ovsSco = NULL; // like to not have, and a big cost with ovlStore (in that delete [] _ovsTmp; _ovsTmp = NULL; // it loaded updated erates into memory), so release delete ovlStore; ovlStore = NULL; // these before symmetrizing overlaps. symmetrizeOverlaps(); } OverlapCache::~OverlapCache() { delete [] _overlaps; delete [] _overlapLen; delete [] _overlapMax; delete _overlapStorage; } // Decide on limits per read. // // From the memory limit, we can compute the average allowed per read. If this is higher than // the expected coverage, we'll not fill memory completely as the reads in unique sequence will // have fewer than this number of overlaps. // // We'd like to iterate this, but the unused space computation assumes all reads are assigned // the same amount of memory. On the next iteration, this isn't true any more. The benefit is // (hopefully) small, and the algorithm is unknown. // // This isn't perfect. It estimates based on whatever is in the store, not only those overlaps // below the error threshold. Result is that memory usage is far below what it should be. Easy to // fix if we assume all reads have the same properties (same library, same length, same error // rate) but not so easy in reality. We need big architecture changes to make it easy (grouping // reads by library, collecting statistics from the overlaps, etc). // // It also doesn't distinguish between 5' and 3' overlaps - it is possible for all the long // overlaps to be off of one end. // void OverlapCache::computeOverlapLimit(ovStore *ovlStore, uint64 genomeSize) { ovlStore->resetRange(); uint32 frstRead = 0; uint32 lastRead = 0; uint32 *numPer = ovlStore->numOverlapsPerFrag(frstRead, lastRead); uint32 totlRead = lastRead - frstRead + 1; // Set the minimum number of overlaps per read to twice coverage. Then set the maximum number of // overlaps per read to a guess of what it will take to fill up memory. _minPer = 2 * RI->numBases() / genomeSize; _maxPer = _memAvail / (RI->numReads() * sizeof(BAToverlap)); writeStatus("OverlapCache()-- Retain at least " F_U32 " overlaps/read, based on %.2fx coverage.\n", _minPer, (double)RI->numBases() / genomeSize); writeStatus("OverlapCache()-- Initial guess at " F_U32 " overlaps/read.\n", _maxPer); writeStatus("OverlapCache()--\n"); if (_maxPer < _minPer) writeStatus("OverlapCache()-- Not enough memory to load the minimum number of overlaps; increase -M.\n"), exit(1); uint64 totalOlaps = ovlStore->numOverlapsInRange(); uint64 olapLoad = 0; // Total overlaps we would load at this threshold uint64 olapMem = 0; uint32 numBelow = 0; // Number of reads below the threshold uint32 numEqual = 0; uint32 numAbove = 0; // Number of reads above the threshold writeStatus("OverlapCache()-- Adjusting for sparse overlaps.\n"); writeStatus("OverlapCache()--\n"); writeStatus("OverlapCache()-- reads loading olaps olaps memory\n"); writeStatus("OverlapCache()-- olaps/read all some loaded free\n"); writeStatus("OverlapCache()-- ---------- ------- ------- ----------- ------- --------\n"); while (true) { olapLoad = 0; numBelow = 0; numEqual = 0; numAbove = 0; for (uint32 i=0; i> 20); // If there are no more overlaps to load, we're done. if (numAbove == 0) break; // Otherwise, there is still (potentially) space left for more overlaps. Estimate how much // higher we could push the threshold: compute how many more overlaps we could load before // exceeding the memory limit, then assume we'd load that many overlaps for each of the // numAbove reads. int64 olapFree = (_memAvail - olapMem) / sizeof(BAToverlap); int64 increase = olapFree / numAbove; if (increase == 0) break; _maxPer += increase; } // We used to (pre 6 Jul 2017) do the symmetry check only if we didn't load all overlaps. // However, symmetry can also break if we use an error rate cutoff because - for reasons not // explored - the error rate on symmetric overlaps differs. So, just enable this always. // // On a moderate coverage human nanopore assembly, it does: // // OverlapCache()-- Symmetrizing overlaps -- finding missing twins. // OverlapCache()-- -- found 8609 missing twins in 51413413 overlaps, 8002 are strong. // OverlapCache()-- Symmetrizing overlaps -- dropping weak non-twin overlaps. // OverlapCache()-- -- dropped 454 overlaps. // OverlapCache()-- Symmetrizing overlaps -- adding 8155 missing twin overlaps. _checkSymmetry = (numAbove > 0) ? true : false; _checkSymmetry = true; delete [] numPer; } uint32 OverlapCache::filterDuplicates(uint32 &no) { uint32 nFiltered = 0; for (uint32 ii=0, jj=1; jjoverlapLength(_ovs[ii].a_iid, _ovs[ii].b_iid, _ovs[ii].a_hang(), _ovs[ii].b_hang()); uint32 jjlen = RI->overlapLength(_ovs[jj].a_iid, _ovs[jj].b_iid, _ovs[jj].a_hang(), _ovs[jj].b_hang()); if (iilen == jjlen) { if (_ovs[ii].evalue() < _ovs[jj].evalue()) jjlen = 0; else iilen = 0; } if (iilen < jjlen) _ovs[ii].a_iid = _ovs[ii].b_iid = 0; else _ovs[jj].a_iid = _ovs[jj].b_iid = 0; } // If nothing was filtered, return. if (nFiltered == 0) return(0); // Squeeze out the filtered overlaps. We used to just copy the last element over any deleted // ones, leaving the list unsorted, but we're now (Nov 2016) binary searching on it, so can't do // that. // Needs to have it's own log. Lots of stuff here. //writeLog("OverlapCache()-- read %u filtered %u overlaps to the same read pair\n", _ovs[0].a_iid, nFiltered); for (uint32 ii=0, jj=0; jjreadLength(_ovs[ii].a_iid) == 0) || // At least one read in the overlap is deleted (RI->readLength(_ovs[ii].b_iid) == 0)) { if (beVerbose) fprintf(stderr, "olap %d involves deleted reads - %u %s - %u %s\n", ii, _ovs[ii].a_iid, (RI->readLength(_ovs[ii].a_iid) == 0) ? "deleted" : "active", _ovs[ii].b_iid, (RI->readLength(_ovs[ii].b_iid) == 0) ? "deleted" : "active"); continue; } if (_ovs[ii].evalue() > maxEvalue) { // Too noisy to care if (beVerbose) fprintf(stderr, "olap %d too noisy evalue %f > maxEvalue %f\n", ii, AS_OVS_decodeEvalue(_ovs[ii].evalue()), AS_OVS_decodeEvalue(maxEvalue)); continue; } uint32 olen = RI->overlapLength(_ovs[ii].a_iid, _ovs[ii].b_iid, _ovs[ii].a_hang(), _ovs[ii].b_hang()); if (olen < minOverlap) { // Too short to care if (beVerbose) fprintf(stderr, "olap %d too short olen %u minOverlap %u\n", ii, olen, minOverlap); continue; } // Just right! _ovsSco[ii] = olen; _ovsSco[ii] <<= AS_MAX_EVALUE_BITS; _ovsSco[ii] |= (~_ovs[ii].evalue()) & ERR_MASK; _ovsSco[ii] <<= SALT_BITS; _ovsSco[ii] |= ii & SALT_MASK; ns++; } if (ns <= _maxPer) // Fewer overlaps than the limit, no filtering needed. return(ns); // Otherwise, filter out the short and low quality overlaps and count how many we saved. memcpy(_ovsTmp, _ovsSco, sizeof(uint64) * no); sort(_ovsTmp, _ovsTmp + no); uint64 minScore = _ovsTmp[no - _maxPer]; ns = 0; for (uint32 ii=0; iiresetRange(); uint64 numTotal = 0; uint64 numLoaded = 0; uint64 numDups = 0; uint32 numReads = 0; uint64 numStore = ovlStore->numOverlapsInRange(); if (numStore == 0) writeStatus("ERROR: No overlaps in overlap store?\n"), exit(1); _overlapStorage = new OverlapStorage(ovlStore->numOverlapsInRange()); while (1) { uint32 numOvl = ovlStore->numberOfOverlaps(); // Query how many overlaps for the next read. if (numOvl == 0) // If no overlaps, we're at the end of the store. break; if (_ovsMax < numOvl) { delete [] _ovs; delete [] _ovsSco; delete [] _ovsTmp; _ovsMax = numOvl + 1024; _ovs = ovOverlap::allocateOverlaps(NULL /* gkpStore */, _ovsMax); _ovsSco = new uint64 [_ovsMax]; _ovsTmp = new uint64 [_ovsMax]; } assert(numOvl <= _ovsMax); // Actually load the overlaps, then detect and remove overlaps between the same pair, then // filter short and low quality overlaps. uint32 no = ovlStore->readOverlaps(_ovs, _ovsMax); // no == total overlaps == numOvl uint32 nd = filterDuplicates(no); // nd == duplicated overlaps (no is decreased by this amount) uint32 ns = filterOverlaps(_maxEvalue, _minOverlap, no); // ns == acceptable overlaps //if (_ovs[0].a_iid == 3514657) // fprintf(stderr, "Loaded %u overlaps - no %u nd %u ns %u\n", numOvl, no, nd, ns); // Allocate space for the overlaps. Allocate a multiple of 8k, assumed to be the page size. // // If we're loading all overlaps (ns == no) we don't need to overallocate. Otherwise, we're // loading only some of them and might have to make a twin later. // // Once allocated copy the good overlaps. if (ns > 0) { uint32 id = _ovs[0].a_iid; _overlapMax[id] = ns; _overlapLen[id] = ns; _overlaps[id] = _overlapStorage->get(_overlapMax[id]); _memOlaps += _overlapMax[id] * sizeof(BAToverlap); uint32 oo=0; for (uint32 ii=0; iinumReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; if (_checkSymmetry == false) return; uint32 *nonsymPerRead = new uint32 [RI->numReads() + 1]; // Overlap in this read is missing it's twin // For each overlap, see if the twin overlap exists. It is tempting to skip searching if the // b-read has loaded all overlaps (the overlap we're searching for must exist) but we can't. // We must still mark the oevrlap as being symmetric. writeStatus("OverlapCache()--\n"); writeStatus("OverlapCache()-- Symmetrizing overlaps.\n"); writeStatus("OverlapCache()-- Finding missing twins.\n"); #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 rr=0; rrnumReads()+1; rr++) { nonsymPerRead[rr] = 0; for (uint32 oo=0; oo<_overlapLen[rr]; oo++) { uint32 rb = _overlaps[rr][oo].b_iid; if (_overlaps[rr][oo].symmetric == true) // If already marked, we're done. continue; // Search for the twin overlap, and if found, we're done. The twin is marked as symmetric in the function. if (searchForOverlap(_overlaps[rb], _overlapLen[rb], rr)) { _overlaps[rr][oo].symmetric = true; continue; } // Didn't find a twin. Count how many overlaps we need to create duplicates of. nonsymPerRead[rr]++; } } uint64 nOverlaps = 0; uint64 nOnly = 0; uint64 nCritical = 0; for (uint32 rr=0; rrnumReads()+1; rr++) { nOverlaps += _overlapLen[rr]; nOnly += nonsymPerRead[rr]; if (_overlapLen[rr] <= _minPer) nCritical += nonsymPerRead[rr]; } writeStatus("OverlapCache()-- Found %llu missing twins in %llu overlaps, %llu are strong.\n", nOnly, nOverlaps, nCritical); // Score all the overlaps (again) and drop the lower quality ones. We need to drop half of the // non-twin overlaps, but also want to retain some minimum number. // But, there are a bunch of overlaps that fall below our score threshold that are symmetric. We // need to keep these, only because figuring out which ones are 'saved' above will be a total // pain in the ass. // Allocate some scratch space for each thread uint64 **ovsScoScratch = new uint64 * [numThreads]; uint64 **ovsTmpScratch = new uint64 * [numThreads]; uint64 *nDroppedScratch = new uint64 [numThreads]; for (uint32 tt=0; tt> 20); // As advertised, score all the overlaps and drop the weak ones. double fractionToDrop = 0.6; #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 rr=0; rrnumReads()+1; rr++) { if (_overlapLen[rr] <= _minPer) // If already too few overlaps, leave them all as is. continue; uint64 *ovsSco = ovsScoScratch[omp_get_thread_num()]; uint64 *ovsTmp = ovsTmpScratch[omp_get_thread_num()]; uint64 &nDropped = nDroppedScratch[omp_get_thread_num()]; for (uint32 oo=0; oo<_overlapLen[rr]; oo++) { ovsSco[oo] = RI->overlapLength( _overlaps[rr][oo].a_iid, _overlaps[rr][oo].b_iid, _overlaps[rr][oo].a_hang, _overlaps[rr][oo].b_hang); ovsSco[oo] <<= AS_MAX_EVALUE_BITS; ovsSco[oo] |= (~_overlaps[rr][oo].evalue) & ERR_MASK; ovsSco[oo] <<= SALT_BITS; ovsSco[oo] |= oo & SALT_MASK; ovsTmp[oo] = ovsSco[oo]; } sort(ovsTmp, ovsTmp + _overlapLen[rr]); uint32 minIdx = (uint32)floor(nonsymPerRead[rr] * fractionToDrop); if (minIdx < _minPer) minIdx = _minPer; uint64 minScore = ovsTmp[minIdx]; for (uint32 oo=0; oo<_overlapLen[rr]; oo++) { if ((ovsSco[oo] < minScore) && (_overlaps[rr][oo].symmetric == false)) { nDropped++; _overlapLen[rr]--; _overlaps[rr][oo] = _overlaps[rr][_overlapLen[rr]]; ovsSco [oo] = ovsSco [_overlapLen[rr]]; oo--; } } for (uint32 oo=0; oo<_overlapLen[rr]; oo++) if (_overlaps[rr][oo].symmetric == false) assert(minScore <= ovsSco[oo]); } // Are we sane? for (uint32 rr=RI->numReads()+1; rr-- > 0; ) if (_overlapLen[rr] > 0) { assert(_overlaps[rr][0 ].a_iid == rr); assert(_overlaps[rr][_overlapLen[rr]-1].a_iid == rr); } // Cleanup and log results. uint64 nDropped = 0; for (uint32 ii=0; iinumReads() + 1]; // Overlap needs to be added to this read for (uint32 rr=0; rrnumReads()+1; rr++) toAddPerRead[rr] = 0; for (uint32 rr=0; rrnumReads()+1; rr++) { for (uint32 oo=0; oo<_overlapLen[rr]; oo++) if (_overlaps[rr][oo].symmetric == false) toAddPerRead[_overlaps[rr][oo].b_iid]++; } uint64 nToAdd = 0; for (uint32 rr=0; rrnumReads()+1; rr++) nToAdd += toAddPerRead[rr]; writeStatus("OverlapCache()-- Adding %llu missing twin overlaps.\n", nToAdd); // // Expand or shrink space for the overlaps. // // Allocate new temporary pointers for each read. BAToverlap **nPtr = new BAToverlap * [RI->numReads()+1]; memset(nPtr, 0, sizeof(BAToverlap *) * (RI->numReads()+1)); // The new storage must start after the old storage. And if it starts after the old storage ends, // we can copy easier. If not, we just grab some empty overlaps to make space. // A complication occurs at the end of a single segment. If there isn't enough space in the // current segment for the overlaps, we skip ahead to the next segment without accounting for the // overlaps we skip. It's possible for the new size to fit into this unused space, which would // then put the old overlaps physically after the new ones. // // [ olaps1+unused | olaps2+unused | olaps3+unused | ] [ olaps4+unused | ..... ] // [ olaps1+new | olaps2+new | olaps3+new | olaps4+new | ] [ olaps5+new | .... ] // // So, we need to compare not overlap counts, but raw positions in the OverlapStorage object. OverlapStorage *oldS = new OverlapStorage(_overlapStorage); // Recreates the existing layout without allocating anything OverlapStorage *newS = _overlapStorage; // Resets pointers for the new layout, using existing space newS->reset(); for (uint32 rr=1; rrnumReads()+1; rr++) { nPtr[rr] = newS->get(_overlapLen[rr] + toAddPerRead[rr]); // Grab the pointer to the new space oldS->get(_overlapMax[rr]); // Move old storages ahead newS->advance(oldS); // Ensure newS is not before where oldS is. _overlapMax[rr] = _overlapLen[rr] + toAddPerRead[rr]; } // With new pointers in hand, copy overlap data - backwards - to the new locations. // (Remeber that the reads are 1..numReads(), not 0..numReads()-1) for (uint32 rr=RI->numReads()+1; rr-- > 0; ) { if (_overlapLen[rr] == 0) continue; assert(_overlaps[rr][0 ].a_iid == rr); assert(_overlaps[rr][_overlapLen[rr]-1].a_iid == rr); for (uint32 oo=_overlapLen[rr]; oo-- > 0; ) nPtr[rr][oo] = _overlaps[rr][oo]; assert(_overlaps[rr][0 ].a_iid == rr); assert(_overlaps[rr][_overlapLen[rr]-1].a_iid == rr); } // Swap pointers to the pointers and cleanup. delete [] _overlaps; _overlaps = nPtr; delete oldS; // newS is the original _overlapStorage, which we could delete, we'd just lose all the overlaps. // Copy non-twin overlaps to their twin. // // This cannot (easily) be parallelized. We're iterating over overlaps in read rr, but inserting // overlaps into read rb. for (uint32 rr=0; rrnumReads()+1; rr++) { for (uint32 oo=0; oo<_overlapLen[rr]; oo++) { if (_overlaps[rr][oo].symmetric == true) continue; uint32 rb = _overlaps[rr][oo].b_iid; uint32 nn = _overlapLen[rb]++; _overlaps[rb][nn].evalue = _overlaps[rr][oo].evalue; _overlaps[rb][nn].a_hang = (_overlaps[rr][oo].flipped) ? (_overlaps[rr][oo].b_hang) : (-_overlaps[rr][oo].a_hang); _overlaps[rb][nn].b_hang = (_overlaps[rr][oo].flipped) ? (_overlaps[rr][oo].a_hang) : (-_overlaps[rr][oo].b_hang); _overlaps[rb][nn].flipped = _overlaps[rr][oo].flipped; _overlaps[rb][nn].filtered = _overlaps[rr][oo].filtered; _overlaps[rb][nn].symmetric = _overlaps[rr][oo].symmetric = true; _overlaps[rb][nn].a_iid = _overlaps[rr][oo].b_iid; _overlaps[rb][nn].b_iid = _overlaps[rr][oo].a_iid; assert(_overlapLen[rb] <= _overlapMax[rb]); assert(toAddPerRead[rb] > 0); toAddPerRead[rb]--; } } // Check that everything worked. for (uint32 rr=0; rrnumReads()+1; rr++) { assert(toAddPerRead[rr] == 0); if (_overlapLen[rr] == 0) continue; assert(_overlaps[rr][0 ].a_iid == rr); assert(_overlaps[rr][_overlapLen[rr]-1].a_iid == rr); } // Cleanup. delete [] toAddPerRead; toAddPerRead = NULL; // Probably should sort again. Not sure if anything depends on this. for (uint32 rr=0; rrnumReads()+1; rr++) { } writeStatus("OverlapCache()-- Finished.\n"); } bool OverlapCache::load(void) { #if 0 char name[FILENAME_MAX]; FILE *file; size_t numRead; snprintf(name, FILENAME_MAX, "%s.ovlCache", _prefix); if (AS_UTL_fileExists(name, FALSE, FALSE) == false) return(false); writeStatus("OverlapCache()-- Loading graph from '%s'.\n", name); errno = 0; file = fopen(name, "r"); if (errno) writeStatus("OverlapCache()-- Failed to open '%s' for reading: %s\n", name, strerror(errno)), exit(1); uint64 magic = ovlCacheMagic; uint32 ovserrbits = AS_MAX_EVALUE_BITS; uint32 ovshngbits = AS_MAX_READLEN_BITS + 1; AS_UTL_safeRead(file, &magic, "overlapCache_magic", sizeof(uint64), 1); AS_UTL_safeRead(file, &ovserrbits, "overlapCache_ovserrbits", sizeof(uint32), 1); AS_UTL_safeRead(file, &ovshngbits, "overlapCache_ovshngbits", sizeof(uint32), 1); if (magic != ovlCacheMagic) writeStatus("OverlapCache()-- ERROR: File '%s' isn't a bogart ovlCache.\n", name), exit(1); AS_UTL_safeRead(file, &_memLimit, "overlapCache_memLimit", sizeof(uint64), 1); AS_UTL_safeRead(file, &_memReserved, "overlapCache_memReserved", sizeof(uint64), 1); AS_UTL_safeRead(file, &_memAvail, "overlapCache_memAvail", sizeof(uint64), 1); AS_UTL_safeRead(file, &_memStore, "overlapCache_memStore", sizeof(uint64), 1); AS_UTL_safeRead(file, &_memOlaps, "overlapCache_memOlaps", sizeof(uint64), 1); AS_UTL_safeRead(file, &_maxPer, "overlapCache_maxPer", sizeof(uint32), 1); _overlaps = new BAToverlap * [RI->numReads() + 1]; _overlapLen = new uint32 [RI->numReads() + 1]; _overlapMax = new uint32 [RI->numReads() + 1]; AS_UTL_safeRead(file, _overlapLen, "overlapCache_len", sizeof(uint32), RI->numReads() + 1); AS_UTL_safeRead(file, _overlapMax, "overlapCache_max", sizeof(uint32), RI->numReads() + 1); for (uint32 rr=0; rrnumReads() + 1; rr++) { if (_overlapLen[rr] == 0) continue; _overlaps[rr] = new BAToverlap [ _overlapMax[rr] ]; memset(_overlaps[rr], 0xff, sizeof(BAToverlap) * _overlapMax[rr]); AS_UTL_safeRead(file, _overlaps[rr], "overlapCache_ovl", sizeof(BAToverlap), _overlapLen[rr]); assert(_overlaps[rr][0].a_iid == rr); } fclose(file); return(true); #endif return(false); } void OverlapCache::save(void) { #if 0 char name[FILENAME_MAX]; FILE *file; snprintf(name, FILENAME_MAX, "%s.ovlCache", _prefix); writeStatus("OverlapCache()-- Saving graph to '%s'.\n", name); errno = 0; file = fopen(name, "w"); if (errno) writeStatus("OverlapCache()-- Failed to open '%s' for writing: %s\n", name, strerror(errno)), exit(1); uint64 magic = ovlCacheMagic; uint32 ovserrbits = AS_MAX_EVALUE_BITS; uint32 ovshngbits = AS_MAX_READLEN_BITS + 1; AS_UTL_safeWrite(file, &magic, "overlapCache_magic", sizeof(uint64), 1); AS_UTL_safeWrite(file, &ovserrbits, "overlapCache_ovserrbits", sizeof(uint32), 1); AS_UTL_safeWrite(file, &ovshngbits, "overlapCache_ovshngbits", sizeof(uint32), 1); AS_UTL_safeWrite(file, &_memLimit, "overlapCache_memLimit", sizeof(uint64), 1); AS_UTL_safeWrite(file, &_memReserved, "overlapCache_memReserved", sizeof(uint64), 1); AS_UTL_safeWrite(file, &_memAvail, "overlapCache_memAvail", sizeof(uint64), 1); AS_UTL_safeWrite(file, &_memStore, "overlapCache_memStore", sizeof(uint64), 1); AS_UTL_safeWrite(file, &_memOlaps, "overlapCache_memOlaps", sizeof(uint64), 1); AS_UTL_safeWrite(file, &_maxPer, "overlapCache_maxPer", sizeof(uint32), 1); AS_UTL_safeWrite(file, _overlapLen, "overlapCache_len", sizeof(uint32), RI->numReads() + 1); AS_UTL_safeWrite(file, _overlapMax, "overlapCache_max", sizeof(uint32), RI->numReads() + 1); for (uint32 rr=0; rrnumReads() + 1; rr++) AS_UTL_safeWrite(file, _overlaps[rr], "overlapCache_ovl", sizeof(BAToverlap), _overlapLen[rr]); fclose(file); #endif } canu-1.6/src/bogart/AS_BAT_OverlapCache.H000066400000000000000000000236231314437614700201110ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_OverlapCache.H * * Modifications by: * * Brian P. Walenz from 2011-FEB-15 to 2013-AUG-28 * are Copyright 2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2011-OCT-13 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-OCT-21 to 2015-JUN-16 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_OVERLAPCACHE #define INCLUDE_AS_BAT_OVERLAPCACHE #include "AS_global.H" #include "ovStore.H" #include "gkStore.H" #include "memoryMappedFile.H" // CA8 used to re-encode the error rate into a smaller-precision number. This was // confusing and broken (it tried to use a log-based encoding to give more precision // to the smaller values). CA3g gives up and uses all 12 bits of precision. // If not enough space for the minimum number of error bits, bump up to a 64-bit word for overlap // storage. // For storing overlaps in memory. 12 bytes per overlap. class BAToverlap { public: BAToverlap() { evalue = 0; a_hang = 0; b_hang = 0; flipped = false; filtered = false; symmetric = false; a_iid = 0; b_iid = 0; }; ~BAToverlap() {}; // Nasty bit of code duplication. bool isDovetail(void) const { return(((a_hang < 0) && (b_hang < 0)) || ((a_hang > 0) && (b_hang > 0))); }; bool AEndIs5prime(void) const { // --------> return((a_hang < 0) && (b_hang < 0)); // ------- }; bool AEndIs3prime(void) const { // --------> return((a_hang > 0) && (b_hang > 0)); // ------- }; bool AisContainer(void) const { // --------> return((a_hang >= 0) && (b_hang <= 0)); // ---- }; bool AisContained(void) const { // ---> return((a_hang <= 0) && (b_hang >= 0)); // --------- }; bool BEndIs3prime(void) const { assert(AisContainer() == false); // Function is not defined assert(AisContained() == false); // for containments. return((AEndIs5prime() && (flipped == false)) || // <=== ------> (AEndIs3prime() && (flipped == true))); // ----> }; bool BEndIs5prime(void) const { assert(AisContainer() == false); // Function is not defined assert(AisContained() == false); // for containments. return((AEndIs5prime() && (flipped == true)) || // ------> (AEndIs3prime() && (flipped == false))); // <=== ----> }; double erate(void) const { return(AS_OVS_decodeEvalue(evalue)); } #if AS_MAX_READLEN_BITS < 24 uint64 evalue : AS_MAX_EVALUE_BITS; // 12 int64 a_hang : AS_MAX_READLEN_BITS+1; // 21+1 int64 b_hang : AS_MAX_READLEN_BITS+1; // 21+1 uint64 flipped : 1; // 1 uint64 filtered : 1; // 1 uint64 symmetric : 1; // 1 - twin overlap exists uint32 a_iid; uint32 b_iid; #if (AS_MAX_EVALUE_BITS + (AS_MAX_READLEN_BITS + 1) + (AS_MAX_READLEN_BITS + 1) + 1 + 1 + 1 > 64) #error not enough bits to store overlaps. decrease AS_MAX_EVALUE_BITS or AS_MAX_READLEN_BITS. #endif #else int32 a_hang; int32 b_hang; uint32 evalue : AS_MAX_EVALUE_BITS; // 12 uint32 flipped : 1; // 1 uint32 filtered : 1; // 1 uint32 symmetric : 1; // 1 - twin overlap exists uint32 a_iid; uint32 b_iid; #endif }; inline bool BAToverlap_sortByEvalue(BAToverlap const &a, BAToverlap const &b) { return(a.evalue > b.evalue); } class OverlapStorage { public: OverlapStorage(uint64 nOvl) { _osAllocLen = 1024 * 1024 * 1024 / sizeof(BAToverlap); // 1GB worth of overlaps _osLen = 0; // osMax is cheap and we overallocate it. _osPos = 0; // If allocLen is small, we can end up with _osMax = 2 * nOvl / _osAllocLen + 2; // more blocks than expected, when overlaps _os = new BAToverlap * [_osMax]; // don't fit in the remaining space. memset(_os, 0, sizeof(BAToverlap *) * _osMax); _os[0] = new BAToverlap [_osAllocLen]; // Alloc first block, keeps getOverlapStorage() simple }; OverlapStorage(OverlapStorage *original) { _osAllocLen = original->_osAllocLen; _osLen = 0; _osPos = 0; _osMax = original->_osMax; _os = NULL; }; ~OverlapStorage() { if (_os == NULL) return; for (uint32 ii=0; ii<_osMax; ii++) delete [] _os[ii]; delete [] _os; } void reset(void) { _osLen = 0; _osPos = 0; }; BAToverlap *get(void) { if (_os == NULL) return(NULL); return(_os[_osLen] + _osPos); }; BAToverlap *get(uint32 nOlaps) { if (_osPos + nOlaps > _osAllocLen) { // If we don't fit in the current allocation, _osPos = 0; // move to the next one. _osLen++; } _osPos += nOlaps; // Reserve space for these overlaps. assert(_osLen < _osMax); if (_os == NULL) // If we're not allowed to allocate, return(NULL); // return nothing. if (_os[_osLen] == NULL) // Otherwise, make sure we have space and return _os[_osLen] = new BAToverlap [_osAllocLen]; // that space. return(_os[_osLen] + _osPos - nOlaps); }; void advance(OverlapStorage *that) { if (((that->_osLen < _osLen)) || // That segment before mine, or ((that->_osLen == _osLen) && (that->_osPos <= _osPos))) // that segment equal and position before mine return; // So no need to modify _osLen = that->_osLen; _osPos = that->_osPos; }; private: uint32 _osAllocLen; // Size of each allocation uint32 _osLen; // Current allocation being used uint32 _osPos; // Position in current allocation; next free overlap uint32 _osMax; // Number of allocations we can make BAToverlap **_os; // Allocations }; class OverlapCache { public: OverlapCache(const char *ovlStorePath, const char *prefix, double maxErate, uint32 minOverlap, uint64 maxMemory, uint64 genomeSize, bool dosave); ~OverlapCache(); private: uint32 filterOverlaps(uint32 maxOVSerate, uint32 minOverlap, uint32 no); uint32 filterDuplicates(uint32 &no); void computeOverlapLimit(ovStore *ovlStore, uint64 genomeSize); void loadOverlaps(ovStore *ovlStore, bool doSave); void symmetrizeOverlaps(void); public: BAToverlap *getOverlaps(uint32 readIID, uint32 &numOverlaps) { numOverlaps = _overlapLen[readIID]; return(_overlaps[readIID]); } private: bool load(void); void save(void); private: const char *_prefix; uint64 _memLimit; // Expected max size of bogart uint64 _memReserved; // Memory to reserve for processing uint64 _memAvail; // Memory available for storing overlaps uint64 _memStore; // Memory used to support overlaps uint64 _memOlaps; // Memory used to store overlaps uint32 *_overlapLen; uint32 *_overlapMax; BAToverlap **_overlaps; // Instead of allocating space for overlaps per read (which has some visible but unknown size // cost with each allocation), or in a single massive allocation (which we can't resize), we // allocate overlaps in large blocks then set pointers into each block where overlaps for each // read start. This is managed by OverlapStorage. OverlapStorage *_overlapStorage; uint32 _maxEvalue; // Don't load overlaps with high error uint32 _minOverlap; // Don't load overlaps that are short uint32 _minPer; // Minimum number of overlaps to retain for a single read uint32 _maxPer; // Maximum number of overlaps to load for a single read bool _checkSymmetry; uint32 _ovsMax; // For loading overlaps ovOverlap *_ovs; // uint64 *_ovsSco; // For scoring overlaps during the load uint64 *_ovsTmp; // For picking out a score threshold uint64 _genomeSize; }; extern OverlapCache *OC; #endif // INCLUDE_AS_BAT_OVERLAPCACHE canu-1.6/src/bogart/AS_BAT_PlaceContains.C000066400000000000000000000205241314437614700202700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_PlaceContains.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2014-JAN-29 * are Copyright 2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2015-JUN-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_PlaceContains.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #undef SHOW_PLACEMENT_DETAIL // Reports evidence (too much) for placing reads. #undef SHOW_PLACEMENT // Reports where the read was placed. void breakSingletonTigs(TigVector &tigs) { // For any singleton unitig, eject the read and delete the unitig. Eventually, // we will stop making singleton tigs. uint32 removed = 0; for (uint32 ti=1; tiufpath.size() > 1) continue; tigs[ti] = NULL; // Remove the tig from the list tigs.registerRead(utg->ufpath[0].ident); // Eject the read delete utg; // Reclaim space removed++; // Count } writeStatus("breakSingletonTigs()-- Removed %u singleton tig%s; reads are now unplaced.\n", removed, (removed == 1) ? "" : "s"); } void placeUnplacedUsingAllOverlaps(TigVector &tigs, const char *UNUSED(prefix)) { uint32 fiLimit = RI->numReads(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (fiLimit < 100 * numThreads) ? numThreads : fiLimit / 99; uint32 *placedTig = new uint32 [RI->numReads() + 1]; SeqInterval *placedPos = new SeqInterval [RI->numReads() + 1]; memset(placedTig, 0, sizeof(uint32) * (RI->numReads() + 1)); memset(placedPos, 0, sizeof(SeqInterval) * (RI->numReads() + 1)); // Just some logging. Count the number of reads we try to place. uint32 nToPlaceContained = 0; uint32 nToPlace = 0; uint32 nPlacedContained = 0; uint32 nPlaced = 0; uint32 nFailedContained = 0; uint32 nFailed = 0; for (uint32 fid=1; fidnumReads()+1; fid++) if (tigs.inUnitig(fid) == 0) // I'm NOT ambiguous! if (OG->isContained(fid)) nToPlaceContained++; else nToPlace++; writeStatus("\n"); writeStatus("placeContains()-- placing %u contained and %u unplaced reads, with %d thread%s.\n", nToPlaceContained, nToPlace, numThreads, (numThreads == 1) ? "" : "s"); // Do the placing! #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 fid=1; fidnumReads()+1; fid++) { bool enableLog = true; if (tigs.inUnitig(fid) > 0) continue; // Place the read. vector placements; placeReadUsingOverlaps(tigs, NULL, fid, placements, placeRead_fullMatch); // Search the placements for the highest expected identity placement using all overlaps in the unitig. uint32 b = UINT32_MAX; for (uint32 i=0; iufpath.size() == 1) // Ignore placements in singletons. continue; uint32 bgn = placements[i].position.min(); uint32 end = placements[i].position.max(); double erate = placements[i].errors / placements[i].aligned; if (tig->overlapConsistentWithTig(5.0, bgn, end, erate) < 0.5) { if ((enableLog == true) && (logFileFlagSet(LOG_PLACE_UNPLACED))) writeLog("read %8u tested tig %6u (%6u reads) at %8u-%8u (cov %7.5f erate %6.4f) - HIGH ERROR\n", fid, placements[i].tigID, tig->ufpath.size(), placements[i].position.bgn, placements[i].position.end, placements[i].fCoverage, erate); continue; } if ((enableLog == true) && (logFileFlagSet(LOG_PLACE_UNPLACED))) writeLog("read %8u tested tig %6u (%6u reads) at %8u-%8u (cov %7.5f erate %6.4f)\n", fid, placements[i].tigID, tig->ufpath.size(), placements[i].position.bgn, placements[i].position.end, placements[i].fCoverage, erate); if ((b == UINT32_MAX) || (placements[i].errors / placements[i].aligned < placements[b].errors / placements[b].aligned)) b = i; } // If we didn't find a best, b will be invalid; set positions for adding to a new tig. // If we did, save both the position it was placed at, and the tigID it was placed in. if (b == UINT32_MAX) { if ((enableLog == true) && (logFileFlagSet(LOG_PLACE_UNPLACED))) writeLog("read %8u remains unplaced\n", fid); placedPos[fid].bgn = 0; placedPos[fid].end = RI->readLength(fid); } else { if ((enableLog == true) && (logFileFlagSet(LOG_PLACE_UNPLACED))) writeLog("read %8u placed tig %6u (%6u reads) at %8u-%8u (cov %7.5f erate %6.4f)\n", fid, placements[b].tigID, tigs[placements[b].tigID]->ufpath.size(), placements[b].position.bgn, placements[b].position.end, placements[b].fCoverage, placements[b].errors / placements[b].aligned); placedTig[fid] = placements[b].tigID; placedPos[fid] = placements[b].position; } } // All reads placed, now just dump them in their correct tigs. for (uint32 fid=1; fidnumReads()+1; fid++) { Unitig *tig = NULL; ufNode frg; if (tigs.inUnitig(fid) > 0) // Already placed, just skip it. continue; // If not placed, it's garbage. These reads were not placed in any tig initially, were not // allowed to seed a tig, and now, could find no place to go. They're garbage. if (placedTig[fid] == 0) { if (OG->isContained(fid)) nFailedContained++; else nFailed++; } // Otherwise, it was placed somewhere, grab the tig. else { if (OG->isContained(fid)) nPlacedContained++; else nPlaced++; tig = tigs[placedTig[fid]]; } // Regardless, add it to the tig. Logging for this is above. if (tig) { frg.ident = fid; frg.contained = 0; frg.parent = 0; frg.ahang = 0; frg.bhang = 0; frg.position = placedPos[fid]; tig->addRead(frg, 0, false); } // Update status. if (tig) RI->setUnplaced(fid); else RI->setLeftover(fid); } // Cleanup. delete [] placedPos; delete [] placedTig; writeStatus("placeContains()-- Placed %u contained reads and %u unplaced reads.\n", nPlacedContained, nPlaced); writeStatus("placeContains()-- Failed to place %u contained reads (too high error suspected) and %u unplaced reads (lack of overlaps suspected).\n", nFailedContained, nFailed); // But wait! All the tigs need to be sorted. Well, not really _all_, but the hard ones to sort // are big, and those quite likely had reads added to them, so it's really not worth the effort // of tracking which ones need sorting, since the ones that don't need it are trivial to sort. for (uint32 ti=1; tisort(); } } canu-1.6/src/bogart/AS_BAT_PlaceContains.H000066400000000000000000000031461314437614700202760ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_PlaceContains.H * * Modifications by: * * Brian P. Walenz from 2010-DEC-06 to 2013-AUG-01 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2015-JUN-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_PLACECONTAINS #define INCLUDE_AS_BAT_PLACECONTAINS #include "AS_BAT_TigVector.H" void breakSingletonTigs(TigVector &tigs); void placeUnplacedUsingAllOverlaps(TigVector &tigs, const char *prefix); #endif // INCLUDE_AS_BAT_PLACECONTAINS canu-1.6/src/bogart/AS_BAT_PlaceReadUsingOverlaps.C000066400000000000000000000567361314437614700221250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_PlaceFragUsingOverlaps.C * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #include "intervalList.H" #include #include void placeRead_fromOverlaps(TigVector &tigs, Unitig *target, uint32 fid, uint32 flags, uint32 ovlLen, BAToverlap *ovl, uint32 &ovlPlaceLen, overlapPlacement *ovlPlace) { if (logFileFlagSet(LOG_PLACE_READ)) writeLog("pROU()-- placements for read %u with %u overlaps\n", fid, ovlLen); for (uint32 oo=0; ooid() != btID))) // Skip if we requested a specific tig and if this isn't it. continue; Unitig *btig = tigs[btID]; ufNode &bread = btig->ufpath[ tigs.ufpathIdx(ovl[oo].b_iid) ]; if (btig->_isUnassembled == true) // Skip if overlapping read is in an unassembled contig. continue; SeqInterval apos; // Position of the read in the btig SeqInterval bver; // Bases covered by the overlap in the B tig // Place the read relative to the other read. The overlap we have is relative to the A read, // so hangs need to be subtracted from the other coordinate. // // Pictures all show positive hangs. // A ------------> (b) // B (a) ------------> if ((ovl[oo].flipped == false) && (bread.position.isForward() == true)) { apos.bgn = bread.position.min() - ovl[oo].a_hang; apos.end = bread.position.max() - ovl[oo].b_hang; bver.bgn = bread.position.min() - ((ovl[oo].a_hang > 0) ? 0 : ovl[oo].a_hang); bver.end = bread.position.max() - ((ovl[oo].b_hang > 0) ? ovl[oo].b_hang : 0); } // A ------------> (b) // B (a) <------------- if ((ovl[oo].flipped == true) && (bread.position.isForward() == false)) { apos.bgn = bread.position.min() - ovl[oo].a_hang; apos.end = bread.position.max() - ovl[oo].b_hang; bver.bgn = bread.position.min() - ((ovl[oo].a_hang > 0) ? 0 : ovl[oo].a_hang); bver.end = bread.position.max() - ((ovl[oo].b_hang > 0) ? ovl[oo].b_hang : 0); } // A (b) <------------ // B ------------> (a) if ((ovl[oo].flipped == true) && (bread.position.isForward() == true)) { apos.end = bread.position.min() + ovl[oo].b_hang; apos.bgn = bread.position.max() + ovl[oo].a_hang; bver.end = bread.position.min() + ((ovl[oo].b_hang > 0) ? ovl[oo].b_hang : 0); bver.bgn = bread.position.max() + ((ovl[oo].a_hang > 0) ? 0 : ovl[oo].a_hang); } // A (b) <------------ // B <------------ (a) if ((ovl[oo].flipped == false) && (bread.position.isForward() == false)) { apos.end = bread.position.min() + ovl[oo].b_hang; apos.bgn = bread.position.max() + ovl[oo].a_hang; bver.end = bread.position.min() + ((ovl[oo].b_hang > 0) ? ovl[oo].b_hang : 0); bver.bgn = bread.position.max() + ((ovl[oo].a_hang > 0) ? 0 : ovl[oo].a_hang); } // HOWEVER, the verified position is all goobered up if the overlapping read // was placed too short. Imagine a 20k read with a 500bp overlap, so the hangs // are 19.5k. If we position this read 1k too short, then readLen-hang is negative, // and we end up misorienting the verified coords (not too mention that they're // likely bogus too). So, if that happens, we just ignore the overlap. int32 bposlen = (bread.position.max() - bread.position.min()); if (ovl[oo].a_hang < 0) bposlen += ovl[oo].a_hang; if (ovl[oo].b_hang > 0) bposlen -= ovl[oo].b_hang; if (bposlen < 0) { writeLog("WARNING: read %u overlap to read %u in tig %u at %d-%d - hangs %d %d to large for placement, ignoring overlap\n", ovl[oo].a_iid, ovl[oo].b_iid, btID, bread.position.bgn, bread.position.end, ovl[oo].a_hang, ovl[oo].b_hang); disallow = true; } // Save the placement in our work space. uint32 flen = RI->readLength(ovl[oo].a_iid); overlapPlacement op; op.frgID = fid; op.refID = ovl[oo].b_iid; op.tigID = btig->id(); op.position = apos; op.verified = bver; op.covered.bgn = (ovl[oo].a_hang < 0) ? 0 : ovl[oo].a_hang; // The portion of the read op.covered.end = (ovl[oo].b_hang > 0) ? flen : ovl[oo].b_hang + flen; // covered by the overlap. op.clusterID = 0; op.fCoverage = 0.0; op.errors = RI->overlapLength(ovl[oo].a_iid, ovl[oo].b_iid, ovl[oo].a_hang, ovl[oo].b_hang) * ovl[oo].erate(); op.aligned = op.covered.end - op.covered.bgn; op.tigFidx = UINT32_MAX; op.tigLidx = 0; // If we're looking for final placments either contained completely in the tig or covering the // whole read, disallow any placements that exceed the boundary of the unitig. This is NOT a // filter on these placements; but any placement here that extends past the end of the tig is // guaranteed to not generate a contained/whole-read placement. if ((flags & placeRead_noExtend) || (flags & placeRead_fullMatch)) if ((op.position.min() < 0) || (op.position.max() > btig->getLength())) disallow = true; if (logFileFlagSet(LOG_PLACE_READ)) writeLog("pRUO()-- bases %5d-%-5d to tig %5d %8ubp at %8d-%-8d olap %8d-%-8d via read %7d at %8d-%-8d hang %6d %6d %s%s\n", op.covered.bgn, op.covered.end, btig->id(), btig->getLength(), op.position.bgn, op.position.end, op.verified.bgn, op.verified.end, bread.ident, bread.position.bgn, bread.position.end, ovl[oo].a_hang, ovl[oo].b_hang, (ovl[oo].flipped == true) ? "I" : "N", (disallow) ? " DISALLOW" : ""); // Ensure everything is hunkey dorey and save the overlap. if (disallow == false) { assert(op.covered.bgn >= 0); assert(op.covered.end <= flen); assert(op.covered.isForward() == true); assert(op.position.isForward() == op.verified.isForward()); if (op.position.isForward() == true) { assert(op.position.bgn <= op.verified.bgn); assert(op.verified.end <= op.position.end); } else { assert(op.position.end <= op.verified.end); assert(op.verified.bgn <= op.position.bgn); } ovlPlace[ovlPlaceLen++] = op; } } } void placeRead_assignEndPointsToCluster(uint32 bgn, uint32 end, uint32 fid, overlapPlacement *ovlPlace, intervalList &bgnPoints, intervalList &endPoints) { int32 windowSlop = 0.075 * RI->readLength(fid); if (windowSlop < 5) windowSlop = 5; for (uint32 oo=bgn; oo &bgnPoints, intervalList &endPoints) { int32 numBgnPoints = bgnPoints.numberOfIntervals(); int32 numEndPoints = endPoints.numberOfIntervals(); for (uint32 oo=bgn; ooufpathIdx(ovlPlace[oo].refID); op.tigFidx = min(ord, op.tigFidx); op.tigLidx = max(ord, op.tigLidx); } if (op.tigFidx > op.tigLidx) writeStatus("pRUO()-- Invalid placement indices: tigFidx %u tigLidx %u\n", op.tigFidx, op.tigLidx); assert(op.tigFidx <= op.tigLidx); if (logFileFlagSet(LOG_PLACE_READ)) writeLog("pRUO()-- spans reads #%u (%u) to #%u (%u) in tig %u\n", op.tigFidx, tig->ufpath[op.tigFidx].ident, op.tigLidx, tig->ufpath[op.tigLidx].ident, op.tigID); } void placeRead_computePlacement(overlapPlacement &op, uint32 os, uint32 oe, overlapPlacement *ovlPlace, Unitig *tig) { stdDev bgnPos, endPos; bool isFwd = ovlPlace[os].position.isForward(); int32 readLen = RI->readLength(op.frgID); int32 tigLen = tig->getLength(); op.errors = 0; op.aligned = 0; op.verified.bgn = (isFwd) ? INT32_MAX : INT32_MIN; op.verified.end = (isFwd) ? INT32_MIN : INT32_MAX; int32 bgnVer2 = (isFwd) ? INT32_MAX : INT32_MIN;; int32 endVer2 = (isFwd) ? INT32_MIN : INT32_MAX;; int32 bgnVer3 = (isFwd) ? INT32_MAX : INT32_MIN;; int32 endVer3 = (isFwd) ? INT32_MIN : INT32_MAX;; op.covered.bgn = INT32_MAX; // Covered interval is always in op.covered.end = INT32_MIN; // forward read coordinates // Deleted overlaps? From where? for (uint32 oo=os; oo tigLen) op.verified.end = tigLen; } else { if (op.verified.bgn > tigLen) op.verified.bgn = tigLen; if (op.verified.end < 0) op.verified.end = 0; } if (isFwd) { if (op.verified.bgn < op.position.bgn) op.verified.bgn = op.position.bgn; if (op.verified.end > op.position.end) op.verified.end = op.position.end; } else { if (op.verified.bgn > op.position.bgn) op.verified.bgn = op.position.bgn; if (op.verified.end < op.position.end) op.verified.end = op.position.end; } // And check that the result is sane. assert(op.position.isForward() == isFwd); assert(op.position.isForward() == op.verified.isForward()); assert(op.covered.isForward() == true); if (isFwd) { assert(op.position.bgn <= op.verified.bgn); assert(op.verified.end <= op.position.end); } else { assert(op.position.end <= op.verified.end); assert(op.verified.bgn <= op.position.bgn); } } bool placeReadUsingOverlaps(TigVector &tigs, Unitig *target, uint32 fid, vector &placements, uint32 flags) { set verboseEnable; //verboseEnable.insert(fid); // enable for all if (verboseEnable.count(fid) > 0) logFileFlags |= LOG_PLACE_READ; if (logFileFlagSet(LOG_PLACE_READ)) // Nope, not ambiguous. if (target) writeLog("\npRUO()-- begin for read %u length %u into target tig %d\n", fid, RI->readLength(fid), target->id()); else writeLog("\npRUO()-- begin for read %u length %u into all tigs\n", fid, RI->readLength(fid)); assert(fid > 0); assert(fid <= RI->numReads()); // Grab overlaps we'll use to place this read. uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(fid, ovlLen); // Grab some work space, and clear the output. placements.clear(); // Compute placements. Anything that doesn't get placed is left as 'nowhere', specifically, in // unitig 0 (which doesn't exist). uint32 ovlPlaceLen = 0; overlapPlacement *ovlPlace = new overlapPlacement [ovlLen]; placeRead_fromOverlaps(tigs, target, fid, flags, ovlLen, ovl, ovlPlaceLen, ovlPlace); // Sort all the placements. Sort order is: // unitig ID // placed orientation (reverse is first) // position sort(ovlPlace, ovlPlace + ovlPlaceLen, overlapPlacement_byLocation); // Segregate the overlaps by placement in the unitig. We want to construct one // overlapPlacement for each distinct placement. How this is done: // // For all overlapping overlaps for a specific unitig and a specific read orientation, the // end points (+- a few bases) are added to an interval list. The intervals are combined. // Each combined interval now forms the basis of a cluster of overlaps. A list of the pairs of // clusters hit by each overlap is built. If there is one clear winner, that is picked. If // there is no clear winner, the read cannot be placed. // // unitig ========================================================================== // overlaps -------------------- ------------------ // ------------------- ------------------- // ------------------- ------------------- // intervals x1x x2x x3x x4x x5x x6xx // // This unitig has two sets of "overlapping overlaps". The left set could be from a tandem // repeat. We'll get two overlaps to the 2,4 pair, and one overlap to the 1,3 pair. Assuming // that is good enough to be a clear winner, we'll ignore the 1,3 overlap and compute position // based on the other two overlaps. uint32 bgn = 0; // Range of overlaps with the same unitig/orientation uint32 end = 1; while (bgn < ovlPlaceLen) { // Find the last placement with the same unitig/orientation as the 'bgn' read. end = bgn + 1; while ((end < ovlPlaceLen) && (ovlPlace[bgn].tigID == ovlPlace[end].tigID) && (ovlPlace[bgn].position.isReverse() == ovlPlace[end].position.isReverse())) end++; if (logFileFlagSet(LOG_PLACE_READ)) writeLog("\nplaceReadUsingOverlaps()-- Merging placements %u to %u to place the read.\n", bgn, end); // Build interval lists for the begin point and the end point. Remember, this is all reads // to a single unitig (the whole picture above), not just the overlapping read sets (left // or right blocks). intervalList bgnPoints; intervalList endPoints; placeRead_assignEndPointsToCluster(bgn, end, fid, ovlPlace, bgnPoints, endPoints); // Now, assign each placement to an end-pair cluster based on the interval ID that the end point falls in. // // Count the number of reads that hit each pair of points. Assign each ovlPlace to an implicit // numbering of each pair of points. placeRead_assignPlacementsToCluster(bgn, end, fid, ovlPlace, bgnPoints, endPoints); // Sort these placements by their clusterID. sort(ovlPlace + bgn, ovlPlace + end, overlapPlacement_byCluster); // Run through each 'cluster' and compute a final placement for the read. // A cluster extends from placements os to oe. // Each cluster generates one placement. for (uint32 os=bgn, oe=bgn+1; os tigs[op.tigID]->getLength()))) noExtend = false; if ((fullMatch == true) && (noExtend == true)) placements.push_back(op); if (logFileFlagSet(LOG_PLACE_READ)) writeLog("pRUO()-- placements[%u] - PLACE READ %d in tig %d at %d,%d -- verified %d,%d -- covered %d,%d %4.1f%% -- errors %.2f aligned %d novl %d%s\n", placements.size() - 1, op.frgID, op.tigID, op.position.bgn, op.position.end, op.verified.bgn, op.verified.end, op.covered.bgn, op.covered.end, op.fCoverage * 100.0, op.errors, op.aligned, oe - os, (fullMatch == false) ? " -- PARTIAL" : "", (noExtend == false) ? " -- EXTENDS" : ""); os = oe; } // End of segregating overlaps by placement // Move to the next block of overlaps. bgn = end; } delete [] ovlPlace; if (verboseEnable.count(fid) > 0) logFileFlags &= ~LOG_PLACE_READ; return(true); } canu-1.6/src/bogart/AS_BAT_PlaceReadUsingOverlaps.H000066400000000000000000000107731314437614700221210ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_PlaceFragUsingOverlaps.H * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_PLACEREADUSINGOVERLAPS #define INCLUDE_AS_BAT_PLACEREADUSINGOVERLAPS #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" // For ReadEnd #include "AS_BAT_Unitig.H" // For SeqInterval #include "AS_BAT_TigVector.H" class overlapPlacement { public: overlapPlacement(uint32 fi=0) { frgID = fi; refID = 0; tigID = 0; clusterID = 0; position = SeqInterval(); verified = SeqInterval(); covered = SeqInterval(); fCoverage = 0.0; errors = 0.0; aligned = 0; tigFidx = UINT32_MAX; tigLidx = 0; }; overlapPlacement(uint32 fid, overlapPlacement &op) { frgID = fid; refID = UINT32_MAX; // Not valid in the output overlapPlacement. tigID = op.tigID; clusterID = op.clusterID; // Useless to track forward. position.bgn = 0; position.end = 0; verified.bgn = 0; verified.end = 0; covered.bgn = op.covered.bgn; covered.end = op.covered.end; fCoverage = 0.0; errors = 0.0; aligned = 0; tigFidx = UINT32_MAX; tigLidx = UINT32_MAX; }; ~overlapPlacement() { }; public: uint32 frgID; // Read ID of the read this position is for. uint32 refID; // Read ID of the overlapping read were placed with. uint32 tigID; // Unitig ID of this placement int32 clusterID; SeqInterval position; // Unitig position of this placement SeqInterval verified; // Unitig position of this placement, verified by overlaps SeqInterval covered; // Position of the overlap on the read double fCoverage; // Coverage of the read double errors; // number of errors in alignments uint32 aligned; // number of bases in alignments uint32 tigFidx; // First unitig read that supports this placement uint32 tigLidx; // Last unitig read that supports this placement }; // Sort by: clusterID, tigID, orientation, position // // This sort is used to cluster the reads into overlapping regions. We don't care // about ties. // // clusterID is UINT32_MAX if the placement should be ignored. // inline bool overlapPlacement_byLocation(const overlapPlacement &A, const overlapPlacement &B) { if (A.tigID != B.tigID) return(A.tigID < B.tigID); if (A.position.isReverse() != B.position.isReverse()) return(A.position.isReverse() < B.position.isReverse()); return(A.position < B.position); } // Sort by: // cluster // // This sort is used to group reads by cluster. We don't care about ties, but they // can change the results if the input overlaps change. inline bool overlapPlacement_byCluster(const overlapPlacement &A, const overlapPlacement &B) { return(A.clusterID < B.clusterID); } const uint32 placeRead_all = 0x00; // Return all alignments const uint32 placeRead_fullMatch = 0x01; // Return only alignments for the whole read const uint32 placeRead_noExtend = 0x02; // Return only alignments contained in the tig bool placeReadUsingOverlaps(TigVector &tigs, Unitig *target, uint32 fid, vector &placements, uint32 flags = placeRead_all); #endif // INCLUDE_AS_BAT_PLACEREADUSINGOVERLAPS canu-1.6/src/bogart/AS_BAT_PopBubbles.txt000066400000000000000000000067741314437614700202520ustar00rootroot00000000000000 findPotentialBubbles() - any unitig where at least half the reads an overlap to some other unitig is a candidate. - returns a map of unitig id (the bubble) to to a vector of unitig ids (the potential poppers). findBubbleReadPlacements() - threaded on the reads - for reads in potential bubbles, uses placeReadUsingOverlaps() to find high-quality alignments to unitigs that can pop the bubble. - returns an array of vector - one vector per read - of the placements for this read. Placements are high quality and to popper tigs only. popBubbles() - findPotentialBubbles() - findBubbleReadPlacements() - for each candidate tig: - build a map of unitig id (target) to an intervalList (targetIntervals) - add to the corresponding intervalList each bubble read placement, squish to intervals when done - filter out intervals that are too short (0.75x) or too long (1.25x) the bubble tig size - save size-consistent interavs to vector of candidatePop (targets) - clear targetIntervals list (its no longer needed) - for each read in the candidate tig - assign placements (from findBubbleReadPlacements()) to targets. some placements have no target - we now have a list of targets[] with: bubble*, target*, target bgn/end, vector of placed bubble reads in this region - decide if the candidate is a bubble, a repeat or an orphan - a bubble has the 5' and 3' most reads aligned, and only one target - a repeat has all reads aligned, and multiple targets - an orphan has all reads aligned, and one target ---------------------------------------- OLD-STYLE BUBBLE POPPING mergeSplitJoin() new intersectionList(unitigs) foreach unitig (NOT parallel) skip if fewer than 15 reads or 300 bases mergeBubbles() - based on previously discovered intersections stealBubbles() - nothing here, not implemented for each unitig (parallel) - unitigs created here are not reprocessed skip if fewer than 15 reads or 300 bases markRepeats() markChimera() ---------------------------------------- mergeBubbles(unitigs, erateBubble, targetUnitig, intersectionList) The intersection list is a 'reverse mapping of all BestEdges between unitigs'. For each read, a list of the incoming edges from other unitigs. foreach intersection point get potential bubble unitig if bubble unitig doesn't exist, it was popped already if bubble unitig is more than 500k, it is skipped if bubble unitig is the current unitig, it is skipped findEnds(), skip if none found checkEnds(), skip if bad checkFrags(), skip if fails bubble is merged, remove it findEnds() - return value is first/last reads find the first/last non-contained read get the correct edge get the unitig that edge points to discard the edge if the unitig it points to is the bubble (??) if both unitigs are null, return false checkEnds() - computes placement of first/last reads, false if inconsistent place both reads using overlaps find min/max coords of suspected correct placement if placedLength < bubbleLength / 2 -> return false, bubble shrank too much if placedLength > bubbleLength * 2 -> return false, bubble grew too much if first/last reads are the same, return true check order and orientation between bubble placement and popped placement bubble placed forward - reads have same orient and same order bubble placed reverse - reads have diff orient and diff order if so, return true return false checkFrags() - based on edges, we think the bubble goes here, try to place all the reads canu-1.6/src/bogart/AS_BAT_PopulateUnitig.C000066400000000000000000000154551314437614700205250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_PopulateUnitig.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_PopulateUnitig.H" void populateUnitig(Unitig *unitig, BestEdgeOverlap *bestnext) { assert(unitig->getLength() > 0); if ((bestnext == NULL) || (bestnext->readId() == 0)) // Nothing to add! return; ufNode read = unitig->ufpath.back(); // The ID of the last read in the unitig, and the end we should walk off of it. int32 lastID = read.ident; bool last3p = (read.position.bgn < read.position.end); uint32 nAdded = 0; // While there are reads to add AND those reads to add are not already in a unitig, // construct a reverse-edge, and add the read. while ((bestnext->readId() != 0) && (unitig->inUnitig(bestnext->readId()) == 0)) { BestEdgeOverlap bestprev; // Reverse nextedge (points from the unitig to the next read to add) so that it points from // the next read to add back to something in the unitig. If the reads are // innie/outtie, we need to reverse the overlap to maintain that the A read is forward. if (last3p == bestnext->read3p()) bestprev.set(lastID, last3p, bestnext->bhang(), bestnext->ahang(), bestnext->evalue()); else bestprev.set(lastID, last3p, -bestnext->ahang(), -bestnext->bhang(), bestnext->evalue()); // We just made 'bestprev' pointing from read 'bestnext->readId()' end 'bestnext->read3p()' // back to read 'lastID' end 'last3p'. Compute the placement. if (unitig->placeRead(read, bestnext->readId(), bestnext->read3p(), &bestprev)) { unitig->addRead(read, 0, false); nAdded++; } else { writeLog("ERROR: Failed to place read %d into BOG path.\n", read.ident); assert(0); } // Set up for the next read lastID = read.ident; last3p = (read.position.bgn < read.position.end); bestnext = OG->getBestEdgeOverlap(lastID, last3p); } if (logFileFlagSet(LOG_BUILD_UNITIG)) if (bestnext->readId() == 0) writeLog("Stopped adding at read %u/%c' because no next best edge. Added %u reads.\n", lastID, (last3p) ? '3' : '5', nAdded); else writeLog("Stopped adding at read %u/%c' beacuse next best read %u/%c' is in unitig %u. Added %u reads.\n", lastID, (last3p) ? '3' : '5', bestnext->readId(), bestnext->read3p() ? '3' : '5', unitig->inUnitig(bestnext->readId()), nAdded); } void populateUnitig(TigVector &tigs, int32 fi) { if ((RI->readLength(fi) == 0) || // Skip deleted (tigs.inUnitig(fi) != 0) || // Skip placed (OG->isContained(fi) == true)) // Skip contained return; Unitig *utg = tigs.newUnitig(logFileFlagSet(LOG_BUILD_UNITIG)); // Add a first read -- to be 'compatable' with the old code, the first read is added // reversed, we walk off of its 5' end, flip it, and add the 3' walk. ufNode read; read.ident = fi; read.contained = 0; read.parent = 0; read.ahang = 0; read.bhang = 0; read.position.bgn = RI->readLength(fi); read.position.end = 0; utg->addRead(read, 0, logFileFlagSet(LOG_BUILD_UNITIG)); // Add reads as long as there is a path to follow...from the 3' end of the first read. BestEdgeOverlap *bestedge5 = OG->getBestEdgeOverlap(fi, false); BestEdgeOverlap *bestedge3 = OG->getBestEdgeOverlap(fi, true); assert(bestedge5->ahang() <= 0); // Best Edges must be dovetail, which makes this test assert(bestedge5->bhang() <= 0); // much simpler. assert(bestedge3->ahang() >= 0); assert(bestedge3->bhang() >= 0); // If this read is not covered by the two best overlaps we are finished. We will not follow // the paths out. This indicates either low coverage, or a chimeric read. If it is low // coverage, then the best overlaps will be mutual and we'll recover the same path. If it is a // chimeric read the overlaps will not be mutual and we will skip this read. // // The amount of our read that is covered by the two best overlaps is // // (readLen + bestedge5->bhang()) + (readLen - bestedge3->ahang()) // // If that is not significantly longer than the read length, then we will not use this // read as a seed for unitig construction. // if (OG->isSuspicious(fi)) return; #if 0 uint32 covered = RI->readLength(fi) + bestedge5->bhang() + RI->readLength(fi) - bestedge3->ahang(); // This breaks tigs at 0x best-coverage regions. There might be a contain that spans (joins) // the two best overlaps to verify the read, but we can't easily tell right now. if (covered < RI->readLength(fi) + AS_OVERLAP_MIN_LEN / 2) { writeLog("Stopping unitig construction of suspicious read %d in unitig %d\n", utg->ufpath.back().ident, utg->id()); return; } #endif if (logFileFlagSet(LOG_BUILD_UNITIG)) writeLog("Adding 5' edges off of read %d in unitig %d\n", utg->ufpath.back().ident, utg->id()); if (bestedge5->readId()) populateUnitig(utg, bestedge5); utg->reverseComplement(false); if (logFileFlagSet(LOG_BUILD_UNITIG)) writeLog("Adding 3' edges off of read %d in unitig %d\n", utg->ufpath.back().ident, utg->id()); if (bestedge3->readId()) populateUnitig(utg, bestedge3); // Enabling this reverse complement is known to degrade the assembly. It is not known WHY it // degrades the assembly. // //utg->reverseComplement(false); } canu-1.6/src/bogart/AS_BAT_PopulateUnitig.H000066400000000000000000000031351314437614700205220ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_PopulateUnitig.H * * Modifications by: * * Brian P. Walenz from 2010-DEC-06 to 2013-AUG-01 * are Copyright 2010,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_POPULATEUNITIG #define INCLUDE_AS_BAT_POPULATEUNITIG void populateUnitig(Unitig *unitig, BestEdgeOverlap *nextedge); void populateUnitig(TigVector &tigs, int32 readID); #endif // INCLUDE_AS_BAT_POPULATUNITIG canu-1.6/src/bogart/AS_BAT_PromoteToSingleton.C000066400000000000000000000042561314437614700213640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_PromoteToSingleton.C * * Modifications by: * * Brian P. Walenz from 2012-JAN-05 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" #include "AS_BAT_ReadInfo.H" void promoteToSingleton(TigVector &tigs) { uint32 nPromoted = 0; for (uint32 fi=1; fi<=RI->numReads(); fi++) { if (tigs.inUnitig(fi) != 0) // Placed. continue; if (RI->readLength(fi) == 0) // Deleted. continue; nPromoted++; Unitig *utg = tigs.newUnitig(false); ufNode read; read.ident = fi; read.contained = 0; read.parent = 0; read.ahang = 0; read.bhang = 0; read.position.bgn = 0; read.position.end = RI->readLength(fi); utg->addRead(read, 0, false); utg->_isUnassembled = true; } writeStatus("promoteToSingleton()-- Moved " F_U32 " unplaced read%s to singleton tigs.\n", nPromoted, (nPromoted == 1) ? "" : "s"); } canu-1.6/src/bogart/AS_BAT_PromoteToSingleton.H000066400000000000000000000021161314437614700213620ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * bogart/AS_BAT_BreakRepeats.H * * Modifications by: * * Brian P. Walenz beginning on 2016-APR-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_PROMOTE_TO_SINGLETON void promoteToSingleton(TigVector &tigs); #endif // INCLUDE_AS_BAT_PROMOTE_TO_SINGLETON canu-1.6/src/bogart/AS_BAT_ReadInfo.C000066400000000000000000000045661314437614700172440ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_Logging.H" ReadInfo::ReadInfo(const char *gkpStorePath, const char *prefix, uint32 minReadLen) { gkStore *gkpStore = gkStore::gkStore_open(gkpStorePath); _numBases = 0; _numReads = gkpStore->gkStore_getNumReads(); _numLibraries = gkpStore->gkStore_getNumLibraries(); _readStatus = new ReadStatus [_numReads + 1]; for (uint32 i=0; i<_numReads + 1; i++) { _readStatus[i].readLength = 0; _readStatus[i].libraryID = 0; _readStatus[i].isBackbone = false; _readStatus[i].isUnplaced = false; _readStatus[i].isLeftover = false; _readStatus[i].unused = 0; } uint32 numSkipped = 0; uint32 numLoaded = 0; for (uint32 fi=1; fi<=_numReads; fi++) { gkRead *read = gkpStore->gkStore_getRead(fi); uint32 iid = read->gkRead_readID(); uint32 len = read->gkRead_sequenceLength(); if (len < minReadLen) { numSkipped++; continue; } _numBases += len; _readStatus[iid].readLength = len; _readStatus[iid].libraryID = read->gkRead_libraryID(); numLoaded++; } gkpStore->gkStore_close(); if (minReadLen > 0) writeStatus("ReadInfo()-- Using %d reads, ignoring %u reads less than " F_U32 " bp long.\n", numLoaded, numSkipped, minReadLen); else writeStatus("ReadInfo()-- Using %d reads, no minimum read length used.\n", numLoaded); } ReadInfo::~ReadInfo() { delete [] _readStatus; } canu-1.6/src/bogart/AS_BAT_ReadInfo.H000066400000000000000000000104741314437614700172440ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_READ_INFO #define INCLUDE_AS_BAT_READ_INFO #include "AS_global.H" #include "ovStore.H" #include "gkStore.H" #include #include #include #include #include struct ReadStatus { uint64 readLength : AS_MAX_READLEN_BITS; uint64 libraryID : AS_MAX_LIBRARIES_BITS; uint64 isBackbone : 1; // Used to construct initial contig uint64 isUnplaced : 1; // Placed in initial contig using overlaps uint64 isLeftover : 1; // Not placed uint64 unused : (64 - AS_MAX_READLEN_BITS - AS_MAX_LIBRARIES_BITS - 3); }; class ReadInfo { public: ReadInfo(const char *gkpStorePath, const char *prefix, uint32 minReadLen); ~ReadInfo(); uint64 memoryUsage(void) { return(sizeof(uint64) + sizeof(uint32) + sizeof(uint32) + sizeof(ReadStatus) * (_numReads + 1)); }; uint64 numBases(void) { return(_numBases); }; uint32 numReads(void) { return(_numReads); }; uint32 numLibraries(void) { return(_numLibraries); }; uint32 readLength(uint32 iid) { return(_readStatus[iid].readLength); }; uint32 libraryIID(uint32 iid) { return(_readStatus[iid].libraryID); }; uint32 overlapLength(uint32 a_iid, uint32 b_iid, int32 a_hang, int32 b_hang) { int32 alen = readLength(a_iid); int32 blen = readLength(b_iid); int32 aovl = 0; int32 bovl = 0; assert(alen > 0); assert(blen > 0); if (a_hang < 0) { // b_hang < 0 ? ---------- : ---- // ? ---------- : ---------- // aovl = (b_hang < 0) ? (alen + b_hang) : (alen); bovl = (b_hang < 0) ? (blen + a_hang) : (blen + a_hang - b_hang); } else { // b_hang < 0 ? ---------- : ---------- // ? ---- : ---------- // aovl = (b_hang < 0) ? (alen - a_hang + b_hang) : (alen - a_hang); bovl = (b_hang < 0) ? (blen) : (blen - b_hang); } if ((aovl <= 0) || (bovl <= 0) || (aovl > alen) || (bovl > blen)) { fprintf(stderr, "WARNING: bogus overlap found for A=" F_U32 " B=" F_U32 "\n", a_iid, b_iid); fprintf(stderr, "WARNING: A len=" F_S32 " hang=" F_S32 " ovl=" F_S32 "\n", alen, a_hang, aovl); fprintf(stderr, "WARNING: B len=" F_S32 " hang=" F_S32 " ovl=" F_S32 "\n", blen, b_hang, bovl); } if (aovl < 0) aovl = 0; if (bovl < 0) bovl = 0; if (aovl > alen) aovl = alen; if (bovl > blen) bovl = blen; assert(aovl > 0); assert(bovl > 0); assert(aovl <= alen); assert(bovl <= blen); // AVE does not work. return((uint32)((aovl, bovl)/2)); // MAX does not work. return((uint32)MAX(aovl, bovl)); return(aovl); }; void setBackbone(uint32 fi) { _readStatus[fi].isBackbone = true; }; void setUnplaced(uint32 fi) { _readStatus[fi].isUnplaced = true; }; void setLeftover(uint32 fi) { _readStatus[fi].isLeftover = true; }; bool isBackbone(uint32 fi) { return(_readStatus[fi].isBackbone); }; bool isUnplaced(uint32 fi) { return(_readStatus[fi].isUnplaced); }; bool isLeftover(uint32 fi) { return(_readStatus[fi].isLeftover); }; private: uint64 _numBases; uint32 _numReads; uint32 _numLibraries; ReadStatus *_readStatus; }; extern ReadInfo *RI; #endif // INCLUDE_AS_BAT_READ_INFO canu-1.6/src/bogart/AS_BAT_SetParentAndHang.C000066400000000000000000000130731314437614700206740ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_SetParentAndHang.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010,2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2015-AUG-05 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" #include "AS_BAT_SetParentAndHang.H" void setParentAndHang(TigVector &tigs) { return; map forward; map allreads; // Just for stats, build a map fo the reads in the unitig. for (uint32 ti=0; tiufpath.size() == 0) continue; // Reset parent and hangs, build a map of the reads in the unitig. for (uint32 fi=1; fiufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; frg->parent = 0; frg->ahang = 0; frg->bhang = 0; allreads[frg->ident] = true; } // For each read, set parent/hangs using the edges. for (uint32 fi=0; fiufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; // Remember that we've placed this read, and if it was forward or reverse. forward[frg->ident] = (frg->position.bgn < frg->position.end); // If the first read, there is no parent possible. if (ti == 0) continue; // Otherwise, find the thickest overlap to any read already placed in the unitig. uint32 olapsLen = 0; BAToverlap *olaps = OC->getOverlaps(frg->ident, olapsLen); uint32 tt = UINT32_MAX; uint32 ttLen = 0; double ttErr = DBL_MAX; int32 ah = 0; int32 bh = 0; uint32 notPresent = 0; // Potential parent isn't in the unitig uint32 notPlaced = 0; // Potential parent isn't placed yet uint32 negHang = 0; // Potential parent has a negative hang to a placed read uint32 goodOlap = 0; for (uint32 oo=0; oooverlapLength(olaps[oo].a_iid, olaps[oo].b_iid, olaps[oo].a_hang, olaps[oo].b_hang); // Compute the hangs, so we can ignore those that would place this read before the parent. // This is a flaw somewhere in bogart, and should be caught and fixed earlier. // Consensus is expecting the have the hangs for the parent read, not this read, and some // fiddling is needed to flip the overlap for this: // First, swap the reads so it's b-vs-a. // Then, flip the overlap if the b read is in the unitig flipped. int32 ah = (olaps[oo].flipped == false) ? (-olaps[oo].a_hang) : (olaps[oo].b_hang); int32 bh = (olaps[oo].flipped == false) ? (-olaps[oo].b_hang) : (olaps[oo].a_hang); if (forward[olaps[oo].b_iid] == false) { swap(ah, bh); ah = -ah; bh = -bh; } // If the ahang is negative, we flubbed up somewhere, and want to place this read before // the parent (even though positions say to place it after, because we sorted by position). if (ah < 0) { //fprintf(stderr, "ERROR: read %u in tig %u has negative ahang from parent read %u, ejected.\n", // frg->ident, ti, olaps[oo].b_iid); negHang++; continue; } // The overlap is good. Count it as such. goodOlap++; // If the overlap is worse than the one we already have, we don't care. if ((l < ttLen) || // Too short (ttErr < olaps[oo].erate())) { // Too noisy continue; } tt = oo; ttLen = l; ttErr = olaps[oo].erate(); } // If no thickest overlap, we screwed up somewhere. Complain and eject the read. if (tt == UINT32_MAX) { fprintf(stderr, "ERROR: read %u in tig %u has no overlap to any previous read, ejected. %u overlaps total. %u negative hang. %u to read not in tig. %u to read later in tig. %u good overlaps.\n", frg->ident, tig->id(), olapsLen, negHang, notPresent, notPlaced, goodOlap); continue; } frg->parent = olaps[tt].b_iid; frg->ahang = ah; frg->bhang = bh; } // Over all reads } // Over all tigs } canu-1.6/src/bogart/AS_BAT_SetParentAndHang.H000066400000000000000000000027521314437614700207030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_SetParentAndHang.H * * Modifications by: * * Brian P. Walenz from 2010-DEC-06 to 2013-AUG-01 * are Copyright 2010,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_SETPARENTANDHANG #define INCLUDE_AS_BAT_SETPARENTANDHANG #include "AS_BAT_TigVector.H" void setParentAndHang(TigVector &tigs); #endif // INCLUDE_AS_BAT_SETPARENTANDHANG canu-1.6/src/bogart/AS_BAT_SplitDiscontinuous.C000066400000000000000000000160301314437614700214240ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_SplitDiscontinuous.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2015-MAR-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_SplitDiscontinuous.H" static Unitig * makeNewUnitig(TigVector &tigs, uint32 splitReadsLen, ufNode *splitReads) { if (splitReadsLen == 0) { writeLog("splitDiscontinuous()-- WARNING: tried to make a new tig with no reads!\n"); return(NULL); } Unitig *newtig = tigs.newUnitig(false); if (logFileFlagSet(LOG_SPLIT_DISCONTINUOUS)) writeLog("splitDiscontinuous()-- new tig " F_U32 " with " F_U32 " reads (starting at read " F_U32 ").\n", newtig->id(), splitReadsLen, splitReads[0].ident); int splitOffset = -splitReads[0].position.min(); // This should already be true, but we force it still splitReads[0].contained = 0; for (uint32 i=0; iaddRead(splitReads[i], splitOffset, false); //logFileFlagSet(LOG_SPLIT_DISCONTINUOUS)); return(newtig); } // Tests if the tig is contiguous. // bool tigIsContiguous(Unitig *tig, uint32 minOverlap) { int32 maxEnd = tig->ufpath[0].position.max(); for (uint32 fi=1; fiufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; if (frg->position.min() > maxEnd - minOverlap) return(false); maxEnd = max(maxEnd, frg->position.max()); } return(true); } // After splitting and ejecting some contains, check for discontinuous tigs. // void splitDiscontinuous(TigVector &tigs, uint32 minOverlap, vector &tigSource) { uint32 numTested = 0; uint32 numSplit = 0; uint32 numCreated = 0; // Sort and make sure the tigs start at zero. Shouldn't be here. for (uint32 ti=0; ticleanUp(); // Allocate space for the largest number of reads. uint32 splitReadsMax = 0; for (uint32 ti=0; tiufpath.size())) splitReadsMax = tigs[ti]->ufpath.size(); ufNode *splitReads = new ufNode [splitReadsMax]; // Now, finally, we can check for gaps in tigs. for (uint32 ti=0; tiufpath.size() < 2)) // No tig, or guaranteed to be contiguous. continue; numTested++; if (tigIsContiguous(tig, minOverlap) == true) // No gaps, nothing to do. continue; numSplit++; // Dang, busted unitig. Fix it up. if (logFileFlagSet(LOG_SPLIT_DISCONTINUOUS)) writeLog("splitDiscontinuous()-- discontinuous tig " F_U32 " with " F_SIZE_T " reads broken into:\n", tig->id(), tig->ufpath.size()); int32 maxEnd = 0; uint32 splitReadsLen = 0; for (uint32 fi=0; fiufpath.size(); fi++) { ufNode *frg = &tig->ufpath[fi]; int32 bgn = frg->position.min(); int32 end = frg->position.max(); // Good thick overlap exists to this read, save it. if (bgn <= maxEnd - minOverlap) { assert(splitReadsLen < splitReadsMax); splitReads[splitReadsLen++] = *frg; maxEnd = max(maxEnd, end); continue; } // No thick overlap found. We need to break right here before the current read. We used to // try to place contained reads with their container. For simplicity, we instead just make a // new unitig, letting the main() decide what to do with them (e.g., bubble pop or try to // place all reads in singleton tigs as contained reads again). numCreated++; Unitig *newtig = makeNewUnitig(tigs, splitReadsLen, splitReads); // 'tigs' can be reallocated, so grab the pointer again. tig = tigs[ti]; // Keep tracking tigSource. if ((tigSource.size() > 0) && (newtig)) { tigSource.resize(newtig->id() + 1); tigSource[newtig->id()].cID = tigSource[ tig->id()].cID, tigSource[newtig->id()].cBgn = tigSource[ tig->id()].cBgn + splitReads[0].position.min(); tigSource[newtig->id()].cEnd = tigSource[newtig->id()].cBgn + newtig->getLength(); tigSource[newtig->id()].uID = newtig->id(); } // Done with the split, save the current read. This resets everything. splitReadsLen = 0; splitReads[splitReadsLen++] = *frg; maxEnd = end; } // If we did any splitting, then the length of the reads in splitReads will be less than the // length of the path in the current unitig. Make a final new unitig for the remaining reads. if (splitReadsLen != tig->ufpath.size()) { numCreated++; Unitig *newtig = makeNewUnitig(tigs, splitReadsLen, splitReads); if ((tigSource.size() > 0) && (newtig)) { tigSource.resize(newtig->id() + 1); tigSource[newtig->id()].cID = tigSource[ tig->id()].cID, tigSource[newtig->id()].cBgn = tigSource[ tig->id()].cBgn + splitReads[0].position.min(); tigSource[newtig->id()].cEnd = tigSource[newtig->id()].cBgn + newtig->getLength(); tigSource[newtig->id()].uID = newtig->id(); } delete tigs[ti]; tigs[ti] = NULL; } } delete [] splitReads; if (numSplit == 0) writeStatus("splitDiscontinuous()-- Tested " F_U32 " tig%s, split none.\n", numTested, (numTested == 1) ? "" : "s"); else writeStatus("splitDiscontinuous()-- Tested " F_U32 " tig%s, split " F_U32 " tig%s into " F_U32 " new tig%s.\n", numTested, (numTested == 1) ? "" : "s", numSplit, (numSplit == 1) ? "" : "s", numCreated, (numCreated == 1) ? "" : "s"); } void splitDiscontinuous(TigVector &tigs, uint32 minOverlap) { vector nothingToSeeHere; splitDiscontinuous(tigs, minOverlap, nothingToSeeHere); } canu-1.6/src/bogart/AS_BAT_SplitDiscontinuous.H000066400000000000000000000031711314437614700214330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_SplitDiscontinuous.H * * Modifications by: * * Brian P. Walenz from 2010-DEC-06 to 2013-AUG-01 * are Copyright 2010,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2015-MAR-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_SPLITDISCONTINUOUS #define INCLUDE_AS_BAT_SPLITDISCONTINUOUS #include "AS_BAT_CreateUnitigs.H" void splitDiscontinuous(TigVector &tigs, uint32 minOverlap, vector &tigSource); void splitDiscontinuous(TigVector &tigs, uint32 minOverlap); #endif // INCLUDE_AS_BAT_SPLITDISCONTINUOUS canu-1.6/src/bogart/AS_BAT_TigGraph.C000066400000000000000000000455461314437614700172650ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_PlaceReadUsingOverlaps.H" #include "AS_BAT_TigGraph.H" #undef SHOW_EDGES #undef SHOW_EDGES_UNPLACED // Generates a lot of noise #undef SHOW_EDGES_VERBOSE class grEdge { public: grEdge() { tigID = 0; bgn = 0; end = 0; fwd = false; extended = false; deleted = true; }; grEdge(uint32 t, int32 b, int32 e, bool f) { tigID = t; bgn = b; end = e; fwd = f; extended = false; deleted = false; }; uint32 tigID; // Which tig we're placing this in int32 bgn; // Location of overlap int32 end; // bool fwd; // Overlap indicates tgB is forward (tgA is defined to be forward) bool extended; bool deleted; }; void emitEdges(TigVector &tigs, Unitig *tgA, bool tgAflipped, FILE *BEG, vector &tigSource) { vector placements; vector edges; // Place the first read. ufNode *rdA = tgA->firstRead(); uint32 rdAlen = RI->readLength(rdA->ident); placeReadUsingOverlaps(tigs, NULL, rdA->ident, placements, placeRead_all); // // Somewhere we need to weed out the high error overlaps - Unitig::overlapConsistentWithTig() won't work // because we're at the end of the tig and can have 1x of coverage. // // Convert those placements into potential edges. // // Overview: from this first placement, we'll try to extend the tig-tig alignment to generate the // full edge. In pictures: // // <----------------------------------------------- tgA // rd1 ------------> // rd2 --------------> // rd3 <---------------- // // --------------------------------> tgB (we don't care about its reads) // // We'll place rd1 in tgB, then place rd2 and extend the alignment, then rd3 and notice that // we've covered all of tgB, so an edge is emitted. If, say, rd2 failed to align fully, we'd // still extend the alignment, and let the total failure of rd3 kill the edge. for (uint32 pp=0; ppgetLength(); int32 bgn = placements[pp].verified.min(); int32 end = placements[pp].verified.max(); if ((tgA->id() == tgBid) && // If placed in the same tig and (bgn <= rdA->position.max()) && // at the same location, skip it. (rdA->position.min() <= end)) continue; if (tgB->_isUnassembled == true) // Ignore placements to unassembled crud. continue; // For this to be a valid starting edge, the read must be placed from it's beginning. In the // picture above, rd1 must be placed fully to it's 5' end. The 3' end can flop around; if the // tig-tig alignment isn't true, then rd2 will fail to align. Note thhat if the tig-tig // alignment is fully captured by only rd1, its 3' end will flop around, tgB will be fully covered, // and the edge will be emitted. if (((rdA->isForward() == true) && (placements[pp].covered.bgn > 0)) || ((rdA->isReverse() == true) && (placements[pp].covered.end < rdAlen))) { #ifdef SHOW_EDGES writeLog("emitEdges()-- edge --- - tig %6u read %8u %8u-%-8u len %6u placed bases %8u-%-8u in tig %6u %8u-%-8u %9u - INCOMPLETELY PLACED outside\n", tgA->id(), rdA->ident, rdA->position.bgn, rdA->position.end, rdAlen, placements[pp].covered.bgn, placements[pp].covered.end, tgBid, bgn, end, tgBlen); #endif continue; } // Now, if the placed read didn't get placed to it's other end, and it's placed in the middle // of the tig, reject the placement. if (((rdA->isForward() == true) && (placements[pp].covered.end < rdAlen) && (bgn > 100) && (end + 100 < tgBlen)) || ((rdA->isReverse() == true) && (placements[pp].covered.bgn > 0) && (bgn > 100) && (end + 100 < tgBlen))) { #ifdef SHOW_EDGES writeLog("emitEdges()-- edge --- - tig %6u read %8u %8u-%-8u len %6u placed bases %8u-%-8u in tig %6u %8u-%-8u %9u - INCOMPLETELY PLACED inside\n", tgA->id(), rdA->ident, rdA->position.bgn, rdA->position.end, rdAlen, placements[pp].covered.bgn, placements[pp].covered.end, tgBid, bgn, end, tgBlen); #endif continue; } #ifdef SHOW_EDGES writeLog("emitEdges()-- edge %3u - tig %6u read %8u %8u-%-8u placed bases %8u-%-8u in tig %6u %8u-%-8u %s quality %f\n", edges.size(), tgA->id(), rdA->ident, rdA->position.bgn, rdA->position.end, placements[pp].covered.bgn, placements[pp].covered.end, tgBid, bgn, end, placements[pp].verified.isForward() ? "->" : "<-", (double)placements[pp].errors / placements[pp].aligned); #endif // Decide the orientation of the second tig based on the orientation of the read and its // alignment. If the orientations are the same, then the second tig doesn't need to be // flipped. // // <------------------------------------- // <--- read in first tig // // <--- alignment on second tig - so if not the same, the second tig needs to be // -------------------> - flipped to make the alignment work bool fwd = (rdA->isForward() == placements[pp].verified.isForward()); // And save the placement. edges.push_back(grEdge(tgBid, bgn, end, fwd)); } // Technically, we should run through the edges and emit those that are already satisfied. But // we can defer this until after the second read is processed. Heck, we could defer until all // reads are processed, but cleaning up the list makes us a little faster, and also lets us short // circuit when we run out of potential edges before we run out of reads in the tig. // While there are still placements to process, march down the reads in this tig, adding to the // appropriate placement. for (uint32 fi=1; (fiufpath.size()) && (edges.size() > 0); fi++) { ufNode *rdA = &tgA->ufpath[fi]; uint32 rdAlen = RI->readLength(rdA->ident); placeReadUsingOverlaps(tigs, NULL, rdA->ident, placements, placeRead_all); // Mark every edge as being not extended. for (uint32 ee=0; eegetLength(); int32 bgn = placements[pp].verified.min(); int32 end = placements[pp].verified.max(); // Ignore placements to unassembled crud. Just an optimization. We'd filter these out // when trying to associate it with an existing overlap. if (tgB->_isUnassembled == true) continue; // Accept the placement only if it is for the whole read, or if it is touching the end of the target tig. if (((placements[pp].covered.bgn > 0) || (placements[pp].covered.end < rdAlen)) && (bgn > 100) && (end + 100 < tgBlen)) { #ifdef SHOW_EDGES_UNPLACED writeLog("emitEdges()-- read %5u incomplete placement covering %5u-%-5u at %5u-%-5u %s in tig %4u\n", rdA->ident, placements[pp].covered.bgn, placements[pp].covered.end, bgn, end, placements[pp].verified.isForward() ? "->" : "<-", tgBid); #endif continue; } for (uint32 ee=0; ee // <------------------- // --------------------- // // tgB ----------------------------------------------------- // [----edges----] // [----read-----] // // To make it more complicated, a contained read should do nothing, so we can't just // insist the end coordinate gets bigger. We must make sure that the bgn coordinate // doesn't get (significantly) smaller. int32 nbgn = min(edges[ee].bgn, bgn); // edges[] is the current region aligned int32 nend = max(edges[ee].end, end); // bgn,end is where the new read aligned // If tgB is forward, fail if the read aligned to the left (lower) of the current region. if ((edges[ee].fwd == true) && (bgn < edges[ee].bgn) && (end < edges[ee].end)) { #ifdef SHOW_EDGES_UNPLACED writeLog("emitEdges()-- edge %3u - extend from %5u-%-5u to %5u-%-5u -- placed read %5u at %5u-%-5u %s in tig %4u - wrong direction (fwd)\n", ee, edges[ee].bgn, edges[ee].end, nbgn, nend, rdA->ident, bgn, end, placements[pp].verified.isForward() ? "->" : "<-", tgBid); #endif continue; } // If tgB is reverse, fail if the read aligned to the left (higher) of the current region. if ((edges[ee].fwd == false) && (end > edges[ee].end) && (bgn > edges[ee].bgn)) { #ifdef SHOW_EDGES_UNPLACED writeLog("emitEdges()-- edge %3u - extend from %5u-%-5u to %5u-%-5u -- placed read %5u at %5u-%-5u %s in tig %4u - wrong direction (rev)\n", ee, edges[ee].bgn, edges[ee].end, nbgn, nend, rdA->ident, bgn, end, placements[pp].verified.isForward() ? "->" : "<-", tgBid); #endif continue; } #ifdef SHOW_EDGES writeLog("emitEdges()-- edge %3u - extend from %5u-%-5u to %5u-%-5u -- placed read %5u at %5u-%-5u %s in tig %4u\n", ee, edges[ee].bgn, edges[ee].end, nbgn, nend, rdA->ident, bgn, end, placements[pp].verified.isForward() ? "->" : "<-", tgBid); #endif edges[ee].bgn = nbgn; edges[ee].end = nend; edges[ee].extended = true; } } // Emit edges that are complete and mark them as done. // // A better idea is to see if this read is overlapping with the first/last read // in the other tig, and we're close enough to the end, instead of these silly 100bp thresholds. // For edges making circles, when tgA == tgB, we need to flip tgB if tgA is flipped. for (uint32 ee=0; eeid()) && (tgAflipped); bool sameContig = false; if ((tigSource.size() > 0) && (tigSource[tgA->id()].cID == tigSource[edges[ee].tigID].cID)) sameContig = true; if ((edges[ee].fwd == false) && (edges[ee].bgn <= 100)) { #ifdef SHOW_EDGES_VERBOSE writeLog("emitEdges()-- edge %3u - tig %6u %s edgeTo tig %6u %s of length %6u (%6u-%6u)\n", ee, tgA->id(), tgAflipped ? "<--" : "-->", edges[ee].tigID, tgBflipped ? "-->" : "<--", edges[ee].end - edges[ee].bgn, edges[ee].bgn, edges[ee].end); #endif fprintf(BEG, "L\ttig%08u\t%c\ttig%08u\t%c\t%uM%s\n", edges[ee].tigID, tgBflipped ? '+' : '-', tgA->id(), tgAflipped ? '-' : '+', edges[ee].end - edges[ee].bgn, (sameContig == true) ? "\tcv:A:T" : "\tcv:A:F"); tgA->_isCircular = (tgA->id() == edges[ee].tigID); edges[ee].deleted = true; } if ((edges[ee].fwd == true) && (edges[ee].end + 100 >= tigs[edges[ee].tigID]->getLength())) { #ifdef SHOW_EDGES_VERBOSE writeLog("emitEdges()-- edge %3u - tig %6u %s edgeTo tig %6u %s of length %6u (%6u-%6u)\n", ee, tgA->id(), tgAflipped ? "<--" : "-->", edges[ee].tigID, tgBflipped ? "<--" : "-->", edges[ee].end - edges[ee].bgn, edges[ee].bgn, edges[ee].end); #endif fprintf(BEG, "L\ttig%08u\t%c\ttig%08u\t%c\t%uM%s\n", edges[ee].tigID, tgBflipped ? '-' : '+', tgA->id(), tgAflipped ? '-' : '+', edges[ee].end - edges[ee].bgn, (sameContig == true) ? "\tcv:A:T" : "\tcv:A:F"); tgA->_isCircular = (tgA->id() == edges[ee].tigID); edges[ee].deleted = true; } } // A bit of cleverness. If we emit edges before dealing with deleted and non-extended edges, the first // time we hit this code we'll emit edges for both the first read and the second read. for (uint32 ee=0; eeid()) && (tgAflipped); if (edges[ee].fwd == false) tgBflipped = !tgBflipped; if (edges[ee].deleted == true) continue; if (edges[ee].extended == true) continue; #ifdef SHOW_EDGES writeLog("emitEdges()-- tig %6u %s edgeTo tig %6u %s [0 %u-%u %u] UNSATISFIED at read %u #%u\n", tgA->id(), tgAflipped ? "<--" : "-->", edges[ee].tigID, tgBflipped ? "<--" : "-->", edges[ee].bgn, edges[ee].end, tigs[edges[ee].tigID]->getLength(), rdA->ident, fi); #endif edges[ee].deleted = true; } // Compress the edges list (optional, but messes up logging if not done) to remove the deleted edges. uint32 oo = 0; for (uint32 ee=0; eeid()) && (tgBflipped); if (edges[ee].fwd == false) tgBflipped = !tgBflipped; if (edges[ee].extended == false) writeLog("emitEdges()-- tig %6u %s edgeTo tig %6u %s [0 %u-%u %u] UNSATISFIED after all reads\n", tgA->id(), tgAflipped ? "<--" : "-->", edges[ee].tigID, tgBflipped ? "<--" : "-->", edges[ee].bgn, edges[ee].end, tigs[edges[ee].tigID]->getLength()); } #endif } // Unlike placing bubbles and repeats, we don't have enough coverage to do any // fancy filtering based on the error profile. We thus fall back to using // the filtering for best edges. void reportTigGraph(TigVector &tigs, vector &tigSource, const char *prefix, const char *label) { char BEGn[FILENAME_MAX]; char BEDn[FILENAME_MAX]; writeLog("\n"); writeLog("----------------------------------------\n"); writeLog("Generating graph\n"); writeStatus("AssemblyGraph()-- generating '%s.%s.gfa'.\n", prefix, label); snprintf(BEGn, FILENAME_MAX, "%s.%s.gfa", prefix, label); snprintf(BEDn, FILENAME_MAX, "%s.%s.bed", prefix, label); FILE *BEG = fopen(BEGn, "w"); FILE *BED = (tigSource.size() > 0) ? fopen(BEDn, "w") : NULL; if (BEG == NULL) return; // Write a header. You've gotta start somewhere! fprintf(BEG, "H\tVN:Z:bogart/edges\n"); // Then write the sequences used in the graph. Unlike the read and contig graphs, every sequence // in our set is output. By construction, only valid unitigs are in it. Though we occasionally // make a disconnected unitig and need to split it again. for (uint32 ti=1; ti_isUnassembled == false)) fprintf(BEG, "S\ttig%08u\t*\tLN:i:%u\n", ti, tigs[ti]->getLength()); // Run through all the tigs, emitting edges for the first and last read. for (uint32 ti=1; ti_isUnassembled == true)) continue; //if (ti == 4) // logFileFlags |= LOG_PLACE_READ; #ifdef SHOW_EDGES writeLog("\n"); writeLog("reportTigGraph()-- tig %u len %u reads %u - firstRead %u\n", ti, tgA->getLength(), tgA->ufpath.size(), tgA->firstRead()->ident); #endif emitEdges(tigs, tgA, false, BEG, tigSource); #ifdef SHOW_EDGES writeLog("\n"); writeLog("reportTigGraph()-- tig %u len %u reads %u - lastRead %u\n", ti, tgA->getLength(), tgA->ufpath.size(), tgA->lastRead()->ident); #endif tgA->reverseComplement(); emitEdges(tigs, tgA, true, BEG, tigSource); tgA->reverseComplement(); if ((tigSource.size() > 0) && (tigSource[ti].cID != UINT32_MAX)) fprintf(BED, "ctg%08u\t%u\t%u\tutg%08u\t%u\t%c\n", tigSource[ti].cID, tigSource[ti].cBgn, tigSource[ti].cEnd, ti, 0, '+'); //logFileFlags &= ~LOG_PLACE_READ; } if (BEG) fclose(BEG); if (BED) fclose(BED); // And report statistics. } canu-1.6/src/bogart/AS_BAT_TigGraph.H000066400000000000000000000024011314437614700172510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_TIGGRAPH #define INCLUDE_AS_BAT_TIGGRAPH #include "AS_global.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" #include "AS_BAT_CreateUnitigs.H" void reportTigGraph(TigVector &tigs, vector &tigSource, const char *prefix, const char *label); #endif // INCLUDE_AS_BAT_ASSEMBLYGRAPH canu-1.6/src/bogart/AS_BAT_TigVector.C000066400000000000000000000131171314437614700174530ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_TigVector.H" TigVector::TigVector(uint32 nReads) { // The read-to-tig map _inUnitig = new uint32 [nReads + 1]; _ufpathIdx = new uint32 [nReads + 1]; for (uint32 ii=0; ii_id = _totalTigs++; if (verbose) writeLog("Creating Unitig %d\n", u->_id); if (_blockNext >= _blockSize) { assert(_numBlocks < _maxBlocks); _blocks[_numBlocks] = new Unitig * [_blockSize]; memset(_blocks[_numBlocks], 0, sizeof(Unitig **) * _blockSize); _numBlocks++; _blockNext = 0; } _blocks[_numBlocks-1][_blockNext++] = u; // The rest are just sanity checks. assert((u->id() / _blockSize) == (_numBlocks - 1)); assert((u->id() % _blockSize) == (_blockNext - 1)); assert(operator[](u->id()) == u); } return(u); }; void TigVector::deleteUnitig(uint32 i) { delete _blocks[i / _blockSize][i % _blockSize]; _blocks[i / _blockSize][i % _blockSize] = NULL; } #ifdef CHECK_UNITIG_ARRAY_INDEXING Unitig *&operator[](uint32 i) { uint32 idx = i / _blockSize; uint32 pos = i % _blockSize; if (((i >= _totalTigs)) || ((idx >= _numBlocks)) || (((pos >= _blockNext) && (idx >= _numBlocks - 1)))) { writeStatus("TigVector::operator[]()-- i=" F_U32 " with totalTigs=" F_U64 "\n", i, _totalTigs); writeStatus("TigVector::operator[]()-- blockSize=" F_U64 "\n", _blockSize); writeStatus("TigVector::operator[]()-- idx=" F_U32 " numBlocks=" F_U64 "\n", idx, _numBlocks); writeStatus("TigVector::operator[]()-- pos=" F_U32 " blockNext=" F_U64 "\n", pos, _blockNext); } assert(i < _totalTigs); assert((idx < _numBlocks)); assert((pos < _blockNext) || (idx < _numBlocks - 1)); return(_blocks[idx][pos]); }; #endif void TigVector::computeArrivalRate(const char *prefix, const char *label) { uint32 tiLimit = size(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (tiLimit < 100000 * numThreads) ? numThreads : tiLimit / 99999; writeStatus("computeArrivalRate()-- Computing arrival rates for %u tigs, with %u thread%s.\n", tiLimit, numThreads, (numThreads == 1) ? "" : "s"); vector hist[6]; //#pragma omp parallel for schedule(dynamic, blockSize) for (uint32 ti=0; tiufpath.size() == 1) continue; tig->computeArrivalRate(prefix, label, hist); } for (uint32 ii=1; ii<6; ii++) { char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.arrivalRate.%u.dat", prefix, ii); FILE *F = fopen(N, "w"); for (uint32 jj=0; jjufpath.size() == 1) continue; tig->computeErrorProfile(prefix, label); } writeStatus("computeErrorProfiles()-- Finished.\n"); } void TigVector::reportErrorProfiles(const char *prefix, const char *label) { uint32 tiLimit = size(); uint32 numThreads = omp_get_max_threads(); uint32 blockSize = (tiLimit < 100000 * numThreads) ? numThreads : tiLimit / 99999; for (uint32 ti=0; tiufpath.size() == 1) continue; tig->reportErrorProfile(prefix, label); } } canu-1.6/src/bogart/AS_BAT_TigVector.H000066400000000000000000000044331314437614700174610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_AS_BAT_TIGVECTOR #define INCLUDE_AS_BAT_TIGVECTOR #include "AS_global.H" class Unitig; class TigVector { public: TigVector(uint32 nReads); ~TigVector(); Unitig *newUnitig(bool verbose); void deleteUnitig(uint32 i); size_t size(void) { return(_totalTigs); }; Unitig *&operator[](uint32 i) { return(_blocks[i / _blockSize][i % _blockSize]); }; void optimizePositions(const char *prefix, const char *label); void computeArrivalRate(const char *prefix, const char *label); void computeErrorProfiles(const char *prefix, const char *label); void reportErrorProfiles(const char *prefix, const char *label); // Mapping from read to position in a tig. public: void registerRead(uint32 readId, uint32 tigid=0, uint32 ufpathidx=UINT32_MAX) { _inUnitig[readId] = tigid; _ufpathIdx[readId] = ufpathidx; }; uint32 inUnitig(uint32 readId) { return(_inUnitig[readId]); }; uint32 ufpathIdx(uint32 readId) { return(_ufpathIdx[readId]); }; private: uint32 *_inUnitig; // Maps a read iid to a unitig id. uint32 *_ufpathIdx; // Maps a read iid to an index in ufpath // The actual vector. private: uint64 _blockSize; uint64 _numBlocks; uint64 _maxBlocks; Unitig ***_blocks; uint64 _blockNext; uint64 _totalTigs; }; #endif // INCLUDE_AS_BAT_TIGVECTOR canu-1.6/src/bogart/AS_BAT_Unitig.C000066400000000000000000000432301314437614700170030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_Unitig.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-19 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #undef SHOW_PROFILE_CONSTRUCTION #undef SHOW_PROFILE_CONSTRUCTION_DETAILS void Unitig::reverseComplement(bool doSort) { // If there are contained reads, we need to sort by position to place them correctly after // their containers. If there are no contained reads, sorting can break the initial unitig // building. When two reads start at position zero, we'll exchange the order. Initial unitig // building depends on having the first read added become the last read in the unitig // after reversing. for (uint32 fi=0; fiposition.bgn = getLength() - frg->position.bgn; frg->position.end = getLength() - frg->position.end; //if (frg->contained != 0) // doSort = true; assert(frg->position.bgn >= 0); assert(frg->position.end >= 0); } // We've updated the positions of everything. Now, sort or reverse the list, and rebuild the // ufpathIdx map. if (doSort) { sort(); } else { std::reverse(ufpath.begin(), ufpath.end()); for (uint32 fi=0; firegisterRead(ufpath[fi].ident, _id, fi); } } void Unitig::cleanUp(void) { if (ufpath.size() > 1) sort(); int32 minPos = ufpath[0].position.min(); if (minPos != 0) for (uint32 fi=0; fi *hist) { sort(); for (uint32 fi=0; fiposition.bgn < rdA->position.end); int32 rdAlo = (rdAfwd) ? rdA->position.bgn : rdA->position.end; int32 rdAhi = (rdAfwd) ? rdA->position.end : rdA->position.bgn; for (uint32 fj=1; fj<6; fj++) { if (fi + fj < ufpath.size()) { ufNode *rdB = &ufpath[fi+fj]; bool rdBfwd = (rdB->position.bgn < rdB->position.end); int32 rdBlo = (rdBfwd) ? rdB->position.bgn : rdB->position.end; int32 rdBhi = (rdBfwd) ? rdB->position.end : rdB->position.bgn; uint32 dist = rdBlo - rdAlo; hist[fj].push_back(dist); } } } } class epOlapDat { public: epOlapDat() { pos = 0; open = false; erate = 0.0; }; epOlapDat(uint32 p, bool o, float e) { pos = p; open = o; erate = e; }; bool operator<(const epOlapDat &that) const { return(pos < that.pos); }; uint32 pos : 31; bool open : 1; float erate; }; void Unitig::computeErrorProfile(const char *UNUSED(prefix), const char *UNUSED(label)) { #ifdef SHOW_PROFILE_CONSTRUCTION writeLog("errorProfile()-- Find error profile for tig " F_U32 " of length " F_U32 " with " F_SIZE_T " reads.\n", id(), getLength(), ufpath.size()); #endif errorProfile.clear(); errorProfileIndex.clear(); // Count the number of overlaps we need to save. We do this, instead of growing the array, // because occasionally these are big, and having two around at the same time can blow our // memory. (Arabidopsis p5 has a tig with 160,246,250 olaps == 1gb memory) #if 0 // A (much) fancier version would merge the overlap detection and errorProfile compute together. // Keep lists of epOlapDat for each read end (some cleverness could probably get rid of the map, // if we just use the index of the read). Before we process a new read, all data for positions // before this reads start position can be processed and freed. map baseToIndex; uint32 *olapsMax = new uint32 [ufpath.size() * 2]; uint32 *olapsLen = new uint32 [ufpath.size() * 2]; epOlapDat **olaps = new epOlapDat [ufpath.size() * 2]; #endif uint32 olapsMax = 0; uint32 olapsLen = 0; epOlapDat *olaps = NULL; for (uint32 fi=0; fiposition.min(); int32 rdAhi = rdA->position.max(); uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(rdA->ident, ovlLen); for (uint32 oi=0; oiinUnitig(ovl[oi].b_iid)) // Reads in different tigs? continue; // Don't care about this overlap. ufNode *rdB = &ufpath[ _vector->ufpathIdx(ovl[oi].b_iid) ]; if (rdA->ident < rdB->ident) // Only want to see one overlap continue; // for each pair. int32 rdBlo = rdB->position.min(); int32 rdBhi = rdB->position.max(); if ((rdAhi <= rdBlo) || (rdBhi <= rdAlo)) // Reads in same tig but not overlapping? continue; // Don't care about this overlap. olapsMax += 2; } } // Scan overlaps to find those that we care about, and save their endpoints. olaps = new epOlapDat [olapsMax]; for (uint32 fi=0; fiposition.min(); int32 rdAhi = rdA->position.max(); uint32 ovlLen = 0; BAToverlap *ovl = OC->getOverlaps(rdA->ident, ovlLen); for (uint32 oi=0; oiinUnitig(ovl[oi].b_iid)) // Reads in different tigs? continue; // Don't care about this overlap. ufNode *rdB = &ufpath[ _vector->ufpathIdx(ovl[oi].b_iid) ]; if (rdA->ident < rdB->ident) // Only want to see one overlap continue; // for each pair. int32 rdBlo = rdB->position.min(); int32 rdBhi = rdB->position.max(); if ((rdAhi <= rdBlo) || (rdBhi <= rdAlo)) // Reads in same tig but not overlapping? continue; // Don't care about this overlap. uint32 bgn = max(rdAlo, rdBlo); uint32 end = min(rdAhi, rdBhi); #ifdef SHOW_PROFILE_CONSTRUCTION_DETAILS writeLog("errorProfile()-- olap %5u read %7u read %7u at %9u-%9u\n", oi, rdA->ident, rdB->ident, bgn, end); #endif olaps[olapsLen++] = epOlapDat(bgn, true, ovl[oi].erate()); // Save an open event, olaps[olapsLen++] = epOlapDat(end, false, ovl[oi].erate()); // and a close event. assert(olapsLen <= olapsMax); } } // Warn if no overlaps. if (olapsLen == 0) { writeLog("WARNING: tig %u length %u nReads %u has no overlaps.\n", id(), getLength(), ufpath.size()); for (uint32 fi=0; fi 0) && (olaps[0].pos != 0)) // Olaps, but missing the first errorProfile.push_back(epValue(0, olaps[0].pos)); // interval, so add it. stdDev curDev; for (uint32 bb=0, ee=0; ee 0) && (olaps[olapsLen-1].pos != getLength())) // Olaps, but missing the last errorProfile.push_back(epValue(olaps[olapsLen-1].pos, getLength())); // interval, so add it. errorProfile.push_back(epValue(getLength(), getLength()+1)); // And one more to make life easier. #ifdef SHOW_PROFILE_CONSTRUCTION writeLog("errorProfile()-- tig %u generated " F_SIZE_T " profile regions from " F_SIZE_T " overlaps.\n", id(), errorProfile.size(), olapsLen); #endif delete [] olaps; // Adjust regions that have no overlaps (mean == 0) to be the average of the adjacent regions. // There are always at least two elements in the profile list: one that starts at coordinate 0, // and the terminating one at coordinate (len, len+1). for (uint32 bi=0; bi bgn)) { fprintf(stderr, "BAD ESTIMATE for bgn=%u end=%u\n", bgn, end); pbi--; } while ((pbi < errorProfileIndex.size()) && (errorProfile[errorProfileIndex[pbi]].end <= bgn)) pbi++; if (pbi == errorProfileIndex.size()) { //fprintf(stderr, "Fell off loop for bgn=%u end=%u last ep bgn=%u end=%u\n", // bgn, end, errorProfile.back().bgn, errorProfile.back().end); pbi--; } // The region pb points to will contain bgn. uint32 pb = errorProfileIndex[pbi]; //fprintf(stderr, "For bgn=%u end=%u - stopped at pbi=%u errorProfile[%u] = %u-%u (1)\n", // bgn, end, pbi, pb, errorProfile[pb].bgn, errorProfile[pb].end); // Fine tune search to find the exact first region. while ((0 < pb) && (bgn < errorProfile[pb].bgn)) pb--; while ((pb < errorProfile.size()) && (errorProfile[pb].end <= bgn)) pb++; #endif if ((errorProfile[pb].bgn > bgn) || (bgn >= errorProfile[pb].end)) fprintf(stderr, "For bgn=%u end=%u - stopped at errorProfile[%u] = %u-%u BOOM\n", bgn, end, pb, errorProfile[pb].bgn, errorProfile[pb].end); assert(errorProfile[pb].bgn <= bgn); assert(bgn < errorProfile[pb].end); // Sum the number of bases above the supplied erate. uint32 pe = pb; while ((pe < errorProfile.size()) && (errorProfile[pe].bgn < end)) { if (erate <= errorProfile[pe].max(deviations)) nAbove += errorProfile[pe].end - errorProfile[pe].bgn; else nBelow += errorProfile[pe].end - errorProfile[pe].bgn; pe++; } // Adjust for the bits we overcounted in the first and last regions. if (pe > 0) // Argh. If this read is fully in the first region (where there pe--; // is only 1x coverage) then pe==0. uint32 bb = bgn - errorProfile[pb].bgn; uint32 be = errorProfile[pe].end - end; assert(bgn >= errorProfile[pb].bgn); assert(errorProfile[pe].end >= end); if (erate <= errorProfile[pb].max(deviations)) nAbove -= bb; else nBelow -= bb; if (erate <= errorProfile[pe].max(deviations)) nAbove -= be; else nBelow -= be; assert(nAbove >= 0); assert(nBelow >= 0); return((double)nAbove / (nBelow + nAbove)); } void Unitig::reportErrorProfile(const char *prefix, const char *label) { char N[FILENAME_MAX]; FILE *F; if (logFileFlagSet(LOG_ERROR_PROFILES) == false) return; snprintf(N, FILENAME_MAX, "%s.%s.%08u.profile", prefix, label, id()); F = fopen(N, "w"); if (F) { for (uint32 ii=0; ii #include #include using namespace std; class BestEdgeOverlap; class optPos; class SeqInterval { public: SeqInterval() { bgn = 0; end = 0; }; ~SeqInterval() { }; int32 min(void) const { return(::min(bgn, end)); }; int32 max(void) const { return(::max(bgn, end)); }; bool isForward(void) const { return(bgn < end); }; bool isReverse(void) const { return(bgn > end); }; bool operator==(SeqInterval const that) const { return(((bgn == that.bgn) && (end == that.end)) || ((bgn == that.end) && (end == that.bgn))); }; bool operator!=(SeqInterval const that) const { return(((bgn != that.bgn) || (end != that.end)) && ((bgn != that.end) || (end != that.bgn))); }; bool operator<(SeqInterval const that) const { return(min() < that.min()); }; public: int32 bgn; // MUST be signed! Read placement needs to set coordinates to negative int32 end; // coordinates to indicate the read extends off the start of the tig. }; // True if A is contained in B. inline bool isContained(int32 Abgn, int32 Aend, int32 Bbgn, int32 Bend) { assert(Abgn < Aend); assert(Bbgn < Bend); return((Bbgn <= Abgn) && (Aend <= Bend)); } inline bool isContained(SeqInterval &A, SeqInterval &B) { return((B.min() <= A.min()) && (A.max() <= B.max())); } // True if the A and B intervals overlap inline bool isOverlapping(int32 Abgn, int32 Aend, int32 Bbgn, int32 Bend) { assert(Abgn < Aend); assert(Bbgn < Bend); return((Abgn < Bend) && (Bbgn < Aend)); } inline bool isOverlapping(SeqInterval &A, SeqInterval &B) { return((A.min() < B.max()) && (B.min() < A.max())); } // Derived from IntMultiPos, but removes some of the data (48b in IntMultiPos, 32b in struct // ufNode). The minimum size (bit fields, assuming maximum limits, not using the contained // field) seems to be 24b, and is more effort than it is worth (just removing 'contained' would be // a chore). // // ufNode is, of course, 'unitig fragment node'. // class ufNode { public: uint32 ident; uint32 contained; uint32 parent; // IID of the read we align to int32 ahang; // If parent defined, these are relative int32 bhang; // that read SeqInterval position; bool isForward(void) const { return(position.isForward()); }; bool isReverse(void) const { return(position.isReverse()); }; bool operator<(ufNode const &that) const { int32 abgn = (position.bgn < position.end) ? position.bgn : position.end; int32 aend = (position.bgn < position.end) ? position.end : position.bgn; int32 bbgn = (that.position.bgn < that.position.end) ? that.position.bgn : that.position.end; int32 bend = (that.position.bgn < that.position.end) ? that.position.end : that.position.bgn; if (abgn < bbgn) return(true); // A starts before B! if (abgn > bbgn) return(false); // B starts before A! if (aend < bend) return(false); // A contained in B, not less than. if (aend > bend) return(true); // B contained in A, is less than. return(false); // Equality, not less than. }; }; class Unitig { private: Unitig(TigVector *v) { _vector = v; _length = 0; _id = 0; _isUnassembled = false; _isRepeat = false; _isCircular = false; }; public: ~Unitig(void) { }; friend class TigVector; void sort(void) { std::sort(ufpath.begin(), ufpath.end()); for (uint32 fi=0; firegisterRead(ufpath[fi].ident, _id, fi); }; //void bubbleSortLastRead(void); void reverseComplement(bool doSort=true); // Ensure that the children are sorted by begin position, // and that unitigs start at position zero. void cleanUp(void); // Recompute bgn/end positions using all overlaps. void optimize_initPlace(uint32 pp, optPos *op, optPos *np, bool firstPass, set &failed, bool beVerbose); void optimize_recompute(uint32 ii, optPos *op, optPos *np, bool beVerbose); void optimize_expand(optPos *op); void optimize_setPositions(optPos *op, bool beVerbose); void optimize(const char *prefix, const char *label); uint32 id(void) { return(_id); }; int32 getLength(void) { return(_length); }; uint32 getNumReads(void) { return(ufpath.size()); }; // Place 'read' using an edge to some read in this tig. The edge is from 'read3p' end. // bool placeRead(ufNode &read, // resulting placement uint32 readId, // read we want to place bool read3p, // end that the edge is from BestEdgeOverlap *edge); // edge to something in this tig void addRead(ufNode node, int offset=0, bool report=false); public: class epValue { public: epValue(uint32 b, uint32 e) { bgn = b; end = e; mean = 0; stddev = 0; }; epValue(uint32 b, uint32 e, float m, float s) { bgn = b; end = e; mean = m; stddev = s; }; double max(double deviations) { return(mean + deviations * stddev); }; bool operator<(const epValue &that) const { return(bgn < that.bgn); }; bool operator<(const uint32 &that) const { return(bgn < that); }; uint32 bgn; uint32 end; float mean; float stddev; }; static size_t epValueSize(void) { return(sizeof(epValue)); }; void computeArrivalRate(const char *prefix, const char *label, vector *hist); void computeErrorProfile(const char *prefix, const char *label); void reportErrorProfile(const char *prefix, const char *label); void clearErrorProfile(void) { errorProfile.clear(); }; double overlapConsistentWithTig(double deviations, uint32 bgn, uint32 end, double erate); // Returns the read that is touching the start of the tig. ufNode *firstRead(void) { ufNode *rd5 = &ufpath.front(); for (uint32 fi=1; (fi < ufpath.size()) && (rd5->position.min() != 0); fi++) rd5 = &ufpath[fi]; if (rd5->position.min() != 0) fprintf(stderr, "ERROR: firstRead() in tig %u doesn't start at the start\n", id()); assert(rd5->position.min() == 0); return(rd5); }; // Returns the read that is touching the end of the tig. ufNode *lastRead(void) { ufNode *rd3 = &ufpath.back(); for (uint32 fi=ufpath.size()-1; (fi-- > 0) && (rd3->position.max() != getLength()); ) rd3 = &ufpath[fi]; if (rd3->position.max() != getLength()) fprintf(stderr, "ERROR: lastRead() in tig %u doesn't end at the end\n", id()); assert(rd3->position.max() == getLength()); return(rd3); }; // Public Member Variables public: vector ufpath; vector errorProfile; vector errorProfileIndex; public: // r > 0 guards against calling these from Idx's, while r < size guards // against calling with Id's. // uint32 inUnitig(uint32 r) { assert(r > 0); return(_vector->inUnitig(r)); }; uint32 ufpathIdx(uint32 r) { assert(r > 0); return(_vector->ufpathIdx(r)); }; ufNode *readFromId(uint32 r) { assert(r > 0); return(&ufpath[ ufpathIdx(r) ]); }; ufNode *readFromIdx(uint32 r) { assert(r < ufpath.size()); return(&ufpath[ r ]); }; private: TigVector *_vector; // For updating the read map. private: int32 _length; uint32 _id; public: // Classification. bool _isUnassembled; // Is a single read or a pseudo singleton. bool _isRepeat; // Is from an identified repeat region. bool _isCircular; // Is (probably) a circular tig. }; #endif // INCLUDE_AS_BAT_UNITIG canu-1.6/src/bogart/AS_BAT_Unitig_AddRead.C000066400000000000000000000060411314437614700203460ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_Unitig_AddFrag.C * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" void Unitig::addRead(ufNode node, int offset, bool report) { node.position.bgn += offset; node.position.end += offset; assert(node.ident > 0); // keep track of the unitig a read is in _vector->registerRead(node.ident, _id, ufpath.size()); // keep track of max position in unitig int32 frgEnd = MAX(node.position.bgn, node.position.end); if (frgEnd > _length) _length = frgEnd; ufpath.push_back(node); if ((report) || (node.position.bgn < 0) || (node.position.end < 0)) { int32 trulen = RI->readLength(node.ident); int32 poslen = (node.position.end > node.position.bgn) ? (node.position.end - node.position.bgn) : (node.position.bgn - node.position.end); if (node.contained) writeLog("Added read %d (len %d) to unitig %d at %d,%d (idx %lu) (lendiff %d) (contained in %d)\n", node.ident, trulen, _id, node.position.bgn, node.position.end, ufpath.size() - 1, poslen - trulen, node.contained); else writeLog("Added read %d (len %d) to unitig %d at %d,%d (idx %lu) (lendiff %d)\n", node.ident, trulen, _id, node.position.bgn, node.position.end, ufpath.size() - 1, poslen - trulen); assert(poslen / trulen < 10); assert(trulen / poslen < 10); } assert(node.position.bgn >= 0); assert(node.position.end >= 0); } // Percolate the last read to the correct spot in the list. #if 0 void Unitig::bubbleSortLastRead(void) { uint32 previd = ufpath.size() - 2; uint32 lastid = ufpath.size() - 1; ufNode last = ufpath[lastid]; uint32 lastbgn = MIN(last.position.bgn, last.position.end); while ((lastid > 0) && (lastbgn < MIN(ufpath[previd].position.bgn, ufpath[previd].position.end))) { ufpath[lastid] = ufpath[previd]; _ufpathIdx[ufpath[lastid].ident] = lastid; lastid--; previd--; } _ufpathIdx[last.ident] = lastid; if (lastid < ufpath.size() - 1) ufpath[lastid] = last; } #endif canu-1.6/src/bogart/AS_BAT_Unitig_PlaceReadUsingEdges.C000066400000000000000000000245061314437614700226660ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/bogart/AS_BAT_Unitig_PlaceFragUsingEdges.C * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #undef DEBUG_PLACE_READ ufNode placeRead_contained(uint32 readId, ufNode &parent, BestEdgeOverlap *edge) { bool pFwd = parent.position.isForward(); int32 pMin = parent.position.min(); int32 pMax = parent.position.max(); assert(pMin < pMax); // Reverse the overlap. read3p here means the overlap is flipped. int32 ahang = (edge->read3p() == false) ? -edge->ahang() : edge->bhang(); int32 bhang = (edge->read3p() == false) ? -edge->bhang() : edge->ahang(); // Depending on the parent orientation... // // pMin pMax pMin pMax // ----------------> <---------------- // ahang ----- bhang bhang ----- ahang // > 0 < 0 < 0 > 0 int32 fMin = (pFwd == true) ? pMin + ahang : pMin - bhang; int32 fMax = (pFwd == true) ? pMax + bhang : pMax - ahang; //int32 fMin = pMin + ((read3p == false) ? -edge->ahang() : edge->bhang()); // * intraScale //int32 fMax = pMax + ((read3p == false) ? -edge->bhang() : edge->ahang()); // * interScale assert(fMin < fMax); // We don't know the true length of the overlap, and our hang-based math tends to shrink reads. // Reset the end coordinate using the actual length of the read. #if 0 #warning NOT RESETTING fMax BASED ON READ LENGTH writeLog("placeCont()-- read %u %d-%d with hangs %d %d places read %u at %d-%d reset to %d\n", parent.ident, parent.position.min(), parent.position.max(), ahang, bhang, readId, fMin, fMax, fMin + RI->readLength(readId)); #endif fMax = fMin + RI->readLength(readId); // Orientation is straightforward, based on the orient of the parent, and the flipped flag. bool fFwd = (((pFwd == true) && (edge->read3p() == false)) || // parent is fwd, olap is not flipped ((pFwd == false) && (edge->read3p() == true))); // parent is rev, olap is flipped ufNode read; read.ident = readId; read.contained = 0; read.parent = edge->readId(); // == parent->ident read.ahang = 0; // Not used in bogart, set on output read.bhang = 0; // Not used in bogart, set on output read.position.bgn = (fFwd) ? fMin : fMax; read.position.end = (fFwd) ? fMax : fMin; #ifdef DEBUG_PLACE_READ writeLog("placeCont()-- parent %7d pos %7d,%7d -- edge to %7d %c' hangs %7d %7d -- read %7d C' -- placed %7d-%7d oriented %s %7d-%7d %f%% of length\n", parent.ident, parent.position.bgn, parent.position.end, edge->readId(), (edge->read3p()) ? '3' : '5', edge->ahang(), edge->bhang(), readId, fMin, fMax, (fFwd) ? "rev" : "fwd", read.position.bgn, read.position.end, 100.0 * (read.position.max() - read.position.min()) / RI->readLength(readId)); #endif return(read); } ufNode placeRead_dovetail(uint32 readId, bool read3p, ufNode &parent, BestEdgeOverlap *edge) { // We have an 'edge' from 'readId' end 'read3p' back to 'parent'. // Use that to compute the placement of 'read'. bool pFwd = parent.position.isForward(); int32 pMin = parent.position.min(); int32 pMax = parent.position.max(); assert(pMin < pMax); // Scale the hangs based on the placed versus actual length of the parent read. //double intraScale = (double)(pMax - pMin) / RI->readLength(parent.ident); // Within the parent read overlap //double interScale = 1.0; // Outside the parent read overlap // We're given an edge from the read-to-place back to the parent. Reverse the edge so it points // from the parent to the read-to-place. // // The canonical edge is from a forward parent to the child. // // -P----\--> +b // +a ---v--------C- // // To reverse the edge: // // If child is forward, swapping the order of the reads results in a canonical overlap. The // hangs become negative. // // -P----\--> +b ----> -a ---/--------C> // +a ---v--------C> ----> -P----v--> -b // // If child is reverse, swapping the order of the reads results in a backwards canonical // overlap, and we need to flip end-to-end also. The hangs are swapped. // // -P----\--> +b ----> -C--------\--> +a // +a <--v--------C- ----> +b <--v----P- // int32 ahang = (read3p == false) ? -edge->ahang() : edge->bhang(); int32 bhang = (read3p == false) ? -edge->bhang() : edge->ahang(); // The read is placed 'to the right' of the parent if // pFwd == true and edge points to 3' end // pFwd == false and edge points to 5' end // bool toRight = (pFwd == edge->read3p()); // If placing 'to the right', we add hangs. Else, subtract the swapped hangs. int32 fMin = 0; int32 fMax = 0; if (toRight) { fMin = pMin + ahang; fMax = pMax + bhang; } else { fMin = pMin - bhang; fMax = pMax - ahang; } //int32 fMin = pMin + ((read3p == false) ? -edge->ahang() : edge->bhang()); // * intraScale //int32 fMax = pMax + ((read3p == false) ? -edge->bhang() : edge->ahang()); // * interScale assert(fMin < fMax); // We don't know the true length of the overlap, and our hang-based math tends to shrink reads. // Reset the end coordinate using the actual length of the read. #if 0 #warning NOT RESETTING fMax BASED ON READ LENGTH writeLog("placeDovs()-- read %u %d-%d with hangs %d %d places read %u at %d-%d reset to %d\n", parent.ident, parent.position.min(), parent.position.max(), ahang, bhang, readId, fMin, fMax, fMin + RI->readLength(readId)); #endif fMax = fMin + RI->readLength(readId); // Orientation is a bit more complicated, with eight cases (drawing pictures helps). // // edge from read3p=true to forward parent 3p -> reverse // edge from read3p=false to reverse parent 3p -> reverse // edge from read3p=false to forward parent 5p -> reverse // edge from read3p=true to reverse parent 5p -> reverse // // edge from read3p=true to reverse parent 3p -> forward // edge from read3p=false to forward parent 3p -> forward // edge from read3p=false to reverse parent 5p -> forward // edge from read3p=true to forward parent 5p -> forward // bool fFwd = (((read3p == true) && (pFwd == true) && (edge->read3p() == true)) || ((read3p == false) && (pFwd == false) && (edge->read3p() == true)) || ((read3p == false) && (pFwd == true) && (edge->read3p() == false)) || ((read3p == true) && (pFwd == false) && (edge->read3p() == false))) ? false : true; ufNode read; read.ident = readId; read.contained = 0; read.parent = edge->readId(); // == parent->ident read.ahang = 0; // Not used in bogart, set on output read.bhang = 0; // Not used in bogart, set on output read.position.bgn = (fFwd) ? fMin : fMax; read.position.end = (fFwd) ? fMax : fMin; #ifdef DEBUG_PLACE_READ writeLog("placeDove()-- parent %7d pos %7d,%7d -- edge to %7d %c' hangs %7d %7d -- read %7d %c' -- placed %7d-%7d oriented %s %7d-%7d %f%% of length\n", parent.ident, parent.position.bgn, parent.position.end, edge->readId(), (edge->read3p()) ? '3' : '5', edge->ahang(), edge->bhang(), readId, (read3p) ? '3' : '5', fMin, fMax, (fFwd) ? "rev" : "fwd", read.position.bgn, read.position.end, 100.0 * (read.position.max() - read.position.min()) / RI->readLength(readId)); #endif return(read); } // Place a read into this tig using an edge from the read to some read in this tig. // bool Unitig::placeRead(ufNode &read, // output placement uint32 readId, // id of read we want to place bool read3p, // end of read 'edge' is from, meaningless if contained BestEdgeOverlap *edge) { // edge to read in this tig assert(readId > 0); assert(readId <= RI->numReads()); read.ident = readId; read.contained = 0; read.parent = 0; read.ahang = 0; read.bhang = 0; read.position.bgn = 0; read.position.end = 0; // No best edge? Hard to place without one. assert(edge != NULL); if (edge == NULL) return(false); // Empty best edge? Still hard to place. assert(edge->readId() != 0); if (edge->readId() == 0) return(false); // Edge not pointing to a read in this tig? assert(inUnitig(edge->readId()) == id()); if (inUnitig(edge->readId()) != id()) return(false); // Grab the index of the parent read. uint32 bidx = ufpathIdx(edge->readId()); assert(edge->readId() == ufpath[bidx].ident); // Now, just compute the placement and return success! if (((edge->ahang() >= 0) && (edge->bhang() <= 0)) || ((edge->ahang() <= 0) && (edge->bhang() >= 0))) read = placeRead_contained(readId, ufpath[bidx], edge); else read = placeRead_dovetail(readId, read3p, ufpath[bidx], edge); return(true); } canu-1.6/src/bogart/addReadsToUnitigs.C000066400000000000000000000374211314437614700200550ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/addReadsToUnitigs.C * * Modifications by: * * Brian P. Walenz from 2013-AUG-24 to 2014-APR-22 * are Copyright 2013-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2015-APR-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "splitToWords.H" #include "MultiAlign.H" #include "MultiAlignStore.H" #include "MultiAlignment_CNS.H" #include "MultiAlignment_CNS_private.H" #include #include #include using namespace std; class readMap { public: readMap() { good = false; proc = false; rFWD = false; rIID = UINT32_MAX; rCNT = 0; tIID = UINT32_MAX; tBGN = 0; tEND = 0; }; bool good; bool proc; bool rFWD; uint32 rIID; uint32 rCNT; uint32 tIID; uint32 tBGN; uint32 tEND; }; class ungapToGap { public: uint32 *gapToUngap; }; int main(int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; int32 tigVers = -1; vector alignMapNames; bool doConsensus = false; bool doModify = true; #ifdef UNFINISHED_ADD_TO_SINGLETON bool doPlaceUnmapped = true; bool doDeleteUnmapped = false; #endif int32 numFailures = 0; int32 numSkipped = 0; bool showResult = false; char *lookupFile = NULL; map lookupIID; #ifdef UNFINISHED_ADD_TO_SINGLETON vector iidInTig; // true if the read is already in a tig vector iidInTigByLib; // count of reads in tigs, by library vector iidInLib; // count of reads, by library #endif bool loadall = false; CNS_Options options = { CNS_OPTIONS_SPLIT_ALLELES_DEFAULT, CNS_OPTIONS_MIN_ANCHOR_DEFAULT, CNS_OPTIONS_DO_PHASING_DEFAULT }; vector RM; vector UG; // Comminucate to MultiAlignment_CNS.c that we are doing consensus and not cgw. thisIsConsensus = 1; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-g") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-t") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); if (tigVers <= 0) fprintf(stderr, "invalid tigStore version (-t store version) '-t %s %s'.\n", argv[arg-1], argv[arg]), exit(1); } else if (strcmp(argv[arg], "-m") == 0) { while ((argv[arg+1] != NULL) && (AS_UTL_fileExists(argv[arg+1], false, false) == true)) alignMapNames.push_back(argv[++arg]); } else if (strcmp(argv[arg], "-lookup") == 0) { lookupFile = argv[++arg]; } else if (strcmp(argv[arg], "-r") == 0) { doConsensus = true; } else if (strcmp(argv[arg], "-v") == 0) { showResult = true; } else if (strcmp(argv[arg], "-V") == 0) { VERBOSE_MULTIALIGN_OUTPUT++; } else if (strcmp(argv[arg], "-loadall") == 0) { loadall = true; } else if (strcmp(argv[arg], "-n") == 0) { doModify = false; } else { err++; } arg++; } if (gkpName == NULL) err++; if (tigName == NULL) err++; if (alignMapNames.size() == 0) err++; if (lookupFile == NULL) err++; if (err) { fprintf(stderr, "usage: %s -g gkpStore -t tigStore version -m coords\n", argv[0]); fprintf(stderr, " -g gkpStore gatekeeper store\n"); fprintf(stderr, " -t tigStore version tigStore and version to modify\n"); fprintf(stderr, "\n"); fprintf(stderr, " -m map-file input map coords\n"); fprintf(stderr, " -M fastqUIDmap gatekeeper output fastqUIDmap for read name to IID translation\n"); fprintf(stderr, "\n"); #if 0 fprintf(stderr, "unmapped reads: default is to promote to singleton tigs\n"); fprintf(stderr, " -U leave unmapped reads alone (will crash CGW)\n"); fprintf(stderr, " -D delete unmapped reads from gkpStore\n"); #else fprintf(stderr, "unmapped reads: all reads that are mapped and eligible for addition must be\n"); fprintf(stderr, "marked as deleted before running this program. reads that are added will be\n"); fprintf(stderr, "undeleted. reads that are not added will remain deleted.\n"); #endif fprintf(stderr, "\n"); fprintf(stderr, "consensus: default is to not rebuild consensus\n"); fprintf(stderr, " -r rebuild consensus including the new reads\n"); fprintf(stderr, " -v show result\n"); fprintf(stderr, " -V verbose\n"); fprintf(stderr, " -loadall load all reads in gkpStore into memory (faster consensus)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -n do all the work, but discard the result\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no gkpStore (-g) supplied.\n"); if (tigName == NULL) fprintf(stderr, "ERROR: no tigStore (-t) supplied.\n"); if (alignMapNames.size() == 0) fprintf(stderr, "ERROR: no map-file (-m) inputs supplied.\n"); if (lookupFile == NULL) fprintf(stderr, "ERROR: no fasqUIDmap (-M) supplied.\n"); exit(1); } { fprintf(stderr, "Loading Name to IID map from '%s'\n", lookupFile); errno = 0; FILE *LF = fopen(lookupFile, "r"); if (errno) fprintf(stderr, "Failed to open fastqUIDmap '%s'\n", lookupFile); char LL[1024]; fgets(LL, 1024, LF); while (!feof(LF)) { chomp(LL); // Shouldn't be necessary, but splitToWords isn't doing it! splitToWords SW(LL); if (SW.numWords() == 3) { lookupIID[string(SW[2])] = SW(1); } else if (SW.numWords() == 6) { lookupIID[string(SW[2])] = SW(1); lookupIID[string(SW[5])] = SW(4); //fprintf(stderr, "'%s' - %u -- '%s' - %u\n", // SW[2], SW(1), SW[5], SW(4)); } else { } fgets(LL, 1024, LF); } fprintf(stderr, "Loaded " F_SIZE_T " name to IIDs\n", lookupIID.size()); } #ifdef UNFINISHED_ADD_TO_SINGLETON { fprintf(stderr, "Loading tig placement from tigStore '%s'\n", tigName); tigStore = new MultiAlignStore(tigName, tigVers, 0, 0, false, false, false); // Read only for (uint32 ti=0; tinumUnitigs(); ti++) { if (tigStore->isDeleted(ti, true)) continue; MultiAlignT *ma = tigStore->loadMultiAlign(ti, true); if (ma == NULL) continue; uint32 fiMax = GetNumIntMultiPoss(ma->f_list); for (uint32 fi=0; fif_list, fi); iidInTig[imp->ident] = true; } tigStore->unloadMultiAlign(ti, true); } delete tigStore; } #endif // // Load the alignment map. The challenge here is to parse the tig and read names // into correct IIDs. We assume that: // Reads were dumped with -dumpfasta and have names ">UID,IID" // Tigs were dumped with tigStore -d consensus and have names "utgIID" // Alignments are in the convertToExtent -extended format // uint32 totAligns = 0; uint32 totUnmap = 0; uint32 totUnique = 0; uint32 totDups = 0; uint32 lowID = UINT32_MAX; uint32 maxID = 0; for (uint32 an=0; an 1) totDups++; } fprintf(stderr, "Loaded %u aligns from ID %u to %u, %u reads unmapped, %u reads had a unique alignment and %u reads had multiple aligns.\n", totAligns, lowID, maxID, totUnmap, totUnique, totDups); // // Update deletion status in gkpStore. This processes every read, regardless. // fprintf(stderr, "Processing mate pairs, updating gkpStore.\n"); gkpStore = gkStore::gkStore_open(gkpName, false, true); // last arg - TRUE - writable uint32 unpaired = 0; uint32 multiple = 0; uint32 pairsToSame = 0; uint32 pairsToDiff = 0; #ifdef UNFINISHED_ADD_TO_SINGLETON // Need to process all reads, since we don't know where the first/last unmapped read is! // We could instead process from the first to last deleted read in gkpStore, or ask which // libraries were being added. lowID = 1; maxID = gkpStore->gkStore_getNumFragments(); #endif for (uint32 ff=lowID, mm=lowID; ff<=maxID; ff++) { gkFragment read; gkFragment mate; // I think this is just a short circuit of the two checks of 'one or both reads has too // few/many mappings' below. //if (RM[ff].rCNT != 1) // // Not mapped, mapped too much // continue; gkpStore->gkStore_getFragment(ff, &read, GKFRAGMENT_INF); mm = read.gkFragment_getMateIID(); if (mm == 0) // No mate, pacbio read? continue; if (mm < ff) // Already processed. continue; gkpStore->gkStore_getFragment(mm, &mate, GKFRAGMENT_INF); if ((RM[ff].rCNT == 0) || (RM[mm].rCNT == 0)) { // One or both reads has too few mappings. unpaired++; RM[ff].good = false; RM[mm].good = false; continue; } if ((RM[ff].rCNT > 1) || (RM[mm].rCNT > 1)) { // One or both reads has too many mappings. multiple++; RM[ff].good = false; RM[mm].good = false; continue; } RM[ff].good = true; RM[mm].good = true; read.gkFragment_setIsDeleted(false); mate.gkFragment_setIsDeleted(false); gkpStore->gkStore_setFragment(&read); gkpStore->gkStore_setFragment(&mate); if (RM[ff].tIID == RM[mm].tIID) pairsToSame++; else pairsToDiff++; } gkpStore->gkStore_close(); fprintf(stderr, "Will NOT add %u pairs - one read failed to map.\n", unpaired); fprintf(stderr, "Will NOT add %u pairs - multiple mappings.\n", multiple); fprintf(stderr, "Will add %u pairs in the same tig\n", pairsToSame); fprintf(stderr, "Will add %u pairs in different tigs\n", pairsToDiff); // // Open stores. gkpStore cannot be opened for writing, because then we can't loadall. // gkpStore = gkStore::gkStore_open(gkpName, false, false); // last arg - false - not writable tigStore = new MultiAlignStore(tigName, tigVers, 0, 0, true, true, false); // Write back to the same version if (loadall) { fprintf(stderr, "Loading all reads into memory.\n"); gkpStore->gkStore_load(0, 0, GKFRAGMENT_QLT); // fails if gkStore is writable } // // Rebuild tigs, stuff them back into the same version. // // Argh, really should convert this to a vector right now.... VA_TYPE(int32) *unGappedOffsets = CreateVA_int32(1024 * 1024); for (uint32 bb=0; bbloadMultiAlign(RM[bb].tIID, true); uint32 ungapLength = GetMultiAlignUngappedLength(ma); uint32 gapLength = GetMultiAlignLength(ma); vector ungapToGap; GetMultiAlignUngapToGap(ma, ungapToGap); //fprintf(stderr, "Loaded UTG %u offset size %lu\n", ma->maID, ungapToGap.size()); //for (uint32 xx=0; xx gap %u\n", xx, ungapToGap[xx]); uint32 readsAdded = 0; for (uint32 ee=bb; eemaID) || (RM[ee].good == false)) continue; uint32 bgn = ungapToGap[RM[ee].tBGN]; uint32 end = ungapToGap[RM[ee].tEND]; readsAdded++; fprintf(stdout, "bb=%u ee=%u ADD read %u to tig %u at %u,%u (from ungapped %u,%u)\n", bb, ee, RM[ee].rIID, RM[ee].tIID, bgn, end, RM[ee].tBGN, RM[ee].tEND); // Add a read to the tig. IntMultiPos frg; frg.type = AS_READ; frg.ident = RM[ee].rIID; frg.contained = 0; frg.parent = 0; frg.ahang = 0; frg.bhang = 0; frg.position.bgn = (RM[ee].rFWD) ? bgn : end, frg.position.end = (RM[ee].rFWD) ? end : bgn, frg.delta_length = 0; frg.delta = NULL; AppendVA_IntMultiPos(ma->f_list, &frg); // Mark that we've processed this read. assert(RM[ee].proc == false); RM[ee].proc = true; } fprintf(stderr, "Added %u reads to tig %u (previously %lu reads)\n", readsAdded, ma->maID, GetNumIntMultiPoss(ma->f_list) - readsAdded); if (doConsensus) { fprintf(stderr, "Regenerating consensus.\n"); if (MultiAlignUnitig(ma, gkpStore, &options, NULL)) { if (showResult) PrintMultiAlignT(stdout, ma, gkpStore, false, false, AS_READ_CLEAR_LATEST); } else { fprintf(stderr, "MultiAlignUnitig()-- tig %d failed.\n", ma->maID); numFailures++; } } if (doModify) { fprintf(stderr, "Updating tig %u\n", ma->maID); tigStore->insertMultiAlign(ma, true, false); } } delete gkpStore; delete tigStore; exit(0); } canu-1.6/src/bogart/analyzeBest.C000066400000000000000000000156041314437614700167600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BOG/analyzeBest.C * * Modifications by: * * Brian P. Walenz from 2010-OCT-09 to 2013-AUG-01 * are Copyright 2010-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2015-APR-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_PER_gkpStore.H" #include "splitToWords.H" // Read the 'best.edges' and 'best.contains' outputs from BOG, compute // how many singletons, spurs and contains are present per library. // // Assumes it is run from 4-unitigger. int main(int argc, char **argv) { char *gkpName = 0L; char *bEdge = "best.edges"; char *bCont = "best.contains"; char *bSing = "best.singletons"; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-g") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { bEdge = argv[++arg]; } else if (strcmp(argv[arg], "-c") == 0) { bCont = argv[++arg]; } else if (strcmp(argv[arg], "-s") == 0) { bSing = argv[++arg]; } else { err++; } arg++; } if ((err) || (gkpName == 0L)) { fprintf(stderr, "usage: %s -g gkpName [-e best.edges] [-c best.contains]\n", argv[0]); exit(1); } fprintf(stderr, "Opening best edges and contains.\n"); errno = 0; FILE *be = fopen(bEdge, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", bEdge, strerror(errno)), exit(1); FILE *bc = fopen(bCont, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", bCont, strerror(errno)), exit(1); FILE *bs = fopen(bSing, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", bSing, strerror(errno)), exit(1); fprintf(stderr, "Loading read to library mapping.\n"); gkStore *gkp = gkStore::gkStore_open(gkpName, false, false); gkStream *str = new gkStream(gkp, 0, 0, GKFRAGMENT_INF); gkFragment fr; uint32 numFrg = gkp->gkStore_getNumFragments(); uint32 numLib = gkp->gkStore_getNumLibraries(); uint32 *frgToLib = new uint32 [numFrg + 1]; memset(frgToLib, 0, sizeof(uint32) * (numFrg + 1)); uint64 *readPerLib = new uint64 [numLib + 1]; memset(readPerLib, 0, sizeof(uint64) * (numLib + 1)); uint64 *deldPerLib = new uint64 [numLib + 1]; memset(deldPerLib, 0, sizeof(uint64) * (numLib + 1)); while (str->next(&fr)) { frgToLib[fr.gkFragment_getReadIID()] = fr.gkFragment_getLibraryIID(); if (fr.gkFragment_getIsDeleted()) deldPerLib[fr.gkFragment_getLibraryIID()]++; else readPerLib[fr.gkFragment_getLibraryIID()]++; } delete str; uint64 *cntdPerLib = new uint64 [numLib + 1]; memset(cntdPerLib, 0, sizeof(uint64) * (numLib + 1)); uint64 *cntrPerLib = new uint64 [numLib + 1]; memset(cntrPerLib, 0, sizeof(uint64) * (numLib + 1)); uint32 *cntr = new uint32 [numFrg + 1]; memset(cntr, 0, sizeof(uint32) * (numFrg + 1)); uint64 *singPerLib = new uint64 [numLib + 1]; memset(singPerLib, 0, sizeof(uint64) * (numLib + 1)); uint64 *spu5PerLib = new uint64 [numLib + 1]; memset(spu5PerLib, 0, sizeof(uint64) * (numLib + 1)); uint64 *spu3PerLib = new uint64 [numLib + 1]; memset(spu3PerLib, 0, sizeof(uint64) * (numLib + 1)); uint64 *dovePerLib = new uint64 [numLib + 1]; memset(dovePerLib, 0, sizeof(uint64) * (numLib + 1)); int32 lineMax = 1048576; char *line = new char [lineMax]; splitToWords W; fprintf(stderr, "Processing best.singletons.\n"); while (!feof(bs)) { fgets(line, lineMax, bs); chomp(line); W.split(line); if (line[0] == '#') continue; singPerLib[frgToLib[W(0)]]++; } fprintf(stderr, "Processing best.contains.\n"); while (!feof(bc)) { fgets(line, lineMax, bc); chomp(line); W.split(line); if (line[0] == '#') continue; cntdPerLib[frgToLib[W(0)]]++; cntr[W(3)]++; } for (uint32 i=1; i<=numFrg; i++) { if (cntr[i] > 0) cntrPerLib[frgToLib[i]]++; } fprintf(stderr, "Processing best.edges.\n"); while (!feof(be)) { fgets(line, lineMax, be); chomp(line); W.split(line); if (line[0] == '#') continue; if ((W(2) == 0) && (W(4) == 0)) singPerLib[frgToLib[W(0)]]++; else if (W(2) == 0) spu5PerLib[frgToLib[W(0)]]++; else if (W(4) == 0) spu3PerLib[frgToLib[W(0)]]++; else dovePerLib[frgToLib[W(0)]]++; } fprintf(stderr, "libIID libUID #frg #del cnt'd cnt'r sing spur5 spur3 dove\n"); for (uint32 i=0; igkStore_getLibrary(i)->libraryName, readPerLib[i], deldPerLib[i], (deldPerLib[i] == 0) ? 0.0 : 100.0 * deldPerLib[i] / tot, cntdPerLib[i], (cntdPerLib[i] == 0) ? 0.0 : 100.0 * cntdPerLib[i] / tot, cntrPerLib[i], (cntrPerLib[i] == 0) ? 0.0 : 100.0 * cntrPerLib[i] / tot, singPerLib[i], (singPerLib[i] == 0) ? 0.0 : 100.0 * singPerLib[i] / tot, spu5PerLib[i], (spu5PerLib[i] == 0) ? 0.0 : 100.0 * spu5PerLib[i] / tot, spu3PerLib[i], (spu3PerLib[i] == 0) ? 0.0 : 100.0 * spu3PerLib[i] / tot, dovePerLib[i], (dovePerLib[i] == 0) ? 0.0 : 100.0 * dovePerLib[i] / tot); } gkp->gkStore_close(); fclose(be); fclose(bc); delete [] frgToLib; delete [] readPerLib; delete [] deldPerLib; delete [] cntdPerLib; delete [] cntrPerLib; delete [] cntr; delete [] singPerLib; delete [] spu5PerLib; delete [] spu3PerLib; delete [] dovePerLib; delete [] line; return(0); } canu-1.6/src/bogart/bogart.C000066400000000000000000000556471314437614700157700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/bogart.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2014-JAN-29 * are Copyright 2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-21 to 2015-AUG-07 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_BAT_ReadInfo.H" #include "AS_BAT_OverlapCache.H" #include "AS_BAT_BestOverlapGraph.H" #include "AS_BAT_ChunkGraph.H" #include "AS_BAT_AssemblyGraph.H" #include "AS_BAT_Logging.H" #include "AS_BAT_Unitig.H" #include "AS_BAT_PopulateUnitig.H" #include "AS_BAT_Instrumentation.H" #include "AS_BAT_PlaceContains.H" #include "AS_BAT_MergeOrphans.H" #include "AS_BAT_MarkRepeatReads.H" #include "AS_BAT_SplitDiscontinuous.H" #include "AS_BAT_DropDeadEnds.H" #include "AS_BAT_PromoteToSingleton.H" #include "AS_BAT_CreateUnitigs.H" #include "AS_BAT_SetParentAndHang.H" #include "AS_BAT_Outputs.H" #include "AS_BAT_TigGraph.H" ReadInfo *RI = 0L; OverlapCache *OC = 0L; BestOverlapGraph *OG = 0L; ChunkGraph *CG = 0L; int main (int argc, char * argv []) { char *gkpStorePath = NULL; char *ovlStorePath = NULL; double erateGraph = 0.075; double erateMax = 0.100; bool filterSuspicious = true; bool filterHighError = true; bool filterLopsided = true; bool filterSpur = true; bool filterDeadEnds = true; uint64 genomeSize = 0; uint32 fewReadsNumber = 2; // Parameters for labeling of unassembled; also set in pipelines/canu/Defaults.pm uint32 tooShortLength = 0; double spanFraction = 1.0; double lowcovFraction = 0.5; uint32 lowcovDepth = 5; double deviationGraph = 6.0; double deviationBubble = 6.0; double deviationRepeat = 3.0; uint32 confusedAbsolute = 5000; double confusedPercent = 500.0; int32 numThreads = 0; uint64 ovlCacheMemory = UINT64_MAX; bool doSave = false; char *prefix = NULL; uint32 minReadLen = 0; uint32 minOverlapLen = 500; uint32 minIntersectLen = 500; uint32 maxPlacements = 2; argc = AS_configure(argc, argv); vector err; int arg = 1; while (arg < argc) { if (strcmp(argv[arg], "-o") == 0) { prefix = argv[++arg]; } else if (strcmp(argv[arg], "-G") == 0) { gkpStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovlStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-gs") == 0) { genomeSize = strtoull(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-unassembled") == 0) { uint32 invalid = 0; if ((arg + 1 < argc) && (argv[arg + 1][0] != '-')) fewReadsNumber = atoi(argv[++arg]); else invalid++; if ((arg + 1 < argc) && (argv[arg + 1][0] != '-')) tooShortLength = atoi(argv[++arg]); else invalid++; if ((arg + 1 < argc) && (argv[arg + 1][0] != '-')) spanFraction = atof(argv[++arg]); else invalid++; if ((arg + 1 < argc) && (argv[arg + 1][0] != '-')) lowcovFraction = atof(argv[++arg]); else invalid++; if ((arg + 1 < argc) && (argv[arg + 1][0] != '-')) lowcovDepth = atoi(argv[++arg]); else invalid++; if (invalid) { char *s = new char [1024]; snprintf(s, 1024, "Too few parameters to -unassembled option.\n"); err.push_back(s); } } else if ((strcmp(argv[arg], "-mr") == 0) || (strcmp(argv[arg], "-RL") == 0)) { // Deprecated minReadLen = atoi(argv[++arg]); } else if ((strcmp(argv[arg], "-mo") == 0) || (strcmp(argv[arg], "-el") == 0)) { // Deprecated minOverlapLen = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-mi") == 0) { minIntersectLen = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-mp") == 0) { maxPlacements = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-threads") == 0) { if ((numThreads = atoi(argv[++arg])) > 0) omp_set_num_threads(numThreads); } else if (strcmp(argv[arg], "-eg") == 0) { erateGraph = atof(argv[++arg]); } else if (strcmp(argv[arg], "-eM") == 0) { erateMax = atof(argv[++arg]); } else if (strcmp(argv[arg], "-ca") == 0) { // Edge confused, based on absolute difference confusedAbsolute = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-cp") == 0) { // Edge confused, based on percent difference confusedPercent = atof(argv[++arg]); } else if (strcmp(argv[arg], "-dg") == 0) { // Deviations, graph deviationGraph = atof(argv[++arg]); } else if (strcmp(argv[arg], "-db") == 0) { // Deviations, bubble deviationBubble = atof(argv[++arg]); } else if (strcmp(argv[arg], "-dr") == 0) { // Deviations, repeat deviationRepeat = atof(argv[++arg]); } else if (strcmp(argv[arg], "-nofilter") == 0) { ++arg; filterSuspicious = ((arg >= argc) || (strcasestr(argv[arg], "suspicious") == NULL)); filterHighError = ((arg >= argc) || (strcasestr(argv[arg], "higherror") == NULL)); filterLopsided = ((arg >= argc) || (strcasestr(argv[arg], "lopsided") == NULL)); filterSpur = ((arg >= argc) || (strcasestr(argv[arg], "spur") == NULL)); filterDeadEnds = ((arg >= argc) || (strcasestr(argv[arg], "deadends") == NULL)); } else if (strcmp(argv[arg], "-M") == 0) { ovlCacheMemory = (uint64)(atof(argv[++arg]) * 1024 * 1024 * 1024); } else if (strcmp(argv[arg], "-save") == 0) { doSave = true; } else if (strcmp(argv[arg], "-D") == 0) { uint32 opt = 0; uint64 flg = 1; bool fnd = false; for (arg++; logFileFlagNames[opt]; flg <<= 1, opt++) { if (strcasecmp(logFileFlagNames[opt], argv[arg]) == 0) { logFileFlags |= flg; fnd = true; } } if (strcasecmp("all", argv[arg]) == 0) { for (flg=1, opt=0; logFileFlagNames[opt]; flg <<= 1, opt++) if (strcasecmp(logFileFlagNames[opt], "stderr") != 0) logFileFlags |= flg; fnd = true; } if (strcasecmp("most", argv[arg]) == 0) { for (flg=1, opt=0; logFileFlagNames[opt]; flg <<= 1, opt++) if ((strcasecmp(logFileFlagNames[opt], "stderr") != 0) && (strcasecmp(logFileFlagNames[opt], "overlapScoring") != 0) && (strcasecmp(logFileFlagNames[opt], "errorProfiles") != 0) && (strcasecmp(logFileFlagNames[opt], "chunkGraph") != 0) && (strcasecmp(logFileFlagNames[opt], "setParentAndHang") != 0)) logFileFlags |= flg; fnd = true; } if (fnd == false) { char *s = new char [1024]; snprintf(s, 1024, "Unknown '-D' option '%s'.\n", argv[arg]); err.push_back(s); } } else if (strcmp(argv[arg], "-d") == 0) { uint32 opt = 0; uint64 flg = 1; bool fnd = false; for (arg++; logFileFlagNames[opt]; flg <<= 1, opt++) { if (strcasecmp(logFileFlagNames[opt], argv[arg]) == 0) { logFileFlags &= ~flg; fnd = true; } } if (fnd == false) { char *s = new char [1024]; snprintf(s, 1024, "Unknown '-d' option '%s'.\n", argv[arg]); err.push_back(s); } } else { char *s = new char [1024]; snprintf(s, 1024, "Unknown option '%s'.\n", argv[arg]); err.push_back(s); } arg++; } if (erateGraph < 0.0) err.push_back("Invalid overlap error threshold (-eg option); must be at least 0.0.\n"); if (erateMax < 0.0) err.push_back("Invalid overlap error threshold (-eM option); must be at least 0.0.\n"); if (prefix == NULL) err.push_back("No output prefix name (-o option) supplied.\n"); if (gkpStorePath == NULL) err.push_back("No gatekeeper store (-G option) supplied.\n"); if (ovlStorePath == NULL) err.push_back("No overlap store (-O option) supplied.\n"); if (err.size() > 0) { fprintf(stderr, "usage: %s -o outputName -O ovlStore -G gkpStore -T tigStore\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -O Mandatory path to an ovlStore.\n"); fprintf(stderr, " -G Mandatory path to a gkpStore.\n"); fprintf(stderr, " -T Mandatory path to a tigStore (can exist or not).\n"); fprintf(stderr, " -o prefix Mandatory name for the output files\n"); fprintf(stderr, "\n"); fprintf(stderr, "Algorithm Options\n"); fprintf(stderr, "\n"); fprintf(stderr, " -gs Genome size in bases.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -mr len Force reads below 'len' bases to be singletons.\n"); fprintf(stderr, " -mo len Ignore overlaps shorter than 'len' bases.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -mi len Create unitigs from contig intersections of at least 'len' bases.\n"); fprintf(stderr, " -mp num Create unitigs from contig intersections with at most 'num' placements.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -nofilter [suspicious],[higherror],[lopsided],[spur]\n"); fprintf(stderr, " Disable filtering of:\n"); fprintf(stderr, " suspicious - reads that have a suspicious lack of overlaps\n"); fprintf(stderr, " higherror - overlaps that have error rates well outside the observed\n"); fprintf(stderr, " lopsided - reads that have unusually asymmetric best overlaps\n"); fprintf(stderr, " spur - reads that have no overlaps on one end\n"); fprintf(stderr, " The value supplied to -nofilter must be one word, order and punctuation\n"); fprintf(stderr, " do not matter. The following examples behave the same:\n"); fprintf(stderr, " '-nofilter suspicious,higherror'\n"); fprintf(stderr, " '-nofilter suspicious-and-higherror'\n"); fprintf(stderr, "\n"); fprintf(stderr, " -threads N Use N compute threads during repeat detection.\n"); fprintf(stderr, " 0 - use OpenMP default (default)\n"); fprintf(stderr, " 1 - use one thread\n"); fprintf(stderr, "\n"); fprintf(stderr, "Overlap Selection - an overlap will be considered for use in a unitig under\n"); fprintf(stderr, " the following conditions:\n"); fprintf(stderr, "\n"); fprintf(stderr, " When constructing the Best Overlap Graph and Greedy tigs ('g'raph):\n"); fprintf(stderr, " -eg 0.020 no more than 0.020 fraction (2.0%%) error ** DEPRECATED **\n"); fprintf(stderr, "\n"); fprintf(stderr, " When loading overlaps, an inflated maximum (to allow reruns with different error rates):\n"); fprintf(stderr, " -eM 0.05 no more than 0.05 fraction (5.0%%) error in any overlap loaded into bogart\n"); fprintf(stderr, " the maximum used will ALWAYS be at leeast the maximum of the four error rates\n"); fprintf(stderr, "\n"); fprintf(stderr, "Overlap Storage\n"); fprintf(stderr, "\n"); fprintf(stderr, " -M gb Use at most 'gb' gigabytes of memory for storing overlaps.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -save Save the overlap graph to disk, and continue.\n"); fprintf(stderr, "\n"); fprintf(stderr, "Debugging and Logging\n"); fprintf(stderr, "\n"); fprintf(stderr, " -D enable logging/debugging for a specific component.\n"); fprintf(stderr, " -d disable logging/debugging for a specific component.\n"); for (uint32 l=0; logFileFlagNames[l]; l++) fprintf(stderr, " %s\n", logFileFlagNames[l]); fprintf(stderr, "\n"); for (uint32 ii=0; ii PARAMETERS.\n"); fprintf(stderr, "\n"); fprintf(stderr, "Resources:\n"); fprintf(stderr, " Memory " F_U64 " GB\n", ovlCacheMemory >> 30); fprintf(stderr, " Compute Threads %d (%s)\n", omp_get_max_threads(), (numThreads > 0) ? "command line" : "OpenMP default"); fprintf(stderr, "\n"); fprintf(stderr, "Lengths:\n"); fprintf(stderr, " Minimum read %u bases\n", minReadLen); fprintf(stderr, " Minimum overlap %u bases\n", minOverlapLen); fprintf(stderr, "\n"); fprintf(stderr, "Overlap Error Rates:\n"); fprintf(stderr, " Graph %.3f (%.3f%%)\n", erateGraph, erateGraph * 100); fprintf(stderr, " Max %.3f (%.3f%%)\n", erateMax, erateMax * 100); fprintf(stderr, "\n"); fprintf(stderr, "Deviations:\n"); fprintf(stderr, " Graph %.3f\n", deviationGraph); fprintf(stderr, " Bubble %.3f\n", deviationBubble); fprintf(stderr, " Repeat %.3f\n", deviationRepeat); fprintf(stderr, "\n"); fprintf(stderr, "Edge Confusion:\n"); fprintf(stderr, " Absolute %d\n", confusedAbsolute); fprintf(stderr, " Percent %.4f\n", confusedPercent); fprintf(stderr, "\n"); fprintf(stderr, "Unitig Construction:\n"); fprintf(stderr, " Minimum intersection %u bases\n", minIntersectLen); fprintf(stderr, " Maxiumum placements %u positions\n", maxPlacements); fprintf(stderr, "\n"); fprintf(stderr, "Debugging Enabled:\n"); if (logFileFlags == 0) fprintf(stderr, " (none)\n"); for (uint64 i=0, j=1; i<64; i++, j<<=1) if (logFileFlagSet(j)) fprintf(stderr, " %s\n", logFileFlagNames[i]); writeStatus("\n"); writeStatus("==> LOADING AND FILTERING OVERLAPS.\n"); writeStatus("\n"); setLogFile(prefix, "filterOverlaps"); RI = new ReadInfo(gkpStorePath, prefix, minReadLen); OC = new OverlapCache(ovlStorePath, prefix, MAX(erateMax, erateGraph), minOverlapLen, ovlCacheMemory, genomeSize, doSave); OG = new BestOverlapGraph(erateGraph, deviationGraph, prefix, filterSuspicious, filterHighError, filterLopsided, filterSpur); CG = new ChunkGraph(prefix); // // Build the initial unitig path from non-contained reads. The first pass is usually the // only one needed, but occasionally (maybe) we miss reads, so we make an explicit pass // through all reads and place whatever isn't already placed. // TigVector contigs(RI->numReads()); // Both initial greedy tigs and final contigs TigVector unitigs(RI->numReads()); // The 'final' contigs, split at every intersection in the graph writeStatus("\n"); writeStatus("==> BUILDING GREEDY TIGS.\n"); writeStatus("\n"); setLogFile(prefix, "buildGreedy"); for (uint32 fi=CG->nextReadByChunkLength(); fi>0; fi=CG->nextReadByChunkLength()) populateUnitig(contigs, fi); delete CG; CG = NULL; breakSingletonTigs(contigs); // populateUnitig() uses only one hang from one overlap to compute the positions of reads. // Once all reads are (approximately) placed, compute positions using all overlaps. contigs.optimizePositions(prefix, "buildGreedy"); //reportOverlaps(contigs, prefix, "buildGreedy"); reportTigs(contigs, prefix, "buildGreedy", genomeSize); // // For future use, remember the reads in contigs. When we make unitigs, we'll // require that every unitig end with one of these reads -- this will let // us reconstruct contigs from the unitigs. // for (uint32 fid=1; fidnumReads()+1; fid++) // This really should be incorporated if (contigs.inUnitig(fid) != 0) // into populateUnitig() RI->setBackbone(fid); // // Place contained reads. // writeStatus("\n"); writeStatus("==> PLACE CONTAINED READS.\n"); writeStatus("\n"); setLogFile(prefix, "placeContains"); //contigs.computeArrivalRate(prefix, "initial"); contigs.computeErrorProfiles(prefix, "initial"); contigs.reportErrorProfiles(prefix, "initial"); placeUnplacedUsingAllOverlaps(contigs, prefix); // Compute positions again. This fixes issues with contains-in-contains that // tend to excessively shrink reads. The one case debugged placed contains in // a three read nanopore contig, where one of the contained reads shrank by 10%, // which was enough to swap bgn/end coords when they were computed using hangs // (that is, sum of the hangs was bigger than the placed read length). contigs.optimizePositions(prefix, "placeContains"); //reportOverlaps(contigs, prefix, "placeContains"); reportTigs(contigs, prefix, "placeContains", genomeSize); // // Merge orphans. // writeStatus("\n"); writeStatus("==> MERGE ORPHANS.\n"); writeStatus("\n"); setLogFile(prefix, "mergeOrphans"); contigs.computeErrorProfiles(prefix, "unplaced"); contigs.reportErrorProfiles(prefix, "unplaced"); mergeOrphans(contigs, deviationBubble); //checkUnitigMembership(contigs); //reportOverlaps(contigs, prefix, "mergeOrphans"); reportTigs(contigs, prefix, "mergeOrphans", genomeSize); // // Initial construction done. Classify what we have as assembled or unassembled. // classifyTigsAsUnassembled(contigs, fewReadsNumber, tooShortLength, spanFraction, lowcovFraction, lowcovDepth); // // Generate a new graph using only edges that are compatible with existing tigs. // writeStatus("\n"); writeStatus("==> GENERATING ASSEMBLY GRAPH.\n"); writeStatus("\n"); setLogFile(prefix, "assemblyGraph"); contigs.computeErrorProfiles(prefix, "assemblyGraph"); contigs.reportErrorProfiles(prefix, "assemblyGraph"); AssemblyGraph *AG = new AssemblyGraph(prefix, deviationRepeat, contigs); AG->reportReadGraph(contigs, prefix, "initial"); // // Detect and break repeats. Annotate each read with overlaps to reads not overlapping in the tig, // project these regions back to the tig, and break unless there is a read spanning the region. // writeStatus("\n"); writeStatus("==> BREAK REPEATS.\n"); writeStatus("\n"); setLogFile(prefix, "breakRepeats"); contigs.computeErrorProfiles(prefix, "repeats"); contigs.reportErrorProfiles(prefix, "repeats"); vector confusedEdges; markRepeatReads(AG, contigs, deviationRepeat, confusedAbsolute, confusedPercent, confusedEdges); //checkUnitigMembership(contigs); //reportOverlaps(contigs, prefix, "markRepeatReads"); reportTigs(contigs, prefix, "markRepeatReads", genomeSize); // // Cleanup tigs. Break those that have gaps in them. Place contains again. For any read // still unplaced, make it a singleton unitig. // writeStatus("\n"); writeStatus("==> CLEANUP MISTAKES.\n"); writeStatus("\n"); setLogFile(prefix, "cleanupMistakes"); splitDiscontinuous(contigs, minOverlapLen); promoteToSingleton(contigs); if (filterDeadEnds) { dropDeadEnds(AG, contigs); splitDiscontinuous(contigs, minOverlapLen); promoteToSingleton(contigs); } writeStatus("\n"); writeStatus("==> CLEANUP GRAPH.\n"); writeStatus("\n"); AG->rebuildGraph(contigs); AG->filterEdges(contigs); writeStatus("\n"); writeStatus("==> GENERATE OUTPUTS.\n"); writeStatus("\n"); setLogFile(prefix, "generateOutputs"); //checkUnitigMembership(contigs); reportOverlaps(contigs, prefix, "final"); reportTigs(contigs, prefix, "final", genomeSize); AG->reportReadGraph(contigs, prefix, "final"); delete AG; AG = NULL; // // unitigSource: // // We want some way of tracking unitigs that came from the same contig. Ideally, // we'd be able to emit only the edges that would join unitigs into the original // contig, but it's complicated by containments. For example: // // [----------------------------------] CONTIG // ------------- UNITIG // -------------------------- UNITIG // ------- UNITIG // // So, instead, we just remember the set of unitigs that were created from each // contig, and assume that any edge between those unitigs represents the contig. // Which it totally doesn't -- any repeat in the contig collapses -- but is a // good first attempt. // vector unitigSource; // The graph must come first, to find circular contigs. reportTigGraph(contigs, unitigSource, prefix, "contigs"); setParentAndHang(contigs); writeTigsToStore(contigs, prefix, "ctg", true); setLogFile(prefix, "tigGraph"); writeStatus("\n"); writeStatus("==> GENERATE UNITIGS.\n"); writeStatus("\n"); setLogFile(prefix, "generateUnitigs"); contigs.computeErrorProfiles(prefix, "generateUnitigs"); contigs.reportErrorProfiles(prefix, "generateUnitigs"); createUnitigs(contigs, unitigs, minIntersectLen, maxPlacements, confusedEdges, unitigSource); splitDiscontinuous(unitigs, minOverlapLen, unitigSource); reportTigGraph(unitigs, unitigSource, prefix, "unitigs"); setParentAndHang(unitigs); writeTigsToStore(unitigs, prefix, "utg", true); // // Tear down bogart. // // How bizarre. Human regression of 2017-07-28-2128 deadlocked (apparently) when deleting OC. // It had 31 threads in futex_wait, thread 1 was in delete of the second block of data. CPU // usage was 100% IIRC. Reproducable, at least twice, possibly three times. setLogFilePrefix // was moved before the deletes in hope that it'll close down threads. Certainly, it should // close thread output files from createUnitigs. setLogFile(prefix, NULL); // Close files. omp_set_num_threads(1); // Hopefully kills off other threads. delete CG; delete OG; delete OC; delete RI; writeStatus("\n"); writeStatus("Bye.\n"); return(0); } canu-1.6/src/bogart/bogart.mk000066400000000000000000000025301314437614700161740ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := bogart SOURCES := bogart.C \ AS_BAT_AssemblyGraph.C \ AS_BAT_BestOverlapGraph.C \ AS_BAT_ChunkGraph.C \ AS_BAT_CreateUnitigs.C \ AS_BAT_DropDeadEnds.C \ AS_BAT_Instrumentation.C \ AS_BAT_Logging.C \ AS_BAT_MarkRepeatReads.C \ AS_BAT_MergeOrphans.C \ AS_BAT_OptimizePositions.C \ AS_BAT_Outputs.C \ AS_BAT_OverlapCache.C \ AS_BAT_PlaceContains.C \ AS_BAT_PlaceReadUsingOverlaps.C \ AS_BAT_PopulateUnitig.C \ AS_BAT_PromoteToSingleton.C \ AS_BAT_ReadInfo.C \ AS_BAT_SetParentAndHang.C \ AS_BAT_SplitDiscontinuous.C \ AS_BAT_TigGraph.C \ AS_BAT_TigVector.C \ AS_BAT_Unitig.C \ AS_BAT_Unitig_AddRead.C \ AS_BAT_Unitig_PlaceReadUsingEdges.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/bogart/findOverlappingReads.pl000066400000000000000000000064511314437614700210360ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2016-AUG-05 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; if (scalar(@ARGV) == 0) { die "usage: $0 assembly-prefix readID [tigStore-bogart-stage]\n"; } my $prefix = shift @ARGV; my $readID = shift @ARGV; my $store = shift @ARGV; my $gkpStore = "$prefix.gkpStore"; my $ovlStore = "$prefix.ovlStore"; my $tigStore = "$prefix.tigStore"; my $tigVers = 1; $tigStore = "$prefix.003.buildUnitigs.tigStore" if ($store eq "003"); $tigStore = "$prefix.004.placeContains.tigStore" if ($store eq "004"); $tigStore = "$prefix.005.mergeOrphans.tigStore" if ($store eq "005"); $tigStore = "$prefix.007.breakRepeats.tigStore" if ($store eq "007"); $tigStore = "$prefix.009.generateOutputs.tigStore" if ($store eq "009"); $gkpStore = "../$gkpStore" if (! -d $gkpStore); $gkpStore = "../$gkpStore" if (! -d $gkpStore); $gkpStore = "../$gkpStore" if (! -d $gkpStore); $ovlStore = "../$ovlStore" if (! -d $ovlStore); $ovlStore = "../$ovlStore" if (! -d $ovlStore); $ovlStore = "../$ovlStore" if (! -d $ovlStore); $tigStore = "../$tigStore" if (! -d $tigStore); $tigStore = "../$tigStore" if (! -d $tigStore); $tigStore = "../$tigStore" if (! -d $tigStore); die "failed to find gkpStore $prefix.gkpStore" if (! -d $gkpStore); die "failed to find ovlStore $prefix.ovlStore" if (! -d $ovlStore); die "failed to find tigStore $prefix.tigStore" if (! -d $tigStore); my %readOvl; my $nOvl = 0; my $nTig = 0; open(F, "ovStoreDump -G $gkpStore -O $ovlStore -p $readID |"); while () { chomp; # For -d dumps if (m/^\s*\d+\s+(\d+)\s+/) { $nOvl++; $readOvl{$1} = $_; } # For -p dumps if (m/^\s*(\d+)\s+A:\s+\d+\s+/) { $nOvl++; $readOvl{$1} = $_; } } close(F); system("ovStoreDump -G $gkpStore -O $ovlStore -p $readID"); print "\n"; my $tig; my $len; my $num; open(F, "tgStoreDump -G $gkpStore -T $tigStore $tigVers -layout |"); while () { chomp; $tig = $1 if (m/^tig\s+(\d+)$/); $len = $1 if (m/^len\s+(\d+)$/); $num = $1 if (m/^numChildren\s+(\d+)$/); if (m/^read\s+(\d+)\s+/) { my $r = $1; if ($r == $readID) { print "tig $tig len $len -- $_\n"; } elsif (exists($readOvl{$1})) { $nTig++; printf "tig %6d len %8d -- %s -- %s\n", $tig, $len, $_, $readOvl{$1}; } } } close(F); print "\n"; print STDERR "Found $nOvl overlaps in $ovlStore (with $gkpStore).\n"; print STDERR "Found $nTig placements in $tigStore.\n"; canu-1.6/src/bogart/plotErrorProfile.pl000066400000000000000000000033321314437614700202340ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2016-JUN-06 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $nn = shift @ARGV; my $pn = shift @ARGV; die "usage: $0 \n" if (!defined($pn)); my $name = "$nn." . substr("00000000$pn", -8) . ".profile"; print "Plotting '$nn' '$pn' - '$name'\n"; my $lastX = 0; my $lastY = 0; open(O, "> $name.dat") or die; open(F, "< $name") or die; while () { if (m/^(\d+)\s+(\d+)\s+(\d+.\d+)\s+\+-\s+(\d+.\d+)\s\(\d+\s+overlaps\)/) { print O "$lastX\t$lastY\n"; print O "$1\t$3\n"; print O "\n"; print O "$1\t$3\n"; print O "$2\t$3\n"; print O "\n"; $lastX = $2; $lastY = $3; } } close(F); print O "$lastX\t$lastY\n"; print O "$lastX\t0\n"; print O "\n"; close(O); open(O, "> $name.gp"); print O "plot '$name.dat' with lines\n"; print O "pause -1\n"; close(O); system("gnuplot $name.gp"); canu-1.6/src/bogus/000077500000000000000000000000001314437614700142265ustar00rootroot00000000000000canu-1.6/src/bogus/bogus-genome.css000066400000000000000000000376001314437614700173350ustar00rootroot00000000000000html, body { height: 98%; } .nav { vertical-align: middle; z-index: 10; } input.icon { display: -moz-inline-stack; display: inline-block; zoom: 1; *display: inline; } div.container { position: absolute; z-index: 0; } div.dragWindow { position:absolute; overflow: hidden; z-index: 1; cursor: url("openhand.cur"); } div.locationTrap { position: absolute; background-color: #BDD7FF; border-color: white white #BDD7FF white; border-style: solid; width: 0px; height: 0px; line-height: 0px; z-index: -10; } div.locationThumb { position: absolute; top: 0px; /* if you change this border from 2px, change GenomeView.showTrap */ border: 2px solid red; margin: 0px -2px 0px -2px; cursor: url("openhand.cur"); } div.locationThumb.dojoMoveItem { cursor: url("closedhand.cur"); } div.overview { width: 100%; border-style: solid; border-width: 5px 0px 5px 0px; border-color: #aaa; color:#aaa; font-family: sans-serif; text-align: center; z-index: -5; } div.block { position: absolute; overflow: visible; top: 0px; height: 100%; } div.track { position: absolute; left: 0px; width: 100%; z-index: 5; } .track.dojoDndItemOver { cursor: inherit; } .track.dojoDndItemBefore { border-top: 3px solid #999; margin-top: -3px; } .track.dojoDndItemAfter { border-bottom: 3px solid #999; margin-bottom: -3px; } div#static_track { top: 0px; position: absolute; background-color: #f0f0f0; z-index: 20; } div.gridline { position: absolute; top: 0px; height: 100%; border-style: none none none solid; border-width: 1px; border-color: #ddd; } div.pos-label { position: absolute; left: 0px; background-color: #ddd; z-index: 100; padding: 4px; font-family: sans-serif; } div.overview-pos { position: absolute; left: 0px; color: black; padding-left: 4px; font-family: sans-serif; border-style: solid; border-color: black; border-width: 0px 0px 0px 1px; } div.blank-block { font: sans-serif; position: absolute; overflow: visible; top: 0px; height: 100%; background-color: #eee; z-index: 19; } div.sequence { position: absolute; left: 0px; font-family: monospace; letter-spacing: 2px; padding-left: 2px; } /* div.minus-feature:before { content: attr(fName); display: block; } div.plus-feature:before { content: attr(fName); display: inline; } */ div.track-label { font-family: sans-serif; z-index: 20; background-color: #BDD7FF; border: 2px #2b434c solid; color: #2b434c; padding: 5px; cursor: pointer; } div.tracklist-label { font-family: sans-serif; z-index: 20; background-color: #BDD7FF; border: 2px #2b434c solid; color: #2b434c; cursor: pointer; padding: 3px; } /* commented for now, multi-select too confusing? .tracklist-container.dojoDndItemSelected { background: #ddf; } .tracklist-container.dojoDndItemAnchor { background: #ddf; } */ div.tracklist-container { padding: 5px; margin-top: -3px; margin-bottom: -3px; } .tracklist-container.dojoDndItemBefore { border-top: 3px solid #999; padding-top: 2px; } .tracklist-container.dojoDndItemAfter { border-bottom: 3px solid #999; padding-bottom: 2px; } .feature-label { position: absolute; font-family: monospace; border: 0px; margin: -2px 0px 0px 0px; /* padding: 0px 0px 2px 0px; for more space below labels */ padding: 0px 0px 0px 0px; /* font-size: 80%; */ white-space: nowrap; background-color: #eee; z-index: 10; cursor: pointer; } .basic, .plus-basic, .minus-basic { position: absolute; cursor: pointer; z-index: 10; min-width: 1px; } div.basic-hist { position: absolute; z-index: 10; } .plus-feature, .minus-feature { position:absolute; height: 8px; background-repeat: repeat-x; cursor: pointer; min-width: 1px; z-index: 10; } .plus-feature { background-image: url('img/plus-chevron3.png'); } .minus-feature { background-image: url('img/minus-chevron3.png'); } div.feature-hist { position: absolute; background-color: blue; border-color: lightblue; border-style: solid; border-width: 1px; z-index: 10; } .plus-feature2, .minus-feature2 { position:absolute; height: 15px; background-repeat: repeat-x; cursor: pointer; min-width: 1px; z-index: 10; } .plus-feature2 { background-image: url('img/plus-herringbone16.png'); } .minus-feature2 { background-image: url('img/minus-herringbone16.png'); } div.feature2-hist { position: absolute; background-color: #9f9; border-color: #ada; border-style: solid; border-width: 1px; z-index: 10; } .plus-feature3, .minus-feature3 { position:absolute; height: 8px; background-repeat: repeat-x; cursor: pointer; min-width: 1px; z-index: 10; } .plus-feature3 { background-image: url('img/plus-chevron.png'); } .minus-feature3 { background-image: url('img/minus-chevron.png'); } div.feature3-hist { position: absolute; background-color: yellow; border-color: black; border-style: solid; border-width: 1px; z-index: 10; } .plus-feature4, .minus-feature4 { position:absolute; height: 12px; background-repeat: repeat-x; cursor: pointer; min-width: 1px; z-index: 10; } .plus-feature4 { background-image: url('img/plus-pacman.png'); } .minus-feature4 { background-image: url('img/minus-pacman.png'); } div.feature4-hist { position: absolute; background-color: yellow; border-color: black; border-style: solid; border-width: 1px; z-index: 10; } .plus-feature5, .minus-feature5 { position:absolute; height: 8px; background-repeat: repeat-x; cursor: pointer; min-width: 1px; z-index: 10; } .plus-feature5 { background-image: url('img/plus-chevron2.png'); } .minus-feature5 { background-image: url('img/minus-chevron2.png'); } div.feature5-hist { position: absolute; background-color: blue; border-color: lightblue; border-style: solid; border-width: 1px; z-index: 10; } div.exon-hist { position: absolute; background-color: #4B76E8; border-style: solid; border-color: #00f; border-width: 1px; z-index: 10; } .plus-exon, .minus-exon { position: absolute; height: 5px; background-color: #4B76E8; border-style: solid; border-color: #00f; border-width: 1px; cursor: pointer; z-index: 10; } div.est-hist { position: absolute; background-color: #ED9185; border-style: solid; border-color: #c33; border-width: 1px; z-index: 10; } .plus-est, .minus-est { position: absolute; height: 5px; background-color: #ED9185; border-style: solid; border-color: #c33; border-width: 1px; cursor: pointer; z-index: 10; } .plus-dblhelix, .minus-dblhelix { position:absolute; height: 11px; background-image: url('img/dblhelix-red.png'); background-repeat: repeat-x; min-width: 1px; cursor: pointer; z-index: 10; } div.dblhelix-hist { position: absolute; background-color: #fcc; border-color: #daa; border-style: solid; border-width: 1px; z-index: 10; } .plus-helix, .minus-helix { position:absolute; height: 12px; background-image: url('img/helix3-green.png'); background-repeat: repeat-x; min-width: 1px; cursor: pointer; z-index: 10; } div.helix-hist { position: absolute; background-color: #cfc; border-color: #ada; border-style: solid; border-width: 1px; z-index: 10; } .loops { position:absolute; height: 13px; background-image: url('img/loops.png'); background-repeat: repeat-x; cursor: pointer; } .plus-cds0, .plus-cds1, .plus-cds2, .minus-cds0, .minus-cds1, .minus-cds2 { position:absolute; height: 13px; background-repeat: repeat-x; cursor: pointer; min-width: 1px; } .plus-cds0 { background-image: url('img/plus-cds0.png'); } .plus-cds1 { background-image: url('img/plus-cds1.png'); } .plus-cds2 { background-image: url('img/plus-cds2.png'); } .minus-cds0 { background-image: url('img/minus-cds0.png'); } .minus-cds1 { background-image: url('img/minus-cds1.png'); } .minus-cds2 { background-image: url('img/minus-cds2.png'); } div.cds-hist { position: absolute; background-color: #fcc; border-color: #daa; border-style: solid; border-width: 1px; z-index: 10; } .topbracket { position:absolute; height: 8px; border-style: solid solid none solid; /* border-width: 2px 2px 0px 2px; */ border-width: 2px; border-color: orange; /* margin-top: 2px */ cursor: pointer; } .bottombracket { position:absolute; height: 8px; border-style: none solid solid solid; border-width: 2px; border-color: green; cursor: pointer; } .hourglass { position:absolute; height: 0px; border-style: solid; border-width: 6px 3px 6px 3px; cursor: pointer; } .plus-triangle { position:absolute; height: 0px; border-style: solid; border-width: 6px 3px 0px 3px; cursor: pointer; } .minus-triangle { position:absolute; height: 0px; border-style: solid; border-width: 0px 3px 6px 3px; cursor: pointer; } .triangle { position:absolute; height: 0px; border-style: solid; border-width: 6px 0px 0px 0px; cursor: pointer; } .hgred { border-color: #f99 white #f99 white; } div.hgred-hist { position: absolute; background-color: #daa; border-color: #d44; border-style: solid; border-width: 1px; z-index: 10; } .hgblue { border-color: #99f white #99f white; } div.hgblue-hist { position: absolute; background-color: #aad; border-color: #99f; border-style: solid; border-width: 1px; z-index: 10; } .ibeam { position:absolute; height: 2px; background-color: blue; border-style: solid; border-width: 8px 4px 8px 4px; border-color: white blue white blue; cursor: pointer; } div.transcript-hist { position: absolute; background-color: #ddd; border-color: #FF9185; border-style: solid; border-width: 1px; z-index: 10; } .transcript, .plus-transcript, .minus-transcript { position: absolute; height: 4px; margin-top: 4px; margin-bottom: 4px; background-color: #999; z-index: 6; min-width: 1px; cursor: pointer; } .plus-transcript-arrowhead { position: absolute; /* border stuff seems slow height: 0px; width: 0px; margin-top: -4px; border-style: solid; border-color: white white white #999; border-width: 6px 0px 6px 10px; */ margin-top: -4px; width: 12px; height: 12px; background-image: url('img/plus-transcript-head.png'); background-repeat: no-repeat; } .minus-transcript-arrowhead { position: absolute; /* border stuff seems slow height: 0px; width: 0px; margin-top: -4px; border-style: solid; border-color: white #999 white white; border-width: 6px 10px 6px 0px; */ margin-top: -4px; width: 12px; height: 12px; background-image: url('img/minus-transcript-head.png'); background-repeat: no-repeat; } .plus-transcript-CDS, .minus-transcript-CDS { position: absolute; height: 12px; margin-top: -4px; background-image: url('img/cds.png'); background-repeat: repeat-x; /* border-width: 2px 0px 3px 0px; border-style: solid; border-color: white; background-color: #FF9185; border-style: solid; border-color: #00f; border-width: 1px;*/ cursor: pointer; z-index: 10; min-width: 1px; } .plus-transcript-exon, .minus-transcript-exon, .plus-transcript-UTR, .minus-transcript-UTR, .plus-transcript-five_prime_UTR, .minus-transcript-five_prime_UTR, .plus-transcript-three_prime_UTR, .minus-transcript-three_prime_UTR { position: absolute; height: 4px; margin-top: -2px; background-color: #B66; border-style: solid; border-color: #D88; border-width: 2px 0px 2px 0px; z-index: 8; min-width: 1px; cursor: pointer; } .generic_parent, .plus-generic_parent, .minus-generic_parent { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; background-color: #AAA; z-index: 6; min-width: 1px; cursor: pointer; } div.generic_parent-hist { position: absolute; background-color: #ddd; border-color: #555; border-style: solid; border-width: 1px; z-index: 10; } .match_part, .plus-match_part, .minus-match_part { position: absolute; height: 4px; margin-top: -2px; background-color: #66B; border-style: solid; border-color: #88D; border-width: 2px 0px 2px 0px; z-index: 8; min-width: 1px; cursor: pointer; } .generic_part_a, .plus-generic_part_a, .minus-generic_part_a { position: absolute; height: 4px; margin-top: -2px; background-color: #6B6; border-style: solid; border-color: #8D8; border-width: 2px 0px 2px 0px; z-index: 8; min-width: 1px; cursor: pointer; } /* * Definitions for jbrowse to show BOGUS and BOGUSNESS results. */ */ * BOGUS classes */ .frag_align { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #AAA; } .plus-frag_align { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #EB2A2A; } .minus-frag_align { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #2AEB2A; } div.frag_align-hist { position: absolute; background-color: #ddd; border-color: #555; border-style: solid; border-width: 1px; z-index: 10; } .rept_interval { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #1010bb; } .uniq_interval { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #10bb10; } .weak_interval { position: absolute; height: 8px; margin-top: -3px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; border-color: #bb1010; border-style: solid; border-width: 1px; } .sepr_interval { position: absolute; height: 8px; margin-top: -3px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; border-color: #bb1010; border-style: solid; border-width: 1px; } /* * BOGUSNESS classes */ .bogusness_span { position: absolute; height: 4px; margin-top: 2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #AAA; } .plus-bogusness_match { position: absolute; height: 8px; margin-top: -2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #ee2222; } .minus-bogusness_match { position: absolute; height: 8px; margin-top: -2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; background-color: #22ee22; } .plus-bogusness_repeat { position: absolute; height: 8px; margin-top: -2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; border-color: 10bb10; border-style: solid; border-width: 2px; background-color: #ee2222; } .minus-bogusness_repeat { position: absolute; height: 8px; margin-top: -2px; margin-bottom: 2px; z-index: 6; min-width: 1px; cursor: pointer; border-color: #bb1010; border-style: solid; border-width: 2px; background-color: #22ee22; } canu-1.6/src/bogus/bogus-genome.json000066400000000000000000000051471314437614700175170ustar00rootroot00000000000000 { "description": "Bogus Results", "db_adaptor": "Bio::DB::SeqFeature::Store", "db_args": { "-adaptor": "memory", "-dir": "."}, "TRACK DEFAULTS": { "class": "feature" }, "tracks": [ { "track": "BOGUS_raw_input", "key": "BOGUS raw input", "feature": ["bogus_raw_input"], "autocomplete": "all", "class": "frag_align" }, { "track": "BOGUS_span_input", "key": "BOGUS span input", "feature": ["bogus_span_input"], "autocomplete": "all", "class": "frag_align" }, { "track": "BOGUS_rept_input", "key": "BOGUS rept input", "feature": ["bogus_rept_input"], "autocomplete": "all", "class": "frag_align" }, { "track": "BOGUS_uniq_input", "key": "BOGUS uniq input", "feature": ["bogus_uniq_input"], "autocomplete": "all", "class": "frag_align" }, { "track": "BOGUS_rept_interval", "key": "BOGUS rept interval", "feature": ["bogus_rept_interval"], "autocomplete": "all", "class": "rept_interval", "subfeatures": true, "subfeature_classes": {"bogus_sepr_interval": "sepr_interval"}, "clientConfig": { "labelScale": 1000000 } }, { "track": "BOGUS_uniq_interval", "key": "BOGUS uniq interval", "feature": ["bogus_uniq_interval"], "autocomplete": "all", "class": "uniq_interval", "subfeatures": true, "subfeature_classes": {"bogus_weak_interval": "weak_interval"}, "clientConfig": { "labelScale": 1000000 } }, { "track": "BOGUSNESS", "key": "BOGUSNESS", "feature": ["bogusness_span"], "autocomplete": "all", "class": "bogusness_span", "subfeatures": true, "subfeature_classes": {"bogusness_match": "bogusness_match"}, "clientConfig": { "labelScale": 1000000 } }, { "track": "BOGUSNESS_repeat", "key": "BOGUSNESS repeat", "feature": ["bogusness_repeat"], "autocomplete": "all", "class": "bogusness_repeat" } ] } canu-1.6/src/bogus/bogus-run.sh000066400000000000000000000033611314437614700165060ustar00rootroot00000000000000#!/bin/sh # Build ideal unitigs given a fasta or frg file. # Align ideal unitigs back to the reference with nucmer. # Generate a mummerplot. # # ASSUMES all programs are in your path. # mummer: nucmer, mummerplot # kmer: snapper2, convertToExtent # wgs-assembler: bogus # Edit reference to make defline have only one work. This is necessary because snapper reports the # whole line, but bogus truncates to the first word. FAS="../FRAGS/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO0.fasta" REF="../AE015924.fasta" if [ $# -gt 0 ] ; then FAS=$1 shift fi if [ $# -gt 0 ] ; then REF=$1 shift fi OUT=`echo $FAS | tr '/' ' ' | awk '{ print $NF}' | sed s/.fasta//` if [ ! -e $FAS ] ; then echo "Failed to find FASTA $FAS" exit fi if [ ! -e $REF ] ; then echo "Failed to find REFERENCE $REF" exit fi if [ `cat $REF | wc -l` != 2 ] ; then echo "REFERENCE is multi-line FASTA, sequence must be on one line." exit fi if [ ! -e $OUT.snapper ] ; then snapper2 \ -queries $FAS \ -genomic $REF \ -minmatchidentity 94 -minmatchcoverage 1 -verbose -mersize 16 \ -output $OUT.snapper fi if [ ! -e $OUT.snapper.extent ] ; then convertToExtent \ < $OUT.snapper \ | grep -v ^cDNAid \ | sort -k7n \ > $OUT.snapper.extent fi if [ ! -e $OUT.ideal.fasta ] ; then bogus \ -snapper $OUT.snapper.extent \ -reference $REF \ -output $OUT.ideal fi if [ ! -e $OUT.ideal.fasta ] ; then echo "No fasta output from bogus, not mapping to reference." exit fi if [ ! -e $OUT.ideal.delta ] ; then nucmer --maxmatch --coords -p $OUT.ideal \ $REF \ $OUT.ideal.fasta fi if [ ! -e $OUT.ideal.png ] ; then mummerplot --layout --filter -p $OUT.ideal -t png $OUT.ideal.delta fi canu-1.6/src/bogus/bogus.C000066400000000000000000000624741314437614700154660ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/bogus.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2014-DEC-23 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bogusUtil.H" class longestAlignment { public: longestAlignment() { bgn = end = len = num = 0; rptBgn = rptEnd = frgLen = 0; }; int32 bgn; // Begin coord on the fragment for this region int32 end; // End coord on the fragment for this region int32 len; // Length of the match int32 num; // Number of matches on this region int32 rptBgn; int32 rptEnd; int32 frgLen; }; // MUCH, much easier to modularize the code if all this stuff is globally available. vector refList; map refMap; vector longest; vector genome; map IIDmap; // Maps an ID string to an IID. vector IIDname; // Maps an IID to an ID string. vector IIDcount; // Maps an IID to the number of alignments intervalList REPT; intervalList UNIQ; bool *REPTvalid = NULL; bool *UNIQvalid = NULL; int32 *REPTvalidParent = NULL; int32 *UNIQvalidParent = NULL; FILE *gffOutput = NULL; FILE *intervalOutput = NULL; static void writeInputsAsGFF3(char *outputPrefix) { for (uint32 i=0; ifrgEnd - A->frgBgn) { longest[frgIID].bgn = A->frgBgn; longest[frgIID].end = A->frgEnd; longest[frgIID].len = A->frgEnd - A->frgBgn; longest[frgIID].num = 0; indelRate = (A->genEnd - A->genBgn) / (double)(A->frgEnd - A->frgBgn); } if (longest[frgIID].frgLen < A->frgEnd) { longest[frgIID].rptBgn = 0; longest[frgIID].rptEnd = A->frgEnd; longest[frgIID].frgLen = A->frgEnd; } } // Now that we know the longest, count the number of matches that have approximately the same // span. If that count is one, mark all but that span as repeat, otherwise, mark everthing as // repeat. assert(longest[frgIID].num == 0); for (uint32 i=bgn; iisRepeat = true; if (((longest[frgIID].bgn - alignWobble <= A->frgBgn) && (A->frgBgn <= longest[frgIID].bgn + alignWobble)) && ((longest[frgIID].end - alignWobble <= A->frgEnd) && (A->frgEnd <= longest[frgIID].end + alignWobble))) { longest[frgIID].num++; A->isRepeat = false; } } assert(longest[frgIID].num > 0); // More than one longest? Then all alignments are repeats. if (longest[frgIID].num > 1) for (uint32 i=bgn; iisRepeat == false) { assert(LONG == 0L); LONG = A; continue; } if ((A->frgBgn < alignWobble) && (A->frgEnd > longest[frgIID].rptBgn)) // First N bases of the read is part of a repeat longest[frgIID].rptBgn = A->frgEnd; if ((A->frgEnd > longest[frgIID].frgLen - alignWobble) && (A->frgBgn < longest[frgIID].rptEnd)) // Last N bases of the read is part of a repeat longest[frgIID].rptEnd = A->frgBgn; } assert(LONG != 0L); // Indel screws up this compute. We have the repeat region marked on the read, and need to transfer it to the genome. // We estimate, globally, the indel rate, and scale the repeat region length. if (longest[frgIID].rptBgn > 0) { uint32 rptLen = (longest[frgIID].rptBgn - 0) * indelRate; if (rptLen > 0) { #ifdef DEBUG fprintf(stderr, "addAlignment()-- RPTBGN longest %u-%u frg %u at frg %u-%u chn %u-%u gen %u %u-%u rptlen %u indelrate %f\n", longest[frgIID].rptBgn, longest[frgIID].rptEnd, LONG->frgIID, LONG->frgBgn, LONG->frgEnd, LONG->chnBgn, LONG->chnEnd, LONG->genIID, LONG->genBgn, LONG->genEnd, rptLen, indelRate); #endif addAlignment(genome, LONG->frgIID, 0, longest[frgIID].rptBgn, false, LONG->chnBgn, LONG->chnBgn + rptLen, LONG->identity, LONG->genIID, LONG->genBgn, LONG->genBgn + rptLen); #ifdef DEBUG uint32 ii = genome.size() - 1; fprintf(stderr, "addAlignment()-- RPTBGN FINI frg %u at frg %u-%u chn %u-%u gen %u %u-%u\n", genome[ii].frgIID, genome[ii].frgBgn, genome[ii].frgEnd, genome[ii].chnBgn, genome[ii].chnEnd, genome[ii].genIID, genome[ii].genBgn, genome[ii].genEnd); #endif } } if (longest[frgIID].rptEnd < LONG->frgEnd) { uint32 rptLen = (LONG->frgEnd - longest[frgIID].rptEnd) * indelRate; if (rptLen > 0) { #ifdef DEBUG fprintf(stderr, "addAlignment()-- RPTEND longest %u-%u frg %u at frg %u-%u chn %u-%u gen %u %u-%u rptlen %u indelrate %f\n", longest[frgIID].rptBgn, longest[frgIID].rptEnd, LONG->frgIID, LONG->frgBgn, LONG->frgEnd, LONG->chnBgn, LONG->chnEnd, LONG->genIID, LONG->genBgn, LONG->genEnd, rptLen, indelRate); #endif addAlignment(genome, LONG->frgIID, longest[frgIID].rptEnd, LONG->frgEnd, false, LONG->chnEnd - rptLen, LONG->chnEnd, LONG->identity, LONG->genIID, LONG->genEnd - rptLen, LONG->genEnd); #ifdef DEBUG uint32 ii = genome.size() - 1; fprintf(stderr, "addAlignment()-- RPTEND FINI frg %u at frg %u-%u chn %u-%u gen %u %u-%u\n", genome[ii].frgIID, genome[ii].frgBgn, genome[ii].frgEnd, genome[ii].chnBgn, genome[ii].chnEnd, genome[ii].genIID, genome[ii].genBgn, genome[ii].genEnd); #endif } } } bgn = end; } // Over all matches, by fragment #ifdef DEBUG for (uint32 ii=0; ii 0) && (minFrags <= refcnt) && (minLength <= refend - refbgn)) { fprintf(intervalOutput, "%s\t%8" F_S64P "\t%8" F_S64P "\tREPT\t" F_S64 "%s\n", refhdr, refbgn, refend, refcnt, (REPTvalid[ir]) ? "" : " weak"); if (REPTvalid[ir]) fprintf(gffOutput, "%s\t.\tbogus_rept_interval\t" F_S64 "\t" F_S64 "\t.\t.\t.\tID=REPT%04d;fragCount=" F_S64 "\n", refhdr, refbgn, refend, ir, refcnt); else fprintf(gffOutput, "%s\t.\tbogus_weak_interval\t" F_S64 "\t" F_S64 "\t.\t.\t.\tParent=UNIQ%04d;fragCount=" F_S64 "\n", refhdr, refbgn, refend, REPTvalidParent[ir], refcnt); } ir++; } else { for (uint32 rr=0; rr 0) && (minFrags <= refcnt) && (minLength <= refend - refbgn)) { fprintf(intervalOutput, "%s\t%8" F_S64P "\t%8" F_S64P "\tUNIQ\t" F_S64 "%s\n", refhdr, refbgn, refend, refcnt, (UNIQvalid[iu]) ? "" : " separation"); if (UNIQvalid[iu]) fprintf(gffOutput, "%s\t.\tbogus_uniq_interval\t" F_S64 "\t" F_S64 "\t.\t.\t.\tID=UNIQ%04d;fragCount=" F_S64 "\n", refhdr, refbgn, refend, iu, refcnt); else fprintf(gffOutput, "%s\t.\tbogus_sepr_interval\t" F_S64 "\t" F_S64 "\t.\t.\t.\tParent=REPT%04d;fragCount=" F_S64 "\n", refhdr, refbgn, refend, UNIQvalidParent[iu], refcnt); } iu++; } } fclose(gffOutput); fclose(intervalOutput); // See CVS version 1.3 for writing rept/uniq fasta return(0); } canu-1.6/src/bogus/bogus.json.README000066400000000000000000000010251314437614700171720ustar00rootroot00000000000000JSON config for viewing BOGUS and BOGUSNESS in jbrowse. Unzip jbrowse-1.1.zip to the directory with bogus/bogusness results. Copy bogus.json to jbrowse/ Append bogus.css to jbrowse/genome.css In jbrowse/: bin/prepare-refseqs.pl -fasta ../reference.fasta bin/biodb-to-json.pl --conf bogus.json NOTE that the reference sequences CANNOT have a space in the defline. 'bogus' will silently remove these, but jbrowse will not. ----- # SCALING on ZOOM - in json "clientConfig": { "labelScale": 1000000 } canu-1.6/src/bogus/bogus.mk000066400000000000000000000007571314437614700157070ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := bogus SOURCES := bogus.C \ bogusUtil.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/bogus/bogusUtil.C000066400000000000000000000214711314437614700163140ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_bogusUtil.C * src/bogart/AS_BAT_bogusUtil.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2014-JAN-23 * are Copyright 2010-2011,2013-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2011-JUN-27 * are Copyright 2011 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2014-DEC-23 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bogusUtil.H" #define MAX_GENOME_SIZE_INPUT 256 * 1024 * 1024 bool byFragmentID(const genomeAlignment &A, const genomeAlignment &B) { return(A.frgIID < B.frgIID); } bool byGenomePosition(const genomeAlignment &A, const genomeAlignment &B) { if (A.chnBgn < B.chnBgn) // A clearly before B. return(true); if (A.chnBgn > B.chnBgn) // A clearly after B. return(false); if ((A.isRepeat == false) && (B.isRepeat == true)) // Start at the same spot, put unique stuff first return(true); // Note the second condition above. If both A and B are repeat=false, byGenomePosition(A,B) and // byGenomePosition(B,A) bth return true, logically impossible. return(false); } void addAlignment(vector &genome, int32 frgIID, int32 frgBgn, int32 frgEnd, bool isReverse, int32 chnBgn, int32 chnEnd, double identity, int32 genIID, int32 genBgn, int32 genEnd) { genomeAlignment A; A.frgIID = frgIID; A.frgBgn = frgBgn; A.frgEnd = frgEnd; A.genIID = genIID; A.genBgn = genBgn; A.genEnd = genEnd; A.chnBgn = chnBgn; A.chnEnd = chnEnd; A.identity = identity; A.isReverse = isReverse; A.isSpanned = false; A.isRepeat = true; assert(A.frgBgn < A.frgEnd); assert(A.genBgn < A.genEnd); genome.push_back(A); } void loadNucmer(char *nucmerName, vector &genome, map &IIDmap, vector &IIDname, vector &refList, map &refMap, double minIdentity) { FILE *inFile = 0L; char inLine[1024]; errno = 0; inFile = fopen(nucmerName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", nucmerName, strerror(errno)), exit(1); fprintf(stderr, "Loading alignments from '%s'\n", nucmerName); // First FOURE lines are header fgets(inLine, 1024, inFile); // (file paths) fgets(inLine, 1024, inFile); // NUCMER fgets(inLine, 1024, inFile); // (blank) fgets(inLine, 1024, inFile); // (header) // Scan the header line, counting the number of columns uint32 nCols = 0; uint32 wIdent = 0; for (uint32 xx=0; inLine[xx]; xx++) { if ((wIdent == 0) && (inLine[xx+0] == '[') && (inLine[xx+1] == '%') && (inLine[xx+2] == ' ')) wIdent = nCols; if (inLine[xx] == '[') nCols++; } // Read the first line. fgets(inLine, 1024, inFile); chomp(inLine); while (!feof(inFile)) { for (uint32 xx=0; inLine[xx]; xx++) if (inLine[xx] == '|') inLine[xx] = ' '; splitToWords W(inLine); genomeAlignment A; string gID = W[nCols - 1]; // TAGS is the last header column, string fID = W[nCols - 0]; // but read ID is in column +1 from there. if (IIDmap.find(fID) == IIDmap.end()) { IIDname.push_back(fID); IIDmap[fID] = IIDname.size() - 1; } // Unlike snapper, these are already in base-based coords. A.frgIID = IIDmap[fID]; A.frgBgn = W(2); A.frgEnd = W(3); A.genIID = refMap[gID]; A.genBgn = W(0); A.genEnd = W(1); A.chnBgn = refList[A.genIID].rschnBgn + A.genBgn; A.chnEnd = refList[A.genIID].rschnBgn + A.genEnd; A.identity = atof(W[wIdent]); A.isReverse = false; A.isSpanned = false; A.isRepeat = true; if (A.frgBgn > A.frgEnd) { A.frgBgn = W(3); A.frgEnd = W(2); A.isReverse = true; } if ((A.frgBgn >= A.frgEnd) || (A.genBgn >= A.genEnd)) { fprintf(stderr, "ERROR: %s\n", inLine); if (A.frgBgn >= A.frgEnd) fprintf(stderr, "ERROR: frgBgn,frgEnd = %u,%u\n", A.frgBgn, A.frgEnd); if (A.genBgn >= A.genEnd) fprintf(stderr, "ERROR: genBgn,genEnd = %u,%u\n", A.genBgn, A.genEnd); } assert(A.frgBgn < A.frgEnd); assert(A.genBgn < A.genEnd); if (A.identity < minIdentity) goto nextNucmerLine; genome.push_back(A); nextNucmerLine: fgets(inLine, 1024, inFile); chomp(inLine); } fclose(inFile); } void loadSnapper(char *snapperName, vector &genome, map &IIDmap, vector &IIDname, vector &refList, map &refMap, double minIdentity) { FILE *inFile = 0L; char inLine[1024]; errno = 0; inFile = fopen(snapperName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", snapperName, strerror(errno)), exit(1); // Read the first line fgets(inLine, 1024, inFile); chomp(inLine); while (!feof(inFile)) { splitToWords W(inLine); genomeAlignment A; string fID = W[0]; string gID = W[5]; if ((W[1][0] < '0') || (W[1][0] > '9')) // Skip header lines. goto nextSnapperLine; if (IIDmap.find(fID) == IIDmap.end()) { IIDname.push_back(fID); IIDmap[fID] = IIDname.size() - 1; } // "+1" -- Convert from space-based coords to base-based coords. A.frgIID = IIDmap[fID]; A.frgBgn = W(3) + 1; A.frgEnd = W(4); A.genIID = refMap[gID]; A.genBgn = W(6) + 1; A.genEnd = W(7); A.chnBgn = refList[A.genIID].rschnBgn + A.genBgn; A.chnEnd = refList[A.genIID].rschnBgn + A.genEnd; A.identity = atof(W[8]); A.isReverse = false; A.isSpanned = false; A.isRepeat = true; if (A.frgBgn > A.frgEnd) { A.frgBgn = W(4) + 1; A.frgEnd = W(3); A.isReverse = true; } assert(A.frgBgn < A.frgEnd); assert(A.genBgn < A.genEnd); if (A.identity < minIdentity) goto nextSnapperLine; genome.push_back(A); nextSnapperLine: fgets(inLine, 1024, inFile); chomp(inLine); } fclose(inFile); } void loadReferenceSequence(char *refName, vector &refList, map &refMap) { int32 reflen = 0; int32 refiid = 0; errno = 0; FILE *F = fopen(refName, "r"); if (errno) fprintf(stderr, "Failed to open reference sequences in '%s': %s\n", refName, strerror(errno)), exit(1); char *refhdr = new char [1024]; char *refseq = new char [MAX_GENOME_SIZE_INPUT]; fgets(refhdr, 1024, F); chomp(refhdr); fgets(refseq, MAX_GENOME_SIZE_INPUT, F); chomp(refseq); while (!feof(F)) { if (refhdr[0] != '>') { fprintf(stderr, "ERROR: reference sequences must be one per line.\n"); exit(1); } for (uint32 i=0; refhdr[i]; i++) { refhdr[i] = refhdr[i+1]; // remove '>' if (isspace(refhdr[i])) // stop at first space refhdr[i] = 0; } int32 rl = strlen(refseq); refMap[refhdr] = refiid; refList.push_back(referenceSequence(reflen, reflen + rl, rl, refhdr)); reflen += rl + 1024; refiid++; fgets(refhdr, 1024, F); chomp(refhdr); fgets(refseq, MAX_GENOME_SIZE_INPUT, F); chomp(refseq); } fclose(F); } canu-1.6/src/bogus/bogusUtil.H000066400000000000000000000075641314437614700163300ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/AS_BAT_bogusUtil.H * src/bogart/AS_BAT_bogusUtil.H * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2014-DEC-23 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef INCLUDE_BOGUSUTIL #define INCLUDE_BOGUSUTIL #include "AS_global.H" #include "splitToWords.H" #include "intervalList.H" #include #include #include #include using namespace std; class genomeAlignment { public: genomeAlignment() { frgIID = frgBgn = frgEnd = 0; genIID = 0; genBgn = genEnd = 0; identity = 0.0; isDeleted = isReverse = isSpanned = isRepeat = false; }; int32 frgIID; int32 frgBgn; int32 frgEnd; int32 genIID; // Position in the actual sequence int32 genBgn; int32 genEnd; int32 chnBgn; // Position in the chained sequences int32 chnEnd; double identity; // Percent identity of the alignment bool isDeleted; // Used by bogusness bool isReverse; bool isSpanned; bool isRepeat; }; class referenceSequence { public: referenceSequence(int32 cb, int32 ce, int32 rl, char *cn) { rschnBgn = cb; rschnEnd = ce; rsrefLen = rl; assert(strlen(cn) < 256); strcpy(rsrefName, cn); }; int32 rschnBgn; int32 rschnEnd; int32 rsrefLen; char rsrefName[256]; }; bool byFragmentID(const genomeAlignment &A, const genomeAlignment &B); bool byGenomePosition(const genomeAlignment &A, const genomeAlignment &B); void addAlignment(vector &genome, int32 frgIID, int32 frgBgn, int32 frgEnd, bool isReverse, int32 chnBgn, int32 chnEnd, double identity, int32 genIID, int32 genBgn, int32 genEnd); void loadNucmer(char *nucmerName, vector &genome, map &IIDmap, vector &IIDname, vector &refList, map &refMap, double minIdentity); void loadSnapper(char *snapperName, vector &genome, map &IIDmap, vector &IIDname, vector &refList, map &refMap, double minIdentity); void loadReferenceSequence(char *refName, vector &refList, map &refMap); #endif // INCLUDE_BOGUSUTIL canu-1.6/src/bogus/bogusness-run.pl000066400000000000000000000145621314437614700174050ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/AS_BAT/bogusness-run.pl # # Modifications by: # # Brian P. Walenz from 2010-DEC-02 to 2013-AUG-01 # are Copyright 2010-2011,2013 J. Craig Venter Institute, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz on 2014-DEC-19 # are Copyright 2014 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; use FindBin; my $prefix = undef; # output file name prefix my $SEQ = undef; # input sequences, fasta my $REF = undef; # input reference, fasta my $IDEAL = undef; # input ideal, output from bogus, *.ideal.intervals my $IGFF3 = undef; # input ideal, output from bogus, *.ideal.gff3 # Assumes this script is NOT installed; it looks to this directory to get # jbrowse source and configs. my $src = $FindBin::Bin; # 1.2.1: Needs: (same) # 1.2: Needs: Heap::Simple Heap::Simple::Perl Heap::Simple::XS PerlIO::gzip Devel::Size # 1.1: Needs: BioPerl, JSON, JSON::XS # my $jbrowseVersion = "1.2.1"; my $jbrowse = "$src/jbrowse-$jbrowseVersion.zip\n"; while (scalar(@ARGV) > 0) { my $arg = shift @ARGV; if ($arg eq "-prefix") { $prefix = shift @ARGV; } elsif ($arg eq "-sequence") { $SEQ = shift @ARGV; } elsif ($arg eq "-reference") { $REF = shift @ARGV; } elsif ($arg eq "-ideal") { $IDEAL = shift @ARGV; } else { die "Unknown option '$arg'\n"; } } if (!defined($prefix) || !defined($SEQ) || !defined($REF) || !defined($IDEAL)) { print STDERR "usage: $0 -prefix P -sequence A.fasta -reference R.fasta -ideal I.ideal.intervals\n"; exit(1); } if ($IDEAL =~ m/(.*).intervals/) { $IDEAL = "$1.intervals"; $IGFF3 = "$1.gff3" } die "IDEAL '$IDEAL' not found.\n" if (! -e "$IDEAL"); die "IGFF3 '$IGFF3' not found.\n" if (! -e "$IGFF3"); # # Map assembly to reference # if (! -e "$prefix.delta") { print STDERR "Running nucmer\n"; my $cmd; $cmd .= "nucmer"; $cmd .= " --maxmatch --coords -p $prefix"; $cmd .= " $REF"; $cmd .= " $SEQ"; system($cmd); } if (! -e "$prefix.png") { print STDERR "Running mummerplot\n"; my $cmd; $cmd .= "mummerplot"; $cmd .= " --layout --filter -p $prefix -t png"; #$cmd .= " --layout -p $prefix -t png"; $cmd .= " $prefix.delta"; system($cmd); } # # Run bogusness on that mapping # if (! -e "$prefix.bogusness.out") { print STDERR "Running bogusness\n"; my $cmd; $cmd = "bogusness \\\n"; $cmd .= " -reference $REF \\\n"; $cmd .= " -ideal $IDEAL \\\n"; $cmd .= " -nucmer $prefix.coords \\\n"; $cmd .= " -output $prefix \\\n"; $cmd .= " > $prefix.bogusness.out"; system($cmd); } # # Build wiki-friendly output # if (! -e "$prefix.bogusness.wiki") { my $lastUtg; open(F, "sort -k10n -k18n < $prefix.bogusness.out |") or die "Failed to read sort output.\n"; open(O, "> $prefix.bogusness.wiki") or die "Failed to open '$prefix.bogusness.wiki' for writing.\n"; print O "{| class=\"wikitable\" border=1\n"; print O "! unitig !! align num !! utg coords !! gen coords !! status !! ideal type !! ideal index !! ideal coords !! length !! utg cov !! ideal cov !! annotation\n"; print O "|-\n"; while () { chomp; if (m/^\|\s(utg\d+)\s/) { if (!defined($lastUtg)) { $lastUtg = $1; } if ($lastUtg ne $1) { print O "| colspan=12 bgcolor=#666666 |\n"; print O "|-\n"; $lastUtg = $1; } s!BEGINSin \|\| UNIQ!bgcolor="FireBrick" \| BEGINSin \|\| bgcolor="FireBrick" \| UNIQ!; s!ENDSin \|\| UNIQ!bgcolor="FireBrick" \| ENDSin \|\| bgcolor="FireBrick" \| UNIQ!; s!BEGINSin \|\| REPT!bgcolor="ForestGreen" \| BEGINSin \|\| bgcolor="ForestGreen" \| REPT!; s!ENDSin \|\| REPT!bgcolor="ForestGreen" \| ENDSin \|\| bgcolor="ForestGreen" \| REPT!; s!CONTAINED \|\| UNIQ!bgcolor="Indigo" \| CONTAINED \|\| bgcolor="Indigo" \| UNIQ!; s!CONTAINED \|\| REPT!bgcolor="SteelBlue" \| CONTAINED \|\| bgcolor="SteelBlue" \| REPT!; print O "$_\n"; print O "|-\n"; } } print O "|}\n"; close(O); close(F); } # # Attempt to create a jbrowse instance. # # cd x010 && tar -xf ../jbrowse.tar && mv jbrowse jbrowse-ctg && cd jbrowse-ctg/ && \ # ln -s ../bogusness-ctg.gff3 ../bogus.gff3 . && \ # bin/prepare-refseqs.pl -fasta ../../NC_000913.fasta && \ # bin/biodb-to-json.pl --conf bogus.json && \ # cd ../.. # system("mkdir $prefix.jbrowse"); open(F, "> $prefix.jbrowse/create.sh") or die "Failed to open '$prefix.jbrowse/create.sh' for writing.\n"; print F "#!/bin/sh\n"; print F "\n"; print F "cd $prefix.jbrowse\n"; print F "\n"; print F "unzip -q $jbrowse\n"; print F "\n"; print F "mv jbrowse-$jbrowseVersion/* jbrowse/* .\n"; print F "rm -rf jbrowse-$jbrowseVersion\n"; print F "\n"; print F "cp -p $src/bogus-genome.css genome.css\n"; print F "cp -p $src/bogus-genome.json bogus.json\n"; print F "\n"; print F "ln -s ../$prefix.gff3\n"; print F "\n"; print F "ln -s ../$IGFF3 .\n"; print F "\n"; print F "echo bin/prepare-refseqs.pl -fasta ../$REF\n"; print F "bin/prepare-refseqs.pl -fasta ../$REF\n"; print F "\n"; print F "echo bin/biodb-to-json.pl --conf bogus.json\n"; print F "bin/biodb-to-json.pl --conf bogus.json\n"; print F "\n"; close(F); system("sh $prefix.jbrowse/create.sh"); canu-1.6/src/bogus/bogusness.C000066400000000000000000000716031314437614700163510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/bogusness.C * * Modifications by: * * Brian P. Walenz from 2010-NOV-23 to 2013-AUG-01 * are Copyright 2010-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-19 to 2014-DEC-23 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bogusUtil.H" // Reads snapper/nucmer output, figures out what the minimal set of alignments that cover each // aligned sequence, then compares those alignments against the ideal unitigs. // // It reports: // bubbles that should have been popped. // unitigs that span a unique-repeat junction (one end in unique, one end in repeat). // unitigs that include a repeat. // unitigs that include a repeat, and are chimeric at the repeat. // unitigs that should be joined. // // #define IDEAL_REPT 0 // A repeat unitig #define IDEAL_UNIQ 1 // A unique unitig #define IDEAL_REPTUNIQ 2 // A portion of this repeat unitig might be separable #define IDEAL_UNIQWEAK 3 // A portion of this unique unitig might be unjoinable #define IDEAL_MIXED 4 // Used for classifying bubbles char *types[4] = { "REPT", "UNIQ", "SEPR", "WEAK" }; #define STATUS_BEGINSin 0 #define STATUS_ENDSin 1 #define STATUS_CONTAINS 2 #define STATUS_CONTAINED 3 char *statuses[4] = { "BEGINSin", "ENDSin", "CONTAINS", "CONTAINED" }; class bogusResult { public: bogusResult(const char *_utgID, int32 _utgIID, int32 _alignNum, int32 _alignTotal, int32 _utgBgn, int32 _utgEnd, int32 _chnBgn, int32 _chnEnd, int32 _genIID, int32 _genBgn, int32 _genEnd, bool _isReverse, int32 _status, int32 _type, int32 _idlNum, int32 _idlBgn, int32 _idlEnd, int32 _alignLen, double _utgCov, double _idlCov) { if (_utgID) strcpy(utgID, _utgID); else utgID[0] = 0; utgIID = _utgIID; alignNum = _alignNum; alignTotal = _alignTotal; utgBgn = _utgBgn; utgEnd = _utgEnd; chnBgn = _chnBgn; chnEnd = _chnEnd; genIID = _genIID; genBgn = _genBgn; genEnd = _genEnd; isReverse = _isReverse; status = _status; type = _type; idlNum = _idlNum; idlBgn = _idlBgn; idlEnd = _idlEnd; alignLen = _alignLen; utgCov = _utgCov; idlCov = _idlCov; }; ~bogusResult() { }; bool operator<(const bogusResult &that) const { if (chnBgn < that.chnBgn) return(true); if (chnBgn > that.chnBgn) return(false); return(idlBgn < that.idlBgn); }; char utgID[32]; int32 utgIID; int32 alignNum; int32 alignTotal; int32 utgBgn; int32 utgEnd; int32 chnBgn; // Position in the chained reference int32 chnEnd; int32 genIID; // Position in the actual reference (for output) int32 genBgn; int32 genEnd; bool isReverse; int32 status; int32 type; int32 idlNum; int32 idlBgn; int32 idlEnd; int32 alignLen; double utgCov; double idlCov; }; class idealUnitig { public: idealUnitig() { bgn = 0; end = 0; type = 0; }; idealUnitig(int32 b, int32 e, char t) { bgn = b; end = e; type = t; }; ~idealUnitig() { }; int32 bgn; int32 end; char type; }; vector refList; map refMap; vector ideal; vector genome; map IIDmap; // Maps an ID string to an IID. vector IIDname; // Maps an IID to an ID string. vector results; void loadIdealUnitigs(char *idealName, vector &ideal) { errno = 0; FILE *F = fopen(idealName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", idealName, strerror(errno)), exit(1); char L[1024]; splitToWords S; char t = 0; fgets(L, 1024, F); while (!feof(F)) { chomp(L); S.split(L); if (strcmp(S[3], "REPT") == 0) t = (S.numWords() == 4) ? IDEAL_REPT : IDEAL_UNIQWEAK; else if (strcmp(S[3], "UNIQ") == 0) t = (S.numWords() == 4) ? IDEAL_UNIQ : IDEAL_REPTUNIQ; else fprintf(stderr, "Unknown type in '%s'\n", L), exit(1); #warning ONLY WORKS ON ONE REFERENCE ideal.push_back(idealUnitig(S(1), S(2), t)); fgets(L, 1024, F); } fclose(F); fprintf(stderr, "Loaded %lu ideal unitigs.\n", ideal.size()); } bool isUnitigContained(int32 frgIID, idealUnitig &ideal, genomeAlignment &cover, int32 &alen, int32 &ulen, int32 &ilen, double &ufrac, double &ifrac) { if ((ideal.bgn <= cover.chnBgn) && (cover.genEnd <= ideal.end)) { alen = cover.genEnd - cover.chnBgn; ulen = cover.genEnd - cover.chnBgn; ilen = ideal.end - ideal.bgn; ufrac = 100.0 * alen / ulen; ifrac = 100.0 * alen / ilen; return(true); } return(false); } bool isUnitigContaining(int32 frgIID, idealUnitig &ideal, genomeAlignment &cover, int32 &alen, int32 &ulen, int32 &ilen, double &ufrac, double &ifrac) { if ((cover.chnBgn <= ideal.bgn) && (ideal.end <= cover.chnEnd)) { alen = ideal.end - ideal.bgn; ulen = cover.chnEnd - cover.chnBgn; ilen = ideal.end - ideal.bgn; ufrac = 100.0 * alen / ulen; ifrac = 100.0 * alen / ilen; return(true); } return(false); } bool isUnitigEnding(int32 frgIID, idealUnitig &ideal, genomeAlignment &cover, int32 &alen, int32 &ulen, int32 &ilen, double &ufrac, double &ifrac) { if ((cover.chnBgn <= ideal.bgn) && (ideal.bgn <= cover.chnEnd)) { alen = cover.chnEnd - ideal.bgn; ulen = cover.chnEnd - cover.chnBgn; ilen = ideal.end - ideal.bgn; ufrac = 100.0 * alen / ulen; ifrac = 100.0 * alen / ilen; return(true); } return(false); } bool isUnitigBeginning(int32 frgIID, idealUnitig &ideal, genomeAlignment &cover, int32 &alen, int32 &ulen, int32 &ilen, double &ufrac, double &ifrac) { if ((cover.chnBgn <= ideal.end) && (ideal.end <= cover.chnEnd)) { alen = ideal.end - cover.chnBgn; ulen = cover.chnEnd - cover.chnBgn; ilen = ideal.end - ideal.bgn; ufrac = 100.0 * alen / ulen; ifrac = 100.0 * alen / ilen; return(true); } return(false); } int main(int argc, char **argv) { uint32 nucmerNamesLen = 0; uint32 snapperNamesLen = 0; char *nucmerNames[1024]; char *snapperNames[1024]; char *idealName = 0L; char *refName = 0L; char *outputPrefix = 0L; FILE *gffOutput = 0L; FILE *resultsOutput = 0L; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-nucmer") == 0) { nucmerNames[nucmerNamesLen++] = argv[++arg]; } else if (strcmp(argv[arg], "-snapper") == 0) { snapperNames[snapperNamesLen++] = argv[++arg]; } else if (strcmp(argv[arg], "-ideal") == 0) { idealName = argv[++arg]; } else if (strcmp(argv[arg], "-reference") == 0) { refName = argv[++arg]; } else if (strcmp(argv[arg], "-output") == 0) { outputPrefix = argv[++arg]; } else { err++; } arg++; } if ((nucmerNamesLen == 0) && (snapperNamesLen == 0L)) fprintf(stderr, "ERROR: No input matches supplied (either -nucmer or -snapper).\n"), err++; if (refName == 0L) fprintf(stderr, "ERROR: No referece supplied (-reference).\n"), err++; if (idealName == 0L) fprintf(stderr, "ERROR: No ideal unitigs supplied (-ideal).\n"), err++; if (outputPrefix == 0L) fprintf(stderr, "ERROR: No output prefix supplied (-output).\n"), err++; if (err) { exit(1); } { char outputName[FILENAME_MAX]; errno = 0; snprintf(outputName, FILENAME_MAX, "%s.bogusness", outputPrefix); resultsOutput = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", outputName, strerror(errno)), exit(1); snprintf(outputName, FILENAME_MAX, "%s.gff3", outputPrefix); gffOutput = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", outputName, strerror(errno)), exit(1); fprintf(gffOutput, "##gff-version 3\n"); } loadIdealUnitigs(idealName, ideal); loadReferenceSequence(refName, refList, refMap); for (uint32 nn=0; nn cover; // Find the range of alignments for a single fragment. uint32 end = bgn + 1; while ((end < genome.size()) && (genome[bgn].frgIID == genome[end].frgIID)) end++; // Basically, discard any alignment that is contained in some other alignment. We're not // trying to explain the genome using the unitigs, we're trying to explain (evaluate) the // unitigs using the genome. for (uint32 i=bgn; iutgID, bi->alignNum, bi->alignTotal, bi->utgBgn, bi->utgEnd, refList[bi->genIID].rsrefName, bi->genBgn, bi->genEnd, statuses[bi->status], types[bi->type], bi->idlNum, bi->idlBgn, bi->idlEnd, bi->alignLen, bi->utgCov, bi->idlCov); } fclose(gffOutput); fclose(resultsOutput); #if 0 // USELESS BROKEN STUFF //////////////////////////////////////// // // Attempt to analyze the result. // //////////////////////////////////////// // // The array variables count over various tolerances at the edges. // // -------- spans the REPT, but includes a bit of the UNIQ on // For example: uuuuu uuuu either end. If the UNIQ is less than 250bp, we'd // rrrrrr numURU{0000}=0 numURU{0250}=1 numURU{0500}=1 int32 tolerances[4] = { 0, 250, 500, 1000 }; uint32 numChimeraTotal = 0; uint32 numChimera[1024] = {0}; // Chimeric unitigs by number of pieces uint32 numBubbleUNIQ = 0; // Number of bubbles in unique regions (contained in a UNIQ and another unitig) uint32 numBubbleREPT = 0; // Number of bubbles in repeat regions (contained in a REPT and another unitig) uint32 numBubbleOTHR = 0; // Number of bubbles that span a region uint32 numIncompleteUNIQ[4] = {0}; // Number of unitigs that do not completely span a UNIQ region (**) uint32 numIncompleteREPT[4] = {0}; // Number of unitigs that do not completely span a REPT region (**) uint32 numURU[4] = {0}; // Number of unitigs that correctly span a R, and might have a bit in U on either end. uint32 numRUR[4] = {0}; // Number of unitigs that correctly span a U, and might have a bit in R on either end. uint32 numIllegalBorder[4] = {0}; // Number of unitigs that cross a UNIQ/REPT border, or multiple borders //////////////////////////////////////////////////////////////////////////////// // Chimera int32 *chm = new int32 [IIDname.size()]; memset(chm, 0, sizeof(int32) * IIDname.size()); for (uint32 i=0; ialignTotal < 1024); if ((bi->alignNum == 1) && (bi->alignTotal > 1)) chm[bi->utgIID]++; } for (uint32 i=0; ichnBgn <= gj->chnBgn) && (gj->chnEnd <= gi->chnEnd) && (gj->frgBgn == bgn[gj->frgIID]) && (gj->frgEnd == end[gj->frgIID])) { bub[gj->frgIID]++; //typ[gj->frgIID] = IDEAL_MIXED; } } } // Now, for any unitig marked as a bubble, try to determine the type. The rule is // simple. Three cases: // matches are all UNIQ // matches are all REPT // matches are of mixed type (bubble spans a junction) // for (uint32 b=0; b) { chomp; $hash2 = $_ if (!defined($hash2)); $revCount++; } close(F); open(F, "git describe --tags --long --dirty --always --abbrev=40 |") or die "Failed to run 'git describe'.\n"; while () { chomp; if (m/^v(\d+)\.(\d+.*)-(\d+)-g(.{40})-*(.*)$/) { $major = $1; $minor = $2; $commits = $3; $hash1 = $4; $dirty = $5; } else { die "Failed to parse describe string '$_'.\n"; } } close(F); if ($dirty eq "dirty") { $dirty = "ahead of github"; } else { $dirty = "sync'd with guthub"; } } # If not in a git repo, we might be able to figure things out based on the directory name. elsif ($cwd =~ m/canu-(.{40})\/src/) { $label = "snapshot"; $hash1 = $1; $hash2 = $1; } elsif ($cwd =~ m/canu-master\/src/) { $label = "master-snapshot"; } elsif ($cwd =~ m/canu-(\d).(\d)\/src/) { $label = "release"; $major = $1; $minor = $2; } # Report what we found. This is really for the gmake output. if ($commits > 0) { print STDERR "Building snapshot v$major.$minor +$commits changes (r$revCount $hash1) ($dirty)\n"; print STDERR "\n"; } else { print STDERR "Building $label v$major.$minor\n"; print STDERR "\n"; } # Dump a new file, but don't overwrite the original. open(F, "> canu_version.H.new") or die "Failed to open 'canu-version.H.new' for writing: $!\n"; print F "// Automagically generated by canu_version_update.pl! Do not commit!\n"; print F "#define CANU_VERSION_LABEL \"$label\"\n"; print F "#define CANU_VERSION_MAJOR \"$major\"\n"; print F "#define CANU_VERSION_MINOR \"$minor\"\n"; print F "#define CANU_VERSION_COMMITS \"$commits\"\n"; print F "#define CANU_VERSION_REVISION \"$revCount\"\n"; print F "#define CANU_VERSION_HASH \"$hash1\"\n"; if (defined($dirty)) { print F "#define CANU_VERSION \"Canu snapshot v$major.$minor +$commits changes (r$revCount $hash1)\\n\"\n"; } elsif (defined($hash1)) { print F "#define CANU_VERSION \"Canu snapshot ($hash1)\\n\"\n"; } else { print F "#define CANU_VERSION \"Canu $major.$minor\\n\"\n"; } close(F); # If they're the same, don't replace the original. if (compare("canu_version.H", "canu_version.H.new") == 0) { unlink "canu_version.H.new"; } else { unlink "canu_version.H"; rename "canu_version.H.new", "canu_version.H"; } # That's all, folks! exit(0); canu-1.6/src/correction/000077500000000000000000000000001314437614700152565ustar00rootroot00000000000000canu-1.6/src/correction/errorEstimate.C000066400000000000000000000126231314437614700202130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-MAY-16 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "ovStore.H" #include "splitToWords.H" #include "AS_UTL_decodeRange.H" #include "stddev.H" #include #include #include using namespace std; int main(int argc, char **argv) { char *scoreFileName = NULL; uint32 deviations = 6; float mass=0.98; bool isOvl=false; argc = AS_configure(argc, argv); int32 arg = 1; int32 err = 0; while (arg < argc) { if (strcmp(argv[arg], "-S") == 0) { scoreFileName = argv[++arg]; } else if (strcmp(argv[arg], "-d") == 0) { deviations = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-m") == 0) { mass = atof(argv[++arg]); } else if (strcmp(argv[arg], "-o") == 0) { isOvl=true; } else { fprintf(stderr, "ERROR: invalid arg '%s'\n", argv[arg]); err++; } arg++; } if (scoreFileName == NULL) err++; if (err) { fprintf(stderr, "usage: %s [options]\n", argv[0]); fprintf(stderr, "\n"); exit(1); } errno = 0; FILE *scoreFile = (scoreFileName == NULL) ? NULL : (scoreFileName[0] == '-' ? stdin : fopen(scoreFileName, "r")); if (errno) fprintf(stderr, "ERROR: failed to open '%s' for reading: %s\n", scoreFileName, strerror(errno)), exit(1); // read the file and store best hits char ovStr[1024]; ovOverlap ov(NULL); map readToLength; map readToIdy; double mean, median, stddev, mad; mean = median = stddev = mad = 0.0; while (fgets(ovStr, 1024, scoreFile) != NULL) { splitToWords W(ovStr); if (isOvl) { ov.a_iid = W(0); ov.b_iid = W(1); if (ov.a_iid == ov.b_iid) continue; ov.dat.ovl.ahg5 = W(4); ov.dat.ovl.ahg3 = W(6); ov.dat.ovl.bhg5 = W(6); ov.dat.ovl.bhg3 = W(7); ov.span(W(3)); ov.erate(atof(W[8])); ov.flipped(W[3][0] == 'I' ? true : false); } else { ov.a_iid = W(0); ov.b_iid = W(1); if (ov.a_iid == ov.b_iid) continue; assert(W[4][0] == '0'); ov.dat.ovl.ahg5 = W(5); ov.dat.ovl.ahg3 = W(7) - W(6); if (W[8][0] == '0') { ov.dat.ovl.bhg5 = W(9); ov.dat.ovl.bhg3 = W(11) - W(10); ov.flipped(false); } else { ov.dat.ovl.bhg3 = W(9); ov.dat.ovl.bhg5 = W(11) - W(10); ov.flipped(true); } ov.erate(atof(W[2])); ov.span(W(10)-W(9)); } if (ov.erate() == 0.0) ov.erate(0.01); // round up when we can't estimate accurately if (readToLength.find(ov.b_iid) == readToLength.end() || readToLength[ov.b_iid] < ov.span()) { readToLength[ov.b_iid] = ov.span(); readToIdy[ov.b_iid] = ov.erate(); } } fclose(scoreFile); stdDev edgeStats; // Find the overlap for every best edge. double *absdev = new double [readToLength.size() + 1]; double *erates = new double [readToLength.size() + 1]; uint32 eratesLen = 0; for (map::iterator it=readToIdy.begin(); it != readToIdy.end(); ++it) { edgeStats.insert(erates[eratesLen++] = it->second); } mean = edgeStats.mean(); stddev = edgeStats.stddev(); fprintf(stderr, "with %u points - mean %f stddev %f - would use overlaps below %f fraction error\n", edgeStats.size(), mean, stddev, mean + deviations * stddev); // Find the median and absolute deviations. sort(erates, erates+eratesLen); median = erates[ eratesLen / 2 ]; double massCutoff = 0; uint32 totalBelow = 0; for (uint32 ii=0; ii= 0.0); mad = absdev[eratesLen/2]; delete [] absdev; delete [] erates; fprintf(stderr, "with %u points - median %f mad %f - would use overlaps below %f fraction error\n", edgeStats.size(), median, mad, median + deviations * 1.4826 * mad); fprintf(stderr, "with %u points - mass of %d is below %f\n", edgeStats.size(), totalBelow, massCutoff); if (scoreFile) fclose(scoreFile); fprintf(stdout, "%.3f\n", massCutoff /* median + deviations * 1.4826 * mad*/); exit(0); } canu-1.6/src/correction/errorEstimate.mk000066400000000000000000000010451314437614700204340ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := errorEstimate SOURCES := errorEstimate.C SRC_INCDIRS := .. ../AS_UTL ../stores ../overlapInCore ../utgcns/libNDalign ../overlapErrorAdjustment TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/correction/filterCorrectionOverlaps.C000066400000000000000000000317701314437614700224230ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-28 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-24 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "ovStore.H" #include "splitToWords.H" #include "AS_UTL_decodeRange.H" #include #include using namespace std; int main(int argc, char **argv) { char *gkpStoreName = NULL; char *ovlStoreName = NULL; char *scoreFileName = NULL; char logFileName[FILENAME_MAX]; char statsFileName[FILENAME_MAX]; bool noLog = false; bool noStats = false; uint32 expectedCoverage = 25; uint32 minOvlLength = 500; uint32 maxOvlLength = AS_MAX_READLEN; double maxErate = 1.0; double minErate = 1.0; bool legacyScore = false; argc = AS_configure(argc, argv); int32 arg = 1; int32 err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovlStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-S") == 0) { scoreFileName = argv[++arg]; } else if (strcmp(argv[arg], "-c") == 0) { expectedCoverage = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-l") == 0) { minOvlLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { AS_UTL_decodeRange(argv[++arg], minErate, maxErate); } else if (strcmp(argv[arg], "-nolog") == 0) { noLog = true; } else if (strcmp(argv[arg], "-nostats") == 0) { noStats = true; } else if (strcmp(argv[arg], "-legacy") == 0) { legacyScore = true; } else { fprintf(stderr, "ERROR: invalid arg '%s'\n", argv[arg]); err++; } arg++; } if (gkpStoreName == NULL) err++; if (ovlStoreName == NULL) err++; if (scoreFileName == NULL) err++; if (err) { fprintf(stderr, "usage: %s [options]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "Rewrites an ovlStore, filtering overlaps that shouldn't be used for correcting reads.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -G gkpStore input reads\n"); fprintf(stderr, " -O ovlStore input overlaps\n"); fprintf(stderr, " -S scoreFile output scores for each read, binary file, to 'scoreFile'\n"); fprintf(stderr, " per-read logging to 'scoreFile.log' (see -nolog)\n"); fprintf(stderr, " summary statistics to 'scoreFile.stats' (see -nostats)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -c coverage retain at most this many overlaps per read\n"); fprintf(stderr, "\n"); fprintf(stderr, " -l length filter overlaps shorter than this length\n"); fprintf(stderr, " -e (min-)max filter overlaps outside this range of fraction error\n"); fprintf(stderr, " example: -e 0.20 filter overlaps above 20%% error\n"); fprintf(stderr, " example: -e 0.05-0.20 filter overlaps below 5%% error\n"); fprintf(stderr, " or above 20%% error\n"); fprintf(stderr, "\n"); fprintf(stderr, " -nolog don't create 'scoreFile.log'\n"); fprintf(stderr, " -nostats don't create 'scoreFile.stats'\n"); if (gkpStoreName == NULL) fprintf(stderr, "ERROR: no gatekeeper store (-G) supplied.\n"); if (ovlStoreName == NULL) fprintf(stderr, "ERROR: no overlap store (-O) supplied.\n"); if (scoreFileName == NULL) fprintf(stderr, "ERROR: no output scoreFile (-S) supplied.\n"); exit(1); } // If only one value supplied to -e, both minErate and maxErate are set to it. // Change the max to the value supplied, and set min to zero. Otherwise, two values // were supplied, and we're all set. No checking is made that the user isn't an idiot. if (minErate == maxErate) { maxErate = minErate; minErate = 0.0; } uint32 maxEvalue = AS_OVS_encodeEvalue(maxErate); uint32 minEvalue = AS_OVS_encodeEvalue(minErate);; gkStore *gkpStore = gkStore::gkStore_open(gkpStoreName); ovStore *inpStore = new ovStore(ovlStoreName, gkpStore); uint64 *scores = new uint64 [gkpStore->gkStore_getNumReads() + 1]; snprintf(logFileName, FILENAME_MAX, "%s.log", scoreFileName); snprintf(statsFileName, FILENAME_MAX, "%s.stats", scoreFileName); errno = 0; FILE *scoreFile = (scoreFileName == NULL) ? NULL : fopen(scoreFileName, "w"); if (errno) fprintf(stderr, "ERROR: failed to open '%s' for writing: %s\n", scoreFileName, strerror(errno)), exit(1); errno = 0; FILE *logFile = (noLog == true) ? NULL : fopen(logFileName, "w"); if (errno) fprintf(stderr, "ERROR: failed to open '%s' for writing: %s\n", logFileName, strerror(errno)), exit(1); uint32 ovlLen = 0; uint32 ovlMax = 131072; ovOverlap *ovl = ovOverlap::allocateOverlaps(gkpStore, ovlMax); ovOverlap swapped(gkpStore); uint32 histLen = 0; uint32 histMax = ovlMax; uint64 *hist = new uint64 [histMax]; uint64 totalOverlaps = 0; uint64 lowErate = 0; uint64 highErate = 0; uint64 tooShort = 0; uint64 tooLong = 0; uint64 belowCutoff = 0; uint64 retained = 0; uint64 totalReads = gkpStore->gkStore_getNumReads(); uint64 readsNoOlaps = 0; uint64 reads00OlapsFiltered = 0; uint64 reads50OlapsFiltered = 0; uint64 reads80OlapsFiltered = 0; uint64 reads95OlapsFiltered = 0; uint64 reads99OlapsFiltered = 0; for (uint32 id=1; id <= gkpStore->gkStore_getNumReads(); id++) { scores[id] = UINT64_MAX; inpStore->readOverlaps(id, ovl, ovlLen, ovlMax); if (ovlLen == 0) { readsNoOlaps++; continue; } if (ovl[0].a_iid != id) { readsNoOlaps++; continue; } histLen = 0; if (histMax < ovlMax) { delete [] hist; histMax = ovlMax; hist = new uint64 [ovlMax]; } // Figure out which overlaps are good enough to consider and save their length. for (uint32 oo=0; oogkStore_getNumReads() + 1); if (scoreFile) fclose(scoreFile); if (logFile) fclose(logFile); delete [] scores; delete inpStore; gkpStore->gkStore_close(); if (noStats == true) exit(0); errno = 0; FILE *statsFile = fopen(statsFileName, "w"); if (errno) fprintf(stderr, "ERROR: failed to open '%s' for writing: %s\n", statsFileName, strerror(errno)), exit(1); fprintf(statsFile, "PARAMETERS:\n"); fprintf(statsFile, "----------\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "%7" F_U32P " (expected coverage)\n", expectedCoverage); fprintf(statsFile, "%7" F_U32P " (don't use overlaps shorter than this)\n", minOvlLength); fprintf(statsFile, "%7.3f (don't use overlaps with erate less than this)\n", minErate); fprintf(statsFile, "%7.3f (don't use overlaps with erate more than this)\n", maxErate); fprintf(statsFile, "\n"); fprintf(statsFile, "OVERLAPS:\n"); fprintf(statsFile, "--------\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "IGNORED:\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "%12" F_U64P " (< %6.4f fraction error)\n", lowErate, AS_OVS_decodeEvalue(minEvalue)); fprintf(statsFile, "%12" F_U64P " (> %6.4f fraction error)\n", highErate, AS_OVS_decodeEvalue(maxEvalue)); fprintf(statsFile, "%12" F_U64P " (< %u bases long)\n", tooShort, minOvlLength); fprintf(statsFile, "%12" F_U64P " (> %u bases long)\n", tooLong, maxOvlLength); fprintf(statsFile, "\n"); fprintf(statsFile, "FILTERED:\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "%12" F_U64P " (too many overlaps, discard these shortest ones)\n", belowCutoff); fprintf(statsFile, "\n"); fprintf(statsFile, "EVIDENCE:\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "%12" F_U64P " (longest overlaps)\n", retained); fprintf(statsFile, "\n"); fprintf(statsFile, "TOTAL:\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "%12" F_U64P " (all overlaps)\n", totalOverlaps); fprintf(statsFile, "\n"); fprintf(statsFile, "READS:\n"); fprintf(statsFile, "-----\n"); fprintf(statsFile, "\n"); fprintf(statsFile, "%12" F_U64P " (no overlaps)\n", readsNoOlaps); fprintf(statsFile, "%12" F_U64P " (no overlaps filtered)\n", reads00OlapsFiltered); fprintf(statsFile, "%12" F_U64P " (< 50%% overlaps filtered)\n", reads50OlapsFiltered); fprintf(statsFile, "%12" F_U64P " (< 80%% overlaps filtered)\n", reads80OlapsFiltered); fprintf(statsFile, "%12" F_U64P " (< 95%% overlaps filtered)\n", reads95OlapsFiltered); fprintf(statsFile, "%12" F_U64P " (< 100%% overlaps filtered)\n", reads99OlapsFiltered); fprintf(statsFile, "\n"); fclose(statsFile); // Histogram of overlaps per read // Histogram of overlaps filtered per read // Histogram of overlaps used for correction (number per read, lengths?) exit(0); } canu-1.6/src/correction/filterCorrectionOverlaps.mk000066400000000000000000000007731314437614700226470ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := filterCorrectionOverlaps SOURCES := filterCorrectionOverlaps.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/correction/generateCorrectionLayouts.C000066400000000000000000000416721314437614700225770ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-09 to 2015-SEP-21 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-27 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "tgStore.H" #include "outputFalcon.H" #include "stashContains.H" #include "splitToWords.H" #include "intervalList.H" #include using namespace std; // Debugging on which reads are filtered, which are used, and which are removed // to meet coverage thresholds. Very big. #undef DEBUG_LAYOUT // Generate a layout for the read in ovl[0].a_iid, using most or all of the overlaps // in ovl. tgTig * generateLayout(gkStore *gkpStore, uint64 *readScores, bool legacyScore, uint32 minEvidenceLength, double maxEvidenceErate, double maxEvidenceCoverage, ovOverlap *ovl, uint32 ovlLen, FILE *flgFile) { set children; tgTig *layout = new tgTig; layout->_tigID = ovl[0].a_iid; layout->_coverageStat = 1.0; // Default to just barely unique layout->_microhetProb = 1.0; // Default to 100% probability of unique layout->_class = tgTig_noclass; layout->_suggestRepeat = false; layout->_suggestCircular = false; gkRead *read = gkpStore->gkStore_getRead(ovl[0].a_iid); layout->_layoutLen = read->gkRead_sequenceLength(); resizeArray(layout->_children, layout->_childrenLen, layout->_childrenMax, ovlLen, resizeArray_doNothing); if (flgFile) fprintf(flgFile, "Generate layout for read " F_U32 " length " F_U32 " using up to " F_U32 " overlaps.\n", layout->_tigID, layout->_layoutLen, ovlLen); for (uint32 oo=0; oo AS_MAX_READLEN) { char ovlString[1024]; fprintf(stderr, "ERROR: bogus overlap '%s'\n", ovl[oo].toString(ovlString, ovOverlapAsCoords, false)); } assert(ovlLength < AS_MAX_READLEN); if (ovl[oo].erate() > maxEvidenceErate) { if (flgFile) fprintf(flgFile, " filter read %9u at position %6u,%6u length %5llu erate %.3f - low quality (threshold %.2f)\n", ovl[oo].b_iid, ovl[oo].a_bgn(), ovl[oo].a_end(), ovlLength, ovl[oo].erate(), maxEvidenceErate); continue; } if (ovl[oo].a_end() - ovl[oo].a_bgn() < minEvidenceLength) { if (flgFile) fprintf(flgFile, " filter read %9u at position %6u,%6u length %5llu erate %.3f - too short (threshold %u)\n", ovl[oo].b_iid, ovl[oo].a_bgn(), ovl[oo].a_end(), ovlLength, ovl[oo].erate(), minEvidenceLength); continue; } if ((readScores != NULL) && (ovlScore < readScores[ovl[oo].b_iid])) { if (flgFile) fprintf(flgFile, " filter read %9u at position %6u,%6u length %5llu erate %.3f - filtered by global filter (threshold " F_U64 ")\n", ovl[oo].b_iid, ovl[oo].a_bgn(), ovl[oo].a_end(), ovlLength, ovl[oo].erate(), readScores[ovl[oo].b_iid]); continue; } if (children.find(ovl[oo].b_iid) != children.end()) { if (flgFile) fprintf(flgFile, " filter read %9u at position %6u,%6u length %5llu erate %.3f - duplicate\n", ovl[oo].b_iid, ovl[oo].a_bgn(), ovl[oo].a_end(), ovlLength, ovl[oo].erate()); continue; } if (flgFile) fprintf(flgFile, " allow read %9u at position %6u,%6u length %5llu erate %.3f\n", ovl[oo].b_iid, ovl[oo].a_bgn(), ovl[oo].a_end(), ovlLength, ovl[oo].erate()); tgPosition *pos = layout->addChild(); // Set the read. Parent is always the read we're building for, hangs and position come from // the overlap. Easy as pie! if (ovl[oo].flipped() == false) { pos->set(ovl[oo].b_iid, ovl[oo].a_iid, ovl[oo].a_hang(), ovl[oo].b_hang(), ovl[oo].a_bgn(), ovl[oo].a_end()); } else { pos->set(ovl[oo].b_iid, ovl[oo].a_iid, ovl[oo].a_hang(), ovl[oo].b_hang(), ovl[oo].a_end(), ovl[oo].a_bgn()); } // Remember the unaligned bit! pos->_askip = ovl[oo].dat.ovl.bhg5; pos->_bskip = ovl[oo].dat.ovl.bhg3; // record the ID children.insert(ovl[oo].b_iid); } // Use utgcns's stashContains to get rid of extra coverage; we don't care about it, and // just delete it immediately. savedChildren *sc = stashContains(layout, maxEvidenceCoverage); if ((flgFile) && (sc)) sc->reportRemoved(flgFile, layout->tigID()); if (sc) { delete sc->children; delete sc; } // stashContains also sorts by position, so we're done. #if 0 if (flgFile) for (uint32 ii=0; iinumberOfChildren(); ii++) fprintf(flgFile, " read %9u at position %6u,%6u hangs %6d %6d %c unAl %5d %5d\n", layout->getChild(ii)->_objID, layout->getChild(ii)->_min, layout->getChild(ii)->_max, layout->getChild(ii)->_ahang, layout->getChild(ii)->_bhang, layout->getChild(ii)->isForward() ? 'F' : 'R', layout->getChild(ii)->_askip, layout->getChild(ii)->_bskip); #endif return(layout); } int main(int argc, char **argv) { char *gkpName = 0L; char *ovlName = 0L; char *scoreName = 0L; char *tigName = 0L; bool falconOutput = false; // To stdout bool trimToAlign = false; uint32 errorRate = AS_OVS_encodeEvalue(0.015); char *outputPrefix = NULL; char logName[FILENAME_MAX] = {0}; char sumName[FILENAME_MAX] = {0}; char flgName[FILENAME_MAX] = {0}; FILE *logFile = 0L; FILE *sumFile = 0L; FILE *flgFile = 0L; uint32 minEvidenceOverlap = 40; uint32 minEvidenceCoverage = 4; uint32 iidMin = 0; uint32 iidMax = UINT32_MAX; char *readListName = NULL; set readList; uint32 minEvidenceLength = 0; double maxEvidenceErate = 1.0; double maxEvidenceCoverage = DBL_MAX; uint32 minCorLength = 0; bool filterCorLength = false; bool legacyScore = false; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { // Input gkpStore gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { // Input ovlStore ovlName = argv[++arg]; } else if (strcmp(argv[arg], "-S") == 0) { // Input scores scoreName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { // Output tigStore tigName = argv[++arg]; } else if (strcmp(argv[arg], "-F") == 0) { // Output directly to falcon, not tigStore falconOutput = true; trimToAlign = true; } else if (strcmp(argv[arg], "-p") == 0) { // Output prefix, just logging and summary outputPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-b") == 0) { // Begin read range iidMin = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { // End read range iidMax = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-rl") == 0) { // List of reads to correct, will also apply -b/-e range readListName = argv[++arg]; } else if (strcmp(argv[arg], "-L") == 0) { // Minimum length of evidence overlap minEvidenceLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-E") == 0) { // Max error rate of evidence overlap maxEvidenceErate = atof(argv[++arg]); } else if (strcmp(argv[arg], "-c") == 0) { // Min coverage of evidence reads to consider the read corrected minEvidenceCoverage = atof(argv[++arg]); } else if (strcmp(argv[arg], "-C") == 0) { // Max coverage of evidence reads to emit. maxEvidenceCoverage = atof(argv[++arg]); } else if (strcmp(argv[arg], "-M") == 0) { // Minimum length of a corrected read minCorLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-legacy") == 0) { legacyScore = true; } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if (gkpName == NULL) err++; if (ovlName == NULL) err++; if (err) { fprintf(stderr, "usage: %s -G gkpStore -O ovlStore [ -T tigStore | -F ] ...\n", argv[0]); fprintf(stderr, " -G gkpStore mandatory path to gkpStore\n"); fprintf(stderr, " -O ovlStore mandatory path to ovlStore\n"); fprintf(stderr, "\n"); fprintf(stderr, " -S file global score (binary) input file\n"); fprintf(stderr, "\n"); fprintf(stderr, " -T corStore output layouts to tigStore corStore\n"); fprintf(stderr, " -F output falconsense-style input directly to stdout\n"); fprintf(stderr, "\n"); fprintf(stderr, " -p name output prefix name, for logging and summary\n"); fprintf(stderr, "\n"); fprintf(stderr, " -b bgnID \n"); fprintf(stderr, " -e endID \n"); fprintf(stderr, "\n"); fprintf(stderr, " -rl file \n"); fprintf(stderr, "\n"); fprintf(stderr, " -L length minimum length of evidence overlaps\n"); fprintf(stderr, " -E erate maximum error rate of evidence overlaps\n"); fprintf(stderr, "\n"); fprintf(stderr, " -c coverage minimum coverage needed in evidence reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " -C coverage maximum coverage of evidence reads to emit\n"); fprintf(stderr, " -M length minimum length of a corrected read\n"); fprintf(stderr, "\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no gkpStore input (-G) supplied.\n"); if (ovlName == NULL) fprintf(stderr, "ERROR: no ovlStore input (-O) supplied.\n"); exit(1); } // Open inputs and output tigStore. gkStore *gkpStore = gkStore::gkStore_open(gkpName); ovStore *ovlStore = new ovStore(ovlName, gkpStore); tgStore *tigStore = (tigName != NULL) ? new tgStore(tigName) : NULL; // Load read scores, if supplied. uint64 *readScores = NULL; if (scoreName) { readScores = new uint64 [gkpStore->gkStore_getNumReads() + 1]; errno = 0; FILE *scoreFile = fopen(scoreName, "r"); if (errno) fprintf(stderr, "failed to open '%s' for reading: %s\n", scoreName, strerror(errno)), exit(1); AS_UTL_safeRead(scoreFile, readScores, "scores", sizeof(uint64), gkpStore->gkStore_getNumReads() + 1); fclose(scoreFile); } // Threshold the range of reads to operate on. if (gkpStore->gkStore_getNumReads() < iidMin) { fprintf(stderr, "ERROR: only " F_U32 " reads in the store (IDs 0-" F_U32 " inclusive); can't process requested range -b " F_U32 " -e " F_U32 "\n", gkpStore->gkStore_getNumReads(), gkpStore->gkStore_getNumReads()-1, iidMin, iidMax); exit(1); } if (gkpStore->gkStore_getNumReads() < iidMax) iidMax = gkpStore->gkStore_getNumReads(); ovlStore->setRange(iidMin, iidMax); // If a readList is supplied, load it, respecting the iidMin/iidMax (only to cut down on the // size). if (readListName != NULL) { errno = 0; char L[1024]; FILE *R = fopen(readListName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", readListName, strerror(errno)), exit(1); fgets(L, 1024, R); while (!feof(R)) { splitToWords W(L); uint32 id = W(0); if ((iidMin <= id) && (id <= iidMax)) readList.insert(W(0)); fgets(L, 1024, R); } fclose(R); } // Open logging and summary files if (outputPrefix) { snprintf(logName, FILENAME_MAX, "%s.log", outputPrefix); snprintf(sumName, FILENAME_MAX, "%s.summary", outputPrefix); snprintf(flgName, FILENAME_MAX, "%s.filter.log", outputPrefix); errno = 0; logFile = fopen(logName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", logName, strerror(errno)), exit(1); sumFile = fopen(sumName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", sumName, strerror(errno)), exit(1); #ifdef DEBUG_LAYOUT flgFile = fopen(flgName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", flgName, strerror(errno)), exit(1); #endif } if (logFile) fprintf(logFile, "read\torigLen\tnumOlaps\tcorLen\n"); // Initialize processing. uint32 ovlMax = 1024 * 1024; uint32 ovlLen = 0; ovOverlap *ovl = ovOverlap::allocateOverlaps(gkpStore, ovlMax); ovlLen = ovlStore->readOverlaps(ovl, ovlMax, true); gkReadData *readData = new gkReadData; // And process. while (ovlLen > 0) { bool skipIt = false; char skipMsg[1024] = {0}; tgTig *layout = generateLayout(gkpStore, readScores, legacyScore, minEvidenceLength, maxEvidenceErate, maxEvidenceCoverage, ovl, ovlLen, flgFile); // If there was a readList, skip anything not in it. if ((readListName != NULL) && (readList.count(layout->tigID()) == 0)) { strcat(skipMsg, "\tnot_in_readList"); skipIt = true; } // Possibly filter by the length of the uncorrected read. gkRead *read = gkpStore->gkStore_getRead(layout->tigID()); if (read->gkRead_sequenceLength() < minCorLength) { strcat(skipMsg, "\tread_too_short"); skipIt = true; } // Possibly filter by the length of the corrected read, taking into account depth of coverage. intervalList coverage; for (uint32 ii=0; iinumberOfChildren(); ii++) { tgPosition *pos = layout->getChild(ii); coverage.add(pos->_min, pos->_max - pos->_min); } intervalList depth(coverage); int32 bgn = INT32_MAX; int32 corLen = 0; for (uint32 dd=0; ddnumberOfChildren() <= 1) { strcat(skipMsg, "\tno_children"); skipIt = true; } // Output, if not skipped. if (logFile) fprintf(logFile, "%u\t%u\t%u\t%u%s\n", layout->tigID(), read->gkRead_sequenceLength(), layout->numberOfChildren(), corLen, skipMsg); if ((skipIt == false) && (tigStore != NULL)) tigStore->insertTig(layout, false); if ((skipIt == false) && (falconOutput == true)) outputFalcon(gkpStore, layout, trimToAlign, stdout, readData); delete layout; // Load next batch of overlaps. ovlLen = ovlStore->readOverlaps(ovl, ovlMax, true); } if (falconOutput) fprintf(stdout, "- -\n"); delete readData; if (logFile != NULL) fclose(logFile); if (sumFile != NULL) fclose(sumFile); if (flgFile != NULL) fclose(flgFile); delete tigStore; delete ovlStore; gkpStore->gkStore_close(); return(0); } canu-1.6/src/correction/generateCorrectionLayouts.mk000066400000000000000000000011201314437614700230040ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := generateCorrectionLayouts SOURCES := generateCorrectionLayouts.C ../utgcns/stashContains.C ../falcon_sense/outputFalcon.C SRC_INCDIRS := .. ../AS_UTL ../stores ../utgcns ../falcon_sense TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/correction/readConsensus.C000066400000000000000000000351161314437614700202040ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-23 to 2015-JUL-26 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "sweatShop.H" //#include #include "gkStore.H" #include "ovStore.H" #include "tgStore.H" #include "overlapReadCache.H" #include "NDalign.H" #include "analyzeAlignment.H" #include "AS_UTL_reverseComplement.H" #include "timeAndSize.H" // getTime(); class consensusGlobalData { public: consensusGlobalData(double maxErate_, uint32 bgnID_, int32 endID_, char *gkpName, char *ovlName, char *tigName, uint32 tigVers, char *cnsName, char *fastqName, uint64 memLimit_) { // Inputs gkpStore = gkStore::gkStore_open(gkpName); readCache = new overlapReadCache(gkpStore, memLimit); ovlStore = (ovlName) ? new ovStore(ovlName, gkpStore) : NULL; tigStore = (tigName) ? new tgStore(tigName, tigVers) : NULL; if (ovlStore) fprintf(stderr, "consensusGlobalData()-- opened ovlStore '%s'\n", ovlName); if (tigStore) fprintf(stderr, "consensusGlobalData()-- opened tigStore '%s'\n", tigName); // Parameters maxErate = maxErate_; memLimit = memLimit_; if (bgnID_ == 0) bgnID_ = 1; if (endID_ > gkpStore->gkStore_getNumReads()) endID_ = gkpStore->gkStore_getNumReads() + 1; bgnID = bgnID_; curID = bgnID_; endID = endID_; // Outputs cnsFile = NULL; fastqFile = NULL; if (cnsName) { errno = 0; cnsFile = fopen(cnsName, "w"); if (errno) fprintf(stderr, "ERROR: failed to open '%s' for writing: %s\n", cnsName, strerror(errno)), exit(1); } if (fastqName) { errno = 0; fastqFile = fopen(fastqName, "w"); if (errno) fprintf(stderr, "ERROR: failed to open '%s' for writing: %s\n", fastqName, strerror(errno)), exit(1); } // State for loading overlaps // State for loading tigs }; ~consensusGlobalData() { gkpStore->gkStore_close(); delete readCache; delete ovlStore; delete tigStore; if (cnsFile) fclose(cnsFile); if (fastqFile) fclose(fastqFile); }; // Parameters double maxErate; uint64 memLimit; uint32 bgnID; uint32 curID; // Currently loading id uint32 endID; // Inputs gkStore *gkpStore; overlapReadCache *readCache; ovStore *ovlStore; tgStore *tigStore; // State for loading gkReadData readData; // State for loading overlaps // State for loading tigs // Outputs FILE *cnsFile; FILE *fastqFile; }; class consensusThreadData { public: consensusThreadData(consensusGlobalData *g, uint32 tid) { threadID = tid; nPassed = 0; nFailed = 0; align = new NDalign(pedGlobal, g->maxErate, 15); // true = partial aligns, maxErate, seedSize analyze = new analyzeAlignment(); }; ~consensusThreadData() { delete align; delete analyze; }; uint32 threadID; uint64 nPassed; uint64 nFailed; char bRev[AS_MAX_READLEN]; NDalign *align; analyzeAlignment *analyze; }; // The overlap compute needs both strings in the correct orientation. // This loaded just loads the tig/overlaps, and converts to a common format. // It ensures that the overlapReadCache is loaded. // aID aBgn aEnd // bID bBgn bEnd bFlip class consensusComputation { public: consensusComputation(tgTig *tig) { _tig = tig; _corLen = 0; _corSeq = NULL; _corQlt = NULL; }; ~consensusComputation() { delete [] _corSeq; delete [] _corQlt; }; public: tgTig *_tig; // Input //ovOverlap *_overlaps; // Input, we convert this to a tig... //uint32 _overlapsLen; uint32 _corLen; char *_corSeq; // Output sequence char *_corQlt; }; // Complicated. Load the overlaps, filter, and convert to a tig. consensusComputation * consensusReaderOverlaps(consensusGlobalData *UNUSED(g)) { return(NULL); } // Simple, just load the tig and call it a day. consensusComputation * consensusReaderTigs(consensusGlobalData *g) { tgTig *t = NULL; consensusComputation *s = NULL; while ((t == NULL) && (g->curID < g->endID)) t = g->tigStore->loadTig(g->curID++); if (t) s = new consensusComputation(t); return(s); } void * consensusReader(void *G) { consensusGlobalData *g = (consensusGlobalData *)G; consensusComputation *s = NULL; if ((g->curID < g->endID) && (g->ovlStore)) s = consensusReaderOverlaps(g); if ((g->curID < g->endID) && (g->tigStore)) s = consensusReaderTigs(g); if (s) g->readCache->loadReads(s->_tig); return(s); } void consensusWorker(void *G, void *T, void *S) { consensusGlobalData *g = (consensusGlobalData *)G; consensusThreadData *t = (consensusThreadData *)T; consensusComputation *s = (consensusComputation *)S; uint32 rID = s->_tig->tigID(); fprintf(stderr, "THREAD %u working on tig %u\n", t->threadID, rID); t->analyze->reset(rID, g->readCache->getRead(rID), g->readCache->getLength(rID)); for (uint32 oo=0; oo_tig->numberOfChildren(); oo++) { if (s->_tig->getChild(oo)->isRead() == false) continue; tgPosition *pos = s->_tig->getChild(oo); // // This colosely follows overlapPair // // Load A. uint32 aID = s->_tig->tigID(); char *aStr = g->readCache->getRead (aID); uint32 aLen = g->readCache->getLength(aID); int32 aLo = pos->min() - 100; if (aLo < 0) aLo = 0; int32 aHi = pos->max() + 100; assert(aID == rID); assert(aLo < aHi); // Load B. If reversed, we need to reverse the coordinates to meet the overlap spec. uint32 bID = pos->ident(); char *bStr = g->readCache->getRead (bID); uint32 bLen = g->readCache->getLength(bID); int32 bLo = (pos->isReverse() == false) ? ( pos->askip()) : (bLen - pos->askip()); int32 bHi = (pos->isReverse() == false) ? (bLen - pos->bskip()) : ( pos->bskip()); // Compute the overlap #if 0 fprintf(stderr, "aligned %6u %s %6u -- %5u-%5u %5u-%5u\n", aID, pos->isReverse() ? "<--" : "-->", bID, aLo, aHi, bLo, bHi); #endif t->align->initialize(aID, aStr, aLen, aLo, aHi, bID, bStr, bLen, bLo, bHi, pos->isReverse()); if (t->align->findMinMaxDiagonal(40) == false) continue; if (t->align->findSeeds(false) == false) continue; t->align->findHits(); t->align->chainHits(); if (t->align->processHits() == true) { t->nPassed++; int32 aLoO = t->align->abgn(); int32 aHiO = t->align->aend(); int32 bLoO = t->align->bbgn(); int32 bHiO = t->align->bend(); if (bLoO > bHiO) { bLoO = t->align->bend(); bHiO = t->align->bbgn(); } #if 0 fprintf(stderr, "aligned %6u %s %6u -- %5u-%5u %5u-%5u -- %5u-%5u %5u-%5u -- %6.2f\n", aID, pos->isReverse() ? "<--" : "-->", bID, aLo, aHi, bLo, bHi, aLoO, aHiO, bLoO, bHiO, 100.0 * (aHiO - aLoO) / (aHi - aLo)); //t->align->display(); #endif // This wants pointers to the start of the strings that align, and the offset to the full read. t->analyze->analyze(t->align->astr() + aLoO, aHiO - aLoO, aLoO, t->align->bstr() + bLoO, bHiO - bLoO, t->align->deltaLen(), t->align->delta()); } else { t->nFailed++; #if 1 fprintf(stderr, "FAILED %6u %s %6u -- %5u-%5u %5u-%5u -- %5u-%5u %5u-%5u -- %6.2f\n", aID, pos->isReverse() ? "<--" : "-->", bID, aLo, aHi, bLo, bHi, 0, 0, 0, 0, 0.0); //t->align->display(); #endif } } //fprintf(stderr, "THREAD %2u finished with tig %5u -- passed %12u -- failed %12u\n", t->threadID, s->_tig->tigID(), t->nPassed, t->nFailed); t->analyze->generateCorrections(); t->analyze->generateCorrectedRead(); s->_corLen = t->analyze->_corSeqLen; s->_corSeq = new char [s->_corLen + 1]; s->_corQlt = new char [s->_corLen + 1]; memcpy(s->_corSeq, t->analyze->_corSeq, sizeof(char) * (s->_corLen + 1)); memcpy(s->_corQlt, t->analyze->_corSeq, sizeof(char) * (s->_corLen + 1)); for (uint32 ii=0; ii_corLen; ii++) s->_corQlt[ii] = '!' + 10; } void consensusWriter(void *G, void *S) { consensusGlobalData *g = (consensusGlobalData *)G; consensusComputation *s = (consensusComputation *)S; if (g->cnsFile) { } if (g->fastqFile) { fprintf(g->fastqFile, "@read%08u\n", s->_tig->tigID()); fprintf(g->fastqFile, "%s\n", s->_corSeq); fprintf(g->fastqFile, "+\n"); fprintf(g->fastqFile, "%s\n", s->_corQlt); } delete s; } int main(int argc, char **argv) { char *gkpName = NULL; char *ovlName = NULL; char *tigName = NULL; uint32 tigVers = 1; char *cnsName = NULL; char *fastqName = NULL; uint32 bgnID = 1; uint32 endID = UINT32_MAX; uint32 numThreads = 1; double maxErate = 0.02; uint64 memLimit = 4; argc = AS_configure(argc, argv); int err=0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovlName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-c") == 0) { cnsName = argv[++arg]; } else if (strcmp(argv[arg], "-f") == 0) { fastqName = argv[++arg]; } else if (strcmp(argv[arg], "-b") == 0) { bgnID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { endID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-t") == 0) { numThreads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-erate") == 0) { maxErate = atof(argv[++arg]); } else if (strcmp(argv[arg], "-memory") == 0) { memLimit = atoi(argv[++arg]); } else { err++; } arg++; } if (gkpName == NULL) err++; if ((ovlName == NULL) && (tigName == NULL)) err++; if ((ovlName != NULL) && (tigName != NULL)) err++; if ((cnsName == NULL) && (fastqName == NULL)) err++; if (err) { fprintf(stderr, "usage: %s ...\n", argv[0]); fprintf(stderr, " -G gkpStore Mandatory, path to gkpStore\n"); fprintf(stderr, "\n"); fprintf(stderr, "Inputs can come from either an overlap or a tig store.\n"); fprintf(stderr, " -O ovlStore \n"); fprintf(stderr, " -T tigStore tigVers \n"); fprintf(stderr, "\n"); fprintf(stderr, "If from an ovlStore, the range of reads processed can be restricted.\n"); fprintf(stderr, " -b bgnID \n"); fprintf(stderr, " -e endID \n"); fprintf(stderr, "\n"); fprintf(stderr, "Outputs will be written as the full multialignment and the final consensus sequence\n"); fprintf(stderr, " -c output.cns \n"); fprintf(stderr, " -f output.fastq \n"); fprintf(stderr, "\n"); fprintf(stderr, " -erate e Overlaps are computed at 'e' fraction error; must be larger than the original erate\n"); fprintf(stderr, " -memory m Use up to 'm' GB of memory\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t n Use up to 'n' cores\n"); fprintf(stderr, "\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no gatekeeper (-G) supplied.\n"); if ((ovlName == NULL) && (tigName == NULL)) fprintf(stderr, "ERROR: no inputs (-O or -T) supplied.\n"); if ((ovlName != NULL) && (tigName != NULL)) fprintf(stderr, "ERROR: only one input (-O or -T) may be supplied.\n"); if ((cnsName == NULL) && (fastqName == NULL)) fprintf(stderr, "ERROR: no outputs (-c or -f) supplied.\n"); exit(1); } consensusGlobalData *g = new consensusGlobalData(maxErate, bgnID, endID, gkpName, ovlName, tigName, tigVers, cnsName, fastqName, memLimit); #if 1 consensusThreadData *t = new consensusThreadData(g, 0); while (1) { consensusComputation *c = (consensusComputation *)consensusReader(g); if (c == NULL) break; consensusWorker(g, t, c); consensusWriter(g, c); } delete t; #else consensusThreadData **td = new consensusThreadData * [numThreads]; sweatShop *ss = new sweatShop(consensusReader, consensusWorker, consensusWriter); ss->setLoaderQueueSize(16384); ss->setWriterQueueSize(1024); ss->setNumberOfWorkers(numThreads); for (uint32 w=0; wsetThreadData(w, td[w] = new consensusThreadData(g, w)); // these leak ss->run(g, true); delete ss; for (uint32 w=0; w $prefix.stats.dat"); open(E, "> $prefix.erate.dat"); # Ignore the header. $_ = ; while () { s/^\s+//; s/\s+$//; #my @v = split '\s+', $_; #my $biid = $v[0]; #my $abgn = $v[2]; #my $aend = $v[3]; my ($biid, $abgn, $aend, $erate); if (m/(\d+)\s+A:\s+(\d+)\s+(\d+)\s+\(\s*\d+\)\s+B:\s+\d+\s+\d+\s+\(\s*\d+\)\s+(\d+.\d+)%/) { $biid = $1; $abgn = $2; $aend = $3; $erate = $4; } else { print "NOPE $_\n"; } for (my $ii=$abgn; $ii<=$aend; $ii++) { if (!defined($erates[$ii])) { $erates[$ii] = "$erate"; } else { $erates[$ii] .= ":$erate"; } } #print E "$abgn\t$aend\t$erate\n"; } close(F); my $maxPoint = scalar(@erates); for (my $pp=0; $pp<$maxPoint; $pp++) { my @vals = split ':', $erates[$pp]; my $depth = scalar (@vals); @vals = sort {$a <=> $b } @vals; my $min = $vals[0]; my $max = $vals[$depth-1]; my $sum = 0; foreach my $v (@vals) { $sum += $v; } my $ave = $sum / $depth; print D "$pp\t$min\t$ave\t$max\t$depth\n"; foreach my $v (@vals) { print E "$pp\t$v\n"; } } close(D); close(E); system("overlapStore -p $iid test.ovlStore test.gkpStore CLR > $prefix.ovlPicture"); open(F, "| gnuplot"); print F "set terminal 'png' size 1280,800\n"; print F "set output '$prefix.png'\n"; print F "plot [][0:30]"; print F " '$prefix.stats.dat' using 1:2 with lines title 'min % identity',"; print F " '$prefix.stats.dat' using 1:3 with lines title 'ave % identity',"; print F " '$prefix.stats.dat' using 1:4 with lines title 'max % identity',"; print F " '$prefix.stats.dat' using 1:5 with lines title 'depth',"; #print F " '$prefix.erate.dat' using 1:2 with points ps 0.2 title 'erate'\n"; print F " '$prefix.erate.dat' using 1:3 with points ps 0.3 title 'bgnerates',"; print F " '$prefix.erate.dat' using 2:3 with points ps 0.3 title 'enderates'\n"; close(F); } canu-1.6/src/erateEstimate/erate-estimate-test-based-on-mapping.pl000066400000000000000000000134301314437614700252460ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/AS_BAT/erate-estimate-test-based-on-mapping.pl # # Modifications by: # # Brian P. Walenz from 2014-NOV-15 to 2015-AUG-07 # are Copyright 2014-2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $erateFile = "BO/test.blasr.sam.coords.longestErate"; my $uidMapFile = "CAordered/test.gkpStore.fastqUIDmap"; my $olapStore = "CAordered/test.ovlStore"; my %erate5; # Either the mapped global erate, or the lowest erate for each end my %erate3; # Load the map from IID to UID and name. my %UIDtoIID; my %UIDtoNAM; my %IIDtoUID; my %IIDtoNAM; my %NAMtoUID; my %NAMtoIID; if (defined($uidMapFile)) { print STDERR "Read names from '$uidMapFile'\n"; open(F, "< $uidMapFile"); while () { my @v = split '\s+', $_; # UID IID NAM UID IID NAM $UIDtoIID{$v[0]} = $v[1]; $UIDtoNAM{$v[0]} = $v[2]; $IIDtoUID{$v[1]} = $v[0]; $IIDtoNAM{$v[1]} = $v[2]; $NAMtoUID{$v[2]} = $v[0]; $NAMtoIID{$v[2]} = $v[1]; if (scalar(@v) == 6) { $UIDtoIID{$v[3]} = $v[4]; $UIDtoNAM{$v[3]} = $v[5]; $IIDtoUID{$v[4]} = $v[3]; $IIDtoNAM{$v[4]} = $v[5]; $NAMtoUID{$v[5]} = $v[3]; $NAMtoIID{$v[5]} = $v[4]; } } close(F); } if (0) { print STDERR "Load global read erates from '$erateFile'\n"; my $found = 0; my $lost = 0; if (! -e "$erateFile") { die "Run erate-coords.pl in BO directory.\n"; } open(F, "< $erateFile") or die; while() { my @v = split '\s+', $_; # This should be read NAME, not assmelby UID. my $iid = $NAMtoIID{$v[0]}; if (defined($iid)) { $found++; $erate5{$iid} = 100 - $v[1]; $erate3{$iid} = 100 - $v[1]; } else { $lost++; #print STDERR "Didn't find name '$v[0]' - dropped from read set?\n"; } } close(F); print STDERR "Found $found erates, lost $lost.\n"; } if (1) { print STDERR "Load overlap erates from '$olapStore'\n"; open(F, "overlapStore -d $olapStore |"); while () { s/^\s+//; s/\s+$//; #print "$_\n"; my ($aIID, $bIID, $orient, $aHang, $bHang, $erate, $crate) = split '\s+', $_; if (($aHang <= 0) && ($bHang >= 0)) { # A contained in B next; } elsif (($aHang >= 0) && ($bHang <= 0)) { # B contained in A next; } elsif (($aHang <= 0) && ($bHang <= 0)) { # Dovetail off the 5' end $erate5{$aIID} = $erate if ((!exists($erate5{$aIID})) || ($erate < $erate5{$aIID})); } elsif (($aHang >= 0) && ($bHang >= 0)) { # Dovetail off the 3' end $erate3{$aIID} = $erate if ((!exists($erate3{$aIID})) || ($erate < $erate3{$aIID})); } else { die; } } } # Scan overlaps, try to decide true from false. print STDERR "Read overlas from '$olapStore'\n"; my @ratioTrue; my @ratioFalse; open(F, "overlapStore -d $olapStore |"); while () { s/^\s+//; s/\s+$//; #print "$_\n"; my ($aIID, $bIID, $orient, $aHang, $bHang, $erate, $crate) = split '\s+', $_; my $aErate = 0; my $bErate = 0; if (($aHang <= 0) && ($bHang >= 0)) { # A contained in B next; } elsif (($aHang >= 0) && ($bHang <= 0)) { # B contained in A next; } elsif (($aHang <= 0) && ($bHang <= 0) && ($orient eq "N")) { # Dovetail off the 5' end of A to the 3' end of B $aErate = $erate5{$aIID}; $bErate = $erate3{$bIID}; } elsif (($aHang <= 0) && ($bHang <= 0) && ($orient eq "I")) { # Dovetail off the 5' end of A to the 5' end of B $aErate = $erate5{$aIID}; $bErate = $erate5{$bIID}; } elsif (($aHang >= 0) && ($bHang >= 0) && ($orient eq "N")) { # Dovetail off the 3' end of A to the 5' end of B $aErate = $erate3{$aIID}; $bErate = $erate5{$bIID}; } elsif (($aHang >= 0) && ($bHang >= 0) && ($orient eq "I")) { # Dovetail off the 3' end of A to the 3' end of B $aErate = $erate3{$aIID}; $bErate = $erate3{$bIID}; } else { die; } print STDERR "Didn't find erate for IID $aIID.\n $_\n" if (!defined($aErate)); print STDERR "Didn't find erate for IID $bIID.\n $_\n" if (!defined($bErate)); next if (!defined($aErate)); next if (!defined($bErate)); my $ratio = int(100 * ($erate) / ($aErate + $bErate)); my $dist = ($aIID < $bIID) ? ($bIID - $aIID) : ($aIID - $bIID); die if ($dist < 0); if ($dist < 50) { $ratioTrue[$ratio]++; } else { $ratioFalse[$ratio]++; } } close(F); open(F, "> r"); for (my $x=0; $x<1000; $x++) { next if (!defined($ratioTrue[$x]) && (!defined($ratioFalse[$x]))); $ratioTrue[$x] += 0; $ratioFalse[$x] += 0; my $v = $x / 100; print F "$v\t$ratioTrue[$x]\t$ratioFalse[$x]\n"; } close(F); canu-1.6/src/erateEstimate/erateEstimate.C000066400000000000000000000510421314437614700206050ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/erate-estimate.C * * Modifications by: * * Brian P. Walenz from 2014-OCT-21 to 2015-JUL-29 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "memoryMappedFile.H" #include "intervalList.H" #include #include #include #include #include using namespace std; #ifndef BROKEN_CLANG_OpenMP #include #endif #define ERATE_TOLERANCE 0.03 uint32 blockSize = 1000; class readErrorEstimate { public: readErrorEstimate() { seqLen = 0; errorMeanS = NULL; // Sum of error up to this point //errorStdDev = NULL; //errorConfidence = NULL; errorMeanU = NULL; // Updated error estimate for this point }; ~readErrorEstimate() { delete [] errorMeanS; //delete [] errorStdDev; //delete [] errorConfidence; delete [] errorMeanU; }; uint64 initialize(gkRead *read) { //uint32 lid = read->gkRead_getLibraryIID(); //gkLibrary *lb = gkpStore->gkStore_getLibrary(lid); seqLen = read->gkRead_sequenceLength(); errorMeanS = new uint32 [seqLen + 1]; //errorStdDev = new double [seqLen + 1]; //errorConfidence = new double [seqLen + 1]; errorMeanU = new uint16 [seqLen + 1]; #if 0 uint32 err40 = AS_OVS_encodeEvalue(0.40); errorMeanS[0] = err40; errorMeanU[0] = 0; // instead of this loop, we can build one full (0-max_read_len) array, // then copy the start N elements to each errorMeanS for (uint32 ii=1; ii> 14) & 0x000001ff; b_iid_lo = (ovl.b_iid >> 0) & 0x00003fff; a_hang = ovl.a_hang(); b_hang = ovl.b_hang(); erate = ovl.evalue(); flipped = ovl.flipped(); discarded = false; assert(ovl.a_iid == a_iid); assert(ovl.b_iid == ((b_iid_hi << 14) | (b_iid_lo))); assert(ovl.a_hang() == a_hang); assert(ovl.b_hang() == b_hang); assert(ovl.erate() == erate); assert(ovl.flipped() == flipped); }; #if 0 void populateOBT(ovOverlap &obt, readErrorEstimate *readProfile, uint32 iidMin) { obt.a_iid = a_iid; obt.b_iid = (b_iid_hi << 14) | (b_iid_lo); int32 seqLenA = readProfile[obt.a_iid - iidMin].seqLen; int32 seqLenB = readProfile[obt.b_iid - iidMin].seqLen; // Swiped from AS_OVS_overlap.C uint32 abgn = (a_hang < 0) ? (0) : (a_hang); uint32 aend = (b_hang < 0) ? (seqLenA + b_hang) : (seqLenA); uint32 bbgn = (a_hang < 0) ? (-a_hang) : (0); uint32 bend = (b_hang < 0) ? (seqLenB) : (seqLenB - b_hang); obt.dat.obt.type = AS_OVS_TYPE_OBT; obt.dat.obt.a_beg = abgn; obt.dat.obt.a_end = aend; obt.dat.obt.b_beg = bbgn; obt.dat.obt.b_end_hi = (bend >> 9) & 0x00003fff; obt.dat.obt.b_end_lo = (bend) & 0x000001ff; obt.dat.obt.fwd = (flipped == false); obt.dat.obt.erate = erate; }; #endif }; class ESToverlapSpan { public: ESToverlapSpan(ESToverlap& ovl, readErrorEstimate *readProfile, uint32 iidMin) { a_iid = ovl.a_iid; b_iid = (ovl.b_iid_hi << 14) | (ovl.b_iid_lo); int32 seqLenA = readProfile[a_iid - iidMin].seqLen; int32 seqLenB = readProfile[b_iid - iidMin].seqLen; // Swiped from AS_OVS_overlap.C uint32 abgn = (ovl.a_hang < 0) ? (0) : (ovl.a_hang); uint32 aend = (ovl.b_hang < 0) ? (seqLenA + ovl.b_hang) : (seqLenA); uint32 bbgn = (ovl.a_hang < 0) ? (-ovl.a_hang) : (0); uint32 bend = (ovl.b_hang < 0) ? (seqLenB) : (seqLenB - ovl.b_hang); a_beg = abgn; a_end = aend; b_beg = bbgn; b_end = bend; fwd = (ovl.flipped == false); erate = ovl.erate; }; uint32 a_iid; uint32 b_iid; uint32 a_beg; uint32 a_end; uint32 b_beg; uint32 b_end; uint32 fwd; uint32 erate; }; void saveProfile(uint32 iid, uint32 iter, readErrorEstimate *readProfile) { char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "erate-%08u-%02u.dat", iid, iter); FILE *F = fopen(N, "w"); for (uint32 pp=0; ppgkStore_getNumReads(), iter); #pragma omp parallel for schedule(dynamic, blockSize) for (uint32 iid=0; iid eRateList; for (uint64 oo=overlapIndex[iid]; oo 0) { double estError = computeEstimatedErate(iidMin, ovl, readProfile); if (estError + ERATE_TOLERANCE < erate) { overlaps[oo].discarded = true; nDiscard++; continue; } } // Otherwise, add it to the list of intervals. nRemain++; eRateList.add(ovl.a_beg, ovl.a_end - ovl.a_beg, erate / 2); } // Convert the list to a sum of error rate per base intervalList eRateMap(eRateList); // Unpack the list into an array of mean error rate per base memset(readProfile[iid].errorMeanU, 0, sizeof(uint16) * (readProfile[iid].seqLen + 1)); for (uint32 ii=0; ii 0) ? (eRateMap.value(ii) / eRateMap.depth(ii)) : 0; assert(0.0 <= eVal); assert(eVal <= 1.0); assert(eRateMap.hi(ii) <= readProfile[iid].seqLen); uint16 eEnc = AS_OVS_encodeEvalue(eVal); for (uint32 pp=eRateMap.lo(ii); pp < eRateMap.hi(ii); pp++) readProfile[iid].errorMeanU[pp] = eEnc; } // Keep users entertained. if ((iid % 1000) == 0) fprintf(stderr, "IID " F_U32 "\r", iid); } // All new estimates are computed. Convert the array of mean error per base into an array of // summed error per base for (uint32 iid=0; iidnumOverlapsInRange(); fprintf(stderr, "Processing from IID " F_U32 " to " F_U32 " out of " F_U32 " reads.\n", iidMin, iidMin + numIIDs, gkpStore->gkStore_getNumReads()); // Can't thread. This does sequential output. Plus, it doesn't compute anything. // Overlaps in the store and those in the list should be in lock-step. We can just // walk down each. uint32 overlapblock = 100000000; ovOverlap *overlapsload = ovOverlap::allocateOverlaps(gkpStore, overlapblock); for (uint64 no=0; noreadOverlaps(overlapsload, overlapblock, false); assert(nLoad > 0); for (uint32 xx=0; xxwriteOverlap(overlapsload + xx); nRemain++; } if ((no & 0x000fffff) == 0) fprintf(stderr, " overlap %10" F_U64P " %8" F_U32P "-%8" F_U32P "\r", no, a_iid, b_iid); } } delete [] overlapsload; #if 0 for (uint32 iid=0; iidwriteOverlap(&obt); } if ((iid % 1000) == 0) fprintf(stderr, "IID " F_U32 "\r", iid); } #endif delete outStore; delete inpStore; fprintf(stderr, "\n"); fprintf(stderr, "nDiscarded " F_U64 " (in previous iterations)\n", nDiscarded); fprintf(stderr, "nRemain " F_U64 "\n", nRemain); } int main(int argc, char **argv) { char *gkpName = 0L; gkStore *gkpStore = 0L; char *ovlStoreName = 0L; ovStore *ovlStore = 0L; char *ovlCacheName = 0L; uint32 errorRate = AS_OVS_encodeEvalue(0.015); double errorLimit = 2.5; char *outputPrefix = NULL; char logName[FILENAME_MAX] = {0}; char sumName[FILENAME_MAX] = {0}; FILE *logFile = 0L; FILE *sumFile = 0L; argc = AS_configure(argc, argv); uint32 minEvidenceOverlap = 40; uint32 minEvidenceCoverage = 1; uint32 iidMin = UINT32_MAX; uint32 iidMax = UINT32_MAX; uint32 partNum = 0; uint32 partMax = 1; uint32 minLen = 0; double minErate = 0; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovlStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-C") == 0) { ovlCacheName = argv[++arg]; } else if (strcmp(argv[arg], "-b") == 0) { iidMin = atoi(argv[++arg]); partNum = 0; partMax = 0; } else if (strcmp(argv[arg], "-e") == 0) { iidMax = atoi(argv[++arg]); partNum = 0; partMax = 0; } else if (strcmp(argv[arg], "-p") == 0) { partNum = atoi(argv[++arg]) - 1; partMax = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-L") == 0) { //minLen = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-E") == 0) { //minErate = atof(argv[++arg]); } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if (err) { exit(1); } // Can't use dynamic threads, screws up the thread-local memory allocations. omp_set_dynamic(false); // Open gatekeeper store fprintf(stderr, "Opening '%s'\n", gkpName); gkpStore = gkStore::gkStore_open(gkpName); // Compute what to compute. if (partNum < partMax) { uint32 nf = gkpStore->gkStore_getNumReads(); iidMin = (partNum + 0) * nf / partMax + 1; iidMax = (partNum + 1) * nf / partMax; if (partNum + 1 == partMax) iidMax = nf; } if (iidMin == UINT32_MAX) iidMin = 1; if (iidMax == UINT32_MAX) iidMax = gkpStore->gkStore_getNumReads(); uint32 numIIDs = (iidMax - iidMin + 1); fprintf(stderr, " iidMin = %9u\n", iidMin); fprintf(stderr, " iidMax = %9u numReads = %9u\n", iidMax, gkpStore->gkStore_getNumReads()); fprintf(stderr, " partNum = %9u\n", partNum); fprintf(stderr, " partMax = %9u\n", partMax); //fprintf(stderr, "ovOverlap " F_U64 "\n", sizeof(ovOverlap)); //fprintf(stderr, "ESToverlap " F_U64 "\n", sizeof(ESToverlap)); // Load read metadata, clear ranges, read lengths, and deleted status. fprintf(stderr, "Initializing profiles\n"); uint64 readProfileSize = 0; readErrorEstimate *readProfile = new readErrorEstimate [numIIDs]; for (uint32 iid=0; iidgkStore_getRead(iid + iidMin)); if ((iid % 10000) == 0) fprintf(stderr, " " F_U32 " reads\r", iid); } fprintf(stderr, " " F_U32 " reads\n", numIIDs); fprintf(stderr, " " F_U64 " GB\n", readProfileSize >> 30); // Open overlap stores fprintf(stderr, "Opening '%s'\n", ovlStoreName); ovlStore = new ovStore(ovlStoreName, gkpStore); fprintf(stderr, "Finding number of overlaps\n"); ovlStore->setRange(iidMin, iidMax); uint64 numOvls = ovlStore->numOverlapsInRange(); uint64 *overlapIndex = new uint64 [numIIDs + 1]; uint32 bgn = 0; uint32 end = 0; uint32 *overlapLen = ovlStore->numOverlapsPerFrag(bgn, end); overlapIndex[0] = 0; for (uint32 iid=0; iid> 30); fprintf(stderr, " overlaps " F_U64 " GB (previous size)\n", (sizeof(ovOverlap) * numOvls) >> 30); fprintf(stderr, " overlaps " F_U64 " GB\n", (sizeof(ESToverlap) * numOvls) >> 30); ESToverlap *overlaps = NULL; memoryMappedFile *overlapsMMF = NULL; if (AS_UTL_fileExists(ovlCacheName, FALSE, FALSE)) { fprintf(stderr, " cache '%s' detected, load averted\n", ovlCacheName); overlapsMMF = new memoryMappedFile(ovlCacheName); overlaps = (ESToverlap *)overlapsMMF->get(0); } else { FILE *ESTcache = NULL; uint32 overlapblock = 100000000; ovOverlap *overlapsload = ovOverlap::allocateOverlaps(gkpStore, overlapblock); overlaps = new ESToverlap [numOvls]; if (ovlCacheName) { errno = 0; ESTcache = fopen(ovlCacheName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", ovlCacheName, strerror(errno)), exit(1); } for (uint64 no=0; noreadOverlaps(overlapsload, overlapblock, false); for (uint32 xx=0; xx #include using namespace std; int main(int argc, char **argv) { char *gkpName = 0L; char *ovlName = 0L; char *tigName = 0L; uint32 tigVers = 0; uint32 errorRate = AS_OVS_encodeEvalue(0.015); char *outputPrefix = NULL; argc = AS_configure(argc, argv); uint32 iidMin = 0; uint32 iidMax = UINT32_MAX; uint32 numReadsPer = 0; uint32 numPartitions = 128; bool trimToAlign = true; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-o") == 0) { outputPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-b") == 0) { iidMin = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { iidMax = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-n") == 0) { numReadsPer = atoi(argv[++arg]); numPartitions = 0; } else if (strcmp(argv[arg], "-p") == 0) { numReadsPer = 0; numPartitions = atoi(argv[++arg]); } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if (err) { exit(1); } if (gkpName == NULL || tigName == NULL) { exit(1); } // Open gkpStore. Pretty much the first thing we always do. gkStore *gkpStore = gkStore::gkStore_open(gkpName); // Open tigStore, check ranges. tgStore *tigStore = new tgStore(tigName, tigVers); uint32 nTigs = tigStore->numTigs(); if (nTigs <= iidMax) iidMax = nTigs - 1; // Count how many reads are referenced in tigs from iidMin to iidMax. These unitigs // are special in that the can contain duplicate reads, so there will be many more // reads referenced than the number of reads in gkpStore. uint64 nReadsInTigs = 0; for (uint32 ti=iidMin; ti<=iidMax; ti++) nReadsInTigs += tigStore->getNumChildren(ti); // Decide how many partitions there should be. Rather easy if the value is supplied, // but if not, compute it from the number of reads per partition. if (numReadsPer > 0) numPartitions = nReadsInTigs / numReadsPer + 1; fprintf(stderr, "Will partition " F_U64 " total child reads into " F_U32 " partitions.\n", nReadsInTigs, numPartitions); // Decide on a partitioning, based on total reads per tig. uint32 *tigToPart = new uint32 [nTigs]; uint32 *nReadsPerPart = new uint32 [numPartitions + 1]; memset(tigToPart, 0, sizeof(uint32) * (nTigs)); memset(nReadsPerPart, 0, sizeof(uint32) * (numPartitions + 1)); // Grab the number of reads per tig, again, but let us sort it to do a simple // greedy partitioning. vector > readsPerTig; for (uint32 ti=iidMin; ti<=iidMax; ti++) readsPerTig.push_back(pair(tigStore->getNumChildren(ti), ti)); sort(readsPerTig.rend(), readsPerTig.rbegin()); // Put the next unitig in the most empty partition. Definitely better algorithms exist... for (uint32 ii=0; iiloadTig(ti); if (tig == NULL) continue; if (tig->numberOfChildren() == 0) continue; uint32 pp = tigToPart[ti]; assert(pp > 0); if (partFile[pp] == NULL) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s%04d", outputPrefix, pp); // Sync'd with canu/CorrectReads.pm errno = 0; partFile[pp] = fopen(name, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", name, strerror(errno)), exit(1); } outputFalcon(gkpStore, tig, trimToAlign, partFile[pp], readData); } delete readData; for (uint32 pp=0; pp<=numPartitions; pp++) { if (partFile[pp] == NULL) continue; fprintf(partFile[pp], "- -\n"); fclose(partFile[pp]); } delete tigStore; delete [] partFile; delete [] tigToPart; delete [] nReadsPerPart; gkpStore->gkStore_close(); return(0); } canu-1.6/src/falcon_sense/createFalconSenseInputs.mk000066400000000000000000000010101314437614700226560ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := createFalconSenseInputs SOURCES := createFalconSenseInputs.C outputFalcon.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/falcon_sense/falcon_sense.C000066400000000000000000000107121314437614700203120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-FEB-24 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-NOV-28 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "splitToWords.H" #include "AS_UTL_fasta.H" #include "falcon.H" #ifndef BROKEN_CLANG_OpenMP #include #endif #include #include using namespace std; int main (int argc, char **argv) { uint32 threads = 0; uint32 min_cov = 4; uint32 min_len = 500; uint32 min_ovl_len = 500; double min_idy = 0.5; uint32 K = 8; uint32 max_read_len = AS_MAX_READLEN; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "--n_core") == 0) { threads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "--min_cov") == 0) { min_cov = atoi(argv[++arg]); } else if (strcmp(argv[arg], "--min_idt") == 0) { min_idy = atof(argv[++arg]); } else if (strcmp(argv[arg], "--min_len") == 0) { min_len = atoi(argv[++arg]); } else if (strcmp(argv[arg], "--min_ovl_len") == 0) { min_ovl_len = atoi(argv[++arg]); } else if (strcmp(argv[arg], "--max_read_len") == 0) { max_read_len = atoi(argv[++arg]); if (max_read_len <= 0 || max_read_len > 2*AS_MAX_READLEN) { max_read_len = 2*AS_MAX_READLEN; } } else { fprintf(stderr, "%s: Unknown option '%s'\n", argv[0], argv[arg]); err++; } arg++; } if (err) { fprintf(stderr, "Invalid usage"); exit(1); } if (threads > 0) { omp_set_num_threads(threads); } else { omp_set_num_threads(omp_get_max_threads()); } // read in a loop and get consensus of each read vector seqs; string seed; char *A = new char[AS_MAX_READLEN * 2]; fgets(A, AS_MAX_READLEN * 2, stdin); while (!feof(stdin)) { splitToWords W(A); if (W[0][0] == '+') { uint32 splitSeqID = 0; FConsensus::consensus_data *consensus_data_ptr = FConsensus::generate_consensus( seqs, min_cov, K, min_idy, min_ovl_len, max_read_len ); #ifdef TRACK_POSITIONS //const std::string& sequenceToCorrect = seqs.at(0); char * originalStringPointer = consensus_data_ptr->sequence; #endif char * split = strtok(consensus_data_ptr->sequence, "acgt"); while (split != NULL) { if (strlen(split) > min_len) { AS_UTL_writeFastA(stdout, split, strlen(split), 60, ">%s_%d\n", seed.c_str(), splitSeqID); splitSeqID++; #ifdef TRACK_POSITIONS int distance_from_beginning = split - originalStringPointer; std::vector relevantOriginalPositions(consensus_data_ptr->originalPos.begin() + distance_from_beginning, consensus_data_ptr->originalPos.begin() + distance_from_beginning + strlen(split)); int firstRelevantPosition = relevantOriginalPositions.front(); int lastRelevantPosition = relevantOriginalPositions.back(); std::string relevantOriginalTemplate = seqs.at(0).substr(firstRelevantPosition, lastRelevantPosition - firstRelevantPosition + 1); // store relevantOriginalTemplate along with corrected read - not implemented #endif } split = strtok(NULL, "acgt"); } FConsensus::free_consensus_data( consensus_data_ptr ); seqs.clear(); seed.clear(); } else if (W[0][0] == '-') { break; } else { if (seed.length() == 0) { seed = W[0]; seqs.push_back(string(W[1])); } else if (strlen(W[1]) > min_ovl_len) { seqs.push_back(string(W[1])); } } fgets(A, AS_MAX_READLEN * 2, stdin); } delete[] A; } canu-1.6/src/falcon_sense/falcon_sense.mk000066400000000000000000000010101314437614700205260ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := falcon_sense SOURCES := falcon_sense.C SRC_INCDIRS := .. ../AS_UTL ../stores ../overlapInCore/libedlib libfalcon TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/falcon_sense/libfalcon/000077500000000000000000000000001314437614700174775ustar00rootroot00000000000000canu-1.6/src/falcon_sense/libfalcon/falcon.C000066400000000000000000000621761314437614700210610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-FEB-24 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ /* * ===================================================================================== * * Filename: fastcon.c * * Description: * * Version: 0.1 * Created: 07/20/2013 17:00:00 * Revision: none * Compiler: gcc * * Author: Jason Chin, * Company: * * ===================================================================================== #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ */ #include "falcon.H" #include "edlib.H" #include #include #include #include #include #include #include namespace FConsensus { #undef DEBUG typedef int32_t seq_coor_t; typedef struct { seq_coor_t s1; seq_coor_t e1; seq_coor_t s2; seq_coor_t e2; long int score; } aln_range; typedef struct { seq_coor_t t_pos; uint16 delta; char q_base; seq_coor_t p_t_pos; // the tag position of the previous base uint16 p_delta; // the tag delta of the previous base char p_q_base; // the previous base uint32 q_id; } align_tag_t; typedef struct { seq_coor_t len; align_tag_t * align_tags; } align_tags_t; typedef struct { uint16 size; uint16 n_link; seq_coor_t * p_t_pos; // the tag position of the previous base uint16 * p_delta; // the tag delta of the previous base char * p_q_base; // the previous base uint16 * link_count; uint16 count; seq_coor_t best_p_t_pos; uint16 best_p_delta; uint16 best_p_q_base; // encoded base double score; } align_tag_col_t; typedef struct { align_tag_col_t * base; } msa_base_group_t; typedef struct { uint16 size; uint16 max_delta; msa_base_group_t * delta; } msa_delta_group_t; typedef msa_delta_group_t * msa_pos_t; align_tags_t * get_align_tags( char * aln_q_seq, char * aln_t_seq, seq_coor_t aln_seq_len, aln_range * range, uint32 q_id, seq_coor_t t_offset, int a_len, int b_len) { char p_q_base; align_tags_t * tags; seq_coor_t i, j, jj, k, p_j, p_jj; tags = (align_tags_t *)calloc( 1, sizeof(align_tags_t) ); tags->len = aln_seq_len; tags->align_tags = (align_tag_t *)calloc( aln_seq_len + 1, sizeof(align_tag_t) ); i = range->s1 - 1; j = range->s2 - 1; jj = 0; p_j = -1; p_jj = 0; p_q_base = '.'; for (k = 0; k < aln_seq_len; k++) { if (aln_q_seq[k] != '-') { i ++; jj ++; } if (aln_t_seq[k] != '-') { j ++; jj = 0; } assert (i >= 0 && i < a_len && j >=0 && j < b_len); #ifdef DEBUG fprintf(stderr, "t %d %d %d %c %c\n", q_id, j, jj, aln_t_seq[k], aln_q_seq[k]); #endif if ( j + t_offset >= 0 && jj < uint16MAX && p_jj < uint16MAX) { (tags->align_tags[k]).t_pos = j + t_offset; (tags->align_tags[k]).delta = jj; (tags->align_tags[k]).p_t_pos = p_j + t_offset; (tags->align_tags[k]).p_delta = p_jj; (tags->align_tags[k]).p_q_base = p_q_base; (tags->align_tags[k]).q_base = aln_q_seq[k]; (tags->align_tags[k]).q_id = q_id; p_j = j; p_jj = jj; p_q_base = aln_q_seq[k]; } } // sentinal at the end //k = aln_seq_len; tags->len = k; (tags->align_tags[k]).t_pos = uint32MAX; (tags->align_tags[k]).delta = uint16MAX; (tags->align_tags[k]).q_base = '.'; (tags->align_tags[k]).q_id = uint32MAX; return tags; } void free_align_tags( align_tags_t * tags) { free( tags->align_tags ); free( tags ); } void allocate_aln_col( align_tag_col_t * col) { col->p_t_pos = ( seq_coor_t * ) calloc(col->size, sizeof( seq_coor_t )); col->p_delta = ( uint16 * ) calloc(col->size, sizeof( uint16 )); col->p_q_base = ( char * )calloc(col->size, sizeof( char )); col->link_count = ( uint16 * ) calloc(col->size, sizeof( uint16 )); } void realloc_aln_col( align_tag_col_t * col ) { col->p_t_pos = (seq_coor_t *) realloc( col->p_t_pos, (col->size) * sizeof( seq_coor_t )); col->p_delta = ( uint16 *) realloc( col->p_delta, (col->size) * sizeof( uint16 )); col->p_q_base = (char *) realloc( col->p_q_base, (col->size) * sizeof( char )); col->link_count = ( uint16 *) realloc( col->link_count, (col->size) * sizeof( uint16 )); } void free_aln_col( align_tag_col_t * col) { free(col->p_t_pos); free(col->p_delta); free(col->p_q_base); free(col->link_count); } void allocate_delta_group( msa_delta_group_t * g) { int i,j; g->max_delta = 0; g->delta = (msa_base_group_t *) calloc( g->size, sizeof(msa_base_group_t)); for (i = 0; i< g->size; i++) { g->delta[i].base = ( align_tag_col_t * ) calloc( 5, sizeof(align_tag_col_t ) ); for (j = 0; j < 5; j++ ) { g->delta[i].base[j].size = 8; allocate_aln_col(&(g->delta[i].base[j])); } } } void realloc_delta_group( msa_delta_group_t * g, uint16 new_size ) { int i, j, bs, es; bs = g->size; es = new_size; g->delta = (msa_base_group_t *) realloc(g->delta, new_size * sizeof(msa_base_group_t)); for (i=bs; i < es; i++) { g->delta[i].base = ( align_tag_col_t *) calloc( 5, sizeof(align_tag_col_t ) ); for (j = 0; j < 5; j++ ) { g->delta[i].base[j].size = 8; allocate_aln_col(&(g->delta[i].base[j])); } } g->size = new_size; } void free_delta_group( msa_delta_group_t * g) { //manything to do here int i, j; for (i = 0; i < g->size; i++) { for (j = 0; j < 5; j++) { free_aln_col( &(g->delta[i].base[j]) ); } free(g->delta[i].base); } free(g->delta); } void update_col( align_tag_col_t * col, seq_coor_t p_t_pos, uint16 p_delta, char p_q_base) { int updated = 0; int kk; col->count += 1; for (kk = 0; kk < col->n_link; kk++) { if ( p_t_pos == col->p_t_pos[kk] && p_delta == col->p_delta[kk] && p_q_base == col->p_q_base[kk] ) { col->link_count[kk] ++; updated = 1; break; } } if (updated == 0) { if (col->n_link + 1 > col->size) { if (col->size < (uint16MAX >> 1)-1) { col->size *= 2; } else { col->size += 256; } assert( col->size < uint16MAX-1 ); realloc_aln_col(col); } kk = col->n_link; col->p_t_pos[kk] = p_t_pos; col->p_delta[kk] = p_delta; col->p_q_base[kk] = p_q_base; col->link_count[kk] = 1; col->n_link++; } } msa_pos_t * get_msa_working_sapce(uint32 max_t_len) { msa_pos_t * msa_array; uint32 i; msa_array = (msa_pos_t *)calloc(max_t_len, sizeof(msa_pos_t)); for (i = 0; i < max_t_len; i++) { msa_array[i] = (msa_delta_group_t *)calloc(1, sizeof(msa_delta_group_t)); msa_array[i]->size = 8; allocate_delta_group(msa_array[i]); } return msa_array; } void clean_msa_working_space( msa_pos_t * msa_array, uint32 max_t_len) { uint32 i,j,k; align_tag_col_t * col; for (i = 0; i < max_t_len; i++) { for (j =0; j < msa_array[i]->max_delta + 1; j++) { for (k = 0; k < 5; k++ ) { col = msa_array[i]->delta[j].base + k; /* for (c =0; c < col->size; c++) { col->p_t_pos[c] = 0; col->p_delta[c] = 0; col->p_q_base[c] = 0; col->link_count[c] =0; } */ col->n_link = 0; col->count = 0; col->best_p_t_pos = -1; col->best_p_delta = -1; col->best_p_q_base = -1; col->score = 0; } } msa_array[i]->max_delta = 0; } } consensus_data * get_cns_from_align_tags( align_tags_t ** tag_seqs, uint32 n_tag_seqs, uint32 t_len, uint32 min_cov, uint32 max_len ) { seq_coor_t i,j; seq_coor_t t_pos = 0; seq_coor_t t_count = 0; uint32 * coverage; consensus_data * consensus; align_tag_t * c_tag; static msa_pos_t * msa_array = NULL; // figure out true t_len and compact, we might have blank spaces for unaligned sequences for (i = 0; i < n_tag_seqs; i++) if (tag_seqs[i] != NULL) tag_seqs[t_count++] = tag_seqs[i]; // null out the remainder for (i = t_count; i < n_tag_seqs; i++) tag_seqs[i] = NULL; n_tag_seqs = t_count; if (n_tag_seqs == 0) { // allocate an empty consensus sequence consensus = (consensus_data *)calloc( 1, sizeof(consensus_data) ); consensus->sequence = (char *)calloc( 1, sizeof(char) ); consensus->eqv = (int32 *)calloc( 1, sizeof(int32) ); return consensus; } coverage = (uint32 *)calloc( t_len, sizeof(uint32) ); if ( msa_array == NULL) { msa_array = get_msa_working_sapce( max_len ); clean_msa_working_space(msa_array, max_len); } assert(t_len < max_len); // loop through every alignment #ifdef DEBUG fprintf(stderr, "XX %d\n", n_tag_seqs); #endif for (i = 0; i < n_tag_seqs; i++) { // for each alignment position, insert the alignment tag to msa_array for (j = 0; j < tag_seqs[i]->len; j++) { c_tag = tag_seqs[i]->align_tags + j; uint32 delta; delta = c_tag->delta; if (delta == 0) { t_pos = c_tag->t_pos; coverage[ t_pos ] ++; } #ifdef DEBUG fprintf(stderr, "Processing position %d in sequence %d (in msa it is column %d with cov %d) with delta %d and current size is %d\n", j, i, t_pos, coverage[t_pos], delta, msa_array[t_pos]->size); #endif // Assume t_pos was set on earlier iteration. // (Otherwise, use its initial value, which might be an error. ~cd) assert(delta < uint16MAX); if (delta > msa_array[t_pos]->max_delta) { msa_array[t_pos]->max_delta = delta; if (msa_array[t_pos]->max_delta + 4 > msa_array[t_pos]->size ) { realloc_delta_group(msa_array[t_pos], msa_array[t_pos]->max_delta + 8); } } uint32 base = -1; switch (c_tag->q_base) { case 'A': base = 0; break; case 'C': base = 1; break; case 'G': base = 2; break; case 'T': base = 3; break; case '-': base = 4; break; default : base = 4; break; } // Note: On bad input, base may be -1. assert(c_tag->p_t_pos >= 0 || j == 0); update_col( &(msa_array[t_pos]->delta[delta].base[base]), c_tag->p_t_pos, c_tag->p_delta, c_tag->p_q_base); #ifdef DEBUG fprintf(stderr, "Updating column from seq %d at position %d in column %d base pos %d base %d to be %c and max is %d\n", i, j, t_pos, base, c_tag->p_t_pos, c_tag->p_q_base, msa_array[t_pos]->max_delta); #endif } } // propogate score throught the alignment links, setup backtracking information align_tag_col_t * g_best_aln_col = 0; uint32 g_best_ck = 0; seq_coor_t g_best_t_pos = 0; { int kk; int ck; // char base; int best_i; int best_j; int best_b; int best_ck = -1; double score; double best_score; double g_best_score; // char best_mark; align_tag_col_t * aln_col; g_best_score = -1; for (i = 0; i < t_len; i++) { //loop through every template base #ifdef DEBUG fprintf(stderr, "max delta: %d %d\n", i, msa_array[i]->max_delta); #endif for (j = 0; j <= msa_array[i]->max_delta; j++) { // loop through every delta position for (kk = 0; kk < 5; kk++) { // loop through diff bases of the same delta posiiton /* switch (kk) { case 0: base = 'A'; break; case 1: base = 'C'; break; case 2: base = 'G'; break; case 3: base = 'T'; break; case 4: base = '-'; break; } */ aln_col = msa_array[i]->delta[j].base + kk; best_score = -1; best_i = -1; best_j = -1; best_b = -1; #ifdef DEBUG fprintf(stderr, "Processing consensus template %d which as %d delta and on base %d i pulled up col %d with %d links and best %d %d %d\n", i, j, kk, aln_col, aln_col->n_link, aln_col->best_p_t_pos, aln_col->best_p_delta, aln_col->best_p_q_base); #endif for (ck = 0; ck < aln_col->n_link; ck++) { // loop through differnt link to previous column int pi; int pj; int pkk; pi = aln_col->p_t_pos[ck]; pj = aln_col->p_delta[ck]; switch (aln_col->p_q_base[ck]) { case 'A': pkk = 0; break; case 'C': pkk = 1; break; case 'G': pkk = 2; break; case 'T': pkk = 3; break; case '-': pkk = 4; break; default : pkk = 4; break; } if (aln_col->p_t_pos[ck] == -1) { score = (double) aln_col->link_count[ck] - (double) coverage[i] * 0.5; } else if (pj > msa_array[pi]->max_delta) { score = (double) aln_col->link_count[ck] - (double) coverage[i] * 0.5; } else { score = msa_array[pi]->delta[pj].base[pkk].score + (double) aln_col->link_count[ck] - (double) coverage[i] * 0.5; } // best_mark = ' '; if (score > best_score) { best_score = score; aln_col->best_p_t_pos = best_i = pi; aln_col->best_p_delta = best_j = pj; aln_col->best_p_q_base = best_b = pkk; best_ck = ck; // best_mark = '*'; } #ifdef DEBUG fprintf(stderr, "X %d %d %d %d %d %d %c %d %lf\n", coverage[i], i, j, aln_col->count, aln_col->p_t_pos[ck], aln_col->p_delta[ck], aln_col->p_q_base[ck], aln_col->link_count[ck], score); #endif } aln_col->score = best_score; if (best_score > g_best_score) { g_best_score = best_score; g_best_aln_col = aln_col; g_best_ck = best_ck; g_best_t_pos = i; #ifdef DEBUG fprintf(stderr, "GB %d %d %d %d\n", i, j, ck, g_best_aln_col); #endif } } } } assert(g_best_score != -1); } // reconstruct the sequences uint32 index; char bb = '$'; int ck; char * cns_str; int * eqv; double score0; consensus = (consensus_data *)calloc( 1, sizeof(consensus_data) ); consensus->sequence = (char *)calloc( t_len * 2 + 1, sizeof(char) ); consensus->eqv = (int32 *)calloc( t_len * 2 + 1, sizeof(int32) ); cns_str = consensus->sequence; eqv = consensus->eqv; #ifdef TRACK_POSITIONS consensus->originalPos.reserve(t_len * 2 + 1); // This is an over-generous pre-allocation #endif index = 0; ck = g_best_ck; i = g_best_t_pos; while (1) { #ifdef TRACK_POSITIONS int originalI = i; #endif if (coverage[i] > min_cov) { switch (ck) { case 0: bb = 'A'; break; case 1: bb = 'C'; break; case 2: bb = 'G'; break; case 3: bb = 'T'; break; case 4: bb = '-'; break; } } else { switch (ck) { case 0: bb = 'a'; break; case 1: bb = 'c'; break; case 2: bb = 'g'; break; case 3: bb = 't'; break; case 4: bb = '-'; break; } } // Note: On bad input, bb will keep previous value, possibly '$'. score0 = g_best_aln_col->score; i = g_best_aln_col->best_p_t_pos; if (i == -1 || index >= t_len * 2) break; j = g_best_aln_col->best_p_delta; ck = g_best_aln_col->best_p_q_base; g_best_aln_col = msa_array[i]->delta[j].base + ck; if (bb != '-') { cns_str[index] = bb; eqv[index] = (int) score0 - (int) g_best_aln_col->score; #ifdef DEBUG fprintf(stderr, "C %d %d %c %lf %d %d\n", i, index, bb, g_best_aln_col->score, coverage[i], eqv[index] ); #endif index ++; #ifdef TRACK_POSITIONS consensus->originalPos.push_back(originalI); #endif } } // reverse the sequence #ifdef TRACK_POSITIONS std::reverse(consensus->originalPos.begin(), consensus->originalPos.end()); #endif for (i = 0; i < index/2; i++) { cns_str[i] = cns_str[i] ^ cns_str[index-i-1]; cns_str[index-i-1] = cns_str[i] ^ cns_str[index-i-1]; cns_str[i] = cns_str[i] ^ cns_str[index-i-1]; eqv[i] = eqv[i] ^ eqv[index-i-1]; eqv[index-i-1] = eqv[i] ^ eqv[index-i-1]; eqv[i] = eqv[i] ^ eqv[index-i-1]; } cns_str[index] = 0; //printf("%s\n", cns_str); clean_msa_working_space(msa_array, t_len+1); free(coverage); return consensus; } consensus_data * generate_consensus( vector input_seq, uint32 min_cov, uint32 K, double min_idt, uint32 min_len, uint32 max_len) { uint32 seq_count; align_tags_t ** tags_list; consensus_data * consensus; double max_diff; max_diff = 1.0 - min_idt; seq_count = input_seq.size(); fflush(stdout); tags_list = (align_tags_t **)calloc( seq_count, sizeof(align_tags_t*) ); #pragma omp parallel for schedule(dynamic) for (uint32 j=0; j < seq_count; j++) { // if the current sequence is too long, truncate it to be shorter if (input_seq[j].size() > input_seq[0].size()) { input_seq[j].resize(input_seq[0].size()); } int tolerance = (int)ceil((double)min(input_seq[j].length(), input_seq[0].length())*max_diff*1.1); EdlibAlignResult align = edlibAlign(input_seq[j].c_str(), input_seq[j].size()-1, input_seq[0].c_str(), input_seq[0].size()-1, edlibNewAlignConfig(tolerance, EDLIB_MODE_HW, EDLIB_TASK_PATH)); if (align.numLocations >= 1 && align.endLocations[0] - align.startLocations[0] > min_len && ((float)align.editDistance / (align.endLocations[0]-align.startLocations[0]) < max_diff)) { aln_range arange; arange.s1 = 0; arange.e1 = input_seq[j].length()-1; arange.s2 = align.startLocations[0]; arange.e2 = align.endLocations[0]; #ifdef DEBUG fprintf(stderr, "Found alignment for seq %d from %d - %d to %d - %d the dist %d length %d\n", j, arange.s1, arange.e1, arange.s2, arange.e2, align.editDistance, align.alignmentLength); #endif // convert edlib to expected char *tgt_aln_str = (char *)calloc( align.alignmentLength+1, sizeof(char) ); char *qry_aln_str = (char *)calloc( align.alignmentLength+1, sizeof(char) ); edlibAlignmentToStrings(align.alignment, align.alignmentLength, arange.s2, arange.e2+1, arange.s1, arange.e1, input_seq[0].c_str(), input_seq[j].c_str(), tgt_aln_str, qry_aln_str); // strip leading/trailing gaps on target uint32_t first_pos = 0; for (int i = 0; i < align.alignmentLength; i++) { if (tgt_aln_str[i] != '-') { first_pos=i; break; } } uint32_t last_pos = align.alignmentLength; for (int i = align.alignmentLength-1; i >= 0; i--) { if (tgt_aln_str[i] != '-') { last_pos=i+1; break; } } arange.s1+= first_pos; arange.e1-= (align.alignmentLength-last_pos); arange.e2++; qry_aln_str[last_pos]='\0'; tgt_aln_str[last_pos]='\0'; #ifdef DEBUG fprintf(stderr, "Final positions to be %d %d for str %d and %d %d for str %d adjst %d %d %d\n", arange.s1, arange.e1, input_seq[j].length(), arange.s2, arange.e2, input_seq[0].length(), first_pos, last_pos, last_pos-first_pos); fprintf(stderr, "Tgt string is %s %d\n", tgt_aln_str+first_pos, strlen(tgt_aln_str+first_pos)); fprintf(stderr, "Qry string is %s %d\n", qry_aln_str+first_pos, strlen(qry_aln_str+first_pos)); #endif assert(arange.s1 >= 0 && arange.s2 >= 0 && arange.e1 <= input_seq[j].length() && arange.e2 <= input_seq[0].length()); tags_list[j] = get_align_tags(qry_aln_str+first_pos, tgt_aln_str+first_pos, last_pos-first_pos, &arange, j, 0, input_seq[j].length(), input_seq[0].length()); free(tgt_aln_str); free(qry_aln_str); } edlibFreeAlignResult(align); } consensus = get_cns_from_align_tags( tags_list, seq_count, input_seq[0].length(), min_cov, max_len); for (int j=0; j < seq_count; j++) if (tags_list[j] != NULL) free_align_tags(tags_list[j]); free(tags_list); return consensus; } void free_consensus_data( consensus_data * consensus ){ free(consensus->sequence); free(consensus->eqv); free(consensus); } } canu-1.6/src/falcon_sense/libfalcon/falcon.H000066400000000000000000000074611314437614700210620ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-FEB-24 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ /* * ===================================================================================== * * Filename: common.h * * Description: Common delclaration for the code base * * Version: 0.1 * Created: 07/16/2013 07:46:23 AM * Revision: none * Compiler: gcc * * Author: Jason Chin, * Company: * * ===================================================================================== #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ */ #include "AS_global.H" #include #include #include #include using namespace std; namespace FConsensus { typedef struct { char * sequence; int32 * eqv; #ifdef TRACK_POSITIONS vector originalPos; // For tracking original read positions in the corrected read #endif } consensus_data; consensus_data * generate_consensus( vector input_seq, uint32 min_cov, uint32 K, double min_idt, uint32 min_len, uint32 max_len); void free_consensus_data(consensus_data *); } canu-1.6/src/falcon_sense/outputFalcon.C000066400000000000000000000051101314437614700203320ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-20 to 2015-MAY-20 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "outputFalcon.H" #include "AS_UTL_reverseComplement.H" // The falcon consensus format: // // name sequence // read sequence // read sequence // read sequence // read sequence // + + # generate consensus for the 'name' sequence using 'read' sequences // ... // - - # To end processing // void outputFalcon(gkStore *gkpStore, tgTig *tig, bool trimToAlign, FILE *F, gkReadData *readData) { gkpStore->gkStore_loadReadData(tig->tigID(), readData); fprintf(F, "read" F_U32 " %s\n", tig->tigID(), readData->gkReadData_getSequence()); for (uint32 cc=0; ccnumberOfChildren(); cc++) { tgPosition *child = tig->getChild(cc); gkpStore->gkStore_loadReadData(child->ident(), readData); if (child->isReverse()) reverseComplementSequence(readData->gkReadData_getSequence(), readData->gkReadData_getRead()->gkRead_sequenceLength()); // For debugging/testing, skip one orientation of overlap. // //if (child->isReverse() == false) // continue; //if (child->isReverse() == true) // continue; // Trim the read to the aligned bit char *seq = readData->gkReadData_getSequence(); if (trimToAlign) { seq += child->_askip; seq[ readData->gkReadData_getRead()->gkRead_sequenceLength() - child->_askip - child->_bskip ] = 0; } fprintf(F, "data" F_U32 " %s\n", tig->getChild(cc)->ident(), seq); } fprintf(F, "+ +\n"); } canu-1.6/src/falcon_sense/outputFalcon.H000066400000000000000000000023251314437614700203440ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-APR-20 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef OUTPUT_FALCON_H #define OUTPUT_FALCON_H #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" void outputFalcon(gkStore *gkpStore, tgTig *tig, bool trimToAlign, FILE *F, gkReadData *readData); #endif // OUTPUT_FALCON_H canu-1.6/src/fastq-utilities/000077500000000000000000000000001314437614700162365ustar00rootroot00000000000000canu-1.6/src/fastq-utilities/build-ordered-reads.sh000066400000000000000000000237771314437614700224270ustar00rootroot00000000000000#!/bin/sh # # Given a set of trimmed reads, map them to reference, filter out any that do not # map completely, and generate an output set of reads that is ordered and oriented. # if [ -d "/work/scripts" ] ; then bin=/work/wgspb/FreeBSD-amd64/bin scp=/work/pacbio-overlapper/scripts export PATH=${PATH}:/work/scripts fi if [ -d "/usr/local/projects/CELERA/bri/scripts" ] ; then bin=/usr/local/projects/CELERA/bri/wgspb/Linux-amd64/bin scp=/usr/local/projects/CELERA/bri/pacbio-assembler/scripts export PATH=${PATH}:/usr/local/projects/CELERA/bri/scripts fi if [ -z $bin ] ; then echo scripts not found. exit fi if [ -z $1 ] ; then echo usage: $0 work-directory reads.fastq reference.fasta exit fi wrk=$1 # Working directory name pfx="test" # Prefix of the files inp=$2 # Path to input FASTQ reads ref=$3 # Path to input FASTA reference if [ ! -d $wrk ] ; then mkdir $wrk fi if [ ! -e $inp ] ; then echo "Failed to find input FASTQ reads in '$inp'" exit fi if [ ! -e $ref ] ; then echo "" echo "WARNING: no reference, will not build ordered assembly." echo fi if [ ! -e "$wrk/build-ordered-reads.spec" ] ; then echo $wrk/build-ordered-reads.spec cat > $wrk/build-ordered-reads.spec \ < $wrk/BO/$pfx.frg echo gatekeeper -T -F -o $wrk/BO/$pfx.gkpStore $wrk/BO/$pfx.frg $bin/gatekeeper -T -F -o $wrk/BO/$pfx.gkpStore $wrk/BO/$pfx.frg fi if [ ! -e "$wrk/BO/$pfx.fastq" ] ; then echo gatekeeper -dumpfastq $wrk/BO/$pfx $wrk/BO/$pfx.gkpStore $bin/gatekeeper -dumpfastq $wrk/BO/$pfx $wrk/BO/$pfx.gkpStore awk '{ print $1 }' < $wrk/BO/$pfx.unmated.fastq > $wrk/BO/$pfx.fastq rm -f $wrk/BO/$pfx.unmated.fastq rm -f $wrk/BO/$pfx.paired.fastq rm -f $wrk/BO/$pfx.1.fastq rm -f $wrk/BO/$pfx.2.fastq echo replaceUIDwithName $wrk/BO/$pfx.gkpStore.fastqUIDmap $wrk/BO/$pfx.fastq $bin/replaceUIDwithName $wrk/BO/$pfx.gkpStore.fastqUIDmap $wrk/BO/$pfx.fastq fi # # OLD blasr options # # -maxLCPLength 15 # -nCandidates 25 # -maxScore -500 # if [ -e $ref ] ; then if [ ! -e "$wrk/BO/$pfx.blasr.badnm.sam" ] ; then echo \ blasr \ -noSplitSubreads \ -nproc 4 \ -minMatch 12 \ -bestn 10 \ -nCandidates 25 \ -minPctIdentity 65.0 \ -sam -clipping soft \ -out $wrk/BO/$pfx.blasr.badnm.sam \ $wrk/BO/$pfx.fastq \ $ref blasr \ -noSplitSubreads \ -nproc 4 \ -minMatch 12 \ -bestn 10 \ -minPctIdentity 65.0 \ -sam -clipping soft \ -out $wrk/BO/$pfx.blasr.badnm.sam \ $wrk/BO/$pfx.fastq \ $ref fi if [ ! -e "$wrk/BO/$pfx.blasr.sam" ] ; then echo a echo \ samtools calmd \ -S $wrk/BO/$pfx.blasr.badnm.sam \ $ref \ \> $wrk/BO/$pfx.blasr.sam samtools calmd \ -S $wrk/BO/$pfx.blasr.badnm.sam \ $ref \ > $wrk/BO/$pfx.blasr.sam fi if [ ! -e "$wrk/BO/$pfx.blasr.sam.coords" ] ; then echo bowtie2-to-nucmercoords.pl $ref \< $wrk/BO/$pfx.blasr.sam \> $wrk/BO/$pfx.blasr.sam.coords bowtie2-to-nucmercoords.pl $ref < $wrk/BO/$pfx.blasr.sam > $wrk/BO/$pfx.blasr.sam.coords fi if [ ! -e "$wrk/BO/$pfx.blasr.sam.coords.ova" ] ; then echo infer-olaps-from-coords perl $scp/infer-olaps-from-genomic-coords.pl \ $wrk/BO/$pfx.blasr.sam.coords \ $wrk/BO/$pfx.fastq \ $wrk/BO/$pfx.blasr.sam.coords \ $wrk/BO/$pfx.gkpStore.fastqUIDmap fi fi ######################################## # # Start an assembly using the blasr overlaps for the perfectly mapping reads. # # We need to rebuild the overlaps to get the IIDs correct for this assembly. # If we were to just use the BO/gkpStore and BO/ovlStore, all the reads that # don't map perfectly end up as singletons. # # For this to work, it is critical that the reads have their original names, NOT UIDs. # # NOTE! Reading coords and lengths from the BL work above, and using reads from the BO work here. # if [ -e $wrk/BO/$pfx.blasr.sam.coords.mapped.ordered.fastq ] ; then if [ ! -d $wrk/BL ] ; then mkdir $wrk/BL fi if [ ! -d "$wrk/BL/$pfx.gkpStore" ] ; then $bin/fastqToCA \ -libraryname L \ -technology sanger \ -type sanger \ -reads $wrk/BO/$pfx.blasr.sam.coords.mapped.ordered.fastq \ > $wrk/BL/$pfx.frg $bin/gatekeeper -T -F \ -o $wrk/BL/$pfx.gkpStore \ $wrk/BL/$pfx.frg fi if [ ! -e "$wrk/BL/$pfx.blasr.sam.coords.ova" ] ; then perl $scp/infer-olaps-from-genomic-coords.pl \ $wrk/BL/$pfx.blasr.sam.coords \ $wrk/BO/$pfx.fastq \ $wrk/BO/$pfx.blasr.sam.coords \ $wrk/BL/$pfx.gkpStore.fastqUIDmap fi if [ ! -d "$wrk/BL/$pfx.ovlStore" ] ; then $bin/convertOverlap -ovl \ < $wrk/BL/$pfx.blasr.sam.coords.ova \ > $wrk/BL/$pfx.blasr.sam.coords.ovb $bin/overlapStoreBuild \ -o $wrk/BL/$pfx.ovlStore \ -g $wrk/BL/$pfx.gkpStore \ -F 1 \ $wrk/BL/$pfx.blasr.sam.coords.ovb fi if [ ! -d "$wrk/BL/$pfx.tigStore" ] ; then perl $bin/runCA -p $pfx -d $wrk/BL -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ stopAfter=unitigger \ $wrk/BL/$pfx.frg else if [ ! -e "$wrk/BL/$pfx.qc" ] ; then perl $bin/runCA -p $pfx -d $wrk/BL -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ $wrk/BL/$pfx.frg fi fi fi ######################################## # # Assembly with all reads. # if [ ! -d $wrk/CA ] ; then mkdir $wrk/CA fi if [ ! -d "$wrk/CA/$pfx.frg" ] ; then $bin/fastqToCA \ -libraryname L \ -technology sanger \ -type sanger \ -reads $inp \ > $wrk/CA/$pfx.frg fi if [ ! -d "$wrk/CA/$pfx.tigStore" ] ; then perl $bin/runCA -p $pfx -d $wrk/CA -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ stopAfter=unitigger \ $wrk/CA/$pfx.frg else if [ ! -e "$wrk/CA/$pfx.qc" ] ; then perl $bin/runCA -p $pfx -d $wrk/CA -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ $wrk/CA/$pfx.frg fi fi ######################################## # # Assembly with all reads. # if [ ! -d $wrk/CAcorrected ] ; then mkdir $wrk/CAcorrected fi if [ ! -d "$wrk/CAcorrected/$pfx.frg" ] ; then $bin/fastqToCA \ -libraryname L \ -technology sanger \ -type sanger \ -reads $inp \ > $wrk/CAcorrected/$pfx.frg fi if [ ! -d "$wrk/CAcorrected/$pfx.tigStore" ] ; then perl $bin/runCA -p $pfx -d $wrk/CAcorrected -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ doFragmentCorrection=1 \ stopAfter=unitigger \ $wrk/CAcorrected/$pfx.frg else if [ ! -e "$wrk/CAcorrected/$pfx.qc" ] ; then perl $bin/runCA -p $pfx -d $wrk/CAcorrected -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ doFragmentCorrection=1 \ $wrk/CAcorrected/$pfx.frg fi fi ######################################## # # Assembly with the perfectly mapping reads. # if [ -e $wrk/BO/$pfx.blasr.sam.coords.mapped.ordered.fastq ] ; then if [ ! -d $wrk/CAordered ] ; then mkdir $wrk/CAordered fi if [ ! -d "$wrk/CAordered/$pfx.frg" ] ; then $bin/fastqToCA \ -libraryname L \ -technology sanger \ -type sanger \ -reads $wrk/BO/$pfx.blasr.sam.coords.mapped.ordered.fastq \ > $wrk/CAordered/$pfx.frg fi if [ ! -d "$wrk/CAordered/$pfx.tigStore" ] ; then perl $bin/runCA -p $pfx -d $wrk/CAordered -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ stopAfter=unitigger \ $wrk/CAordered/$pfx.frg else if [ ! -e "$wrk/CAordered/$pfx.qc" ] ; then perl $bin/runCA -p $pfx -d $wrk/CAordered -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ $wrk/CAordered/$pfx.frg fi fi fi ######################################## # # Assembly with the perfectly mapping reads. # if [ -e $wrk/BO/$pfx.blasr.sam.coords.mapped.ordered.fastq ] ; then if [ ! -d $wrk/CAorderedCorrected ] ; then mkdir $wrk/CAorderedCorrected fi if [ ! -d "$wrk/CAorderedCorrected/$pfx.frg" ] ; then $bin/fastqToCA \ -libraryname L \ -technology sanger \ -type sanger \ -reads $wrk/BO/$pfx.blasr.sam.coords.mapped.ordered.fastq \ > $wrk/CAorderedCorrected/$pfx.frg fi if [ ! -d "$wrk/CAorderedCorrected/$pfx.tigStore" ] ; then perl $bin/runCA -p $pfx -d $wrk/CAorderedCorrected -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ doFragmentCorrection=1 \ stopAfter=unitigger \ $wrk/CAorderedCorrected/$pfx.frg else if [ ! -e "$wrk/CAorderedCorrected/$pfx.qc" ] ; then perl $bin/runCA -p $pfx -d $wrk/CAorderedCorrected -s $wrk/build-ordered-reads.spec \ useGrid=0 scriptOnGrid=0 \ doFragmentCorrection=1 \ $wrk/CAorderedCorrected/$pfx.frg fi fi fi canu-1.6/src/fastq-utilities/fastqAnalyze.C000066400000000000000000000347551314437614700210220ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_GKP/fastqAnalyze.C * * Modifications by: * * Brian P. Walenz from 2012-FEB-24 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2015-JAN-13 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include #include using namespace std; // Map a letter to an index in the freq arrays. uint32 baseToIndex[256]; char indexToBase[8]; #define FREQ_A 0 #define FREQ_C 1 #define FREQ_G 2 #define FREQ_T 3 #define FREQ_N 4 // Just N's #define FREQ_g 5 // Just -'s #define FREQ_Z 6 // Everything else #define FREQ_NUM 7 #define MAX_READ_LEN 1024 * 1024 class nucFreq { public: nucFreq() { memset(mono, 0, sizeof(uint64) * FREQ_NUM); memset(di, 0, sizeof(uint64) * FREQ_NUM * FREQ_NUM); memset(tri, 0, sizeof(uint64) * FREQ_NUM * FREQ_NUM * FREQ_NUM); }; uint64 mono[FREQ_NUM]; uint64 di[FREQ_NUM][FREQ_NUM]; uint64 tri[FREQ_NUM][FREQ_NUM][FREQ_NUM]; }; class nucOut { public: nucOut(char a, char b, char c, uint64 cnt, double frq) { label[0] = a; label[1] = b; label[2] = c; label[3] = 0; count = cnt; freq = frq; }; bool operator<(nucOut const &that) const { return(freq > that.freq); }; char label[4]; uint64 count; double freq; }; void doStats(char *inName, char *otName) { uint64 totSeqs = 0; uint64 totBases = 0; vector seqLen; nucFreq *freq = new nucFreq; char A[MAX_READ_LEN]; char B[MAX_READ_LEN]; char C[MAX_READ_LEN]; char D[MAX_READ_LEN]; errno = 0; FILE *F = fopen(inName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", inName, strerror(errno)), exit(1); //errno = 0; //FILE *O = fopen(otName, "w"); //if (errno) // fprintf(stderr, "Failed to open '%s' for writing: %s\n", otName, strerror(errno)), exit(1); while (!feof(F)) { fgets(A, MAX_READ_LEN, F); fgets(B, MAX_READ_LEN, F); chomp(B); fgets(C, MAX_READ_LEN, F); fgets(D, MAX_READ_LEN, F); chomp(D); if ((A[0] != '@') || (C[0] != '+')) { fprintf(stderr, "WARNING: sequence isn't fastq.\n"); fprintf(stderr, "WARNING: %s", A); fprintf(stderr, "WARNING: %s\n", B); fprintf(stderr, "WARNING: %s", C); fprintf(stderr, "WARNING: %s\n", D); } uint32 a = 0; uint32 b = baseToIndex[B[0]]; uint32 c = baseToIndex[B[1]]; uint32 ii; freq->mono[b]++; freq->mono[c]++; freq->di[b][c]++; for (ii=2; B[ii]; ii++) { a = b; b = c; c = baseToIndex[B[ii]]; freq->mono[c]++; freq->di[b][c]++; freq->tri[a][b][c]++; } ii--; seqLen.push_back(ii); totSeqs++; totBases += ii; if ((totSeqs % 10000) == 0) fprintf(stderr, "Reading " F_U64 "\r", totSeqs); } fprintf(stderr, "Read " F_U64 "\n", totSeqs); fprintf(stdout, "%s\n", inName); fprintf(stdout, "\n"); fprintf(stdout, "sequences\t" F_U64 "\n", totSeqs); fprintf(stdout, "bases\t" F_U64 "\n", totBases); fprintf(stdout, "\n"); fprintf(stdout, "average\t" F_U64 "\n", totBases / totSeqs); fprintf(stdout, "\n"); //sort(seqLen.begin(), seqLen.end()); uint64 min = UINT64_MAX; uint64 max = 0; for (uint32 ii=0; ii output; fprintf(stdout, "\n"); fprintf(stdout, "mononucleotide\n"); fprintf(stdout, "\n"); for (uint32 ii=0; iimono[ii] > 0) output.push_back(nucOut(indexToBase[ii], 0, 0, freq->mono[ii], freq->mono[ii] * 100.0 / totBases)); sort(output.begin(), output.end()); for (uint32 ii=0; iidi[ii][jj] > 0) output.push_back(nucOut(indexToBase[ii], indexToBase[jj], 0, freq->di[ii][jj], freq->di[ii][jj] * 100.0 / totBases)); sort(output.begin(), output.end()); for (uint32 ii=0; iitri[ii][jj][kk] > 0) output.push_back(nucOut(indexToBase[ii], indexToBase[jj], indexToBase[kk], freq->tri[ii][jj][kk], freq->tri[ii][jj][kk] * 100.0 / totBases)); sort(output.begin(), output.end()); for (uint32 ii=0; ii 0)) { fgets(A, MAX_READ_LEN, F); fgets(B, MAX_READ_LEN, F); fgets(C, MAX_READ_LEN, F); fgets(D, MAX_READ_LEN, F); chomp(D); for (uint32 x=0; D[x] != 0; x++) { if (D[x] < '!') isNotSanger = true; if (D[x] < ';') isNotSolexa = true; if (D[x] < '@') isNotIllumina3 = true; // Illumina 1.3 if (D[x] < 'B') isNotIllumina5 = true; // Illumina 1.5 if (D[x] < '!') isNotIllumina8 = true; // Illumina 1.5 if ('I' < D[x]) isNotSanger = true; if ('h' < D[x]) isNotSolexa = true; if ('h' < D[x]) isNotIllumina3 = true; // Illumina 1.3 if ('h' < D[x]) isNotIllumina5 = true; // Illumina 1.5 if ('J' < D[x]) isNotIllumina8 = true; // Illumina 1.5 qvCounts[D[x]]++; //fprintf(stderr, "%d%d%d%d%d %c %d\n", // isNotSanger, isNotSolexa, isNotIllumina3, isNotIllumina5, isNotIllumina8, D[x], x); } //fprintf(stderr, "%d%d%d%d%d '%s'\n", // isNotSanger, isNotSolexa, isNotIllumina3, isNotIllumina5, isNotIllumina8, D); numValid = 0; if (isNotSanger == false) numValid++; if (isNotSolexa == false) numValid++; if (isNotIllumina3 == false) numValid++; if (isNotIllumina5 == false) numValid++; if (isNotIllumina8 == false) numValid++; numTrials--; } fclose(F); fprintf(stdout, "%s --", inName); if (isNotSanger == false) fprintf(stdout, " SANGER"); if (isNotSolexa == false) fprintf(stdout, " SOLEXA"); if (isNotIllumina3 == false) fprintf(stdout, " ILLUMINA_1.3+"); if (isNotIllumina5 == false) fprintf(stdout, " ILLUMINA_1.5+"); if (isNotIllumina8 == false) fprintf(stdout, " ILLUMINA_1.8+"); if (numValid == 0) fprintf(stdout, " NO_VALID_ENCODING"); fprintf(stdout, "\n"); if (numValid == 0) { fprintf(stdout, "QV histogram:\n"); for (uint32 c=0, i=0; i<12; i++) { fprintf(stdout, "%3d: ", i * 10); for (uint32 j=0; j<10; j++, c++) fprintf(stdout, " %8d/%c", qvCounts[c], isprint(c) ? c : '.'); fprintf(stdout, "\n"); } } if (isNotSanger == false) originalIsSanger = true; if (isNotSolexa == false) originalIsSolexa = true; if (isNotIllumina3 == false) originalIsIllumina = true; if (isNotIllumina5 == false) originalIsIllumina = true; if (isNotIllumina8 == false) originalIsSanger = true; } void doTransformQV(char *inName, char *otName, bool originalIsSolexa, bool originalIsIllumina, bool originalIsSanger) { uint32 numValid = 0; char A[MAX_READ_LEN]; char B[MAX_READ_LEN]; char C[MAX_READ_LEN]; char D[MAX_READ_LEN]; if (originalIsSolexa == true) numValid++; if (originalIsIllumina == true) numValid++; if (originalIsSanger == true) numValid++; if (numValid == 0) fprintf(stderr, "No QV decision made. No valid encoding found. Specify a QV encoding to convert from.\n"), exit(0); if (numValid > 1) fprintf(stderr, "No QV decision made. Multiple valid encodings found. Specify a QV encoding to convert from.\n"), exit(0); if (originalIsSanger == true) fprintf(stderr, "No QV changes needed; original is in sanger format already.\n"), exit(0); errno = 0; FILE *F = fopen(inName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", inName, strerror(errno)), exit(1); errno = 0; FILE *O = fopen(otName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", otName, strerror(errno)), exit(1); while (!feof(F)) { fgets(A, MAX_READ_LEN, F); fgets(B, MAX_READ_LEN, F); fgets(C, MAX_READ_LEN, F); fgets(D, MAX_READ_LEN, F); chomp(D); if (feof(F)) break; for (uint32 x=0; D[x] != 0; x++) { if (originalIsSolexa) { double qs = D[x] - '@'; qs /= 10.0; qs = 10.0 * log10(pow(10.0, qs) + 1); D[x] = lround(qs) + '0'; } if (originalIsIllumina) { D[x] -= '@'; D[x] += '!'; } } fprintf(O, "%s%s%s%s\n", A, B, C, D); } fclose(F); fclose(O); } int main(int argc, char **argv) { char *inName = NULL; char *otName = NULL; bool originalIsSolexa = false; bool originalIsIllumina = false; bool originalIsSanger = false; bool convertToSanger = false; bool computeStats = false; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-o") == 0) { otName = argv[++arg]; convertToSanger = true; } else if (strcmp(argv[arg], "-solexa") == 0) { originalIsSolexa = true; } else if (strcmp(argv[arg], "-illumina") == 0) { originalIsIllumina = true; } else if (strcmp(argv[arg], "-sanger") == 0) { originalIsSanger = true; } else if (strcmp(argv[arg], "-stats") == 0) { computeStats = true; } else if (inName == NULL) { inName = argv[arg]; } else { err++; } arg++; } if ((err) || (inName == NULL)) { fprintf(stderr, "usage: %s [-stats] [-o output.fastq] input.fastq\n", argv[0]); fprintf(stderr, " If no options are given, input.fastq is analyzed and a best guess for the\n"); fprintf(stderr, " QV encoding is output. Otherwise, the QV encoding is converted to Sanger-style\n"); fprintf(stderr, " using this guess.\n"); fprintf(stderr, "\n"); fprintf(stderr, " In some cases, the encoding cannot be determined. When this occurs, no guess is\n"); fprintf(stderr, " output. For conversion, you can force the input QV type with:\n"); fprintf(stderr, "\n"); fprintf(stderr, " -solexa input QV is solexa\n"); fprintf(stderr, " -illumina input QV is illumina\n"); fprintf(stderr, " -sanger input QV is sanger\n"); fprintf(stderr, "\n"); fprintf(stderr, " -o sanger-style-output.fastq\n"); fprintf(stderr, "\n"); fprintf(stderr, " If -stats is supplied, no QV analysis or conversion is performed, but some simple\n"); fprintf(stderr, " statistics are computed and output to stdout.\n"); fprintf(stderr, "\n"); exit(1); } for (uint32 ii=0; ii<256; ii++) baseToIndex[ii] = FREQ_Z; baseToIndex['a'] = FREQ_A; baseToIndex['A'] = FREQ_A; baseToIndex['c'] = FREQ_C; baseToIndex['C'] = FREQ_C; baseToIndex['g'] = FREQ_G; baseToIndex['G'] = FREQ_G; baseToIndex['t'] = FREQ_T; baseToIndex['T'] = FREQ_T; baseToIndex['n'] = FREQ_N; baseToIndex['N'] = FREQ_N; baseToIndex['-'] = FREQ_g; baseToIndex['-'] = FREQ_g; indexToBase[FREQ_A] = 'A'; indexToBase[FREQ_C] = 'C'; indexToBase[FREQ_G] = 'G'; indexToBase[FREQ_T] = 'T'; indexToBase[FREQ_N] = 'N'; indexToBase[FREQ_g] = '-'; indexToBase[FREQ_Z] = '?'; if (computeStats) doStats(inName, otName), exit(0); doAnalyzeQV(inName, originalIsSolexa, originalIsIllumina, originalIsSanger); if (convertToSanger) doTransformQV(inName, otName, originalIsSolexa, originalIsIllumina, originalIsSanger); exit(0); } canu-1.6/src/fastq-utilities/fastqAnalyze.mk000066400000000000000000000007311314437614700212320ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := fastqAnalyze SOURCES := fastqAnalyze.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/fastq-utilities/fastqSample.C000066400000000000000000000366441314437614700206370ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_GKP/fastqSample.C * * Modifications by: * * Brian P. Walenz from 2010-FEB-22 to 2013-AUG-01 * are Copyright 2010-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-06 to 2015-FEB-24 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2017-JUN-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include #include using namespace std; #define MAXLEN 1024*1024*50 class aRead { public: aRead() { memset(a, 0, sizeof(char) * MAXLEN); memset(b, 0, sizeof(char) * MAXLEN); memset(c, 0, sizeof(char) * MAXLEN); memset(d, 0, sizeof(char) * MAXLEN); }; bool read(FILE *F) { a[0] = 0; b[0] = 0; c[0] = 0; d[0] = 0; if (F == NULL) return(false); fgets(a, MAXLEN, F); fgets(b, MAXLEN, F); fgets(c, MAXLEN, F); fgets(d, MAXLEN, F); if (feof(F) == true) return(false); if ((a[0] != '@') || (c[0] != '+')) { fprintf(stderr, "ERROR: Not FastQ. Read lines:\n"); fprintf(stderr, " %s", a); fprintf(stderr, " %s", b); fprintf(stderr, " %s", c); fprintf(stderr, " %s", d); exit(1); } return(true); }; void write(FILE *F) { if (F == NULL) return; fputs(a, F); fputs(b, F); fputs(c, F); fputs(d, F); }; uint32 length(void) { return(strlen(b) - 1); // Newline still exists }; private: char a[MAXLEN]; char b[MAXLEN]; char c[MAXLEN]; char d[MAXLEN]; }; class anInput { public: anInput() { id = 0; len = 0; }; anInput(uint64 id_, uint32 len1_, uint32 len2_) { id = id_; len = len1_ + len2_; }; uint64 id; uint32 len; }; inline bool anInputByLongest(const anInput &a, const anInput &b) { return(a.len > b.len); }; int main(int argc, char **argv) { aRead *Ar = new aRead, *Br = new aRead; FILE *Ai = NULL, *Bi = NULL; FILE *Ao = NULL, *Bo = NULL; vector ids; vector sav; char *INPNAME = NULL; char *OUTNAME = NULL; bool AUTONAME = false; uint64 NUMINPUT = 0; // Number of pairs in the input uint64 READLENGTH = 0; // For mated reads, 2x read size bool isMated = true; uint32 MINLEN = 0; uint32 LONGEST = false; uint64 GENOMESIZE = 0; // Size of the genome in bp double COVERAGE = 0; // Desired coverage in output uint64 NUMOUTPUT = 0; // Number of pairs to output double FRACTION = 0.0; // Desired fraction of the input uint64 BASES = 0; // Desired amount of sequence char path1[FILENAME_MAX]; char path2[FILENAME_MAX]; srand48(time(NULL)); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-I") == 0) { INPNAME = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { OUTNAME = argv[++arg]; } else if (strcmp(argv[arg], "-A") == 0) { AUTONAME = true; } else if (strcmp(argv[arg], "-T") == 0) { NUMINPUT = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-L") == 0) { READLENGTH = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-U") == 0) { isMated = false; } else if (strcmp(argv[arg], "-m") == 0) { MINLEN = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-max") == 0) { LONGEST = true; } else if (strcmp(argv[arg], "-g") == 0) { GENOMESIZE = atol(argv[++arg]); } else if (strcmp(argv[arg], "-c") == 0) { COVERAGE = atof(argv[++arg]); } else if (strcmp(argv[arg], "-p") == 0) { NUMOUTPUT = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-f") == 0) { FRACTION = atof(argv[++arg]); } else if (strcmp(argv[arg], "-b") == 0) { BASES = atoi(argv[++arg]); } else { err++; } arg++; } if (INPNAME == NULL) err++; if (OUTNAME == NULL) OUTNAME = INPNAME; if ((GENOMESIZE == 0) && (COVERAGE != 0)) err++; if ((COVERAGE == 0) && (NUMOUTPUT == 0) && (FRACTION == 0.0) && (BASES == 0)) err++; if (err) { fprintf(stderr, "\n"); fprintf(stderr, "usage: %s [opts]\n", argv[0]); fprintf(stderr, " Input Specification\n"); fprintf(stderr, " -I NAME input name (prefix) of the reads\n"); fprintf(stderr, " -T T total number of mate pairs in the input (if not supplied, will be counted)\n"); fprintf(stderr, " -L L length of a single read (if not supplied, will be determined)\n"); fprintf(stderr, " -U reads are unmated, expected in *.u.fastq\n"); fprintf(stderr, "\n"); fprintf(stderr, " Output Specification\n"); fprintf(stderr, " -O NAME output name (prefix) of the reads (default is same as -I)\n"); fprintf(stderr, " -A automatically include coverage or number of reads in the output name\n"); fprintf(stderr, " -m L ignore reads shorter than L bases\n"); fprintf(stderr, " -max don't sample randomly, pick the longest reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " Method 1: specify desired output coverage:\n"); fprintf(stderr, " -g G genome size\n"); fprintf(stderr, " -c C desired coverage in the output reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " Method 2: specify desired number of output pairs\n"); fprintf(stderr, " -p N for mated reads, output 2N reads, or N pairs of reads\n"); fprintf(stderr, " for unmated reads, output N reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " Method 3: specify a desired fraction of the input:\n"); fprintf(stderr, " -f F output F * T pairs of reads (T as above in -t option)\n"); fprintf(stderr, " 0.0 < F <= 1.0\n"); fprintf(stderr, "\n"); fprintf(stderr, " Method 4: specify a desired total length\n"); fprintf(stderr, " -b B output reads/pairs until B bases is exceeded\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, "Samples reads from paired Illumina reads NAME.1.fastq and NAME.2.fastq and outputs:\n"); fprintf(stderr, " NAME.Cx.1.fastq and N.Cx.2.fastq (for coverage based sampling)\n"); fprintf(stderr, " NAME.n=N.1.fastq and N.n=N.2.fastq (for coverage based sampling)\n"); fprintf(stderr, "\n"); fprintf(stderr, "If -T is not supplied, the number of reads will be counted for you.\n"); fprintf(stderr, "\n"); if (INPNAME == NULL) fprintf(stderr, "ERROR: no name supplied with -I.\n"); if ((GENOMESIZE == 0) && (COVERAGE != 0)) fprintf(stderr, "ERROR: no genome size supplied with -g (when using -c)\n"); if ((COVERAGE == 0) && (NUMOUTPUT == 0) && (FRACTION == 0.0) && (BASES == 0)) fprintf(stderr, "ERROR: no method supplied with -c, -p, -f or -b\n"); fprintf(stderr, "\n"); exit(1); } // // We know not enough about the reads, and are forced to scan the entire // inputs. // uint64 totBasesInInput = 0; uint64 totPairsInInput = 0; if ((NUMINPUT == 0) || (READLENGTH == 0)) { uint64 Ac = 0; uint64 Bc = 0; fprintf(stderr, "Counting the number of reads in the input.\n"); snprintf(path1, FILENAME_MAX, "%s.%c.fastq", INPNAME, (isMated == true) ? '1' : 'u'); snprintf(path2, FILENAME_MAX, "%s.%c.fastq", INPNAME, (isMated == true) ? '2' : 'u'); errno = 0; Ai = fopen(path1, "r"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", path1, strerror(errno)), exit(1); if (isMated == true) { errno = 0; Bi = fopen(path2, "r"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", path2, strerror(errno)), exit(1); } bool moreA = Ar->read(Ai); bool moreB = Br->read(Bi); while (moreA || moreB) { uint32 lA = Ar->length(); uint32 lB = Br->length(); if (lA > 0) Ac++; if (lB > 0) Bc++; ids.push_back(anInput(totPairsInInput, lA, lB)); sav.push_back(false); totPairsInInput += 1; totBasesInInput += lA + lB; moreA = Ar->read(Ai); moreB = Br->read(Bi); } if (Ai) fclose(Ai); if (Bi) fclose(Bi); fprintf(stderr, "Found " F_U64 " bases and " F_U64 " reads in '%s'\n", totBasesInInput, totPairsInInput, path1); if (Ac != Bc) { fprintf(stderr, "ERROR: Number of reads in the .1 and .2 files must be the same.\n"); exit(1); } } // Otherwise, both NUMINPUT and READLENGTH are defined, so we can fill out ids // with defaults. else { if (isMated == false) for (uint64 ii=0; ii 0) { nBasesToOutput = (uint64)(COVERAGE * GENOMESIZE); } if (NUMOUTPUT > 0) { nPairsToOutput = NUMOUTPUT; } if (FRACTION > 0) { nPairsToOutput = (uint64)(FRACTION * totPairsInInput); } if (BASES > 0) { nBasesToOutput = BASES; } if (totBasesInInput < nBasesToOutput) fprintf(stderr, "ERROR: not enough reads, " F_U64 " bp in input, " F_U64 " needed for desired .....\n", totBasesInInput, nBasesToOutput), exit(1); if (totPairsInInput < nPairsToOutput) fprintf(stderr, "ERROR: not enough reads, " F_U64 " %s in input, " F_U64 " needed for desired ......\n", totPairsInInput, (isMated) ? "pairs" : "reads", nPairsToOutput), exit(1); //fprintf(stderr, "OUTPUT: %lu bases\n", nBasesToOutput); //fprintf(stderr, "OUTPUT: %lu pairs\n", nPairsToOutput); // // Randomize the ID list, or sort by length. // if (LONGEST) { fprintf(stderr, "Sorting by length\n"); sort(ids.begin(), ids.end(), anInputByLongest); } else { fprintf(stderr, "Shuffling sequences\n"); for (uint64 i=0; i 0) { uint64 nPairs = 0; for (uint64 i=0; i 0) { uint64 nBases = 0; for (uint64 i=0; i 0) { snprintf(path1, FILENAME_MAX, "%s.x=%07.3f.n=%09" F_U64P ".%c.fastq", OUTNAME, (double)nBasesToOutput / GENOMESIZE, nPairsToOutput, (isMated == true) ? '1' : 'u'); snprintf(path2, FILENAME_MAX, "%s.x=%07.3f.n=%09" F_U64P ".%c.fastq", OUTNAME, (double)nBasesToOutput / GENOMESIZE, nPairsToOutput, (isMated == true) ? '2' : 'u'); } else { snprintf(path1, FILENAME_MAX, "%s.x=UNKNOWN.n=%09" F_U64P ".%c.fastq", OUTNAME, nPairsToOutput, (isMated == true) ? '1' : 'u'); snprintf(path2, FILENAME_MAX, "%s.x=UNKNOWN.n=%09" F_U64P ".%c.fastq", OUTNAME, nPairsToOutput, (isMated == true) ? '2' : 'u'); } errno = 0; Ao = fopen(path1, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", path1, strerror(errno)), exit(1); if (isMated == true) { errno = 0; Bo = fopen(path2, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", path2, strerror(errno)), exit(1); } uint64 i=0; uint64 s=0; if (isMated == true) { if (nPairsToOutput > 0) fprintf(stderr, "Extracting " F_U64 " mate pairs into %s and %s\n", nPairsToOutput, path1, path2); else fprintf(stderr, "Extracting " F_U64 " bases of mate pairs into %s and %s\n", nBasesToOutput, path1, path2); for (; Ar->read(Ai) && Br->read(Bi); i++) { if ((i < totPairsInInput) && (sav[i])) { Ar->write(Ao); Br->write(Bo); s++; } } fclose(Ai); fclose(Bi); fclose(Ao); fclose(Bo); } else { if (nPairsToOutput > 0) fprintf(stderr, "Extracting " F_U64 " reads into %s\n", nPairsToOutput, path1); else fprintf(stderr, "Extracting " F_U64 " bases of reads into %s\n", nBasesToOutput, path1); for (; Ar->read(Ai); i++) { if ((i < totPairsInInput) && (sav[i])) { Ar->write(Ao); s++; } } fclose(Ai); fclose(Ao); } delete Ar; delete Br; if (i > totPairsInInput) { fprintf(stderr, "WARNING: There are " F_U64 " %s in the input; you claimed there are " F_U64 " (-t option) %s.\n", i, (isMated) ? "mates" : "reads", totPairsInInput, (isMated) ? "mates" : "reads"); fprintf(stderr, "WARNING: Result is not a random sample of the input file.\n"); } if (i < totPairsInInput) { fprintf(stderr, "WARNING: There are " F_U64 " %s in the input; you claimed there are " F_U64 " (-t option) %s.\n", i, (isMated) ? "mates" : "reads", totPairsInInput, (isMated) ? "mates" : "reads"); if (GENOMESIZE > 0) fprintf(stderr, "WARNING: Result is only %f X coverage.\n", (double)s * READLENGTH / GENOMESIZE); } return(0); } canu-1.6/src/fastq-utilities/fastqSample.mk000066400000000000000000000007271314437614700210550ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := fastqSample SOURCES := fastqSample.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/fastq-utilities/fastqSimulate-checkCoverage.pl000066400000000000000000000104561314437614700241520ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/AS_GKP/fastqSimulate-checkCoverage.pl # # Modifications by: # # Brian P. Walenz on 2014-FEB-20 # are Copyright 2014 J. Craig Venter Institute, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz on 2015-JAN-13 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; system("leaff -G 1 2000 2000 > base.fasta"); system("leaff -H -G 1 500 500 | tr ACGT NNNN >> base.fasta"); system("leaff -H -G 1 2000 2000 >> base.fasta"); my (@rcov1, @ccov1); my (@rcov2, @ccov2); my (@rcov3, @ccov3); if (! -e "cov") { my $max = -s "base.fasta"; my $maxiter = 100; for (my $iter=0; $iter<$maxiter; $iter++) { system("fastqSimulate -em 0 -ei 0 -ed 0 -f base.fasta -o t1 -l 100 -x 100 -pe 1000 0 > /dev/null 2>&1"); system("fastqSimulate -em 0 -ei 0 -ed 0 -f base.fasta -o t2 -l 100 -x 100 -allowgaps -pe 1000 0 > /dev/null 2>&1"); system("fastqSimulate -em 0 -ei 0 -ed 0 -f base.fasta -o t3 -l 100 -x 100 -allowgaps -allowns -pe 1000 0 > /dev/null 2>&1"); system("rm *.1.fastq *.2.fastq *.c.fastq"); print STDERR "Summarizing for iter $iter\n"; open(IN, "< t1.i.fastq"); summarize(\@rcov1, \@ccov1); close(IN); open(IN, "< t2.i.fastq"); summarize(\@rcov2, \@ccov2); close(IN); open(IN, "< t3.i.fastq"); summarize(\@rcov3, \@ccov3); close(IN); } open(COV, "> cov"); for (my $x=0; $x<$max; $x++) { $rcov1[$x] += 0; $rcov1[$x] /= $maxiter; $ccov1[$x] += 0; $ccov1[$x] /= $maxiter; $rcov2[$x] += 0; $rcov2[$x] /= $maxiter; $ccov2[$x] += 0; $ccov2[$x] /= $maxiter; $rcov3[$x] += 0; $rcov3[$x] /= $maxiter; $ccov3[$x] += 0; $ccov3[$x] /= $maxiter; print COV "$x\t$rcov1[$x]\t$ccov1[$x]\t$rcov2[$x]\t$ccov2[$x]\t$rcov3[$x]\t$ccov3[$x]\n"; } close(COV); } open(GP, "> cov.gp"); print GP "set terminal png size 1280,800\n"; print GP "set output 'cov.png'\n"; print GP "plot 'cov' using 1:2 with lines lt 1 lw 1 title 'read cov, no Ns', \\\n"; print GP " 'cov' using 1:4 with lines lt 2 lw 1 title 'read cov, clone span Ns', \\\n"; print GP " 'cov' using 1:6 with lines lt 3 lw 1 title 'read cov, both span Ns', \\\n"; print GP " 'cov' using 1:3 with lines lt 1 lw 2 title 'clone cov, no Ns', \\\n"; print GP " 'cov' using 1:5 with lines lt 2 lw 2 title 'clone cov, clone span Ns', \\\n"; print GP " 'cov' using 1:7 with lines lt 3 lw 2 title 'clone cov, both span Ns'\n"; close(GP); system("gnuplot cov.gp"); sub summarize(@@) { my ($rcov, $ccov) = @_; while (!eof(IN)) { my $a = ; chomp $a; my $b = ; chomp $b; my $l = length($b); my $c = ; my $d = ; my $bgn; my $end; my $spn; if ($a =~ m/^\@PE_\d+_\d+\@(\d+)-(\d+)\/1/) { $bgn = $1; $end = $1 + $l; $spn = $2; } if ($a =~ m/^\@PE_\d+_\d+\@(\d+)-(\d+)\/2/) { $bgn = $2 - $l; $end = $2; $spn = $bgn; # No clone span for the second read } die if (!defined($bgn)); for (my $x=$bgn; $x<$end; $x++) { ${$rcov}[$x]++; } for (my $x=$bgn; $x<$spn; $x++) { ${$ccov}[$x]++; } } } canu-1.6/src/fastq-utilities/fastqSimulate-perfectSep.pl000066400000000000000000000235401314437614700235170ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/AS_GKP/fastqSimulate-perfectSep.pl # # Modifications by: # # Brian P. Walenz from 2011-DEC-28 to 2013-AUG-01 # are Copyright 2011,2013 J. Craig Venter Institute, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz on 2015-JAN-13 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Build MP, PE and 454 reads for a reference sequence. The MP reads are filtered into # a perfectly classified set with two libraries PEperfect and MPperfect. Chimers # can be rescued or deleted. my $prefix = shift @ARGV; my $reference = shift @ARGV; my $error = "0.01"; my $BBcoverage = "08"; my $PEcoverage = "25"; my $PEinsert = "0500"; my $PEstddev = "050"; my $MPcoverage = "10"; my $MPinsert = "3000"; my $MPstddev = "300"; my $MPenrich = "0.8"; my $BBreadLen = "400"; my $PEreadLen = "150"; my $MPreadLen = "150"; my $name; # # BUILD 454 # $name = "$prefix.${BBreadLen}bp.fragment.${BBcoverage}x"; if (! -e "$name.s.fastq") { system("fastqSimulate -f $reference -o $name -l $BBreadLen -x $BBcoverage -e $error -se"); unlink("$name.frg"); } if (! -e "$name.frg") { system("fastqToCA -libraryname BB -type sanger -reads $name.s.fastq > $name.frg"); } # # BUILD PE # $name = "$prefix.${PEreadLen}bp.${PEinsert}bpPE.${PEcoverage}x"; if (! -e "$name.i.fastq") { system("fastqSimulate -f $reference -o $name -l $PEreadLen -x $PEcoverage -e $error -pe $PEinsert $PEstddev"); unlink("$name.frg"); } if (! -e "$name.frg") { system("fastqToCA -libraryname BB -insertsize $PEinsert $PEstddev -innie -type sanger -reads $name.1.fastq,$name.2.fastq > $name.frg"); } # # BUILD MP # $name = "$prefix.${MPreadLen}bp.${MPinsert}bpMP.${MPcoverage}x"; if (! -e "$name.i.fastq") { print STDERR "fastqSimulate -f $reference -o $name -l $MPreadLen -x $MPcoverage -e $error -mp $MPinsert $MPstddev $PEinsert $PEstddev $MPenrich\n"; system("fastqSimulate -f $reference -o $name -l $MPreadLen -x $MPcoverage -e $error -mp $MPinsert $MPstddev $PEinsert $PEstddev $MPenrich"); unlink("$name.frg"); } if (! -e "$name.frg") { system("fastqToCA -libraryname MP -insertsize $MPinsert $MPstddev -outtie -type sanger -reads $name.1.fastq,$name.2.fastq > $name.frg"); } # # FILTER MP # open(tMP1, "> $name.tMP.1.fastq"); open(tMP2, "> $name.tMP.2.fastq"); open(tMPi, "> $name.tMP.i.fastq"); open(fPE1, "> $name.fPE.1.fastq"); open(fPE2, "> $name.fPE.2.fastq"); open(fPEi, "> $name.fPE.i.fastq"); open(aPE1, "> $name.aPE.1.fastq"); # A-junction reads that are PE open(aPE2, "> $name.aPE.2.fastq"); open(aPEi, "> $name.aPE.i.fastq"); open(aMP1, "> $name.aMP.1.fastq"); # A-junction reads that are MP open(aMP2, "> $name.aMP.2.fastq"); open(aMPi, "> $name.aMP.i.fastq"); open(bPE1, "> $name.bPE.1.fastq"); # B-junction reads that are PE open(bPE2, "> $name.bPE.2.fastq"); open(bPEi, "> $name.bPE.i.fastq"); open(bMP1, "> $name.bMP.1.fastq"); # B-junction reads that are MP open(bMP2, "> $name.bMP.2.fastq"); open(bMPi, "> $name.bMP.i.fastq"); open(MP1, "< $name.1.fastq"); open(MP2, "< $name.2.fastq"); my ($a1, $b1, $c1, $d1); my ($a2, $b2, $c2, $d2); while (!eof(MP1) && !eof(MP2)) { $a1 = ; chomp $a1; $a2 = ; chomp $a2; # Painful. The reverse()s below also reverse the newline. $b1 = ; chomp $b1; $b2 = ; chomp $b2; $c1 = ; chomp $c1; $c2 = ; chomp $c2; $d1 = ; chomp $d1; $d2 = ; chomp $d2; if ($a1 =~ m/^.(...)_/) { my $typ = $1; if ($typ eq "tMP") { $b1 = reverse($b1); $b1 =~ tr/ACGTacgt/TGCAtgca/; $d1 = reverse($d1); $b2 = reverse($b2); $b2 =~ tr/ACGTacgt/TGCAtgca/; $d2 = reverse($d2); print tMP1 "$a1\n$b1\n$c1\n$d1\n"; print tMP2 "$a2\n$b2\n$c2\n$d2\n"; print tMPi "$a1\n$b1\n$c1\n$d1\n"; print tMPi "$a2\n$b2\n$c2\n$d2\n"; } elsif ($typ eq "fPE") { print fPE1 "$a1\n$b1\n$c1\n$d1\n"; print fPE2 "$a2\n$b2\n$c2\n$d2\n"; print fPEi "$a1\n$b1\n$c1\n$d1\n"; print fPEi "$a2\n$b2\n$c2\n$d2\n"; } elsif ($typ eq "aMP") { if ($a1 =~ m/^...._\d+_\d+.\d+-\d+_(\d+)\/(\d+)\/\d+\/\d$/) { my $len = $2; my $pos = $1; if ($pos < $MPreadLen / 2) { # Saving the 5' end, makes MP. $b1 = substr($b1, 0, $pos); $b1 = reverse($b1); $b1 =~ tr/ACGTacgt/TGCAtgca/; $d1 = substr($d1, 0, $pos); $d1 = reverse($d1); $b2 = reverse($b2); $b2 =~ tr/ACGTacgt/TGCAtgca/; $d2 = reverse($d2); print aMP1 "$a1\n$b1\n$c1\n$d1\n"; print aMP2 "$a2\n$b2\n$c2\n$d2\n"; print aMPi "$a1\n$b1\n$c1\n$d1\n"; print aMPi "$a2\n$b2\n$c2\n$d2\n"; } else { # Saving the 3' end, makes PE. $b1 = substr($b1, $pos); $d1 = substr($d1, $pos); print aPE1 "$a1\n$b1\n$c1\n$d1\n"; print aPE2 "$a2\n$b2\n$c2\n$d2\n"; print aPEi "$a1\n$b1\n$c1\n$d1\n"; print aPEi "$a2\n$b2\n$c2\n$d2\n"; } } else { chomp $a1; die "No aMP '$a1'\n"; } } elsif ($typ eq "bMP") { if ($a1 =~ m/^...._\d+_\d+.\d+-\d+_(\d+)\/(\d+)\/\d+\/\d$/) { my $pos = $1; my $len = $2; if ($pos < $MPreadLen / 2) { # Saving the 5' end, makes MP. $b1 = reverse($b1); $b1 =~ tr/ACGTacgt/TGCAtgca/; $d1 = reverse($d1); $b2 = substr($b2, 0, $pos); $b2 = reverse($b2); $b2 =~ tr/ACGTacgt/TGCAtgca/; $d2 = substr($d2, 0, $pos); $d2 = reverse($d2); print bMP1 "$a1\n$b1\n$c1\n$d1\n"; print bMP2 "$a2\n$b2\n$c2\n$d2\n"; print bMPi "$a1\n$b1\n$c1\n$d1\n"; print bMPi "$a2\n$b2\n$c2\n$d2\n"; } else { # Saving the 3' end, makes PE. $b2 = substr($b2, $pos); $d2 = substr($d2, $pos); print bPE1 "$a1\n$b1\n$c1\n$d1\n"; print bPE2 "$a2\n$b2\n$c2\n$d2\n"; print bPEi "$a1\n$b1\n$c1\n$d1\n"; print bPEi "$a2\n$b2\n$c2\n$d2\n"; } } else { chomp $a1; die "No bMP '$a1'\n"; } } elsif ($typ eq "cMP") { chomp; print STDERR "$a1\n"; } else { die "Unknown type '$typ'\n"; } } else { chomp $a1; die "no match: '$a1'\n"; } } close(tMP1); close(tMP2); close(tMPi); close(fPE1); close(fPE2); close(fPEi); close(aMP1); close(aMP2); close(aMPi); close(aPE1); close(aPE2); close(aPEi); close(bMP1); close(bMP2); close(bMPi); close(bPE1); close(bPE2); close(bPEi); system("fastqToCA -libraryname tMP -insertsize $MPinsert $MPstddev -innie -type sanger -mates $name.tMP.1.fastq,$name.tMP.2.fastq > $name.tMP.frg"); system("fastqToCA -libraryname fPE -insertsize $PEinsert $PEstddev -innie -type sanger -mates $name.fPE.1.fastq,$name.fPE.2.fastq > $name.fPE.frg"); system("fastqToCA -libraryname aMP -insertsize $MPinsert $MPstddev -innie -type sanger -mates $name.aMP.1.fastq,$name.aMP.2.fastq > $name.aMP.frg"); system("fastqToCA -libraryname aPE -insertsize $PEinsert $PEstddev -innie -type sanger -mates $name.aPE.1.fastq,$name.aPE.2.fastq > $name.aPE.frg"); system("fastqToCA -libraryname bMP -insertsize $MPinsert $MPstddev -innie -type sanger -mates $name.bMP.1.fastq,$name.bMP.2.fastq > $name.bMP.frg"); system("fastqToCA -libraryname bPE -insertsize $PEinsert $PEstddev -innie -type sanger -mates $name.bPE.1.fastq,$name.bPE.2.fastq > $name.bPE.frg"); system("perl /work/FRAGS/map-illumina-pairs.pl plot-tMP JUNK.150bp.3000bpMP.10x.tMP.1.fastq JUNK.150bp.3000bpMP.10x.tMP.2.fastq Yersinia_pestis_KIM10_AE009952.fasta"); system("perl /work/FRAGS/map-illumina-pairs.pl plot-fPE JUNK.150bp.3000bpMP.10x.fPE.1.fastq JUNK.150bp.3000bpMP.10x.fPE.2.fastq Yersinia_pestis_KIM10_AE009952.fasta"); system("perl /work/FRAGS/map-illumina-pairs.pl plot-aMP JUNK.150bp.3000bpMP.10x aMP.1.fastq JUNK.150bp.3000bpMP.10x.aMP.2.fastq Yersinia_pestis_KIM10_AE009952.fasta"); system("perl /work/FRAGS/map-illumina-pairs.pl plot-aPE JUNK.150bp.3000bpMP.10x.aPE.1.fastq JUNK.150bp.3000bpMP.10x.aPE.2.fastq Yersinia_pestis_KIM10_AE009952.fasta"); system("perl /work/FRAGS/map-illumina-pairs.pl plot-bMP JUNK.150bp.3000bpMP.10x.bMP.1.fastq JUNK.150bp.3000bpMP.10x.bMP.2.fastq Yersinia_pestis_KIM10_AE009952.fasta"); system("perl /work/FRAGS/map-illumina-pairs.pl plot-bPE JUNK.150bp.3000bpMP.10x.bPE.1.fastq JUNK.150bp.3000bpMP.10x.bPE.2.fastq Yersinia_pestis_KIM10_AE009952.fasta"); canu-1.6/src/fastq-utilities/fastqSimulate-sort.C000066400000000000000000000141151314437614700221530ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_GKP/fastqSimulate-sort.C * * Modifications by: * * Brian P. Walenz from 2013-MAR-21 to 2013-AUG-01 * are Copyright 2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2015-JAN-13 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "AS_UTL_reverseComplement.H" #include #include using namespace std; class pairedRead { public: bool operator<(const pairedRead &that) const { return((seqId < that.seqId) || ((seqId == that.seqId) && (seqPos < that.seqPos))); } uint32 seqId; uint32 seqPos; char *readA; char *readB; }; uint32 hLen = 1024; uint32 sLen = 1024 * 1024 * 16; char *a = NULL; char *b = NULL; char *c = NULL; char *d = NULL; char * readRead(FILE *inFile, uint32 &seq, uint32 &bgn, uint32 &end) { seq = 0; bgn = 0; end = 0; if (inFile == NULL) return(NULL); if (a == NULL) { a = new char [hLen]; b = new char [sLen]; c = new char [hLen]; d = new char [sLen]; } uint32 al = 0; uint32 bl = 0; uint32 cl = 0; uint32 dl = 0; fgets(a, hLen, inFile); al = strlen(a); fgets(b, sLen, inFile); bl = strlen(b); fgets(c, hLen, inFile); cl = strlen(c); fgets(d, sLen, inFile); dl = strlen(d); assert(a[0] == '@'); assert(c[0] == '+'); uint32 p=0; // @tMP_0_0@2569105-2572074_239/553/2571835/1 while (a[p] != '_') p++; p++; while (a[p] != '_') p++; p++; seq = strtoul(a+p, NULL, 10); while (a[p] != '@') p++; p++; bgn = strtoul(a+p, NULL, 10); while (a[p] != '-') p++; p++; end = strtoul(a+p, NULL, 10); //fprintf(stderr, "seq=" F_U32 " bgn=" F_U32 " end=" F_U32 " line %s", // seq, bgn, end, a); char *retstr = new char [al + bl + cl + dl + 1]; memcpy(retstr, a, al); memcpy(retstr + al, b, bl); memcpy(retstr + al + bl, c, cl); memcpy(retstr + al + bl + cl, d, dl); retstr[al + bl + cl + dl] = 0; return(retstr); } int main(int argc, char **argv) { char *inName1 = NULL; char *inName2 = NULL; char *otName1 = NULL; char *otName2 = NULL; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-i1") == 0) { inName1 = argv[++arg]; } else if (strcmp(argv[arg], "-i2") == 0) { inName2 = argv[++arg]; } else if (strcmp(argv[arg], "-o1") == 0) { otName1 = argv[++arg]; } else if (strcmp(argv[arg], "-o2") == 0) { otName2 = argv[++arg]; } else { err++; } arg++; } if (inName1 == NULL) err++; if (otName1 == NULL) err++; if ((inName2 == NULL) != (otName2 == NULL)) err++; if (err) { fprintf(stderr, "usage: %s -i1 in.1.fastq [-i2 in.2.fastq] -o1 out.1.fastq [-o2 out.2.fastq]\n", argv[0]); if (inName1 == NULL) fprintf(stderr, "ERROR: No in.1.fastq supplied with -i1.\n"); if (otName1 == NULL) fprintf(stderr, "ERROR: No out.1.fastq supplied with -i1.\n"); if ((inName2 == NULL) != (otName2 == NULL)) fprintf(stderr, "ERROR: Matedness of input and output don't agree (neither or both -i2 and -o2 must be used).\n"); exit(1); } errno = 0; FILE *inFile1 = (inName1 == NULL) ? NULL : fopen(inName1, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", inName1, strerror(errno)), exit(1); errno = 0; FILE *inFile2 = (inName2 == NULL) ? NULL : fopen(inName2, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", inName2, strerror(errno)), exit(1); errno = 0; FILE *otFile1 = (otName1 == NULL) ? NULL : fopen(otName1, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", otName1, strerror(errno)), exit(1); errno = 0; FILE *otFile2 = (otName2 == NULL) ? NULL : fopen(otName2, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", otName2, strerror(errno)), exit(1); // Load reads. vector reads; pairedRead pr; while (!feof(inFile1)) { uint32 seq1=0, bgn1=0, end1=0; uint32 seq2=0, bgn2=0, end2=0; char *a = readRead(inFile1, seq1, bgn1, end1); char *b = readRead(inFile2, seq2, bgn2, end2); if (feof(inFile1)) break; if ((a != NULL) && (b != NULL)) { pr.seqId = (seq1 < seq2) ? seq1 : seq2; pr.seqPos = (bgn1 < end1) ? bgn1 : end1; assert(bgn1 == bgn2); assert(end1 == end2); } else if (a != NULL) { pr.seqId = seq1; pr.seqPos = bgn1; } else { assert(0); } pr.readA = a; pr.readB = b; reads.push_back(pr); } fprintf(stderr, "Loaded " F_U64 " mated reads.\n", reads.size()); sort(reads.begin(), reads.end()); for (uint32 i=0; i #include #include #include #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "AS_UTL_reverseComplement.H" #include using namespace std; #undef DEBUG_ERRORS // Print when mismatch, insert or delete errors are added vector seqStartPositions; static char revComp[256]; static char errorBase[256][3]; static char insertBase[4]; static char validBase[256]; double readMismatchRate = 0.01; // Fraction mismatch error double readInsertRate = 0.01; // Fraction insertion error double readDeleteRate = 0.01; // Fraction deletion error bool allowGaps = false; bool allowNs = false; const uint32 mpJunctionsNone = 0; const uint32 mpJunctionsNormal = 1; const uint32 mpJunctionsAlways = 2; double pRevComp = 0.5; double pNormal = 0.0; #define QV_BASE '!' // Returns random int in range bgn <= x < end. // int32 randomUniform(int32 bgn, int32 end) { if (bgn >= end) fprintf(stderr, "randomUniform()-- ERROR: invalid range bgn=%d end=%d\n", bgn, end); assert(bgn < end); return((int32)floor((end - bgn) * drand48() + bgn)); } // Generate a random gaussian using the Marsaglia polar method. // int32 randomGaussian(double mean, double stddev) { double u = 0.0; double v = 0.0; double r = 0.0; do { u = 2.0 * drand48() - 1.0; v = 2.0 * drand48() - 1.0; r = u * u + v * v; } while (r >= 1.0); if (r < 1e-10) r = 1e-10; r = sqrt(-2 * log(r) / r); // Uniform gaussian is u*r and v*r. We only use one of these. return((int32)(mean + u * r * stddev)); } int32 findSequenceIndex(int32 pos) { int32 seqIdx = 0; for (seqIdx=0; seqIdx 0); seqIdx--; return(seqIdx); } uint64 nNoChange = 0; uint64 nMismatch = 0; uint64 nInsert = 0; uint64 nDelete = 0; void makeSequenceError(char *s1, char *q1, int32 &p) { double r = drand48(); if ((r < readMismatchRate) && (p >= 0)) { #ifdef DEBUG_ERRORS fprintf(stderr, "MISMATCH at p=%d base=%d/%c qc=%d/%c (INITIAL)\n", p, s1[p], s1[p], q1[p], q1[p]); #endif s1[p] = errorBase[s1[p]][randomUniform(0, 3)]; q1[p] = (validBase[s1[p]]) ? QV_BASE + 8 : QV_BASE + 2; nMismatch++; #ifdef DEBUG_ERRORS fprintf(stderr, "MISMATCH at p=%d base=%d/%c qc=%d/%c\n", p, s1[p], s1[p], q1[p], q1[p]); #endif return; } r -= readMismatchRate; if (r < readInsertRate) { p++; s1[p] = insertBase[randomUniform(0, 4)]; q1[p] = (validBase[s1[p]]) ? QV_BASE + 4 : QV_BASE + 2; nInsert++; #ifdef DEBUG_ERRORS fprintf(stderr, "INSERT at p=%d base=%d/%c qc=%d/%c\n", p, s1[p], s1[p], q1[p], q1[p]); #endif return; } r -= readInsertRate; if ((r < readDeleteRate) && (p > 0)) { p--; nDelete++; #ifdef DEBUG_ERRORS fprintf(stderr, "DELETE at p=%d\n", p); #endif return; } r -= readDeleteRate; nNoChange++; } bool makeSequences(char *frag, int32 fragLen, int32 readLen, char *s1, char *q1, char *s2, char *q2, bool makeNormal = false) { for (int32 p=0, i=0; p') || ((allowGaps == false) && (seq[i] == 'N'))) goto trySEagain; // Generate the sequence. if (makeSequences(seq + bgn, 0, readLen, s1, q1, NULL, NULL) == false) goto trySEagain; // Make sure the read doesn't contain N's (redundant in this particular case) if (allowNs == false) for (int32 i=0; i') || ((allowGaps == false) && (seq[i] == 'N'))) goto tryPEagain; // Read sequences from the ends. bool makeNormal = ((pNormal > 0.0) && (drand48() < pNormal)); if (makeSequences(seq + bgn, len, readLen, s1, q1, s2, q2, makeNormal) == false) goto tryPEagain; // Make sure the reads don't contain N's if (allowNs == false) for (int32 i=0; i') || ((allowGaps == false) && (seq[i] == 'N'))) goto tryMPagain; // If we fail the mpEnrichment test, pick a random shearing and return PE reads. // Otherwise, rotate the sequence to circularize and return MP reads. if (mpEnrichment < drand48()) { // Failed to wash away non-biotin marked sequence, make PE int32 sbgn = bgn + randomUniform(0, len - slen); bool makeNormal = ((pNormal > 0.0) && (drand48() < pNormal)); if (makeSequences(seq + sbgn, slen, readLen, s1, q1, s2, q2, makeNormal) == false) goto tryMPagain; // Make sure the reads don't contain N's if (allowNs == false) for (int32 i=0; i= slen)) fprintf(stderr, "ERROR: invalid shift %d.\n", shift); assert(shift > 0); assert(shift < slen); // Put 'shift' bases from the end of the insert on the start of sh[], // and then fill the remaining of sh[] with the beginning of the insert. // // sh[] == [------>END] [BGN--------->] int32 pInsert = bgn; int32 pShift = shift; while (pShift < slen) // Copy BGN---> sh[pShift++] = seq[pInsert++]; pInsert = bgn + len - shift; pShift = 0; while (pInsert < bgn + len) // Copy --->END sh[pShift++] = seq[pInsert++]; assert(pShift == shift); sh[slen] = 0; bool makeNormal = ((pNormal > 0.0) && (drand48() < pNormal)); if (makeSequences(sh, slen, readLen, s1, q1, s2, q2, makeNormal) == false) goto tryMPagain; // Make sure the reads don't contain N's if (allowNs == false) for (int32 i=0; i= slen - readLen) type = (type == 'a') ? 'c' : 'b'; if (mpJunctions == mpJunctionsNone) assert(type == 't'); if (mpJunctions == mpJunctionsAlways) assert(type != 't'); // Add a marker for the chimeric point. This unfortunately includes some knowledge of // makeSequences(); the second sequence is reverse complemented. In that case, adjust shift // to the the position in that reverse complemented read. // if ((shift > 0) && (shift < readLen)) { q1[shift-1] = QV_BASE + 10; q1[shift-0] = QV_BASE + 10; } if ((shift > slen - readLen) && (shift < slen)) { assert((readLen - (shift + readLen - slen)) > 0); assert((readLen - (shift + readLen - slen)) < readLen); shift = readLen - (shift + readLen - slen); q2[shift - 1] = QV_BASE + 10; q2[shift - 0] = QV_BASE + 10; } // Output sequences, with a descriptive ID. Because bowtie2 removes /1 and /2 when the // mate maps concordantly, we no longer use that form. fprintf(outputI, "@%cMP%s_%d_%d@%d-%d_%d/%d/%d#1\n", type, (makeNormal) ? "normal" : "", np, idx, bgn, bgn+len, shift, slen, bgn+len-shift); fprintf(outputI, "%s\n", s1); fprintf(outputI, "+\n"); fprintf(outputI, "%s\n", q1); fprintf(outputI, "@%cMP%s_%d_%d@%d-%d_%d/%d/%d#2\n", type, (makeNormal) ? "normal" : "", np, idx, bgn, bgn+len, shift, slen, bgn+len-shift); fprintf(outputI, "%s\n", s2); fprintf(outputI, "+\n"); fprintf(outputI, "%s\n", q2); fprintf(output1, "@%cMP%s_%d_%d@%d-%d_%d/%d/%d#1\n", type, (makeNormal) ? "normal" : "", np, idx, bgn, bgn+len, shift, slen, bgn+len-shift); fprintf(output1, "%s\n", s1); fprintf(output1, "+\n"); fprintf(output1, "%s\n", q1); fprintf(output2, "@%cMP%s_%d_%d@%d-%d_%d/%d/%d#2\n", type, (makeNormal) ? "normal" : "", np, idx, bgn, bgn+len, shift, slen, bgn+len-shift); fprintf(output2, "%s\n", s2); fprintf(output2, "+\n"); fprintf(output2, "%s\n", q2); reverseComplement(s1, q1, readLen); reverseComplement(s2, q2, readLen); fprintf(outputC, "@%cMP%s_%d_%d@%d-%d_%d/%d/%d#1\n", type, (makeNormal) ? "normal" : "", np, idx, bgn+len, bgn, shift, slen, bgn+len-shift); fprintf(outputC, "%s\n", s1); fprintf(outputC, "+\n"); fprintf(outputC, "%s\n", q1); fprintf(outputC, "@%cMP%s_%d_%d@%d-%d_%d/%d/%d#2\n", type, (makeNormal) ? "normal" : "", np, idx, bgn+len, bgn, shift, slen, bgn+len-shift); fprintf(outputC, "%s\n", s2); fprintf(outputC, "+\n"); fprintf(outputC, "%s\n", q2); } //if ((np % 1000) == 0) // fprintf(stderr, "%9d / %9d - %5.2f%%\r", np, numPairs, 100.0 * np / numPairs); } delete [] s1; delete [] q1; delete [] s2; delete [] q2; delete [] sh; } void makeCC(char *seq, int32 seqLen, FILE *outputI, FILE *UNUSED(outputC), int32 readLen, int32 numReads, int32 ccJunkSize, int32 ccJunkStdDev, double ccFalse) { char acgt[4] = { 'A', 'C', 'G', 'T' }; char *s1 = new char [readLen + 1]; char *q1 = new char [readLen + 1]; for (int32 nr=0; nr readLen - 80) goto tryCCagain; int32 lenf = randomUniform(1, readLen - lenj); int32 lenr = readLen - lenj - lenf; if ((lenf < 1) || (lenr < 1)) goto tryCCagain; int32 bgnf = randomUniform(1, seqLen - readLen); int32 idxf = findSequenceIndex(bgnf); int32 zerf = seqStartPositions[idxf]; int32 bgnr = randomUniform(1, seqLen - readLen); int32 idxr = findSequenceIndex(bgnr); int32 zerr = seqStartPositions[idxr]; bool isFalse = false; if (ccFalse < drand48()) { bgnr = bgnf + readLen - lenr; idxr = findSequenceIndex(bgnr); zerr = seqStartPositions[idxr]; isFalse = true; } // Scan the sequences, if we spanned a sequence break or encounter a block of Ns, don't use this pair if (idxf != idxr) goto tryCCagain; if (allowNs == false) for (int32 i=bgnf; i') || ((allowGaps == false) && (seq[i] == 'N'))) goto tryCCagain; if (allowNs == false) for (int32 i=bgnr; i') || ((allowGaps == false) && (seq[i] == 'N'))) goto tryCCagain; // Generate the sequence. if ((makeSequences(seq + bgnf, 0, lenf, s1, q1, NULL, NULL) == false) || (makeSequences(seq + bgnr, 0, lenr, s1 + readLen - lenr, q1 + readLen - lenr, NULL, NULL) == false)) goto tryCCagain; // Load the read with random garbage. for (int32 i=lenf; i= argc) { fprintf(stderr, "Not enough args to -pe.\n"); err++; } else { peEnable = true; peShearSize = atoi(argv[++arg]); peShearStdDev = atoi(argv[++arg]); } } else if (strcmp(argv[arg], "-mp") == 0) { if (arg + 5 >= argc) { fprintf(stderr, "Not enough args to -mp.\n"); err++; } else { mpEnable = true; mpInsertSize = atoi(argv[++arg]); mpInsertStdDev = atoi(argv[++arg]); mpShearSize = atoi(argv[++arg]); mpShearStdDev = atoi(argv[++arg]); mpEnrichment = atof(argv[++arg]); } } else if (strcmp(argv[arg], "-cc") == 0) { if (arg + 3 >= argc) { fprintf(stderr, "Not enough args to -cc.\n"); err++; } else { ccEnable = true; ccJunkSize = atoi(argv[++arg]); ccJunkStdDev = atoi(argv[++arg]); ccFalse = atof(argv[++arg]); } } else if (strcmp(argv[arg], "-seed") == 0) { seed = atoi(argv[++arg]); } else { fprintf(stderr, "Unknown arg '%s'\n", argv[arg]); err++; } arg++; } if ((err) || (fastaName == NULL) || (outputPrefix == NULL) || ((seEnable == false) && (peEnable == false) && (mpEnable == false) && (ccEnable == false)) || ((seEnable == true) && (cloneCoverage > 0))) { fprintf(stderr, "usage: %s -f reference.fasta -o output-prefix -l read-length ....\n", argv[0]); fprintf(stderr, " -f ref.fasta Use sequences in ref.fasta as the genome.\n"); fprintf(stderr, " -o name Create outputs name.1.fastq and name.2.fastq (and maybe others).\n"); fprintf(stderr, " -l len Create reads of length 'len' bases.\n"); fprintf(stderr, " -n n Create 'n' reads (for -se) or 'n' pairs of reads (for -pe and -mp).\n"); fprintf(stderr, " -x read-cov Set 'np' to create reads that sample the genome to 'read-cov' read coverage.\n"); fprintf(stderr, " -X clone-cov Set 'np' to create reads that sample the genome to 'clone-cov' clone coverage.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -em err Reads will contain fraction mismatch error 'e' (0.01 == 1%% error).\n"); fprintf(stderr, " -ei err Reads will contain fraction insertion error 'e' (0.01 == 1%% error).\n"); fprintf(stderr, " -ed err Reads will contain fraction deletion error 'e' (0.01 == 1%% error).\n"); fprintf(stderr, "\n"); fprintf(stderr, " -seed s Seed randomness with 32-bit integer s.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -allowgaps Allow pairs to span N regions in the reference. By default, pairs\n"); fprintf(stderr, " are not allowed to span a gap. Reads are never allowed to cover N's.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -allowns Allow reads to contain N regions. Implies -allowgaps\n"); fprintf(stderr, "\n"); fprintf(stderr, " -nojunction For -mp, do not create chimeric junction reads. Create only fully PE or\n"); fprintf(stderr, " fully MP reads.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -normal p Output a normal-oriented (both forward or both reverse) pair with\n"); fprintf(stderr, " probability p. Only for -pe and -mp.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -se\n"); fprintf(stderr, " Create single-end reads.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -cc junkSize junkStdDev false\n"); fprintf(stderr, " Create chimeric single-end reads. The chimer is formed from two uniformly\n"); fprintf(stderr, " distributed positions in the reference. Some amount of random junk is inserted\n"); fprintf(stderr, " at the junction. With probability 'false' the read is not chimeric, but still\n"); fprintf(stderr, " the junk bases inserted in the middle.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -pe shearSize shearStdDev\n"); fprintf(stderr, " Create paired-end reads, from fragments of size 'shearSize +- shearStdDev'.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -mp insertSize insertStdDev shearSize shearStdDev enrichment\n"); fprintf(stderr, " Create mate-pair reads. The pairs will be 'insertSize +- insertStdDev'\n"); fprintf(stderr, " apart. The circularized insert is then sheared into fragments of size\n"); fprintf(stderr, " 'shearSize +- shearStdDev'. With probability 'enrichment' the fragment\n"); fprintf(stderr, " containing the junction is used to form the pair of reads. The junction\n"); fprintf(stderr, " location is uniformly distributed through this fragment.\n"); fprintf(stderr, " Reads are labeled as:\n"); fprintf(stderr, " tMP - a MP pair\n"); fprintf(stderr, " fMP - a PE pair\n"); fprintf(stderr, " aMP - a MP pair with junction in the first read\n"); fprintf(stderr, " bMP - a MP pair with junction in the second read\n"); fprintf(stderr, " cMP - a MP pair with junction in both reads (the reads overlap)\n"); fprintf(stderr, "\n"); fprintf(stderr, "Output QV's are the Sanger spec.\n"); fprintf(stderr, "\n"); if (fastaName == NULL) fprintf(stderr, "ERROR: No fasta file (-f) supplied.\n"); if (outputPrefix == NULL) fprintf(stderr, "ERROR: No output prefix (-o) supplied.\n"); if ((seEnable == false) && (peEnable == false) && (mpEnable == false) && (ccEnable == false)) fprintf(stderr, "ERROR: No type (-se or -pe or -mp) selected.\n"); if ((seEnable == true) && (cloneCoverage > 0)) fprintf(stderr, "ERROR: Can't sample clone coverage with single-ended (-se) reads.\n"); exit(1); } if ((readMismatchRate < 0.0) || (readMismatchRate > 1.0)) err++; if ((readInsertRate < 0.0) || (readInsertRate > 1.0)) err++; if ((readDeleteRate < 0.0) || (readDeleteRate > 1.0)) err++; if (readMismatchRate + readInsertRate + readDeleteRate > 1.0) err++; if (err > 0) { fprintf(stderr, "Invalid error rates.\n"); exit(1); } // // Initialize // // The errorBase[0] assignments permit makeSequenceError() to return a result // when the end of genome is hit. This is caught later in makeSequence() and the // read is aborted. fprintf(stderr, "seed = " F_U64 "\n", seed); srand48(seed); memset(revComp, '&', sizeof(char) * 256); revComp['A'] = 'T'; revComp['C'] = 'G'; revComp['G'] = 'C'; revComp['T'] = 'A'; revComp['N'] = 'N'; memset(errorBase, '*', sizeof(char) * 256 * 3); errorBase[ 0 ][0] = 0 ; errorBase[ 0 ][1] = 0 ; errorBase[ 0 ][2] = 0 ; errorBase['A'][0] = 'C'; errorBase['A'][1] = 'G'; errorBase['A'][2] = 'T'; errorBase['C'][0] = 'A'; errorBase['C'][1] = 'G'; errorBase['C'][2] = 'T'; errorBase['G'][0] = 'A'; errorBase['G'][1] = 'C'; errorBase['G'][2] = 'T'; errorBase['T'][0] = 'A'; errorBase['T'][1] = 'C'; errorBase['T'][2] = 'G'; errorBase['N'][0] = 'N'; errorBase['N'][1] = 'N'; errorBase['N'][2] = 'N'; memset(insertBase, '*', sizeof(char) * 4); insertBase[0] = 'A'; insertBase[1] = 'C'; insertBase[2] = 'G'; insertBase[3] = 'T'; memset(validBase, 0, sizeof(char) * 256); validBase['A'] = 1; validBase['C'] = 1; validBase['G'] = 1; validBase['T'] = 1; // // Open output files, failing quickly. // errno = 0; if ((seEnable == true) || (ccEnable == true)) { snprintf(outputName, FILENAME_MAX, "%s.s.fastq", outputPrefix); outputI = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open output file '%s': %s\n", outputName, strerror(errno)), exit(1); } if ((seEnable == false) && (ccEnable == false)) { snprintf(outputName, FILENAME_MAX, "%s.i.fastq", outputPrefix); outputI = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open output file '%s': %s\n", outputName, strerror(errno)), exit(1); snprintf(outputName, FILENAME_MAX, "%s.c.fastq", outputPrefix); outputC = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open output file '%s': %s\n", outputName, strerror(errno)), exit(1); } if (peEnable || mpEnable) { snprintf(outputName, FILENAME_MAX, "%s.1.fastq", outputPrefix); output1 = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open output file '%s': %s\n", outputName, strerror(errno)), exit(1); snprintf(outputName, FILENAME_MAX, "%s.2.fastq", outputPrefix); output2 = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open output file '%s': %s\n", outputName, strerror(errno)), exit(1); } // // Load all reference sequences into a single string. Seperate different sequences with a '>', we'll not make // fragments that span these markers (inefficiently, sigh). // fastaFile = fopen(fastaName, "r"); if (errno) fprintf(stderr, "Failed to open fasta file '%s': %s\n", fastaName, strerror(errno)), exit(1); numSeq = 0; seqMax = AS_UTL_sizeOfFile(fastaName); seqLen = 0; seq = new char [seqMax + 1]; memset(seq, 0, sizeof(char) * seqMax); uint32 nInvalid = 0; while (!feof(fastaFile)) { fgets(seq + seqLen, seqMax - seqLen, fastaFile); if (seq[seqLen] == '>') { numSeq++; seqLen++; seqStartPositions.push_back(seqLen); continue; } for (; ((seq[seqLen] != '\n') && (seq[seqLen] != '\r') && (seq[seqLen] != 0)); seqLen++) { seq[seqLen] = toupper(seq[seqLen]); if ((seq[seqLen] != 'N') && (validBase[seq[seqLen]] == 0)) { nInvalid++; //fprintf(stderr, "Replace invalid base '%c' at position %u.\n", seq[seqLen], seqLen); seq[seqLen] = insertBase[randomUniform(0, 3)]; //q1[p] = (validBase[s1[p]]) ? QV_BASE + 8 : QV_BASE + 2; } } assert(seqLen < seqMax); } fclose(fastaFile); seq[seqLen] = 0; assert(numSeq == seqStartPositions.size()); fprintf(stderr, "Loaded %u sequences of length %d, with %u invalid bases fixed.\n", numSeq, seqLen - numSeq, nInvalid); if ((numSeq == 0) || (seqLen == 0)) fprintf(stderr, "ERROR: No sequences or bases loaded, can't simulate reads.\n"), exit(1); // // If requested, compute the number of pairs to get a desired X of coverage // { uint32 cloneSize = 0; uint32 cloneStdDev = 0; uint32 readNumReads = UINT32_MAX; uint32 readNumPairs = UINT32_MAX; uint32 cloneNumReads = UINT32_MAX; uint32 cloneNumPairs = UINT32_MAX; if (peEnable) { cloneSize = peShearSize; cloneStdDev = peShearStdDev; } if (mpEnable) { cloneSize = mpInsertSize; cloneStdDev = mpInsertStdDev; } if (ccEnable) { cloneSize = ccJunkSize; cloneStdDev = ccJunkStdDev; } if (readCoverage > 0) { readNumReads = (uint32)floor(readCoverage * (seqLen - numSeq) / readLen); readNumPairs = readNumReads / 2; } if ((cloneCoverage > 0) && (seEnable == false)) { cloneNumPairs = (uint32)floor(cloneCoverage * (seqLen - numSeq) / cloneSize); cloneNumReads = cloneNumPairs * 2; } numReads = MIN(numReads, readNumReads); numPairs = MIN(numPairs, readNumPairs); numReads = MIN(numReads, cloneNumReads); numPairs = MIN(numPairs, cloneNumPairs); if (seEnable) fprintf(stderr, "Generate %.2f X read coverage of a %dbp genome with %u %dbp reads.\n", (double)numReads * readLen / seqLen, seqLen - numSeq, numReads, readLen); else fprintf(stderr, "Generate %.2f X read (%.2f X clone) coverage of a %dbp genome with %u pairs of %dbp reads from a clone of %d +- %dbp.\n", (double)numReads * readLen / seqLen, (double)numPairs * cloneSize / seqLen, seqLen - numSeq, numPairs, readLen, cloneSize, cloneStdDev); } // // // if (seEnable) makeSE(seq, seqLen, outputI, outputC, readLen, numReads); if (peEnable) makePE(seq, seqLen, outputI, outputC, output1, output2, readLen, numPairs, peShearSize, peShearStdDev); if (mpEnable) makeMP(seq, seqLen, outputI, outputC, output1, output2, readLen, numPairs, mpInsertSize, mpInsertStdDev, mpShearSize, mpShearStdDev, mpEnrichment, mpJunctions); if (ccEnable) makeCC(seq, seqLen, outputI, outputC, readLen, numReads, ccJunkSize, ccJunkStdDev, ccFalse); // // // if ((seEnable == true) || (ccEnable == true)) fclose(outputI); if ((seEnable == false) && (ccEnable == false)) { fclose(outputI); fclose(outputC); } if (peEnable || mpEnable) { fclose(output1); fclose(output2); } delete [] seq; fprintf(stderr, "\n"); fprintf(stderr, "Number of reads with:\n"); fprintf(stderr, " nNoChange = " F_U64 "\n", nNoChange); fprintf(stderr, " nMismatch = " F_U64 "\n", nMismatch); fprintf(stderr, " nInsert = " F_U64 "\n", nInsert); fprintf(stderr, " nDelete = " F_U64 "\n", nDelete); exit(0); } canu-1.6/src/fastq-utilities/fastqSimulate.mk000066400000000000000000000007331314437614700214140ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := fastqSimulate SOURCES := fastqSimulate.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/gfa/000077500000000000000000000000001314437614700136445ustar00rootroot00000000000000canu-1.6/src/gfa/alignGFA.C000066400000000000000000000660501314437614700153670ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-APR-04 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include "edlib.H" #include "splitToWords.H" #include "AS_UTL_reverseComplement.H" #include "gfa.H" #include "bed.H" #define IS_GFA 1 #define IS_BED 2 class sequence { public: sequence() { seq = NULL; len = 0; }; ~sequence() { delete [] seq; }; void set(tgTig *tig) { len = tig->length(false); seq = new char [len + 1]; memcpy(seq, tig->bases(false), len); seq[len] = 0; }; char *seq; uint32 len; }; class sequences { public: sequences(char *tigName, uint32 tigVers) { tgStore *tigStore = new tgStore(tigName, tigVers); b = 0; e = tigStore->numTigs(); seqs = new sequence [e+1]; used = new uint32 [e+1]; for (uint32 ti=b; ti < e; ti++) { tgTig *tig = tigStore->loadTig(ti); used[ti] = 0; if (tig == NULL) continue; seqs[ti].set(tig); tigStore->unloadTig(ti); } delete tigStore; }; ~sequences() { delete [] seqs; delete [] used; }; sequence &operator[](uint32 xx) { if (xx < e) return(seqs[xx]); fprintf(stderr, "ERROR: sequence id %u out of range b=%u e=%u\n", xx, b, e); assert(xx < e); return(seqs[0]); }; uint32 b; uint32 e; sequence *seqs; uint32 *used; }; void dotplot(uint32 Aid, bool Afwd, char *Aseq, uint32 Bid, bool Bfwd, char *Bseq) { char Aname[128], Afile[128]; char Bname[128], Bfile[128]; char Pname[128], Pfile[128]; FILE *F; sprintf(Aname, "tig%08u%c", Aid, (Afwd) ? '+' : '-'); sprintf(Afile, "tig%08u%c.fasta", Aid, (Afwd) ? '+' : '-'); sprintf(Bname, "tig%08u%c", Bid, (Bfwd) ? '+' : '-'); sprintf(Bfile, "tig%08u%c.fasta", Bid, (Bfwd) ? '+' : '-'); sprintf(Pname, "plot-%s-%s", Aname, Bname); sprintf(Pfile, "plot-%s-%s.sh", Aname, Bname); F = fopen(Pfile, "w"); fprintf(F, "#!/bin/sh\n"); fprintf(F, "\n"); fprintf(F, "nucmer --maxmatch --nosimplify -p %s %s.fasta %s.fasta\n", Pname, Aname, Bname); fprintf(F, "show-coords -l -o -r -T %s.delta | expand -t 8 > %s.coords\n", Pname, Pname); fprintf(F, "mummerplot --fat -t png -p %s %s.delta\n", Pname, Pname); fprintf(F, "echo mummerplot --fat -p %s %s.delta\n", Pname, Pname); fclose(F); F = fopen(Afile, "w"); fprintf(F, ">%s\n%s\n", Aname, Aseq); fclose(F); F = fopen(Bfile, "w"); fprintf(F, ">%s\n%s\n", Bname, Bseq); fclose(F); sprintf(Pfile, "sh plot-%s-%s.sh", Aname, Bname); system(Pfile); } bool checkLink(gfaLink *link, sequences &seqs, bool beVerbose, bool doPlot) { char *Aseq = seqs[link->_Aid].seq, *Arev = NULL; char *Bseq = seqs[link->_Bid].seq, *Brev = NULL; int32 Abgn, Aend, Alen = seqs[link->_Aid].len; int32 Bbgn, Bend, Blen = seqs[link->_Bid].len; EdlibAlignResult result = { 0, NULL, NULL, 0, NULL, 0, 0 }; int32 AalignLen = 0; int32 BalignLen = 0; int32 editDist = 0; int32 alignLen = 0; int32 maxEdit = 0; // NOTE! edlibAlign calls the 'A' sequence the 'query' and // the 'B' sequence the 'target' (aka, 'reference'). link->alignmentLength(AalignLen, BalignLen, alignLen); // Regardless of if we find an new alignment or not, remove the old one. // If we don't find a new one, we'll discard the link. delete [] link->_cigar; link->_cigar = NULL; if (link->_Afwd == false) Aseq = Arev = reverseComplementCopy(Aseq, Alen); if (link->_Bfwd == false) Bseq = Brev = reverseComplementCopy(Bseq, Blen); // Ty to find the end coordinate on B. Align the last bits of A to B. // // -------(---------] v--?? // [------------)------ // Abgn = max(Alen - AalignLen, 0); Aend = Alen; Bbgn = 0; Bend = min(Blen, (int32)(1.10 * BalignLen)); // Allow 25% gaps over what the GFA said? maxEdit = (int32)ceil(alignLen * 0.12); if (beVerbose) fprintf(stderr, "LINK tig%08u %c %17s tig%08u %c %17s Aalign %6u Balign %6u align %6u\n", link->_Aid, (link->_Afwd) ? '+' : '-', "", link->_Bid, (link->_Bfwd) ? '+' : '-', "", AalignLen, BalignLen, alignLen); if (beVerbose) fprintf(stderr, "TEST tig%08u %c %8d-%-8d tig%08u %c %8d-%-8d maxEdit=%6d (extend B)", link->_Aid, (link->_Afwd) ? '+' : '-', Abgn, Aend, link->_Bid, (link->_Bfwd) ? '+' : '-', Bbgn, Bend, maxEdit); result = edlibAlign(Aseq + Abgn, Aend-Abgn, // The 'query' Bseq + Bbgn, Bend-Bbgn, // The 'target' edlibNewAlignConfig(maxEdit, EDLIB_MODE_HW, EDLIB_TASK_LOC)); if (result.numLocations > 0) { if (beVerbose) fprintf(stderr, "\n"); Bend = Bbgn + result.endLocations[0] + 1; // 0-based to space-based edlibFreeAlignResult(result); } else { if (beVerbose) fprintf(stderr, " - FAILED\n"); } // Do the same for A. Aend and Bbgn never change; Bend was set above. // // ------(--------------] // ^--?? [-------]----------- // Abgn = max(Alen - (int32)(1.10 * AalignLen), 0); // Allow 25% gaps over what the GFA said? if (beVerbose) fprintf(stderr, " tig%08u %c %8d-%-8d tig%08u %c %8d-%-8d maxEdit=%6d (extend A)", link->_Aid, (link->_Afwd) ? '+' : '-', Abgn, Aend, link->_Bid, (link->_Bfwd) ? '+' : '-', Bbgn, Bend, maxEdit); // NEEDS to be MODE_HW because we need to find the suffix alignment. result = edlibAlign(Bseq + Bbgn, Bend-Bbgn, // The 'query' Aseq + Abgn, Aend-Abgn, // The 'target' edlibNewAlignConfig(maxEdit, EDLIB_MODE_HW, EDLIB_TASK_LOC)); if (result.numLocations > 0) { if (beVerbose) fprintf(stderr, "\n"); Abgn = Abgn + result.startLocations[0]; edlibFreeAlignResult(result); } else { if (beVerbose) fprintf(stderr, " - FAILED\n"); } // One more alignment, this time, with feeling - notice EDLIB_MODE_MW and EDLIB_TASK_PATH. if (beVerbose) fprintf(stderr, " tig%08u %c %8d-%-8d tig%08u %c %8d-%-8d maxEdit=%6d (final)", link->_Aid, (link->_Afwd) ? '+' : '-', Abgn, Aend, link->_Bid, (link->_Bfwd) ? '+' : '-', Bbgn, Bend, maxEdit); result = edlibAlign(Aseq + Abgn, Aend-Abgn, Bseq + Bbgn, Bend-Bbgn, edlibNewAlignConfig(2 * maxEdit, EDLIB_MODE_NW, EDLIB_TASK_PATH)); bool success = false; if (result.numLocations > 0) { if (beVerbose) fprintf(stderr, "\n"); editDist = result.editDistance; alignLen = ((Aend - Abgn) + (Bend - Bbgn) + (editDist)) / 2; alignLen = result.alignmentLength; // Edlib 'alignmentLength' is populated only for TASK_PATH link->_cigar = edlibAlignmentToCigar(result.alignment, result.alignmentLength, EDLIB_CIGAR_STANDARD); edlibFreeAlignResult(result); success = true; } else { if (beVerbose) fprintf(stderr, " - FAILED\n"); } if (beVerbose) fprintf(stderr, " tig%08u %c %8d-%-8d tig%08u %c %8d-%-8d %.4f\n", link->_Aid, (link->_Afwd) ? '+' : '-', Abgn, Aend, link->_Bid, (link->_Bfwd) ? '+' : '-', Bbgn, Bend, (double)editDist / alignLen); // Make a plot. if ((success == false) && (doPlot == true)) dotplot(link->_Aid, link->_Afwd, Aseq, link->_Bid, link->_Bfwd, Bseq); // Cleanup for the next link. delete [] Arev; delete [] Brev; if (beVerbose) fprintf(stderr, "\n"); return(success); } // Align all of B into A. Extend A as needed to make the whole thing fit. // Abgn, Aend and score are updated with the alignment. // bool checkRecord_align(char *label, char *Aname, char *Aseq, int32 Alen, int32 &Abgn, int32 &Aend, char *Bname, char *Bseq, int32 Blen, int32 &score, bool beVerbose) { EdlibAlignResult result = { 0, NULL, NULL, 0, NULL, 0, 0 }; int32 editDist = 0; int32 alignLen = 0; int32 alignScore = 0; int32 maxEdit = (int32)ceil(Blen * 0.03); // Should be the same sequence, but allow for a little difference. int32 step = (int32)ceil(Blen * 0.15); Aend = min(Aend + 2 * step, Alen); // Limit Aend to the actual length of the contig (consensus can shrink) Abgn = max(Aend - Blen - 2 * step, 0); // Then push Abgn back to make space for the unitig. tryAgain: if (beVerbose) fprintf(stderr, "ALIGN %5s utg %s len=%7d to ctg %s %9d-%9d len=%9d", label, Bname, Blen, Aname, Abgn, Aend, Alen); #if 0 char N[FILENAME_MAX]; FILE *F; char ach = Aseq[Aend]; Aseq[Aend] = 0; char bch = Bseq[Bend]; Bseq[Bend] = 0; sprintf(N, "compare%04d-%04d-ctg%04d.fasta", record->_Aid, record->_Bid, record->_Aid); F = fopen(N, "w"); fprintf(F, ">ctg%04d\n%s\n", record->_Aid, Aseq + Abgn); fclose(F); sprintf(N, "compare%04d-%04d-utg%04d.fasta", record->_Aid, record->_Bid, record->_Bid); F = fopen(N, "w"); fprintf(F, ">utg%04d\n%s\n", record->_Bid, Bseq + Bbgn); fclose(F); Aseq[Aend] = ach; Bseq[Bend] = bch; #endif result = edlibAlign(Bseq, Blen, // The 'query' (unitig) Aseq + Abgn, Aend-Abgn, // The 'target' (contig) edlibNewAlignConfig(maxEdit, EDLIB_MODE_HW, EDLIB_TASK_LOC)); // Got an alignment? Process and report, and maybe try again. if (result.numLocations > 0) { int32 nAbgn = Abgn + result.startLocations[0]; int32 nAend = Abgn + result.endLocations[0] + 1; // 0-based to space-based char *cigar = NULL; editDist = result.editDistance; alignLen = ((nAend - nAbgn) + (Blen) + (editDist)) / 2; alignScore = 1000 - (int32)(1000.0 * editDist / alignLen); // If there's an alignment, we can get a cigar string and better alignment length. if ((result.alignment != NULL) && (result.alignmentLength > 0)) { cigar = edlibAlignmentToCigar(result.alignment, result.alignmentLength, EDLIB_CIGAR_STANDARD); alignLen = result.alignmentLength; } edlibFreeAlignResult(result); if (beVerbose) fprintf(stderr, " - POSITION from %9d-%-9d to %9d-%-9d score %5d/%9d = %4d%s%s\n", Abgn, Aend, nAbgn, nAend, editDist, alignLen, alignScore, (cigar != NULL) ? " align " : "", (cigar != NULL) ? cigar : ""); delete [] cigar; // If it's a full alignment -- if the A region was big enough to have unaligned bases -- then // we're done. Update the result and get out of here. if (((Abgn < nAbgn) || (Abgn == 0)) && ((nAend < Aend) || (Aend == Alen))) { Abgn = nAbgn; Aend = nAend; score = alignScore; return(true); } // Otherwise, we ran out of A sequence to align to before we ran out of stuff to align. Extend // the A region and try again. if (Abgn == nAbgn) Abgn = max(Abgn - step, 0); if (Aend == nAend) Aend = min(Aend + step, Alen); goto tryAgain; } // Didn't get a good alignment. // We fail for one of two reasons - either not enough bases in the reference, or too high of // error. Unitigs are supposed to be from the same sequence, but they might be lower coverage // and therefore higher error. It's more likely they are misplaced. if ((Aend - Abgn < 4 * Blen) && (maxEdit < Blen * 0.25)) { if (beVerbose) fprintf(stderr, " - FAILED, RELAX\n"); Abgn = max(Abgn - step, 0); Aend = min(Aend + step, Alen); maxEdit *= 1.2; goto tryAgain; } if (beVerbose) fprintf(stderr, " - ABORT, ABORT, ABORT!\n"); return(false); } bool checkRecord(bedRecord *record, sequences &ctgs, sequences &utgs, bool beVerbose, bool UNUSED(doPlot)) { char *Aseq = ctgs[record->_Aid].seq; char *Bseq = utgs[record->_Bid].seq, *Brev = NULL; int32 Abgn = record->_bgn; int32 Aend = record->_end; int32 Alen = ctgs[record->_Aid].len; int32 Blen = utgs[record->_Bid].len; bool success = true; int32 alignScore = 0; if (record->_Bfwd == false) Bseq = Brev = reverseComplementCopy(Bseq, Blen); // If Bseq (the unitig) is small, just align the full thing. if (Blen < 50000) { success &= checkRecord_align("ALL", record->_Aname, Aseq, Alen, Abgn, Aend, record->_Bname, Bseq, Blen, alignScore, beVerbose); } // Otherwise, we need to try to align only the ends of the unitig. // // -----------------------[AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA]------------------ // BBBBB..............BBBBB // else { int32 AbgnL = Abgn, AendL = Abgn + 50000; int32 AbgnR = Aend - 50000, AendR = Aend; char *BseqL = Bseq; char *BseqR = Bseq + Blen - 50000; #if 0 success &= checkRecord_align("ALL", record->_Aname, Aseq, Alen, Abgn, Aend, record->_Bname, Bseq, Blen, alignScore, beVerbose); #endif success &= checkRecord_align("LEFT", record->_Aname, Aseq, Alen, AbgnL, AendL, record->_Bname, BseqL, 50000, alignScore, beVerbose); success &= checkRecord_align("RIGHT", record->_Aname, Aseq, Alen, AbgnR, AendR, record->_Bname, BseqR, 50000, alignScore, beVerbose); Abgn = AbgnL; Aend = AendR; } delete [] Brev; // If successful, save the coordinates. Because we're usually not aligning the whole // unitig to the contig, we can't save the score. if (success) { record->_bgn = Abgn; record->_end = Aend; record->_score = 0; //alignScore; } return(success); } // // Try to find an alignment for each link in the GFA file. If found, output a new link // with correct CIGAR string. If not found, discard the link. // void processGFA(char *tigName, uint32 tigVers, char *inGFA, char *otGFA, uint32 verbosity) { // Load the GFA file. fprintf(stderr, "-- Reading GFA '%s'.\n", inGFA); gfaFile *gfa = new gfaFile(inGFA); fprintf(stderr, "-- Loading sequences from tigStore '%s' version %u.\n", tigName, tigVers); sequences *seqsp = new sequences(tigName, tigVers); sequences &seqs = *seqsp; // Set GFA lengths based on the sequences we loaded. fprintf(stderr, "-- Resetting sequence lengths.\n"); for (uint32 ii=0; ii_sequences.size(); ii++) gfa->_sequences[ii]->_length = seqs[gfa->_sequences[ii]->_id].len; // Align! uint32 passCircular = 0; uint32 failCircular = 0; uint32 passNormal = 0; uint32 failNormal = 0; uint32 iiLimit = gfa->_links.size(); uint32 iiNumThreads = omp_get_max_threads(); uint32 iiBlockSize = (iiLimit < 1000 * iiNumThreads) ? iiNumThreads : iiLimit / 999; fprintf(stderr, "-- Aligning " F_U32 " links using " F_U32 " threads.\n", iiLimit, iiNumThreads); #pragma omp parallel for schedule(dynamic, iiBlockSize) for (uint32 ii=0; ii_links[ii]; if (link->_Aid == link->_Bid) { if (verbosity > 0) fprintf(stderr, "Processing circular link for tig %u\n", link->_Aid); if (link->_Afwd != link->_Bfwd) fprintf(stderr, "WARNING: %s %c %s %c -- circular to the same end!?\n", link->_Aname, link->_Afwd ? '+' : '-', link->_Bname, link->_Bfwd ? '+' : '-'); bool pN = checkLink(link, seqs, (verbosity > 0), false); if (pN == true) passCircular++; else failCircular++; } // Now the usual case. else { if (verbosity > 0) fprintf(stderr, "Processing link between tig %u %s and tig %u %s\n", link->_Aid, link->_Afwd ? "-->" : "<--", link->_Bid, link->_Bfwd ? "-->" : "<--"); bool pN = checkLink(link, seqs, (verbosity > 0), false); if (pN == true) passNormal++; else failNormal++; } // If the cigar exists, we found an alignment. If not, delete the link. if (link->_cigar == NULL) { if (verbosity > 0) fprintf(stderr, " Failed to find alignment.\n"); delete gfa->_links[ii]; gfa->_links[ii] = NULL; } } fprintf(stderr, "-- Writing GFA '%s'.\n", otGFA); gfa->saveFile(otGFA); fprintf(stderr, "-- Cleaning up.\n"); delete seqsp; delete gfa; fprintf(stderr, "-- Aligned %6u ciruclar tigs, failed %6u\n", passCircular, failCircular); fprintf(stderr, "-- Aligned %6u linear tigs, failed %6u\n", passNormal, failNormal); } // // Find an alignment between the unitig (the feature) and the contig (the 'chromosome'). // Output updated coordiates. // void processBED(char *tigName, uint32 tigVers, char *seqName, uint32 seqVers, char *inBED, char *otBED, uint32 verbosity) { // Load the BED file. fprintf(stderr, "-- Reading BED '%s'.\n", inBED); bedFile *bed = new bedFile(inBED); fprintf(stderr, "-- Loading sequences from tigStore '%s' version %u.\n", tigName, tigVers); sequences *utgsp = new sequences(tigName, tigVers); sequences &utgs = *utgsp; fprintf(stderr, "-- Loading sequences from tigStore '%s' version %u.\n", seqName, seqVers); sequences *ctgsp = new sequences(seqName, seqVers); sequences &ctgs = *ctgsp; // Align! uint32 pass = 0; uint32 fail = 0; uint32 iiLimit = bed->_records.size(); uint32 iiNumThreads = omp_get_max_threads(); uint32 iiBlockSize = (iiLimit < 1000 * iiNumThreads) ? iiNumThreads : iiLimit / 999; fprintf(stderr, "-- Aligning " F_U32 " records using " F_U32 " threads.\n", iiLimit, iiNumThreads); #pragma omp parallel for schedule(dynamic, iiBlockSize) for (uint32 ii=0; ii_records[ii]; if (checkRecord(record, ctgs, utgs, (verbosity > 0), false)) { pass++; } else { delete bed->_records[ii]; bed->_records[ii] = NULL; fail++; } } fprintf(stderr, "-- Writing BED '%s'.\n", otBED); bed->saveFile(otBED); fprintf(stderr, "-- Cleaning up.\n"); delete utgsp; delete ctgsp; delete bed; fprintf(stderr, "-- Aligned %6u unitigs to contigs, failed %6u\n", pass, fail); } // // Infer a graph from the positions of unitigs (features) in contigs (chromosomes). Generate a GFA // input and toss that up to processGFA. // void processBEDtoGFA(char *tigName, uint32 tigVers, char *inBED, char *otGFA, uint32 verbosity) { int32 minOlap = 100; // We only really need the sequence lengths here, but eventually, we'll want to generate // alignments for all the overlaps, and so we'll need the sequences too. fprintf(stderr, "-- Loading sequences from tigStore '%s' version %u.\n", tigName, tigVers); sequences *seqsp = new sequences(tigName, tigVers); sequences &seqs = *seqsp; // Load the BED file and allocate an output GFA. fprintf(stderr, "-- Reading BED '%s'.\n", inBED); bedFile *bed = new bedFile(inBED); gfaFile *gfa = new gfaFile("H\tVN:Z:bogart/edges"); // Iterate over sequences, looking for overlaps in contigs. Stupid, O(n^2) but seems fast enough. uint32 iiLimit = bed->_records.size(); uint32 iiNumThreads = omp_get_max_threads(); uint32 iiBlockSize = (iiLimit < 1000 * iiNumThreads) ? iiNumThreads : iiLimit / 999; fprintf(stderr, "-- Aligning " F_U32 " records using " F_U32 " threads.\n", iiLimit, iiNumThreads); #pragma omp parallel for schedule(dynamic, iiBlockSize) for (uint64 ii=0; ii_records.size(); ii++) { for (uint64 jj=ii+1; jj_records.size(); jj++) { if (bed->_records[ii]->_Aid != bed->_records[jj]->_Aid) // Different contigs? continue; // No overlap. if ((bed->_records[ii]->_end < bed->_records[jj]->_bgn + minOlap) || // No (thick) intersection? (bed->_records[jj]->_end < bed->_records[ii]->_bgn + minOlap)) // continue; // No overlap. // Overlap! //fprintf(stderr, "OVERLAP %s %d-%d - %s %d-%d\n", // bed->_records[ii]->_Bname, bed->_records[ii]->_bgn, bed->_records[ii]->_end, // bed->_records[jj]->_Bname, bed->_records[jj]->_bgn, bed->_records[jj]->_end); int32 olapLen = 0; if (bed->_records[ii]->_bgn < bed->_records[jj]->_end) olapLen = bed->_records[ii]->_end - bed->_records[jj]->_bgn; if (bed->_records[jj]->_bgn < bed->_records[ii]->_end) olapLen = bed->_records[jj]->_end - bed->_records[ii]->_bgn; assert(olapLen > 0); char cigar[81]; sprintf(cigar, "%dM", olapLen); gfaLink *link = new gfaLink(bed->_records[ii]->_Bname, bed->_records[ii]->_Bid, true, bed->_records[jj]->_Bname, bed->_records[jj]->_Bid, true, cigar); bool pN = checkLink(link, seqs, (verbosity > 0), false); #pragma omp critical { if (pN) gfa->_links.push_back(link); else gfa->_links.push_back(link); // Remember sequences we've hit. seqs.used[bed->_records[ii]->_Bid]++; seqs.used[bed->_records[jj]->_Bid]++; } } } // Add sequences. We could have done this as we're running through making edges, but we then // need to figure out if we've seen a sequence already. char seqName[80]; for (uint32 ii=0; ii 0) { sprintf(seqName, "utg%08u", ii); gfa->_sequences.push_back(new gfaSequence(seqName, ii, seqs[ii].len)); } // Write the file, cleanup, done! gfa->saveFile(otGFA); delete gfa; delete bed; } int main (int argc, char **argv) { char *tigName = NULL; // For GFA and BED, the source of the tigs uint32 tigVers = UINT32_MAX; char *seqName = NULL; // For BED, the source of the 'chromosomes' uint32 seqVers = UINT32_MAX; // The -C option (either chromosome or container) char *inGraph = NULL; char *otGraph = NULL; uint32 graphType = IS_GFA; uint32 verbosity = 0; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-C") == 0) { seqName = argv[++arg]; seqVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-gfa") == 0) { graphType = IS_GFA; } else if (strcmp(argv[arg], "-bed") == 0) { graphType = IS_BED; } else if (strcmp(argv[arg], "-i") == 0) { inGraph = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { otGraph = argv[++arg]; } else if (strcmp(argv[arg], "-V") == 0) { verbosity++; } else if (strcmp(argv[arg], "-t") == 0) { omp_set_num_threads(atoi(argv[++arg])); } else { fprintf(stderr, "%s: Unknown option '%s'\n", argv[0], argv[arg]); err++; } arg++; } if (tigName == NULL) err++; if (inGraph == NULL) err++; if (otGraph == NULL) err++; if ((tigName) && (tigVers == 0)) err++; if ((seqName) && (seqVers == 0)) err++; if (err) { fprintf(stderr, "usage: %s [opts]\n", argv[0]); fprintf(stderr, " Validates a GFA by generating alignments.\n"); fprintf(stderr, " Optionally writes new GFA with updated CIGAR string (NOT IMPLEMENTED).\n"); fprintf(stderr, "\n"); fprintf(stderr, " -G g Load reads from gkStore 'g'.\n"); fprintf(stderr, " -T t v Load tigs from tgStore 't', version 'v'.\n"); fprintf(stderr, " -C t v For BED format, the source of the 'chromosomes'. Similar to -T.\n"); fprintf(stderr, " Consensus sequence must exist for -T and -C (usually in v=2)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -i input Input graph.\n"); fprintf(stderr, " -o output Output graph.\n"); fprintf(stderr, " Graph are either GFA (v1) or BED format.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -gfa The input and output graphs are in GFA (v1) format.\n"); fprintf(stderr, " -bed The input graph is in BED format. If -C is supplied, the\n"); fprintf(stderr, " output will also be BED, and will have updated positions.\n"); fprintf(stderr, " If -C is not supplied, the output will be GFA (v1) of the\n"); fprintf(stderr, " overlaps inferred from the BED positions.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -V Increase chatter.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t threads Use 'threads' computational threads.\n"); fprintf(stderr, "\n"); if (tigName == NULL) fprintf(stderr, "ERROR: no tigStore (-T) supplied.\n"); if (inGraph == NULL) fprintf(stderr, "ERROR: no input GFA (-i) supplied.\n"); if (otGraph == NULL) fprintf(stderr, "ERROR: no output GFA (-o) supplied.\n"); if ((tigName) && (tigVers == 0)) fprintf(stderr, "ERROR: invalid tigStore version (-T) supplied.\n"); if ((seqName) && (seqVers == 0)) fprintf(stderr, "ERROR: invalid tigStore version (-C) supplied.\n"); exit(1); } if (graphType == IS_GFA) processGFA(tigName, tigVers, inGraph, otGraph, verbosity); if ((graphType == IS_BED) && (seqName != NULL)) processBED(tigName, tigVers, seqName, seqVers, inGraph, otGraph, verbosity); if ((graphType == IS_BED) && (seqName == NULL)) processBEDtoGFA(tigName, tigVers, inGraph, otGraph, verbosity); fprintf(stderr, "Bye.\n"); exit(0); } canu-1.6/src/gfa/alignGFA.mk000066400000000000000000000007651314437614700156150ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := alignGFA SOURCES := alignGFA.C SRC_INCDIRS := .. ../AS_UTL ../stores ../overlapInCore/libedlib TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/gfa/bed.C000066400000000000000000000065371314437614700145150ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-MAY-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "bed.H" // Search for canu-specific names, and convert to tigID's. static uint32 nameToCanuID(char *name) { uint32 id = UINT32_MAX; if ((name[0] == 't') && (name[1] == 'i') && (name[2] == 'g')) id = strtoll(name + 3, NULL, 10); if ((name[0] == 'c') && (name[1] == 't') && (name[2] == 'g')) id = strtoll(name + 3, NULL, 10); if ((name[0] == 'u') && (name[1] == 't') && (name[2] == 'g')) id = strtoll(name + 3, NULL, 10); return(id); } bedRecord::bedRecord() { _Aname = NULL; _Aid = UINT32_MAX; _bgn = UINT32_MAX; _end = 0; _Bname = NULL; _Bid = UINT32_MAX; _score = 0; _Bfwd = false; } bedRecord::bedRecord(char *inLine) { load(inLine); } bedRecord::~bedRecord() { delete [] _Aname; delete [] _Bname; } void bedRecord::load(char *inLine) { splitToWords W(inLine); _Aname = new char [strlen(W[0]) + 1]; _Aid = UINT32_MAX; _bgn = W(1); _end = W(2); _Bname = new char [strlen(W[3]) + 1]; _Bid = UINT32_MAX; _score = W(4); _Bfwd = W[5][0] == '+'; strcpy(_Aname, W[0]); strcpy(_Bname, W[3]); _Aid = nameToCanuID(_Aname); // Search for canu-specific names, and convert to tigID's. _Bid = nameToCanuID(_Bname); } void bedRecord::save(FILE *outFile) { fprintf(outFile, "%s\t%d\t%d\t%s\t%u\t%c\n", _Aname, _bgn, _end, _Bname, _score, (_Bfwd == true) ? '+' : '-'); } bedFile::bedFile(char *inFile) { loadFile(inFile); } bedFile::~bedFile() { for (uint32 ii=0; ii<_records.size(); ii++) delete _records[ii]; } bool bedFile::loadFile(char *inFile) { FILE *F = NULL; char *L = NULL; uint32 Llen = 0; uint32 Lmax = 0; errno = 0; F = fopen(inFile, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", inFile, strerror(errno)), exit(1); while (AS_UTL_readLine(L, Llen, Lmax, F)) { _records.push_back(new bedRecord(L)); } fclose(F); delete [] L; fprintf(stderr, "bed: Loaded " F_S64 " records.\n", _records.size()); return(true); } bool bedFile::saveFile(char *outFile) { FILE *F = NULL; errno = 0; F = fopen(outFile, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", outFile, strerror(errno)), exit(1); for (uint32 ii=0; ii<_records.size(); ii++) if (_records[ii]) _records[ii]->save(F); fclose(F); return(true); } canu-1.6/src/gfa/bed.H000066400000000000000000000030651314437614700145130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-MAY-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_BED_H #define AS_UTL_BED_H #include "AS_global.H" #include "splitToWords.H" class bedRecord { public: bedRecord(); bedRecord(char *inLine); ~bedRecord(); void load(char *inLine); void save(FILE *outFile); public: char *_Aname; // The 'chromosome' uint32 _Aid; // Canu specific. int32 _bgn; int32 _end; char *_Bname; // The 'feature' uint32 _Bid; // Canu specific. uint32 _score; bool _Bfwd; }; class bedFile { public: bedFile(char *inFile); ~bedFile(); bool loadFile(char *inFile); bool saveFile(char *outFile); public: vector _records; }; #endif // AS_UTL_BED_H canu-1.6/src/gfa/gfa.C000066400000000000000000000210621314437614700145060ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-APR-04 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "gfa.H" template static bool findGFAtokenI(char *features, char *token, TT &value) { char *p = NULL; p = strstr(features, token); if (p == NULL) return(false); p += strlen(token); // Skip over the token... if (*p == ':') // ...and any :, if the user forgot to include it. p++; value = (TT)strtoll(p, NULL, 10); //fprintf(stderr, "FOUND feature '%s' in '%s' -> '%s' %u\n", token, features, p, value); return(true); } // Search for canu-specific names, and convert to tigID's. // Allow either 'tig', 'utg' or 'ctg'. static uint32 nameToCanuID(char *name) { uint32 id = UINT32_MAX; if (((name[0] == 't') && (name[1] == 'i') && (name[2] == 'g')) || ((name[0] == 'u') && (name[1] == 't') && (name[2] == 'g')) || ((name[0] == 'c') && (name[1] == 't') && (name[2] == 'g'))) id = strtoll(name + 3, NULL, 10); return(id); } gfaSequence::gfaSequence() { _name = NULL; _id = UINT32_MAX; _sequence = NULL; _features = NULL; _length = 0; } gfaSequence::gfaSequence(char *inLine) { load(inLine); } gfaSequence::gfaSequence(char *name, uint32 id, uint32 len) { _name = new char [strlen(name) + 1]; _id = id; _sequence = NULL; _features = NULL; _length = len; strcpy(_name, name); } gfaSequence::~gfaSequence() { delete [] _name; delete [] _sequence; delete [] _features; } void gfaSequence::load(char *inLine) { splitToWords W(inLine); _name = new char [strlen(W[1]) + 1]; _id = UINT32_MAX; _sequence = new char [strlen(W[2]) + 1]; _features = new char [strlen(W[3]) + 1]; _length = 0; strcpy(_name, W[1]); strcpy(_sequence, W[2]); strcpy(_features, W[3]); // Scan the _features for a length. findGFAtokenI(_features, "LN:i:", _length); // And any canu ID _id = nameToCanuID(_name); } void gfaSequence::save(FILE *outFile) { fprintf(outFile, "S\t%s\t%s\tLN:i:%u\n", _name, _sequence ? _sequence : "*", _length); } gfaLink::gfaLink() { _Aname = NULL; _Aid = UINT32_MAX; _Afwd = false; _Bname = NULL; _Bid = UINT32_MAX; _Bfwd = false; _cigar = NULL; _features = NULL; } gfaLink::gfaLink(char *inLine) { load(inLine); } gfaLink::gfaLink(char *Aname, uint32 Aid, bool Afwd, char *Bname, uint32 Bid, bool Bfwd, char *cigar) { _Aname = new char [strlen(Aname) + 1]; _Aid = Aid; _Afwd = Afwd; _Bname = new char [strlen(Bname) + 1]; _Bid = Bid; _Bfwd = Bfwd; _cigar = new char [strlen(cigar) + 1]; _features = NULL; strcpy(_Aname, Aname); strcpy(_Bname, Bname); strcpy(_cigar, cigar); _Aid = nameToCanuID(_Aname); // Search for canu-specific names, and convert to tigID's. _Bid = nameToCanuID(_Bname); } gfaLink::~gfaLink() { delete [] _Aname; delete [] _Bname; delete [] _cigar; delete [] _features; } void gfaLink::load(char *inLine) { splitToWords W(inLine); _Aname = new char [strlen(W[1]) + 1]; _Aid = UINT32_MAX; _Afwd = W[2][0] == '+'; _Bname = new char [strlen(W[3]) + 1]; _Bid = UINT32_MAX; _Bfwd = W[4][0] == '+'; _cigar = new char [strlen(W[5]) + 1]; _features = new char [(W[6]) ? strlen(W[6]) + 1 : 1]; strcpy(_Aname, W[1]); strcpy(_Bname, W[3]); strcpy(_cigar, W[5]); strcpy(_features, (W[6]) ? W[6] : ""); _Aid = nameToCanuID(_Aname); // Search for canu-specific names, and convert to tigID's. _Bid = nameToCanuID(_Bname); } void gfaLink::save(FILE *outFile) { fprintf(outFile, "L\t%s\t%c\t%s\t%c\t%s\n", _Aname, (_Afwd == true) ? '+' : '-', _Bname, (_Bfwd == true) ? '+' : '-', (_cigar == NULL) ? "*" : _cigar); } void gfaLink::alignmentLength(int32 &queryLen, int32 &refceLen, int32 &alignLen) { char *cp = _cigar; refceLen = 0; // Bases on the reference involved in the alignment queryLen = 0; // Bases on the query involved in the alignment alignLen = 0; // Length of the alignment if (cp == NULL) return; if (*cp == '*') return; do { int64 val = strtoll(cp, &cp, 10); char code = *cp++; switch (code) { case 'M': // Alignment, either match or mismatch refceLen += val; queryLen += val; alignLen += val; break; case 'I': // Insertion to the reference - gap in query queryLen += val; alignLen += val; break; case 'D': // Deletion from the reference - gap in reference refceLen += val; alignLen += val; break; case 'N': // Skipped in the reference (e.g., intron) refceLen += val; fprintf(stderr, "warning - unsupported CIGAR code '%c' in '%s'\n", code, _cigar); break; case 'S': // Soft-clipped from the query - not part of the alignment fprintf(stderr, "warning - unsupported CIGAR code '%c' in '%s'\n", code, _cigar); break; case 'H': // Hard-clipped from the query - not part of the alignment, and removed from the read as input fprintf(stderr, "warning - unsupported CIGAR code '%c' in '%s'\n", code, _cigar); break; case 'P': // Padding - "silent deletion from padded reference" - ??? fprintf(stderr, "warning - unsupported CIGAR code '%c' in '%s'\n", code, _cigar); break; case '=': // Alignment, match refceLen += val; queryLen += val; alignLen += val; break; case 'X': // Alignment, mismatch refceLen += val; queryLen += val; alignLen += val; break; default: fprintf(stderr, "unknown CIGAR code '%c' in '%s'\n", code, _cigar); break; } } while (*cp != 0); } gfaFile::gfaFile() { _header = NULL; } gfaFile::gfaFile(char *inFile) { _header = NULL; if ((inFile[0] == 'H') && (inFile[1] == '\t')) { _header = new char [strlen(inFile) + 1]; strcpy(_header, inFile); } else { loadFile(inFile); } } gfaFile::~gfaFile() { delete [] _header; for (uint32 ii=0; ii<_sequences.size(); ii++) delete _sequences[ii]; for (uint32 ii=0; ii<_links.size(); ii++) delete _links[ii]; } bool gfaFile::loadFile(char *inFile) { FILE *F = NULL; char *L = NULL; uint32 Llen = 0; uint32 Lmax = 0; errno = 0; F = fopen(inFile, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", inFile, strerror(errno)), exit(1); while (AS_UTL_readLine(L, Llen, Lmax, F)) { char type = L[0]; if (L[1] != '\t') fprintf(stderr, "gfaFile::loadFile()-- misformed file; second letter must be tab in line '%s'\n", L), exit(1); if (type == 'H') { delete [] _header; _header = new char [Llen]; strcpy(_header, L+2); } else if (type == 'S') { _sequences.push_back(new gfaSequence(L)); } else if (type == 'L') { _links.push_back(new gfaLink(L)); } else { fprintf(stderr, "gfaFile::loadFile()-- unrecognized line '%s'\n", L), exit(1); } } fclose(F); delete [] L; fprintf(stderr, "gfa: Loaded " F_S64 " sequences and " F_S64 " links.\n", _sequences.size(), _links.size()); return(true); } bool gfaFile::saveFile(char *outFile) { FILE *F = NULL; errno = 0; F = fopen(outFile, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", outFile, strerror(errno)), exit(1); fprintf(F, "H\t%s\n", _header); for (uint32 ii=0; ii<_sequences.size(); ii++) if (_sequences[ii]) _sequences[ii]->save(F); for (uint32 ii=0; ii<_links.size(); ii++) if (_links[ii]) _links[ii]->save(F); fclose(F); return(true); } canu-1.6/src/gfa/gfa.H000066400000000000000000000042221314437614700145120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2017-APR-04 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_UTL_GFA_H #define AS_UTL_GFA_H #include "AS_global.H" #include "splitToWords.H" // Features assumed to hold only the length, and we don't use it. class gfaSequence { public: gfaSequence(); gfaSequence(char *inLine); gfaSequence(char *name, uint32 id, uint32 len); ~gfaSequence(); void load(char *inLine); void save(FILE *outFile); public: char *_name; uint32 _id; char *_sequence; char *_features; uint32 _length; }; class gfaLink { public: gfaLink(); gfaLink(char *inLine); gfaLink(char *Aname, uint32 Aid, bool Afwd, char *Bname, uint32 Bid, bool Bfwd, char *cigar); ~gfaLink(); void load(char *inLine); void save(FILE *outFile); void alignmentLength(int32 &queryLen, int32 &refceLen, int32 &alignLen); public: char *_Aname; uint32 _Aid; // Canu specific. bool _Afwd; char *_Bname; uint32 _Bid; // Canu specific. bool _Bfwd; char *_cigar; char *_features; }; class gfaFile { public: gfaFile(); gfaFile(char *inFile); ~gfaFile(); bool loadFile(char *inFile); bool saveFile(char *outFile); public: char *_header; vector _sequences; vector _links; }; #endif // AS_UTL_GFA_H canu-1.6/src/main.mk000066400000000000000000000205101314437614700143620ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${DESTDIR})" "" DESTDIR := endif ifeq "$(strip ${PREFIX})" "" ifeq "$(strip ${DESTDIR})" "" PREFIX := $(realpath ..) else PREFIX := /canu endif endif ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := $(DESTDIR)$(PREFIX)/$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := $(DESTDIR)$(PREFIX)/$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := libcanu.a SOURCES := AS_global.C \ \ AS_UTL/AS_UTL_decodeRange.C \ AS_UTL/AS_UTL_fasta.C \ AS_UTL/AS_UTL_fileIO.C \ AS_UTL/AS_UTL_reverseComplement.C \ AS_UTL/AS_UTL_stackTrace.C \ \ AS_UTL/AS_UTL_alloc.C \ \ AS_UTL/bitEncodings.C \ AS_UTL/bitPackedFile.C \ AS_UTL/bitPackedArray.C \ AS_UTL/dnaAlphabets.C \ AS_UTL/hexDump.C \ AS_UTL/md5.C \ AS_UTL/mt19937ar.C \ AS_UTL/readBuffer.C \ AS_UTL/speedCounter.C \ AS_UTL/sweatShop.C \ AS_UTL/timeAndSize.C \ AS_UTL/kMer.C \ \ falcon_sense/libfalcon/falcon.C \ \ stores/gkStore.C \ stores/gkStoreEncode.C \ \ stores/ovOverlap.C \ stores/ovStore.C \ stores/ovStoreWriter.C \ stores/ovStoreFilter.C \ stores/ovStoreFile.C \ stores/ovStoreHistogram.C \ \ stores/tgStore.C \ stores/tgTig.C \ stores/tgTigSizeAnalysis.C \ stores/tgTigMultiAlignDisplay.C \ \ stores/libsnappy/snappy-sinksource.cc \ stores/libsnappy/snappy-stubs-internal.cc \ stores/libsnappy/snappy.cc \ \ meryl/libmeryl.C \ \ overlapInCore/overlapReadCache.C \ \ overlapErrorAdjustment/analyzeAlignment.C \ \ overlapInCore/liboverlap/Binomial_Bound.C \ overlapInCore/liboverlap/Display_Alignment.C \ overlapInCore/liboverlap/prefixEditDistance.C \ overlapInCore/liboverlap/prefixEditDistance-allocateMoreSpace.C \ overlapInCore/liboverlap/prefixEditDistance-extend.C \ overlapInCore/liboverlap/prefixEditDistance-forward.C \ overlapInCore/liboverlap/prefixEditDistance-reverse.C \ \ overlapInCore/libedlib/edlib.C \ \ utgcns/libNDalign/NDalign.C \ \ utgcns/libNDalign/Binomial_Bound.C \ utgcns/libNDalign/NDalgorithm.C \ utgcns/libNDalign/NDalgorithm-allocateMoreSpace.C \ utgcns/libNDalign/NDalgorithm-extend.C \ utgcns/libNDalign/NDalgorithm-forward.C \ utgcns/libNDalign/NDalgorithm-reverse.C \ \ utgcns/libcns/abAbacus-addRead.C \ utgcns/libcns/abAbacus-appendBases.C \ utgcns/libcns/abAbacus-applyAlignment.C \ utgcns/libcns/abAbacus-baseCall.C \ utgcns/libcns/abAbacus-mergeRefine.C \ utgcns/libcns/abAbacus-refine.C \ utgcns/libcns/abAbacus-refreshMultiAlign.C \ utgcns/libcns/abAbacus.C \ utgcns/libcns/abColumn.C \ utgcns/libcns/abMultiAlign.C \ utgcns/libcns/unitigConsensus.C \ utgcns/libpbutgcns/AlnGraphBoost.C \ \ gfa/gfa.C \ gfa/bed.C \ \ meryl/libkmer/existDB-create-from-fasta.C \ meryl/libkmer/existDB-create-from-meryl.C \ meryl/libkmer/existDB-create-from-sequence.C \ meryl/libkmer/existDB-state.C \ meryl/libkmer/existDB.C \ meryl/libkmer/positionDB-access.C \ meryl/libkmer/positionDB-dump.C \ meryl/libkmer/positionDB-file.C \ meryl/libkmer/positionDB-mismatch.C \ meryl/libkmer/positionDB-sort.C \ meryl/libkmer/positionDB.C ifeq (${BUILDSTACKTRACE}, 1) SOURCES += AS_UTL/libbacktrace/atomic.c \ AS_UTL/libbacktrace/backtrace.c \ AS_UTL/libbacktrace/dwarf.c \ AS_UTL/libbacktrace/elf.c \ AS_UTL/libbacktrace/fileline.c \ AS_UTL/libbacktrace/mmap.c \ AS_UTL/libbacktrace/mmapio.c \ AS_UTL/libbacktrace/posix.c \ AS_UTL/libbacktrace/print.c \ AS_UTL/libbacktrace/simple.c \ AS_UTL/libbacktrace/sort.c \ AS_UTL/libbacktrace/state.c \ AS_UTL/libbacktrace/unknown.c endif SRC_INCDIRS := . \ AS_UTL \ stores \ stores/libsnappy \ alignment \ utgcns/libNDalign \ utgcns/libcns \ utgcns/libpbutgcns \ utgcns/libNDFalcon \ utgcns/libboost \ meryl/libleaff \ overlapInCore \ overlapInCore/libedlib \ overlapInCore/liboverlap \ falcon_sense/libfalcon SUBMAKEFILES := stores/gatekeeperCreate.mk \ stores/gatekeeperDumpFASTQ.mk \ stores/gatekeeperDumpMetaData.mk \ stores/gatekeeperPartition.mk \ stores/ovStoreBuild.mk \ stores/ovStoreBucketizer.mk \ stores/ovStoreSorter.mk \ stores/ovStoreIndexer.mk \ stores/ovStoreDump.mk \ stores/ovStoreStats.mk \ stores/tgStoreCompress.mk \ stores/tgStoreDump.mk \ stores/tgStoreLoad.mk \ stores/tgStoreFilter.mk \ stores/tgStoreCoverageStat.mk \ stores/tgTigDisplay.mk \ \ meryl/libleaff.mk \ meryl/leaff.mk \ meryl/meryl.mk \ meryl/maskMers.mk \ meryl/simple.mk \ meryl/estimate-mer-threshold.mk \ meryl/existDB.mk \ meryl/positionDB.mk \ \ merTrim/merTrim.mk \ \ overlapInCore/overlapInCore.mk \ overlapInCore/overlapInCorePartition.mk \ overlapInCore/overlapConvert.mk \ overlapInCore/overlapImport.mk \ overlapInCore/overlapPair.mk \ \ overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.mk \ \ mhap/mhap.mk \ mhap/mhapConvert.mk \ \ minimap/mmapConvert.mk \ \ correction/filterCorrectionOverlaps.mk \ correction/generateCorrectionLayouts.mk \ correction/readConsensus.mk \ correction/errorEstimate.mk \ \ falcon_sense/createFalconSenseInputs.mk \ falcon_sense/falcon_sense.mk \ \ overlapBasedTrimming/trimReads.mk \ overlapBasedTrimming/splitReads.mk \ \ overlapErrorAdjustment/findErrors.mk \ overlapErrorAdjustment/findErrors-Dump.mk \ overlapErrorAdjustment/correctOverlaps.mk \ \ bogart/bogart.mk \ \ bogus/bogus.mk \ \ erateEstimate/erateEstimate.mk \ \ utgcns/utgcns.mk \ \ gfa/alignGFA.mk \ \ fastq-utilities/fastqAnalyze.mk \ fastq-utilities/fastqSample.mk \ fastq-utilities/fastqSimulate.mk \ fastq-utilities/fastqSimulate-sort.mk canu-1.6/src/merTrim/000077500000000000000000000000001314437614700145265ustar00rootroot00000000000000canu-1.6/src/merTrim/merTrim-compare-logs.pl000066400000000000000000000060001314437614700210640ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/AS_MER/merTrim-compare-logs.pl # # Modifications by: # # Brian P. Walenz from 2014-NOV-15 to 2014-DEC-05 # are Copyright 2014 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $log1 = shift @ARGV; my $log2 = shift @ARGV; if (!defined($log1) || !defined($log2)) { die "usage: $0 run1.log run2.log\n"; } open(L1, "< $log1") or die; open(L2, "< $log1") or die; my ($a1, $a2, $a3, $a4, $a5, $a6, $a7, $a8, $a9); my ($b1, $b2, $b3, $b4, $b5, $b6, $b7, $b8, $b9); while (!eof(L1) && !eof(L2)) { anotherA: do { $a1 = ; # FINAL, or "Correct" lines } while ($a1 !~ m/^FINAL/); $a2 = ; # ORI seq $a3 = ; # COR seq $a4 = ; # COR qlt $a5 = ; # COVERAGE $a6 = ; # CORRECTIONS $a7 = ; # DISCONNECTION $a8 = ; # ADAPTER $a9 = ; # RESULT #if ($a1 =~ m/^ADAPTERSEARCH/) { # goto anotherA; #} anotherB: do { $b1 = ; # FINAL, or "Correct" lines } while ($b1 !~ m/^FINAL/); $b2 = ; # ORI seq $b3 = ; # COR seq $b4 = ; # COR qlt $b5 = ; # COVERAGE $b6 = ; # CORRECTIONS $b7 = ; # DISCONNECTION $b8 = ; # ADAPTER $b9 = ; # RESULT #if ($b1 =~ m/^ADAPTERSEARCH/) { # goto anotherB; #} my ($aID, $aLen, $aBgn, $aEnd); my ($bID, $bLen, $bBgn, $bEnd); # FINAL or ADAPTERSEARCH if ($a1 =~ m/^\w+\sread\s(\d+)\slen\s(\d+)\s\(trim\s(\d+)-(\d+)\)$/) { $aID = $1; $aLen = $2; $aBgn = $3; $aEnd = $4; } else { die "Nope a1 $a1"; } if ($b1 =~ m/^\w+\sread\s(\d+)\slen\s(\d+)\s\(trim\s(\d+)-(\d+)\)$/) { $bID = $1; $bLen = $2; $bBgn = $3; $bEnd = $4; } else { die "Nope b1 $b1"; } die "ID mismatch $aID $bID\n" if ($aID != $bID); if (($aBgn != $bBgn) || ($aEnd != $bEnd)) { print "$aID/$bID $aLen/$bLen $aBgn-$aEnd $bBgn-$bEnd\n"; } if (($aID % 10000) == 0) { print STDERR "$aID\n"; } } canu-1.6/src/merTrim/merTrim.C000066400000000000000000002003161314437614700162530ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_MER/merTrim.C * * Modifications by: * * Brian P. Walenz from 2010-FEB-22 to 2014-APR-11 * are Copyright 2010-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2010-JUL-12 * are Copyright 2010 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "ovStore.H" #include "gkStore.H" #include "AS_UTL_reverseComplement.H" #include #include "sweatShop.H" #include "existDB.H" #include "positionDB.H" #include "libmeryl.H" #include "merTrimResult.H" // There is a serious bug in storing the kmer count in the existDB that merTrim 1.40 exposes. kmer // must be at least r1950 to fix the bug. #if !defined(EXISTDB_H_VERSION) || (EXISTDB_H_VERSION < 1960) #error kmer needs to be updated to at least r1960 #error note that the kmer svn url changed in mid December 2012, the old url does not have r1960 #endif uint32 VERBOSE = 0; #define ALLGOOD 1 #define ALLCRAP 2 #define ATTEMPTCORRECTION 3 // Doesn't work, gets different results. #undef USE_MERSTREAM_REBUILD #undef TEST_TESTBASE //char *createAdapterString(bool adapIllumina, bool adap454); class mertrimGlobalData { public: mertrimGlobalData() { gkpPath = 0L; gkp = 0L; fqInputPath = 0L; fqOutputPath = 0L; fqVerifyPath = 0L; merSize = 22; merCountsFile = 0L; merCountsCache = false; adapCountsFile = 0L; adapIllumina = false; adap454 = false; compression = 0; // Does not work! numThreads = 4; beVerbose = false; forceCorrection = false; correctMismatch = true; correctIndel = true; actualCoverage = 0; // Estimate of coverage minCorrectFraction = 1.0 / 3.0; // Base can be corrected if less than 1/3 coverage minCorrect = 0; // minVerifiedFraction = 1.0 / 4.0; // Base can be corrected to something with only 1/4 coverage minVerified = 0; // endTrimDefault = true; endTrimNum = 0; #if 0 // These never worked well. endTrimWinScale[0] = 0.50; endTrimErrAllow[0] = 2; endTrimWinScale[1] = 0.25; endTrimErrAllow[1] = 0; #endif endTrimQV = '2'; doTrimming = true; discardZeroCoverage = false; discardImperfectCoverage = false; trimImperfectCoverage = true; fqInput = NULL; fqOutput = NULL; fqVerify = NULL; fqLog = NULL; genomicDB = NULL; adapterDB = NULL; resPath = NULL; resFile = NULL; gktBgn = 0; gktEnd = 0; gktCur = 0; }; ~mertrimGlobalData() { gkp->gkStore_close(); gkp = NULL; delete fqInput; delete fqOutput; delete fqVerify; delete fqLog; delete genomicDB; delete adapterDB; if (resFile != NULL) fclose(resFile); }; void initializeGatekeeper(void) { if (gkpPath == NULL) return; fprintf(stderr, "opening gkStore '%s'\n", gkpPath); gkp = gkStore::gkStore_open(gkpPath); if (gktBgn == 0) { gktBgn = 1; gktEnd = gkp->gkStore_getNumReads(); } gktCur = gktBgn; if (gktBgn > gktEnd) fprintf(stderr, "ERROR: invalid range: -b (" F_U32 ") >= -e (" F_U32 ").\n", gktBgn, gktEnd), exit(1); if (gktEnd > gkp->gkStore_getNumReads()) fprintf(stderr, "ERROR: invalid range: -e (" F_U32 ") > num frags (" F_U32 ").\n", gktEnd, gkp->gkStore_getNumReads()), exit(1); errno = 0; resFile = fopen(resPath, "w"); if (errno) fprintf(stderr, "Failed to open output file '%s': %s\n", resPath, strerror(errno)), exit(1); }; void initializeFASTQ(void) { if (fqInputPath == NULL) return; if (fqOutputPath == NULL) return; fqInput = new compressedFileReader(fqInputPath); fqOutput = new compressedFileWriter(fqOutputPath); if (fqVerifyPath) fqVerify = new compressedFileReader(fqVerifyPath); char fqName[FILENAME_MAX]; snprintf(fqName, FILENAME_MAX, "%s.log", fqOutputPath); fqLog = new compressedFileWriter(fqName); }; void initialize(void) { initializeGatekeeper(); initializeFASTQ(); if (actualCoverage == 0) { merylStreamReader *MF = new merylStreamReader(merCountsFile); uint32 i = 0; uint32 iX = 0; //fprintf(stderr, "distinct: " F_U64 "\n", MF->numberOfDistinctMers()); //fprintf(stderr, "unique: " F_U64 "\n", MF->numberOfUniqueMers()); //fprintf(stderr, "total: " F_U64 "\n", MF->numberOfTotalMers()); //fprintf(stderr, "Xcoverage zero 1 0 " F_U64 "\n", MF->histogram(1)); for (i=2; (i < MF->histogramLength()) && (MF->histogram(i-1) > MF->histogram(i)); i++) //fprintf(stderr, "Xcoverage drop " F_U32 " " F_U64 " " F_U64 "\n", i, MF->histogram(i-1), MF->histogram(i)); ; iX = i - 1; for (; i < MF->histogramLength(); i++) { if (MF->histogram(iX) < MF->histogram(i)) { //fprintf(stderr, "Xcoverage incr " F_U32 " " F_U64 " " F_U64 "\n", i, MF->histogram(iX), MF->histogram(i)); iX = i; } else { //fprintf(stderr, "Xcoverage drop " F_U32 " " F_U64 " " F_U64 "\n", i, MF->histogram(iX), MF->histogram(i)); } } fprintf(stderr, "Guessed X coverage is " F_U32 "\n", iX); delete MF; actualCoverage = iX; } if (minCorrectFraction > 0) minCorrect = (uint32)floor(minCorrectFraction * actualCoverage); if (minVerifiedFraction > 0) minVerified = (uint32)floor(minVerifiedFraction * actualCoverage); fprintf(stderr, "Use minCorrect=" F_U32 " minVerified=" F_U32 "\n", minCorrect, minVerified); if (minCorrect < minVerified) { fprintf(stderr, "WARNING!\n"); fprintf(stderr, "WARNING! minVerified (-verified) should be less than minCorrect (-correct).\n"); fprintf(stderr, "WARNING!\n"); } if (adapCountsFile) { fprintf(stderr, "loading adapter mer database.\n"); adapterDB = new existDB(adapCountsFile, merSize, existDBcounts, 0, UINT32_MAX); adapterDB->printState(stderr); } else if (adapIllumina || adap454) { fprintf(stderr, "creating adapter mer database.\n"); #if 0 char *adapter = createAdapterString(adapIllumina, adap454); adapterDB = new existDB(adapter, merSize, existDBcanonical | existDBcounts); //adapterDB->printState(stderr); delete [] adapter; #endif } else { fprintf(stderr, "not searching for adapter.\n"); } char cacheName[FILENAME_MAX]; snprintf(cacheName, FILENAME_MAX, "%s.merTrimDB", merCountsFile); if (AS_UTL_fileExists(cacheName, FALSE, FALSE)) { fprintf(stderr, "loading genome mer database from cache '%s'.\n", cacheName); genomicDB = new existDB(cacheName); } else if (merCountsFile) { fprintf(stderr, "loading genome mer database from meryl '%s'.\n", merCountsFile); genomicDB = new existDB(merCountsFile, merSize, existDBcounts, MIN(minCorrect, minVerified), UINT32_MAX); if (merCountsCache) { fprintf(stderr, "saving genome mer database to cache '%s'.\n", cacheName); genomicDB->saveState(cacheName); } } }; public: // Command line parameters // char *gkpPath; char *fqInputPath; char *fqOutputPath; char *fqVerifyPath; uint32 merSize; char *merCountsFile; bool merCountsCache; char *adapCountsFile; bool adapIllumina; bool adap454; uint32 compression; uint32 numThreads; bool beVerbose; bool forceCorrection; bool correctMismatch; bool correctIndel; char *resPath; FILE *resFile; bool endTrimDefault; uint32 endTrimNum; double endTrimWinScale[16]; uint32 endTrimErrAllow[16]; char endTrimQV; bool doTrimming; bool discardZeroCoverage; bool discardImperfectCoverage; bool trimImperfectCoverage; // Global data // gkStore *gkp; uint32 actualCoverage; double minCorrectFraction; uint32 minCorrect; double minVerifiedFraction; uint32 minVerified; compressedFileReader *fqInput; compressedFileWriter *fqOutput; compressedFileReader *fqVerify; compressedFileWriter *fqLog; existDB *genomicDB; existDB *adapterDB; // Input State // uint32 gktBgn; uint32 gktCur; uint32 gktEnd; }; class mertrimThreadData { public: mertrimThreadData(mertrimGlobalData *g) { kb = new kMerBuilder(g->merSize, g->compression, 0L); }; ~mertrimThreadData() { delete kb; }; public: kMerBuilder *kb; }; class mertrimComputation { public: mertrimComputation() { readName = NULL; origSeq = NULL; origQlt = NULL; corrSeq = NULL; corrQlt = NULL; seqMap = NULL; verifyName = NULL; verifySeq = NULL; verifyErr = NULL; rMS = NULL; disconnect = NULL; coverage = NULL; adapter = NULL; corrected = NULL; eDB = NULL; } ~mertrimComputation() { delete [] readName; delete [] origSeq; delete [] origQlt; delete [] corrSeq; delete [] corrQlt; delete [] verifyName; delete [] verifySeq; delete [] verifyErr; delete [] seqMap; delete rMS; delete [] disconnect; delete [] coverage; delete [] adapter; delete [] corrected; } void initializeGatekeeper(mertrimGlobalData *g_) { g = g_; readIID = fr.gkRead_readID(); seqLen = fr.gkRead_sequenceLength(); allocLen = seqLen + seqLen; readName = NULL; origSeq = new char [allocLen]; origQlt = new char [allocLen]; corrSeq = new char [allocLen]; corrQlt = new char [allocLen]; verifyName = NULL; verifySeq = NULL; verifyErr = NULL; seqMap = new uint32 [allocLen]; nMersExpected = 0; nMersTested = 0; nMersFound = 0; nMersCorrect = 0; rMS = NULL; disconnect = NULL; coverage = NULL; adapter = NULL; corrected = NULL; eDB = NULL; #warning HORRIBLY NON OPTIMAL READING OF READS gkReadData rd; g->gkp->gkStore_loadReadData(&fr, &rd); strcpy(origSeq, rd.gkReadData_getSequence()); strcpy(origQlt, rd.gkReadData_getQualities()); strcpy(corrSeq, rd.gkReadData_getSequence()); strcpy(corrQlt, rd.gkReadData_getQualities()); // Replace Ns with a random low-quality base. This is necessary, since the mer routines // will not make a mer for N, and we never see it to correct it. char letters[4] = { 'A', 'C', 'G', 'T' }; // Not really replacing N with random ACGT, but good enough for us. for (uint32 i=0; igktCur++; seqLen = 0; allocLen = AS_MAX_READLEN + AS_MAX_READLEN + 1; // Used for seq/qlt storage only readName = new char [1024]; origSeq = new char [allocLen]; origQlt = new char [allocLen]; corrSeq = new char [allocLen]; corrQlt = new char [allocLen]; verifyName = NULL; // probably redundant verifySeq = NULL; verifyErr = NULL; seqMap = new uint32 [allocLen]; nMersExpected = 0; nMersTested = 0; nMersFound = 0; nMersCorrect = 0; rMS = NULL; disconnect = NULL; coverage = NULL; adapter = NULL; corrected = NULL; eDB = NULL; // Load the answer, if supplied (uses the real read storage space as temporary) if (g->fqVerify) { verifyName = new char [1024]; verifySeq = new char [allocLen]; verifyErr = new char [allocLen]; memset(verifyName, 1024, 0); memset(verifySeq, allocLen, 0); // Needed, for verified reads shorter than real reads memset(verifyErr, allocLen, '-'); fgets(verifyName, 1024, g->fqVerify->file()); fgets(verifySeq, allocLen, g->fqVerify->file()); fgets(origQlt, allocLen, g->fqVerify->file()); // qv name line, ignored fgets(origQlt, allocLen, g->fqVerify->file()); chomp(verifyName); chomp(verifySeq); } // Load a read to correct fgets(readName, 1024, g->fqInput->file()); fgets(origSeq, allocLen, g->fqInput->file()); fgets(origQlt, allocLen, g->fqInput->file()); // qv name line, ignored fgets(origQlt, allocLen, g->fqInput->file()); if (feof(g->fqInput->file())) return(false); chomp(readName); chomp(origSeq); chomp(origQlt); // Adjust QVs to CA encoding, bases to upper case, and non-acgt to acgt. #warning ASSUMING SANGER QV ENCODING for (uint32 i=0; origQlt[i]; i++) { if ('a' <= origSeq[i]) origSeq[i] += 'A' - 'a'; if (origQlt[i] < '!') fprintf(stderr, "ERROR: invalid QV '%c' (%d) in read '%s': '%s'\n", origQlt[i], origQlt[i], readName, origQlt); // Our Sanger reads (dumped as fastq from gkpStore) have QV's higher than this. //if ('J' < origQlt[i]) // fprintf(stderr, "ERROR: invalid QV '%c' (%d) in read '%s': '%s'\n", // origQlt[i], origQlt[i], readName, origQlt); origQlt[i] -= '!'; } // Copy to the corrected sequence strcpy(corrSeq, origSeq); strcpy(corrQlt, origQlt); // Replace Ns with a random low-quality base. This is necessary, since the mer routines // will not make a mer for N, and we never see it to correct it. uint32 numReplace = 0; char letters[4] = { 'A', 'C', 'G', 'T' }; // Not really replacing N with random ACGT, but good enough for us. for (uint32 i=0; corrSeq[i]; i++) if ((corrSeq[i] != 'A') && (corrSeq[i] != 'C') && (corrSeq[i] != 'G') && (corrSeq[i] != 'T')) { numReplace++; corrSeq[i] = letters[i & 0x03]; corrQlt[i] = 0; } seqLen = strlen(origSeq); allocLen = seqLen + seqLen; clrBgn = 0; clrEnd = seqLen; garbageInInput = false; gapInConfirmedKmers = false; hasNoConfirmedKmers = false; imperfectKmerCoverage = false; containsAdapter = false; containsAdapterCount = false; containsAdapterFixed = false; containsAdapterBgn = false; containsAdapterEnd = false; suspectedChimer = false; suspectedChimerBgn = 0; suspectedChimerEnd = 0; for (uint32 i=0; i= errorRate * seqLen) { garbageInInput = true; clrBgn = 0; clrEnd = 0; } return(true); }; uint32 evaluate(void); void reverse(void); void analyze(void); uint32 testBases(char *bases, uint32 basesLen); uint32 testBaseChange(uint32 pos, char replacement); uint32 testBaseIndel(uint32 pos, char replacement); bool correctMismatch(uint32 pos, uint32 mNum, uint32 mExtra, bool isReversed); bool correctIndel(uint32 pos, uint32 mNum, uint32 mExtra, bool isReversed); void searchAdapter(bool isReversed); void scoreAdapter(void); void attemptCorrection(bool isReversed); void attemptTrimming(bool doTrimming, char endTrimQV); void attemptTrimming5End(uint32 *errorPos, uint32 endWindow, uint32 endAllowed); void attemptTrimming3End(uint32 *errorPos, uint32 endWindow, uint32 endAllowed); uint32 getClrBgn(void) { return(seqMap[clrBgn]); }; uint32 getClrEnd(void) { return(seqMap[clrEnd]); }; uint32 getSeqLen(void) { return(seqMap[seqLen]); }; void dump(char *label); // Public for the writer. gkRead fr; mertrimGlobalData *g; mertrimThreadData *t; uint32 readIID; uint32 seqLen; uint32 allocLen; char *readName; char *origSeq; char *origQlt; char *corrSeq; char *corrQlt; char *verifyName; char *verifySeq; char *verifyErr; uint32 *seqMap; uint32 nMersExpected; uint32 nMersTested; uint32 nMersFound; uint32 nMersCorrect; merStream *rMS; // kmers in the read, for searching against genomic kmers uint32 clrBgn; uint32 clrEnd; bool garbageInInput; bool gapInConfirmedKmers; bool hasNoConfirmedKmers; bool imperfectKmerCoverage; bool containsAdapter; // Read hits adapter kmers uint32 containsAdapterCount; // Number of uncorrected adapter kmers hit uint32 containsAdapterFixed; // Number of corrected adapter kmers hit uint32 containsAdapterBgn; // Location of adapter uint32 containsAdapterEnd; // Location of adapter bool suspectedChimer; uint32 suspectedChimerBgn; uint32 suspectedChimerEnd; uint32 *disconnect; // per base - a hole before this base uint32 *coverage; // per base - mer coverage uint32 *adapter; // per base - mer coverage in adapter kmers uint32 *corrected; // per base - type of correction here existDB *eDB; uint32 nHole; // Number of spaces (between bases) with no mer coverage uint32 nCorr; // Number of bases corrected uint32 nFail; // Number of bases uncorrected because no answer found uint32 nConf; // Number of bases uncorrected because multiple answers found char merstring[256]; }; // Scan the sequence, counting the number of kmers verified. If we find all of them, we're done. // uint32 mertrimComputation::evaluate(void) { if (VERBOSE > 1) fprintf(stderr, "\nPROCESS read %d name %s\n", readIID, readName); if (garbageInInput == true) return(ALLCRAP); if (rMS == NULL) rMS = new merStream(t->kb, new seqStream(corrSeq, seqLen), false, true); rMS->rewind(); nMersExpected = clrEnd - clrBgn - g->merSize + 1; nMersTested = 0; nMersFound = 0; nMersCorrect = 0; while ((rMS->nextMer()) && (rMS->thePositionInSequence() + g->merSize - 1 < clrEnd)) { if (rMS->thePositionInSequence() < clrBgn) // Mer before the clear range begins. continue; if (clrEnd <= rMS->thePositionInSequence() + g->merSize - 1) // Mer after the clear range ends continue; nMersTested++; //fprintf(stderr, "pos %d count %d\n", // rMS->thePositionInSequence() + g->merSize - 1, // eDB->count(rMS->theCMer())); if (eDB->count(rMS->theCMer()) >= g->minCorrect) // We don't need to correct this kmer. nMersCorrect++; if (eDB->count(rMS->theCMer()) >= g->minVerified) // We trust this mer. nMersFound++; } if (VERBOSE > 0) fprintf(stderr, "INITIAL read %u %s len %u has %u mers, %u correct and %u trusted.\n", readIID, readName, seqLen, nMersTested, nMersCorrect, nMersFound); if (nMersCorrect == nMersExpected) // All mers correct, read is 100% verified! return(ALLGOOD); if (nMersFound == 0) { // No trusted kMers found, read is 100% garbage (or 100% unique). hasNoConfirmedKmers = true; return(ALLCRAP); } // Attempt correction. return(ATTEMPTCORRECTION); } void mertrimComputation::reverse(void) { uint32 c = 0; uint32 *s = NULL; uint32 *S = NULL; reverseComplement(corrSeq, corrQlt, seqLen); uint32 cb = seqLen - clrBgn; uint32 ce = seqLen - clrEnd; clrBgn = ce; clrEnd = cb; delete rMS; rMS = new merStream(t->kb, new seqStream(corrSeq, seqLen), false, true); if (corrected) { s = corrected; S = corrected + seqLen - 1; while (s < S) { if (*s == 'X') *s = 0; if (*S == 'X') *S = 0; c = *s; *s++ = *S; *S-- = c; } } if (adapter) { s = adapter; S = adapter + seqLen - 1; while (s < S) { c = *s; *s++ = *S; *S-- = c; } } s = seqMap; S = seqMap + seqLen - 1; while (s < S) { c = *s; *s++ = *S; *S-- = c; } } void mertrimComputation::analyze(void) { if (rMS == NULL) return; rMS->rewind(); if (coverage == NULL) coverage = new uint32 [allocLen]; if (disconnect == NULL) disconnect = new uint32 [allocLen]; memset(coverage, 0, sizeof(uint32) * (allocLen)); memset(disconnect, 0, sizeof(uint32) * (allocLen)); while (rMS->nextMer()) { uint32 posBgn = rMS->thePositionInSequence(); uint32 posEnd = rMS->thePositionInSequence() + g->merSize; assert(posEnd <= seqLen); if (eDB->count(rMS->theCMer()) < g->minVerified) // This mer is too weak for us. SKip it. continue; // If we aren't the first mer, then there should be coverage for our first base. If not, // we have found a correctable error, an uncorrectable error, or a chimeric read. if ((posBgn > 0) && (coverage[posBgn-1] > 0) && (coverage[posBgn] == 0)) disconnect[posBgn-1] = disconnect[posBgn] = 'D'; // Add coverage for the good mer. for (uint32 add=posBgn; addrewind(); if (VERBOSE > 1) dump("ANALYZE"); } bool mertrimComputation::correctMismatch(uint32 pos, uint32 mNum, uint32 mExtra, bool isReversed) { uint32 nA = (corrSeq[pos] != 'A') ? testBaseChange(pos, 'A') : 0; uint32 nC = (corrSeq[pos] != 'C') ? testBaseChange(pos, 'C') : 0; uint32 nG = (corrSeq[pos] != 'G') ? testBaseChange(pos, 'G') : 0; uint32 nT = (corrSeq[pos] != 'T') ? testBaseChange(pos, 'T') : 0; uint32 rB = 0; // Base to change to uint32 rV = 0; // Count of that kmer evidence uint32 nR = 0; if (VERBOSE > 2) { if (nA > mNum + mExtra) fprintf(stderr, "testA at %d -- %d req=%d\n", pos, nA, mNum + mExtra); if (nC > mNum + mExtra) fprintf(stderr, "testC at %d -- %d req=%d\n", pos, nC, mNum + mExtra); if (nG > mNum + mExtra) fprintf(stderr, "testG at %d -- %d req=%d\n", pos, nG, mNum + mExtra); if (nT > mNum + mExtra) fprintf(stderr, "testT at %d -- %d req=%d\n", pos, nT, mNum + mExtra); } // VERBOSE // If we found a single perfectly correct choice, ignore all the other solutions. if (nA == g->merSize) nR++; if (nC == g->merSize) nR++; if (nG == g->merSize) nR++; if (nT == g->merSize) nR++; if (nR == 1) { if (nA != g->merSize) nA = 0; if (nC != g->merSize) nC = 0; if (nG != g->merSize) nG = 0; if (nT != g->merSize) nT = 0; } // Count the number of viable solutions. nR = 0; if (nA > mNum + mExtra) { nR++; rB = 'A'; rV = nA; } if (nC > mNum + mExtra) { nR++; rB = 'C'; rV = nC; } if (nG > mNum + mExtra) { nR++; rB = 'G'; rV = nG; } if (nT > mNum + mExtra) { nR++; rB = 'T'; rV = nT; } if (nR == 0) // Nothing viable, keep the base as is. return(false); // Something to change to. By definition, this is a stronger mer than the original. if (nR > 1) { // Multiple solutions. Pick the most common. If we don't do this, the confirmed kmer // coverage drops at this location, and we trim the read. This would result in // zero coverage at every variation. uint32 mm = MAX(MAX(nA, nC), MAX(nG, nT)); if (nA == mm) { rB = 'A'; rV = nA; } if (nC == mm) { rB = 'C'; rV = nC; } if (nG == mm) { rB = 'G'; rV = nG; } if (nT == mm) { rB = 'T'; rV = nT; } //corrected[pos] = 'X'; //return(false); } // One solution! Correct it. if (VERBOSE > 0) { if (nR > 1) fprintf(stderr, "Correct read %d at position %d from %c (%u) to %c (%u) (QV %d) (%s) (multiple choices nA=%d nC=%d nG=%d nT=%d)\n", readIID, (isReversed == false) ? pos : seqLen - pos, corrSeq[pos], mNum, rB, rV, corrQlt[pos], (isReversed == false) ? "fwd" : "rev", nA, nC, nG, nT); else fprintf(stderr, "Correct read %d at position %d from %c (%u) to %c (%u) (QV %d) (%s)\n", readIID, (isReversed == false) ? pos : seqLen - pos, corrSeq[pos], mNum, rB, rV, corrQlt[pos], (isReversed == false) ? "fwd" : "rev"); } // VERBOSE corrSeq[pos] = rB; corrected[pos] = 'C'; // Rebuild the merStream to use the corrected sequence, then move to the same position in // the sequence. This is done because we cannot simply change the string -- we need to // change the state of the kMerBuilder associated with the merStream, and we can't do that. #ifdef USE_MERSTREAM_REBUILD rMS->rebuild(); #else pos = rMS->thePositionInSequence(); delete rMS; rMS = new merStream(t->kb, new seqStream(corrSeq, seqLen), false, true); rMS->nextMer(); while (pos != rMS->thePositionInSequence()) rMS->nextMer(); #endif return(true); } bool mertrimComputation::correctIndel(uint32 pos, uint32 mNum, uint32 mExtra, bool isReversed) { uint32 nD = testBaseIndel(pos, '-'); uint32 nA = testBaseIndel(pos, 'A'); uint32 nC = testBaseIndel(pos, 'C'); uint32 nG = testBaseIndel(pos, 'G'); uint32 nT = testBaseIndel(pos, 'T'); char rB = 0; uint32 rV = 0; uint32 nR = 0; if (nD > mNum + mExtra) { nR++; rB = '-'; rV = nD; } if (nA > mNum + mExtra) { nR++; rB = 'A'; rV = nA; } if (nC > mNum + mExtra) { nR++; rB = 'C'; rV = nC; } if (nG > mNum + mExtra) { nR++; rB = 'G'; rV = nG; } if (nT > mNum + mExtra) { nR++; rB = 'T'; rV = nT; } if (VERBOSE > 2) { if (nD > mNum + mExtra) fprintf(stderr, "test-- %d -- %d req=%d\n", pos, nD, mNum + mExtra); if (nA > mNum + mExtra) fprintf(stderr, "test+A %d -- %d req=%d\n", pos, nA, mNum + mExtra); if (nC > mNum + mExtra) fprintf(stderr, "test+C %d -- %d req=%d\n", pos, nC, mNum + mExtra); if (nG > mNum + mExtra) fprintf(stderr, "test+G %d -- %d req=%d\n", pos, nG, mNum + mExtra); if (nT > mNum + mExtra) fprintf(stderr, "test+T %d -- %d req=%d\n", pos, nT, mNum + mExtra); } // VERBOSE if (nR == 0) // No solutions. return(false); if (nR > 1) { // Multiple solutions. corrected[pos] = 'X'; return(false); } // One solution. Make a correction. Either a deletion or an insert. if (nD > mNum + mExtra) { if (VERBOSE > 0) { fprintf(stderr, "Correct read %d at position %d from %c (%u) to DELETE (%u) (QV %d) (%s)\n", readIID, (isReversed == false) ? pos : seqLen - pos, corrSeq[pos], mNum, rV, corrQlt[pos], (isReversed == false) ? "fwd" : "rev"); } // VERBOSE for (uint32 i=pos; i 0) { fprintf(stderr, "Correct read %d at position %d from . (%u) to INSERT %c (%u) (%s)\n", readIID, (isReversed == false) ? pos : seqLen - pos, mNum, rB, rV, (isReversed == false) ? "fwd" : "rev"); } // VERBOSE for (uint32 i=seqLen+1; i>pos; i--) { corrSeq[i] = corrSeq[i-1]; corrQlt[i] = corrQlt[i-1]; if (adapter) adapter[i] = adapter[i-1]; seqMap[i] = seqMap[i-1]; } corrSeq[pos] = rB; corrQlt[pos] = '5'; if (adapter) adapter[pos] = 0; seqMap[pos] = seqMap[pos-1]; seqLen++; clrEnd++; corrected[pos] = 'I'; } // Rebuild the merstream. When we call analyze() the stream gets rewound. If we do not // restore our position, it is possible to get stuck in an infinite loop, inserting and // deleting the same base over and over. This happens because we don't explicitly require // that the mer we are at be found, just that we find enough mers to make the change. // // So, on the next pass through, we'd encounter the same mer we didn't find before, attempt // to change it again, and possibly delete the base we inserted. // // test+A 57 -- 6 req=2 // Correct read 6 at position 57 INSERT A // // test-- 58 -- 3 req=2 // Correct read 6 at position 58 from A to DELETE (QV 5) // // The first time through, we insert an A (with 6 mers agreeing). The second time through, // since we didn't fix the mer we were at, our choice is to delete the base we inserted (3 // mers tell us to do so). #ifdef USE_MERSTREAM_REBUILD rMS->rebuild(); #else pos = rMS->thePositionInSequence(); delete rMS; rMS = new merStream(t->kb, new seqStream(corrSeq, seqLen), false, true); analyze(); rMS->nextMer(); while (pos != rMS->thePositionInSequence()) rMS->nextMer(); #endif return(true); } void mertrimComputation::searchAdapter(bool isReversed) { if (corrected == NULL) { corrected = new uint32 [allocLen]; memset(corrected, 0, sizeof(uint32) * (allocLen)); } if (rMS == NULL) rMS = new merStream(t->kb, new seqStream(corrSeq, seqLen), false, true); rMS->rewind(); // A combination of evaluate() and attemptCorrection(). Find any adapter kmers in the // read, do any corrections, then mark (in array 'adapter' the location of those adapter // bases while (rMS->nextMer()) { uint32 pos = rMS->thePositionInSequence() + g->merSize - 1; uint32 count = eDB->count(rMS->theCMer()); if (count >= 1) { // Mer exists, no need to correct. containsAdapter = true; continue; } uint32 mNum = testBaseChange(pos, corrSeq[pos]); // Test if we can repair the sequence with a single base change. if (g->correctMismatch) if (correctMismatch(pos, mNum, 1, isReversed)) { containsAdapter = true; containsAdapterFixed++; } } } void mertrimComputation::scoreAdapter(void) { if (containsAdapter == false) return; assert(adapter == NULL); adapter = new uint32 [allocLen]; memset(adapter, 0, sizeof(uint32) * (allocLen)); assert(clrBgn == 0); assert(clrEnd == seqLen); containsAdapterBgn = seqLen; containsAdapterEnd = 0; rMS->rewind(); while (rMS->nextMer()) { uint32 bgn = rMS->thePositionInSequence(); uint32 end = bgn + g->merSize - 1; uint32 count = eDB->count(rMS->theCMer()); if (count == 0) continue; containsAdapterCount++; containsAdapterBgn = MIN(containsAdapterBgn, bgn); containsAdapterEnd = MAX(containsAdapterEnd, end + 1); if (VERBOSE > 1) fprintf(stderr, "ADAPTER at " F_U32 "," F_U32 " [" F_U32 "," F_U32 "]\n", bgn, end, containsAdapterBgn, containsAdapterEnd); for (uint32 a=bgn; a<=end; a++) adapter[a]++; } if (VERBOSE) dump("ADAPTERSEARCH"); } void mertrimComputation::attemptCorrection(bool isReversed) { if (corrected == NULL) { corrected = new uint32 [allocLen]; memset(corrected, 0, sizeof(uint32) * (allocLen)); } assert(coverage); assert(disconnect); assert(corrected); assert(rMS != NULL); rMS->rewind(); while (rMS->nextMer()) { uint32 pos = rMS->thePositionInSequence() + g->merSize - 1; uint32 count = eDB->count(rMS->theCMer()); //fprintf(stderr, "MER at %d is %s has count %d %s\n", // pos, // rMS->theFMer().merToString(merstring), // (count >= g->minCorrect) ? "CORRECT" : "ERROR", // count); if (count >= g->minCorrect) // Mer is correct, no need to correct it! continue; // State the minimum number of mers we'd accept as evidence any change we make is correct. The // penalty for OVER correcting (too low a threshold) is potentially severe -- we could // insert/delete over and over and over eventually blowing up. The penalty for UNDER // correcting, however, is that we trim a read too aggressively. // // Being strictly greater than before works for mismatches and deletions. // // For insertions, especially insertions in single nucleotide runs, this doesn't work so well. // There are two cases, an insertion before a run, and an insertion after a run. Below, X // represents ACGT, and the A's are the run. // XXXXXAAAAA: inserting an A after the X's, returns the same sequence, and the count of // good mers doesn't change. // AAAAAXXXXX: inserting an A before the X's changes the sequence, returning AAAAAAXXXX, // and it is possible (likely) that this new sequence will have a higher count. // // We therefore require that we find at least TWO more good mers before accepting a change. // // The drawback of this is that we cannot correct two adjacent errors. The first error // (the one we're currently working on) is corrected and adds one to the coverage count, // but then we hit that second error and do not find any more mers. // // A solution would be to retry any base we cannot correct and allow a positive change of // one mer to accept the change. (in other words, change +1 below to +0). uint32 mNum = testBaseChange(pos, corrSeq[pos]); // Test if we can repair the sequence with a single base change. if (g->correctMismatch) if (correctMismatch(pos, mNum, 1, isReversed)) continue; if ((g->correctIndel) && (g->merSize < pos) && (pos < seqLen - g->merSize)) if (correctIndel(pos, mNum, 3, isReversed)) continue; } if (VERBOSE > 1) { dump("POSTCORRECT"); } // VERBOSE } uint32 mertrimComputation::testBases(char *bases, uint32 basesLen) { uint32 offset = 0; uint32 numConfirmed = 0; // // UNTESTED with KMER_WORDS != 1 // kMer F(g->merSize); kMer R(g->merSize); for (uint32 i=1; imerSize && offsetmerSize && offsetcount(F) >= g->minVerified) numConfirmed++; } else { if (eDB->count(R) >= g->minVerified) numConfirmed++; } } return(numConfirmed); } // Attempt to change the base at pos to make the kmers spanning it agree. // Returns the number of kmers validated, and the letter to change to. // uint32 mertrimComputation::testBaseChange(uint32 pos, char replacement) { uint32 numConfirmed = 0; char originalBase = corrSeq[pos]; uint32 offset = pos + 1 - g->merSize; corrSeq[pos] = replacement; numConfirmed = testBases(corrSeq + offset, MIN(seqLen - offset, 2 * g->merSize - 1)); #ifdef TEST_TESTBASE { uint32 oldConfirmed = 0; merStream *localms = new merStream(new kMerBuilder(merSize, compression, 0L), new seqStream(corrSeq + offset, seqLen - offset), true, true); // Test for (uint32 i=0; imerSize && localms->nextMer(); i++) if (eDB->count(localms->theCMer()) >= g->minVerified) oldConfirmed++; delete localms; assert(oldConfirmed == numConfirmed); } #endif corrSeq[pos] = originalBase; //if (numConfirmed > 0) // fprintf(stderr, "testBaseChange() pos=%d replacement=%c confirmed=%d\n", // pos, replacement, numConfirmed); return(numConfirmed); } uint32 mertrimComputation::testBaseIndel(uint32 pos, char replacement) { uint32 numConfirmed = 0; char testStr[128] = {0}; uint32 len = 0; uint32 offset = pos + 1 - g->merSize; uint32 limit = g->merSize * 2 - 1; assert(2 * g->merSize < 120); // Overly pessimistic // Copy the first merSize bases. while (len < g->merSize - 1) testStr[len++] = corrSeq[offset++]; // Copy the second merSize bases, but overwrite the last base in the first copy (if we're testing // a de;etion) or insert a replacement base. if (replacement == '-') { offset++; } else { testStr[len++] = replacement; } // Copy the rest of the bases. while ((len < limit) && (corrSeq[offset])) testStr[len++] = corrSeq[offset++]; numConfirmed = testBases(testStr, len); #ifdef TEST_TESTBASE { uint32 oldConfirmed = 0; merStream *localms = new merStream(new kMerBuilder(merSize, compression, 0L), new seqStream(testStr, len), true, true); // Test for (uint32 i=0; imerSize && localms->nextMer(); i++) if (existDB->count(localms->theCMer()) >= g->minVerified) oldConfirmed++; delete localms; assert(oldConfirmed == numConfirmed); } #endif //if (numConfirmed > 0) // fprintf(stderr, "testBaseIndel() pos=%d replacement=%c confirmed=%d\n", // pos, replacement, numConfirmed); return(numConfirmed); } void mertrimComputation::attemptTrimming5End(uint32 *errorPos, uint32 endWindow, uint32 errAllow) { for (bool doTrim = true; doTrim; ) { uint32 endFound = 0; uint32 endTrimPos = 0; uint32 bgn = clrBgn; uint32 end = clrBgn + endWindow; if (end > clrEnd) end = clrEnd; for (uint32 i=bgn; i 1) fprintf(stderr, "BGNTRIM found=%u pos=%u from %u to %u\n", endFound, endTrimPos, clrBgn, clrBgn + endWindow); if (endFound > errAllow) clrBgn = endTrimPos; else doTrim = false; } } void mertrimComputation::attemptTrimming3End(uint32 *errorPos, uint32 endWindow, uint32 errAllow) { for (bool doTrim = true; doTrim; ) { uint32 endFound = 0; uint32 endTrimPos = clrEnd; uint32 bgn = clrBgn; uint32 end = clrEnd; if (clrBgn + endWindow <= clrEnd) bgn = clrEnd - endWindow; assert(bgn >= clrBgn); assert(bgn <= clrEnd); for (int32 i=bgn; i 1) fprintf(stderr, "ENDTRIM found=%u pos=%u from %u to %u\n", endFound, endTrimPos, clrEnd - endWindow, clrEnd); if (endFound > errAllow) clrEnd = endTrimPos; else doTrim = false; } } void mertrimComputation::attemptTrimming(bool doTrimming, char endTrimQV) { // Just bail if the read is all junk. Nothing to do here. // if ((garbageInInput) || (hasNoConfirmedKmers)) { clrBgn = 0; clrEnd = 0; return; } if ((clrBgn == 0) && (clrEnd == 0)) return; assert(coverage != NULL); // Lop off the ends with no confirmed kmers or a QV less than 3. The Illumina '2' qv ('B' in // Illumina 1.5+ encodings) is defined as "rest of read is < QV 15; do not use". // if (doTrimming) { while ((clrBgn < clrEnd) && ((coverage[clrBgn] == 0) || (corrQlt[clrBgn] <= endTrimQV))) clrBgn++; while ((clrEnd > clrBgn) && ((coverage[clrEnd-1] == 0) || (corrQlt[clrEnd-1] <= endTrimQV))) clrEnd--; } //fprintf(stderr, "TRIM: %d,%d (lop-off-ends)\n", clrBgn, clrEnd); //dump("TRIM"); // Deal with adapter. We'll pick the biggest end that is adapter free for our sequence. // If both ends have adapter, so be it; the read gets trashed. if (containsAdapter) { containsAdapterBgn = seqLen; containsAdapterEnd = 0; for (uint32 i=0; i 0) && (i < containsAdapterBgn)) containsAdapterBgn = i; if ((adapter[i] > 0) && (containsAdapterEnd < i)) containsAdapterEnd = i + 1; } if (containsAdapterBgn < clrBgn) containsAdapterBgn = clrBgn; if (clrEnd < containsAdapterEnd) containsAdapterEnd = clrEnd; uint32 bgnClear = containsAdapterBgn - clrBgn; uint32 endClear = clrEnd - containsAdapterEnd; if (bgnClear > endClear) // Start is bigger clrEnd = containsAdapterBgn; else // End is bigger clrBgn = containsAdapterEnd; if (clrBgn >= clrEnd) { clrBgn = 0; clrEnd = 0; } //fprintf(stderr, "TRIM: %d,%d (adapter)\n", clrBgn, clrEnd); //dump("TRIM"); } if (doTrimming == false) return; // Lop off the ends with no confirmed kmers (again) // while ((clrBgn < clrEnd) && (coverage[clrBgn] == 0)) clrBgn++; while ((clrEnd > clrBgn) && (coverage[clrEnd-1] == 0)) clrEnd--; //fprintf(stderr, "TRIM: %d,%d (lop-off-ends-2)\n", clrBgn, clrEnd); //dump("TRIM"); // True if there is an error at position i. // uint32 *errorPos = new uint32 [seqLen]; for (uint32 i=0; iendTrimNum; i++) { attemptTrimming5End(errorPos, g->merSize * g->endTrimWinScale[i], g->endTrimErrAllow[i]); attemptTrimming3End(errorPos, g->merSize * g->endTrimWinScale[i], g->endTrimErrAllow[i]); } delete [] errorPos; // If there are zero coverage areas in the interior, trash the whole read. if (g->discardZeroCoverage == true) { for (uint32 i=clrBgn; idiscardImperfectCoverage == true) && (clrBgn < clrEnd) && (clrEnd > 0)) { uint32 bgn = clrBgn; uint32 end = clrEnd - 1; while ((bgn + 1 < end) && (coverage[bgn] < coverage[bgn+1])) bgn++; while ((bgn + 1 < end) && (coverage[end-1] > coverage[end])) end--; uint32 bgnc = coverage[bgn]; uint32 endc = coverage[end]; if (VERBOSE) fprintf(stderr, "IMPERFECT: bgn=%u %u end=%u %u\n", bgn, bgnc, end, endc); if (bgnc != endc) imperfectKmerCoverage = true; else for (uint32 i=bgn; i<=end; i++) if (coverage[i] != bgnc) imperfectKmerCoverage = true; } // If the coverage isn't perfect (see above), trim out the ends that make it imperfect. // This leaves interior imperfect regions, but there should be enough to get // an overlap on either end. if ((g->trimImperfectCoverage == true) && (clrBgn < clrEnd) && (clrEnd > 0)) { uint32 bgn = clrBgn; uint32 end = clrEnd - 1; // Search from the ends in until we find the highest coverage. while ((bgn <= end) && (coverage[bgn] < g->merSize - 4)) bgn++; while ((bgn <= end) && (coverage[end] < g->merSize - 4)) end--; end++; // Check that everything between is nice and high. Why the -2? This lets // us tolerate one or two 'missing' kmers. uint32 mbgn = bgn; // Maximal bgn/end region uint32 mend = bgn; uint32 tbgn = bgn; // Currently testing region uint32 tend = bgn; while (tend < end) { if (coverage[tend] < g->merSize - 4) { if ((mend - mbgn) < (tend - tbgn)) { mbgn = tbgn; mend = tend; } tbgn = tend + 1; // Next region starts on the next position. } tend++; } // If we found a sub region, reset to it (after remembering to check the last sub region). if (mbgn < mend) { if ((mend - mbgn) < (tend - tbgn)) { mbgn = tbgn; mend = tend; } if (VERBOSE) fprintf(stderr, "Reset clr from %d,%d to %d,%d\n", bgn, end, mbgn, mend); bgn = mbgn; end = mend; } // Then search towards the ends, as long as the coverage is strictly decreasing. if (bgn < end) { while ((bgn > clrBgn) && (coverage[bgn-1] < coverage[bgn]) && (coverage[bgn-1] > 0)) bgn--; while ((end < clrEnd) && (coverage[end] < coverage[end-1]) && (coverage[end] > 0)) end++; } else { imperfectKmerCoverage = true; } clrBgn = bgn; clrEnd = end; } // Make sense of the clear ranges. If they're invalid, reset to 0,0. if (clrBgn >= clrEnd) { clrBgn = 0; clrEnd = 0; } if (VERBOSE > 1) { fprintf(stderr, "TRIM: %d,%d (post)\n", clrBgn, clrEnd); dump("TRIM"); } // VERBOSE } void mertrimComputation::dump(char *label) { char *logLine = new char [4 * seqLen]; uint32 logPos = 0; uint32 bogus = (clrEnd == 0) ? 0 : UINT32_MAX; fprintf(stderr, "%s read %d %s len %d (trim %d-%d)\n", label, readIID, readName, seqLen, clrBgn, clrEnd); logPos = 0; for (uint32 i=0; origSeq[i]; i++) { if (i == clrBgn) { logLine[logPos++] = '-'; logLine[logPos++] = '['; } if (i == bogus) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; } // read deleted logLine[logPos++] = origSeq[i]; if (i+1 == clrEnd) { logLine[logPos++] = ']'; logLine[logPos++] = '-'; } } strcpy(logLine + logPos, " (ORI)\n"); fprintf(stderr, logLine); logPos = 0; for (uint32 i=0; it = t; s->eDB = g->genomicDB; uint32 eval = s->evaluate(); // Attempt correction if there are kmers to correct from. if (eval == ATTEMPTCORRECTION) { s->analyze(); s->attemptCorrection(false); s->reverse(); s->analyze(); s->attemptCorrection(true); s->reverse(); } // Search for linker/adapter. This needs to be after correction, since indel screws // up the clear ranges we set in scoreAdapter(). if ((eval != ALLCRAP) && (g->adapterDB != NULL)) { s->eDB = g->adapterDB; s->searchAdapter(false); s->reverse(); s->searchAdapter(true); s->reverse(); s->scoreAdapter(); s->eDB = g->genomicDB; } // Attempt trimming if the read wasn't perfect if (eval != ALLGOOD) { s->analyze(); s->attemptTrimming(g->doTrimming, g->endTrimQV); } if (VERBOSE) s->dump("FINAL"); } mertrimComputation * mertrimReaderGatekeeper(mertrimGlobalData *g) { mertrimComputation *s = NULL; while ((g->gktCur <= g->gktEnd) && (s == NULL)) { s = new mertrimComputation(); g->gkp->gkStore_getRead(g->gktCur); g->gktCur++; // Original version used to check if the library was eligible for initial trimming // based on kmers. That means nothing in canu. #if 1 s->initializeGatekeeper(g); #else if ((g->forceCorrection) || (g->gkp->gkStore_getLibrary(s->fr.gkRead_libraryID())->doTrim_initialMerBased)) { s->initializeGatekeeper(g); } else { delete s; s = NULL; } #endif } return(s); } mertrimComputation * mertrimReaderFASTQ(mertrimGlobalData *g) { mertrimComputation *s = new mertrimComputation(); if (s->initializeFASTQ(g) == false) { delete s; s = NULL; } return(s); } void * mertrimReader(void *G) { mertrimGlobalData *g = (mertrimGlobalData *)G; mertrimComputation *s = NULL; if (g->gkp) s = mertrimReaderGatekeeper(g); if (g->fqInput) s = mertrimReaderFASTQ(g); return(s); } void mertrimWriterGatekeeper(mertrimGlobalData *g, mertrimComputation *s) { mertrimResult res; res.readIID = s->fr.gkRead_readID(); #warning BOGUS MIN_READ_LENGTH USED uint32 minReadLength = 500; if ((s->getClrEnd() <= s->getClrBgn()) || (s->getClrEnd() - s->getClrBgn() < minReadLength)) res.deleted = true; else res.deleted = false; res.clrBgn = s->getClrBgn(); res.clrEnd = s->getClrEnd(); res.chimer = s->suspectedChimer; res.chmBgn = s->suspectedChimerBgn; res.chmEnd = s->suspectedChimerEnd; res.writeResult(g->resFile); } void mertrimWriterFASTQ(mertrimGlobalData *g, mertrimComputation *s) { // Note that getClrBgn/getClrEnd return positions in the ORIGINAL read, not the correct read. // DO NOT USE HERE! uint32 seqOffset = 0; char label[256]; label[0] = 0; #warning BOGUS MIN_READ_LENGTH USED uint32 minReadLength = 500; if (s->garbageInInput == true) { strcat(label, "DEL-GARBAGE"); seqOffset = 0; s->corrSeq[0] = 0; s->corrQlt[0] = 0; s->clrBgn = 0; s->clrEnd = 0; goto outputFastq; } if (s->hasNoConfirmedKmers == true) { strcat(label, "DEL-NO-KMER"); seqOffset = 0; s->corrSeq[0] = 0; s->corrQlt[0] = 0; s->clrBgn = 0; s->clrEnd = 0; goto outputFastq; } if ((s->containsAdapter == true) && (s->suspectedChimer == false)) { strcat(label, "ADAPTERTRIM"); seqOffset = s->clrBgn; s->corrSeq[s->clrEnd] = 0; s->corrQlt[s->clrEnd] = 0; goto outputFastq; } if ((s->containsAdapter == false) && (s->suspectedChimer == true)) { if (s->suspectedChimerBgn - s->clrBgn >= minReadLength) { strcpy(label, "MP"); // Junction read, longest portion would make an MP pair seqOffset = s->clrBgn; s->corrSeq[s->suspectedChimerBgn] = 0; s->corrQlt[s->suspectedChimerBgn] = 0; } else if (s->clrEnd - s->suspectedChimerEnd >= minReadLength) { strcpy(label, "PE"); // Junction read, longest portion would make a PE pair seqOffset = s->suspectedChimerEnd; s->corrSeq[s->clrEnd] = 0; s->corrQlt[s->clrEnd] = 0; } else { strcpy(label, "MP-PE-SHORT");// Junction read, neither side is long enough to save, so save nothing. seqOffset = 0; s->corrSeq[0] = 0; s->corrQlt[0] = 0; s->clrBgn = 0; s->clrEnd = 0; } goto outputFastq; } if (s->gapInConfirmedKmers == true) { strcat(label, "DEL-ZERO-COV"); seqOffset = 0; s->corrSeq[0] = 0; s->corrQlt[0] = 0; s->clrBgn = 0; s->clrEnd = 0; goto outputFastq; } if (s->imperfectKmerCoverage == true) { strcat(label, "DEL-INPERFECT-COV"); seqOffset = 0; s->corrSeq[0] = 0; s->corrQlt[0] = 0; s->clrBgn = 0; s->clrEnd = 0; goto outputFastq; } if ((s->suspectedChimer == false) && (s->containsAdapter == false)) { strcat(label, "CLEAN"); seqOffset = s->clrBgn; s->corrSeq[s->clrEnd] = 0; s->corrQlt[s->clrEnd] = 0; goto outputFastq; } assert(s->suspectedChimer); assert(s->containsAdapter); strcpy(label, "JUNCTION+ADAPTER"); seqOffset = 0; s->corrSeq[0] = 0; s->corrQlt[0] = 0; outputFastq: // If there is a verify sequence, verify. uint32 nVerifyErrors = 0; #warning unfinished verify if (s->verifySeq) { } fprintf(g->fqLog->file(), F_U32"\t" F_U32 "\tchimer\t%c\t" F_U32 "\t" F_U32 "\tadapter\t%c\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\t%s\t%s\n", s->clrBgn, s->clrEnd, s->suspectedChimer ? 't' : 'f', s->suspectedChimerBgn, s->suspectedChimerEnd, s->containsAdapter ? 't' : 'f', s->containsAdapterCount, s->containsAdapterFixed, s->containsAdapterBgn, s->containsAdapterEnd, label, s->readName); // Convert from CA QV to Sanger QV for (uint32 i=0; s->corrQlt[i]; i++) s->corrQlt[i] += '!'; // If no sequence, write a single N, with low QV (! == lowest). if ((s->corrSeq[seqOffset] == 0) || (s->corrQlt[seqOffset] == 0)) fprintf(g->fqOutput->file(), "%s type=%s\nN\n+\n!\n", s->readName, label); else fprintf(g->fqOutput->file(), "%s type=%s\n%s\n+\n%s\n", s->readName, label, s->corrSeq + seqOffset, s->corrQlt + seqOffset); if (VERBOSE) fprintf(stderr, "RESULT read %d len %d (trim %d-%d) %s\n", s->readIID, s->seqLen, s->clrBgn, s->clrEnd, label); } void mertrimWriter(void *G, void *S) { mertrimGlobalData *g = (mertrimGlobalData *)G; mertrimComputation *s = (mertrimComputation *)S; // The get*() functions return positions in the original uncorrected sequence. They // map from positions in the corrected sequence (which has inserts and deletes) back // to the original sequence. // assert(s->getClrBgn() <= s->getClrEnd()); assert(s->getClrEnd() <= s->getSeqLen()); if (g->resFile) mertrimWriterGatekeeper(g, s); if (g->fqOutput) mertrimWriterFASTQ(g, s); delete s; } int main(int argc, char **argv) { mertrimGlobalData *g = new mertrimGlobalData; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-g") == 0) { g->gkpPath = argv[++arg]; } else if (strcmp(argv[arg], "-F") == 0) { g->fqInputPath = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { g->fqVerifyPath = argv[++arg]; } else if (strcmp(argv[arg], "-m") == 0) { g->merSize = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-mc") == 0) { g->merCountsFile = argv[++arg]; } else if (strcmp(argv[arg], "-enablecache") == 0) { g->merCountsCache = true; } else if (strcmp(argv[arg], "-mC") == 0) { g->adapCountsFile = argv[++arg]; } else if (strcmp(argv[arg], "-mCillumina") == 0) { g->adapIllumina = 1; } else if (strcmp(argv[arg], "-mC454") == 0) { g->adap454 = 2; } else if (strcmp(argv[arg], "-t") == 0) { g->numThreads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-b") == 0) { g->gktBgn = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { g->gktEnd = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-v") == 0) { g->beVerbose = true; } else if (strcmp(argv[arg], "-V") == 0) { VERBOSE++; } else if (strcmp(argv[arg], "-coverage") == 0) { g->actualCoverage = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-correct") == 0) { g->minCorrectFraction = atof(argv[++arg]); if (g->minCorrectFraction >= 1) { g->minCorrect = (uint32)g->minCorrectFraction; g->minCorrectFraction = 0; } } else if (strcmp(argv[arg], "-evidence") == 0) { g->minVerifiedFraction = atof(argv[++arg]); if (g->minVerifiedFraction >= 1) { g->minVerified = (uint32)g->minVerifiedFraction; g->minVerifiedFraction = 0; } } else if (strcmp(argv[arg], "-endtrim") == 0) { if (g->endTrimDefault == true) g->endTrimNum = 0; g->endTrimWinScale[g->endTrimNum] = atof(argv[++arg]); g->endTrimErrAllow[g->endTrimNum] = atoi(argv[++arg]); g->endTrimNum++; } else if (strcmp(argv[arg], "-notrimming") == 0) { g->doTrimming = false; } else if (strcmp(argv[arg], "-discardzero") == 0) { g->discardZeroCoverage = true; } else if (strcmp(argv[arg], "-discardimperfect") == 0) { g->discardImperfectCoverage = true; } else if (strcmp(argv[arg], "-notrimimperfect") == 0) { g->trimImperfectCoverage = false; } else if (strcmp(argv[arg], "-endtrimqv") == 0) { g->endTrimQV = argv[++arg][0]; } else if (strcmp(argv[arg], "-f") == 0) { g->forceCorrection = true; } else if (strcmp(argv[arg], "-NM") == 0) { g->correctMismatch = false; } else if (strcmp(argv[arg], "-NI") == 0) { g->correctIndel = false; } else if (strcmp(argv[arg], "-o") == 0) { g->fqOutputPath = argv[++arg]; g->resPath = argv[arg]; } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); err++; } arg++; } if ((g->gkpPath == 0L) && (g->fqInputPath == 0L)) err++; if (err) { fprintf(stderr, "usage: %s ...\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -F reads.fastq input reads\n"); fprintf(stderr, " -o reads.fastq output reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " -T reads.fasta truth reads for validation\n"); fprintf(stderr, "\n"); fprintf(stderr, " -m ms mer size\n"); fprintf(stderr, " -mc counts kmer database (in 'counts.mcdat' and 'counts.mcidx')\n"); fprintf(stderr, " -enablecache dump the final kmer data to 'counts.merTrimDB'\n"); fprintf(stderr, "\n"); fprintf(stderr, " -coverage C\n"); fprintf(stderr, " -correct n mers with count below n can be changed\n"); fprintf(stderr, " (that is, count >= n are correct mers)\n"); fprintf(stderr, " -evidence n mers with count at least n will be used for changes\n"); fprintf(stderr, "\n"); fprintf(stderr, " -mC adapter.fasta screen for these adapter sequences\n"); fprintf(stderr, " -mCillumina screen for common Illumina adapter sequences\n"); fprintf(stderr, " -mC454 screen for common 454 adapter and linker sequences\n"); fprintf(stderr, "\n"); fprintf(stderr, " -endtrim (undocumented)\n"); fprintf(stderr, " -notrimming do only correction, no trimming\n"); fprintf(stderr, " -discardzero trash the whole read if coverage drops to zero in the middle\n"); fprintf(stderr, " -discardimperfect trash the whole read if coverage isn't perfect\n"); fprintf(stderr, " -notrimimperfect do NOT trim off ends that make the coverage imperfect\n"); fprintf(stderr, " -endtrimqv Q trim ends of reads if they are below qv Q (Sanger encoded; default '2')\n"); fprintf(stderr, "\n"); fprintf(stderr, " -NM do NOT correct mismatch errors\n"); fprintf(stderr, " -NI do NOT correct indel errors\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t T use T CPU cores\n"); fprintf(stderr, "\n"); fprintf(stderr, " -v report progress to stderr\n"); fprintf(stderr, " -V report trimming evidence to stdout (more -V -> more reports)\n"); fprintf(stderr, "\n"); exit(1); } g->initialize(); gkRead fr; #if 0 // DEBUG, non-threaded version. speedCounter SC(" Trimming: %11.0f reads -- %7.5f reads/second\r", 1.0, 0x1fff, true); mertrimThreadData *t = new mertrimThreadData(g); g->tBgn = 140222; g->tCur = 140222; g->tEnd = 140222; mertrimComputation *s = (mertrimComputation *)mertrimReader(g); while (s) { mertrimWorker(g, t, s); mertrimWriter(g, s); SC.tick(); s = (mertrimComputation *)mertrimReader(g); } delete t; #else // PRODUCTION, threaded version sweatShop *ss = new sweatShop(mertrimReader, mertrimWorker, mertrimWriter); ss->setLoaderQueueSize(16384); ss->setWriterQueueSize(1024); ss->setNumberOfWorkers(g->numThreads); for (uint32 w=0; wnumThreads; w++) ss->setThreadData(w, new mertrimThreadData(g)); // these leak ss->run(g, g->beVerbose); // true == verbose #endif delete g; fprintf(stderr, "\nSuccess! Bye.\n"); return(0); } canu-1.6/src/merTrim/merTrim.mk000066400000000000000000000010301314437614700164700ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := merTrim SOURCES := merTrim.C SRC_INCDIRS := .. ../AS_UTL ../stores ../meryl ../meryl/libleaff ../meryl/libkmer TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/merTrim/merTrimAdapter.C000066400000000000000000000573441314437614700175670ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_MER/merTrimAdapter.C * * Modifications by: * * Brian P. Walenz from 2012-MAY-10 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "bio++.H" #include "sweatShop.H" #include "existDB.H" #include "positionDB.H" #include "libmeryl.H" #include "logMsg.H" // Size of the begin/end substring to use when creating end-to-end sequences. This must be larger // than 2*merSize-1, so there is a kmer spanning the junction on both sides. Larger is not a // problem; it slightly slows down the table build. // #define KSIZE 32 char * createAdapterString(bool adapIllumina, bool adap454) { uint32 na = 0; uint32 adapterMax = 256; const char *adapterNam[256]; const char *adapterSeq[256]; if (adapIllumina) { adapterNam[na] = "Illumina Single End Adapter 1"; adapterSeq[na++] = "ACACTCTTTCCCTACACGACGCTGTTCCATCT"; adapterNam[na] = "Illumina Single End Adapter 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT"; adapterNam[na] = "Illumina Single End PCR Primer 1"; adapterSeq[na++] = "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Single End PCR Primer 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT"; adapterNam[na] = "Illumina Single End Sequencing Primer"; adapterSeq[na++] = "ACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Paired End Adapter 1"; adapterSeq[na++] = "ACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Paired End Adapter 2"; adapterSeq[na++] = "CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT"; adapterNam[na] = "Illumina Paried End PCR Primer 1"; adapterSeq[na++] = "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Paired End PCR Primer 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT"; adapterNam[na] = "Illumina Paried End Sequencing Primer 1"; adapterSeq[na++] = "ACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Paired End Sequencing Primer 2"; adapterSeq[na++] = "CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT"; adapterNam[na] = "Illumina DpnII expression Adapter 1"; adapterSeq[na++] = "ACAGGTTCAGAGTTCTACAGTCCGAC"; adapterNam[na] = "Illumina DpnII expression Adapter 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina DpnII expression PCR Primer 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina DpnII expression PCR Primer 2"; adapterSeq[na++] = "AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "Illumina DpnII expression Sequencing Primer"; adapterSeq[na++] = "CGACAGGTTCAGAGTTCTACAGTCCGACGATC"; adapterNam[na] = "Illumina NlaIII expression Adapter 1"; adapterSeq[na++] = "ACAGGTTCAGAGTTCTACAGTCCGACATG"; adapterNam[na] = "Illumina NlaIII expression Adapter 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina NlaIII expression PCR Primer 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina NlaIII expression PCR Primer 2"; adapterSeq[na++] = "AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "Illumina NlaIII expression Sequencing Primer"; adapterSeq[na++] = "CCGACAGGTTCAGAGTTCTACAGTCCGACATG"; adapterNam[na] = "Illumina Small RNA Adapter 1"; adapterSeq[na++] = "GTTCAGAGTTCTACAGTCCGACGATC"; adapterNam[na] = "Illumina Small RNA Adapter 2"; adapterSeq[na++] = "TCGTATGCCGTCTTCTGCTTGT"; adapterNam[na] = "Illumina Small RNA RT Primer"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina Small RNA PCR Primer 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina Small RNA PCR Primer 2"; adapterSeq[na++] = "AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "Illumina Small RNA Sequencing Primer"; adapterSeq[na++] = "CGACAGGTTCAGAGTTCTACAGTCCGACGATC"; adapterNam[na] = "Illumina Multiplexing Adapter 1"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCT"; adapterNam[na] = "Illumina Multiplexing Adapter 2"; adapterSeq[na++] = "ACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Multiplexing PCR Primer 1.01"; adapterSeq[na++] = "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Multiplexing PCR Primer 2.01"; adapterSeq[na++] = "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT"; adapterNam[na] = "Illumina Multiplexing Read1 Sequencing Primer"; adapterSeq[na++] = "ACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "Illumina Multiplexing Index Sequencing Primer"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"; adapterNam[na] = "Illumina Multiplexing Read2 Sequencing Primer"; adapterSeq[na++] = "GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT"; adapterNam[na] = "Illumina PCR Primer Index 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 3"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 4"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 5"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 6"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 7"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 8"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 9"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 10"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 11"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTC"; adapterNam[na] = "Illumina PCR Primer Index 12"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTC"; adapterNam[na] = "Illumina DpnII Gex Adapter 1"; adapterSeq[na++] = "GATCGTCGGACTGTAGAACTCTGAAC"; adapterNam[na] = "Illumina DpnII Gex Adapter 1.01"; adapterSeq[na++] = "ACAGGTTCAGAGTTCTACAGTCCGAC"; adapterNam[na] = "Illumina DpnII Gex Adapter 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina DpnII Gex Adapter 2.01"; adapterSeq[na++] = "TCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "Illumina DpnII Gex PCR Primer 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina DpnII Gex PCR Primer 2"; adapterSeq[na++] = "AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "Illumina DpnII Gex Sequencing Primer"; adapterSeq[na++] = "CGACAGGTTCAGAGTTCTACAGTCCGACGATC"; adapterNam[na] = "Illumina NlaIII Gex Adapter 1.01"; adapterSeq[na++] = "TCGGACTGTAGAACTCTGAAC"; adapterNam[na] = "Illumina NlaIII Gex Adapter 1.02"; adapterSeq[na++] = "ACAGGTTCAGAGTTCTACAGTCCGACATG"; adapterNam[na] = "Illumina NlaIII Gex Adapter 2.01"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina NlaIII Gex Adapter 2.02"; adapterSeq[na++] = "TCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "Illumina NlaIII Gex PCR Primer 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina NlaIII Gex PCR Primer 2"; adapterSeq[na++] = "AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "Illumina NlaIII Gex Sequencing Primer"; adapterSeq[na++] = "CCGACAGGTTCAGAGTTCTACAGTCCGACATG"; adapterNam[na] = "Illumina Small RNA RT Primer"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina 5p RNA Adapter"; adapterSeq[na++] = "GTTCAGAGTTCTACAGTCCGACGATC"; adapterNam[na] = "Illumina RNA Adapter1"; adapterSeq[na++] = "TCGTATGCCGTCTTCTGCTTGT"; adapterNam[na] = "Illumina Small RNA 3p Adapter 1"; adapterSeq[na++] = "ATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "Illumina Small RNA PCR Primer 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGA"; adapterNam[na] = "Illumina Small RNA PCR Primer 2"; adapterSeq[na++] = "AATGATACGGCGACCACCGACAGGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "Illumina Small RNA Sequencing Primer"; adapterSeq[na++] = "CGACAGGTTCAGAGTTCTACAGTCCGACGATC"; adapterNam[na] = "TruSeq Universal Adapter"; adapterSeq[na++] = "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"; adapterNam[na] = "TruSeq Adapter, Index 1"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 2"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 3"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 4"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 5"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 6"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 7"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 8"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 9"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 10"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 11"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "TruSeq Adapter, Index 12"; adapterSeq[na++] = "GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG"; adapterNam[na] = "Illumina RNA RT Primer"; adapterSeq[na++] = "GCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "Illumina RNA PCR Primer"; adapterSeq[na++] = "AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA"; adapterNam[na] = "RNA PCR Primer, Index 1"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 2"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 3"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 4"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 5"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 6"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 7"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 8"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 9"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 10"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 11"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 12"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 13"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTTGACTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 14"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGGAACTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 15"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 16"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGGACGGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 17"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCTCTACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 18"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCGGACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 19"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTTTCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 20"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGGCCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 21"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGAAACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 22"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 23"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCCACTCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 24"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCTACCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 25"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATATCAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 26"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCTCATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 27"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATAGGAATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 28"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCTTTTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 29"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTAGTTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 30"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCCGGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 31"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATATCGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 32"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTGAGTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 33"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGCCTGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 34"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCCATGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 35"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATAAAATGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 36"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTGTTGGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 37"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATATTCCGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 38"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATAGCTAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 39"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGTATAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 40"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTCTGAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 41"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGTCGTCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 42"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCGATTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 43"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGCTGTAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 44"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATATTATAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 45"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATGAATGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 46"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTCGGGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 47"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; adapterNam[na] = "RNA PCR Primer, Index 48"; adapterSeq[na++] = "CAAGCAGAAGACGGCATACGAGATTGCCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA"; // Cre-Lox from Van Nieuwerburgh, et al. "Illumina mate-paired DNA sequencing-library preparation // using Cre-Lox recombination", Nucleic Acids Research, 2012, Vol 40, No. 3 adapterNam[na] = "CreLox 5'3'"; adapterSeq[na++] = "CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA"; adapterNam[na] = "CreLox 3'5'"; adapterSeq[na++] = "GCATTATTGAAGCATATCGTATGTAATATGCTTCAATATGCT"; } if (0) { adapterNam[na] = "ABI Dynabead EcoP Oligo"; adapterSeq[na++] = "CTGATCTAGAGGTACCGGATCCCAGCAGT"; adapterNam[na] = "ABI Solid3 Adapter A"; adapterSeq[na++] = "CTGCCCCGGGTTCCTCATTCTCTCAGCAGCATG"; adapterNam[na] = "ABI Solid3 Adapter B"; adapterSeq[na++] = "CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT"; adapterNam[na] = "ABI Solid3 5' AMP Primer"; adapterSeq[na++] = "CCACTACGCCTCCGCTTTCCTCTCTATG"; adapterNam[na] = "ABI Solid3 3' AMP Primer"; adapterSeq[na++] = "CTGCCCCGGGTTCCTCATTCT"; adapterNam[na] = "ABI Solid3 EF1 alpha Sense Primer"; adapterSeq[na++] = "CATGTGTGTTGAGAGCTTC"; adapterNam[na] = "ABI Solid3 EF1 alpha Antisense Primer"; adapterSeq[na++] = "GAAAACCAAAGTGGTCCAC"; adapterNam[na] = "ABI Solid3 GAPDH Forward Primer"; adapterSeq[na++] = "TTAGCACCCCTGGCCAAGG"; adapterNam[na] = "ABI Solid3 GAPDH Reverse Primer"; adapterSeq[na++] = "CTTACTCCTTGGAGGCCATG"; } if (adap454) { adapterNam[na] = "454 FLX Linker"; adapterSeq[na++] = "GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC"; adapterNam[na] = "454 Titanium Linker"; adapterSeq[na++] = "TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG"; adapterNam[na] = "454 AdaptorA"; adapterSeq[na++] = "CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG"; adapterNam[na] = "454 AdaptorB"; adapterSeq[na++] = "CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG"; } assert(na < 255); // Build a new sequence, from every adapter above, then every pair of adapter, F-F, F-R, R-F, R-R. // // For the pairs, we only care about the junction, so can limit to the two kmers on either side. uint32 adapterListMax = 16 * 1024 * 1024; char *adapterList = new char [adapterListMax]; char *adapterEnd = adapterList; memset(adapterList, 0, sizeof(char) * 16 * 1024 * 1024); int32 ladapter[256]; memset(ladapter, 0, sizeof(int32) * 256); char fadapterA[256]; char radapterA[256]; char fadapterB[256]; char radapterB[256]; for (uint32 a=0; agkStore_enableClearRange(AS_READ_CLEAR_OBTINITIAL); gkpStore->gkStore_enableClearRange(AS_READ_CLEAR_TNT); gkpStore->gkStore_metadataCaching(true); bool tntEnabled = false; // Over every result file, read each line, change the clear range. gkFragment gkf; mertrimResult res; fgets(resultFileName, FILENAME_MAX, listFile); while (!feof(listFile)) { chomp(resultFileName); errno = 0; FILE *resultFile = fopen(resultFileName, "r"); if (errno) fprintf(stderr, "Failed to open result file '%s' for reading: %s\n", resultFileName, strerror(errno)), exit(1); while (res.readResult(resultFile)) { gkpStore->gkStore_getFragment(res.readIID, &gkf, GKFRAGMENT_INF); if (res.chimer) gkf.gkFragment_setClearRegion(res.chmBgn, res.chmEnd, AS_READ_CLEAR_TNT); gkf.gkFragment_setClearRegion(res.clrBgn, res.clrEnd, AS_READ_CLEAR_OBTINITIAL); if (res.deleted) gkpStore->gkStore_delFragment(res.readIID); else gkpStore->gkStore_setFragment(&gkf); res.print(logFile); } fclose(resultFile); fgets(resultFileName, FILENAME_MAX, listFile); } fclose(listFile); fclose(logFile); gkpStore->gkStore_close(); } void merTrimShow(char *gkpStoreName, char *resultName) { errno = 0; FILE *resultFile = fopen(resultName, "r"); if (errno) fprintf(stderr, "Failed to open result file '%s' for reading: %s\n", resultName, strerror(errno)), exit(1); mertrimResult res; while (res.readResult(resultFile)) res.print(stdout); } int main(int argc, char **argv) { char *gkpStoreName = NULL; char *listName = NULL; char *logName = NULL; char *resultName = NULL; int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-g") == 0) { gkpStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-L") == 0) { listName = argv[++arg]; } else if (strcmp(argv[arg], "-l") == 0) { logName = argv[++arg]; } else if (strcmp(argv[arg], "-d") == 0) { resultName = argv[++arg]; } else { err++; } arg++; } if (((gkpStoreName == NULL) && (listName != NULL)) || ((gkpStoreName != NULL) && (listName == NULL))) err++; if ((listName == NULL) && (resultName == NULL)) err++; if ((listName != NULL) && (resultName != NULL)) err++; if (err) { fprintf(stderr, "usage: %s -L merTrimOutputList -g gkpStore [-l output.log]\n", argv[0]); fprintf(stderr, " %s -d merTrimOutput\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " The first form will read a list of merTrim output names from\n"); fprintf(stderr, " merTrimOuptutList, and apply the results to gkpStore.\n"); fprintf(stderr, "\n"); fprintf(stderr, " The second form will read a single merTrimOutput file and decode\n"); fprintf(stderr, " the results to stdout.\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); if (((gkpStoreName == NULL) && (listName != NULL)) || ((gkpStoreName != NULL) && (listName == NULL))) fprintf(stderr, "ERROR: First form needs both -L and -g.\n"); if (gkpStoreName == NULL) fprintf(stderr, "ERROR: No gatekeeper store supplied with -g.\n"); if ((listName == NULL) && (resultName == NULL)) fprintf(stderr, "ERROR: Exactly one of -L and -d must be specified.\n"); if ((listName != NULL) && (resultName != NULL)) fprintf(stderr, "ERROR: Only one of -L and -d can be specified.\n"); fprintf(stderr, "\n"); exit(1); } if (listName) merTrimApply(gkpStoreName, listName, logName); if (resultName) merTrimShow(gkpStoreName, resultName); return(0); } canu-1.6/src/merTrim/merTrimResult.H000066400000000000000000000046371314437614700174670ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_MER/merTrimResult.H * * Modifications by: * * Brian P. Walenz from 2011-AUG-22 to 2013-AUG-01 * are Copyright 2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #ifndef MERTRIMRESULT_H #define MERTRIMRESULT_H class mertrimResult { public: mertrimResult() { readIID = 0; deleted = 0; clrBgn = 0; clrEnd = 0; chimer = 0; chmBgn = 0; chmEnd = 0; }; void print(FILE *F) { if (F == NULL) return; if (chimer) fprintf(F, F_U32"\t" F_U32 "\t" F_U32 "\tchimer\t" F_U32 "\t" F_U32 "%s\n", readIID, clrBgn, clrEnd, chmBgn, chmEnd, (deleted) ? "\tdeleted" : ""); else fprintf(F, F_U32"\t" F_U32 "\t" F_U32 "%s\n", readIID, clrBgn, clrEnd, (deleted) ? "\tdeleted" : ""); }; void writeResult(FILE *W) { if (W == NULL) return; AS_UTL_safeWrite(W, this, "merTrimResult", sizeof(mertrimResult), 1); }; bool readResult(FILE *R) { if (R == NULL) return(false); if (!feof(R)) AS_UTL_safeRead(R, this, "merTrimResult", sizeof(mertrimResult), 1); return(feof(R) == false); }; uint32 readIID; uint32 deleted; uint32 clrBgn; uint32 clrEnd; uint32 chimer; uint32 chmBgn; uint32 chmEnd; }; #endif // MERTRIMRESULT_H canu-1.6/src/mercy/000077500000000000000000000000001314437614700142265ustar00rootroot00000000000000canu-1.6/src/mercy/mercy-regions.C000066400000000000000000000260301314437614700171160ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_MER/mercy-regions.C * * Modifications by: * * Brian P. Walenz from 2007-MAR-19 to 2013-AUG-01 * are Copyright 2007-2008,2010-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include #include "AS_global.H" #include "splitToWords.H" #include "intervalList.H" // This reads the assembly frgctg, varctg and merQC badmers, computes // the number and location of bad-mer, bad-var regions, and their // depth, in contig space. // // File paths are hardcoded. // This code ONLY works on 64-bit hardware, but it's easy to fix. using namespace std; #include void readDepth(char *depthname, map*> &lowCoverage) { char line[1024] = {0}; map*> ILs; fprintf(stderr, "Reading depth from '%s'\n", depthname); errno = 0; FILE *F = fopen(depthname, "r"); if (errno) fprintf(stderr, "failed to open '%s': %s\n", depthname, strerror(errno)), exit(1); uint32 i=0; fgets(line, 1024, F); while (!feof(F)) { splitToWords W(line); uint64 uid = strtoul(W[1], 0L, 10); uint32 beg = strtoul(W[2], 0L, 10); uint32 end = strtoul(W[3], 0L, 10); if (beg > end) fprintf(stderr, "ERROR: l=" F_U32 " h=" F_U32 "\n", beg, end); if (ILs[uid] == 0L) ILs[uid] = new intervalList; ILs[uid]->add(beg, end - beg); i++; fgets(line, 1024, F); } fclose(F); fprintf(stderr, " " F_U32 " lines.\n", i); map*>::iterator it = ILs.begin(); map*>::iterator ed = ILs.end(); while (it != ed) { lowCoverage[it->first] = new intervalList(*it->second); delete it->second; it->second = 0L; it++; } } void readVariation(char *depthname, map*> &variation) { char line[1024 * 1024] = {0}; fprintf(stderr, "Reading variation from '%s'\n", depthname); errno = 0; FILE *F = fopen(depthname, "r"); if (errno) fprintf(stderr, "failed to open '%s': %s\n", depthname, strerror(errno)), exit(1); uint32 i=0; fgets(line, 1024 * 1024, F); while (!feof(F)) { splitToWords W(line); uint64 uid = strtoul(W[1], 0L, 10); uint32 beg = strtoul(W[2], 0L, 10); uint32 end = strtoul(W[3], 0L, 10); if (variation[uid] == 0L) variation[uid] = new intervalList; variation[uid]->add(beg, end - beg); i++; fgets(line, 1024 * 1024, F); } fclose(F); fprintf(stderr, " " F_U32 " lines.\n", i); } void readBadMers(char *depthname, map*> &badMers) { char line[1024] = {0}; fprintf(stderr, "Reading badMers from '%s'\n", depthname); errno = 0; FILE *F = fopen(depthname, "r"); if (errno) fprintf(stderr, "failed to open '%s': %s\n", depthname, strerror(errno)), exit(1); uint32 i=0; fgets(line, 1024, F); while (!feof(F)) { splitToWords W(line); // Change every non-digit to a space in the first word. for (uint32 z=strlen(W[0])-1; z--; ) if (!isdigit(W[0][z])) W[0][z] = ' '; uint64 uid = strtoul(W[0], 0L, 10); uint32 beg = strtoul(W[3], 0L, 10); uint32 end = strtoul(W[4], 0L, 10); if (badMers[uid] == 0L) badMers[uid] = new intervalList; badMers[uid]->add(beg, end - beg); i++; fgets(line, 1024, F); } fclose(F); fprintf(stderr, " " F_U32 " lines.\n", i); } int main(int argc, char **argv) { map*> badMers; map*> variation; map*> lowCoverage; bool showDepthIntersect = false; bool showVariantIntersect = false; bool showVarDepthIntersect = false; argc = AS_configure(argc, argv); int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-D") == 0) { } else if (strcmp(argv[arg], "-pd") == 0) { showDepthIntersect = true; } else if (strcmp(argv[arg], "-pv") == 0) { showVariantIntersect = true; } else if (strcmp(argv[arg], "-pvd") == 0) { showVarDepthIntersect = true; } else { fprintf(stderr, "usage: %s [-D debugfile] [-pd] [-pv] [-pvd]\n", argv[0]); fprintf(stderr, " -pd print bad mers regions isect depth\n"); fprintf(stderr, " -pv print bad mers regions isect variants\n"); fprintf(stderr, " -pvd print bad mers regions isect both variants and depth\n"); exit(1); } arg++; } #if 1 // HuRef6, in the assembly directory. // readDepth ("/project/huref6/assembly/h6/9-terminator/h6.posmap.frgctg", lowCoverage); readVariation("/project/huref6/assembly/h6/9-terminator/h6.posmap.varctg", variation); readBadMers ("/project/huref6/assembly/h6-mer-validation/h6-ms22-allfrags-normalcontigs.badmers.0.singlecontig.zerofrag.badmers", badMers); #endif #if 0 // HuRef6, ws=25, in the assembly directory. // readDepth ("/project/huref6/assembly/h6/9-terminator-ws25/h6.posmap.frgctg", lowCoverage); readVariation("/project/huref6/assembly/h6/9-terminator-ws25/h6.posmap.varctg", variation); readBadMers ("/project/huref6/assembly/h6-mer-validation/h6-version4-ws25/h6-ms22-allfrags-normalcontigs.badmers.0.singlecontig.zerofrag.badmers", badMers); #endif #if 0 // Our scratch huref // readDepth ("/project/huref6/redo_consensus-gennady/mer-validation/h6tmp.posmap.frgctg", lowCoverage); readVariation("/project/huref6/redo_consensus-gennady/mer-validation/h6tmp.posmap.varctg", variation); readBadMers ("/project/huref6/redo_consensus-gennady/mer-validation/h6tmp-ms22-allfrags-allcontigs.badmers.0.singlecontig.zerofrag.badmers", badMers); #endif uint32 badBegDepth[1024] = {0}; uint32 badEndDepth[1024] = {0}; uint32 badDepth[32][32]; for (uint32 i=0; i<32; i++) for (uint32 j=0; j<32; j++) badDepth[i][j] = 0; map*>::iterator it = badMers.begin(); map*>::iterator ed = badMers.end(); while (it != ed) { uint64 uid = it->first; intervalList *Iv = variation[uid]; intervalList *Ib = badMers[uid]; intervalList *Ii = 0L; intervalList *Id = lowCoverage[uid]; if (Iv) Iv->merge(); if (Ib) Ib->merge(); if (Iv && Ib) { Ii = new intervalList; Ii->intersect(*Iv, *Ib); } if (Ii) { uint32 ii = 0; uint32 id = 0; while ((ii < Ii->numberOfIntervals()) && (id < Id->numberOfIntervals())) { // We want to count the number of times a badmer region // begins/ends in some depth. //fprintf(stderr, "testing beg " F_U32 " " F_U32 " -- " F_U32 " " F_U32 "\n", // Ii->lo(ii), Ii->hi(ii), Id->lo(id), Id->hi(id)); uint32 beg = 0; uint32 end = 0; // Low points are not allowed to be equal to high points, skip to the next while ((id < Id->numberOfIntervals()) && (Id->hi(id) <= Ii->lo(ii))) { id++; //fprintf(stderr, "testing beg (m) " F_U32 " " F_U32 " -- " F_U32 " " F_U32 "\n", // Ii->lo(ii), Ii->hi(ii), Id->lo(id), Id->hi(id)); } if (id < Id->numberOfIntervals()) { uint32 lo = Id->lo(id); uint32 hi = Id->hi(id); // Low points are not allowed to be equal to high points. if ((lo <= Ii->lo(ii)) && (Ii->lo(ii) < hi)) { beg = Id->depth(id); } else { fprintf(stderr, "failed to find begin " F_U32 " " F_U32 " -- " F_U32 " " F_U32 " " F_U32 "\n", Ii->lo(ii), Ii->hi(ii), Id->lo(id), Id->hi(id), Id->depth(id)); if (id > 0) fprintf(stderr, " " F_U32 " " F_U32 " -- " F_U32 " " F_U32 " " F_U32 "\n", Ii->lo(ii), Ii->hi(ii), Id->lo(id-1), Id->hi(id-1), Id->depth(id-1)); //exit(1); } } //fprintf(stderr, "testing end " F_U64 " " F_U64 " -- " F_U64 " " F_U64 "\n", // Ii->lo(ii), Ii->hi(ii), Id->lo(id), Id->hi(id)); // High points can be equal. while ((id < Id->numberOfIntervals()) && (Id->hi(id) < Ii->hi(ii))) { id++; //fprintf(stderr, "testing end (m) " F_U64 " " F_U64 " -- " F_U64 " " F_U64 "\n", // Ii->lo(ii), Ii->hi(ii), Id->lo(id), Id->hi(id)); } if (id < Id->numberOfIntervals()) { uint32 lo = Id->lo(id); uint32 hi = Id->hi(id); // High points aren't allowed to be equal to lo, but can be equal to hi. if ((lo < Ii->hi(ii)) && (Ii->hi(ii) <= hi)) { end = Id->depth(id); } else { fprintf(stderr, "failed to find end " F_U32 " " F_U32 " -- " F_U32 " " F_U32 " " F_U32 "\n", Ii->lo(ii), Ii->hi(ii), Id->lo(id), Id->hi(id), Id->depth(id)); if (id > 0) fprintf(stderr, " " F_U32 " " F_U32 " -- " F_U32 " " F_U32 " " F_U32 "\n", Ii->lo(ii), Ii->hi(ii), Id->lo(id-1), Id->hi(id-1), Id->depth(id-1)); //exit(1); } } badBegDepth[beg]++; badEndDepth[end]++; fprintf(stdout, F_U64"\t" F_U32 "\t" F_U32 "\tdepth=" F_U32 "," F_U32 "\n", uid, Ii->lo(ii), Ii->hi(ii), beg, end); if ((beg < 32) && (end < 32)) badDepth[beg][end]++; ii++; } } it++; } uint32 bb = 0; uint32 be = 0; for (uint32 x=0; x<32; x++) { fprintf(stdout, F_U32"\t" F_U32 "\t" F_U32 "\n", x, badBegDepth[x], badEndDepth[x]); bb += badBegDepth[x]; be += badEndDepth[x]; } fprintf(stdout, "total\t" F_U32 "\t" F_U32 "\n", bb, be); for (uint32 i=0; i<30; i++) { for (uint32 j=0; j<30; j++) fprintf(stdout, "%5u", badDepth[i][j]); fprintf(stdout, "\n"); } return(0); } canu-1.6/src/mercy/mercy.C000066400000000000000000000340751314437614700154620ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_MER/mercy.C * * Modifications by: * * Brian P. Walenz from 2007-MAR-19 to 2014-APR-11 * are Copyright 2007-2009,2011,2013-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "AS_global.H" #include "libmeryl.H" // The categories depend on the type of input (fragments or contigs): // // 0 -- no count, mer not present // 1 -- single copy // 2 -- 2 -> 10 copies (contigs) -- 2 -> 2mode copies (frags) // 3 -- 11 -> 100 copies (contigs) -- -> 10mode copies (frags) // 4 -- 101+ copies (contigs) -- -> 100mode copies (frags) // 5 -- -> infinity copies (frags) // You'll also need to modify compare() and output() if you change this. #define NUMCATEGORIES 6 // The output files are global for convenience. Otherwise, we'd be passing // them to compare() for every single mer. // bool dumpFlag = false; FILE *dumpSCZF = 0L; FILE *dumpMCZF = 0L; FILE *dumpMCSF = 0L; FILE *dumpMCMF = 0L; char merstring[1024]; uint32 findMode(char *name) { merylStreamReader *M = new merylStreamReader(name); uint32 *H = new uint32 [16384]; fprintf(stderr, "Finding mode of '%s'\n", name); for (uint32 i=0; i<16384; i++) H[i] = 0; while (M->validMer()) { if (M->theCount() < 16384) H[M->theCount()]++; M->nextMer(); } uint32 mi = 16; for (uint32 i=mi; i<16384; i++) if (H[i] > H[mi]) mi = i; fprintf(stderr, "Mode of '%s' is " F_U32 "\n", name, mi); return(mi); } void compare(merylStreamReader *F, merylStreamReader *C, kMer &minmer, uint32 mode, uint32 R[NUMCATEGORIES][NUMCATEGORIES]) { uint32 Ftype = 0; uint32 Ctype = 0; kMer Fmer = F->theFMer(); kMer Cmer = C->theFMer(); uint32 Fcnt = F->theCount(); uint32 Ccnt = C->theCount(); if (Fcnt == 0) Ftype = 0; else if (Fcnt == 1) Ftype = 1; else if (Fcnt <= 2*mode) Ftype = 2; else if (Fcnt <= 10*mode) Ftype = 3; else if (Fcnt <= 100*mode) Ftype = 4; else Ftype = 5; if (Ccnt == 0) Ctype = 0; else if (Ccnt == 1) Ctype = 1; else if (Ccnt <= 10) Ctype = 2; else if (Ccnt <= 100) Ctype = 3; else Ctype = 4; // If the mer isn't valid, we hit the end of the file, and the mer // thus (obviously) isn't in the file. // if (F->validMer() == false) Ftype = 0; if (C->validMer() == false) Ctype = 0; // If either type is 0, we're done, but only increment the count if // this mer is the minmer. // if ((Ftype == 0) || (Ctype == 0)) { if (((Ftype == 0) && (Cmer == minmer)) || ((Ctype == 0) && (Fmer == minmer))) { R[Ftype][Ctype]++; // Save the mer if it's in contigs, but not fragments. if (dumpFlag) if (Ftype == 0) if (Ctype == 1) fprintf(dumpSCZF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); else fprintf(dumpMCZF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); } return; } // If the mers don't agree, we're also done. If either is the // minmer, note that we saw it. // if (Fmer != Cmer) { if (Fmer == minmer) R[Ftype][0]++; if (Cmer == minmer) { R[0][Ctype]++; // Again, save the mer since it's in contigs, but not fragments. if (dumpFlag) if (Ctype == 1) fprintf(dumpSCZF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); else fprintf(dumpMCZF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); } return; } // If we're not the minmer, we're done. if (Fmer != minmer) return; // Otherwise, the mers are in both inputs R[Ftype][Ctype]++; // Save the mer if it's in contigs "more" than if in fragments. if (dumpFlag) { if (Ftype < Ctype) if (Ctype == 2) fprintf(dumpMCSF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); else fprintf(dumpMCMF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); if ((Ftype == 0) && (Ctype == 1)) fprintf(dumpSCZF, ">" F_U32 "\n%s\n", Ccnt, Cmer.merToString(merstring)); } } void output(char *title, uint32 mode, uint32 R[NUMCATEGORIES][NUMCATEGORIES]) { fprintf(stdout, "\n\n%s\n", title); fprintf(stdout, "(frags) | zero | one | <= 10 | <= 100 | <= inf | (contigs)\n"); for (uint32 i=0; i<6; i++) { switch (i) { case 0: fprintf(stdout, "zero "); break; case 1: fprintf(stdout, "one "); break; case 2: fprintf(stdout, "<= 2mode "); break; case 3: fprintf(stdout, "<= 10mode "); break; case 4: fprintf(stdout, "<= 100mode "); break; case 5: fprintf(stdout, "<= inf "); break; default: fprintf(stdout, "????????? "); break; } for (uint32 j=0; j<5; j++) fprintf(stdout, "%12" F_U32P, R[i][j]); fprintf(stdout, "\n"); } } int main(int argc, char **argv) { merylStreamReader *AF = 0L; merylStreamReader *TF = 0L; merylStreamReader *AC = 0L; merylStreamReader *DC = 0L; merylStreamReader *CO = 0L; uint32 AFmode = 0; uint32 TFmode = 0; char dumpSCZFname[FILENAME_MAX] = {0}; // single contig, zero frags char dumpMCZFname[FILENAME_MAX] = {0}; // low contig, zero frags char dumpMCSFname[FILENAME_MAX] = {0}; // medium contig, low frags char dumpMCMFname[FILENAME_MAX] = {0}; // everything else, contig > frags bool beVerbose = false; argc = AS_configure(argc, argv); int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-af") == 0) { // All frags ++arg; AFmode = findMode(argv[arg]); AF = new merylStreamReader(argv[arg]); AF->nextMer(); } else if (strcmp(argv[arg], "-tf") == 0) { // Trimmed frags ++arg; TFmode = findMode(argv[arg]); TF = new merylStreamReader(argv[arg]); TF->nextMer(); } else if (strcmp(argv[arg], "-ac") == 0) { // All contigs AC = new merylStreamReader(argv[++arg]); AC->nextMer(); } else if (strcmp(argv[arg], "-dc") == 0) { // Degenerate contigs DC = new merylStreamReader(argv[++arg]); DC->nextMer(); } else if (strcmp(argv[arg], "-co") == 0) { // Contigs CO = new merylStreamReader(argv[++arg]); CO->nextMer(); } else if (strcmp(argv[arg], "-dump") == 0) { arg++; dumpFlag = true; snprintf(dumpSCZFname, FILENAME_MAX, "%s.0.singlecontig.zerofrag.fasta", argv[arg]); snprintf(dumpMCZFname, FILENAME_MAX, "%s.1.multiplecontig.zerofrag.fasta", argv[arg]); snprintf(dumpMCSFname, FILENAME_MAX, "%s.2.multiplecontig.lowfrag.fasta", argv[arg]); snprintf(dumpMCMFname, FILENAME_MAX, "%s.3.multiplecontig.multiplefrag.fasta", argv[arg]); } else if (strcmp(argv[arg], "-v") == 0) { beVerbose = true; } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); } arg++; } if ((AF == 0L) && (TF == 0L) && (AC == 0L) && (DC == 0L) && (CO == 0L)) { fprintf(stderr, "usage: %s [opts] [-v] [-dump prefix]\n", argv[0]); fprintf(stderr, "At least one fragcounts and one contigcounts are needed.\n"); fprintf(stderr, " -af | -tf fragcounts\n"); fprintf(stderr, " -ac | -dc | -co contigcounts \n"); fprintf(stderr, "Dumping is probably only useful with exactly one frag and\n"); fprintf(stderr, "one contig, but I'll let you do it with any number.\n"); exit(1); } if ((AF == 0L) && (TF == 0L)) { fprintf(stderr, "ERROR - need at least one of -af, -tf\n"); exit(1); } if ((AC == 0L) && (DC == 0L) && (CO == 0L)) { fprintf(stderr, "ERROR - need at least one of -ac, -dc, -co\n"); exit(1); } // Check mersizes. // uint32 merSize = 0; uint32 ms[5] = { 0 }; if (AF) merSize = ms[0] = AF->merSize(); if (TF) merSize = ms[1] = TF->merSize(); if (AC) merSize = ms[2] = AC->merSize(); if (DC) merSize = ms[3] = DC->merSize(); if (CO) merSize = ms[4] = CO->merSize(); bool differ = false; if ((ms[0] > 0) && (ms[0] != merSize)) differ = true; if ((ms[1] > 0) && (ms[1] != merSize)) differ = true; if ((ms[2] > 0) && (ms[2] != merSize)) differ = true; if ((ms[3] > 0) && (ms[3] != merSize)) differ = true; if ((ms[4] > 0) && (ms[4] != merSize)) differ = true; if (differ) { fprintf(stderr, "error: mer size differ.\n"); fprintf(stderr, " AF - " F_U32 "\n", ms[0]); fprintf(stderr, " TF - " F_U32 "\n", ms[1]); fprintf(stderr, " AC - " F_U32 "\n", ms[2]); fprintf(stderr, " DC - " F_U32 "\n", ms[3]); fprintf(stderr, " CO - " F_U32 "\n", ms[4]); exit(1); } if (dumpFlag) { errno = 0; dumpSCZF = fopen(dumpSCZFname, "w"); dumpMCZF = fopen(dumpMCZFname, "w"); dumpMCSF = fopen(dumpMCSFname, "w"); dumpMCMF = fopen(dumpMCMFname, "w"); if (errno) fprintf(stderr, "Failed to open the dump files: %s\n", strerror(errno)), exit(1); } uint32 AFvsAC[NUMCATEGORIES][NUMCATEGORIES]; uint32 AFvsDC[NUMCATEGORIES][NUMCATEGORIES]; uint32 AFvsCO[NUMCATEGORIES][NUMCATEGORIES]; uint32 TFvsAC[NUMCATEGORIES][NUMCATEGORIES]; uint32 TFvsDC[NUMCATEGORIES][NUMCATEGORIES]; uint32 TFvsCO[NUMCATEGORIES][NUMCATEGORIES]; for (uint32 i=0; ivalidMer()) minmer = AF->theFMer(); if (TF && TF->validMer()) minmer = TF->theFMer(); if (AC && AC->validMer()) minmer = AC->theFMer(); if (DC && DC->validMer()) minmer = DC->theFMer(); if (CO && CO->validMer()) minmer = CO->theFMer(); speedCounter *C = new speedCounter(" Examining: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, beVerbose); bool morestuff = true; while (morestuff) { // Find any mer in our set if (AF && AF->validMer()) minmer = AF->theFMer(); if (TF && TF->validMer()) minmer = TF->theFMer(); if (AC && AC->validMer()) minmer = AC->theFMer(); if (DC && DC->validMer()) minmer = DC->theFMer(); if (CO && CO->validMer()) minmer = CO->theFMer(); // Find the smallest mer in our set if (AF && AF->validMer() && (AF->theFMer() < minmer)) minmer = AF->theFMer(); if (TF && TF->validMer() && (TF->theFMer() < minmer)) minmer = TF->theFMer(); if (AC && AC->validMer() && (AC->theFMer() < minmer)) minmer = AC->theFMer(); if (DC && DC->validMer() && (DC->theFMer() < minmer)) minmer = DC->theFMer(); if (CO && CO->validMer() && (CO->theFMer() < minmer)) minmer = CO->theFMer(); // We need to do up to six comparisons here. if (AF && AC) compare(AF, AC, minmer, AFmode, AFvsAC); if (AF && DC) compare(AF, DC, minmer, AFmode, AFvsDC); if (AF && CO) compare(AF, CO, minmer, AFmode, AFvsCO); if (TF && AC) compare(TF, AC, minmer, TFmode, TFvsAC); if (TF && DC) compare(TF, DC, minmer, TFmode, TFvsDC); if (TF && CO) compare(TF, CO, minmer, TFmode, TFvsCO); C->tick(); #if 0 if (C->tick()) { char stringjunk[256]; fprintf(stderr, "\nMM %s\n", minmer.merToString(stringjunk)); if (AF) fprintf(stderr, "AF %s\n", AF->theFMer().merToString(stringjunk)); if (TF) fprintf(stderr, "TF %s\n", TF->theFMer().merToString(stringjunk)); if (AC) fprintf(stderr, "AC %s\n", AC->theFMer().merToString(stringjunk)); if (DC) fprintf(stderr, "DC %s\n", DC->theFMer().merToString(stringjunk)); if (CO) fprintf(stderr, "CO %s\n", CO->theFMer().merToString(stringjunk)); } #endif // Advance to the next mer, if we were just used morestuff = false; if ((AF) && (AF->theFMer() == minmer)) morestuff |= AF->nextMer(); if ((TF) && (TF->theFMer() == minmer)) morestuff |= TF->nextMer(); if ((AC) && (AC->theFMer() == minmer)) morestuff |= AC->nextMer(); if ((DC) && (DC->theFMer() == minmer)) morestuff |= DC->nextMer(); if ((CO) && (CO->theFMer() == minmer)) morestuff |= CO->nextMer(); } delete C; // output if ((AF) && (AC)) output("all frags vs all contigs", AFmode, AFvsAC); if ((AF) && (DC)) output("all frags vs deg. contigs", AFmode, AFvsDC); if ((AF) && (CO)) output("all frags vs non-deg. contigs", AFmode, AFvsCO); if ((TF) && (AC)) output("trimmed frags vs all contigs", TFmode, TFvsAC); if ((TF) && (DC)) output("trimmed frags vs deg. contigs", TFmode, TFvsDC); if ((TF) && (CO)) output("trimmed frags vs non-deg. contigs", TFmode, TFvsCO); delete AF; delete TF; delete AC; delete DC; delete CO; exit(0); } canu-1.6/src/mercy/mercy.sh000066400000000000000000000160241314437614700157040ustar00rootroot00000000000000#!/bin/sh # This file is part of Celera Assembler, a software program that # assembles whole-genome shotgun reads into contigs and scaffolds. # Copyright (C) 2006-2007, J. Craig Venter Institute # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received (LICENSE.txt) a copy of the GNU General Public # License along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # Test if the mers in the consensus sequence are supported by mers in # the fragments. # If we count just the clear, we get a clearer (ha, ha) picture of # the assembly quality, while if we count all reads we get a picture # of trimming. # onlyClear=1 onlyReal=1 mem=8192 mem=16384 mem=24576 ms=22 binroot=/bioinfo/assembly/walenz/src/genomics asmMerQC=$binroot/meryl/asmMerQC mapMers=$binroot/meryl/mapMers dir=/scratch/drosnightly asm=willi dir=/project/huref6/redo-consensus_gennady asm=h6tmp dir=/project/huref6/assembly/h6 asm=h6 # Count mers in reads # if [ ! -e $asm-ms$ms-clr-frags.mcidx ] ; then bin/dumpFragStoreAsFasta -frg $dir/$asm.frgStore | \ meryl -B -C -m $ms -s - -o $asm-ms$ms-clr-frags -threads 4 -memory $mem -v fi if [ ! -e $asm-ms$ms-all-frags.mcidx ] ; then bin/dumpFragStoreAsFasta -allbases -allfrags -frg $dir/$asm.frgStore | \ meryl -B -C -m $ms -s - -o $asm-ms$ms-all-frags -threads 4 -memory $mem -v fi echo Finding contigs. if [ ! -e $asm.normalcontigs.fasta ] ; then bin/asmOutputContigsFasta < $dir/9-terminator/$asm.asm > $asm.normalcontigs.fasta & fi if [ ! -e $asm.degeneratecontigs.fasta ] ; then bin/asmOutputContigsFasta -D < $dir/9-terminator/$asm.asm > $asm.degeneratecontigs.fasta & fi if [ ! -e $asm.allcontigs.fasta ] ; then bin/asmOutputContigsFasta -d < $dir/9-terminator/$asm.asm > $asm.allcontigs.fasta & fi # Count mers in contigs # if [ ! -e $asm-ms$ms-normal-contigs.mcidx ] ; then meryl -B -C -m $ms -s $asm.normalcontigs.fasta -o $asm-ms$ms-normal-contigs -threads 4 -segments 4 -v & fi if [ ! -e $asm-ms$ms-degenerate-contigs.mcidx ] ; then meryl -B -C -m $ms -s $asm.degeneratecontigs.fasta -o $asm-ms$ms-degenerate-contigs -threads 4 -segments 4 -v & fi if [ ! -e $asm-ms$ms-all-contigs.mcidx ] ; then meryl -B -C -m $ms -s $asm.allcontigs.fasta -o $asm-ms$ms-all-contigs -threads 4 -segments 4 -v & fi if [ ! -e $asm-ms$ms.asmMerQC ] ; then $asmMerQC -af $asm-ms$ms-all-frags \ -tf $asm-ms$ms-clr-frags \ -co $asm-ms$ms-normal-contigs \ -ac $asm-ms$ms-all-contigs \ -dc $asm-ms$ms-degenerate-contigs \ > $asm-ms$ms.asmMerQC & fi echo Finding badmers. if [ ! -e $asm-ms$ms-allfrags-normalcontigs.badmers.asmMerQC ] ; then $asmMerQC -af $asm-ms$ms-all-frags \ -co $asm-ms$ms-normal-contigs \ -dump $asm-ms$ms-allfrags-normalcontigs.badmers \ > $asm-ms$ms-allfrags-normalcontigs.badmers.asmMerQC & fi if [ ! -e $asm-ms$ms-allfrags-allcontigs.badmers.asmMerQC ] ; then $asmMerQC -af $asm-ms$ms-all-frags \ -ac $asm-ms$ms-all-contigs \ -dump $asm-ms$ms-allfrags-allcontigs.badmers \ > $asm-ms$ms-allfrags-allcontigs.badmers.asmMerQC & fi if [ ! -e $asm-ms$ms-allfrags-degeneratecontigs.badmers.asmMerQC ] ; then $asmMerQC -af $asm-ms$ms-all-frags \ -dc $asm-ms$ms-degenerate-contigs \ -dump $asm-ms$ms-allfrags-degeneratecontigs.badmers \ > $asm-ms$ms-allfrags-degeneratecontigs.badmers.asmMerQC & fi if [ ! -e $asm-ms$ms-clrfrags-normalcontigs.badmers.asmMerQC ] ; then $asmMerQC -tf $asm-ms$ms-clr-frags \ -co $asm-ms$ms-normal-contigs \ -dump $asm-ms$ms-clrfrags-normalcontigs.badmers \ > $asm-ms$ms-clrfrags-normalcontigs.badmers.asmMerQC & fi if [ ! -e $asm-ms$ms-clrfrags-allcontigs.badmers.asmMerQC ] ; then $asmMerQC -tf $asm-ms$ms-clr-frags \ -ac $asm-ms$ms-all-contigs \ -dump $asm-ms$ms-clrfrags-allcontigs.badmers \ > $asm-ms$ms-clrfrags-allcontigs.badmers.asmMerQC & fi if [ ! -e $asm-ms$ms-clrfrags-degeneratecontigs.badmers.asmMerQC ] ; then $asmMerQC -tf $asm-ms$ms-clr-frags \ -dc $asm-ms$ms-degenerate-contigs \ -dump $asm-ms$ms-clrfrags-degeneratecontigs.badmers \ > $asm-ms$ms-clrfrags-degeneratecontigs.badmers.asmMerQC & fi echo Mapping. if [ ! -e $asm-ms$ms-allfrags-normalcontigs.badmers.0.singlecontig.zerofrag.badmers ] ; then $mapMers -m 22 \ -mers $asm-ms$ms-allfrags-normalcontigs.badmers.0.singlecontig.zerofrag.fasta \ -seq $asm.normalcontigs.fasta \ > $asm-ms$ms-allfrags-normalcontigs.badmers.0.singlecontig.zerofrag.badmers & fi if [ ! -e $asm-ms$ms-allfrags-allcontigs.badmers.0.singlecontig.zerofrag.badmers ] ; then $mapMers -m 22 \ -mers $asm-ms$ms-allfrags-allcontigs.badmers.0.singlecontig.zerofrag.fasta \ -seq $asm.allcontigs.fasta \ > $asm-ms$ms-allfrags-allcontigs.badmers.0.singlecontig.zerofrag.badmers & fi if [ ! -e $asm-ms$ms-allfrags-degeneratecontigs.badmers.0.singlecontig.zerofrag.badmers ] ; then $mapMers -m 22 \ -mers $asm-ms$ms-allfrags-degeneratecontigs.badmers.0.singlecontig.zerofrag.fasta \ -seq $asm.degeneratecontigs.fasta \ > $asm-ms$ms-allfrags-degeneratecontigs.badmers.0.singlecontig.zerofrag.badmers & fi if [ ! -e $asm-ms$ms-clrfrags-normalcontigs.badmers.0.singlecontig.zerofrag.badmers ] ; then $mapMers -m 22 \ -mers $asm-ms$ms-clrfrags-normalcontigs.badmers.0.singlecontig.zerofrag.fasta \ -seq $asm.normalcontigs.fasta \ > $asm-ms$ms-clrfrags-normalcontigs.badmers.0.singlecontig.zerofrag.badmers & fi if [ ! -e $asm-ms$ms-clrfrags-allcontigs.badmers.0.singlecontig.zerofrag.badmers ] ; then $mapMers -m 22 \ -mers $asm-ms$ms-clrfrags-allcontigs.badmers.0.singlecontig.zerofrag.fasta \ -seq $asm.allcontigs.fasta \ > $asm-ms$ms-clrfrags-allcontigs.badmers.0.singlecontig.zerofrag.badmers & fi if [ ! -e $asm-ms$ms-clrfrags-degeneratecontigs.badmers.0.singlecontig.zerofrag.badmers ] ; then $mapMers -m 22 \ -mers $asm-ms$ms-clrfrags-degeneratecontigs.badmers.0.singlecontig.zerofrag.fasta \ -seq $asm.degeneratecontigs.fasta \ > $asm-ms$ms-clrfrags-degeneratecontigs.badmers.0.singlecontig.zerofrag.badmers & fi if [ ! -e $asm-ms$ms-allfrags-normalcontigs.badmers.5.all.badmers ] ; then cat $asm-ms$ms-allfrags-normalcontigs.badmers.[01].*.fasta > $asm-ms$ms-allfrags-normalcontigs.badmers.5.allzero.fasta $mapMers -m 22 \ -mers $asm-ms$ms-allfrags-normalcontigs.badmers.5.allzero.fasta \ -seq $asm.normalcontigs.fasta \ > $asm-ms$ms-allfrags-normalcontigs.badmers.5.allzero.badmers & fi date canu-1.6/src/meryl/000077500000000000000000000000001314437614700142375ustar00rootroot00000000000000canu-1.6/src/meryl/compare-counts.C000066400000000000000000000145041314437614700173060ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/compare-counts.C * * Modifications by: * * Brian P. Walenz from 2010-AUG-28 to 2014-APR-11 * are Copyright 2010,2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-NOV-22 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include #include "libmeryl.H" #if 0 void heatMap() { speedCounter *C = new speedCounter(" Examining: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1ffffff, false); #define MAXA 150 #define MAXB 150 double heatraw[MAXA][MAXB]; double heatsca[MAXA][MAXB]; for (uint32 i=0; inextMer(); B->nextMer(); while ((A->validMer()) || (B->validMer())) { kMer &a = A->theFMer(); kMer &b = B->theFMer(); uint32 ac = A->theCount(); uint32 bc = B->theCount(); if (ac >= MAXA) ac = MAXA-1; if (bc >= MAXB) bc = MAXB-1; if (A->validMer() == false) { ac = 0; heatraw[ac][bc]++; B->nextMer(); continue; } if (B->validMer() == false) { bc = 0; heatraw[ac][bc]++; A->nextMer(); continue; } if (a == b) { heatraw[ac][bc]++; A->nextMer(); B->nextMer(); } else if (a < b) { heatraw[ac][0]++; A->nextMer(); } else { heatraw[0][bc]++; B->nextMer(); } C->tick(); } delete C; delete A; delete B; // Scale each row to be between 0 and 1 #if 0 for (uint32 j=0; jmerSize(); #define HMAX 64 * 1024 uint32 *Htrue = new uint32 [HMAX]; uint32 *Hnoise = new uint32 [HMAX]; for (uint32 i=0; inextMer(); S->nextMer(); while ((T->validMer()) || (S->validMer())) { kMer &t = T->theFMer(); kMer &s = S->theFMer(); uint32 tc = T->theCount(); uint32 sc = S->theCount(); if (tc >= HMAX) tc = HMAX-1; if (sc >= HMAX) sc = HMAX-1; // If we're out of truth kmers, the sample is noise. if (T->validMer() == false) { Hnoise[sc]++; S->nextMer(); continue; } // If we're out of sample kmers, do nothing but go to the next truth kmer. if (S->validMer() == false) { T->nextMer(); continue; } // If the kmers are equal, this is a true kmer if (t == s) { Htrue[sc]++; T->nextMer(); S->nextMer(); } // If the truth kmer is the lesser, get the next truth. else if (t < s) { T->nextMer(); } // Else the sample kmer is smaller, add it to the noise pile, and get the next. else { Hnoise[sc]++; S->nextMer(); } } delete T; delete S; char outputName[FILENAME_MAX]; snprintf(outputName, FILENAME_MAX, "%s.gp", outputPrefix); FILE *outputGP = fopen(outputName, "w"); snprintf(outputName, FILENAME_MAX, "%s.dat", outputPrefix); FILE *outputDAT = fopen(outputName, "w"); fprintf(outputGP, "set terminal png\n"); fprintf(outputGP, "set output \"%s.png\"\n", outputPrefix); fprintf(outputGP, "set title \"%s true/false %d-mers\"\n", plotTitle, kmerSize); fprintf(outputGP, "set xlabel \"k-mer count\"\n"); fprintf(outputGP, "set ylabel \"number of kmers\"\n"); fprintf(outputGP, "plot [0:100] [0:1000000] \"%s.dat\" using 1:2 with lines title \"true\", \"%s.dat\" using 1:3 with lines title \"false\"\n", outputPrefix, outputPrefix); fclose(outputGP); for (uint32 i=0; i hist[i])) i++; uint32 iX = i - 1; while (i < histLen) { if (hist[iX] < hist[i]) iX = i; i++; } fprintf(stderr, "Guessed X coverage is " F_U32 "\n", iX); return(iX); } void loadHistogram(merylStreamReader *MF, uint64 &nDistinct, uint64 &nUnique, uint64 &nTotal, uint32 &histLen, uint32* &hist) { nDistinct = MF->numberOfDistinctMers(); nUnique = MF->numberOfUniqueMers(); nTotal = MF->numberOfTotalMers(); histLen = MF->histogramLength(); hist = new uint32 [histLen]; for (uint32 hh=0; hhhistogram(hh); } void loadHistogram(FILE *HF, uint64 &nDistinct, uint64 &nUnique, uint64 &nTotal, uint32 &histLen, uint32* &hist) { char L[1024]; uint32 histMax; nDistinct = 0; nUnique = 0; nTotal = 0; histLen = 0; histMax = 1048576; hist = new uint32 [histMax]; memset(hist, 0, sizeof(uint32) * histMax); fgets(L, 1024, HF); while (!feof(HF)) { splitToWords W(L); uint32 h = W(0); uint32 c = W(1); while (h >= histMax) resizeArray(hist, histLen, histMax, histMax * 2, resizeArray_copyData | resizeArray_clearNew); hist[h] = c; histLen = (histLen < h) ? h : histLen; fgets(L, 1024, HF); } histLen++; nUnique = hist[1]; for (uint32 hh=0; hh 0) ? expectedCoverage : guessCoverage(hist, histLen); // Pass 1: look for a reasonable limit, using %distinct and %total. // uint64 totalUsefulDistinct = nDistinct - nUnique; uint64 totalUsefulAll = nTotal - nUnique; uint64 distinct = 0; uint64 total = 0; uint32 maxCount = 0; uint32 extCount = 0; uint32 kk = 2; // If we cover 99% of all the distinct mers, that's reasonable. // // If we're a somewhat high count, and we're covering 2/3 of the total mers, assume that there // are lots of errors (or polymorphism) that are preventing us from covering many distinct mers. // for (; kk < histLen; kk++) { distinct += hist[kk]; total += hist[kk] * kk; if (((distinct / (double)totalUsefulDistinct) > 0.9975) || (((total / (double)totalUsefulAll) > 0.6667) && (kk > 50 * guessedCoverage))) { maxCount = kk; break; } } fprintf(stderr, "Set maxCount to " F_U32 " (" F_U32 " kmers), which will cover %.2f%% of distinct mers and %.2f%% of all mers.\n", maxCount, hist[maxCount], 100.0 * distinct / totalUsefulDistinct, 100.0 * total / totalUsefulAll); // Compute an average number of kmers around this count. Skip the 1 count. uint32 min = (maxCount < 27) ? 2 : maxCount - 25; uint32 max = (maxCount + 26 > histLen) ? histLen : maxCount + 26; uint64 avg = 0; uint64 tot = 0; for (int32 ii=min; ii 0) { limit = histLen; fprintf(stderr, "No break in kmer coverage found.\n"); } else { limit = min; fprintf(stderr, "Found break in kmer coverage at %d\n", limit); } // Scan forward until we see a 10x increase in the number of kmers, OR, until we cover 90% of the sequence and are at sufficient coverage. for (; kk < limit; kk++) { if (avg * 10 < kk * hist[kk]) break; if (((double)total / totalUsefulAll >= 0.9) && (kk > 50 * guessedCoverage)) break; distinct += hist[kk]; total += hist[kk] * kk; if (hist[kk] > 0) extCount = kk; } if (extCount > 0) maxCount = extCount; fprintf(stderr, "Set maxCount to " F_U32 " (" F_U32 " kmers), which will cover %.2f%% of distinct mers and %.2f%% of all mers.\n", maxCount, hist[maxCount], 100.0 * distinct / totalUsefulDistinct, 100.0 * total / totalUsefulAll); fprintf(stdout, F_U32"\n", maxCount); delete [] hist; return(0); } canu-1.6/src/meryl/estimate-mer-threshold.mk000066400000000000000000000010121314437614700211500ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := estimate-mer-threshold SOURCES := estimate-mer-threshold.C SRC_INCDIRS := .. ../AS_UTL libleaff TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/meryl/existDB.C000066400000000000000000000170361314437614700157140ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-FEB-20 to 2003-OCT-20 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-30 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-DEC-04 to 2014-APR-11 * are Copyright 2005,2007-2008,2011,2013-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "existDB.H" #include "speedCounter.H" #include "libmeryl.H" #include "merStream.H" // Driver for the existDB creation. Reads a sequence.fasta, builds // an existDB for the mers in the file, and then writes the internal // structures to disk. // // The existDB constructor is smart enough to read either a pre-built // image or a regular multi-fasta file. int testFiles(char *filename, char *prefix, uint32 merSize) { char *prefixfilename = new char [strlen(prefix) + 32]; // Create existDB e and save it to disk // existDB *e = new existDB(filename, merSize, existDBnoFlags | existDBcounts, 0, ~uint32ZERO); sprintf(prefixfilename, "%s.1", prefix); e->saveState(prefixfilename); // Create existDB f by loading the saved copy from disk // existDB *f = new existDB(prefixfilename); // Create a fresh existDB g (to check if we corrup the original when saved) // existDB *g = new existDB(filename, merSize, existDBnoFlags | existDBcounts, 0, ~uint32ZERO); speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, true); fprintf(stderr, "Need to iterate over %7.2f Mmers.\n", (uint64MASK(2 * merSize) + 1) / 1000000.0); for (uint64 d=0, m=uint64MASK(2 * merSize); m--; ) { bool ee = e->exists(m); bool ef = f->exists(m); bool eg = g->exists(m); uint32 ce = e->count(m); uint32 cf = f->count(m); uint32 cg = g->count(m); if ((ee != ef) || (ef != eg) || (ee != eg)) fprintf(stderr, "mer "F_X64" not found : e=%d f=%d g=%d\n", m, ee, ef, eg); if ((ce != cf) || (cf != cg) || (ce != cg)) fprintf(stderr, "mer "F_X64" count differs : e=%u f=%u g=%u (exists=%d)\n", m, ce, cf, cg, ee); if ((m & 0xffffff) == 0) { // Been a while since a report, so report. d = 1; } if ((ce > 1) && (d == 1)) { // Report anything not unique, to make sure that we're testing real counts and not just existence. fprintf(stderr, "mer "F_X64" : e=%u f=%u g=%u (exists=%d)\n", m, ce, cf, cg, ee); d = 0; } C->tick(); } delete e; delete C; return(0); } int testExistence(char *filename, uint32 merSize) { existDB *E = new existDB(filename, merSize, existDBnoFlags, 0, ~uint32ZERO); merStream *M = new merStream(new kMerBuilder(merSize), new seqStream(filename), true, true); uint64 tried = 0; uint64 lost = 0; while (M->nextMer()) { tried++; if (!E->exists(M->theFMer())) lost++; } delete M; delete E; if (lost) { fprintf(stderr, "Tried "F_U64", didn't find "F_U64" merStream mers in the existDB.\n", tried, lost); return(1); } else { return(0); } } int testExhaustive(char *filename, char *merylname, uint32 merSize) { existDB *E = new existDB(filename, merSize, existDBnoFlags, 0, ~uint32ZERO); merylStreamReader *M = new merylStreamReader(merylname); speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, true); uint64 found = uint64ZERO; uint64 expected = uint64ZERO; FILE *DUMP = 0L; DUMP = fopen("testExhaustive.ms.dump", "w"); while (M->nextMer()) { if (E->exists(M->theFMer())) { expected++; fprintf(DUMP, F_X64"\n", (uint64)M->theFMer()); } else { fprintf(DUMP, F_X64" MISSED!\n", (uint64)M->theFMer()); } } fclose(DUMP); fprintf(stderr, "Found "F_U64" mers in the meryl database.\n", expected); fprintf(stderr, "Need to iterate over %7.2f Mmers.\n", (uint64MASK(2 * merSize) + 1) / 1000000.0); DUMP = fopen("testExhaustive.ck.dump", "w"); for (uint64 m = uint64MASK(2 * merSize); m--; ) { if (E->exists(m)) { found++; fprintf(DUMP, F_X64"\n", m); } C->tick(); } fclose(DUMP); delete C; delete E; delete M; if (expected != found) { fprintf(stderr, "Expected to find "F_U64" mers, but found "F_U64" instead.\n", expected, found); return(1); } else { return(0); } } const char *usage = "usage: %s [stuff]\n" " -mersize mersize\n" " -- Use the specified mersize when building existDB tables.\n" "\n" " -build some.fasta prefix\n" " -- Build an existDB on all mers in some.fasta and save\n" " the tables into prefix.\n" "\n" " -describe prefix\n" " -- Reports the state of some existDB file.\n" "\n" " -testfiles some.fasta prefix\n" " -- Build an existDB table from some.fasta. Write that table to disk.\n" " Load the table back. Compare that each mer in some.fasta is present\n" " in all three existDB tables created earlier.\n" "\n" " -testexistence some.fasta\n" " -- Build an existDB table from some.fasta, check that every\n" " mer in some.fasta can be found in the table. Does not\n" " guarantee that every mer in the table is found in the file.\n" "\n" " -testexhaustive some.fasta some.meryl\n" " -- Build an existDB table from some.fasta, check _EVERY_ mer\n" " for existance. Complain if a mer exists in the table but\n" " not in the meryl database. Assumes 'some.meryl' is the\n" " mercount of some.fasta.\n" "\n"; int main(int argc, char **argv) { uint32 mersize = 20; if (argc < 3) { fprintf(stderr, usage, argv[0]); exit(1); } int arg = 1; while (arg < argc) { if (strncmp(argv[arg], "-mersize", 2) == 0) { arg++; mersize = atoi(argv[arg]); } else if (strncmp(argv[arg], "-describe", 2) == 0) { existDB *e = new existDB(argv[argc-1], false); e->printState(stdout); delete e; exit(0); } else if (strncmp(argv[arg], "-testfiles", 8) == 0) { exit(testFiles(argv[arg+1], argv[arg+2], mersize)); } else if (strncmp(argv[arg], "-testexistence", 8) == 0) { exit(testExistence(argv[arg+1], mersize)); } else if (strncmp(argv[arg], "-testexhaustive", 8) == 0) { exit(testExhaustive(argv[arg+1], argv[arg+2], mersize)); } else if (strncmp(argv[arg], "-build", 2) == 0) { existDB *e = new existDB(argv[argc-2], mersize, existDBnoFlags, 0, ~uint32ZERO); e->saveState(argv[argc-1]); delete e; exit(0); } arg++; } exit(0); } canu-1.6/src/meryl/existDB.mk000066400000000000000000000007651314437614700161420ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := existDB SOURCES := existDB.C SRC_INCDIRS := .. ../AS_UTL libleaff libkmer TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/meryl/gkrpt.pl000077500000000000000000000176341314437614700157410ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/AS_MER/gkrpt.pl # # Modifications by: # # Brian P. Walenz from 2008-JAN-03 to 2013-AUG-01 # are Copyright 2008,2013 J. Craig Venter Institute, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz on 2015-APR-10 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my %kmers; while (!eof(STDIN)){ my $count = ; chomp $count; my $kmer = ; chomp $kmer; $count =~ s/>//g; $count = int($count); $kmer =~ tr/acgt/ACGT/; my $numbad = $kmer =~ tr/ACGT//c; if ($numbad > 0){ die "$numbad nonacgtACGT characters in kmer $kmer\n"; } if (defined($kmers{$kmer})){ die "kmer repeated in input $kmer\n"; } $kmers{$kmer} = $count; } foreach my $kmer (sort { $kmers{$b} <=> $kmers{$a} } (keys %kmers)){ my $startkmer = my $curkmer = $kmer; my $startcount = my $curcount = $kmers{$kmer}; if ($curcount < 0) { next; } else { $kmers{$kmer} = - $curcount; } my $currpt = $curkmer; for ( ; ; ) { my $nextcount = -1; my $realcount = -1; my $maxkmer; my $realkmer; my $tmpkmer; my $maxcount = -1; $curkmer = (substr $curkmer, 1) . "A"; my $rckmer = reverse $curkmer; $rckmer =~ tr/ACGT/TGCA/; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } substr($curkmer, -1) = "C"; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { substr($rckmer, 0, 1) = "G"; if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } substr($curkmer, -1) = "G"; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { substr($rckmer, 0, 1) = "C"; if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } substr($curkmer, -1) = "T"; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { substr($rckmer, 0, 1) = "A"; if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } if (($realcount < 0) || ($realcount < ($curcount / 2))) { last; } else { $curkmer = $maxkmer; $curcount = $realcount; $kmers{$realkmer} = - $realcount; $currpt .= (substr $curkmer, -1); } } $curcount = $startcount; $curkmer = $startkmer; for ( ; ; ) { my $nextcount = -1; my $realcount = -1; my $maxkmer; my $realkmer; my $tmpkmer; my $maxcount = -1; $curkmer = "A" . (substr $curkmer, 0, -1); my $rckmer = reverse $curkmer; $rckmer =~ tr/ACGT/TGCA/; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } substr($curkmer, 0, 1) = "C"; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { substr($rckmer, -1) = "G"; if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } substr($curkmer, 0, 1) = "G"; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { substr($rckmer, -1) = "C"; if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } substr($curkmer, 0, 1) = "T"; if (defined($kmers{$curkmer})){ $nextcount = $kmers{$curkmer}; $tmpkmer = $curkmer; } else { substr($rckmer, -1) = "A"; if (defined($kmers{$rckmer})){ $nextcount = $kmers{$rckmer}; $tmpkmer = $rckmer; } else { $nextcount = -1; } } if ((abs $nextcount) > $maxcount){ $maxcount = abs $nextcount; $realcount = $nextcount; $realkmer = $tmpkmer; $maxkmer = $curkmer; } if (($realcount < 0) || ($realcount < ($curcount / 2))) { last; } else { $curkmer = $maxkmer; $curcount = $realcount; $kmers{$realkmer} = - $realcount; $currpt = (substr $curkmer, 0, 1) . $currpt; } } if ((my $lenrpt = length $currpt) > $ARGV[0]) { print ">$startkmer $startcount $lenrpt\n$currpt\n"; } } canu-1.6/src/meryl/leaff-blocks.C000066400000000000000000000045171314437614700167020ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/blocks.C * * Modifications by: * * Brian P. Walenz from 2009-FEB-07 to 2014-APR-11 * are Copyright 2009,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqCache.H" void dumpBlocks(char *filename) { seqCache *F = 0L; seqInCore *S = 0L; bool V[256] = {0}; for (uint32 i=0; i<256; i++) V[i] = false; V['n'] = true; V['N'] = true; F = new seqCache(filename); for (uint32 s=0; sgetNumberOfSequences(); s++) { seqInCore *S = F->getSequenceInCore(s); uint32 len = S->sequenceLength(); char begseq = S->sequence()[0]; bool nnn = V[begseq]; uint32 begpos = 0; uint32 pos = 0; for (pos=0; possequence()[pos]; if (nnn != V[seq]) { fprintf(stdout, "%c " F_U32 " " F_U32 " " F_U32 " " F_U32 "\n", begseq, s, begpos, pos, pos - begpos); nnn = V[seq]; begpos = pos; begseq = seq; } } fprintf(stdout, "%c " F_U32 " " F_U32 " " F_U32 " " F_U32 "\n", begseq, s, begpos, pos, pos - begpos); fprintf(stdout, ". " F_U32 " " F_U32 " " F_U32 "\n", s, pos, 0); delete S; } delete F; } canu-1.6/src/meryl/leaff-duplicates.C000066400000000000000000000125221314437614700175550ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/dups.C * * Modifications by: * * Brian P. Walenz from 2009-FEB-07 to 2014-APR-11 * are Copyright 2009,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqCache.H" #include "md5.H" md5_s * computeMD5ForEachSequence(seqCache *F) { uint32 numSeqs = F->getNumberOfSequences(); md5_s *result = new md5_s [numSeqs]; for (uint32 idx=0; idx < numSeqs; idx++) { seqInCore *s1 = F->getSequenceInCore(idx); md5_string(result+idx, s1->sequence(), s1->sequenceLength()); result[idx].i = s1->getIID(); delete s1; } return(result); } void mapDuplicates_Print(char *filea, seqInCore *sa, char *fileb, seqInCore *sb) { if (strcmp(sa->sequence(), sb->sequence()) == 0) fprintf(stdout, F_U32" <-> " F_U32 "\n", sa->getIID(), sb->getIID()); else fprintf(stderr, "COLLISION DETECTED BETWEEN %s:" F_U32 " AND %s:" F_U32 "!\nPLEASE REPORT THIS TO bri@walenz.org!\n", filea, sa->getIID(), fileb, sb->getIID()); } void findDuplicates(char *filename) { seqInCore *s1 = 0L; seqInCore *s2 = 0L; seqCache *A = new seqCache(filename); uint32 numSeqs = A->getNumberOfSequences(); fprintf(stderr, "Computing MD5's for each sequence in '%s'.\n", filename); md5_s *result = computeMD5ForEachSequence(A); fprintf(stderr, "Sorting MD5's.\n"); qsort(result, numSeqs, sizeof(md5_s), md5_compare); fprintf(stderr, "Verifying identity, and output\n"); for (uint32 idx=1; idxgetSequenceInCore(result[idx-1].i); s2 = A->getSequenceInCore(result[idx].i); if (strcmp(s1->sequence(), s2->sequence()) == 0) { fprintf(stdout, F_U32":%s\n" F_U32 ":%s\n\n", result[idx-1].i, s1->header(), result[idx ].i, s2->header()); } else { fprintf(stderr, "COLLISION DETECTED BETWEEN IID " F_U32 " AND " F_U32 "!\nPLEASE REPORT THIS TO bri@walenz.org!\n", result[idx-1].i, result[idx].i); } delete s1; delete s2; } } delete [] result; delete A; } void mapDuplicates(char *filea, char *fileb) { fprintf(stderr, "Computing MD5's for each sequence in '%s'.\n", filea); seqCache *A = new seqCache(filea); md5_s *resultA = computeMD5ForEachSequence(A); fprintf(stderr, "Computing MD5's for each sequence in '%s'.\n", fileb); seqCache *B = new seqCache(fileb); md5_s *resultB = computeMD5ForEachSequence(B); uint32 numSeqsA = A->getNumberOfSequences(); uint32 numSeqsB = B->getNumberOfSequences(); uint32 idxA = 0; uint32 idxB = 0; fprintf(stderr, "Sorting MD5's.\n"); qsort(resultA, numSeqsA, sizeof(md5_s), md5_compare); qsort(resultB, numSeqsB, sizeof(md5_s), md5_compare); fprintf(stderr, "Finding duplicates.\n"); while ((idxAgetSequenceInCore(resultA[idxA].i); seqInCore *sb = B->getSequenceInCore(resultB[idxB].i); mapDuplicates_Print(filea, sa, fileb, sb); // While the B sequence matches the current A sequence, output a match // uint32 idxBb = idxB+1; int resb = md5_compare(resultA+idxA, resultB+idxBb); while (resb == 0) { seqInCore *sbb = B->getSequenceInCore(resultB[idxBb].i); mapDuplicates_Print(filea, sa, fileb, sbb); delete sbb; idxBb++; resb = md5_compare(resultA+idxA, resultB+idxBb); } // And likewise for A // uint32 idxAa = idxA+1; int resa = md5_compare(resultA+idxAa, resultB+idxB); while (resa == 0) { seqInCore *saa = A->getSequenceInCore(resultA[idxAa].i); mapDuplicates_Print(filea, saa, fileb, sb); delete saa; idxAa++; resa = md5_compare(resultA+idxAa, resultB+idxB); } delete sa; delete sb; idxA++; idxB++; } else { if (res < 0) idxA++; else idxB++; } } delete A; delete B; } canu-1.6/src/meryl/leaff-gc.C000066400000000000000000000112541314437614700160120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/gc.C * * Modifications by: * * Brian P. Walenz from 2009-FEB-07 to 2014-APR-11 * are Copyright 2009,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqCache.H" void computeGCcontent(char *filename) { seqCache *A = new seqCache(filename); for (uint32 idx=0; idx < A->getNumberOfSequences(); idx++) { seqInCore *S = A->getSequenceInCore(idx); char *s = S->sequence(); uint32 genomeLength = S->sequenceLength(); fprintf(stdout, ">%s\n", S->header()); int gc[256] = {0}; gc['c'] = 1; gc['C'] = 1; gc['g'] = 1; gc['G'] = 1; // Replace the sequence with "g or c". We can't do this inline, // since output reports the sequence too. The extra 1000 at the // end is important, since we do not bother checking for the end // of the valid data, just assume that it's zero. // char *g = new char [S->sequenceLength() + 1000]; for (uint32 i=0; i 1) ? g[i-2] : 0); ave5 += g[i+2] - ((i > 2) ? g[i-3] : 0); ave11 += g[i+5] - ((i > 5) ? g[i-6] : 0); ave51 += g[i+25] - ((i > 25) ? g[i-25] : 0); ave101 += g[i+50] - ((i > 50) ? g[i-51] : 0); ave201 += g[i+100] - ((i > 100) ? g[i-101] : 0); ave501 += g[i+250] - ((i > 250) ? g[i-251] : 0); ave1001 += g[i+500] - ((i > 500) ? g[i-501] : 0); ave2001 += g[i+1000] - ((i > 1000) ? g[i-1001] : 0); fprintf(stdout, F_U32"\t" F_U32 "\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\t%.2f\n", i, s[i], ave3 / (double)((i >= 1) ? 3 - ((i < genomeLength - 1) ? 0 : i + 2 - genomeLength) : i+2), ave5 / (double)((i >= 2) ? 5 - ((i < genomeLength - 2) ? 0 : i + 3 - genomeLength) : i+3), ave11 / (double)((i >= 5) ? 11 - ((i < genomeLength - 4) ? 0 : i + 5 - genomeLength) : i+6), ave51 / (double)((i >= 25) ? 51 - ((i < genomeLength - 24) ? 0 : i + 25 - genomeLength) : i+26), ave101 / (double)((i >= 50) ? 101 - ((i < genomeLength - 49) ? 0 : i + 50 - genomeLength) : i+51), ave201 / (double)((i >= 100) ? 201 - ((i < genomeLength - 99) ? 0 : i + 100 - genomeLength) : i+101), ave501 / (double)((i >= 250) ? 501 - ((i < genomeLength - 249) ? 0 : i + 250 - genomeLength) : i+251), ave1001 / (double)((i >= 500) ? 1001 - ((i < genomeLength - 499) ? 0 : i + 500 - genomeLength) : i+501), ave2001 / (double)((i >= 1000) ? 2001 - ((i < genomeLength - 999) ? 0 : i + 1000 - genomeLength) : i+1001)); } delete [] g; delete S; } } canu-1.6/src/meryl/leaff-partition.C000066400000000000000000000144341314437614700174350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/partition.C * * Modifications by: * * Brian P. Walenz from 2009-FEB-07 to 2014-APR-11 * are Copyright 2009-2010,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqCache.H" #include struct partition_s { uint32 length; uint32 index; uint32 partition; }; static int partition_s_compare(const void *A, const void *B) { const partition_s *a = (const partition_s *)A; const partition_s *b = (const partition_s *)B; if (a->length < b->length) return(1); if (a->length > b->length) return(-1); return(0); } static partition_s * loadPartition(seqCache *F) { uint32 n = F->getNumberOfSequences(); partition_s *p = new partition_s [n]; for (uint32 i=0; igetSequenceLength(i); p[i].index = i; p[i].partition = 0; } qsort(p, n, sizeof(partition_s), partition_s_compare); return(p); } static void outputPartition(seqCache *F, char *prefix, partition_s *p, uint32 openP, uint32 n) { char filename[FILENAME_MAX]; // Check that everything has been partitioned // for (uint32 i=0; igetSequenceInCore(p[i].index); fprintf(file, ">%s\n", S->header()); fwrite(S->sequence(), sizeof(char), S->sequenceLength(), file); fprintf(file, "\n"); if (S->sequenceLength() != p[i].length) { fprintf(stderr, "Huh? '%s' " F_U32 " != " F_U32 "\n", S->header(), S->sequenceLength(), p[i].length); } delete S; } fclose(file); } } else { // This dumps the partition information to stdout. // fprintf(stdout, F_U32"\n", openP); for (uint32 o=1; o<=openP; o++) { uint32 sizeP = 0; for (uint32 i=0; igetNumberOfSequences(); partition_s *p = loadPartition(F); uint32 openP = 1; // Currently open partition uint32 sizeP = 0; // Size of open partition uint32 seqsP = n; // Number of sequences to partition // For any sequences larger than partitionSize, create // partitions containing just one sequence // for (uint32 i=0; i partitionSize) { p[i].partition = openP++; seqsP--; } } // For the remaining, iterate through the list, // greedily placing the longest sequence that fits // into the open partition // while (seqsP > 0) { for (uint32 i=0; igetNumberOfSequences(); partition_s *p = loadPartition(F); if (partitionSize > n) partitionSize = n; // The size, in bases, of each partition // uint32 *s = new uint32 [partitionSize]; for (uint32 i=0; igetNumberOfSequences(); partition_s *p = new partition_s [n]; uint32 numSeqPerPart = (uint32)ceil(n / (double)numSegments); for (uint32 i=0; igetSequenceLength(i); p[i].index = i; p[i].partition = i / numSeqPerPart + 1; } outputPartition(F, prefix, p, numSegments, n); delete [] p; delete F; } canu-1.6/src/meryl/leaff-simulate.C000066400000000000000000000146511314437614700172500ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/simseq.C * * Modifications by: * * Brian P. Walenz from 2004-JUN-24 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2008-SEP-22 to 2009-JUN-13 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "mt19937ar.H" // This is Liliana Florea's sequencing error simulator. Bri hacked // it to use a real RNG, and to make it work from leaff. typedef struct edit_script { uint32 optype; uint32 num; struct edit_script *next; } EditScript_t; typedef struct align { uint32 offset, len; EditScript_t *script; } Align_t; // This guy is provided by leaff extern mtRandom MT; // RAND returns x numbers, starting at number y. // #define RAND(x,y) (int)((y) + (MT.mtRandom32() % (x))) #define max(x,y) ((x)>=(y) ? (x):(y)) #define min(x,y) ((x)<=(y) ? (x):(y)) #define MOV 3 #define SUB 2 #define INS 1 #define DEL 0 EditScript_t * new_script(int optype, uint32 num, EditScript_t *next) { EditScript_t *newtp = new EditScript_t; newtp->optype = optype; newtp->num = num; newtp->next = next; return(newtp); } /* DEL(pos), SUB(pos) - modifY position pos; INS - insert right before pos */ void insert(Align_t *aln, uint32 in_pos, uint32 in_optype) { uint32 i, num, optype; EditScript_t *t, *tp; //fprintf(stderr, "Modify script op=%d pos=%d\n", in_optype, in_pos); for (t=aln->script, i=0, tp=NULL; t; tp=t, t=t->next) { num = t->num; optype = t->optype; switch (optype) { case INS: if (in_pos==i+1) { if (tp) tp->next = new_script(in_optype, 1, tp->next); else aln->script = new_script(in_optype, 1, aln->script); return; } break; case DEL: i += num; break; case SUB: case MOV: if (inum = l; tp = t; tp->next = new_script(in_optype, 1, tp->next); tp = tp->next; tp->next = new_script(optype, r, tp->next); } else if (!l) { if (tp) tp->next = new_script(in_optype, 1, t); else aln->script = new_script(in_optype, 1, aln->script); if (in_optype!=INS) t->num -= 1; } else { tp = t; tp->next = new_script(in_optype, 1, tp->next); if (in_optype!=INS) t->num -= 1; } return; } i += num; break; default: fprintf(stderr, "Unrecognized optype (%d).\n", in_optype); break; } } //fprintf(stderr, "Failed to modify sequence (%d,%d).\n", in_optype, in_pos); } void print_simseq(char *seq, char *hdr, Align_t *aln, double P, uint32 CUT, uint32 COPY) { uint32 k, e; char *s; char let_4[4] = {'A','C','G','T'}; char let_3A[3] = {'C','G','T'}; char let_3C[3] = {'A','G','T'}; char let_3G[3] = {'A','C','T'}; char let_3T[3] = {'A','C','G'}; EditScript_t *t; fprintf(stdout, ">"); while ((*hdr) && !isspace(*hdr)) fprintf(stdout, "%c", *hdr++); fprintf(stdout, ":seq=%d:copy=%d:loc=%d-%d:err=%1.2f\n", CUT+1, COPY+1, aln->offset, aln->offset+aln->len-1, P); s = seq + aln->offset-1; for (t=aln->script; t; t=t->next) { if (*s == 0) break; switch (t->optype) { case INS: for (k=0; knum; k++) { e = RAND(4,0); fprintf(stdout, "%c", let_4[e]); } break; case DEL: while (*s && t->num) { s++; t->num--; } break; case SUB: for (k=0; knum; k++) { e = RAND(3,0); if (*s=='A') fprintf(stdout, "%c", let_3A[e]); else if (*s=='C') fprintf(stdout, "%c", let_3C[e]); else if (*s=='G') fprintf(stdout, "%c", let_3G[e]); else if (*s=='T') fprintf(stdout, "%c", let_3T[e]); else fprintf(stdout, "%c", 'A'); s++; } break; case MOV: for (k=0; knum; k++) { if (*s == 0) { k = t->num; } else { fprintf(stdout, "%c", *s); s++; } } break; default: fprintf(stderr, "Unrecognized optype (%d).\n", t->optype); break; } } fprintf(stdout, "\n"); } void simseq(char *seq, char *hdr, uint32 len, uint32 N, uint32 L, uint32 C, double P) { Align_t align; uint32 i, j, k; uint32 start; EditScript_t *s; for (i=0; inext; delete s; } } } } canu-1.6/src/meryl/leaff-statistics.C000066400000000000000000000103021314437614700176040ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/stats.C * * Modifications by: * * Brian P. Walenz from 2009-FEB-07 to 2014-APR-11 * are Copyright 2009,2012-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-08 to 2015-MAR-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqCache.H" #include using namespace std; void stats(char *filename, uint64 refLen) { seqCache *F = new seqCache(filename); bool V[256]; for (uint32 i=0; i<256; i++) V[i] = false; V['n'] = true; V['N'] = true; uint32 numSeq = F->getNumberOfSequences(); uint64 Ss = 0; // actual length of span uint64 Rs = 0; // reference length of span uint32 *Ls = new uint32 [numSeq]; uint64 Sb = 0; uint64 Rb = 0; uint32 *Lb = new uint32 [numSeq]; for (uint32 i=0; igetSequenceInCore(s); uint32 len = S->sequenceLength(); uint32 span = len; uint32 base = len; for (uint32 pos=1; possequence()[pos]]) base--; } Ss += span; Sb += base; Ls[S->getIID()] = span; Lb[S->getIID()] = base; delete S; } if (refLen > 0) { Rs = refLen; Rb = refLen; } else { Rs = Ss; Rb = Sb; } //qsort(Ls, numSeq, sizeof(uint32), uint32_compare); //qsort(Lb, numSeq, sizeof(uint32), uint32_compare); sort(Ls, Ls + numSeq); sort(Lb, Lb + numSeq); reverse(Ls, Ls + numSeq); reverse(Lb, Lb + numSeq); uint32 n50s[11] = {0}; uint32 l50s[11] = {0}; uint32 n50b[11] = {0}; uint32 l50b[11] = {0}; uint32 sizes[11] = {0}; uint32 sizeb[11] = {0}; for (uint32 i=0; i<11; i++) { sizes[i] = i * Rs / 10; sizeb[i] = i * Rb / 10; //fprintf(stderr, "SIZE %2d s=%d b=%d\n", i, sizes[i], sizeb[i]); } for (uint32 i=0, sum=0, n=1; (i < numSeq) && (n < 11); i++) { if ((sum < sizes[n]) && (sizes[n] <= sum + Ls[i])) { n50s[n] = Ls[i]; l50s[n] = i; n++; } sum += Ls[i]; } for (uint32 i=0, sum=0, n=1; (i < numSeq) && (n < 11); i++) { if ((sum < sizeb[n]) && (sizeb[n] <= sum + Lb[i])) { n50b[n] = Lb[i]; l50b[n] = i; n++; } sum += Lb[i]; } //for (uint32 i=0, sum=0; sum < Rb/2; i++) { //} fprintf(stdout, "%s\n", F->getSourceName()); fprintf(stdout, "\n"); fprintf(stdout, "numSeqs " F_U32 "\n", numSeq); fprintf(stdout, "\n"); fprintf(stdout, "SPAN (smallest " F_U32 " largest " F_U32 ")\n", Ls[numSeq-1], Ls[0]); for (uint32 i=1; i<10; i++) fprintf(stdout, "n" F_U32 " %10" F_U32P " at index " F_U32 "\n", 10 * i, n50s[i], l50s[i]); fprintf(stdout, "totLen %10" F_U64P "\n", Ss); fprintf(stdout, "refLen %10" F_U64P "\n", Rs); fprintf(stdout, "\n"); fprintf(stdout, "BASES (smallest " F_U32 " largest " F_U32 ")\n", Lb[numSeq-1], Lb[0]); for (uint32 i=1; i<10; i++) fprintf(stdout, "n" F_U32 " %10" F_U32P " at index " F_U32 "\n", 10 * i, n50b[i], l50b[i]); fprintf(stdout, "totLen %10" F_U64P "\n", Sb); fprintf(stdout, "refLen %10" F_U64P "\n", Rb); delete [] Ls; delete [] Lb; } canu-1.6/src/meryl/leaff.C000066400000000000000000000711421314437614700154250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/leaff/leaff.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-OCT-14 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-FEB-20 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-06 to 2014-APR-11 * are Copyright 2005-2009,2011-2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Liliana Florea on 2011-NOV-16 * are Copyright 2011 Liliana Florea, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-22 to 2015-JAN-13 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqCache.H" #include "seqStore.H" #include "md5.H" #include "mt19937ar.H" #include "dnaAlphabets.H" // Analysis functions // void dumpBlocks(char *filename); void stats(char *filename, uint64 refLen); void partitionBySize(char *prefix, uint64 partitionSize, char *filename); void partitionByBucket(char *prefix, uint64 partitionSize, char *filename); void partitionBySegment(char *prefix, uint64 numSegments, char *filename); void simseq(char *,char *, uint32, uint32, uint32, uint32, double); void computeGCcontent(char *name); void findDuplicates(char *filename); void mapDuplicates(char *filea, char *fileb); void processFile(char *filename); void processArray(int argc, char **argv); bool doReverse = false; bool doComplement = false; bool withDefLine = true; char *specialDefLine = 0L; uint32 withLineBreaks = 0; bool toUppercase = false; char translate[256] = {0}; seqCache *fasta = 0L; uint32 begPos = (uint32)0; uint32 endPos = ~(uint32)0; uint32 endExtract = ~(uint32)0; mtRandom MT; static void failIfNoSource(void) { if (fasta == 0L) fprintf(stderr, "No source file specified.\n"), exit(1); } static void failIfNotRandomAccess(void) { if (fasta->randomAccessSupported() == false) fprintf(stderr, "Algorithm required random access; soruce file not supported.\n"), exit(1); } static void helpStandard(char *program) { fprintf(stderr, "usage: %s [-f fasta-file] [options]\n", program); fprintf(stderr, "\n"); fprintf(stderr, "SOURCE FILES\n"); fprintf(stderr, " -f file: use sequence in 'file' (-F is also allowed for historical reasons)\n"); fprintf(stderr, " -A file: read actions from 'file'\n"); fprintf(stderr, "\n"); fprintf(stderr, "SOURCE FILE EXAMINATION\n"); fprintf(stderr, " -d: print the number of sequences in the fasta\n"); fprintf(stderr, " -i name: print an index, labelling the source 'name'\n"); fprintf(stderr, "\n"); fprintf(stderr, "OUTPUT OPTIONS\n"); fprintf(stderr, " -6 <#>: insert a newline every 60 letters\n"); fprintf(stderr, " (if the next arg is a number, newlines are inserted every\n"); fprintf(stderr, " n letters, e.g., -6 80. Disable line breaks with -6 0,\n"); fprintf(stderr, " or just don't use -6!)\n"); fprintf(stderr, " -e beg end: Print only the bases from position 'beg' to position 'end'\n"); fprintf(stderr, " (space based, relative to the FORWARD sequence!) If\n"); fprintf(stderr, " beg == end, then the entire sequence is printed. It is an\n"); fprintf(stderr, " error to specify beg > end, or beg > len, or end > len.\n"); fprintf(stderr, " -ends n Print n bases from each end of the sequence. One input\n"); fprintf(stderr, " sequence generates two output sequences, with '_5' or '_3'\n"); fprintf(stderr, " appended to the ID. If 2n >= length of the sequence, the\n"); fprintf(stderr, " sequence itself is printed, no ends are extracted (they\n"); fprintf(stderr, " overlap).\n"); fprintf(stderr, " -C: complement the sequences\n"); fprintf(stderr, " -H: DON'T print the defline\n"); fprintf(stderr, " -h: Use the next word as the defline (\"-H -H\" will reset to the\n"); fprintf(stderr, " original defline\n"); fprintf(stderr, " -R: reverse the sequences\n"); fprintf(stderr, " -u: uppercase all bases\n"); fprintf(stderr, "\n"); fprintf(stderr, "SEQUENCE SELECTION\n"); fprintf(stderr, " -G n s l: print n randomly generated sequences, 0 < s <= length <= l\n"); fprintf(stderr, " -L s l: print all sequences such that s <= length < l\n"); fprintf(stderr, " -N l h: print all sequences such that l <= %% N composition < h\n"); fprintf(stderr, " (NOTE 0.0 <= l < h < 100.0)\n"); fprintf(stderr, " (NOTE that you cannot print sequences with 100%% N\n"); fprintf(stderr, " This is a useful bug).\n"); fprintf(stderr, " -q file: print sequences from the seqid list in 'file'\n"); fprintf(stderr, " -r num: print 'num' randomly picked sequences\n"); fprintf(stderr, " -s seqid: print the single sequence 'seqid'\n"); fprintf(stderr, " -S f l: print all the sequences from ID 'f' to 'l' (inclusive)\n"); fprintf(stderr, " -W: print all sequences (do the whole file)\n"); fprintf(stderr, "\n"); fprintf(stderr, "LONGER HELP\n"); fprintf(stderr, " -help analysis\n"); fprintf(stderr, " -help examples\n"); } static void helpAnalysis(char *program) { fprintf(stderr, "usage: %s [-f ] [options]\n", program); fprintf(stderr, "\n"); fprintf(stderr, " --findduplicates a.fasta\n"); fprintf(stderr, " Reports sequences that are present more than once. Output\n"); fprintf(stderr, " is a list of pairs of deflines, separated by a newline.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --mapduplicates a.fasta b.fasta\n"); fprintf(stderr, " Builds a map of IIDs from a.fasta and b.fasta that have\n"); fprintf(stderr, " identical sequences. Format is \"IIDa <-> IIDb\"\n"); fprintf(stderr, "\n"); fprintf(stderr, " --md5 a.fasta:\n"); fprintf(stderr, " Don't print the sequence, but print the md5 checksum\n"); fprintf(stderr, " (of the entire sequence) followed by the entire defline.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --partition prefix [ n[gmk]bp | n ] a.fasta\n"); fprintf(stderr, " --partitionmap [ n[gmk]bp | n ] a.fasta\n"); fprintf(stderr, " Partition the sequences into roughly equal size pieces of\n"); fprintf(stderr, " size nbp, nkbp, nmbp or ngbp; or into n roughly equal sized\n"); fprintf(stderr, " parititions. Sequences larger that the partition size are\n"); fprintf(stderr, " in a partition by themself. --partitionmap writes a\n"); fprintf(stderr, " description of the partition to stdout; --partiton creates\n"); fprintf(stderr, " a fasta file 'prefix-###.fasta' for each partition.\n"); fprintf(stderr, " Example: -F some.fasta --partition parts 130mbp\n"); fprintf(stderr, " -F some.fasta --partition parts 16\n"); fprintf(stderr, "\n"); fprintf(stderr, " --segment prefix n a.fasta\n"); fprintf(stderr, " Splits the sequences into n files, prefix-###.fasta.\n"); fprintf(stderr, " Sequences are not reordered.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --gccontent a.fasta\n"); fprintf(stderr, " Reports the GC content over a sliding window of\n"); fprintf(stderr, " 3, 5, 11, 51, 101, 201, 501, 1001, 2001 bp.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --testindex a.fasta\n"); fprintf(stderr, " Test the index of 'file'. If index is up-to-date, leaff\n"); fprintf(stderr, " exits successfully, else, leaff exits with code 1. If an\n"); fprintf(stderr, " index file is supplied, that one is tested, otherwise, the\n"); fprintf(stderr, " default index file name is used.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --dumpblocks a.fasta\n"); fprintf(stderr, " Generates a list of the blocks of N and non-N. Output\n"); fprintf(stderr, " format is 'base seq# beg end len'. 'N 84 483 485 2' means\n"); fprintf(stderr, " that a block of 2 N's starts at space-based position 483\n"); fprintf(stderr, " in sequence ordinal 84. A '.' is the end of sequence\n"); fprintf(stderr, " marker.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --errors L N C P a.fasta\n"); fprintf(stderr, " For every sequence in the input file, generate new\n"); fprintf(stderr, " sequences including simulated sequencing errors.\n"); fprintf(stderr, " L -- length of the new sequence. If zero, the length\n"); fprintf(stderr, " of the original sequence will be used.\n"); fprintf(stderr, " N -- number of subsequences to generate. If L=0, all\n"); fprintf(stderr, " subsequences will be the same, and you should use\n"); fprintf(stderr, " C instead.\n"); fprintf(stderr, " C -- number of copies to generate. Each of the N\n"); fprintf(stderr, " subsequences will have C copies, each with different\n"); fprintf(stderr, " errors.\n"); fprintf(stderr, " P -- probability of an error.\n"); fprintf(stderr, "\n"); fprintf(stderr, " HINT: to simulate ESTs from genes, use L=500, N=10, C=10\n"); fprintf(stderr, " -- make C=10 sequencer runs of N=10 EST sequences\n"); fprintf(stderr, " of length 500bp each.\n"); fprintf(stderr, " to simulate mRNA from genes, use L=0, N=10, C=10\n"); fprintf(stderr, " to simulate reads from genomes, use L=800, N=10, C=1\n"); fprintf(stderr, " -- of course, N= should be increased to give the\n"); fprintf(stderr, " appropriate depth of coverage\n"); fprintf(stderr, "\n"); fprintf(stderr, " --stats a.fasta [refLen]\n"); fprintf(stderr, " Reports size statistics; number, N50, sum, largest.\n"); fprintf(stderr, " If 'refLen' is supplied, N50 is based on this size.\n"); fprintf(stderr, "\n"); fprintf(stderr, " --seqstore out.seqStore\n"); fprintf(stderr, " Converts the input file (-f) to a seqStore file.\n"); } static void helpExamples(char *program) { fprintf(stderr, "usage: %s [-f ] [options]\n", program); fprintf(stderr, "\n"); fprintf(stderr, "Options are ORDER DEPENDENT. Sequences are printed whenever an ACTION occurs\n"); fprintf(stderr, "on the command line. SEQUENCE OPTIONS are not reset when a sequence is printed.\n"); fprintf(stderr, "\n"); fprintf(stderr, "SEQUENCES are numbered starting at ZERO, not one.\n"); fprintf(stderr, "\n"); fprintf(stderr, " Print the first 10 bases of the fourth sequence in file 'genes':\n"); fprintf(stderr, " -f genes -e 0 10 -s 3\n"); fprintf(stderr, "\n"); fprintf(stderr, " Print the first 10 bases of the fourth and fifth sequences:\n"); fprintf(stderr, " -f genes -e 0 10 -s 3 -s 4\n"); fprintf(stderr, "\n"); fprintf(stderr, " Print the fourth and fifth sequences reverse complemented, and the sixth\n"); fprintf(stderr, " sequence forward. The second set of -R -C toggle off reverse-complement:\n"); fprintf(stderr, " -f genes -R -C -s 3 -s 4 -R -C -s 5\n"); fprintf(stderr, "\n"); fprintf(stderr, " Convert file 'genes' to a seqStore 'genes.seqStore'. The seqStore\n"); fprintf(stderr, " provides better performance with the kmer tools.\n"); fprintf(stderr, " -f genes --seqstore genes.seqStore\n"); } static void printSequence(char *def, char *seq, uint32 beg, uint32 end) { if (beg >= end) return; if ((endExtract != ~uint32ZERO) && (endExtract + endExtract < end - beg)) { char d[1024]; uint32 l = strlen(seq); snprintf(d, 1024, "%s_5", def); printSequence(d, seq, 0, endExtract); snprintf(d, 1024, "%s_3", def); printSequence(d, seq, l-endExtract, l); return; } if (specialDefLine) def = specialDefLine; if (withDefLine == false) def = 0L; uint32 limit = end - beg; char *n = new char [end - beg + 1]; char *m; if ((doReverse == false) && (doComplement == false)) { m = n; seq += beg; while (limit--) *(m++) = translate[*(seq++)]; } else if ((doReverse == true) && (doComplement == false)) { m = n + limit - 1; seq += beg; while (limit--) *(m--) = translate[*(seq++)]; } else if ((doReverse == false) && (doComplement == true)) { m = n; seq += beg; while (limit--) *(m++) = alphabet.complementSymbol(translate[*(seq++)]); } else if ((doReverse == true) && (doComplement == true)) { m = n + limit - 1; seq += beg; while (limit--) *(m--) = alphabet.complementSymbol(translate[*(seq++)]); } n[end-beg] = 0; if (def) fprintf(stdout, ">%s\n", def); if (withLineBreaks) { char *t = n; char *a = new char [withLineBreaks+1]; while (*t) { uint32 i=0; while ((*t) && (i < withLineBreaks)) a[i++] = *(t++); a[i++] = '\n'; a[i] = 0; fprintf(stdout, "%s", a); } delete [] a; } else { fprintf(stdout, "%s\n", n); } delete [] n; } static void printSequence(seqInCore *sic) { printSequence(sic->header(), sic->sequence(), (begPos!=(uint32)0) ? begPos:0, (endPos!=~uint32(0)) ? endPos:sic->sequenceLength()); } static void printSequence(uint32 sid) { seqInCore *sic = fasta->getSequenceInCore(sid); if (sic == 0L) fprintf(stderr, "WARNING: Didn't find sequence with iid '" F_U32 "'\n", sid); else printSequence(sic); delete sic; } static void printSequence(char *sid) { seqInCore *sic = fasta->getSequenceInCore(sid); if (sic == 0L) fprintf(stderr, "WARNING: Didn't find sequence with name/iid '%s'\n", sid); else printSequence(sic); delete sic; } static void printIDsFromFile(char *name) { uint32 idLen = 0; uint32 idMax = 63; char *id = new char [idMax+1]; readBuffer B(name); char x = B.read(); // For optimal performance, we should sort the list of ID's given // by their IID, but the user might have a good reason for wanting // them unsorted. while (B.eof() == false) { while (alphabet.isWhitespace(x) && (B.eof() == false)) x = B.read(); if (B.eof() == false) { idLen = 0; while (!alphabet.isWhitespace(x) && (B.eof() == false)) { id[idLen++] = x; x = B.read(); if (idLen >= idMax) { idMax *= 2; char *newid = new char [idMax+1]; memcpy(newid, id, sizeof(char) * idLen); delete [] id; id = newid; } } id[idLen] = 0; seqInCore *S = fasta->getSequenceInCore(id); if (S == 0L) fprintf(stderr, "WARNING: Didn't find sequence with name/iid '%s'\n", id); else printSequence(S); } } delete [] id; } void processArray(int argc, char **argv) { int arg = 1; while (arg < argc) { if ((strcmp(argv[arg], "-f") == 0) || (strcmp(argv[arg], "-F") == 0)) { delete fasta; fasta = new seqCache(argv[++arg]); } else if (strcmp(argv[arg], "-i") == 0) { failIfNoSource(); ++arg; if ((argv[arg] == 0L) || (argv[arg][0] == '-')) fprintf(stderr, "ERROR: next arg to -i should be 'name', I got '%s'\n", (argv[arg] == 0L) ? "(nullpointer)" : argv[arg]), exit(1); for (uint32 s=0; sgetNumberOfSequences(); s++) fprintf(stdout, "G\tseq\t%s:" F_U32 "\t" F_U32 "\t%s\n", argv[arg], s, fasta->getSequenceLength(s), ">unimplemented"); } else if (strcmp(argv[arg], "-d") == 0) { failIfNoSource(); printf(F_U32"\n", fasta->getNumberOfSequences()); } else if (strcmp(argv[arg], "-L") == 0) { uint32 small = strtouint32(argv[++arg]); uint32 large = strtouint32(argv[++arg]); failIfNoSource(); for (uint32 s=0; sgetNumberOfSequences(); s++) if ((small <= fasta->getSequenceLength(s)) && (fasta->getSequenceLength(s) < large)) printSequence(s); } else if (strcmp(argv[arg], "-N") == 0) { double small = atof(argv[++arg]); double large = atof(argv[++arg]); failIfNoSource(); for (uint32 s=0; sgetNumberOfSequences(); s++) { seqInCore *S = fasta->getSequenceInCore(s); uint32 Ns = 0; uint32 len = S->sequenceLength(); char *seq = S->sequence(); for (uint32 i=begPos; igetNumberOfSequences(); s++) printSequence(s); } else if (strcmp(argv[arg], "-G") == 0) { uint32 n = strtouint32(argv[++arg]); uint32 s = strtouint32(argv[++arg]); uint32 l = strtouint32(argv[++arg]); char bases[4] = {'A', 'C', 'G', 'T'}; char *def = new char [1024]; char *seq = new char [l + 1]; if (s == 0) s = 1; if (s > l) fprintf(stderr, "leaff: usage: -G num-seqs min-length max-length\n"), exit(1); for (uint32 i=0; igetSequenceIID(argv[++arg]); uint32 highID = fasta->getSequenceIID(argv[++arg]); if (lowID > highID) { uint32 t = lowID; lowID = highID; highID = t; } for (uint32 s=lowID; (s <= highID) && (s <= fasta->getNumberOfSequences()); s++) printSequence(s); } else if (strcmp(argv[arg], "-r") == 0) { uint32 num = strtouint32(argv[++arg]); failIfNoSource(); failIfNotRandomAccess(); // Impossible to fix, or load whole thing into memory if (num >= fasta->getNumberOfSequences()) num = fasta->getNumberOfSequences(); uint32 *seqs = new uint32 [fasta->getNumberOfSequences()]; for (uint32 i=0; igetNumberOfSequences(); i++) seqs[i] = i; for (uint32 i=0; igetNumberOfSequences(); i++) { uint32 j = MT.mtRandom32() % (fasta->getNumberOfSequences() - i) + i; uint32 t = seqs[j]; seqs[j] = seqs[i]; seqs[i] = t; } for (uint32 i=0; igetNumberOfSequences(); s++) { seqInCore *S = fasta->getSequenceInCore(s); fprintf(stdout, "%s %s\n", md5_toascii(md5_string(&md5, S->sequence(), S->sequenceLength()), sum), S->header()); delete S; } delete fasta; exit(0); } else if ((strcmp(argv[arg], "--partition") == 0) || (strcmp(argv[arg], "--partitionmap") == 0)) { char *prefix = 0L; if (strcmp(argv[arg], "--partition") == 0) prefix = argv[++arg]; // does the next arg end with gbp, mbp, kbp or bp? If so, // partition by length, else partition into buckets. // int al = strlen(argv[arg+1]); uint64 ps = strtouint64(argv[arg+1]); char a3 = (al<3) ? '0' : alphabet.toLower(argv[arg+1][al-3]); char a2 = (al<2) ? '0' : alphabet.toLower(argv[arg+1][al-2]); char a1 = (al<1) ? '0' : alphabet.toLower(argv[arg+1][al-1]); // partition! if (!isdigit(a1) || !isdigit(a2) || !isdigit(a3)) { if ((a3 == 'g') && (a2 == 'b') && (a1 == 'p')) { ps *= 1000000000; } else if ((a3 == 'm') && (a2 == 'b') && (a1 == 'p')) { ps *= 1000000; } else if ((a3 == 'k') && (a2 == 'b') && (a1 == 'p')) { ps *= 1000; } else if (isdigit(a3) && (a2 == 'b') && (a1 == 'p')) { ps *= 1; } else { fprintf(stderr, "Unknown partition size option '%s'\n", argv[arg+1]), exit(1); } if (ps == 0) fprintf(stderr, "Unknown or zero partition size '%s'\n", argv[arg+1]), exit(1); partitionBySize(prefix, ps, argv[arg+2]); } else { if (ps == 0) fprintf(stderr, "Unknown or zero partition size '%s'\n", argv[arg+1]), exit(1); partitionByBucket(prefix, ps, argv[arg+2]); } exit(0); } else if (strcmp(argv[arg], "--segment") == 0) { partitionBySegment(argv[arg+1], strtouint32(argv[arg+2]), argv[arg+3]); exit(0); } else if (strcmp(argv[arg], "--gccontent") == 0) { computeGCcontent(argv[++arg]); exit(0); } else if (strcmp(argv[arg], "--dumpblocks") == 0) { dumpBlocks(argv[++arg]); exit(0); } else if (strcmp(argv[arg], "--stats") == 0) { stats(argv[arg+1], (argv[arg+2] != 0L) ? strtouint64(argv[arg+2]) : 0); exit(0); } else if (strcmp(argv[arg], "--errors") == 0) { uint32 L = strtouint32(argv[++arg]); // Desired length uint32 l = 0; // min of desired length, length of sequence uint32 N = strtouint32(argv[++arg]); // number of copies per sequence uint32 C = strtouint32(argv[++arg]); // number of mutations per copy double P = atof(argv[++arg]); // probability of mutation uint32 i = 0; fasta = new seqCache(argv[++arg]); seqInCore *S = fasta->getSequenceInCore(i++); while (S) { char *seq = S->sequence(); char *hdr = S->header(); uint32 len = S->sequenceLength(); l = len; if ((L > 0) && (L < len)) l = L; simseq(seq, hdr, len, N, l, C, P); delete S; S = fasta->getSequenceInCore(i++); } delete fasta; exit(0); } else if (strcmp(argv[arg], "--seqstore") == 0) { constructSeqStore(argv[++arg], fasta); exit(0); } else if (strcmp(argv[arg], "-help") == 0) { if ((argv[arg+1]) && (strcmp(argv[arg+1], "analysis") == 0)) helpAnalysis(argv[0]); else if ((argv[arg+1]) && (strcmp(argv[arg+1], "examples") == 0)) helpExamples(argv[0]); else helpStandard(argv[0]); exit(0); } else { helpStandard(argv[0]); fprintf(stderr, "Unknown option '%s'\n", argv[arg]); exit(1); } arg++; } delete fasta; fasta = 0L; } void processFile(char *filename) { FILE *F = NULL; if (strcmp(filename, "-") == 0) { F = stdin; } else { errno = 0; F = fopen(filename, "r"); if (errno) fprintf(stderr, "Couldn't open '%s': %s\n", filename, strerror(errno)), exit(1); } uint64 max = 16 * 1024 * 1024; uint64 pos = 0; size_t len = 0; char *data = new char [max]; // Suck the file into 'data' while (!feof(F)) { errno = 0; len = fread(data+pos, 1, max - pos, F); if (errno) fprintf(stderr, "Couldn't read " F_U64 " bytes from '%s': %s\n", (uint64)(max-pos), filename, strerror(errno)), exit(1); pos += len; if (pos >= max) { max += 16 * 1024 * 1024; char *tmpd = new char [max]; memcpy(tmpd, data, pos); delete [] data; data = tmpd; } } if (strcmp(filename, "-") != 0) fclose(F); len = pos; // (over)count the number of words; we start at two, since the // first arg is the name of the program, and if there is only one // word and no whitespace in the file, the below loop fails to // count the second word. int argc = 2; char **argv = 0L; for (uint32 i=0; inextMer()) { if (_isForward) { countingTable[ HASH(M->theFMer()) ]++; numberOfMers++; } if (_isCanonical) { countingTable[ HASH(M->theCMer()) ]++; numberOfMers++; } } delete M; #ifdef STATS uint64 dist[32] = {0}; uint64 maxcnt = 0; for (uint64 i=tableSizeInEntries+1; i--; ) { if (countingTable[i] > maxcnt) maxcnt = countingTable[i]; if (countingTable[i] < 32) dist[countingTable[i]]++; } for(uint64 i=0; i<32; i++) fprintf(stderr, "existDB::usage[%2d] = %d\n", i, dist[i]); fprintf(stderr, "existDB::maxcnt = %d\n", maxcnt); #endif //////////////////////////////////////////////////////////////////////////////// // // Determine how many bits we need to hold the value // numberOfMers.....then.... // // This is numberOfMers+1 because we need to store the // first position after the last mer. That is, if there are two // mers, we will store that the first mer is at position 0, the // second mer is at position 1, and the end of the second mer is at // position 2. // if (_compressedHash) { _hshWidth = 1; while ((numberOfMers+1) > (uint64ONE << _hshWidth)) _hshWidth++; } //////////////////////////////////////////////////////////////////////////////// // // 2) Allocate a hash table and some mer storage buckets. // _hashTableWords = tableSizeInEntries + 2; if (_compressedHash) _hashTableWords = _hashTableWords * _hshWidth / 64 + 1; _bucketsWords = numberOfMers + 2; if (_compressedBucket) _bucketsWords = _bucketsWords * _chkWidth / 64 + 1; _countsWords = numberOfMers + 2; if (_compressedCounts) _countsWords = _countsWords * _cntWidth / 64 + 1; if (beVerbose) { fprintf(stderr, "existDB::createFromFastA()-- hashTable is "F_U64"MB\n", _hashTableWords >> 17); fprintf(stderr, "existDB::createFromFastA()-- buckets is "F_U64"MB\n", _bucketsWords >> 17); if (flags & existDBcounts) fprintf(stderr, "existDB::createFromFastA()-- counts is "F_U64"MB\n", _countsWords >> 17); } _hashTable = new uint64 [_hashTableWords]; _buckets = new uint64 [_bucketsWords]; _countsWords = (flags & existDBcounts) ? _countsWords : 0; _counts = (flags & existDBcounts) ? new uint64 [_countsWords] : 0L; // These aren't strictly needed. _buckets is cleared as it is initialied. _hashTable // is also cleared as it is initialized, but in the _compressedHash case, the last // few words might be uninitialized. They're unused. // //memset(_hashTable, 0, sizeof(uint64) * _hashTableWords); //memset(_buckets, 0, sizeof(uint64) * _bucketsWords); // buckets is cleared as it is built //memset(_counts, 0, sizeof(uint64) * _countsWords); _hashTable[_hashTableWords-1] = 0; _hashTable[_hashTableWords-2] = 0; _hashTable[_hashTableWords-3] = 0; _hashTable[_hashTableWords-4] = 0; //////////////////////////////////////////////////////////////////////////////// // // Make the hash table point to the start of the bucket, and reset // the counting table -- we're going to use it to fill the buckets. // uint64 tmpPosition = 0; uint64 begPosition = 0; uint64 ptr = 0; if (_compressedHash) { for (uint64 i=0; inextMer()) { if (_isForward) insertMer(HASH(M->theFMer()), CHECK(M->theFMer()), 1, countingTable); if (_isCanonical) insertMer(HASH(M->theCMer()), CHECK(M->theCMer()), 1, countingTable); } delete M; // Compress out the gaps we have from redundant kmers. uint64 pos = 0; uint64 frm = 0; uint64 len = 0; for (uint64 i=0; imerSize(); if (merSize != _merSizeInBases) { fprintf(stderr, "createFromMeryl()-- ERROR: requested merSize ("F_U32") is different than merSize in meryl database ("F_U32").\n", merSize, _merSizeInBases); exit(1); } // We can set this exactly, but not memory optimal (see meryl/estimate.C:optimalNumberOfBuckets()). // Instead, we just blindly use whatever meryl used. // uint32 tblBits = M->prefixSize(); // But it is faster to reset to this. Might use 2x the memory. //uint32 tblBits = logBaseTwo64(M->numberOfDistinctMers() + 1); _shift1 = 2 * _merSizeInBases - tblBits; _shift2 = _shift1 / 2; _mask1 = uint64MASK(tblBits); _mask2 = uint64MASK(_shift1); _hshWidth = uint32ZERO; _chkWidth = 2 * _merSizeInBases - tblBits; _cntWidth = 16; uint64 tableSizeInEntries = uint64ONE << tblBits; uint64 numberOfMers = uint64ZERO; uint64 *countingTable = new uint64 [tableSizeInEntries + 1]; if (beVerbose) { fprintf(stderr, "createFromMeryl()-- tableSizeInEntries "F_U64"\n", tableSizeInEntries); fprintf(stderr, "createFromMeryl()-- count range "F_U32"-"F_U32"\n", lo, hi); } for (uint64 i=tableSizeInEntries+1; i--; ) countingTable[i] = 0; _isCanonical = flags & existDBcanonical; _isForward = flags & existDBforward; if (beVerbose) { fprintf(stderr, "createFromMeryl()-- canonical %c\n", (_isCanonical) ? 'T' : 'F'); fprintf(stderr, "createFromMeryl()-- forward %c\n", (_isForward) ? 'T' : 'F'); } assert(_isCanonical + _isForward == 1); // 1) Count bucket sizes // While we don't know the bucket sizes right now, but we do know // how many buckets and how many mers. // // Because we could be inserting both forward and reverse, we can't // really move the direction testing outside the loop, unless we // want to do two iterations over M. // speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, beVerbose); while (M->nextMer()) { if ((lo <= M->theCount()) && (M->theCount() <= hi)) { if (_isForward) { countingTable[ HASH(M->theFMer()) ]++; numberOfMers++; } if (_isCanonical) { kMer r = M->theFMer(); r.reverseComplement(); if (M->theFMer() < r) countingTable[ HASH(M->theFMer()) ]++; else countingTable[ HASH(r) ]++; numberOfMers++; } C->tick(); } } if (beVerbose) fprintf(stderr, "createFromMeryl()-- numberOfMers "F_U64"\n", numberOfMers); delete C; delete M; if (_compressedHash) { _hshWidth = 1; while ((numberOfMers+1) > (uint64ONE << _hshWidth)) _hshWidth++; } if (beVerbose) { fprintf(stderr, "existDB::createFromMeryl()-- Found "F_U64" mers between count of "F_U32" and "F_U32"\n", numberOfMers, lo, hi); } // 2) Allocate hash table, mer storage buckets // _hashTableWords = tableSizeInEntries + 2; if (_compressedHash) _hashTableWords = _hashTableWords * _hshWidth / 64 + 1; _bucketsWords = numberOfMers + 2; if (_compressedBucket) _bucketsWords = _bucketsWords * _chkWidth / 64 + 1; _countsWords = numberOfMers + 2; if (_compressedCounts) _countsWords = _countsWords * _cntWidth / 64 + 1; if (beVerbose) { fprintf(stderr, "existDB::createFromMeryl()-- hashTable is "F_U64"MB\n", _hashTableWords >> 17); fprintf(stderr, "existDB::createFromMeryl()-- buckets is "F_U64"MB\n", _bucketsWords >> 17); if (flags & existDBcounts) fprintf(stderr, "existDB::createFromMeryl()-- counts is "F_U64"MB\n", _countsWords >> 17); } _hashTable = new uint64 [_hashTableWords]; _buckets = new uint64 [_bucketsWords]; _countsWords = (flags & existDBcounts) ? _countsWords : 0; _counts = (flags & existDBcounts) ? new uint64 [_countsWords] : 0L; // These aren't strictly needed. _buckets is cleared as it is initialied. _hashTable // is also cleared as it is initialized, but in the _compressedHash case, the last // few words might be uninitialized. They're unused. //memset(_hashTable, 0, sizeof(uint64) * _hashTableWords); //memset(_buckets, 0, sizeof(uint64) * _bucketsWords); // buckets is cleared as it is built //memset(_counts, 0, sizeof(uint64) * _countsWords); _hashTable[_hashTableWords-1] = 0; _hashTable[_hashTableWords-2] = 0; _hashTable[_hashTableWords-3] = 0; _hashTable[_hashTableWords-4] = 0; //////////////////////////////////////////////////////////////////////////////// // // Make the hash table point to the start of the bucket, and reset // the counting table -- we're going to use it to fill the buckets. // uint64 tmpPosition = 0; uint64 begPosition = 0; uint64 ptr = 0; if (_compressedHash) { for (uint64 i=0; inextMer()) { if ((lo <= M->theCount()) && (M->theCount() <= hi)) { if (_isForward) insertMer(HASH(M->theFMer()), CHECK(M->theFMer()), M->theCount(), countingTable); if (_isCanonical) { kMer r = M->theFMer(); r.reverseComplement(); if (M->theFMer() < r) insertMer(HASH(M->theFMer()), CHECK(M->theFMer()), M->theCount(), countingTable); else insertMer(HASH(r), CHECK(r), M->theCount(), countingTable); numberOfMers++; } C->tick(); } } delete C; delete M; delete [] countingTable; return(true); } canu-1.6/src/meryl/libkmer/existDB-create-from-sequence.C000066400000000000000000000211541314437614700233450ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2012-MAY-08 to 2014-APR-11 * are Copyright 2012-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-AUG-31 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "existDB.H" #include "merStream.H" bool existDB::createFromSequence(char const *sequence, uint32 merSize, uint32 flags) { bool beVerbose = false; bool rebuilding = false; _hashTable = 0L; _buckets = 0L; _counts = 0L; _merSizeInBases = merSize; _searchForDupe = true; if ((flags & existDBcompressHash) || (flags & existDBcompressBuckets) || (flags & existDBcompressCounts)) fprintf(stderr, "existDB::createFromSequence: compression not supported.\n"), exit(1); // This (at =22) eats up 16MB, and should allow a lot of mers at big sizes. Unfortunately, we // know nothing about how man mers are going to be in the input. // // Setting this too high drastically reduces performance, suspected because of cache misses. // Setting this too low will also reduce performance, by increasing the search time in a bucket. // uint32 tblBits = logBaseTwo64(strlen(sequence)); rebuild: _shift1 = 2 * _merSizeInBases - tblBits; _shift2 = _shift1 / 2; _mask1 = uint64MASK(tblBits); _mask2 = uint64MASK(_shift1); _hshWidth = uint32ZERO; _chkWidth = 2 * merSize - tblBits; _cntWidth = 16; uint64 tableSizeInEntries = uint64ONE << tblBits; uint64 numberOfMers = uint64ZERO; uint64 *countingTable = new uint64 [tableSizeInEntries + 1]; for (uint64 i=tableSizeInEntries+1; i--; ) countingTable[i] = 0; _isCanonical = flags & existDBcanonical; _isForward = flags & existDBforward; assert(_isCanonical + _isForward == 1); //////////////////////////////////////////////////////////////////////////////// // // 1) Count bucket sizes // merStream *M = new merStream(new kMerBuilder(_merSizeInBases), new seqStream(sequence, strlen(sequence)), true, true); while (M->nextMer()) { if (_isForward) { countingTable[ HASH(M->theFMer()) ]++; numberOfMers++; } if (_isCanonical) { countingTable[ HASH(M->theCMer()) ]++; numberOfMers++; } } delete M; #ifdef STATS uint64 dist[32] = {0}; uint64 maxcnt = 0; for (uint64 i=tableSizeInEntries+1; i--; ) { if (countingTable[i] > maxcnt) maxcnt = countingTable[i]; if (countingTable[i] < 32) dist[countingTable[i]]++; } for(uint64 i=0; i<32; i++) fprintf(stderr, "existDB::usage[%2d] = %d\n", i, dist[i]); fprintf(stderr, "existDB::maxcnt = %d\n", maxcnt); #endif //////////////////////////////////////////////////////////////////////////////// // // Determine how many bits we need to hold the value // numberOfMers.....then.... // // This is numberOfMers+1 because we need to store the // first position after the last mer. That is, if there are two // mers, we will store that the first mer is at position 0, the // second mer is at position 1, and the end of the second mer is at // position 2. // if (_compressedHash) { _hshWidth = 1; while ((numberOfMers+1) > (uint64ONE << _hshWidth)) _hshWidth++; } //////////////////////////////////////////////////////////////////////////////// // // 2) Allocate a hash table and some mer storage buckets. // _hashTableWords = tableSizeInEntries + 2; if (_compressedHash) _hashTableWords = _hashTableWords * _hshWidth / 64 + 1; _bucketsWords = numberOfMers + 2; if (_compressedBucket) _bucketsWords = _bucketsWords * _chkWidth / 64 + 1; _countsWords = numberOfMers + 2; if (_compressedCounts) _countsWords = _countsWords * _cntWidth / 64 + 1; if (beVerbose) { fprintf(stderr, "existDB::createFromSequence()-- hashTable is "F_U64"MB\n", _hashTableWords >> 17); fprintf(stderr, "existDB::createFromSequence()-- buckets is "F_U64"MB\n", _bucketsWords >> 17); if (flags & existDBcounts) fprintf(stderr, "existDB::createFromSequence()-- counts is "F_U64"MB\n", _countsWords >> 17); } _hashTable = new uint64 [_hashTableWords]; _buckets = new uint64 [_bucketsWords]; _countsWords = (flags & existDBcounts) ? _countsWords : 0; _counts = (flags & existDBcounts) ? new uint64 [_countsWords] : 0L; // These aren't strictly needed. _buckets is cleared as it is initialied. _hashTable // is also cleared as it is initialized, but in the _compressedHash case, the last // few words might be uninitialized. They're unused. //memset(_hashTable, 0, sizeof(uint64) * _hashTableWords); //memset(_buckets, 0, sizeof(uint64) * _bucketsWords); // buckets is cleared as it is built //memset(_counts, 0, sizeof(uint64) * _countsWords); _hashTable[_hashTableWords-1] = 0; _hashTable[_hashTableWords-2] = 0; _hashTable[_hashTableWords-3] = 0; _hashTable[_hashTableWords-4] = 0; //////////////////////////////////////////////////////////////////////////////// // // Make the hash table point to the start of the bucket, and reset // the counting table -- we're going to use it to fill the buckets. // uint64 tmpPosition = 0; uint64 begPosition = 0; uint64 ptr = 0; if (_compressedHash) { for (uint64 i=0; inextMer()) { if (_isForward) insertMer(HASH(M->theFMer()), CHECK(M->theFMer()), 1, countingTable); if (_isCanonical) insertMer(HASH(M->theCMer()), CHECK(M->theCMer()), 1, countingTable); } delete M; // Compress out the gaps we have from redundant kmers. uint64 pos = 0; uint64 frm = 0; uint64 len = 0; for (uint64 i=0; i 0) _counts = new uint64 [_countsWords]; fread(_hashTable, sizeof(uint64), _hashTableWords, F); fread(_buckets, sizeof(uint64), _bucketsWords, F); if (_countsWords > 0) fread(_counts, sizeof(uint64), _countsWords, F); } fclose(F); if (errno) { fprintf(stderr, "existDB::loadState()-- Read failure.\n%s\n", strerror(errno)); exit(1); } return(true); } void existDB::printState(FILE *stream) { fprintf(stream, "merSizeInBases: "F_U32"\n", _merSizeInBases); fprintf(stream, "tableBits "F_U32"\n", 2 * _merSizeInBases - _shift1); fprintf(stream, "-----------------\n"); fprintf(stream, "_hashTableWords "F_U64" ("F_U64" KB)\n", _hashTableWords, _hashTableWords >> 7); fprintf(stream, "_bucketsWords "F_U64" ("F_U64" KB)\n", _bucketsWords, _bucketsWords >> 7); fprintf(stream, "_countsWords "F_U64" ("F_U64" KB)\n", _countsWords, _countsWords >> 7); fprintf(stream, "-----------------\n"); fprintf(stream, "_shift1: "F_U32"\n", _shift1); fprintf(stream, "_shift2 "F_U32"\n", _shift2); fprintf(stream, "_mask1 "F_X64"\n", _mask1); fprintf(stream, "_mask2 "F_X64"\n", _mask2); if (_compressedHash) { fprintf(stream, "_compressedHash true\n"); fprintf(stream, "_hshWidth "F_U32"\n", _hshWidth); } else { fprintf(stream, "_compressedHash false\n"); fprintf(stream, "_hshWidth undefined\n"); } if (_compressedBucket) { fprintf(stream, "_compressedBucket true\n"); fprintf(stream, "_chkWidth "F_U32"\n", _chkWidth); } else { fprintf(stream, "_compressedBucket false\n"); fprintf(stream, "_chkWidth undefined\n"); } if (_compressedCounts) { fprintf(stream, "_compressedCount true\n"); fprintf(stream, "_cntWidth "F_U32"\n", _cntWidth); } else { fprintf(stream, "_compressedCount false\n"); fprintf(stream, "_cntWidth undefined\n"); } } canu-1.6/src/meryl/libkmer/existDB.C000066400000000000000000000121531314437614700173340ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-OCT-20 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-12 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-20 to 2014-APR-11 * are Copyright 2005-2008,2012-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "existDB.H" #include "AS_UTL_fileIO.H" existDB::existDB(char const *filename, bool loadData) { clear(); _compressedHash = false; _compressedBucket = false; if (loadState(filename, true, loadData) == false) { fprintf(stderr, "existDB::existDB()-- Tried to read state from '%s', but failed.\n", filename); exit(1); } } existDB::existDB(char const *filename, uint32 merSize, existDBflags flags, uint32 lo, uint32 hi) { clear(); _compressedHash = flags & existDBcompressHash; _compressedBucket = flags & existDBcompressBuckets; _compressedCounts = flags & existDBcompressCounts; _searchForDupe = false; // Try to read state from the filename. If successful, make sure // that the merSize is correct. // if (loadState(filename)) { bool fail = false; if (_merSizeInBases != merSize) { fprintf(stderr, "existDB::existDB()-- Read state from '%s', but got different mer sizes\n", filename); fprintf(stderr, "existDB::existDB()-- Got "F_U32", expected "F_U32"\n", _merSizeInBases, merSize); fail = true; } if (fail) exit(1); return; } // If no direction flags are set, set the default direction of // forward. Stupid precedence rules. // if ((flags & (existDBcanonical | existDBforward)) == uint32ZERO) flags |= existDBforward; // If we can open 'filename' for reading, then we assume the file // is a multi-fasta, and we build an existDB/ // // Otherwise, we assume that 'filename' is really the prefix for a // meryl database. if (AS_UTL_fileExists(filename)) createFromFastA(filename, merSize, flags); else createFromMeryl(filename, merSize, lo, hi, flags); } existDB::existDB(char const *sequence, uint32 merSize, existDBflags flags) { clear(); _compressedHash = flags & existDBcompressHash; _compressedBucket = flags & existDBcompressBuckets; _compressedCounts = flags & existDBcompressCounts; if ((flags & (existDBcanonical | existDBforward)) == uint32ZERO) flags |= existDBforward; createFromSequence(sequence, merSize, flags); } existDB::~existDB() { delete [] _hashTable; delete [] _buckets; delete [] _counts; } bool existDB::exists(uint64 mer) { uint64 c, h, st, ed; if (_compressedHash) { h = HASH(mer) * _hshWidth; st = getDecodedValue(_hashTable, h, _hshWidth); ed = getDecodedValue(_hashTable, h + _hshWidth, _hshWidth); } else { h = HASH(mer); st = _hashTable[h]; ed = _hashTable[h+1]; } if (st == ed) return(false); c = CHECK(mer); if (_compressedBucket) { st *= _chkWidth; ed *= _chkWidth; for (; st> _shift1) ^ (k >> _shift2) ^ k) & _mask1); }; uint64 CHECK(uint64 k) { return(k & _mask2); }; void insertMer(uint64 hsh, uint64 chk, uint64 cnt, uint64 *countingTable) { // If the mer is already here, just update the count. This only // works if not _compressedBucket, and only makes sense for loading from // fasta or sequence. if ((_compressedBucket == false) && (_searchForDupe)) { uint64 st = _hashTable[hsh]; uint64 ed = countingTable[hsh]; for (; stgetSequenceInCore(); intervalList IL; speedCounter SC(" %8f frags (%8.5f frags/sec)\r", 1, 1000, true); while (S) { merStream *MS = new merStream(new kMerBuilder(22), new seqStream(S->sequence(), S->sequenceLength()), true, true); IL.clear(); while (MS->nextMer()) { if (E->exists(MS->theFMer())) { IL.add(MS->thePositionInSequence(), 22); } } IL.merge(); if (IL.sumOfLengths() > 0) { fprintf(stdout, "%5.2f\n", 100.0 * IL.sumOfLengths() / (double)S->sequenceLength()); } delete MS; delete S; SC.tick(); S = Q->getSequenceInCore(); } delete Q; delete E; return(0); } canu-1.6/src/meryl/libkmer/percentCovered.mk000066400000000000000000000006261314437614700211710ustar00rootroot00000000000000#CXXFLAGS := -fopenmp -D_GLIBCXX_PARALLEL -O3 -fPIC -m64 -pipe -Wno-write-strings #LDFLAGS := -fopenmp -lm TARGET := percentCovered SOURCES := percentCovered.C SRC_INCDIRS := ../libutil ../libbio ../libseq ../libkmer ../libmeryl TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lkmer -lmeryl -lseq -lbio -lutil TGT_PREREQS := libkmer.a libmeryl.a libseq.a libbio.a libutil.a SUBMAKEFILES := canu-1.6/src/meryl/libkmer/posDB.mk000066400000000000000000000006001314437614700172200ustar00rootroot00000000000000#CXXFLAGS := -fopenmp -D_GLIBCXX_PARALLEL -O3 -fPIC -m64 -pipe -Wno-write-strings #LDFLAGS := -fopenmp -lm TARGET := posDB SOURCES := driver-posDB.C SRC_INCDIRS := ../libutil ../libbio ../libseq ../libmeryl TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lkmer -lmeryl -lseq -lbio -lutil TGT_PREREQS := libkmer.a libmeryl.a libseq.a libbio.a libutil.a SUBMAKEFILES := canu-1.6/src/meryl/libkmer/positionDB-access.C000066400000000000000000000254071314437614700213110ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-AUG-14 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-21 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-FEB-07 to 2014-APR-11 * are Copyright 2005-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "positionDB.H" void positionDB::reallocateSpace(uint64*& posn, uint64& posnMax, uint64& posnLen, uint64 len) { if (posnMax < posnLen + len) { uint64 *pp; posnMax = posnLen + len + (len >> 2); if (posnMax == 0) posnMax = 16384; try { pp = new uint64 [posnMax]; } catch (...) { fprintf(stderr, "positionDB::get()-- Can't allocate space for more positions, requested "F_U64" uint64's.\n", posnMax); abort(); } memcpy(pp, posn, sizeof(uint64) * posnLen); delete [] posn; posn = pp; } } void positionDB::loadPositions(uint64 J, uint64*& posn, uint64& posnMax, uint64& posnLen, uint64& count) { uint64 sizs[3] = {_pptrWidth, 1, _sizeWidth}; uint64 vals[3] = {0, 0, 1}; getDecodedValues(_buckets, J + _chckWidth, (_sizeWidth == 0) ? 2 : 3, sizs, vals); // If the size is stored, the count is updated to the correct // thing. If it's not stored, the count is set to 1 by the default // value of vals[2], and reset after we get the number of positions // stored. // count = vals[2]; if (vals[1]) { reallocateSpace(posn, posnMax, posnLen, 64); posn[posnLen++] = vals[0]; } else { uint64 ptr = vals[0] * _posnWidth; uint64 len = getDecodedValue(_positions, ptr, _posnWidth); if (_sizeWidth == 0) count = len; reallocateSpace(posn, posnMax, posnLen, len + 64); for (ptr += _posnWidth; len > 0; ptr += _posnWidth, len--) posn[posnLen++] = getDecodedValue(_positions, ptr, _posnWidth); } } bool positionDB::getExact(uint64 mer, uint64*& posn, uint64& posnMax, uint64& posnLen, uint64& count) { uint64 h = HASH(mer); uint64 c = CHECK(mer); uint64 st, ed; if (_hashTable_BP) { st = getDecodedValue(_hashTable_BP, h * _hashWidth, _hashWidth); ed = getDecodedValue(_hashTable_BP, h * _hashWidth + _hashWidth, _hashWidth); } else { st = _hashTable_FW[h]; ed = _hashTable_FW[h+1]; } posnLen = 0; if (st == ed) return(false); for (uint64 i=st, J=st * _wFin; i 0) return(vals[2]); if (vals[1]) return(1); return(getDecodedValue(_positions, vals[0] * _posnWidth, _posnWidth)); } } return(0); } uint64 positionDB::setCount(uint64 mer, uint64 count) { uint64 h = HASH(mer); uint64 c = CHECK(mer); uint64 st, ed; if (_hashTable_BP) { st = getDecodedValue(_hashTable_BP, h * _hashWidth, _hashWidth); ed = getDecodedValue(_hashTable_BP, h * _hashWidth + _hashWidth, _hashWidth); } else { st = _hashTable_FW[h]; ed = _hashTable_FW[h+1]; } if (st == ed) return(0); for (uint64 i=st, J=st * _wFin; i 0) count = vals[3]; // What happened here: By default, the count is 1. If it is // NOT a unique mer in the table, we reset the count to the // number of entries in the table. Then, if there is a count // stored in the table, we reset the count again. // Move on to copying the data, if in the correct range. if (vals[2] == 1) { // Is a single mer in our table. Copy if the actual count is // acceptable. if ((lo <= count) && (count < hi)) { okCount++; setDecodedValues(_buckets, nb, (_sizeWidth == 0) ? 3 : 4, sizs, vals); nb += _wFin; } else { _numberOfDistinct--; _numberOfMers--; loCount++; } } else { // Mer has more than one location in the table. Copy all // locations if the count is acceptable. if ((lo <= count) && (count < hi)) { okCount++; // Copy the bucket vals[1] = np / _posnWidth; setDecodedValues(_buckets, nb, (_sizeWidth == 0) ? 3 : 4, sizs, vals); nb += _wFin; // Copy length of the positions if (cp != np) setDecodedValue(_positions, np, _posnWidth, len); np += _posnWidth; cp += _posnWidth; // Copy positions while (len > 0) { if (cp != np) setDecodedValue(_positions, np, _posnWidth, getDecodedValue(_positions, cp, _posnWidth)); np += _posnWidth; cp += _posnWidth; len--; } } else { // Not acceptable count _numberOfDistinct--; _numberOfEntries -= len; if (count < lo) loCount++; if (count > hi) hiCount++; } } // Move to the next entry st++; cb += _wFin; } // Over all entries in the bucket // Update the end position of this bucket if (_hashTable_BP) setDecodedValue(_hashTable_BP, h * _hashWidth + _hashWidth, _hashWidth, nb / _wFin); else _hashTable_FW[h+1] = nb / _wFin; } // Over all buckets fprintf(stderr, "positionDB::filter()-- Filtered "F_U64" kmers less than "F_U64"\n", loCount, lo); fprintf(stderr, "positionDB::filter()-- Filtered "F_U64" kmers more than "F_U64"\n", hiCount, hi); fprintf(stderr, "positionDB::filter()-- Saved "F_U64" kmers with acceptable count\n", okCount); //dump("posDB.after"); } canu-1.6/src/meryl/libkmer/positionDB-dump.C000066400000000000000000000046721314437614700210160ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-AUG-14 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-21 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2007-DEC-11 to 2014-APR-11 * are Copyright 2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "positionDB.H" void positionDB::dump(char *name) { uint64 sizs[4] = {_chckWidth, _pptrWidth, 1, _sizeWidth}; uint64 vals[4] = {0, 0, 0, 0}; FILE *F = fopen(name, "w"); for (uint64 h=0; h<_tableSizeInEntries; h++) { uint64 st, ed; if (_hashTable_BP) { st = getDecodedValue(_hashTable_BP, h * _hashWidth, _hashWidth); ed = getDecodedValue(_hashTable_BP, h * _hashWidth + _hashWidth, _hashWidth); } else { st = _hashTable_FW[h]; ed = _hashTable_FW[h+1]; } fprintf(F, "B "F_U64" "F_U64"-"F_U64"\n", h, st, ed); while (st < ed) { uint64 cb = st * _wFin; getDecodedValues(_buckets, cb, (_sizeWidth == 0) ? 3 : 4, sizs, vals); fprintf(F, "%c chk="F_X64" pos="F_U64" siz="F_U64, (vals[2] == 0) ? 'D' : 'U', vals[0], vals[1], vals[3]); if (vals[2] == 0) { uint64 pos = vals[1] * _posnWidth; uint64 len = getDecodedValue(_positions, pos, _posnWidth); for (pos += _posnWidth; len > 0; pos += _posnWidth, len--) fprintf(F, " "F_U64, getDecodedValue(_positions, pos, _posnWidth)); } fprintf(F, "\n"); st++; } } fclose(F); } canu-1.6/src/meryl/libkmer/positionDB-file.C000066400000000000000000000244201314437614700207610ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-AUG-14 to 2004-APR-01 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2004-APR-30 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-12 to 2014-APR-11 * are Copyright 2005,2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "positionDB.H" static char magic[16] = { 'p', 'o', 's', 'i', 't', 'i', 'o', 'n', 'D', 'B', '.', 'v', '1', ' ', ' ', ' ' }; static char faild[16] = { 'p', 'o', 's', 'i', 't', 'i', 'o', 'n', 'D', 'B', 'f', 'a', 'i', 'l', 'e', 'd' }; // These came from kmer/libutil/file.c and need to be replaced with more standard functions // Split writes/reads into smaller pieces, check the result of each // piece. Really needed by OSF1 (V5.1). // void safeWrite(int filedes, const void *buffer, const char *desc, size_t nbytes) { size_t position = 0; size_t length = 32 * 1024 * 1024; size_t towrite = 0; size_t written = 0; while (position < nbytes) { towrite = length; if (position + towrite > nbytes) towrite = nbytes - position; errno = 0; written = write(filedes, ((char *)buffer) + position, towrite); if ((errno) || (towrite != written)) { fprintf(stderr, "safeWrite()-- Write failure on %s: %s\n", desc, strerror(errno)); fprintf(stderr, "safeWrite()-- Wanted to write "F_S64" bytes, wrote "F_S64".\n", (int64)towrite, (int64)written); exit(1); } position += written; } } int safeRead(int filedes, const void *buffer, const char *desc, size_t nbytes) { size_t position = 0; size_t length = 32 * 1024 * 1024; size_t toread = 0; size_t written = 0; // readen? int failed = 0; while (position < nbytes) { toread = length; if (position + toread > nbytes) toread = nbytes - position; errno = 0; written = read(filedes, ((char *)buffer) + position, toread); failed = errno; #ifdef VERY_SAFE if (toread != written) failed = 1; #endif if ((failed) && (errno != EINTR)) { fprintf(stderr, "safeRead()-- Read failure on %s: %s.\n", desc, strerror(errno)); fprintf(stderr, "safeRead()-- Wanted to read "F_S64" bytes, read "F_S64".\n", (int64)toread, (int64)written); exit(1); } if (written == 0) break; position += written; } return(position); } void positionDB::saveState(char const *filename) { fprintf(stderr, "Saving positionDB to '%s'\n", filename); errno = 0; int F = open(filename, O_RDWR | O_CREAT | O_LARGEFILE, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH); if (errno) { fprintf(stderr, "Can't open '%s' for writing positionDB.\n%s\n", filename, strerror(errno)); exit(1); } bool magicFirst = false; // Test if this is a pipe. If so, we write the magic first, // otherwise we write the magic last. // errno = 0; lseek(F, 0, SEEK_SET); if (errno == ESPIPE) magicFirst = true; if (magicFirst) write(F, magic, sizeof(char) * 16); else write(F, faild, sizeof(char) * 16); if (errno) { fprintf(stderr, "positionDB::saveState()-- Write failure on magic first.\n%s\n", strerror(errno)); exit(1); } // If only to be completely annoying and anal, we clear the // pointers before we write the data. Sure, we could just write // the stuff we care about, but this is easier. This is easier. // Before you go rip out this stuff, remember that you can now // checksum the resulting files. So don't do it. // uint32 *bs = _bucketSizes; uint64 *cb = _countingBuckets; uint64 *hp = _hashTable_BP; uint32 *hw = _hashTable_FW; uint64 *bu = _buckets; uint64 *ps = _positions; uint64 *he = _hashedErrors; _bucketSizes = 0L; _countingBuckets = 0L; _hashTable_BP = (uint64 *)((_hashTable_BP) ? uint64ONE : uint64ZERO); _hashTable_FW = (uint32 *)((_hashTable_FW) ? uint64ONE : uint64ZERO); _buckets = 0L; _positions = 0L; _hashedErrors = 0L; safeWrite(F, this, "this", sizeof(positionDB) * 1); _bucketSizes = bs; _countingBuckets = cb; _hashTable_BP = hp; _hashTable_FW = hw; _buckets = bu; _positions = ps; _hashedErrors = he; if (_hashTable_BP) { safeWrite(F, _hashTable_BP, "_hashTable_BP", sizeof(uint64) * (_tableSizeInEntries * _hashWidth / 64 + 1)); } else { safeWrite(F, _hashTable_FW, "_hashTable_FW", sizeof(uint32) * (_tableSizeInEntries + 1)); } safeWrite(F, _buckets, "_buckets", sizeof(uint64) * (_numberOfDistinct * _wFin / 64 + 1)); safeWrite(F, _positions, "_positions", sizeof(uint64) * (_numberOfEntries * _posnWidth / 64 + 1)); safeWrite(F, _hashedErrors, "_hashedErrors", sizeof(uint64) * (_hashedErrorsLen)); if (magicFirst == false) { lseek(F, 0, SEEK_SET); if (errno) { fprintf(stderr, "positionDB::saveState()-- Failed to seek to start of file -- write failed.\n%s\n", strerror(errno)); exit(1); } write(F, magic, sizeof(char) * 16); if (errno) { fprintf(stderr, "positionDB::saveState()-- Write failure on magic last.\n%s\n", strerror(errno)); exit(1); } } close(F); } bool positionDB::loadState(char const *filename, bool beNoisy, bool loadData) { char cigam[16] = { 0 }; fprintf(stderr, "Loading positionDB from '%s'\n", filename); errno = 0; int F = open(filename, O_RDONLY | O_LARGEFILE, 0); if (errno) { fprintf(stderr, "Can't open '%s' for reading pre-built positionDB: %s\n", filename, strerror(errno)); return(false); } safeRead(F, cigam, "Magic Number", sizeof(char) * 16); if (strncmp(faild, cigam, 16) == 0) { if (beNoisy) { fprintf(stderr, "positionDB::loadState()-- Incomplete positionDB binary file.\n"); fprintf(stderr, "positionDB::loadState()-- Read '%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c'\n", cigam[0], cigam[1], cigam[2], cigam[3], cigam[4], cigam[5], cigam[6], cigam[7], cigam[8], cigam[9], cigam[10], cigam[11], cigam[12], cigam[13], cigam[14], cigam[15]); fprintf(stderr, "positionDB::loadState()-- Expected '%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c'\n", magic[0], magic[1], magic[2], magic[3], magic[4], magic[5], magic[6], magic[7], magic[8], magic[9], magic[10], magic[11], magic[12], magic[13], magic[14], magic[15]); } close(F); return(false); } else if (strncmp(magic, cigam, 16) != 0) { if (beNoisy) { fprintf(stderr, "positionDB::loadState()-- Not a positionDB binary file, maybe a sequence file?\n"); fprintf(stderr, "positionDB::loadState()-- Read '%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c'\n", cigam[0], cigam[1], cigam[2], cigam[3], cigam[4], cigam[5], cigam[6], cigam[7], cigam[8], cigam[9], cigam[10], cigam[11], cigam[12], cigam[13], cigam[14], cigam[15]); fprintf(stderr, "positionDB::loadState()-- Expected '%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c%c'\n", magic[0], magic[1], magic[2], magic[3], magic[4], magic[5], magic[6], magic[7], magic[8], magic[9], magic[10], magic[11], magic[12], magic[13], magic[14], magic[15]); } close(F); return(false); } safeRead(F, this, "positionDB", sizeof(positionDB) * 1); _bucketSizes = 0L; _countingBuckets = 0L; _buckets = 0L; _positions = 0L; _hashedErrors = 0L; if (loadData) { uint64 hs = _tableSizeInEntries * _hashWidth / 64 + 1; uint64 bs = _numberOfDistinct * _wFin / 64 + 1; uint64 ps = _numberOfEntries * _posnWidth / 64 + 1; if (_hashTable_BP) { _hashTable_BP = new uint64 [hs]; _hashTable_FW = 0L; safeRead(F, _hashTable_BP, "_hashTable_BP", sizeof(uint64) * hs); } else { _hashTable_BP = 0L; _hashTable_FW = new uint32 [_tableSizeInEntries + 1]; safeRead(F, _hashTable_FW, "_hashTable_FW", sizeof(uint32) * (_tableSizeInEntries + 1)); } _buckets = new uint64 [bs]; _positions = new uint64 [ps]; _hashedErrors = new uint64 [_hashedErrorsMax]; safeRead(F, _buckets, "_buckets", sizeof(uint64) * bs); safeRead(F, _positions, "_positions", sizeof(uint64) * ps); safeRead(F, _hashedErrors, "_hashedErrors", sizeof(uint64) * _hashedErrorsLen); } close(F); return(true); } void positionDB::printState(FILE *stream) { fprintf(stream, "merSizeInBases: "F_U32"\n", _merSizeInBases); fprintf(stream, "merSkipInBases: "F_U32"\n", _merSkipInBases); fprintf(stream, "tableSizeInBits: "F_U32"\n", _tableSizeInBits); fprintf(stream, "tableSizeInEntries: "F_U64"\n", _tableSizeInEntries); fprintf(stream, "hashWidth: "F_U32"\n", _hashWidth); fprintf(stream, "chckWidth: "F_U32"\n", _chckWidth); fprintf(stream, "posnWidth: "F_U32"\n", _posnWidth); fprintf(stream, "numberOfMers: "F_U64"\n", _numberOfMers); fprintf(stream, "numberOfPositions: "F_U64"\n", _numberOfPositions); fprintf(stream, "numberOfDistinct: "F_U64"\n", _numberOfDistinct); fprintf(stream, "numberOfUnique: "F_U64"\n", _numberOfUnique); fprintf(stream, "numberOfEntries: "F_U64"\n", _numberOfEntries); fprintf(stream, "maximumEntries: "F_U64"\n", _maximumEntries); } canu-1.6/src/meryl/libkmer/positionDB-mismatch.C000066400000000000000000000302121314437614700216430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2007-NOV-11 to 2014-APR-11 * are Copyright 2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "positionDB.H" static int stringscmp(const void *A, const void *B) { uint64 const a = *(uint64 const *)A; uint64 const b = *(uint64 const *)B; if (a < b) return(-1); if (a > b) return(1); return(0); } static uint32 makeUnique(uint64 *strings, uint32 stringsLen) { qsort(strings, stringsLen, sizeof(uint64), stringscmp); uint32 len = 0; uint32 nxt = 1; while (nxt < stringsLen) { if (strings[len] != strings[nxt]) { len++; strings[len] = strings[nxt]; } nxt++; } return(len+1); } #if 0 // debug static void dumpPatterns(uint64 *strings, uint32 stringsLen, uint32 ts) { for (uint32 i=0; i= stringsMax) stringsLen = makeUnique(strings, stringsLen); for (uint32 x=0; x<243; x++) strings[stringsLen++] = HASH((m1 & e1[x]) ^ (m2 & e2[x]) ^ (m3 & e3[x]) ^ (m4 & e4[x]) ^ (m5 & e5[x])); } stringsLen = makeUnique(strings, stringsLen); stringsLen = makeUnique(strings, stringsLen); //dumpPatterns(strings, stringsLen, _tableSizeInBits); //fprintf(stderr, "DONE5 totpat="F_U64" toterr="F_U64" stringsLen="F_U32"\n", totpat, toterr, stringsLen); } // Six errors if (6 <= _nErrorsAllowed) { for (uint32 ai=0; ai<_merSizeInBases; ai++) for (uint32 bi=0; bi= stringsMax) stringsLen = makeUnique(strings, stringsLen); for (uint32 x=0; x<729; x++) strings[stringsLen++] = HASH((m1 & e1[x]) ^ (m2 & e2[x]) ^ (m3 & e3[x]) ^ (m4 & e4[x]) ^ (m5 & e5[x]) ^ (m6 & e6[x])); } stringsLen = makeUnique(strings, stringsLen); stringsLen = makeUnique(strings, stringsLen); //dumpPatterns(strings, stringsLen, _tableSizeInBits); //fprintf(stderr, "DONE6 totpat="F_U64" toterr="F_U64" stringsLen="F_U32"\n", totpat, toterr, stringsLen); } if (7 <= _nErrorsAllowed) { fprintf(stderr, "Only 6 errors allowed.\n"); exit(1); } for (uint32 i=1; i> 1)); if (err <= numMismatches) { diffs = REBUILD(hash, chck) ^ mer; d1 = diffs & uint64NUMBER(0x5555555555555555); d2 = diffs & uint64NUMBER(0xaaaaaaaaaaaaaaaa); err = countNumberOfSetBits64(d1 | (d2 >> 1)); if (err <= numMismatches) // err is junk, just need a parameter here loadPositions(J, posn, posnMax, posnLen, err); } } } } return(posnLen > 0); } canu-1.6/src/meryl/libkmer/positionDB-sort.C000066400000000000000000000114551314437614700210350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-MAY-06 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-21 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2006-JUL-07 to 2014-APR-11 * are Copyright 2006-2008,2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "positionDB.H" void adjustHeap(uint64 *C, uint64 *P, int64 i, int64 n) { uint64 c = C[i]; uint64 p = P[i]; int64 j = (i << 1) + 1; // let j be the left child while (j < n) { if (j= C[j]) // a position for M[i] has been found break; C[(j-1)/2] = C[j]; // Move larger child up a level P[(j-1)/2] = P[j]; j = (j << 1) + 1; } C[(j-1)/2] = c; P[(j-1)/2] = p; } void positionDB::sortAndRepackBucket(uint64 b) { uint64 st = _bucketSizes[b]; uint64 ed = _bucketSizes[b+1]; uint32 le = (uint32)(ed - st); if (ed < st) fprintf(stdout, "ERROR: Bucket "F_U64" starts at "F_U64" ends at "F_U64"?\n", b, st, ed); if (le == 0) return; // One mer in the list? It's distinct and unique! (and doesn't // contribute to the position list space count) // if (le == 1) { _numberOfDistinct++; _numberOfUnique++; return; } // Allocate more space, if we need to. // if (_sortedMax <= le) { _sortedMax = le + 1024; delete [] _sortedChck; delete [] _sortedPosn; _sortedChck = new uint64 [_sortedMax]; _sortedPosn = new uint64 [_sortedMax]; } // Unpack the bucket // uint64 lens[3] = {_chckWidth, _posnWidth, 1 + _sizeWidth}; uint64 vals[3] = {0}; for (uint64 i=st, J=st * _wCnt; i=0; t--) { if (_sortedPosn[t] == uint64MASK(_posnWidth)) { unsetBucket = 1; fprintf(stdout, "ERROR: unset posn bucket="F_U64" t="F_S64" le="F_U32"\n", b, t, le); } adjustHeap(_sortedChck, _sortedPosn, t, le); } if (unsetBucket) for (uint32 t=0; t0; t--) { uint64 tc = _sortedChck[t]; uint64 tp = _sortedPosn[t]; _sortedChck[t] = _sortedChck[0]; _sortedPosn[t] = _sortedPosn[0]; _sortedChck[0] = tc; _sortedPosn[0] = tp; adjustHeap(_sortedChck, _sortedPosn, 0, t); } // Scan the list of sorted mers, counting the number of distinct and unique, // and the space needed in the position list. uint64 entries = 1; // For t=0 for (uint32 t=1; t _sortedChck[t]) fprintf(stdout, "ERROR: bucket="F_U64" t="F_U32" le="F_U32": "F_X64" > "F_X64"\n", b, t, le, _sortedChck[t-1], _sortedChck[t]); if (_sortedChck[t-1] != _sortedChck[t]) { _numberOfDistinct++; if (_maximumEntries < entries) _maximumEntries = entries; if (entries == 1) _numberOfUnique++; else _numberOfEntries += entries + 1; // +1 for the length entries = 0; } entries++; } // Don't forget the last mer! // _numberOfDistinct++; if (_maximumEntries < entries) _maximumEntries = entries; if (entries == 1) _numberOfUnique++; else _numberOfEntries += entries + 1; // Repack the sorted entries // for (uint64 i=st, J=st * _wCnt; i> 2) & 0x3333333333333333llu) | ((_md << 2) & 0xccccccccccccccccllu); _md = ((_md >> 4) & 0x0f0f0f0f0f0f0f0fllu) | ((_md << 4) & 0xf0f0f0f0f0f0f0f0llu); _md = ((_md >> 8) & 0x00ff00ff00ff00ffllu) | ((_md << 8) & 0xff00ff00ff00ff00llu); _md = ((_md >> 16) & 0x0000ffff0000ffffllu) | ((_md << 16) & 0xffff0000ffff0000llu); _md = ((_md >> 32) & 0x00000000ffffffffllu) | ((_md << 32) & 0xffffffff00000000llu); // Complement the bases _md ^= 0xffffffffffffffffllu; // Shift and mask out the bases not in the mer _md >>= 64 - _merSize * 2; _md &= uint64MASK(_merSize * 2); return(_md); } positionDB::positionDB(char const *filename, uint32 merSize, uint32 merSkip, uint32 maxMismatch, bool loadData) { memset(this, 0, sizeof(positionDB)); // loadData == false only for driver-posDB.C, and only so it can // dump stats on a posDB file. if (loadState(filename, true, false) == false) { fprintf(stderr, "positionDB()-- Tried to read state from '%s', but failed.\n", filename); exit(1); } if ((loadData) && (merSize != _merSizeInBases)) { fprintf(stderr, "positionDB()-- Tried to read state from '%s', but mer size is wrong (found "F_U32", wanted "F_U32").\n", filename, _merSizeInBases, merSize); exit(1); } if ((loadData) && (merSkip != _merSkipInBases)) { fprintf(stderr, "positionDB()-- Tried to read state from '%s', but mer skip is wrong (found "F_U32", wanted "F_U32").\n", filename, _merSkipInBases, merSkip); exit(1); } if ((loadData) && (maxMismatch != _nErrorsAllowed)) { fprintf(stderr, "positionDB()-- Tried to read state from '%s', but max number of mismatches is wrong (found "F_U32", wanted "F_U32").\n", filename, _nErrorsAllowed, maxMismatch); exit(1); } if (loadState(filename, true, loadData) == false) { fprintf(stderr, "positionDB()-- Tried to read state from '%s', but failed.\n", filename); exit(1); } } positionDB::positionDB(merStream *MS, uint32 merSize, uint32 merSkip, existDB *mask, existDB *only, merylStreamReader *counts, uint32 minCount, uint32 maxCount, uint32 maxMismatch, uint32 maxMemory, bool beVerbose) { memset(this, 0, sizeof(positionDB)); // Guesstimate a nice table size based on the number of input mers // and the mersize, unless the user gave us a table size. // // We need to ensure that // 2 * merSize + posnWidth + 1 - 64 <= tblBits <= 2 * merSize - 4 // // The catch is that we don't exactly know posnWidth right now. We // can overestimate it, though, based on the size of the sequence // that is backing the merStream. // // The second catch is that we don't want to make tblBits too big // or too small. If too big, we waste a lot of memory in the hash // table pointers, and if too small, we waste even more memory in // the data table (not to mention the algorithm dies because it // assumed buckets in the data table are small). // // The memory size is (roughly): // // 2^tblBits * log(numDistinctMers) + // numDistinctMers * (2*merSize - tblBits + 1 + log(numMers) + // (numMers - numUniqieMers) * log(numMers) // // this is approximately proportional to: // // 2^tblBits * posnWidth + // approxMers * (2*merSize - tblBits + 1 + posnWidth) // uint64 approxMers = MS->approximateNumberOfMers(); uint64 posnWidth = logBaseTwo64(approxMers + 1); // Find the smallest and largest tblBits we could possibly use. // uint64 sm = 2 * merSize + posnWidth + 1 - 64; uint64 lg = 2 * merSize - 4; if (2 * merSize + posnWidth + 1 < 64) sm = 2; if (sm < 16) sm = 16; if (sm > lg) { fprintf(stderr, "ERROR: too many mers for this mersize!\n"); fprintf(stderr, " sm = "F_U64"\n", sm); fprintf(stderr, " lg = "F_U64"\n", lg); fprintf(stderr, " merSize = "F_U32" bits\n", 2 * merSize); fprintf(stderr, " approxMers = "F_U64" mers\n", approxMers); fprintf(stderr, " posnWidth = "F_U64" bits\n", posnWidth); exit(1); } // Iterate through all the choices, picking the one with the // smallest expected footprint. // { if (beVerbose) { fprintf(stderr, "potential configurations for approximately "F_U64" "F_U32"-mers (posnW="F_U64").\n", approxMers, merSize, posnWidth); } uint64 mini = 0; // tblSize of the smallest found uint64 minm = ~mini; // memory size of the smallest found double minw = 0.0; // work of the smallest found uint64 memory = 0; double effort = 0; if (maxMemory == 0) maxMemory = ~uint32ZERO; for (uint64 i=sm; i<=lg; i++) { // These are only needed if maxMismatch is set, but it's // simpler to always set. // _merSizeInBases = merSize; _merSizeInBits = 2 * _merSizeInBases; _merSkipInBases = merSkip; _tableSizeInBits = i; _tableSizeInEntries = uint64ONE << _tableSizeInBits; _hashWidth = uint32ZERO; _hashMask = uint64MASK(_tableSizeInBits); _chckWidth = _merSizeInBits - _tableSizeInBits; _posnWidth = uint64ZERO; _sizeWidth = 0; _shift1 = _merSizeInBits - _tableSizeInBits; _shift2 = _shift1 / 2; _mask1 = uint64MASK(_tableSizeInBits); _mask2 = uint64MASK(_shift1); // Everyone wants to know the memory size (in MB). // memory = ((uint64ONE << i) * posnWidth + approxMers * (2*merSize - i + 1 + posnWidth)) >> 23; // If we know we're looking for mismatches, we compute the amount // of work needed per lookup, and use that, instead of strict // memory sizing, to deicde the table size. // if (maxMismatch > 0) effort = setUpMismatchMatcher(maxMismatch, approxMers); // If our memory size is smaller than allowed, AND it's the // smallest, or the work is smaller, save the table size. // if ((memory < maxMemory) && ((memory < minm) || (effort < minw))) { mini = i; minm = memory; minw = effort; } if (beVerbose) { fprintf(stderr, "tblBits=%02lu shifts=%02u,%02u -- size %8.3fGB -- work %8.3f%s\n", i, _shift1, _shift2, memory / 1024.0, effort, (mini == i) ? " ***" : ""); } } _tableSizeInBits = mini; } if (_tableSizeInBits == 0) { fprintf(stderr, "ERROR: No positionDB parameters within allowed memory limit.\n"); exit(1); } if (beVerbose) { uint32 s1 = 2*merSize-_tableSizeInBits; fprintf(stderr, "tblBits="F_U32" s1="F_U32" s2="F_U32" -- merSize="F_U32" bits + posnWidth="F_U64" bits (est "F_U64" mers) FINAL\n", _tableSizeInBits, s1, s1/2, merSize, posnWidth, approxMers); } _merSizeInBases = merSize; _merSizeInBits = 2 * _merSizeInBases; _merSkipInBases = merSkip; _tableSizeInEntries = uint64ONE << _tableSizeInBits; _hashWidth = uint32ZERO; _hashMask = uint64MASK(_tableSizeInBits); _chckWidth = _merSizeInBits - _tableSizeInBits; _posnWidth = uint64ZERO; _sizeWidth = 0; if (maxCount == 0) maxCount = ~uint32ZERO; if (counts) _sizeWidth = (maxCount < ~uint32ZERO) ? logBaseTwo64(maxCount+1) : 32; _shift1 = _merSizeInBits - _tableSizeInBits; _shift2 = _shift1 / 2; _mask1 = uint64MASK(_tableSizeInBits); _mask2 = uint64MASK(_shift1); #if 0 fprintf(stderr, "merSizeInBits "F_U32"\n", _merSizeInBits); fprintf(stderr, "hashWidth "F_U32"\n", _hashWidth); fprintf(stderr, "chckWidth "F_U32"\n", _chckWidth); fprintf(stderr, "shift1 "F_U32"\n", _shift1); fprintf(stderr, "shift2 "F_U32"\n", _shift2); #endif if (maxMismatch > 0) setUpMismatchMatcher(maxMismatch, approxMers); build(MS, mask, only, counts, minCount, maxCount, beVerbose); } void positionDB::build(merStream *MS, existDB *mask, existDB *only, merylStreamReader *counts, uint32 minCount, uint32 maxCount, bool beVerbose) { _bucketSizes = 0L; _countingBuckets = 0L; _hashTable_BP = 0L; _hashTable_FW = 0L; _buckets = 0L; _positions = 0L; _wCnt = 0; _wFin = 0; // For get/setDecodedValues(). uint64 lensC[4] = {~uint64ZERO, ~uint64ZERO, ~uint64ZERO, ~uint64ZERO}; uint64 lensF[4] = {~uint64ZERO, ~uint64ZERO, ~uint64ZERO, ~uint64ZERO}; uint64 vals[4] = {0}; uint64 nval = (_sizeWidth == 0) ? 3 : 4; _numberOfMers = uint64ZERO; _numberOfPositions = uint64ZERO; _numberOfDistinct = uint64ZERO; _numberOfUnique = uint64ZERO; _numberOfEntries = uint64ZERO; _maximumEntries = uint64ZERO; // We assume later that these are already allocated. _sortedMax = 16384; _sortedChck = new uint64 [_sortedMax]; _sortedPosn = new uint64 [_sortedMax]; if (MS == 0L) { fprintf(stderr, "positionDB()-- ERROR: No merStream? Nothing to build a table with!\n"); exit(1); } MS->rewind(); //////////////////////////////////////////////////////////////////////////////// // // 1) Count bucket sizes // // We'll later want to reuse the _bucketSizes space for storing the // hash table. To make it somewhat safe, we allocate the space as // uint64, then cast it to be uint32. // // bktAllocIsJunk tells us if we should release this memory (if we // need to allocate separate space for the hash table). We'd need // to do this if the hashWidth is more than 32 bits, but we won't // know that for a little bit. // // The _bucketSizes is offset by one from bktAlloc so that we don't // overwrite _bucketSizes when we are constructing hash table. // uint64 *bktAlloc; try { bktAlloc = new uint64 [_tableSizeInEntries / 2 + 4]; } catch (std::bad_alloc) { fprintf(stderr, "positionDB()-- caught std::bad_alloc in %s at line %d\n", __FILE__, __LINE__); fprintf(stderr, "positionDB()-- bktAlloc = new uint64 ["F_U64"]\n", _tableSizeInEntries / 2 + 4); exit(1); } bool bktAllocIsJunk = false; bzero(bktAlloc, sizeof(uint64) * (_tableSizeInEntries / 2 + 4)); // Why +2? We try to reuse the bktAlloc space for the hash table, // which is constructed from the bucketSizes. The hashTable is // built from the bucketSizes. It definitely needs to be +1, and // so we use +2 just in case the human is being stupid again. // _bucketSizes = (uint32 *)(bktAlloc + 2); #ifdef ERROR_CHECK_COUNTING fprintf(stdout, "ERROR_CHECK_COUNTING is defined.\n"); uint32 *_errbucketSizes = new uint32 [_tableSizeInEntries + 2]; for (uint64 i=0; i<_tableSizeInEntries + 2; i++) _errbucketSizes[i] = uint32ZERO; #endif if (beVerbose) fprintf(stderr, " Allocated bucket size counting space with total size "F_U64" KB\n", _tableSizeInEntries >> 8); speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, beVerbose); // Two choices here // // 1) No masking or onlying is done. Stream the mers and just // count the positions. This is the original behavior. // // 2) Masking or onlying is done. Open the output stream file, // stream the mers by, checking for mask/only of both // forward and reverse mers. If either is found, push // the (forward) mer and position onto the stream. // close the output stream. // // Save the mer if it doesn't exist in the mask (both f and r), // or does exist in the only (either f or r), add it. // // The input databases for mask and only are (currently) made // using canonical mers. We halve the number of exists() by // also using canonical mers here. // MS->rewind(); while (MS->nextMer(_merSkipInBases)) { _bucketSizes[ HASH(MS->theFMer()) ]++; #ifdef ERROR_CHECK_COUNTING _errbucketSizes[ HASH(MS->theFMer()) ]++; #endif _numberOfMers++; _numberOfPositions = MS->thePositionInStream(); assert((_numberOfPositions >> 60) == 0); C->tick(); } delete C; C = 0L; if (beVerbose) fprintf(stderr, " Found "F_U64" mers (max position = "F_U64")\n", _numberOfMers, _numberOfPositions); // This caught a nasty bug in merStream rewind(), and it's pretty // cheap, so I left it in. Search for the other DEBUGnumPositions. // uint64 DEBUGnumPositions = _numberOfPositions + 1; // This is _numberOfMers+1 because we need to store the first // position after the last mer. That is, if there are two mers, we // will store that the first mer is at position 0, the second mer // is at position 1, and the end of the second mer is at position // 2. // // In reality, it should be the number of distinct mers, not the // total number of mers, but we don't know that yet. And so // occasionally we'll make things too big and waste a bit of // memory. // _hashWidth = logBaseTwo64(_numberOfMers+1); _posnWidth = logBaseTwo64(_numberOfPositions+1); /////////////////////////////////////////////////////////////////////////////// // // 2) Allocate buckets and make bucketSizes be a pointer into them // _wCnt = _chckWidth + _posnWidth + 1 + _sizeWidth; lensC[0] = _chckWidth; lensC[1] = _posnWidth; lensC[2] = 1; lensC[3] = _sizeWidth; uint64 bucketsSpace = (_numberOfMers+1) * _wCnt / 64 + 1; uint32 endPosition = 0; if (beVerbose) fprintf(stderr, " Allocated "F_U64"KB for buckets ("F_U64" 64-bit words)\n", bucketsSpace >> 7, bucketsSpace); try { _countingBuckets = new uint64 [bucketsSpace]; } catch (std::bad_alloc) { fprintf(stderr, "positionDB()-- caught std::bad_alloc in %s at line %d\n", __FILE__, __LINE__); fprintf(stderr, "positionDB()-- _countingBuckets = new uint64 ["F_U64"]\n", bucketsSpace); exit(1); } for (uint64 i=0; irewind(); while (MS->nextMer(_merSkipInBases)) { uint64 h = HASH(MS->theFMer()); #ifdef ERROR_CHECK_COUNTING if (_bucketSizes[h] == 0) { char str[33]; fprintf(stderr, "positionDB()-- ERROR_CHECK_COUNTING: Bucket "F_U64" ran out of things! '%s'\n", h, MS->theFMer().merToString(str)); fprintf(stderr, "positionDB()-- ERROR_CHECK_COUNTING: Stream is at "F_U64"\n", MS->thePositionInStream()); } #endif _bucketSizes[h]--; #ifdef ERROR_CHECK_COUNTING _errbucketSizes[h]--; #endif #ifdef ERROR_CHECK_EMPTY_BUCKETS // Check that everything is empty. Empty is defined as set to all 1's. getDecodedValues(_countingBuckets, (uint64)_bucketSizes[h] * (uint64)_wCnt, nval, lensC, vals); if (((~vals[0]) & uint64MASK(lensC[0])) || ((~vals[1]) & uint64MASK(lensC[1])) || ((~vals[2]) & uint64MASK(lensC[2])) || ((lensC[3] > 0) && ((~vals[3]) & uint64MASK(lensC[3])))) fprintf(stdout, "ERROR_CHECK_EMPTY_BUCKETS: countingBucket not empty! pos=%lu 0x%016lx 0x%016lx 0x%016lx 0x%016lx\n", _bucketSizes[h] * _wCnt, (~vals[0]) & uint64MASK(lensC[0]), (~vals[1]) & uint64MASK(lensC[1]), (~vals[2]) & uint64MASK(lensC[2]), (~vals[3]) & uint64MASK(lensC[3])); #endif vals[0] = CHECK(MS->theFMer()); vals[1] = MS->thePositionInStream(); vals[2] = 0; vals[3] = 0; setDecodedValues(_countingBuckets, (uint64)_bucketSizes[h] * (uint64)_wCnt, nval, lensC, vals); #ifdef ERROR_CHECK_COUNTING_ENCODING getDecodedValues(_countingBuckets, (uint64)_bucketSizes[h] * (uint64)_wCnt, nval, lensC, vals); if (vals[0] != CHECK(MS->theFMer())) fprintf(stdout, "ERROR_CHECK_COUNTING_ENCODING error: CHCK corrupted! Wanted "uint64HEX" got "uint64HEX"\n", CHECK(MS->theFMer()), vals[0]); if (vals[1] != MS->thePositionInStream()) fprintf(stdout, "ERROR_CHECK_COUNTING_ENCODING error: POSN corrupted! Wanted "uint64HEX" got "uint64HEX"\n", MS->thePositionInStream(), vals[1]); if (vals[2] != 0) fprintf(stdout, "ERROR_CHECK_COUNTING_ENCODING error: UNIQ corrupted.\n"); if (vals[3] != 0) fprintf(stdout, "ERROR_CHECK_COUNTING_ENCODING error: SIZE corrupted.\n"); #endif C->tick(); } delete C; C = 0L; #ifdef ERROR_CHECK_COUNTING for (uint64 i=0; i<_tableSizeInEntries; i++) if (_errbucketSizes[i] != 0) fprintf(stdout, "ERROR_CHECK_COUNTING: Bucket "F_U32" wasn't filled fully? "F_U32" left over.\n", i, _errbucketSizes[i]); delete [] _errbucketSizes; _errbucketSizes = 0L; #endif //////////////////////////////////////////////////////////////////////////////// // // 4) Sort each bucket -- count: // 1) number of distinct mers // 2) number of unique mers // 3) number of entries in position table ( sum mercount+1 for all mercounts > 1) // also need to repack the sorted things // if (beVerbose) fprintf(stderr, " Sorting and repacking buckets ("F_U64" buckets).\n", _tableSizeInEntries); C = new speedCounter(" %7.2f Mbuckets -- %5.2f Mbuckets/second\r", 1000000.0, 0x1ffffff, beVerbose); for (uint64 i=0; i<_tableSizeInEntries; i++) { sortAndRepackBucket(i); C->tick(); } delete C; C = 0L; if (beVerbose) fprintf(stderr, " Found %12lu total mers\n" " Found %12lu distinct mers\n" " Found %12lu unique mers\n" " Need "F_U64" non-unique position list entries ("F_U64" maximum count)\n", _numberOfMers, _numberOfDistinct, _numberOfUnique, _numberOfEntries, _maximumEntries); //////////////////////////////////////////////////////////////////////////////// // // Compute the size of the final bucket position entry. It's // either a position into the sequence, or a pointer into a list of // positions. In rare cases, the pointer is larger than the // sequence position, and we need to do extra work. // // The width of position pointers (in buckets) is the max of // _posnWidth (a pointer to the sequence position) and // _pptrWidth (a pointer to an entry in the positions table). // _pptrWidth = logBaseTwo64(_numberOfEntries+1); if (_pptrWidth < _posnWidth) _pptrWidth = _posnWidth; _wFin = _chckWidth + _pptrWidth + 1 + _sizeWidth; lensF[0] = _chckWidth; lensF[1] = _pptrWidth; lensF[2] = 1; lensF[3] = _sizeWidth; //////////////////////////////////////////////////////////////////////////////// // // 5) Allocate: real hash table, buckets and position table. // // XXXX how do we count the number of buckets/positions we never // use because they are masked out?? // // If we are just thresholding (ignore things with count > 100) // it's easy, a simple loop over something. // // If we have an exist/only db....are they in the same order? Can // we loop over both at the same time and count that way? That'd // be cool! Mersize is the same, why can the table size be the // same too -- OK, if the existDB has a small number of mers in it, // then we don't need a large table. uint64 hs = _tableSizeInEntries * _hashWidth / 64 + 1; uint64 bs = _numberOfDistinct * _wFin / 64 + 1; uint64 ps = _numberOfEntries * _posnWidth / 64 + 1; if (_hashWidth <= 32) { if (beVerbose) fprintf(stderr, " Reusing bucket counting space for hash table.\n"); #ifdef UNCOMPRESS_HASH_TABLE _hashTable_BP = 0L; _hashTable_FW = (uint32 *)bktAlloc; #else _hashTable_BP = bktAlloc; _hashTable_FW = 0L; #endif bktAllocIsJunk = false; } else { // Can't use the full-width hash table, since the data size is > // 32 bits -- we'd need to allocate 64-bit ints for it, and // that'll likely be too big...and we'd need to have // _hashTable_FW64 or something. if (beVerbose) fprintf(stderr, " Allocated "F_U64"KB for hash table ("F_U64" 64-bit words)\n", hs >> 7, hs); try { _hashTable_BP = new uint64 [hs]; _hashTable_FW = 0L; } catch (std::bad_alloc) { fprintf(stderr, "positionDB()-- caught std::bad_alloc in %s at line %d\n", __FILE__, __LINE__); fprintf(stderr, "positionDB()-- _hashTable_BP = new uint64 ["F_U64"]\n", hs); exit(1); } bktAllocIsJunk = true; } // If we have enough space to reuse the counting space, reuse it. // Else, allocate more space. // // We need to ensure that there are enough bits and that the size // of a bucket didn't increase. If the bucket size did increase, // and we see more unique buckets than total mers (up to some // point) we overwrite data. // // Recall that bucketSpace ~= numberOfMers * wCnt // if ((bs < bucketsSpace) && (_wFin <= _wCnt)) { if (beVerbose) fprintf(stderr, " Reusing bucket space; Have: "F_U64" Need: "F_U64" (64-bit words)\n", bucketsSpace, bs); _buckets = _countingBuckets; bs = bucketsSpace; // for output at the end } else { if (beVerbose) fprintf(stderr, " Allocated "F_U64"KB for buckets ("F_U64" 64-bit words)\n", bs >> 7, bs); try { _buckets = new uint64 [bs]; } catch (std::bad_alloc) { fprintf(stderr, "positionDB()-- caught std::bad_alloc in %s at line %d\n", __FILE__, __LINE__); fprintf(stderr, "positionDB()-- _buckets = new uint64 ["F_U64"]\n", bs); exit(1); } } if (beVerbose) fprintf(stderr, " Allocated "F_U64"KB for positions ("F_U64" 64-bit words)\n", ps >> 7, ps); try { _positions = new uint64 [ps]; } catch (std::bad_alloc) { fprintf(stderr, "positionDB()-- caught std::bad_alloc in %s at line %d\n", __FILE__, __LINE__); fprintf(stderr, "positionDB()-- _positions = new uint64 ["F_U64"\n", ps); exit(1); } //////////////////////////////////////////////////////////////////////////////// // // 6) Transfer from the sorted buckets to the hash table. // if (beVerbose) fprintf(stderr, " Transferring to final structure ("F_U64" buckets).\n", _tableSizeInEntries); uint64 bucketStartPosition = 0; // Current positions and bit positions in the buckets and position list. // uint64 currentBbit = uint64ZERO; // Bit position into bucket uint64 currentPbit = uint64ZERO; // Bit position into positions uint64 currentPpos = uint64ZERO; // Value position into positions #ifdef TEST_NASTY_BUGS // Save the position array pointer of each bucket for debugging. // uint64 currentBpos = uint64ZERO; // Value position into bucket uint32 *posPtrCheck = new uint32 [65826038]; #endif // We also take this opportunity to reset some statistics that are // wrong. // _numberOfMers = 0; _numberOfPositions = 0; _numberOfDistinct = 0; _numberOfUnique = 0; _numberOfEntries = 0; _maximumEntries = 0; C = new speedCounter(" %7.2f Mbuckets -- %5.2f Mbuckets/second\r", 1000000.0, 0x1ffffff, beVerbose); // We need b outside the loop! // uint64 b; for (b=0; b<_tableSizeInEntries; b++) { C->tick(); // Set the start of the bucket -- we took pains to ensure that // we don't overwrite _bucketSizes[b], if we are reusing that // space for the hash table. // if (_hashTable_BP) setDecodedValue(_hashTable_BP, (uint64)b * (uint64)_hashWidth, _hashWidth, bucketStartPosition); else _hashTable_FW[b] = bucketStartPosition; // Get the number of mers in the counting bucket. The error // checking and sizing of _sortedChck and _sortedPosn was already // done in the sort. // uint64 st = _bucketSizes[b]; uint64 ed = _bucketSizes[b+1]; uint32 le = ed - st; // Unpack the check values // for (uint64 i=st, J=st * _wCnt; i maxCount) useMer = false; if ((useMer == true) && (mask || only)) { // MER_REMOVAL_DURING_XFER. Great. The existDB has // (usually) the canonical mer. We have the forward mer. // Well, no, we have the forward mers' hash and check. So, // we reconstruct the mer, reverse complement it, and then // throw the mer out if either the forward or reverse exists // (or doesn't exist). uint64 m = REBUILD(b, _sortedChck[stM]); uint64 r; if (mask) { if (mask->isCanonical()) { r = reverseComplementMer(_merSizeInBases, m); if (r < m) m = r; } if (mask->exists(m)) useMer = false; } if (only) { if (only->isCanonical()) { r = reverseComplementMer(_merSizeInBases, m); if (r < m) m = r; } if (only->exists(m) == false) useMer = false; } } if (useMer) { _numberOfMers += edM - stM; _numberOfPositions += edM - stM; _numberOfDistinct++; if (stM+1 == edM) { _numberOfUnique++; #ifdef TEST_NASTY_BUGS posPtrCheck[currentBpos++] = _sortedPosn[stM]; #endif vals[0] = _sortedChck[stM]; vals[1] = _sortedPosn[stM]; vals[2] = 1; vals[3] = 0; currentBbit = setDecodedValues(_buckets, currentBbit, nval, lensF, vals); bucketStartPosition++; } else { _numberOfEntries += edM - stM; if (_maximumEntries < edM - stM) _maximumEntries = edM - stM; #ifdef TEST_NASTY_BUGS posPtrCheck[currentBpos++] = currentPpos; #endif vals[0] = _sortedChck[stM]; vals[1] = currentPpos; vals[2] = 0; vals[3] = 0; currentBbit = setDecodedValues(_buckets, currentBbit, nval, lensF, vals); bucketStartPosition++; // Store the positions. Store the number of positions // here, then store all positions. // // The positions are in the proper place in _sortedPosn, // and setDecodedValue masks out the extra crap, so no // temporary needed. Probably should be done with // setDecodedValues, but then we need another array telling // the sizes of each piece. // setDecodedValue(_positions, currentPbit, _posnWidth, edM - stM); currentPbit += _posnWidth; currentPpos++; for (; stM < edM; stM++) { if (_sortedPosn[stM] >= DEBUGnumPositions) { fprintf(stderr, "positionDB()-- ERROR: Got position "F_U64", but only "F_U64" available!\n", _sortedPosn[stM], DEBUGnumPositions); abort(); } setDecodedValue(_positions, currentPbit, _posnWidth, _sortedPosn[stM]); currentPbit += _posnWidth; currentPpos++; } } } // useMer // All done with this mer. // stM = edM; } // while (stM < le) } // for each bucket // Set the end of the last bucket // if (_hashTable_BP) setDecodedValue(_hashTable_BP, b * _hashWidth, _hashWidth, bucketStartPosition); else _hashTable_FW[b] = bucketStartPosition; delete C; // Clear out the end of the arrays -- this is only so that we can // checksum the result. // if (_hashTable_BP) { b = b * _hashWidth + _hashWidth; setDecodedValue(_hashTable_BP, b, 64 - (b % 64), uint64ZERO); } setDecodedValue(_buckets, currentBbit, 64 - (currentBbit % 64), uint64ZERO); setDecodedValue(_positions, currentPbit, 64 - (currentPbit % 64), uint64ZERO); if (beVerbose) { fprintf(stderr, " Avail: Bucket %12lu Position %12lu (64-bit words)\n", bs, ps); fprintf(stderr, " Avail: Bucket %12lu Position %12lu (entries)\n", _numberOfDistinct, _numberOfEntries); fprintf(stderr, " Used: Bucket %12lu Position %12lu (64-bit words)\n", currentBbit / 64, currentPbit / 64); } // Reset the sizes to what we actually found. If we then // dump/reload, we shrink our footprint. // _numberOfDistinct = currentBbit / _wFin; _numberOfEntries = currentPbit / _posnWidth; if (beVerbose) { fprintf(stderr, " Used: Bucket %12lu Position %12lu (entries)\n", _numberOfDistinct, _numberOfEntries); fprintf(stderr, " Found %12lu total mers\n" " Found %12lu distinct mers\n" " Found %12lu unique mers\n" " Need "F_U64" non-unique position list entries ("F_U64" maximum count)\n", _numberOfMers, _numberOfDistinct, _numberOfUnique, _numberOfEntries, _maximumEntries); } // If we removed mers, there is a small chance that our hash table // is too big -- we might have removed enoough mers to make the // width smaller. If so, rebuild the hash table. // // Also, hooray, we finally know the number of distinct mers, so we // can make this nice and tight // if (_hashTable_BP) { uint32 newHashWidth = 1; while ((_numberOfDistinct+1) > (uint64ONE << newHashWidth)) newHashWidth++; if (newHashWidth != _hashWidth) { uint64 npos = 0; uint64 opos = 0; if (beVerbose) fprintf(stderr, " Rebuilding the hash table, from "F_U32" bits wide to "F_U32" bits wide.\n", _hashWidth, newHashWidth); for (uint64 z=0; z<_tableSizeInEntries+1; z++) { setDecodedValue(_hashTable_BP, npos, newHashWidth, getDecodedValue(_hashTable_BP, opos, _hashWidth)); npos += newHashWidth; opos += _hashWidth; } // Clear the end again. setDecodedValue(_hashTable_BP, npos, 64 - (npos % 64), uint64ZERO); } _hashWidth = newHashWidth; } // If supplied, add in any counts. The meryl table is, sadly, in // the wrong order, and we must hash and search. // // Meryl _should_ be storing only forward mers, but we have no way // of checking. // // After all counts are loaded, check if we can compress the counts // space any. Check if the largestMerylCount is much smaller than // the space it is stored in. If so, we can compress the table. // uint64 largestMerylCount = 0; uint64 countsLoaded = 0; if (counts) { if (beVerbose) fprintf(stderr, " Loading "F_U64" mercounts.\n", counts->numberOfDistinctMers()); C = new speedCounter(" %7.2f Mmercounts -- %5.2f Mmercounts/second\r", 1000000.0, 0x1fffff, beVerbose); while (counts->nextMer()) { kMer k = counts->theFMer(); uint64 c = counts->theCount(); uint64 f = setCount(k, c); k.reverseComplement(); uint64 r = setCount(k, c); if (f + r > 0) { countsLoaded++; if (largestMerylCount < c) largestMerylCount = c; } C->tick(); } delete C; if (beVerbose) fprintf(stderr, " Loaded "F_U64" mercounts; largest is "F_U64".\n", countsLoaded, largestMerylCount); if (logBaseTwo64(largestMerylCount + 1) < _sizeWidth) { if (beVerbose) fprintf(stderr, " Compress sizes from "F_U32" bits to "F_U32" bits.\n", _sizeWidth, (uint32)logBaseTwo64(largestMerylCount + 1)); uint64 oSiz[4] = { _chckWidth, _pptrWidth, 1, _sizeWidth }; uint64 nSiz[4] = { _chckWidth, _pptrWidth, 1, logBaseTwo64(largestMerylCount + 1) }; uint64 tVal[4] = { 0, 0, 0, 0 }; uint64 oP = 0, oS = oSiz[0] + oSiz[1] + oSiz[2] + oSiz[3]; uint64 nP = 0, nS = nSiz[0] + nSiz[1] + nSiz[2] + nSiz[3]; assert(nS < oS); C = new speedCounter(" %7.2f Mmercounts -- %5.2f Mmercounts/second\r", 1000000.0, 0x1fffff, beVerbose); for (uint64 bu=0; bu<_numberOfDistinct; bu++) { getDecodedValues(_buckets, oP, 4, oSiz, tVal); setDecodedValues(_buckets, nP, 4, nSiz, tVal); oP += oS; nP += nS; C->tick(); } delete C; _sizeWidth = nSiz[3]; _wFin = _chckWidth + _pptrWidth + 1 + _sizeWidth; } } #ifdef TEST_NASTY_BUGS // Unpack the bucket positions and check. Report the first one // that is broken. // for(uint64 bb=0; bbrewind(); if (mask) { C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, beVerbose); uint32 extraMer = 0; while (MS->nextMer(_merSkipInBases)) { uint64 mer = MS->theFMer(); if (mask->exists(mer) && exists(mer)) extraMer++; C->tick(); } delete C; fprintf(stderr, "positionDB()-- mask: "F_U32" mers extra!\n", extraMer); } else if (only) { C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, beVerbose); uint32 missingMer = 0; while (MS->nextMer(_merSkipInBases)) { uint64 mer = MS->theFMer(); if (only->exists(mer) && !exists(mer)) missingMer++; C->tick(); } delete C; fprintf(stderr, "positionDB()-- only: "F_U32" mers missing!\n", missingMer); } #endif // Free the counting buckets if we aren't using the space for // something else. // if (_buckets != _countingBuckets) delete [] _countingBuckets; // In theory, we could move these to be immediately after the data // is useless. // _bucketSizes = 0L; _countingBuckets = 0L; delete [] _sortedChck; delete [] _sortedPosn; _sortedMax = 0; _sortedChck = 0L; _sortedPosn = 0L; if (bktAllocIsJunk) delete [] bktAlloc; } positionDB::~positionDB() { delete [] _hashTable_BP; delete [] _hashTable_FW; delete [] _buckets; delete [] _positions; delete [] _hashedErrors; } canu-1.6/src/meryl/libkmer/positionDB.H000066400000000000000000000200061314437614700200450ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2003-OCT-21 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-21 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-SEP-13 to 2014-APR-11 * are Copyright 2005-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef POSITIONDB_H #define POSITIONDB_H #include "AS_global.H" #include "merStream.H" // The two existDB inputs can be either forward or canonical. If // canonical, we are smart enough to search exist/only with the // canonical mer. // Returns position in posn, resizing it if needed. Space is // allocated if none supplied. The following is valid: // // uint64 *posn = 0L; // uint64 posnMax = 0; // uint64 posnLen = 0; // if (get(somemer, posn, posnMax, posnLen)) { // do something with the positions // } // // exists() returns T/F if mer exists or not // count() returns the number of times that mer is present // Define this to use an uncompressed hash table when the width is 32 // bits or less. Doing so is A LOT faster in mismatch lookups, but // does use more memory. #undef UNCOMPRESS_HASH_TABLE // Define this to leave out references to getTime(), speedCounter() // and make the positionDB build very quietly. #undef SILENTPOSITIONDB // Define these to enable some debugging methods #undef DEBUGPOSDB #undef DEBUGREBUILD class existDB; class merylStreamReader; class positionDB { public: positionDB(char const *filename, uint32 merSize, uint32 merSkip, uint32 maxMismatch, bool loadData=true); positionDB(merStream *MS, uint32 merSize, uint32 merSkip, existDB *mask, existDB *only, merylStreamReader *counts, uint32 minCount, uint32 maxCount, uint32 maxMismatch, uint32 maxMemory, bool beVerbose); ~positionDB(); private: void build(merStream *MS, existDB *mask, existDB *only, merylStreamReader *counts, uint32 minCount, uint32 maxCount, bool beVerbose); private: void reallocateSpace(uint64*& posn, uint64& posnMax, uint64& posnLen, uint64 len); void loadPositions(uint64 v, uint64*& posn, uint64& posnMax, uint64& posnLen, uint64& count); public: bool getExact(uint64 mer, uint64*& posn, uint64& posnMax, uint64& posnLen, uint64& count); bool existsExact(uint64 mer); uint64 countExact(uint64 mer); public: void filter(uint64 lo, uint64 hi); private: double setUpMismatchMatcher(uint32 nErrorsAllowed, uint64 approxMers); public: bool getUpToNMismatches(uint64 mer, uint32 maxMismatches, uint64*& posn, uint64& posnMax, uint64& posnLen); private: uint64 setCount(uint64 mer, uint64 count); // Save or load a built table // public: void saveState(char const *filename); bool loadState(char const *filename, bool beNoisy=false, bool loadData=true); void printState(FILE *stream); // Only really useful for debugging. Don't use. // void dump(char *name); bool checkREBUILD(uint64 m) { #define DEBUGREBUILD #ifdef DEBUGREBUILD uint64 h = HASH(m); uint64 c = CHECK(m); uint64 r = REBUILD(h, c); if (r != m) { fprintf(stderr, "shift1 = "F_U32"\n", _shift1); fprintf(stderr, "shift2 = "F_U32"\n", _shift2); fprintf(stderr, "M = "F_X64"\n", m); fprintf(stderr, "H = "F_X64"\n", h); fprintf(stderr, "C = "F_X64"\n", c); fprintf(stderr, "R = "F_X64"\n", r); return(false); } return(true); #else return(REBUILD(HASH(m), CHECK(m)) == m); #endif }; private: uint64 HASH(uint64 k) { return(((k >> _shift1) ^ (k >> _shift2) ^ k) & _mask1); }; uint64 CHECK(uint64 k) { return(k & _mask2); }; uint64 REBUILD(uint64 h, uint64 c) { // Decode a HASH and a CHECK to get back the mer. You'd better // bloody PRAY you don't break this (test/test-rebuild.C). It // was a headache++ to write. uint64 sha = _shift1 - _shift2; uint64 msk = uint64MASK(sha); // The check is exactly the mer....just not all there. uint64 mer = c; uint64 shf = sha - (_tableSizeInBits % 2); uint64 shg = 0; uint64 shh = _shift1; // Unrolling this is troublesome - we still need the tests, // bizarre merSize, tblSize combinations use lots of iterations // (when the merSize and tblSize are about the same, the CHECK is // small, and so we need to do lots of iterations). //fprintf(stderr, "shf="F_U64W(2)" shg="F_U64W(2)" shh="F_U64W(2)" mer="F_X64"\n", shf, shg, shh, mer); do { mer |= (((h >> shg) ^ (mer >> shg) ^ (mer >> shf)) & msk) << shh; //fprintf(stderr, "shf="F_U64W(2)" shg="F_U64W(2)" shh="F_U64W(2)" mer="F_X64"\n", shf, shg, shh, mer); shf += sha; shg += sha; shh += sha; } while ((shf < _merSizeInBits) && (shh < 64)); mer &= uint64MASK(_merSizeInBits); return(mer); }; void sortAndRepackBucket(uint64 b); uint32 *_bucketSizes; uint64 *_countingBuckets; uint64 *_hashTable_BP; // Bit packed uint32 *_hashTable_FW; // Full width uint64 *_buckets; uint64 *_positions; uint32 _merSizeInBases; uint32 _merSizeInBits; uint32 _merSkipInBases; uint64 _tableSizeInEntries; uint32 _tableSizeInBits; uint32 _hashWidth; // Hash bith uint32 _chckWidth; // Check bits uint32 _posnWidth; // Positions in the sequence uint32 _pptrWidth; // Pointers to positions uint32 _sizeWidth; // Extra number in the table uint64 _hashMask; uint32 _wCnt; uint32 _wFin; uint32 _shift1; uint32 _shift2; uint64 _mask1; uint64 _mask2; uint64 _numberOfMers; uint64 _numberOfPositions; uint64 _numberOfDistinct; uint64 _numberOfUnique; uint64 _numberOfEntries; uint64 _maximumEntries; // For sorting the mers // uint32 _sortedMax; uint64 *_sortedChck; uint64 *_sortedPosn; // For the mismatch matcher uint32 _nErrorsAllowed; uint32 _hashedErrorsLen; uint32 _hashedErrorsMax; uint64 *_hashedErrors; }; #endif // POSITIONDB_H canu-1.6/src/meryl/libleaff.mk000066400000000000000000000017171314437614700163420ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := libleaff.a SOURCES := libleaff/fastaFile.C \ libleaff/fastaStdin.C \ libleaff/fastqFile.C \ libleaff/fastqStdin.C \ libleaff/gkStoreFile.C \ libleaff/merStream.C \ libleaff/seqCache.C \ libleaff/seqFactory.C \ libleaff/seqStore.C \ libleaff/seqStream.C \ libleaff/sffFile.C # libleaff/selftest.C # libleaff/test-merStream.C # libleaff/test-seqCache.C # libleaff/test-seqStream.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := TGT_PREREQS := SUBMAKEFILES := canu-1.6/src/meryl/libleaff/000077500000000000000000000000001314437614700160035ustar00rootroot00000000000000canu-1.6/src/meryl/libleaff/fastaFile.C000066400000000000000000000372731314437614700200210ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-19 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "fastaFile.H" #include "dnaAlphabets.H" // Says 'kmerFastaFileIdx' #define FASTA_MAGICNUMBER1 0x7473614672656d6bULL #define FASTA_MAGICNUMBER2 0x786449656c694661ULL fastaFile::fastaFile(const char *filename) { clear(); #ifdef DEBUG fprintf(stderr, "fastaFile::fastaFile()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif strcpy(_filename, filename); constructIndex(); _rb = new readBuffer(_filename); _numberOfSequences = _header._numberOfSequences; } fastaFile::fastaFile() { clear(); } fastaFile::~fastaFile() { delete _rb; delete [] _index; delete [] _names; } seqFile * fastaFile::openFile(const char *filename) { struct stat st; #ifdef DEBUG fprintf(stderr, "fastaFile::openFile()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif if (((filename == 0L) && (isatty(fileno(stdin)) == 0)) || ((filename != 0L) && (filename[0] == '-') && (filename[1] == 0))) return(0L); errno = 0; stat(filename, &st); if (errno) return(0L); if ((st.st_mode & S_IFREG) == 0) return(0L); // Otherwise, open and see if we can get the first sequence. We // assume it's fasta if we find a '>' denoting a defline the first // thing in the file. // // Use of a readBuffer here is a bit heavyweight, but it's safe and // easy. Opening a fastaFile isn't, after all, lightweight anyway. // fastaFile *f = 0L; readBuffer *r = new readBuffer(filename); char x = r->read(); while ((r->eof() == false) && (alphabet.isWhitespace(x) == true)) x = r->read(); // If we get a fasta record separator assume it's a fasta file. If // it's eof, the file is empty, and we might as well return this // fasta file and let the client deal with the lack of sequence. // if ((x == '>') || (r->eof() == true)) f = new fastaFile(filename); delete r; return(f); } uint32 fastaFile::find(const char *sequencename) { char *ptr = _names; // If this proves far too slow, rewrite the _names string to // separate IDs with 0xff, then use strstr on the whole thing. To // find the ID, scan down the string counting the number of 0xff's. // // Similar code is used for seqStore::find() for (uint32 iid=0; iid < _header._numberOfSequences; iid++) { //fprintf(stderr, "fastaFile::find()-- '%s' vs '%s'\n", sequencename, ptr); if (strcmp(sequencename, ptr) == 0) return(iid); while (*ptr) ptr++; ptr++; } return(~uint32ZERO); } uint32 fastaFile::getSequenceLength(uint32 iid) { #ifdef DEBUG fprintf(stderr, "fastaFile::getSequenceLength()-- " F_U32 "\n", iid); #endif return((iid < _numberOfSequences) ? _index[iid]._seqLength : 0); } bool fastaFile::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { #ifdef DEBUG fprintf(stderr, "fastaFile::getSequence(full)-- " F_U32 "\n", iid); #endif // Assume there is no index. Without being horribly complicated // (as in the previous versions of this codebase) all we'd get from // having an index around is the length of the sequence. // // Previous versions used to use the index to tell if the sequence // was squeezed (and so a direct copy to the output), if it was // fixed width (mostly direct copies) or unknown. Now we just // assume it's unknown and go byte by byte. If speed is a concern, // use the seqFile instead. if (iid >= _header._numberOfSequences) { fprintf(stderr, "fastaFile::getSequence(full)-- iid " F_U32 " more than number of sequences " F_U32 "\n", iid, _header._numberOfSequences); return(false); } if (sMax == 0) { sMax = 2048; s = new char [sMax]; } if (hMax == 0) { hMax = 2048; h = new char [hMax]; } if ((_index) && (sMax < _index[iid]._seqLength)) { sMax = _index[iid]._seqLength; delete [] s; s = new char [sMax]; } hLen = 0; sLen = 0; #ifdef DEBUG fprintf(stderr, "fastaFile::getSequence(full)-- seek to iid=" F_U32 " at pos=" F_U32 "\n", iid, _index[iid]._seqPosition); #endif _rb->seek(_index[iid]._seqPosition); char x = _rb->read(); // Skip whitespace at the start of the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // We should be at a '>' character now. Fail if not. if (_rb->eof()) return(false); if (x != '>') fprintf(stderr, "fastaFile::getSequence(full)-- ERROR1: In %s, expected '>' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Skip the '>' in the defline x = _rb->read(); // Skip whitespace between the '>' and the defline while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true) && (x != '\r') && (x != '\n')) x = _rb->read(); // Copy the defline, until the first newline. while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) { h[hLen++] = x; if (hLen >= hMax) { hMax += 2048; char *H = new char [hMax]; memcpy(H, h, hLen); delete [] h; h = H; } x = _rb->read(); } h[hLen] = 0; // Skip whitespace between the defline and the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // Copy the sequence, until EOF or the next '>'. while ((_rb->eof() == false) && (_rb->peek() != '>')) { if (alphabet.isWhitespace(x) == false) { s[sLen++] = x; if (sLen >= sMax) { if (sMax == 4294967295) // 4G - 1 fprintf(stderr, "fastaFile::getSequence()-- ERROR: sequence is too long; must be less than 4 Gbp.\n"), exit(1); if (sMax >= 2147483648) // 2G sMax = 4294967295; else sMax *= 2; char *S = new char [sMax]; memcpy(S, s, sLen); delete [] s; s = S; } } x = _rb->read(); } s[sLen] = 0; _nextID++; return(true); } // slow bool fastaFile::getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) { if (iid >= _header._numberOfSequences) { fprintf(stderr, "fastaFile::getSequence(part)-- iid " F_U32 " more than number of sequences " F_U32 "\n", iid, _header._numberOfSequences); return(false); } #ifdef DEBUG fprintf(stderr, "fastaFile::getSequence(part)-- " F_U32 "\n", iid); #endif // It is impossible to be efficient here; see the big comment in // the other getSequence() above. // // We can't even guess where to start scanning the sequence; we // just don't have any information about how much whitespace is in // the sequence. _rb->seek(_index[iid]._seqPosition); uint32 pos = 0; char x = _rb->read(); // Skip whitespace at the start of the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // We should be at a '>' character now. Fail if not. if (_rb->eof()) return(false); if (x != '>') fprintf(stderr, "fastaFile::getSequence(part)-- ERROR2: In %s, expected '>' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Skip the defline. while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) x = _rb->read(); // Skip whitespace between the defline and the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // Skip sequence up until bgn. while ((_rb->eof() == false) && (pos < bgn)) { if (alphabet.isWhitespace(x) == false) pos++; x = _rb->read(); } // Copy sequence while ((_rb->eof() == false) && (pos < end)) { if (alphabet.isWhitespace(x) == false) s[pos++ - bgn] = x; x = _rb->read(); } s[pos - bgn] = 0; // Fail if we didn't copy enough stuff. return((pos == end) ? true : false); } void fastaFile::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); strcpy(_typename, "FastA"); _randomAccessSupported = true; _numberOfSequences = 0; _rb = 0L; memset(&_header, 0, sizeof(fastaFileHeader)); _index = 0L; _names = 0L; _nextID = 0; } void fastaFile::loadIndex(char *indexname) { struct stat fastastat; if (AS_UTL_fileExists(indexname) == false) return; errno = 0; if (stat(_filename, &fastastat)) { fprintf(stderr, "fastaFile::constructIndex()-- stat of file '%s' failed: %s\n", _filename, strerror(errno)); return; } FILE *I = fopen(indexname, "r"); if (errno) { fprintf(stderr, "fastaFile::constructIndex()-- open of file '%s' failed: %s\n", indexname, strerror(errno)); return; } fread(&_header, sizeof(fastaFileHeader), 1, I); if ((_header._magic[0] != FASTA_MAGICNUMBER1) && (_header._magic[1] != FASTA_MAGICNUMBER2)) { fprintf(stderr, "fastaFile::constructIndex()-- magic mismatch.\n"); fclose(I); return; } #if 0 (_header._fastaModificationTime != (uint64)fastastat.st_mtime) (_header._fastaCreationTime != (uint64)fastastat.st_ctime) #endif if (_header._fastaFileSize != (uint64)fastastat.st_size) { fprintf(stderr, "fastaFile::constructIndex()-- stat mismatch.\n"); fclose(I); return; } _index = new fastaFileIndex [_header._numberOfSequences]; _names = new char [_header._namesLength]; fread(_index, sizeof(fastaFileIndex), _header._numberOfSequences, I); fread(_names, sizeof(char), _header._namesLength, I); #ifdef DEBUG fprintf(stderr, "fastaFile::constructIndex()-- '%s' LOADED\n", _filename); #endif fclose(I); return; } void fastaFile::constructIndex(void) { if (_index) return; // If the filename ends in '.fasta' then append a 'idx', // otherwise, append '.fastaidx'. char indexname[FILENAME_MAX]; strcpy(indexname, _filename); uint32 l = strlen(_filename); if ((l > 5) && (strcmp(_filename + l - 6, ".fasta") == 0)) strcat(indexname, "idx"); else strcat(indexname, ".fastaidx"); // If the index exists, suck it in and return. loadIndex(indexname); if (_index) return; #ifdef DEBUG fprintf(stderr, "fastaFile::constructIndex()-- '%s' BUILDING\n", _filename); #endif // Allocate some space for the index structures. uint32 indexMax = 64 * 1024 * 1024 / sizeof(fastaFileIndex); uint32 indexLen = 0; _index = new fastaFileIndex [indexMax]; uint32 namesMax = 32 * 1024 * 1024; uint32 namesLen = 0; _names = new char [namesMax]; // Some local storage uint64 seqStart; uint32 seqLen; uint32 seqLenMax = ~uint32ZERO; uint32 namePos; readBuffer ib(_filename); char x = ib.read(); #ifdef DEBUGINDEX fprintf(stderr, "readBuffer '%s' eof=%d x=%c %d\n", _filename, ib.eof(), x, x); #endif // Build it. // Skip whitespace at the start of the sequence. while ((ib.eof() == false) && (alphabet.isWhitespace(x) == true)) { #ifdef DEBUGINDEX fprintf(stderr, "skip '%c' %d\n", x, x); #endif x = ib.read(); } while (ib.eof() == false) { #ifdef DEBUGINDEX fprintf(stderr, "index\n"); #endif // We should be at a '>' character now. Fail if not. if (x != '>') fprintf(stderr, "fastaFile::constructIndex()-- ERROR3: In %s, expected '>' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Save info - ib's position is correctly at the first letter in // the defline (which might be whitespace), but the reader // expects our position to be at the '>' -- hence the -1. seqStart = ib.tell() - 1; seqLen = 0; namePos = namesLen; // Read that first letter x = ib.read(); // Copy the name to the names while ((ib.eof() == false) && (alphabet.isWhitespace(x) == false)) { if (namesLen + 1 >= namesMax) { namesMax += 32 * 1024 * 1024; char *nt = new char [namesMax]; memcpy(nt, _names, namesLen); delete [] _names; _names = nt; } _names[namesLen++] = x; #ifdef DEBUGINDEX fprintf(stderr, "name += %c\n", x); #endif x = ib.read(); } if (namesLen + 1 >= namesMax) { namesMax += 32 * 1024 * 1024; char *nt = new char [namesMax]; memcpy(nt, _names, namesLen); delete [] _names; _names = nt; } _names[namesLen++] = 0; // Skip the rest of the defline while ((ib.eof() == false) && (x != '\r') && (x != '\n')) { #ifdef DEBUGINDEX fprintf(stderr, "skip let %c\n", x); #endif x = ib.read(); } // Skip whitespace between the defline and the sequence. while ((ib.eof() == false) && (alphabet.isWhitespace(x) == true)) { #ifdef DEBUGINDEX fprintf(stderr, "skip num %d\n", x); #endif x = ib.read(); } #ifdef DEBUGINDEX fprintf(stderr, "x=%c peek=%c\n", x, ib.peek()); #endif // Count sequence length while ((ib.eof() == false) && (ib.peek() != '>')) { #ifdef DEBUGINDEX fprintf(stderr, "seqlen %s %c\n", (alphabet.isWhitespace(x) == false) ? "save" : "skip", x); #endif if (alphabet.isWhitespace(x) == false) seqLen++; if (seqLen >= seqLenMax) fprintf(stderr, "fastaFile::constructIndex()-- ERROR: In %s, sequence '%s' is too long. Maximum length is %u bases.\n", _filename, _names + namePos, seqLenMax), exit(1); x = ib.read(); } // Save to the index. if (indexLen >= indexMax) { indexMax *= 2; fastaFileIndex *et = new fastaFileIndex[indexMax]; memcpy(et, _index, sizeof(fastaFileIndex) * indexLen); delete [] _index; _index = et; } _index[indexLen]._seqPosition = seqStart; _index[indexLen]._seqLength = seqLen; #ifdef DEBUG fprintf(stderr, "INDEX iid=" F_U32 " len=" F_U32 " pos=" F_U64 "\n", indexLen, seqLen, seqStart); #endif indexLen++; // Load the '>' for the next iteration. x = ib.read(); } // Fill out the index meta data struct stat fastastat; errno = 0; if (stat(_filename, &fastastat)) fprintf(stderr, "fastaFile::constructIndex()-- stat() of file '%s' failed: %s\n", _filename, strerror(errno)), exit(1); _header._magic[0] = FASTA_MAGICNUMBER1; _header._magic[1] = FASTA_MAGICNUMBER2; _header._numberOfSequences = indexLen; _header._namesLength = namesLen; _header._fastaFileSize = fastastat.st_size; _header._fastaModificationTime = fastastat.st_mtime; _header._fastaCreationTime = fastastat.st_ctime; // Dump the index, if possible. errno = 0; FILE *I = fopen(indexname, "w"); if (errno) return; fwrite(&_header, sizeof(fastaFileHeader), 1, I); fwrite( _index, sizeof(fastaFileIndex), _header._numberOfSequences, I); fwrite( _names, sizeof(char), _header._namesLength, I); fclose(I); } canu-1.6/src/meryl/libleaff/fastaFile.H000066400000000000000000000051531314437614700200160ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef FASTAFILE_H #define FASTAFILE_H #include "seqFile.H" struct fastaFileHeader { uint64 _magic[2]; uint32 _numberOfSequences; // Number of sequences in the file uint32 _namesLength; // Bytes in the names uint64 _fastaFileSize; // st_size - size of file in bytes uint64 _fastaModificationTime; // st_mtime - time of last data modification uint64 _fastaCreationTime; // st_ctime - time of last file status change }; struct fastaFileIndex { uint64 _seqPosition; // Position of the sequence in the file uint32 _seqLength; // Length of the sequence (no whitespace counted) }; class fastaFile : public seqFile { protected: fastaFile(const char *filename); fastaFile(); public: ~fastaFile(); protected: seqFile *openFile(const char *filename); public: uint32 find(const char *sequencename); uint32 getSequenceLength(uint32 iid); bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); void loadIndex(char *indexname); void constructIndex(void); readBuffer *_rb; fastaFileHeader _header; fastaFileIndex *_index; char *_names; uint32 _nextID; // Next sequence in the read buffer uint32 _gs_iid; uint32 _gs_pos; friend class seqFactory; }; #endif // FASTAFILE_H canu-1.6/src/meryl/libleaff/fastaStdin.C000066400000000000000000000155441314437614700202200ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-19 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "fastaStdin.H" #include "dnaAlphabets.H" fastaStdin::fastaStdin(const char *filename) { clear(); #ifdef DEBUG fprintf(stderr, "fastaStdin::fastaStdin()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif if (filename == 0L) { strcpy(_filename, "(stdin)"); _rb = new readBuffer("-"); } else { _pipe = popen(filename, "r"); _rb = new readBuffer(_pipe); } } fastaStdin::fastaStdin() { clear(); } fastaStdin::~fastaStdin() { delete _rb; delete [] _header; delete [] _sequence; } seqFile * fastaStdin::openFile(const char *filename) { #ifdef DEBUG fprintf(stderr, "fastaStdin::openFile()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif if (((filename == 0L) && (isatty(fileno(stdin)) == 0)) || ((filename != 0L) && (filename[0] == '-') && (filename[1] == 0))) return(new fastaStdin(0L)); if (filename == 0L) return(0L); // The stdin variants also handle compressed inputs (because we can't seek in these). uint32 fl = strlen(filename); char cmd[32 + fl]; if ((filename[fl-3] == '.') && (filename[fl-2] == 'g') && (filename[fl-1] == 'z')) sprintf(cmd, "gzip -dc %s", filename); else if ((filename[fl-4] == '.') && (filename[fl-3] == 'b') && (filename[fl-2] == 'z') && (filename[fl-1] == '2')) sprintf(cmd, "bzip2 -dc %s", filename); else if ((filename[fl-3] == '.') && (filename[fl-2] == 'x') && (filename[fl-1] == 'z')) sprintf(cmd, "xz -dc %s", filename); else return(0L); return(new fastaStdin(cmd)); } uint32 fastaStdin::getNumberOfSequences(void) { if (_rb->peek() == 0) return(_nextIID); else return(_nextIID + 1); } uint32 fastaStdin::find(const char *sequencename) { fprintf(stderr, "fastaStdin::find()-- ERROR! Used for random access on sequence '%s'.\n", sequencename); assert(0); return(~uint32ZERO); } uint32 fastaStdin::getSequenceLength(uint32 iid) { if (iid == _nextIID) if (loadNextSequence(_header, _headerLen, _headerMax, _sequence, _sequenceLen, _sequenceMax) == false) return(0); if (iid + 1 != _nextIID) { fprintf(stderr, "fastaStdin::getSequenceLength()-- ERROR! Used for random access. Requested iid=%u, at iid=%u\n", iid, _nextIID); assert(0); } return(strlen(_sequence)); } bool fastaStdin::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { bool ret = true; #ifdef DEBUG fprintf(stderr, "fastaStdin::getSequence(full)-- " F_U32 "\n", iid); #endif if (iid == _nextIID) if (loadNextSequence(_header, _headerLen, _headerMax, _sequence, _sequenceLen, _sequenceMax) == false) return(false); if (iid + 1 != _nextIID) { fprintf(stderr, "fastaStdin::getSequence(full)-- ERROR! Used for random access. Requested iid=%u, at iid=%u\n", iid, _nextIID); assert(0); } if (hLen < _headerMax) { delete [] h; hMax = _headerMax; h = new char [hMax]; } if (sLen < _sequenceMax) { delete [] s; sMax = _sequenceMax; s = new char [sMax]; } memcpy(h, _header, _headerLen + 1); hLen = _headerLen; memcpy(s, _sequence, _sequenceLen + 1); sLen = _sequenceLen; return(true); } bool fastaStdin::getSequence(uint32 iid, uint32 bgn, uint32 end, char *UNUSED(s)) { fprintf(stderr, "fastaStdin::getSequence(part)-- ERROR! Used for random access on sequence %u bgn %u end %u.\n", iid, bgn, end); assert(0); return(false); } void fastaStdin::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); _randomAccessSupported = false; strcpy(_typename, "FastAstream"); _numberOfSequences = ~uint32ZERO; _rb = 0L; _nextIID = 0; _pipe = 0L; _header = 0L; _headerLen = 0; _headerMax = 0; _sequence = 0L; _sequenceLen = 0; _sequenceMax = 0; } bool fastaStdin::loadNextSequence(char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { if (hMax == 0) { hMax = 2048; h = new char [hMax]; } if (sMax == 0) { sMax = 2048; s = new char [sMax]; } hLen = 0; sLen = 0; char x = _rb->read(); // Skip whitespace at the start of the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // We should be at a '>' character now. Fail if not. if (_rb->eof() == true) return(false); if (x != '>') fprintf(stderr, "fastaStdin::loadNextSequence(part)-- ERROR: In %s, expected '>' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Skip the '>' in the defline x = _rb->read(); // Skip whitespace between the '>' and the defline while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true) && (x != '\r') && (x != '\n')) x = _rb->read(); // Copy the defline, until the first newline. while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) { h[hLen++] = x; if (hLen >= hMax) { //fprintf(stderr, "realloc header\n"); hMax += 2048; char *H = new char [hMax]; memcpy(H, h, hLen); delete [] h; h = H; } x = _rb->read(); } h[hLen] = 0; // Skip whitespace between the defline and the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // Copy the sequence, until EOF or the next '>'. while ((_rb->eof() == false) && (_rb->peek() != '>')) { if (alphabet.isWhitespace(x) == false) { s[sLen++] = x; if (sLen >= sMax) { //fprintf(stderr, "realloc sequence\n"); sMax *= 2; char *S = new char [sMax]; memcpy(S, s, sLen); delete [] s; s = S; } } x = _rb->read(); } s[sLen] = 0; _nextIID++; return(true); } canu-1.6/src/meryl/libleaff/fastaStdin.H000066400000000000000000000042211314437614700202130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef FASTASTDIN_H #define FASTASTDIN_H #include "seqFile.H" class fastaStdin : public seqFile { protected: fastaStdin(const char *filename); fastaStdin(); public: ~fastaStdin(); protected: seqFile *openFile(const char *filename); public: uint32 getNumberOfSequences(void); public: uint32 find(const char *sequencename); uint32 getSequenceLength(uint32 iid); bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); bool loadNextSequence(char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); readBuffer *_rb; uint32 _nextIID; FILE *_pipe; char *_header; uint32 _headerLen; uint32 _headerMax; char *_sequence; uint32 _sequenceLen; uint32 _sequenceMax; friend class seqFactory; }; #endif // FASTASTDIN_H canu-1.6/src/meryl/libleaff/fastqFile.C000066400000000000000000000400571314437614700200330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-19 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "fastqFile.H" #include "dnaAlphabets.H" // Says 'kmerFastaFileIdx' #define FASTQ_MAGICNUMBER1 0x7473614672656d6bULL #define FASTQ_MAGICNUMBER2 0x786449656c694661ULL fastqFile::fastqFile(const char *filename) { clear(); #ifdef DEBUG fprintf(stderr, "fastqFile::fastqFile()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif strcpy(_filename, filename); constructIndex(); _rb = new readBuffer(_filename); _numberOfSequences = _header._numberOfSequences; } fastqFile::fastqFile() { clear(); } fastqFile::~fastqFile() { delete _rb; delete [] _index; delete [] _names; } seqFile * fastqFile::openFile(const char *filename) { struct stat st; #ifdef DEBUG fprintf(stderr, "fastqFile::openFile()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif if (((filename == 0L) && (isatty(fileno(stdin)) == 0)) || ((filename != 0L) && (filename[0] == '-') && (filename[1] == 0))) return(0L); errno = 0; stat(filename, &st); if (errno) return(0L); if ((st.st_mode & S_IFREG) == 0) return(0L); // Otherwise, open and see if we can get the first sequence. We // assume it's fastq if we find a '>' denoting a defline the first // thing in the file. // // Use of a readBuffer here is a bit heavyweight, but it's safe and // easy. Opening a fastqFile isn't, after all, lightweight anyway. // fastqFile *f = 0L; readBuffer *r = new readBuffer(filename); char x = r->read(); while ((r->eof() == false) && (alphabet.isWhitespace(x) == true)) x = r->read(); // If we get a fastq record separator assume it's a fastq file. If // it's eof, the file is empty, and we might as well return this // fastq file and let the client deal with the lack of sequence. // if ((x == '@') || (r->eof() == true)) f = new fastqFile(filename); delete r; return(f); } uint32 fastqFile::find(const char *sequencename) { char *ptr = _names; // If this proves far too slow, rewrite the _names string to // separate IDs with 0xff, then use strstr on the whole thing. To // find the ID, scan down the string counting the number of 0xff's. // // Similar code is used for seqStore::find() for (uint32 iid=0; iid < _header._numberOfSequences; iid++) { //fprintf(stderr, "fastqFile::find()-- '%s' vs '%s'\n", sequencename, ptr); if (strcmp(sequencename, ptr) == 0) return(iid); while (*ptr) ptr++; ptr++; } return(~uint32ZERO); } uint32 fastqFile::getSequenceLength(uint32 iid) { #ifdef DEBUG fprintf(stderr, "fastqFile::getSequenceLength()-- " F_U32 "\n", iid); #endif return((iid < _numberOfSequences) ? _index[iid]._seqLength : 0); } bool fastqFile::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { #ifdef DEBUG fprintf(stderr, "fastqFile::getSequence(full)-- " F_U32 "\n", iid); #endif if (iid >= _header._numberOfSequences) { fprintf(stderr, "fastqFile::getSequence(full)-- iid " F_U32 " more than number of sequences " F_U32 "\n", iid, _header._numberOfSequences); return(false); } if (sMax == 0) { sMax = 2048; s = new char [sMax]; } if (hMax == 0) { hMax = 2048; h = new char [hMax]; } if ((_index) && (sMax < _index[iid]._seqLength)) { sMax = _index[iid]._seqLength; delete [] s; s = new char [sMax]; } hLen = 0; sLen = 0; #ifdef DEBUG fprintf(stderr, "fastqFile::getSequence(full)-- seek to iid=" F_U32 " at pos=" F_U32 "\n", iid, _index[iid]._seqPosition); #endif _rb->seek(_index[iid]._seqPosition); char x = _rb->read(); // Skip whitespace at the start of the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // We should be at a '@' character now. Fail if not. if (_rb->eof()) return(false); if (x != '@') fprintf(stderr, "fastqFile::getSequence(full)-- ERROR1: In %s, expected '@' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Skip the '@' in the defline x = _rb->read(); // Skip whitespace between the '@' and the defline while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true) && (x != '\r') && (x != '\n')) x = _rb->read(); // Copy the defline, until the first newline. while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) { h[hLen++] = x; if (hLen >= hMax) { hMax += 2048; char *H = new char [hMax]; memcpy(H, h, hLen); delete [] h; h = H; } x = _rb->read(); } h[hLen] = 0; // Skip whitespace between the defline and the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // Copy the sequence, until EOF or the start of the QV bases. while ((_rb->eof() == false) && (x != '+')) { if (alphabet.isWhitespace(x) == false) { s[sLen++] = x; if (sLen >= sMax) { if (sMax == 4294967295) // 4G - 1 fprintf(stderr, "fastqFile::getSequence()-- ERROR: sequence is too long; must be less than 4 Gbp.\n"), exit(1); if (sMax >= 2147483648) // 2G sMax = 4294967295; else sMax *= 2; char *S = new char [sMax]; memcpy(S, s, sLen); delete [] s; s = S; } } x = _rb->read(); } s[sLen] = 0; // Skip the rest of the QV id line and then the entire QV line. //x = _rb->read(); assert((_rb->eof() == true) || (x == '+')); while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) x = _rb->read(); x = _rb->read(); while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) x = _rb->read(); _nextID++; return(true); } // slow bool fastqFile::getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) { if (iid >= _header._numberOfSequences) { fprintf(stderr, "fastqFile::getSequence(part)-- iid " F_U32 " more than number of sequences " F_U32 "\n", iid, _header._numberOfSequences); return(false); } #ifdef DEBUG fprintf(stderr, "fastqFile::getSequence(part)-- " F_U32 "\n", iid); #endif // Unlike the fasta version of this, we know that all the sequence is on one line. However, we // expect fastq sequences to be small, and we still do the same processing -- character by character. _rb->seek(_index[iid]._seqPosition); uint32 pos = 0; char x = _rb->read(); // Skip whitespace at the start of the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // We should be at a '@' character now. Fail if not. if (_rb->eof()) return(false); if (x != '@') fprintf(stderr, "fastqFile::getSequence(part)-- ERROR2: In %s, expected '@' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Skip the defline. while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) x = _rb->read(); // Skip whitespace between the defline and the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // Skip sequence up until bgn. while ((_rb->eof() == false) && (pos < bgn)) { if (alphabet.isWhitespace(x) == false) pos++; x = _rb->read(); } // Copy sequence while ((_rb->eof() == false) && (pos < end)) { if (alphabet.isWhitespace(x) == false) s[pos++ - bgn] = x; x = _rb->read(); } s[pos - bgn] = 0; // Fail if we didn't copy enough stuff. return((pos == end) ? true : false); } void fastqFile::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); strcpy(_typename, "Fastq"); _randomAccessSupported = true; _numberOfSequences = 0; _rb = 0L; memset(&_header, 0, sizeof(fastqFileHeader)); _index = 0L; _names = 0L; _nextID = 0; } void fastqFile::loadIndex(char *indexname) { struct stat fastqstat; if (AS_UTL_fileExists(indexname) == false) return; errno = 0; if (stat(_filename, &fastqstat)) { fprintf(stderr, "fastqFile::constructIndex()-- stat of file '%s' failed: %s\n", _filename, strerror(errno)); return; } FILE *I = fopen(indexname, "r"); if (errno) { fprintf(stderr, "fastqFile::constructIndex()-- open of file '%s' failed: %s\n", indexname, strerror(errno)); return; } fread(&_header, sizeof(fastqFileHeader), 1, I); if ((_header._magic[0] != FASTQ_MAGICNUMBER1) && (_header._magic[1] != FASTQ_MAGICNUMBER2)) { fprintf(stderr, "fastqFile::constructIndex()-- magic mismatch.\n"); fclose(I); return; } if ((_header._fastqFileSize != (uint64)fastqstat.st_size) || (_header._fastqModificationTime != (uint64)fastqstat.st_mtime) || (_header._fastqCreationTime != (uint64)fastqstat.st_ctime)) { fprintf(stderr, "fastqFile::constructIndex()-- stat mismatch.\n"); fclose(I); return; } _index = new fastqFileIndex [_header._numberOfSequences]; _names = new char [_header._namesLength]; fread(_index, sizeof(fastqFileIndex), _header._numberOfSequences, I); fread(_names, sizeof(char), _header._namesLength, I); #ifdef DEBUG fprintf(stderr, "fastqFile::constructIndex()-- '%s' LOADED\n", _filename); #endif fclose(I); return; } void fastqFile::constructIndex(void) { if (_index) return; // If the filename ends in '.fastq' then append a 'idx', // otherwise, append '.fastqidx'. char indexname[FILENAME_MAX]; strcpy(indexname, _filename); uint32 l = strlen(_filename); if ((l > 5) && (strcmp(_filename + l - 6, ".fastq") == 0)) strcat(indexname, "idx"); else strcat(indexname, ".fastqidx"); // If the index exists, suck it in and return. loadIndex(indexname); if (_index) return; #ifdef DEBUG fprintf(stderr, "fastqFile::constructIndex()-- '%s' BUILDING\n", _filename); #endif // Allocate some space for the index structures. uint32 indexMax = 64 * 1024 * 1024 / sizeof(fastqFileIndex); uint32 indexLen = 0; _index = new fastqFileIndex [indexMax]; uint32 namesMax = 32 * 1024 * 1024; uint32 namesLen = 0; _names = new char [namesMax]; // Some local storage uint64 seqStart; uint32 seqLen; uint32 seqLenMax = ~uint32ZERO; uint32 namePos; readBuffer ib(_filename); char x = ib.read(); #ifdef DEBUGINDEX fprintf(stderr, "readBuffer '%s' eof=%d x=%c %d\n", _filename, ib.eof(), x, x); #endif // Build it. // Skip whitespace at the start of the sequence. while ((ib.eof() == false) && (alphabet.isWhitespace(x) == true)) { #ifdef DEBUGINDEX fprintf(stderr, "skip '%c' %d\n", x, x); #endif x = ib.read(); } while (ib.eof() == false) { #ifdef DEBUGINDEX fprintf(stderr, "index\n"); #endif // We should be at a '@' character now. Fail if not. if (x != '@') fprintf(stderr, "fastqFile::constructIndex()-- ERROR3: In %s, expected '@' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Save info - ib's position is correctly at the first letter in // the defline (which might be whitespace), but the reader // expects our position to be at the '@' -- hence the -1. seqStart = ib.tell() - 1; seqLen = 0; namePos = namesLen; // Read that first letter x = ib.read(); // Copy the name to the names while ((ib.eof() == false) && (alphabet.isWhitespace(x) == false)) { if (namesLen + 1 >= namesMax) { namesMax += 32 * 1024 * 1024; char *nt = new char [namesMax]; memcpy(nt, _names, namesLen); delete [] _names; _names = nt; } _names[namesLen++] = x; #ifdef DEBUGINDEX fprintf(stderr, "name += %c\n", x); #endif x = ib.read(); } if (namesLen + 1 >= namesMax) { namesMax += 32 * 1024 * 1024; char *nt = new char [namesMax]; memcpy(nt, _names, namesLen); delete [] _names; _names = nt; } _names[namesLen++] = 0; // Skip the rest of the defline while ((ib.eof() == false) && (x != '\r') && (x != '\n')) { #ifdef DEBUGINDEX fprintf(stderr, "skip let %c\n", x); #endif x = ib.read(); } // Skip whitespace between the defline and the sequence. while ((ib.eof() == false) && (alphabet.isWhitespace(x) == true)) { #ifdef DEBUGINDEX fprintf(stderr, "skip num %d\n", x); #endif x = ib.read(); } #ifdef DEBUGINDEX fprintf(stderr, "x=%c peek=%c\n", x, ib.peek()); #endif // Count sequence length while ((ib.eof() == false) && (x != '+')) { #ifdef DEBUGINDEX fprintf(stderr, "seqlen %s %c\n", (alphabet.isWhitespace(x) == false) ? "save" : "skip", x); #endif if (alphabet.isWhitespace(x) == false) seqLen++; if (seqLen >= seqLenMax) fprintf(stderr, "fastqFile::constructIndex()-- ERROR: In %s, sequence '%s' is too long. Maximum length is %u bases.\n", _filename, _names + namePos, seqLenMax), exit(1); x = ib.read(); } // Save to the index. if (indexLen >= indexMax) { fprintf(stderr, "REALLOC len=" F_U32 " from " F_U32 " to " F_U32 "\n", indexLen, indexMax, indexMax * 2); indexMax *= 2; fastqFileIndex *et = new fastqFileIndex[indexMax]; memcpy(et, _index, sizeof(fastqFileIndex) * indexLen); delete [] _index; _index = et; } _index[indexLen]._seqPosition = seqStart; _index[indexLen]._seqLength = seqLen; #if 0 if ((indexLen * sizeof(fastqFileIndex) > 131000) && (indexLen * sizeof(fastqFileIndex) < 131200)) fprintf(stderr, "INDEX pos=" F_U64 " iid=" F_U32 " len=" F_U32 " pos=" F_U64 "\n", indexLen * sizeof(fastqFileIndex), indexLen, seqLen, seqStart); #endif indexLen++; // Skip the rest of the QV def line, then the entire QV line, then load the '@' for the next sequence. //x = ib.read(); assert((ib.eof() == true) || (x == '+')); while ((ib.eof() == false) && (x != '\r') && (x != '\n')) x = ib.read(); x = ib.read(); while ((ib.eof() == false) && (x != '\r') && (x != '\n')) x = ib.read(); while ((ib.eof() == false) && (x != '@')) x = ib.read(); } // Fill out the index meta data struct stat fastqstat; errno = 0; if (stat(_filename, &fastqstat)) fprintf(stderr, "fastqFile::constructIndex()-- stat() of file '%s' failed: %s\n", _filename, strerror(errno)), exit(1); _header._magic[0] = FASTQ_MAGICNUMBER1; _header._magic[1] = FASTQ_MAGICNUMBER2; _header._numberOfSequences = indexLen; _header._namesLength = namesLen; _header._fastqFileSize = fastqstat.st_size; _header._fastqModificationTime = fastqstat.st_mtime; _header._fastqCreationTime = fastqstat.st_ctime; // Dump the index, if possible. errno = 0; FILE *I = fopen(indexname, "w"); if (errno) return; fwrite(&_header, sizeof(fastqFileHeader), 1, I); fwrite( _index, sizeof(fastqFileIndex), _header._numberOfSequences, I); fwrite( _names, sizeof(char), _header._namesLength, I); fclose(I); } canu-1.6/src/meryl/libleaff/fastqFile.H000066400000000000000000000051531314437614700200360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef FASTQFILE_H #define FASTQFILE_H #include "seqFile.H" struct fastqFileHeader { uint64 _magic[2]; uint32 _numberOfSequences; // Number of sequences in the file uint32 _namesLength; // Bytes in the names uint64 _fastqFileSize; // st_size - size of file in bytes uint64 _fastqModificationTime; // st_mtime - time of last data modification uint64 _fastqCreationTime; // st_ctime - time of last file status change }; struct fastqFileIndex { uint64 _seqPosition; // Position of the sequence in the file uint32 _seqLength; // Length of the sequence (no whitespace counted) }; class fastqFile : public seqFile { protected: fastqFile(const char *filename); fastqFile(); public: ~fastqFile(); protected: seqFile *openFile(const char *filename); public: uint32 find(const char *sequencename); uint32 getSequenceLength(uint32 iid); bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); void loadIndex(char *indexname); void constructIndex(void); readBuffer *_rb; fastqFileHeader _header; fastqFileIndex *_index; char *_names; uint32 _nextID; // Next sequence in the read buffer uint32 _gs_iid; uint32 _gs_pos; friend class seqFactory; }; #endif // FASTQFILE_H canu-1.6/src/meryl/libleaff/fastqStdin.C000066400000000000000000000164051314437614700202350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "fastqStdin.H" #include "dnaAlphabets.H" fastqStdin::fastqStdin(const char *filename) { clear(); #ifdef DEBUG fprintf(stderr, "fastqStdin::fastqStdin()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif if (filename == 0L) { strcpy(_filename, "(stdin)"); _rb = new readBuffer("-"); } else { _pipe = popen(filename, "r"); _rb = new readBuffer(_pipe); } } fastqStdin::fastqStdin() { clear(); } fastqStdin::~fastqStdin() { delete _rb; delete [] _header; delete [] _sequence; delete [] _quality; } seqFile * fastqStdin::openFile(const char *filename) { #ifdef DEBUG fprintf(stderr, "fastqStdin::openFile()-- '%s'\n", (filename) ? filename : "NULLPOINTER"); #endif if (((filename == 0L) && (isatty(fileno(stdin)) == 0)) || ((filename != 0L) && (filename[0] == '-') && (filename[1] == 0))) return(new fastqStdin(0L)); // The stdin variants also handle compressed inputs (because we can't seek in these). if (filename == 0L) return(0L); uint32 fl = strlen(filename); char cmd[32 + fl]; if ((filename[fl-3] == '.') && (filename[fl-2] == 'g') && (filename[fl-1] == 'z')) sprintf(cmd, "gzip -dc %s", filename); else if ((filename[fl-4] == '.') && (filename[fl-3] == 'b') && (filename[fl-2] == 'z') && (filename[fl-1] == '2')) sprintf(cmd, "bzip2 -dc %s", filename); else if ((filename[fl-3] == '.') && (filename[fl-2] == 'x') && (filename[fl-1] == 'z')) sprintf(cmd, "xz -dc %s", filename); else return(0L); return(new fastqStdin(cmd)); } uint32 fastqStdin::getNumberOfSequences(void) { if (_rb->peek() == 0) return(_nextIID); else return(_nextIID + 1); } uint32 fastqStdin::find(const char *sequencename) { fprintf(stderr, "fastqStdin::find()-- ERROR! Used for random access on sequence '%s'.\n", sequencename); assert(0); return(~uint32ZERO); } uint32 fastqStdin::getSequenceLength(uint32 iid) { if (iid == _nextIID) if (loadNextSequence(_header, _headerLen, _headerMax, _sequence, _sequenceLen, _sequenceMax) == false) return(0); if (iid + 1 != _nextIID) { fprintf(stderr, "fastqStdin::getSequence()-- ERROR! Used for random access. Requested iid=%u, at iid=%u\n", iid, _nextIID); assert(0); } return(strlen(_sequence)); } bool fastqStdin::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { bool ret = true; #ifdef DEBUG fprintf(stderr, "fastqStdin::getSequence(full)-- " F_U32 "\n", iid); #endif if (iid == _nextIID) if (loadNextSequence(_header, _headerLen, _headerMax, _sequence, _sequenceLen, _sequenceMax) == false) return(false); if (iid + 1 != _nextIID) { fprintf(stderr, "fastqStdin::getSequence(full)-- ERROR! Used for random access. Requested iid=%u, at iid=%u\n", iid, _nextIID); assert(0); } if (hLen < _headerMax) { delete [] h; hMax = _headerMax; h = new char [hMax]; } if (sLen < _sequenceMax) { delete [] s; sMax = _sequenceMax; s = new char [sMax]; } memcpy(h, _header, _headerLen + 1); hLen = _headerLen; memcpy(s, _sequence, _sequenceLen + 1); sLen = _sequenceLen; return(true); } bool fastqStdin::getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) { fprintf(stderr, "fastqStdin::getSequence(part)-- ERROR! Used for random access on iid " F_U32 " from position " F_U32 "-" F_U32 ".\n", iid, bgn, end); assert(0); return(false); } void fastqStdin::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); _randomAccessSupported = false; strcpy(_typename, "FastQstream"); _numberOfSequences = ~uint32ZERO; _rb = 0L; _nextIID = 0; _pipe = 0L; _header = 0L; _headerLen = 0; _headerMax = 0; _sequence = 0L; _sequenceLen = 0; _sequenceMax = 0; _quality = 0L; _qualityLen = 0; _qualityMax = 0; } bool fastqStdin::loadNextSequence(char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { if (hMax == 0) { hMax = 2048; h = new char [hMax]; } if (sMax == 0) { sMax = 2048; s = new char [sMax]; } hLen = 0; sLen = 0; char x = _rb->read(); // Skip whitespace at the start of the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // We should be at a '@' character now. Fail if not. if (_rb->eof() == true) return(false); if (x != '@') fprintf(stderr, "fastqStdin::loadNextSequence(part)-- ERROR: In %s, expected '@' at beginning of defline, got '%c' instead.\n", _filename, x), exit(1); // Skip the '@' in the defline x = _rb->read(); // Skip whitespace between the '@' and the defline while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true) && (x != '\r') && (x != '\n')) x = _rb->read(); // Copy the defline, until the first newline. while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) { h[hLen++] = x; if (hLen >= hMax) { //fprintf(stderr, "realloc header\n"); hMax += 2048; char *H = new char [hMax]; memcpy(H, h, hLen); delete [] h; h = H; } x = _rb->read(); } h[hLen] = 0; // Skip whitespace between the defline and the sequence. while ((_rb->eof() == false) && (alphabet.isWhitespace(x) == true)) x = _rb->read(); // Copy the sequence, until EOF or the start of the QV bases. while ((_rb->eof() == false) && (_rb->peek() != '+')) { if (alphabet.isWhitespace(x) == false) { s[sLen++] = x; if (sLen >= sMax) { //fprintf(stderr, "realloc sequence\n"); sMax *= 2; char *S = new char [sMax]; memcpy(S, s, sLen); delete [] s; s = S; } } x = _rb->read(); } s[sLen] = 0; // Skip the rest of the QV id line and then the entire QV line. //x = _rb->read(); assert((_rb->eof() == true) || (x == '+')); while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) x = _rb->read(); x = _rb->read(); while ((_rb->eof() == false) && (x != '\r') && (x != '\n')) x = _rb->read(); _nextIID++; return(true); } canu-1.6/src/meryl/libleaff/fastqStdin.H000066400000000000000000000043651314437614700202440ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef FASTQSTDIN_H #define FASTQSTDIN_H #include "seqFile.H" class fastqStdin : public seqFile { protected: fastqStdin(const char *filename); fastqStdin(); public: ~fastqStdin(); protected: seqFile *openFile(const char *filename); public: uint32 getNumberOfSequences(void); public: uint32 find(const char *sequencename); uint32 getSequenceLength(uint32 iid); bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); bool loadNextSequence(char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); readBuffer *_rb; uint32 _nextIID; FILE *_pipe; char *_header; uint32 _headerLen; uint32 _headerMax; char *_sequence; uint32 _sequenceLen; uint32 _sequenceMax; char *_quality; uint32 _qualityLen; uint32 _qualityMax; friend class seqFactory; }; #endif // FASTQSTDIN_H canu-1.6/src/meryl/libleaff/gkStoreFile.C000066400000000000000000000074421314437614700203340ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-FEB-04 to 2015-AUG-14 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "gkStoreFile.H" //#include "AS_UTL_fileIO.H" gkStoreFile::gkStoreFile() { clear(); gkp = NULL; } gkStoreFile::gkStoreFile(const char *name) { clear(); strcpy(_filename, name); gkp = gkStore::gkStore_open(_filename); _numberOfSequences = gkp->gkStore_getNumReads(); //fprintf(stderr, "Opened '%s' with %u reads\n", _filename, _numberOfSequences); } gkStoreFile::~gkStoreFile() { gkp->gkStore_close(); } seqFile * gkStoreFile::openFile(const char *name) { struct stat st; // Assume it's a gkStore if it is a directory, and the info / reads / blobs files exist. char infoName[FILENAME_MAX]; char readName[FILENAME_MAX]; char blobName[FILENAME_MAX]; sprintf(infoName, "%s/info", name); sprintf(readName, "%s/reads", name); sprintf(blobName, "%s/blobs", name); if ((AS_UTL_fileExists(name, true) == false) || (AS_UTL_fileExists(infoName) == false) || (AS_UTL_fileExists(readName) == false) || (AS_UTL_fileExists(blobName) == false)) return(0L); // Yup, probably a gkStore. If it isn't, the gkStore() constructor blows up. return(new gkStoreFile(name)); } bool gkStoreFile::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { if (iid > _numberOfSequences) { fprintf(stderr, "gkStoreFile::getSequence()-- iid %u exceeds number in store %u\n", iid, _numberOfSequences); return(false); } iid++; uint32 rLength = gkp->gkStore_getRead(iid)->gkRead_sequenceLength(); if (hMax < 32) { delete h; h = new char [32]; hMax = 32; } if (sMax < rLength) { delete s; s = new char [rLength + 1]; sMax = rLength; } hLen = sprintf(h, F_U32, iid); sLen = rLength; gkp->gkStore_loadReadData(iid, &readData); memcpy(s, readData.gkReadData_getSequence(), sizeof(char) * rLength); s[sLen] = 0; return(true); } bool gkStoreFile::getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) { if (iid > _numberOfSequences) { fprintf(stderr, "gkStoreFile::getSequence()-- iid %u exceeds number in store %u\n", iid, _numberOfSequences); return(false); } iid++; uint32 rLength = gkp->gkStore_getRead(iid)->gkRead_sequenceLength(); //fprintf(stderr, "return canu iid %u of length %u\n", iid, rLength); assert(bgn < end); assert(bgn <= rLength); assert(end <= rLength); gkp->gkStore_loadReadData(iid, &readData); memcpy(s, readData.gkReadData_getSequence() + bgn, sizeof(char) * (end - bgn)); s[end-bgn] = 0; return(true); } void gkStoreFile::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); strcpy(_typename, "GKSTORE"); _numberOfSequences = 0; } canu-1.6/src/meryl/libleaff/gkStoreFile.H000066400000000000000000000040461314437614700203360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-FEB-04 to 2015-MAR-17 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef GKSTOREFILE_H #define GKSTOREFILE_H #include "seqFile.H" #include "gkStore.H" class gkStoreFile : public seqFile { protected: gkStoreFile(const char *filename); gkStoreFile(); public: ~gkStoreFile(); protected: seqFile *openFile(const char *name); public: uint32 find(const char *sequencename) { fprintf(stderr, "gkStoreFile::find()-- Lookup of sequencename '%s' not supported.\n", sequencename); assert(0); return(0); }; uint32 getSequenceLength(uint32 iid) { return(gkp->gkStore_getRead(iid + 1)->gkRead_sequenceLength()); }; bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); gkStore *gkp; gkReadData readData; friend class seqFactory; }; #endif // GKSTOREFILE_H canu-1.6/src/meryl/libleaff/merStream.C000066400000000000000000000044261314437614700200540ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "merStream.H" merStream::merStream(kMerBuilder *kb, seqStream *ss, bool kbown, bool ssown) { _kb = kb; _ss = ss; _kbdelete = kbown; _ssdelete = ssown; _beg = uint64ZERO; _end = ~uint64ZERO; _kb->clear(); _invalid = true; } merStream::~merStream() { if (_kbdelete) delete _kb; if (_ssdelete) delete _ss; } void merStream::rewind(void) { _ss->rewind(); _kb->clear(); _invalid = true; } void merStream::rebuild(void) { _ss->setPosition(_ss->strPos() - _kb->theFMer().getMerSpan()); _kb->clear(); _invalid = true; } void merStream::setBaseRange(uint64 beg, uint64 end) { assert(beg < end); //fprintf(stderr, "merStream::setBaseRange()-- from "uint64FMT" to "uint64FMT".\n", beg, end); // We can't tell the seqStore when to stop; while we could compute the span of a spaced seed, we // cannot compute it for a compressed seed. We need to stop iterating when the beginning of the // mer reaches the requested end. _ss->setRange(beg, ~uint64ZERO); _beg = beg; _end = end; _kb->clear(); _invalid = true; } uint64 merStream::approximateNumberOfMers(void) { uint64 approx = _end - _beg; uint64 k = _kb->merSize(); // If we don't know the range, sum all the sequence lengths, otherwise, it's just the length from // begin to end. if (_end == ~uint64ZERO) { approx = uint64ZERO; for (uint32 s=0; s<_ss->numberOfSequences(); s++) { uint32 l = _ss->lengthOf(s); if (l > k) approx += l - k + 1; } } return(approx); } canu-1.6/src/meryl/libleaff/merStream.H000066400000000000000000000076611314437614700200650ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef MERSTREAM_H #define MERSTREAM_H #include "seqFile.H" #include "seqStream.H" #include "kMer.H" // // merStream needs exclusive use of a kMerBuilder and a seqStream. // // The kMerBuilder can be used over and over. I think snapper is the // only one that does this though. // // The seqStream can be used elsewhere, but ONLY for looking up // positions. // // The merStream does NOT assume ownership of either of these, unless // the own flags are set. // // The stream is not valid until nextMer is called; allowing loops of // while (MS->nextMer()) { // process(MS->theFMer()); // } // // setRange() positions refer to ACGT letters in the input, NOT mers. // rewind() repositions the file to the start of the range. // class merStream { public: merStream(kMerBuilder *kb, seqStream *ss, bool kbown=false, bool ssown=false); ~merStream(); kMer const & theFMer(void) { assert(_invalid == false); return(_kb->theFMer()); }; kMer const & theRMer(void) { assert(_invalid == false); return(_kb->theRMer()); }; kMer const & theCMer(void) { assert(_invalid == false); return(_kb->theCMer()); }; bool nextMer(uint32 skip=0) { char ch; do { ch = _ss->get(); if (ch == 0) return(false); } while ((_kb->addBase(ch) == true) || (skip-- > 0)); _kb->mask(); _invalid = false; #if 0 char merstring[256]; fprintf(stderr, "merStream::nextMer()-- seqPos="uint64FMT" merPos="uint64FMT" span="uint32FMT" base0span="uint32FMT" end="uint64FMT" %s %s\n", _ss->strPos(), _ss->strPos() - theFMer().getMerSpan(), theFMer().getMerSpan(), _kb->baseSpan(0), _end, _kb->theFMer().merToString(merstring), (_ss->strPos() - theFMer().getMerSpan() < _end) ? "" : "STOP"); #endif // The mer is out of range if: // o it begins at or past the _end // o the span of the first base ends at or past the _end // // If the mer isn't spaced, the base span is always 1. If it is spaced, the span will be // between 1 and ... who knows. return(_ss->strPos() - theFMer().getMerSpan() + _kb->baseSpan(0) - 1 < _end); }; void rewind(void); void rebuild(void); void setBaseRange(uint64 beg, uint64 end); uint64 thePositionInSequence(void) { assert(_invalid == false); return(_ss->seqPos() - theFMer().getMerSpan()); }; uint64 thePositionInStream(void) { assert(_invalid == false); return(_ss->strPos() - theFMer().getMerSpan()); }; uint64 theSequenceNumber(void) { assert(_invalid == false); return(_ss->seqIID()); }; uint64 approximateNumberOfMers(void); private: kMerBuilder *_kb; seqStream *_ss; bool _kbdelete; bool _ssdelete; bool _invalid; uint64 _beg; uint64 _end; }; #endif // MERSTREAM_H canu-1.6/src/meryl/libleaff/selftest.C000066400000000000000000000042541314437614700177450ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ { seqFile *SF = openSeqFile(argv[1]); fprintf(stdout, "source '%s' of type '%s' has "uint32FMT" sequences.\n", SF->getSourceName(), SF->getFileTypeName(), SF->getNumberOfSequences()); fprintf(stdout, "getSequenceLength() vs getSequence(full)\n"); { char *h = 0L; char *s = 0L; uint32 hLen=0, hMax=0; uint32 sLen=0, sMax=0; for (uint32 sid=0; sidgetNumberOfSequences(); sid++) { SF->getSequence(sid, h, hLen, hMax, s, sLen, sMax); if ((strlen(s) != SF->getSequenceLength(sid)) || (strlen(s) != sLen) || (SF->getSequenceLength(sid) != sLen)) { fprintf(stdout, "length differ for sid="uint32FMT" h='%s' strlen(s)=%d sLen="uint32FMT" getSequenceLength()="uint32FMT"\n", sid, h, strlen(s), sLen, SF->getSequenceLength(sid)); } } delete [] h; delete [] s; } fprintf(stdout, "getSequenceLength() vs getSequence(part)\n"); { char *p = new char [128 * 1024 * 1024]; for (uint32 sid=0; sidgetNumberOfSequences(); sid++) { SF->getSequence(sid, 0, SF->getSequenceLength(sid), p); if (strlen(p) != SF->getSequenceLength(sid)) { fprintf(stdout, "length differ for sid="uint32FMT" strlen(s)=%d getSequenceLength()="uint32FMT"\n", sid, strlen(p), SF->getSequenceLength(sid)); } } delete [] p; } return(0); } canu-1.6/src/meryl/libleaff/seqCache.C000066400000000000000000000117001314437614700176220ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-DEC-08 to 2014-DEC-22 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-19 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "seqCache.H" #include "seqFactory.H" seqCache::seqCache(const char *filename, uint32 cachesize, bool verbose) { _fb = openSeqFile(filename); _idToGetNext = 0; _allSequencesLoaded = false; _reportLoading = verbose; _cacheMap = 0L; _cacheSize = 0; _cacheNext = 0; _cache = 0L; setCacheSize(cachesize); } seqCache::~seqCache() { flushCache(); delete _fb; delete [] _cacheMap; delete [] _cache; } uint32 seqCache::getSequenceIID(char *name) { uint32 iid = ~uint32ZERO; // If the name is all integers, AND below the number of sequences // we have, return that, otherwise, look it up. // bool isInt = true; char *x = name; while (*x) { if ((*x < '0') || ('9' < *x)) isInt = false; x++; } if (isInt) iid = strtouint32(name); if (iid >= _fb->getNumberOfSequences()) iid = _fb->find(name); #ifdef DEBUG fprintf(stderr, "seqCache::getSequenceIID()-- '%s' -> " F_U32 "\n", name, iid); #endif return(iid); } seqInCore * seqCache::getSequenceInCore(uint32 iid) { uint32 cacheID = ~uint32ZERO; seqInCore *retSeq = 0L; if ((_fb->randomAccessSupported() == true) && (iid >= _fb->getNumberOfSequences())) return(0L); if (_allSequencesLoaded == true) { cacheID = iid; } else if ((_cacheSize > 0) && (_cacheMap[iid] != ~uint32ZERO)) { cacheID = _cacheMap[iid]; } else { uint32 hLen=0, hMax=0, sLen=0, sMax=0; char *h=0L, *s=0L; if (_fb->getSequence(iid, h, hLen, hMax, s, sLen, sMax) == false) return(0L); retSeq = new seqInCore(iid, h, hLen, s, sLen, true); // Remove any old cached sequence, then store the one we just made if (_cache) { if (_cache[_cacheNext]) { _cacheMap[_cache[_cacheNext]->getIID()] = ~uint32ZERO; delete _cache[_cacheNext]; } _cache[_cacheNext] = retSeq; _cacheMap[iid] = _cacheNext; cacheID = _cacheNext; retSeq = 0L; _cacheNext = (_cacheNext + 1) % _cacheSize; } } // If no retSeq set, make a copy of the one we have in the cache. if ((retSeq == 0L) && (cacheID != ~uint32ZERO)) retSeq = new seqInCore(iid, _cache[cacheID]->header(), _cache[cacheID]->headerLength(), _cache[cacheID]->sequence(), _cache[cacheID]->sequenceLength(), false); return(retSeq); } void seqCache::setCacheSize(uint32 cachesize) { uint32 ns = _fb->getNumberOfSequences(); flushCache(); if (cachesize == 0) { _cacheMap = 0L; _cacheSize = 0; _cacheNext = 0; _cache = 0L; return; } _cacheMap = new uint32 [ns]; _cacheSize = cachesize; _cacheNext = 0; _cache = new seqInCore * [_cacheSize]; for (uint32 i=0; igetNumberOfSequences(); _cacheNext = 0; _cache = new seqInCore * [_cacheSize]; for (uint32 iid=0; iid<_cacheSize; iid++) { uint32 hLen=0, hMax=0, sLen=0, sMax=0; char *h=0L, *s=0L; if (_fb->getSequence(iid, h, hLen, hMax, s, sLen, sMax) == false) fprintf(stderr, "seqCache::loadAllSequences()-- Failed to load iid " F_U32 ".\n", iid), exit(1); _cache[iid] = new seqInCore(iid, h, hLen, s, sLen, true); } _allSequencesLoaded = true; } void seqCache::flushCache(void) { if (_fb == 0L) return; if (_cacheMap) { uint32 ns = _fb->getNumberOfSequences(); for (uint32 i=0; igetSourceName()); }; const char *getFileTypeName(void) { return(_fb->getFileTypeName()); }; bool randomAccessSupported(void) { return(_fb->randomAccessSupported()); }; uint32 getNumberOfSequences(void) { return(_fb->getNumberOfSequences()); }; uint32 getSequenceLength(uint32 iid) { return(_fb->getSequenceLength(iid)); }; void setCacheSize(uint32 cachesize); void loadAllSequences(void); void flushCache(void); private: seqFile *_fb; uint32 _idToGetNext; bool _allSequencesLoaded; bool _reportLoading; uint32 *_cacheMap; // Maps ID to cache entry uint32 _cacheSize; // Size of cache uint32 _cacheNext; // Next cache spot to use seqInCore **_cache; // Cache of sequences }; #endif // SEQCACHE_H canu-1.6/src/meryl/libleaff/seqFactory.C000066400000000000000000000045251314437614700202350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-FEB-04 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-06 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "seqFactory.H" #include "fastaFile.H" #include "fastaStdin.H" #include "fastqFile.H" #include "fastqStdin.H" #include "seqStore.H" #include "gkStoreFile.H" seqFactory *seqFactory::me = 0L; seqFactory::seqFactory() { _filesNum = 0; _filesMax = 16; _files = new seqFile * [_filesMax]; registerFile(new fastaFile); registerFile(new fastaStdin); registerFile(new fastqFile); registerFile(new fastqStdin); registerFile(new seqStore); registerFile(new gkStoreFile); //registerFile(new sffFile); } seqFactory::~seqFactory() { for (uint32 i=0; i<_filesNum; i++) delete _files[i]; delete [] _files; } void seqFactory::registerFile(seqFile *f) { if (_filesNum >= _filesMax) { fprintf(stderr, "seqFactory::registerFile()-- Wow! You registered lots of files! Now fix %s at line %d.\n", __FILE__, __LINE__); exit(1); } _files[_filesNum++] = f; } seqFile * seqFactory::openFile(const char *name) { seqFile *n = 0L; for (uint32 i=0; i<_filesNum; i++) { n = _files[i]->openFile(name); if (n) return(n); } fprintf(stderr, "seqFactory::openFile()-- Cannot determine type of file '%s'. Tried:\n", name); for (uint32 i=0; i<_filesNum; i++) fprintf(stderr, "seqFactory::openFile()-- '%s'\n", _files[i]->getFileTypeName()); exit(1); return(n); } canu-1.6/src/meryl/libleaff/seqFactory.H000066400000000000000000000026571314437614700202460ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SEQFACTORY_H #define SEQFACTORY_H #include "seqFile.H" class seqFactory { protected: seqFactory(); ~seqFactory(); public: static seqFactory *instance(void) { if (me == 0L) me = new seqFactory; return(me); }; void registerFile(seqFile *f); seqFile *openFile(const char *name); private: static seqFactory *me; uint32 _filesNum; uint32 _filesMax; seqFile **_files; }; #define openSeqFile(S) seqFactory::instance()->openFile((S)) #endif // SEQFACTORY_H canu-1.6/src/meryl/libleaff/seqFile.H000066400000000000000000000050741314437614700175120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SEQFILE_H #define SEQFILE_H #undef DEBUG #include "AS_global.H" #include "readBuffer.H" // General flow of the constructors is: // Clear all data // Open the file // Set _filename, _typename // Read/build the index structure // Position the file to the first read // Set _numberOfSequences (IMPORTANT, and subtle) class seqFile { protected: seqFile(const char *UNUSED(filename)) {}; seqFile() {}; public: virtual ~seqFile() {}; protected: virtual seqFile *openFile(const char *filename) = 0; public: virtual const char *getSourceName(void) { return(_filename); }; virtual const char *getFileTypeName(void) { return(_typename); }; virtual bool randomAccessSupported(void) { return(_randomAccessSupported); }; virtual uint32 getNumberOfSequences(void) { return(_numberOfSequences); }; public: virtual uint32 find(const char *sequencename) = 0; virtual uint32 getSequenceLength(uint32 id) = 0; virtual bool getSequence(uint32 id, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) = 0; virtual bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) = 0; protected: char _filename[FILENAME_MAX]; char _typename[FILENAME_MAX]; bool _randomAccessSupported; uint32 _numberOfSequences; friend class seqFactory; }; #endif // SEQFILE_H canu-1.6/src/meryl/libleaff/seqStore.C000066400000000000000000000426271314437614700177270ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "seqStore.H" #include "seqCache.H" #include "dnaAlphabets.H" #include "speedCounter.H" // Says 'kmerSeqStoreFile' #define SEQSTORE_MAGICNUMBER1 0x5371655372656d6bULL #define SEQSTORE_MAGICNUMBER2 0x656c694665726f74ULL seqStore::seqStore(const char *filename) { clear(); strcpy(_filename, filename); errno = 0; FILE *F = fopen(_filename, "r"); if (errno) fprintf(stderr, "seqStore::seqStore()-- Failed to open '%s': %s\n", _filename, strerror(errno)), exit(1); fread(&_header, sizeof(seqStoreHeader), 1, F); fclose(F); //_indexBPF = new bitPackedFile(_filename, _header._indexStart); //_blockBPF = new bitPackedFile(_filename, _header._blockStart); //_namesBPF = new bitPackedFile(_filename, _header._namesStart); _bpf = new bitPackedFile(_filename, sizeof(seqStoreHeader)); _numberOfSequences = _header._numberOfSequences; } seqStore::seqStore() { clear(); } seqStore::~seqStore() { //if ((_filename) && (_filename[0] != 0)) // fprintf(stderr, "Closing seqStore '%s'\n", _filename); delete _bpf; delete [] _index; delete [] _block; delete [] _names; delete _indexBPF; delete _blockBPF; delete _namesBPF; } seqFile * seqStore::openFile(const char *filename) { uint64 magic1, magic2; struct stat st; errno = 0; stat(filename, &st); if (errno) return(0L); if ((st.st_mode & S_IFREG) == 0) return(0L); // Check the magic. Fail if not correct. errno = 0; FILE *F = fopen(filename, "r"); if (errno) return(0L); fread(&magic1, sizeof(uint64), 1, F); fread(&magic2, sizeof(uint64), 1, F); fclose(F); if ((magic1 != SEQSTORE_MAGICNUMBER1) || (magic2 != SEQSTORE_MAGICNUMBER2)) return(0L); return(new seqStore(filename)); } // If this proves far too slow, rewrite the _names string to separate IDs with 0xff, then use // strstr on the whole thing. To find the ID, scan down the string counting the number of 0xff's. // // Similar code is used for fastaFile::find() // uint32 seqStore::find(const char *sequencename) { if (_names == NULL) loadIndex(); char *ptr = _names; for (uint32 iid=0; iid < _header._numberOfSequences; iid++) { if (strcmp(sequencename, ptr) == 0) return(iid); while (*ptr) ptr++; ptr++; } return(~uint32ZERO); } uint32 seqStore::getSequenceLength(uint32 iid) { if (_index == NULL) loadIndex(); return((iid < _header._numberOfSequences) ? _index[iid]._seqLength : 0); } bool seqStore::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { if (_index == NULL) loadIndex(); if (iid >= _header._numberOfSequences) { fprintf(stderr, "seqStore::getSequence(full)-- iid " F_U32 " more than number of sequences " F_U32 "\n", iid, _header._numberOfSequences); return(false); } if (sMax == 0) s = 0L; // So the delete below doesn't bomb if (hMax == 0) h = 0L; if (sMax < _index[iid]._seqLength + 1) { sMax = _index[iid]._seqLength + 1024; delete [] s; s = new char [sMax]; } if (hMax < _index[iid]._hdrLength + 1) { hMax = _index[iid]._hdrLength + 1024; delete [] h; h = new char [hMax]; } hLen = 0; sLen = 0; // Copy the defline into h memcpy(h, _names + _index[iid]._hdrPosition, _index[iid]._hdrLength); h[_index[iid]._hdrLength] = 0; // Decode and copy the sequence into s uint32 seqLen = _index[iid]._seqLength; uint32 block = _index[iid]._block; uint64 seekpos = _index[iid]._seqPosition * 2; _bpf->seek(seekpos); while (sLen < seqLen) { assert(_bpf->tell() == _block[block]._bpf * 2); assert(sLen == _block[block]._pos); if (_block[block]._isACGT == 0) { memset(s + sLen, 'N', _block[block]._len); sLen += _block[block]._len; } else { for (uint32 xx=0; xx<_block[block]._len; xx++) { s[sLen++] = alphabet.bitsToLetter(_bpf->getBits(2)); } } block++; } s[sLen] = 0; return(true); } bool seqStore::getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) { if (_index == NULL) loadIndex(); if (iid >= _header._numberOfSequences) { fprintf(stderr, "seqStore::getSequence(part)-- iid " F_U32 " more than number of sequences " F_U32 "\n", iid, _header._numberOfSequences); return(false); } if (bgn >= end) { fprintf(stderr, "seqStore::getSequence(part)-- for iid " F_U32 "; invalid bgn=" F_U32 " end=" F_U32 "; seqLen=" F_U32 "\n", iid, bgn, end, _index[iid]._seqLength); return(false); } // Decode and copy the sequence into s uint32 block = _index[iid]._block; uint32 sLen = 0; // length of sequence we've copied uint32 sPos = 0; // position in the sequence // Skip blocks before we care. // while (sPos + _block[block]._len < bgn) { sPos += _block[block]._len; block++; } assert(sPos == _block[block]._pos); // Move into the block (we could just set sPos = bgn...). sPos += bgn - _block[block]._pos; // Handle the partial block. Copy what is left in the block, or // the requested size, whichever is smaller. uint32 partLen = MIN((_block[block]._pos + _block[block]._len - bgn), (end - bgn)); if (_block[block]._isACGT == 0) { memset(s, 'N', partLen); sLen += partLen; _bpf->seek(_block[block+1]._bpf * 2); } else { _bpf->seek((_block[block]._bpf + bgn - _block[block]._pos) * 2); for (uint32 xx=0; xxgetBits(2)); } sPos += partLen; block++; while (sPos < end) { assert(_bpf->tell() == _block[block]._bpf * 2); assert(sPos == _block[block]._pos); // Like the partial block above, pick how much to copy as the // smaller of the block size and what is left to fill. partLen = MIN((_block[block]._len), (end - sPos)); if (_block[block]._isACGT == 0) { memset(s + sLen, 'N', partLen); sLen += partLen; } else { for (uint32 xx=0; xxgetBits(2)); } sPos += partLen; block++; } s[sLen] = 0; return(true); } void seqStore::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); strcpy(_typename, "seqStore"); _numberOfSequences = 0; _bpf = 0L; memset(&_header, 0, sizeof(seqStoreHeader)); _index = 0L; _block = 0L; _names = 0L; _indexBPF = 0L; _blockBPF = 0L; _namesBPF = 0L; _lastIIDloaded = ~uint32ZERO; } void seqStore::loadIndex(void) { if (_index) return; delete _indexBPF; _indexBPF = 0L; delete _blockBPF; _blockBPF = 0L; delete _namesBPF; _namesBPF = 0L; errno = 0; FILE *F = fopen(_filename, "r"); if (errno) fprintf(stderr, "seqStore::seqStore()-- Failed to open '%s': %s\n", _filename, strerror(errno)), exit(1); fread(&_header, sizeof(seqStoreHeader), 1, F); //fprintf(stderr, "seqStore::seqStore()-- Allocating space for " F_U32 " sequences (" F_U64 "MB)\n", _header._numberOfSequences, _header._numberOfSequences * sizeof(seqStoreIndex) / 1024 / 1024); //fprintf(stderr, "seqStore::seqStore()-- Allocating space for " F_U32 " blocks (" F_U64 "MB)\n", _header._numberOfBlocks, _header._numberOfBlocks * sizeof(seqStoreBlock) / 1024 / 1024); //fprintf(stderr, "seqStore::seqStore()-- Allocating space for " F_U32 " labels (" F_U64 "MB)\n", _header._namesLength, _header._namesLength * sizeof(char) / 1024 / 1024); _index = new seqStoreIndex [_header._numberOfSequences]; _block = new seqStoreBlock [_header._numberOfBlocks]; _names = new char [_header._namesLength]; fseeko(F, _header._indexStart, SEEK_SET); fread( _index, sizeof(seqStoreIndex), _header._numberOfSequences, F); #if 0 for (uint32 i=0; i<_header._numberOfSequences; i++) fprintf(stderr, "IDX[%4u] hdrPos=%u hdrLen=%u seqPos=%llu seqLen=%u block=%u\n", i, _index[i]._hdrPosition, _index[i]._hdrLength, _index[i]._seqPosition, _index[i]._seqLength, _index[i]._block); #endif fseeko(F, _header._blockStart, SEEK_SET); fread( _block, sizeof(seqStoreBlock), _header._numberOfBlocks, F); fseeko(F, _header._namesStart, SEEK_SET); fread( _names, sizeof(char), _header._namesLength, F); if (errno) fprintf(stderr, "seqStore::seqStore()-- Failed to read index from '%s': %s\n", _filename, strerror(errno)), exit(1); fclose(F); } static void addSeqStoreBlock(uint32 &BLOKmax, uint32 &BLOKlen, seqStoreBlock* &BLOK, seqStoreBlock &b, uint32 &nBlockACGT, uint32 &nBlockGAP, uint64 &nACGT) { //fprintf(stderr, "addSeqStoreBlock()-- BLOK max=%u len=%u ACGT=%u GAP=%u nACGT=%lu\n", // BLOKmax, BLOKlen, nBlockACGT, nBlockGAP, nACGT); if (b._len == 0) return; if (b._isACGT == 1) { nBlockACGT++; nACGT += b._len; } else { nBlockGAP++; } BLOK[BLOKlen++] = b; if (BLOKlen >= BLOKmax) { BLOKmax *= 2; seqStoreBlock *nb = new seqStoreBlock [BLOKmax]; memcpy(nb, BLOK, BLOKlen * sizeof(seqStoreBlock)); delete [] BLOK; BLOK = nb; } } void constructSeqStore(char *filename, seqCache *inputseq) { fprintf(stderr, "constructSeqStore()-- constructing seqStore '%s' from seqCache '%s' of type '%s'.\n", filename, inputseq->getSourceName(), inputseq->getFileTypeName()); seqStoreHeader HEAD; memset(&HEAD, 0, sizeof(seqStoreHeader)); bitPackedFile *DATA = new bitPackedFile(filename, sizeof(seqStoreHeader), true); uint32 INDXmax = 1048576; seqStoreIndex *INDX = new seqStoreIndex [INDXmax]; uint32 BLOKmax = 1048576; uint32 BLOKlen = 0; seqStoreBlock *BLOK = new seqStoreBlock [BLOKmax]; uint32 NAMEmax = 32 * 1024 * 1024; uint32 NAMElen = 0; char *NAME = new char [NAMEmax]; seqInCore *sic = inputseq->getSequenceInCore(); uint64 nACGT = 0; uint32 nBlockACGT = 0; uint32 nBlockGAP = 0; uint32 nSequences = 0; speedCounter C(" reading sequences %7.0f sequences -- %5.0f sequences/second\r", 1.0, 0x1ffff, true); while (sic != NULL) { if (sic->sequence()) { char *seq = sic->sequence(); seqStoreBlock b; if (nSequences >= INDXmax) { seqStoreIndex *I = new seqStoreIndex[INDXmax * 2]; memcpy(I, INDX, sizeof(seqStoreIndex) * nSequences); delete [] INDX; INDXmax *= 2; INDX = I; } INDX[nSequences]._hdrPosition = NAMElen; INDX[nSequences]._hdrLength = sic->headerLength(); INDX[nSequences]._seqPosition = DATA->tell() / 2; INDX[nSequences]._seqLength = sic->sequenceLength(); INDX[nSequences]._block = BLOKlen; #if 0 fprintf(stderr, "ADD SEQUENCE hdr pos=%u len=%u seq pos=%u len=%u blok=%u\n", INDX[nSequences]._hdrPosition, INDX[nSequences]._hdrLength, INDX[nSequences]._seqPosition, INDX[nSequences]._seqLength, INDX[nSequences]._block); #endif #if SEQSTOREBLOCK_MAXPOS < uint64MASK(32) if (sic->sequenceLength() > SEQSTOREBLOCK_MAXPOS) fprintf(stderr, "constructSeqStore()-- sequence %s too long, must be shorter than " F_U64 " Gbp.\n", sic->header(), SEQSTOREBLOCK_MAXPOS / 1024 / 1024 / 1024), exit(1); #endif #if SEQSTOREBLOCK_MAXIID < uint64MASK(32) if (sic->getIID() > SEQSTOREBLOCK_MAXPOS) fprintf(stderr, "constructSeqStore()-- too many sequences, must be fewer than " F_U64 ".\n", SEQSTOREBLOCK_MAXIID), exit(1); #endif if (NAMElen + sic->headerLength() + 1 > NAMEmax) { NAMEmax += 32 * 1024 * 1024; char *nm = new char [NAMEmax]; memcpy(nm, NAME, sizeof(char) * NAMElen); delete [] NAME; NAME = nm; } strcpy(NAME + NAMElen, sic->header()); NAMElen += sic->headerLength() + 1; b._isACGT = 0; b._iid = sic->getIID(); b._pos = 0; b._len = 0; b._bpf = DATA->tell() / 2; for (uint32 p=0; psequenceLength(); p++) { uint64 bits = alphabet.letterToBits(seq[p]); // If the length of the current block is too big (which would // soon overflow the bit field storing length) write out a // block and reset the length. // if (b._len == SEQSTOREBLOCK_MAXLEN) { addSeqStoreBlock(BLOKmax, BLOKlen, BLOK, b, nBlockACGT, nBlockGAP, nACGT); b._pos = p; b._len = 0; b._bpf = DATA->tell() / 2; } if (bits == 0xff) { // This letter is NOT ACGT. If the current block is an ACGT block, write it // and reset. // if (b._isACGT == 1) { addSeqStoreBlock(BLOKmax, BLOKlen, BLOK, b, nBlockACGT, nBlockGAP, nACGT); b._isACGT = 0; b._iid = sic->getIID(); b._pos = p; b._len = 0; b._bpf = DATA->tell() / 2; } } else { // This letter is ACGT. If the current block is NOT an ACGT block, write it // and reset. // if (b._isACGT == 0) { addSeqStoreBlock(BLOKmax, BLOKlen, BLOK, b, nBlockACGT, nBlockGAP, nACGT); b._isACGT = 1; b._iid = sic->getIID(); b._pos = p; b._len = 0; b._bpf = DATA->tell() / 2; } } // Always add one to the length of the current block, and // write out the base if the letter is ACGT. // b._len++; if (bits != 0xff) DATA->putBits(bits, 2); } // Emit the last block // addSeqStoreBlock(BLOKmax, BLOKlen, BLOK, b, nBlockACGT, nBlockGAP, nACGT); } // If there is no sequence, the index record for this sequence is left blank. // nSequences++; C.tick(); delete sic; sic = inputseq->getSequenceInCore(); } // And a sentinel EOF block -- gets the last position in the file, // useful for the binary search. We always have a space block at // the end of the list, but we don't care if we just used the last // block (and so we don't bother to reallocate the array if it is // full). BLOK[BLOKlen]._isACGT = 0; BLOK[BLOKlen]._iid = uint32MASK(32); BLOK[BLOKlen]._pos = uint32MASK(31); BLOK[BLOKlen]._len = 0; BLOK[BLOKlen]._bpf = DATA->tell() / 2; BLOKlen++; // Update the header, assemble the final file. delete DATA; HEAD._magic[0] = SEQSTORE_MAGICNUMBER1; HEAD._magic[1] = SEQSTORE_MAGICNUMBER2; HEAD._pad = uint32ZERO; HEAD._numberOfSequences = nSequences; HEAD._numberOfACGT = nACGT; HEAD._numberOfBlocksACGT = nBlockACGT; HEAD._numberOfBlocksGAP = nBlockGAP; HEAD._numberOfBlocks = BLOKlen; HEAD._namesLength = NAMElen; HEAD._indexStart = uint64ZERO; HEAD._blockStart = uint64ZERO; HEAD._namesStart = uint64ZERO; errno = 0; FILE *F = fopen(filename, "r+"); if (errno) fprintf(stderr, "constructSeqStore()-- Failed to reopen '%s' to write data: %s\n", filename, strerror(errno)), exit(1); fseeko(F, 0, SEEK_END); HEAD._indexStart = ftello(F); fwrite(INDX, sizeof(seqStoreIndex), HEAD._numberOfSequences, F); fseeko(F, 0, SEEK_END); HEAD._blockStart = ftello(F); fwrite(BLOK, sizeof(seqStoreBlock), HEAD._numberOfBlocks, F); fseeko(F, 0, SEEK_END); HEAD._namesStart = ftello(F); fwrite(NAME, sizeof(char), HEAD._namesLength, F); fseeko(F, 0, SEEK_SET); fwrite(&HEAD, sizeof(seqStoreHeader), 1, F); fclose(F); if (errno) fprintf(stderr, "constructSeqStore()-- Failed to write data to '%s': %s\n", filename, strerror(errno)), exit(1); delete [] INDX; delete [] BLOK; delete [] NAME; // ESTmapper depends on this output. fprintf(stderr, "constructSeqStore()-- seqStore '%s' constructed (" F_U32 " sequences, " F_U64 " ACGT letters, " F_U32 " ACGT blocks, " F_U32 " GAP blocks).\n", filename, HEAD._numberOfSequences, HEAD._numberOfACGT, HEAD._numberOfBlocksACGT, HEAD._numberOfBlocksGAP); } canu-1.6/src/meryl/libleaff/seqStore.H000066400000000000000000000075731314437614700177350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SEQSTORE_H #define SEQSTORE_H #include "seqCache.H" #include "bitPackedFile.H" // A binary fasta file. // // HEADER // magic number // number of sequences // optional - alphabet size // optional - alphabet map (0x00 -> 'a', etc) // position of index start // position of data start // DATA // INDEX // position of sequence start in DATA // header length // sequence length // MAP // name to IID mapping struct seqStoreHeader { uint64 _magic[2]; uint32 _pad; uint32 _numberOfSequences; uint64 _numberOfACGT; uint32 _numberOfBlocksACGT; uint32 _numberOfBlocksGAP; uint32 _numberOfBlocks; uint32 _namesLength; uint64 _indexStart; uint64 _blockStart; uint64 _namesStart; }; // This index allows us to return a complete sequence // struct seqStoreIndex { uint32 _hdrPosition; // Offset into _names for the defline uint32 _hdrLength; // Length of the defline uint64 _seqPosition; // Offset into _bpf for the sequence data uint32 _seqLength; // Length, in bases, of the sequence uint32 _block; // The seqStoreBlock that starts this sequence }; // This index allows us to seek to a specific base in the // file of sequences. Each block is either: // ACGT - and has data // N - no data // It will map a specific ACGT location to the sequence, and the ID // of that sequence (seq ID and location in that sequence). // struct seqStoreBlock { uint64 _isACGT:1; // block is acgt uint64 _pos:32; // position in sequence uint64 _iid:32; // iid of the sequence we are in uint64 _len:23; // length of block uint64 _bpf:40; // position in the bit file of sequence }; #define SEQSTOREBLOCK_MAXPOS uint64MASK(32) #define SEQSTOREBLOCK_MAXIID uint64MASK(32) #define SEQSTOREBLOCK_MAXLEN uint64MASK(23) class seqStore : public seqFile { protected: seqStore(const char *filename); seqStore(); public: ~seqStore(); protected: seqFile *openFile(const char *filename); public: uint32 find(const char *sequencename); uint32 getSequenceLength(uint32 iid); bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); void loadIndex(void); bitPackedFile *_bpf; seqStoreHeader _header; seqStoreIndex *_index; seqStoreBlock *_block; char *_names; bitPackedFile *_indexBPF; bitPackedFile *_blockBPF; bitPackedFile *_namesBPF; uint32 _lastIIDloaded; friend class seqFactory; }; // Construct a new seqStore 'filename' from input file 'inputseq'. // void constructSeqStore(char *filename, seqCache *inputseq); #endif // SEQSTORE_H canu-1.6/src/meryl/libleaff/seqStream.C000066400000000000000000000244271314437614700200640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "seqFactory.H" #include "seqStream.H" seqStream::seqStream(const char *filename) { _file = openSeqFile(filename); _string = 0L; _currentIdx = 0; _currentPos = 0; _streamPos = 0; _bufferMax = 1048576; _bufferLen = 0; _bufferPos = 0; _bufferSep = 0; _buffer = new char [_bufferMax + 1]; _idxLen = _file->getNumberOfSequences(); _idx = new seqStreamIndex [_idxLen + 1]; //fprintf(stderr, "seqStream::seqStream()-- Allocating " F_U64 "MB for seqStreamIndex on " F_U64 " sequences.\n", // _idxLen * sizeof(seqStreamIndex) / 1024 / 1024, _idxLen); _seqNumOfPos = 0L; _lengthOfSequences = 0; _eof = false; _separator = '.'; _separatorLength = 2; setSeparator('.', 2); _bgn = 0; _end = _lengthOfSequences; } seqStream::seqStream(const char *sequence, uint32 length) { _file = 0L; _string = (char *)sequence; _currentIdx = 0; _currentPos = 0; _streamPos = 0; _bufferMax = length; _bufferLen = length; _bufferPos = 0; _bufferSep = 0; _buffer = _string; _idxLen = 1; _idx = new seqStreamIndex [_idxLen + 1]; _seqNumOfPos = 0L; _idx[0]._iid = 0; _idx[0]._len = length; _idx[0]._bgn = 0; _idx[1]._iid = ~uint32ZERO; _idx[1]._len = 0; _idx[1]._bgn = length; _lengthOfSequences = length; _eof = false; _separator = '.'; _separatorLength = 20; _bgn = 0; _end = length; } seqStream::~seqStream() { if (_file) { delete _file; delete [] _buffer; } delete [] _idx; delete [] _seqNumOfPos; } void seqStream::setSeparator(char sep, uint32 len) { // Special case; no separator needed for string backed sequences. if (_string) return; // Bizarre signedness issue with sep=255 // ST->get() == sep FAILS // x=ST->get(); x == sep SUCCEEDS // // Not suggested to use non-printable ascii. if ((isprint(sep) == 0) || (tolower(sep) == 'a') || (tolower(sep) == 'c') || (tolower(sep) == 'g') || (tolower(sep) == 't')) { fprintf(stderr, "seqStream::setSeparator()-- ERROR! Separator letter must be printable ASCII and not [ACGTacgt].\n"); exit(1); } if (len == 0) { fprintf(stderr, "seqStream::setSeparator()-- ERROR! Separator length cannot be zero.\n"); exit(1); } _lengthOfSequences = 0; _separator = sep; _separatorLength = len;; for (uint32 s=0; s<_idxLen; s++) { _idx[s]._iid = s; _idx[s]._len = _file->getSequenceLength(s); _idx[s]._bgn = _lengthOfSequences; _lengthOfSequences += _idx[s]._len; } _idx[_idxLen]._iid = ~uint32ZERO; _idx[_idxLen]._len = 0; _idx[_idxLen]._bgn = _lengthOfSequences; // Rebuild our sequence number of position map, if it exists. // if (_seqNumOfPos) { delete [] _seqNumOfPos; tradeSpaceForTime(); } } void seqStream::tradeSpaceForTime(void) { uint32 i = 0; uint32 s = 0; //fprintf(stderr, "Allocating " F_U32 " uint32s for seqNumOfPos.\n", _lengthOfSequences); _seqNumOfPos = new uint32 [_lengthOfSequences]; for (i=0; i<_lengthOfSequences; i++) { // Increment the sequence number until we enter into the next // sequence. Zero length sequences require the use of a 'while' // here. // while (i >= _idx[s+1]._bgn) s++; _seqNumOfPos[i] = s; } } unsigned char seqStream::get(void) { if (_streamPos >= _end) _eof = true; if ((_eof == false) && (_bufferPos >= _bufferLen)) fillBuffer(); if (_eof) return(0); if (_bufferSep == 0) { _currentPos++; _streamPos++; } else { _bufferSep--; } return(_buffer[_bufferPos++]); } void seqStream::rewind(void){ // Search for the correct spot. Uncommon operation, be inefficient // but simple. The range was checked to be good by setRange(). uint32 s = 0; uint64 l = 0; while ((s < _idxLen) && (l + _idx[s]._len < _bgn)) l += _idx[s++]._len; _eof = false; // (_bgn - l) is a 32-bit quanitity because of the second half of // the while above. Although _bgn is a 64-bit value, the value // used to set _bufferPos will be for that of a string constructor, // and so _bgn will be 32-bits. fillBuffer() resets _bufferPos if // we're backed by a file. _currentIdx = s; _currentPos = _bgn - l; _streamPos = _bgn; _bufferPos = _bgn; //fprintf(stderr, "seqStream::rewind()-- 1 currentIdx=" F_U32 " currentPos=" F_U32 " streamPos=" F_U32 " bufferPos=" F_U32 "\n", // _currentIdx, _currentPos, _streamPos, _bufferPos); fillBuffer(); //fprintf(stderr, "seqStream::rewind()-- 2 currentIdx=" F_U32 " currentPos=" F_U32 " streamPos=" F_U32 " bufferPos=" F_U32 "\n", // _currentIdx, _currentPos, _streamPos, _bufferPos); } void seqStream::setRange(uint64 bgn, uint64 end) { assert(bgn < end); uint32 s = 0; uint64 l = 0; while (s < _idxLen) l += _idx[s++]._len; if (end == ~uint64ZERO) end = l; if ((bgn > l) || (end > l)) fprintf(stderr, "seqStream::setRange()-- ERROR: range (" F_U64 "," F_U64 ") too big; only " F_U64 " positions.\n", bgn, end, l), exit(1); _bgn = bgn; _end = end; rewind(); } void seqStream::setPosition(uint64 pos) { assert(_bgn <= pos); assert( pos < _end); uint64 old = _bgn; _bgn = pos; rewind(); _bgn = old; } uint32 seqStream::sequenceNumberOfPosition(uint64 p) { uint32 s = ~uint32ZERO; // binary search on our list of start positions, to find the // sequence that p is in. if (_lengthOfSequences <= p) { fprintf(stderr, "seqStream::sequenceNumberOfPosition()-- WARNING: position p=" F_U64 " too big; only " F_U64 " positions.\n", p, _lengthOfSequences); return(s); } if (_seqNumOfPos) return(_seqNumOfPos[p]); if (_idxLen < 16) { for (s=0; s<_idxLen; s++) if ((_idx[s]._bgn <= p) && (p < _idx[s+1]._bgn)) break; } else { uint32 lo = 0; uint32 hi = _idxLen; uint32 md = 0; while (lo <= hi) { md = (lo + hi) / 2; if (p < _idx[md]._bgn) { // This block starts after the one we're looking for. hi = md; } else if ((_idx[md]._bgn <= p) && (p < _idx[md+1]._bgn)) { // Got it! lo = md + 1; hi = md; s = md; } else { // By default, then, the block is too low. lo = md; } } } return(s); } void seqStream::fillBuffer(void) { // Special case for when we're backed by a character string; there // is no need to fill the buffer. // if (_file == 0L) { if (_currentPos >= _end) _eof = true; return; } // Read bytes from the _file, stuff them into the buffer. Assumes // there is nothing in the buffer to save. _bufferLen = 0; _bufferPos = 0; // Still more stuff in the sequence? Get it. if (_currentPos < _idx[_currentIdx]._len) { #ifdef DEBUG fprintf(stderr, "seqStream::fillBuffer()-- More Seq currentPos=" F_U32 " len=" F_U32 "\n", _currentPos, _idx[_currentIdx]._len); #endif _bufferLen = MIN(_idx[_currentIdx]._len - _currentPos, _bufferMax); if (_file->getSequence(_idx[_currentIdx]._iid, _currentPos, _currentPos + _bufferLen, _buffer) == false) fprintf(stderr, "seqStream::fillBuffer()-- Failed to getSequence(part) #1 iid=" F_U32 " bgn=" F_U32 " end=" F_U32 "\n", _idx[_currentIdx]._iid, _currentPos, _currentPos + _bufferLen), exit(1); return; } // We've finished a sequence. Load the next. _currentPos = 0; _currentIdx++; while ((_currentIdx < _idxLen) && (_idx[_currentIdx]._len == 0)) _currentIdx++; #ifdef DEBUG fprintf(stderr, "seqStream::fillBuffer()-- New Seq currentPos=" F_U32 " len=" F_U32 "\n", _currentPos, _idx[_currentIdx]._len); #endif // All done if there is no more sequence. if (_currentIdx >= _idxLen) { _eof = true; return; } // Insert a separator. for (_bufferLen = 0; _bufferLen < _separatorLength; _bufferLen++) _buffer[_bufferLen] = _separator; // Keep track of the separator - this is used to make sure we don't // advance the sequence/stream position while the separator is // being returned. // _bufferSep = _bufferLen; // How much to get; minimum of what is left in the sequence, and // the buffer size. Don't forget about the separator we already // inserted! // uint32 bl = MIN(_idx[_currentIdx]._len - _currentPos, _bufferMax - _bufferLen); if (_file->getSequence(_idx[_currentIdx]._iid, _currentPos, _currentPos + bl, _buffer + _bufferLen) == false) fprintf(stderr, "seqStream::fillBuffer()-- Failed to getSequence(part) #2 iid=" F_U32 " bgn=" F_U32 " end=" F_U32 "\n", _idx[_currentIdx]._iid, _currentPos, _currentPos + bl), exit(1); _bufferLen += bl; // Load more, until buffer is full. Not really needed, and won't // improve performance much. AND it adds a lot of complexity to // track which sequence is current (_currentIdx). return; } canu-1.6/src/meryl/libleaff/seqStream.H000066400000000000000000000107371314437614700200700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SEQSTREAM_H #define SEQSTREAM_H #include "seqFile.H" struct seqStreamIndex { uint32 _iid; // seqFile IID uint32 _len; // length of the sequence uint64 _bgn; // begin position in the stream }; class seqStream { public: seqStream(const char *filename); seqStream(const char *sequence, uint32 length); ~seqStream(); // Separate sequences with this letter. Non-ACGT is always // returned as 'N'. Changing the length of the separator AFTER // setting the range will result in the wrong range being used. // void setSeparator(char sep, uint32 len); // get() returns one letter per input letter -- a gap of size n // will return n gap symbols. // unsigned char get(void); bool eof(void) { return(_eof); }; // Returns to the start of the range. // void rewind(void); // Set the range of ACGT sequence we will return. Coordinates are // space-based. Example: // // >one // AAA // >two // C // >three // GGG // // We separate these sequences with three '-' letters. // // strPos 012...3...456 // AAA---C---GGG // // range(0,0) -> nothing // range(0,1) -> A // range(0,3) -> AAA // range(0,4) -> AAAnnnC // range(0,5) -> AAAnnnCnnnG // void setRange(uint64 bgn, uint64 end); void setPosition(uint64 pos); // seqPos() is the position we are at in the current sequence; // seqIID() is the iid of that sequence; // strPos() is the position we are at in the chained sequence // // Values are not defined if the letter is a separator. // uint32 seqPos(void) { return(_currentPos); }; uint32 seqIID(void) { return(_idx[_currentIdx]._iid); }; uint64 strPos(void) { return(_streamPos); }; uint32 numberOfSequences(void) { return(_idxLen); }; // Return the length of, position of (in the chain) and IID of the // (s)th sequence in the chain. // uint32 lengthOf(uint32 s) { return((s >= _idxLen) ? ~uint32ZERO : _idx[s]._len); }; uint32 IIDOf(uint32 s) { return((s >= _idxLen) ? ~uint32ZERO : _idx[s]._iid); }; uint64 startOf(uint32 s) { return((s >= _idxLen) ? ~uint64ZERO : _idx[s]._bgn); }; // For a chain position p, returns the s (above) for that position. // uint32 sequenceNumberOfPosition(uint64 p); void tradeSpaceForTime(void); private: void fillBuffer(void); seqFile *_file; // Backed by a seqFile. char *_string; // Backed by a character string. uint64 _bgn; // Begin/End position in chained sequence uint64 _end; uint32 _currentIdx; // index into _idx of the current sequence uint32 _currentPos; // position in the current sequence uint64 _streamPos; // position in the chained sequence // Buffer for holding sequence from the seqFile. uint32 _bufferMax; // Max size of the buffer uint32 _bufferLen; // Actual size of the buffer uint32 _bufferPos; // Position we are at in the buffer uint32 _bufferSep; // How much of the buffer is separator char *_buffer; // Info about the raw sequences uint32 _idxLen; seqStreamIndex *_idx; uint32 *_seqNumOfPos; uint64 _lengthOfSequences; bool _eof; char _separator; uint32 _separatorLength; }; #endif // SEQSTREAM_H canu-1.6/src/meryl/libleaff/sffFile.C000066400000000000000000000143661314437614700174770ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-06 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "sffFile.H" #include "bitOperations.H" // Lots of ths came from AS_GKP_sff.c sffFile::sffFile() { clear(); } sffFile::sffFile(const char *name) { clear(); strcpy(_filename, name); _rb = new readBuffer(name); _rb->read(&_header, 31); if (_header.magic_number != 0x2e736666) { _header.swap_endianess = 1; _header.magic_number = uint32Swap(_header.magic_number); _header.index_offset = uint64Swap(_header.index_offset); _header.index_length = uint32Swap(_header.index_length); _header.number_of_reads = uint32Swap(_header.number_of_reads); _header.header_length = uint16Swap(_header.header_length); _header.key_length = uint16Swap(_header.key_length); _header.number_of_flows_per_read = uint16Swap(_header.number_of_flows_per_read); } assert(_header.magic_number == 0x2e736666); assert(_header.number_of_flows_per_read < SFF_NUMBER_OF_FLOWS_MAX); assert(_header.key_length < SFF_KEY_SEQUENCE_MAX); _rb->read(_header.flow_chars, sizeof(char) * _header.number_of_flows_per_read); _rb->read(_header.key_sequence, sizeof(char) * _header.key_length); _firstReadLocation = _header.header_length; // The spec says the index might be here, however, all files I've // seen have the index at the end of the file. // if ((_header.index_length > 0) && (_header.index_offset == _header.header_length)) _firstReadLocation += _header.index_length; // Index // _index = new sffIndex [_header.number_of_reads]; for (uint64 i=0; i<_header.number_of_reads; i++) { uint64 pos = _rb->tell(); _rb->read(&_read, 16); if (_header.swap_endianess) { _read.read_header_length = uint16Swap(_read.read_header_length); _read.name_length = uint16Swap(_read.name_length); _read.number_of_bases = uint32Swap(_read.number_of_bases); } _index[i]._seqPos = pos; _index[i]._seqLen = _read.number_of_bases; _index[i]._namLen = _read.name_length; pos += _read.read_header_length; pos += sizeof(uint16) * _header.number_of_flows_per_read; pos += sizeof(uint8) * _read.number_of_bases; pos += sizeof(char) * _read.number_of_bases; pos += sizeof(uint8) * _read.number_of_bases; pos += (_header.number_of_flows_per_read * sizeof(uint16) + _read.number_of_bases * sizeof(uint8) + _read.number_of_bases * sizeof(char) + _read.number_of_bases * sizeof(uint8)) % 8; _rb->seek(pos); } // // Index _rb->seek(_firstReadLocation); _numberOfSequences = _header.number_of_reads; } sffFile::~sffFile() { delete _rb; delete [] _index; } //////////////////////////////////////// seqFile * sffFile::openFile(const char *name) { struct stat st; // Open the file, return if it matches the SFF magic_number. errno = 0; stat(name, &st); if (errno) return(0L); if ((st.st_mode & S_IFREG) == 0) return(0L); FILE *F = fopen(name, "r"); if (errno) { fprintf(stderr, "sffFile::openFile()- failed to open '%s': %s\n", name, strerror(errno)); return(0L); } uint32 magic_number = 0; AS_UTL_safeRead(F, &magic_number, "sff magic_number", sizeof(uint32), 1); fclose(F); if ((magic_number == 0x2e736666) || (uint32Swap(magic_number) == 0x2e736666)) return(new sffFile(name)); return(0L); } bool sffFile::getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax) { if (iid > _header.number_of_reads) return(false); memset(&_read, 0, sizeof(sffRead)); _rb->seek(_index[iid]._seqPos); _rb->read(&_read, 16); if (_header.swap_endianess) { _read.read_header_length = uint16Swap(_read.read_header_length); _read.name_length = uint16Swap(_read.name_length); _read.number_of_bases = uint32Swap(_read.number_of_bases); _read.clip_quality_left = uint16Swap(_read.clip_quality_left); _read.clip_quality_right = uint16Swap(_read.clip_quality_right); _read.clip_adapter_left = uint16Swap(_read.clip_adapter_left); _read.clip_adapter_right = uint16Swap(_read.clip_adapter_right); } assert(_read.read_header_length < SFF_NAME_LENGTH_MAX); assert(_read.number_of_bases < SFF_NUMBER_OF_BASES_MAX); _rb->read(_read.name, sizeof(char) * _read.name_length); _read.name[_read.name_length] = 0; uint64 pos = _rb->tell(); pos += _read.read_header_length; pos += sizeof(uint16) * _header.number_of_flows_per_read; pos += sizeof(uint8) * _read.number_of_bases; _rb->seek(pos); _rb->read(_read.bases, sizeof(char) * _read.number_of_bases); _read.bases[_read.number_of_bases] = 0; return(true); } bool sffFile::getSequence(uint32 iid, uint32 bgn, uint32 end, char *s) { if (iid > _header.number_of_reads) return(false); // Same as above, mostly. return(false); } void sffFile::clear(void) { memset(_filename, 0, FILENAME_MAX); memset(_typename, 0, FILENAME_MAX); strcpy(_typename, "SFF"); _randomAccessSupported = true; _numberOfSequences = 0; _rb = 0L; memset(&_header, 0, sizeof(sffHeader)); memset(&_read, 0, sizeof(sffRead)); _index = 0L; _firstReadLocation = 0; _readIID = 0; } canu-1.6/src/meryl/libleaff/sffFile.H000066400000000000000000000066051314437614700175010ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SFF_H #define SFF_H #include "seqFile.H" #define SFF_KEY_SEQUENCE_MAX 64 #define SFF_NAME_LENGTH_MAX 256 #define SFF_NUMBER_OF_FLOWS_MAX 512 #define SFF_NUMBER_OF_BASES_MAX 2048 // The assembler itself cannot handle longer struct sffHeader { // The next block is read in one swoop from the sff file. DO NOT MODIFY! uint32 magic_number; char version[4]; uint64 index_offset; uint32 index_length; uint32 number_of_reads; uint16 header_length; uint16 key_length; uint16 number_of_flows_per_read; uint8 flowgram_format_code; char flow_chars[SFF_NUMBER_OF_FLOWS_MAX]; // h->number_of_flows_per_read char key_sequence[SFF_KEY_SEQUENCE_MAX]; // h->key_length uint32 swap_endianess; }; struct sffRead { // The next block is read in one swoop from the sff file. DO NOT MODIFY! uint16 read_header_length; uint16 name_length; uint32 number_of_bases; uint16 clip_quality_left; uint16 clip_quality_right; uint16 clip_adapter_left; uint16 clip_adapter_right; char name[SFF_NAME_LENGTH_MAX]; // r->name_length uint16 flowgram_values[SFF_NUMBER_OF_FLOWS_MAX]; // h->number_of_flows_per_read uint8 flow_index_per_base[SFF_NUMBER_OF_BASES_MAX]; // r->number_of_bases char bases[SFF_NUMBER_OF_BASES_MAX]; // r->number_of_bases uint8 quality_scores[SFF_NUMBER_OF_BASES_MAX]; // r->number_of_bases char quality[SFF_NUMBER_OF_BASES_MAX]; // quality_scores converted to CA-format qv }; struct sffIndex { uint64 _seqPos; uint32 _seqLen; uint32 _namLen; }; class sffFile : public seqFile { protected: sffFile(const char *filename); sffFile(); public: ~sffFile(); protected: seqFile *openFile(const char *name); public: uint32 find(const char *sequencename) { assert(0); return(0); }; uint32 getSequenceLength(uint32 iid) { return(_index[iid]._seqLen); }; bool getSequence(uint32 iid, char *&h, uint32 &hLen, uint32 &hMax, char *&s, uint32 &sLen, uint32 &sMax); bool getSequence(uint32 iid, uint32 bgn, uint32 end, char *s); private: void clear(void); readBuffer *_rb; sffHeader _header; sffRead _read; sffIndex *_index; uint64 _firstReadLocation; uint64 _readIID; friend class seqFactory; }; #endif // SFF_H canu-1.6/src/meryl/libleaff/test-correctSequence.H000066400000000000000000000105451314437614700222300ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef TEST_CORRECTSEQUENCE_H #define TEST_CORRECTSEQUENCE_H //#define WITH_WHITESPACE struct correctSequence_t { char header[256]; uint32 headerLength; char *sequence; uint32 sequenceLength; }; correctSequence_t *correctSequence = 0L; mt_s *mtctx = 0L; char *chainSeq; uint32 *chainSeqPos; uint32 *chainSeqIID; uint64 *chainStrPos; void generateCorrectSequence(uint32 minLen, uint32 maxLen, uint32 numSeq) { char bases[4] = {'A', 'C', 'G', 'T'}; uint32 n = numSeq; uint32 s = minLen; uint32 l = maxLen; uint32 seed = (uint32)(getTime() * 1000); fprintf(stderr, "generateCorrectSequence()-- Using seed "uint32FMT"\n", seed); fprintf(stderr, "generateCorrectSequence()-- Generating "uint32FMT" sequences of length "uint32FMT" to "uint32FMT"\n", numSeq, minLen, maxLen); correctSequence = new correctSequence_t [n]; mtctx = mtInit(seed); FILE *F = fopen("test-correctSequence.fasta", "w"); for (uint32 i=0; i%s\n", correctSequence[i].header); for (uint32 r=mtRandom32(mtctx) % 4; r--; ) fprintf(F, "\n"); for (uint32 p=0; p%s\n", correctSequence[i].header); fprintf(F, "%s\n", correctSequence[i].sequence); #endif } for (uint32 r=mtRandom32(mtctx) % 4; r--; ) fprintf(F, "\n"); fclose(F); } void generateChainedAnswer(uint32 numSeq, char sep, uint32 sepLen) { uint32 maxLen = 0; for (uint32 i=0; inextMer(); MS->theFMer().merToString(testmer); if (verbose) { fprintf(stdout, "MS pos="uint32FMT" posInSeq="uint64FMT" posInStr="uint64FMT" seqNum="uint64FMT"\n", pos, MS->thePositionInSequence(), MS->thePositionInStream(), MS->theSequenceNumber()); if (strncmp(testmer, seq + pos, merSize)) fprintf(stdout, "MS pos="uint32FMT" failed '%s' != '%s'.\n", pos, testmer, seq + pos); } assert(nm == true); assert(MS->thePositionInSequence() == SP[pos]); assert(MS->thePositionInStream() == SP[pos]); assert(MS->theSequenceNumber() == 0); assert(strncmp(testmer, seq + pos, merSize) == 0); pos++; } // Should have no more mers nm = MS->nextMer(); assert(nm == false); return(err); } uint32 testMerStreamOperation(merStream *MS, uint32 beg, uint32 end, uint32 sepLen) { uint32 err = 0; char fmerstr[256]; char rmerstr[256]; char cmerstr[256]; char tmerstr[256]; while (MS->nextMer()) { MS->theFMer().merToString(fmerstr); MS->theRMer().merToString(rmerstr); MS->theCMer().merToString(cmerstr); if ((strcmp(fmerstr, cmerstr) != 0) && ((strcmp(rmerstr, cmerstr) != 0))) { fprintf(stderr, "mer strings disagree; F:%s R:%s C:%s\n", fmerstr, rmerstr, cmerstr); FAIL(); } reverseComplementSequence(rmerstr, strlen(rmerstr)); if (strcmp(fmerstr, rmerstr) != 0) { fprintf(stderr, "mer strings disagree after reverse; F:%s R:%s\n", fmerstr, rmerstr); FAIL(); } uint32 pseq = MS->thePositionInSequence(); uint32 pstr = MS->thePositionInStream(); uint32 piid = MS->theSequenceNumber(); uint32 mersize = MS->theFMer().getMerSize(); uint32 merspan = MS->theFMer().getMerSpan(); #if 0 if (beg > 10) { uint32 pp = pstr + piid * sepLen - 10; uint32 xx = 0; fprintf(stderr, "beg="uint32FMT" pstr="uint32FMT" '", beg, pstr); for (xx=0; xx<10; xx++, pp++) fprintf(stderr, "%c", chainSeq[pp]); fprintf(stderr, ":"); for (xx=0; xx 1) { ST = new seqStream("test-correctSequence.fasta"); ST->setSeparator(sep, sepLen); } else { ST = new seqStream(correctSequence[0].sequence, correctSequence[0].sequenceLength); } MS = new merStream(KB, ST, true, true); uint32 maxLen = ST->startOf(numSeq-1) + ST->lengthOf(numSeq-1); // Whole thing, rewind, whole thing fprintf(stderr, "whole thing.\n"); err += testMerStreamOperation(MS, 0, maxLen, sepLen); MS->rewind(); err += testMerStreamOperation(MS, 0, maxLen, sepLen); // Random subsets - we're not terribly interested in streaming, // just getting the start/end correct. fprintf(stderr, "subsets.\n"); for (uint32 iter=0; iter<500; iter++) { uint32 beg = mtRandom32(mtctx) % maxLen; uint32 end = (beg + 10000 < maxLen) ? (beg + 10000) : maxLen; //fprintf(stderr, "subsets - "uint32FMT"-"uint32FMT"\n", beg, end); MS->setBaseRange(beg, end); err += testMerStreamOperation(MS, beg, end, sepLen); MS->rewind(); err += testMerStreamOperation(MS, beg, end, sepLen); } delete MS; return(err); } int main(int argc, char **argv) { uint32 minLen = 1000; uint32 maxLen = 200000; uint32 numSeq = 1000; uint32 err = 0; // Very simple merStream test { fprintf(stdout, "merStream(kMerBuilder(20), ...)\n"); merStream *MS = new merStream(new kMerBuilder(20), new seqStream("GGGTCAACTCCGCCCGCACTCTAGC", 25), true, true); uint32 SP[10] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }; testMerStreamSimple(MS, 20, "GGGTCAACTCCGCCCGCACTCTAGC", SP); MS->rewind(); testMerStreamSimple(MS, 20, "GGGTCAACTCCGCCCGCACTCTAGC", SP); MS->rewind(); MS->rewind(); testMerStreamSimple(MS, 20, "GGGTCAACTCCGCCCGCACTCTAGC", SP); delete MS; fprintf(stdout, "merStream(kMerBuilder(20), ...) - PASSED\n"); } { fprintf(stdout, "merStream(kMerBuilder(20, 1), ...)\n"); merStream *MS = new merStream(new kMerBuilder(20, 1), new seqStream("GGGAATTTTCAACTCCGCCCGCACTCTAGCCCAAA", 35), true, true); uint32 SP[10] = { 0, 3, 5, 9, 10, 12 }; testMerStreamSimple(MS, 20, "GATCACTCGCGCACTCTAGCA", SP); MS->rewind(); testMerStreamSimple(MS, 20, "GATCACTCGCGCACTCTAGCA", SP); MS->rewind(); MS->rewind(); testMerStreamSimple(MS, 20, "GATCACTCGCGCACTCTAGCA", SP); delete MS; fprintf(stdout, "merStream(kMerBuilder(20, 1), ...) - PASSED\n"); } // Move on to harder tests generateCorrectSequence(minLen, maxLen, numSeq); // Tests seqStream(string, strlen) construction method fprintf(stderr, "err += testMerStream(new kMerBuilder(20, 0, 0L), 1, '.', 1);\n"); err += testMerStream(new kMerBuilder(20, 0, 0L), 1, '.', 1); fprintf(stderr, "err += testMerStream(new kMerBuilder(22, 1, 0L), 1, '.', 1);\n"); err += testMerStream(new kMerBuilder(22, 1, 0L), 1, '.', 1); // Tests seqStream(filename) construction method fprintf(stderr, "err += testMerStream(new kMerBuilder(20, 0, 0L), numSeq, '.', 1);\n"); err += testMerStream(new kMerBuilder(20, 0, 0L), numSeq, '.', 1); fprintf(stderr, "err += testMerStream(new kMerBuilder(28, 0, 0L), numSeq, '.', 100);\n"); err += testMerStream(new kMerBuilder(28, 0, 0L), numSeq, '.', 100); fprintf(stderr, "err += testMerStream(new kMerBuilder(24, 4, 0L), numSeq, '.', 100);\n"); err += testMerStream(new kMerBuilder(24, 4, 0L), numSeq, '.', 100); removeCorrectSequence(numSeq); if (err == 0) fprintf(stderr, "Success!\n"); exit(err > 0); } canu-1.6/src/meryl/libleaff/test-seqCache.C000066400000000000000000000134731314437614700206100ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "util.h" #include "seqCache.H" #include "seqStream.H" #include "merStream.H" #include "test-correctSequence.H" uint32 testSeqVsCorrect(seqInCore *S, uint32 testID) { uint32 err = 0; if (S == 0L) { fprintf(stderr, "testID:"uint32FMT" - empty sequence\n", testID); return(1); } uint32 sid = S->getIID(); if (strcmp(S->header(), correctSequence[sid].header) != 0) { fprintf(stderr, "testID:"uint32FMT" - header differs '%s' vs '%s'\n", testID, S->header(), correctSequence[sid].header); err++; } if (S->headerLength() != correctSequence[sid].headerLength) { fprintf(stderr, "testID:"uint32FMT" - header length differs "uint32FMT" vs "uint32FMT"\n", testID, S->headerLength(), correctSequence[sid].headerLength); err++; } if (strcmp(S->sequence(), correctSequence[sid].sequence) != 0) { fprintf(stderr, "testID:"uint32FMT" - sequence differs\n", testID); err++; } if (strlen(S->sequence()) != correctSequence[sid].sequenceLength) { fprintf(stderr, "testID:"uint32FMT" - sequence length differs strlen "uint32FMT" vs "uint32FMT"\n", testID, (uint32)strlen(S->sequence()), correctSequence[sid].sequenceLength); err++; } if (S->sequenceLength() != correctSequence[sid].sequenceLength) { fprintf(stderr, "testID:"uint32FMT" - sequence length differs "uint32FMT" vs "uint32FMT"\n", testID, S->sequenceLength(), correctSequence[sid].sequenceLength); err++; } return(err); } uint32 testSeqCacheIDLookups(seqCache *SC) { uint32 err = 0; uint32 numSeq = SC->getNumberOfSequences(); double start = getTime(); // 1 - getSequenceIID() fprintf(stderr, "1 - getSequenceIID()\n"); for (uint32 sid=0; sidgetSequenceIID(correctSequence[sid].header)) { fprintf(stderr, "2 - failed to find name '%s'\n", correctSequence[sid].header); err++; } } fprintf(stderr, "Test took %f seconds.\n", getTime() - start); return(err); } uint32 testSeqCache(seqCache *SC) { uint32 err = 0; uint32 numSeq = SC->getNumberOfSequences(); seqInCore *S = 0L; double start = getTime(); // 0 - getSequenceLength() fprintf(stderr, "0 - getSequenceLength()\n"); for (uint32 sid=0; sidgetSequenceLength(sid) != correctSequence[sid].sequenceLength) { fprintf(stderr, "1 - length differs.\n"); err++; } // 2 - stream with getSequenceInCore() fprintf(stderr, "2 - stream with getSequenceInCore()\n"); S = SC->getSequenceInCore(); while (S != 0L) { err += testSeqVsCorrect(S, 2); delete S; S = SC->getSequenceInCore(); } // 3 - iterate with getSequenceInCore(sid++) fprintf(stderr, "3 - iterate with getSequenceInCore(sid++)\n"); for (uint32 sid=0; sidgetSequenceInCore(sid); err += testSeqVsCorrect(S, 3); delete S; } // 4 - random with getSequenceInCore(sid) fprintf(stderr, "4 - random with getSequenceInCore(sid)\n"); for (uint32 cnt=0; cnt<4*numSeq; cnt++) { uint32 sid = mtRandom32(mtctx) % numSeq; S = SC->getSequenceInCore(sid); err += testSeqVsCorrect(S, 4); delete S; } fprintf(stderr, "Test took %f seconds.\n", getTime() - start); return(err); } int main(int argc, char **argv) { uint32 minLen = 100; uint32 maxLen = 2000; uint32 numSeq = 100000; seqCache *SC = 0L; uint32 err = 0; generateCorrectSequence(minLen, maxLen, numSeq); fprintf(stderr, "seqCache(file, 0, true) (ID lookups)\n"); SC = new seqCache("test-correctSequence.fasta", 0, true); //err += testSeqCacheIDLookups(SC); delete SC; fprintf(stderr, "seqCache(file, 0, true)\n"); SC = new seqCache("test-correctSequence.fasta", 0, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 1, true)\n"); SC = new seqCache("test-correctSequence.fasta", 1, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 2, true)\n"); SC = new seqCache("test-correctSequence.fasta", 2, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 4, true)\n"); SC = new seqCache("test-correctSequence.fasta", 4, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 8, true)\n"); SC = new seqCache("test-correctSequence.fasta", 8, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 32, true)\n"); SC = new seqCache("test-correctSequence.fasta", 32, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 200, true)\n"); SC = new seqCache("test-correctSequence.fasta", 200, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 1000000, true)\n"); SC = new seqCache("test-correctSequence.fasta", 1000000, true); err += testSeqCache(SC); delete SC; fprintf(stderr, "seqCache(file, 0, true) -- loadAllSequence\n"); SC = new seqCache("test-correctSequence.fasta", 0, true); SC->loadAllSequences(); err += testSeqCache(SC); delete SC; removeCorrectSequence(numSeq); if (err == 0) fprintf(stderr, "Success!\n"); exit(err > 0); } canu-1.6/src/meryl/libleaff/test-seqStream.C000066400000000000000000000205331314437614700210330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "util.h" #include "seqCache.H" #include "seqStream.H" #include "merStream.H" #include "test-correctSequence.H" #define FAIL() { err++; assert(0); } uint32 testIndexing(uint32 numSeq, char sep, uint32 sepLen) { uint32 err = 0; seqStream *ST = 0L; fprintf(stderr, "testIndexing()-- numSeq="uint32FMT" sep=%c sepLen="uint32FMT"\n", numSeq, sep, sepLen); generateChainedAnswer(numSeq, sep, sepLen); if (numSeq > 1) { ST = new seqStream("test-correctSequence.fasta"); ST->setSeparator(sep, sepLen); } else { ST = new seqStream(correctSequence[0].sequence, correctSequence[0].sequenceLength); } uint32 maxLen = ST->startOf(numSeq-1) + ST->lengthOf(numSeq-1); // Basic checks on the reverse lookup - this is state independent; // it changes only based on the separator length. In other words, // there is no need to check this while iterating through the // seqStream. fprintf(stderr, "IGNORE THIS WARNING: "); if (ST->sequenceNumberOfPosition(maxLen) != ~uint32ZERO) { fprintf(stderr, "maxLen too small.\n"); FAIL(); } if (ST->sequenceNumberOfPosition(maxLen - 1) == ~uint32ZERO) { fprintf(stderr, "maxLen too big.\n"); FAIL(); } // Check all lookups - lengthOf() and IIDOf() are implicitly // checked by the operation of seqStream (get() mostly). startOf() // isn't, but inserting errors in setRange() led to // infinite-looking loops. uint64 pos = 0; uint64 sta = 0; for (uint32 sid=0; sidlengthOf(sid) != correctSequence[sid].sequenceLength) { fprintf(stderr, "lengthOf "uint32FMT" returned "uint32FMT", not correct "uint32FMT"\n", sid, ST->lengthOf(sid), correctSequence[sid].sequenceLength); FAIL(); } if (ST->startOf(sid) != sta) { fprintf(stderr, "startOf "uint32FMT" returned "uint64FMT", not correct "uint64FMT"\n", sid, ST->startOf(sid), sta); FAIL(); } if (ST->IIDOf(sid) != sid) { fprintf(stderr, "IIDOf "uint32FMT" returned "uint32FMT", not correct "uint32FMT"\n", sid, ST->IIDOf(sid), sid); FAIL(); } sta += correctSequence[sid].sequenceLength; for (uint32 ppp=0; pppsequenceNumberOfPosition(pos) != sid) { fprintf(stderr, "sequenceNumberOfPosition "uint64FMT" returned "uint32FMT", not correct "uint32FMT".\n", pos, ST->sequenceNumberOfPosition(pos), sid); FAIL(); } } } if (pos != maxLen) { fprintf(stderr, "maxLen wrong.\n"); FAIL(); } // Check the separator. Seek to a spot right before one, and count // that we have the correct length. More rigorously tested in // testChaining(). for (uint32 sid=0; sidsetRange(ST->startOf(sid) + ST->lengthOf(sid)-1, ~uint64ZERO); ST->get(); for (uint32 x=0; xget(); if (s != sep) { fprintf(stderr, "wrong separator at sep "uint32FMT" got %d expected %d\n", x, s, sep); FAIL(); } } if (ST->get() == sep) { fprintf(stderr, "too many separators!\n"); FAIL(); } } delete ST; return(err); } uint32 testSeqStream(seqStream *ST, uint32 sib, uint32 sie, char sep) { uint32 err = 0; while (ST->eof() == false) { uint32 sp = ST->seqPos(); uint32 si = ST->seqIID(); uint64 st = ST->strPos(); char ch = ST->get(); if (ch != 0) { if (ch != chainSeq[sib]) { fprintf(stderr, "sp="uint32FMT" si="uint32FMT" st="uint64FMT" ch=%c -- letter wrong got'%c'\n", sp, si, st, ch, chainSeq[sib]); FAIL(); } if ((ch != sep) && (sp != chainSeqPos[sib])) { fprintf(stderr, "sp="uint32FMT" si="uint32FMT" st="uint64FMT" ch=%c -- seqPos wrong got "uint32FMT"\n", sp, si, st, ch, chainSeqPos[sib]); FAIL(); } if ((ch != sep) && (si != chainSeqIID[sib])) { fprintf(stderr, "sp="uint32FMT" si="uint32FMT" st="uint64FMT" ch=%c -- seqIID wrong got"uint32FMT"\n", sp, si, st, ch, chainSeqIID[sib]); FAIL(); } if ((ch != sep) && (st != chainStrPos[sib])) { fprintf(stderr, "sp="uint32FMT" si="uint32FMT" st="uint64FMT" ch=%c -- strPos wrong got "uint64FMT"\n", sp, si, st, ch, chainStrPos[sib]); FAIL(); } sib++; } } if (sib != sie) { fprintf(stderr, "iterated length wrong; sib="uint32FMT" sie="uint32FMT"\n", sib, sie); FAIL(); } return(err); } uint32 testChaining(uint32 numSeq, char sep, uint32 sepLen) { uint32 err = 0; seqStream *ST = 0L; fprintf(stderr, "testChaining()-- numSeq="uint32FMT" sep=%c sepLen="uint32FMT"\n", numSeq, sep, sepLen); generateChainedAnswer(numSeq, sep, sepLen); if (numSeq > 1) { ST = new seqStream("test-correctSequence.fasta"); ST->setSeparator(sep, sepLen); } else { ST = new seqStream(correctSequence[0].sequence, correctSequence[0].sequenceLength); } // Do a test on the whole thing. { uint32 sib = 0; uint32 sie = strlen(chainSeq); fprintf(stderr, "initial test with full range\n"); testSeqStream(ST, sib, sie, sep); fprintf(stderr, "initial test with full range (rewind)\n"); ST->rewind(); testSeqStream(ST, sib, sie, sep); } // Set the range to random values, and check all the results. // We've already verified the index works, so we're free to use // that (but we currently don't). uint32 maxLen = ST->startOf(numSeq-1) + ST->lengthOf(numSeq-1); fprintf(stderr, "test on subranges\n"); for (uint32 iter=0; iter<500; iter++) { uint32 beg = mtRandom32(mtctx) % maxLen; uint32 end = mtRandom32(mtctx) % maxLen; if (beg > end) { uint32 t = end; end = beg; beg = t; } ST->setRange(beg, end); // Compute the position in our stream for the ACGT based beg and // end. The quirk here is that our stream includes the // separator. uint32 sib = 0; // chainSeq position uint32 sie = 0; for (uint32 ppp=0, sid=0; sideof() == false) ST->get(); ST->rewind(); } else { //fprintf(stderr, "Random iter "uint32FMT"\n", iter); } testSeqStream(ST, sib, sie, sep); } return(err > 0); } int main(int argc, char **argv) { uint32 minLen = 100; uint32 maxLen = 20000; uint32 numSeq = 1000; uint32 err = 0; generateCorrectSequence(minLen, maxLen, numSeq); // Tests seqStream(string, strlen) construction method err += testIndexing(1, '.', 1); err += testChaining(1, '.', 1); // Tests seqStream(filename) construction method err += testIndexing(numSeq, '.', 1); err += testIndexing(numSeq, ':', 10); err += testIndexing(numSeq, 'z', 100); err += testIndexing(numSeq, '-', 1000); err += testChaining(numSeq, '.', 1); err += testChaining(numSeq, ':', 10); err += testChaining(numSeq, 'z', 100); err += testChaining(numSeq, '-', 1000); removeCorrectSequence(numSeq); if (err == 0) fprintf(stderr, "Success!\n"); exit(err > 0); } canu-1.6/src/meryl/libleaff/test/000077500000000000000000000000001314437614700167625ustar00rootroot00000000000000canu-1.6/src/meryl/libleaff/test/Makefile000066400000000000000000000011671314437614700204270ustar00rootroot00000000000000 PROG = test-merstream-speed INCLUDE = -I.. -I../../libutil -I../../libbio -I../../libseq LIBS = -L.. -L../../libutil -L../../libbio -L../../libseq -lseq -lbio -lutil -lm OBJS = include ../../Make.compilers all: $(PROG) @echo Tests passed! test-merstream-speed: test-merstream-speed.C $(CXX) $(CXXFLAGS_COMPILE) -c -o test-merstream-speed.o test-merstream-speed.C $(INCLUDE) $(CXX) $(CXXLDFLAGS) -o test-merstream-speed test-merstream-speed.o $(LIBS) ../../leaff/leaff -G 10000 1000 10000 > junk.fasta cat junk.fasta > /dev/null ./test-merstream-speed junk.fasta rm -f junk* clean: rm -f $(PROG) *.o *junk* canu-1.6/src/meryl/libleaff/test/test-merstream-speed.C000066400000000000000000000041531314437614700231430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include "bio++.H" #include "seqCache.H" #include "seqStream.H" #include "merStream.H" int main(int argc, char **argv) { speedCounter *C = 0L; FILE *F = 0L; seqStream *S = 0L; merStream *M = 0L; if (argc != 2) { fprintf(stderr, "usage: %s some.fasta\n", argv[0]); fprintf(stderr, "Reads some.fasta using fgetc(), the seqStream and the merStream,\n"); fprintf(stderr, "reporting the speed of each method.\n"); exit(1); } //////////////////////////////////////// F = fopen(argv[1], "r"); C = new speedCounter("fgetc(): %7.2f Mthings -- %5.2f Mthings/second\r", 1000000.0, 0x3fffff, true); while (!feof(F)) fgetc(F), C->tick(); delete C; fclose(F); //////////////////////////////////////// S = new seqStream(argv[1]); C = new speedCounter("seqStream: %7.2f Mthings -- %5.2f Mthings/second\r", 1000000.0, 0x3fffff, true); while (S->get()) C->tick(); delete C; delete S; //////////////////////////////////////// M = new merStream(new kMerBuilder(20), new seqStream(argv[1]), true, true); C = new speedCounter("seqStream -> merStream: %7.2f Mthings -- %5.2f Mthings/second\r", 1000000.0, 0x3fffff, true); while (M->nextMer()) C->tick(); delete C; delete M; exit(0); } canu-1.6/src/meryl/libmeryl.C000066400000000000000000000420461314437614700161700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libmeryl/libmeryl.C * * Modifications by: * * Brian P. Walenz from 2003-SEP-08 to 2004-APR-08 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2004-APR-30 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-23 to 2014-APR-11 * are Copyright 2005-2008,2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-JUL-01 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "libmeryl.H" // Version 3 ?? // Version 4 removed _histogramHuge, dynamically sizing it on write. // 0123456789012345 static char *ImagicV = "merylStreamIv04\n"; static char *ImagicX = "merylStreamIvXX\n"; static char *DmagicV = "merylStreamDv04\n"; static char *DmagicX = "merylStreamDvXX\n"; static char *PmagicV = "merylStreamPv04\n"; static char *PmagicX = "merylStreamPvXX\n"; merylStreamReader::merylStreamReader(const char *fn_, uint32 ms_) { char idxname[FILENAME_MAX]; char datname[FILENAME_MAX]; char posname[FILENAME_MAX]; if (fn_ == 0L) { fprintf(stderr, "ERROR - no counted database file specified.\n"); exit(1); } memset(_filename, 0, sizeof(char) * FILENAME_MAX); strcpy(_filename, fn_); // Open the files // snprintf(idxname, FILENAME_MAX, "%s.mcidx", _filename); snprintf(datname, FILENAME_MAX, "%s.mcdat", _filename); snprintf(posname, FILENAME_MAX, "%s.mcpos", _filename); // bitPackedFile will create a file if it doesn't exist, so we need to fail ahead // of time. bool idxexist = AS_UTL_fileExists(idxname); bool datexist = AS_UTL_fileExists(datname); bool posexist = AS_UTL_fileExists(posname); if ((idxexist == false) || (datexist == false)) { fprintf(stderr, "merylStreamReader()-- ERROR: Didn't find data files for reading mer data.\n"); fprintf(stderr, "merylStreamReader()-- ERROR: Expecting to find '%s' and\n", idxname); fprintf(stderr, "merylStreamReader()-- ERROR: '%s'\n", datname); exit(1); } _IDX = new bitPackedFile(idxname); _DAT = new bitPackedFile(datname); _POS = (posexist) ? new bitPackedFile(posname) : 0L; // Verify that they are what they should be, and read in the header // char Imagic[16] = {0}; char Dmagic[16] = {0}; char Pmagic[16] = {0}; bool fail = false; for (uint32 i=0; i<16; i++) { Imagic[i] = _IDX->getBits(8); Dmagic[i] = _DAT->getBits(8); if (_POS) Pmagic[i] = _POS->getBits(8); } if (strncmp(Imagic, ImagicX, 16) == 0) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcidx is an INCOMPLETE merylStream index file!\n", _filename); fail = true; } if (strncmp(Imagic, ImagicX, 13) != 0) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcidx is not a merylStream index file!\n", _filename); fail = true; } if (strncmp(Dmagic, DmagicX, 16) == 0) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcdat is an INCOMPLETE merylStream data file!\n", _filename); fail = true; } if (strncmp(Dmagic, DmagicX, 13) != 0) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcdat is not a merylStream data file!\n", _filename); fail = true; } if ((Imagic[13] != Dmagic[13]) || (Imagic[14] != Dmagic[14])) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcidx and %s.mcdat are different versions!\n", _filename, _filename); fail = true; } if (_POS) { if (strncmp(Pmagic, PmagicX, 16) == 0) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcpos is an INCOMPLETE merylStream data file!\n", _filename); fail = true; } if (strncmp(Pmagic, PmagicX, 13) != 0) { fprintf(stderr, "merylStreamReader()-- ERROR: %s.mcpos is not a merylStream data file!\n", _filename); fail = true; } } if (fail) exit(1); _idxIsPacked = _IDX->getBits(32); _datIsPacked = _IDX->getBits(32); _posIsPacked = _IDX->getBits(32); _merSizeInBits = _IDX->getBits(32) << 1; _merCompression = _IDX->getBits(32); _prefixSize = _IDX->getBits(32); _merDataSize = _merSizeInBits - _prefixSize; _numUnique = _IDX->getBits(64); _numDistinct = _IDX->getBits(64); _numTotal = _IDX->getBits(64); _histogramPos = 0; _histogramLen = 0; _histogramMaxValue = 0; _histogram = 0L; uint32 version = atoi(Imagic + 13); // Versions earlier than four used a fixed-size histogram, stored at the start // of the index. if (version < 4) { _histogramPos = _IDX->tell(); _histogramLen = _IDX->getBits(64); // Previous _histogramHuge, now unused _histogramLen = _IDX->getBits(64); _histogramMaxValue = _IDX->getBits(64); _histogram = new uint64 [_histogramLen]; for (uint32 i=0; i<_histogramLen; i++) _histogram[i] = _IDX->getBits(64); } // Version 4 switched to a dynamically sized histogram, stored at the end // of the index. else { _histogramPos = _IDX->getBits(64); _histogramLen = _IDX->getBits(64); _histogramMaxValue = _IDX->getBits(64); _histogram = new uint64 [_histogramLen]; uint64 position = _IDX->tell(); _IDX->seek(_histogramPos); for (uint32 i=0; i<_histogramLen; i++) _histogram[i] = _IDX->getBits(64); _IDX->seek(position); } _thisBucket = uint64ZERO; _thisBucketSize = getIDXnumber(); _numBuckets = uint64ONE << _prefixSize; _thisMer.setMerSize(_merSizeInBits >> 1); _thisMer.clear(); _thisMerCount = uint64ZERO; _thisMerPositionsMax = 0; _thisMerPositions = 0L; _validMer = true; #ifdef SHOW_VARIABLES fprintf(stderr, "_merSizeInBits = " F_U32 "\n", _merSizeInBits); fprintf(stderr, "_merCompression = " F_U32 "\n", _merCompression); fprintf(stderr, "_prefixSize = " F_U32 "\n", _prefixSize); fprintf(stderr, "_merDataSize = " F_U32 "\n", _merDataSize); fprintf(stderr, "_numUnique = " F_U64 "\n", _numUnique); fprintf(stderr, "_numDistinct = " F_U64 "\n", _numDistinct); fprintf(stderr, "_numTotal = " F_U64 "\n", _numTotal); fprintf(stderr, "_thisBucket = " F_U64 "\n", _thisBucket); fprintf(stderr, "_thisBucketSize = " F_U64 "\n", _thisBucketSize); fprintf(stderr, "_thisMerCount = " F_U64 "\n", _thisMerCount); #endif if ((ms_ > 0) && (_merSizeInBits >> 1 != ms_)) { fprintf(stderr, "merylStreamReader()-- ERROR: User requested mersize " F_U32 " but '%s' is mersize " F_U32 "\n", ms_, _filename, _merSizeInBits >> 1); exit(1); } } merylStreamReader::~merylStreamReader() { delete _IDX; delete _DAT; delete _POS; delete [] _thisMerPositions; delete [] _histogram; } bool merylStreamReader::nextMer(void) { // Use a while here, so that we skip buckets that are empty // while ((_thisBucketSize == 0) && (_thisBucket < _numBuckets)) { _thisBucketSize = getIDXnumber(); _thisBucket++; } if (_thisBucket >= _numBuckets) return(_validMer = false); // Before you get rid of the clear() -- if, say, the list of mers // is sorted and we can shift the mer to make space for the new // stuff -- make sure that nobody is calling reverseComplement()! // _thisMer.clear(); _thisMer.readFromBitPackedFile(_DAT, _merDataSize); _thisMer.setBits(_merDataSize, _prefixSize, _thisBucket); _thisMerCount = getDATnumber(); _thisBucketSize--; if (_POS) { if (_thisMerPositionsMax < _thisMerCount) { delete [] _thisMerPositions; _thisMerPositionsMax = _thisMerCount + 1024; _thisMerPositions = new uint32 [_thisMerPositionsMax]; } for (uint32 i=0; i<_thisMerCount; i++) { _thisMerPositions[i] = _POS->getBits(32); } } return(true); } merylStreamWriter::merylStreamWriter(const char *fn_, uint32 merSize, uint32 merComp, uint32 prefixSize, bool positionsEnabled) { char outpath[FILENAME_MAX]; memset(_filename, 0, sizeof(char) * FILENAME_MAX); strcpy(_filename, fn_); snprintf(outpath, FILENAME_MAX, "%s.mcidx.creating", _filename); _IDX = new bitPackedFile(outpath, 0, true); snprintf(outpath, FILENAME_MAX, "%s.mcdat.creating", _filename); _DAT = new bitPackedFile(outpath, 0, true); if (positionsEnabled) { snprintf(outpath, FILENAME_MAX, "%s.mcpos.creating", _filename); _POS = new bitPackedFile(outpath, 0, true); } else { _POS = 0L; } _idxIsPacked = 1; _datIsPacked = 1; _posIsPacked = 0; _merSizeInBits = merSize * 2; _merCompression = merComp; _prefixSize = prefixSize; _merDataSize = _merSizeInBits - _prefixSize; _thisBucket = uint64ZERO; _thisBucketSize = uint64ZERO; _numBuckets = uint64ONE << _prefixSize; _numUnique = uint64ZERO; _numDistinct = uint64ZERO; _numTotal = uint64ZERO; _histogramPos = 0; _histogramLen = 1024; _histogramMaxValue = 0; _histogram = new uint64 [_histogramLen]; for (uint32 i=0; i<_histogramLen; i++) _histogram[i] = 0; _thisMerIsBits = false; _thisMerIskMer = false; _thisMer.setMerSize(_merSizeInBits >> 1); _thisMer.clear(); _thisMerPre = uint64ZERO; _thisMerMer = uint64ZERO; _thisMerPreSize = prefixSize; _thisMerMerSize = 2 * merSize - prefixSize; _thisMerCount = uint64ZERO; // Initialize the index file. for (uint32 i=0; i<16; i++) _IDX->putBits(ImagicX[i], 8); _IDX->putBits(_idxIsPacked, 32); _IDX->putBits(_datIsPacked, 32); _IDX->putBits(_posIsPacked, 32); _IDX->putBits(_merSizeInBits >> 1, 32); _IDX->putBits(_merCompression, 32); _IDX->putBits(_prefixSize, 32); _IDX->putBits(_numUnique, 64); _IDX->putBits(_numDistinct, 64); _IDX->putBits(_numTotal, 64); _IDX->putBits(0, 64); // Offset to the histogram _IDX->putBits(0, 64); // Length of the histogram data _IDX->putBits(0, 64); // Max value seen in the histogram // Initialize the data file. for (uint32 i=0; i<16; i++) _DAT->putBits(DmagicX[i], 8); // Initialize the positions file. if (_POS) for (uint32 i=0; i<16; i++) _POS->putBits(PmagicX[i], 8); } merylStreamWriter::~merylStreamWriter() { writeMer(); // Finish writing the buckets. while (_thisBucket < _numBuckets + 2) { setIDXnumber(_thisBucketSize); _thisBucketSize = 0; _thisBucket++; } // Save the position of the histogram _histogramPos = _IDX->tell(); // And write the histogram for (uint32 i=0; i<=_histogramMaxValue; i++) _IDX->putBits(_histogram[i], 64); // Seek back to the start and rewrite the magic numbers. _IDX->seek(0); for (uint32 i=0; i<16; i++) _IDX->putBits(ImagicV[i], 8); _IDX->putBits(_idxIsPacked, 32); _IDX->putBits(_datIsPacked, 32); _IDX->putBits(_posIsPacked, 32); _IDX->putBits(_merSizeInBits >> 1, 32); _IDX->putBits(_merCompression, 32); _IDX->putBits(_prefixSize, 32); _IDX->putBits(_numUnique, 64); _IDX->putBits(_numDistinct, 64); _IDX->putBits(_numTotal, 64); _IDX->putBits(_histogramPos, 64); _IDX->putBits(_histogramMaxValue+1, 64); // The length of the data (includes 0) _IDX->putBits(_histogramMaxValue, 64); // The maximum value of the data delete _IDX; delete [] _histogram; // Seek back to the start of the data and rewrite the magic numbers. _DAT->seek(0); for (uint32 i=0; i<16; i++) _DAT->putBits(DmagicV[i], 8); delete _DAT; // Seek back to the start of the positions and rewrite the magic numbers. if (_POS) { _POS->seek(0); for (uint32 i=0; i<16; i++) _POS->putBits(PmagicV[i], 8); } delete _POS; // All done! Rename our temporary outputs to final outputs. char outpath[FILENAME_MAX]; char finpath[FILENAME_MAX]; snprintf(outpath, FILENAME_MAX, "%s.mcidx.creating", _filename); snprintf(finpath, FILENAME_MAX, "%s.mcidx", _filename); rename(outpath, finpath); snprintf(outpath, FILENAME_MAX, "%s.mcdat.creating", _filename); snprintf(finpath, FILENAME_MAX, "%s.mcdat", _filename); rename(outpath, finpath); if (_POS) { snprintf(outpath, FILENAME_MAX, "%s.mcpos.creating", _filename); snprintf(finpath, FILENAME_MAX, "%s.mcpos", _filename); rename(outpath, finpath); } } void merylStreamWriter::writeMer(void) { if (_thisMerCount == 0) return; _numTotal += _thisMerCount; _numDistinct++; if (_thisMerCount >= _histogramLen) resizeArray(_histogram, _histogramMaxValue+1, _histogramLen, _thisMerCount + 16384, resizeArray_copyData | resizeArray_clearNew); _histogram[_thisMerCount]++; if (_histogramMaxValue < _thisMerCount) _histogramMaxValue = _thisMerCount; assert((_thisMerIsBits == false) || (_thisMerIskMer == false)); if (_thisMerIsBits) { if (_thisMerCount == 1) { _DAT->putBits(_thisMerMer, _thisMerMerSize); setDATnumber(1); _thisBucketSize++; _numUnique++; } else { _DAT->putBits(_thisMerMer, _thisMerMerSize); setDATnumber(_thisMerCount); _thisBucketSize++; } } else { if (_thisMerCount == 1) { _thisMer.writeToBitPackedFile(_DAT, _merDataSize); setDATnumber(1); _thisBucketSize++; _numUnique++; } else if (_thisMerCount > 1) { _thisMer.writeToBitPackedFile(_DAT, _merDataSize); setDATnumber(_thisMerCount); _thisBucketSize++; } } } void merylStreamWriter::addMer(kMer &mer, uint32 count, uint32 *positions) { uint64 val; if (_thisMerIskMer == false) { _thisMerIskMer = true; assert(_thisMerIsBits == false); } // Fail if we see a smaller mer than last time. // if (mer < _thisMer) { char str[1024]; fprintf(stderr, "merylStreamWriter::addMer()-- ERROR: your mer stream isn't sorted increasingly!\n"); fprintf(stderr, "merylStreamWriter::addMer()-- last: %s\n", _thisMer.merToString(str)); fprintf(stderr, "merylStreamWriter::addMer()-- this: %s\n", mer.merToString(str)); exit(1); } // If there was a position given, write it. // if (positions && _POS) for (uint32 i=0; iputBits(positions[i], 32); // If the new mer is the same as the last one just increase the // count. // if (mer == _thisMer) { _thisMerCount += count; return; } // Write thisMer to disk. If the count is zero, we don't write // anything. The count is zero for the first mer (all A) unless we // add that mer, and if the silly user gives us a mer with zero // count. // writeMer(); // If the new mer is in a different bucket from the last mer, write // out some bucket counts. We need a while loop (opposed to just // writing one bucket) because we aren't guaranteed that the mers // are in adjacent buckets. // val = mer.startOfMer(_prefixSize); while (_thisBucket < val) { setIDXnumber(_thisBucketSize); _thisBucketSize = 0; _thisBucket++; } // Remember the new mer for the next time // _thisMer = mer; _thisMerCount = count; } void merylStreamWriter::addMer(uint64 prefix, uint32 prefixBits, uint64 mer, uint32 merBits, uint32 count, uint32 *UNUSED(positions)) { if (_thisMerIsBits == false) { _thisMerIsBits = true; assert(_thisMerIskMer == false); } assert(prefixBits == _prefixSize); assert(prefixBits == _thisMerPreSize); assert(merBits == _thisMerMerSize); assert(prefixBits + merBits == _merSizeInBits); if (((prefix < _thisMerPre)) || ((prefix <= _thisMerPre) && (mer < _thisMerMer))) { assert(0); } if ((prefix == _thisMerPre) && (mer == _thisMerMer)) { _thisMerCount += count; return; } writeMer(); while (_thisBucket < prefix) { setIDXnumber(_thisBucketSize); _thisBucketSize = 0; _thisBucket++; } _thisMerPre = prefix; _thisMerMer = mer; _thisMerCount = count; } canu-1.6/src/meryl/libmeryl.H000066400000000000000000000166001314437614700161720ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/libmeryl/libmeryl.H * * Modifications by: * * Brian P. Walenz from 2003-SEP-08 to 2004-APR-08 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-21 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-23 to 2014-APR-11 * are Copyright 2005,2007-2008,2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-JUN-16 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-19 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef LIBMERYL_H #define LIBMERYL_H #include "kMer.H" // A merStream reader/writer for meryl mercount data. // // merSize is used to check that the meryl file is the correct size. // If it isn't the code fails. // // The reader returns mers in lexicographic order. No random access. // The writer assumes that mers come in sorted increasingly. // // numUnique the total number of mers with count of one // numDistinct the total number of distinct mers in this file // numTotal the total number of mers in this file class merylStreamReader { public: merylStreamReader(const char *fn, uint32 ms=0); ~merylStreamReader(); kMer &theFMer(void) { return(_thisMer); }; uint64 theCount(void) { return(_thisMerCount); }; bool hasPositions(void) { return(_POS != 0L); }; uint32 *thePositions(void) { return(_thisMerPositions); }; uint32 getPosition(uint32 i) { return(((_POS) && (i < _thisMerCount)) ? _thisMerPositions[i] : ~uint32ZERO); }; uint32 merSize(void) { return(_merSizeInBits >> 1); }; uint32 merCompression(void) { return(_merCompression); }; uint32 prefixSize(void) { return(_prefixSize); }; uint64 numberOfUniqueMers(void) { return(_numUnique); }; uint64 numberOfDistinctMers(void) { return(_numDistinct); }; uint64 numberOfTotalMers(void) { return(_numTotal); }; uint64 histogram(uint32 i) { return((i < _histogramLen) ? _histogram[i] : ~uint64ZERO); }; uint64 histogramLength(void) { return(_histogramLen); }; uint64 histogramMaximumCount(void) { return(_histogramMaxValue); }; bool nextMer(void); bool validMer(void) { return(_validMer); }; private: char _filename[FILENAME_MAX]; bitPackedFile *_IDX; bitPackedFile *_DAT; bitPackedFile *_POS; uint64 getIDXnumber(void) { uint64 n = 1; if (_idxIsPacked) n = _IDX->getNumber(); else n = _IDX->getBits(32); return(n); }; uint64 getDATnumber(void) { uint64 n = 1; if (_datIsPacked) { if (_DAT->getBits(1)) n = _DAT->getNumber() + 2; } else { n = _DAT->getBits(32); } return(n); }; // Why not bool? Seems like the bitPackedFile is incompatible // with bools. uint32 _idxIsPacked; uint32 _datIsPacked; uint32 _posIsPacked; uint32 _merSizeInBits; uint32 _merCompression; uint32 _prefixSize; uint32 _merDataSize; uint64 _thisBucket; uint64 _thisBucketSize; uint64 _numBuckets; kMer _thisMer; uint64 _thisMerCount; uint32 _thisMerPositionsMax; uint32 *_thisMerPositions; uint64 _numUnique; uint64 _numDistinct; uint64 _numTotal; uint64 _histogramPos; // position of the histogram data in IDX uint64 _histogramLen; // number of entries in the histo uint64 _histogramMaxValue; // highest count ever seen uint64 *_histogram; bool _validMer; }; class merylStreamWriter { public: merylStreamWriter(const char *filePrefix, uint32 merSize, // In bases uint32 merComp, // A length, bases uint32 prefixSize, // In bits bool positionsEnabled); ~merylStreamWriter(); void addMer(kMer &mer, uint32 count=1, uint32 *positions=0L); void addMer(uint64 prefix, uint32 prefixBits, uint64 mer, uint32 merBits, uint32 count=1, uint32 *positions=0L); private: void writeMer(void); void setIDXnumber(uint64 n) { if (_idxIsPacked) _IDX->putNumber(n); else _IDX->putBits(n, 32); }; void setDATnumber(uint64 n) { if (_datIsPacked) { if (n == 1) { _DAT->putBits(uint64ZERO, 1); } else { _DAT->putBits(uint64ONE, 1); _DAT->putNumber(n-2); } } else { _DAT->putBits(n, 32); } }; char _filename[FILENAME_MAX]; bitPackedFile *_IDX; bitPackedFile *_DAT; bitPackedFile *_POS; uint32 _idxIsPacked; uint32 _datIsPacked; uint32 _posIsPacked; uint32 _merSizeInBits; uint32 _merCompression; uint32 _prefixSize; uint32 _merDataSize; uint64 _thisBucket; uint64 _thisBucketSize; uint64 _numBuckets; uint64 _numUnique; uint64 _numDistinct; uint64 _numTotal; uint64 _histogramPos; // position of the histogram data in IDX uint64 _histogramLen; // number of entries in the histogram uint64 _histogramMaxValue; // highest count ever seen uint64 *_histogram; bool _thisMerIsBits; bool _thisMerIskMer; kMer _thisMer; uint64 _thisMerPre; uint64 _thisMerMer; uint32 _thisMerPreSize; uint32 _thisMerMerSize; uint64 _thisMerCount; }; #endif // LIBMERYL_H canu-1.6/src/meryl/mapMers-depth.C000066400000000000000000000124511314437614700170540ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/mapMers-depth.C * * Modifications by: * * Brian P. Walenz from 2007-JUN-08 to 2014-APR-11 * are Copyright 2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-07 to 2014-DEC-22 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "bio++.H" #include "seqCache.H" #include "merStream.H" #include "libmeryl.H" #include "existDB.H" #warning this code might not work due to intervalList changes int main(int argc, char **argv) { uint32 merSize = 16; char *merylFile = 0L; char *fastaFile = 0L; bool beVerbose = false; uint32 loCount = 0; uint32 hiCount = ~uint32ZERO; uint32 windowsize = 0; uint32 skipsize = 0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-m") == 0) { merSize = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-mers") == 0) { merylFile = argv[++arg]; } else if (strcmp(argv[arg], "-seq") == 0) { fastaFile = argv[++arg]; } else if (strcmp(argv[arg], "-v") == 0) { beVerbose = true; } else if (strcmp(argv[arg], "-lo") == 0) { loCount = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-hi") == 0) { hiCount = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-w") == 0) { windowsize = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-s") == 0) { skipsize = strtouint32(argv[++arg]); } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); } arg++; } if ((merylFile == 0L) || (fastaFile == 0L)) { fprintf(stderr, "usage: %s -m mersize -mers mers -seq fasta > output\n", argv[0]); exit(1); } existDB *E = new existDB(merylFile, merSize, existDBcounts | existDBcompressCounts | existDBcompressBuckets, loCount, hiCount); seqCache *F = new seqCache(fastaFile); for (uint32 Sid=0; Sid < F->getNumberOfSequences(); Sid++) { seqInCore *S = F->getSequenceInCore(Sid); merStream *MS = new merStream(new kMerBuilder(merSize), new seqStream(S->sequence(), S->sequenceLength()), true, true); uint32 idlen = 0; intervalDepthRegions *id = new intervalDepthRegions [S->sequenceLength() * 2 + 2]; while (MS->nextMer()) { int32 cnt = (int32)E->count(MS->theFMer()) + (int32)E->count(MS->theRMer()); // Old intervalDepth was to add 'cnt' in the first and subtract 'cnt' in the second. // Then to use the 'ct' field below. // New intervalDepth is the same, but uses the value field. // Count is now the number of intervals that are represented in this block. id[idlen].pos = MS->thePositionInSequence(); id[idlen].change = cnt; id[idlen].open = true; idlen++; id[idlen].pos = MS->thePositionInSequence() + merSize; id[idlen].change = cnt; id[idlen].open = false; idlen++; } intervalList ID(id, idlen); uint32 x = 0; uint32 len = S->sequenceLength(); // Default case, report un-averaged depth at every single location. // if ((windowsize == 0) && (skipsize == 0)) { for (uint32 i=0; i < ID.numberOfIntervals(); i++) { for (; x < ID.lo(i); x++) fprintf(stdout, uint32FMTW(7)"\t"uint32FMTW(6)"\n", x, 0); for (; x < ID.hi(i); x++) fprintf(stdout, uint32FMTW(7)"\t"uint32FMTW(6)"\n", x, ID.value(i)); } for (; x < len; x++) fprintf(stdout, uint32FMTW(7)"\t"uint32FMTW(6)"\n", x, 0); } else { uint32 *depth = new uint32 [len]; for (x=0; x < len; x++) depth[x] = 0; for (uint32 i=0; i < ID.numberOfIntervals(); i++) for (x=ID.lo(i); x < ID.hi(i); x++) depth[x] = ID.count(i); uint32 avedepth = 0; for (x=0; x < windowsize; x++) avedepth += depth[x]; while (x < len) { uint32 avepos = (x - 1) - (windowsize - 1) / 2; if ((avepos % skipsize) == 0) fprintf(stdout, uint32FMT"\t%.4f\n", avepos, (double)avedepth / (double)windowsize); avedepth = avedepth + depth[x] - depth[x-windowsize]; x++; } delete [] depth; } delete [] id; delete MS; delete S; } delete F; delete E; } canu-1.6/src/meryl/mapMers.C000066400000000000000000000157471314437614700157650ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/mapMers.C * * Modifications by: * * Brian P. Walenz from 2006-OCT-18 to 2014-APR-11 * are Copyright 2006-2008,2013-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-JUL-22 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "bio++.H" #include "seqCache.H" #include "merStream.H" #include "libmeryl.H" #include "existDB.H" #define OP_NONE 0 #define OP_STATS 1 #define OP_REGIONS 2 #define OP_DETAILS 3 int main(int argc, char **argv) { uint32 merSize = 16; char *merylFile = 0L; char *fastaFile = 0L; bool beVerbose = false; uint32 loCount = 0; uint32 hiCount = ~uint32ZERO; uint32 operation = OP_NONE; // For OP_STATS uint32 Clen = 0; uint32 Cmax = 4 * 1024 * 1024; uint32 *C = new uint32 [Cmax]; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-m") == 0) { merSize = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-mers") == 0) { merylFile = argv[++arg]; } else if (strcmp(argv[arg], "-seq") == 0) { fastaFile = argv[++arg]; } else if (strcmp(argv[arg], "-v") == 0) { beVerbose = true; } else if (strcmp(argv[arg], "-lo") == 0) { loCount = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-hi") == 0) { hiCount = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-stats") == 0) { operation = OP_STATS; } else if (strcmp(argv[arg], "-regions") == 0) { operation = OP_REGIONS; } else if (strcmp(argv[arg], "-details") == 0) { operation = OP_DETAILS; } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); } arg++; } if ((operation == OP_NONE) || (merylFile == 0L) || (fastaFile == 0L)) { fprintf(stderr, "usage: %s [-stats | -regions | -details] -m mersize -mers mers -seq fasta > output\n", argv[0]); exit(1); } #if 0 existDB *E = NULL; if (fileExists("junk.existDB")) { fprintf(stderr, "loading from junk.existDB\n"); E = new existDB("junk.existDB"); fprintf(stderr, "loaded\n"); } else { exit(1); E = new existDB(merylFile, merSize, existDBcounts, loCount, hiCount); E->saveState("junk.existDB"); } #endif existDB *E = new existDB(merylFile, merSize, existDBcounts, loCount, hiCount); seqCache *F = new seqCache(fastaFile); fprintf(stderr, "Begin.\n"); for (uint32 Sid=0; Sid < F->getNumberOfSequences(); Sid++) { seqInCore *S = F->getSequenceInCore(Sid); merStream *MS = new merStream(new kMerBuilder(merSize), new seqStream(S->sequence(), S->sequenceLength()), true, true); // with counts, report mean, mode, median, min, max for each frag. if (operation == OP_STATS) { Clen = 0; while (MS->nextMer()) C[Clen++] = E->count(MS->theFMer()) + E->count(MS->theRMer()); uint64 mean = uint64ZERO; uint64 min = ~uint64ZERO; uint64 max = uint64ZERO; uint64 hist[16] = { 0 }; // Histogram values are powers of two, e.g., <=1, <=2, <=4, <=8, <=16, <=32, <=64, <=128, <=256, <=512, <=1024, <=4096, <=8192, <=328768 for (uint32 i=0; i C[i]) min = C[i]; if (max < C[i]) max = C[i]; hist[ logBaseTwo64(C[i]) ]++; } if (Clen > 0) { mean /= Clen; } else { mean = uint64ZERO; min = uint64ZERO; max = uint64ZERO; } fprintf(stdout, "%s\t" uint64FMT"\t"uint64FMT"\t"uint64FMT"\t" uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t" uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\n", S->header(), mean, min, max, hist[ 0], hist[ 1], hist[ 2], hist[ 3], hist[ 4], hist[ 5], hist[ 6], hist[ 7], hist[ 8], hist[ 9], hist[10], hist[11], hist[12], hist[13], hist[14], hist[15]); } // without counts, reports regions with mer coverage. // Orientation tells us nothing, since the mers are probably canonical if (operation == OP_REGIONS) { uint64 beg = ~uint64ZERO; uint64 end = ~uint64ZERO; uint64 pos = ~uint64ZERO; uint64 numCovReg = 0; uint64 lenCovReg = 0; while (MS->nextMer()) { if (E->exists(MS->theFMer()) || E->exists(MS->theRMer())) { pos = MS->thePositionInSequence(); if (beg == ~uint64ZERO) beg = end = pos; if (pos <= end + merSize) { end = pos; } else { fprintf(stdout, "%s\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\n", S->header(), beg, end+merSize, end+merSize - beg); numCovReg++; lenCovReg += end+merSize - beg; beg = end = pos; } } else { fprintf(stdout, "%s\t"uint64FMT"\tuncovered\n", S->header(), MS->thePositionInSequence()); } } if (beg != ~uint64ZERO) fprintf(stdout, "%s\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\n", S->header(), beg, end+merSize, end+merSize - beg); fprintf(stderr, "numCovReg: "uint64FMT"\n", numCovReg); fprintf(stderr, "lenCovReg: "uint64FMT"\n", lenCovReg); } if (operation == OP_DETAILS) { char merString[256]; while (MS->nextMer()) { uint64 beg = MS->thePositionInSequence(); uint64 end = beg + merSize; uint64 fnt = E->count(MS->theFMer()); uint64 rnt = E->count(MS->theRMer()); fprintf(stdout, "%s\t%s\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\t"uint64FMT"\n", S->header(), MS->theFMer().merToString(merString), beg, end, fnt, rnt, fnt + rnt); } } delete MS; delete S; } delete F; delete E; } canu-1.6/src/meryl/maskMers.C000066400000000000000000000426701314437614700161360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/maskMers.C * * Modifications by: * * Brian P. Walenz from 2008-MAR-31 to 2014-APR-11 * are Copyright 2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-NOV-22 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "seqStream.H" #include "libmeryl.H" #include "speedCounter.H" #include #define MAX_COVERAGE 51 class mateRescueData { public: mateRescueData() { _mean = 0; _stddev = 0; _coverage = 0; _normal = 0L; _normalZero = 0; }; void init(int32 mean_, int32 stddev_, uint32 coverage_) { _mean = mean_; _stddev = stddev_; _coverage = coverage_; assert(_mean > 3 * _stddev); double a = 1.0 / (_stddev * sqrt(2 * M_PI)); double c = 2 * _stddev * _stddev; int32 b1l = (int32)floor(-3 * _stddev); int32 b1h = (int32)ceil ( 3 * _stddev); _normal = new double [b1h - b1l + 1]; _normalZero = -b1l; for (int32 l=0; l= _numSeq) || (onlySeqIID_ == i)) { fprintf(stderr, "Loading sequence " F_U32 " of length " F_U32 "\n", i, _seqLen[i]); _masking[i] = new char [_seqLen[i]]; _repeatID[i] = new uint32 [_seqLen[i]]; //memset(_masking[i], 'g', sizeof(char) * _seqLen[i]); //memset(_repeatID[i], 0, sizeof(uint32) * _seqLen[i]); fread(_masking[i], sizeof(char), _seqLen[i], maskMersFile); fread(_repeatID[i], sizeof(uint32), _seqLen[i], maskMersFile); } else { fseek(maskMersFile, sizeof(char) * _seqLen[i], SEEK_CUR); fseek(maskMersFile, sizeof(uint32) * _seqLen[i], SEEK_CUR); _seqLen[i] = 0; } } fclose(maskMersFile); } void merMaskedSequence::saveMasking(void) { FILE *maskMersFile = fopen(_maskMersName, "w"); fwrite(&_numSeq, sizeof(uint32), 1, maskMersFile); fwrite(&_merSize, sizeof(uint32), 1, maskMersFile); fwrite( _seqLen, sizeof(uint32), _numSeq, maskMersFile); for (uint32 i=0; i<_numSeq; i++) { fwrite(_masking[i], sizeof(char), _seqLen[i], maskMersFile); fwrite(_repeatID[i], sizeof(uint32), _seqLen[i], maskMersFile); } fclose(maskMersFile); } void merMaskedSequence::buildMasking(void) { seqStream *STR = new seqStream(_fastaName); _numSeq = STR->numberOfSequences(); _seqLen = new int32 [_numSeq]; _masking = new char * [_numSeq]; _repeatID = new uint32 * [_numSeq]; _merSize = 0; fprintf(stderr, F_U32" sequences in '%s'\n", _numSeq, _fastaName); for (uint32 i=0; i<_numSeq; i++) { _seqLen[i] = STR->lengthOf(i); _masking[i] = new char [_seqLen[i]]; _repeatID[i] = new uint32 [_seqLen[i]]; memset(_masking[i], 'g', sizeof(char) * _seqLen[i]); memset(_repeatID[i], 0, sizeof(uint32) * _seqLen[i]); } // g -> gap in sequence // u -> unique mer // r -> repeat mer // // For all the r's we also need to remember the other locations // that repeat is at. We annotate the map with a repeat id, set if // another copy of the repeat is nearby. merylStreamReader *MS = new merylStreamReader(_merylName); speedCounter *CT = new speedCounter(" Masking mers in sequence: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, true); uint32 rid = 0; _merSize = MS->merSize(); while (MS->nextMer()) { //fprintf(stderr, "mer count="uint64FMT" pos=" F_U32 "\n", MS->theCount(), MS->getPosition(0)); if (MS->theCount() == 1) { uint32 p = MS->getPosition(0); uint32 s = STR->sequenceNumberOfPosition(p); p -= STR->startOf(s); _masking[s][p] = 'u'; } else { std::sort(MS->thePositions(), MS->thePositions() + MS->theCount()); uint32 lastS = ~uint32ZERO; uint32 lastP = 0; rid++; for (uint32 i=0; itheCount(); i++) { uint32 p = MS->getPosition(i); uint32 s = STR->sequenceNumberOfPosition(p); p -= STR->startOf(s); // Always set the masking. _masking[s][p] = 'r'; // If there is a repeat close by, set the repeat ID. if ((s == lastS) && (lastP + 40000 > p)) { _repeatID[s][lastP] = rid; _repeatID[s][p] = rid; } lastS = s; lastP = p; } } CT->tick(); } delete CT; delete MS; delete STR; saveMasking(); } void computeDensity(merMaskedSequence *S, char *outputPrefix) { char outputName[FILENAME_MAX]; FILE *outputFile; uint32 windowSizeMax = 10000; for (uint32 s=0; snumSeq(); s++) { // seqLen == 0 iff that sequence is not loaded. if (S->seqLen(s) == 0) continue; snprintf(outputName, FILENAME_MAX, "%s.density.seq%02" F_U32P, outputPrefix, s); outputFile = fopen(outputName, "w"); fprintf(stderr, "Starting '%s'\n", outputName); fprintf(outputFile, "#window\tunique\trepeat\tgaps\n"); // Not the most efficient, but good enough for us right now. for (int32 p=0; pseqLen(s); ) { uint32 windowSize = 0; uint32 uniqueSum = 0; uint32 repeatSum = 0; uint32 gapSum = 0; while ((windowSize < windowSizeMax) && (p < S->seqLen(s))) { char m = S->masking(s, p); if (m == 'u') uniqueSum++; if (m == 'g') gapSum++; if (m == 'r') repeatSum++; windowSize++; p++; } fprintf(outputFile, F_U32"\t%f\t%f\t%f\n", p - windowSize, (double)uniqueSum / windowSize, (double)repeatSum / windowSize, (double)gapSum / windowSize); } fclose(outputFile); } } // For each 'r' mer, compute the number of 'u' mers // that are within some mean +- stddev range. // // We count for two blocks: // // | <- mean -> | <- mean -> | // ---[block1]---------------mer---------------[block2]--- // // Once we know that, we can compute the probability that // a repeat mer can be rescued. // // p1 = uniq/total -- for 1 X coverage // pn = 1 - (1-p1)^n -- for n X coverage void computeMateRescue(merMaskedSequence *S, char *outputPrefix, mateRescueData *lib, uint32 libLen) { char outputName[FILENAME_MAX]; FILE *outputFile; FILE *outputData; uint32 closeRepeatsLen = 0; uint32 closeRepeatsMax = 80000; int32 *closeRepeats = new int32 [closeRepeatsMax]; speedCounter *CT = new speedCounter(" Examining repeats: %7.2f Kbases -- %5.2f Kbases/second\r", 1000.0, 0x1ffff, true); uint32 totalDepth = 0; for (uint32 l=0; lnumSeq(); s++) { // seqLen == 0 iff that sequence is not loaded. if (S->seqLen(s) == 0) continue; fprintf(stderr, "Starting sequence " F_U32 "\n", s); snprintf(outputName, FILENAME_MAX, "%s.mateRescue.seq%02" F_U32P ".out", outputPrefix, s); outputFile = fopen(outputName, "w"); snprintf(outputName, FILENAME_MAX, "%s.mateRescue.seq%02" F_U32P ".dat", outputPrefix, s); outputData = fopen(outputName, "w"); double numRR[MAX_COVERAGE] = {0}; // num repeats rescued (expected) for [] X coverage double numNR[MAX_COVERAGE] = {0}; // num repeats nonrescuable (expected) for [] X coverage uint32 numRT = 0; // num repeats total for (int32 p=0; pseqLen(s); p++) { CT->tick(); double pRtot = 0.0; double pFtot = 0.0; if ((S->masking(s, p) != 'g') && (S->masking(s, p) != 'u') && (S->masking(s, p) != 'r')) fprintf(stderr, "INVALID MASKING - got %d = %c\n", S->masking(s, p), S->masking(s, p)); if (S->masking(s, p) == 'r') { numRT++; // Index over x-coverage in libraries. MUST BE 1. uint32 ridx = 1; for (uint32 l=0; lrepeatID(s, p) > 0) { int32 pl = (int32)floor(p - 3 * stddev); int32 ph = (int32)ceil (p + 3 * stddev); if (pl < 0) pl = 0; if (ph > S->seqLen(s)) ph = S->seqLen(s); for (int32 pi=pl; pirepeatID(s, pi) == S->repeatID(s, p)) && (pi != p)) closeRepeats[closeRepeatsLen++] = pi; } int32 b1l = (int32)floor(p - mean - 3 * stddev); int32 b1h = (int32)ceil (p - mean + 3 * stddev); int32 b2l = (int32)floor(p + mean - 3 * stddev); int32 b2h = (int32)ceil (p + mean + 3 * stddev); if (b1l < 0) b1l = 0; if (b1h < 0) b1h = 0; if (b1h > S->seqLen(s)) b1h = S->seqLen(s); if (b2l < 0) b2l = 0; if (b2h > S->seqLen(s)) b2h = S->seqLen(s); if (b2l > S->seqLen(s)) b2l = S->seqLen(s); //fprintf(stderr, "b1: %d-%d b2:%d-%d\n", b1l, b1h, b2l, b2h); // probability we can rescue this repeat with this mate pair double pRescue = 0.0; double pFailed = 0.0; if (closeRepeatsLen == 0) { // No close repeats, use the fast method. for (int32 b=b1l; bmasking(s, b) == 'u') pRescue += lib[l].normal(b - p + mean); } for (int32 b=b2l; bmasking(s, b) == 'u') pRescue += lib[l].normal(b - p - mean); } } else { // Close repeats, gotta be slow. for (int32 b=b1l; bmasking(s, b) == 'u') { int32 mrl = b + mean - 3 * stddev; int32 mrh = b + mean + 3 * stddev; bool rescuable = true; for (uint32 cri=0; rescuable && crimasking(s, b) == 'u') { int32 mrl = b - mean - 3 * stddev; int32 mrh = b - mean + 3 * stddev; bool rescuable = true; for (uint32 cri=0; rescuable && crimerSize(), numRT, numRR[x], numNR[x], x, lib[l].mean(), lib[l].stddev()); n++; if (n >= lib[l].coverage()) { l++; n = 0; } } fclose(outputFile); fclose(outputData); } delete CT; } int main(int argc, char **argv) { char *merylName = 0L; char *fastaName = 0L; char *outputPrefix = 0L; uint32 onlySeqIID = ~uint32ZERO; bool doDensity = false; bool doRescue = false; mateRescueData lib[MAX_COVERAGE]; uint32 libLen = 0; int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-mers") == 0) { merylName = argv[++arg]; } else if (strcmp(argv[arg], "-seq") == 0) { fastaName = argv[++arg]; } else if (strcmp(argv[arg], "-only") == 0) { onlySeqIID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-output") == 0) { outputPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-d") == 0) { doDensity = true; } else if (strcmp(argv[arg], "-r") == 0) { if (atoi(argv[arg+3]) > 0) { doRescue = true; lib[libLen++].init(atoi(argv[arg+1]), atoi(argv[arg+2]), atoi(argv[arg+3])); } arg += 3; } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); err++; } arg++; } if ((err) || (merylName == 0L) || (fastaName == 0L) || (outputPrefix == 0L)) { fprintf(stderr, "usage: %s -mers mers -seq fasta -output prefix [-d] [-r mean stddev coverage]\n", argv[0]); exit(1); } merMaskedSequence *S = new merMaskedSequence(fastaName, merylName, onlySeqIID); if (doDensity) computeDensity(S, outputPrefix); if (doRescue) computeMateRescue(S, outputPrefix, lib, libLen); return(0); } canu-1.6/src/meryl/maskMers.mk000066400000000000000000000007571314437614700163630ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := maskMers SOURCES := maskMers.C SRC_INCDIRS := .. ../AS_UTL libleaff TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/meryl/meryl-args.C000066400000000000000000000544771314437614700164460ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/args.C * * Modifications by: * * Brian P. Walenz from 2004-MAR-31 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-05 to 2004-JUL-01 * are Copyright 2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-20 to 2014-APR-11 * are Copyright 2005-2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-MAY-29 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-SEP-14 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "meryl.H" // Some string handling utilities. // bool writeString(const char *str, FILE *F) { errno = 0; uint32 len = 0; if (str) { len = (uint32)strlen(str) + 1; fwrite(&len, sizeof(uint32), 1, F); fwrite( str, sizeof(char), len, F); } else { fwrite(&len, sizeof(uint32), 1, F); } if (errno) { fprintf(stderr, "writeString()-- Failed to write string of length " F_U32 ": %s\n", len, strerror(errno)); fprintf(stderr, "writeString()-- First 80 bytes of string is:\n"); fprintf(stderr, "%80.80s\n", str); return(false); } return(true); } char* readString(FILE *F) { errno = 0; uint32 len = 0; fread(&len, sizeof(uint32), 1, F); if (errno) { fprintf(stderr, "readString()-- Failed to read string: %s\n", strerror(errno)); exit(1); } char *str = 0L; if (len > 0) { str = new char [len]; fread(str, sizeof(char), len, F); if (errno) { fprintf(stderr, "readString()-- Failed to read string: %s\n", strerror(errno)); exit(1); } } return(str); } char* duplString(char *str) { char *dupstr = 0L; if (str) { uint32 len = (uint32)strlen(str); dupstr = new char [len+1]; strcpy(dupstr, str); } return(dupstr); } void merylArgs::usage(void) { fprintf(stderr, "usage: %s [personality] [global options] [options]\n", execName); fprintf(stderr, "\n"); fprintf(stderr, "where personality is:\n"); fprintf(stderr, " -P -- compute parameters\n"); fprintf(stderr, " -B -- build table\n"); fprintf(stderr, " -S -- scan table\n"); fprintf(stderr, " -M -- \"math\" operations\n"); fprintf(stderr, " -D -- dump table\n"); fprintf(stderr, "\n"); fprintf(stderr, "-P: Given a sequence file (-s) or an upper limit on the\n"); fprintf(stderr, " number of mers in the file (-n), compute the table size\n"); fprintf(stderr, " (-t in build) to minimize the memory usage.\n"); fprintf(stderr, " -m # (size of a mer; required)\n"); fprintf(stderr, " -c # (homopolymer compression; optional)\n"); fprintf(stderr, " -p (enable positions)\n"); fprintf(stderr, " -s seq.fasta (seq.fasta is scanned to determine the number of mers)\n"); fprintf(stderr, " -n # (compute params assuming file with this many mers in it)\n"); fprintf(stderr, "\n"); fprintf(stderr, " Only one of -s, -n need to be specified. If both are given\n"); fprintf(stderr, " -s takes priority.\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, "-B: Given a sequence file (-s) and lots of parameters, compute\n"); fprintf(stderr, " the mer-count tables. By default, both strands are processed.\n"); fprintf(stderr, " -f (only build for the forward strand)\n"); fprintf(stderr, " -r (only build for the reverse strand)\n"); fprintf(stderr, " -C (use canonical mers, assumes both strands)\n"); fprintf(stderr, " -L # (DON'T save mers that occur less than # times)\n"); fprintf(stderr, " -U # (DON'T save mers that occur more than # times)\n"); fprintf(stderr, " -m # (size of a mer; required)\n"); fprintf(stderr, " -c # (homopolymer compression; optional)\n"); fprintf(stderr, " -p (enable positions)\n"); fprintf(stderr, " -s seq.fasta (sequence to build the table for)\n"); fprintf(stderr, " -o tblprefix (output table prefix)\n"); fprintf(stderr, " -v (entertain the user)\n"); fprintf(stderr, "\n"); fprintf(stderr, " By default, the computation is done as one large sequential process.\n"); fprintf(stderr, " Multi-threaded operation is possible, at additional memory expense, as\n"); fprintf(stderr, " is segmented operation, at additional I/O expense.\n"); fprintf(stderr, "\n"); fprintf(stderr, " Threaded operation: Split the counting in to n almost-equally sized\n"); fprintf(stderr, " pieces. This uses an extra h MB (from -P) per thread.\n"); fprintf(stderr, " -threads n (use n threads to build)\n"); fprintf(stderr, "\n"); fprintf(stderr, " Segmented, sequential operation: Split the counting into pieces that\n"); fprintf(stderr, " will fit into no more than m MB of memory, or into n equal sized pieces.\n"); fprintf(stderr, " Each piece is computed sequentially, and the results are merged at the end.\n"); fprintf(stderr, " Only one of -memory and -segments is needed.\n"); fprintf(stderr, " -memory mMB (use at most m MB of memory per segment)\n"); fprintf(stderr, " -segments n (use n segments)\n"); fprintf(stderr, "\n"); fprintf(stderr, " Segmented, batched operation: Same as sequential, except this allows\n"); fprintf(stderr, " each segment to be manually executed in parallel.\n"); fprintf(stderr, " Only one of -memory and -segments is needed.\n"); fprintf(stderr, " -memory mMB (use at most m MB of memory per segment)\n"); fprintf(stderr, " -segments n (use n segments)\n"); fprintf(stderr, " -configbatch (create the batches)\n"); fprintf(stderr, " -countbatch n (run batch number n)\n"); fprintf(stderr, " -mergebatch (merge the batches)\n"); fprintf(stderr, " Initialize the compute with -configbatch, which needs all the build options.\n"); fprintf(stderr, " Execute all -countbatch jobs, then -mergebatch to complete.\n"); fprintf(stderr, " meryl -configbatch -B [options] -o file\n"); fprintf(stderr, " meryl -countbatch 0 -o file\n"); fprintf(stderr, " meryl -countbatch 1 -o file\n"); fprintf(stderr, " ...\n"); fprintf(stderr, " meryl -countbatch N -o file\n"); fprintf(stderr, " meryl -mergebatch N -o file\n"); fprintf(stderr, " Batched mode can run on the grid.\n"); fprintf(stderr, " -sge jobname unique job name for this execution. Meryl will submit\n"); fprintf(stderr, " jobs with name mpjobname, ncjobname, nmjobname, for\n"); fprintf(stderr, " phases prepare, count and merge.\n"); fprintf(stderr, " -sgebuild \"options\" any additional options to sge, e.g.,\n"); fprintf(stderr, " -sgemerge \"options\" \"-p -153 -pe thread 2 -A merylaccount\"\n"); fprintf(stderr, " N.B. - -N will be ignored\n"); fprintf(stderr, " N.B. - be sure to quote the options\n"); fprintf(stderr, "\n"); fprintf(stderr, "-M: Given a list of tables, perform a math, logical or threshold operation.\n"); fprintf(stderr, " Unless specified, all operations take any number of databases.\n"); fprintf(stderr, "\n"); fprintf(stderr, " Math operations are:\n"); fprintf(stderr, " min count is the minimum count for all databases. If the mer\n"); fprintf(stderr, " does NOT exist in all databases, the mer has a zero count, and\n"); fprintf(stderr, " is NOT in the output.\n"); fprintf(stderr, " minexist count is the minimum count for all databases that contain the mer\n"); fprintf(stderr, " max count is the maximum count for all databases\n"); fprintf(stderr, " add count is sum of the counts for all databases\n"); fprintf(stderr, " sub count is the first minus the second (binary only)\n"); fprintf(stderr, " abs count is the absolute value of the first minus the second (binary only)\n"); fprintf(stderr, "\n"); fprintf(stderr, " Logical operations are:\n"); fprintf(stderr, " and outputs mer iff it exists in all databases\n"); fprintf(stderr, " nand outputs mer iff it exists in at least one, but not all, databases\n"); fprintf(stderr, " or outputs mer iff it exists in at least one database\n"); fprintf(stderr, " xor outputs mer iff it exists in an odd number of databases\n"); fprintf(stderr, "\n"); fprintf(stderr, " Threshold operations are:\n"); fprintf(stderr, " lessthan x outputs mer iff it has count < x\n"); fprintf(stderr, " lessthanorequal x outputs mer iff it has count <= x\n"); fprintf(stderr, " greaterthan x outputs mer iff it has count > x\n"); fprintf(stderr, " greaterthanorequal x outputs mer iff it has count >= x\n"); fprintf(stderr, " equal x outputs mer iff it has count == x\n"); fprintf(stderr, " Threshold operations work on exactly one database.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -s tblprefix (use tblprefix as a database)\n"); fprintf(stderr, " -o tblprefix (create this output)\n"); fprintf(stderr, " -v (entertain the user)\n"); fprintf(stderr, "\n"); fprintf(stderr, " NOTE: Multiple tables are specified with multiple -s switches; e.g.:\n"); fprintf(stderr, " %s -M add -s 1 -s 2 -s 3 -s 4 -o all\n", execName); fprintf(stderr, " NOTE: It is NOT possible to specify more than one operation:\n"); fprintf(stderr, " %s -M add -s 1 -s 2 -sub -s 3\n", execName); fprintf(stderr, " will NOT work.\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, "-D: Dump the table (not all of these work).\n"); fprintf(stderr, "\n"); fprintf(stderr, " -Dd Dump a histogram of the distance between the same mers.\n"); fprintf(stderr, " -Dt Dump mers >= a threshold. Use -n to specify the threshold.\n"); fprintf(stderr, " -Dc Count the number of mers, distinct mers and unique mers.\n"); fprintf(stderr, " -Dh Dump (to stdout) a histogram of mer counts.\n"); fprintf(stderr, " -s Read the count table from here (leave off the .mcdat or .mcidx).\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); } void merylArgs::clear(void) { execName = 0L; options = 0L; beVerbose = false; doForward = true; doReverse = false; doCanonical = false; inputFile = 0L; outputFile = 0L; queryFile = 0L; merSize = 20; merComp = 0; positionsEnabled = false; numMersEstimated = 0; numMersActual = 0; numBasesActual = 0; mersPerBatch = 0; basesPerBatch = 0; numBuckets = 0; numBuckets_log2 = 0; merDataWidth = 0; merDataMask = uint64ZERO; bucketPointerWidth = 0; numThreads = 0; memoryLimit = 0; segmentLimit = 0; configBatch = false; countBatch = false; mergeBatch = false; batchNumber = 0; sgeJobName = 0L; sgeBuildOpt = 0L; sgeMergeOpt = 0L; isOnGrid = false; lowCount = 0; highCount = ~lowCount; desiredCount = 0; outputCount = 0; outputAll = 0; outputPosition = 0; mergeFilesMax = 0; mergeFilesLen = 0; mergeFiles = 0L; personality = 0; } merylArgs::merylArgs(int argc, char **argv) { clear(); execName = duplString(argv[0]); if (argc == 1) { usage(); exit(1); } // Count how many '-s' switches there are, then allocate space // for them in mergeFiles. We also sum the length of all options, // so we can copy them into an 'options' string used when we // resubmit to the grid. // uint32 optionsLen = 0; for (int arg=1; arg < argc; arg++) { optionsLen += strlen(argv[arg]) + 1; if (strcmp(argv[arg], "-s") == 0) mergeFilesMax++; } mergeFiles = new char * [mergeFilesMax]; options = new char [2 * optionsLen + 1]; options[0] = 0; bool fail = false; char *optptr = options; for (int arg=1; arg < argc; arg++) { if (arg > 1) *optptr++ = ' '; // Arg! If the arg has spaces or other stuff that the shell // needs escaped we need to escape them again. So, we copy byte // by byte and insert escapes at the right points. for (char *op=argv[arg]; *op; op++, optptr++) { if (isspace(*op) || !isalnum(*op)) if ((*op != '-') && (*op != '_') && (*op != '.') && (*op != '/')) *optptr++ = '\\'; *optptr = *op; } //strcat(options, argv[arg]); } // Parse the options // for (int arg=1; arg < argc; arg++) { if (strncmp(argv[arg], "-V", 2) == 0) { fprintf(stdout, "meryl the Mighty Mer Counter version (no version)\n"); exit(0); } else if (strcmp(argv[arg], "-m") == 0) { arg++; merSize = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "-c") == 0) { arg++; merComp = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "-p") == 0) { positionsEnabled = true; } else if (strcmp(argv[arg], "-s") == 0) { arg++; delete [] inputFile; inputFile = duplString(argv[arg]); mergeFiles[mergeFilesLen++] = duplString(argv[arg]); } else if (strcmp(argv[arg], "-n") == 0) { arg++; numMersEstimated = strtouint64(argv[arg]); } else if (strcmp(argv[arg], "-f") == 0) { doForward = true; doReverse = false; doCanonical = false; } else if (strcmp(argv[arg], "-r") == 0) { doForward = false; doReverse = true; doCanonical = false; } else if (strcmp(argv[arg], "-C") == 0) { doForward = false; doReverse = false; doCanonical = true; } else if (strcmp(argv[arg], "-L") == 0) { arg++; lowCount = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "-U") == 0) { arg++; highCount = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "-o") == 0) { arg++; delete [] outputFile; outputFile = duplString(argv[arg]); } else if (strcmp(argv[arg], "-v") == 0) { beVerbose = true; } else if (strcmp(argv[arg], "-P") == 0) { personality = 'P'; } else if (strcmp(argv[arg], "-B") == 0) { personality = 'B'; } else if (strcmp(argv[arg], "-S") == 0) { personality = 'S'; } else if (strcmp(argv[arg], "-M") == 0) { arg++; if (strcmp(argv[arg], "merge") == 0) { personality = PERSONALITY_MERGE; } else if (strcmp(argv[arg], "min") == 0) { personality = PERSONALITY_MIN; } else if (strcmp(argv[arg], "minexist") == 0) { personality = PERSONALITY_MINEXIST; } else if (strcmp(argv[arg], "max") == 0) { personality = PERSONALITY_MAX; } else if (strcmp(argv[arg], "maxexist") == 0) { personality = PERSONALITY_MAXEXIST; } else if (strcmp(argv[arg], "add") == 0) { personality = PERSONALITY_ADD; } else if (strcmp(argv[arg], "sub") == 0) { personality = PERSONALITY_SUB; } else if (strcmp(argv[arg], "abs") == 0) { personality = PERSONALITY_ABS; } else if (strcmp(argv[arg], "divide") == 0) { personality = PERSONALITY_DIVIDE; } else if (strcmp(argv[arg], "and") == 0) { personality = PERSONALITY_AND; } else if (strcmp(argv[arg], "nand") == 0) { personality = PERSONALITY_NAND; } else if (strcmp(argv[arg], "or") == 0) { personality = PERSONALITY_OR; } else if (strcmp(argv[arg], "xor") == 0) { personality = PERSONALITY_XOR; } else if (strcmp(argv[arg], "lessthan") == 0) { personality = PERSONALITY_LEQ; arg++; desiredCount = strtouint32(argv[arg]) - 1; } else if (strcmp(argv[arg], "lessthanorequal") == 0) { personality = PERSONALITY_LEQ; arg++; desiredCount = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "greaterthan") == 0) { personality = PERSONALITY_GEQ; arg++; desiredCount = strtouint32(argv[arg]) + 1; } else if (strcmp(argv[arg], "greaterthanorequal") == 0) { personality = PERSONALITY_GEQ; arg++; desiredCount = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "equal") == 0) { personality = PERSONALITY_EQ; arg++; desiredCount = strtouint32(argv[arg]); } else { fprintf(stderr, "ERROR: unknown math personality %s\n", argv[arg]); exit(1); } } else if (strcmp(argv[arg], "-Dd") == 0) { personality = 'd'; } else if (strcmp(argv[arg], "-Dt") == 0) { personality = 't'; } else if (strcmp(argv[arg], "-Dp") == 0) { personality = 'p'; } else if (strcmp(argv[arg], "-Dc") == 0) { personality = 'c'; } else if (strcmp(argv[arg], "-Dh") == 0) { personality = 'h'; } else if (strcmp(argv[arg], "-memory") == 0) { arg++; memoryLimit = strtouint64(argv[arg]) * 1024 * 1024; } else if (strcmp(argv[arg], "-segments") == 0) { arg++; segmentLimit = strtouint64(argv[arg]); } else if (strcmp(argv[arg], "-threads") == 0) { arg++; numThreads = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "-configbatch") == 0) { personality = 'B'; configBatch = true; countBatch = false; mergeBatch = false; batchNumber = uint32ZERO; } else if (strcmp(argv[arg], "-countbatch") == 0) { arg++; personality = 'B'; configBatch = false; countBatch = true; mergeBatch = false; batchNumber = strtouint32(argv[arg]); } else if (strcmp(argv[arg], "-mergebatch") == 0) { personality = 'B'; configBatch = false; countBatch = false; mergeBatch = true; batchNumber = uint32ZERO; } else if (strcmp(argv[arg], "-sge") == 0) { sgeJobName = argv[++arg]; } else if (strcmp(argv[arg], "-sgebuild") == 0) { sgeBuildOpt = argv[++arg]; } else if (strcmp(argv[arg], "-sgemerge") == 0) { sgeMergeOpt = argv[++arg]; } else if (strcmp(argv[arg], "-forcebuild") == 0) { isOnGrid = true; } else { fprintf(stderr, "Unknown option '%s'.\n", argv[arg]); fail = true; } } // Using threads is only useful if we are not a batch. // if ((numThreads > 0) && (configBatch || countBatch || mergeBatch)) { if (configBatch) fprintf(stderr, "WARNING: -threads has no effect with -configbatch, disabled.\n"); if (countBatch) fprintf(stderr, "WARNING: -threads has no effect with -countbatch, disabled.\n"); if (mergeBatch) fprintf(stderr, "WARNING: -threads has no effect with -mergebatch, disabled.\n"); numThreads = 1; } if (numThreads == 0) numThreads = omp_get_max_threads(); omp_set_num_threads(numThreads); // SGE is not useful unless we are in batch mode. // if (sgeJobName && !configBatch && !countBatch && !mergeBatch) { fprintf(stderr, "ERROR: -sge not useful unless in batch mode (replace -B with -configbatch)\n"); exit(1); } if (fail) exit(1); } merylArgs::merylArgs(const char *prefix) { char filename[FILENAME_MAX]; clear(); snprintf(filename, FILENAME_MAX, "%s.merylArgs", prefix); errno = 0; FILE *F = fopen(filename, "rb"); if (errno) { fprintf(stderr, "merylArgs::readConfig()-- Failed to open '%s': %s\n", filename, strerror(errno)); exit(1); } char magic[17] = {0}; fread(magic, sizeof(char), 16, F); if (strncmp(magic, "merylBatcherv02", 16) != 0) { fprintf(stderr, "merylArgs::readConfig()-- '%s' doesn't appear to be a merylArgs file.\n", filename); exit(1); } // Load the config, then reset the pointers. fread(this, sizeof(merylArgs), 1, F); execName = readString(F); options = 0L; inputFile = readString(F); outputFile = readString(F); queryFile = 0L; sgeJobName = readString(F); sgeBuildOpt = readString(F); sgeMergeOpt = readString(F); mergeFiles = new char* [mergeFilesLen]; for (uint32 i=0; i #include #include #include #include "meryl.H" #include "libmeryl.H" void binaryOperations(merylArgs *args) { if (args->mergeFilesLen != 2) { fprintf(stderr, "ERROR - must have exactly two files!\n"); exit(1); } if (args->outputFile == 0L) { fprintf(stderr, "ERROR - no output file specified.\n"); exit(1); } if ((args->personality != PERSONALITY_SUB) && (args->personality != PERSONALITY_ABS) && (args->personality != PERSONALITY_DIVIDE)) { fprintf(stderr, "ERROR - only personalities sub and abs\n"); fprintf(stderr, "ERROR - are supported in binaryOperations().\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); } // Open the input files, read in the first mer // merylStreamReader *A = new merylStreamReader(args->mergeFiles[0]); merylStreamReader *B = new merylStreamReader(args->mergeFiles[1]); A->nextMer(); B->nextMer(); // Make sure that the mersizes agree, and pick a prefix size for // the output // if (A->merSize() != B->merSize()) { fprintf(stderr, "ERROR - mersizes are different!\n"); fprintf(stderr, "ERROR - mersize of '%s' is " F_U32 "\n", args->mergeFiles[0], A->merSize()); fprintf(stderr, "ERROR - mersize of '%s' is " F_U32 "\n", args->mergeFiles[1], B->merSize()); exit(1); } // Open the output file, using the larger of the two prefix sizes // merylStreamWriter *W = new merylStreamWriter(args->outputFile, A->merSize(), A->merCompression(), (A->prefixSize() > B->prefixSize()) ? A->prefixSize() : B->prefixSize(), A->hasPositions()); // SUB - report A - B // ABS - report the absolute difference between the two files // // These two operations are very similar (SUB was derived from ABS), so // any bug found in one is probably in the other. // kMer Amer; uint32 Acnt = uint32ZERO; kMer Bmer; uint32 Bcnt = uint32ZERO; switch (args->personality) { case PERSONALITY_SUB: while (A->validMer() || B->validMer()) { Amer = A->theFMer(); Acnt = A->theCount(); Bmer = B->theFMer(); Bcnt = B->theCount(); // If the A stream is all out of mers, set Amer to be the // same as Bmer, and set Acnt to zero. Similar for B. // if (!A->validMer()) { Amer = Bmer; Acnt = uint32ZERO; } if (!B->validMer()) { Bmer = Amer; Bcnt = uint32ZERO; } //fprintf(stderr, "sub A="uint64HEX" B="uint64HEX"\n", Amer, Bmer); if (Amer == Bmer) { W->addMer(Amer, (Acnt > Bcnt) ? Acnt - Bcnt : 0); A->nextMer(); B->nextMer(); } else if (Amer < Bmer) { W->addMer(Amer, Acnt); A->nextMer(); } else { B->nextMer(); } } break; case PERSONALITY_ABS: while (A->validMer() || B->validMer()) { Amer = A->theFMer(); Acnt = A->theCount(); Bmer = B->theFMer(); Bcnt = B->theCount(); // If the A stream is all out of mers, set Amer to be the // same as Bmer, and set Acnt to zero. Similar for B. // if (!A->validMer()) { Amer = Bmer; Acnt = uint32ZERO; } if (!B->validMer()) { Bmer = Amer; Bcnt = uint32ZERO; } if (Amer == Bmer) { W->addMer(Amer, (Acnt > Bcnt) ? Acnt - Bcnt : Bcnt - Acnt); A->nextMer(); B->nextMer(); } else if (Amer < Bmer) { W->addMer(Amer, Acnt); A->nextMer(); } else { W->addMer(Bmer, Bcnt); B->nextMer(); } } break; case PERSONALITY_DIVIDE: while (A->validMer() || B->validMer()) { Amer = A->theFMer(); Acnt = A->theCount(); Bmer = B->theFMer(); Bcnt = B->theCount(); // If the A stream is all out of mers, set Amer to be the // same as Bmer, and set Acnt to zero. Similar for B. // if (!A->validMer()) { Amer = Bmer; Acnt = uint32ZERO; } if (!B->validMer()) { Bmer = Amer; Bcnt = uint32ZERO; } if (Amer == Bmer) { if ((Acnt > 0) && (Bcnt > 0)) { double d = 1000.0 * (double)Acnt / (double)Bcnt; if (d > 4096.0 * 1024.0 * 1024.0) d = 4096.0 * 1024.0 * 1024.0; W->addMer(Amer, (uint32)floor(d)); } A->nextMer(); B->nextMer(); } else if (Amer < Bmer) { A->nextMer(); } else { B->nextMer(); } } break; } delete A; delete B; delete W; } canu-1.6/src/meryl/meryl-build.C000066400000000000000000000671161314437614700166030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/build.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-08 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Clark Mobarry on 2004-FEB-12 * are Copyright 2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-MAR-23 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAR-12 to 2014-APR-11 * are Copyright 2005-2009,2011-2012,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-JUL-01 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "meryl.H" #include "seqStream.H" #include "merStream.H" #include "speedCounter.H" void runThreaded(merylArgs *args); // You probably want this to be the same as KMER_WORDS, but in rare // cases, it can be less. // #define SORTED_LIST_WIDTH KMER_WORDS // to make the sorted list be wider, we also need to store wide // things in the bitpackedarray buckets. probably easy (do multiple // adds of data, each at most 64 bits) but not braindead. #if SORTED_LIST_WIDTH == 1 class sortedList_t { public: uint64 _w; uint32 _p; bool operator<(sortedList_t &that) { return(_w < that._w); }; bool operator>=(sortedList_t &that) { return(_w >= that._w); }; sortedList_t &operator=(sortedList_t &that) { _w = that._w; _p = that._p; return(*this); }; }; #else class sortedList_t { public: uint64 _w[SORTED_LIST_WIDTH]; uint32 _p; bool operator<(sortedList_t &that) { for (uint32 i=SORTED_LIST_WIDTH; i--; ) { if (_w[i] < that._w[i]) return(true); if (_w[i] > that._w[i]) return(false); } return(false); }; bool operator>=(sortedList_t &that) { for (uint32 i=SORTED_LIST_WIDTH; i--; ) { if (_w[i] > that._w[i]) return(true); if (_w[i] < that._w[i]) return(false); } return(true); }; sortedList_t &operator=(sortedList_t &that) { for (uint32 i=SORTED_LIST_WIDTH; i--; ) _w[i] = that._w[i]; _p = that._p; return(*this); }; }; #endif void adjustHeap(sortedList_t *M, int64 i, int64 n) { sortedList_t m = M[i]; int64 j = (i << 1) + 1; // let j be the left child while (j < n) { if (j= M[j]) // a position for M[i] has been found break; M[(j-1)/2] = M[j]; // Move larger child up a level j = (j << 1) + 1; } M[(j-1)/2] = m; } void submitPrepareBatch(merylArgs *args) { FILE *F; char nam[FILENAME_MAX]; char cmd[FILENAME_MAX]; snprintf(nam, FILENAME_MAX, "%s-prepare.sh", args->outputFile); errno = 0; F = fopen(nam, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", nam, strerror(errno)), exit(1); fprintf(F, "#!/bin/sh\n\n"); fprintf(F, ". $SGE_ROOT/$SGE_CELL/common/settings.sh\n"); fprintf(F, "%s -forcebuild %s\n", args->execName, args->options); fclose(F); if (args->sgeMergeOpt) snprintf(cmd, FILENAME_MAX, "qsub -cwd -b n -j y -o %s-prepare.err %s -N mp%s %s-prepare.sh", args->outputFile, args->sgeMergeOpt, args->sgeJobName, args->outputFile); else snprintf(cmd, FILENAME_MAX, "qsub -cwd -b n -j y -o %s-prepare.err -N mp%s %s-prepare.sh", args->outputFile, args->sgeJobName, args->outputFile); fprintf(stderr, "%s\n", cmd); if (system(cmd)) fprintf(stderr, "%s\nFailed to execute qsub command: %s\n", cmd, strerror(errno)), exit(1); } void submitCountBatches(merylArgs *args) { FILE *F; char nam[FILENAME_MAX]; char cmd[FILENAME_MAX]; snprintf(nam, FILENAME_MAX, "%s-count.sh", args->outputFile); errno = 0; F = fopen(nam, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", nam, strerror(errno)), exit(1); fprintf(F, "#!/bin/sh\n\n"); fprintf(F, ". $SGE_ROOT/$SGE_CELL/common/settings.sh\n"); fprintf(F, "batchnum=`expr $SGE_TASK_ID - 1`\n"); fprintf(F, "%s -v -countbatch $batchnum -o %s\n", args->execName, args->outputFile); fclose(F); if (args->sgeBuildOpt) snprintf(cmd, FILENAME_MAX, "qsub -t 1-" F_U64 " -cwd -b n -j y -o %s-count-\\$TASK_ID.err %s -N mc%s %s-count.sh", args->segmentLimit, args->outputFile, args->sgeBuildOpt, args->sgeJobName, args->outputFile); else snprintf(cmd, FILENAME_MAX, "qsub -t 1-" F_U64 " -cwd -b n -j y -o %s-count-\\$TASK_ID.err -N mc%s %s-count.sh", args->segmentLimit, args->outputFile, args->sgeJobName, args->outputFile); fprintf(stderr, "%s\n", cmd); if (system(cmd)) fprintf(stderr, "%s\nFailed to execute qsub command: %s\n", cmd, strerror(errno)), exit(1); // submit the merge snprintf(nam, FILENAME_MAX, "%s-merge.sh", args->outputFile); errno = 0; F = fopen(nam, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", nam, strerror(errno)), exit(1); fprintf(F, "#!/bin/sh\n\n"); fprintf(F, ". $SGE_ROOT/$SGE_CELL/common/settings.sh\n"); fprintf(F, "%s -mergebatch -o %s\n", args->execName, args->outputFile); fclose(F); if (args->sgeMergeOpt) snprintf(cmd, FILENAME_MAX, "qsub -hold_jid mc%s -cwd -b n -j y -o %s-merge.err %s -N mm%s %s-merge.sh", args->sgeJobName, args->outputFile, args->sgeMergeOpt, args->sgeJobName, args->outputFile); else snprintf(cmd, FILENAME_MAX, "qsub -hold_jid mc%s -cwd -b n -j y -o %s-merge.err -N mm%s %s-merge.sh", args->sgeJobName, args->outputFile, args->sgeJobName, args->outputFile); fprintf(stderr, "%s\n", cmd); if (system(cmd)) fprintf(stderr, "%s\nFailed to execute qsub command: %s\n", cmd, strerror(errno)), exit(1); } void prepareBatch(merylArgs *args) { bool fatalError = false; if (args->inputFile == 0L) fprintf(stderr, "ERROR - no input file specified.\n"), fatalError = true; if (args->outputFile == 0L) fprintf(stderr, "ERROR - no output file specified.\n"), fatalError = true; if ((args->doForward == false) && (args->doReverse == false) && (args->doCanonical == false)) fprintf(stderr, "ERROR - need to specify at least one of -f, -r, -C\n"), fatalError = true; if ((args->doForward && args->doReverse) || (args->doForward && args->doCanonical) || (args->doReverse && args->doCanonical)) fprintf(stderr, "ERROR - only one of -f, -r and -C may be specified!\n"), fatalError = true; if (args->lowCount > args->highCount) fprintf(stderr, "ERROR - lowCount > highCount??\n"), fatalError = true; if (args->segmentLimit && args->memoryLimit) fprintf(stderr, "ERROR: Only one of -memory and -segments can be specified.\n"), fatalError=true; if (fatalError) exit(1); // If we were given no segment or memory limit, but threads, we // really want to create n segments. // if ((args->numThreads > 0) && (args->segmentLimit == 0) && (args->memoryLimit == 0)) args->segmentLimit = args->numThreads; { seqStream *seqstr = new seqStream(args->inputFile); args->numBasesActual = 0; for (uint32 i=0; inumberOfSequences(); i++) args->numBasesActual += seqstr->lengthOf(i); merStream *merstr = new merStream(new kMerBuilder(args->merSize), seqstr, true, true); args->numMersActual = merstr->approximateNumberOfMers() + 1; delete merstr; } #warning not submitting prepareBatch to grid #if 0 if ((args->isOnGrid) || (args->sgeJobName == 0L)) { } else { // Shucks, we need to build the merstream file. Lets do it // on the grid! // submitPrepareBatch(args); exit(0); } #endif // If there is a memory limit, figure out how to divide the work into an integer multiple of // numThreads segments. // // Otherwise, if there is a segment limit, split the total number of mers into n pieces. // // Otherwise, we must be doing it all in one fell swoop. // if (args->memoryLimit) { args->mersPerBatch = estimateNumMersInMemorySize(args->merSize, args->memoryLimit, args->numThreads, args->positionsEnabled, args->beVerbose); // Degenerate case; if we can fit more per batch than there are in total, just divide them equally. if (args->mersPerBatch > args->numMersActual) args->mersPerBatch = args->numMersActual / args->numThreads; // Compute how many segments we need, rounding up. args->segmentLimit = (uint64)ceil((double)args->numMersActual / (double)args->mersPerBatch); // Then keep the compute balanced and make the number of segments be a multiple of the number of threads. args->segmentLimit = args->numThreads * (uint32)ceil((double)args->segmentLimit / (double)args->numThreads); } else if (args->segmentLimit) { args->mersPerBatch = (uint64)ceil((double)args->numMersActual / (double)args->segmentLimit); } else { args->mersPerBatch = args->numMersActual; args->segmentLimit = 1; } args->basesPerBatch = (uint64)ceil((double)args->numBasesActual / (double)args->segmentLimit); // Choose the optimal number of buckets to reduce memory usage. Yes, this is already done in // estimateNumMersInMemorySize() (but not saved) and we need to do it for the other cases anyway. // // We use the number of mers per batch + 1 because we need to store the first position after the // last mer. That is, if there are two mers, we will store that the first mer is at position 0, // the second mer is at position 1, and the end of the second mer is at position 2. // args->bucketPointerWidth = logBaseTwo64(args->basesPerBatch + 1); args->numBuckets_log2 = optimalNumberOfBuckets(args->merSize, args->basesPerBatch, args->positionsEnabled); args->numBuckets = (uint64ONE << args->numBuckets_log2); args->merDataWidth = args->merSize * 2 - args->numBuckets_log2; if (args->merDataWidth > SORTED_LIST_WIDTH * 64) { fprintf(stderr, " numMersActual = " F_U64 "\n", args->numMersActual); fprintf(stderr, " mersPerBatch = " F_U64 "\n", args->mersPerBatch); fprintf(stderr, " basesPerBatch = " F_U64 "\n", args->basesPerBatch); fprintf(stderr, " numBuckets = " F_U64 " (" F_U32 " bits)\n", args->numBuckets, args->numBuckets_log2); fprintf(stderr, " bucketPointerWidth = " F_U32 "\n", args->bucketPointerWidth); fprintf(stderr, " merDataWidth = " F_U32 "\n", args->merDataWidth); fprintf(stderr, "Sorry! merSize too big! Increase KMER_WORDS in libbio.kmer.H\n"); exit(1); } if (args->beVerbose) { fprintf(stderr, "Computing " F_U64 " segments using " F_U32 " threads and " F_U64 "MB memory (" F_U64 "MB if in one batch).\n", args->segmentLimit, args->numThreads, estimateMemory(args->merSize, args->mersPerBatch, args->positionsEnabled) * args->numThreads, estimateMemory(args->merSize, args->numMersActual, args->positionsEnabled)); fprintf(stderr, " numMersActual = " F_U64 "\n", args->numMersActual); fprintf(stderr, " mersPerBatch = " F_U64 "\n", args->mersPerBatch); fprintf(stderr, " basesPerBatch = " F_U64 "\n", args->basesPerBatch); fprintf(stderr, " numBuckets = " F_U64 " (" F_U32 " bits)\n", args->numBuckets, args->numBuckets_log2); fprintf(stderr, " bucketPointerWidth = " F_U32 "\n", args->bucketPointerWidth); fprintf(stderr, " merDataWidth = " F_U32 "\n", args->merDataWidth); } } void runSegment(merylArgs *args, uint64 segment) { merStream *M = 0L; merylStreamWriter *W = 0L; speedCounter *C = 0L; uint32 *bucketSizes = 0L; uint64 *bucketPointers = 0L; uint64 *merDataArray[SORTED_LIST_WIDTH] = { 0L }; uint32 *merPosnArray = 0L; // If this segment exists already, skip it. // // XXX: This should be a command line option. // XXX: This should check that the files are complete meryl files. // char filename[FILENAME_MAX]; snprintf(filename, FILENAME_MAX, "%s.batch" F_U64 ".mcdat", args->outputFile, segment); if (AS_UTL_fileExists(filename)) { if (args->beVerbose) fprintf(stderr, "Found result for batch " F_U64 " in %s.\n", segment, filename); return; } if ((args->beVerbose) && (args->segmentLimit > 1)) fprintf(stderr, "Computing segment " F_U64 " of " F_U64 ".\n", segment+1, args->segmentLimit); // Allocate space for bucket pointers and (temporary) bucket sizes. if (args->beVerbose) fprintf(stderr, " Allocating " F_U64 "MB for bucket pointer table (" F_U32 " bits wide).\n", (args->numBuckets * args->bucketPointerWidth + 128) >> 23, args->bucketPointerWidth); bucketPointers = new uint64 [(args->numBuckets * args->bucketPointerWidth + 128) >> 6]; if (args->beVerbose) fprintf(stderr, " Allocating " F_U64 "MB for counting the size of each bucket.\n", args->numBuckets >> 18); bucketSizes = new uint32 [ args->numBuckets ]; for (uint64 i=args->numBuckets; i--; ) bucketSizes[i] = uint32ZERO; // Position the mer stream at the start of this segments' mers. // The last segment goes until the stream runs out of mers, // everybody else does args->basesPerBatch mers. C = new speedCounter(" Counting mers in buckets: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); M = new merStream(new kMerBuilder(args->merSize, args->merComp), new seqStream(args->inputFile), true, true); M->setBaseRange(args->basesPerBatch * segment, args->basesPerBatch * segment + args->basesPerBatch); char mstring[256]; if (args->doForward) { while (M->nextMer()) { //fprintf(stderr, "FMER %s\n", M->theFMer().merToString(mstring)); bucketSizes[ args->hash(M->theFMer()) ]++; C->tick(); } } if (args->doReverse) { while (M->nextMer()) { //fprintf(stderr, "RMER %s\n", M->theRMer().merToString(mstring)); bucketSizes[ args->hash(M->theRMer()) ]++; C->tick(); } } if (args->doCanonical) { while (M->nextMer()) { if (M->theFMer() <= M->theRMer()) { //fprintf(stderr, "FMER %s\n", M->theFMer().merToString(mstring)); bucketSizes[ args->hash(M->theFMer()) ]++; } else { //fprintf(stderr, "RMER %s\n", M->theRMer().merToString(mstring)); bucketSizes[ args->hash(M->theRMer()) ]++; } C->tick(); } } delete C; delete M; // Create the hash index using the counts. The hash points // to the end of the bucket; when we add a word, we move the // hash bucket pointer down one. // // When done, we can deallocate the counting table. // if (args->beVerbose) fprintf(stderr, " Creating bucket pointers.\n"); { uint64 mi=0; uint64 mj=0; uint64 mc=0; while (mi < args->numBuckets) { mc += bucketSizes[mi++]; setDecodedValue(bucketPointers, mj, args->bucketPointerWidth, mc); mj += args->bucketPointerWidth; } // Add the location of the end of the table. This is not // modified when adding words, but is used to determine // the size of the last bucket. // setDecodedValue(bucketPointers, mj, args->bucketPointerWidth, mc); } // All done with the counting table, get rid of it. if (args->beVerbose) fprintf(stderr, " Releasing " F_U64 "MB from counting the size of each bucket.\n", args->numBuckets >> 18); delete [] bucketSizes; // Allocate space for mer storage and (optional) position data. If mers are bigger than 32, we // allocate full words. if (args->beVerbose) fprintf(stderr, " Allocating " F_U64 "MB for mer storage (" F_U32 " bits wide).\n", (args->basesPerBatch * args->merDataWidth + 64) >> 23, args->merDataWidth); for (uint64 mword=0, width=args->merDataWidth; width > 0; ) { if (width >= 64) { merDataArray[mword] = new uint64 [ args->basesPerBatch + 1 ]; width -= 64; mword++; } else { merDataArray[mword] = new uint64 [ (args->basesPerBatch * width + 64) >> 6 ]; width = 0; } } // Position data. if (args->positionsEnabled) { if (args->beVerbose) fprintf(stderr, " Allocating " F_U64 "MB for mer position storage.\n", (args->basesPerBatch * 32 + 32) >> 23); merPosnArray = new uint32 [ args->basesPerBatch + 1 ]; } C = new speedCounter(" Filling mers into list: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); M = new merStream(new kMerBuilder(args->merSize, args->merComp), new seqStream(args->inputFile), true, true); M->setBaseRange(args->basesPerBatch * segment, args->basesPerBatch * segment + args->basesPerBatch); while (M->nextMer()) { kMer const &m = ((args->doReverse) || (args->doCanonical && (M->theFMer() > M->theRMer()))) ? M->theRMer() : M->theFMer(); uint64 element = preDecrementDecodedValue(bucketPointers, args->hash(m) * args->bucketPointerWidth, args->bucketPointerWidth); #if SORTED_LIST_WIDTH == 1 // Even though this would work in the general loop below, we // special case one word mers to avoid the loop overhead. // setDecodedValue(merDataArray[0], element * args->merDataWidth, args->merDataWidth, m.endOfMer(args->merDataWidth)); #else for (uint64 mword=0, width=args->merDataWidth; width>0; ) { if (width >= 64) { merDataArray[mword][element] = m.getWord(mword); width -= 64; mword++; } else { setDecodedValue(merDataArray[mword], element * width, width, m.getWord(mword) & uint64MASK(width)); width = 0; } } #endif if (args->positionsEnabled) merPosnArray[element] = M->thePositionInStream(); C->tick(); } delete C; delete M; char batchOutputFile[FILENAME_MAX]; snprintf(batchOutputFile, FILENAME_MAX, "%s.batch" F_U64, args->outputFile, segment); C = new speedCounter(" Writing output: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); W = new merylStreamWriter((args->segmentLimit == 1) ? args->outputFile : batchOutputFile, args->merSize, args->merComp, args->numBuckets_log2, args->positionsEnabled); // Sort each bucket into sortedList, then output the mers // sortedList_t *sortedList = 0L; uint32 sortedListMax = 0; uint32 sortedListLen = 0; for (uint64 bucket=0, bucketPos=0; bucket < args->numBuckets; bucket++) { uint64 st = getDecodedValue(bucketPointers, bucketPos, args->bucketPointerWidth); bucketPos += args->bucketPointerWidth; uint64 ed = getDecodedValue(bucketPointers, bucketPos, args->bucketPointerWidth); if (ed < st) { fprintf(stderr, "ERROR: In segment " F_U64 "\n", segment); fprintf(stderr, "ERROR: Bucket " F_U64 " (out of " F_U64 ") ends before it starts!\n", bucket, args->numBuckets); fprintf(stderr, "ERROR: start=" F_U64 "\n", st); fprintf(stderr, "ERROR: end =" F_U64 "\n", ed); } assert(ed >= st); if ((ed - st) > (uint64ONE << 30)) { fprintf(stderr, "ERROR: In segment " F_U64 "\n", segment); fprintf(stderr, "ERROR: Bucket " F_U64 " (out of " F_U64 ") is HUGE!\n", bucket, args->numBuckets); fprintf(stderr, "ERROR: start=" F_U64 "\n", st); fprintf(stderr, "ERROR: end =" F_U64 "\n", ed); } // Nothing here? Keep going. if (ed == st) continue; sortedListLen = (uint32)(ed - st); // Allocate more space, if we need to. // if (sortedListLen > sortedListMax) { delete [] sortedList; sortedList = new sortedList_t [2 * sortedListLen + 1]; sortedListMax = 2 * sortedListLen; } // Clear out the sortedList -- if we don't, we leave the high // bits unset which will probably make the sort random. // bzero(sortedList, sizeof(sortedList_t) * sortedListLen); // Unpack the mers into the sorting array // if (args->positionsEnabled) for (uint64 i=st; imerDataWidth; imerDataWidth) sortedList[i-st]._w = getDecodedValue(merDataArray[0], J, args->merDataWidth); #else for (uint64 i=st; imerDataWidth; width>0; ) { if (width >= 64) { sortedList[i-st]._w[mword] = merDataArray[mword][i]; width -= 64; mword++; } else { sortedList[i-st]._w[mword] = getDecodedValue(merDataArray[mword], i * width, width); width = 0; } } } #endif // Sort if there is more than one item // if (sortedListLen > 1) { for (int64 t=(sortedListLen-2)/2; t>=0; t--) adjustHeap(sortedList, t, sortedListLen); for (int64 t=sortedListLen-1; t>0; t--) { sortedList_t tv = sortedList[t]; sortedList[t] = sortedList[0]; sortedList[0] = tv; adjustHeap(sortedList, 0, t); } } // Dump the list of mers to the file. // kMer mer(args->merSize); for (uint32 t=0; ttick(); // Build the complete mer // #if SORTED_LIST_WIDTH == 1 mer.setWord(0, sortedList[t]._w); #else for (uint64 mword=0; mword < SORTED_LIST_WIDTH; mword++) mer.setWord(mword, sortedList[t]._w[mword]); #endif mer.setBits(args->merDataWidth, args->numBuckets_log2, bucket); // Add it if (args->positionsEnabled) W->addMer(mer, 1, &sortedList[t]._p); else W->addMer(mer, 1, 0L); } } delete [] sortedList; delete C; delete W; for (uint32 x=0; xbeVerbose) fprintf(stderr, "Segment " F_U64 " finished.\n", segment); } void build(merylArgs *args) { if (!args->countBatch && !args->mergeBatch) prepareBatch(args); // Since we write no output until after all the work is done, check that // we can actually make output files before starting the work. char N[FILENAME_MAX]; sprintf(N, "%s.existenceTest", args->outputFile); if (AS_UTL_fileExists(args->outputFile, true) == true) fprintf(stderr, "ERROR: output prefix -o cannot be a directory.\n"), exit(1); if (AS_UTL_fileExists(N) == false) { errno = 0; FILE *F = fopen(N, "w"); if (errno) fprintf(stderr, "ERROR: can't make outputs with prefix '%s': %s\n", args->outputFile, strerror(errno)), exit(1); AS_UTL_unlink(N); } // Three choices: // // threaded -- start threads, launch pieces in each thread. This // thread waits for completion and then merges the results. // // batched -- write info file and exit. Compute and merge is done // on separate invocations. // // segmented -- write info file, then do each piece sequentially. // After all pieces finished, do a merge. // // bool doMerge = false; // Write out our configuration and exit if we are -configbatch if (args->configBatch) { args->writeConfig(); if (args->sgeJobName) { fprintf(stdout, "Batch prepared. Submitting to the grid.\n"); submitCountBatches(args); } else { fprintf(stdout, "Batch prepared. Please run:\n"); for (uint64 s=0; ssegmentLimit; s++) fprintf(stdout, "%s -countbatch " F_U64 " -o %s\n", args->execName, s, args->outputFile); fprintf(stdout, "%s -mergebatch -o %s\n", args->execName, args->outputFile); } } // Read back the configuration, run the segment and exit if we are -countbatch else if (args->countBatch) { merylArgs *savedArgs = new merylArgs(args->outputFile); savedArgs->beVerbose = args->beVerbose; runSegment(savedArgs, args->batchNumber); delete savedArgs; } // Check that all the files exist if we are -mergebatch and continue with execution // // MEMORY LEAK! We should delete this at the end of the function, but it's a pain, and who // cares? else if (args->mergeBatch) { merylArgs *savedArgs = new merylArgs(args->outputFile); savedArgs->beVerbose = args->beVerbose; args = savedArgs; doMerge = true; } // Otherwise, compute batches. else { #pragma omp parallel for for (uint64 s=0; ssegmentLimit; s++) runSegment(args, s); doMerge = true; } // If there is more than one segment, merge them to get the output. // // We do this by contructing a meryl command line and recursively // (effectively) calling meryl. // // The command line is // // ./meryl -M merge [-v] -s batch1 -s batch2 ... -s batchN -o outputFile // if ((doMerge) && (args->segmentLimit > 1)) { if (args->beVerbose) fprintf(stderr, "Merge results.\n"); int argc = 0; char **argv = new char* [7 + 2 * args->segmentLimit]; bool *arga = new bool [7 + 2 * args->segmentLimit]; arga[argc] = false; argv[argc++] = "meryl-build-merge"; arga[argc] = false; argv[argc++] = "-M"; arga[argc] = false; argv[argc++] = "merge"; if (args->beVerbose) { arga[argc] = false; argv[argc++] = "-v"; } for (uint32 i=0; isegmentLimit; i++) { arga[argc] = false; argv[argc++] = "-s"; arga[argc] = true; argv[argc] = new char [FILENAME_MAX]; snprintf(argv[argc], FILENAME_MAX, "%s.batch" F_U32, args->outputFile, i); argc++; } arga[argc] = false; argv[argc++] = "-o"; arga[argc] = false; argv[argc++] = args->outputFile; merylArgs *addArgs = new merylArgs(argc, argv); multipleOperations(addArgs); // Cleanup the memory leak. // delete addArgs; for (int i=0; isegmentLimit; i++) { char filename[FILENAME_MAX]; snprintf(filename, FILENAME_MAX, "%s.batch" F_U32 ".mcidx", args->outputFile, i); unlink(filename); snprintf(filename, FILENAME_MAX, "%s.batch" F_U32 ".mcdat", args->outputFile, i); unlink(filename); snprintf(filename, FILENAME_MAX, "%s.batch" F_U32 ".mcpos", args->outputFile, i); unlink(filename); } } // If we just merged, delete the merstream file // if (doMerge) { char filename[FILENAME_MAX]; snprintf(filename, FILENAME_MAX, "%s.merStream", args->outputFile); unlink(filename); } } canu-1.6/src/meryl/meryl-dump.C000066400000000000000000000126731314437614700164470ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/dump.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-07 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-12 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-23 to 2014-APR-11 * are Copyright 2005,2007-2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "meryl.H" #include "libmeryl.H" #include void dumpThreshold(merylArgs *args) { merylStreamReader *M = new merylStreamReader(args->inputFile); char str[1025]; while (M->nextMer()) { if (M->theCount() >= args->numMersEstimated) fprintf(stdout, ">" F_U64 "\n%s\n", M->theCount(), M->theFMer().merToString(str)); } delete M; } void dumpPositions(merylArgs *args) { merylStreamReader *M = new merylStreamReader(args->inputFile); char str[1025]; if (M->hasPositions() == false) { fprintf(stderr, "File '%s' contains no position information.\n", args->inputFile); } else { while (M->nextMer()) { fprintf(stdout, ">" F_U64, M->theCount()); for (uint32 i=0; itheCount(); i++) fprintf(stdout, " " F_U32, M->getPosition(i)); fprintf(stdout, "\n%s\n", M->theFMer().merToString(str)); } } delete M; } void countUnique(merylArgs *args) { merylStreamReader *M = new merylStreamReader(args->inputFile); #warning make this a test #if 0 uint64 numDistinct = 0; uint64 numUnique = 0; uint64 numMers = 0; uint64 c = 0; while (M->nextMer()) { c = M->theCount(); numDistinct++; if (c == 1) numUnique++; numMers += c; } assert(numMers == M->numberOfTotalMers()); assert(numDistinct == M->numberOfDistinctMers()); assert(numUnique == M->numberOfUniqueMers()); fprintf(stderr, "OK\n"); #endif fprintf(stdout, "Found " F_U64 " mers.\n", M->numberOfTotalMers()); fprintf(stdout, "Found " F_U64 " distinct mers.\n", M->numberOfDistinctMers()); fprintf(stdout, "Found " F_U64 " unique mers.\n", M->numberOfUniqueMers()); delete M; } void plotHistogram(merylArgs *args) { uint64 distinct = 0; uint64 total = 0; merylStreamReader *M = new merylStreamReader(args->inputFile); fprintf(stderr, "Found " F_U64 " mers.\n", M->numberOfTotalMers()); fprintf(stderr, "Found " F_U64 " distinct mers.\n", M->numberOfDistinctMers()); fprintf(stderr, "Found " F_U64 " unique mers.\n", M->numberOfUniqueMers()); fprintf(stderr, "Largest mercount is " F_U64 ".\n", M->histogramMaximumCount()); for (uint32 i=1; ihistogramLength(); i++) { uint64 hist = M->histogram(i); if (hist > 0) { distinct += hist; total += hist * i; fprintf(stdout, F_U32"\t" F_U64 "\t%.4f\t%.4f\n", i, hist, distinct / (double)M->numberOfDistinctMers(), total / (double)M->numberOfTotalMers()); } } delete M; } void dumpDistanceBetweenMers(merylArgs *args) { merylStreamReader *M = new merylStreamReader(args->inputFile); // This is now tough because we don't know where the sequences end, // and our positions encode position in the chain. uint32 histMax = 64 * 1024 * 1024; uint64 *hist = new uint64 [histMax]; uint64 histHuge = 0; memset(hist, 0, sizeof(uint64) * histMax); if (M->hasPositions() == false) { fprintf(stderr, "File '%s' contains no position information.\n", args->inputFile); } else { while (M->nextMer()) { std::sort(M->thePositions(), M->thePositions() + M->theCount()); for (uint32 i=1; itheCount(); i++) { uint32 d = M->getPosition(i) - M->getPosition(i-1); if (d < histMax) hist[d]++; else histHuge++; } } uint32 maxd = 0; for (uint32 d=0; d 25) ? 50 : 2 * merSize - 2; // Max width of bucket pointer table. // t - prefix stored in the bucket pointer table; number of entries in the table // N - width of a bucket pointer for (uint64 t=2; t < tMax; t++) { for (uint64 N=1; N<40; N++) { uint64 Nmin = uint64ONE << (N - 1); // Minimum number of mers we want to fit in the table uint64 Nmax = uint64ONE << (N); // Maximum number of mers that can fit in the table uint64 bucketsize = (uint64ONE << t) * N; // Size, in bits, of the pointer table uint64 n = (memLimt - bucketsize) / (2*merSize - t + posPerMer); // Number of mers we can fit into mer data table. if ((memLimt > bucketsize) && // pointer table small enough to fit in memory (n > 0) && // at least some space to store mers (n <= Nmax) && // enough space for the mers in the data table (Nmin <= n) && // ...but not more than enough space (maxN < n)) { // this value of t fits more mers that any other seen so far maxN = n; bestT = t; } } } if (beVerbose) fprintf(stdout, "Can fit " F_U64 " mers into table with prefix of " F_U64 " bits, using %.3fMB (%.3fMB for positions)\n", maxN * numThreads, bestT, (((uint64ONE << bestT) * logBaseTwo64(maxN) + maxN * (2*merSize - bestT + posPerMer)) >> 3) * numThreads / 1048576.0, ((maxN * posPerMer) >> 3) * numThreads / 1048576.0); return(maxN); } uint64 estimateMemory(uint32 merSize, uint64 numMers, bool positionsEnabled) { uint64 posPerMer = (positionsEnabled == false) ? 0 : 32; uint64 tMax = (merSize > 25) ? 50 : 2 * merSize - 2; uint64 tMin = tMax; uint64 memMin = UINT64_MAX; for (uint64 t=2; t < tMax; t++) { uint64 N = logBaseTwo64(numMers); // Width of the bucket pointer table uint64 memUsed = ((uint64ONE << t) * logBaseTwo64(numMers) + numMers * (2 * merSize - t + posPerMer)) >> 3; if (memUsed < memMin) { tMin = t; memMin = memUsed; } //fprintf(stderr, "t=%2lu N=%2lu memUsed=%16lu -- tMin=%2lu memMin=%16lu\n", // t, N, memUsed, tMin, memMin); } return(memMin >> 20); } uint32 optimalNumberOfBuckets(uint32 merSize, uint64 numMers, bool positionsEnabled) { uint64 opth = ~uint64ZERO; uint64 opts = ~uint64ZERO; uint64 h = 0; uint64 s = 0; uint64 hwidth = logBaseTwo64(numMers); // Positions consume space too, but only if enabled. Probably // doesn't matter here. // uint64 posPerMer = 0; if (positionsEnabled) posPerMer = 32; // Find the table size (in bits, h) that minimizes memory usage // for the given merSize and numMers // // We have two tables: // the bucket pointers num buckets * pointer width == 2 << h * hwidth // the mer data: num mers * (mersize - hwidth) // uint64 hmax = 64 - logBaseTwo64(hwidth + numMers * (2 * merSize - h)); for (h=2; h<=hmax && h<2*merSize; h++) { s = (uint64ONE << h) * hwidth + numMers * (2 * merSize - h + posPerMer); //fprintf(stderr, "optimalNumberOfBuckets()-- h=" F_U64 " s=" F_U64 "\n", h, s); if (s < opts) { opth = h; opts = s; } } return((uint32)opth); } void estimate(merylArgs *args) { if (args->inputFile) { merStream M(new kMerBuilder(args->merSize, args->merComp), new seqStream(args->inputFile), true, true); speedCounter C(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); if (args->beVerbose) fprintf(stderr, "Counting mers in '%s'\n", args->inputFile); args->numMersEstimated = 0; while (M.nextMer()) { C.tick(); args->numMersEstimated++; } C.finish(); } uint32 opth = optimalNumberOfBuckets(args->merSize, args->numMersEstimated, args->positionsEnabled); uint64 memu = ((uint64ONE << opth) * logBaseTwo64(args->numMersEstimated+1) + args->numMersEstimated * (2 * args->merSize - opth)); fprintf(stderr, F_U64" " F_U32 "-mers can be computed using " F_U64 "MB memory.\n", args->numMersEstimated, args->merSize, memu >> 23); } canu-1.6/src/meryl/meryl-merge-listmerge.C000066400000000000000000000315271314437614700205710ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/merge.listmerge.C * * Modifications by: * * Brian P. Walenz from 2008-JUN-20 to 2014-APR-11 * are Copyright 2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "meryl.H" #include "libmeryl.H" using namespace std; #include struct mMer { kMer _mer; uint32 _cnt; uint32 _off; uint32 _nxt; uint32 _stp; }; class mMerList { public: mMerList(uint32 maxSize) { _posLen = 0; _posMax = 2 * maxSize; _pos = new uint32 [_posMax]; _mmmLen = 0; _mmmMax = maxSize; _mmm = new mMer [_mmmMax]; _tip = ~uint32ZERO; _fre = 0; for (uint32 i=0; i<_mmmMax; i++) { _mmm[i]._cnt = 0; _mmm[i]._off = 0; _mmm[i]._nxt = i+1; _mmm[i]._stp = 0; } _mmm[_mmmMax-1]._nxt = ~uint32ZERO; }; ~mMerList() { delete [] _pos; delete [] _mmm; }; bool loadMore(void) { return((_mmmMax < _tip) || (_mmm[_tip]._stp == 1)); }; uint32 length(void) { return(_mmmLen); }; kMer *pop(uint32 &cnt, uint32* &pos) { kMer *ret = 0L; //fprintf(stderr, "POP tip="uint32FMT"\n", _tip); if (_tip < _mmmMax) { uint32 f = _tip; ret = &_mmm[f]._mer; cnt = _mmm[f]._cnt; pos = (_mmm[f]._off != ~uint32ZERO) ? _pos + _mmm[f]._off : 0L; // Move tip to the next thing _tip = _mmm[f]._nxt; // And append this one to the free list. _mmm[f]._nxt = _fre; _fre = f; _mmmLen--; //fprintf(stderr, "POP f="uint32FMT" tip="uint32FMT" len="uint32FMT"\n", f, _tip, _mmmLen); } return(ret); }; // rebuild the position list, squeezes out empty items void rebuild(void) { if (_posLen > 0) { assert(0); uint32 *np = new uint32 [_posMax]; _posLen = 0; for (uint32 i=0; i<_mmmLen; i++) { mMer *m = _mmm + i; if (m->_off != ~uint32ZERO) { _mmm[_mmmLen]._off = _posLen; for (uint32 p=0; p_cnt; p++, _posLen++) np[_posLen] = _pos[p]; } } delete [] _pos; _pos = np; } }; // Read more mers from the file void read(merylStreamReader *R, uint32 num, bool loadAll) { uint32 xxx = 0; uint32 las = ~uint32ZERO; uint32 pos = _tip; bool stop = false; //fprintf(stderr, "read()- loading "uint32FMT"\n", num); assert(_mmmLen + num < _mmmMax); // Load until we hit the sentinal. if (loadAll == false) num = ~uint32ZERO; for (xxx=0; (xxx < num) && (stop == false) && (R->nextMer()); xxx++) { // Insert into a free node uint32 fre = _fre; _fre = _mmm[fre]._nxt; _mmm[fre]._mer = R->theFMer(); _mmm[fre]._cnt = R->theCount(); _mmm[fre]._off = ~uint32ZERO; _mmm[fre]._stp = 0; uint32 *ppp = R->thePositions(); if (ppp) { _mmm[fre]._off = _posLen; if (_posMax <= _posLen + _mmm[fre]._cnt) { fprintf(stderr, "Reallocate _pos\n"); _posMax *= 2; uint32 *tmp = new uint32 [_posMax]; memcpy(tmp, _pos, sizeof(uint32) * _posLen); delete [] _pos; _pos = tmp; } for (uint32 i=0; i<_mmm[fre]._cnt; i++, _posLen++) _pos[_posLen] = ppp[i]; } // Keep count _mmmLen++; // Figure out where to put it in the list. New duplicates must // go AFTER the existing -- that's the job of <=. while ((pos < _mmmMax) && (_mmm[pos]._mer <= R->theFMer())) { las = pos; pos = _mmm[pos]._nxt; } if (_mmmMax < _tip) { // No tip, make new list. _mmm[fre]._nxt = _tip; _tip = fre; las = ~uint32ZERO; pos = _tip; } else if (_mmmMax < las) { // Valid list, but we want to insert before the start _mmm[fre]._nxt = _tip; _tip = fre; las = ~uint32ZERO; pos = _tip; } else if (pos < _mmmMax) { // Valid pos, insert in the middle (after las, before pos) _mmm[fre]._nxt = _mmm[las]._nxt; _mmm[las]._nxt = fre; las = fre; //pos = _mmm[las]._nxt; } else { // Have a list, but we ran off the end, append (after las) _mmm[fre]._nxt = ~uint32ZERO; _mmm[las]._nxt = fre; pos = fre; if (loadAll == false) stop = true; } } // Set the sentinal. This forces us to load more mers. // if (loadAll == true) { //fprintf(stderr, "read()-- stop on tip = "uint32FMT"\n", las); _mmm[las]._stp = 1; } //fprintf(stderr, "read()-- now up to "uint32FMT" mers ("uint32FMT" pos); loaded "uint32FMT" out of "uint32FMT" requested.\n", _mmmLen, _posLen, xxx, num); }; private: uint32 _posLen; uint32 _posMax; uint32 *_pos; uint32 _mmmLen; uint32 _mmmMax; mMer *_mmm; uint32 _tip; uint32 _fre; }; void multipleOperations(merylArgs *args) { if (args->mergeFilesLen < 2) { fprintf(stderr, "ERROR - must have at least two databases (you gave "uint32FMT")!\n", args->mergeFilesLen); exit(1); } if (args->outputFile == 0L) { fprintf(stderr, "ERROR - no output file specified.\n"); exit(1); } if ((args->personality != PERSONALITY_MERGE) && (args->personality != PERSONALITY_MIN) && (args->personality != PERSONALITY_MINEXIST) && (args->personality != PERSONALITY_MAX) && (args->personality != PERSONALITY_ADD) && (args->personality != PERSONALITY_AND) && (args->personality != PERSONALITY_NAND) && (args->personality != PERSONALITY_OR) && (args->personality != PERSONALITY_XOR)) { fprintf(stderr, "ERROR - only personalities min, minexist, max, add, and, nand, or, xor\n"); fprintf(stderr, "ERROR - are supported in multipleOperations().\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); } uint32 maxSize = 64 * 1024 * 1024; merylStreamReader **R = new merylStreamReader* [args->mergeFilesLen]; merylStreamWriter *W = 0L; mMerList *M = new mMerList(maxSize + maxSize / 4); for (uint32 i=0; imergeFilesLen; i++) R[i] = new merylStreamReader(args->mergeFiles[i]); // Verify that the mersizes are all the same // bool fail = false; uint32 merSize = R[0]->merSize(); uint32 merComp = R[0]->merCompression(); for (uint32 i=0; imergeFilesLen; i++) { fail |= (merSize != R[i]->merSize()); fail |= (merComp != R[i]->merCompression()); } if (fail) fprintf(stderr, "ERROR: mer size or compression level differ.\n"), exit(1); // Open the output file, using the largest prefix size found in the // input/mask files. // uint32 prefixSize = 0; for (uint32 i=0; imergeFilesLen; i++) if (prefixSize < R[i]->prefixSize()) prefixSize = R[i]->prefixSize(); W = new merylStreamWriter(args->outputFile, merSize, merComp, prefixSize); // Load mers from all files, remember the largest mer we load. // bool loadAll = true; for (uint32 i=0; imergeFilesLen; i++) { M->read(R[i], maxSize / args->mergeFilesLen, loadAll); loadAll = false; } fprintf(stderr, "Initial load: length="uint32FMT"\n", M->length()); bool moreStuff = true; kMer currentMer; // The current mer we're operating on uint32 currentCount = uint32ZERO; // The count (operation dependent) of this mer uint32 currentTimes = uint32ZERO; // Number of files it's in uint32 currentPositionsMax = 0; uint32 *currentPositions = 0L; kMer *thisMer; // The mer we just read uint32 thisCount = uint32ZERO; // The count of the mer we just read uint32 *thisPositions = 0L; speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); currentMer.setMerSize(merSize); while (moreStuff) { // Load more stuff if needed. // if (M->loadMore() == true) { M->rebuild(); uint32 additionalLoading = 8192; if (maxSize / args->mergeFilesLen > M->length()) additionalLoading = maxSize / args->mergeFilesLen - M->length(); loadAll = true; for (uint32 i=0; imergeFilesLen; i++) { if (R[i]->validMer()) { M->read(R[i], additionalLoading, loadAll); loadAll = false; } } } // All done? Exit. if (M->length() == 0) moreStuff = false; thisMer = M->pop(thisCount, thisPositions); // If we've hit a different mer, write out the last one if ((M->length() == 0) || (*thisMer != currentMer)) { switch (args->personality) { case PERSONALITY_MIN: if (currentTimes == args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_MERGE: case PERSONALITY_MINEXIST: case PERSONALITY_MAX: case PERSONALITY_ADD: W->addMer(currentMer, currentCount, currentPositions); break; case PERSONALITY_AND: if (currentTimes == args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_NAND: if (currentTimes != args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_OR: W->addMer(currentMer, currentCount); break; case PERSONALITY_XOR: if ((currentTimes % 2) == 1) W->addMer(currentMer, currentCount); break; default: fprintf(stderr, "ERROR - invalid personality in multipleOperations::write\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); break; } currentMer = *thisMer; currentCount = uint32ZERO; currentTimes = uint32ZERO; C->tick(); } if (moreStuff == false) break; // Perform the operation switch (args->personality) { case PERSONALITY_MERGE: if (thisPositions) { if (currentPositionsMax == 0) { currentPositionsMax = 1048576; currentPositions = new uint32 [currentPositionsMax]; } if (currentPositionsMax < currentCount + thisCount) { while (currentPositionsMax < currentCount + thisCount) currentPositionsMax *= 2; uint32 *t = new uint32 [currentPositionsMax]; memcpy(t, currentPositions, sizeof(uint32) * currentCount); delete [] currentPositions; currentPositions = t; } if (thisCount < 16) { for (uint32 i=0; i thisCount) currentCount = thisCount; } break; case PERSONALITY_MAX: if (currentCount < thisCount) currentCount = thisCount; break; case PERSONALITY_ADD: currentCount += thisCount; break; case PERSONALITY_AND: case PERSONALITY_NAND: case PERSONALITY_OR: case PERSONALITY_XOR: currentCount = 1; break; default: fprintf(stderr, "ERROR - invalid personality in multipleOperations::operate\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); break; } currentTimes++; } for (uint32 i=0; imergeFilesLen; i++) delete R[i]; delete R; delete W; delete M; delete C; } canu-1.6/src/meryl/meryl-merge-qsort.C000066400000000000000000000333051314437614700177420ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/merge.qsort.C * * Modifications by: * * Brian P. Walenz from 2008-JUN-20 to 2014-APR-11 * are Copyright 2008,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "meryl.H" #include "libmeryl.H" using namespace std; #include struct mMer { kMer _mer; uint32 _cnt; uint32 _off; }; static int mMerGreaterThan(void const *a, void const *b) { mMer const *A = (mMer const *)a; mMer const *B = (mMer const *)b; return(B->_mer.qsort_less(A->_mer)); } class mMerList { public: mMerList(uint32 maxSize) { _posLen = 0; _posMax = 2 * maxSize; _pos = new uint32 [_posMax]; _mmmLen = 0; _mmmMax = maxSize; _mmm = new mMer [_mmmMax]; }; ~mMerList() { delete [] _pos; delete [] _mmm; }; uint32 length(void) { return(_mmmLen); }; // Until we sort, first() is the last thing loaded. // After we sort, first() is the lowest mer in the set. kMer &first(void) { return(_mmm[_mmmLen-1]._mer); }; //kMer &last(void) { return(_mmm[0]._mer); }; //kMer &get(uint32 i) { return(_mmm[i]._mer); }; // Return the first (sorted order) thing in the list -- it's the last on the list. kMer *pop(uint32 &cnt, uint32* &pos) { if (_mmmLen == 0) return(0L); _mmmLen--; assert(_sorted); cnt = _mmm[_mmmLen]._cnt; pos = 0L; if (_mmm[_mmmLen]._off != ~uint32ZERO) pos = _pos + _mmm[_mmmLen]._off; return(&_mmm[_mmmLen]._mer); } // rebuild the position list, squeezes out empty items void rebuild(void) { if (_posLen > 0) { uint32 *np = new uint32 [_posMax]; _posLen = 0; for (uint32 i=0; i<_mmmLen; i++) { mMer *m = _mmm + i; if (m->_off != ~uint32ZERO) { _mmm[_mmmLen]._off = _posLen; for (uint32 p=0; p_cnt; p++, _posLen++) np[_posLen] = _pos[p]; } } delete [] _pos; _pos = np; } }; // Read more mers from the file void read(merylStreamReader *R, uint32 num) { uint32 xxx = 0; if (_mmmLen + num >= _mmmMax) { fprintf(stderr, "Reallocate _mmm\n"); _mmmMax = _mmmMax + 2 * num; mMer *tmp = new mMer [_mmmMax]; memcpy(tmp, _mmm, sizeof(mMer) * _mmmLen); delete [] _mmm; _mmm = tmp; } _sorted = false; R->nextMer(); for (xxx=0; (xxx < num) && (R->validMer()); xxx++) { if (_mmmMax <= _mmmLen) { fprintf(stderr, "Reallocate _mmm\n"); _mmmMax *= 2; mMer *tmp = new mMer [_mmmMax]; memcpy(tmp, _mmm, sizeof(mMer) * _mmmLen); delete [] _mmm; _mmm = tmp; } _mmm[_mmmLen]._mer = R->theFMer(); _mmm[_mmmLen]._cnt = R->theCount(); _mmm[_mmmLen]._off = ~uint32ZERO; uint32 *pos = R->thePositions(); if (pos) { _mmm[_mmmLen]._off = _posLen; if (_posMax <= _posLen + _mmm[_mmmLen]._cnt) { fprintf(stderr, "Reallocate _pos\n"); _posMax *= 2; uint32 *tmp = new uint32 [_posMax]; memcpy(tmp, _pos, sizeof(uint32) * _posLen); delete [] _pos; _pos = tmp; } for (uint32 i=0; i<_mmm[_mmmLen]._cnt; i++, _posLen++) _pos[_posLen] = pos[i]; } _mmmLen++; R->nextMer(); } //fprintf(stderr, "read()-- now up to "uint32FMT" mers ("uint32FMT" pos); loaded "uint32FMT" out of "uint32FMT" requested.\n", _mmmLen, _posLen, xxx, num); }; // Sort our list of mers void sort(void) { if (_sorted == false) { //fprintf(stderr, "SORT BEG\n"); qsort_mt(_mmm, _mmmLen, sizeof(mMer), mMerGreaterThan, 8, 32 * 1024); _sorted = true; //fprintf(stderr, "SORT END\n"); } }; private: bool _sorted; uint32 _posLen; uint32 _posMax; uint32 *_pos; uint32 _mmmLen; uint32 _mmmMax; mMer *_mmm; }; void multipleOperations(merylArgs *args) { char debugstring[256]; char debugstring2[256]; if (args->mergeFilesLen < 2) { fprintf(stderr, "ERROR - must have at least two databases (you gave "uint32FMT")!\n", args->mergeFilesLen); exit(1); } if (args->outputFile == 0L) { fprintf(stderr, "ERROR - no output file specified.\n"); exit(1); } if ((args->personality != PERSONALITY_MERGE) && (args->personality != PERSONALITY_MIN) && (args->personality != PERSONALITY_MINEXIST) && (args->personality != PERSONALITY_MAX) && (args->personality != PERSONALITY_ADD) && (args->personality != PERSONALITY_AND) && (args->personality != PERSONALITY_NAND) && (args->personality != PERSONALITY_OR) && (args->personality != PERSONALITY_XOR)) { fprintf(stderr, "ERROR - only personalities min, minexist, max, add, and, nand, or, xor\n"); fprintf(stderr, "ERROR - are supported in multipleOperations().\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); } merylStreamReader **R = new merylStreamReader* [args->mergeFilesLen]; merylStreamWriter *W = 0L; uint32 maxSize = 512 * 1024; mMerList *M = new mMerList(maxSize + maxSize / 4); // Open the input files and load some mers - we need to do this // just so we can check the mersizes/compression next. // for (uint32 i=0; imergeFilesLen; i++) { R[i] = new merylStreamReader(args->mergeFiles[i]); M->read(R[i], 1 + i); } // Verify that the mersizes are all the same // bool fail = false; uint32 merSize = R[0]->merSize(); uint32 merComp = R[0]->merCompression(); for (uint32 i=0; imergeFilesLen; i++) { fail |= (merSize != R[i]->merSize()); fail |= (merComp != R[i]->merCompression()); } if (fail) fprintf(stderr, "ERROR: mer sizes (or compression level) differ.\n"), exit(1); // Open the output file, using the largest prefix size found in the // input/mask files. // uint32 prefixSize = 0; for (uint32 i=0; imergeFilesLen; i++) if (prefixSize < R[i]->prefixSize()) prefixSize = R[i]->prefixSize(); W = new merylStreamWriter(args->outputFile, merSize, merComp, prefixSize); kMer lastLoaded; lastLoaded.setMerSize(merSize); lastLoaded.smallest(); // Load mers from all files, remember the largest mer we load. // for (uint32 i=0; imergeFilesLen; i++) { M->read(R[i], maxSize / args->mergeFilesLen); if (lastLoaded < M->first()) lastLoaded = M->first(); } // Make sure all files have at least that largest mer loaded. // for (uint32 i=0; imergeFilesLen; i++) while (R[i]->validMer() && (R[i]->theFMer() <= lastLoaded)) M->read(R[i], 2 * 1024); fprintf(stderr, "Initial load: length="uint32FMT" lastLoaded=%s\n", M->length(), lastLoaded.merToString(debugstring)); M->sort(); bool allLoaded = false; bool moreStuff = true; kMer currentMer; // The current mer we're operating on uint32 currentCount = uint32ZERO; // The count (operation dependent) of this mer uint32 currentTimes = uint32ZERO; // Number of files it's in uint32 currentPositionsMax = 0; uint32 *currentPositions = 0L; kMer *thisMer; // The mer we just read uint32 thisCount = uint32ZERO; // The count of the mer we just read uint32 *thisPositions = 0L; speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); currentMer.setMerSize(merSize); while (moreStuff) { // Load more stuff if needed. M is sorted, so first() is the // smallest mer in the set - we're good up to and including // lastLoaded. // if ((allLoaded == false) && ((M->length() == 0) || (lastLoaded < M->first()))) { #if 0 if (M->length() > 0) fprintf(stderr, "LOADMORE length="uint32FMT" lastLoaded=%s first=%s\n", M->length(), lastLoaded.merToString(debugstring2), M->first().merToString(debugstring)); else fprintf(stderr, "LOADMORE length="uint32FMT" lastLoaded=%s first=EMPTY\n", M->length(), lastLoaded.merToString(debugstring2)); #endif // We need to copy all the mers currently loaded into fresh // storage, so we can deallocate the position storage. Yucky. // M->rebuild(); allLoaded = true; // Load more stuff to give us a large collection of mers // uint32 additionalLoading = 8192; if (maxSize / args->mergeFilesLen > M->length()) additionalLoading = maxSize / args->mergeFilesLen - M->length(); //fprintf(stderr, "LOADMORE adding "uint32FMT" from each file\n", additionalLoading); lastLoaded.setMerSize(merSize); lastLoaded.smallest(); for (uint32 i=0; imergeFilesLen; i++) { if (R[i]->validMer()) { M->read(R[i], additionalLoading); if (lastLoaded < M->first()) lastLoaded = M->first(); allLoaded = false; } } // Make sure all files have at least that largest mer loaded. // for (uint32 i=0; imergeFilesLen; i++) while (R[i]->validMer() && (R[i]->theFMer() <= lastLoaded)) M->read(R[i], 2 * 1024); M->sort(); } // All done? Exit. if (M->length() == 0) moreStuff = false; thisMer = M->pop(thisCount, thisPositions); // If we've hit a different mer, write out the last one if ((M->length() == 0) || (*thisMer != currentMer)) { switch (args->personality) { case PERSONALITY_MIN: if (currentTimes == args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_MERGE: case PERSONALITY_MINEXIST: case PERSONALITY_MAX: case PERSONALITY_ADD: W->addMer(currentMer, currentCount, currentPositions); break; case PERSONALITY_AND: if (currentTimes == args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_NAND: if (currentTimes != args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_OR: W->addMer(currentMer, currentCount); break; case PERSONALITY_XOR: if ((currentTimes % 2) == 1) W->addMer(currentMer, currentCount); break; default: fprintf(stderr, "ERROR - invalid personality in multipleOperations::write\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); break; } currentMer = *thisMer; currentCount = uint32ZERO; currentTimes = uint32ZERO; C->tick(); } if (moreStuff == false) break; // Perform the operation switch (args->personality) { case PERSONALITY_MERGE: if (thisPositions) { if (currentPositionsMax == 0) { currentPositionsMax = 1048576; currentPositions = new uint32 [currentPositionsMax]; } if (currentPositionsMax < currentCount + thisCount) { while (currentPositionsMax < currentCount + thisCount) currentPositionsMax *= 2; uint32 *t = new uint32 [currentPositionsMax]; memcpy(t, currentPositions, sizeof(uint32) * currentCount); delete [] currentPositions; currentPositions = t; } if (thisCount < 16) { for (uint32 i=0; i thisCount) currentCount = thisCount; } break; case PERSONALITY_MAX: if (currentCount < thisCount) currentCount = thisCount; break; case PERSONALITY_ADD: currentCount += thisCount; break; case PERSONALITY_AND: case PERSONALITY_NAND: case PERSONALITY_OR: case PERSONALITY_XOR: currentCount = 1; break; default: fprintf(stderr, "ERROR - invalid personality in multipleOperations::operate\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); break; } currentTimes++; } for (uint32 i=0; imergeFilesLen; i++) delete R[i]; delete R; delete W; delete M; delete C; } canu-1.6/src/meryl/meryl-merge.C000066400000000000000000000220761314437614700165770ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/merge.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-08 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-09 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-23 to 2014-APR-11 * are Copyright 2005-2009,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "meryl.H" #include "libmeryl.H" void multipleOperations(merylArgs *args) { if (args->mergeFilesLen < 2) { fprintf(stderr, "ERROR - must have at least two databases (you gave " F_U32 ")!\n", args->mergeFilesLen); exit(1); } if (args->outputFile == 0L) { fprintf(stderr, "ERROR - no output file specified.\n"); exit(1); } if ((args->personality != PERSONALITY_MERGE) && (args->personality != PERSONALITY_MIN) && (args->personality != PERSONALITY_MINEXIST) && (args->personality != PERSONALITY_MAX) && (args->personality != PERSONALITY_MAXEXIST) && (args->personality != PERSONALITY_ADD) && (args->personality != PERSONALITY_AND) && (args->personality != PERSONALITY_NAND) && (args->personality != PERSONALITY_OR) && (args->personality != PERSONALITY_XOR)) { fprintf(stderr, "ERROR - only personalities min, minexist, max, maxexist, add, and, nand, or, xor\n"); fprintf(stderr, "ERROR - are supported in multipleOperations(). (%d)\n", args->personality); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); } merylStreamReader **R = new merylStreamReader* [args->mergeFilesLen]; merylStreamWriter *W = 0L; // Open the input files, read in the first mer // for (uint32 i=0; imergeFilesLen; i++) { R[i] = new merylStreamReader(args->mergeFiles[i]); R[i]->nextMer(); } // Verify that the mersizes are all the same // bool fail = false; uint32 merSize = R[0]->merSize(); uint32 merComp = R[0]->merCompression(); for (uint32 i=0; imergeFilesLen; i++) { fail |= (merSize != R[i]->merSize()); fail |= (merComp != R[i]->merCompression()); } if (fail) fprintf(stderr, "ERROR: mer sizes (or compression level) differ.\n"), exit(1); // Open the output file, using the largest prefix size found in the // input/mask files. // uint32 prefixSize = 0; for (uint32 i=0; imergeFilesLen; i++) if (prefixSize < R[i]->prefixSize()) prefixSize = R[i]->prefixSize(); W = new merylStreamWriter(args->outputFile, merSize, merComp, prefixSize, args->positionsEnabled); // We will find the smallest mer in any file, and count the number of times // it is present in the input files. bool moreInput = true; kMer currentMer; // The current mer we're operating on uint32 currentCount = uint32ZERO; // The count (operation dependent) of this mer uint32 currentTimes = uint32ZERO; // Number of files it's in uint32 currentPositionsMax = 0; uint32 *currentPositions = 0L; kMer thisMer; // The mer we just read uint32 thisFile = ~uint32ZERO; // The file we read it from uint32 thisCount = uint32ZERO; // The count of the mer we just read speedCounter *C = new speedCounter(" %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, args->beVerbose); currentMer.setMerSize(merSize); thisMer.setMerSize(merSize); while (moreInput) { // Find the smallest mer present in any input file. // moreInput = false; thisMer.clear(); thisFile = ~uint32ZERO; thisCount = uint32ZERO; // Load thisMer with the first valid mer for (uint32 i=0; imergeFilesLen && !moreInput; i++) if (R[i]->validMer()) { moreInput = true; thisCount = R[i]->theCount(); thisFile = i; thisMer = R[i]->theFMer(); } // Now find the smallest one if (moreInput) { for (uint32 i=thisFile+1; imergeFilesLen; i++) if ((R[i]->validMer()) && (R[i]->theFMer()) < thisMer) { moreInput = true; thisCount = R[i]->theCount(); thisFile = i; thisMer = R[i]->theFMer(); } } // If we've hit a different mer, write out the last one if ((moreInput == false) || (thisMer != currentMer)) { switch (args->personality) { case PERSONALITY_MIN: case PERSONALITY_MAX: if (currentTimes == args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_MERGE: case PERSONALITY_MINEXIST: case PERSONALITY_MAXEXIST: case PERSONALITY_ADD: W->addMer(currentMer, currentCount, currentPositions); break; case PERSONALITY_AND: if (currentTimes == args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_NAND: if (currentTimes != args->mergeFilesLen) W->addMer(currentMer, currentCount); break; case PERSONALITY_OR: W->addMer(currentMer, currentCount); break; case PERSONALITY_XOR: if ((currentTimes % 2) == 1) W->addMer(currentMer, currentCount); break; default: fprintf(stderr, "ERROR - invalid personality in multipleOperations::write\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); break; } currentMer = thisMer; currentCount = uint32ZERO; currentTimes = uint32ZERO; C->tick(); } // All done? Exit. if (moreInput == false) continue; // Perform the operation switch (args->personality) { case PERSONALITY_MERGE: if (R[thisFile]->thePositions()) { if (currentPositionsMax == 0) { currentPositionsMax = 1048576; currentPositions = new uint32 [currentPositionsMax]; } if (currentPositionsMax < currentCount + thisCount) { while (currentPositionsMax < currentCount + thisCount) currentPositionsMax *= 2; uint32 *t = new uint32 [currentPositionsMax]; memcpy(t, currentPositions, sizeof(uint32) * currentCount); delete [] currentPositions; currentPositions = t; } if (thisCount < 16) { uint32 *p = R[thisFile]->thePositions(); for (uint32 i=0; ithePositions(), sizeof(uint32) * thisCount); } } // Otherwise, we're the same as ADD. currentCount += thisCount; break; case PERSONALITY_MIN: case PERSONALITY_MINEXIST: if (currentTimes == 0) { currentCount = thisCount; } else { if (currentCount > thisCount) currentCount = thisCount; } break; case PERSONALITY_MAX: case PERSONALITY_MAXEXIST: if (currentCount < thisCount) currentCount = thisCount; break; case PERSONALITY_ADD: currentCount += thisCount; break; case PERSONALITY_AND: case PERSONALITY_NAND: case PERSONALITY_OR: case PERSONALITY_XOR: currentCount = 1; break; default: fprintf(stderr, "ERROR - invalid personality in multipleOperations::operate\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); break; } currentTimes++; // Move the file we just read from to the next mer R[thisFile]->nextMer(); } for (uint32 i=0; imergeFilesLen; i++) delete R[i]; delete R; delete W; delete C; } canu-1.6/src/meryl/meryl-unaryOp.C000066400000000000000000000063511314437614700171330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/unaryOp.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-07 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2007-OCT-12 to 2008-JUN-09 * are Copyright 2007-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-05 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include "meryl.H" #include "libmeryl.H" void unaryOperations(merylArgs *args) { if (args->mergeFilesLen != 1) { fprintf(stderr, "ERROR - must have exactly one file!\n"); exit(1); } if (args->outputFile == 0L) { fprintf(stderr, "ERROR - no output file specified.\n"); exit(1); } if ((args->personality != PERSONALITY_LEQ) && (args->personality != PERSONALITY_GEQ) && (args->personality != PERSONALITY_EQ)) { fprintf(stderr, "ERROR - only personalities lessthan, lessthanorequal,\n"); fprintf(stderr, "ERROR - greaterthan, greaterthanorequal, and equal\n"); fprintf(stderr, "ERROR - are supported in unaryOperations().\n"); fprintf(stderr, "ERROR - this is a coding error, not a user error.\n"); exit(1); } // Open the input and output files -- we don't know the number // unique, distinct, and total until after the operation, so we // leave them zero. // merylStreamReader *R = new merylStreamReader(args->mergeFiles[0]); merylStreamWriter *W = new merylStreamWriter(args->outputFile, R->merSize(), R->merCompression(), R->prefixSize(), R->hasPositions()); switch (args->personality) { case PERSONALITY_LEQ: while (R->nextMer()) if (R->theCount() <= args->desiredCount) W->addMer(R->theFMer(), R->theCount(), R->thePositions()); break; case PERSONALITY_GEQ: while (R->nextMer()) if (R->theCount() >= args->desiredCount) W->addMer(R->theFMer(), R->theCount(), R->thePositions()); break; case PERSONALITY_EQ: while (R->nextMer()) if (R->theCount() == args->desiredCount) W->addMer(R->theFMer(), R->theCount(), R->thePositions()); break; } delete R; delete W; } canu-1.6/src/meryl/meryl.C000066400000000000000000000054341314437614700155010ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/meryl.C * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-07 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-MAR-25 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-23 to 2009-AUG-07 * are Copyright 2005,2007-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2014-DEC-08 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include #include #include #include "meryl.H" int main(int argc, char **argv) { merylArgs *args = new merylArgs(argc, argv); switch (args->personality) { case 'P': estimate(args); break; case 'B': build(args); break; case 'd': dumpDistanceBetweenMers(args); break; case 't': dumpThreshold(args); break; case 'p': dumpPositions(args); break; case 'c': countUnique(args); break; case 'h': plotHistogram(args); break; case PERSONALITY_MIN: case PERSONALITY_MINEXIST: case PERSONALITY_MAX: case PERSONALITY_MAXEXIST: case PERSONALITY_ADD: case PERSONALITY_AND: case PERSONALITY_NAND: case PERSONALITY_OR: case PERSONALITY_XOR: multipleOperations(args); break; case PERSONALITY_SUB: case PERSONALITY_ABS: case PERSONALITY_DIVIDE: binaryOperations(args); break; case PERSONALITY_LEQ: case PERSONALITY_GEQ: case PERSONALITY_EQ: unaryOperations(args); break; default: args->usage(); fprintf(stderr, "%s: unknown personality. Specify -P, -B, -S or -M!\n", args->execName); exit(1); break; } delete args; return(0); } canu-1.6/src/meryl/meryl.H000066400000000000000000000122531314437614700155030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/meryl.H * * Modifications by: * * Brian P. Walenz from 2003-JAN-02 to 2004-APR-07 * are Copyright 2003-2004 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-MAR-25 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-23 to 2014-APR-11 * are Copyright 2005-2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-MAY-29 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef MERYL_H #define MERYL_H #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "kMer.H" #include "bitPackedFile.H" #include "libmeryl.H" #include "speedCounter.H" #include "timeAndSize.H" #define PERSONALITY_MERGE 0xff #define PERSONALITY_MIN 0x01 #define PERSONALITY_MINEXIST 0x02 #define PERSONALITY_MAX 0x03 #define PERSONALITY_MAXEXIST 0x04 #define PERSONALITY_ADD 0x05 #define PERSONALITY_SUB 0x06 #define PERSONALITY_DIVIDE 0x07 #define PERSONALITY_ABS 0x08 #define PERSONALITY_AND 0x10 #define PERSONALITY_NAND 0x11 #define PERSONALITY_OR 0x12 #define PERSONALITY_XOR 0x13 #define PERSONALITY_LEQ 0x14 #define PERSONALITY_GEQ 0x15 #define PERSONALITY_EQ 0x16 class merylArgs { public: merylArgs(int argc, char **argv); merylArgs(const char *prefix); ~merylArgs(); void usage(void); void clear(void); uint64 hash(kMer const &mer) { return(mer.startOfMer(numBuckets_log2)); }; bool writeConfig(void); bool readConfig(const char *prefix); public: char *execName; char *options; bool beVerbose; bool doForward; bool doReverse; bool doCanonical; char *inputFile; char *outputFile; char *queryFile; uint32 merSize; uint32 merComp; bool positionsEnabled; uint64 numMersEstimated; uint64 numMersActual; uint64 numBasesActual; uint64 mersPerBatch; uint64 basesPerBatch; uint64 numBuckets; uint32 numBuckets_log2; uint32 merDataWidth; uint64 merDataMask; uint32 bucketPointerWidth; uint32 numThreads; uint64 memoryLimit; uint64 segmentLimit; bool configBatch; bool countBatch; bool mergeBatch; uint32 batchNumber; char *sgeJobName; char *sgeBuildOpt; char *sgeMergeOpt; bool isOnGrid; uint32 lowCount; uint32 highCount; uint32 desiredCount; bool outputCount; bool outputAll; bool outputPosition; bool includeDefLine; bool includeMer; uint32 mergeFilesMax; uint32 mergeFilesLen; char **mergeFiles; uint32 personality; }; uint64 estimateNumMersInMemorySize(uint32 merSize, uint64 mem, uint32 numThreads, bool positionsEnabled, bool beVerbose); uint64 estimateMemory(uint32 merSize, uint64 numMers, bool positionsEnabled); uint32 optimalNumberOfBuckets(uint32 merSize, uint64 numMers, bool positionsEnabled); void estimate(merylArgs *args); void build(merylArgs *args); void multipleOperations(merylArgs *args); void binaryOperations(merylArgs *args); void unaryOperations(merylArgs *args); void dump(merylArgs *args); void dumpThreshold(merylArgs *args); void dumpPositions(merylArgs *args); void countUnique(merylArgs *args); void dumpDistanceBetweenMers(merylArgs *args); void plotHistogram(merylArgs *args); #endif // MERYL_H canu-1.6/src/meryl/meryl.mk000066400000000000000000000012631314437614700157220ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := meryl SOURCES := meryl-args.C \ meryl-binaryOp.C \ meryl-build.C \ meryl-dump.C \ meryl-estimate.C \ meryl-merge.C \ meryl-unaryOp.C \ meryl.C SRC_INCDIRS := .. ../AS_UTL libleaff TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/meryl/positionDB.C000066400000000000000000000245341314437614700164250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2003-AUG-14 to 2003-SEP-18 * are Copyright 2003 Applera Corporation, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2004-APR-30 to 2004-OCT-10 * are Copyright 2004 Brian P. Walenz, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-MAY-19 to 2014-APR-11 * are Copyright 2005,2007-2008,2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "positionDB.H" #include "existDB.H" // Driver for the positionDB creation. Reads a sequence.fasta, builds // a positionDB for the mers in the file, and then writes the internal // structures to disk. // // The positionDB constructor is smart enough to read either a pre-built // image or a regular multi-fasta file. #define MERSIZE 20 int test1(char *filename) { merStream *T = new merStream(new kMerBuilder(MERSIZE), new seqStream(filename), true, true); positionDB *M = new positionDB(T, MERSIZE, 0, 0L, 0L, 0L, 0, 0, 0, 0, true); uint64 *posn = new uint64 [1024]; uint64 posnMax = 1024; uint64 posnLen = uint64ZERO; uint64 count = uint64ZERO; uint32 missing = uint32ZERO; uint32 failed = uint32ZERO; char str[33]; T->rewind(); while (T->nextMer()) { if (M->getExact(T->theFMer(), posn, posnMax, posnLen, count)) { missing = uint32ZERO; for (uint32 i=0; ithePositionInStream()) missing++; if (missing != 1) { failed++; fprintf(stdout, "%s @ "F_U64"/"F_U64": Found "F_U64" table entries, and "F_U32" matching positions (", T->theFMer().merToString(str), T->theSequenceNumber(), T->thePositionInStream(), posnLen, missing); for (uint32 i=0; itheFMer().merToString(str), T->thePositionInStream()); } } delete M; delete T; return(failed != 0); } int test2(char *filename, char *query) { merStream *T = new merStream(new kMerBuilder(MERSIZE), new seqStream(filename), true, true); positionDB *M = new positionDB(T, MERSIZE, 0, 0L, 0L, 0L, 0, 0, 0, 0, true); uint64 *posn = new uint64 [1024]; uint64 posnMax = 1024; uint64 posnLen = uint64ZERO; uint64 count = uint64ZERO; char str[33]; delete T; T = new merStream(new kMerBuilder(MERSIZE), new seqStream(query), true, true); while (T->nextMer()) { if (M->getExact(T->theFMer(), posn, posnMax, posnLen, count)) { fprintf(stdout, "Got a F match for mer=%s at "F_U64"/"F_U64" (in mers), numMatches="F_U64"\n", T->theFMer().merToString(str), T->theSequenceNumber(), T->thePositionInStream(), posnLen); } if (M->getExact(T->theRMer(), posn, posnMax, posnLen, count)) { fprintf(stdout, "Got a R match for mer=%s at "F_U64"/"F_U64" (in mers), numMatches="F_U64"\n", T->theRMer().merToString(str), T->theSequenceNumber(), T->thePositionInStream(), posnLen); } } delete M; delete T; return(0); } // Builds a positionDB possibly using a subset of the file. // // Subset on entire sequences: // -use x-y,a,b // // Subset on a range of mers, in this case, use only the 1000th // through 1999th (inclusive) mer: // -merbegin 1000 -merend 2000 // // Or do both, use the first 1000 mers from the 3rd sequence: // -use 3 -merbegin 0 -merend 1000 int main(int argc, char **argv) { uint32 mersize = 20; uint32 merskip = 0; char *maskF = 0L; char *onlyF = 0L; uint64 merBegin = ~uint64ZERO; uint64 merEnd = ~uint64ZERO; char *sequenceFile = 0L; char *outputFile = 0L; if (argc < 3) { fprintf(stderr, "usage: %s [args]\n", argv[0]); fprintf(stderr, " -mersize k The size of the mers, default=20.\n"); fprintf(stderr, " -merskip k The skip between mers, default=0\n"); fprintf(stderr, " -use a-b,c Specify which sequences to use, default=all\n"); fprintf(stderr, " -merbegin b Build on a subset of the mers, starting at mer #b, default=all mers\n"); fprintf(stderr, " -merend e Build on a subset of the mers, ending at mer #e, default=all mers\n"); fprintf(stderr, " -sequence s.fasta Input sequences.\n"); fprintf(stderr, " -output p.posDB Output filename.\n"); fprintf(stderr, "\n"); fprintf(stderr, " To dump information about an image:\n"); fprintf(stderr, " -dump datafile\n"); fprintf(stderr, "\n"); fprintf(stderr, " To run sanity tests:\n"); fprintf(stderr, " -buildonly [build opts] sequence.fasta\n"); fprintf(stderr, " -- just builds a table and exits\n"); fprintf(stderr, " -existence [build opts] sequence.fasta\n"); fprintf(stderr, " -- builds (or reads) a table reports if any mers\n"); fprintf(stderr, " in sequence.fasta cannot be found\n"); fprintf(stderr, " -extra [build opts] sequence.fasta\n"); fprintf(stderr, " -- builds (or reads) a table reports if any mers\n"); fprintf(stderr, " NOT in sequence.fasta are be found\n"); fprintf(stderr, " -test1 sequence.fasta\n"); fprintf(stderr, " -- Tests if each and every mer is found in the\n"); fprintf(stderr, " positionDB. Reports if it doesn't find a mer\n"); fprintf(stderr, " at the correct position. Doesn't report if table\n"); fprintf(stderr, " has too much stuff.\n"); fprintf(stderr, " -test2 db.fasta sequence.fasta\n"); fprintf(stderr, " -- Builds a positionDB from db.fasta, then searches\n"); fprintf(stderr, " the table for each mer in sequence.fasta. Reports\n"); fprintf(stderr, " all mers it finds.\n"); fprintf(stderr, " -- This is a silly test and you shouldn't do it.\n"); exit(1); } int arg = 1; while (arg < argc) { if (strcmp(argv[arg], "-mersize") == 0) { mersize = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-merskip") == 0) { merskip = strtouint32(argv[++arg]); } else if (strcmp(argv[arg], "-mask") == 0) { maskF = argv[++arg]; } else if (strcmp(argv[arg], "-only") == 0) { onlyF = argv[++arg]; } else if (strcmp(argv[arg], "-merbegin") == 0) { merBegin = strtouint64(argv[++arg]); } else if (strcmp(argv[arg], "-merend") == 0) { merEnd = strtouint64(argv[++arg]); } else if (strcmp(argv[arg], "-sequence") == 0) { sequenceFile = argv[++arg]; } else if (strcmp(argv[arg], "-output") == 0) { outputFile = argv[++arg]; } else if (strcmp(argv[arg], "-dump") == 0) { positionDB *e = new positionDB(argv[++arg], 0, 0, 0, false); e->printState(stdout); delete e; exit(0); } else if (strcmp(argv[arg], "-test1") == 0) { exit(test1(argv[arg+1])); } else if (strcmp(argv[arg], "-test2") == 0) { exit(test2(argv[arg+1], argv[arg+2])); } else { fprintf(stderr, "ERROR: unknown arg '%s'\n", argv[arg]); exit(1); } arg++; } // Exit quickly if the output file exists. // if (AS_UTL_fileExists(outputFile)) { fprintf(stderr, "Output file '%s' exists already!\n", outputFile); exit(0); } merStream *MS = new merStream(new kMerBuilder(MERSIZE), new seqStream(sequenceFile), true, true); // Approximate the number of mers in the sequences. // uint64 numMers = MS->approximateNumberOfMers(); // Reset the limits. // // XXX: If the user somehow knows how many mers are in the input // file, and specifies an end between there and the amount of // sequence, we'll pointlessly still make a merStreamFile, even // though we shouldn't. // if (merBegin == ~uint64ZERO) merBegin = 0; if (merEnd == ~uint64ZERO) merEnd = numMers; if (merBegin >= merEnd) { fprintf(stderr, "ERROR: merbegin="F_U64" and merend="F_U64" are incompatible.\n", merBegin, merEnd); exit(1); } if ((merBegin > 0) || (merEnd < numMers)) MS->setBaseRange(merBegin, merEnd); existDB *maskDB = 0L; if (maskF) { fprintf(stderr, "Building maskDB from '%s'\n", maskF); maskDB = new existDB(maskF, mersize, existDBnoFlags, 0, ~uint32ZERO); } existDB *onlyDB = 0L; if (onlyF) { fprintf(stderr, "Building onlyDB from '%s'\n", onlyF); onlyDB = new existDB(onlyF, mersize, existDBnoFlags, 0, ~uint32ZERO); } fprintf(stderr, "Building table with merSize "F_U32", merSkip "F_U32"\n", mersize, merskip); positionDB *positions = new positionDB(MS, mersize, merskip, maskDB, onlyDB, 0L, 0, 0, 0, 0, true); fprintf(stderr, "Dumping positions table to '%s'\n", outputFile); positions->saveState(outputFile); delete MS; delete positions; exit(0); } canu-1.6/src/meryl/positionDB.mk000066400000000000000000000007731314437614700166510ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := positionDB SOURCES := positionDB.C SRC_INCDIRS := .. ../AS_UTL libleaff libkmer TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/meryl/simple.C000066400000000000000000000146211314437614700156400ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * kmer/meryl/simple.C * * Modifications by: * * Brian P. Walenz from 2010-AUG-31 to 2014-APR-11 * are Copyright 2010-2011,2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-05 to 2015-APR-24 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "kMer.H" #include "bitPackedFile.H" #include "libmeryl.H" #include "seqStream.H" #include "merStream.H" #include "speedCounter.H" #include "speedCounter.H" #include "timeAndSize.H" #include using namespace std; #if 0 #include #include #include #include #include #include "bio++.H" #include "meryl.H" #include "libmeryl.H" #include "seqStream.H" #include "merStream.H" #endif // A very simple mer counter. Allocates a gigantic 32-bit array, // populates the array with mers, sorts, writes output. int main(int argc, char **argv) { char *inName = 0L; char *otName = 0L; uint32 merSize = 22; uint32 merCompression = 1; bool doForward = false; bool doReverse = false; bool doCanonical = false; speedCounter *C = 0L; merStream *M = 0L; merylStreamWriter *W = 0L; uint64 numMers = 0; bool algSorting = false; int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-i") == 0) { inName = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { otName = argv[++arg]; } else if (strcmp(argv[arg], "-m") == 0) { merSize = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-f") == 0) { doForward = true; } else if (strcmp(argv[arg], "-r") == 0) { doReverse = true; } else if (strcmp(argv[arg], "-C") == 0) { doCanonical = true; } else if (strcmp(argv[arg], "-c") == 0) { merCompression = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-sort") == 0) { algSorting = true; } else if (strcmp(argv[arg], "-direct") == 0) { algSorting = false; } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); err++; } arg++; } if (inName == 0L) { fprintf(stderr, "no input given with '-i'\n"); err++; } if (otName == 0L) { fprintf(stderr, "no output given with '-o'\n"); err++; } if (err) exit(1); M = new merStream(new kMerBuilder(merSize, merCompression), new seqStream(inName), true, true); numMers = M->approximateNumberOfMers(); delete M; fprintf(stderr, "Guessing " F_U64 " mers in input '%s'\n", numMers, inName); M = new merStream(new kMerBuilder(merSize, merCompression), new seqStream(inName), true, true); if (algSorting) { uint64 theMersLen = 0; uint64 theMersMax = 2 * numMers; // for allowing both -f and -r uint32 *theMers = new uint32 [theMersMax]; fprintf(stderr, "Allocating " F_U64 "MB for mer storage.\n", numMers * sizeof(uint64) >> 20); C = new speedCounter(" Filling mer list: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, 1); while (M->nextMer()) { if (doForward) theMers[theMersLen++] = M->theFMer(); if (doReverse) theMers[theMersLen++] = M->theRMer(); if (doCanonical) theMers[theMersLen++] = (M->theFMer() <= M->theRMer()) ? M->theFMer() : M->theRMer(); C->tick(); } delete C; delete M; fprintf(stderr, "Found " F_U64 " mers in input '%s'\n", theMersLen, inName); if (theMersLen > theMersMax) fprintf(stderr, "ERROR: too many mers in input!\n"), exit(1); sort(theMers, theMers + theMersLen); C = new speedCounter(" Writing output: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, 1); W = new merylStreamWriter(otName, merSize, merCompression, 16, false); kMer mer(merSize); for (uint64 i=0; iaddMer(mer, 1, 0L); C->tick(); } delete C; delete W; delete [] theMers; } else { uint64 numCounts = ((uint64)1) << (2 * merSize); uint32 *theCounts = new uint32 [numCounts]; fprintf(stderr, "Allocating " F_U64 "MB for count storage.\n", numCounts * sizeof(uint32) >> 20); memset(theCounts, 0, sizeof(uint32) * numCounts); C = new speedCounter(" Filling mer counts: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, 1); while (M->nextMer()) { if (doForward) theCounts[M->theFMer()]++; if (doReverse) theCounts[M->theRMer()]++; if (doCanonical) theCounts[(M->theFMer() <= M->theRMer()) ? M->theFMer() : M->theRMer()]++; C->tick(); } delete C; delete M; C = new speedCounter(" Writing output: %7.2f Mmers -- %5.2f Mmers/second\r", 1000000.0, 0x1fffff, 1); W = new merylStreamWriter(otName, merSize, merCompression, 16, false); kMer mer(merSize); for (uint64 i=0; i 0) { mer.setWord(0, i); W->addMer(mer, theCounts[i], 0L); C->tick(); } } delete C; delete W; delete [] theCounts; } exit(0); } canu-1.6/src/meryl/simple.mk000066400000000000000000000007531314437614700160660ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := simple SOURCES := simple.C SRC_INCDIRS := .. ../AS_UTL libleaff TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/meryl/test/000077500000000000000000000000001314437614700152165ustar00rootroot00000000000000canu-1.6/src/meryl/test/Makefile000066400000000000000000000025401314437614700166570ustar00rootroot00000000000000PROG = stupidcount exhaustive INCLUDE = -I.. -I../../libutil -I../../libbio -I../../libmeryl LIBS = -L.. -L../../libutil -L../../libbio -L../../libmeryl -lmeryl -lbio -lutil -lm MERSIZE = 26 include ../../Make.compilers all: $(PROG) test-reduce stupidcount: stupidcount.C $(CXX) $(CXXFLAGS_COMPILE) -c -o stupidcount.o stupidcount.C $(INCLUDE) $(CXX) $(CXXLDFLAGS) -o stupidcount stupidcount.o $(LIBS) exhaustive: exhaustive.C kmerlite.H $(CXX) $(CXXFLAGS_COMPILE) -c -o exhaustive.o exhaustive.C $(INCLUDE) $(CXX) $(CXXLDFLAGS) -o exhaustive exhaustive.o $(LIBS) test-exhaustive: exhaustive ../meryl ../../leaff/leaff ../../leaff/leaff -G 1000 10000 40000 > g.fasta ../meryl -B -s g.fasta -o s -m $(MERSIZE) -threads 7 ./exhaustive -m s -f g.fasta test-reduce: ../meryl ../meryl -B -f -m 20 -s test-seq1.fasta -o 1 # Build the initial table ../meryl -Dt -n 0 -s 1 > 2.reduce.fasta # Dump the initial table as fasta ../meryl -B -f -m 20 -s 2.reduce.fasta -o 2 # Build a new table on the dumped fasta ../meryl -M sub -s 1 -s 2 -o 3 # Remove one copy of each mer ../meryl -Dt -n 1 -s 3 # Dump the resulting file echo 1 10 9 1 is correct touch test-reduce test: ../meryl -B -s test-seq1.fasta -o t -m 20 clean: rm -f $(PROG) *.o *.mc??? test-reduce *.seqStore* g.fasta 2.reduce.fasta *.fastaidx canu-1.6/src/meryl/test/exhaustive.C000066400000000000000000000140561314437614700175150ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bio++.H" #include "libmeryl.H" #include "kmerlite.H" // This tests that all the mers in an input fasta file are counted // properly. It does not test that the meryl output contains exactly // those mers, just that those mers are there. // // If you can fit into one batch, then it _will_ verift that the // meryl output is exactly correct. // // Reads a meryl-format kmer count in chunks. Each chunk is stored // in a searchable structure (we should be using, say, an extended // existDB, but we're using a balanced binary tree). The entire // source fasta file is then streamed against the kmer chunk, // decrementing the count for each mer. When the whole file is // streamed, any kmers with positive count are reported. // NB: My hacked kazlib returns a pointer to whatever we give it. // Since we gave it a pointer to an object, it gives us back a // pointer to "a pointer to an object". Hence, this ugliness. // int kMerLiteSort(void const *a, void const *b) { kMerLite const *A = *((kMerLite * const *)a); kMerLite const *B = *((kMerLite * const *)b); if (*A < *B) return(-1); if (*A > *B) return(1); return(0); } int main(int argc, char **argv) { char *merylCount = 0L; char *fastaName = 0L; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-m") == 0) { merylCount = argv[++arg]; } else if (strcmp(argv[arg], "-f") == 0) { fastaName = argv[++arg]; } else { fprintf(stderr, "unknown option '%s'\n", argv[arg]); } arg++; } if ((merylCount == 0L) || (fastaName == 0L)) { fprintf(stderr, "usage: %s -m -f \n", argv[0]); exit(1); } // Open the count files // merylStreamReader *MSR = new merylStreamReader(merylCount); fprintf(stderr, "Mers are "uint32FMT" bases.\n", MSR->merSize()); fprintf(stderr, "There are "uint64FMT" unique (copy = 1) mers.\n", MSR->numberOfUniqueMers()); fprintf(stderr, "There are "uint64FMT" distinct mers.\n", MSR->numberOfDistinctMers()); fprintf(stderr, "There are "uint64FMT" mers total.\n", MSR->numberOfTotalMers()); // Guess how many mers we can fit into 512MB, then report how many chunks we need to do. uint32 merSize = MSR->merSize(); uint64 memoryLimit = 700 * 1024 * 1024; uint64 perMer = sizeof(kMerLite) + sizeof(dnode_t); uint64 mersPerBatch = memoryLimit / perMer; uint32 numBatches = MSR->numberOfDistinctMers() / mersPerBatch; uint32 batch = 0; dnode_t *nodes = new dnode_t [mersPerBatch]; kMerLite *mers = new kMerLite [mersPerBatch]; if (MSR->numberOfDistinctMers() % mersPerBatch) numBatches++; fprintf(stderr, "perMer: "uint64FMT" bytes ("uint64FMT" for kMerLite, "uint64FMT" for dnode_t.\n", perMer, (uint64)sizeof(kMerLite), (uint64)sizeof(dnode_t)); fprintf(stderr, "We can fit "uint64FMT" mers into "uint64FMT"MB.\n", mersPerBatch, memoryLimit >> 20); fprintf(stderr, "So we need "uint32FMT" batches to verify the count.\n", numBatches); while (MSR->validMer()) { uint64 mersRemain = mersPerBatch; dict_t *merDict = dict_create(mersPerBatch, kMerLiteSort); batch++; // STEP 1: Insert mersPerBatch into the merDict // fprintf(stderr, "STEP 1 BATCH "uint32FMTW(2)": Insert into merDict\n", batch); while (MSR->nextMer() && mersRemain) { mersRemain--; mers[mersRemain] = MSR->theFMer(); // initialize the node with the value, then insert the node // into the tree using the key int32 val = (int32)MSR->theCount(); dnode_init(&nodes[mersRemain], (void *)val); dict_insert(merDict, &nodes[mersRemain], &mers[mersRemain]); } // STEP 2: Stream the original file, decrementing the count // fprintf(stderr, "STEP 2 BATCH "uint32FMTW(2)": Stream fasta\n", batch); seqStream *CS = new seqStream(fastaName, true); merStream *MS = new merStream(new kMerBuilder(merSize), CS); kMerLite mer; dnode_t *nod; while (MS->nextMer()) { mer = MS->theFMer(); nod = dict_lookup(merDict, &mer); if (nod != 0L) { int32 val = (int32)dnode_get(nod); val--; dnode_put(nod, (void *)val); } else { // Unless the whole meryl file fit into our merDict, we cannot warn if // we don't find mers. // if (numBatches == 1) { char str[1024]; fprintf(stderr, "Didn't find node for mer '%s'\n", mer.merToString(merSize, str)); } } } delete MS; delete CS; // STEP 3: Check every node in the tree to make sure that the counts // are exactly zero. // fprintf(stderr, "STEP 3 BATCH "uint32FMTW(2)": Check\n", batch); nod = dict_first(merDict); while (nod) { int32 val = (int32)dnode_get(nod); kMerLite const *nodmer = (kMerLite const *)dnode_getkey(nod); if (val != 0) { char str[1024]; fprintf(stderr, "Got count "int32FMT" for mer '%s'\n", val, nodmer->merToString(merSize, str)); } nod = dict_next(merDict, nod); } // STEP 4: Destroy the dictionary. // fprintf(stderr, "STEP 4 BATCH "uint32FMTW(2)": Destroy\n", batch); while ((nod = dict_first(merDict))) dict_delete(merDict, nod); dict_destroy(merDict); } } canu-1.6/src/meryl/test/kmerlite.H000066400000000000000000000076451314437614700171570ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bio++.H" #ifndef KMER_LITE_H #define KMER_LITE_H //////////////////////////////////////// // // This is kMerLite -- derived from kMer.H, removing // most of the accessors. // // Assumes that KMER_WORDS is already defined. class kMerLite { public: // Used by some of the test routines. void dump(void) const { for (uint32 i=0; i> 5; char *str = instr; if ((merSize & uint32MASK(6)) == 0) lastWord++; // We build the string right to left, print any partial word // first, then print whole words until we run out of words to // print. if (merSize & uint32MASK(5)) { ::merToString(merSize & uint32MASK(5), _wd[lastWord], str); str += merSize & uint32MASK(5); } while (lastWord > 0) { lastWord--; ::merToString(32, _wd[lastWord], str); str += 32; } return(instr); }; #if KMER_WORDS == 1 bool operator!=(kMerLite const &r) const { return(_wd[0] != r._wd[0]); }; bool operator==(kMerLite const &r) const { return(_wd[0] == r._wd[0]); }; bool operator<(kMerLite const &r) const { return(_wd[0] < r._wd[0]); }; bool operator>(kMerLite const &r) const { return(_wd[0] > r._wd[0]); }; bool operator<=(kMerLite const &r) const { return(_wd[0] <= r._wd[0]); }; bool operator>=(kMerLite const &r) const { return(_wd[0] >= r._wd[0]); }; #else bool operator!=(kMerLite const &r) const { uint64 res = uint64ZERO; for (uint32 i=KMER_WORDS; i--; ) res |= _wd[i] ^ r._wd[i]; return(res != uint64ZERO); }; bool operator==(kMerLite const &r) const { uint64 res = uint64ZERO; for (uint32 i=KMER_WORDS; i--; ) res |= _wd[i] ^ r._wd[i]; return(res == uint64ZERO); }; bool operator<(kMerLite const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (_wd[i] < r._wd[i]) return(true); if (_wd[i] > r._wd[i]) return(false); } return(false); }; bool operator>(kMerLite const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (_wd[i] > r._wd[i]) return(true); if (_wd[i] < r._wd[i]) return(false); } return(false); }; bool operator<=(kMerLite const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (_wd[i] < r._wd[i]) return(true); if (_wd[i] > r._wd[i]) return(false); } return(true); }; bool operator>=(kMerLite const &r) const { for (uint32 i=KMER_WORDS; i--; ) { if (_wd[i] > r._wd[i]) return(true); if (_wd[i] < r._wd[i]) return(false); } return(true); }; #endif private: uint64 _wd[KMER_WORDS]; }; #endif // KMER_LITE_H canu-1.6/src/meryl/test/stupidcount.C000066400000000000000000000032101314437614700176770ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "bio++.H" // Reads a sequence file, outputs a list of the mers in it. You can // then pipe this to unix sort and uniq to do a mercount. You // probably don't want to count large things this way... int main(int argc, char **argv) { char *seqName = 0L; uint32 merSize = 20; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-s") == 0) { seqName = argv[++arg]; } else if (strcmp(argv[arg], "-m") == 0) { merSize = strtouint32(argv[++arg], 0L); } arg++; } if (seqName == 0L) { fprintf(stderr, "usage: %s [-m mersize] -s seqfile.fasta\n", argv[0]); exit(1); } seqStream *CS = new seqStream(seqName, true); merStream *MS = new merStream(new kMerBuilder(merSize), CS); char str[1024]; while (MS->nextMer()) fprintf(stdout, "%s\n", MS->theFMer().merToString(str)); delete MS; delete CS; exit(0); } canu-1.6/src/meryl/test/test-seq1.fasta000066400000000000000000000005321314437614700200640ustar00rootroot00000000000000> 1 A 20 CG 0 T AAAAAAAAAAAAAAAAAAAAGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG > 1 A 1 CG 0 T ----not-a-mer------ -----is-a-mer------- AAAAAAAAAAAAAAAAAAAANGCGCGCGCGCGCGCGCGCGNCGCGCGCGCGCGCGCGCGCG > (zero 63 bases) ATNCGGATYCGATCGASCHJAGSVHYWERIGHWEEIRVHSDKFVHWIERVHIWRVHKSDFVKS > 20 T NNNNTTTTTTTTTTTTTTTTTTTTNNNNTTTTTTTTTTTTTTTTTTTTNNNN canu-1.6/src/meryl/test/test-seq2.fasta000066400000000000000000000010711314437614700200640ustar00rootroot00000000000000> 2 A 20 CG 0 T AAAAAAAAAAAAAAAAAAAAN AAAAAAAAAAAAAAAAAAAAN AAAAAAAAAAAAAAAAAAAAN AAAAAAAAAAAAAAAAAAAAN AAAAAAAAAAAAAAAAAAAAN AAAAAAAAAAAAAAAAAAACN AAAAAAAAAAAAAAAAAAACN AAAAAAAAAAAAAAAAAAACN AAAAAAAAAAAAAAAAAAACN AAAAAAAAAAAAAAAAAAACN AAAAAAAAAAAAAAAAAAAAAAAGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCG > 1 A 1 CG 0 T ----not-a-mer------ -----is-a-mer------- AAAAAAAAAAAAAAAAAAAANGCGCGCGCGCGCGCGCGCGNCGCGCGCGCGCGCGCGCGCG > (zero 63 bases) ATNCGGATYCGATCGASCHJAGSVHYWERIGHWEEIRVHSDKFVHWIERVHIWRVHKSDFVKS > 20 T NNNNTTTTTTTTTTTTTTTTTTTTNNNNTTTTTTTTTTTTTTTTTTTTNNNN canu-1.6/src/meryl/test/test-seq3.fasta000066400000000000000000000000741314437614700200670ustar00rootroot00000000000000> ACGCTCAGCTACTACGACTTAGAGAAAATAGCGATATAGCGATCGATCGATTAGAGA canu-1.6/src/mhap/000077500000000000000000000000001314437614700140345ustar00rootroot00000000000000canu-1.6/src/mhap/mhap.mk000066400000000000000000000005401314437614700153110ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := mhap-2.1.2.jar SOURCES := mhap-2.1.2.tar canu-1.6/src/mhap/mhapConvert.C000066400000000000000000000126531314437614700164350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAR-27 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Sergey Koren beginning on 2016-FEB-24 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-AUG-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "ovStore.H" #include "splitToWords.H" #include using namespace std; int main(int argc, char **argv) { char *outName = NULL; char *gkpName = NULL; uint32 baseIDhash = 0; uint32 numIDhash = 0; uint32 baseIDquery = 0; vector files; int32 arg = 1; int32 err = 0; while (arg < argc) { if (strcmp(argv[arg], "-o") == 0) { outName = argv[++arg]; } else if (strcmp(argv[arg], "-h") == 0) { baseIDhash = atoi(argv[++arg]) - 1; numIDhash = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-q") == 0) { baseIDquery = atoi(argv[++arg]) - 1; } else if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (AS_UTL_fileExists(argv[arg])) { files.push_back(argv[arg]); } else { fprintf(stderr, "ERROR: invalid arg '%s'\n", argv[arg]); err++; } arg++; } if ((err) || (gkpName == NULL) || (outName == NULL) || (files.size() == 0)) { fprintf(stderr, "usage: %s [options] file.mhap[.gz]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " Converts mhap native output to ovb\n"); fprintf(stderr, "\n"); fprintf(stderr, " -o out.ovb output file\n"); fprintf(stderr, "\n"); fprintf(stderr, " -h id num base id and number of hash table reads\n"); fprintf(stderr, " (mhap output IDs 1 through 'num')\n"); fprintf(stderr, " -q id base id of query reads\n"); fprintf(stderr, " (mhap output IDs 'num+1' and higher)\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no gkpStore (-G) supplied\n"); if (files.size() == 0) fprintf(stderr, "ERROR: no overlap files supplied\n"); exit(1); } char *ovStr = new char [1024]; gkStore *gkpStore = gkStore::gkStore_open(gkpName); ovOverlap ov(gkpStore); ovFile *of = new ovFile(NULL, outName, ovFileFullWrite); for (uint32 ff=0; fffile()) != NULL) { splitToWords W(ovStr); ov.a_iid = W(0) + baseIDquery - numIDhash; // First ID is the query ov.b_iid = W(1) + baseIDhash; // Second ID is the hash table if (ov.a_iid == ov.b_iid) continue; assert(W[4][0] == '0'); // first read is always forward assert(W(5) < W(6)); // first read bgn < end assert(W(6) <= W(7)); // first read end <= len assert(W(9) < W(10)); // second read bgn < end assert(W(10) <= W(11)); // second read end <= len ov.dat.ovl.forUTG = true; ov.dat.ovl.forOBT = true; ov.dat.ovl.forDUP = true; ov.dat.ovl.ahg5 = W(5); ov.dat.ovl.ahg3 = W(7) - W(6); if (W[8][0] == '0') { ov.dat.ovl.bhg5 = W(9); ov.dat.ovl.bhg3 = W(11) - W(10); ov.flipped(false); } else { ov.dat.ovl.bhg3 = W(9); ov.dat.ovl.bhg5 = W(11) - W(10); ov.flipped(true); } ov.erate(atof(W[2])); // Check the overlap - the hangs must be less than the read length. uint32 alen = gkpStore->gkStore_getRead(ov.a_iid)->gkRead_sequenceLength(); uint32 blen = gkpStore->gkStore_getRead(ov.b_iid)->gkRead_sequenceLength(); if ((alen < ov.dat.ovl.ahg5 + ov.dat.ovl.ahg3) || (blen < ov.dat.ovl.bhg5 + ov.dat.ovl.bhg3)) { fprintf(stderr, "INVALID OVERLAP %8u (len %6d) %8u (len %6d) hangs %6lu %6lu - %6lu %6lu flip %lu\n", ov.a_iid, alen, ov.b_iid, blen, ov.dat.ovl.ahg5, ov.dat.ovl.ahg3, ov.dat.ovl.bhg5, ov.dat.ovl.bhg3, ov.dat.ovl.flipped); exit(1); } // Overlap looks good, write it! of->writeOverlap(&ov); } delete in; arg++; } delete of; delete [] ovStr; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/mhap/mhapConvert.mk000066400000000000000000000007541314437614700166610ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := mhapConvert SOURCES := mhapConvert.C SRC_INCDIRS := .. ../AS_UTL ../stores liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/minimap/000077500000000000000000000000001314437614700145415ustar00rootroot00000000000000canu-1.6/src/minimap/mmapConvert.C000066400000000000000000000170111314437614700171400ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-FEB-24 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-OCT-24 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "ovStore.H" #include "splitToWords.H" #include using namespace std; int main(int argc, char **argv) { char *outName = NULL; char *gkpName = NULL; bool partialOverlaps = false; uint32 minOverlapLength = 0; uint32 tolerance = 0; vector files; int32 arg = 1; int32 err = 0; while (arg < argc) { if (strcmp(argv[arg], "-o") == 0) { outName = argv[++arg]; } else if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-tolerance") == 0) { tolerance = atoi(argv[++arg]);; } else if (strcmp(argv[arg], "-partial") == 0) { partialOverlaps = true; } else if (strcmp(argv[arg], "-len") == 0) { minOverlapLength = atoi(argv[++arg]); } else if (AS_UTL_fileExists(argv[arg])) { files.push_back(argv[arg]); } else { fprintf(stderr, "ERROR: invalid arg '%s'\n", argv[arg]); err++; } arg++; } if ((err) || (gkpName == NULL) || (outName == NULL) || (files.size() == 0)) { fprintf(stderr, "usage: %s [options] file.mhap[.gz]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " Converts mhap native output to ovb\n"); fprintf(stderr, "\n"); fprintf(stderr, " -o out.ovb output file\n"); fprintf(stderr, "\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no gkpStore (-G) supplied\n"); if (files.size() == 0) fprintf(stderr, "ERROR: no overlap files supplied\n"); exit(1); } char *ovStr = new char [1024*1024]; gkStore *gkpStore = gkStore::gkStore_open(gkpName); ovOverlap ov(gkpStore); ovFile *of = new ovFile(NULL, outName, ovFileFullWrite); for (uint32 ff=0; fffile()) != NULL) { splitToWords W(ovStr); ov.a_iid = W(0); ov.b_iid = W(5); if (ov.a_iid == ov.b_iid) continue; ov.dat.ovl.ahg5 = W(2); ov.dat.ovl.ahg3 = W(1) - W(3); if (W[4][0] == '+') { ov.dat.ovl.bhg5 = W(7); ov.dat.ovl.bhg3 = W(6) - W(8); ov.flipped(false); } else { ov.dat.ovl.bhg3 = W(7); ov.dat.ovl.bhg5 = W(6) - W(8); ov.flipped(true); } ov.erate(1-((double)W(9)/W(10))); // Check the overlap - the hangs must be less than the read length. uint32 alen = gkpStore->gkStore_getRead(ov.a_iid)->gkRead_sequenceLength(); uint32 blen = gkpStore->gkStore_getRead(ov.b_iid)->gkRead_sequenceLength(); if ((alen < ov.dat.ovl.ahg5 + ov.dat.ovl.ahg3) || (blen < ov.dat.ovl.bhg5 + ov.dat.ovl.bhg3)) { fprintf(stderr, "INVALID OVERLAP %8u (len %6d) %8u (len %6d) hangs %6lu %6lu - %6lu %6lu flip %lu\n", ov.a_iid, alen, ov.b_iid, blen, ov.dat.ovl.ahg5, ov.dat.ovl.ahg3, ov.dat.ovl.bhg5, ov.dat.ovl.bhg3, ov.dat.ovl.flipped); exit(1); } if (!ov.overlapIsDovetail() && partialOverlaps == false) { if (alen <= blen && ov.dat.ovl.ahg5 >= 0 && ov.dat.ovl.ahg3 >= 0 && ov.dat.ovl.bhg5 >= ov.dat.ovl.ahg5 && ov.dat.ovl.bhg3 >= ov.dat.ovl.ahg3 && ((ov.dat.ovl.ahg5 + ov.dat.ovl.ahg3)) < tolerance) { ov.dat.ovl.bhg5 = max(0, ov.dat.ovl.bhg5 - ov.dat.ovl.ahg5); ov.dat.ovl.ahg5 = 0; ov.dat.ovl.bhg3 = max(0, ov.dat.ovl.bhg3 - ov.dat.ovl.ahg3); ov.dat.ovl.ahg3 = 0; } // second is b contained (both b hangs can be extended) // else if (alen >= blen && ov.dat.ovl.bhg5 >= 0 && ov.dat.ovl.bhg3 >= 0 && ov.dat.ovl.ahg5 >= ov.dat.ovl.bhg5 && ov.dat.ovl.ahg3 >= ov.dat.ovl.bhg3 && ((ov.dat.ovl.bhg5 + ov.dat.ovl.bhg3)) < tolerance) { ov.dat.ovl.ahg5 = max(0, ov.dat.ovl.ahg5 - ov.dat.ovl.bhg5); ov.dat.ovl.bhg5 = 0; ov.dat.ovl.ahg3 = max(0, ov.dat.ovl.ahg3 - ov.dat.ovl.bhg3); ov.dat.ovl.bhg3 = 0; } // third is 5' dovetal ----------> // ----------> // or // <--------- // bhg5 here is always first overhang on b read // else if (ov.dat.ovl.ahg3 <= ov.dat.ovl.bhg3 && (ov.dat.ovl.ahg3 >= 0 && ((double)(ov.dat.ovl.ahg3)) < tolerance) && (ov.dat.ovl.bhg5 >= 0 && ((double)(ov.dat.ovl.bhg5)) < tolerance)) { ov.dat.ovl.ahg5 = max(0, ov.dat.ovl.ahg5 - ov.dat.ovl.bhg5); ov.dat.ovl.bhg5 = 0; ov.dat.ovl.bhg3 = max(0, ov.dat.ovl.bhg3 - ov.dat.ovl.ahg3); ov.dat.ovl.ahg3 = 0; } // // fourth is 3' dovetail ----------> // ----------> // or // <---------- // bhg5 is always first overhang on b read else if (ov.dat.ovl.ahg5 <= ov.dat.ovl.bhg5 && (ov.dat.ovl.ahg5 >= 0 && ((double)(ov.dat.ovl.ahg5)) < tolerance) && (ov.dat.ovl.bhg3 >= 0 && ((double)(ov.dat.ovl.bhg3)) < tolerance)) { ov.dat.ovl.bhg5 = max(0, ov.dat.ovl.bhg5 - ov.dat.ovl.ahg5); ov.dat.ovl.ahg5 = 0; ov.dat.ovl.ahg3 = max(0, ov.dat.ovl.ahg3 - ov.dat.ovl.bhg3); ov.dat.ovl.bhg3 = 0; } } ov.dat.ovl.forUTG = (partialOverlaps == false) && (ov.overlapIsDovetail() == true);; ov.dat.ovl.forOBT = partialOverlaps; ov.dat.ovl.forDUP = partialOverlaps; // check the length is big enough if (ov.a_end() - ov.a_bgn() < minOverlapLength || ov.b_end() - ov.b_bgn() < minOverlapLength) { continue; } // Overlap looks good, write it! of->writeOverlap(&ov); } arg++; } delete of; delete [] ovStr; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/minimap/mmapConvert.mk000066400000000000000000000007541314437614700173730ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := mmapConvert SOURCES := mmapConvert.C SRC_INCDIRS := .. ../AS_UTL ../stores liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapBasedTrimming/000077500000000000000000000000001314437614700172255ustar00rootroot00000000000000canu-1.6/src/overlapBasedTrimming/adjustFlipped.C000066400000000000000000000111461314437614700221320ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-15 to 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "adjustOverlaps.H" // Adjust the overlap for any trimming done already. This works by computing the fraction of the // overlap trimmed for each read and each end, picking the largest fraction for each end, and // applying that fraction to the other read. // // It expects only flipped overlaps. The output coordinates are for a REVERSE COMPLEMENTED // b read. if you care which end is the actual 5' or 3' end, look at flipped(). bool adjustFlipped(clearRangeFile *iniClr, gkStore *gkp, ovOverlap *ovl, uint32 &aovlbgn, uint32 &aovlend, uint32 &bovlbgn, uint32 &bovlend, uint32 &aclrbgn, uint32 &aclrend, uint32 &bclrbgn, uint32 &bclrend) { assert(ovl->flipped() == true); uint32 bLen = gkp->gkStore_getRead(ovl->b_iid)->gkRead_sequenceLength(); aovlbgn = ovl->a_bgn(); bovlbgn = bLen - ovl->b_bgn(); // bgn(), because this is the higher coord aovlend = ovl->a_end(); bovlend = bLen - ovl->b_end(); aclrbgn = iniClr->bgn(ovl->a_iid); bclrbgn = bLen - iniClr->end(ovl->b_iid); // end(), because this is the higher coord aclrend = iniClr->end(ovl->a_iid); bclrend = bLen - iniClr->bgn(ovl->b_iid); assert(aovlbgn < aovlend); assert(bovlbgn < bovlend); if ((aclrend <= aovlbgn) || (aovlend <= aclrbgn) || (bclrend <= bovlbgn) || (bovlend <= bclrbgn)) { // Overlap doesn't intersect clear range, fail. #if 0 fprintf(stderr, "Discard FLIP overlap from %u,%u-%u,%u based on clear ranges %u,%u and %u,%u\n", aovlbgn, aovlend, bovlbgn, bovlend, aclrbgn, aclrend, bclrbgn, bclrend); #endif return(false); } uint32 alen = aovlend - aovlbgn; uint32 blen = bovlend - bovlbgn; double afracbgn = (double)((aclrbgn < aovlbgn) ? (0) : (aclrbgn - aovlbgn)) / alen; double bfracbgn = (double)((bclrbgn < bovlbgn) ? (0) : (bclrbgn - bovlbgn)) / blen; double afracend = (double)((aclrend > aovlend) ? (0) : (aovlend - aclrend)) / alen; double bfracend = (double)((bclrend > bovlend) ? (0) : (bovlend - bclrend)) / blen; //fprintf(stderr, "frac a %.20f %.20f b %.20f %.20f\n", afracbgn, afracend, bfracbgn, bfracend); //fprintf(stderr, "frac a %.20f %.20f b %.20f %.20f\n", afracbgn * alen, afracend * alen, bfracbgn * blen, bfracend * blen); double maxbgn = max(afracbgn, bfracbgn); double maxend = max(afracend, bfracend); //fprintf(stderr, "frac a %.20f %.20f b %.20f %.20f\n", maxbgn * alen, maxend * alen, maxbgn * blen, maxend * blen); assert(maxbgn < 1.0); assert(maxend < 1.0); uint32 aadjbgn = (uint32)round(maxbgn * alen); uint32 badjbgn = (uint32)round(maxbgn * blen); uint32 aadjend = (uint32)round(maxend * alen); uint32 badjend = (uint32)round(maxend * blen); //fprintf(stderr, "frac a %u %u b %u %u alen %u blen %u\n", aadjbgn, aadjend, badjbgn, badjend, alen, blen); #if 0 fprintf(stderr, "Adjusted FLIP overlap from %u,%u-%u,%u (adjust %u,%u,%u,%u) to %u,%u-%u,%u based on clear ranges %u,%u and %u,%u maxbgn=%f maxend=%f\n", aovlbgn, aovlend, bovlbgn, bovlend, aadjbgn, aadjend, badjbgn, badjend, aovlbgn + aadjbgn, aovlend - aadjend, bovlbgn + badjbgn, bovlend - badjend, aclrbgn, aclrend, bclrbgn, bclrend, maxbgn, maxend); #endif aovlbgn += aadjbgn; bovlbgn += badjbgn; aovlend -= aadjend; bovlend -= badjend; assert(aclrbgn <= aovlbgn); assert(bclrbgn <= bovlbgn); assert(aovlend <= aclrend); assert(bovlend <= bclrend); return(true); } canu-1.6/src/overlapBasedTrimming/adjustNormal.C000066400000000000000000000103711314437614700217760ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-15 to 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "adjustOverlaps.H" // Adjust the overlap for any trimming done already. This works by computing the fraction of the // overlap trimmed for each read and each end, picking the largest fraction for each end, and // applying that fraction to the other read. // // It expects only normal overlaps. bool adjustNormal(clearRangeFile *iniClr, gkStore *gkp, ovOverlap *ovl, uint32 &aovlbgn, uint32 &aovlend, uint32 &bovlbgn, uint32 &bovlend, uint32 &aclrbgn, uint32 &aclrend, uint32 &bclrbgn, uint32 &bclrend) { assert(ovl->flipped() == false); aovlbgn = ovl->a_bgn(); bovlbgn = ovl->b_bgn(); aovlend = ovl->a_end(); bovlend = ovl->b_end(); aclrbgn = iniClr->bgn(ovl->a_iid); bclrbgn = iniClr->bgn(ovl->b_iid); aclrend = iniClr->end(ovl->a_iid); bclrend = iniClr->end(ovl->b_iid); assert(aovlbgn < aovlend); assert(bovlbgn < bovlend); if ((aclrend <= aovlbgn) || (aovlend <= aclrbgn) || (bclrend <= bovlbgn) || (bovlend <= bclrbgn)) { // Overlap doesn't intersect clear range, fail. #if 0 fprintf(stderr, "Discard NORM overlap from %u,%u-%u,%u based on clear ranges %u,%u and %u,%u\n", aovlbgn, aovlend, bovlbgn, bovlend, aclrbgn, aclrend, bclrbgn, bclrend); #endif return(false); } uint32 alen = aovlend - aovlbgn; uint32 blen = bovlend - bovlbgn; double afracbgn = (double)((aclrbgn < aovlbgn) ? (0) : (aclrbgn - aovlbgn)) / alen; double bfracbgn = (double)((bclrbgn < bovlbgn) ? (0) : (bclrbgn - bovlbgn)) / blen; double afracend = (double)((aclrend > aovlend) ? (0) : (aovlend - aclrend)) / alen; double bfracend = (double)((bclrend > bovlend) ? (0) : (bovlend - bclrend)) / blen; //fprintf(stderr, "frac a %.20f %.20f b %.20f %.20f\n", afracbgn, afracend, bfracbgn, bfracend); //fprintf(stderr, "frac a %.20f %.20f b %.20f %.20f\n", afracbgn * alen, afracend * alen, bfracbgn * blen, bfracend * blen); double maxbgn = max(afracbgn, bfracbgn); double maxend = max(afracend, bfracend); //fprintf(stderr, "frac a %.20f %.20f b %.20f %.20f\n", maxbgn * alen, maxend * alen, maxbgn * blen, maxend * blen); assert(maxbgn < 1.0); assert(maxend < 1.0); uint32 aadjbgn = (uint32)round(maxbgn * alen); uint32 badjbgn = (uint32)round(maxbgn * blen); uint32 aadjend = (uint32)round(maxend * alen); uint32 badjend = (uint32)round(maxend * blen); //fprintf(stderr, "frac a %u %u b %u %u alen %u blen %u\n", aadjbgn, aadjend, badjbgn, badjend, alen, blen); #if 0 fprintf(stderr, "Adjusted NORM overlap from %u,%u-%u,%u (adjust %u,%u,%u,%u) to %u,%u-%u,%u based on clear ranges %u,%u and %u,%u maxbgn=%f maxend=%f\n", aovlbgn, aovlend, bovlbgn, bovlend, aadjbgn, aadjend, badjbgn, badjend, aovlbgn + aadjbgn, aovlend - aadjend, bovlbgn + badjbgn, bovlend - badjend, aclrbgn, aclrend, bclrbgn, bclrend, maxbgn, maxend); #endif aovlbgn += aadjbgn; bovlbgn += badjbgn; aovlend -= aadjend; bovlend -= badjend; assert(aclrbgn <= aovlbgn); assert(bclrbgn <= bovlbgn); assert(aovlend <= aclrend); assert(bovlend <= bclrend); return(true); } canu-1.6/src/overlapBasedTrimming/adjustOverlaps.H000066400000000000000000000031471314437614700223510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef ADJUST_OVERLAPS_H #define ADJUST_OVERLAPS_H #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "clearRangeFile.H" bool adjustNormal(clearRangeFile *iniClr, gkStore *gkp, ovOverlap *ovl, uint32 &aovlbgn, uint32 &aovlend, uint32 &bovlbgn, uint32 &bovlend, uint32 &aclrbgn, uint32 &aclrend, uint32 &bclrbgn, uint32 &bclrend); bool adjustFlipped(clearRangeFile *iniClr, gkStore *gkp, ovOverlap *ovl, uint32 &aovlbgn, uint32 &aovlend, uint32 &bovlbgn, uint32 &bovlend, uint32 &aclrbgn, uint32 &aclrend, uint32 &bclrbgn, uint32 &bclrend); #endif canu-1.6/src/overlapBasedTrimming/clearRangeFile.H000066400000000000000000000104521314437614700222030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-APR-15 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef CLEAR_RANGE_FILE_H #define CLEAR_RANGE_FILE_H #include "AS_global.H" #include "gkStore.H" #include "AS_UTL_fileIO.H" class clearRangeFile { public: // Create a clear range file. If the file doesn't exist, maxID must be set to the number // of reads allowed. If the file does exist, we don't care and load the number of reads // from the file. clearRangeFile(char *fileName, gkStore *gkp) { strcpy(_fileName, fileName); if (AS_UTL_fileExists(_fileName) == false) { _modified = true; // Always gets written, regardless of changes _lastID = gkp->gkStore_getNumReads(); _bgn = new uint32 [_lastID + 1]; _end = new uint32 [_lastID + 1]; // Every range is undefined memset(_bgn, 0, sizeof(uint32) * (_lastID + 1)); memset(_end, 0, sizeof(uint32) * (_lastID + 1)); // Nope, every range is set to fully clear. reset(gkp); } else { _modified = false; FILE *F = fopen(_fileName, "r"); if (errno) fprintf(stderr, "clearRangeFile()-- Failed to open '%s' for loading clear ranges: %s\n", _fileName, strerror(errno)), exit(1); AS_UTL_safeRead(F, &_lastID, "clearRangeFile::lastID", sizeof(uint32), 1); assert(gkp->gkStore_getNumReads() == _lastID); // Sane? _bgn = new uint32 [_lastID + 1]; _end = new uint32 [_lastID + 1]; AS_UTL_safeRead(F, _bgn, "clearRangeFile::bgn", sizeof(uint32), _lastID + 1); AS_UTL_safeRead(F, _end, "clearRangeFile::end", sizeof(uint32), _lastID + 1); fclose(F); } }; // Close the clear range file. ~clearRangeFile() { if (_modified == true) { FILE *F = fopen(_fileName, "w"); if (errno) fprintf(stderr, "clearRangeFile()-- Failed to open '%s' for saving clear ranges: %s\n", _fileName, strerror(errno)), exit(1); AS_UTL_safeWrite(F, &_lastID, "clearRangeFile::lastID", sizeof(uint32), 1); AS_UTL_safeWrite(F, _bgn, "clearRangeFile::bgn", sizeof(uint32), _lastID + 1); AS_UTL_safeWrite(F, _end, "clearRangeFile::end", sizeof(uint32), _lastID + 1); fclose(F); } delete [] _bgn; delete [] _end; }; void reset(gkStore *gkp) { for (uint32 fi=1; fi <= _lastID; fi++) { _bgn[fi] = 0; _end[fi] = gkp->gkStore_getRead(fi)->gkRead_sequenceLength(); } }; uint32 bgn(uint32 id) { assert(id <= _lastID); return(_bgn[id]); }; uint32 end(uint32 id) { assert(id <= _lastID); return(_end[id]); }; uint32 &setbgn(uint32 id) { assert(id <= _lastID); _modified = true; return(_bgn[id]); }; uint32 &setend(uint32 id) { assert(id <= _lastID); _modified = true; return(_end[id]); }; bool isDeleted(uint32 id) { assert(id <= _lastID); return((_bgn[id] == UINT32_MAX) && (_end[id] == UINT32_MAX)); }; void setDeleted(uint32 id) { assert(id <= _lastID); _modified = true; _bgn[id] = UINT32_MAX; _end[id] = UINT32_MAX; }; void copy(clearRangeFile *source) { if (source == NULL) return; assert(_lastID == source->_lastID); memcpy(_bgn, source->_bgn, sizeof(uint32) * (_lastID + 1)); memcpy(_end, source->_end, sizeof(uint32) * (_lastID + 1)); }; private: bool _modified; char _fileName[FILENAME_MAX]; uint32 _lastID; // [_lastID] is valid; allocated _lastID+1 spots. uint32 *_bgn; uint32 *_end; }; #endif // CLEAR_RANGE_FILE_H canu-1.6/src/overlapBasedTrimming/detect-unsplit-subreads-in-overlaps.pl000066400000000000000000000056041314437614700265760ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my %readLen; open(F, "gatekeeper -dumpfragments -tabular test.gkpStore |"); while () { my @v = split '\s+', $_; $readLen{$v[1]} = $v[9]; } close(F); my %readHash; my %readConf; my $lastRead = 0; my $nOvl = 0; my $potSub = 0; my $proSub = 0; open(F, "overlapStore -d test.ovlStore |"); while () { s/^\s+//; s/\s+$//; my @v = split '\s+', $_; if ($lastRead != $v[0]) { my $mCount = scalar(keys %readConf); if ($mCount > 0) { print "$lastRead\t$mCount\n"; $potSub++; } if ($mCount > 5) { $proSub++; } $lastRead = $v[0]; $nOvl = 0; undef %readHash; undef %readConf; } next if (($v[3] < 0) && ($v[4] > 0)); # read is contained in other $nOvl++; if (! exists($readHash{$v[1]})) { $readHash{$v[1]} = "$v[3] $v[4]"; next; } my ($na, $nb) = ($v[3], $v[4]); my ($oa, $ob) = split '\s+', $readHash{$v[1]}; my $len = $readLen{$v[0]}; my $nbgn = ($na > 0) ? $na : 0; my $nend = ($nb < 0) ? ($len + $nb) : $len; my $obgn = ($oa > 0) ? $oa : 0; my $oend = ($ob < 0) ? ($len + $ob) : $len; #print "$_ -- $nbgn $nend - $obgn $oend\n"; goto saveConf if (($nbgn <= $obgn) && ($nend <= $obgn)); # n before o goto saveConf if (($nbgn >= $oend) && ($nend >= $oend)); # n after o my $min = ($nbgn < $obgn) ? $obgn : $nbgn; # Largest bgn coord my $max = ($nend < $oend) ? $nend : $oend; # Smallest end coord my $nlen = $nend - $nbgn; my $olen = $oend - $obgn; my $olap = $max - $min; #print "$_ -- $olap $nlen $olen\n"; next if (($olap / $nlen > 0.5) || ($olap / $olen > 0.5)); # reads overlap by lots # different regions, probably sub read detected saveConf: $readConf{$v[1]}++; #print "$_\n"; } close(F); print STDERR "potential unsplit subreads: $potSub\n"; print STDERR "probable unsplit subreads: $proSub\n"; canu-1.6/src/overlapBasedTrimming/generate-random-chimeric-fragments.pl000066400000000000000000000112631314437614700264020ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## se strict; # ########################################################################### # # This file is part of Celera Assembler, a software program that # assembles whole-genome shotgun reads into contigs and scaffolds. # Copyright (C) 2009, J. Craig Venter Instititue. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received (LICENSE.txt) a copy of the GNU General Public # License along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # ########################################################################### # # Generates random chimera and spur fragments from a supplied reference # genome. # # User unfriendly. # # A spur: [junk][sequence] # A chimer: [junk][sequence][junk][sequence][junk] my $genome = shift @ARGV; my $readLength = 400; my $leaff = "/work/wgs/kmer/leaff/leaff"; my $cvtto = "perl /work/wgs/src/AS_MSG/convert-fasta-to-v2.pl"; die "Reference sequence '$genome' not found.\n" if (! -e $genome); # generate some random reads from the genome, both forward and reverse. system("$leaff --errors $readLength 10000 1 0.01 $genome > sample3.fasta"); system("$leaff -f sample3.fasta -R -C -W > sample1.fasta"); system("$leaff --errors $readLength 10000 1 0.01 $genome > sample2.fasta"); # generate random small bits of sequence from those samples; leaff can't do this nicely. my @sequences; open(F, "< sample1.fasta"); while (!eof(F)) { my $h = ; chomp $h; my $s = ; chomp $s; my $r = rand(); push @sequences, "$r\0seq\0$s"; } close(F); open(F, "< sample2.fasta"); while (!eof(F)) { my $h = ; chomp $h; my $s = ; chomp $s; my $r = rand(); push @sequences, "$r\0rev\0$s"; } close(F); @sequences = sort { $a <=> $b } @sequences; open(F, "> chimer.fasta"); open(R, "$leaff -G 100000 $readLength $readLength |"); my $iid = 0; my $def = ""; my $seq = ""; my $seqLen = 0; while (scalar(@sequences) > 0) { my $np = int(rand() * 3 + 1); my $gl; $iid++; $def = "$iid"; $seq = ""; $seqLen = 0; $gl = int(rand() * 200 - 50); # between -50 and 150, junk on the front if ($gl > 0) { my $h = ; my $s = ; $def .= "-jnk-$gl"; $seq .= substr($s, 0, $gl); $seqLen += $gl; } for (my $pp=0; $pp < $np; $pp++) { my $S = pop @sequences; my ($r, $h, $s) = split '\0', $S; my $l = int(rand() * 200 + 50); # between 50 and 250. $def .= "-$h-$l"; $seq .= substr($s, 0, $l); $seqLen += $l; if ($pp + 1 == $np) { $gl = 200; } else { $gl = 100; } $gl = int(rand() * $gl - 50); # between -50 and 50, junk in the middle if ($gl > 0) { my $h = ; my $s = ; $def .= "-jnk-$gl"; $seq .= substr($s, 0, $gl); $seqLen += $gl; } } print F ">$def\n$seq\n"; } close(F); system("tr 'ACGT' 'NNNN' < chimer.fasta > chimer.qv"); system("perl /work/wgs/src/AS_MSG/convert-fasta-to-v2.pl -l CHIMER -s chimer.fasta -q chimer.qv > chimer.frg"); unlink "sample1.fasta"; unlink "sample2.fasta"; unlink "sample3.fasta"; unlink "sample1.fastaidx"; unlink "sample2.fastaidx"; unlink "sample3.fastaidx"; unlink "chimer.fasta"; unlink "chimer.qv"; canu-1.6/src/overlapBasedTrimming/splitReads-subReads.C000066400000000000000000000272741314437614700232250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "splitReads.H" // Examine overlaps for a specific pattern indicating a read that flips back on itself. // This requires multiple overlaps from the same read, in opposite orientation, to work. // // If this read isn't suspected to contain sub reads, all we can do is mark some // overlaps as junk. // // If this read is suspected to contain sub reads, we want to mark the junction region. uint32 intervalOverlap(uint32 b1, uint32 e1, uint32 b2, uint32 e2) { uint32 minmax = MIN(e1, e2); uint32 maxmin = MAX(b1, b2); if (minmax > maxmin) return(minmax - maxmin); return(0); } bool doCheckSubRead(gkStore *gkp, uint32 id) { gkRead *read = gkp->gkStore_getRead(id); gkLibrary *libr = gkp->gkStore_getLibrary(read->gkRead_libraryID()); return(libr->gkLibrary_checkForSubReads() == true); } // Populate w->blist with intervals where a suspected subread junction occurs. void detectSubReads(gkStore *gkp, workUnit *w, FILE *subreadFile, bool subreadFileVerbose) { assert(w->adjLen > 0); assert(doCheckSubRead(gkp, w->id) == true); map secondIdx; map numOlaps; bool largePalindrome = false; intervalList BAD; intervalList BADall; // Count the number of overlaps for each b_iid, and remember the last index. There are supposed to // be at most two overlaps per ID pair, so if we remember the last, and iterate through, we can // get both. for (uint32 ii=0; iiadjLen; ii++) { secondIdx[w->adj[ii].b_iid] = ii; numOlaps [w->adj[ii].b_iid]++; } // Scan overlaps. For any pair of b_iid, with overlaps in opposite directions, compute a 'bad' // interval where a suspected flip occurs. for (uint32 ii=0; iiadjLen; ii++) { adjOverlap *aii = w->adj + ii; if (numOlaps[w->adj[ii].b_iid] == 1) { // Only one overlap, can't indicate sub read! //if ((subreadFile) && (subreadFileVerbose)) // fprintf(subreadFile, "oneOverlap %u (%u-%u) %u (%u-%u) -- can't indicate subreads\n", // w->adj[ii].a_iid, w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[ii].b_iid, w->adj[ii].bovlbgn, w->adj[ii].bovlend); continue; } // We should never get more than two overlaps per read pair. if (numOlaps[w->adj[ii].b_iid] > 2) { fprintf(stderr, "ERROR: more overlaps than expected for pair %u %u.\n", w->adj[ii].a_iid, w->adj[ii].b_iid); continue; } assert(numOlaps[w->adj[ii].b_iid] == 2); uint32 jj = secondIdx[w->adj[ii].b_iid]; adjOverlap *ajj = w->adj + jj; assert(jj < w->adjLen); if (ii == jj) { // Already did this one! //if ((subreadFile) && (subreadFileVerbose)) // fprintf(subreadFile, "sameOverlap %u (%u-%u) %u (%u-%u)\n", // w->adj[ii].a_iid, w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[ii].b_iid, w->adj[ii].bovlbgn, w->adj[ii].bovlend); continue; } // The two overlaps should be for the same reads. assert(w->adj[ii].a_iid == w->adj[jj].a_iid); assert(w->adj[ii].b_iid == w->adj[jj].b_iid); // And opposite orientations. if (w->adj[ii].flipped == w->adj[jj].flipped) { fprintf(stderr, "ERROR: same orient duplicate overlaps for pair %u %u\n", w->adj[ii].a_iid, w->adj[ii].b_iid); continue; } assert(w->adj[ii].flipped != w->adj[jj].flipped); bool AcheckSub = (doCheckSubRead(gkp, w->adj[ii].a_iid) == true); bool BcheckSub = (doCheckSubRead(gkp, w->adj[ii].b_iid) == true); assert(AcheckSub == true); // Otherwise we wouldn't be in this function! // Decide what type of duplicate we have. // Overlap on the A read -=> B read is potentially sub read containing -=> don't use overlaps // Overlap on the B read -=> A read is potentially sub read containing -=> split this read uint32 Aoverlap = intervalOverlap(w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[jj].aovlbgn, w->adj[jj].aovlend); uint32 Boverlap = intervalOverlap(w->adj[ii].bovlbgn, w->adj[ii].bovlend, w->adj[jj].bovlbgn, w->adj[jj].bovlend); // If there is no overlap anywhere, we're not sure what is going on. This could be a genomic // repeat. Leave the overlaps alone. // if ((Aoverlap == 0) && (Boverlap == 0)) continue; // Remember if the overlapping ovelap is large - we'll later check if the bad region falls // within here, and if there are enough spanning reads not trim. We also use this as one more // count of BAD. // if ((AcheckSub) && (Aoverlap > 1000) && (BcheckSub) && (Boverlap > 1000)) { uint32 dist = (w->adj[ii].a_iid > w->adj[ii].b_iid) ? (w->adj[ii].a_iid - w->adj[ii].b_iid) : (w->adj[ii].b_iid - w->adj[ii].a_iid); if (subreadFile) fprintf(subreadFile, " II %8u (%6u-%6u) %8u (%6u-%6u) JJ %8u (%6u-%6u) %8u (%6u-%6u) %s\n", w->adj[ii].a_iid, w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[ii].b_iid, w->adj[ii].bovlbgn, w->adj[ii].bovlend, w->adj[jj].a_iid, w->adj[jj].aovlbgn, w->adj[jj].aovlend, w->adj[jj].b_iid, w->adj[jj].bovlbgn, w->adj[jj].bovlend, (dist > 5) ? " PALINDROME WARNING--FAR-IID--WARNING" : "PALINDROME"); largePalindrome = true; } #if 0 // Otherwise, if the overlaps overlap on both reads by significant chunks, don't believe // either. These are possibly both chimeric reads, at least PacBio junction reads. // // Or an inverted repeat. // if ((AcheckSub) && (Aoverlap > 50) && (BcheckSub) && (Boverlap > 50)) { if (subreadFile) fprintf(subreadFile, "BothOv %u (%u-%u) %u (%u-%u) %u (%u-%u) %u (%u-%u)\n", w->adj[ii].a_iid, w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[ii].b_iid, w->adj[ii].bovlbgn, w->adj[ii].bovlend, w->adj[jj].a_iid, w->adj[jj].aovlbgn, w->adj[jj].aovlend, w->adj[jj].b_iid, w->adj[jj].bovlbgn, w->adj[jj].bovlend); } #endif #if 0 // Stronger overlap in the A reads. The B read looks like it has subreads, which is perfectly fine // evidence for us. Unless they span a junction. // if ((BcheckSub) && (Boverlap < Aoverlap)) { if (subreadFile) fprintf(subreadFile, "BcheckSub %u (%u-%u) %u (%u-%u) %u (%u-%u) %u (%u-%u)\n", w->adj[ii].a_iid, w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[ii].b_iid, w->adj[ii].bovlbgn, w->adj[ii].bovlend, w->adj[jj].a_iid, w->adj[jj].aovlbgn, w->adj[jj].aovlend, w->adj[jj].b_iid, w->adj[jj].bovlbgn, w->adj[jj].bovlend); } #endif // It looks like A has sub reads if the B read has a strong overlap in overlaps, and the A read does not // have a strong overlap. if ((Aoverlap > 250) || (Boverlap < 250)) // A strong overlap in the A read, there isn't a sub read junction we can identifiy, OR // A weak overlap in the B read, and we expected the B read to align to both of the A subreads. continue; // Decide on a region in the read that is suspected to contain the chimer junction. // // In the true case: ii overlap is first on the read; bad region from the end of this overlap // to the start of the jj overlap. // // Note that sometimes overlaps extend through the junction. This will just flip the region // around. We're expecting to find non-overlapping overlaps, but if we find overlapping ones, // the bad interval is still between the end points. // // --------------> ------------> // <--------- vs <--------- // uint32 badbgn = (w->adj[ii].aovlbgn < w->adj[jj].aovlbgn) ? w->adj[ii].aovlend : w->adj[jj].aovlend; uint32 badend = (w->adj[ii].aovlbgn < w->adj[jj].aovlbgn) ? w->adj[jj].aovlbgn : w->adj[ii].aovlbgn; if (badbgn > badend) { uint32 a = badbgn; badbgn = badend; badend = a; } assert(badbgn <= badend); if (subreadFile) fprintf(subreadFile, " II %8u (%6u-%6u) %8u (%6u-%6u) JJ %8u (%6u-%6u) %8u (%6u-%6u) BAD %6u-%6u size %6u %s\n", w->adj[ii].a_iid, w->adj[ii].aovlbgn, w->adj[ii].aovlend, w->adj[ii].b_iid, w->adj[ii].bovlbgn, w->adj[ii].bovlend, w->adj[jj].a_iid, w->adj[jj].aovlbgn, w->adj[jj].aovlend, w->adj[jj].b_iid, w->adj[jj].bovlbgn, w->adj[jj].bovlend, badbgn, badend, badend - badbgn, (badend - badbgn <= SUBREAD_LOOP_MAX_SIZE) ? "(EVIDENCE)" : "(too far)"); // A true subread signature will have a small bad interval (10 bases) and largely agree on the // interval. False signature will have a large size, and not agree. We only check for size // though. // if (badend - badbgn <= SUBREAD_LOOP_MAX_SIZE) BAD.add(badbgn, badend - badbgn); // Save all plausible pairs. // if (badend - badbgn <= SUBREAD_LOOP_EXT_SIZE) BADall.add(badbgn, badend - badbgn); } // // Merge all the 'bad' intervals. Save the merged intervals for later use. // BAD.merge(); BADall.merge(); for (uint32 bb=0; bbadjLen; ii++) if ((w->adj[ii].aovlbgn + 100 < BAD.lo(bb)) && (BAD.hi(bb) + 100 < w->adj[ii].aovlend)) numSpan += (doCheckSubRead(gkp, w->adj[ii].a_iid)) ? 1 : 2; if (subreadFile) fprintf(subreadFile, "AcheckSub region %u (" F_S32 "-" F_S32 ") with %u hits %u bighits - span %u largePalindrome %s\n", w->adj[0].a_iid, BAD.lo(bb), BAD.hi(bb), BAD.count(bb), allHits, numSpan, largePalindrome ? "true" : "false"); if (numSpan > 9) // If there are 10 or more spanning read (equivalents) this is not a subread junction. There // is plenty of evidence it is true. continue; if (BAD.count(bb) + allHits / 4 + largePalindrome < 3) // If 2 or fewer reads claim this is a sub read junction, skip it. Evidence is weak. continue; if (subreadFile) fprintf(subreadFile, "CONFIRMED BAD REGION %d-%d\n", BAD.lo(bb), BAD.hi(bb)); w->blist.push_back(badRegion(w->id, badType_subread, BAD.lo(bb), BAD.hi(bb))); } } canu-1.6/src/overlapBasedTrimming/splitReads-trimBad.C000066400000000000000000000113431314437614700230250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JAN-19 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2017-JUN-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "splitReads.H" // If after chimer trimming a read had a bad interval in the clear, just delete the read. // Evidence said it was both good and bad. // // But first, if the bad interval just touches the clear, trim it out. // Process the list of bad intervals in blist to find a single clear range. // Store the result in w->clrBgn and w->clrEnd. void trimBadInterval(gkStore *gkp, workUnit *w, uint32 minReadLength, FILE *subreadFile, bool subreadFileVerbose) { if (w->blist.size() == 0) return; intervalList goodRegions; // Build an interval list of all the bad regions, and invert them into good regions. for (uint32 bb=0; bbblist.size(); bb++) goodRegions.add(w->blist[bb].bgn, w->blist[bb].end - w->blist[bb].bgn); goodRegions.invert(w->clrBgn, w->clrEnd); // Find the largest good region, save it in the output clear range. If there are no // regions (the whole read was marked bad?), default to a bougs clear range. // Was previously set to UINT32_MAX. However, for a read with no good region, when UINT32_MAX is returned // to calling function, asserts on line 370-371 fail because UINT32_MAX is not in initial clear range // set to 0 instead. w->clrBgn = 0; w->clrEnd = 0; for (uint32 rr=0; rrclrEnd - w->clrBgn) < (goodRegions.hi(rr) - goodRegions.lo(rr))) { w->clrBgn = goodRegions.lo(rr); w->clrEnd = goodRegions.hi(rr); } } // If the largest isn't big enough, remember that, and annotate the log appropriately if (w->clrEnd - w->clrBgn < minReadLength) w->isOK = false; // For logging, find the two bordering bad regions if (subreadFile) { vector loBad; vector hiBad; uint32 spur5 = UINT32_MAX; uint32 spur3 = UINT32_MAX; for (uint32 rr=0; rrblist.size(); rr++) { if (w->blist[rr].type == badType_5spur) { assert(spur5 == UINT32_MAX); // Should be at most one 5' spur region spur5 = rr; } if (w->blist[rr].type == badType_3spur) { assert(spur3 == UINT32_MAX); // Should be at most one 3' spur region spur3 = rr; } if (w->blist[rr].end == w->clrBgn) loBad.push_back(rr); if (w->clrEnd == w->blist[rr].bgn) hiBad.push_back(rr); } // Can't really say much about the number of regions we find. There could be none (if the // biggest good region extends to the end of the read), exactly one (what we expect), or more // than one (if two algorithms agree on a bad region). char *logPtr = w->logMsg; sprintf(logPtr, "iid %6u trim %7u %7u", w->id, w->clrBgn, w->clrEnd); while (*logPtr) logPtr++; if (w->isOK == false) sprintf(logPtr, " TOO_SHORT"); while (*logPtr) logPtr++; if (spur5 != UINT32_MAX) sprintf(logPtr, " (5'spur %7u %7u)", w->blist[spur5].bgn, w->blist[spur5].end); while (*logPtr) logPtr++; if (spur3 != UINT32_MAX) sprintf(logPtr, " (3'spur %7u %7u)", w->blist[spur3].bgn, w->blist[spur3].end); while (*logPtr) logPtr++; for (uint32 xx=0; xxblist[x].typeName(), w->blist[x].bgn, w->blist[x].end); while (*logPtr) logPtr++; } for (uint32 xx=0; xxblist[x].typeName(), w->blist[x].bgn, w->blist[x].end); while (*logPtr) logPtr++; } logPtr[0] = '\n'; logPtr[1] = 0; } } canu-1.6/src/overlapBasedTrimming/splitReads-workUnit.C000066400000000000000000000116261314437614700232710ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-15 to 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2017-AUG-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "splitReads.H" // Adjust the overlap for trimming already done. Filter out useless overlaps. Save the // overlap into a useful format. void workUnit::addAndFilterOverlaps(gkStore *gkp, clearRangeFile *finClr, double errorRate, ovOverlap *ovl, uint32 ovlLen) { if (adjMax < ovlLen) { delete [] adj; adjMax = ovlLen; adj = new adjOverlap [adjMax]; } adjLen = 0; for (uint32 oo=0; ooa_iid; uint32 idB = o->b_iid; if (finClr->isDeleted(idA) || finClr->isDeleted(idB)) continue; // Returns the finClr clear range of the two reads (in *clr*), and the overlap adjusted to // that clear range (in *ovl*). The B read coordinates are reverse-complemented if the // overlap is reversed (so bovlbgn < bovlend always). a->a_iid = o->a_iid; a->b_iid = o->b_iid; a->flipped = o->flipped(); bool valid = false; if (o->flipped() == false) valid = adjustNormal(finClr, gkp, o, a->aovlbgn, a->aovlend, a->bovlbgn, a->bovlend, a->aclrbgn, a->aclrend, a->bclrbgn, a->bclrend); else valid = adjustFlipped(finClr, gkp, o, a->aovlbgn, a->aovlend, a->bovlbgn, a->bovlend, a->aclrbgn, a->aclrend, a->bclrbgn, a->bclrend); if (valid == false) // adjust() says the overlap doesn't intersect the clear range, so nothing here. continue; assert(a->aclrbgn <= a->aovlbgn); // Also checked in adjust()... assert(a->bclrbgn <= a->bovlbgn); assert(a->aovlend <= a->aclrend); assert(a->bovlend <= a->bclrend); // Reset evidence hangs that are close to zero to be zero. This shifts the overlap end point // to remove as much of the useless hang as possible. if (a->bovlbgn - a->bclrbgn < 15) { uint32 limit = a->aovlbgn - a->aclrbgn; uint32 adjust = a->bovlbgn - a->bclrbgn; if (adjust > limit) adjust = limit; if ((a->aovlbgn < adjust) || (a->bovlbgn < adjust)) fprintf(stderr, "ovl %u %u-%u %u %u-%u -> clr %u-%u %u-%u adj %u-%u %u-%u\n", o->a_iid, o->a_bgn(), o->a_end(), o->b_iid, o->b_bgn(), o->b_end(), a->aclrbgn, a->aclrend, a->bclrbgn, a->bclrend, a->aovlbgn, a->aovlend, a->bovlbgn, a->bovlend); assert(a->aovlbgn >= adjust); assert(a->bovlbgn >= adjust); a->aovlbgn -= adjust; a->bovlbgn -= adjust; assert(a->aclrbgn <= a->aovlbgn); assert(a->bclrbgn <= a->bovlbgn); } if (a->bclrend - a->bovlend < 15) { uint32 limit = a->aclrend - a->aovlend; uint32 adjust = a->bclrend - a->bovlend; if (adjust > limit) adjust = limit; a->aovlend += adjust; a->bovlend += adjust; assert(a->aovlend <= a->aclrend); assert(a->bovlend <= a->bclrend); } // Filter out garbage overlaps // // The first version used hard and fast cutoffs of (>=35bp and <= 0.02 error) or (>= 70bp). // These were not fair to short reads. // // The second version used compilacted rules (see cvs), but based the decision off of the // length of the A-read. Fair, but only for assemblies of similarily sized reads. Totally // unfair for Sanger-vs-Illumin overlaps. // // The third version sets a minimum length and identity based on the shorter fragment. These // mirror closely what the second version was doing. It was extended to allow even shorter // lengths if either read is aligned to an end. // // The fourth version, for long noisy reads, accepts all overlaps. if (0) continue; // Save the overlap. adjLen++; } } canu-1.6/src/overlapBasedTrimming/splitReads.C000066400000000000000000000440261314437614700214510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-15 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "splitReads.H" #include "trimStat.H" #include "clearRangeFile.H" #include "AS_UTL_decodeRange.H" int main(int argc, char **argv) { char *gkpName = NULL; char *ovsName = NULL; char *finClrName = NULL; char *outClrName = NULL; double errorRate = 0.06; //uint32 minAlignLength = 40; uint32 minReadLength = 64; uint32 idMin = 1; uint32 idMax = UINT32_MAX; char *outputPrefix = NULL; char outputName[FILENAME_MAX]; FILE *staFile = NULL; FILE *reportFile = NULL; FILE *subreadFile = NULL; bool doSubreadLogging = false; bool doSubreadLoggingVerbose = false; // Statistics on the trimming - the second set are from the old logging, and don't really apply anymore. trimStat readsIn; // Read is eligible for trimming trimStat deletedIn; // Read was deleted already trimStat noTrimIn; // Read not requesting trimming trimStat noOverlaps; // no overlaps in store trimStat noCoverage; // no coverage after adjusting for trimming done trimStat readsProcChimera; // Read was processed for chimera signal trimStat readsProcSpur; // Read was processed for spur signal trimStat readsProcSubRead; // Read was processed for subread signal #if 0 trimStat badSpur5; trimStat badSpur3; trimStat badChimera; trimStat badSubread; #endif trimStat readsNoChange; trimStat readsBadSpur5, basesBadSpur5; trimStat readsBadSpur3, basesBadSpur3; trimStat readsBadChimera, basesBadChimera; trimStat readsBadSubread, basesBadSubread; trimStat readsTrimmed5; trimStat readsTrimmed3; #if 0 trimStat fullCoverage; // fully covered by overlaps trimStat noSignalNoGap; // no signal, no gaps trimStat noSignalButGap; // no signal, with gaps trimStat bothFixed; // both chimera and spur signal trimmed trimStat chimeraFixed; // only chimera signal trimmed trimStat spurFixed; // only spur signal trimmed trimStat bothDeletedSmall; // deleted because of both cimera and spur signals trimStat chimeraDeletedSmall; // deleted because of chimera signal trimStat spurDeletedSmall; // deleted because of spur signal trimStat spurDetectedNormal; // normal spur detected trimStat spurDetectedLinker; // linker spur detected trimStat chimeraDetectedInnie; // innpue-pair chimera detected trimStat chimeraDetectedOverhang; // overhanging chimera detected trimStat chimeraDetectedGap; // gap chimera detected trimStat chimeraDetectedLinker; // linker chimera detected #endif trimStat deletedOut; // Read was deleted by trimming argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovsName = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { outputPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-t") == 0) { AS_UTL_decodeRange(argv[++arg], idMin, idMax); } else if (strcmp(argv[arg], "-Ci") == 0) { finClrName = argv[++arg]; } else if (strcmp(argv[arg], "-Co") == 0) { outClrName = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { errorRate = atof(argv[++arg]); //} else if (strcmp(argv[arg], "-l") == 0) { // minAlignLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-minlength") == 0) { minReadLength = atoi(argv[++arg]); } else { fprintf(stderr, "%s: unknown option '%s'\n", argv[0], argv[arg]); err++; } arg++; } if (errorRate < 0.0) err++; if ((gkpName == 0L) || (ovsName == 0L) || (outputPrefix == NULL) || (err)) { fprintf(stderr, "usage: %s -G gkpStore -O ovlStore -Ci input.clearFile -Co output.clearFile -o outputPrefix]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G gkpStore path to read store\n"); fprintf(stderr, " -O ovlStore path to overlap store\n"); fprintf(stderr, "\n"); fprintf(stderr, " -o name output prefix, for logging\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t bgn-end limit processing to only reads from bgn to end (inclusive)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -Ci clearFile path to input clear ranges (NOT SUPPORTED)\n"); fprintf(stderr, " -Co clearFile path to ouput clear ranges\n"); fprintf(stderr, "\n"); fprintf(stderr, " -e erate ignore overlaps with more than 'erate' percent error\n"); //fprintf(stderr, " -l length ignore overlaps shorter than 'l' aligned bases (NOT SUPPORTED)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -minlength l reads trimmed below this many bases are deleted\n"); fprintf(stderr, "\n"); if (errorRate < 0.0) fprintf(stderr, "ERROR: Error rate (-e) value %f too small; must be 'fraction error' and above 0.0\n", errorRate); exit(1); } gkStore *gkp = gkStore::gkStore_open(gkpName); ovStore *ovs = new ovStore(ovsName, gkp); clearRangeFile *finClr = new clearRangeFile(finClrName, gkp); clearRangeFile *outClr = new clearRangeFile(outClrName, gkp); if (outClr) // If the outClr file exists, those clear ranges are loaded. We need to reset them // back to 'untrimmed' for now. outClr->reset(gkp); if (finClr && outClr) // A finClr file was supplied, so use those as the clear ranges. outClr->copy(finClr); snprintf(outputName, FILENAME_MAX, "%s.log", outputPrefix); errno = 0; reportFile = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", outputName, strerror(errno)), exit(1); if (doSubreadLogging) { snprintf(outputName, FILENAME_MAX, "%s.subread.log", outputPrefix); errno = 0; subreadFile = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", outputName, strerror(errno)), exit(1); } uint32 ovlLen = 0; uint32 ovlMax = 64 * 1024; ovOverlap *ovl = ovOverlap::allocateOverlaps(gkp, ovlMax); memset(ovl, 0, sizeof(ovOverlap) * ovlMax); workUnit *w = new workUnit; if (idMin < 1) idMin = 1; if (idMax > gkp->gkStore_getNumReads()) idMax = gkp->gkStore_getNumReads(); fprintf(stderr, "Processing from ID " F_U32 " to " F_U32 " out of " F_U32 " reads, using errorRate = %.2f\n", idMin, idMax, gkp->gkStore_getNumReads(), errorRate); for (uint32 id=idMin; id<=idMax; id++) { gkRead *read = gkp->gkStore_getRead(id); gkLibrary *libr = gkp->gkStore_getLibrary(read->gkRead_libraryID()); if (finClr->isDeleted(id)) { // Read already trashed. deletedIn += read->gkRead_sequenceLength(); continue; } if ((libr->gkLibrary_removeSpurReads() == false) && (libr->gkLibrary_removeChimericReads() == false) && (libr->gkLibrary_checkForSubReads() == false)) { // Nothing to do. noTrimIn += read->gkRead_sequenceLength(); continue; } readsIn += read->gkRead_sequenceLength(); uint32 nLoaded = ovs->readOverlaps(id, ovl, ovlLen, ovlMax); //fprintf(stderr, "read %7u with %7u overlaps\r", id, nLoaded); if (nLoaded == 0) { // No overlaps, nothing to check! noOverlaps += read->gkRead_sequenceLength(); continue; } w->clear(id, finClr->bgn(id), finClr->end(id)); w->addAndFilterOverlaps(gkp, finClr, errorRate, ovl, ovlLen); if (w->adjLen == 0) { // All overlaps trimmed out! noCoverage += read->gkRead_sequenceLength(); continue; } // Find bad regions. //if (libr->gkLibrary_markBad() == true) // // From an external file, a list of known bad regions. If no overlaps span // // the region with sufficient coverage, mark the region as bad. This was // // motivated by the old 454 linker detection. // markBad(gkp, w, subreadFile, doSubreadLoggingVerbose); //if (libr->gkLibrary_removeSpurReads() == true) { // readsProcSpur += read->gkRead_sequenceLength(); // detectSpur(gkp, w, subreadFile, doSubreadLoggingVerbose); // Get stats on spur region detected - save the length of each region to the trimStats object. //} //if (libr->gkLibrary_removeChimericReads() == true) { // readsProcChimera += read->gkRead_sequenceLength(); // detectChimer(gkp, w, subreadFile, doSubreadLoggingVerbose); // Get stats on chimera region detected - save the length of each region to the trimStats object. //} if (libr->gkLibrary_checkForSubReads() == true) { readsProcSubRead += read->gkRead_sequenceLength(); detectSubReads(gkp, w, subreadFile, doSubreadLoggingVerbose); } // Get stats on the bad regions found. This kind of duplicates code in trimBadInterval(), but // I don't want to pass all the stats objects into there. if (w->blist.size() == 0) { readsNoChange += read->gkRead_sequenceLength(); } else { uint32 nSpur5 = 0, bSpur5 = 0; uint32 nSpur3 = 0, bSpur3 = 0; uint32 nChimera = 0, bChimera = 0; uint32 nSubread = 0, bSubread = 0; for (uint32 bb=0; bbblist.size(); bb++) { switch (w->blist[bb].type) { case badType_5spur: nSpur5 += 1; basesBadSpur5 += w->blist[bb].end - w->blist[bb].bgn; break; case badType_3spur: nSpur3 += 1; basesBadSpur3 += w->blist[bb].end - w->blist[bb].bgn; break; case badType_chimera: nChimera += 1; basesBadChimera += w->blist[bb].end - w->blist[bb].bgn; break; case badType_subread: nSubread += 1; basesBadSubread += w->blist[bb].end - w->blist[bb].bgn; break; default: break; } } if (nSpur5 > 0) readsBadSpur5 += nSpur5; if (nSpur3 > 0) readsBadSpur3 += nSpur3; if (nChimera > 0) readsBadChimera += nChimera; if (nSubread > 0) readsBadSubread += nSubread; } // Find solution. This coalesces the list (in 'w') of all the bad regions found, picks out the // largest good region, generates a log of the bad regions that support this decision, and sets // the trim points. trimBadInterval(gkp, w, minReadLength, subreadFile, doSubreadLoggingVerbose); // Log the solution. AS_UTL_safeWrite(reportFile, w->logMsg, "logMsg", sizeof(char), strlen(w->logMsg)); // Save the solution.... outClr->setbgn(w->id) = w->clrBgn; outClr->setend(w->id) = w->clrEnd; // And maybe delete the read. if (w->isOK == false) { deletedOut += read->gkRead_sequenceLength(); outClr->setDeleted(w->id); } // Update stats on what was trimmed. The asserts say the clear range didn't expand, and the if // tests if the clear range changed. assert(w->clrBgn >= w->iniBgn); assert(w->iniEnd >= w->clrEnd); if (w->clrBgn > w->iniBgn) readsTrimmed5 += w->clrBgn - w->iniBgn; if (w->iniEnd > w->clrEnd) readsTrimmed3 += w->iniEnd - w->clrEnd; } delete [] ovl; delete w; gkp->gkStore_close(); delete finClr; delete outClr; // Close log files if (reportFile) fclose(reportFile); if (subreadFile) fclose(subreadFile); // Write the summary if (outputPrefix) { snprintf(outputName, FILENAME_MAX, "%s.stats", outputPrefix); errno = 0; staFile = fopen(outputName, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", outputName, strerror(errno)); } if (staFile == NULL) staFile = stdout; // Would like to know number of subreads per read fprintf(staFile, "PARAMETERS:\n"); fprintf(staFile, "----------\n"); fprintf(staFile, "%7u (reads trimmed below this many bases are deleted)\n", minReadLength); fprintf(staFile, "%7.4f (use overlaps at or below this fraction error)\n", errorRate); //fprintf(staFile, "%7u (use only overlaps longer than this)\n", minAlignLength); // NOT SUPPORTED! fprintf(staFile, "INPUT READS:\n"); fprintf(staFile, "-----------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads processed)\n", readsIn.nReads, readsIn.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads not processed, previously deleted)\n", deletedIn.nReads, deletedIn.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads not processed, in a library where trimming isn't allowed)\n", noTrimIn.nReads, noTrimIn.nBases); fprintf(staFile, "\n"); fprintf(staFile, "PROCESSED:\n"); fprintf(staFile, "--------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (no overlaps)\n", noOverlaps.nReads, noOverlaps.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (no coverage after adjusting for trimming done already)\n", noCoverage.nReads, noCoverage.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (processed for chimera)\n", readsProcChimera.nReads, readsProcChimera.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (processed for spur)\n", readsProcSpur.nReads, readsProcSpur.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (processed for subreads)\n", readsProcSubRead.nReads, readsProcSubRead.nBases); fprintf(staFile, "\n"); fprintf(staFile, "READS WITH SIGNALS:\n"); fprintf(staFile, "------------------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " signals (number of 5' spur signal)\n", readsBadSpur5.nReads, readsBadSpur5.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " signals (number of 3' spur signal)\n", readsBadSpur3.nReads, readsBadSpur3.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " signals (number of chimera signal)\n", readsBadChimera.nReads, readsBadChimera.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " signals (number of subread signal)\n", readsBadSubread.nReads, readsBadSubread.nBases); fprintf(staFile, "\n"); fprintf(staFile, "SIGNALS:\n"); fprintf(staFile, "-------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (size of 5' spur signal)\n", basesBadSpur5.nReads, basesBadSpur5.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (size of 3' spur signal)\n", basesBadSpur3.nReads, basesBadSpur3.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (size of chimera signal)\n", basesBadChimera.nReads, basesBadChimera.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (size of subread signal)\n", basesBadSubread.nReads, basesBadSubread.nBases); fprintf(staFile, "\n"); fprintf(staFile, "TRIMMING:\n"); fprintf(staFile, "--------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (trimmed from the 5' end of the read)\n", readsTrimmed5.nReads, readsTrimmed5.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (trimmed from the 3' end of the read)\n", readsTrimmed3.nReads, readsTrimmed3.nBases); #if 0 fprintf(staFile, "DELETED:\n"); fprintf(staFile, "-------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (deleted because of both cimera and spur signals)\n", bothDeletedSmall.nReads, bothDeletedSmall.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (deleted because of chimera signal)\n", chimeraDeletedSmall.nReads, chimeraDeletedSmall.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (deleted because of spur signal)\n", spurDeletedSmall.nReads, spurDeletedSmall.nBases); fprintf(staFile, "\n"); fprintf(staFile, "SPUR TYPES:\n"); fprintf(staFile, "----------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (normal spur detected)\n", spurDetectedNormal.nReads, spurDetectedNormal.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (linker spur detected)\n", spurDetectedLinker.nReads, spurDetectedLinker.nBases); fprintf(staFile, "\n"); fprintf(staFile, "CHIMERA TYPES:\n"); fprintf(staFile, "-------------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (innie-pair chimera detected)\n", chimeraDetectedInnie.nReads, chimeraDetectedInnie.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (overhanging chimera detected)\n", chimeraDetectedOverhang.nReads, chimeraDetectedOverhang.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (gap chimera detected)\n", chimeraDetectedGap.nReads, chimeraDetectedGap.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (linker chimera detected)\n", chimeraDetectedLinker.nReads, chimeraDetectedLinker.nBases); #endif // INPUT READS = ACCEPTED + TRIMMED + DELETED // SPUR TYPE = TRIMMED and DELETED spur and both categories // CHIMERA TYPE = TRIMMED and DELETED chimera and both categories if (staFile != stdout) fclose(staFile); exit(0); } canu-1.6/src/overlapBasedTrimming/splitReads.H000066400000000000000000000132201314437614700214460ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-20 to 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef SPLIT_READS_H #define SPLIT_READS_H #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "adjustOverlaps.H" #include "clearRangeFile.H" #include "intervalList.H" #include "AS_UTL_decodeRange.H" #include #include using namespace std; // But then use only overlaps larger than this for some of the more questionable overlaps. #define MIN_INTERVAL_OVERLAP 60 // Same orient overlaps closer than this are evidence of PacBio subreads. #define SUBREAD_LOOP_MAX_SIZE 500 #define SUBREAD_LOOP_EXT_SIZE 2000 // Original //#define SUBREAD_LOOP_MAX_SIZE 100 //#define SUBREAD_LOOP_EXT_SIZE 0 // WITH_REPORT_FULL will ALL overlap evidence. // REPORT_OVERLAPS will print the incoming overlaps in the log. // #undef WITH_REPORT_FULL #undef REPORT_OVERLAPS #undef DEBUG_ISLINKER #undef DEBUG_INTERVAL #undef DEBUG_FILTER // An overlap, adjusted for trimming done already. // class adjOverlap { public: uint32 a_iid; uint32 b_iid; uint32 flipped; // The b read is 3' to 5' uint32 aclrbgn, aclrend; // Clear range of the two reads uint32 bclrbgn, bclrend; // B read coords are adjusted for any flip uint32 aovlbgn, aovlend; // Overlap extent, adjusted for the clear range uint32 bovlbgn, bovlend; uint32 aOVLbgn, aOVLend; // Original unadjusted overlap uint32 bOVLbgn, bOVLend; }; const uint32 badType_nothing = 0; const uint32 badType_5spur = 1; const uint32 badType_3spur = 2; const uint32 badType_chimera = 3; const uint32 badType_subread = 4; // A region that is bad. // class badRegion { public: badRegion() { id = 0; type = 0; bgn = 0; end = 0; }; badRegion(uint32 id_, uint32 type_, uint32 bgn_, uint32 end_) { id = id_; type = type_; bgn = bgn_; end = end_; }; const char *typeName(void) const { const char *N[5] = { "nothing", "5'spur", "3'spur", "chimera", "subread" }; return(N[type]); } public: uint32 id; // Read ID for this bad region uint32 type; // Type of bad region - 5'spur, 3'spur, chimera, subread uint32 bgn; // uint32 end; // }; class workUnit { public: workUnit(uint32 id_=0, uint32 iniBgn_=0, uint32 iniEnd_=0) { clear(id_, iniBgn_, iniEnd_); logMsg[0] = 0; blist.clear(); adjLen = 0; adjMax = 0; adj = NULL; }; ~workUnit() { delete [] adj; }; void clear(uint32 id_, uint32 iniBgn_, uint32 iniEnd_) { id = id_; iniBgn = iniBgn_; iniEnd = iniEnd_; isOK = true; isBad = false; isSpur5 = false; isSpur3 = false; isChimera = false; isSubread = false; clrBgn = iniBgn_; clrEnd = iniEnd_; logMsg[0] = 0; blist.clear(); adjLen = 0; //adjMax = 0; // Do NOT reset the allocated space. //adj = NULL; }; void addAndFilterOverlaps(gkStore *gkp, clearRangeFile *finClr, double errorRate, ovOverlap *ovl, uint32 ovlLen); public: uint32 id; // Read ID uint32 iniBgn; // The input clear range uint32 iniEnd; // Results bool isOK; // Read is acceptable, might have been fixed as below bool isBad; // Read couldn't be fixed bool isSpur5; // Spur trimmed on 5' end bool isSpur3; // Spur trimmed on 3' end bool isChimera; // Read split because of suspected chimeric region bool isSubread; // Read split because of suspected subreads uint32 clrBgn; // The final clear range uint32 clrEnd; char logMsg[1024]; // Work space vector blist; // Overlaps uint32 adjLen; uint32 adjMax; adjOverlap *adj; }; void detectSpur(gkStore *gkp, workUnit *w, FILE *subreadFile, bool subreadFileVerbose); void detectChimer(gkStore *gkp, workUnit *w, FILE *subreadFile, bool subreadFileVerbose); void detectSubReads(gkStore *gkp, workUnit *w, FILE *subreadFile, bool subreadFileVerbose); void trimBadInterval(gkStore *gkp, workUnit *w, uint32 minReadLength, FILE *subreadFile, bool subreadFileVerbose); #endif // SPLIT_READS_H canu-1.6/src/overlapBasedTrimming/splitReads.mk000066400000000000000000000012051314437614700216660ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := splitReads SOURCES := splitReads.C \ splitReads-workUnit.C \ splitReads-subReads.C \ splitReads-trimBad.C \ adjustNormal.C \ adjustFlipped.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapBasedTrimming/test-random-chimeric-fragments.pl000066400000000000000000000074111314437614700255670ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## se strict; # ########################################################################### # # This file is part of Celera Assembler, a software program that # assembles whole-genome shotgun reads into contigs and scaffolds. # Copyright (C) 2009, J. Craig Venter Instititue. # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received (LICENSE.txt) a copy of the GNU General Public # License along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # ########################################################################### # # Figures out what pieces ended up after trimming. Use with generate-random-chimeric-fragments.pl # # User unfriendly. my $uid; my $iid; open(F, "/work/wgs/FreeBSD-amd64/bin/gatekeeper -dumpfragments /work/test/test.gkpStore |") or die; while () { if (m/^fragmentIdent\s+=\s+(.*),(\d+)$/) { $uid = $1; $iid = $2; } if (m/^fragmentIsDeleted\s+=\s+1$/) { undef $uid; undef $iid; } if (m/^fragmentClear\s+=\s+OBTCHIMERA,(\d+),(\d+)$/) { my $bgn = $1; my $end = $2; my $b = 0; my $e = 0; if ($uid =~ m/(\d+)-/) { my @v = split '-', $uid; shift @v; # iid of read my $numMatches = 0; my $strMatches = ""; my $badMatches = 0; while (scalar(@v) > 0) { my $typ = shift @v; my $len = shift @v; $e = $b + $len; if (($bgn < $e - 5) && ($b + 5 < $end)) { $numMatches++; $strMatches .= "$b-$e-$typ "; if (($typ ne "seq") && ($typ ne "rev")) { $badMatches++; } } $b = $e; } if (($numMatches > 1) || ($badMatches > 0)) { my $ovl = `/work/wgs/FreeBSD-amd64/bin/overlapStore -g /work/test/test.gkpStore -dp $iid /work/test/0-overlaptrim/test.obtStore`; if (length($ovl) > 0) { print "$uid\n"; print "$bgn-$end\n"; print "$strMatches\n"; print $ovl; print "========================================================================================================================\n"; } } } } } close(F); canu-1.6/src/overlapBasedTrimming/trimReads-bestEdge.C000066400000000000000000000256021314437614700230100ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-28 to 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "trimReads.H" #include #include #include using namespace std; // Generates plots for each trim. #undef GNUPLOT // Logging to the screen. #undef VERBOSE bool bestEdge(ovOverlap *ovl, uint32 ovlLen, gkRead *read, uint32 ibgn, uint32 iend, uint32 &fbgn, uint32 &fend, char *logMsg, uint32 errorValue, uint32 minOverlap, uint32 minCoverage, uint32 minReadLength) { fbgn = ibgn; fend = iend; logMsg[0] = 0; assert(read->gkRead_readID() == ovl[0].a_iid); assert(ovlLen > 0); // // Trim once, using the largest covered rule. This gets rid of any gaps in overlap coverage, // etc. // uint32 lbgn = 0; uint32 lend = 0; if (largestCovered(ovl, ovlLen, read, ibgn, iend, lbgn, lend, logMsg, errorValue, minOverlap, minCoverage, minReadLength) == false) return(false); // // Trim again, to maximize overlap length. // int32 iid = read->gkRead_readID(); uint32 len = read->gkRead_sequenceLength(); vector trim5; vector trim3; uint32 nContained = 0; #ifdef GNUPLOT vector trim5iid, trim5sco; vector trim3iid, trim3sco; #endif // For each overlap, add potential trim points where the overlap ends. for (uint32 i=0; i errorValue) // Overlap is crappy. continue; // Add trim points as long as they aren't outside the largest covered region. if (lbgn <= tbgn) trim5.push_back(tbgn); if (tend <= lend) trim3.push_back(tend); // If overlap indicates this read is contained, we're done. There isn't any // trimming needed. We can set the final trim and return, or let the rest // of the algorithm/logging finish. if ((tbgn == ibgn) && (tend == iend)) nContained++; } // If the read is contained in more than one other read, we're done. No trimming needed. if (nContained >= 2) { //fbgn = ibgn; //fend = iend; //return(true); trim5.clear(); trim3.clear(); } // Add trim points for the largest covered, i.e., no trimming past what largest covered did. trim5.push_back(lbgn); trim3.push_back(lend); // Duplicate removal, and possibly the processing algorithm, need the trim points // sorted from outside-in. sort(trim5.begin(), trim5.end(), std::less()); sort(trim3.begin(), trim3.end(), std::greater()); // Remove duplicate points (easier here than in the loops below) { uint32 old = 0; uint32 sav = 0; for (old=0, sav=0; old= len - triml) // Not possible to get a higher score by trimming more. break; //fprintf(stderr, "trim5 pt %u out of %u\n", pt, trim5.size()); for (uint32 i=0; i < ovlLen; i++) { uint32 tbgn = ibgn + ovl[i].a_bgn(); uint32 tend = ibgn + ovl[i].a_end(); if ((triml < tbgn) || (tend <= triml)) // Alignment starts after of the trim point; not a valid overlap. // or, alignment ends before the trim point (trimmed out). continue; // Limit the overlap to the largest covered. if (tend > lend) tend = lend; assert(tend >= triml); uint32 tlen = tend - triml; assert(tlen <= len); // Save the best score. Break ties by favoring the current best. if (((score < tlen)) || ((score == tlen) && (ovl[i].b_iid == best5iid))) { score = tlen; sciid = ovl[i].b_iid; scend = tend; #ifdef GNUPLOT trim5sco[pt] = score; trim5iid[pt] = sciid; #endif } } // Give up if we're not finding anything longer after 1/3 of the current read. Previous // attempts at this wanted to ensure that the current read (sciid) is the same as the best // (best5iid) but ties and slightly different begin/end points make this unreliable. For // example, an overlap from 100-500 and an overlap from 200-501. The first is longest up until // trim point 200, then the second becomes longest. The BEST longest is still the first // overlap, at trim point 100. if (score < best5score * 0.66) break; // Save a new best if the score is better, and the endpoint is more interior to the read. if ((best5score < score) && (best5end < scend)) { //fprintf(stderr, "RESET 5 end to pt %d score %d iid %d end %d\n", // pt, score, sciid, scend); best5pt = pt; best5score = score; best5iid = sciid; best5end = scend; } } //fprintf(stderr, "BEST at %u position %u pt %u\n", best5score, trim5[best5pt], best5pt); // Find the best 3' point. for (uint32 pt=0; pt= trimr - 0) // Not possible to get a higher score by trimming more. break; //fprintf(stderr, "trim3 pt %u out of %u\n", pt, trim3.size()); for (uint32 i=0; i < ovlLen; i++) { uint32 tbgn = ibgn + ovl[i].a_bgn(); uint32 tend = ibgn + ovl[i].a_end(); if ((tend < trimr) || (trimr <= tbgn)) // Alignment ends before the trim point; not a valid overlap, // or, alignment starts after the trim point (trimmed out) continue; // Limit the overlap to the largest covered. if (tbgn < lbgn) tbgn = lbgn; assert(trimr >= tbgn); uint32 tlen = trimr - tbgn; assert(tlen <= len); if (((score < tlen)) || ((score == tlen) && (ovl[i].b_iid == best3iid))) { score = tlen; sciid = ovl[i].b_iid; scbgn = tbgn; #ifdef GNUPLOT trim3sco[pt] = score; trim3iid[pt] = sciid; #endif } } if (score < best3score * 0.66) break; if ((best3score < score) && (scbgn < best3bgn)) { //fprintf(stderr, "RESET 3 end to pt %d score %d iid %d bgn %d\n", // pt, score, sciid, scbgn); best3pt = pt; best3score = score; best3iid = sciid; best3bgn = scbgn; } } #ifdef GNUPLOT { char D[FILENAME_MAX]; char G[FILENAME_MAX]; char S[FILENAME_MAX]; FILE *F; snprintf(D, FILENAME_MAX, "trim-%08d.dat", read->gkRead_readID()); snprintf(G, FILENAME_MAX, "trim-%08d.gp", read->gkRead_readID()); snprintf(S, FILENAME_MAX, "gnuplot < trim-%08d.gp", read->gkRead_readID()); F = fopen(D, "w"); for (uint32 i=0; igkRead_readIID()); fprintf(F, "plot \"trim-%08d.dat\" using 3:5 with linespoints, \"trim-%08d.dat\" using 10:12 with linespoints\n", read->gkRead_readIID(), read->gkRead_readIID()); fclose(F); system(S); } #endif //fprintf(stderr, "BEST at %u position %u pt %u\n", best3score, trim3[best3pt], best3pt); // Set trimming. Be just a little aggressive, and get rid of an extra base or two, if possible. // If not possible, the read is crap, and will be deleted anyway. fbgn = trim5[best5pt]; fend = trim3[best3pt]; if ((fbgn + 4 < fend) && (2 < fend)) { fbgn += 2; fend -= 2; } // Did we go outside the largestCovered region? Just reset. Ideally we'd assert, and crash. if (fbgn < lbgn) fbgn = lbgn; if (lend < fend) fend = lend; // Lastly, did we just end up with a bogus trim? There isn't any guard against picking the 5' // trim point to the right of the 3' trim point. #if 1 if (fend < fbgn) { fprintf(stderr, "iid = %u\n", read->gkRead_readID()); fbgn = lbgn; fend = lend; } #endif assert(fbgn <= fend); return(true); } canu-1.6/src/overlapBasedTrimming/trimReads-largestCovered.C000066400000000000000000000150141314437614700242330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-28 to 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "trimReads.H" #include "intervalList.H" bool largestCovered(ovOverlap *ovl, uint32 ovlLen, gkRead *read, uint32 ibgn, uint32 iend, uint32 &fbgn, uint32 &fend, char *logMsg, uint32 errorValue, uint32 minOverlap, uint32 minCoverage, uint32 minReadLength) { logMsg[0] = 0; assert(read->gkRead_readID() == ovl[0].a_iid); assert(ovlLen > 0); intervalList IL; intervalList ID; int32 iid = read->gkRead_readID(); uint32 nSkip = 0; uint32 nUsed = 0; for (uint32 i=0; i errorValue) { // Overlap is crappy. //fprintf(stderr, "skip %2u\n", i); nSkip++; continue; } //fprintf(stderr, "save %2u\n", i); nUsed++; IL.add(tbgn, tend - tbgn); } #if 0 for (uint32 it=0; it 0) { intervalList DE(IL); uint32 it = 0; uint32 ib = 0; uint32 ie = 0; while (it < DE.numberOfIntervals()) { //fprintf(stderr, "DE - %d - " F_S64 " " F_S64 " " F_U32 "\n", fr.gkFragment_getReadIID(), DE.lo(it), DE.hi(it), DE.depth(it)); if (DE.depth(it) < minCoverage) { // Dropped below good coverage depth. If we have an interval, save it. Reset. if (ie > ib) { //fprintf(stderr, "AD1 %d-%d len %d\n", ib, ie, ie - ib); ID.add(ib, ie - ib); } ib = 0; ie = 0; } else if ((ib == 0) && (ie == 0)) { // Depth is good. If no current interval, make a new one. ib = DE.lo(it); ie = DE.hi(it); //fprintf(stderr, "NE1 %d-%d len %d\n", ib, ie, ie - ib); } else if (ie == DE.lo(it)) { // Depth is good. If this interval is adjacent to the current, extend. ie = DE.hi(it); //fprintf(stderr, "EXT %d-%d len %d\n", ib, ie, ie - ib); } else { // Depth is good, but we just had a gap in coverage. Save any current interval. Reset. if (ie > ib) { //fprintf(stderr, "AD2 %d-%d len %d\n", ib, ie, ie - ib); ID.add(ib, ie - ib); } ib = DE.lo(it); ie = DE.hi(it); //fprintf(stderr, "NE2 %d-%d len %d\n", ib, ie, ie - ib); } it++; } if (ie > ib) { //fprintf(stderr, "AD3 %d-%d len %d\n", ib, ie, ie - ib); ID.add(ib, ie - ib); } } // Now that we've created depth, merge the intervals. IL.merge(minOverlap); // IL - covered interavls enforcing a minimum overlap size (these can overlap) // ID - covered intervals enforcing a minimum depth (these cannot overlap) // // Create new intervals from the intersection of IL and ID. // // This catches one nasty case, where a thin overlap has more than minDepth coverage. // // ------------- 3x coverage // ------------- all overlaps 1 or 2 dashes long // --------- // ----------- if (minCoverage > 0) { intervalList FI; uint32 li = 0; uint32 di = 0; while ((li < IL.numberOfIntervals()) && (di < ID.numberOfIntervals())) { uint32 ll = IL.lo(li); uint32 lh = IL.hi(li); uint32 dl = ID.lo(di); uint32 dh = ID.hi(di); uint32 nl = 0; uint32 nh = 0; // If they intersect, make a new region if ((ll <= dl) && (dl < lh)) { nl = dl; nh = (lh < dh) ? lh : dh; } if ((dl <= ll) && (ll < dh)) { nl = ll; nh = (lh < dh) ? lh : dh; } if (nl < nh) FI.add(nl, nh - nl); // Advance the list with the earlier region. if (lh <= dh) // IL ends at or before ID li++; if (dh <= lh) { // ID ends at or before IL di++; } } // Replace the intervals to use with the intersection. IL = FI; } //////////////////////////////////////// // The IL.ct(it) is always 1 if we filter low coverage. It is no longer reported. #if 0 if (IL.numberOfIntervals() > 1) for (uint32 it=0; itgkRead_readID(), IL.lo(it), IL.hi(it)); #endif if (IL.numberOfIntervals() == 0) { strcpy(logMsg, "\tno high quality overlaps"); return(false); } fbgn = IL.lo(0); fend = IL.hi(0); sprintf(logMsg, "\tskipped %u overlaps; used %u overlaps", nSkip, nUsed); for (uint32 it=0; it fend - fbgn) { fbgn = IL.lo(it); fend = IL.hi(it); } } return(true); } canu-1.6/src/overlapBasedTrimming/trimReads-quality.C000066400000000000000000000143211314437614700227520ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "trimReads.H" // This is mostly historical. // // It needs to be updated to use sanger qv encoding, and to take that as input. // A simple initialized array -- performs a quality letter -> quality // value translation. // class qualityLookup { public: qualityLookup() { for (uint32 i=0; i<255; i++) q[i] = 1 / pow(10, i / 10.0); }; ~qualityLookup() { }; double lookup(uint32 x) { return(q[x]); }; private: double q[255]; }; qualityLookup qual; static void findGoodQuality(double *qltD, uint32 qltLen, char minQualityLetter, uint32 &qltL, uint32 &qltR) { struct pair { uint32 start; uint32 end; }; pair *f = new pair [qltLen + 1]; pair *r = new pair [qltLen + 1]; uint32 fpos=0, flen=0; uint32 rpos=0, rlen=0; uint32 p = 0; double q = 0; double minQuality = qual.lookup(minQualityLetter - '!'); // Scan forward, find first base with quality >= 20. // Then, find the first base that makes cumulative expected #errors > 1/100 // #errors = 1 / 10 ^ log(q) // while (p < qltLen) { // Find the next begin point // while ((p < qltLen) && (qltD[p] > minQuality)) p++; // Got a begin point! Scan until the quality drops significantly. // f[fpos].start = p; f[fpos].end = p; q = qltD[p]; p++; while ((p < qltLen) && (q / (p - f[fpos].start) < minQuality)) { q += qltD[p]; p++; } f[fpos].end = p; if (f[fpos].end - f[fpos].start > 10) fpos++; } // Scan backward, just like the forward. // // Stung by using uint32 for p; p is one more than it wants to be. // Although, to be fair, there are just about as many cases of p+1 // as p below. // p = qltLen; q = 0; while (p > 0) { while ((p > 0) && (qltD[p-1] > minQuality)) p--; r[rpos].start = p; r[rpos].end = p; if (p > 0) { p--; q = qltD[p]; while ((p > 0) && (q / (r[rpos].end - p) < minQuality)) { p--; q += qltD[p]; } r[rpos].start = p; if (r[rpos].end - r[rpos].start > 10) rpos++; } } // Now, just pick the largest overlap qltL = 0; qltR = 0; flen = fpos; rlen = rpos; //fprintf(stderr, "qltLen = " F_U32 " flen=" F_U32 " rlen=" F_U32 "\n", qltLen, flen, rlen); uint32 winningFPos = 0; uint32 winningRPos = 0; uint32 winningStyle = 0; for (fpos=0; fpos (qltR - qltL)) { winningFPos = fpos; winningRPos = rpos; winningStyle = 0; qltL = f[fpos].start; qltR = r[rpos].end; } } // fffffffffff // rrrrrrrrrr // else if ((f[fpos].start <= r[rpos].start) && (r[rpos].start <= f[fpos].end) && (f[fpos].end <= r[rpos].end)) { if ((f[fpos].end - r[rpos].start) > (qltR - qltL)) { winningFPos = fpos; winningRPos = rpos; winningStyle = 1; qltL = r[rpos].start; qltR = f[fpos].end; } } // fffffffffffffffffff // rrrrrrrrrr // else if ((f[fpos].start <= r[rpos].start) && (r[rpos].end <= f[fpos].end)) { if ((r[rpos].end - r[rpos].start) > (qltR - qltL)) { winningFPos = fpos; winningRPos = rpos; winningStyle = 2; qltL = r[rpos].start; qltR = r[rpos].end; } } // fffffffffff // rrrrrrrrrrrrrrrrrrrr // else if ((r[rpos].start <= f[fpos].start) && (f[fpos].end <= r[rpos].end)) { if ((f[fpos].end - f[fpos].start) > (qltR - qltL)) { winningFPos = fpos; winningRPos = rpos; winningStyle = 3; qltL = f[fpos].start; qltR = f[fpos].end; } } else if (f[fpos].end < r[rpos].start) { // NOP, no overlap. } else if (r[rpos].end < f[fpos].start) { // NOP, no overlap. } else { fprintf(stderr, "UNMATCHED OVERLAP\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\n", f[fpos].start, f[fpos].end, r[rpos].start, r[rpos].end); } } } delete [] f; delete [] r; } // Takes a gkFragment, returns clear ranges for some quality // threshold. Higher level than I wanted, but it obscures // everything, and is exactly the interface that this and // mergeTrimming.C want. // void doTrim(gkRead *read, gkReadData *readData, double minQuality, uint32 &left, uint32 &right) { uint32 qltLen = read->gkRead_sequenceLength(); char *qltC = readData->gkReadData_getQualities(); double *qltD = new double [qltLen]; for (uint32 i=0; i end) // the original clear range was completely outside the max range. // bool enforceMaximumClearRange(gkRead *read, uint32 UNUSED(ibgn), uint32 UNUSED(iend), uint32 &fbgn, uint32 &fend, char *logMsg, clearRangeFile *maxClr) { if (maxClr == NULL) return(true); if (fbgn == fend) return(true); uint32 mbgn = maxClr->bgn(read->gkRead_readID()); uint32 mend = maxClr->end(read->gkRead_readID()); assert(mbgn < mend); assert(fbgn <= fend); if ((fend < mbgn) || (mend < fbgn)) { // Final clear not intersecting maximum clear. strcat(logMsg, (logMsg[0]) ? " - " : "\t"); strcat(logMsg, "outside maximum allowed clear range"); return(false); } else if ((fbgn < mbgn) || (mend < fend)) { // Final clear extends outside the maximum clear. fbgn = MAX(fbgn, mbgn); fend = MIN(fend, mend); strcat(logMsg, (logMsg[0]) ? " - " : "\t"); strcat(logMsg, "adjusted to obey maximum allowed clear range"); return(true); } else { // Final clear already within the maximum clear. return(true); } } int main(int argc, char **argv) { char *gkpName = 0L; char *ovsName = 0L; char *iniClrName = NULL; char *maxClrName = NULL; char *outClrName = NULL; uint32 errorValue = AS_OVS_encodeEvalue(0.015); uint32 minAlignLength = 40; uint32 minReadLength = 64; char *outputPrefix = NULL; char logName[FILENAME_MAX] = {0}; char sumName[FILENAME_MAX] = {0}; FILE *logFile = 0L; FILE *staFile = 0L; uint32 idMin = 1; uint32 idMax = UINT32_MAX; uint32 minEvidenceOverlap = 40; uint32 minEvidenceCoverage = 1; // Statistics on the trimming trimStat readsIn; // Read is eligible for trimming trimStat deletedIn; // Read was deleted already trimStat noTrimIn; // Read not requesting trimming trimStat readsOut; // Read was trimmed to a valid read trimStat noOvlOut; // Read was deleted; no ovelaps trimStat deletedOut; // Read was deleted; too small after trimming trimStat noChangeOut; // Read was untrimmed trimStat trim5; // Bases trimmed from the 5' end trimStat trim3; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovsName = argv[++arg]; } else if (strcmp(argv[arg], "-Ci") == 0) { iniClrName = argv[++arg]; } else if (strcmp(argv[arg], "-Cm") == 0) { maxClrName = argv[++arg]; } else if (strcmp(argv[arg], "-Co") == 0) { outClrName = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { double erate = atof(argv[++arg]); errorValue = AS_OVS_encodeEvalue(erate); } else if (strcmp(argv[arg], "-l") == 0) { minAlignLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-minlength") == 0) { minReadLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-ol") == 0) { minEvidenceOverlap = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-oc") == 0) { minEvidenceCoverage = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-o") == 0) { outputPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-t") == 0) { AS_UTL_decodeRange(argv[++arg], idMin, idMax); } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if ((gkpName == NULL) || (ovsName == NULL) || (outputPrefix == NULL) || (err)) { fprintf(stderr, "usage: %s -G gkpStore -O ovlStore -Co output.clearFile -o outputPrefix\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G gkpStore path to read store\n"); fprintf(stderr, " -O ovlStore path to overlap store\n"); fprintf(stderr, "\n"); fprintf(stderr, " -o name output prefix, for logging\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t bgn-end limit processing to only reads from bgn to end (inclusive)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -Ci clearFile path to input clear ranges (NOT SUPPORTED)\n"); //fprintf(stderr, " -Cm clearFile path to maximal clear ranges\n"); fprintf(stderr, " -Co clearFile path to ouput clear ranges\n"); fprintf(stderr, "\n"); fprintf(stderr, " -e erate ignore overlaps with more than 'erate' percent error\n"); //fprintf(stderr, " -l length ignore overlaps shorter than 'l' aligned bases (NOT SUPPORTED)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -ol l the minimum evidence overlap length\n"); fprintf(stderr, " -oc c the minimum evidence overlap coverage\n"); fprintf(stderr, " evidence overlaps must overlap by 'l' bases to be joined, and\n"); fprintf(stderr, " must be at least 'c' deep to be retained\n"); fprintf(stderr, "\n"); fprintf(stderr, " -minlength l reads trimmed below this many bases are deleted\n"); fprintf(stderr, "\n"); exit(1); } gkStore *gkp = gkStore::gkStore_open(gkpName); ovStore *ovs = new ovStore(ovsName, gkp); clearRangeFile *iniClr = (iniClrName == NULL) ? NULL : new clearRangeFile(iniClrName, gkp); clearRangeFile *maxClr = (maxClrName == NULL) ? NULL : new clearRangeFile(maxClrName, gkp); clearRangeFile *outClr = (outClrName == NULL) ? NULL : new clearRangeFile(outClrName, gkp); if (outClr) // If the outClr file exists, those clear ranges are loaded. We need to reset them // back to 'untrimmed' for now. outClr->reset(gkp); if (iniClr && outClr) // An iniClr file was supplied, so use those as the initial clear ranges. outClr->copy(iniClr); if (outputPrefix) { snprintf(logName, FILENAME_MAX, "%s.log", outputPrefix); errno = 0; logFile = fopen(logName, "w"); if (errno) fprintf(stderr, "Failed to open log file '%s' for writing: %s\n", logName, strerror(errno)), exit(1); fprintf(logFile, "id\tinitL\tinitR\tfinalL\tfinalR\tmessage (DEL=deleted NOC=no change MOD=modified)\n"); } uint32 ovlLen = 0; uint32 ovlMax = 64 * 1024; ovOverlap *ovl = ovOverlap::allocateOverlaps(gkp, ovlMax); memset(ovl, 0, sizeof(ovOverlap) * ovlMax); char logMsg[1024] = {0}; if (idMin < 1) idMin = 1; if (idMax > gkp->gkStore_getNumReads()) idMax = gkp->gkStore_getNumReads(); fprintf(stderr, "Processing from ID " F_U32 " to " F_U32 " out of " F_U32 " reads.\n", idMin, idMax, gkp->gkStore_getNumReads()); for (uint32 id=idMin; id<=idMax; id++) { gkRead *read = gkp->gkStore_getRead(id); gkLibrary *libr = gkp->gkStore_getLibrary(read->gkRead_libraryID()); logMsg[0] = 0; // If the fragment is deleted, do nothing. If the fragment was deleted AFTER overlaps were // generated, then the overlaps will be out of sync -- we'll get overlaps for these fragments // we skip. // if ((iniClr) && (iniClr->isDeleted(id) == true)) { deletedIn += read->gkRead_sequenceLength(); continue; } // If it did not request trimming, do nothing. Similar to the above, we'll get overlaps to // fragments we skip. // if ((libr->gkLibrary_finalTrim() == GK_FINALTRIM_LARGEST_COVERED) && (libr->gkLibrary_finalTrim() == GK_FINALTRIM_BEST_EDGE)) { noTrimIn += read->gkRead_sequenceLength(); continue; } readsIn += read->gkRead_sequenceLength(); // Decide on the initial trimming. We copied any iniClr into outClr above, and if there wasn't // an iniClr, then outClr is the full read. uint32 ibgn = outClr->bgn(id); uint32 iend = outClr->end(id); // Set the, ahem, initial final trimming. bool isGood = false; uint32 fbgn = ibgn; uint32 fend = iend; // Load overlaps. uint32 nLoaded = ovs->readOverlaps(id, ovl, ovlLen, ovlMax); // Trim! if (nLoaded == 0) { // No overlaps, so mark it as junk. isGood = false; } else if (libr->gkLibrary_finalTrim() == GK_FINALTRIM_LARGEST_COVERED) { // Use the largest region covered by overlaps as the trim assert(ovlLen > 0); assert(id == ovl[0].a_iid); isGood = largestCovered(ovl, ovlLen, read, ibgn, iend, fbgn, fend, logMsg, errorValue, minEvidenceOverlap, minEvidenceCoverage, minReadLength); assert(fbgn <= fend); } else if (libr->gkLibrary_finalTrim() == GK_FINALTRIM_BEST_EDGE) { // Use the largest region covered by overlaps as the trim assert(ovlLen > 0); assert(id == ovl[0].a_iid); isGood = bestEdge(ovl, ovlLen, read, ibgn, iend, fbgn, fend, logMsg, errorValue, minEvidenceOverlap, minEvidenceCoverage, minReadLength); assert(fbgn <= fend); } else { // Do nothing. Really shouldn't get here. assert(0); continue; } // Enforce the maximum clear range if ((isGood) && (maxClr)) { isGood = enforceMaximumClearRange(read, ibgn, iend, fbgn, fend, logMsg, maxClr); assert(fbgn <= fend); } // // Trimmed. Make sense of the result, write some logs, and update the output. // // If bad trimming or too small, write the log and keep going. // if (nLoaded == 0) { noOvlOut += read->gkRead_sequenceLength(); outClr->setbgn(id) = fbgn; outClr->setend(id) = fend; outClr->setDeleted(id); // Gah, just obliterates the clear range. fprintf(logFile, F_U32"\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\tNOV%s\n", id, ibgn, iend, fbgn, fend, (logMsg[0] == 0) ? "" : logMsg); } else if ((isGood == false) || (fend - fbgn < minReadLength)) { deletedOut += read->gkRead_sequenceLength(); outClr->setbgn(id) = fbgn; outClr->setend(id) = fend; outClr->setDeleted(id); // Gah, just obliterates the clear range. fprintf(logFile, F_U32"\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\tDEL%s\n", id, ibgn, iend, fbgn, fend, (logMsg[0] == 0) ? "" : logMsg); } // If we didn't change anything, also write a log. // else if ((ibgn == fbgn) && (iend == fend)) { noChangeOut += read->gkRead_sequenceLength(); fprintf(logFile, F_U32"\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\tNOC%s\n", id, ibgn, iend, fbgn, fend, (logMsg[0] == 0) ? "" : logMsg); continue; } // Otherwise, we actually did something. else { readsOut += fend - fbgn; outClr->setbgn(id) = fbgn; outClr->setend(id) = fend; assert(ibgn <= fbgn); assert(fend <= iend); if (fbgn - ibgn > 0) trim5 += fbgn - ibgn; if (iend - fend > 0) trim3 += iend - fend; fprintf(logFile, F_U32"\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\tMOD%s\n", id, ibgn, iend, fbgn, fend, (logMsg[0] == 0) ? "" : logMsg); } } // Clean up. gkp->gkStore_close(); delete ovs; delete iniClr; delete maxClr; delete outClr; if (logFile) fclose(logFile); // should fprintf() the numbers directly here so an explanation of each category can be supplied; // simpler for now to have report() do it. // Dump the statistics and plots if (outputPrefix) { snprintf(sumName, FILENAME_MAX, "%s.stats", outputPrefix); errno = 0; staFile = fopen(sumName, "w"); if (errno) fprintf(stderr, "Failed to open stats file '%s' for writing: %s\n", sumName, strerror(errno)), exit(1); } if (staFile == NULL) staFile = stderr; fprintf(staFile, "PARAMETERS:\n"); fprintf(staFile, "----------\n"); fprintf(staFile, "%7u (reads trimmed below this many bases are deleted)\n", minReadLength); fprintf(staFile, "%7.4f (use overlaps at or below this fraction error)\n", AS_OVS_decodeEvalue(errorValue)); fprintf(staFile, "%7u (break region if overlap is less than this long, for 'largest covered' algorithm)\n", minEvidenceOverlap); fprintf(staFile, "%7u (break region if overlap coverage is less than this many read%s, for 'largest covered' algorithm)\n", minEvidenceCoverage, (minEvidenceCoverage == 1) ? "" : "s"); fprintf(staFile, "\n"); fprintf(staFile, "INPUT READS:\n"); fprintf(staFile, "-----------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads processed)\n", readsIn.nReads, readsIn.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads not processed, previously deleted)\n", deletedIn.nReads, deletedIn.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads not processed, in a library where trimming isn't allowed)\n", noTrimIn.nReads, noTrimIn.nBases); readsIn .generatePlots(outputPrefix, "inputReads", 250); deletedIn.generatePlots(outputPrefix, "inputDeletedReads", 250); noTrimIn .generatePlots(outputPrefix, "inputNoTrimReads", 250); fprintf(staFile, "\n"); fprintf(staFile, "OUTPUT READS:\n"); fprintf(staFile, "------------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (trimmed reads output)\n", readsOut.nReads, readsOut.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads with no change, kept as is)\n", noChangeOut.nReads, noChangeOut.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads with no overlaps, deleted)\n", noOvlOut.nReads, noOvlOut.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (reads with short trimmed length, deleted)\n", deletedOut.nReads, deletedOut.nBases); readsOut .generatePlots(outputPrefix, "outputTrimmedReads", 250); noOvlOut .generatePlots(outputPrefix, "outputNoOvlReads", 250); deletedOut .generatePlots(outputPrefix, "outputDeletedReads", 250); noChangeOut.generatePlots(outputPrefix, "outputUnchangedReads", 250); fprintf(staFile, "\n"); fprintf(staFile, "TRIMMING DETAILS:\n"); fprintf(staFile, "----------------\n"); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (bases trimmed from the 5' end of a read)\n", trim5.nReads, trim5.nBases); fprintf(staFile, "%6" F_U32P " reads %12" F_U64P " bases (bases trimmed from the 3' end of a read)\n", trim3.nReads, trim3.nBases); trim5.generatePlots(outputPrefix, "trim5", 25); trim3.generatePlots(outputPrefix, "trim3", 25); if ((staFile) && (staFile != stderr)) fclose(staFile); // Buh-bye. exit(0); } canu-1.6/src/overlapBasedTrimming/trimReads.H000066400000000000000000000047061314437614700212770ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef TRIM_READS_H #define TRIM_READS_H #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "intervalList.H" #define OBT_MODE_WIGGLE (5) #define OBT_CQ_LENGTH (100) #define OBT_CQO_LENGTH (200) #define OBT_CQO_OVERLAP (100) #define OBT_CQ_SHORT (5) #define OBT_QLT_CLOSE_5 (10) // 5,6 use 5'mode, use 5'min>1 #define OBT_QLT_FAR_5 (50) // 11 use min5' #define OBT_QLT_MODE3 (150) // 9 use 3'mode #define OBT_QLT_CLOSE_MAXM3 (30) // 14 use max>1 close to max #define OBT_QLT_CLOSE_MAX3 (100) // 12 use max3' bool largestCovered(ovOverlap *ovl, uint32 ovlLen, gkRead *read, uint32 ibgn, uint32 iend, uint32 &fbgn, uint32 &fend, char *logMsg, uint32 errorRate, uint32 minOverlap, uint32 minCoverage, uint32 minReadLength); bool bestEdge(ovOverlap *ovl, uint32 ovlLen, gkRead *read, uint32 ibgn, uint32 iend, uint32 &fbgn, uint32 &fend, char *logMsg, uint32 errorRate, uint32 minOverlap, uint32 minCoverage, uint32 minReadLength); #endif // TRIM_READS_H canu-1.6/src/overlapBasedTrimming/trimReads.mk000066400000000000000000000011131314437614700215040ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := trimReads SOURCES := trimReads.C \ trimReads-bestEdge.C \ trimReads-largestCovered.C \ trimReads-quality.C SRC_INCDIRS := .. ../AS_UTL ../stores TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapBasedTrimming/trimStat.H000066400000000000000000000054131314437614700211500ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JAN-19 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef TRIM_STAT_H #define TRIM_STAT_H #include "AS_global.H" class trimStat { public: trimStat() { nReads = 0; nBases = 0; }; trimStat &operator+=(uint32 bases) { nReads += 1; nBases += bases; histo.push_back(bases); return(*this); }; void generatePlots(char *outputPrefix, char *outputName, uint32 binwidth) { char N[FILENAME_MAX]; FILE *F; snprintf(N, FILENAME_MAX, "%s.%s.dat", outputPrefix, outputName); F = fopen(N, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); for (uint64 ii=0; ii /dev/null 2>&1", outputPrefix, outputName); system(N); }; uint32 nReads; uint64 nBases; vector histo; }; #endif // TRIM_STAT_H canu-1.6/src/overlapErrorAdjustment/000077500000000000000000000000001314437614700176305ustar00rootroot00000000000000canu-1.6/src/overlapErrorAdjustment/analyzeAlignment.C000066400000000000000000000475411314437614700232510ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-18 to 2015-JUL-01 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "analyzeAlignment.H" // Return the substitution vote corresponding to Ch . static Vote_Value_t Matching_Vote(char ch) { switch (ch) { case 'A': return(A_SUBST); break; case 'C': return(C_SUBST); break; case 'G': return(G_SUBST); break; case 'T': return(T_SUBST); break; } fprintf(stderr, "Matching_Vote()-- invalid letter '%c'\n", ch); return(NO_VOTE); } static char Matching_Char(Vote_Value_t vv) { switch (vv) { case A_SUBST: return('A'); break; case C_SUBST: return('C'); break; case G_SUBST: return('G'); break; case T_SUBST: return('T'); break; default: return('?'); break; } return('?'); } // This is expecting: // aSeq and bSeq to be pointers to the start of the sequence that was aligned with Prefix_Edit_Distance // aLen and bLen to be the strlen of those strings // aOffset to be the offset from base zero of the read // deltaLen and delta as passed back from the aligner // void analyzeAlignment::analyze(char *aSeq, int32 aLen, int32 aOffset, char *bSeq, int32 bLen, int32 deltaLen, int32 *delta) { assert(aLen >= 0); assert(bLen >= 0); int32 ct = 0; _readSub[ct] = -1; _algnSub[ct] = -1; _voteValue[ct] = A_SUBST; // Dummy value ct++; int32 i = 0; int32 j = 0; int32 p = 0; uint32 nMatch = 0; uint32 nMismatch = 0; uint32 nInsert = 0; uint32 nDelete = 0; for (int32 k=0; k 0) { _readSub[ct] = i; _algnSub[ct] = p; _voteValue[ct] = DELETE; #ifdef DEBUG fprintf(stderr, "DELETE %c at %d #%d\n", aSeq[i], i, p); #endif nDelete++; ct++; i++; assert(i <= aLen); p++; } #ifdef DEBUG fprintf(stderr, "#4 match %u mismatch %u insert %u delete %u\n", nMatch, nMismatch, nInsert, nDelete); #endif } // No more deltas. While there is still sequence, add matches or mismatches. #ifdef DEBUG fprintf(stderr, "#5 match %u mismatch %u insert %u delete %u\n", nMatch, nMismatch, nInsert, nDelete); fprintf(stderr, "#5 k=DONE i=%d out of %d j=%d out of %d\n", i, aLen, j, bLen); #endif while (i < aLen) { //fprintf(stderr, "k=DONE i=%d out of %d j=%d out of %d\n", i, aLen, j, bLen); if (aSeq[i] != bSeq[j]) nMismatch++; else nMatch++; if (aSeq[i] != bSeq[j]) { _readSub[ct] = i; _algnSub[ct] = p; switch (bSeq[j]) { case 'A': _voteValue[ct] = A_SUBST; break; case 'C': _voteValue[ct] = C_SUBST; break; case 'G': _voteValue[ct] = G_SUBST; break; case 'T': _voteValue[ct] = T_SUBST; break; default : fprintf(stderr, "ERROR:[3] Bad sequence '%c' 0x%02x)\n", bSeq[j], bSeq[j]); assert(0); } ct++; } i++; assert(i <= aLen); // Guaranteed, we're looping on this j++; assert(j <= bLen); p++; } #ifdef DEBUG fprintf(stderr, "#6 match %u mismatch %u insert %u delete %u\n", nMatch, nMismatch, nInsert, nDelete); #endif _readSub[ct] = i; _algnSub[ct] = p; // For each identified change, add votes for some region around the change. //fprintf(stderr, "Found %u changes.\n", ct); for (int32 i=1; i<=ct; i++) { int32 prev_match = _algnSub[i] - _algnSub[i-1] - 1; int32 p_lo = (i == 1 ? 0 : End_Exclude_Len); int32 p_hi = (i == ct ? prev_match : prev_match - End_Exclude_Len); // If distance to previous match is bigger than 'kmer' size, make a new vote. // This operates one ahead of where votes are added - we add votes for _readSub[i-1] when at [i]. if (prev_match >= Kmer_Len) { fprintf(stderr, "adjust ct %d pos %d - lo %d hi %d\n", i, _readSub[i-1], p_lo, p_hi); fprintf(stderr, " match vote %u to %u\n", aOffset + _readSub[i-1] + 1, aOffset + _readSub[i-1] + p_lo + 1); for (int32 p=0; p 0) || (_voteValue[i-1] <= T_SUBST) || (_voteValue[i] <= T_SUBST))) { int32 next_match = _algnSub[i+1] - _algnSub[i] - 1; if (prev_match + next_match >= Vote_Qualify_Len) castVote(_voteValue[i], aOffset + _readSub[i]); } } } void analyzeAlignment::outputDetails(uint32 j) { fprintf(stderr, "%3" F_U32P ": %c conf %3" F_U64P " deletes %3" F_U64P " | subst %3" F_U64P " %3" F_U64P " %3" F_U64P " %3" F_U64P " | no_insert %3" F_U64P " insert %3" F_U64P " %3" F_U64P " %3" F_U64P " %3" F_U64P "\n", j, _seq[j], _vote[j].confirmed, _vote[j].deletes, _vote[j].a_subst, _vote[j].c_subst, _vote[j].g_subst, _vote[j].t_subst, _vote[j].no_insert, _vote[j].a_insert, _vote[j].c_insert, _vote[j].g_insert, _vote[j].t_insert); }; void analyzeAlignment::outputDetails(void) { fprintf(stderr, ">%d\n", _readID); for (uint32 j=0; _seq[j] != '\0'; j++) outputDetails(j); }; void analyzeAlignment::generateCorrections(FILE *corFile) { _corLen = 0; _cor[_corLen].keep_left = (_lDegree < Degree_Threshold); _cor[_corLen].keep_right = (_rDegree < Degree_Threshold); _cor[_corLen].type = IDENT; _cor[_corLen].pos = 0; _cor[_corLen].readID = _readID; _corLen++; resizeArray(_cor, _corLen, _corMax, _corLen+1); uint32 passedLowConfirmed = 0; uint32 substitutions = 0; uint32 skippedTooFew = 0; // 0 or 1 votes uint32 skippedTooWeak = 0; // No vote more than 50% uint32 skippedNoChange = 0; // Is a substitution vote, but it's the same as the base that is there uint32 skippedHaplo = 0; // More than one significant vote, and we're not correcting haplotypes uint32 skippedConfirmed = 0; // ?? uint32 passedInsert = 0; uint32 insertions = 0; uint32 skippedInsTotal = 0; uint32 skippedInsMax = 0; uint32 skippedInsHaplo = 0; uint32 skippedInsTooMany = 0; for (uint32 j=0; j<_seqLen; j++) { outputDetails(j); if (_vote[j].confirmed < 2) { Vote_Value_t vval = DELETE; uint64 max = _vote[j].deletes; bool is_change = true; passedLowConfirmed++; if (_vote[j].a_subst > max) { vval = A_SUBST; max = _vote[j].a_subst; is_change = (_seq[j] != 'A'); } if (_vote[j].c_subst > max) { vval = C_SUBST; max = _vote[j].c_subst; is_change = (_seq[j] != 'C'); } if (_vote[j].g_subst > max) { vval = G_SUBST; max = _vote[j].g_subst; is_change = (_seq[j] != 'G'); } if (_vote[j].t_subst > max) { vval = T_SUBST; max = _vote[j].t_subst; is_change = (_seq[j] != 'T'); } uint64 haplo_ct = ((_vote[j].deletes >= Min_Haplo_Occurs) + (_vote[j].a_subst >= Min_Haplo_Occurs) + (_vote[j].c_subst >= Min_Haplo_Occurs) + (_vote[j].g_subst >= Min_Haplo_Occurs) + (_vote[j].t_subst >= Min_Haplo_Occurs)); uint64 total = (_vote[j].deletes + _vote[j].a_subst + _vote[j].c_subst + _vote[j].g_subst + _vote[j].t_subst); // The original had a gargantuan if test (five clauses, all had to be true) to decide if a // record should be output. It was negated into many small tests if we should skip the // output. A side effect is that we can abort a little earlier (skipping the two clauses // above)....but we don't bother. // (total > 1) if (total <= 1) { fprintf(stderr, "FEW total = " F_U64 " <= 1\n", total); skippedTooFew++; continue; } // (2 * max > total) if (2 * max <= total) { fprintf(stderr, "WEAK 2*max = " F_U64 " <= total = " F_U64 "\n", 2*max, total); skippedTooWeak++; continue; } // (is_change == true) if (is_change == false) { fprintf(stderr, "SAME is_change = %s\n", (is_change) ? "true" : "false"); skippedNoChange++; continue; } // ((haplo_ct < 2) || (Use_Haplo_Ct == false)) if ((haplo_ct >= 2) && (Use_Haplo_Ct == true)) { fprintf(stderr, "HAPLO haplo_ct=" F_U64 " >= 2 AND Use_Haplo_Ct = %s\n", haplo_ct, (Use_Haplo_Ct) ? "true" : "false"); skippedHaplo++; continue; } // ((_vote[j].confirmed == 0) || // ((_vote[j].confirmed == 1) && (max > 6))) if ((_vote[j].confirmed > 0) && ((_vote[j].confirmed != 1) || (max <= 6))) { fprintf(stderr, "INDET confirmed = " F_U64 " max = " F_U64 "\n", _vote[j].confirmed, max); skippedConfirmed++; continue; } // Otherwise, output. substitutions++; fprintf(stderr, "SUBSTITUTE position " F_U32 " to %c\n", j, Matching_Char(vval)); _cor[_corLen].type = vval; _cor[_corLen].pos = j; _cor[_corLen].readID = _readID; _corLen++; resizeArray(_cor, _corLen, _corMax, _corLen+1); } // confirmed < 2 if (_vote[j].no_insert < 2) { Vote_Value_t ins_vote = A_INSERT; uint64 ins_max = _vote[j].a_insert; passedInsert++; if (ins_max < _vote[j].c_insert) { ins_vote = C_INSERT; ins_max = _vote[j].c_insert; } if (ins_max < _vote[j].g_insert) { ins_vote = G_INSERT; ins_max = _vote[j].g_insert; } if (ins_max < _vote[j].t_insert) { ins_vote = T_INSERT; ins_max = _vote[j].t_insert; } uint64 ins_haplo_ct = ((_vote[j].a_insert >= Min_Haplo_Occurs) + (_vote[j].c_insert >= Min_Haplo_Occurs) + (_vote[j].g_insert >= Min_Haplo_Occurs) + (_vote[j].t_insert >= Min_Haplo_Occurs)); uint64 ins_total = (_vote[j].a_insert + _vote[j].c_insert + _vote[j].g_insert + _vote[j].t_insert); if (ins_total <= 1) { fprintf(stderr, "FEW ins_total = " F_U64 " <= 1\n", ins_total); skippedInsTotal++; continue; } if (2 * ins_max >= ins_total) { fprintf(stderr, "WEAK 2*ins_max = " F_U64 " <= ins_total = " F_U64 "\n", 2*ins_max, ins_total); skippedInsMax++; continue; } if ((ins_haplo_ct >= 2) && (Use_Haplo_Ct == true)) { fprintf(stderr, "HAPLO ins_haplo_ct=" F_U64 " >= 2 AND Use_Haplo_Ct = %s\n", ins_haplo_ct, (Use_Haplo_Ct) ? "true" : "false"); skippedInsHaplo++; continue; } if ((_vote[j].no_insert > 0) && ((_vote[j].no_insert != 1) || (ins_max <= 6))) { fprintf(stderr, "INDET no_insert = " F_U64 " ins_max = " F_U64 "\n", _vote[j].no_insert, ins_max); skippedInsTooMany++; continue; } // Otherwise, output. insertions++; fprintf(stderr, "INSERT position " F_U32 " to %c\n", j, Matching_Char(ins_vote)); _cor[_corLen].type = ins_vote; _cor[_corLen].pos = j; _cor[_corLen].readID = _readID; _corLen++; resizeArray(_cor, _corLen, _corMax, _corLen+1); } // insert < 2 } fprintf(stderr, "Processed corrections: made %6u subs and %6u inserts - possible %6u (few %6u weak %6u same %6u haplo %6u confirmed %6u) inserts %6u (total %6u max %6u haplo %6u confirmed %6u)\n", substitutions, insertions, passedLowConfirmed, skippedTooFew, skippedTooWeak, skippedNoChange, skippedHaplo, skippedConfirmed, passedInsert, skippedInsTotal, skippedInsMax, skippedInsHaplo, skippedInsTooMany); if (corFile) AS_UTL_safeWrite(corFile, _cor, "corrections", sizeof(Correction_Output_t), _corLen); } #if 0 fprintf(stderr, "Corrected " F_U64 " bases with " F_U64 " substitutions, " F_U64 " deletions and " F_U64 " insertions.\n", G->basesLen, changes[A_SUBST] + changes[C_SUBST] + changes[G_SUBST] + changes[T_SUBST], changes[DELETE], changes[A_INSERT] + changes[C_INSERT] + changes[G_INSERT] + changes[T_INSERT]); #endif void analyzeAlignment::generateCorrectedRead(Adjust_t *fadj, uint32 *fadjLen, uint64 *changes) { // oseq = original sequence // fseq - fixed sequence _corSeqLen = 0; uint32 corPos = 1; // First entry is the ident block. uint32 adjVal = 0; // Corrected reads start at position zero, really! for (uint32 i=0; i<_seqLen; i++) { // No more corrections, or no more corrections for this read -- just copy bases till the end. if ((corPos == _corLen) || (_cor[corPos].readID != _readID)) { //fprintf(stderr, "no more corrections at i=%u, copy rest of read as is\n", i); while (i < _seqLen) _corSeq[_corSeqLen++] = _filter[_seq[i++]]; break; } // Not at a correction -- copy the base. if (i < _cor[corPos].pos) { _corSeq[_corSeqLen++] = _filter[_seq[i]]; continue; } if ((i != _cor[corPos].pos) && (i != _cor[corPos].pos + 1)) fprintf(stderr, "i=%d corPos=%d _cor[corPos].pos=%d\n", i, corPos, _cor[corPos].pos); assert((i == _cor[corPos].pos) || (i == _cor[corPos].pos + 1)); if (changes) changes[_cor[corPos].type]++; switch (_cor[corPos].type) { case DELETE: // Delete base //fprintf(stderr, "DELETE %u pos %u adjust %d\n", (*fadjLen), i+1, adjVal-1); if (fadj) { fadj[(*fadjLen)].adjpos = i + 1; fadj[(*fadjLen)].adjust = --adjVal; (*fadjLen)++; } break; case A_SUBST: _corSeq[_corSeqLen++] = 'A'; break; case C_SUBST: _corSeq[_corSeqLen++] = 'C'; break; case G_SUBST: _corSeq[_corSeqLen++] = 'G'; break; case T_SUBST: _corSeq[_corSeqLen++] = 'T'; break; case A_INSERT: if (i != _cor[corPos].pos + 1) { // Insert not immediately after subst //fprintf(stderr, "A i=%d != _cor[%d].pos+1=%d\n", i, corPos, _cor[corPos].pos+1); _corSeq[_corSeqLen++] = _filter[_seq[i++]]; } _corSeq[_corSeqLen++] = 'A'; if (fadj) { fadj[(*fadjLen)].adjpos = i + 1; fadj[(*fadjLen)].adjust = ++adjVal; (*fadjLen)++; } i--; // Undo the automagic loop increment break; case C_INSERT: if (i != _cor[corPos].pos + 1) { //fprintf(stderr, "C i=%d != _cor[%d].pos+1=%d\n", i, corPos, _cor[corPos].pos+1); _corSeq[_corSeqLen++] = _filter[_seq[i++]]; } _corSeq[_corSeqLen++] = 'C'; if (fadj) { fadj[(*fadjLen)].adjpos = i + 1; fadj[(*fadjLen)].adjust = ++adjVal; (*fadjLen)++; } i--; break; case G_INSERT: if (i != _cor[corPos].pos + 1) { //fprintf(stderr, "G i=%d != _cor[%d].pos+1=%d\n", i, corPos, _cor[corPos].pos+1); _corSeq[_corSeqLen++] = _filter[_seq[i++]]; } _corSeq[_corSeqLen++] = 'G'; if (fadj) { fadj[(*fadjLen)].adjpos = i + 1; fadj[(*fadjLen)].adjust = ++adjVal; (*fadjLen)++; } i--; break; case T_INSERT: if (i != _cor[corPos].pos + 1) { //fprintf(stderr, "T i=%d != _cor[%d].pos+1=%d\n", i, corPos, _cor[corPos].pos+1); _corSeq[_corSeqLen++] = _filter[_seq[i++]]; } _corSeq[_corSeqLen++] = 'T'; if (fadj) { fadj[(*fadjLen)].adjpos = i + 1; fadj[(*fadjLen)].adjust = ++adjVal; (*fadjLen)++; } i--; break; default: fprintf (stderr, "ERROR: Illegal vote type\n"); break; } corPos++; } // Terminate the sequence. _corSeq[_corSeqLen] = 0; } canu-1.6/src/overlapErrorAdjustment/analyzeAlignment.H000066400000000000000000000147361314437614700232560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-18 to 2015-JUL-01 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "correctionOutput.H" // Allocate one per base in read under evaluation. // 5 values per 64 bit word -> 12 bits per value + 4 left over // 6 values per 64 bit word -> 10 bits per value + 4 left over // 7 values per 64 bit word -> 9 bits per value + 1 left over // // Since there are only 11 values to store, we use the 10 bit size. // // This is only in-memory, so we don't worry about padding it to maintain // proper sizes between runs. #define VOTETALLY_BITS 10 #define VOTETALLY_MAX 1023 struct Vote_Tally_t { uint64 confirmed : VOTETALLY_BITS; uint64 deletes : VOTETALLY_BITS; uint64 a_subst : VOTETALLY_BITS; uint64 c_subst : VOTETALLY_BITS; uint64 g_subst : VOTETALLY_BITS; uint64 t_subst : VOTETALLY_BITS; uint64 no_insert : VOTETALLY_BITS; uint64 a_insert : VOTETALLY_BITS; uint64 c_insert : VOTETALLY_BITS; uint64 g_insert : VOTETALLY_BITS; uint64 t_insert : VOTETALLY_BITS; }; // Maps an original location to a fixed location. Supposedly // smaller/faster than the obvious array mapping... struct Adjust_t { int32 adjpos; int32 adjust; }; class analyzeAlignment { public: analyzeAlignment() { Degree_Threshold = 2; // ?? Use_Haplo_Ct = true; // Use haplotype counts to correct End_Exclude_Len = 3; // ?? Kmer_Len = 9; // ?? Vote_Qualify_Len = 9; // ?? Min_Haplo_Occurs = 3; // This many or more votes for the same base indicates a haplotype _readID = 0; _seqLen = 0; _seq = NULL; _corSeqLen = 0; _corSeqMax = AS_MAX_READLEN; _corSeq = new char [_corSeqMax]; _voteMax = 0; _vote = NULL; _lDegree = 0; _rDegree = 0; _readSub = new int32 [AS_MAX_READLEN]; _algnSub = new int32 [AS_MAX_READLEN]; _voteValue = new Vote_Value_t [AS_MAX_READLEN]; _corLen = 0; _corMax = AS_MAX_READLEN / 4; _cor = new Correction_Output_t [_corMax]; // The original version converted ACGT to lowercase, and replaced non-ACGT with 'a'. for (uint32 i=0; i<256; i++) _filter[i] = 'a'; _filter['A'] = _filter['a'] = 'a'; _filter['C'] = _filter['c'] = 'c'; _filter['G'] = _filter['g'] = 'g'; _filter['T'] = _filter['t'] = 't'; }; ~analyzeAlignment() { delete [] _corSeq; delete [] _vote; delete [] _readSub; delete [] _algnSub; delete [] _voteValue; delete [] _cor; }; public: void reset(uint32 id, char *seq, uint32 seqLen) { fprintf(stderr, "reset() for read id %u of length %u\n", id, seqLen); _readID = id; _seqLen = seqLen; _seq = seq; _corSeqLen = 0; if (_voteMax < _seqLen) { delete [] _vote; _voteMax = _seqLen + 2000; _vote = new Vote_Tally_t [_voteMax]; } // Seems like overkill, but I don't know where to put it memset(_vote, 0, sizeof(Vote_Tally_t) * _voteMax); }; void analyze(char *a, int32 aLen, int32 aOffset, char *b, int32 bLen, int32 deltaLen, int32 *delta); void outputDetails(uint32 j); void outputDetails(void); void generateCorrections(FILE *corFile=NULL); void generateCorrectedRead(Adjust_t *fadj=NULL, uint32 *fadjLen=NULL, uint64 *changes=NULL); private: void castVote(Vote_Value_t val, int32 pos) { int32 v = 0; switch (val) { case DELETE: if (_vote[pos].deletes < VOTETALLY_MAX) v = ++_vote[pos].deletes; break; case A_SUBST: if (_vote[pos].a_subst < VOTETALLY_MAX) v = ++_vote[pos].a_subst; break; case C_SUBST: if (_vote[pos].c_subst < VOTETALLY_MAX) v = ++_vote[pos].c_subst; break; case G_SUBST: if (_vote[pos].g_subst < VOTETALLY_MAX) v = ++_vote[pos].g_subst; break; case T_SUBST: if (_vote[pos].t_subst < VOTETALLY_MAX) v = ++_vote[pos].t_subst; break; case A_INSERT: if (_vote[pos].a_insert < VOTETALLY_MAX) v = ++_vote[pos].a_insert; break; case C_INSERT: if (_vote[pos].c_insert < VOTETALLY_MAX) v = ++_vote[pos].c_insert; break; case G_INSERT: if (_vote[pos].g_insert < VOTETALLY_MAX) v = ++_vote[pos].g_insert; break; case T_INSERT: if (_vote[pos].t_insert < VOTETALLY_MAX) v = ++_vote[pos].t_insert; break; case NO_VOTE: break; default : fprintf(stderr, "ERROR: Illegal vote type\n"); break; } }; private: // Parameters int32 Degree_Threshold; int32 Use_Haplo_Ct; int32 End_Exclude_Len; int32 Kmer_Len; int32 Vote_Qualify_Len; int32 Min_Haplo_Occurs; // Per-read data uint32 _readID; uint32 _seqLen; char *_seq; public: uint32 _corSeqLen; uint32 _corSeqMax; char *_corSeq; private: uint32 _voteMax; Vote_Tally_t *_vote; uint32 _lDegree; uint32 _rDegree; int32 *_readSub; int32 *_algnSub; Vote_Value_t *_voteValue; // Outputs (used to generate the corrected read) uint32 _corLen; uint32 _corMax; Correction_Output_t *_cor; char _filter[256]; }; canu-1.6/src/overlapErrorAdjustment/correctOverlaps-Correct_Frags.C000066400000000000000000000231321314437614700256330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-20 to 2015-JUN-03 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-02 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "correctOverlaps.H" #include "correctionOutput.H" // Shouldn't be global. char filter[256]; void correctRead(uint32 curID, char *fseq, uint32 &fseqLen, Adjust_t *fadj, uint32 &fadjLen, char *oseq, uint32 oseqLen, Correction_Output_t *C, uint64 &Cpos, uint64 Clen, uint64 *changes) { #if DO_NO_CORRECTIONS // for testing if the adjustments are screwed up. yup. strcpy(fseq, oseq); fseqLen += oseqLen; return; #endif //fprintf(stderr, "Correcting read %u\n", curID); // Find the correct corrections. while ((Cpos < Clen) && (C[Cpos].readID < curID)) { //fprintf(stderr, "SKIP Cpos=%d for read %u, want read %u\n", Cpos, C[Cpos].readID, curID); Cpos++; } // Skip any IDENT message. assert(C[Cpos].type == IDENT); //G.reads[G.readsLen].keep_left = C[Cpos].keep_left; //G.reads[G.readsLen].keep_right = C[Cpos].keep_right; Cpos++; //fprintf(stderr, "Start at Cpos=%d position=%d type=%d id=%d\n", Cpos, C[Cpos].pos, C[Cpos].type, C[Cpos].readID); int32 adjVal = 0; for (uint32 i=0; i%u\n%s\n", curID, fseq); } // Open and read corrections from Correct_File_Path and // apply them to sequences in Frag . // Load reads from gkpStore, and apply corrections. void Correct_Frags(coParameters *G, gkStore *gkpStore) { // The original converted to lowercase, and made non-acgt be 'a'. for (uint32 i=0; i<256; i++) filter[i] = 'a'; filter['A'] = filter['a'] = 'a'; filter['C'] = filter['c'] = 'c'; filter['G'] = filter['g'] = 'g'; filter['T'] = filter['t'] = 't'; // Open the corrections, as an array. memoryMappedFile *Cfile = new memoryMappedFile(G->correctionsName); Correction_Output_t *C = (Correction_Output_t *)Cfile->get(); uint64 Cpos = 0; uint64 Clen = Cfile->length() / sizeof(Correction_Output_t); uint64 firstRecord = 0; uint64 currentRecord = 0; fprintf(stderr, "Reading " F_U64 " corrections from '%s'.\n", Clen, G->correctionsName); // Count the number of bases, so we can do two gigantic allocations for bases and adjustments. // Adjustments are always less than the number of corrections; we could also count exactly. G->basesLen = 0; G->adjustsLen = 0; for (uint32 curID=G->bgnID; curID<=G->endID; curID++) { gkRead *read = gkpStore->gkStore_getRead(curID); G->basesLen += read->gkRead_sequenceLength() + 1; } for (uint64 c=0; cadjustsLen++; break; } } fprintf(stderr, "Correcting " F_U64 " bases with " F_U64 " indel adjustments.\n", G->basesLen, G->adjustsLen); fprintf(stderr, "--Allocate " F_U64 " + " F_U64 " + " F_U64 " MB for bases, adjusts and reads.\n", (sizeof(char) * G->basesLen) >> 20, (sizeof(Adjust_t) * G->adjustsLen) >> 20, (sizeof(Frag_Info_t) * (G->endID - G->bgnID + 1)) >> 20); G->bases = new char [G->basesLen]; G->adjusts = new Adjust_t [G->adjustsLen]; G->reads = new Frag_Info_t [G->endID - G->bgnID + 1]; G->readsLen = 0; G->basesLen = 0; G->adjustsLen = 0; uint64 changes[12] = {0}; // Load reads and apply corrections for each one. gkReadData *readData = new gkReadData; for (uint32 curID=G->bgnID; curID<=G->endID; curID++) { gkRead *read = gkpStore->gkStore_getRead(curID); gkpStore->gkStore_loadReadData(read, readData); uint32 readLength = read->gkRead_sequenceLength(); char *readBases = readData->gkReadData_getSequence(); // Save pointers to the bases and adjustments. G->reads[G->readsLen].bases = G->bases + G->basesLen; G->reads[G->readsLen].basesLen = 0; G->reads[G->readsLen].adjusts = G->adjusts + G->adjustsLen; G->reads[G->readsLen].adjustsLen = 0; // Find the correct corrections. while ((Cpos < Clen) && (C[Cpos].readID < curID)) Cpos++; // We should be at the IDENT message. if (C[Cpos].type != IDENT) { fprintf(stderr, "ERROR: didn't find IDENT at Cpos=" F_U64 " for read " F_U32 "\n", Cpos, curID); fprintf(stderr, " C[Cpos] = keep_left=%u keep_right=%u type=%u pos=%u readID=%u\n", C[Cpos].keep_left, C[Cpos].keep_right, C[Cpos].type, C[Cpos].pos, C[Cpos].readID); } assert(C[Cpos].type == IDENT); G->reads[G->readsLen].keep_left = C[Cpos].keep_left; G->reads[G->readsLen].keep_right = C[Cpos].keep_right; //Cpos++; // Now do the corrections. correctRead(curID, G->reads[G->readsLen].bases, G->reads[G->readsLen].basesLen, G->reads[G->readsLen].adjusts, G->reads[G->readsLen].adjustsLen, readData->gkReadData_getSequence(), read->gkRead_sequenceLength(), C, Cpos, Clen, changes); // Update the lengths in the globals. G->basesLen += G->reads[G->readsLen].basesLen + 1; G->adjustsLen += G->reads[G->readsLen].adjustsLen; G->readsLen += 1; } delete readData; delete Cfile; fprintf(stderr, "Corrected " F_U64 " bases with " F_U64 " substitutions, " F_U64 " deletions and " F_U64 " insertions.\n", G->basesLen, changes[A_SUBST] + changes[C_SUBST] + changes[G_SUBST] + changes[T_SUBST], changes[DELETE], changes[A_INSERT] + changes[C_INSERT] + changes[G_INSERT] + changes[T_INSERT]); } canu-1.6/src/overlapErrorAdjustment/correctOverlaps-Prefix_Edit_Distance.C000066400000000000000000000203561314437614700271310ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-MAY-02 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "correctOverlaps.H" static void Compute_Delta(pedWorkArea_t *WA, int32 e, int32 d, int32 row) { int32 last = row; int32 stackLen = 0; for (int32 k=e; k>0; k--) { int32 from = d; int32 max = 1 + WA->Edit_Array_Lazy[k-1][d]; int32 j = WA->Edit_Array_Lazy[k-1][d-1]; if (j > max) { from = d-1; max = j; } j = 1 + WA->Edit_Array_Lazy[k-1][d+1]; if (j > max) { from = d+1; max = j; } if (from == d-1) { WA->deltaStack[stackLen++] = max - last - 1; d--; last = WA->Edit_Array_Lazy[k-1][from]; } else if (from == d+1) { WA->deltaStack[stackLen++] = last - (max - 1); d++; last = WA->Edit_Array_Lazy[k-1][from]; } } WA->deltaStack[stackLen++] = last + 1; for (int32 k=0, i=stackLen-1; i>0; i--) WA->delta[k++] = abs(WA->deltaStack[i]) * Sign(WA->deltaStack[i-1]); WA->deltaLen = stackLen - 1; } // Allocate another block of 64mb for edits // Needs to be at least: // 52,432 to handle 40% error at 64k overlap // 104,860 to handle 80% error at 64k overlap // 209,718 to handle 40% error at 256k overlap // 419,434 to handle 80% error at 256k overlap // 3,355,446 to handle 40% error at 4m overlap // 6,710,890 to handle 80% error at 4m overlap // Bigger means we can assign more than one Edit_Array[] in one allocation. uint32 EDIT_SPACE_SIZE = 16 * 1024 * 1024; static void Allocate_More_Edit_Space(pedWorkArea_t *WA) { // Determine the last allocated block, and the last assigned block int32 b = 0; // Last edit array assigned int32 e = 0; // Last edit array assigned more space int32 a = WA->alloc.size(); // First unallocated block while (WA->Edit_Array_Lazy[b] != NULL) b++; // Fill in the edit space array. Well, not quite yet. First, decide the minimum size. // // Element [0] can access from [-2] to [2] = 5 elements. // Element [1] can access from [-3] to [3] = 7 elements. // // Element [e] can access from [-2-e] to [2+e] = 5 + e * 2 elements // // So, our offset for this new block needs to put [e][0] at offset... int32 Offset = 2 + b; int32 Del = 6 + b * 2; int32 Size = EDIT_SPACE_SIZE; while (Size < Offset + Del) Size *= 2; // Allocate another block int32 *alloc = new int32 [Size]; WA->alloc.push_back(alloc); // And, now, fill in the edit space array. e = b; while ((Offset + Del < Size) && (e < WA->Edit_Array_Max)) { WA->Edit_Array_Lazy[e++] = alloc + Offset; Offset += Del; Del += 2; } if (e == b) fprintf(stderr, "Allocate_More_Edit_Space()-- ERROR: couldn't allocate enough space for even one more entry! e=%d\n", e); assert(e != b); fprintf(stderr, "--Allocate %d MB for edit array work space %d (positions %u-%u)\n", Size >> 20, a, b, e-1); } // Return the minimum number of changes (inserts, deletes, replacements) // needed to match string A[0 .. (m-1)] with a prefix of string // T[0 .. (n-1)] if it's not more than Error_Limit . // // Put delta description of alignment in WA->delta and set // WA->deltaLen to the number of entries there if it's a complete // match. // Set A_End and T_End to the rightmost positions where the // alignment ended in A and T , respectively. // Set Match_To_End true if the match extended to the end // of at least one string; otherwise, set it false to indicate // a branch point. int32 Prefix_Edit_Dist(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, bool &Match_To_End, pedWorkArea_t *WA) { //assert (m <= n); int32 Best_d = 0; int32 Best_e = 0; int32 Longest = 0; WA->deltaLen = 0; int32 shorter = min(m, n); int32 Row = 0; while ((Row < shorter) && (A[Row] == T[Row])) Row++; //fprintf(stderr, "Row=%d matches at the start\n", Row); if (WA->Edit_Array_Lazy[0] == NULL) Allocate_More_Edit_Space(WA); WA->Edit_Array_Lazy[0][0] = Row; // Exact match? if (Row == shorter) { A_End = Row; T_End = Row; Match_To_End = true; return(0); } int32 Left = 0; int32 Right = 0; double Max_Score = 0.0; int32 Max_Score_Len = 0; int32 Max_Score_Best_d = 0; int32 Max_Score_Best_e = 0; for (int32 e=1; e<=Error_Limit; e++) { if (WA->Edit_Array_Lazy[e] == NULL) Allocate_More_Edit_Space(WA); Left = max(Left - 1, -e); Right = min(Right + 1, e); WA->Edit_Array_Lazy[e-1][Left] = -2; WA->Edit_Array_Lazy[e-1][Left-1] = -2; WA->Edit_Array_Lazy[e-1][Right] = -2; WA->Edit_Array_Lazy[e-1][Right+1] = -2; for (int32 d=Left; d<=Right; d++) { Row = 1 + WA->Edit_Array_Lazy[e-1][d]; Row = max(Row, WA->Edit_Array_Lazy[e-1][d-1]); Row = max(Row, WA->Edit_Array_Lazy[e-1][d+1] + 1); while ((Row < m) && (Row + d < n) && (A[Row] == T[Row + d])) Row++; //fprintf(stderr, "Row=%d matches at error e=%d\n", Row, e); assert(e < WA->Edit_Array_Max); WA->Edit_Array_Lazy[e][d] = Row; if (Row == m || Row + d == n) { //fprintf(stderr, "Hit end Row=%d m=%d Row+d=%d n=%d\n", Row, m, Row+d, n); // Force last error to be mismatch rather than insertion if ((Row == m) && (1 + WA->Edit_Array_Lazy[e-1][d+1] == WA->Edit_Array_Lazy[e][d]) && (d < Right)) { d++; WA->Edit_Array_Lazy[e][d] = WA->Edit_Array_Lazy[e][d-1]; } A_End = Row; // One past last align position T_End = Row + d; Match_To_End = true; Compute_Delta(WA, e, d, Row); return(e); } } while (Left <= Right && Left < 0 && WA->Edit_Array_Lazy[e][Left] < WA->G->Edit_Match_Limit[e]) Left++; if (Left >= 0) while (Left <= Right && WA->Edit_Array_Lazy[e][Left] + Left < WA->G->Edit_Match_Limit[e]) Left++; if (Left > Right) break; while (Right > 0 && WA->Edit_Array_Lazy[e][Right] + Right < WA->G->Edit_Match_Limit[e]) Right--; if (Right <= 0) while (WA->Edit_Array_Lazy[e][Right] < WA->G->Edit_Match_Limit[e]) Right--; assert (Left <= Right); for (int32 d=Left; d <= Right; d++) if (WA->Edit_Array_Lazy[e][d] > Longest) { Best_d = d; Best_e = e; Longest = WA->Edit_Array_Lazy[e][d]; } int32 Score = Longest * BRANCH_PT_MATCH_VALUE - e; // Assumes BRANCH_PT_MATCH_VALUE - BRANCH_PT_ERROR_VALUE == 1.0 // findErrors also included a second test; overlapper doesn't. if (Score > Max_Score) { Max_Score = Score; Max_Score_Len = Longest; Max_Score_Best_d = Best_d; Max_Score_Best_e = Best_e; } } // findErrors does this call. Overlapper doesn't. //Compute_Delta(WA, Max_Score_Best_e, Max_Score_Best_d, Max_Score_Len); A_End = Max_Score_Len; T_End = Max_Score_Len + Max_Score_Best_d; Match_To_End = false; // findErrors is returning Max_Score_Best_e. So does overlapper. // The original return was just e, but the only way we get here is if the e loop // exits with e = Error_Limit+1. return(Error_Limit + 1); } canu-1.6/src/overlapErrorAdjustment/correctOverlaps-Read_Olaps.C000066400000000000000000000051301314437614700251170ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-16 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-MAY-02 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "correctOverlaps.H" // Load overlaps with aIID from G->bgnID to G->endID. // Overlaps can be unsorted. void Read_Olaps(coParameters *G, gkStore *gkpStore) { ovStore *ovs = new ovStore(G->ovlStorePath, gkpStore); ovs->setRange(G->bgnID, G->endID); uint64 numolaps = ovs->numOverlapsInRange(); uint64 numNormal = 0; uint64 numInnie = 0; fprintf(stderr, "Read_Olaps()-- Loading " F_U64 " overlaps from '%s' for reads " F_U32 " to " F_U32 "\n", numolaps, G->ovlStorePath, G->bgnID, G->endID); fprintf(stderr, "--Allocate " F_U64 " MB for overlaps.\n", (sizeof(Olap_Info_t) * numolaps) >> 20); G->olaps = new Olap_Info_t [numolaps]; G->olapsLen = 0; ovOverlap olap(gkpStore); while (ovs->readOverlap(&olap)) { G->olaps[G->olapsLen].a_iid = olap.a_iid; G->olaps[G->olapsLen].b_iid = olap.b_iid; G->olaps[G->olapsLen].a_hang = olap.a_hang(); G->olaps[G->olapsLen].b_hang = olap.b_hang(); //G->olaps[G->olapsLen].orient = (olap.flipped()) ? INNIE : NORMAL; G->olaps[G->olapsLen].innie = (olap.flipped() == true); G->olaps[G->olapsLen].normal = (olap.flipped() == false); G->olaps[G->olapsLen].order = G->olapsLen; G->olaps[G->olapsLen].evalue = olap.evalue(); numNormal += (G->olaps[G->olapsLen].normal == true); numInnie += (G->olaps[G->olapsLen].innie == true); G->olapsLen++; } delete ovs; fprintf(stderr, "Read_Olaps()-- Loaded " F_U64 " overlaps -- " F_U64 " normal and " F_U64 " innie.\n", G->olapsLen, numNormal, numInnie); } canu-1.6/src/overlapErrorAdjustment/correctOverlaps-Redo_Olaps.C000066400000000000000000000366561314437614700251560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-14 to 2015-JUN-03 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "correctOverlaps.H" #include "correctionOutput.H" #include "AS_UTL_reverseComplement.H" void correctRead(uint32 curID, char *fseq, uint32 &fseqLen, Adjust_t *fadj, uint32 &fadjLen, char *oseq, uint32 oseqLen, Correction_Output_t *C, uint64 &Cpos, uint64 Clen, uint64 *changes=NULL); int32 Prefix_Edit_Dist(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, bool &Match_To_End, pedWorkArea_t *ped); #define DISPLAY_WIDTH 250 // Show (to stdout ) the alignment encoded in delta [0 .. (deltaLen - 1)] // between strings a [0 .. (a_len - 1)] and b [0 .. (b_len - 1)] . static void Display_Alignment(char *a, int32 aLen, char *b, int32 bLen, int32 *delta, int32 deltaLen) { int32 i = 0; int32 j = 0; char *top = new char [32 * 1024]; int32 topLen = 0; char *bot = new char [32 * 1024]; int32 botLen = 0; for (int32 k = 0; k < deltaLen; k++) { for (int32 m = 1; m < abs(delta[k]); m++) { top[topLen++] = a[i++]; j++; } if (delta[k] < 0) { top[topLen++] = '-'; j++; } else { top[topLen++] = a[i++]; } } while (i < aLen && j < bLen) { top[topLen++] = a[i++]; j++; } top[topLen] = '\0'; i = j = 0; for (int32 k = 0; k < deltaLen; k++) { for (int32 m = 1; m < abs(delta[k]); m++) { bot[botLen++] = b[j++]; i++; } if (delta[k] > 0) { bot[botLen++] = '-'; i++; } else { bot[botLen++] = b[j++]; } } while (j < bLen && i < aLen) { bot[botLen++] = b[j++]; i++; } bot[botLen] = '\0'; for (i = 0; i < topLen || i < botLen; i += DISPLAY_WIDTH) { putc('\n', stderr); fprintf(stderr, "A: "); for (j = 0; j < DISPLAY_WIDTH && i + j < topLen; j++) putc(top[i + j], stderr); putc('\n', stderr); fprintf(stderr, "B: "); for (j = 0; j < DISPLAY_WIDTH && i + j < botLen; j++) putc(bot[i + j], stderr); putc('\n', stderr); fprintf(stderr, " "); for (j = 0; j < DISPLAY_WIDTH && i + j < botLen && i + j < topLen; j++) if (top[i + j] != ' ' && bot[i + j] != ' ' && top[i + j] != bot[i + j]) putc('^', stderr); else putc(' ', stderr); putc('\n', stderr); } delete [] top; delete [] bot; } // Set hanging offset values for reversed fragment in // rev_adj[0 .. (adj_ct - 1)] based on corresponding forward // values in fadj[0 .. (adj_ct - 1)]. frag_len is the length // of the fragment. static void Make_Rev_Adjust(Adjust_t *radj, Adjust_t *fadj, int32 adj_ct, int32 frag_len) { if (adj_ct == 0) return; int32 i = 0; int32 j = 0; int32 prev = 0; for (i=adj_ct-1; i>0; i--) { if (fadj[i].adjust == fadj[i-1].adjust + 1) { radj[j].adjpos = 2 + frag_len - fadj[i].adjpos; radj[j].adjust = prev + 1; prev = radj[j].adjust; } else if (fadj[i].adjust == fadj[i-1].adjust - 1) { radj[j].adjpos = 3 + frag_len - fadj[i].adjpos; radj[j].adjust = prev - 1; prev = radj[j].adjust; } else { fprintf(stderr, "ERROR: Bad adjustment value. i = %d adj_ct = %d adjust[i] = %d adjust[i-1] = %d\n", i, adj_ct, fadj[i].adjust, fadj[i-1].adjust); assert(0); } j++; } assert(i == 0); if (fadj[i].adjust == 1) { radj[j].adjpos = 2 + frag_len - fadj[i].adjpos; radj[j].adjust = prev + 1; } else if (fadj[i].adjust == -1) { radj[j].adjpos = 3 + frag_len - fadj[i].adjpos; radj[j].adjust = prev - 1; } else { fprintf(stderr, "ERROR: Bad adjustment value. i = %d adj_ct = %d adjust[i] = %d\n", i, adj_ct, fadj[i].adjust); assert(0); } assert(j+1 == adj_ct); } // Return the adjusted value of hang based on // adjust[0 .. (adjust_ct - 1)] . static int32 Hang_Adjust(int32 hang, Adjust_t *adjust, int32 adjust_ct) { int32 delta = 0; assert(hang >= 0); // Replacing second test >= with just > didn't change anything. Both had 14 fails. for (int32 i=0; (i < adjust_ct) && (hang >= adjust[i].adjpos); i++) { //if (delta != adjust[i].adjust) // fprintf(stderr, "hang_adjust i=%d adjust_ct=%d adjust=%d pos=%d\n", i, adjust_ct, adjust[i].adjust, adjust[i].adjpos); delta = adjust[i].adjust; } //fprintf(stderr, "hang adjust delta %d\n", delta); return(hang + delta); } // Read old fragments in gkpStore and choose the ones that // have overlaps with fragments in Frag. Recompute the // overlaps, using fragment corrections and output the revised error. void Redo_Olaps(coParameters *G, gkStore *gkpStore) { // Figure out the range of B reads we care about. We probably could just loop over every read in // the store with minimal penalty. uint64 thisOvl = 0; uint64 lastOvl = G->olapsLen - 1; uint32 loBid = G->olaps[thisOvl].b_iid; uint32 hiBid = G->olaps[lastOvl].b_iid; // Open all the corrections. memoryMappedFile *Cfile = new memoryMappedFile(G->correctionsName); Correction_Output_t *C = (Correction_Output_t *)Cfile->get(); uint64 Cpos = 0; uint64 Clen = Cfile->length() / sizeof(Correction_Output_t); // Allocate some temporary work space for the forward and reverse corrected B reads. fprintf(stderr, "--Allocate " F_U64 " MB for fseq and rseq.\n", (2 * sizeof(char) * 2 * (AS_MAX_READLEN + 1)) >> 20); char *fseq = new char [AS_MAX_READLEN + 1 + AS_MAX_READLEN + 1]; uint32 fseqLen = 0; char *rseq = new char [AS_MAX_READLEN + 1 + AS_MAX_READLEN + 1]; uint32 rseqLen = 0; fprintf(stderr, "--Allocate " F_U64 " MB for fadj and radj.\n", (2 * sizeof(Adjust_t) * (AS_MAX_READLEN + 1)) >> 20); Adjust_t *fadj = new Adjust_t [AS_MAX_READLEN + 1]; Adjust_t *radj = new Adjust_t [AS_MAX_READLEN + 1]; uint32 fadjLen = 0; // radj is the same length fprintf(stderr, "--Allocate " F_U64 " MB for pedWorkArea_t.\n", sizeof(pedWorkArea_t) >> 20); gkReadData *readData = new gkReadData; pedWorkArea_t *ped = new pedWorkArea_t; uint64 Total_Alignments_Ct = 0; uint64 Failed_Alignments_Ct = 0; uint64 Failed_Alignments_Both_Ct = 0; uint64 Failed_Alignments_End_Ct = 0; uint64 Failed_Alignments_Length_Ct = 0; uint32 rhaFail = 0; uint32 rhaPass = 0; uint64 olapsFwd = 0; uint64 olapsRev = 0; ped->initialize(G, G->errorRate); // Process overlaps. Loop over the B reads, and recompute each overlap. for (uint32 curID=loBid; curID<=hiBid; curID++) { if (((curID - loBid) % 1024) == 0) fprintf(stderr, "Recomputing overlaps - %9u - %9u - %9u\r", loBid, curID, hiBid); if (curID < G->olaps[thisOvl].b_iid) continue; gkRead *read = gkpStore->gkStore_getRead(curID); gkpStore->gkStore_loadReadData(read, readData); // Apply corrections to the B read (also converts to lower case, reverses it, etc) //fprintf(stderr, "Correcting B read %u at Cpos=%u\n", curID, Cpos); fseqLen = 0; rseqLen = 0; fadjLen = 0; correctRead(curID, fseq, fseqLen, fadj, fadjLen, readData->gkReadData_getSequence(), read->gkRead_sequenceLength(), C, Cpos, Clen); // Create copies of the sequence for forward and reverse. There isn't a need for the forward copy (except that // we mutate it with corrections), and the reverse copy could be deferred until it is needed. memcpy(rseq, fseq, sizeof(char) * (fseqLen + 1)); reverseComplementSequence(rseq, fseqLen); Make_Rev_Adjust(radj, fadj, fadjLen, fseqLen); // Recompute alignments for all overlaps involving the B read. for (; ((thisOvl <= lastOvl) && (G->olaps[thisOvl].b_iid == curID)); thisOvl++) { Olap_Info_t *olap = G->olaps + thisOvl; //fprintf(stderr, "processing overlap %u - %u\n", olap->a_iid, olap->b_iid); // Find the A segment. It's always forward. It's already been corrected. char *a_part = G->reads[olap->a_iid - G->bgnID].bases; if (olap->a_hang > 0) { int32 ha = Hang_Adjust(olap->a_hang, G->reads[olap->a_iid - G->bgnID].adjusts, G->reads[olap->a_iid - G->bgnID].adjustsLen); a_part += ha; //fprintf(stderr, "offset a_part by ha=%d\n", ha); } // Find the B segment. char *b_part = (olap->normal == true) ? fseq : rseq; //if (olap->normal == true) // fprintf(stderr, "b_part = fseq %40.40s\n", fseq); //else // fprintf(stderr, "b_part = rseq %40.40s\n", rseq); if (olap->normal == true) olapsFwd++; else olapsRev++; bool rha=false; if (olap->a_hang < 0) { int32 ha = (olap->normal == true) ? Hang_Adjust(-olap->a_hang, fadj, fadjLen) : Hang_Adjust(-olap->a_hang, radj, fadjLen); b_part += ha; //fprintf(stderr, "offset b_part by ha=%d normal=%d\n", ha, olap->normal); rha=true; } // Compute the alignment. int32 a_part_len = strlen(a_part); int32 b_part_len = strlen(b_part); int32 olap_len = min(a_part_len, b_part_len); int32 a_end = 0; int32 b_end = 0; bool match_to_end = false; //fprintf(stderr, ">A\n%s\n", a_part); //fprintf(stderr, ">B\n%s\n", b_part); int32 errors = Prefix_Edit_Dist(a_part, a_part_len, b_part, b_part_len, G->Error_Bound[olap_len], a_end, b_end, match_to_end, ped); // ped->delta isn't used. // ?? These both occur, but the first is much much more common. if ((ped->deltaLen > 0) && (ped->delta[0] == 1) && (0 < G->olaps[thisOvl].a_hang)) { int32 stop = min(ped->deltaLen, (int32)G->olaps[thisOvl].a_hang); // a_hang is int32:31! int32 i = 0; for (i=0; (i < stop) && (ped->delta[i] == 1); i++) ; //fprintf(stderr, "RESET 1 i=%d delta=%d\n", i, ped->delta[i]); assert((i == stop) || (ped->delta[i] != -1)); ped->deltaLen -= i; memmove(ped->delta, ped->delta + i, ped->deltaLen * sizeof (int)); a_part += i; a_end -= i; a_part_len -= i; errors -= i; } else if ((ped->deltaLen > 0) && (ped->delta[0] == -1) && (G->olaps[thisOvl].a_hang < 0)) { int32 stop = min(ped->deltaLen, - G->olaps[thisOvl].a_hang); int32 i = 0; for (i=0; (i < stop) && (ped->delta[i] == -1); i++) ; //fprintf(stderr, "RESET 2 i=%d delta=%d\n", i, ped->delta[i]); assert((i == stop) || (ped->delta[i] != 1)); ped->deltaLen -= i; memmove(ped->delta, ped->delta + i, ped->deltaLen * sizeof (int)); b_part += i; b_end -= i; b_part_len -= i; errors -= i; } Total_Alignments_Ct++; int32 olapLen = min(a_end, b_end); if ((match_to_end == false) && (olapLen <= 0)) Failed_Alignments_Both_Ct++; if (match_to_end == false) Failed_Alignments_End_Ct++; if (olapLen <= 0) Failed_Alignments_Length_Ct++; if ((match_to_end == false) || (olapLen <= 0)) { Failed_Alignments_Ct++; #if 0 // I can't find any patterns in these errors. I thought that it was caused by the corrections, but I // found a case where no corrections were made and the alignment still failed. Perhaps it is differences // in the alignment code (the forward vs reverse prefix distance in overlapper vs only the forward here)? fprintf(stderr, "Redo_Olaps()--\n"); fprintf(stderr, "Redo_Olaps()--\n"); fprintf(stderr, "Redo_Olaps()-- Bad alignment errors %d a_end %d b_end %d match_to_end %d olapLen %d\n", errors, a_end, b_end, match_to_end, olapLen); fprintf(stderr, "Redo_Olaps()-- Overlap a_hang %d b_hang %d innie %d\n", olap->a_hang, olap->b_hang, olap->innie); fprintf(stderr, "Redo_Olaps()-- Reads a_id %u a_length %d b_id %u b_length %d\n", G->olaps[thisOvl].a_iid, G->reads[ G->olaps[thisOvl].a_iid ].basesLen, G->olaps[thisOvl].b_iid, G->reads[ G->olaps[thisOvl].b_iid ].basesLen); fprintf(stderr, "Redo_Olaps()-- A %s\n", a_part); fprintf(stderr, "Redo_Olaps()-- B %s\n", b_part); Display_Alignment(a_part, a_part_len, b_part, b_part_len, ped->delta, ped->deltaLen); fprintf(stderr, "\n"); #endif if (rha) rhaFail++; continue; } if (rha) rhaPass++; G->olaps[thisOvl].evalue = AS_OVS_encodeEvalue((double)errors / olapLen); //fprintf(stderr, "REDO - errors = %u / olapLep = %u -- %f\n", errors, olapLen, AS_OVS_decodeEvalue(G->olaps[thisOvl].evalue)); } } fprintf(stderr, "\n"); delete ped; delete readData; delete [] radj; delete [] fadj; delete [] rseq; delete [] fseq; delete Cfile; fprintf(stderr, "-- Release bases, adjusts and reads.\n"); delete [] G->bases; G->bases = NULL; delete [] G->adjusts; G->adjusts = NULL; delete [] G->reads; G->reads = NULL; fprintf(stderr, "Olaps Fwd " F_U64 "\n", olapsFwd); fprintf(stderr, "Olaps Rev " F_U64 "\n", olapsRev); fprintf(stderr, "Total: " F_U64 "\n", Total_Alignments_Ct); fprintf(stderr, "Failed: " F_U64 " (both)\n", Failed_Alignments_Both_Ct); fprintf(stderr, "Failed: " F_U64 " (either)\n", Failed_Alignments_Ct); fprintf(stderr, "Failed: " F_U64 " (match to end)\n", Failed_Alignments_End_Ct); fprintf(stderr, "Failed: " F_U64 " (negative length)\n", Failed_Alignments_Length_Ct); fprintf(stderr, "rhaFail %u rhaPass %u\n", rhaFail, rhaPass); } canu-1.6/src/overlapErrorAdjustment/correctOverlaps.C000066400000000000000000000156451314437614700231240ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-JUN-16 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "correctOverlaps.H" #include "Binomial_Bound.H" void Read_Olaps(coParameters *G, gkStore *gkpStore); void Correct_Frags(coParameters *G, gkStore *gkpStore); void Redo_Olaps(coParameters *G, gkStore *gkpStore); int main(int argc, char **argv) { coParameters *G = new coParameters(); argc = AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { G->gkpStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-R") == 0) { G->bgnID = atoi(argv[++arg]); G->endID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-O") == 0) { // -F? -S Olap_Path G->ovlStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { G->errorRate = atof(argv[++arg]); } else if (strcmp(argv[arg], "-l") == 0) { G->minOverlap = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-c") == 0) { // For 'corrections' file input G->correctionsName = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { // For 'erates' output G->eratesName = argv[++arg]; } else if (strcmp(argv[arg], "-t") == 0) { // But we're not threaded! G->numThreads = atoi(argv[++arg]); } else { err++; } arg++; } if (G->gkpStorePath == NULL) fprintf(stderr, "ERROR: no input gatekeeper store (-G) supplied.\n"), err++; if (G->ovlStorePath == NULL) fprintf(stderr, "ERROR: no input overlap store (-O) supplied.\n"), err++; if (G->correctionsName == NULL) fprintf(stderr, "ERROR: no input read corrections file (-c) supplied.\n"), err++; if (G->eratesName == NULL) fprintf(stderr, "ERROR: no output erates file (-o) supplied.\n"), err++; if (err) { fprintf(stderr, "USAGE: %s [-d ] [-o ] [-q ]\n", argv[0]); fprintf(stderr, " [-x ] [-F OlapFile] [-S OlapStore]\n"); fprintf(stderr, " [-c ] [-e \n"); fprintf(stderr, " \n"); fprintf(stderr, "\n"); fprintf(stderr, "Recalculates overlaps for frags .. in\n"); fprintf(stderr, " using corrections in \n"); fprintf(stderr, "\n"); fprintf(stderr, "Options:\n"); fprintf(stderr, "-e specifies binary file to dump corrected erates to\n"); fprintf(stderr, " for later updating of olap store by update-erates \n"); fprintf(stderr, "-F specify file of sorted overlaps to use (in the format\n"); fprintf(stderr, " produced by get-olaps\n"); fprintf(stderr, "-o specifies name of file to which OVL messages go\n"); fprintf(stderr, "-q overlaps less than this error rate are\n"); fprintf(stderr, " automatically output\n"); fprintf(stderr, "-S specify the binary overlap store containing overlaps to use\n"); exit(1); } //fprintf (stderr, "Quality Threshold = %.2f%%\n", 100.0 * Quality_Threshold); // // Initialize Globals // fprintf(stderr, "Initializing.\n"); double MAX_ERRORS = 1 + (uint32)(G->errorRate * AS_MAX_READLEN); Initialize_Match_Limit(G->Edit_Match_Limit, G->errorRate, MAX_ERRORS); for (int32 i=0; i <= AS_MAX_READLEN; i++) G->Error_Bound[i] = (int)ceil(i * G->errorRate); // // // fprintf(stderr, "Opening gkpStore '%s'.\n", G->gkpStorePath); gkStore *gkpStore = gkStore::gkStore_open(G->gkpStorePath); if (G->bgnID < 1) G->bgnID = 1; if (gkpStore->gkStore_getNumReads() < G->endID) G->endID = gkpStore->gkStore_getNumReads(); // Load the reads for the overlaps we are going to be correcting, and apply corrections to them fprintf(stderr, "Correcting reads " F_U32 " to " F_U32 ".\n", G->bgnID, G->endID); Correct_Frags(G, gkpStore); // Load overlaps we're going to correct fprintf(stderr, "Loading overlaps.\n"); Read_Olaps(G, gkpStore); // Now sort them on the B iid. fprintf(stderr, "Sorting overlaps.\n"); #ifdef _GLIBCXX_PARALLEL __gnu_sequential::sort(G->olaps, G->olaps + G->olapsLen, Olap_Info_t_by_bID()); #else sort(G->olaps, G->olaps + G->olapsLen, Olap_Info_t_by_bID()); #endif // Recompute overlaps fprintf(stderr, "Recomputing overlaps.\n"); Redo_Olaps(G, gkpStore); gkpStore->gkStore_close(); gkpStore = NULL; // Sort the overlaps back into the original order fprintf(stderr, "Sorting overlaps.\n"); #ifdef _GLIBCXX_PARALLEL __gnu_sequential::sort(G->olaps, G->olaps + G->olapsLen, Olap_Info_t_by_Order()); #else sort(G->olaps, G->olaps + G->olapsLen, Olap_Info_t_by_Order()); #endif // Dump the new erates fprintf (stderr, "Saving corrected error rates to file %s\n", G->eratesName); { errno = 0; FILE *fp = fopen(G->eratesName, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", G->eratesName, strerror(errno)), exit(1); AS_UTL_safeWrite(fp, &G->bgnID, "loid", sizeof(int32), 1); AS_UTL_safeWrite(fp, &G->endID, "hiid", sizeof(int32), 1); AS_UTL_safeWrite(fp, &G->olapsLen, "num", sizeof(uint64), 1); fprintf(stderr, "--Allocate " F_U64 " MB for output error rates.\n", (sizeof(uint16) * G->olapsLen) >> 20); uint16 *evalue = new uint16 [G->olapsLen]; for (int32 i=0; iolapsLen; i++) evalue[i] = G->olaps[i].evalue; AS_UTL_safeWrite(fp, evalue, "evalue", sizeof(uint16), G->olapsLen); delete [] evalue; fclose(fp); } // Finished. //fprintf (stderr, "%d/%d failed/total alignments (%.1f%%)\n", // Failed_Alignments_Ct, Total_Alignments_Ct, // Total_Alignments_Ct == 0 ? 0.0 : (100.0 * Failed_Alignments_Ct) / Total_Alignments_Ct); delete G; fprintf(stderr, "\n"); fprintf(stderr, "Bye.\n"); exit(0); } canu-1.6/src/overlapErrorAdjustment/correctOverlaps.H000066400000000000000000000170241314437614700231220ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-MAY-02 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include #include "gkStore.H" #include "ovStore.H" #include using namespace std; #define Sign(a) ( ((a) > 0) - ((a) < 0) ) // Value to add for a match in finding branch points // 1.20 was the calculated value for 6% vs 35% error discrimination // Converting to integers didn't make it faster #define BRANCH_PT_MATCH_VALUE 0.272 // Value to add for a mismatch in finding branch points // -2.19 was the calculated value for 6% vs 35% error discrimination // Converting to integers didn't make it faster #define BRANCH_PT_ERROR_VALUE -0.728 // Default value for End_Exclude_Len #define DEFAULT_END_EXCLUDE_LEN 3 // Default value for bases on each side of SNP to vote for change #define DEFAULT_HALF_LEN 4 // Default value for Kmer_Len #define DEFAULT_KMER_LEN 9 // Default value for Quality_Threshold #define DEFAULT_QUALITY_THRESHOLD 0.015 // Probability limit to "band" edit-distance calculation // Determines NORMAL_DISTRIB_THOLD #define EDIT_DIST_PROB_BOUND 1e-4 // The number of errors that are ignored in setting probability // bound for terminating alignment extensions in edit distance // calculations #define ERRORS_FOR_FREE 1 // Most bytes allowed in line of fasta file #define MAX_FASTA_LINE 2048 // Longest name allowed for a file in the overlap store #define MAX_FILENAME_LEN 1000 // Most errors in any edit distance computation // 0.40 // KNOWN ONLY AT RUN TIME //#define MAX_ERRORS (1 + (int) (AS_OVL_ERROR_RATE * AS_MAX_READLEN)) // Factor by which to grow memory in olap array when reading it #define EXPANSION_FACTOR 1.4 // Branch points must be at least this many bases from the // end of the fragment to be reported #define MIN_BRANCH_END_DIST 20 // Branch point tails must fall off from the max by at least // this rate #define MIN_BRANCH_TAIL_SLOPE 0.20 // Determined by EDIT_DIST_PROB_BOUND #define NORMAL_DISTRIB_THOLD 3.62 struct Adjust_t { int32 adjpos; int32 adjust; }; class Frag_Info_t { public: Frag_Info_t() { bases = NULL; basesLen = 0; adjusts = NULL; adjustsLen = 0; keep_left = false; keep_right = false; }; ~Frag_Info_t() { }; char *bases; Adjust_t *adjusts; uint32 basesLen; uint32 adjustsLen; uint32 keep_right : 1; // I think these are unused. uint32 keep_left : 1; // If so, we get back 8 bytes. If not, redo Correct_Frags to use a temporary, and use 31 bits for the two lengths. }; class Olap_Info_t { public: Olap_Info_t() { a_iid = 0; b_iid = 0; a_hang = 0; b_hang = 0; innie = false; normal = false; order = 0; evalue = 0; }; ~Olap_Info_t() {}; uint32 a_iid; uint32 b_iid; int64 a_hang : 31; int64 b_hang : 31; uint64 innie : 1; // was 'orient' with choice INNIE=0 or NORMAL=1 uint64 normal : 1; // so 'normal' always != 'innie' uint64 order; uint32 evalue; }; // Sort by increasing b_iid. // // It is possible, but unlikely, to have two overlaps to the same pair of reads, // if we overlap a5'-b3' and a3'-b5'. I think. // class Olap_Info_t_by_bID { public: inline bool operator()(const Olap_Info_t &a, const Olap_Info_t &b) { if (a.b_iid < b.b_iid) return(true); if (a.b_iid > b.b_iid) return(false); if (a.a_iid < b.a_iid) return(true); if (a.a_iid > b.a_iid) return(false); return(a.innie != b.innie); }; }; class Olap_Info_t_by_Order { public: inline bool operator()(const Olap_Info_t &a, const Olap_Info_t &b) { return(a.order < b.order); }; }; class coParameters; class pedWorkArea_t { public: pedWorkArea_t() { G = NULL; memset(delta, 0, sizeof(int32) * AS_MAX_READLEN); memset(deltaStack, 0, sizeof(int32) * AS_MAX_READLEN); deltaLen = 0; Edit_Array_Lazy = NULL; }; ~pedWorkArea_t() { for (uint32 xx=0; xx < alloc.size(); xx++) delete [] alloc[xx]; delete [] Edit_Array_Lazy; }; void initialize(coParameters *G_, double errorRate) { G = G_; fprintf(stderr, "-- Allocate " F_U64 " MB for Edit_Array pointers.\n", (sizeof(int32 *) * Edit_Array_Max) >> 20); Edit_Array_Max = 1 + (uint32)(errorRate * AS_MAX_READLEN); Edit_Array_Lazy = new int32 * [Edit_Array_Max]; memset(Edit_Array_Lazy, 0, sizeof(int32 *) * Edit_Array_Max); }; public: coParameters *G; int32 delta[AS_MAX_READLEN]; // Only need ERATE * READLEN int32 deltaStack[AS_MAX_READLEN]; int32 deltaLen; vector alloc; // Allocated blocks, don't use directly. int32 **Edit_Array_Lazy; // Doled out space. int32 Edit_Array_Max; // Former MAX_ERRORS }; class coParameters { public: coParameters() { gkpStorePath = NULL; ovlStorePath = NULL; // Input read corrections, output overlap corrections correctionsName = NULL; eratesName = NULL; // Range of IDs to process bgnID = 0; endID = UINT32_MAX; bases = NULL; basesLen = 0; adjusts = NULL; adjustsLen = 0; reads = NULL; readsLen = 0; olaps = NULL; olapsLen = 0; numThreads = 1; errorRate = 0.06; minOverlap = 0; }; ~coParameters() { delete [] bases; delete [] adjusts; delete [] reads; delete [] olaps; }; // Paths to stores char *gkpStorePath; char *ovlStorePath; // Input read corrections, output overlap corrections char *correctionsName; char *eratesName; // Range of IDs to process uint32 bgnID; uint32 endID; char *bases; uint64 basesLen; Adjust_t *adjusts; uint64 adjustsLen; Frag_Info_t *reads; // These are relative to bgnID! uint32 readsLen; // Number of fragments being corrected Olap_Info_t *olaps; uint64 olapsLen; // Number of overlaps being used uint32 numThreads; // Only one thread supported. double errorRate; uint32 minOverlap; pedWorkArea_t ped; // Globals // This array[e] is the minimum value of Edit_Array[e][d] // to be worth pursuing in edit-distance computations between guides // (only MAX_ERRORS needed) int Edit_Match_Limit[AS_MAX_READLEN+1]; // This array[i] is the maximum number of errors allowed // in a match between sequences of length i , which is // i * MAXERROR_RATE . int Error_Bound[AS_MAX_READLEN + 1]; }; canu-1.6/src/overlapErrorAdjustment/correctOverlaps.mk000066400000000000000000000012761314437614700233440ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := correctOverlaps SOURCES := correctOverlaps.C \ correctOverlaps-Correct_Frags.C \ correctOverlaps-Read_Olaps.C \ correctOverlaps-Redo_Olaps.C \ correctOverlaps-Prefix_Edit_Distance.C SRC_INCDIRS := .. ../AS_UTL ../stores ../overlapInCore/liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapErrorAdjustment/correctionOutput.H000066400000000000000000000024601314437614700233330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Definitions for our exportable data. enum Vote_Value_t { IDENT, // Just an iid in this record. DELETE, A_SUBST, C_SUBST, G_SUBST, T_SUBST, // DON'T rearrange this! Code depends on the ordering. A_INSERT, C_INSERT, G_INSERT, T_INSERT, NO_VOTE, EXTENSION }; struct Correction_Output_t { uint32 keep_left : 1; // set true if left overlap degree is low uint32 keep_right : 1; // set true if right overlap degree is low uint32 type : 4; // Vote_Value_t uint32 pos : 26; // uint32 readID; }; canu-1.6/src/overlapErrorAdjustment/findErrors-Analyze_Alignment.C000066400000000000000000000232161314437614700254540ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz on 2015-JUN-18 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Sergey Koren beginning on 2016-MAR-22 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-MAY-18 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" // Add vote val to G.reads[sub] at sequence position p static void Cast_Vote(feParameters *G, Vote_Value_t val, int32 pos, int32 sub) { int32 v=0; switch (val) { case DELETE: if (G->reads[sub].vote[pos].deletes < MAX_VOTE) v = ++G->reads[sub].vote[pos].deletes; break; case A_SUBST: if (G->reads[sub].vote[pos].a_subst < MAX_VOTE) v = ++G->reads[sub].vote[pos].a_subst; break; case C_SUBST: if (G->reads[sub].vote[pos].c_subst < MAX_VOTE) v = ++G->reads[sub].vote[pos].c_subst; break; case G_SUBST: if (G->reads[sub].vote[pos].g_subst < MAX_VOTE) v = ++G->reads[sub].vote[pos].g_subst; break; case T_SUBST: if (G->reads[sub].vote[pos].t_subst < MAX_VOTE) v = ++G->reads[sub].vote[pos].t_subst; break; case A_INSERT: if (G->reads[sub].vote[pos].a_insert < MAX_VOTE) v = ++G->reads[sub].vote[pos].a_insert; break; case C_INSERT: if (G->reads[sub].vote[pos].c_insert < MAX_VOTE) v = ++G->reads[sub].vote[pos].c_insert; break; case G_INSERT: if (G->reads[sub].vote[pos].g_insert < MAX_VOTE) v = ++G->reads[sub].vote[pos].g_insert; break; case T_INSERT: if (G->reads[sub].vote[pos].t_insert < MAX_VOTE) v = ++G->reads[sub].vote[pos].t_insert; break; case NO_VOTE: break; default : fprintf(stderr, "ERROR: Illegal vote type\n"); break; } // Largely useless, just too much output. //fprintf(stderr, "Cast_Vote()-- sub %d at %d vote %d\n", sub, p, val); } // Return the substitution vote corresponding to Ch . static Vote_Value_t Matching_Vote(char ch) { switch (ch) { case 'a': return(A_SUBST); break; case 'c': return(C_SUBST); break; case 'g': return(G_SUBST); break; case 't': return(T_SUBST); break; } fprintf(stderr, "Matching_Vote()-- invalid letter '%c'\n", ch); return(NO_VOTE); } // Analyze the delta-encoded alignment in delta[0 .. (deltaLen - 1)] // between a_part and b_part and store the resulting votes // about the a sequence in G->reads[sub]. The alignment starts // a_offset bytes in from the start of the a sequence in G->reads[sub] . // a_len and b_len are the lengths of the prefixes of a_part and // b_part , resp., that align. void Analyze_Alignment(Thread_Work_Area_t *wa, char *a_part, int32 a_len, int32 a_offset, char *b_part, int32 b_len, int32 sub) { assert(a_len >= 0); assert(b_len >= 0); int32 ct = 0; // Necessary?? //memset(wa->globalvote, 0, sizeof(Vote_t) * AS_MAX_READLEN); wa->globalvote[ct].frag_sub = -1; wa->globalvote[ct].align_sub = -1; wa->globalvote[ct].vote_val = A_SUBST; // Dummy value ct++; int32 i = 0; int32 j = 0; int32 p = 0; for (int32 k=0; kped.deltaLen; k++) { //fprintf(stderr, "k=%d deltalen=%d i=%d our of %d j=%d out of %d\n", k, wa->ped.deltaLen, i, a_len, j, b_len); // Add delta[k] matches or mismatches for (int32 m=1; mped.delta[k]); m++) { if (a_part[i] != b_part[j]) { wa->globalvote[ct].frag_sub = i; wa->globalvote[ct].align_sub = p; switch (b_part[j]) { case 'a': wa->globalvote[ct].vote_val = A_SUBST; break; case 'c': wa->globalvote[ct].vote_val = C_SUBST; break; case 'g': wa->globalvote[ct].vote_val = G_SUBST; break; case 't': wa->globalvote[ct].vote_val = T_SUBST; break; default : fprintf(stderr, "ERROR:[1] Bad sequence '%c' 0x%02x)\n", b_part[j], b_part[j]); assert(0); } ct++; } i++; //assert(i <= a_len); j++; //assert(j <= b_len); p++; } // If a negative delta, insert a base. if (wa->ped.delta[k] < 0) { wa->globalvote[ct].frag_sub = i - 1; wa->globalvote[ct].align_sub = p; //fprintf(stderr, "INSERT %c at %d #%d\n", b_part[j], i-1, p); switch (b_part[j]) { case 'a': wa->globalvote[ct].vote_val = A_INSERT; break; case 'c': wa->globalvote[ct].vote_val = C_INSERT; break; case 'g': wa->globalvote[ct].vote_val = G_INSERT; break; case 't': wa->globalvote[ct].vote_val = T_INSERT; break; default : fprintf(stderr, "ERROR:[2] Bad sequence '%c' 0x%02x)\n", b_part[j], b_part[j]); assert(0); } ct++; j++; //assert(j <= b_len); p++; } // If a positive deta, delete the base. if (wa->ped.delta[k] > 0) { wa->globalvote[ct].frag_sub = i; wa->globalvote[ct].align_sub = p; wa->globalvote[ct].vote_val = DELETE; //fprintf(stderr, "DELETE %c at %d #%d\n", a_part[i], i, p); ct++; i++; assert(i <= a_len); p++; } } // No more deltas. While there is still sequence, add matches or mismatches. //fprintf(stderr, "k=done i=%d our of %d j=%d out of %d\n", i, a_len, j, b_len); while (i < a_len) { //fprintf(stderr, "k=done i=%d our of %d j=%d out of %d\n", i, a_len, j, b_len); if (a_part[i] != b_part[j]) { wa->globalvote[ct].frag_sub = i; wa->globalvote[ct].align_sub = p; switch (b_part[j]) { case 'a': wa->globalvote[ct].vote_val = A_SUBST; break; case 'c': wa->globalvote[ct].vote_val = C_SUBST; break; case 'g': wa->globalvote[ct].vote_val = G_SUBST; break; case 't': wa->globalvote[ct].vote_val = T_SUBST; break; default : fprintf(stderr, "ERROR:[3] Bad sequence '%c' 0x%02x)\n", b_part[j], b_part[j]); assert(0); } ct++; } i++; //assert(i <= a_len); // Guaranteed, we're looping on this j++; //assert(j <= b_len); p++; } wa->globalvote[ct].frag_sub = i; wa->globalvote[ct].align_sub = p; // For each identified change, add votes for some region around the change. // // This is adding extra votes if the distance between two errors is larger than a kmer. // Not sure why there are no 'matching bases' in this region. // // X == changes, mismatch or indel // // ------- <- confirmed count added // ----- <- no_insert count added // matching-bases} X 1 2 3 1 2 3 4 3 2 1 X {matching-bases // ----- ----- // match match // votes votes // for (int32 i=1; i<=ct; i++) { int32 prev_match = wa->globalvote[i].align_sub - wa->globalvote[i - 1].align_sub - 1; int32 p_lo = (i == 1 ? 0 : wa->G->End_Exclude_Len); int32 p_hi = (i == ct ? prev_match : prev_match - wa->G->End_Exclude_Len); // If distance to previous match is bigger than 'kmer' size, make a new vote. if (prev_match >= wa->G->Kmer_Len) { for (int32 p=0; pG, Matching_Vote(a_part[wa->globalvote[i-1].frag_sub + p + 1]), a_offset + wa->globalvote[i-1].frag_sub + p + 1, sub); for (int32 p=p_lo; pglobalvote[i-1].frag_sub + p + 1; if (wa->G->reads[sub].vote[k].confirmed < MAX_VOTE) wa->G->reads[sub].vote[k].confirmed++; if ((p < p_hi - 1) && (wa->G->reads[sub].vote[k].no_insert < MAX_VOTE)) wa->G->reads[sub].vote[k].no_insert++; } for (int32 p=p_hi; pG, Matching_Vote(a_part[wa->globalvote[i-1].frag_sub + p + 1]), a_offset + wa->globalvote[i-1].frag_sub + p + 1, sub); } // Don't allow consecutive inserts. If we aren't the last change, and there is non-adjacent // previous (or this and the previous votes are not insertions), do another vote. if ((i < ct) && ((prev_match > 0) || (wa->globalvote[i-1].vote_val <= T_SUBST) || (wa->globalvote[i ].vote_val <= T_SUBST))) { int32 next_match = wa->globalvote[i + 1].align_sub - wa->globalvote[i].align_sub - 1; // if our vote is outside of the bounds (meaning we have gaps at the start or end of the alignment), skip the vote if (a_offset + wa->globalvote[i].frag_sub < 0 || a_offset + wa->globalvote[i].frag_sub >= a_len) { continue; } if (prev_match + next_match >= wa->G->Vote_Qualify_Len) Cast_Vote(wa->G, wa->globalvote[i].vote_val, a_offset + wa->globalvote[i].frag_sub, sub); } } } canu-1.6/src/overlapErrorAdjustment/findErrors-Dump.C000066400000000000000000000050511314437614700227550ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-MAY-20 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" int main(int argc, char **argv) { char *redName = NULL; argc = AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-r") == 0) { redName = argv[++arg]; } else { fprintf(stderr, "Unknown option '%s'\n", argv[arg]); err++; } arg++; } if (redName == NULL) err++; if (err > 0) { fprintf(stderr, "usage: %s -r file.red\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "Dumps, as ASCII, the results from findErrors *.red files.\n"); if (redName == NULL) fprintf(stderr, "ERROR: no *.red file (-r) supplied.\n"); exit(1); } char *typeName[13] = { "IDENT", "DELETE", "A_SUBST", "C_SUBST", "G_SUBST", "T_SUBST", "A_INSERT", "C_INSERT", "G_INSERT", "T_INSERT", "NO_VOTE", "EXTENSION", NULL }; memoryMappedFile *Cfile = new memoryMappedFile(redName); Correction_Output_t *C = (Correction_Output_t *)Cfile->get(); uint64 Cpos = 0; uint64 Clen = Cfile->length() / sizeof(Correction_Output_t); for (uint32 ii=0; ii%d\n", G->bgnID + i); for (uint32 j=0; G->reads[i].sequence[j] != '\0'; j++) fprintf(stderr, "%3d: %c conf %3d deletes %3d | subst %3d %3d %3d %3d | no_insert %3d insert %3d %3d %3d %3d\n", j, j >= G->reads[i].clear_len ? toupper (G->reads[i].sequence[j]) : G->reads[i].sequence[j], G->reads[i].vote[j].confirmed, G->reads[i].vote[j].deletes, G->reads[i].vote[j].a_subst, G->reads[i].vote[j].c_subst, G->reads[i].vote[j].g_subst, G->reads[i].vote[j].t_subst, G->reads[i].vote[j].no_insert, G->reads[i].vote[j].a_insert, G->reads[i].vote[j].c_insert, G->reads[i].vote[j].g_insert, G->reads[i].vote[j].t_insert); } void Output_Corrections(feParameters *G) { Correction_Output_t out; errno = 0; FILE *fp = fopen(G->outputFileName, "wb"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", G->outputFileName, strerror(errno)), exit(1); for (uint32 i=0; ireadsLen; i++) { //if (i == 0) // Output_Details(G, i); out.keep_left = (G->reads[i].left_degree < G->Degree_Threshold); out.keep_right = (G->reads[i].right_degree < G->Degree_Threshold); out.type = IDENT; out.pos = 0; out.readID = G->bgnID + i; //fprintf(stderr, "read %d clear_len %d\n", i, G->reads[i].clear_len); AS_UTL_safeWrite(fp, &out, "correction1", sizeof(Correction_Output_t), 1); if (G->reads[i].sequence == NULL) // Deleted fragment continue; for (uint32 j=0; jreads[i].clear_len; j++) { if (G->reads[i].vote[j].confirmed < 2) { Vote_Value_t vote = DELETE; int32 max = G->reads[i].vote[j].deletes; bool is_change = true; if (G->reads[i].vote[j].a_subst > max) { vote = A_SUBST; max = G->reads[i].vote[j].a_subst; is_change = (G->reads[i].sequence[j] != 'a'); } if (G->reads[i].vote[j].c_subst > max) { vote = C_SUBST; max = G->reads[i].vote[j].c_subst; is_change = (G->reads[i].sequence[j] != 'c'); } if (G->reads[i].vote[j].g_subst > max) { vote = G_SUBST; max = G->reads[i].vote[j].g_subst; is_change = (G->reads[i].sequence[j] != 'g'); } if (G->reads[i].vote[j].t_subst > max) { vote = T_SUBST; max = G->reads[i].vote[j].t_subst; is_change = (G->reads[i].sequence[j] != 't'); } int32 haplo_ct = ((G->reads[i].vote[j].deletes >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].a_subst >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].c_subst >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].g_subst >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].t_subst >= MIN_HAPLO_OCCURS)); int32 total = (G->reads[i].vote[j].deletes + G->reads[i].vote[j].a_subst + G->reads[i].vote[j].c_subst + G->reads[i].vote[j].g_subst + G->reads[i].vote[j].t_subst); // The original had a gargantuajn if test (five clauses, all had to be true) to decide if a record should be output. // It was negated into many small tests if we should skip the output. // A side effect is that we can abort a little earlier in two cases (and we don't even bother). //fprintf(stderr, "TEST read %d position %d type %d -- ", i, j, vote); // (total > 1) if (total <= 1) { //fprintf(stderr, "FEW total = %d <= 1\n", total); continue; } // (2 * max > total) if (2 * max <= total) { //fprintf(stderr, "WEAK 2*max = %d <= total = %d\n", 2*max, total); continue; } // (is_change == true) if (is_change == false) { //fprintf(stderr, "SAME is_change = %d\n", is_change); continue; } // ((haplo_ct < 2) || (G->Use_Haplo_Ct == false)) if ((haplo_ct >= 2) && (G->Use_Haplo_Ct == true)) { //fprintf(stderr, "HAPLO haplo_ct=%d >= 2 AND Use_Haplo_Ct = %d\n", haplo_ct, G->Use_Haplo_Ct); continue; } // ((G->reads[i].vote[j].confirmed == 0) || // ((G->reads[i].vote[j].confirmed == 1) && (max > 6))) if ((G->reads[i].vote[j].confirmed > 0) && ((G->reads[i].vote[j].confirmed != 1) || (max <= 6))) { //fprintf(stderr, "INDET confirmed = %d max = %d\n", G->reads[i].vote[j].confirmed, max); continue; } // Otherwise, output. out.type = vote; out.pos = j; //fprintf(stderr, "CORRECT!\n"); AS_UTL_safeWrite(fp, &out, "correction2", sizeof(Correction_Output_t), 1); } // confirmed < 2 if (G->reads[i].vote[j].no_insert < 2) { Vote_Value_t ins_vote = A_INSERT; int32 ins_max = G->reads[i].vote[j].a_insert; if (ins_max < G->reads[i].vote[j].c_insert) { ins_vote = C_INSERT; ins_max = G->reads[i].vote[j].c_insert; } if (ins_max < G->reads[i].vote[j].g_insert) { ins_vote = G_INSERT; ins_max = G->reads[i].vote[j].g_insert; } if (ins_max < G->reads[i].vote[j].t_insert) { ins_vote = T_INSERT; ins_max = G->reads[i].vote[j].t_insert; } int32 ins_haplo_ct = ((G->reads[i].vote[j].a_insert >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].c_insert >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].g_insert >= MIN_HAPLO_OCCURS) + (G->reads[i].vote[j].t_insert >= MIN_HAPLO_OCCURS)); int32 ins_total = (G->reads[i].vote[j].a_insert + G->reads[i].vote[j].c_insert + G->reads[i].vote[j].g_insert + G->reads[i].vote[j].t_insert); //fprintf(stderr, "TEST read %d position %d type %d (insert) -- ", i, j, ins_vote); if (ins_total <= 1) { //fprintf(stderr, "FEW ins_total = %d <= 1\n", ins_total); continue; } if (2 * ins_max >= ins_total) { //fprintf(stderr, "WEAK 2*ins_max = %d <= ins_total = %d\n", 2*ins_max, ins_total); continue; } if ((ins_haplo_ct >= 2) && (G->Use_Haplo_Ct == true)) { //fprintf(stderr, "HAPLO ins_haplo_ct=%d >= 2 AND Use_Haplo_Ct = %d\n", ins_haplo_ct, G->Use_Haplo_Ct); continue; } if ((G->reads[i].vote[j].no_insert > 0) && ((G->reads[i].vote[j].no_insert != 1) || (ins_max <= 6))) { //fprintf(stderr, "INDET no_insert = %d ins_max = %d\n", G->reads[i].vote[j].no_insert, ins_max); continue; } // Otherwise, output. out.type = ins_vote; out.pos = j; //fprintf(stderr, "INSERT!\n"); AS_UTL_safeWrite(fp, &out, "correction3", sizeof(Correction_Output_t), 1); } // insert < 2 } } fclose (fp); } canu-1.6/src/overlapErrorAdjustment/findErrors-Prefix_Edit_Distance.C000066400000000000000000000203011314437614700260570ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" // Set delta to the entries indicating the insertions/deletions // in the alignment encoded in edit_array ending at position // edit_array[e][d]. row is the position in the first // string where the alignment ended. Set (*delta_len) to // the number of entries in delta . static void Compute_Delta(pedWorkArea_t *WA, int32 e, int32 d, int32 row) { int32 last = row; int32 stackLen = 0; for (int32 k=e; k>0; k--) { int32 from = d; int32 max = 1 + WA->Edit_Array_Lazy[k-1][d]; int32 j = WA->Edit_Array_Lazy[k-1][d-1]; if (j > max) { from = d-1; max = j; } j = 1 + WA->Edit_Array_Lazy[k-1][d+1]; if (j > max) { from = d+1; max = j; } if (from == d-1) { WA->deltaStack[stackLen++] = max - last - 1; d--; last = WA->Edit_Array_Lazy[k-1][from]; } else if (from == d+1) { WA->deltaStack[stackLen++] = last - (max - 1); d++; last = WA->Edit_Array_Lazy[k-1][from]; } } WA->deltaStack[stackLen++] = last + 1; for (int32 k=0, i=stackLen-1; i>0; i--) WA->delta[k++] = abs(WA->deltaStack[i]) * Sign(WA->deltaStack[i-1]); WA->deltaLen = stackLen - 1; } // Allocate another block of 64mb for edits // Needs to be at least: // 52,432 to handle 40% error at 64k overlap // 104,860 to handle 80% error at 64k overlap // 209,718 to handle 40% error at 256k overlap // 419,434 to handle 80% error at 256k overlap // 3,355,446 to handle 40% error at 4m overlap // 6,710,890 to handle 80% error at 4m overlap // Bigger means we can assign more than one Edit_Array[] in one allocation. uint32 EDIT_SPACE_SIZE = 16 * 1024 * 1024; static void Allocate_More_Edit_Space(pedWorkArea_t *WA) { // Determine the last allocated block, and the last assigned block int32 b = 0; // Last edit array assigned int32 e = 0; // Last edit array assigned more space int32 a = WA->alloc.size(); // First unallocated block while (WA->Edit_Array_Lazy[b] != NULL) b++; // Fill in the edit space array. Well, not quite yet. First, decide the minimum size. // // Element [0] can access from [-2] to [2] = 5 elements. // Element [1] can access from [-3] to [3] = 7 elements. // // Element [e] can access from [-2-e] to [2+e] = 5 + e * 2 elements // // So, our offset for this new block needs to put [e][0] at offset... int32 Offset = 2 + b; int32 Del = 6 + b * 2; int32 Size = EDIT_SPACE_SIZE; while (Size < Offset + Del) Size *= 2; // Allocate another block int32 *alloc = new int32 [Size]; WA->alloc.push_back(alloc); // And, now, fill in the edit space array. e = b; while ((Offset + Del < Size) && (e < WA->Edit_Array_Max)) { WA->Edit_Array_Lazy[e++] = alloc + Offset; Offset += Del; Del += 2; } if (e == b) fprintf(stderr, "Allocate_More_Edit_Space()-- ERROR: couldn't allocate enough space for even one more entry! e=%d\n", e); assert(e != b); //fprintf(stderr, "WorkArea %d allocates space %d of size %d for array %d through %d\n", WA->thread_id, a, Size, b, e-1); } // Return the minimum number of changes (inserts, deletes, replacements) // needed to match string A[0 .. (m-1)] with a prefix of string // T[0 .. (n-1)] if it's not more than Error_Limit . // // Put delta description of alignment in WA->delta and set // WA->deltaLen to the number of entries there if it's a complete // match. // Set A_End and T_End to the rightmost positions where the // alignment ended in A and T , respectively. // Set Match_To_End true if the match extended to the end // of at least one string; otherwise, set it false to indicate // a branch point. int32 Prefix_Edit_Dist(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, bool &Match_To_End, pedWorkArea_t *WA) { //assert (m <= n); int32 Best_d = 0; int32 Best_e = 0; int32 Longest = 0; WA->deltaLen = 0; int32 shorter = min(m, n); int32 Row = 0; while ((Row < shorter) && (A[Row] == T[Row])) Row++; if (WA->Edit_Array_Lazy[0] == NULL) Allocate_More_Edit_Space(WA); WA->Edit_Array_Lazy[0][0] = Row; // Exact match? if (Row == shorter) { A_End = Row; T_End = Row; Match_To_End = true; return(0); } int32 Left = 0; int32 Right = 0; double Max_Score = 0.0; int32 Max_Score_Len = 0; int32 Max_Score_Best_d = 0; int32 Max_Score_Best_e = 0; for (int32 e=1; e<=Error_Limit; e++) { if (WA->Edit_Array_Lazy[e] == NULL) Allocate_More_Edit_Space(WA); Left = max(Left - 1, -e); Right = min(Right + 1, e); WA->Edit_Array_Lazy[e-1][Left] = -2; WA->Edit_Array_Lazy[e-1][Left-1] = -2; WA->Edit_Array_Lazy[e-1][Right] = -2; WA->Edit_Array_Lazy[e-1][Right+1] = -2; for (int32 d=Left; d<=Right; d++) { Row = 1 + WA->Edit_Array_Lazy[e-1][d]; Row = max(Row, WA->Edit_Array_Lazy[e-1][d-1]); Row = max(Row, WA->Edit_Array_Lazy[e-1][d+1] + 1); while ((Row < m) && (Row + d < n) && (A[Row] == T[Row + d])) Row++; assert(e < WA->Edit_Array_Max); WA->Edit_Array_Lazy[e][d] = Row; if (Row == m || Row + d == n) { // Force last error to be mismatch rather than insertion if ((Row == m) && (1 + WA->Edit_Array_Lazy[e-1][d+1] == WA->Edit_Array_Lazy[e][d]) && (d < Right)) { d++; WA->Edit_Array_Lazy[e][d] = WA->Edit_Array_Lazy[e][d-1]; } A_End = Row; // One past last align position T_End = Row + d; Match_To_End = true; Compute_Delta(WA, e, d, Row); return(e); } } while (Left <= Right && Left < 0 && WA->Edit_Array_Lazy[e][Left] < WA->G->Edit_Match_Limit[e]) Left++; if (Left >= 0) while (Left <= Right && WA->Edit_Array_Lazy[e][Left] + Left < WA->G->Edit_Match_Limit[e]) Left++; if (Left > Right) break; while (Right > 0 && WA->Edit_Array_Lazy[e][Right] + Right < WA->G->Edit_Match_Limit[e]) Right--; if (Right <= 0) while (WA->Edit_Array_Lazy[e][Right] < WA->G->Edit_Match_Limit[e]) Right--; assert (Left <= Right); for (int32 d=Left; d <= Right; d++) if (WA->Edit_Array_Lazy[e][d] > Longest) { Best_d = d; Best_e = e; Longest = WA->Edit_Array_Lazy[e][d]; } int32 Score = Longest * BRANCH_PT_MATCH_VALUE - e; // Assumes BRANCH_PT_MATCH_VALUE - BRANCH_PT_ERROR_VALUE == 1.0 // CorrectOverlaps didn't have the second clause. // Neither did overlapper. if ((Score > Max_Score) && (Best_e <= WA->G->Error_Bound[min(Longest, Longest + Best_d)])) { Max_Score = Score; Max_Score_Len = Longest; Max_Score_Best_d = Best_d; Max_Score_Best_e = Best_e; } } // CorrectOverlaps doesn't have this call Compute_Delta(WA, Max_Score_Best_e, Max_Score_Best_d, Max_Score_Len); A_End = Max_Score_Len; T_End = Max_Score_Len + Max_Score_Best_d; Match_To_End = false; // CorrectOverlaps was returning just 'e', not the best as below. // It used this return value to compute the new error rate. return(Max_Score_Best_e); } canu-1.6/src/overlapErrorAdjustment/findErrors-Process_Olap.C000066400000000000000000000124141314437614700244420ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JUN-27 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" #include "AS_UTL_reverseComplement.H" int32 Prefix_Edit_Dist(char *A, int m, char *T, int n, int Error_Limit, int32 &A_End, int32 &T_End, bool &Match_To_End, pedWorkArea_t * wa); void Analyze_Alignment(Thread_Work_Area_t *wa, char *a_part, int32 a_len, int32 a_offset, char *b_part, int32 b_len, int32 sub); // Find the alignment referred to in olap , where the a_iid // fragment is in Frag and the b_iid sequence is in b_seq . // Use the alignment to increment the appropriate vote fields // for the a fragment. shredded is true iff the b fragment // is from shredded data, in which case the overlap will be // ignored if the a fragment is also shredded. // rev_seq is a buffer to hold the reverse complement of b_seq // if needed. (* rev_id) is used to keep track of whether // rev_seq has been created yet. (* wa) is the work-area // containing space for the process to use in case of multi-threading. void Process_Olap(Olap_Info_t *olap, char *b_seq, bool shredded, Thread_Work_Area_t *wa) { #if 0 fprintf(stderr, "Process_Olap: %8d %8d %5d %5d %c\n", olap->a_iid, olap->b_iid, olap->a_hang, olap->b_hang, olap->innie == true ? 'I' : 'N'); #endif int32 ri = olap->a_iid - wa->G->bgnID; if ((shredded == true) && (wa->G->reads[ri].shredded == true)) return; char *a_part = wa->G->reads[ri].sequence; int32 a_offset = 0; char *b_part = (olap->normal == true) ? b_seq : wa->rev_seq; int32 b_offset = 0; // If innie, reverse-complement the B sequence. if ((olap->innie == true) && (wa->rev_id != olap->b_iid)) { strcpy(b_part, b_seq); reverseComplementSequence(b_part, 0); wa->rev_id = olap->b_iid; } // Adjust for hangs. if (olap->a_hang > 0) { a_offset = olap->a_hang; a_part += a_offset; } if (olap->a_hang < 0) { b_offset = -olap->a_hang; b_part += b_offset; } // Count degree - just how many times we cover the end of the read? if ((olap->a_hang <= 0) && (wa->G->reads[ri].left_degree < MAX_DEGREE)) wa->G->reads[ri].left_degree++; if ((olap->b_hang >= 0) && (wa->G->reads[ri].right_degree < MAX_DEGREE)) wa->G->reads[ri].right_degree++; // Get the alignment uint32 a_part_len = strlen(a_part); uint32 b_part_len = strlen(b_part); int32 a_end = 0; int32 b_end = 0; uint32 olap_len = min(a_part_len, b_part_len); bool match_to_end = false; //fprintf(stderr, "A: offset %d length %d\n", a_offset, a_part_len); //fprintf(stderr, "B: offset %d length %d\n", b_offset, b_part_len); int32 errors = Prefix_Edit_Dist(a_part, a_part_len, b_part, b_part_len, wa->G->Error_Bound[olap_len], a_end, b_end, match_to_end, &wa->ped); if ((a_end < 0) || (a_end > a_part_len) || (b_end < 0) || (b_end > b_part_len)) { fprintf (stderr, "ERROR: Bad edit distance.\n"); fprintf (stderr, " errors = %d a_end = %d b_end = %d\n", errors, a_end, b_end); fprintf (stderr, " a_part_len = %d b_part_len = %d\n", a_part_len, b_part_len); fprintf (stderr, " a_iid = %d b_iid = %d match_to_end = %c\n", olap->a_iid, olap->b_iid, match_to_end ? 'T' : 'F'); } assert(a_end >= 0); assert(a_end <= a_part_len); assert(b_end >= 0); assert(b_end <= b_part_len); //printf(" errors = %d delta_len = %d\n", errors, wa->ped.deltaLen); //printf(" a_align = %d/%d b_align = %d/%d\n", a_end, a_part_len, b_end, b_part_len); //Display_Alignment(a_part, a_end, b_part, b_end, wa->delta, wa->deltaLen, wa->G->reads[ri].clear_len - a_offset); if ((match_to_end == false) && (a_end + a_offset >= wa->G->reads[ri].clear_len - 1)) { olap_len = min(a_end, b_end); match_to_end = true; } if ((errors <= wa->G->Error_Bound[olap_len]) && (match_to_end == true)) { wa->passedOlaps++; Analyze_Alignment(wa, a_part, a_end, a_offset, b_part, b_end, ri); } else { wa->failedOlaps++; } } canu-1.6/src/overlapErrorAdjustment/findErrors-Read_Frags.C000066400000000000000000000102401314437614700240410ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-05 to 2015-JUN-03 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" // Open and read fragments with IIDs from Lo_Frag_IID to // Hi_Frag_IID (INCLUSIVE) from gkpStore_Path and store them in // global Frag . // This shares lots of code with Extract_Needed_Frags. void Read_Frags(feParameters *G, gkStore *gkpStore) { // The original converted to lowercase, and made non-acgt be 'a'. char filter[256]; for (uint32 i=0; i<256; i++) filter[i] = 'a'; filter['A'] = filter['a'] = 'a'; filter['C'] = filter['c'] = 'c'; filter['G'] = filter['g'] = 'g'; filter['T'] = filter['t'] = 't'; // Count the number of bases, so we can do two gigantic allocations for // bases and votes. uint64 basesLength = 0; uint64 votesLength = 0; uint64 readsLoaded = 0; fprintf(stderr, "Read_Frags()-- from " F_U32 " through " F_U32 "\n", G->bgnID, G->endID); for (uint32 curID=G->bgnID; curID<=G->endID; curID++) { gkRead *read = gkpStore->gkStore_getRead(curID); basesLength += read->gkRead_sequenceLength() + 1; votesLength += read->gkRead_sequenceLength(); } uint64 totAlloc = (sizeof(char) * basesLength + sizeof(Vote_Tally_t) * basesLength + sizeof(Frag_Info_t) * G->readsLen); fprintf(stderr, "Read_Frags()-- allocate %lu MB for bases, votes and info, for %u reads of total length %lu (%.4f bytes/base)\n", totAlloc >> 20, G->endID - G->bgnID + 1, basesLength, (basesLength > 0) ? ((double)totAlloc / basesLength) : 0.0); G->readBases = new char [basesLength]; G->readVotes = new Vote_Tally_t [votesLength]; // NO constructor, MUST INIT G->readsLen = G->endID - G->bgnID + 1; G->reads = new Frag_Info_t [G->readsLen]; // Has constructor, no need to init memset(G->readBases, 0, sizeof(char) * basesLength); memset(G->readVotes, 0, sizeof(Vote_Tally_t) * votesLength); basesLength = 0; votesLength = 0; gkReadData *readData = new gkReadData; for (uint32 curID=G->bgnID; curID<=G->endID; curID++) { gkRead *read = gkpStore->gkStore_getRead(curID); gkpStore->gkStore_loadReadData(read, readData); uint32 readLength = read->gkRead_sequenceLength(); char *readBases = readData->gkReadData_getSequence(); G->reads[curID - G->bgnID].sequence = G->readBases + basesLength; G->reads[curID - G->bgnID].vote = G->readVotes + votesLength; basesLength += readLength + 1; votesLength += readLength; readsLoaded += 1; for (uint32 bb=0; bbreads[curID - G->bgnID].sequence[bb] = filter[readBases[bb]]; G->reads[curID - G->bgnID].sequence[readLength] = 0; // All good reads end. G->reads[curID - G->bgnID].clear_len = readLength; G->reads[curID - G->bgnID].shredded = false; G->reads[curID - G->bgnID].left_degree = 0; G->reads[curID - G->bgnID].right_degree = 0; } delete readData; fprintf(stderr, "Read_Frags()-- from " F_U32 " through " F_U32 " -- loaded " F_U64 " bases in " F_U64 " reads.\n", G->bgnID, G->endID-1, basesLength, readsLoaded); } canu-1.6/src/overlapErrorAdjustment/findErrors-Read_Olaps.C000066400000000000000000000044111314437614700240600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-16 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" // Load overlaps with aIID from G->bgnID to G->endID. // Overlaps can be unsorted. void Read_Olaps(feParameters *G, gkStore *gkpStore) { ovStore *ovs = new ovStore(G->ovlStorePath, gkpStore); ovs->setRange(G->bgnID, G->endID); uint64 numolaps = ovs->numOverlapsInRange(); fprintf(stderr, "Read_Olaps()-- loading " F_U64 " overlaps.\n", numolaps); G->olaps = new Olap_Info_t [numolaps]; G->olapsLen = 0; ovOverlap olap(gkpStore); while (ovs->readOverlap(&olap)) { G->olaps[G->olapsLen].a_iid = olap.a_iid; G->olaps[G->olapsLen].b_iid = olap.b_iid; G->olaps[G->olapsLen].a_hang = olap.a_hang(); G->olaps[G->olapsLen].b_hang = olap.b_hang(); G->olaps[G->olapsLen].innie = (olap.flipped() == true); G->olaps[G->olapsLen].normal = (olap.flipped() == false); // These are violated if the innie/normal members are signed! assert(G->olaps[G->olapsLen].innie != G->olaps[G->olapsLen].normal); assert((G->olaps[G->olapsLen].innie == false) || (G->olaps[G->olapsLen].innie == true)); assert((G->olaps[G->olapsLen].normal == false) || (G->olaps[G->olapsLen].normal == true)); G->olapsLen++; } delete ovs; } canu-1.6/src/overlapErrorAdjustment/findErrors.C000066400000000000000000000355731314437614700220660ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-29 to 2015-JUL-01 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "findErrors.H" #include "Binomial_Bound.H" void Process_Olap(Olap_Info_t *olap, char *b_seq, bool shredded, Thread_Work_Area_t *wa); void Read_Frags(feParameters *G, gkStore *gkpStore); void Read_Olaps(feParameters *G, gkStore *gkpStore); void Output_Corrections(feParameters *G); // From overlapInCore.C int Binomial_Bound(int e, double p, int Start, double Limit); // Read fragments lo_frag..hi_frag (INCLUSIVE) from store and save the ids and sequences of those // with overlaps to fragments in global Frag . static void Extract_Needed_Frags(feParameters *G, gkStore *gkpStore, uint32 loID, uint32 hiID, Frag_List_t *fl, uint64 &nextOlap) { // The original converted to lowercase, and made non-acgt be 'a'. char filter[256]; for (uint32 i=0; i<256; i++) filter[i] = 'a'; filter['A'] = filter['a'] = 'a'; filter['C'] = filter['c'] = 'c'; filter['G'] = filter['g'] = 'g'; filter['T'] = filter['t'] = 't'; // Count the amount of stuff we're loading. fl->readsLen = 0; fl->basesLen = 0; uint64 lastOlap = nextOlap; uint32 ii = 0; // Index into reads arrays uint32 fi = G->olaps[lastOlap].b_iid; // Actual ID we're extracting assert(loID <= fi); fprintf(stderr, "\n"); fprintf(stderr, "Extract_Needed_Frags()-- Loading used reads between " F_U32 " and " F_U32 ", at overlap " F_U64 ".\n", fi, hiID, lastOlap); while (fi <= hiID) { gkRead *read = gkpStore->gkStore_getRead(fi); fl->readsLen += 1; fl->basesLen += read->gkRead_sequenceLength() + 1; // Advance to the next overlap lastOlap++; while ((lastOlap < G->olapsLen) && (G->olaps[lastOlap].b_iid == fi)) lastOlap++; fi = (lastOlap < G->olapsLen) ? G->olaps[lastOlap].b_iid : hiID + 1; } fprintf(stderr, "Extract_Needed_Frags()-- Loading reads for overlaps " F_U64 " to " F_U64 " (reads " F_U32 " bases " F_U64 ")\n", nextOlap, lastOlap, fl->readsLen, fl->basesLen); // Ensure there is space. if (fl->readsMax < fl->readsLen) { delete [] fl->readIDs; delete [] fl->readBases; //fprintf(stderr, "Extract_Needed_Frags()-- realloc reads from " F_U32 " to " F_U32 "\n", fl->readsMax, 12 * fl->readsLen / 10); fl->readIDs = new uint32 [12 * fl->readsLen / 10]; fl->readBases = new char * [12 * fl->readsLen / 10]; fl->readsMax = 12 * fl->readsLen / 10; } if (fl->basesMax < fl->basesLen) { delete [] fl->bases; //fprintf(stderr, "Extract_Needed_Frags()-- realloc bases from " F_U64 " to " F_U64 "\n", fl->basesMax, 12 * fl->basesLen / 10); fl->bases = new char [12 * fl->basesLen / 10]; fl->basesMax = 12 * fl->basesLen / 10; } // Load. This is complicated by loading only the reads that have overlaps we care about. fl->readsLen = 0; fl->basesLen = 0; gkReadData *readData = new gkReadData; ii = 0; fi = G->olaps[nextOlap].b_iid; assert(loID <= fi); while (fi <= hiID) { gkRead *read = gkpStore->gkStore_getRead(fi); fl->readIDs[ii] = fi; fl->readBases[ii] = fl->bases + fl->basesLen; fl->basesLen += read->gkRead_sequenceLength() + 1; gkpStore->gkStore_loadReadData(read, readData); uint32 readLen = read->gkRead_sequenceLength(); char *readBases = readData->gkReadData_getSequence(); for (uint32 bb=0; bbreadBases[ii][bb] = filter[readBases[bb]]; fl->readBases[ii][readLen] = 0; // All good reads end. ii++; // Advance to the next overlap. nextOlap++; while ((nextOlap < G->olapsLen) && (G->olaps[nextOlap].b_iid == fi)) nextOlap++; fi = (nextOlap < G->olapsLen) ? G->olaps[nextOlap].b_iid : hiID + 1; } delete readData; fl->readsLen = ii; if (fl->readsLen > 0) fprintf(stderr, "Extract_Needed_Frags()-- Loaded " F_U32 " reads (%.4f%%). Loaded IDs " F_U32 " through " F_U32 ".\n", fl->readsLen, 100.0 * fl->readsLen / (hiID - 1 - loID), fl->readIDs[0], fl->readIDs[fl->readsLen-1]); else fprintf(stderr, "Extract_Needed_Frags()-- Loaded " F_U32 " reads (%.4f%%).\n", fl->readsLen, 100.0 * fl->readsLen / (hiID - 1 - loID)); } // Process all old fragments in Internal_gkpStore. Only // do overlaps/corrections with fragments where // frag_iid % Num_PThreads == thread_id void * Threaded_Process_Stream(void *ptr) { Thread_Work_Area_t *wa = (Thread_Work_Area_t *)ptr; for (int32 i=0; ifrag_list->readsLen; i++) { int32 skip_id = -1; while (wa->frag_list->readIDs[i] > wa->G->olaps[wa->nextOlap].b_iid) { if (wa->G->olaps[wa->nextOlap].b_iid != skip_id) { fprintf(stderr, "SKIP: b_iid = %d\n", wa->G->olaps[wa->nextOlap].b_iid); skip_id = wa->G->olaps[wa->nextOlap].b_iid; } wa->nextOlap++; } if (wa->frag_list->readIDs[i] != wa->G->olaps[wa->nextOlap].b_iid) { fprintf (stderr, "ERROR: Lists don't match\n"); fprintf (stderr, "frag_list iid = %d nextOlap = %d i = %d\n", wa->frag_list->readIDs[i], wa->G->olaps[wa->nextOlap].b_iid, i); exit (1); } wa->rev_id = UINT32_MAX; while ((wa->nextOlap < wa->G->olapsLen) && (wa->G->olaps[wa->nextOlap].b_iid == wa->frag_list->readIDs[i])) { if (wa->G->olaps[wa->nextOlap].a_iid % wa->G->numThreads == wa->thread_id) { Process_Olap(wa->G->olaps + wa->nextOlap, wa->frag_list->readBases[i], false, // shredded wa); } wa->nextOlap++; } } pthread_exit(ptr); return(NULL); } // Read old fragments in gkpStore that have overlaps with // fragments in Frag. Read a batch at a time and process them // with multiple pthreads. Each thread processes all the old fragments // but only changes entries in Frag that correspond to its thread // ID. Recomputes the overlaps and records the vote information about // changes to make (or not) to fragments in Frag . static void Threaded_Stream_Old_Frags(feParameters *G, gkStore *gkpStore, uint64 &passedOlaps, uint64 &failedOlaps) { pthread_attr_t attr; pthread_attr_init(&attr); pthread_attr_setstacksize(&attr, THREAD_STACKSIZE); pthread_t *thread_id = new pthread_t [G->numThreads]; Thread_Work_Area_t *thread_wa = new Thread_Work_Area_t [G->numThreads]; for (uint32 i=0; inumThreads; i++) { thread_wa[i].thread_id = i; thread_wa[i].loID = 0; thread_wa[i].hiID = 0; thread_wa[i].nextOlap = 0; thread_wa[i].G = G; thread_wa[i].frag_list = NULL; thread_wa[i].rev_id = UINT32_MAX; thread_wa[i].passedOlaps = 0; thread_wa[i].failedOlaps = 0; memset(thread_wa[i].rev_seq, 0, sizeof(char) * AS_MAX_READLEN); double MAX_ERRORS = 1 + (uint32)(G->errorRate * AS_MAX_READLEN); thread_wa[i].ped.initialize(G, G->errorRate); } uint32 loID = G->olaps[0].b_iid; uint32 hiID = loID + FRAGS_PER_BATCH - 1; uint32 endID = G->olaps[G->olapsLen - 1].b_iid; if (hiID > endID) hiID = endID; uint64 frstOlap = 0; uint64 nextOlap = 0; Frag_List_t frag_list_1; Frag_List_t frag_list_2; Frag_List_t *curr_frag_list = &frag_list_1; Frag_List_t *next_frag_list = &frag_list_2; Extract_Needed_Frags(G, gkpStore, loID, hiID, curr_frag_list, nextOlap); while (loID <= endID) { // Process fragments in curr_frag_list in background for (uint32 i=0; inumThreads; i++) { thread_wa[i].loID = loID; thread_wa[i].hiID = hiID; thread_wa[i].nextOlap = frstOlap; thread_wa[i].frag_list = curr_frag_list; int status = pthread_create(thread_id + i, &attr, Threaded_Process_Stream, thread_wa + i); if (status != 0) fprintf(stderr, "pthread_create error: %s\n", strerror(status)), exit(1); } // Read next batch of fragments loID = hiID + 1; if (loID <= endID) { hiID = loID + FRAGS_PER_BATCH - 1; if (hiID > endID) hiID = endID; frstOlap = nextOlap; Extract_Needed_Frags(G, gkpStore, loID, hiID, next_frag_list, nextOlap); } // Wait for background processing to finish for (uint32 i=0; inumThreads; i++) { void *ptr; int status = pthread_join(thread_id[i], &ptr); if (status != 0) fprintf(stderr, "pthread_join error: %s\n", strerror(status)), exit(1); } // Swap the lists and compute another block { Frag_List_t *s = curr_frag_list; curr_frag_list = next_frag_list; next_frag_list = s; } } // Threads all done, sum up stats. passedOlaps = 0; failedOlaps = 0; for (uint32 i=0; inumThreads; i++) { passedOlaps += thread_wa[i].passedOlaps; failedOlaps += thread_wa[i].failedOlaps; } delete [] thread_id; delete [] thread_wa; } int main(int argc, char **argv) { feParameters *G = new feParameters(); argc = AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { G->gkpStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-R") == 0) { G->bgnID = atoi(argv[++arg]); G->endID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-O") == 0) { G->ovlStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { G->errorRate = atof(argv[++arg]); } else if (strcmp(argv[arg], "-l") == 0) { G->minOverlap = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-o") == 0) { // For 'corrections' file output G->outputFileName = argv[++arg]; } else if (strcmp(argv[arg], "-t") == 0) { G->numThreads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-d") == 0) { G->Degree_Threshold = strtol(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-k") == 0) { G->Kmer_Len = strtol(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-p") == 0) { G->Use_Haplo_Ct = FALSE; } else if (strcmp(argv[arg], "-V") == 0) { G->Vote_Qualify_Len = strtol(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-x") == 0) { G->End_Exclude_Len = strtol(argv[++arg], NULL, 10); } else { fprintf(stderr, "Unknown option '%s'\n", argv[arg]); err++; } arg++; } if (G->gkpStorePath == NULL) err++; if (G->ovlStorePath == NULL) err++; if (G->numThreads == 0) err++; if (err > 0) { fprintf(stderr, "usage: %s[-ehp][-d DegrThresh][-k KmerLen][-x ExcludeLen]\n", argv[0]); fprintf(stderr, " [-F OlapFile][-S OlapStore][-o CorrectFile]\n"); fprintf(stderr, " [-t NumPThreads][-v VerboseLevel]\n"); fprintf(stderr, " [-V Vote_Qualify_Len]\n"); fprintf(stderr, " \n"); fprintf(stderr, "\n"); fprintf(stderr, "Makes corrections to fragment sequence based on overlaps\n"); fprintf(stderr, "and recomputes overlaps on corrected fragments\n"); fprintf(stderr, "Fragments come from and specify\n"); fprintf(stderr, "the range of fragments to modify\n"); fprintf(stderr, "\n"); fprintf(stderr, "Options:\n"); fprintf(stderr, "-d set keep flag on end of frags with less than this many olaps\n"); fprintf(stderr, "-F specify file of sorted overlaps to use (in the format produced\n"); fprintf(stderr, " by get-olaps\n"); fprintf(stderr, "-h print this message\n"); fprintf(stderr, "-k minimum exact-match region to prevent change\n"); fprintf(stderr, "-o specify output file to hold correction info\n"); fprintf(stderr, "-p don't use haplotype counts to correct\n"); fprintf(stderr, "-S specify the binary overlap store containing overlaps to use\n"); fprintf(stderr, "-t set number of p-threads to use\n"); fprintf(stderr, "-v specify level of verbose outputs, higher is more\n"); fprintf(stderr, "-V specify number of exact match bases around an error to vote to change\n"); fprintf(stderr, "-x length of end of exact match to exclude in preventing change\n"); if (G->gkpStorePath == NULL) fprintf(stderr, "ERROR: no gatekeeper store (-G) supplied.\n"); if (G->ovlStorePath == NULL) fprintf(stderr, "ERROR: no overlap store (-O) supplied.\n"); if (G->numThreads == 0) fprintf(stderr, "ERROR: number of compute threads (-t) must be larger than zero.\n"); exit(1); } // Initialize Globals double MAX_ERRORS = 1 + (uint32)(G->errorRate * AS_MAX_READLEN); Initialize_Match_Limit(G->Edit_Match_Limit, G->errorRate, MAX_ERRORS); for (uint32 i = 0; i <= AS_MAX_READLEN; i++) G->Error_Bound[i] = (int)ceil(i * G->errorRate); // Load data. gkStore *gkpStore = gkStore::gkStore_open(G->gkpStorePath); if (G->bgnID < 1) G->bgnID = 1; if (gkpStore->gkStore_getNumReads() < G->endID) G->endID = gkpStore->gkStore_getNumReads(); Read_Frags(G, gkpStore); Read_Olaps(G, gkpStore); // Sort overlaps, process each. sort(G->olaps, G->olaps + G->olapsLen); uint64 passedOlaps = 0; uint64 failedOlaps = 0; Threaded_Stream_Old_Frags(G, gkpStore, passedOlaps, failedOlaps); // All done. Sum up what we did. fprintf(stderr, "\n"); fprintf(stderr, "Passed overlaps = %10" F_U64P " %8.4f%%\n", passedOlaps, 100.0 * passedOlaps / (failedOlaps + passedOlaps)); fprintf(stderr, "Failed overlaps = %10" F_U64P " %8.4f%%\n", failedOlaps, 100.0 * failedOlaps / (failedOlaps + passedOlaps)); // Dump output. //Output_Details(G); Output_Corrections(G); // Cleanup and exit! gkpStore->gkStore_close(); delete G; fprintf(stderr, "\n"); fprintf(stderr, "Bye.\n"); exit(0); } canu-1.6/src/overlapErrorAdjustment/findErrors.H000066400000000000000000000217631314437614700220670ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-JUN-27 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include #include "gkStore.H" #include "ovStore.H" #include "correctionOutput.H" #include using namespace std; // Private stuff #define Sign(a) ( ((a) > 0) - ((a) < 0) ) // Value to add for a match in finding branch points // 1.20 was the calculated value for 6% vs 35% error discrimination // Converting to integers didn't make it faster #define BRANCH_PT_MATCH_VALUE 0.272 // Value to add for a mismatch in finding branch points // -2.19 was the calculated value for 6% vs 35% error discrimination // Converting to integers didn't make it faster #define BRANCH_PT_ERROR_VALUE -0.728 // Number of bits to store integer versions of error rates #define ERATE_BITS 16 // The number of errors that are ignored in setting probability // bound for terminating alignment extensions in edit distance // calculations #define ERRORS_FOR_FREE 1 // Factor by which to grow memory in olap array when reading it #define EXPANSION_FACTOR 1.4 // Number of old fragments to read into memory-based fragment // store at a time for processing #define FRAGS_PER_BATCH 100000 // Longest name allowed for a file in the overlap store #define MAX_FILENAME_LEN 1000 // Most errors in any edit distance computation // KNOWN ONLY AT RUN TIME //#define MAX_ERRORS (1 + (int) (AS_OVL_ERROR_RATE * AS_MAX_READLEN)) // Highest number of votes before overflow #define MAX_DEGREE 32767 // Highest number of votes before overflow #define MAX_VOTE 255 // Branch points must be at least this many bases from the // end of the fragment to be reported #define MIN_BRANCH_END_DIST 20 // Branch point tails must fall off from the max by at least // this rate #define MIN_BRANCH_TAIL_SLOPE 0.20 // This many or more votes at the same base indicate // a separate haplotype #define MIN_HAPLO_OCCURS 3 // The amount of memory to allocate for the stack of each thread #define THREAD_STACKSIZE (128 * 512 * 512) struct Vote_Tally_t { uint32 confirmed : 8; uint32 deletes : 8; uint32 a_subst : 8; uint32 c_subst : 8; uint32 g_subst : 8; uint32 t_subst : 8; uint32 no_insert : 8; uint32 a_insert : 8; uint32 c_insert : 8; uint32 g_insert : 8; uint32 t_insert : 8; }; struct Vote_t { int32 frag_sub; int32 align_sub; Vote_Value_t vote_val; }; class Frag_Info_t { public: Frag_Info_t() { sequence = NULL; vote = NULL; clear_len = 0; left_degree = 0; right_degree = 0; shredded = false; unused = false; }; ~Frag_Info_t() { }; char *sequence; Vote_Tally_t *vote; uint64 clear_len : 31; uint64 left_degree : 31; uint64 right_degree : 31; uint64 shredded : 1; // True if shredded read uint64 unused : 1; }; class Olap_Info_t { public: Olap_Info_t() { a_iid = 0; b_iid = 0; a_hang = 0; b_hang = 0; innie = false; normal = false; }; ~Olap_Info_t() {}; uint32 a_iid; uint32 b_iid; int64 a_hang : 31; int64 b_hang : 31; uint64 innie : 1; // was 'orient' with choice INNIE=0 or NORMAL=1 uint64 normal : 1; // so 'normal' always != 'innie' // Sort by increasing b_iid, then increasing a_iid. bool operator<(Olap_Info_t const &that) const { if (b_iid < that.b_iid) return(true); if (b_iid > that.b_iid) return(false); if (a_iid < that.a_iid) return(true); if (a_iid > that.a_iid) return(false); // It is possible, but unlikely, to have two overlaps to the same pair of reads, // if we overlap a5'-b3' and a3'-b5'. I think. return(innie < that.innie); }; }; class Frag_List_t { public: Frag_List_t() { readsMax = 0; readsLen = 0; readIDs = NULL; readBases = NULL; basesMax = 0; basesLen = 0; bases = NULL; }; ~Frag_List_t() { delete [] readIDs; delete [] readBases; delete [] bases; }; uint32 readsMax; uint32 readsLen; uint32 *readIDs; char **readBases; uint64 basesMax; uint64 basesLen; char *bases; // Read sequences, 0 terminated }; class feParameters; class pedWorkArea_t { public: pedWorkArea_t() { G = NULL; memset(delta, 0, sizeof(int32) * AS_MAX_READLEN); memset(deltaStack, 0, sizeof(int32) * AS_MAX_READLEN); deltaLen = 0; Edit_Array_Lazy = NULL; Edit_Array_Max = 0; }; ~pedWorkArea_t() { for (uint32 xx=0; xx < alloc.size(); xx++) delete [] alloc[xx]; delete [] Edit_Array_Lazy; }; void initialize(feParameters *G_, double errorRate) { G = G_; Edit_Array_Max = 1 + (uint32)(errorRate * AS_MAX_READLEN); Edit_Array_Lazy = new int32 * [Edit_Array_Max]; memset(Edit_Array_Lazy, 0, sizeof(int32 *) * Edit_Array_Max); }; public: feParameters *G; int32 delta[AS_MAX_READLEN]; // Only need ERATE * READLEN int32 deltaStack[AS_MAX_READLEN]; int32 deltaLen; vector alloc; // Allocated blocks, don't use directly. int32 **Edit_Array_Lazy; // Doled out space. int32 Edit_Array_Max; // Former MAX_ERRORS }; struct Thread_Work_Area_t { int32 thread_id; uint32 loID; uint32 hiID; uint64 nextOlap; feParameters *G; Frag_List_t *frag_list; char rev_seq[AS_MAX_READLEN + 1]; // Used in Process_Olap to hold RC of the B read uint32 rev_id; // Ident of the rev_seq read. Vote_t globalvote[AS_MAX_READLEN]; uint64 passedOlaps; uint64 failedOlaps; pedWorkArea_t ped; }; class feParameters { public: feParameters() { gkpStorePath = NULL; ovlStorePath = NULL; bgnID = 0; endID = UINT32_MAX; readBases = NULL; readVotes = NULL; reads = NULL; readsLen = 0; olaps = NULL; olapsLen = 0; outputFileName = NULL; numThreads = 4; errorRate = 0.06; minOverlap = 0; // Parameters // Output Degree_Threshold = 2; //, DEFAULT_DEGREE_THRESHOLD; Use_Haplo_Ct = TRUE; // Analyze_Alignment End_Exclude_Len = 3; //DEFAULT_END_EXCLUDE_LEN; Kmer_Len = 9; //DEFAULT_KMER_LEN; Vote_Qualify_Len = 9; //DEFAULT_VOTE_QUALIFY_LEN; }; ~feParameters() { delete [] readBases; delete [] readVotes; delete [] reads; delete [] olaps; }; // Paths to stores char *gkpStorePath; char *ovlStorePath; // Range of IDs to process uint32 bgnID; uint32 endID; char *readBases; Vote_Tally_t *readVotes; Frag_Info_t *reads; uint32 readsLen; // Number of fragments being corrected Olap_Info_t *olaps; uint64 olapsLen; // Number of overlaps being used char *outputFileName; uint32 numThreads; double errorRate; uint32 minOverlap; // This array [e] is the minimum value of Edit_Array [e] [d] to be worth pursuing in edit-distance // computations between guides (only MAX_ERRORS needed) int Edit_Match_Limit [AS_MAX_READLEN + 1]; // Set keep flag on end of fragment if number of olaps < this value int Degree_Threshold; // Set false by -h option to ignore haplotype counts when correcting int Use_Haplo_Ct; // Length of ends of exact-match regions not used in preventing sequence correction int End_Exclude_Len; // Length of minimum exact match in overlap to confirm base pairs int Kmer_Len; // Number of bases surrounding a SNP to vote for change int Vote_Qualify_Len; // This array [i] is the maximum number of errors allowed in a match between sequences of length // i , which is i * MAXERROR_RATE . int Error_Bound [AS_MAX_READLEN + 1]; }; canu-1.6/src/overlapErrorAdjustment/findErrors.mk000066400000000000000000000013561314437614700223030ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := findErrors SOURCES := findErrors.C \ findErrors-Analyze_Alignment.C \ findErrors-Output.C \ findErrors-Prefix_Edit_Distance.C \ findErrors-Process_Olap.C \ findErrors-Read_Frags.C \ findErrors-Read_Olaps.C SRC_INCDIRS := .. ../AS_UTL ../stores ../overlapInCore/liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapInCore-analysis/000077500000000000000000000000001314437614700175005ustar00rootroot00000000000000canu-1.6/src/overlapInCore-analysis/analyze-true-vs-test.pl000066400000000000000000000432441314437614700240670ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Reads true overlaps from the first file, separates the second file into # true/false based on iid. Hangs are NOT checked. my $truePrefix = "BL"; my $testPrefix = "CAordered"; my $trueOverlaps = "$truePrefix/test.ovlStore"; my $testOverlaps = "$testPrefix/test.ovlStore"; my $filtTrue = "filtered.true"; my $filtFalse = "filtered.false"; # # Load read lengths, make sure both assemblies are the same # my @readLengths; my %contained; open(F, "gatekeeper -dumpfragments -tabular $truePrefix/test.gkpStore |"); while () { my @v = split '\s+', $_; $readLengths[$v[1]] = $v[9]; } close(F); open(F, "gatekeeper -dumpfragments -tabular $testPrefix/test.gkpStore |"); while () { my @v = split '\s+', $_; die "IID $v[1] BL $readLengths[$v[1]] != CA $v[9]\n" if ($readLengths[$v[1]] != $v[9]); } close(F); open(F, "< $testPrefix/4-unitigger/best.contains"); while () { my @v = split '\s+', $_; $contained{$v[0]}++; } close(F); # # Load true overlaps # my %true; # If set, id pair is a true overlap my %trueComputed; # If set, id pair is a true overlap, and was computed { my $last; # local storage my @lenIdent; # local storage of length-vs-ident for true overlaps. ident is not known. open(F, "overlapStore -d $trueOverlaps |") or die "Failed to open '$trueOverlaps' for reading.\n"; while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; if ($last != $aIID) { open(O, "> iid$last.len-vs-ident.dat"); print O @lenIdent; close(O); undef @lenIdent; } my $oLen = computeOverlapLength($aIID, $bIID, $aHang, $bHang); $true{"$aIID-$bIID"} = $oLen; $true{"$bIID-$aIID"} = $oLen; next if (exists($contained{$aIID})); next if (exists($contained{$bIID})); push @lenIdent, "$oLen\t$12.0\n"; $last = $aIID; } close(F); } print STDERR "true: ", scalar(keys %true) / 2, "\n"; # # Process computed overlaps # my %trueH; my %falseH; my %keysH; my @lenIdent5T; my @lenIdent5F; my @lenIdent3T; my @lenIdent3F; my $longestF = 0; my $longestT = 0; { my $last = 0; open(LT, "> all.true.length-vs-ident.dat"); open(LF, "> all.false.length-vs-ident.dat"); open(TO, "> all.true.filtered-overlaps.ova"); open(FO, "> all.false.filtered-overlaps.ova"); open(F, "overlapStore -d $testOverlaps |") or die "Failed to open '$testOverlaps' for reading.\n"; while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; if ($last != $aIID) { if (scalar(keys %falseH) > 0) { # error rate histogram if (1) { open(O, "> iid$last.eratehistogram.dat"); foreach my $ii (sort { $a <=> $b } keys %keysH) { $trueH{$ii} += 0.0; $falseH{$ii} += 0.0; print O "$ii\t$trueH{$ii}\t$falseH{$ii}\n"; } close(O); open(O, "| gnuplot > /dev/null 2> /dev/null"); print O "set terminal 'png'\n"; print O "set output 'iid$last.eratehistogram.png'\n"; print O "plot 'iid$last.eratehistogram.dat' using 1:3 with lines title 'FALSE',"; print O " 'iid$last.eratehistogram.dat' using 1:2 with lines title 'TRUE'\n"; close(O); } # length vs identitiy if (1) { #0.5 * $longestT < $longestF) { #print STDERR "IID $last true: len=$longestT false: len=$longestF\n"; open(O, "> iid$last.len-vs-ident.5.true.dat"); print O @lenIdent5T; close(O); open(O, "> iid$last.len-vs-ident.5.false.dat"); print O @lenIdent5F; close(O); open(O, "> iid$last.len-vs-ident.3.true.dat"); print O @lenIdent3T; close(O); open(O, "> iid$last.len-vs-ident.3.false.dat"); print O @lenIdent3F; close(O); open(O, "| gnuplot > /dev/null 2> /dev/null"); print O "set terminal 'png'\n"; print O "set output 'iid$last.len-vs-ident.png'\n"; print O "plot [10:35] [0:15000]"; print O " 'iid$last.len-vs-ident.5.false.dat' using 2:1 title 'FALSE 5' pt 1 lc 1,"; print O " 'iid$last.len-vs-ident.5.true.dat' using 2:1 title 'TRUE 5' pt 1 lc 2,"; print O " 'iid$last.len-vs-ident.3.false.dat' using 2:1 title 'FALSE 3' pt 2 lc 1,"; print O " 'iid$last.len-vs-ident.3.true.dat' using 2:1 title 'TRUE 3' pt 2 lc 2,"; print O " 'iid$last.len-vs-ident.dat' using 2:1 title 'MISSED' pt 3 lc 3\n"; print O "set output 'iid$last.5.len-vs-ident.png'\n"; print O "plot [10:35] [0:15000]"; print O " 'iid$last.len-vs-ident.5.false.dat' using 2:1 title 'FALSE 5' pt 1 lc 1,"; print O " 'iid$last.len-vs-ident.5.true.dat' using 2:1 title 'TRUE 5' pt 1 lc 2\n"; print O "set output 'iid$last.3.len-vs-ident.png'\n"; print O "plot [10:35] [0:15000]"; print O " 'iid$last.len-vs-ident.3.false.dat' using 2:1 title 'FALSE 3' pt 2 lc 1,"; print O " 'iid$last.len-vs-ident.3.true.dat' using 2:1 title 'TRUE 3' pt 2 lc 2\n"; close(O); } } undef %trueH; undef %falseH; undef %keysH; undef @lenIdent5T; undef @lenIdent5F; undef @lenIdent3T; undef @lenIdent3F; $longestT = 0; $longestF = 0; } if (exists($true{"$aIID-$bIID"})) { print TO "$_\n"; } else { print FO "$_\n"; } $trueComputed{"$aIID-$bIID"}++; next if (exists($contained{$aIID})); next if (exists($contained{$bIID})); my $oLen = computeOverlapLength($aIID, $bIID, $aHang, $bHang); my $intErate = int($origErate); my $is5 = ($aHang < 0); # Computing the number of matches instead of actual length makes every overlap shorter # than truth, and they all get flagged in output # #$oLen -= int($oLen * $origErate / 100); if (exists($true{"$aIID-$bIID"})) { $trueH{$intErate}++; $keysH{$intErate}++; print LT "$oLen\t$origErate\n"; if ($is5) { push @lenIdent5T, "$oLen\t$origErate\n"; } else { push @lenIdent3T, "$oLen\t$origErate\n"; } if ($longestT < $oLen) { $longestT = $oLen; } } else { $falseH{$intErate}++; $keysH{$intErate}++; print LF "$oLen\t$origErate\n"; if ($is5) { push @lenIdent5F, "$oLen\t$origErate\n"; } else { push @lenIdent3F, "$oLen\t$origErate\n"; } if ($longestF < $oLen) { $longestF = $oLen; } } $last = $aIID; } close(F); close(LT); close(LF); close(TO); close(FO); } print "convertOverlap -ovl < all.true.filtered-overlaps.ova > all.true.filtered-overlaps.ovb\n"; system("convertOverlap -ovl < all.true.filtered-overlaps.ova > all.true.filtered-overlaps.ovb"); print "convertOverlap -ovl < all.false.filtered-overlaps.ova > all.false.filtered-overlaps.ovb\n"; system("convertOverlap -ovl < all.false.filtered-overlaps.ova > all.false.filtered-overlaps.ovb"); print "rm -rf all.true.filtered-overlaps.ovlStore all.true.filtered-overlaps+false.ovlStore\n"; system("rm -rf all.true.filtered-overlaps.ovlStore all.true.filtered-overlaps+false.ovlStore"); print "overlapStoreBuild -g $testPrefix/test.gkpStore -o all.true.filtered-overlaps.ovlStore -F 1 all.true.filtered-overlaps.ovb\n"; system("overlapStoreBuild -g $testPrefix/test.gkpStore -o all.true.filtered-overlaps.ovlStore -F 1 all.true.filtered-overlaps.ovb"); print "overlapStoreBuild -g $testPrefix/test.gkpStore -o all.true.filtered-overlaps+false.ovlStore -F 1 all.true.filtered-overlaps.ovb all.false.filtered-overlaps.ovb\n"; system("overlapStoreBuild -g $testPrefix/test.gkpStore -o all.true.filtered-overlaps+false.ovlStore -F 1 all.true.filtered-overlaps.ovb all.false.filtered-overlaps.ovb"); my %bestTrueEdge; my %bestTrueContain; open(F, "< $truePrefix/4-unitigger/best.edges") or die; while () { my @v = split '\s+', $_; $bestTrueEdge{"$v[0]-$v[2]"}++; $bestTrueEdge{"$v[2]-$v[0]"}++; $bestTrueEdge{"$v[0]-$v[4]"}++; $bestTrueEdge{"$v[4]-$v[0]"}++; } close(F); open(F, "< $truePrefix/4-unitigger/best.contains") or die; while () { my @v = split '\s+', $_; $bestTrueContain{"$v[0]-$v[3]"}++; } close(F); my %bestTestEdge; my %bestTestContain; open(F, "< $testPrefix/4-unitigger/best.edges") or die; while () { my @v = split '\s+', $_; $bestTestEdge{"$v[0]-$v[2]"}++; $bestTestEdge{"$v[2]-$v[0]"}++; $bestTestEdge{"$v[0]-$v[4]"}++; $bestTestEdge{"$v[4]-$v[0]"}++; } close(F); open(F, "< $testPrefix/4-unitigger/best.contains") or die; while () { my @v = split '\s+', $_; $bestTestContain{"$v[0]-$v[3]"}++; } close(F); # # Output true overlaps that we missed # open(F, "overlapStore -d $trueOverlaps |") or die "Failed to open '$trueOverlaps' for reading.\n"; open(O, "> true-overlaps.missed.other.ova"); open(E, "> true-overlaps.missed.edge.ova"); open(C, "> true-overlaps.missed.contain.ova"); while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; next if (exists($trueComputed{"$aIID-$bIID"})); if (exists($bestTrueEdge{"$aIID-$bIID"})) { print E "$_\n"; } elsif (exists($bestTrueContain{"$aIID-$bIID"})) { print C "$_\n"; } else { print O "$_\n"; } } close(O); close(F); # # Classify computed overlaps into true/false for each type (best edge, best contain, the others). # open(F, "overlapStore -d $testOverlaps |") or die "Failed to open '$trueOverlaps' for reading.\n"; open(BET, "> computed-overlaps.bestE.true.dat"); open(BEN, "> computed-overlaps.bestE.near.dat"); open(BEF, "> computed-overlaps.bestE.false.dat"); open(BCT, "> computed-overlaps.bestC.true.dat"); open(BCN, "> computed-overlaps.bestC.near.dat"); open(BCF, "> computed-overlaps.bestC.false.dat"); open(OTT, "> computed-overlaps.other.true.dat"); open(OTF, "> computed-overlaps.other.false.dat"); open(BETo, "> computed-overlaps.bestE.true.ova"); open(BENo, "> computed-overlaps.bestE.near.ova"); open(BEFo, "> computed-overlaps.bestE.false.ova"); open(BCTo, "> computed-overlaps.bestC.true.ova"); open(BCNo, "> computed-overlaps.bestC.near.ova"); open(BCFo, "> computed-overlaps.bestC.false.ova"); open(OTTo, "> computed-overlaps.other.true.ova"); open(OTFo, "> computed-overlaps.other.false.ova"); while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; my $pair = "$aIID-$bIID"; my $len = computeOverlapLength($aIID, $bIID, $aHang, $bHang); $origErate = int(10 * $origErate) / 10; my $diff = ($aIID < $bIID) ? $bIID - $aIID : $aIID - $bIID; die if ($diff < 0); if (exists($bestTestEdge{$pair})) { if (exists($bestTrueEdge{$pair})) { print BET "$origErate\t$len\n"; print BETo "$_\n"; } elsif (exists($true{$pair})) { print BEN "$origErate\t$len\n"; print BENo "$_\n"; } else { print BEF "$origErate\t$len\n"; print BEFo "$_\n"; } } elsif (exists($bestTestContain{$pair})) { if (exists($bestTrueContain{$pair})) { print BCT "$origErate\t$len\n"; print BCTo "$_\n"; } elsif (exists($true{$pair})) { print BCN "$origErate\t$len\n"; print BCNo "$_\n"; } else { print BCF "$origErate\t$len\n"; print BCFo "$_\n"; } } else { if (exists($true{$pair})) { print OTT "$origErate\t$len\n"; print OTTo "$_\n"; } else { print OTF "$origErate\t$len\n"; print OTFo "$_\n"; } } } close(OTF); close(OTFo); close(OTT); close(OTTo); close(BCF); close(BCFo); close(BCN); close(BCNo); close(BCT); close(BCTo); close(BEF); close(BEFo); close(BEN); close(BENo); close(BET); close(BETo); system("awk '{ print \$1 }' computed-overlaps.bestE.true.dat | sort -n | uniq -c > computed-overlaps.bestE.true.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestE.near.dat | sort -n | uniq -c > computed-overlaps.bestE.near.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestE.false.dat | sort -n | uniq -c > computed-overlaps.bestE.false.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestC.true.dat | sort -n | uniq -c > computed-overlaps.bestC.true.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestC.near.dat | sort -n | uniq -c > computed-overlaps.bestC.near.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestC.false.dat | sort -n | uniq -c > computed-overlaps.bestC.false.eratehist"); system("awk '{ print \$1 }' computed-overlaps.other.true.dat | sort -n | uniq -c > computed-overlaps.other.true.eratehist"); system("awk '{ print \$1 }' computed-overlaps.other.false.dat | sort -n | uniq -c > computed-overlaps.other.false.eratehist"); open(F, "> tmp.gp"); print F "set terminal png\n"; print F "\n"; print F "set output 'computed-overlaps.bestE.eratehist.png'\n"; print F "set title 'Computed Overlaps, best edges, error rate histogram'\n"; print F "plot 'computed-overlaps.bestE.true.eratehist' using 2:1 title 'true' ps 1.00 lc 2, \\\n"; print F " 'computed-overlaps.bestE.near.eratehist' using 2:1 title 'near' ps 1.00 lc 3, \\\n"; print F " 'computed-overlaps.bestE.false.eratehist' using 2:1 title 'false' ps 1.00 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, best contains, error rate histogram'\n"; print F "set output 'computed-overlaps.bestC.eratehist.png'\n"; print F "plot 'computed-overlaps.bestC.true.eratehist' using 2:1 title 'true' ps 1.00 lc 2, \\\n"; print F " 'computed-overlaps.bestC.near.eratehist' using 2:1 title 'near' ps 1.00 lc 3, \\\n"; print F " 'computed-overlaps.bestC.false.eratehist' using 2:1 title 'false' ps 1.00 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, non-best, error rate histogram'\n"; print F "set output 'computed-overlaps.other.eratehist.png'\n"; print F "plot 'computed-overlaps.other.true.eratehist' using 2:1 title 'true' ps 1.00 lc 2, \\\n"; print F " 'computed-overlaps.other.false.eratehist' using 2:1 title 'false' ps 1.00 lc 1\n"; print F "\n"; print F "\n"; print F "set title 'Computed Overlaps, best edges non-best, error-vs-length'\n"; print F "set output 'computed-overlaps.bestE.length-vs-erate.png'\n"; print F "plot 'computed-overlaps.bestE.true.dat' using 1:2 title 'true' ps 0.50 lc 2, \\\n"; print F " 'computed-overlaps.bestE.near.dat' using 1:2 title 'near' ps 0.25 lc 3, \\\n"; print F " 'computed-overlaps.bestE.false.dat' using 1:2 title 'false' ps 0.25 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, non-best, error-vs-length'\n"; print F "set output 'computed-overlaps.bestC.length-vs-erate.png'\n"; print F "plot 'computed-overlaps.bestC.true.dat' using 1:2 title 'true' ps 0.50 lc 2, \\\n"; print F " 'computed-overlaps.bestC.near.dat' using 1:2 title 'near' ps 0.25 lc 3, \\\n"; print F " 'computed-overlaps.bestC.false.dat' using 1:2 title 'false' ps 0.25 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, non-best, error-vs-length'\n"; print F "set output 'computed-overlaps.other.length-vs-erate.png'\n"; print F "plot 'computed-overlaps.other.true.dat' using 1:2 title 'true' ps 0.50 lc 2, \\\n"; print F " 'computed-overlaps.other.false.dat' using 1:2 title 'false' ps 0.25 lc 1\n"; close(F); system("gnuplot tmp.gp && rm tmp.gp"); sub computeOverlapLength ($$$$) { my ($aIID, $bIID, $aHang, $bHang) = @_; my $aLen = $readLengths[$aIID]; my $bLen = $readLengths[$bIID]; my $aOvl = 0; my $bOvl = 0; # Swiped from bogart if ($aHang < 0) { # bHang < 0 ? ---------- : ---- # ? ---------- : ---------- # $aOvl = ($bHang < 0) ? ($aLen + $bHang) : ($aLen); $bOvl = ($bHang < 0) ? ($bLen + $aHang) : ($bLen + $aHang - $bHang); } else { # bHang < 0 ? ---------- : ---------- # ? ---- : ---------- # $aOvl = ($bHang < 0) ? ($aLen - $aHang + $bHang) : ($aLen - $aHang); $bOvl = ($bHang < 0) ? ($bLen) : ($bLen - $bHang); } return(($aOvl + $bOvl) / 2); } canu-1.6/src/overlapInCore-analysis/check-ordered-reads-for-missed-overlaps.pl000066400000000000000000000063251314437614700275350ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $dataset; $dataset = "BL"; $dataset = "CAordered"; my $maxB = 0; my $maxA = 0; sub processIIDs ($@) { my $thisIID = shift @_; my @iids = sort { $a <=> $b } @_; return if (scalar(@iids) == 0); my $minIID = $iids[0]; my $curIID = 0; my $maxIID = $iids[scalar(@iids) - 1]; my $expectedB = 0; my $expectedA = 0; my $foundB = 0; my $foundA = 0; my $missingB = 0; my $missingA = 0; foreach my $iid (@iids) { #print "$iid -- $curIID\n"; next if ($iid < $thisIID - 30); # Ignore false overlaps next if ($iid > $thisIID + 30); $expectedB = $thisIID - $iid if ($expectedB < $thisIID - $iid); $expectedA = $iid - $thisIID if ($expectedA < $iid - $thisIID); $curIID = $iid - 1 if ($curIID == 0); # If curIID==0, first time here, set to make no missing. if ($iid < $thisIID) { $foundB++; if ($curIID + 1 < $iid) { #print STDERR " missingB $curIID - $iid\n"; $missingB += $iid - 1 - $curIID; } } else { $foundA++; if ($curIID + 1 < $iid) { print STDERR " missingA $curIID - $iid\n"; $missingA += $iid - 1 - $curIID; } } $curIID = $iid; $curIID++ if ($curIID + 1 == $thisIID); } my $found = $foundB + $foundA; my $expected = $expectedB + $expectedA; my $missing = $missingB + $missingA; if ($missingA > 0) { my $fracB = int(10000 * $missingB / $expectedB) / 100; my $fracA = int(10000 * $missingA / $expectedA) / 100; print "$thisIID range $expectedB/$expectedA found $foundB/$foundA missing $missingB/$missingA $fracB%/$fracA%\n"; } } my $lastIID = 0; my @iids; open(F, "overlapStore -d JUNKTEST3/$dataset/test.ovlStore |"); while () { s/^\s+//; s/\s+$//; # v[0] = iid # v[1] = iid # v[2] = orient # v[3] = hang # v[4] = hang # v[5] = ident # v[6] = ident my @v = split '\s+', $_; if ($v[0] != $lastIID) { #print STDERR "PROCESS $lastIID ", scalar(@iids), "\n"; processIIDs($lastIID, @iids); undef @iids; $lastIID = $v[0]; } #print STDERR "$v[0] $v[1]\n"; push @iids, $v[1]; } close(F); canu-1.6/src/overlapInCore-analysis/filterTrue.pl000066400000000000000000000430641314437614700221710ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Reads true overlaps from the first file, separates the second file into # true/false based on iid. Hangs are NOT checked. my $trueOverlaps = "BL/test.ovlStore"; my $testOverlaps = "CAordered/test.ovlStore"; my $filtTrue = "filtered.true"; my $filtFalse = "filtered.false"; # # Load read lengths, make sure both assemblies are the same # my @readLengths; my %contained; open(F, "gatekeeper -dumpfragments -tabular BL/test.gkpStore |"); while () { my @v = split '\s+', $_; $readLengths[$v[1]] = $v[9]; } close(F); open(F, "gatekeeper -dumpfragments -tabular CAordered/test.gkpStore |"); while () { my @v = split '\s+', $_; die "IID $v[1] BL $readLengths[$v[1]] != CA $v[9]\n" if ($readLengths[$v[1]] != $v[9]); } close(F); open(F, "< CAordered/4-unitigger/best.contains"); while () { my @v = split '\s+', $_; $contained{$v[0]}++; } close(F); # # Load true overlaps # my %true; # If set, id pair is a true overlap my %trueComputed; # If set, id pair is a true overlap, and was computed { my $last; # local storage my @lenIdent; # local storage of length-vs-ident for true overlaps. ident is not known. open(F, "overlapStore -d $trueOverlaps |") or die "Failed to open '$trueOverlaps' for reading.\n"; while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; if ($last != $aIID) { open(O, "> iid$last.len-vs-ident.dat"); print O @lenIdent; close(O); undef @lenIdent; } my $oLen = computeOverlapLength($aIID, $bIID, $aHang, $bHang); $true{"$aIID-$bIID"} = $oLen; $true{"$bIID-$aIID"} = $oLen; next if (exists($contained{$aIID})); next if (exists($contained{$bIID})); push @lenIdent, "$oLen\t$12.0\n"; $last = $aIID; } close(F); } print STDERR "true: ", scalar(keys %true) / 2, "\n"; # # Process computed overlaps # my %trueH; my %falseH; my %keysH; my @lenIdent5T; my @lenIdent5F; my @lenIdent3T; my @lenIdent3F; my $longestF = 0; my $longestT = 0; { my $last = 0; open(LT, "> all.true.length-vs-ident.dat"); open(LF, "> all.false.length-vs-ident.dat"); open(TO, "> all.true.filtered-overlaps.ova"); open(FO, "> all.false.filtered-overlaps.ova"); open(F, "overlapStore -d $testOverlaps |") or die "Failed to open '$testOverlaps' for reading.\n"; while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; if ($last != $aIID) { if (scalar(keys %falseH) > 0) { # error rate histogram if (1) { open(O, "> iid$last.eratehistogram.dat"); foreach my $ii (sort { $a <=> $b } keys %keysH) { $trueH{$ii} += 0.0; $falseH{$ii} += 0.0; print O "$ii\t$trueH{$ii}\t$falseH{$ii}\n"; } close(O); open(O, "| gnuplot > /dev/null 2> /dev/null"); print O "set terminal 'png'\n"; print O "set output 'iid$last.eratehistogram.png'\n"; print O "plot 'iid$last.eratehistogram.dat' using 1:3 with lines title 'FALSE',"; print O " 'iid$last.eratehistogram.dat' using 1:2 with lines title 'TRUE'\n"; close(O); } # length vs identitiy if (1) { #0.5 * $longestT < $longestF) { #print STDERR "IID $last true: len=$longestT false: len=$longestF\n"; open(O, "> iid$last.len-vs-ident.5.true.dat"); print O @lenIdent5T; close(O); open(O, "> iid$last.len-vs-ident.5.false.dat"); print O @lenIdent5F; close(O); open(O, "> iid$last.len-vs-ident.3.true.dat"); print O @lenIdent3T; close(O); open(O, "> iid$last.len-vs-ident.3.false.dat"); print O @lenIdent3F; close(O); open(O, "| gnuplot > /dev/null 2> /dev/null"); print O "set terminal 'png'\n"; print O "set output 'iid$last.len-vs-ident.png'\n"; print O "plot [10:35] [0:15000]"; print O " 'iid$last.len-vs-ident.5.false.dat' using 2:1 title 'FALSE 5' pt 1 lc 1,"; print O " 'iid$last.len-vs-ident.5.true.dat' using 2:1 title 'TRUE 5' pt 1 lc 2,"; print O " 'iid$last.len-vs-ident.3.false.dat' using 2:1 title 'FALSE 3' pt 2 lc 1,"; print O " 'iid$last.len-vs-ident.3.true.dat' using 2:1 title 'TRUE 3' pt 2 lc 2,"; print O " 'iid$last.len-vs-ident.dat' using 2:1 title 'MISSED' pt 3 lc 3\n"; print O "set output 'iid$last.5.len-vs-ident.png'\n"; print O "plot [10:35] [0:15000]"; print O " 'iid$last.len-vs-ident.5.false.dat' using 2:1 title 'FALSE 5' pt 1 lc 1,"; print O " 'iid$last.len-vs-ident.5.true.dat' using 2:1 title 'TRUE 5' pt 1 lc 2\n"; print O "set output 'iid$last.3.len-vs-ident.png'\n"; print O "plot [10:35] [0:15000]"; print O " 'iid$last.len-vs-ident.3.false.dat' using 2:1 title 'FALSE 3' pt 2 lc 1,"; print O " 'iid$last.len-vs-ident.3.true.dat' using 2:1 title 'TRUE 3' pt 2 lc 2\n"; close(O); } } undef %trueH; undef %falseH; undef %keysH; undef @lenIdent5T; undef @lenIdent5F; undef @lenIdent3T; undef @lenIdent3F; $longestT = 0; $longestF = 0; } if (exists($true{"$aIID-$bIID"})) { print TO "$_\n"; } else { print FO "$_\n"; } $trueComputed{"$aIID-$bIID"}++; next if (exists($contained{$aIID})); next if (exists($contained{$bIID})); my $oLen = computeOverlapLength($aIID, $bIID, $aHang, $bHang); my $intErate = int($origErate); my $is5 = ($aHang < 0); # Computing the number of matches instead of actual length makes every overlap shorter # than truth, and they all get flagged in output # #$oLen -= int($oLen * $origErate / 100); if (exists($true{"$aIID-$bIID"})) { $trueH{$intErate}++; $keysH{$intErate}++; print LT "$oLen\t$origErate\n"; if ($is5) { push @lenIdent5T, "$oLen\t$origErate\n"; } else { push @lenIdent3T, "$oLen\t$origErate\n"; } if ($longestT < $oLen) { $longestT = $oLen; } } else { $falseH{$intErate}++; $keysH{$intErate}++; print LF "$oLen\t$origErate\n"; if ($is5) { push @lenIdent5F, "$oLen\t$origErate\n"; } else { push @lenIdent3F, "$oLen\t$origErate\n"; } if ($longestF < $oLen) { $longestF = $oLen; } } $last = $aIID; } close(F); close(LT); close(LF); close(TO); close(FO); } print "convertOverlap -ovl < all.true.filtered-overlaps.ova > all.true.filtered-overlaps.ovb\n"; system("convertOverlap -ovl < all.true.filtered-overlaps.ova > all.true.filtered-overlaps.ovb"); print "convertOverlap -ovl < all.false.filtered-overlaps.ova > all.false.filtered-overlaps.ovb\n"; system("convertOverlap -ovl < all.false.filtered-overlaps.ova > all.false.filtered-overlaps.ovb"); print "rm -rf all.true.filtered-overlaps.ovlStore all.true.filtered-overlaps+false.ovlStore\n"; system("rm -rf all.true.filtered-overlaps.ovlStore all.true.filtered-overlaps+false.ovlStore"); print "overlapStoreBuild -g CAordered/test.gkpStore -o all.true.filtered-overlaps.ovlStore -F 1 all.true.filtered-overlaps.ovb\n"; system("overlapStoreBuild -g CAordered/test.gkpStore -o all.true.filtered-overlaps.ovlStore -F 1 all.true.filtered-overlaps.ovb"); print "overlapStoreBuild -g CAordered/test.gkpStore -o all.true.filtered-overlaps+false.ovlStore -F 1 all.true.filtered-overlaps.ovb all.false.filtered-overlaps.ovb\n"; system("overlapStoreBuild -g CAordered/test.gkpStore -o all.true.filtered-overlaps+false.ovlStore -F 1 all.true.filtered-overlaps.ovb all.false.filtered-overlaps.ovb"); my %bestTrueEdge; my %bestTrueContain; open(F, "< BL/4-unitigger/best.edges") or die; while () { my @v = split '\s+', $_; $bestTrueEdge{"$v[0]-$v[2]"}++; $bestTrueEdge{"$v[2]-$v[0]"}++; $bestTrueEdge{"$v[0]-$v[4]"}++; $bestTrueEdge{"$v[4]-$v[0]"}++; } close(F); open(F, "< BL/4-unitigger/best.contains") or die; while () { my @v = split '\s+', $_; $bestTrueContain{"$v[0]-$v[3]"}++; } close(F); my %bestTestEdge; my %bestTestContain; open(F, "< CAordered/4-unitigger/best.edges") or die; while () { my @v = split '\s+', $_; $bestTestEdge{"$v[0]-$v[2]"}++; $bestTestEdge{"$v[2]-$v[0]"}++; $bestTestEdge{"$v[0]-$v[4]"}++; $bestTestEdge{"$v[4]-$v[0]"}++; } close(F); open(F, "< CAordered/4-unitigger/best.contains") or die; while () { my @v = split '\s+', $_; $bestTestContain{"$v[0]-$v[3]"}++; } close(F); # # Output true overlaps that we missed # open(F, "overlapStore -d $trueOverlaps |") or die "Failed to open '$trueOverlaps' for reading.\n"; open(O, "> true-overlaps.missed.other.ova"); open(E, "> true-overlaps.missed.edge.ova"); open(C, "> true-overlaps.missed.contain.ova"); while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; next if (exists($trueComputed{"$aIID-$bIID"})); if (exists($bestTrueEdge{"$aIID-$bIID"})) { print E "$_\n"; } elsif (exists($bestTrueContain{"$aIID-$bIID"})) { print C "$_\n"; } else { print O "$_\n"; } } close(O); close(F); # # Classify computed overlaps into true/false for each type (best edge, best contain, the others). # open(F, "overlapStore -d $testOverlaps |") or die "Failed to open '$trueOverlaps' for reading.\n"; open(BET, "> computed-overlaps.bestE.true.dat"); open(BEN, "> computed-overlaps.bestE.near.dat"); open(BEF, "> computed-overlaps.bestE.false.dat"); open(BCT, "> computed-overlaps.bestC.true.dat"); open(BCN, "> computed-overlaps.bestC.near.dat"); open(BCF, "> computed-overlaps.bestC.false.dat"); open(OTT, "> computed-overlaps.other.true.dat"); open(OTF, "> computed-overlaps.other.false.dat"); open(BETo, "> computed-overlaps.bestE.true.ova"); open(BENo, "> computed-overlaps.bestE.near.ova"); open(BEFo, "> computed-overlaps.bestE.false.ova"); open(BCTo, "> computed-overlaps.bestC.true.ova"); open(BCNo, "> computed-overlaps.bestC.near.ova"); open(BCFo, "> computed-overlaps.bestC.false.ova"); open(OTTo, "> computed-overlaps.other.true.ova"); open(OTFo, "> computed-overlaps.other.false.ova"); while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origErate, $corrErate) = split '\s+', $_; my $pair = "$aIID-$bIID"; my $len = computeOverlapLength($aIID, $bIID, $aHang, $bHang); $origErate = int(10 * $origErate) / 10; my $diff = ($aIID < $bIID) ? $bIID - $aIID : $aIID - $bIID; die if ($diff < 0); if (exists($bestTestEdge{$pair})) { if (exists($bestTrueEdge{$pair})) { print BET "$origErate\t$len\n"; print BETo "$_\n"; } elsif (exists($true{$pair})) { print BEN "$origErate\t$len\n"; print BENo "$_\n"; } else { print BEF "$origErate\t$len\n"; print BEFo "$_\n"; } } elsif (exists($bestTestContain{$pair})) { if (exists($bestTrueContain{$pair})) { print BCT "$origErate\t$len\n"; print BCTo "$_\n"; } elsif (exists($true{$pair})) { print BCN "$origErate\t$len\n"; print BCNo "$_\n"; } else { print BCF "$origErate\t$len\n"; print BCFo "$_\n"; } } else { if (exists($true{$pair})) { print OTT "$origErate\t$len\n"; print OTTo "$_\n"; } else { print OTF "$origErate\t$len\n"; print OTFo "$_\n"; } } } close(OTF); close(OTFo); close(OTT); close(OTTo); close(BCF); close(BCFo); close(BCN); close(BCNo); close(BCT); close(BCTo); close(BEF); close(BEFo); close(BEN); close(BENo); close(BET); close(BETo); system("awk '{ print \$1 }' computed-overlaps.bestE.true.dat | sort -n | uniq -c > computed-overlaps.bestE.true.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestE.near.dat | sort -n | uniq -c > computed-overlaps.bestE.near.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestE.false.dat | sort -n | uniq -c > computed-overlaps.bestE.false.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestC.true.dat | sort -n | uniq -c > computed-overlaps.bestC.true.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestC.near.dat | sort -n | uniq -c > computed-overlaps.bestC.near.eratehist"); system("awk '{ print \$1 }' computed-overlaps.bestC.false.dat | sort -n | uniq -c > computed-overlaps.bestC.false.eratehist"); system("awk '{ print \$1 }' computed-overlaps.other.true.dat | sort -n | uniq -c > computed-overlaps.other.true.eratehist"); system("awk '{ print \$1 }' computed-overlaps.other.false.dat | sort -n | uniq -c > computed-overlaps.other.false.eratehist"); open(F, "> tmp.gp"); print F "set terminal png\n"; print F "\n"; print F "set output 'computed-overlaps.bestE.eratehist.png'\n"; print F "set title 'Computed Overlaps, best edges, error rate histogram'\n"; print F "plot 'computed-overlaps.bestE.true.eratehist' using 2:1 title 'true' ps 1.00 lc 2, \\\n"; print F " 'computed-overlaps.bestE.near.eratehist' using 2:1 title 'near' ps 1.00 lc 3, \\\n"; print F " 'computed-overlaps.bestE.false.eratehist' using 2:1 title 'false' ps 1.00 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, best contains, error rate histogram'\n"; print F "set output 'computed-overlaps.bestC.eratehist.png'\n"; print F "plot 'computed-overlaps.bestC.true.eratehist' using 2:1 title 'true' ps 1.00 lc 2, \\\n"; print F " 'computed-overlaps.bestC.near.eratehist' using 2:1 title 'near' ps 1.00 lc 3, \\\n"; print F " 'computed-overlaps.bestC.false.eratehist' using 2:1 title 'false' ps 1.00 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, non-best, error rate histogram'\n"; print F "set output 'computed-overlaps.other.eratehist.png'\n"; print F "plot 'computed-overlaps.other.true.eratehist' using 2:1 title 'true' ps 1.00 lc 2, \\\n"; print F " 'computed-overlaps.other.false.eratehist' using 2:1 title 'false' ps 1.00 lc 1\n"; print F "\n"; print F "\n"; print F "set title 'Computed Overlaps, best edges non-best, error-vs-length'\n"; print F "set output 'computed-overlaps.bestE.length-vs-erate.png'\n"; print F "plot 'computed-overlaps.bestE.true.dat' using 1:2 title 'true' ps 0.50 lc 2, \\\n"; print F " 'computed-overlaps.bestE.near.dat' using 1:2 title 'near' ps 0.25 lc 3, \\\n"; print F " 'computed-overlaps.bestE.false.dat' using 1:2 title 'false' ps 0.25 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, non-best, error-vs-length'\n"; print F "set output 'computed-overlaps.bestC.length-vs-erate.png'\n"; print F "plot 'computed-overlaps.bestC.true.dat' using 1:2 title 'true' ps 0.50 lc 2, \\\n"; print F " 'computed-overlaps.bestC.near.dat' using 1:2 title 'near' ps 0.25 lc 3, \\\n"; print F " 'computed-overlaps.bestC.false.dat' using 1:2 title 'false' ps 0.25 lc 1\n"; print F "\n"; print F "set title 'Computed Overlaps, non-best, error-vs-length'\n"; print F "set output 'computed-overlaps.other.length-vs-erate.png'\n"; print F "plot 'computed-overlaps.other.true.dat' using 1:2 title 'true' ps 0.50 lc 2, \\\n"; print F " 'computed-overlaps.other.false.dat' using 1:2 title 'false' ps 0.25 lc 1\n"; close(F); system("gnuplot tmp.gp && rm tmp.gp"); sub computeOverlapLength ($$$$) { my ($aIID, $bIID, $aHang, $bHang) = @_; my $aLen = $readLengths[$aIID]; my $bLen = $readLengths[$bIID]; my $aOvl = 0; my $bOvl = 0; # Swiped from bogart if ($aHang < 0) { # bHang < 0 ? ---------- : ---- # ? ---------- : ---------- # $aOvl = ($bHang < 0) ? ($aLen + $bHang) : ($aLen); $bOvl = ($bHang < 0) ? ($bLen + $aHang) : ($bLen + $aHang - $bHang); } else { # bHang < 0 ? ---------- : ---------- # ? ---- : ---------- # $aOvl = ($bHang < 0) ? ($aLen - $aHang + $bHang) : ($aLen - $aHang); $bOvl = ($bHang < 0) ? ($bLen) : ($bLen - $bHang); } return(($aOvl + $bOvl) / 2); } canu-1.6/src/overlapInCore-analysis/find-missed-true-overlaps.pl000066400000000000000000000060061314437614700250470ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Given: # an assembly of true inferred overlaps # an assembly of computed overlaps # report the number of true overlaps that are missed in the computed overlaps my %truthOverlaps; my @readLengths; open(F, "gatekeeper -dumpfragments -tabular JUNKTEST3/BL/test.gkpStore |"); while () { my @v = split '\s+', $_; $readLengths[$v[1]] = $v[9]; } close(F); open(F, "gatekeeper -dumpfragments -tabular JUNKTEST3/CAordered/test.gkpStore |"); while () { my @v = split '\s+', $_; die "IID $v[1] BL $readLengths[$v[1]] != CA $v[9]\n" if ($readLengths[$v[1]] != $v[9]); } close(F); open(F, "overlapStore -d JUNKTEST3/BL/test.ovlStore |"); while () { s/^\s+//; s/\s+$//; my ($aIID, $bIID, $orient, $aHang, $bHang, $origID, $corrID) = split '\s+', $_; my $aLen = $readLengths[$aIID]; my $bLen = $readLengths[$bIID]; my $aOvl = 0; my $bOvl = 0; my $oLen = 0; # Swiped from bogart if ($aHang < 0) { # bHang < 0 ? ---------- : ---- # ? ---------- : ---------- # $aOvl = ($bHang < 0) ? ($aLen + $bHang) : ($aLen); $bOvl = ($bHang < 0) ? ($bLen + $aHang) : ($bLen + $aHang - $bHang); } else { # bHang < 0 ? ---------- : ---------- # ? ---- : ---------- # $aOvl = ($bHang < 0) ? ($aLen - $aHang + $bHang) : ($aLen - $aHang); $bOvl = ($bHang < 0) ? ($bLen) : ($bLen - $bHang); } $oLen = ($aOvl + $bOvl) / 2; $truthOverlaps{"$aIID-$bIID"}++; } close(F); print STDERR "truthOverlaps: ", scalar(keys %truthOverlaps), "\n"; open(F, "overlapStore -d JUNKTEST3/CAordered/test.ovlStore |"); while () { s/^\s+//; s/\s+$//; # v[0] = iid # v[1] = iid # v[2] = orient # v[3] = hang # v[4] = hang # v[5] = ident # v[6] = ident my ($aIID, $bIID, $orient, $aHang, $bHang, $origID, $corrID) = split '\s+', $_; delete $truthOverlaps{"$aIID-$bIID"}; } close(F); print STDERR "truthOverlaps missed: ", scalar(keys %truthOverlaps), "\n"; canu-1.6/src/overlapInCore-analysis/infer-obt-from-genomic-blasr.pl000066400000000000000000000034741314437614700254130ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Convert BLASR default output to OBT overlaps: aIID bIID [f|r] aBgn aEnd bBgn bEnd error my %IDmap; open(F, "< $ARGV[0]") or die "Failed to open '$ARGV[0]' for reading sequence names.\n"; while () { my ($uid, $iid, $name) = split '\s+', $_; $IDmap{$name} = $iid; } close(F); while () { my @v = split '\s+', $_; # The first read seems to have a sub-read range appended. if ($v[0] =~ m/^(.*)\/\d+_\d+$/) { $v[0] = $1; } my $aiid = $IDmap{$v[0]}; my $biid = $IDmap{$v[1]}; my $fr = ($v[8] == 0) ? "f" : "r"; my $error = 100.0 - $v[3]; ($v[9], $v[10]) = ($v[10], $v[9]) if ($v[8] == 1); die "No A iid found for '$v[0]'\n" if (!defined($aiid)); die "No B iid found for '$v[1]'\n" if (!defined($biid)); next if ($aiid == $biid); # HACK! #$error = 0.01; print "$aiid\t$biid\t$fr\t$v[5]\t$v[6]\t$v[9]\t$v[10]\t$error\n"; } canu-1.6/src/overlapInCore-analysis/infer-olaps-from-genomic-coords.pl000066400000000000000000000177541314437614700261410ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $outputPrefix = $ARGV[0]; my $readsFile = $ARGV[1]; my $coordsFile = $ARGV[2]; my $uidMapFile = $ARGV[3]; my $orientForward = 0; # # Load name to IID map # my %NAMEtoIID; my %UIDtoNAME; if (defined($uidMapFile)) { print STDERR "Read names from '$uidMapFile'\n"; open(F, "< $uidMapFile"); while () { my @v = split '\s+', $_; # UID IID NAME UID IID NAME if (scalar(@v) == 3) { $NAMEtoIID{$v[2]} = $v[1]; $UIDtoNAME{$v[0]} = $v[2]; } if (scalar(@v) == 6) { $NAMEtoIID{$v[2]} = $v[1]; $NAMEtoIID{$v[5]} = $v[4]; $UIDtoNAME{$v[0]} = $v[2]; $UIDtoNAME{$v[3]} = $v[5]; } } close(F); } # # Load read lengths. Needed to find reads that do not map full length, and to compute hangs. # Assumes reads are dumped from gkpStore and have names 'uid,iid'. my %readLength; my %totalAligns; my %goodAligns; print STDERR "Read lengths from '$readsFile'\n"; open(F, "< $readsFile") or die "Couldn't read '$readsFile'\n"; while (!eof(F)) { my $a = ; my $b = ; my $c = ; my $d = ; if ($a =~ m/^.(\S+)$/) { if (! exists($NAMEtoIID{$1})) { #print STDERR "WARNING: read '$1' not in map.\n" } else { $readLength{$1} = length($b) - 1; $totalAligns{$1} = 0; $goodAligns{$1} = 0; } } else { chomp; die "Failed to parse read name from '$a'\n"; } } close(F); my @aligns; my $totalAligns = 0; my $goodAligns = 0; my $readOrder = 0; my %positions; open(F, "< $coordsFile"); $_ = ; $_ = ; $_ = ; $_ = ; while () { my @v = split '\s+', $_; my $b = $v[0]; # reference begin my $e = $v[1]; # reference end my $r = $v[9]; # reference name my $n = $v[10]; # read name my $f = 1; my $L = $v[2]; # read begin, unused my $R = $v[3]; # read end, unused if ($R < $L) { $f = 0; $L = $v[3]; $R = $v[2]; } die "L=$L R=$R\n" if ($R < $L); die "b=$b e=$e\n" if ($e < $b); next if (!exists($readLength{$n})); die "Didn't find read length for read '$n'\n" if (!exists($readLength{$n})); $totalAligns++; $totalAligns{$n}++; next if (10 < $L); next if (10 < $readLength{$n} - $R); $goodAligns++; $goodAligns{$n}++; my $scale = ($R - $L) / ($e - $b); # scale from reference coords to read coords # The aligns line starts with reference name. This lets us group aligns per reference sequence. push @aligns, "$r\0$b\0$e\0$f\0$scale\0$n"; } # Generate olaps. We sort by reference name, then step through all aligns to that reference and # sort by begin coordinate. my $olaps = 0; open(OVA, "> $outputPrefix.ova"); @aligns = sort @aligns; while (scalar(@aligns) > 0) { my @coords; my @v = split '\0', $aligns[0]; my $r = $v[0]; # Find all coords for this reference sequence while ($r eq $v[0]) { push @coords, "$v[1]\0$v[2]\0$v[3]\0$v[4]\0$v[5]"; shift @aligns; @v = split '\0', $aligns[0]; } # Sort those coords by start position @coords = sort { $a <=> $b } @coords; # Process each one while (scalar(@coords) > 0) { my ($rb, $re, $rf, $rs, $rn) = split '\0', $coords[0]; die "Reversed $coords[0]\n" if ($re < $rb); next if (!exists($NAMEtoIID{$rn})); # Just skip reads we don't care about. die if (!exists($NAMEtoIID{$rn})); # For development, fail. # Save this reads position if (exists($positions{$rn})) { print STDERR "WARNING: read $rn already has a position!\n"; } $positions{$rn} = $readOrder++; # Relabel the read name to an IID. $rn = $NAMEtoIID{$rn}; shift @coords; foreach my $cc (@coords) { my ($cb, $ce, $cf, $cs, $cn) = split '\0', $cc; my $olap = $re - $cb; last if ($olap < 40); my $ahang; my $bhang; if ($rf == 1) { # Read is forward $ahang = $cb - $rb; $bhang = $ce - $re; } else { # Read is reverse; hangs are swapped and negative from forward case $ahang = $re - $ce; $bhang = $rb - $cb; } next if (!exists($NAMEtoIID{$cn})); die if (!exists($NAMEtoIID{$cn})); $cn = $NAMEtoIID{$cn}; # Scale the hangs. Hangs are computed using genomic positions, but deletions from the read # cause this hang to be too large. if ($ahang < 0) { $ahang = $ahang * $cs; # hang is composed of the C read } else { $ahang = $ahang * $rs; # hang is composed of the R read } if ($bhang < 0) { $bhang = $bhang * $rs; # hang is composed of the R read } else { $bhang = $bhang * $cs; # hang is composed of the C read } printf(OVA "%9s %9s %s %5d %5d %.2f %.2f\n", $rn, $cn, ($rf == $cf) ? "N" : "I", $ahang, $bhang, 0.0, 0.0); $olaps++; } } } close(OVA); # Output reads that aligned, didn't align. open(R, "< $readsFile"); open(M, "> $outputPrefix.mapped.fastq"); open(P, "> $outputPrefix.partial.fastq"); open(F, "> $outputPrefix.failed.fastq"); my $mappedReads = 0; my $partialReads = 0; my $failedReads = 0; my @mappedReads; my @orientReads; while (!eof(R)) { my $a = ; my $b = ; my $c = ; my $d = ; my $n; if ($a =~ m/\@(\S+)\s*/) { $n = $1; } else { chomp; die "Failed to parse read name from '$a'\n"; } #die "Failed to find UIDtoNAME for UID '$n'\n" if (!exists($UIDtoNAME{$n})); # #$a = "\@$UIDtoNAME{$n}\n"; if ($goodAligns{$n} > 0) { $mappedReads++; print M "$a$b$c$d"; die "No position for '$n'\n" if (!exists($positions{$n})); push @mappedReads, "$positions{$n}\0$a$b$c$d"; push @orientReads, "$positions{$n}\0$a$b$c$d"; } elsif ($totalAligns{$n} > 0) { $partialReads++; print P "$a$b$c$d"; } else { $failedReads++; print F "$a$b$c$d"; } } close(F); close(P); close(M); close(R); @mappedReads = sort { $a <=> $b } @mappedReads; @orientReads = sort { $a <=> $b } @orientReads; open(A, "> $outputPrefix.mapped.ordered.fastq"); foreach my $a (@mappedReads) { my ($idx,$dat) = split '\0', $a; print A "$dat"; } close(A); open(A, "> $outputPrefix.mapped.ordered.oriented.fastq"); foreach my $a (@orientReads) { my ($idx,$dat) = split '\0', $a; print A "$dat"; } close(A); print STDERR "RESULT $outputPrefix ", scalar(keys %readLength), " input $mappedReads mapped $partialReads partial $failedReads unaligned reads, $totalAligns aligns, $goodAligns good aligns, $olaps overlaps.\n"; canu-1.6/src/overlapInCore-analysis/infer-olaps-from-pairwise-blasr.pl000066400000000000000000000106261314437614700261440ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Convert BLASR default output to OVL overlaps: aIID bIID [I|N] aHang bHang error error_corrected die "Don't use BLASR default output; use SAM format.\n" my %IDmap; # BLASR is not reporting symmetric overlaps. We need to keep track of which overlaps we have reported. my %reported; open(F, "< $ARGV[0]") or die "Failed to open '$ARGV[0]' for reading sequence names.\n"; while () { my ($uid, $iid, $name) = split '\s+', $_; $IDmap{$name} = $iid; } close(F); while () { my @v = split '\s+', $_; $_ = join '\t', $_; # The first read seems to have a sub-read range appended. if ($v[0] =~ m/^(.*)\/\d+_\d+$/) { $v[0] = $1; } my $aiid; my $biid; # Or it might have UID,IID. If this is ALWAYS the case, we don't need the uid map, yay! if ($v[0] =~ m/^\d+,(\d+)$/) { $aiid = $1; } else { $aiid = $IDmap{$v[0]}; } if ($v[1] =~ m/^\d+,(\d+)$/) { $biid = $1; } else { $biid = $IDmap{$v[1]}; } my $ni = ($v[8] == 0) ? "N" : "I"; my $error = 100.0 - $v[3]; my $ecorr = 100.0 - $v[3]; if ((exists($reported{"$aiid-$biid"}) || (exists($reported{"$biid-$aiid"})))) { next; } $reported{"$aiid-$biid"} = 0; # Argh! Do NOT flip coords if reversed. #($v[9], $v[10]) = ($v[10], $v[9]) if ($v[8] == 1); die "No A iid found for '$v[0]'\n" if (!defined($aiid)); die "No B iid found for '$v[1]'\n" if (!defined($biid)); next if ($aiid == $biid); #next if ($aiid < $biid); die "First read is flipped\n$_\n" if ($v[4] != 0); my $a1 = $v[5]; # Amount unaligned on left of first my $b1 = $v[7] - $v[6]; my $a2 = $v[9]; my $b2 = $v[11] - $v[10]; #print "'$v[7]' '$v[8]' '$v[9]' '$v[10]'\n"; #print "a1 $a1 from [5] $v[5]\n"; #print "b1 $b1 from [7]-[6] $v[7] - $v[6]\n"; #print "a2 $a2 from [9] $v[9]\n"; #print "b2 $b2 from [11]-[10] $v[11] - $v[10]\n"; my $fl = $v[8]; my $ahang = 0; my $bhang = 0; my $label = ""; # Handle A contained in B if (($a1 == 0) && ($b1 == 0)) { if ($fl == 0) { $ahang = -$a2; $bhang = $b2; $label = "AinBf"; } else { $ahang = -$b2; $bhang = $a2; $label = "AinBr"; } } # Handle B contained in A elsif (($a2 == 0) && ($b2 == 0)) { if ($fl == 0) { $ahang = $a1; $bhang = -$b1; $label = "BinAf"; } else { $ahang = $b1; $bhang = -$a1; $label = "BinAr"; } } # Handle a dovetail off the left end of A elsif (($a1 == 0) && ($b2 == 0) && ($fl == 0)) { $ahang = -$a2; $bhang = -$b1; $label = "BdoveAf"; } elsif (($a1 == 0) && ($a2 == 0) && ($fl == 1)) { $ahang = -$b2; $bhang = -$b1; $label = "BdoveAr"; } # Handle dovetail off the right end of A elsif (($b1 == 0) && ($a2 == 0) && ($fl == 0)) { $ahang = $a1; $bhang = $b2; $label = "AdoveBf"; } elsif (($b1 == 0) && ($b2 == 0) && ($fl == 1)) { $ahang = $a1; $bhang = $a2; $label = "AdoveBr"; } # All the rest aren't valid overlaps. else { next; $label = "INVALID"; } # HACK! $error = 0.01; $ecorr = 0.01; #print "$aiid\t$biid\t$ni\t$ahang\t$bhang\t$error\t$ecorr\t$label\t$_"; print "$aiid\t$biid\t$ni\t$ahang\t$bhang\t$error\t$ecorr\n"; } canu-1.6/src/overlapInCore-analysis/infer-olaps-from-pairwise-coords.pl000066400000000000000000000116511314437614700263310ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # cd /work/blasr-overlapper/JUNKTEST3/CAordered # time sh /work/scripts/blasr-overlaps.sh BLASROVERLAP.fasta BLASROVERLAP.fasta BLASROVERLAP #7043.059u 16.347s 22:09.56 530.9% 6573+816k 11+192523io 11881pf+0w # # NMAME to IID MAP # my %NAMEtoIID; open(F, "< CAordered/test.gkpStore.fastqUIDmap") or die "Failed to open 'CAordered/test.gkpStore.fastqUIDmap'\n"; while () { my @v = split '\s+', $_; #$IIDtoNAME{$v[1]} = $v[2]; $NAMEtoIID{$v[2]} = $v[1]; } close(F); # # READ LENGTHS # my @readLength; my %contained; open(F, "gatekeeper -dumpfragments -tabular BL/test.gkpStore |"); while () { my @v = split '\s+', $_; $readLength[$v[1]] = $v[9]; } close(F); open(F, "gatekeeper -dumpfragments -tabular CAordered/test.gkpStore |"); while () { my @v = split '\s+', $_; die "IID $v[1] BL $readLength[$v[1]] != CA $v[9]\n" if ($readLength[$v[1]] != $v[9]); } close(F); # # # my %reported; open(F, "< BO/PAIRS.blasr.sam.coords") or die "Failed to open 'BO/PAIRS.blasr.sam.coords'\n"; open(O, "> BO/PAIRS.blasr.sam.ova"); $_ = ; $_ = ; $_ = ; $_ = ; while () { my @v = split '\s+', $_; my $aIID = $NAMEtoIID{$v[9]}; my $bIID = $NAMEtoIID{$v[10]}; next if (!defined($aIID)); # Extra overlaps are not an error. next if (!defined($bIID)); die "undef for a $v[9]\n" if (!defined($aIID)); # Extra overlaps are an error. die "undef for b $v[10]\n" if (!defined($bIID)); next if ($aIID == $bIID); next if (exists($reported{"$aIID-$bIID"})); $reported{"$aIID-$bIID"}++; $reported{"$bIID-$aIID"}++; my $l1 = $readLength[$aIID]; my $a1 = $v[0] - 1; # Amount unaligned on left of first my $b1 = $l1 - $v[1]; # Amount unaligned on right of first my $ori; my $l2 = $readLength[$bIID]; my $a2; my $b2; if ($v[2] < $v[3]) { $ori = "N"; $a2 = $v[2] - 1; $b2 = $l2 - $v[3]; } else { $ori = "I"; $a2 = $v[3] - 1; $b2 = $l2 - $v[2]; } # Extend near global to be global. my $maxTol = 10; $a1 = 0 if ($a1 < $maxTol); $b1 = $l1 if ($l1 < $b1 + $maxTol); $a2 = 0 if ($a2 < $maxTol); $b2 = $l2 if ($l2 < $b2 + $maxTol); my $fl = ($ori eq "I"); my $ahang = 0; my $bhang = 0; my $label = ""; # Handle A contained in B if (($a1 == 0) && ($b1 == 0)) { if ($fl == 0) { $ahang = -$a2; $bhang = $b2; $label = "AinBf"; } else { $ahang = -$b2; $bhang = $a2; $label = "AinBr"; } } # Handle B contained in A elsif (($a2 == 0) && ($b2 == 0)) { if ($fl == 0) { $ahang = $a1; $bhang = -$b1; $label = "BinAf"; } else { $ahang = $b1; $bhang = -$a1; $label = "BinAr"; } } # Handle a dovetail off the left end of A elsif (($a1 == 0) && ($b2 == 0) && ($fl == 0)) { $ahang = -$a2; $bhang = -$b1; $label = "BdoveAf"; } elsif (($a1 == 0) && ($a2 == 0) && ($fl == 1)) { $ahang = -$b2; $bhang = -$b1; $label = "BdoveAr"; } # Handle dovetail off the right end of A elsif (($b1 == 0) && ($a2 == 0) && ($fl == 0)) { $ahang = $a1; $bhang = $b2; $label = "AdoveBf"; } elsif (($b1 == 0) && ($b2 == 0) && ($fl == 1)) { $ahang = $a1; $bhang = $a2; $label = "AdoveBr"; } # All the rest aren't valid overlaps. else { next; $label = "INVALID"; } my $error = 100 - $v[8]; my $ecorr = 100 - $v[8]; #print "$aIID\t$bIID\t$ni\t$ahang\t$bhang\t$error\t$ecorr\t$label\t$_"; print O "$aIID\t$bIID\t$ori\t$ahang\t$bhang\t$error\t$ecorr\n"; } close(F); close(O); system("convertOverlap -ovl < BO/PAIRS.blasr.sam.ova > BO/PAIRS.blasr.sam.ovb"); system("overlapStoreBuild -g CAordered/test.gkpStore -o blasr-pairs.ovlStore -F 1 BO/PAIRS.blasr.sam.ovb"); canu-1.6/src/overlapInCore-analysis/infer-ovl-from-genomic-blasr.pl000066400000000000000000000105351314437614700254230ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-OCT-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Convert BLASR default output to OVL overlaps: aIID bIID [I|N] aHang bHang error error_corrected my %IDmap; # BLASR is not reporting symmetric overlaps. We need to keep track of which overlaps we have reported. my %reported; open(F, "< $ARGV[0]") or die "Failed to open '$ARGV[0]' for reading sequence names.\n"; while () { my ($uid, $iid, $name) = split '\s+', $_; $IDmap{$name} = $iid; } close(F); while () { my @v = split '\s+', $_; $_ = join '\t', $_; # The first read seems to have a sub-read range appended. if ($v[0] =~ m/^(.*)\/\d+_\d+$/) { $v[0] = $1; } my $aiid; my $biid; # Or it might have UID,IID. If this is ALWAYS the case, we don't need the uid map, yay! if ($v[0] =~ m/^\d+,(\d+)$/) { $aiid = $1; } else { $aiid = $IDmap{$v[0]}; } if ($v[1] =~ m/^\d+,(\d+)$/) { $biid = $1; } else { $biid = $IDmap{$v[1]}; } my $ni = ($v[8] == 0) ? "N" : "I"; my $error = 100.0 - $v[3]; my $ecorr = 100.0 - $v[3]; if ((exists($reported{"$aiid-$biid"}) || (exists($reported{"$biid-$aiid"})))) { next; } $reported{"$aiid-$biid"} = 0; # Argh! Do NOT flip coords if reversed. #($v[9], $v[10]) = ($v[10], $v[9]) if ($v[8] == 1); die "No A iid found for '$v[0]'\n" if (!defined($aiid)); die "No B iid found for '$v[1]'\n" if (!defined($biid)); next if ($aiid == $biid); #next if ($aiid < $biid); die "First read is flipped\n$_\n" if ($v[4] != 0); my $a1 = $v[5]; # Amount unaligned on left of first my $b1 = $v[7] - $v[6]; my $a2 = $v[9]; my $b2 = $v[11] - $v[10]; #print "'$v[7]' '$v[8]' '$v[9]' '$v[10]'\n"; #print "a1 $a1 from [5] $v[5]\n"; #print "b1 $b1 from [7]-[6] $v[7] - $v[6]\n"; #print "a2 $a2 from [9] $v[9]\n"; #print "b2 $b2 from [11]-[10] $v[11] - $v[10]\n"; my $fl = $v[8]; my $ahang = 0; my $bhang = 0; my $label = ""; # Handle A contained in B if (($a1 == 0) && ($b1 == 0)) { if ($fl == 0) { $ahang = -$a2; $bhang = $b2; $label = "AinBf"; } else { $ahang = -$b2; $bhang = $a2; $label = "AinBr"; } } # Handle B contained in A elsif (($a2 == 0) && ($b2 == 0)) { if ($fl == 0) { $ahang = $a1; $bhang = -$b1; $label = "BinAf"; } else { $ahang = $b1; $bhang = -$a1; $label = "BinAr"; } } # Handle a dovetail off the left end of A elsif (($a1 == 0) && ($b2 == 0) && ($fl == 0)) { $ahang = -$a2; $bhang = -$b1; $label = "BdoveAf"; } elsif (($a1 == 0) && ($a2 == 0) && ($fl == 1)) { $ahang = -$b2; $bhang = -$b1; $label = "BdoveAr"; } # Handle dovetail off the right end of A elsif (($b1 == 0) && ($a2 == 0) && ($fl == 0)) { $ahang = $a1; $bhang = $b2; $label = "AdoveBf"; } elsif (($b1 == 0) && ($b2 == 0) && ($fl == 1)) { $ahang = $a1; $bhang = $a2; $label = "AdoveBr"; } # All the rest aren't valid overlaps. else { next; $label = "INVALID"; } # HACK! $error = 0.01; $ecorr = 0.01; #print "$aiid\t$biid\t$ni\t$ahang\t$bhang\t$error\t$ecorr\t$label\t$_"; print "$aiid\t$biid\t$ni\t$ahang\t$bhang\t$error\t$ecorr\n"; } canu-1.6/src/overlapInCore/000077500000000000000000000000001314437614700156575ustar00rootroot00000000000000canu-1.6/src/overlapInCore/libedlib/000077500000000000000000000000001314437614700174255ustar00rootroot00000000000000canu-1.6/src/overlapInCore/libedlib/edlib.C000066400000000000000000001754661314437614700206330ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-SEP-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ /* * The MIT License (MIT) * * Copyright (c) 2014 Martin Šošić * * Permission is hereby granted, free of charge, to any person obtaining a copy of * this software and associated documentation files (the "Software"), to deal in * the Software without restriction, including without limitation the rights to * use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of * the Software, and to permit persons to whom the Software is furnished to do so, * subject to the following conditions: * * The above copyright notice and this permission notice shall be included in all * copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS * FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR * COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER * IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ #include "edlib.H" #include #include #include #include #include #include using namespace std; typedef uint64_t Word; static const int WORD_SIZE = sizeof(Word) * 8; // Size of Word in bits static const Word WORD_1 = (Word)1; static const Word HIGH_BIT_MASK = WORD_1 << (WORD_SIZE - 1); // 100..00 // Data needed to find alignment. struct AlignmentData { Word* Ps; Word* Ms; int* scores; int* firstBlocks; int* lastBlocks; AlignmentData(int maxNumBlocks, int targetLength) { // We build a complete table and mark first and last block for each column // (because algorithm is banded so only part of each columns is used). // TODO: do not build a whole table, but just enough blocks for each column. Ps = new Word[maxNumBlocks * targetLength]; Ms = new Word[maxNumBlocks * targetLength]; scores = new int[maxNumBlocks * targetLength]; firstBlocks = new int[targetLength]; lastBlocks = new int[targetLength]; } ~AlignmentData() { delete[] Ps; delete[] Ms; delete[] scores; delete[] firstBlocks; delete[] lastBlocks; } }; struct Block { Word P; // Pvin Word M; // Mvin int score; // score of last cell in block; Block() {} Block(Word P, Word M, int score) :P(P), M(M), score(score) {} }; static int myersCalcEditDistanceSemiGlobal(const Word* Peq, int W, int maxNumBlocks, const unsigned char* query, int queryLength, const unsigned char* target, int targetLength, int alphabetLength, int k, EdlibAlignMode mode, int* bestScore_, int** positions_, int* numPositions_); static int myersCalcEditDistanceNW(const Word* Peq, int W, int maxNumBlocks, const unsigned char* query, int queryLength, const unsigned char* target, int targetLength, int alphabetLength, int k, int* bestScore_, int* position_, bool findAlignment, AlignmentData** alignData, int targetStopPosition); static int obtainAlignment( const unsigned char* query, const unsigned char* rQuery, int queryLength, const unsigned char* target, const unsigned char* rTarget, int targetLength, int alphabetLength, int bestScore, unsigned char** alignment, int* alignmentLength); static int obtainAlignmentHirschberg( const unsigned char* query, const unsigned char* rQuery, int queryLength, const unsigned char* target, const unsigned char* rTarget, int targetLength, int alphabetLength, int bestScore, unsigned char** alignment, int* alignmentLength); static int obtainAlignmentTraceback(int queryLength, int targetLength, int bestScore, const AlignmentData* alignData, unsigned char** alignment, int* alignmentLength); static int transformSequences(const char* queryOriginal, int queryLength, const char* targetOriginal, int targetLength, unsigned char** queryTransformed, unsigned char** targetTransformed); static inline int ceilDiv(int x, int y); static inline unsigned char* createReverseCopy(const unsigned char* seq, int length); static inline Word* buildPeq(int alphabetLength, const unsigned char* query, int queryLength); /** * Main edlib method. */ EdlibAlignResult edlibAlign(const char* const queryOriginal, const int queryLength, const char* const targetOriginal, const int targetLength, const EdlibAlignConfig config) { EdlibAlignResult result; result.editDistance = -1; result.endLocations = result.startLocations = NULL; result.numLocations = 0; result.alignment = NULL; result.alignmentLength = 0; result.alphabetLength = 0; assert(queryLength > 0); assert(targetLength > 0); /*------------ TRANSFORM SEQUENCES AND RECOGNIZE ALPHABET -----------*/ unsigned char* query, * target; int alphabetLength = transformSequences(queryOriginal, queryLength, targetOriginal, targetLength, &query, &target); result.alphabetLength = alphabetLength; /*-------------------------------------------------------*/ /*--------------------- INITIALIZATION ------------------*/ int maxNumBlocks = ceilDiv(queryLength, WORD_SIZE); // bmax in Myers int W = maxNumBlocks * WORD_SIZE - queryLength; // number of redundant cells in last level blocks Word* Peq = buildPeq(alphabetLength, query, queryLength); /*-------------------------------------------------------*/ /*------------------ MAIN CALCULATION -------------------*/ // TODO: Store alignment data only after k is determined? That could make things faster. int positionNW; // Used only when mode is NW. AlignmentData* alignData = NULL; bool dynamicK = false; int k = config.k; if (k < 0) { // If valid k is not given, auto-adjust k until solution is found. dynamicK = true; k = WORD_SIZE; // Gives better results than smaller k. } do { if (config.mode == EDLIB_MODE_HW || config.mode == EDLIB_MODE_SHW) { myersCalcEditDistanceSemiGlobal(Peq, W, maxNumBlocks, query, queryLength, target, targetLength, alphabetLength, k, config.mode, &(result.editDistance), &(result.endLocations), &(result.numLocations)); } else { // mode == EDLIB_MODE_NW myersCalcEditDistanceNW(Peq, W, maxNumBlocks, query, queryLength, target, targetLength, alphabetLength, k, &(result.editDistance), &positionNW, false, &alignData, -1); } k *= 2; } while(dynamicK && result.editDistance == -1); if (result.editDistance >= 0) { // If there is solution. // If NW mode, set end location explicitly. if (config.mode == EDLIB_MODE_NW) { result.endLocations = new int [1]; result.endLocations[0] = targetLength - 1; result.numLocations = 1; } // Find starting locations. if (config.task == EDLIB_TASK_LOC || config.task == EDLIB_TASK_PATH) { result.startLocations = new int [result.numLocations]; if (config.mode == EDLIB_MODE_HW) { // If HW, I need to calculate start locations. const unsigned char* rTarget = createReverseCopy(target, targetLength); const unsigned char* rQuery = createReverseCopy(query, queryLength); Word* rPeq = buildPeq(alphabetLength, rQuery, queryLength); // Peq for reversed query for (int i = 0; i < result.numLocations; i++) { int endLocation = result.endLocations[i]; int bestScoreSHW, numPositionsSHW; int* positionsSHW; myersCalcEditDistanceSemiGlobal( rPeq, W, maxNumBlocks, rQuery, queryLength, rTarget + targetLength - endLocation - 1, endLocation + 1, alphabetLength, result.editDistance, EDLIB_MODE_SHW, &bestScoreSHW, &positionsSHW, &numPositionsSHW); // Taking last location as start ensures that alignment will not start with insertions // if it can start with mismatches instead. result.startLocations[i] = endLocation - positionsSHW[numPositionsSHW - 1]; delete[] positionsSHW; } delete[] rTarget; delete[] rQuery; delete[] rPeq; } else { // If mode is SHW or NW for (int i = 0; i < result.numLocations; i++) { result.startLocations[i] = 0; } } } // Find alignment -> all comes down to finding alignment for NW. // Currently we return alignment only for first pair of locations. if (config.task == EDLIB_TASK_PATH) { int alnStartLocation = result.startLocations[0]; int alnEndLocation = result.endLocations[0]; const unsigned char* alnTarget = target + alnStartLocation; const int alnTargetLength = alnEndLocation - alnStartLocation + 1; const unsigned char* rAlnTarget = createReverseCopy(alnTarget, alnTargetLength); const unsigned char* rQuery = createReverseCopy(query, queryLength); obtainAlignment(query, rQuery, queryLength, alnTarget, rAlnTarget, alnTargetLength, alphabetLength, result.editDistance, &(result.alignment), &(result.alignmentLength)); delete[] rAlnTarget; delete[] rQuery; } } /*-------------------------------------------------------*/ //--- Free memory ---// delete[] Peq; delete[] query; delete[] target; delete alignData; //-------------------// return result; } char* edlibAlignmentToCigar(const unsigned char* const alignment, const int alignmentLength, const EdlibCigarFormat cigarFormat) { if (cigarFormat != EDLIB_CIGAR_EXTENDED && cigarFormat != EDLIB_CIGAR_STANDARD) { return 0; } // Maps move code from alignment to char in cigar. // 0 1 2 3 char moveCodeToChar[] = {'=', 'I', 'D', 'X'}; if (cigarFormat == EDLIB_CIGAR_STANDARD) { moveCodeToChar[0] = moveCodeToChar[3] = 'M'; } vector* cigar = new vector(); char lastMove = 0; // Char of last move. 0 if there was no previous move. int numOfSameMoves = 0; for (int i = 0; i <= alignmentLength; i++) { // if new sequence of same moves started if (i == alignmentLength || (moveCodeToChar[alignment[i]] != lastMove && lastMove != 0)) { // Write number of moves to cigar string. int numDigits = 0; for (; numOfSameMoves; numOfSameMoves /= 10) { cigar->push_back('0' + numOfSameMoves % 10); numDigits++; } reverse(cigar->end() - numDigits, cigar->end()); // Write code of move to cigar string. cigar->push_back(lastMove); // If not at the end, start new sequence of moves. if (i < alignmentLength) { // Check if alignment has valid values. if (alignment[i] > 3) { delete cigar; return 0; } numOfSameMoves = 0; } } if (i < alignmentLength) { lastMove = moveCodeToChar[alignment[i]]; numOfSameMoves++; } } cigar->push_back(0); // Null character termination. char* cigar_ = new char [cigar->size()]; memcpy(cigar_, &(*cigar)[0], cigar->size() * sizeof(char)); delete cigar; return cigar_; } void edlibAlignmentToStrings(const unsigned char* alignment, int alignmentLength, int tgtStart, int tgtEnd, int qryStart, int qryEnd, const char *tgt, const char *qry, char *tgt_aln_str, char *qry_aln_str) { for (int a = 0, qryPos=qryStart, tgtPos=tgtStart; a < alignmentLength; a++) { assert(qryPos <= qryEnd && tgtPos <= tgtEnd); if (alignment[a] == EDLIB_EDOP_MATCH || alignment[a] == EDLIB_EDOP_MISMATCH) { // match or mismatch qry_aln_str[a] = qry[qryPos]; tgt_aln_str[a] = tgt[tgtPos]; qryPos++; tgtPos++; } else if (alignment[a] == EDLIB_EDOP_INSERT) { // insertion in target tgt_aln_str[a] = '-'; qry_aln_str[a] = qry[qryPos]; qryPos++; } else if (alignment[a] == EDLIB_EDOP_DELETE) { // insertion in query tgt_aln_str[a] = tgt[tgtPos]; qry_aln_str[a] = '-'; tgtPos++; } } qry_aln_str[alignmentLength] = tgt_aln_str[alignmentLength] = '\0'; assert(strlen(qry_aln_str) == alignmentLength && strlen(tgt_aln_str) == alignmentLength); } /** * Build Peq table for given query and alphabet. * Peq is table of dimensions alphabetLength+1 x maxNumBlocks. * Bit i of Peq[s * maxNumBlocks + b] is 1 if i-th symbol from block b of query equals symbol s, otherwise it is 0. * NOTICE: free returned array with delete[]! */ static inline Word* buildPeq(const int alphabetLength, const unsigned char* const query, const int queryLength) { int maxNumBlocks = ceilDiv(queryLength, WORD_SIZE); // table of dimensions alphabetLength+1 x maxNumBlocks. Last symbol is wildcard. Word* Peq = new Word[(alphabetLength + 1) * maxNumBlocks]; // Build Peq (1 is match, 0 is mismatch). NOTE: last column is wildcard(symbol that matches anything) with just 1s for (int symbol = 0; symbol <= alphabetLength; symbol++) { for (int b = 0; b < maxNumBlocks; b++) { if (symbol < alphabetLength) { Peq[symbol * maxNumBlocks + b] = 0; for (int r = (b+1) * WORD_SIZE - 1; r >= b * WORD_SIZE; r--) { Peq[symbol * maxNumBlocks + b] <<= 1; // NOTE: We pretend like query is padded at the end with W wildcard symbols if (r >= queryLength || query[r] == symbol) Peq[symbol * maxNumBlocks + b] += 1; } } else { // Last symbol is wildcard, so it is all 1s Peq[symbol * maxNumBlocks + b] = (Word)-1; } } } return Peq; } /** * Returns new sequence that is reverse of given sequence. */ static inline unsigned char* createReverseCopy(const unsigned char* const seq, const int length) { unsigned char* rSeq = new unsigned char[length]; for (int i = 0; i < length; i++) { rSeq[i] = seq[length - i - 1]; } return rSeq; } /** * Corresponds to Advance_Block function from Myers. * Calculates one word(block), which is part of a column. * Highest bit of word (one most to the left) is most bottom cell of block from column. * Pv[i] and Mv[i] define vin of cell[i]: vin = cell[i] - cell[i-1]. * @param [in] Pv Bitset, Pv[i] == 1 if vin is +1, otherwise Pv[i] == 0. * @param [in] Mv Bitset, Mv[i] == 1 if vin is -1, otherwise Mv[i] == 0. * @param [in] Eq Bitset, Eq[i] == 1 if match, 0 if mismatch. * @param [in] hin Will be +1, 0 or -1. * @param [out] PvOut Bitset, PvOut[i] == 1 if vout is +1, otherwise PvOut[i] == 0. * @param [out] MvOut Bitset, MvOut[i] == 1 if vout is -1, otherwise MvOut[i] == 0. * @param [out] hout Will be +1, 0 or -1. */ static inline int calculateBlock(Word Pv, Word Mv, Word Eq, const int hin, Word &PvOut, Word &MvOut) { // hin can be 1, -1 or 0. // 1 -> 00...01 // 0 -> 00...00 // -1 -> 11...11 (2-complement) Word hinIsNeg = (Word)(hin >> 2) & WORD_1; // 00...001 if hin is -1, 00...000 if 0 or 1 Word Xv = Eq | Mv; // This is instruction below written using 'if': if (hin < 0) Eq |= (Word)1; Eq |= hinIsNeg; Word Xh = (((Eq & Pv) + Pv) ^ Pv) | Eq; Word Ph = Mv | ~(Xh | Pv); Word Mh = Pv & Xh; int hout = 0; // This is instruction below written using 'if': if (Ph & HIGH_BIT_MASK) hout = 1; hout = (Ph & HIGH_BIT_MASK) >> (WORD_SIZE - 1); // This is instruction below written using 'if': if (Mh & HIGH_BIT_MASK) hout = -1; hout -= (Mh & HIGH_BIT_MASK) >> (WORD_SIZE - 1); Ph <<= 1; Mh <<= 1; // This is instruction below written using 'if': if (hin < 0) Mh |= (Word)1; Mh |= hinIsNeg; // This is instruction below written using 'if': if (hin > 0) Ph |= (Word)1; Ph |= (Word)((hin + 1) >> 1); PvOut = Mh | ~(Xv | Ph); MvOut = Ph & Xv; return hout; } /** * Does ceiling division x / y. * Note: x and y must be non-negative and x + y must not overflow. */ static inline int ceilDiv(const int x, const int y) { return x % y ? x / y + 1 : x / y; } static inline int min(const int x, const int y) { return x < y ? x : y; } static inline int max(const int x, const int y) { return x > y ? x : y; } /** * @param [in] block * @return Values of cells in block, starting with bottom cell in block. */ static inline vector getBlockCellValues(const Block block) { vector scores(WORD_SIZE); int score = block.score; Word mask = HIGH_BIT_MASK; for (int i = 0; i < WORD_SIZE - 1; i++) { scores[i] = score; if (block.P & mask) score--; if (block.M & mask) score++; mask >>= 1; } scores[WORD_SIZE - 1] = score; return scores; } /** * Writes values of cells in block into given array, starting with first/top cell. * @param [in] block * @param [out] dest Array into which cell values are written. Must have size of at least WORD_SIZE. */ static inline void readBlock(const Block block, int* const dest) { int score = block.score; Word mask = HIGH_BIT_MASK; for (int i = 0; i < WORD_SIZE - 1; i++) { dest[WORD_SIZE - 1 - i] = score; if (block.P & mask) score--; if (block.M & mask) score++; mask >>= 1; } dest[0] = score; } /** * Writes values of cells in block into given array, starting with last/bottom cell. * @param [in] block * @param [out] dest Array into which cell values are written. Must have size of at least WORD_SIZE. */ static inline void readBlockReverse(const Block block, int* const dest) { int score = block.score; Word mask = HIGH_BIT_MASK; for (int i = 0; i < WORD_SIZE - 1; i++) { dest[i] = score; if (block.P & mask) score--; if (block.M & mask) score++; mask >>= 1; } dest[WORD_SIZE - 1] = score; } /** * @param [in] block * @param [in] k * @return True if all cells in block have value larger than k, otherwise false. */ static inline bool allBlockCellsLarger(const Block block, const int k) { vector scores = getBlockCellValues(block); for (int i = 0; i < WORD_SIZE; i++) { if (scores[i] <= k) return false; } return true; } /** * Uses Myers' bit-vector algorithm to find edit distance for one of semi-global alignment methods. * @param [in] Peq Query profile. * @param [in] W Size of padding in last block. * TODO: Calculate this directly from query, instead of passing it. * @param [in] maxNumBlocks Number of blocks needed to cover the whole query. * TODO: Calculate this directly from query, instead of passing it. * @param [in] query * @param [in] queryLength * @param [in] target * @param [in] targetLength * @param [in] alphabetLength * @param [in] k * @param [in] mode EDLIB_MODE_HW or EDLIB_MODE_SHW * @param [out] bestScore_ Edit distance. * @param [out] positions_ Array of 0-indexed positions in target at which best score was found. Make sure to free this array with free(). * @param [out] numPositions_ Number of positions in the positions_ array. * @return Status. */ static int myersCalcEditDistanceSemiGlobal(const Word* const Peq, const int W, const int maxNumBlocks, const unsigned char* const query, const int queryLength, const unsigned char* const target, const int targetLength, const int alphabetLength, int k, const EdlibAlignMode mode, int* const bestScore_, int** const positions_, int* const numPositions_) { *positions_ = NULL; *numPositions_ = 0; // firstBlock is 0-based index of first block in Ukkonen band. // lastBlock is 0-based index of last block in Ukkonen band. int firstBlock = 0; int lastBlock = min(ceilDiv(k + 1, WORD_SIZE), maxNumBlocks) - 1; // y in Myers Block *bl; // Current block Block* blocks = new Block[maxNumBlocks]; // For HW, solution will never be larger then queryLength. if (mode == EDLIB_MODE_HW) { k = min(queryLength, k); } // Each STRONG_REDUCE_NUM column is reduced in more expensive way. // This gives speed up of about 2 times for small k. const int STRONG_REDUCE_NUM = 2048; // Initialize P, M and score bl = blocks; for (int b = 0; b <= lastBlock; b++) { bl->score = (b + 1) * WORD_SIZE; bl->P = (Word)-1; // All 1s bl->M = (Word)0; bl++; } int bestScore = -1; vector positions; // TODO: Maybe put this on heap? const int startHout = mode == EDLIB_MODE_HW ? 0 : 1; // If 0 then gap before query is not penalized; const unsigned char* targetChar = target; for (int c = 0; c < targetLength; c++) { // for each column const Word* Peq_c = Peq + (*targetChar) * maxNumBlocks; //----------------------- Calculate column -------------------------// int hout = startHout; bl = blocks + firstBlock; Peq_c += firstBlock; for (int b = firstBlock; b <= lastBlock; b++) { hout = calculateBlock(bl->P, bl->M, *Peq_c, hout, bl->P, bl->M); bl->score += hout; bl++; Peq_c++; } bl--; Peq_c--; //------------------------------------------------------------------// //---------- Adjust number of blocks according to Ukkonen ----------// if ((lastBlock < maxNumBlocks - 1) && (bl->score - hout <= k) // bl is pointing to last block && ((*(Peq_c + 1) & WORD_1) || hout < 0)) { // Peq_c is pointing to last block // If score of left block is not too big, calculate one more block lastBlock++; bl++; Peq_c++; bl->P = (Word)-1; // All 1s bl->M = (Word)0; bl->score = (bl - 1)->score - hout + WORD_SIZE + calculateBlock(bl->P, bl->M, *Peq_c, hout, bl->P, bl->M); } else { while (lastBlock >= firstBlock && bl->score >= k + WORD_SIZE) { lastBlock--; bl--; Peq_c--; } } // Every some columns, do some expensive but also more efficient block reducing -> this is important! if (c % STRONG_REDUCE_NUM == 0) { while (lastBlock >= firstBlock && allBlockCellsLarger(*bl, k)) { lastBlock--; bl--; Peq_c--; } } if (mode != EDLIB_MODE_HW) { while (firstBlock <= lastBlock && blocks[firstBlock].score >= k + WORD_SIZE) { firstBlock++; } if (c % STRONG_REDUCE_NUM == 0) { // Do strong reduction every some blocks while (firstBlock <= lastBlock && allBlockCellsLarger(blocks[firstBlock], k)) { firstBlock++; } } } // For HW, even if all cells are > k, there still may be solution in next // column because starting conditions at upper boundary are 0. // That means that first block is always candidate for solution, // and we can never end calculation before last column. if (mode == EDLIB_MODE_HW) { lastBlock = max(0, lastBlock); } // If band stops to exist finish if (lastBlock < firstBlock) { *bestScore_ = bestScore; if (bestScore != -1) { *positions_ = new int [positions.size()]; *numPositions_ = positions.size(); copy(positions.begin(), positions.end(), *positions_); } delete[] blocks; return EDLIB_STATUS_OK; } //------------------------------------------------------------------// //------------------------- Update best score ----------------------// if (lastBlock == maxNumBlocks - 1) { int colScore = bl->score; if (colScore <= k) { // Scores > k dont have correct values (so we cannot use them), but are certainly > k. // NOTE: Score that I find in column c is actually score from column c-W if (bestScore == -1 || colScore <= bestScore) { if (colScore != bestScore) { positions.clear(); bestScore = colScore; // Change k so we will look only for equal or better // scores then the best found so far. k = bestScore; } positions.push_back(c - W); } } } //------------------------------------------------------------------// targetChar++; } // Obtain results for last W columns from last column. if (lastBlock == maxNumBlocks - 1) { vector blockScores = getBlockCellValues(*bl); for (int i = 0; i < W; i++) { int colScore = blockScores[i + 1]; if (colScore <= k && (bestScore == -1 || colScore <= bestScore)) { if (colScore != bestScore) { positions.clear(); k = bestScore = colScore; } positions.push_back(targetLength - W + i); } } } *bestScore_ = bestScore; if (bestScore != -1) { *positions_ = new int [positions.size()]; *numPositions_ = positions.size(); copy(positions.begin(), positions.end(), *positions_); } delete[] blocks; return EDLIB_STATUS_OK; } /** * Uses Myers' bit-vector algorithm to find edit distance for global(NW) alignment method. * @param [in] Peq Query profile. * @param [in] W Size of padding in last block. * TODO: Calculate this directly from query, instead of passing it. * @param [in] maxNumBlocks Number of blocks needed to cover the whole query. * TODO: Calculate this directly from query, instead of passing it. * @param [in] query * @param [in] queryLength * @param [in] target * @param [in] targetLength * @param [in] alphabetLength * @param [in] k * @param [out] bestScore_ Edit distance. * @param [out] position_ 0-indexed position in target at which best score was found. * @param [in] findAlignment If true, whole matrix is remembered and alignment data is returned. * Quadratic amount of memory is consumed. * @param [out] alignData Data needed for alignment traceback (for reconstruction of alignment). * Set only if findAlignment is set to true, otherwise it is NULL. * Make sure to free this array using delete[]. * @param [out] targetStopPosition If set to -1, whole calculation is performed normally, as expected. * If set to p, calculation is performed up to position p in target (inclusive) * and column p is returned as the only column in alignData. * @return Status. */ static int myersCalcEditDistanceNW(const Word* const Peq, const int W, const int maxNumBlocks, const unsigned char* const query, const int queryLength, const unsigned char* const target, const int targetLength, const int alphabetLength, int k, int* const bestScore_, int* const position_, const bool findAlignment, AlignmentData** const alignData, const int targetStopPosition) { if (targetStopPosition > -1 && findAlignment) { // They can not be both set at the same time! return EDLIB_STATUS_ERROR; } // Each STRONG_REDUCE_NUM column is reduced in more expensive way. const int STRONG_REDUCE_NUM = 2048; // TODO: Choose this number dinamically (based on query and target lengths?), so it does not affect speed of computation if (k < abs(targetLength - queryLength)) { *bestScore_ = *position_ = -1; return EDLIB_STATUS_OK; } k = min(k, max(queryLength, targetLength)); // Upper bound for k // firstBlock is 0-based index of first block in Ukkonen band. // lastBlock is 0-based index of last block in Ukkonen band. int firstBlock = 0; // This is optimal now, by my formula. int lastBlock = min(maxNumBlocks, ceilDiv(min(k, (k + queryLength - targetLength) / 2) + 1, WORD_SIZE)) - 1; Block* bl; // Current block Block* blocks = new Block[maxNumBlocks]; // Initialize P, M and score bl = blocks; for (int b = 0; b <= lastBlock; b++) { bl->score = (b + 1) * WORD_SIZE; bl->P = (Word)-1; // All 1s bl->M = (Word)0; bl++; } // If we want to find alignment, we have to store needed data. if (findAlignment) *alignData = new AlignmentData(maxNumBlocks, targetLength); else if (targetStopPosition > -1) *alignData = new AlignmentData(maxNumBlocks, 1); else *alignData = NULL; const unsigned char* targetChar = target; for (int c = 0; c < targetLength; c++) { // for each column const Word* Peq_c = Peq + *targetChar * maxNumBlocks; //----------------------- Calculate column -------------------------// int hout = 1; bl = blocks + firstBlock; for (int b = firstBlock; b <= lastBlock; b++) { hout = calculateBlock(bl->P, bl->M, Peq_c[b], hout, bl->P, bl->M); bl->score += hout; bl++; } bl--; //------------------------------------------------------------------// // bl now points to last block // Update k. I do it only on end of column because it would slow calculation too much otherwise. // NOTICE: I add W when in last block because it is actually result from W cells to the left and W cells up. k = min(k, bl->score + max(targetLength - c - 1, queryLength - ((1 + lastBlock) * WORD_SIZE - 1) - 1) + (lastBlock == maxNumBlocks - 1 ? W : 0)); //---------- Adjust number of blocks according to Ukkonen ----------// //--- Adjust last block ---// // If block is not beneath band, calculate next block. Only next because others are certainly beneath band. if (lastBlock + 1 < maxNumBlocks && !(//score[lastBlock] >= k + WORD_SIZE || // NOTICE: this condition could be satisfied if above block also! ((lastBlock + 1) * WORD_SIZE - 1 > k - bl->score + 2 * WORD_SIZE - 2 - targetLength + c + queryLength))) { lastBlock++; bl++; bl->P = (Word)-1; // All 1s bl->M = (Word)0; int newHout = calculateBlock(bl->P, bl->M, Peq_c[lastBlock], hout, bl->P, bl->M); bl->score = (bl - 1)->score - hout + WORD_SIZE + newHout; hout = newHout; } // While block is out of band, move one block up. // NOTE: Condition used here is more loose than the one from the article, since I simplified the max() part of it. // I could consider adding that max part, for optimal performance. while (lastBlock >= firstBlock && (bl->score >= k + WORD_SIZE || ((lastBlock + 1) * WORD_SIZE - 1 > // TODO: Does not work if do not put +1! Why??? k - bl->score + 2 * WORD_SIZE - 2 - targetLength + c + queryLength + 1))) { lastBlock--; bl--; } //-------------------------// //--- Adjust first block ---// // While outside of band, advance block while (firstBlock <= lastBlock && (blocks[firstBlock].score >= k + WORD_SIZE || ((firstBlock + 1) * WORD_SIZE - 1 < blocks[firstBlock].score - k - targetLength + queryLength + c))) { firstBlock++; } //--------------------------/ // TODO: consider if this part is useful, it does not seem to help much if (c % STRONG_REDUCE_NUM == 0) { // Every some columns do more expensive but more efficient reduction while (lastBlock >= firstBlock) { // If all cells outside of band, remove block vector scores = getBlockCellValues(*bl); int numCells = lastBlock == maxNumBlocks - 1 ? WORD_SIZE - W : WORD_SIZE; int r = lastBlock * WORD_SIZE + numCells - 1; bool reduce = true; for (int i = WORD_SIZE - numCells; i < WORD_SIZE; i++) { // TODO: Does not work if do not put +1! Why??? if (scores[i] <= k && r <= k - scores[i] - targetLength + c + queryLength + 1) { reduce = false; break; } r--; } if (!reduce) break; lastBlock--; bl--; } while (firstBlock <= lastBlock) { // If all cells outside of band, remove block vector scores = getBlockCellValues(blocks[firstBlock]); int numCells = firstBlock == maxNumBlocks - 1 ? WORD_SIZE - W : WORD_SIZE; int r = firstBlock * WORD_SIZE + numCells - 1; bool reduce = true; for (int i = WORD_SIZE - numCells; i < WORD_SIZE; i++) { if (scores[i] <= k && r >= scores[i] - k - targetLength + c + queryLength) { reduce = false; break; } r--; } if (!reduce) break; firstBlock++; } } // If band stops to exist finish if (lastBlock < firstBlock) { *bestScore_ = *position_ = -1; delete[] blocks; return EDLIB_STATUS_OK; } //------------------------------------------------------------------// //---- Save column so it can be used for reconstruction ----// if (findAlignment && c < targetLength) { bl = blocks + firstBlock; for (int b = firstBlock; b <= lastBlock; b++) { (*alignData)->Ps[maxNumBlocks * c + b] = bl->P; (*alignData)->Ms[maxNumBlocks * c + b] = bl->M; (*alignData)->scores[maxNumBlocks * c + b] = bl->score; (*alignData)->firstBlocks[c] = firstBlock; (*alignData)->lastBlocks[c] = lastBlock; bl++; } } //----------------------------------------------------------// //---- If this is stop column, save it and finish ----// if (c == targetStopPosition) { for (int b = firstBlock; b <= lastBlock; b++) { (*alignData)->Ps[b] = (blocks + b)->P; (*alignData)->Ms[b] = (blocks + b)->M; (*alignData)->scores[b] = (blocks + b)->score; (*alignData)->firstBlocks[0] = firstBlock; (*alignData)->lastBlocks[0] = lastBlock; } *bestScore_ = -1; *position_ = targetStopPosition; delete[] blocks; return EDLIB_STATUS_OK; } //----------------------------------------------------// targetChar++; } if (lastBlock == maxNumBlocks - 1) { // If last block of last column was calculated // Obtain best score from block -> it is complicated because query is padded with W cells int bestScore = getBlockCellValues(blocks[lastBlock])[W]; if (bestScore <= k) { *bestScore_ = bestScore; *position_ = targetLength - 1; delete[] blocks; return EDLIB_STATUS_OK; } } *bestScore_ = *position_ = -1; delete[] blocks; return EDLIB_STATUS_OK; } /** * Finds one possible alignment that gives optimal score by moving back through the dynamic programming matrix, * that is stored in alignData. Consumes large amount of memory: O(queryLength * targetLength). * @param [in] queryLength Normal length, without W. * @param [in] targetLength Normal length, without W. * @param [in] bestScore Best score. * @param [in] alignData Data obtained during finding best score that is useful for finding alignment. * @param [out] alignment Alignment. * @param [out] alignmentLength Length of alignment. * @return Status code. */ static int obtainAlignmentTraceback(const int queryLength, const int targetLength, const int bestScore, const AlignmentData* const alignData, unsigned char** const alignment, int* const alignmentLength) { const int maxNumBlocks = ceilDiv(queryLength, WORD_SIZE); const int W = maxNumBlocks * WORD_SIZE - queryLength; *alignment = new unsigned char [queryLength + targetLength - 1]; *alignmentLength = 0; int c = targetLength - 1; // index of column int b = maxNumBlocks - 1; // index of block in column int currScore = bestScore; // Score of current cell int lScore = -1; // Score of left cell int uScore = -1; // Score of upper cell int ulScore = -1; // Score of upper left cell Word currP = alignData->Ps[c * maxNumBlocks + b]; // P of current block Word currM = alignData->Ms[c * maxNumBlocks + b]; // M of current block // True if block to left exists and is in band bool thereIsLeftBlock = c > 0 && b >= alignData->firstBlocks[c-1] && b <= alignData->lastBlocks[c-1]; // We set initial values of lP and lM to 0 only to avoid compiler warnings, they should not affect the // calculation as both lP and lM should be initialized at some moment later (but compiler can not // detect it since this initialization is guaranteed by "business" logic). Word lP = 0, lM = 0; if (thereIsLeftBlock) { lP = alignData->Ps[(c - 1) * maxNumBlocks + b]; // P of block to the left lM = alignData->Ms[(c - 1) * maxNumBlocks + b]; // M of block to the left } currP <<= W; currM <<= W; int blockPos = WORD_SIZE - W - 1; // 0 based index of current cell in blockPos // TODO(martin): refactor this whole piece of code. There are too many if-else statements, // it is too easy for a bug to hide and to hard to effectively cover all the edge-cases. // We need better separation of logic and responsibilities. while (true) { if (c == 0) { thereIsLeftBlock = true; lScore = b * WORD_SIZE + blockPos + 1; ulScore = lScore - 1; } // TODO: improvement: calculate only those cells that are needed, // for example if I calculate upper cell and can move up, // there is no need to calculate left and upper left cell //---------- Calculate scores ---------// if (lScore == -1 && thereIsLeftBlock) { lScore = alignData->scores[(c - 1) * maxNumBlocks + b]; // score of block to the left for (int i = 0; i < WORD_SIZE - blockPos - 1; i++) { if (lP & HIGH_BIT_MASK) lScore--; if (lM & HIGH_BIT_MASK) lScore++; lP <<= 1; lM <<= 1; } } if (ulScore == -1) { if (lScore != -1) { ulScore = lScore; if (lP & HIGH_BIT_MASK) ulScore--; if (lM & HIGH_BIT_MASK) ulScore++; } else if (c > 0 && b-1 >= alignData->firstBlocks[c-1] && b-1 <= alignData->lastBlocks[c-1]) { // This is the case when upper left cell is last cell in block, // and block to left is not in band so lScore is -1. ulScore = alignData->scores[(c - 1) * maxNumBlocks + b - 1]; } } if (uScore == -1) { uScore = currScore; if (currP & HIGH_BIT_MASK) uScore--; if (currM & HIGH_BIT_MASK) uScore++; currP <<= 1; currM <<= 1; } //-------------------------------------// // TODO: should I check if there is upper block? //-------------- Move --------------// // Move up - insertion to target - deletion from query if (uScore != -1 && uScore + 1 == currScore) { currScore = uScore; lScore = ulScore; uScore = ulScore = -1; if (blockPos == 0) { // If entering new (upper) block if (b == 0) { // If there are no cells above (only boundary cells) (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_INSERT; // Move up for (int i = 0; i < c + 1; i++) // Move left until end (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_DELETE; break; } else { blockPos = WORD_SIZE - 1; b--; currP = alignData->Ps[c * maxNumBlocks + b]; currM = alignData->Ms[c * maxNumBlocks + b]; if (c > 0 && b >= alignData->firstBlocks[c-1] && b <= alignData->lastBlocks[c-1]) { thereIsLeftBlock = true; lP = alignData->Ps[(c - 1) * maxNumBlocks + b]; // TODO: improve this, too many operations lM = alignData->Ms[(c - 1) * maxNumBlocks + b]; } else { thereIsLeftBlock = false; // TODO(martin): There may not be left block, but there can be left boundary - do we // handle this correctly then? Are l and ul score set correctly? I should check that / refactor this. } } } else { blockPos--; lP <<= 1; lM <<= 1; } // Mark move (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_INSERT; } // Move left - deletion from target - insertion to query else if (lScore != -1 && lScore + 1 == currScore) { currScore = lScore; uScore = ulScore; lScore = ulScore = -1; c--; if (c == -1) { // If there are no cells to the left (only boundary cells) (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_DELETE; // Move left int numUp = b * WORD_SIZE + blockPos + 1; for (int i = 0; i < numUp; i++) // Move up until end (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_INSERT; break; } currP = lP; currM = lM; if (c > 0 && b >= alignData->firstBlocks[c-1] && b <= alignData->lastBlocks[c-1]) { thereIsLeftBlock = true; lP = alignData->Ps[(c - 1) * maxNumBlocks + b]; lM = alignData->Ms[(c - 1) * maxNumBlocks + b]; } else { if (c == 0) { // If there are no cells to the left (only boundary cells) thereIsLeftBlock = true; lScore = b * WORD_SIZE + blockPos + 1; ulScore = lScore - 1; } else { thereIsLeftBlock = false; } } // Mark move (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_DELETE; } // Move up left - (mis)match else if (ulScore != -1) { unsigned char moveCode = ulScore == currScore ? EDLIB_EDOP_MATCH : EDLIB_EDOP_MISMATCH; currScore = ulScore; uScore = lScore = ulScore = -1; c--; if (c == -1) { // If there are no cells to the left (only boundary cells) (*alignment)[(*alignmentLength)++] = moveCode; // Move left int numUp = b * WORD_SIZE + blockPos; for (int i = 0; i < numUp; i++) // Move up until end (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_INSERT; break; } if (blockPos == 0) { // If entering upper left block if (b == 0) { // If there are no more cells above (only boundary cells) (*alignment)[(*alignmentLength)++] = moveCode; // Move up left for (int i = 0; i < c + 1; i++) // Move left until end (*alignment)[(*alignmentLength)++] = EDLIB_EDOP_DELETE; break; } blockPos = WORD_SIZE - 1; b--; currP = alignData->Ps[c * maxNumBlocks + b]; currM = alignData->Ms[c * maxNumBlocks + b]; } else { // If entering left block blockPos--; currP = lP; currM = lM; currP <<= 1; currM <<= 1; } // Set new left block if (c > 0 && b >= alignData->firstBlocks[c-1] && b <= alignData->lastBlocks[c-1]) { thereIsLeftBlock = true; lP = alignData->Ps[(c - 1) * maxNumBlocks + b]; lM = alignData->Ms[(c - 1) * maxNumBlocks + b]; } else { if (c == 0) { // If there are no cells to the left (only boundary cells) thereIsLeftBlock = true; lScore = b * WORD_SIZE + blockPos + 1; ulScore = lScore - 1; } else { thereIsLeftBlock = false; } } // Mark move (*alignment)[(*alignmentLength)++] = moveCode; } else { // Reached end - finished! break; } //----------------------------------// } // BPW suspects this is just releasing memory. //*alignment = (unsigned char*) realloc(*alignment, (*alignmentLength) * sizeof(unsigned char)); reverse(*alignment, *alignment + (*alignmentLength)); return EDLIB_STATUS_OK; } /** * Finds one possible alignment that gives optimal score (bestScore). * It will split problem into smaller problems using Hirschberg's algorithm and when they are small enough, * it will solve them using traceback algorithm. * @param [in] query * @param [in] rQuery Reversed query. * @param [in] queryLength * @param [in] target * @param [in] rTarget Reversed target. * @param [in] targetLength * @param [in] alphabetLength * @param [in] bestScore Best(optimal) score. * @param [out] alignment Sequence of edit operations that make target equal to query. * @param [out] alignmentLength Length of alignment. * @return Status code. */ static int obtainAlignment( const unsigned char* const query, const unsigned char* const rQuery, const int queryLength, const unsigned char* const target, const unsigned char* const rTarget, const int targetLength, const int alphabetLength, const int bestScore, unsigned char** const alignment, int* const alignmentLength) { // Handle special case when one of sequences has length of 0. if (queryLength == 0 || targetLength == 0) { *alignmentLength = targetLength + queryLength; *alignment = new unsigned char [*alignmentLength]; for (int i = 0; i < *alignmentLength; i++) { (*alignment)[i] = queryLength == 0 ? EDLIB_EDOP_DELETE : EDLIB_EDOP_INSERT; } return EDLIB_STATUS_OK; } const int maxNumBlocks = ceilDiv(queryLength, WORD_SIZE); const int W = maxNumBlocks * WORD_SIZE - queryLength; int statusCode; // TODO: think about reducing number of memory allocations in alignment functions, probably // by sharing some memory that is allocated only once. That refers to: Peq, columns in Hirschberg, // and it could also be done for alignments - we could have one big array for alignment that would be // sparsely populated by each of steps in recursion, and at the end we would just consolidate those results. // If estimated memory consumption for traceback algorithm is smaller than 1MB use it, // otherwise use Hirschberg's algorithm. By running few tests I choose boundary of 1MB as optimal. long long alignmentDataSize = (long long) (2 * sizeof(Word) + sizeof(int)) * maxNumBlocks * targetLength + (long long) 2 * sizeof(int) * targetLength; if (alignmentDataSize < 1024 * 1024) { int score_, endLocation_; // Used only to call function. AlignmentData* alignData = NULL; Word* Peq = buildPeq(alphabetLength, query, queryLength); myersCalcEditDistanceNW(Peq, W, maxNumBlocks, query, queryLength, target, targetLength, alphabetLength, bestScore, &score_, &endLocation_, true, &alignData, -1); assert(score_ == bestScore); assert(endLocation_ == targetLength - 1); statusCode = obtainAlignmentTraceback(queryLength, targetLength, bestScore, alignData, alignment, alignmentLength); delete alignData; delete[] Peq; } else { statusCode = obtainAlignmentHirschberg(query, rQuery, queryLength, target, rTarget, targetLength, alphabetLength, bestScore, alignment, alignmentLength); } return statusCode; } /** * Finds one possible alignment that gives optimal score (bestScore). * Uses Hirschberg's algorithm to split problem into two sub-problems, solve them and combine them together. * @param [in] query * @param [in] rQuery Reversed query. * @param [in] queryLength * @param [in] target * @param [in] rTarget Reversed target. * @param [in] targetLength * @param [in] alphabetLength * @param [in] bestScore Best(optimal) score. * @param [out] alignment Sequence of edit operations that make target equal to query. * @param [out] alignmentLength Length of alignment. * @return Status code. */ static int obtainAlignmentHirschberg( const unsigned char* const query, const unsigned char* const rQuery, const int queryLength, const unsigned char* const target, const unsigned char* const rTarget, const int targetLength, const int alphabetLength, const int bestScore, unsigned char** const alignment, int* const alignmentLength) { const int maxNumBlocks = ceilDiv(queryLength, WORD_SIZE); const int W = maxNumBlocks * WORD_SIZE - queryLength; Word* Peq = buildPeq(alphabetLength, query, queryLength); Word* rPeq = buildPeq(alphabetLength, rQuery, queryLength); // Used only to call functions. int score_, endLocation_; // Divide dynamic matrix into two halfs, left and right. const int leftHalfWidth = targetLength / 2; const int rightHalfWidth = targetLength - leftHalfWidth; // Calculate left half. AlignmentData* alignDataLeftHalf = NULL; int leftHalfCalcStatus = myersCalcEditDistanceNW( Peq, W, maxNumBlocks, query, queryLength, target, targetLength, alphabetLength, bestScore, &score_, &endLocation_, false, &alignDataLeftHalf, leftHalfWidth - 1); // Calculate right half. AlignmentData* alignDataRightHalf = NULL; int rightHalfCalcStatus = myersCalcEditDistanceNW( rPeq, W, maxNumBlocks, rQuery, queryLength, rTarget, targetLength, alphabetLength, bestScore, &score_, &endLocation_, false, &alignDataRightHalf, rightHalfWidth - 1); delete[] Peq; delete[] rPeq; if (leftHalfCalcStatus == EDLIB_STATUS_ERROR || rightHalfCalcStatus == EDLIB_STATUS_ERROR) { if (alignDataLeftHalf) delete alignDataLeftHalf; if (alignDataRightHalf) delete alignDataRightHalf; return EDLIB_STATUS_ERROR; } // Unwrap the left half. int firstBlockIdxLeft = alignDataLeftHalf->firstBlocks[0]; int lastBlockIdxLeft = alignDataLeftHalf->lastBlocks[0]; // TODO: avoid this allocation by using some shared array? // scoresLeft contains scores from left column, starting with scoresLeftStartIdx row (query index) // and ending with scoresLeftEndIdx row (0-indexed). int scoresLeftLength = (lastBlockIdxLeft - firstBlockIdxLeft + 1) * WORD_SIZE; int* scoresLeft = new int[scoresLeftLength]; for (int blockIdx = firstBlockIdxLeft; blockIdx <= lastBlockIdxLeft; blockIdx++) { Block block(alignDataLeftHalf->Ps[blockIdx], alignDataLeftHalf->Ms[blockIdx], alignDataLeftHalf->scores[blockIdx]); readBlock(block, scoresLeft + (blockIdx - firstBlockIdxLeft) * WORD_SIZE); } int scoresLeftStartIdx = firstBlockIdxLeft * WORD_SIZE; // If last block contains padding, shorten the length of scores for the length of padding. if (lastBlockIdxLeft == maxNumBlocks - 1) { scoresLeftLength -= W; } // Unwrap the right half (I also reverse it while unwraping). int firstBlockIdxRight = alignDataRightHalf->firstBlocks[0]; int lastBlockIdxRight = alignDataRightHalf->lastBlocks[0]; int scoresRightLength = (lastBlockIdxRight - firstBlockIdxRight + 1) * WORD_SIZE; int* scoresRight = new int[scoresRightLength]; int* scoresRightOriginalStart = scoresRight; for (int blockIdx = firstBlockIdxRight; blockIdx <= lastBlockIdxRight; blockIdx++) { Block block(alignDataRightHalf->Ps[blockIdx], alignDataRightHalf->Ms[blockIdx], alignDataRightHalf->scores[blockIdx]); readBlockReverse(block, scoresRight + (lastBlockIdxRight - blockIdx) * WORD_SIZE); } int scoresRightStartIdx = queryLength - (lastBlockIdxRight + 1) * WORD_SIZE; // If there is padding at the beginning of scoresRight (that can happen because of reversing that we do), // move pointer forward to remove the padding (that is why we remember originalStart). if (scoresRightStartIdx < 0) { assert(scoresRightStartIdx == -1 * W); scoresRight += W; scoresRightStartIdx += W; scoresRightLength -= W; } delete alignDataLeftHalf; delete alignDataRightHalf; //--------------------- Find the best move ----------------// // Find the query/row index of cell in left column which together with its lower right neighbour // from right column gives the best score (when summed). We also have to consider boundary cells // (those cells at -1 indexes). // x| // -+- // |x int queryIdxLeftStart = max(scoresLeftStartIdx, scoresRightStartIdx - 1); int queryIdxLeftEnd = min(scoresLeftStartIdx + scoresLeftLength - 1, scoresRightStartIdx + scoresRightLength - 2); int leftScore, rightScore; int queryIdxLeftAlignment; // Query/row index of cell in left column where alignment is passing through. bool queryIdxLeftAlignmentFound = false; for (int queryIdx = queryIdxLeftStart; queryIdx <= queryIdxLeftEnd; queryIdx++) { leftScore = scoresLeft[queryIdx - scoresLeftStartIdx]; rightScore = scoresRight[queryIdx + 1 - scoresRightStartIdx]; if (leftScore + rightScore == bestScore) { queryIdxLeftAlignment = queryIdx; queryIdxLeftAlignmentFound = true; break; } } // Check boundary cells. if (!queryIdxLeftAlignmentFound && scoresLeftStartIdx == 0 && scoresRightStartIdx == 0) { leftScore = leftHalfWidth; rightScore = scoresRight[0]; if (leftScore + rightScore == bestScore) { queryIdxLeftAlignment = -1; queryIdxLeftAlignmentFound = true; } } if (!queryIdxLeftAlignmentFound && scoresLeftStartIdx + scoresLeftLength == queryLength && scoresRightStartIdx + scoresRightLength == queryLength) { leftScore = scoresLeft[scoresLeftLength - 1]; rightScore = rightHalfWidth; if (leftScore + rightScore == bestScore) { queryIdxLeftAlignment = queryLength - 1; queryIdxLeftAlignmentFound = true; } } delete[] scoresLeft; delete[] scoresRightOriginalStart; if (queryIdxLeftAlignmentFound == false) { // If there was no move that is part of optimal alignment, then there is no such alignment // or given bestScore is not correct! return EDLIB_STATUS_ERROR; } //----------------------------------------------------------// // Calculate alignments for upper half of left half (upper left - ul) // and lower half of right half (lower right - lr). const int ulHeight = queryIdxLeftAlignment + 1; const int lrHeight = queryLength - ulHeight; const int ulWidth = leftHalfWidth; const int lrWidth = rightHalfWidth; unsigned char* ulAlignment = NULL; int ulAlignmentLength; int ulStatusCode = obtainAlignment(query, rQuery + lrHeight, ulHeight, target, rTarget + lrWidth, ulWidth, alphabetLength, leftScore, &ulAlignment, &ulAlignmentLength); unsigned char* lrAlignment = NULL; int lrAlignmentLength; int lrStatusCode = obtainAlignment(query + ulHeight, rQuery, lrHeight, target + ulWidth, rTarget, lrWidth, alphabetLength, rightScore, &lrAlignment, &lrAlignmentLength); if (ulStatusCode == EDLIB_STATUS_ERROR || lrStatusCode == EDLIB_STATUS_ERROR) { delete[] ulAlignment; delete[] lrAlignment; return EDLIB_STATUS_ERROR; } // Build alignment by concatenating upper left alignment with lower right alignment. *alignmentLength = ulAlignmentLength + lrAlignmentLength; *alignment = new unsigned char [*alignmentLength]; memcpy(*alignment, ulAlignment, ulAlignmentLength); memcpy(*alignment + ulAlignmentLength, lrAlignment, lrAlignmentLength); delete[] ulAlignment; delete[] lrAlignment; return EDLIB_STATUS_OK; } /** * Takes char query and char target, recognizes alphabet and transforms them into unsigned char sequences * where elements in sequences are not any more letters of alphabet, but their index in alphabet. * Most of internal edlib functions expect such transformed sequences. * This function will allocate queryTransformed and targetTransformed, so make sure to free them when done. * Example: * Original sequences: "ACT" and "CGT". * Alphabet would be recognized as ['A', 'C', 'T', 'G']. Alphabet length = 4. * Transformed sequences: [0, 1, 2] and [1, 3, 2]. * @param [in] queryOriginal * @param [in] queryLength * @param [in] targetOriginal * @param [in] targetLength * @param [out] queryTransformed It will contain values in range [0, alphabet length - 1]. * @param [out] targetTransformed It will contain values in range [0, alphabet length - 1]. * @return Alphabet length - number of letters in recognized alphabet. */ static int transformSequences(const char* const queryOriginal, const int queryLength, const char* const targetOriginal, const int targetLength, unsigned char** const queryTransformed, unsigned char** const targetTransformed) { // Alphabet is constructed from letters that are present in sequences. // Each letter is assigned an ordinal number, starting from 0 up to alphabetLength - 1, // and new query and target are created in which letters are replaced with their ordinal numbers. // This query and target are used in all the calculations later. *queryTransformed = new unsigned char [queryLength]; *targetTransformed = new unsigned char [targetLength]; // Alphabet information, it is constructed on fly while transforming sequences. unsigned char letterIdx[256]; //!< letterIdx[c] is index of letter c in alphabet bool inAlphabet[256]; // inAlphabet[c] is true if c is in alphabet for (int i = 0; i < 256; i++) inAlphabet[i] = false; int alphabetLength = 0; for (int i = 0; i < queryLength; i++) { unsigned char c = static_cast(queryOriginal[i]); if (!inAlphabet[c]) { inAlphabet[c] = true; letterIdx[c] = alphabetLength; alphabetLength++; } (*queryTransformed)[i] = letterIdx[c]; } for (int i = 0; i < targetLength; i++) { unsigned char c = static_cast(targetOriginal[i]); if (!inAlphabet[c]) { inAlphabet[c] = true; letterIdx[c] = alphabetLength; alphabetLength++; } (*targetTransformed)[i] = letterIdx[c]; } return alphabetLength; } EdlibAlignConfig edlibNewAlignConfig(int k, EdlibAlignMode mode, EdlibAlignTask task) { EdlibAlignConfig config; config.k = k; config.mode = mode; config.task = task; return config; } EdlibAlignConfig edlibDefaultAlignConfig(void) { return edlibNewAlignConfig(-1, EDLIB_MODE_NW, EDLIB_TASK_DISTANCE); } void edlibFreeAlignResult(EdlibAlignResult result) { delete[] result.endLocations; delete[] result.startLocations; delete[] result.alignment; } canu-1.6/src/overlapInCore/libedlib/edlib.H000066400000000000000000000253531314437614700206250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ /* * The MIT License (MIT) * * Copyright (c) 2014 Martin Šošić * * Permission is hereby granted, free of charge, to any person obtaining a copy of * this software and associated documentation files (the "Software"), to deal in * the Software without restriction, including without limitation the rights to * use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of * the Software, and to permit persons to whom the Software is furnished to do so, * subject to the following conditions: * * The above copyright notice and this permission notice shall be included in all * copies or substantial portions of the Software. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS * FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR * COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER * IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ #ifndef EDLIB_H #define EDLIB_H /** * @file * @author Martin Sosic * @brief Main header file, containing all public functions and structures. */ // Status codes #define EDLIB_STATUS_OK 0 #define EDLIB_STATUS_ERROR 1 /** * Alignment methods - how should Edlib treat gaps before and after query? */ typedef enum { /** * Global method. This is the standard method. * Useful when you want to find out how similar is first sequence to second sequence. */ EDLIB_MODE_NW, /** * Prefix method. Similar to global method, but with a small twist - gap at query end is not penalized. * What that means is that deleting elements from the end of second sequence is "free"! * For example, if we had "AACT" and "AACTGGC", edit distance would be 0, because removing "GGC" from the end * of second sequence is "free" and does not count into total edit distance. This method is appropriate * when you want to find out how well first sequence fits at the beginning of second sequence. */ EDLIB_MODE_SHW, /** * Infix method. Similar as prefix method, but with one more twist - gaps at query end and start are * not penalized. What that means is that deleting elements from the start and end of second sequence is "free"! * For example, if we had ACT and CGACTGAC, edit distance would be 0, because removing CG from the start * and GAC from the end of second sequence is "free" and does not count into total edit distance. * This method is appropriate when you want to find out how well first sequence fits at any part of * second sequence. * For example, if your second sequence was a long text and your first sequence was a sentence from that text, * but slightly scrambled, you could use this method to discover how scrambled it is and where it fits in * that text. In bioinformatics, this method is appropriate for aligning read to a sequence. */ EDLIB_MODE_HW } EdlibAlignMode; /** * Alignment tasks - what do you want Edlib to do? */ typedef enum { EDLIB_TASK_DISTANCE, //!< Find edit distance and end locations. EDLIB_TASK_LOC, //!< Find edit distance, end locations and start locations. EDLIB_TASK_PATH //!< Find edit distance, end locations and start locations and alignment path. } EdlibAlignTask; /** * Describes cigar format. * @see http://samtools.github.io/hts-specs/SAMv1.pdf * @see http://drive5.com/usearch/manual/cigar.html */ typedef enum { EDLIB_CIGAR_STANDARD, //!< Match: 'M', Insertion: 'I', Deletion: 'D', Mismatch: 'M'. EDLIB_CIGAR_EXTENDED //!< Match: '=', Insertion: 'I', Deletion: 'D', Mismatch: 'X'. } EdlibCigarFormat; // Edit operations. #define EDLIB_EDOP_MATCH 0 //!< Match. #define EDLIB_EDOP_INSERT 1 //!< Insertion to target = deletion from query. #define EDLIB_EDOP_DELETE 2 //!< Deletion from target = insertion to query. #define EDLIB_EDOP_MISMATCH 3 //!< Mismatch. /** * @brief Configuration object for edlibAlign() function. */ typedef struct { /** * Set k to non-negative value to tell edlib that edit distance is not larger than k. * Smaller k can significantly improve speed of computation. * If edit distance is larger than k, edlib will set edit distance to -1. * Set k to negative value and edlib will internally auto-adjust k until score is found. */ int k; /** * Alignment method. * EDLIB_MODE_NW: global (Needleman-Wunsch) * EDLIB_MODE_SHW: prefix. Gap after query is not penalized. * EDLIB_MODE_HW: infix. Gaps before and after query are not penalized. */ EdlibAlignMode mode; /** * Alignment task - tells Edlib what to calculate. Less to calculate, faster it is. * EDLIB_TASK_DISTANCE - find edit distance and end locations of optimal alignment paths in target. * EDLIB_TASK_LOC - find edit distance and start and end locations of optimal alignment paths in target. * EDLIB_TASK_PATH - find edit distance, alignment path (and start and end locations of it in target). */ EdlibAlignTask task; } EdlibAlignConfig; /** * Helper method for easy construction of configuration object. * @return Configuration object filled with given parameters. */ EdlibAlignConfig edlibNewAlignConfig(int k, EdlibAlignMode mode, EdlibAlignTask task); /** * @return Default configuration object, with following defaults: * k = -1, mode = EDLIB_MODE_NW, task = EDLIB_TASK_DISTANCE. */ EdlibAlignConfig edlibDefaultAlignConfig(void); /** * Container for results of alignment done by edlibAlign() function. */ typedef struct { /** * -1 if k is non-negative and edit distance is larger than k. */ int editDistance; /** * Array of zero-based positions in target where optimal alignment paths end. * If gap after query is penalized, gap counts as part of query (NW), otherwise not. * Set to NULL if edit distance is larger than k. * If you do not free whole result object using edlibFreeAlignResult(), do not forget to use free(). */ int* endLocations; /** * Array of zero-based positions in target where optimal alignment paths start, * they correspond to endLocations. * If gap before query is penalized, gap counts as part of query (NW), otherwise not. * Set to NULL if not calculated or if edit distance is larger than k. * If you do not free whole result object using edlibFreeAlignResult(), do not forget to use free(). */ int* startLocations; /** * Number of end (and start) locations. */ int numLocations; /** * Alignment is found for first pair of start and end locations. * Set to NULL if not calculated. * Alignment is sequence of numbers: 0, 1, 2, 3. * 0 stands for match. * 1 stands for insertion to target. * 2 stands for insertion to query. * 3 stands for mismatch. * Alignment aligns query to target from begining of query till end of query. * If gaps are not penalized, they are not in alignment. * If you do not free whole result object using edlibFreeAlignResult(), do not forget to use free(). */ unsigned char* alignment; /** * Length of alignment. */ int alignmentLength; /** * Number of different characters in query and target together. */ int alphabetLength; } EdlibAlignResult; /** * Frees memory in EdlibAlignResult that was allocated by edlib. * If you do not use it, make sure to free needed members manually using free(). */ void edlibFreeAlignResult(EdlibAlignResult result); /** * Aligns two sequences (query and target) using edit distance (levenshtein distance). * Through config parameter, this function supports different alignment methods (global, prefix, infix), * as well as different modes of search (tasks). * It always returns edit distance and end locations of optimal alignment in target. * It optionally returns start locations of optimal alignment in target and alignment path, * if you choose appropriate tasks. * @param [in] query First sequence. * @param [in] queryLength Number of characters in first sequence. * @param [in] target Second sequence. * @param [in] targetLength Number of characters in second sequence. * @param [in] config Additional alignment parameters, like alignment method and wanted results. * @return Result of alignment, which can contain edit distance, start and end locations and alignment path. * Make sure to clean up the object using edlibFreeAlignResult() or by manually freeing needed members. */ EdlibAlignResult edlibAlign(const char* query, const int queryLength, const char* target, const int targetLength, const EdlibAlignConfig config); /** * Builds cigar string from given alignment sequence. * @param [in] alignment Alignment sequence. * 0 stands for match. * 1 stands for insertion to target. * 2 stands for insertion to query. * 3 stands for mismatch. * @param [in] alignmentLength * @param [in] cigarFormat Cigar will be returned in specified format. * @return Cigar string. * I stands for insertion. * D stands for deletion. * X stands for mismatch. (used only in extended format) * = stands for match. (used only in extended format) * M stands for (mis)match. (used only in standard format) * String is null terminated. * Needed memory is allocated and given pointer is set to it. * Do not forget to free it later using free()! */ char* edlibAlignmentToCigar(const unsigned char* alignment, int alignmentLength, EdlibCigarFormat cigarFormat); void edlibAlignmentToStrings(const unsigned char* alignment, int alignmentLength, int tgtStart, int tgtEnd, int qryStart, int qryEnd, const char *tgt, const char *qry, char *tgt_aln_str, char *qry_aln_str); #endif // EDLIB_H canu-1.6/src/overlapInCore/liboverlap/000077500000000000000000000000001314437614700200165ustar00rootroot00000000000000canu-1.6/src/overlapInCore/liboverlap/Binomial_Bound.C000066400000000000000000000126451314437614700230130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-FEB-09 to 2015-JUN-02 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "Binomial_Bound.H" #include "gkStore.H" #undef COMPUTE_IN_LOG_SPACE // Determined by EDIT_DIST_PROB_BOUND #define NORMAL_DISTRIB_THOLD 3.62 // Probability limit to "band" edit-distance calculation // Determines NORMAL_DISTRIB_THOLD #define EDIT_DIST_PROB_BOUND 1e-4 // Return the smallest n >= Start s.t. // prob [>= e errors in n binomial trials (p = error prob)] > EDIT_DIST_PROB_BOUND // int Binomial_Bound(int e, double p, int Start) { double Normal_Z, Mu_Power, Factorial, Poisson_Coeff; double q, Sum, P_Power, Q_Power, X; int k, n, Bin_Coeff, Ct; q = 1.0 - p; if (Start < e) Start = e; for (n = Start; n < AS_MAX_READLEN; n ++) { if (n <= 35) { Sum = 0.0; Bin_Coeff = 1; Ct = 0; P_Power = 1.0; Q_Power = pow (q, n); for (k = 0; k < e && 1.0 - Sum > EDIT_DIST_PROB_BOUND; k ++) { X = Bin_Coeff * P_Power * Q_Power; Sum += X; Bin_Coeff *= n - Ct; Bin_Coeff /= ++ Ct; P_Power *= p; Q_Power /= q; } if (1.0 - Sum > EDIT_DIST_PROB_BOUND) return(n); } else { Normal_Z = (e - 0.5 - n * p) / sqrt (n * p * q); if (Normal_Z <= NORMAL_DISTRIB_THOLD) return n; #ifndef COMPUTE_IN_LOG_SPACE Sum = 0.0; Mu_Power = 1.0; Factorial = 1.0; Poisson_Coeff = exp (- n * p); for (k = 0; k < e; k ++) { Sum += Mu_Power * Poisson_Coeff / Factorial; Mu_Power *= n * p; Factorial *= k + 1; } #else Sum = 0.0; Mu_Power = 0.0; Factorial = 0.0; Poisson_Coeff = - n * p; for (k = 0; k < e; k ++) { Sum += exp(Mu_Power + Poisson_Coeff - Factorial); Mu_Power += log(n * p); Factorial = lgamma(k + 1); } #endif if (1.0 - Sum > EDIT_DIST_PROB_BOUND) return(n); } } return(AS_MAX_READLEN); } void Initialize_Match_Limit(int32 *ml, double maxErate, int32 maxErrors) { int32 e = 0; int32 s = 1; int32 l = MIN(maxErrors, 2000); // Compute the first 2000 values; set to maxErrors to do no estimation // The number of errors that are ignored in setting probability bound for terminating alignment // extensions in edit distance calculations int32 ERRORS_FOR_FREE = 1; // Free errors. while (e <= ERRORS_FOR_FREE) ml[e++] = 0; // Compute the actual limits. This is _VERY_ expensive for longer reads. BITS=17 is about all // it can support. #ifdef DUMP_MATCH_LIMIT l = maxErrors; // For showing deviations #endif while (e < l) { s = Binomial_Bound(e - ERRORS_FOR_FREE, maxErate, s); ml[e] = s - 1; assert(ml[e] >= ml[e-1]); //if ((e % 100) == 0) // fprintf(stderr, " %8.4f%% - %8d / %8d\r", 100.0 * e / maxErrors, e, maxErrors); e++; } // Estimate the remaining limits. using a linear function based on a precomputed slope. // prefixEditDistance-matchLimitGenerate computes the data values for a bunch of error rates. // These are used to compute the slope of a line from the [2000] point through the [max] point. // These slopes fit, almost exactly, an a/x+b curve, and that curve is used to compute the slope // for any error rate. // #if AS_MAX_READLEN_BITS == 17 double sl = 0.962830901135531 / maxErate + 0.096810267016486; #endif #if AS_MAX_READLEN_BITS == 18 double sl = 0.964368146781421 / maxErate + 0.118101522100597; #endif #if AS_MAX_READLEN_BITS == 19 double sl = 0.963823337297648 / maxErate + 0.156091528250625; #endif #if AS_MAX_READLEN_BITS == 20 double sl = 0.971023157863311 / maxErate + 0.154425731746994; #endif #if AS_MAX_READLEN_BITS == 21 double sl = 0.982064188397525 / maxErate + 0.067835741959926; #endif #if AS_MAX_READLEN_BITS == 22 double sl = 0.986446300363063 / maxErate + 0.052358358862826; #endif #if AS_MAX_READLEN_BITS == 23 double sl = 0.989923769842693 / maxErate + 0.0395372203695468; #endif #if AS_MAX_READLEN_BITS == 24 double sl = 0.992440299290478 / maxErate + 0.036791522317757; #endif // And the first value. double vl = ml[e-1] + sl; while (e < maxErrors) { ml[e] = (int32)ceil(vl); vl += sl; e++; } #ifdef DUMP_MATCH_LIMIT FILE *F = fopen("values-new.dat", "w"); for (int32 e=0; e 0) { bot[bot_len++] = '-'; i++; nBgap++; } else { bot[bot_len++] = b[j++]; } } while (j < b_len && i < a_len) { bot[bot_len++] = b[j++]; i++; } bot[bot_len] = '\0'; } uint32 diffs = 0; for (int32 i=0; (i < top_len) || (i < bot_len); i += DISPLAY_WIDTH) { fprintf(stderr, "%d\n", i); fprintf(stderr, "A: "); for (int32 j=0; (j < DISPLAY_WIDTH) && (i+j < top_len); j++) putc(top[i+j], stderr); fprintf(stderr, "\n"); fprintf(stderr, "B: "); for (int32 j=0; (j < DISPLAY_WIDTH) && (i+j < bot_len); j++) putc(bot[i+j], stderr); fprintf(stderr, "\n"); fprintf(stderr, " "); for (int32 j=0; (jStart - 1; int32 S_Right_Begin = Match->Start + Match->Len; int32 S_Right_Len = S_Len - S_Right_Begin; int32 T_Left_Begin = Match->Offset - 1; int32 T_Right_Begin = Match->Offset + Match->Len; int32 T_Right_Len = T_Len - T_Right_Begin; int32 Total_Olap = (min(Match->Start, Match->Offset) + Match->Len + min(S_Right_Len, T_Right_Len)); int32 Error_Limit = Error_Bound[Total_Olap]; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "prefixEditDistance::Extend_Alignment()-- limit olap of %u bases to %u errors - %f%%\n", Total_Olap, Error_Limit, 100.0 * Error_Limit / Total_Olap); fprintf(stdout, "prefixEditDistance::Extend_Alignment()-- S: %d-%d and %d-%d T: %d-%d and %d-%d\n", 0, S_Left_Begin, S_Right_Begin, S_Right_Begin + S_Right_Len, 0, T_Left_Begin, T_Right_Begin, T_Right_Begin + T_Right_Len); #endif Left_Delta_Len = 0; Right_Delta_Len = 0; bool invertLeftDeltas = false; bool invertRightDeltas = false; if ((S_Right_Len == 0) || (T_Right_Len == 0)) { S_Hi = 0; T_Hi = 0; rMatchToEnd = true; } else if (S_Right_Len <= T_Right_Len) { Right_Errors = forward(S + S_Right_Begin, S_Right_Len, T + T_Right_Begin, T_Right_Len, Error_Limit, S_Hi, T_Hi, rMatchToEnd); for (int32 i=0; i 0) { if (Right_Delta[0] > 0) Left_Delta[Left_Delta_Len++] = -(Right_Delta[0] + Leftover + Match->Len); else Left_Delta[Left_Delta_Len++] = -(Right_Delta[0] - Leftover - Match->Len); } // WHY?! Does this mean the invesion on the forward() calls is backwards? for (int32 i=1; i 0; k--) { assert(Edit_Array_Lazy[k] != NULL); from = d; max = 1 + Edit_Array_Lazy[k - 1][d]; if ((j = Edit_Array_Lazy[k - 1][d - 1]) > max) { from = d - 1; max = j; } if ((j = 1 + Edit_Array_Lazy[k - 1][d + 1]) > max) { from = d + 1; max = j; } if (from == d - 1) { Delta_Stack[Right_Delta_Len++] = max - last - 1; d--; last = Edit_Array_Lazy[k - 1][from]; } else if (from == d + 1) { Delta_Stack[Right_Delta_Len++] = last - (max - 1); d++; last = Edit_Array_Lazy[k - 1][from]; } } Delta_Stack[Right_Delta_Len++] = last + 1; k = 0; for (i = Right_Delta_Len - 1; i > 0; i--) Right_Delta[k++] = abs (Delta_Stack[i]) * Sign (Delta_Stack[i - 1]); Right_Delta_Len--; } // Return the minimum number of changes (inserts, deletes, replacements) // needed to match string A[0 .. (m-1)] with a prefix of string // T[0 .. (n-1)] if it's not more than Error_Limit . // If no match, return the number of errors for the best match // up to a branch point. // Put delta description of alignment in Right_Delta and set // Right_Delta_Len to the number of entries there if it's a complete // match. // Set A_End and T_End to the rightmost positions where the // alignment ended in A and T , respectively. // Set Match_To_End true if the match extended to the end // of at least one string; otherwise, set it false to indicate // a branch point. int32 prefixEditDistance::forward(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, bool &Match_To_End) { double Score; int Max_Score_Len = 0, Max_Score_Best_d = 0, Max_Score_Best_e = 0; int Best_d, Best_e, From, Last, Longest, Max, Row; int d, e, i, j, k; assert (m <= n); Best_d = Best_e = Longest = 0; Right_Delta_Len = 0; for (Row = 0; Row < m && (A[Row] == T[Row] || A[Row] == 'n' || T[Row] == 'n'); Row++) ; if (Edit_Array_Lazy[0] == NULL) Allocate_More_Edit_Space(0); Edit_Array_Lazy[0][0] = Row; if (Row == m) { // Exact match A_End = T_End = m; Match_To_End = TRUE; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d FWD exact match\n", omp_get_thread_num()); #endif return 0; } int32 Left = 0; int32 Right = 0; double Max_Score = 0.0; for (e = 1; e <= Error_Limit; e++) { if (Edit_Array_Lazy[e] == NULL) Allocate_More_Edit_Space(e); Left = MAX (Left - 1, -e); Right = MIN (Right + 1, e); Edit_Array_Lazy[e - 1][Left ] = -2; Edit_Array_Lazy[e - 1][Left - 1] = -2; Edit_Array_Lazy[e - 1][Right ] = -2; Edit_Array_Lazy[e - 1][Right + 1] = -2; for (d = Left; d <= Right; d++) { Row = 1 + Edit_Array_Lazy[e - 1][d]; if ((j = Edit_Array_Lazy[e - 1][d - 1]) > Row) Row = j; if ((j = 1 + Edit_Array_Lazy[e - 1][d + 1]) > Row) Row = j; while (Row < m && Row + d < n && (A[Row] == T[Row + d] || A[Row] == 'n' || T[Row + d] == 'n')) Row++; Edit_Array_Lazy[e][d] = Row; if (Row == m || Row + d == n) { // Check for branch point here caused by uneven distribution of errors Score = Row * Branch_Match_Value - e; // Assumes Branch_Match_Value - Branch_Error_Value == 1.0 int32 Tail_Len = Row - Max_Score_Len; bool abort = false; double slope = (double)(Max_Score - Score) / Tail_Len; if ((doingPartialOverlaps == true) && (Score < Max_Score)) abort = true; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d FWD e=%d MIN=%d Tail_Len=%d Max_Score=%d Score=%d slope=%f SLOPE=%f\n", omp_get_thread_num(), e, MIN_BRANCH_END_DIST, Tail_Len, Max_Score, Score, slope, MIN_BRANCH_TAIL_SLOPE); #endif if ((e > MIN_BRANCH_END_DIST / 2) && (Tail_Len >= MIN_BRANCH_END_DIST) && (slope >= MIN_BRANCH_TAIL_SLOPE)) abort = true; if (abort) { A_End = Max_Score_Len; T_End = Max_Score_Len + Max_Score_Best_d; Set_Right_Delta (Max_Score_Best_e, Max_Score_Best_d); Match_To_End = FALSE; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d FWD ABORT alignment at e=%d best_e=%d\n", omp_get_thread_num(), e, Max_Score_Best_e); #endif return(Max_Score_Best_e); } // Force last error to be mismatch rather than insertion if ((Row == m) && (1 + Edit_Array_Lazy[e - 1][d + 1] == Edit_Array_Lazy[e][d]) && (d < Right)) { d++; Edit_Array_Lazy[e][d] = Edit_Array_Lazy[e][d - 1]; } A_End = Row; // One past last align position T_End = Row + d; Set_Right_Delta (e, d); Match_To_End = TRUE; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d FWD END alignment at e=%d\n", omp_get_thread_num(), e); #endif return(e); } } while ((Left <= Right) && (Left < 0) && (Edit_Array_Lazy[e][Left] < Edit_Match_Limit[e])) Left++; if (Left >= 0) while ((Left <= Right) && (Edit_Array_Lazy[e][Left] + Left < Edit_Match_Limit[e])) Left++; if (Left > Right) { #ifdef SHOW_EXTEND_ALIGN //fprintf(stdout, "WorkArea %2d FWD BREAK at Left=%d Right=%d\n", omp_get_thread_num(), Left, Right); #endif break; } while ((Right > 0) && (Edit_Array_Lazy[e][Right] + Right < Edit_Match_Limit[e])) Right--; if (Right <= 0) while (Edit_Array_Lazy[e][Right] < Edit_Match_Limit[e]) Right--; assert (Left <= Right); for (d = Left; d <= Right; d++) if (Edit_Array_Lazy[e][d] > Longest) { Best_d = d; Best_e = e; Longest = Edit_Array_Lazy[e][d]; } Score = Longest * Branch_Match_Value - e; // Assumes Branch_Match_Value - Branch_Error_Value == 1.0 if (Score > Max_Score) { Max_Score = Score; Max_Score_Len = Longest; Max_Score_Best_d = Best_d; Max_Score_Best_e = Best_e; } } #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d FWD ERROR_LIMIT at e=%d Error_Limit=%d best_e=%d\n", omp_get_thread_num(), e, Error_Limit, Max_Score_Best_e); #endif A_End = Max_Score_Len; T_End = Max_Score_Len + Max_Score_Best_d; Set_Right_Delta (Max_Score_Best_e, Max_Score_Best_d); Match_To_End = FALSE; return Max_Score_Best_e; } canu-1.6/src/overlapInCore/liboverlap/prefixEditDistance-matchLimit.sh000066400000000000000000000007461314437614700262300ustar00rootroot00000000000000#!/bin/sh bgn=`expr $SGE_TASK_ID + 0` end=`expr $SGE_TASK_ID + 1` ./bin-bound-test $bgn $end for er in 0100 0200 0300 0400 0500 0600 0700 0800 0900 1000 \ 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 \ 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 \ 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 \ 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 ; do qsub -cwd -j y -o /dev/null -b y ./prefixEditDistance-matchLimit $er done canu-1.6/src/overlapInCore/liboverlap/prefixEditDistance-matchLimitGenerate.C000066400000000000000000000160371314437614700274530ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-02 to 2015-AUG-14 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-05 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-15 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "Binomial_Bound.H" #ifndef BROKEN_CLANG_OpenMP #include #endif // To use: // // Set read length in gkStore.H. Run this "50 5000 50" to compute data from 0.50% to 50.00% // in steps of 0.50%. It will write three output files with the data: // *.C - for inclusion in programs // *.bin - binary dump of the array // *.dat - ascii integer dump of the array // // WARNING! BITS=21 needs about 40 CPU hours. (5000 took 2:40; 3800 took 1:28 // // The *.dat also contains the slope of the line from [0] to [i] and from [i] to [max] for each i. // // To generate the slope parameters used in Binomial_Bound.C (BITS=20 and BITS=21 are expensive): // // prefixEditDistance-matchLimitGenerate 100 5000 100 // // grep '^2000 ' *21/*dat | sed 's/.dat:/ /' | sed 's/prefixEditDistance-matchLimitData-BITS=[0-9][0-9]\/prefixEditDistance-matchLimit-//' > slopes // // gnuplot: // f(x) = a/x+b // fit f(x) 'slopes' using 1:5 via a,b // plot 'slopes' using 1:5 with lines, f(x) // show var // plug a and b into Binomial_Bound.C // int main(int argc, char **argv) { int32 minEvalue = 0; int32 maxEvalue = 0; int32 step = 1; char D[FILENAME_MAX]; char O[FILENAME_MAX]; if (argc == 2) { minEvalue = atoi(argv[1]); maxEvalue = minEvalue; } else if (argc == 3) { minEvalue = atoi(argv[1]); maxEvalue = atoi(argv[2]); } else if (argc == 4) { minEvalue = atoi(argv[1]); maxEvalue = atoi(argv[2]); step = atoi(argv[3]); } else { fprintf(stderr, "usage: %s minEvalue [maxEvalue [step]]\n", argv[0]); fprintf(stderr, " computes overlapper probabilities for minEvalue <= eValue <= maxEvalue'\n"); fprintf(stderr, " eValue 100 == 0.01 fraction error == 1%% error\n"); exit(1); } fprintf(stderr, "Computing Edit_Match_Limit data for reads of length %ubp (bits = %u).\n", AS_MAX_READLEN, AS_MAX_READLEN_BITS); sprintf(D, "prefixEditDistance-matchLimitData-BITS=%01d", AS_MAX_READLEN_BITS); AS_UTL_mkdir(D); #pragma omp parallel for schedule(dynamic, 1) for (int32 evalue=maxEvalue; evalue>=minEvalue; evalue -= step) { char N[FILENAME_MAX]; // Local to this thread! double erate = evalue / 10000.0; int32 start = 1; int32 MAX_ERRORS = (1 + (int) (erate * AS_MAX_READLEN)); int32 ERRORS_FOR_FREE = 1; int32 *starts = new int32 [MAX_ERRORS + 1]; memset(starts, 0, sizeof(int32) * (MAX_ERRORS + 1)); sprintf(N, "%s/prefixEditDistance-matchLimit-%04d.bin", D, evalue); if (AS_UTL_fileExists(N)) { fprintf(stderr, "eValue %04d -- eRate %6.4f -- %7.4f%% error -- %8d values -- thread %2d - LOAD\n", evalue, erate, erate * 100.0, MAX_ERRORS, omp_get_thread_num()); errno = 0; FILE *F = fopen(N, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", N, strerror(errno)), exit(1); int32 me = 0; double er = 0.0; fread(&me, sizeof(int32), 1, F); fread(&er, sizeof(double), 1, F); fread( starts, sizeof(int32), MAX_ERRORS, F); assert(me == MAX_ERRORS); assert(er == erate); fclose(F); } else { fprintf(stderr, "eValue %04d -- eRate %6.4f -- %7.4f%% error -- %8d values -- thread %2d - COMPUTE\n", evalue, erate, erate * 100.0, MAX_ERRORS, omp_get_thread_num()); for (int32 e=ERRORS_FOR_FREE + 1; e 0; k--) { assert(Edit_Array_Lazy[k] != NULL); from = d; max = 1 + Edit_Array_Lazy[k - 1][d]; if ((j = Edit_Array_Lazy[k - 1][d - 1]) > max) { from = d - 1; max = j; } if ((j = 1 + Edit_Array_Lazy[k - 1][d + 1]) > max) { from = d + 1; max = j; } if (from == d - 1) { Left_Delta[Left_Delta_Len++] = max - last - 1; d--; last = Edit_Array_Lazy[k - 1][from]; } else if (from == d + 1) { Left_Delta[Left_Delta_Len++] = last - (max - 1); d++; last = Edit_Array_Lazy[k - 1][from]; } } leftover = last; // Don't allow first delta to be +1 or -1 assert (Left_Delta_Len == 0 || Left_Delta[0] != -1); // BPW - The original test was Left_Delta_Len>0, but that hit uninitialized data when Len == 1. if (Left_Delta_Len > 1 && Left_Delta[0] == 1 && t_end + t_len > 0) { int i; if (Left_Delta[1] > 0) // <- uninitialized here Left_Delta[0] = Left_Delta[1] + 1; else Left_Delta[0] = Left_Delta[1] - 1; for (i = 2; i < Left_Delta_Len; i++) Left_Delta[i - 1] = Left_Delta[i]; Left_Delta_Len--; t_end--; if (Left_Delta_Len == 0) leftover++; } } // Return the minimum number of changes (inserts, deletes, replacements) // needed to match string A[0 .. (1-m)] right-to-left with a prefix of string // T[0 .. (1-n)] right-to-left if it's not more than Error_Limit . // If no match, return the number of errors for the best match // up to a branch point. // Put delta description of alignment in Left_Delta and set // Left_Delta_Len to the number of entries there. // Set A_End and T_End to the leftmost positions where the // alignment ended in A and T , respectively. // If the alignment succeeds set Leftover to the number of // characters that match after the last Left_Delta entry; // otherwise, set Leftover to zero. // Set Match_To_End true if the match extended to the end // of at least one string; otherwise, set it false to indicate // a branch point. int32 prefixEditDistance::reverse(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, int32 &Leftover, // <- novel bool &Match_To_End) { double Score; int Max_Score_Len = 0, Max_Score_Best_d = 0, Max_Score_Best_e = 0; int Best_d, Best_e, From, Last, Longest, Max, Row; int d, e, j, k; assert (m <= n); Best_d = Best_e = Longest = 0; Left_Delta_Len = 0; for (Row = 0; Row < m && (A[- Row] == T[- Row] || A[- Row] == 'n' || T[- Row] == 'n'); Row++) ; if (Edit_Array_Lazy[0] == NULL) Allocate_More_Edit_Space(0); Edit_Array_Lazy[0][0] = Row; if (Row == m) { A_End = T_End = - m; Leftover = m; Match_To_End = TRUE; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d REV exact match\n", omp_get_thread_num()); #endif return 0; } int32 Left = 0; int32 Right = 0; double Max_Score = 0.0; for (e = 1; e <= Error_Limit; e++) { if (Edit_Array_Lazy[e] == NULL) Allocate_More_Edit_Space(e); Left = MAX (Left - 1, -e); Right = MIN (Right + 1, e); Edit_Array_Lazy[e - 1][Left ] = -2; Edit_Array_Lazy[e - 1][Left - 1] = -2; Edit_Array_Lazy[e - 1][Right ] = -2; Edit_Array_Lazy[e - 1][Right + 1] = -2; for (d = Left; d <= Right; d++) { Row = 1 + Edit_Array_Lazy[e - 1][d]; if ((j = Edit_Array_Lazy[e - 1][d - 1]) > Row) Row = j; if ((j = 1 + Edit_Array_Lazy[e - 1][d + 1]) > Row) Row = j; while (Row < m && Row + d < n && (A[- Row] == T[- Row - d] || A[- Row] == 'n' || T[- Row - d] == 'n')) Row++; Edit_Array_Lazy[e][d] = Row; if (Row == m || Row + d == n) { // Check for branch point here caused by uneven distribution of errors Score = Row * Branch_Match_Value - e; // Assumes Branch_Match_Value - Branch_Error_Value == 1.0 int32 Tail_Len = Row - Max_Score_Len; bool abort = false; double slope = (double)(Max_Score - Score) / Tail_Len; if ((doingPartialOverlaps == true) && (Score < Max_Score)) abort = true; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d REV e=%d MIN=%d Tail_Len=%d Max_Score=%d Score=%d slope=%f SLOPE=%f\n", omp_get_thread_num(), e, MIN_BRANCH_END_DIST, Tail_Len, Max_Score, Score, slope, MIN_BRANCH_TAIL_SLOPE); #endif if ((e > MIN_BRANCH_END_DIST / 2) && (Tail_Len >= MIN_BRANCH_END_DIST) && (slope >= MIN_BRANCH_TAIL_SLOPE)) abort = true; if (abort) { A_End = - Max_Score_Len; T_End = - Max_Score_Len - Max_Score_Best_d; Set_Left_Delta (Max_Score_Best_e, Max_Score_Best_d, Leftover, T_End, n); Match_To_End = FALSE; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d REV ABORT alignment at e=%d best_e=%d\n", omp_get_thread_num(), e, Max_Score_Best_e); #endif return(Max_Score_Best_e); } A_End = - Row; // One past last align position T_End = - Row - d; Set_Left_Delta (e, d, Leftover, T_End, n); Match_To_End = TRUE; #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d REV END alignment at e=%d\n", omp_get_thread_num(), e); #endif return(e); } } while ((Left <= Right) && (Left < 0) && (Edit_Array_Lazy[e][Left] < Edit_Match_Limit[e])) Left++; if (Left >= 0) while ((Left <= Right) && (Edit_Array_Lazy[e][Left] + Left < Edit_Match_Limit[e])) Left++; if (Left > Right) { #ifdef SHOW_EXTEND_ALIGN //fprintf(stdout, "WorkArea %2d REV BREAK at Left=%d Right=%d\n", omp_get_thread_num(), Left, Right); #endif break; } while ((Right > 0) && (Edit_Array_Lazy[e][Right] + Right < Edit_Match_Limit[e])) Right--; if (Right <= 0) while (Edit_Array_Lazy[e][Right] < Edit_Match_Limit[e]) Right--; assert (Left <= Right); for (d = Left; d <= Right; d++) if (Edit_Array_Lazy[e][d] > Longest) { Best_d = d; Best_e = e; Longest = Edit_Array_Lazy[e][d]; } Score = Longest * Branch_Match_Value - e; // Assumes Branch_Match_Value - Branch_Error_Value == 1.0 if (Score > Max_Score) { Max_Score = Score; Max_Score_Len = Longest; Max_Score_Best_d = Best_d; Max_Score_Best_e = Best_e; } } #ifdef SHOW_EXTEND_ALIGN fprintf(stdout, "WorkArea %2d REV ERROR_LIMIT at e=%d Error_Limit=%d best_e=%d\n", omp_get_thread_num(), e, Error_Limit, Max_Score_Best_e); #endif A_End = - Max_Score_Len; T_End = - Max_Score_Len - Max_Score_Best_d; Set_Left_Delta (Max_Score_Best_e, Max_Score_Best_d, Leftover, T_End, n); Match_To_End = FALSE; return Max_Score_Best_e; } canu-1.6/src/overlapInCore/liboverlap/prefixEditDistance.C000066400000000000000000000070301314437614700237000ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-03 to 2015-SEP-21 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-05 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "prefixEditDistance.H" #include "Binomial_Bound.H" prefixEditDistance::prefixEditDistance(bool doingPartialOverlaps_, double maxErate_) { maxErate = maxErate_; doingPartialOverlaps = doingPartialOverlaps_; MAX_ERRORS = (1 + (int)ceil(maxErate * AS_MAX_READLEN)); MIN_BRANCH_END_DIST = 20; MIN_BRANCH_TAIL_SLOPE = ((maxErate > 0.06) ? 1.0 : 0.20); Left_Delta = new int [MAX_ERRORS]; Right_Delta = new int [MAX_ERRORS]; allocated = 3 * MAX_ERRORS * sizeof(int); Delta_Stack = new int [MAX_ERRORS]; Edit_Space_Lazy = new int * [MAX_ERRORS]; Edit_Array_Lazy = new int * [MAX_ERRORS]; memset(Edit_Space_Lazy, 0, sizeof(int *) * MAX_ERRORS); memset(Edit_Array_Lazy, 0, sizeof(int *) * MAX_ERRORS); allocated += MAX_ERRORS * sizeof (int); allocated += MAX_ERRORS * sizeof (int); // Edit_Match_Limit_Allocation = new int32 [MAX_ERRORS + 1]; Edit_Match_Limit = Edit_Match_Limit_Allocation; Initialize_Match_Limit(Edit_Match_Limit_Allocation, maxErate, MAX_ERRORS); for (int32 i=0; i <= AS_MAX_READLEN; i++) { //Error_Bound[i] = (int32) (i * maxErate + 0.0000000000001); Error_Bound[i] = (int32)ceil(i * maxErate); } // Value to add for a match in finding branch points. // // ALH: Note that maxErate also affects what overlaps get found // // ALH: Scoring seems to be unusual: given an alignment // of length l with k mismatches, the score seems to be // computed as l + k * error value and NOT (l-k)*match+k*error // // I.e. letting x := DEFAULT_BRANCH_MATCH_VAL, // the max mismatch fraction p to give a non-negative score // would be p = x/(1-x); conversely, to compute x for a // goal p, we have x = p/(1+p). E.g. // // for p=0.06, x = .06 / (1.06) = .0566038 // for p=0.35, x = .35 / (1.35) = .259259 // for p=0.2, x = .20 / (1.20) = .166667 // for p=0.15, x = .15 / (1.15) = .130435 // // Value was for 6% vs 35% error discrimination. // Converting to integers didn't make it faster. // Corresponding error value is this value minus 1.0 Branch_Match_Value = maxErate / (1 + maxErate); Branch_Error_Value = Branch_Match_Value - 1.0; }; prefixEditDistance::~prefixEditDistance() { delete [] Left_Delta; delete [] Right_Delta; delete [] Delta_Stack; for (uint32 i=0; i 0) - ((a) < 0) ) enum Overlap_t { NONE, LEFT_BRANCH_PT, RIGHT_BRANCH_PT, DOVETAIL }; // the input to Extend_Alignment. struct Match_Node_t { int32 Offset; // To start of exact match in hash-table frag int32 Len; // Of exact match int32 Start; // Of exact match in current (new) frag int32 Next; // Subscript of next match in list }; class prefixEditDistance { public: prefixEditDistance(bool doingPartialOverlaps_, double maxErate_); ~prefixEditDistance(); void Allocate_More_Edit_Space(int e); void Set_Right_Delta(int32 e, int32 d); int32 forward(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, bool &Match_To_End); void Set_Left_Delta(int32 e, int32 d, int32 &leftover, int32 &t_end, int32 t_len); int32 reverse(char *A, int32 m, char *T, int32 n, int32 Error_Limit, int32 &A_End, int32 &T_End, int32 &Leftover, // <- novel bool &Match_To_End); Overlap_t Extend_Alignment(Match_Node_t *Match, char *S, uint32 S_ID, int32 S_Len, char *T, uint32 T_ID, int32 T_Len, int32 &S_Lo, int32 &S_Hi, int32 &T_Lo, int32 &T_Hi, int32 &Errors); public: // The four below were global #defines, two depended on the error rate which is now local. // Most errors in any edit distance computation uint32 MAX_ERRORS; // Branch points must be at least this many bases from the end of the fragment to be reported uint32 MIN_BRANCH_END_DIST; // Branch point tails must fall off from the max by at least this rate double MIN_BRANCH_TAIL_SLOPE; double maxErate; bool doingPartialOverlaps; uint64 allocated; int32 Left_Delta_Len; int32 *Left_Delta; int32 Right_Delta_Len; int32 *Right_Delta; int32 *Delta_Stack; int32 **Edit_Space_Lazy; // Array of pointers, if set, it was a new'd allocation int32 **Edit_Array_Lazy; // Array of pointers, some are not new'd allocations #ifdef DEBUG_EDIT_SPACE_ALLOC int32 Edit_Space_Lazy_Max; // Last allocated block, DEBUG ONLY #endif // This array [e] is the minimum value of Edit_Array[e][d] // to be worth pursuing in edit-distance computations between reads const int32 *Edit_Match_Limit; int32 *Edit_Match_Limit_Allocation; // The maximum number of errors allowed in a match between reads of length i, // which is i * AS_OVL_ERROR_RATE. int32 Error_Bound[AS_MAX_READLEN + 1]; // Scores of matches and mismatches in alignments. Alignment ends at maximum score. double Branch_Match_Value; double Branch_Error_Value; }; #endif canu-1.6/src/overlapInCore/overlapConvert.C000066400000000000000000000066061314437614700210040ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-FEB-11 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include using namespace std; int main(int argc, char **argv) { char *gkpStoreName = NULL; gkStore *gkpStore = NULL; ovOverlapDisplayType dt = ovOverlapAsCoords; bool native = false; vector files; int32 arg = 1; int32 err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-coords") == 0) { dt = ovOverlapAsCoords; } else if (strcmp(argv[arg], "-hangs") == 0) { dt = ovOverlapAsHangs; } else if (strcmp(argv[arg], "-raw") == 0) { dt = ovOverlapAsRaw; } else if (strcmp(argv[arg], "-native") == 0) { native = true; } else if (AS_UTL_fileExists(argv[arg])) { files.push_back(argv[arg]); } else { fprintf(stderr, "ERROR: invalid arg '%s'\n", argv[arg]); err++; } arg++; } if ((gkpStoreName == NULL) && (dt == ovOverlapAsCoords)) err++; if ((err) || (files.size() == 0)) { fprintf(stderr, "usage: %s [options] file.ovb[.gz]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G gkpStore (needed for -coords, the default)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -coords output coordiantes on reads\n"); fprintf(stderr, " -hangs output hangs on reads\n"); fprintf(stderr, " -raw output raw hangs on reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " -native input ovb file is NOT snappy compressed\n"); fprintf(stderr, "\n"); if ((gkpStoreName == NULL) && (dt == ovOverlapAsCoords)) fprintf(stderr, "ERROR: -coords mode requires a gkpStore (-G)\n"); if (files.size() == 0) fprintf(stderr, "ERROR: no overlap files supplied\n"); exit(1); } if (gkpStoreName) gkpStore = gkStore::gkStore_open(gkpStoreName); char *ovStr = new char [1024]; for (uint32 ff=0; ffenableSnappy(false); while (of->readOverlap(&ov)) fputs(ov.toString(ovStr, dt, true), stdout); delete of; } delete [] ovStr; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/overlapInCore/overlapConvert.mk000066400000000000000000000007621314437614700212260ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := overlapConvert SOURCES := overlapConvert.C SRC_INCDIRS := .. ../AS_UTL ../stores liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapInCore/overlapImport.C000066400000000000000000000207361314437614700206360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAY-14 to 2015-JUN-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "splitToWords.H" #include "mt19937ar.H" #include using namespace std; #define TYPE_NONE 'N' #define TYPE_LEGACY 'L' #define TYPE_COORDS 'C' #define TYPE_HANGS 'H' #define TYPE_RAW 'R' #define TYPE_OVB 'O' #define TYPE_RANDOM 'r' int main(int argc, char **argv) { char *gkpStoreName = NULL; gkStore *gkpStore = NULL; char *ovlFileName = NULL; char *ovlStoreName = NULL; char inType = TYPE_NONE; uint64 numRandom = 0; bool native = false; vector files; int32 arg = 1; int32 err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { ovlFileName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovlStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-legacy") == 0) { inType = TYPE_LEGACY; } else if (strcmp(argv[arg], "-coords") == 0) { fprintf(stderr, "-coords not implemented.\n"), exit(1); inType = TYPE_COORDS; } else if (strcmp(argv[arg], "-hangs") == 0) { fprintf(stderr, "-hangs not implemented.\n"), exit(1); inType = TYPE_HANGS; } else if (strcmp(argv[arg], "-raw") == 0) { inType = TYPE_RAW; } else if (strcmp(argv[arg], "-ovb") == 0) { fprintf(stderr, "-ovb not implemented.\n"), exit(1); inType = TYPE_OVB; } else if (strcmp(argv[arg], "-random") == 0) { inType = TYPE_RANDOM; numRandom = strtoull(argv[++arg], NULL, 10); files.push_back(NULL); } else if (strcmp(argv[arg], "-native") == 0) { native = true; } else if ((strcmp(argv[arg], "-") == 0) || (AS_UTL_fileExists(argv[arg]))) { files.push_back(argv[arg]); } else { fprintf(stderr, "ERROR: invalid arg '%s'\n", argv[arg]); err++; } arg++; } if (gkpStoreName == NULL) err++; if (inType == TYPE_NONE) err++; if ((err) || (files.size() == 0)) { fprintf(stderr, "usage: %s [options] ascii-ovl-file-input.[.gz]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "Required:\n"); fprintf(stderr, " -G name.gkpStore path to valid gatekeeper store\n"); fprintf(stderr, "\n"); fprintf(stderr, "Output Format:\n"); fprintf(stderr, " -o file.ovb output file name\n"); fprintf(stderr, " -O name.ovlStore output overlap store"); fprintf(stderr, "\n"); fprintf(stderr, "Input Format:\n"); fprintf(stderr, " -legacy 'CA8 overlapStore -d' format\n"); fprintf(stderr, " -coords 'overlapConvert -coords' format (not implemented)\n"); fprintf(stderr, " -hangs 'overlapConvert -hangs' format (not implemented)\n"); fprintf(stderr, " -raw 'overlapConvert -raw' format\n"); fprintf(stderr, " -ovb 'overlapInCore' format (not implemented)\n"); fprintf(stderr, " -random N create N random overlaps, for store testing\n"); fprintf(stderr, "\n"); fprintf(stderr, " -native output ovb (-o) files will not be snappy compressed\n"); fprintf(stderr, "\n"); fprintf(stderr, "Input file can be stdin ('-') or a gz/bz2/xz compressed file.\n"); fprintf(stderr, "\n"); if (gkpStoreName == NULL) fprintf(stderr, "ERROR: need to supply a gkpStore (-G).\n"); if (inType == TYPE_NONE) fprintf(stderr, "ERROR: need to supply a format type (-legacy, -coords, -hangs, -raw).\n"); if (files.size() == 0) fprintf(stderr, "ERROR: need to supply input files.\n"); exit(1); } if (gkpStoreName) gkpStore = gkStore::gkStore_open(gkpStoreName); char *S = new char [1024]; splitToWords W; ovOverlap ov(gkpStore); ovFile *of = (ovlFileName == NULL) ? NULL : new ovFile(gkpStore, ovlFileName, ovFileFullWrite); ovStoreWriter *os = (ovlStoreName == NULL) ? NULL : new ovStoreWriter(ovlStoreName, gkpStore); if ((of) && (native == true)) of->enableSnappy(false); // Make random inputs first. if (inType == TYPE_RANDOM) { mtRandom mt; for (uint64 ii=0; iigkStore_getNumReads()) + 1; uint32 bID = floor(mt.mtRandomRealOpen() * gkpStore->gkStore_getNumReads()) + 1; #if 0 // For testing when reads have no overlaps in store building. Issue #302. aID = aID & 0xfffffff0; bID = bID & 0xfffffff0; if (aID == 0) aID = 1; if (bID == 0) bID = 1; #endif uint32 aLen = gkpStore->gkStore_getRead(aID)->gkRead_sequenceLength(); uint32 bLen = gkpStore->gkStore_getRead(bID)->gkRead_sequenceLength(); bool olapFlip = mt.mtRandom32() % 2; // We could be fancy and make actual overlaps that make sense, or punt and make overlaps that // are valid but nonsense. ov.a_iid = aID; ov.b_iid = bID; ov.flipped(olapFlip); ov.a_hang((int32)(mt.mtRandomRealOpen() * 2 * aLen - aLen)); ov.b_hang((int32)(mt.mtRandomRealOpen() * 2 * bLen - bLen)); ov.dat.ovl.forOBT = false; ov.dat.ovl.forDUP = false; ov.dat.ovl.forUTG = true; ov.erate(mt.mtRandomRealOpen() * 0.1); if (of) of->writeOverlap(&ov); if (os) os->writeOverlap(&ov); } files.pop_back(); } // Now process any files. for (uint32 ff=0; fffile()); while (!feof(in->file())) { W.split(S); switch (inType) { case TYPE_LEGACY: // Aiid Biid 'I/N' ahang bhang erate erate ov.a_iid = W(0); ov.b_iid = W(1); ov.flipped(W[2][0] == 'I'); ov.a_hang(W(3)); ov.b_hang(W(4)); // Overlap store reports %error, but we expect fraction error. //ov.erate(atof(W[5]); // Don't use the original uncorrected error rate ov.erate(atof(W[6]) / 100.0); break; case TYPE_COORDS: break; case TYPE_HANGS: break; case TYPE_RAW: ov.a_iid = W(0); ov.b_iid = W(1); ov.flipped(W[2][0] == 'I'); ov.dat.ovl.span = W(3); ov.dat.ovl.ahg5 = W(4); ov.dat.ovl.ahg3 = W(5); ov.dat.ovl.bhg5 = W(6); ov.dat.ovl.bhg3 = W(7); ov.erate(atof(W[8]) / 1); ov.dat.ovl.forUTG = false; ov.dat.ovl.forOBT = false; ov.dat.ovl.forDUP = false; for (uint32 i = 9; i < W.numWords(); i++) { ov.dat.ovl.forUTG |= ((W[i][0] == 'U') && (W[i][1] == 'T') && (W[i][2] == 'G')); // Fails if W[i] == "U". ov.dat.ovl.forOBT |= ((W[i][0] == 'O') && (W[i][1] == 'B') && (W[i][2] == 'T')); ov.dat.ovl.forDUP |= ((W[i][0] == 'D') && (W[i][1] == 'U') && (W[i][2] == 'P')); } break; default: break; } if (of) of->writeOverlap(&ov); if (os) os->writeOverlap(&ov); fgets(S, 1024, in->file()); } delete in; } delete os; delete of; delete [] S; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/overlapInCore/overlapImport.mk000066400000000000000000000007601314437614700210560ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := overlapImport SOURCES := overlapImport.C SRC_INCDIRS := .. ../AS_UTL ../stores liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapInCore/overlapInCore-Build_Hash_Index.C000066400000000000000000000514141314437614700236670ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_overlap_common.h * src/AS_OVM/overlapInCore-Build_Hash_Index.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-JUL-15 to 2007-NOV-20 * are Copyright 2005,2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2006-MAR-27 to 2006-AUG-21 * are Copyright 2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Art Delcher on 2007-FEB-13 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2007-AUG-27 to 2009-JAN-16 * are Copyright 2007,2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2011-MAR-08 to 2011-JUN-17 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-DEC-15 to 2015-AUG-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "overlapInCore.H" #include "AS_UTL_reverseComplement.H" // Add string s as an extra hash table string and return // a single reference to the beginning of it. static String_Ref_t Add_Extra_Hash_String(const char *s) { String_Ref_t ref = 0; String_Ref_t sub = 0; int len; uint32 new_len = Used_Data_Len + G.Kmer_Len; if (Extra_String_Subcount < MAX_EXTRA_SUBCOUNT) { sub = String_Ct + Extra_String_Ct - 1; } else { sub = String_Ct + Extra_String_Ct; if (sub >= String_Start_Size) { uint64 n = max(sub * 1.1, String_Start_Size * 1.5); //fprintf(stderr, "REALLOC String_Start from " F_U64 " to " F_U64 "\n", String_Start_Size, n); resizeArray(String_Start, String_Start_Size, String_Start_Size, n); } String_Start[sub] = Used_Data_Len; Extra_String_Ct++; Extra_String_Subcount = 0; new_len++; } if (new_len >= Extra_Data_Len) { uint64 n = max(new_len * 1.1, Extra_Data_Len * 1.5); //fprintf(stderr, "REALLOC basesData from " F_U64 " to " F_U64 "\n", Extra_Data_Len, n); resizeArray(basesData, Extra_Data_Len, Extra_Data_Len, n); } strncpy(basesData + String_Start[sub] + G.Kmer_Len * Extra_String_Subcount, s, G.Kmer_Len + 1); Used_Data_Len = new_len; setStringRefStringNum(ref, sub); if (sub > MAX_STRING_NUM) { fprintf(stderr, "Too many skip kmer strings for hash table.\n"); fprintf(stderr, "Try skipping hopeless check (-z option)\n"); fprintf(stderr, "Exiting\n"); exit (1); } setStringRefOffset(ref, (String_Ref_t)Extra_String_Subcount * (String_Ref_t)G.Kmer_Len); assert(Extra_String_Subcount * G.Kmer_Len < OFFSET_MASK); setStringRefLast(ref, (uint64)1); setStringRefEmpty(ref, TRUELY_ONE); Extra_String_Subcount++; return(ref); } // Mark left/right_end_screened in global String_Info for // ref and everything in its list, if they occur near // enough to the end of the string. static void Mark_Screened_Ends_Single(String_Ref_t ref) { int32 s_num = getStringRefStringNum(ref); int32 len = String_Info[s_num].length; if (getStringRefOffset(ref) < HOPELESS_MATCH) String_Info[s_num].lfrag_end_screened = TRUE; if (len - getStringRefOffset(ref) - G.Kmer_Len + 1 < HOPELESS_MATCH) String_Info[s_num].rfrag_end_screened = TRUE; } static void Mark_Screened_Ends_Chain(String_Ref_t ref) { Mark_Screened_Ends_Single (ref); while (! getStringRefLast(ref)) { ref = nextRef[(String_Start[getStringRefStringNum(ref)] + getStringRefOffset(ref)) / (HASH_KMER_SKIP + 1)]; Mark_Screened_Ends_Single (ref); } } // Set the empty bit to true for the hash table entry // corresponding to string s whose hash key is key . // Also set global String_Info.left/right_end_screened // true if the entry occurs near the left/right end, resp., // of the string in the hash table. If not found, add an // entry to the hash table and mark it empty. static void Hash_Mark_Empty(uint64 key, char * s) { String_Ref_t h_ref; char * t; unsigned char key_check; int64 ct, probe; int64 sub; int i, shift; sub = HASH_FUNCTION (key); key_check = KEY_CHECK_FUNCTION (key); probe = PROBE_FUNCTION (key); ct = 0; do { for (i = 0; i < Hash_Table[sub].Entry_Ct; i ++) if (Hash_Table[sub].Check[i] == key_check) { h_ref = Hash_Table[sub].Entry[i]; t = basesData + String_Start[getStringRefStringNum(h_ref)] + getStringRefOffset(h_ref); if (strncmp (s, t, G.Kmer_Len) == 0) { if (! getStringRefEmpty(Hash_Table[sub].Entry[i])) Mark_Screened_Ends_Chain (Hash_Table[sub].Entry[i]); setStringRefEmpty(Hash_Table[sub].Entry[i], TRUELY_ONE); return; } } assert (i == Hash_Table[sub].Entry_Ct); if (Hash_Table[sub].Entry_Ct < ENTRIES_PER_BUCKET) { // Not found if (G.Use_Hopeless_Check) { Hash_Table[sub].Entry[i] = Add_Extra_Hash_String (s); setStringRefEmpty(Hash_Table[sub].Entry[i], TRUELY_ONE); Hash_Table[sub].Check[i] = key_check; Hash_Table[sub].Entry_Ct ++; Hash_Table[sub].Hits[i] = 0; Hash_Entries ++; shift = HASH_CHECK_FUNCTION (key); Hash_Check_Array[sub] |= (((Check_Vector_t) 1) << shift); } return; } sub = (sub + probe) % HASH_TABLE_SIZE; } while (++ ct < HASH_TABLE_SIZE); fprintf (stderr, "ERROR: Hash table full\n"); assert (FALSE); } // Set Empty bit true for all entries in global Hash_Table // that match a kmer in file Kmer_Skip_File . // Add the entry (and then mark it empty) if it's not in Hash_Table. static void Mark_Skip_Kmers(void) { uint64 key; char line[MAX_LINE_LEN]; int ct = 0; rewind (G.Kmer_Skip_File); while (fgets (line, MAX_LINE_LEN, G.Kmer_Skip_File) != NULL) { int i, len; ct ++; len = strlen (line) - 1; if (line[0] != '>' || line[len] != '\n') { fprintf (stderr, "ERROR: Bad line %d in kmer skip file\n", ct); fputs (line, stderr); exit (1); } if (fgets (line, MAX_LINE_LEN, G.Kmer_Skip_File) == NULL) { fprintf (stderr, "ERROR: Bad line after %d in kmer skip file\n", ct); exit (1); } ct ++; len = strlen (line) - 1; if (len != G.Kmer_Len || line[len] != '\n') { fprintf (stderr, "ERROR: Bad line %d in kmer skip file\n", ct); fputs (line, stderr); exit (1); } line[len] = '\0'; //if ((ct % 200000) == 0) // fprintf(stderr, "Loaded skip %10d '%s'\n", ct/2, line); key = 0; for (i = 0; i < len; i ++) { line[i] = tolower (line[i]); key |= (uint64) (Bit_Equivalent[(int) line[i]]) << (2 * i); } Hash_Mark_Empty (key, line); reverseComplementSequence (line, len); key = 0; for (i = 0; i < len; i ++) key |= (uint64) (Bit_Equivalent[(int) line[i]]) << (2 * i); Hash_Mark_Empty (key, line); } fprintf (stderr, "String_Ct = " F_U64 " Extra_String_Ct = " F_U64 " Extra_String_Subcount = " F_U64 "\n", String_Ct, Extra_String_Ct, Extra_String_Subcount); fprintf (stderr, "Read %d kmers to mark to skip\n", ct / 2); } // Insert Ref with hash key Key into global Hash_Table . // Ref represents string S . static void Hash_Insert(String_Ref_t Ref, uint64 Key, char * S) { String_Ref_t H_Ref; char * T; int Shift; unsigned char Key_Check; int64 Ct, Probe, Sub; int i; Sub = HASH_FUNCTION (Key); Shift = HASH_CHECK_FUNCTION (Key); Hash_Check_Array[Sub] |= (((Check_Vector_t) 1) << Shift); Key_Check = KEY_CHECK_FUNCTION (Key); Probe = PROBE_FUNCTION (Key); Ct = 0; do { for (i = 0; i < Hash_Table[Sub].Entry_Ct; i ++) if (Hash_Table[Sub].Check[i] == Key_Check) { H_Ref = Hash_Table[Sub].Entry[i]; T = basesData + String_Start[getStringRefStringNum(H_Ref)] + getStringRefOffset(H_Ref); if (strncmp (S, T, G.Kmer_Len) == 0) { if (getStringRefLast(H_Ref)) { Extra_Ref_Ct ++; } nextRef[(String_Start[getStringRefStringNum(Ref)] + getStringRefOffset(Ref)) / (HASH_KMER_SKIP + 1)] = H_Ref; Extra_Ref_Ct ++; setStringRefLast(Ref, TRUELY_ZERO); Hash_Table[Sub].Entry[i] = Ref; if (Hash_Table[Sub].Hits[i] < HIGHEST_KMER_LIMIT) Hash_Table[Sub].Hits[i] ++; return; } } if (i != Hash_Table[Sub].Entry_Ct) { fprintf (stderr, "i = %d Sub = " F_S64 " Entry_Ct = %d\n", i, Sub, Hash_Table[Sub].Entry_Ct); } assert (i == Hash_Table[Sub].Entry_Ct); if (Hash_Table[Sub].Entry_Ct < ENTRIES_PER_BUCKET) { setStringRefLast(Ref, TRUELY_ONE); Hash_Table[Sub].Entry[i] = Ref; Hash_Table[Sub].Check[i] = Key_Check; Hash_Table[Sub].Entry_Ct ++; Hash_Entries ++; Hash_Table[Sub].Hits[i] = 1; return; } Sub = (Sub + Probe) % HASH_TABLE_SIZE; } while (++ Ct < HASH_TABLE_SIZE); fprintf (stderr, "ERROR: Hash table full\n"); assert (FALSE); } // Insert string subscript i into the global hash table. // Sequence and information about the string are in // global variables basesData, String_Start, String_Info, .... static void Put_String_In_Hash(uint32 UNUSED(curID), uint32 i) { String_Ref_t ref = 0; int skip_ct; uint64 key; uint64 key_is_bad; int j; uint32 kmers_skipped = 0; uint32 kmers_bad = 0; uint32 kmers_inserted = 0; char *p = basesData + String_Start[i]; char *window = basesData + String_Start[i]; key = key_is_bad = 0; for (uint32 j=0; j MAX_STRING_NUM) fprintf (stderr, "Too many strings for hash table--exiting\n"), exit(1); setStringRefOffset(ref, TRUELY_ZERO); skip_ct = 0; setStringRefEmpty(ref, TRUELY_ZERO); if (key_is_bad == false) { Hash_Insert(ref, key, window); kmers_inserted++; } else { kmers_bad++; } while (*p != 0) { window++; String_Ref_t newoff = getStringRefOffset(ref) + 1; assert(newoff < OFFSET_MASK); setStringRefOffset(ref, newoff); if (++skip_ct > HASH_KMER_SKIP) skip_ct = 0; key_is_bad >>= 1; key_is_bad |= (uint64) (Char_Is_Bad[(int) * p]) << (G.Kmer_Len - 1); key >>= 2; key |= (uint64) (Bit_Equivalent[(int) * (p ++)]) << (2 * (G.Kmer_Len - 1)); if (skip_ct > 0) { kmers_skipped++; continue; } if (key_is_bad) { kmers_bad++; continue; } Hash_Insert(ref, key, window); kmers_inserted++; } //fprintf(stderr, "STRING %u skipped %u bad %u inserted %u\n", // curID, kmers_skipped, kmers_bad, kmers_inserted); } // Read the next batch of strings from stream and create a hash // table index of their G.Kmer_Len -mers. Return 1 if successful; // 0 otherwise. The batch ends when either end-of-file is encountered // or Max_Hash_Strings have been read in. first_frag_id is the // internal ID of the first fragment in the hash table. int Build_Hash_Index(gkStore *gkpStore, uint32 bgnID, uint32 endID) { String_Ref_t ref; uint64 total_len; uint64 hash_entry_limit; fprintf(stderr, "Build_Hash_Index from " F_U32 " to " F_U32 "\n", bgnID, endID); Hash_String_Num_Offset = bgnID; String_Ct = 0; Extra_String_Ct = 0; Extra_String_Subcount = MAX_EXTRA_SUBCOUNT; total_len = 0; //if (Data == NULL) { // Extra_Data_Len = Max_Hash_Data_Len + AS_MAX_READLEN; // Data_Len = Max_Hash_Data_Len + AS_MAX_READLEN; // // basesData = new char [Data_Len]; // qualsData = new char [Data_Len]; // // old_ref_len = Data_Len / (HASH_KMER_SKIP + 1); // nextRef = new String_Ref_t [old_ref_len]; //} //memset(nextRef, 0xff, old_ref_len * sizeof(String_Ref_t)); memset(Hash_Table, 0x00, HASH_TABLE_SIZE * sizeof(Hash_Bucket_t)); memset(Hash_Check_Array, 0x00, HASH_TABLE_SIZE * sizeof(Check_Vector_t)); Extra_Ref_Ct = 0; Hash_Entries = 0; hash_entry_limit = G.Max_Hash_Load * HASH_TABLE_SIZE * ENTRIES_PER_BUCKET; #if 0 fprintf(stderr, "HASH LOADING STARTED: fragID %12" F_U64P "\n", first_frag_id); fprintf(stderr, "HASH LOADING STARTED: strings %12" F_U64P " out of %12" F_U64P " max.\n", String_Ct, G.Max_Hash_Strings); fprintf(stderr, "HASH LOADING STARTED: length %12" F_U64P " out of %12" F_U64P " max.\n", total_len, G.Max_Hash_Data_Len); fprintf(stderr, "HASH LOADING STARTED: entries %12" F_U64P " out of %12" F_U64P " max (load %.2f).\n", Hash_Entries, hash_entry_limit, (100.0 * Hash_Entries) / (HASH_TABLE_SIZE * ENTRIES_PER_BUCKET)); #endif // Compute an upper limit on the number of bases we will load. The number of Hash_Entries // can't be computed here, so the real loop below could end earlier than expected - and we // don't use a little bit of memory. uint32 nSkipped = 0; uint32 nShort = 0; uint32 nLoadable = 0; uint64 maxAlloc = 0; uint32 curID = 0; // The last ID loaded into the hash for (curID=bgnID; ((String_Ct < G.Max_Hash_Strings) && (total_len < G.Max_Hash_Data_Len) && (curID <= endID)); curID++) { gkRead *read = gkpStore->gkStore_getRead(curID); if ((read->gkRead_libraryID() < G.minLibToHash) || (read->gkRead_libraryID() > G.maxLibToHash)) { nSkipped++; continue; } if (read->gkRead_sequenceLength() < G.Min_Olap_Len) { nShort++; continue; } nLoadable++; maxAlloc += read->gkRead_sequenceLength() + 1; } fprintf(stderr, "Found " F_U32 " reads with length " F_U64 " to load; " F_U32 " skipped by being too short; " F_U32 " skipped per library restriction\n", nLoadable, maxAlloc, nShort, nSkipped); // This should be less than what the user requested on the command line if (maxAlloc >= G.Max_Hash_Data_Len + AS_MAX_READLEN) fprintf(stderr, "maxAlloc = " F_U64 " G.Max_Hash_Data_Len = " F_U64 " AS_MAX_READLEN = %u\n", maxAlloc, G.Max_Hash_Data_Len, AS_MAX_READLEN); assert(maxAlloc < G.Max_Hash_Data_Len + AS_MAX_READLEN); // Allocate space, then fill it. uint64 nextRef_Len = maxAlloc / (HASH_KMER_SKIP + 1); Extra_Data_Len = Data_Len = maxAlloc; basesData = new char [Data_Len]; qualsData = new char [Data_Len]; nextRef = new String_Ref_t [nextRef_Len]; memset(nextRef, 0xff, sizeof(String_Ref_t) * nextRef_Len); gkReadData *readData = new gkReadData; for (curID=bgnID; ((String_Ct < G.Max_Hash_Strings) && (total_len < G.Max_Hash_Data_Len) && (Hash_Entries < hash_entry_limit) && (curID <= endID)); curID++, String_Ct++) { // Load sequence if it exists, otherwise, add an empty read. // Duplicated in Process_Overlaps(). String_Start[String_Ct] = UINT64_MAX; String_Info[String_Ct].length = 0; String_Info[String_Ct].lfrag_end_screened = TRUE; String_Info[String_Ct].rfrag_end_screened = TRUE; gkRead *read = gkpStore->gkStore_getRead(curID); if ((read->gkRead_libraryID() < G.minLibToHash) || (read->gkRead_libraryID() > G.maxLibToHash)) continue; uint32 len = read->gkRead_sequenceLength(); if (len < G.Min_Olap_Len) continue; gkpStore->gkStore_loadReadData(read, readData); char *seqptr = readData->gkReadData_getSequence(); char *qltptr = readData->gkReadData_getQualities(); // Note where we are going to store the string, and how long it is String_Start[String_Ct] = total_len; String_Info[String_Ct].length = len; String_Info[String_Ct].lfrag_end_screened = FALSE; String_Info[String_Ct].rfrag_end_screened = FALSE; // Store it. for (uint32 i=0; i 0) { uint32 extra = new_len % (HASH_KMER_SKIP + 1); if (extra > 0) new_len += 1 + HASH_KMER_SKIP - extra; } #endif // Trouble - allocate more space for sequence and quality data. // This was computed ahead of time! if (total_len > maxAlloc) fprintf(stderr, "total_len=" F_U64 " len=" F_U32 " maxAlloc=" F_U64 "\n", total_len, len, maxAlloc); assert(total_len <= maxAlloc); // What is Extra_Data_Len? It's set to Data_Len if we would have reallocated here. Put_String_In_Hash(curID, String_Ct); if ((String_Ct % 100000) == 0) fprintf (stderr, "String_Ct:%12" F_U64P "/%12" F_U32P " totalLen:%12" F_U64P "/%12" F_U64P " Hash_Entries:%12" F_U64P "/%12" F_U64P " Load: %.2f%%\n", String_Ct, G.Max_Hash_Strings, total_len, G.Max_Hash_Data_Len, Hash_Entries, hash_entry_limit, 100.0 * Hash_Entries / (HASH_TABLE_SIZE * ENTRIES_PER_BUCKET)); } curID--; // We always stop on the read after we loaded. delete readData; fprintf(stderr, "HASH LOADING STOPPED: strings %12" F_U64P " out of %12" F_U32P " max.\n", String_Ct, G.Max_Hash_Strings); fprintf(stderr, "HASH LOADING STOPPED: length %12" F_U64P " out of %12" F_U64P " max.\n", total_len, G.Max_Hash_Data_Len); fprintf(stderr, "HASH LOADING STOPPED: entries %12" F_U64P " out of %12" F_U64P " max (load %.2f).\n", Hash_Entries, hash_entry_limit, 100.0 * Hash_Entries / (HASH_TABLE_SIZE * ENTRIES_PER_BUCKET)); if (String_Ct == 0) { fprintf(stderr, "HASH LOADING STOPPED: no strings added?\n"); return(endID); } Used_Data_Len = total_len; //fprintf(stderr, "Extra_Ref_Ct = " F_U64 " Max_Extra_Ref_Space = " F_U64 "\n", Extra_Ref_Ct, Max_Extra_Ref_Space); if (Extra_Ref_Ct > Max_Extra_Ref_Space) { int32 newSize = (Max_Extra_Ref_Space == 0) ? 16 * 1024 : Max_Extra_Ref_Space * 2; while (newSize < Extra_Ref_Ct) newSize *= 2; String_Ref_t *newSpace = new String_Ref_t [newSize]; memcpy(newSpace, Extra_Ref_Space, sizeof(String_Ref_t) * Max_Extra_Ref_Space); delete [] Extra_Ref_Space; Max_Extra_Ref_Space = newSize; // Former max_extra_ref_ct Extra_Ref_Space = newSpace; } if (G.Kmer_Skip_File != NULL) Mark_Skip_Kmers(); // Coalesce reference chain into adjacent entries in Extra_Ref_Space Extra_Ref_Ct = 0; for (uint64 i = 0; i < HASH_TABLE_SIZE; i ++) for (int32 j = 0; j < Hash_Table[i].Entry_Ct; j ++) { ref = Hash_Table[i].Entry[j]; if (! getStringRefLast(ref) && ! getStringRefEmpty(ref)) { Extra_Ref_Space[Extra_Ref_Ct] = ref; setStringRefStringNum(Hash_Table[i].Entry[j], (String_Ref_t)(Extra_Ref_Ct >> OFFSET_BITS)); setStringRefOffset (Hash_Table[i].Entry[j], (String_Ref_t)(Extra_Ref_Ct & OFFSET_MASK)); Extra_Ref_Ct ++; do { ref = nextRef[(String_Start[getStringRefStringNum(ref)] + getStringRefOffset(ref)) / (HASH_KMER_SKIP + 1)]; Extra_Ref_Space[Extra_Ref_Ct ++] = ref; } while (! getStringRefLast(ref)); } } return(curID); } canu-1.6/src/overlapInCore/overlapInCore-Find_Overlaps.C000066400000000000000000000305641314437614700232740ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_overlap_common.h * src/AS_OVM/overlapInCore-Find_Overlaps.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-JUL-15 to 2007-NOV-20 * are Copyright 2005,2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2006-MAR-27 to 2006-AUG-21 * are Copyright 2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Art Delcher on 2007-FEB-13 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2007-AUG-27 to 2009-JAN-16 * are Copyright 2007,2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2011-MAR-08 to 2011-JUN-17 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz on 2014-DEC-15 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-JUN-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "overlapInCore.H" // Add information for the match in ref to the list // starting at subscript (* start). The matching window begins // offset bytes from the beginning of this string. static void Add_Match(String_Ref_t ref, int * start, int offset, int * consistent, Work_Area_t * WA) { int * p, save; int diag = 0, new_diag, expected_start = 0, num_checked = 0; int move_to_front = FALSE; new_diag = getStringRefOffset(ref) - offset; for (p = start; (* p) != 0; p = & (WA->Match_Node_Space [(* p)].Next)) { expected_start = WA->Match_Node_Space [(* p)].Start + WA->Match_Node_Space [(* p)].Len - G.Kmer_Len + 1 + HASH_KMER_SKIP; diag = WA->Match_Node_Space [(* p)].Offset - WA->Match_Node_Space [(* p)].Start; if (expected_start < offset) break; if (expected_start == offset) { if (new_diag == diag) { WA->Match_Node_Space [(* p)].Len += 1 + HASH_KMER_SKIP; if (move_to_front) { save = (* p); (* p) = WA->Match_Node_Space [(* p)].Next; WA->Match_Node_Space [save].Next = (* start); (* start) = save; } return; } else move_to_front = TRUE; } num_checked ++; } if (WA->Next_Avail_Match_Node == WA->Match_Node_Size) { int32 newSize = WA->Match_Node_Size * 2; Match_Node_t *newSpace = new Match_Node_t [newSize]; memcpy(newSpace, WA->Match_Node_Space, sizeof(Match_Node_t) * WA->Match_Node_Size); delete [] WA->Match_Node_Space; WA->Match_Node_Size = newSize; WA->Match_Node_Space = newSpace; } if ((* start) != 0 && (num_checked > 0 || abs (diag - new_diag) > 3 || offset < expected_start + G.Kmer_Len - 2)) (* consistent) = FALSE; save = (* start); (* start) = WA->Next_Avail_Match_Node; WA->Next_Avail_Match_Node ++; WA->Match_Node_Space [(* start)].Offset = getStringRefOffset(ref); WA->Match_Node_Space [(* start)].Len = G.Kmer_Len; WA->Match_Node_Space [(* start)].Start = offset; WA->Match_Node_Space [(* start)].Next = save; #if 0 fprintf(stderr, "Add_Match()-- %3d offset %d len %d start %d next %d\n", *start, WA->Match_Node_Space [(* start)].Offset, WA->Match_Node_Space [(* start)].Len, WA->Match_Node_Space [(* start)].Start, WA->Match_Node_Space [(* start)].Next); #endif } // Add information for Ref and all its matches to the global hash table in String_Olap_Space. Grow // the space if necessary. The matching window begins Offset bytes from the beginning of this // string. static void Add_Ref(String_Ref_t Ref, int Offset, Work_Area_t * WA) { uint32 Prev, StrNum, Sub; int consistent; StrNum = getStringRefStringNum(Ref); Sub = (StrNum ^ (StrNum >> STRING_OLAP_SHIFT)) & STRING_OLAP_MASK; while (WA->String_Olap_Space [Sub].Full && WA->String_Olap_Space [Sub].String_Num != StrNum) { Prev = Sub; Sub = WA->String_Olap_Space [Sub].Next; if (Sub == 0) { if (WA->Next_Avail_String_Olap == WA->String_Olap_Size) { int32 newSize = WA->String_Olap_Size * 2; String_Olap_t *newSpace = new String_Olap_t [newSize]; memcpy(newSpace, WA->String_Olap_Space, sizeof(String_Olap_t) * WA->String_Olap_Size); delete [] WA->String_Olap_Space; WA->String_Olap_Size = newSize; WA->String_Olap_Space = newSpace; } Sub = WA->Next_Avail_String_Olap ++; WA->String_Olap_Space [Prev].Next = Sub; WA->String_Olap_Space [Sub].Full = FALSE; break; } } if (! WA->String_Olap_Space [Sub].Full) { WA->String_Olap_Space [Sub].String_Num = StrNum; WA->String_Olap_Space [Sub].Match_List = 0; WA->String_Olap_Space [Sub].diag_sum = 0.0; WA->String_Olap_Space [Sub].diag_ct = 0; WA->String_Olap_Space [Sub].diag_bgn = AS_MAX_READLEN; WA->String_Olap_Space [Sub].diag_end = 0; WA->String_Olap_Space [Sub].Next = 0; WA->String_Olap_Space [Sub].Full = TRUE; WA->String_Olap_Space [Sub].consistent = TRUE; } consistent = WA->String_Olap_Space [Sub].consistent; WA->String_Olap_Space [Sub].diag_sum += (double)getStringRefOffset(Ref) - Offset; WA->String_Olap_Space [Sub].diag_ct ++; if (WA->String_Olap_Space [Sub].diag_bgn > Offset) WA->String_Olap_Space [Sub].diag_bgn = Offset; if (WA->String_Olap_Space [Sub].diag_end < Offset) WA->String_Olap_Space [Sub].diag_end = Offset; Add_Match (Ref, & (WA->String_Olap_Space [Sub].Match_List), Offset, & consistent, WA); WA->String_Olap_Space [Sub].consistent = consistent; return; } // Search for string S with hash key Key in the global // Hash_Table starting at subscript Sub. Return the matching // reference in the hash table if there is one, or else a reference // with the Empty bit set true. Set (* Where) to the subscript in // Extra_Ref_Space where the reference was found if it was found there. // Set (* hi_hits) to TRUE if hash table entry is found but is empty // because it was screened out, otherwise set to FALSE. static String_Ref_t Hash_Find(uint64 Key, int64 Sub, char * S, int64 * Where, int * hi_hits) { String_Ref_t H_Ref = 0; char * T; unsigned char Key_Check; int64 Ct, Probe; int i; Key_Check = KEY_CHECK_FUNCTION (Key); Probe = PROBE_FUNCTION (Key); (* hi_hits) = FALSE; Ct = 0; do { for (i = 0; i < Hash_Table [Sub].Entry_Ct; i ++) if (Hash_Table [Sub].Check [i] == Key_Check) { int is_empty; H_Ref = Hash_Table [Sub].Entry [i]; //fprintf(stderr, "Href = Hash_Table %u Entry %u = " F_U64 "\n", Sub, i, H_Ref); is_empty = getStringRefEmpty(H_Ref); if (! getStringRefLast(H_Ref) && ! is_empty) { (* Where) = ((uint64)getStringRefStringNum(H_Ref) << OFFSET_BITS) + getStringRefOffset(H_Ref); H_Ref = Extra_Ref_Space [(* Where)]; //fprintf(stderr, "Href = Extra_Ref_Space " F_U64 " = " F_U64 "\n", *Where, H_Ref); } //fprintf(stderr, "Href = " F_U64 " Get String_Start[ " F_U64 " ] + " F_U64 "\n", getStringRefStringNum(H_Ref), getStringRefOffset(H_Ref)); T = basesData + String_Start [getStringRefStringNum(H_Ref)] + getStringRefOffset(H_Ref); if (strncmp (S, T, G.Kmer_Len) == 0) { if (is_empty) { setStringRefEmpty(H_Ref, TRUELY_ONE); (* hi_hits) = TRUE; } return H_Ref; } } if (Hash_Table [Sub].Entry_Ct < ENTRIES_PER_BUCKET) { setStringRefEmpty(H_Ref, TRUELY_ONE); return H_Ref; } Sub = (Sub + Probe) % HASH_TABLE_SIZE; } while (++ Ct < HASH_TABLE_SIZE); setStringRefEmpty(H_Ref, TRUELY_ONE); return H_Ref; } // Find and output all overlaps and branch points between string // Frag and any fragment currently in the global hash table. // Frag_Len is the length of Frag and Frag_Num is its ID number. // Dir is the orientation of Frag . void Find_Overlaps(char Frag [], int Frag_Len, char quality [], uint32 Frag_Num, Direction_t Dir, Work_Area_t * WA) { String_Ref_t Ref; char * P, * Window; uint64 Key, Next_Key; int64 Sub, Next_Sub, Where; Check_Vector_t This_Check, Next_Check; int Offset, Shift, Next_Shift; int hi_hits; int j; memset (WA->String_Olap_Space, 0, STRING_OLAP_MODULUS * sizeof (String_Olap_t)); WA->Next_Avail_String_Olap = STRING_OLAP_MODULUS; WA->Next_Avail_Match_Node = 1; assert (Frag_Len >= G.Kmer_Len); Offset = 0; P = Window = Frag; WA->left_end_screened = FALSE; WA->right_end_screened = FALSE; WA->A_Olaps_For_Frag = 0; WA->B_Olaps_For_Frag = 0; Key = 0; for (j = 0; j < G.Kmer_Len; j ++) Key |= (uint64) (Bit_Equivalent [(int) * (P ++)]) << (2 * j); Sub = HASH_FUNCTION (Key); Shift = HASH_CHECK_FUNCTION (Key); Next_Key = (Key >> 2); Next_Key |= ((uint64) (Bit_Equivalent [(int) * P])) << (2 * (G.Kmer_Len - 1)); Next_Sub = HASH_FUNCTION (Next_Key); Next_Shift = HASH_CHECK_FUNCTION (Next_Key); Next_Check = Hash_Check_Array [Next_Sub]; if ((Hash_Check_Array [Sub] & (((Check_Vector_t) 1) << Shift)) != 0) { Ref = Hash_Find (Key, Sub, Window, & Where, & hi_hits); if (hi_hits) { WA->left_end_screened = TRUE; } if (! getStringRefEmpty(Ref)) { while (TRUE) { if (Frag_Num < getStringRefStringNum(Ref) + Hash_String_Num_Offset) Add_Ref (Ref, Offset, WA); if (getStringRefLast(Ref)) break; else { Ref = Extra_Ref_Space [++ Where]; assert (! getStringRefEmpty(Ref)); } } } } while ((* P) != '\0') { Window ++; Offset ++; Key = Next_Key; Shift = Next_Shift; Sub = Next_Sub; This_Check = Next_Check; P ++; Next_Key = (Key >> 2); Next_Key |= ((uint64) (Bit_Equivalent [(int) * P])) << (2 * (G.Kmer_Len - 1)); Next_Sub = HASH_FUNCTION (Next_Key); Next_Shift = HASH_CHECK_FUNCTION (Next_Key); Next_Check = Hash_Check_Array [Next_Sub]; if ((This_Check & (((Check_Vector_t) 1) << Shift)) != 0) { Ref = Hash_Find (Key, Sub, Window, & Where, & hi_hits); if (hi_hits) { if (Offset < HOPELESS_MATCH) { WA->left_end_screened = TRUE; } if (Frag_Len - Offset - G.Kmer_Len + 1 < HOPELESS_MATCH) { WA->right_end_screened = TRUE; } } if (! getStringRefEmpty(Ref)) { while (TRUE) { if (Frag_Num < getStringRefStringNum(Ref) + Hash_String_Num_Offset) Add_Ref (Ref, Offset, WA); if (getStringRefLast(Ref)) break; else { Ref = Extra_Ref_Space [++ Where]; assert (! getStringRefEmpty(Ref)); } } } } } Process_String_Olaps (Frag, Frag_Len, quality, Frag_Num, Dir, WA); } canu-1.6/src/overlapInCore/overlapInCore-Output.C000066400000000000000000000213031314437614700220300ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_overlap_common.h * src/AS_OVM/overlapInCore-Output.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-JUL-15 to 2007-NOV-20 * are Copyright 2005,2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2006-MAR-27 to 2006-AUG-21 * are Copyright 2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Art Delcher on 2007-FEB-13 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2007-AUG-27 to 2009-JAN-16 * are Copyright 2007,2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2011-MAR-08 to 2011-JUN-17 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-NOV-17 to 2015-JUN-16 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-26 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "overlapInCore.H" // Output the overlap between strings S_ID and T_ID which // have lengths S_Len and T_Len , respectively. // The overlap information is in (* olap) . // S_Dir indicates the orientation of S . // T is always forward. void Output_Overlap(uint32 S_ID, int S_Len, Direction_t S_Dir, uint32 T_ID, int T_Len, Olap_Info_t *olap, Work_Area_t *WA) { ovOverlap *ovs = WA->overlaps + WA->overlapsLen++; // Overlap is good for UTG only. ovs->dat.ovl.forUTG = true; ovs->dat.ovl.forOBT = false; ovs->dat.ovl.forDUP = false; // Set the span of the alignment. ovs->dat.ovl.span = (olap->s_hi - olap->s_lo); ovs->dat.ovl.span += (olap->t_hi - olap->t_lo); ovs->dat.ovl.span += (olap->delta_ct); assert((ovs->dat.ovl.span % 2) == 0); ovs->dat.ovl.span /= 2; assert (S_ID < T_ID); int32 S_Right_Hang = S_Len - olap->s_hi - 1; int32 T_Right_Hang = T_Len - olap->t_hi - 1; bool Sleft; char orient = 0; int32 ahg = 0; int32 bhg = 0; //fprintf(stderr, "S %d %d = %d len %d T %d %d = %d len %d delta_ct %d delta %d\n", // olap->s_lo, olap->s_hi, olap->s_hi - olap->s_lo, S_Len, // olap->t_lo, olap->t_hi, olap->t_hi - olap->t_lo, T_Len, // olap->delta_ct, olap->delta[0]); assert(olap->s_lo < olap->s_hi); assert(olap->t_lo < olap->t_hi); // Ensure this is a dovetail or contained overlap. assert((olap->s_lo == 0) || (olap->t_lo == 0)); assert((olap->s_hi == S_Len - 1) || (olap->t_hi == T_Len - 1)); if ((olap->s_lo > olap->t_lo) || ((olap->s_lo == olap->t_lo) && (S_Right_Hang > T_Right_Hang))) { assert(olap->t_lo == 0); assert((olap->s_hi == S_Len - 1) || (olap->t_hi == T_Len - 1)); Sleft = true; } else { assert(olap->s_lo == 0); assert((olap->s_hi == S_Len - 1) || (olap->t_hi == T_Len - 1)); Sleft = false; } if (Sleft) { ovs->a_iid = S_ID; ovs->b_iid = T_ID; } else { ovs->a_iid = T_ID; ovs->b_iid = S_ID; } //if (Sleft) { // if (S_Right_Hang >= T_Right_Hang) // overlap_type = AS_CONTAINMENT; // else // overlap_type = AS_DOVETAIL; //} else { // if (T_Right_Hang >= S_Right_Hang) // overlap_type = AS_CONTAINMENT; // else // overlap_type = AS_DOVETAIL; //} if (Sleft) { orient = (S_Dir == FORWARD) ? 'N' : 'O'; ahg = olap->s_lo; bhg = T_Right_Hang - S_Right_Hang; } else { orient = (S_Dir == FORWARD) ? 'N' : 'I'; ahg = olap->t_lo; bhg = S_Right_Hang - T_Right_Hang; } // CMM: Regularize the reverse orientated containment overlaps to a common orientation. // // This catches the case where a reverse orient S (T is always fowrard) is placed // in the A position; we flip the overlap to make S be forward and T be reverse. // if ((orient == 'O') && (S_Right_Hang >= T_Right_Hang)) { orient = 'I'; ahg = -(T_Right_Hang - S_Right_Hang); bhg = -(olap->s_lo); } ovs->erate(olap->quality); switch (orient) { case 'N': ovs->a_hang(ahg); ovs->b_hang(bhg); ovs->dat.ovl.flipped = false; break; case 'I': ovs->a_hang(ahg); ovs->b_hang(bhg); ovs->dat.ovl.flipped = true; break; case 'O': ovs->a_hang(-bhg); ovs->b_hang(-ahg); ovs->dat.ovl.flipped = true; break; case 'A': // Never reached. ovs->a_hang(-bhg); ovs->b_hang(-ahg); ovs->dat.ovl.flipped = false; break; } #if OUTPUT_OVERLAP_DELTAS signed char deltas[2 * AS_READ_MAX_NORMAL_LEN]; signed char *deltaCursor = deltas; if (Sleft == false) for (i = 0; i < olap->delta_ct; i ++) olap->delta [i] *= -1; for (int i = 0; i < olap->delta_ct; i ++) { for (int j = abs (olap->delta [i]); j > 0; j -= AS_LONGEST_DELTA) { if (j > AS_LONGEST_DELTA) *deltaCursor++ = AS_LONG_DELTA_CODE; else *deltaCursor++ = j * Sign (olap->delta [i]); } } *deltaCursor = AS_ENDOF_DELTA_CODE; #endif WA->Total_Overlaps ++; if (bhg <= 0) WA->Contained_Overlap_Ct ++; else WA->Dovetail_Overlap_Ct ++; // Write overlaps if we've saved too many. // They're also written at the end of the thread. if (WA->overlapsLen >= WA->overlapsMax) #pragma omp critical { for (int32 zz=0; zzoverlapsLen; zz++) Out_BOF->writeOverlap(WA->overlaps + zz); WA->overlapsLen = 0; } } void Output_Partial_Overlap(uint32 s_id, uint32 t_id, Direction_t dir, const Olap_Info_t *olap, int s_len, int t_len, Work_Area_t *WA) { Total_Overlaps++; ovOverlap *ovl = WA->overlaps + WA->overlapsLen++; assert(s_id < t_id); ovl->a_iid = s_id; ovl->b_iid = t_id; // Overlap is good for OBT or DUP. It will be refined more during the store build. ovl->dat.ovl.forUTG = false; ovl->dat.ovl.forOBT = true; ovl->dat.ovl.forDUP = true; // Set the span of the alignment. ovl->dat.ovl.span = (olap->s_hi - olap->s_lo); ovl->dat.ovl.span += (olap->t_hi - olap->t_lo); ovl->dat.ovl.span += (olap->delta_ct); assert((ovl->dat.ovl.span % 2) == 0); ovl->dat.ovl.span /= 2; // Convert to canonical form with s forward and use space-based // coordinates if (dir == FORWARD) { ovl->dat.ovl.ahg5 = (olap->s_lo); ovl->dat.ovl.ahg3 = s_len - (olap->s_hi + 1); ovl->dat.ovl.bhg5 = (olap->t_lo); ovl->dat.ovl.bhg3 = t_len - (olap->t_hi + 1); ovl->dat.ovl.flipped = false; //assert(a < b); //assert(c < d); } else { ovl->dat.ovl.ahg5 = s_len - (olap->s_hi + 1); ovl->dat.ovl.ahg3 = (olap->s_lo); ovl->dat.ovl.bhg5 = t_len - (olap->t_hi + 1); ovl->dat.ovl.bhg3 = (olap->t_lo); ovl->dat.ovl.flipped = true; //assert(a < b); //assert(c > d); // Reverse! } ovl->erate(olap->quality); // We also flush the file at the end of a thread if (WA->overlapsLen >= WA->overlapsMax) { #pragma omp critical for (int32 zz=0; zzoverlapsLen; zz++) Out_BOF->writeOverlap(WA->overlaps + zz); WA->overlapsLen = 0; } } canu-1.6/src/overlapInCore/overlapInCore-Process_Overlaps.C000066400000000000000000000136741314437614700240350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_overlap_common.h * src/AS_OVM/overlapInCore-Process_Overlaps.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-JUL-15 to 2007-NOV-20 * are Copyright 2005,2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2006-MAR-27 to 2006-AUG-21 * are Copyright 2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Art Delcher on 2007-FEB-13 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2007-AUG-27 to 2009-JAN-16 * are Copyright 2007,2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2011-MAR-08 to 2015-AUG-21 * are Copyright 2011,2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-DEC-15 to 2015-AUG-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-JUN-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "overlapInCore.H" #include "AS_UTL_reverseComplement.H" // Find and output all overlaps between strings in store and those in the global hash table. // This is the entry point for each compute thread. void * Process_Overlaps(void *ptr){ Work_Area_t *WA = (Work_Area_t *)ptr; gkReadData *readData = new gkReadData; char *bases = new char [AS_MAX_READLEN + 1]; char *quals = new char [AS_MAX_READLEN + 1]; while (WA->bgnID < G.endRefID) { WA->overlapsLen = 0; WA->Total_Overlaps = 0; WA->Contained_Overlap_Ct = 0; WA->Dovetail_Overlap_Ct = 0; WA->Kmer_Hits_Without_Olap_Ct = 0; WA->Kmer_Hits_With_Olap_Ct = 0; WA->Kmer_Hits_Skipped_Ct = 0; WA->Multi_Overlap_Ct = 0; fprintf(stderr, "Thread %02u processes reads " F_U32 "-" F_U32 "\n", WA->thread_id, WA->bgnID, WA->endID); for (uint32 fi=WA->bgnID; fi<=WA->endID; fi++) { // Load sequence/quality data // Duplicated in Build_Hash_Index() gkRead *read = WA->gkpStore->gkStore_getRead(fi); if ((read->gkRead_libraryID() < G.minLibToRef) || (read->gkRead_libraryID() > G.maxLibToRef)) continue; uint32 len = read->gkRead_sequenceLength(); if (len < G.Min_Olap_Len) continue; WA->gkpStore->gkStore_loadReadData(read, readData); char *seqptr = readData->gkReadData_getSequence(); char *qltptr = readData->gkReadData_getQualities(); for (uint32 i=0; igkRead_readID(), FORWARD, WA); reverseComplement(bases, quals, len); Find_Overlaps(bases, len, quals, read->gkRead_readID(), REVERSE, WA); } // Write out this block of overlaps, no need to keep them in core! // While we have a mutex, also find the next block of things to process. fprintf(stderr, "Thread %02u writes reads " F_U32 "-" F_U32 " (" F_U64 " overlaps " F_U64 "/" F_U64 "/" F_U64 " kmer hits with/without overlap/skipped)\n", WA->thread_id, WA->bgnID, WA->endID, WA->overlapsLen, WA->Kmer_Hits_With_Olap_Ct, WA->Kmer_Hits_Without_Olap_Ct, WA->Kmer_Hits_Skipped_Ct); // Flush any remaining overlaps and update statistics. #pragma omp critical { for (int zz=0; zzoverlapsLen; zz++) Out_BOF->writeOverlap(WA->overlaps + zz); WA->overlapsLen = 0; Total_Overlaps += WA->Total_Overlaps; Contained_Overlap_Ct += WA->Contained_Overlap_Ct; Dovetail_Overlap_Ct += WA->Dovetail_Overlap_Ct; Kmer_Hits_Without_Olap_Ct += WA->Kmer_Hits_Without_Olap_Ct; Kmer_Hits_With_Olap_Ct += WA->Kmer_Hits_With_Olap_Ct; Kmer_Hits_Skipped_Ct += WA->Kmer_Hits_Skipped_Ct; Multi_Overlap_Ct += WA->Multi_Overlap_Ct; WA->bgnID = G.curRefID; WA->endID = G.curRefID + G.perThread - 1; if (WA->endID > G.endRefID) WA->endID = G.endRefID; G.curRefID = WA->endID + 1; } } delete readData; delete [] bases; delete [] quals; return(ptr); } canu-1.6/src/overlapInCore/overlapInCore-Process_String_Overlaps.C000066400000000000000000000615061314437614700253600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_overlap_common.h * src/AS_OVM/overlapInCore-Process_String_Overlaps.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-JUL-15 to 2007-NOV-20 * are Copyright 2005,2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2006-MAR-27 to 2006-AUG-21 * are Copyright 2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Art Delcher on 2007-FEB-13 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2007-AUG-27 to 2009-JAN-16 * are Copyright 2007,2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2011-MAR-08 to 2011-JUN-17 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-DEC-15 to 2015-JUL-20 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include #include "overlapInCore.H" static uint64 computeExpected(uint64 kmerSize, double ovlLen, double erate) { if (ovlLen < kmerSize) return 0; return int(floor(exp(-1.0 * (double)kmerSize * erate) * (ovlLen - kmerSize + 1))); } static uint64 computeMinimumKmers(uint64 kmerSize, double ovlLen, double erate) { if (G.Filter_By_Kmer_Count == 0) return G.Filter_By_Kmer_Count; ovlLen = (ovlLen < 0 ? ovlLen*-1.0 : ovlLen); return max(G.Filter_By_Kmer_Count, computeExpected(kmerSize, ovlLen, erate)); } // Choose the best overlap in olap[0 .. (ct - 1)] . // Mark all others as deleted (by setting deleted[] true for them) // and combine their information in the min/max entries in the // best one. static void Combine_Into_One_Olap(Olap_Info_t olap[], int ct, int deleted[]) { int min_diag, max_diag; int s_left_boundary, s_right_boundary; int t_left_boundary, t_right_boundary; int i, best; best = 0; min_diag = olap[0].min_diag; max_diag = olap[0].max_diag; s_left_boundary = olap[0].s_left_boundary; s_right_boundary = olap[0].s_right_boundary; t_left_boundary = olap[0].t_left_boundary; t_right_boundary = olap[0].t_right_boundary; for (i = 1; i < ct; i ++) { if (olap[i].quality < olap[best].quality) best = i; if (olap[i].min_diag < min_diag) min_diag = olap[i].min_diag; if (olap[i].max_diag > max_diag) max_diag = olap[i].max_diag; if (olap[i].s_left_boundary < s_left_boundary) s_left_boundary = olap[i].s_left_boundary; if (olap[i].s_right_boundary > s_right_boundary) s_right_boundary = olap[i].s_right_boundary; if (olap[i].t_left_boundary < t_left_boundary) t_left_boundary = olap[i].t_left_boundary; if (olap[i].t_right_boundary > t_right_boundary) t_right_boundary = olap[i].t_right_boundary; } olap[best].min_diag = min_diag; olap[best].max_diag = max_diag; olap[best].s_left_boundary = s_left_boundary; olap[best].s_right_boundary = s_right_boundary; olap[best].t_left_boundary = t_left_boundary; olap[best].t_right_boundary = t_right_boundary; for (i = 0; i < ct; i ++) deleted[i] = (i != best); } // Combine overlaps whose overlap regions intersect sufficiently // in p[0 .. (ct - 1)] by marking the poorer quality one // deleted (by setting deleted[] true for it) and combining // its min/max info in the other. Assume all entries in // deleted are 0 initially. static void Merge_Intersecting_Olaps(Olap_Info_t p[], int ct, int deleted[]) { int i, j, lo_diag, hi_diag; for (i = 0; i < ct - 1; i ++) for (j = i + 1; j < ct; j ++) { if (deleted[i] || deleted[j]) continue; lo_diag = p[i].min_diag; hi_diag = p[i].max_diag; if ((lo_diag <= 0 && p[j].min_diag > 0) || (lo_diag > 0 && p[j].min_diag <= 0)) continue; if ((lo_diag >= 0 && p[j]. t_right_boundary - lo_diag - p[j].s_left_boundary >= MIN_INTERSECTION) || (lo_diag <= 0 && p[j]. s_right_boundary + lo_diag - p[j]. t_left_boundary >= MIN_INTERSECTION) || (hi_diag >= 0 && p[j]. t_right_boundary - hi_diag - p[j].s_left_boundary >= MIN_INTERSECTION) || (hi_diag <= 0 && p[j]. s_right_boundary + hi_diag - p[j]. t_left_boundary >= MIN_INTERSECTION)) { Olap_Info_t * discard, * keep; if (p[i].quality < p[j].quality) { keep = p + i; discard = p + j; deleted[j] = TRUE; } else { keep = p + j; discard = p + i; deleted[i] = TRUE; } if (discard->min_diag < keep->min_diag) keep->min_diag = discard->min_diag; if (discard->max_diag > keep->max_diag) keep->max_diag = discard->max_diag; if (discard->s_left_boundary < keep->s_left_boundary) keep->s_left_boundary = discard->s_left_boundary; if (discard->s_right_boundary > keep->s_right_boundary) keep->s_right_boundary = discard->s_right_boundary; if (discard->t_left_boundary < keep->t_left_boundary) keep->t_left_boundary = discard->t_left_boundary; if (discard->t_right_boundary > keep->t_right_boundary) keep->t_right_boundary = discard->t_right_boundary; } } } // Add information for the overlap between strings S and T // at positions s_lo .. s_hi and t_lo .. t_hi , resp., and // with quality qual to the array olap[] which // currently has (* ct) entries. Increment (* ct) if this // is a new, distinct overlap; otherwise, modify an existing // entry if this is just a "slide" of an existing overlap. static void Add_Overlap(int s_lo, int s_hi, int t_lo, int t_hi, double qual, Olap_Info_t * olap, int &ct, Work_Area_t * WA) { // If not partials, combine overlapping overlaps if (G.Doing_Partial_Overlaps == false) { int32 new_diag = t_lo - s_lo; for (int32 i=0; i < ct; i++) { int32 old_diag = olap[i].t_lo - olap[i].s_lo; // If intersecting, just extend the existing saved overlap. if ((new_diag > 0 && old_diag > 0 && olap[i].t_right_boundary - new_diag - olap[i].s_left_boundary >= MIN_INTERSECTION) || (new_diag <= 0 && old_diag <= 0 && olap[i].s_right_boundary + new_diag - olap[i].t_left_boundary >= MIN_INTERSECTION)) { if (new_diag < olap[i].min_diag) olap[i].min_diag = new_diag; if (new_diag > olap[i].max_diag) olap[i].max_diag = new_diag; if (s_lo < olap[i].s_left_boundary) olap[i].s_left_boundary = s_lo; if (s_hi > olap[i].s_right_boundary) olap[i].s_right_boundary = s_hi; if (t_lo < olap[i].t_left_boundary) olap[i].t_left_boundary = t_lo; if (t_hi > olap[i].t_right_boundary) olap[i].t_right_boundary = t_hi; // If better quality, copy in the new overlap if (qual < olap[i].quality) { olap[i].s_lo = s_lo; olap[i].s_hi = s_hi; olap[i].t_lo = t_lo; olap[i].t_hi = t_hi; olap[i].quality = qual; //memcpy(& (olap[i].delta), WA->editDist->Left_Delta, WA->editDist->Left_Delta_Len * sizeof (int)); memcpy(olap[i].delta, WA->editDist->Left_Delta, WA->editDist->Left_Delta_Len * sizeof(int32)); olap[i].delta_ct = WA->editDist->Left_Delta_Len; } return; } } } if (ct >= MAX_DISTINCT_OLAPS) { // no room for a new entry; this shouldn't happen //fprintf(stderr, "SKIP - no space left.\n"); return; } // Add a new overlap olap[ct].s_lo = olap[ct].s_left_boundary = s_lo; olap[ct].s_hi = olap[ct].s_right_boundary = s_hi; olap[ct].t_lo = olap[ct].t_left_boundary = t_lo; olap[ct].t_hi = olap[ct].t_right_boundary = t_hi; olap[ct].quality = qual; //memcpy(& (olap[ct].delta), WA->editDist->Left_Delta, WA->editDist->Left_Delta_Len * sizeof (int)); memcpy(olap[ct].delta, WA->editDist->Left_Delta, WA->editDist->Left_Delta_Len * sizeof(int32)); olap[ct].delta_ct = WA->editDist->Left_Delta_Len; olap[ct].min_diag = t_lo - s_lo; olap[ct].max_diag = t_lo - s_lo; ct++; } // Return TRUE iff the exact match region beginning at // position start in the first string and offset in // the second string lies along the alignment from // lo .. hi on the first string where the delta-encoding // of the alignment is given by // WA->Left_Delta[0 .. (WA->Left_Delta_Len-1)] . static int Lies_On_Alignment(int start, int offset, int s_lo, int t_lo, Work_Area_t * WA) { int i, diag, new_diag; diag = t_lo - s_lo; new_diag = offset - start; for (i = 0; i < WA->editDist->Left_Delta_Len; i ++) { s_lo += abs (WA->editDist->Left_Delta[i]); if (start < s_lo) return (abs (new_diag - diag) <= SHIFT_SLACK); if (WA->editDist->Left_Delta[i] < 0) diag ++; else { s_lo ++; diag --; } } return (abs (new_diag - diag) <= SHIFT_SLACK); } // Choose the best partial overlap in olap[0 .. (ct - 1)] . // Set the corresponding deleted entry to false for the others. // Best is the greatest number of matching bases. static void Choose_Best_Partial(Olap_Info_t * olap, int ct, int deleted[]) { double matching_bases; int i, best; best = 0; matching_bases = (1.0 - olap[0].quality) * (2 + olap[0].s_hi - olap[0].s_lo + olap[0].t_hi - olap[0].t_lo); // actually twice the number of matching bases but the max will be // the same overlap for (i = 1; i < ct; i ++) { double mb; mb = (1.0 - olap[i].quality) * (2 + olap[i].s_hi - olap[i].s_lo + olap[i].t_hi - olap[i].t_lo); if (matching_bases < mb || (matching_bases == mb && olap[i].quality < olap[best].quality)) best = i; } for (i = 0; i < ct; i ++) deleted[i] = (i != best); } // Return TRUE iff there is any length-( window_len ) subarray // of a[0 .. (n-1)] that sums to threshold or higher. static int Has_Bad_Window(char *a, int n, int window_len, int threshold) { if (n < window_len) return(FALSE); int32 sum = 0; int32 i=0; int32 j=0; for (i=0; i= threshold) return(TRUE); while (i < n) { sum -= a[j++]; sum += a[i++]; if (sum >= threshold) return(TRUE); } return(FALSE); } // Find and report all overlaps and branch points between string S // (with length S_Len and id S_ID ) and string T (with // length & screen info in t_info and id T_ID ) using the exact // matches in the list beginning at subscript (* Start). Dir is // the orientation of S . static void Process_Matches (int * Start, char * S, int S_Len, char * S_quality, uint32 S_ID, Direction_t Dir, char * T, Hash_Frag_Info_t t_info, char * T_quality, uint32 T_ID, Work_Area_t * WA, int consistent) { int P, * Ref; Olap_Info_t *distinct_olap = NULL; Match_Node_t * Longest_Match, * Ptr; Overlap_t Kind_Of_Olap = NONE; double Quality; int Olap_Len; int overlaps_output = 0; int distinct_olap_ct; int Max_Len, S_Lo, S_Hi, T_Lo, T_Hi; int t_len; int Done_S_Left, Done_S_Right; int Errors; Done_S_Left = Done_S_Right = FALSE; t_len = t_info.length; assert ((* Start) != 0); // If a singleton match is hopeless on either side // it needn't be processed if (G.Use_Hopeless_Check && WA->Match_Node_Space[(* Start)].Next == 0 && ! G.Doing_Partial_Overlaps) { int s_head, t_head, s_tail, t_tail; int is_hopeless = FALSE; s_head = WA->Match_Node_Space[(* Start)].Start; t_head = WA->Match_Node_Space[(* Start)].Offset; if (s_head <= t_head) { if (s_head > HOPELESS_MATCH && ! WA->left_end_screened) is_hopeless = TRUE; } else { if (t_head > HOPELESS_MATCH && ! t_info.lfrag_end_screened) is_hopeless = TRUE; } s_tail = S_Len - s_head - WA->Match_Node_Space[(* Start)].Len + 1; t_tail = t_len - t_head - WA->Match_Node_Space[(* Start)].Len + 1; if (s_tail <= t_tail) { if (s_tail > HOPELESS_MATCH && ! WA->right_end_screened) is_hopeless = TRUE; } else { if (t_tail > HOPELESS_MATCH && ! t_info.rfrag_end_screened) is_hopeless = TRUE; } if (is_hopeless) { (* Start) = 0; WA->Kmer_Hits_Without_Olap_Ct ++; return; } } distinct_olap = WA->distinct_olap; distinct_olap_ct = 0; while ((* Start) != 0) { int a_hang, b_hang; int hit_limit = FALSE; Max_Len = WA->Match_Node_Space[(* Start)].Len; Longest_Match = WA->Match_Node_Space + (* Start); for (P = WA->Match_Node_Space[(* Start)].Next; P != 0; P = WA->Match_Node_Space[P].Next) if (WA->Match_Node_Space[P].Len > Max_Len) { Max_Len = WA->Match_Node_Space[P].Len; Longest_Match = WA->Match_Node_Space + P; } a_hang = Longest_Match->Start - Longest_Match->Offset; b_hang = a_hang + S_Len - t_len; hit_limit = ((WA->A_Olaps_For_Frag >= G.Frag_Olap_Limit && a_hang <= 0) || (WA->B_Olaps_For_Frag >= G.Frag_Olap_Limit && b_hang <= 0)); if (! hit_limit) { //fprintf(stderr, "Extend_Alignment()- start %d len %d offset %d diag %d - S ID %u %d-%d - T ID %u %d-%d\n", // Longest_Match->Start, // Longest_Match->Len, // Longest_Match->Offset, // Longest_Match->Start - Longest_Match->Offset, // S_ID, S_Lo, S_Hi, T_ID, T_Lo, T_Hi); Kind_Of_Olap = WA->editDist->Extend_Alignment(Longest_Match, S, S_ID, S_Len, T, T_ID, t_len, S_Lo, S_Hi, T_Lo, T_Hi, Errors); if (Kind_Of_Olap == DOVETAIL || G.Doing_Partial_Overlaps) { if (1 + S_Hi - S_Lo >= G.Min_Olap_Len && 1 + T_Hi - T_Lo >= G.Min_Olap_Len) { Olap_Len = 1 + MIN (S_Hi - S_Lo, T_Hi - T_Lo); Quality = (double) Errors / Olap_Len; if (Errors <= WA->editDist->Error_Bound[Olap_Len]) { //fprintf(stderr, "Add_Overlap()- quality %f count %d\n", Quality, distinct_olap_ct); Add_Overlap (S_Lo, S_Hi, T_Lo, T_Hi, Quality, distinct_olap, distinct_olap_ct, WA); } } } } if (consistent) (* Start) = 0; for (Ref = Start; (* Ref) != 0; ) { Ptr = WA->Match_Node_Space + (* Ref); if (Ptr == Longest_Match || ((Kind_Of_Olap == DOVETAIL || G.Doing_Partial_Overlaps) && S_Lo - SHIFT_SLACK <= Ptr->Start && Ptr->Start + Ptr->Len <= (S_Hi + 1) + SHIFT_SLACK - 1 && Lies_On_Alignment (Ptr->Start, Ptr->Offset, S_Lo, T_Lo, WA) )) (* Ref) = Ptr->Next; // Remove this node, it matches the alignment else Ref = & (Ptr->Next); } } if (distinct_olap_ct > 0) { int deleted[MAX_DISTINCT_OLAPS] = {0}; Olap_Info_t * p; int i; // Check if any previously distinct overlaps should be merged because // of other merges. if (G.Doing_Partial_Overlaps) { if (G.Unique_Olap_Per_Pair) Choose_Best_Partial (distinct_olap, distinct_olap_ct, deleted); //else // Do nothing, output them all } else { if (G.Unique_Olap_Per_Pair) Combine_Into_One_Olap (distinct_olap, distinct_olap_ct, deleted); else Merge_Intersecting_Olaps (distinct_olap, distinct_olap_ct, deleted); } p = distinct_olap; for (i = 0; i < distinct_olap_ct; i ++) { if (! deleted[i]) { bool rejected = FALSE; if (G.Use_Window_Filter) { int32 d; int32 i = p->s_lo;; int32 j = p->t_lo; int32 q_len = 0; char *q_diff = WA->q_diff; for (int32 k=0; kdelta_ct; k++) { int32 len = abs(p->delta[k]); for (int32 n = 1; n < len; n++) { if (S[i] == T[j] || S[i] == 'n' || T[j] == 'n') { d = 0; } else { d = MIN (S_quality[i], T_quality[j]); d = MIN (d, QUALITY_CUTOFF); } q_diff[q_len++] = d; i++; j++; } if (p->delta[k] > 0) { d = S_quality[i]; i++; } else { d = T_quality[j]; j++; } q_diff[q_len++] = MIN (d, QUALITY_CUTOFF); } while (i <= p->s_hi) { if (S[i] == T[j] || S[i] == 'n' || T[j] == 'n') { d = 0; } else { d = MIN(S_quality[i], T_quality[j]); d = MIN(d, QUALITY_CUTOFF); } q_diff[q_len++] = d; i++; j++; } if (Has_Bad_Window(q_diff, q_len, BAD_WINDOW_LEN, BAD_WINDOW_VALUE)) { rejected = TRUE; Bad_Short_Window_Ct++; } else if (Has_Bad_Window(q_diff, q_len, 100, 240)) { rejected = TRUE; Bad_Long_Window_Ct++; } } if (! rejected) { if (G.Doing_Partial_Overlaps) Output_Partial_Overlap(S_ID, T_ID, Dir, p, S_Len, t_len, WA); else Output_Overlap(S_ID, S_Len, Dir, T_ID, t_len, p, WA); overlaps_output++; if (p->s_lo == 0) WA->A_Olaps_For_Frag++; if (p->s_hi >= S_Len - 1) WA->B_Olaps_For_Frag++; } } p++; } } if (overlaps_output == 0) WA->Kmer_Hits_Without_Olap_Ct++; else { WA->Kmer_Hits_With_Olap_Ct++; if (overlaps_output > 1) WA->Multi_Overlap_Ct++; } return; } // Compare the diag_sum fields in a and b as (String_Olap_t *) 's and // return -1 if a < b , 0 if a == b , and 1 if a > b . // Used for qsort . static int By_Diag_Sum (const void * a, const void * b) { String_Olap_t * x, * y; x = (String_Olap_t *) a; y = (String_Olap_t *) b; if (x->diag_sum < y->diag_sum) return -1; else if (x->diag_sum > y->diag_sum) return 1; return 0; } // Find and report all overlaps and branch points between string S // and all strings in global String_Olap_Space . // Return the number of entries processed. // Len is the length of S , ID is its fragment ID and // Dir indicates if S is forward, or reverse-complemented. int Process_String_Olaps (char * S, int Len, char * S_quality, uint32 ID, Direction_t Dir, Work_Area_t * WA) { int32 i, ct, root_num, start, processed_ct; // Move all full entries to front of String_Olap_Space and set // diag_sum to average diagonal. if enough entries to bother, // sort by average diagonal. Then process positive & negative diagonals // separately in order from longest to shortest overlap. Stop // processing when output limit has been reached. for (i = ct = 0; i < WA->Next_Avail_String_Olap; i ++) if (WA->String_Olap_Space[i].Full) { root_num = WA->String_Olap_Space[i].String_Num; if (root_num + Hash_String_Num_Offset > ID) { if (WA->String_Olap_Space[i].Match_List == 0) { fprintf (stderr, " Curr_String_Num = %d root_num %d have no matches\n", ID, root_num); exit (-2); } if (i != ct) WA->String_Olap_Space[ct] = WA->String_Olap_Space[i]; assert (WA->String_Olap_Space[ct].diag_ct > 0); WA->String_Olap_Space[ct].diag_sum /= WA->String_Olap_Space[ct].diag_ct; ct ++; } } if (ct == 0) return ct; if (ct <= G.Frag_Olap_Limit) { for (i = 0; i < ct; i ++) { root_num = WA->String_Olap_Space[i].String_Num; //fprintf(stderr, "Processing overlap from %d and global, curr match is %d of %.2f len and %d diag matches min of %d\n", ID, (root_num + Hash_String_Num_Offset), (double)WA->String_Olap_Space[i].diag_end-WA->String_Olap_Space[i].diag_bgn, WA->String_Olap_Space[i].diag_ct, computeMinimumKmers(G.Kmer_Len, WA->String_Olap_Space[i].diag_end-WA->String_Olap_Space[i].diag_bgn, G.maxErate)); if (computeMinimumKmers(G.Kmer_Len, WA->String_Olap_Space[i].diag_end-WA->String_Olap_Space[i].diag_bgn, G.maxErate) > WA->String_Olap_Space[i].diag_ct) { WA->Kmer_Hits_Skipped_Ct++; continue; } Process_Matches(&WA->String_Olap_Space[i].Match_List, S, Len, S_quality, ID, Dir, basesData + String_Start[root_num], String_Info[root_num], qualsData + String_Start[root_num], root_num + Hash_String_Num_Offset, WA, WA->String_Olap_Space[i].consistent); assert(WA->String_Olap_Space[i].Match_List == 0); } return ct; } qsort (WA->String_Olap_Space, ct, sizeof (String_Olap_t), By_Diag_Sum); for (start = 0; start < ct && WA->String_Olap_Space[start].diag_sum < 0; start ++) ; processed_ct = 0; for (i = start; i < ct && WA->A_Olaps_For_Frag < G.Frag_Olap_Limit ; i ++) { root_num = WA->String_Olap_Space[i].String_Num; if (computeMinimumKmers(G.Kmer_Len, WA->String_Olap_Space[i].diag_end-WA->String_Olap_Space[i].diag_bgn, G.maxErate) > WA->String_Olap_Space[i].diag_ct) { WA->Kmer_Hits_Skipped_Ct++; continue; } Process_Matches(&WA->String_Olap_Space[i].Match_List, S, Len, S_quality, ID, Dir, basesData + String_Start[root_num], String_Info[root_num], qualsData + String_Start[root_num], root_num + Hash_String_Num_Offset, WA, WA->String_Olap_Space[i].consistent); assert(WA->String_Olap_Space[i].Match_List == 0); processed_ct ++; } for (i = start - 1; i >= 0 && WA->B_Olaps_For_Frag < G.Frag_Olap_Limit ; i --) { root_num = WA->String_Olap_Space[i].String_Num; if (computeMinimumKmers(G.Kmer_Len, WA->String_Olap_Space[i].diag_end-WA->String_Olap_Space[i].diag_bgn, G.maxErate) > WA->String_Olap_Space[i].diag_ct) { WA->Kmer_Hits_Skipped_Ct++; continue; } Process_Matches(&WA->String_Olap_Space[i].Match_List, S, Len, S_quality, ID, Dir, basesData + String_Start[root_num], String_Info[root_num], qualsData + String_Start[root_num], root_num + Hash_String_Num_Offset, WA, WA->String_Olap_Space[i].consistent); assert(WA->String_Olap_Space[i].Match_List == 0); processed_ct ++; } return processed_ct; } canu-1.6/src/overlapInCore/overlapInCore.C000066400000000000000000000515341314437614700205430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_driver_common.h * src/AS_OVM/overlapInCore.C * * Modifications by: * * Michael Schatz from 2004-SEP-23 to 2012-JAN-26 * are Copyright 2004,2012 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2009,2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-11 to 2015-AUG-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-27 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-NOV-20 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "overlapInCore.H" #include "AS_UTL_decodeRange.H" oicParameters G; uint32 STRING_NUM_BITS = 31; // MUST BE EXACTLY THIS uint32 OFFSET_BITS = 31; uint64 TRUELY_ONE = (uint64)1; uint64 TRUELY_ZERO = (uint64)0; uint64 STRING_NUM_MASK = (TRUELY_ONE << STRING_NUM_BITS) - 1; uint64 OFFSET_MASK = (TRUELY_ONE << OFFSET_BITS) - 1; uint64 MAX_STRING_NUM = STRING_NUM_MASK; int64 Bad_Short_Window_Ct = 0; // The number of overlaps rejected because of too many errors in a small window int64 Bad_Long_Window_Ct = 0; // The number of overlaps rejected because of too many errors in a long window // Stores sequence and quality data of fragments in hash table char *basesData = NULL; char *qualsData = NULL; size_t Data_Len = 0; String_Ref_t *nextRef = NULL; size_t Extra_Data_Len; // Total length available for hash table string data, // including both regular strings and extra strings // added from kmer screening uint64 Max_Extra_Ref_Space = 0; // allocated amount uint64 Extra_Ref_Ct = 0; // used amount String_Ref_t *Extra_Ref_Space = NULL; uint64 Extra_String_Ct = 0; // Number of extra strings of screen kmers added to hash table uint64 Extra_String_Subcount = 0; // Number of kmers already added to last extra string in hash table Check_Vector_t * Hash_Check_Array = NULL; // Bit vector to eliminate impossible hash matches uint64 Hash_String_Num_Offset = 1; Hash_Bucket_t * Hash_Table; uint64 Kmer_Hits_With_Olap_Ct = 0; uint64 Kmer_Hits_Without_Olap_Ct = 0; uint64 Kmer_Hits_Skipped_Ct = 0; uint64 Multi_Overlap_Ct = 0; uint64 String_Ct; // Number of fragments in the hash table Hash_Frag_Info_t * String_Info = NULL; int64 * String_Start = NULL; uint32 String_Start_Size = 0; // Number of available positions in String_Start size_t Used_Data_Len = 0; // Number of bytes of Data currently occupied, including // regular strings and extra kmer screen strings int32 Bit_Equivalent[256] = {0}; // Table to convert characters to 2-bit integer code int32 Char_Is_Bad[256] = {0}; // Table to check if character is not a, c, g or t. uint64 Hash_Entries = 0; uint64 Total_Overlaps = 0; uint64 Contained_Overlap_Ct = 0; uint64 Dovetail_Overlap_Ct = 0; uint64 HSF1 = 666; uint64 HSF2 = 666; uint64 SV1 = 666; uint64 SV2 = 666; uint64 SV3 = 666; ovFile *Out_BOF = NULL; // Allocate memory for (* WA) and set initial values. // Set thread_id field to id . void Initialize_Work_Area(Work_Area_t *WA, int id, gkStore *gkpStore) { uint64 allocated = 0; WA->String_Olap_Size = INIT_STRING_OLAP_SIZE; WA->String_Olap_Space = new String_Olap_t [WA->String_Olap_Size]; WA->Match_Node_Size = INIT_MATCH_NODE_SIZE; WA->Match_Node_Space = new Match_Node_t [WA->Match_Node_Size]; allocated += WA->String_Olap_Size * sizeof (String_Olap_t); allocated += WA->Match_Node_Size * sizeof (Match_Node_t); WA->status = 0; WA->thread_id = id; WA->gkpStore = gkpStore; WA->overlapsLen = 0; WA->overlapsMax = 1024 * 1024 / sizeof(ovOverlap); WA->overlaps = ovOverlap::allocateOverlaps(WA->gkpStore, WA->overlapsMax); allocated += sizeof(ovOverlap) * WA->overlapsMax; WA->editDist = new prefixEditDistance(G.Doing_Partial_Overlaps, G.maxErate); WA->q_diff = new char [AS_MAX_READLEN]; WA->distinct_olap = new Olap_Info_t [MAX_DISTINCT_OLAPS]; } void Delete_Work_Area(Work_Area_t *WA) { delete WA->editDist; delete [] WA->String_Olap_Space; delete [] WA->Match_Node_Space; delete [] WA->overlaps; delete [] WA->distinct_olap; delete [] WA->q_diff; } int OverlapDriver(void) { Work_Area_t *thread_wa = new Work_Area_t [G.Num_PThreads]; gkStore *gkpStore = gkStore::gkStore_open(G.Frag_Store_Path); Out_BOF = new ovFile(gkpStore, G.Outfile_Name, ovFileFullWrite); fprintf(stderr, "Initializing %u work areas.\n", G.Num_PThreads); #pragma omp parallel for for (uint32 i=0; igkStore_getNumReads() < G.endHashID) G.endHashID = gkpStore->gkStore_getNumReads(); // Note distinction between the local bgn/end and the global G.bgn/G.end. uint32 bgnHashID = G.bgnHashID; uint32 endHashID = G.bgnHashID + G.Max_Hash_Strings - 1; // Inclusive! // Iterate over read blocks, build a hash table, then search in threads. while (bgnHashID < G.endHashID) { if (endHashID > G.endHashID) endHashID = G.endHashID; assert(0 < bgnHashID); assert(bgnHashID <= endHashID); assert(endHashID <= gkpStore->gkStore_getNumReads()); // Load as much as we can. If we load less than expected, the endHashID is updated to reflect // the last read loaded. endHashID = Build_Hash_Index(gkpStore, bgnHashID, endHashID); // Decide the range of reads to process. No more than what is loaded in the table. if (G.bgnRefID < 1) G.bgnRefID = 1; if (G.endRefID > gkpStore->gkStore_getNumReads()) G.endRefID = gkpStore->gkStore_getNumReads(); G.curRefID = G.bgnRefID; // The old version used to further divide the ref range into blocks of at most // Max_Reads_Per_Batch so that those reads could be loaded into core. We don't // need to do that anymore. G.perThread = 1 + (G.endRefID - G.bgnRefID) / G.Num_PThreads / 8; fprintf(stderr, "\n"); fprintf(stderr, "Range: %u-%u. Store has %u reads.\n", G.bgnRefID, G.endRefID, gkpStore->gkStore_getNumReads()); fprintf(stderr, "Chunk: " F_U32 " reads/thread -- (G.endRefID=" F_U32 " - G.bgnRefID=" F_U32 ") / G.Num_PThreads=" F_U32 " / 8\n", G.perThread, G.endRefID, G.bgnRefID, G.Num_PThreads); fprintf(stderr, "\n"); fprintf(stderr, "Starting " F_U32 "-" F_U32 " with " F_U32 " per thread\n", G.bgnRefID, G.endRefID, G.perThread); fprintf(stderr, "\n"); // Initialize each thread, reset the current position. curRefID and endRefID are updated, this // cannot be done in the parallel loop! for (uint32 i=0; igkStore_close(); for (uint32 i=0; i 0.06) { if (G.Use_Window_Filter) fprintf(stderr, "High error rates requested -- window-filter turned off despite -w flag!\n"); G.Use_Window_Filter = FALSE; G.Use_Hopeless_Check = FALSE; } if (G.Max_Hash_Strings == 0) fprintf(stderr, "* No memory model supplied; -M needed!\n"), err++; if (G.Kmer_Len == 0) fprintf(stderr, "* No kmer length supplied; -k needed!\n"), err++; if (G.Max_Hash_Strings > MAX_STRING_NUM) fprintf(stderr, "Too many strings (--hashstrings), must be less than " F_U64 "\n", MAX_STRING_NUM), err++; if (G.Outfile_Name == NULL) fprintf (stderr, "ERROR: No output file name specified\n"), err++; if ((err) || (G.Frag_Store_Path == NULL)) { fprintf(stderr, "USAGE: %s [options] \n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "-b in contig mode, specify the output file\n"); fprintf(stderr, "-c contig mode. Use 2 frag stores. First is\n"); fprintf(stderr, " for reads; second is for contigs\n"); fprintf(stderr, "-G do partial overlaps\n"); fprintf(stderr, "-h to specify fragments to put in hash table\n"); fprintf(stderr, " Implies LSF mode (no changes to frag store)\n"); fprintf(stderr, "-I designate a file of frag iids to limit olaps to\n"); fprintf(stderr, " (Contig mode only)\n"); fprintf(stderr, "-k if one or two digits, the length of a kmer, otherwise\n"); fprintf(stderr, " the filename containing a list of kmers to ignore in\n"); fprintf(stderr, " the hash table\n"); fprintf(stderr, "-l specify the maximum number of overlaps per\n"); fprintf(stderr, " fragment-end per batch of fragments.\n"); fprintf(stderr, "-m allow multiple overlaps per oriented fragment pair\n"); fprintf(stderr, "-M specify memory size. Valid values are '8GB', '4GB',\n"); fprintf(stderr, " '2GB', '1GB', '256MB'. (Not for Contig mode)\n"); fprintf(stderr, "-o specify output file name\n"); fprintf(stderr, "-P write protoIO output (if not -G)\n"); fprintf(stderr, "-r specify old fragments to overlap\n"); fprintf(stderr, "-t use parallel threads\n"); fprintf(stderr, "-u allow only 1 overlap per oriented fragment pair\n"); fprintf(stderr, "-w filter out overlaps with too many errors in a window\n"); fprintf(stderr, "-z skip the hopeless check\n"); fprintf(stderr, "\n"); fprintf(stderr, "--maxerate only output overlaps with fraction or less error (e.g., 0.06 == 6%%)\n"); fprintf(stderr, "--minlength only output overlaps of or more bases\n"); fprintf(stderr, "\n"); fprintf(stderr, "--hashbits n Use n bits for the hash mask.\n"); fprintf(stderr, "--hashstrings n Load at most n strings into the hash table at one time.\n"); fprintf(stderr, "--hashdatalen n Load at most n bytes into the hash table at one time.\n"); fprintf(stderr, "--hashload f Load to at most 0.0 < f < 1.0 capacity (default 0.7).\n"); fprintf(stderr, "\n"); fprintf(stderr, "--maxreadlen n For batches with all short reads, pack bits differently to\n"); fprintf(stderr, " process more reads per batch.\n"); fprintf(stderr, " all reads must be shorter than n\n"); fprintf(stderr, " --hashstrings limited to 2^(30-m)\n"); fprintf(stderr, " Common values:\n"); fprintf(stderr, " maxreadlen 2048->hashstrings 524288 (default)\n"); fprintf(stderr, " maxreadlen 512->hashstrings 2097152\n"); fprintf(stderr, " maxreadlen 128->hashstrings 8388608\n"); fprintf(stderr, "\n"); fprintf(stderr, "--readsperbatch n Force batch size to n.\n"); fprintf(stderr, "--readsperthread n Force each thread to process n reads.\n"); fprintf(stderr, "\n"); exit(1); } // We know enough now to set the hash function variables, and some other random variables. HSF1 = G.Kmer_Len - (G.Hash_Mask_Bits / 2); HSF2 = 2 * G.Kmer_Len - G.Hash_Mask_Bits; SV1 = HSF1 + 2; SV2 = (HSF1 + HSF2) / 2; SV3 = HSF2 - 2; // Log parameters. fprintf(stderr, "\n"); fprintf(stderr, "STRING_NUM_BITS " F_U32 "\n", STRING_NUM_BITS); fprintf(stderr, "OFFSET_BITS " F_U32 "\n", OFFSET_BITS); fprintf(stderr, "STRING_NUM_MASK " F_U64 "\n", STRING_NUM_MASK); fprintf(stderr, "OFFSET_MASK " F_U64 "\n", OFFSET_MASK); fprintf(stderr, "MAX_STRING_NUM " F_U64 "\n", MAX_STRING_NUM); fprintf(stderr, "\n"); fprintf(stderr, "Hash_Mask_Bits " F_U32 "\n", G.Hash_Mask_Bits); fprintf(stderr, "Max_Hash_Strings " F_U32 "\n", G.Max_Hash_Strings); fprintf(stderr, "Max_Hash_Data_Len " F_U64 "\n", G.Max_Hash_Data_Len); fprintf(stderr, "Max_Hash_Load %f\n", G.Max_Hash_Load); fprintf(stderr, "Kmer Length " F_U64 "\n", G.Kmer_Len); fprintf(stderr, "Min Overlap Length %d\n", G.Min_Olap_Len); fprintf(stderr, "Max Error Rate %f\n", G.maxErate); fprintf(stderr, "Min Kmer Matches " F_U64 "\n", G.Filter_By_Kmer_Count); fprintf(stderr, "\n"); fprintf(stderr, "Num_PThreads " F_U32 "\n", G.Num_PThreads); omp_set_num_threads(G.Num_PThreads); assert (8 * sizeof (uint64) > 2 * G.Kmer_Len); Bit_Equivalent['a'] = Bit_Equivalent['A'] = 0; Bit_Equivalent['c'] = Bit_Equivalent['C'] = 1; Bit_Equivalent['g'] = Bit_Equivalent['G'] = 2; Bit_Equivalent['t'] = Bit_Equivalent['T'] = 3; for (int i = 0; i < 256; i ++) { char ch = tolower ((char) i); if (ch == 'a' || ch == 'c' || ch == 'g' || ch == 't') Char_Is_Bad[i] = 0; else Char_Is_Bad[i] = 1; } fprintf(stderr, "\n"); fprintf(stderr, "HASH_TABLE_SIZE " F_U64 "\n", HASH_TABLE_SIZE); fprintf(stderr, "sizeof(Hash_Bucket_t) " F_U64 "\n", (uint64)sizeof(Hash_Bucket_t)); fprintf(stderr, "hash table size: " F_U64 " MB\n", (HASH_TABLE_SIZE * sizeof(Hash_Bucket_t)) >> 20); fprintf(stderr, "\n"); Hash_Table = new Hash_Bucket_t [HASH_TABLE_SIZE]; fprintf(stderr, "check " F_U64 " MB\n", ((HASH_TABLE_SIZE * sizeof (Check_Vector_t)) >> 20)); fprintf(stderr, "info " F_SIZE_T " MB\n", ((G.Max_Hash_Strings * sizeof (Hash_Frag_Info_t)) >> 20)); fprintf(stderr, "start " F_SIZE_T " MB\n", ((G.Max_Hash_Strings * sizeof (int64)) >> 20)); fprintf(stderr, "\n"); Hash_Check_Array = new Check_Vector_t [HASH_TABLE_SIZE]; String_Info = new Hash_Frag_Info_t [G.Max_Hash_Strings]; String_Start = new int64 [G.Max_Hash_Strings]; String_Start_Size = G.Max_Hash_Strings; memset(Hash_Check_Array, 0, sizeof(Check_Vector_t) * HASH_TABLE_SIZE); memset(String_Info, 0, sizeof(Hash_Frag_Info_t) * G.Max_Hash_Strings); memset(String_Start, 0, sizeof(int64) * G.Max_Hash_Strings); OverlapDriver(); delete [] basesData; delete [] qualsData; delete [] nextRef; delete [] String_Start; delete [] String_Info; delete [] Hash_Check_Array; delete [] Hash_Table; FILE *stats = stderr; if (G.Outstat_Name != NULL) { errno = 0; stats = fopen(G.Outstat_Name, "w"); if (errno) { fprintf(stderr, "WARNING: failed to open '%s' for writing: %s\n", G.Outstat_Name, strerror(errno)); stats = stderr; } } fprintf(stats, " Kmer hits without olaps = " F_S64 "\n", Kmer_Hits_Without_Olap_Ct); fprintf(stats, " Kmer hits with olaps = " F_S64 "\n", Kmer_Hits_With_Olap_Ct); //fprintf(stats, " Kmer hits below %u = " F_S64 "\n", G.Filter_By_Kmer_Count, Kmer_Hits_Skipped_Ct); fprintf(stats, " Multiple overlaps/pair = " F_S64 "\n", Multi_Overlap_Ct); fprintf(stats, " Total overlaps produced = " F_S64 "\n", Total_Overlaps); fprintf(stats, " Contained overlaps = " F_S64 "\n", Contained_Overlap_Ct); fprintf(stats, " Dovetail overlaps = " F_S64 "\n", Dovetail_Overlap_Ct); fprintf(stats, "Rejected by short window = " F_S64 "\n", Bad_Short_Window_Ct); fprintf(stats, " Rejected by long window = " F_S64 "\n", Bad_Long_Window_Ct); if (stats != stderr) fclose(stats); fprintf(stderr, "Bye.\n"); return(0); } canu-1.6/src/overlapInCore/overlapInCore.H000066400000000000000000000415511314437614700205460ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/AS_OVL_overlap.h * src/AS_OVM/overlapInCore.H * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005,2007-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-AUG-23 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Art Delcher on 2007-FEB-13 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2011-MAR-08 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-DEC-15 to 2015-AUG-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-27 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-NOV-20 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "prefixEditDistance.H" #ifndef OVERLAPINCORE_H #define OVERLAPINCORE_H #define HASH_KMER_SKIP 0 // Skip this many kmers between the kmers put into the hash // table. Setting this to 0 will make every kmer go // into the hash table. #define BAD_WINDOW_LEN 50 // Length of window in which to look for clustered errors // to invalidate an overlap #define BAD_WINDOW_VALUE (8 * QUALITY_CUTOFF) // This many or more errors in a window of BAD_WINDOW_LEN // invalidates an overlap #define CHECK_MASK 0xff // To set Check field in hash bucket #define DEFAULT_HI_HIT_LIMIT INT_MAX // Any kmer in hash table with at least this many hits // cannot initiate an overlap. Can be changed on the command // line with -K option #define DELETED_FRAG 2 // Indicates fragment was marked as deleted in the store #define DISPLAY_WIDTH 60 // Number of characters per line when displaying sequences #define ENTRIES_PER_BUCKET 21 // In main hash table. Recommended values are 21, 31 or 42 // depending on cache line size. #define HASH_CHECK_MASK 0x1f // Used to set and check bit in Hash_Check_Array // Change if change Check_Vector_t #define HASH_EXPANSION_FACTOR 1.4 // Hash table size is >= this times MAX_HASH_STRINGS #define HASH_MASK (((uint64)1 << G.Hash_Mask_Bits) - 1) // Extract right Hash_Mask_Bits bits of hash key #define HASH_TABLE_SIZE (1 + HASH_MASK) // Number of buckets in hash table #define HIGHEST_KMER_LIMIT 255 // If Hi_Hit_Limit is more than this, it's ignored #define HOPELESS_MATCH 90 // A string this long or longer without an exact kmer // match is assumed to be hopeless to find a match // within the error threshold #define IID_GAP_LIMIT 100 // When using a list of fragment IID's, gaps between // IID's this long or longer force a new load partial // store to be done #define INIT_MATCH_NODE_SIZE 10000 // Initial number of nodes to hold exact matches #define INIT_SCREEN_MATCHES 50 // Initial number of screen-match entries per fragment #define INIT_STRING_OLAP_SIZE 5000 // Initial number of different New fragments that // overlap a single Old fragment #define K_MER_STEP 1 // 1 = every k-mer in search // 2 = every other k-mer // 3 = every third, ... // Used to skip some k-mers in finding matches #define MAX_BRANCH_COUNT UCHAR_MAX // The largest branch point count before an overflow #define MAX_DISTINCT_OLAPS 3 // Most possible really different overlaps (i.e., not // just shifts from periodic regions) between 2 fragments // in a given orientation. For fragments of approximately // same size, should never be more than 2. #define EXPECTED_STRING_LEN (AS_MAX_READLEN / 2) #define INITIAL_DATA_LEN (EXPECTED_STRING_LEN * Max_Hash_Strings) // The number of bytes to allocate initially for hash-table sequence #define MAX_LINE_LEN 1000 // Maximum input line when reading FASTA file #define MAX_NAME_LEN 500 // Longest file name allowed #define MIN_CALC_KMER 4 // When calculating the Hi_Hit_Limit based on genome length, etc, // don't set it below this #define MIN_INTERSECTION 10 // Minimum length of match region to be worth reporting #define MIN_OLAP_OUTSIDE_SCREEN 30 // Minimum number of bases outside of screened regions // to be a reportable overlap. Entire overlap (including screened // portion) must still be MIN_OLAP_LEN . #define OUTPUT_OVERLAP_DELTAS 0 // If true include delta-encoding of overlap alignment // in overlap messages. Otherwise, omit them. // As of 6 Oct 2008, support for overlap deltas has been removed. // However, there are enough remnants in AS_MSG to output them. // Just enabling OUTPUT_OVERLAP_DELTAS will not compile; see // AS_MSG_USE_OVL_DELTA in AS_MSG. #define PROBE_MASK 0x3e // Used to determine probe step to resolve collisions #define QUALITY_CUTOFF 20 // Regard quality values higher than this as equal to this // for purposes of finding bad windows #define SCRIPT_NAME "lsf-ovl" // Default name of script produced by make-ovl-script #define SHIFT_SLACK 1 // Allow to be off by this many bases in combining/comparing alignments #define STRING_OLAP_SHIFT 8 // To compute hash function into the String_Olap hash table. #define STRING_OLAP_MODULUS (1 << STRING_OLAP_SHIFT) // The size of the String_Olap hash table. The rest of // the space is used for chaining. This number should be // relatively small to reflect the number of fragments a // given fragment has exact matches with. #define STRING_OLAP_MASK (STRING_OLAP_MODULUS - 1) // To compute hash function into the String_Olap hash table. #define THREAD_STACKSIZE (16 * 512 * 512) // The amount of stack space to allocate to each thread. #define VALID_FRAG 1 // Indicates fragment was valid in the fragment store //#define WINDOW_SCREEN_OLAP 10 // Amount by which k-mers can overlap a screen region and still // be added to the hash table. #define MAX_EXTRA_SUBCOUNT (AS_MAX_READLEN / G.Kmer_Len) #define HASH_FUNCTION(k) (((k) ^ ((k) >> HSF1) ^ ((k) >> HSF2)) & HASH_MASK) // Gives subscript in hash table for key k #define HASH_CHECK_FUNCTION(k) (((k) ^ ((k) >> SV1) ^ ((k) >> SV2)) & HASH_CHECK_MASK) // Gives bit position to see if key could be in bucket #define KEY_CHECK_FUNCTION(k) (((k) ^ ((k) >> SV1) ^ ((k) >> SV3)) & CHECK_MASK) // Gives bit pattern to see if key could match #define PROBE_FUNCTION(k) ((((k) ^ ((k) >> SV2) ^ ((k) >> SV3)) & PROBE_MASK) | 1) // Gives secondary hash function. Force to be odd so that will be relatively // prime wrt the hash table size, which is a power of 2. typedef enum Direction_Type { FORWARD, REVERSE } Direction_t; typedef struct String_Olap_Node { uint32 String_Num; // Of hash-table frag that have exact match with int32 Match_List; // Subscript of start of list of exact matches double diag_sum; // Sum of diagonals of all k-mer matches to this frag int32 diag_ct; // Count of all k-mer matches to this frag int diag_bgn; int diag_end; signed int Next : 29; // Next match if this is a collision unsigned Full : 1; unsigned consistent : 1; } String_Olap_t; typedef struct Olap_Info { int s_lo, s_hi; int t_lo, t_hi; double quality; int delta [AS_MAX_READLEN+1]; // needs only MAX_ERRORS int delta_ct; int s_left_boundary, s_right_boundary; int t_left_boundary, t_right_boundary; int min_diag, max_diag; } Olap_Info_t; // The following structure holds what used to be global information, but // is now encapsulated so that multiple copies can be made for multiple // parallel threads. typedef struct Work_Area { //int *Error_Bound; String_Olap_t * String_Olap_Space; int32 String_Olap_Size; int32 Next_Avail_String_Olap; Match_Node_t * Match_Node_Space; int32 Match_Node_Size; int32 Next_Avail_Match_Node; // Counts the number of overlaps for each fragment. Cuts off // overlaps above a limit. int32 A_Olaps_For_Frag; int32 B_Olaps_For_Frag; gkStore *gkpStore; int left_end_screened; int right_end_screened; int status; int thread_id; uint32 bgnID; // Range of reads we are processing uint32 endID; // was frag_segment_lo and frag_segment_hi (all lowercase) // Instead of outputting each overlap as we create it, we // buffer them and output blocks of overlaps. uint64 overlapsLen; uint64 overlapsMax; ovOverlap *overlaps; // Various stats that used to be global and updated whenever we // output an overlap or finished processing a set of hits. // Needed a mutex to update. uint64 Total_Overlaps; uint64 Contained_Overlap_Ct; uint64 Dovetail_Overlap_Ct; uint64 Kmer_Hits_Without_Olap_Ct; uint64 Kmer_Hits_With_Olap_Ct; uint64 Kmer_Hits_Skipped_Ct; uint64 Multi_Overlap_Ct; prefixEditDistance *editDist; char * q_diff; Olap_Info_t *distinct_olap; } Work_Area_t; typedef uint32 Check_Vector_t; // Bit vector to see if hash bucket could possibly contain a match typedef uint64 String_Ref_t; #define BIT_EMPT 62 #define BIT_LAST 63 extern uint32 STRING_NUM_BITS; extern uint32 OFFSET_BITS; extern uint64 TRUELY_ZERO; extern uint64 TRUELY_ONE; extern uint64 STRING_NUM_MASK; extern uint64 OFFSET_MASK; extern uint64 MAX_STRING_NUM; // // [ Last (1) ][ Empty (1) ][ Offset (11) ][ StringNum (19) ] // #define getStringRefStringNum(X) (((X) ) & STRING_NUM_MASK) #define getStringRefOffset(X) (((X) >> STRING_NUM_BITS) & OFFSET_MASK) #define getStringRefEmpty(X) (((X) >> BIT_EMPT ) & TRUELY_ONE) #define getStringRefLast(X) (((X) >> BIT_LAST ) & TRUELY_ONE) #define setStringRefStringNum(X, Y) ((X) = (((X) & ~(STRING_NUM_MASK )) | ((Y)))) #define setStringRefOffset(X, Y) ((X) = (((X) & ~(OFFSET_MASK << STRING_NUM_BITS)) | ((Y) << STRING_NUM_BITS))) #define setStringRefEmpty(X, Y) ((X) = (((X) & ~(TRUELY_ONE << BIT_EMPT )) | ((Y) << BIT_EMPT))) #define setStringRefLast(X, Y) ((X) = (((X) & ~(TRUELY_ONE << BIT_LAST )) | ((Y) << BIT_LAST))) typedef struct Hash_Bucket { String_Ref_t Entry [ENTRIES_PER_BUCKET]; unsigned char Check [ENTRIES_PER_BUCKET]; unsigned char Hits [ENTRIES_PER_BUCKET]; int16 Entry_Ct; } Hash_Bucket_t; typedef struct Hash_Frag_Info { uint32 length : 30; uint32 lfrag_end_screened : 1; uint32 rfrag_end_screened : 1; } Hash_Frag_Info_t; extern char *basesData; extern char *qualsData; extern String_Ref_t *nextRef; extern size_t Data_Len; extern int64 Bad_Short_Window_Ct; extern int64 Bad_Long_Window_Ct; extern size_t Extra_Data_Len; extern uint64 Max_Extra_Ref_Space; extern uint64 Extra_Ref_Ct; extern String_Ref_t * Extra_Ref_Space; extern uint64 Extra_String_Ct; extern uint64 Extra_String_Subcount; extern Check_Vector_t * Hash_Check_Array; extern uint64 Hash_String_Num_Offset; extern Hash_Bucket_t * Hash_Table; extern uint64 Kmer_Hits_With_Olap_Ct; extern uint64 Kmer_Hits_Without_Olap_Ct; extern uint64 Kmer_Hits_Skipped_Ct; extern uint64 Multi_Overlap_Ct; extern uint64 String_Ct; extern Hash_Frag_Info_t * String_Info; extern int64 * String_Start; extern uint32 String_Start_Size; extern size_t Used_Data_Len; extern int32 Bit_Equivalent [256]; extern int32 Char_Is_Bad [256]; extern uint64 Hash_Entries; extern uint64 Total_Overlaps; extern uint64 Contained_Overlap_Ct; extern uint64 Dovetail_Overlap_Ct; class oicParameters { public: oicParameters() { initialize(); }; ~oicParameters() {}; void initialize(void) { maxErate = 0.06; Doing_Partial_Overlaps = false; bgnHashID = 1; endHashID = UINT32_MAX; minLibToHash = 0; maxLibToHash = UINT32_MAX; bgnRefID = 1; endRefID = UINT32_MAX; minLibToRef = 0; maxLibToRef = UINT32_MAX; Kmer_Len = 0; Kmer_Skip_File = NULL; Filter_By_Kmer_Count = 0; Frag_Olap_Limit = UINT64_MAX; Unique_Olap_Per_Pair = true; Hash_Mask_Bits = 22; Max_Hash_Load = 0.6; Max_Hash_Strings = 10000; Max_Hash_Data_Len = 100000000; Outfile_Name = NULL; Outstat_Name = NULL; Num_PThreads = 1; Min_Olap_Len = 0; Use_Window_Filter = false; Use_Hopeless_Check = true; Frag_Store_Path = NULL; }; double maxErate; // If set true by the G option (G for Granger) // then allow overlaps that do not extend to the end // of either read. bool Doing_Partial_Overlaps; // -G uint32 bgnHashID; // -h uint32 endHashID; uint32 minLibToHash; // -H uint32 maxLibToHash; // The range of reads we need to process uint32 frag_segment_lo; uint32 frag_segment_hi; uint32 bgnRefID; // -r uint32 curRefID; // When processing, this is where we are at, bgn < cur <= end. uint32 endRefID; uint32 minLibToRef; // -R uint32 maxLibToRef; uint32 perThread; // When processing, how many to do per block uint64 Kmer_Len; // -k uint64 Filter_By_Kmer_Count; FILE *Kmer_Skip_File; // -k // Maximum number of overlaps for end of an old fragment against // a single hash table of frags, in each orientation uint64 Frag_Olap_Limit; // -l // If true will allow at most // one overlap output message per oriented fragment pair // Set true by -u command-line option; set false by -m bool Unique_Olap_Per_Pair; // -m and -u uint32 Hash_Mask_Bits; // --hashbits uint32 Max_Hash_Strings; // --hashstrings uint64 Max_Hash_Data_Len; // --hashdatalen double Max_Hash_Load; // --hashload // --maxreadlen sets OFFSET_BITS, STRING_NUM_BITS, STRING_NUM_MASK and MAX_STRING_NUM. char *Outfile_Name; // -o char *Outstat_Name; // -s uint32 Num_PThreads; // -t int32 Min_Olap_Len; // --minlength, former -v // Determines whether check for a window containing too many // errors is used to disqualify overlaps. bool Use_Window_Filter; // -w // Determines whether check for absence of kmer matches // at the end of a read is used to abort the overlap before // the extension from a single kmer match is attempted. bool Use_Hopeless_Check; // -z char *Frag_Store_Path; }; extern oicParameters G; extern uint64 HSF1; extern uint64 HSF2; extern uint64 SV1; extern uint64 SV2; extern uint64 SV3; extern ovFile *Out_BOF; void Output_Overlap(uint32 S_ID, int S_Len, Direction_t S_Dir, uint32 T_ID, int T_Len, Olap_Info_t * olap, Work_Area_t *WA); void Output_Partial_Overlap(uint32 s_id, uint32 t_id, Direction_t dir, const Olap_Info_t * p, int s_len, int t_len, Work_Area_t *WA); int Process_String_Olaps (char * S, int Len, char * S_quality, uint32 ID, Direction_t Dir, Work_Area_t * WA); void Find_Overlaps (char Frag [], int Frag_Len, char quality [], uint32 Frag_Num, Direction_t Dir, Work_Area_t * WA); void * Process_Overlaps (void *); int Build_Hash_Index(gkStore *store, uint32 bgnID, uint32 endID); #endif // OVERLAPINCORE_H canu-1.6/src/overlapInCore/overlapInCore.mk000066400000000000000000000013251314437614700207610ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := overlapInCore SOURCES := overlapInCore.C \ overlapInCore-Build_Hash_Index.C \ overlapInCore-Find_Overlaps.C \ overlapInCore-Output.C \ overlapInCore-Process_Overlaps.C \ overlapInCore-Process_String_Overlaps.C SRC_INCDIRS := .. ../AS_UTL ../stores liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapInCore/overlapInCorePartition.C000066400000000000000000000342321314437614700224310ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVL/overlap_partition.C * * Modifications by: * * Brian P. Walenz from 2011-JUN-12 to 2013-AUG-01 * are Copyright 2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2012-JUL-29 to 2013-DEC-02 * are Copyright 2012-2013 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-NOV-21 to 2015-AUG-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "AS_UTL_decodeRange.H" // Reads gkpStore, outputs three files: // ovlbat - batch names // ovljob - job names // ovlopt - overlapper options // // From (very) old paper notes, overlapInCore only computes overlaps for referenceID < hashID. uint32 batchMax = 1000; void outputJob(FILE *BAT, FILE *JOB, FILE *OPT, uint32 hashBeg, uint32 hashEnd, uint32 refBeg, uint32 refEnd, uint32 maxNumReads, uint32 maxLength, uint32 &batchSize, uint32 &batchName, uint32 &jobName) { fprintf(BAT, "%03" F_U32P "\n", batchName); fprintf(JOB, "%06" F_U32P "\n", jobName); if (maxNumReads == 0) { fprintf(OPT, "-h " F_U32 "-" F_U32 " -r " F_U32 "-" F_U32 "\n", hashBeg, hashEnd, refBeg, refEnd); fprintf(stderr, "HASH %10d-%10d REFR %10d-%10d JOB %d\n", hashBeg, hashEnd, refBeg, refEnd, jobName); } else { fprintf(OPT, "-h " F_U32 "-" F_U32 " -r " F_U32 "-" F_U32 " --hashstrings " F_U32 " --hashdatalen " F_U32 "\n", hashBeg, hashEnd, refBeg, refEnd, maxNumReads, maxLength); fprintf(stderr, "HASH %10d-%10d REFR %10d-%10d STRINGS %10d BASES %10d JOB %d\n", hashBeg, hashEnd, refBeg, refEnd, maxNumReads, maxLength, jobName); } refBeg = refEnd + 1; batchSize++; if (batchSize >= batchMax) { batchSize = 0; batchName++; } jobName++; } uint32 * loadReadLengths(gkStore *gkp, set &libToHash, uint32 &hashMin, uint32 &hashMax, set &libToRef, uint32 &refMin, uint32 &refMax) { uint32 numReads = gkp->gkStore_getNumReads(); uint32 numLibs = gkp->gkStore_getNumLibraries(); uint32 *readLen = new uint32 [numReads + 1]; bool testHash = false; bool testRef = false; if (libToHash.size() > 0) { testHash = true; hashMin = UINT32_MAX; hashMax = 0; } if (libToRef.size() > 0) { testRef = true; refMin = UINT32_MAX; refMax = 0; } bool *doHash = new bool [numLibs + 1]; bool *doRef = new bool [numLibs + 1]; for (uint32 i=0; i<=numLibs; i++) { doHash[i] = (libToHash.count(i) == 0) ? false : true; doRef[i] = (libToRef.count(i) == 0) ? false : true; } fprintf(stderr, "Loading lengths of " F_U32 " fragments (" F_SIZE_T "mb)\n", numReads, (numReads * sizeof(uint32)) >> 20); memset(readLen, 0, sizeof(uint32) * (numReads + 1)); for (uint32 ii=1; ii<=numReads; ii++) { gkRead *read = gkp->gkStore_getRead(ii); if (read->gkRead_readID() != ii) fprintf(stderr, "ERROR: readID=%u != ii=%u\n", read->gkRead_readID(), ii); assert(read->gkRead_readID() == ii); readLen[ii] = read->gkRead_sequenceLength(); if ((testHash == true) && (doHash[read->gkRead_libraryID()] == true)) { if (ii < hashMin) hashMin = ii; if (hashMax < ii) hashMax = ii; } if ((testRef == true) && (doRef[read->gkRead_libraryID()] == true)) { if (ii < refMin) refMin = ii; if (refMax < ii) refMax = ii; } if ((ii % 1048576) == 0) fprintf(stderr, "Loading lengths at " F_U32 " out of " F_U32 ". H: " F_U32 "," F_U32 " R: " F_U32 "," F_U32 "\n", ii, numReads, hashMin, hashMax, refMin, refMax); } delete [] doHash; delete [] doRef; return(readLen); } void partitionFrags(gkStore *gkp, FILE *BAT, FILE *JOB, FILE *OPT, uint32 minOverlapLength, uint64 ovlHashBlockSize, uint64 ovlRefBlockLength, uint64 ovlRefBlockSize, set &libToHash, set &libToRef) { uint32 hashMin = 1; uint32 hashBeg = 1; uint32 hashEnd = 0; uint32 hashMax = UINT32_MAX; uint32 refMin = 1; uint32 refBeg = 1; uint32 refEnd = 0; uint32 refMax = UINT32_MAX; uint32 batchSize = 0; uint32 batchName = 1; uint32 jobName = 1; uint32 numReads = gkp->gkStore_getNumReads(); uint32 *readLen = NULL; if ((ovlRefBlockLength > 0) || (libToHash.size() > 0) || (libToRef.size() > 0)) readLen = loadReadLengths(gkp, libToHash, hashMin, hashMax, libToRef, refMin, refMax); if (hashMax > numReads) hashMax = numReads; if (refMax > numReads) refMax = numReads; fprintf(stderr, "Partitioning for hash: " F_U32 "-" F_U32 " ref: " F_U32 "," F_U32 "\n", hashMin, hashMax, refMin, refMax); hashBeg = hashMin; while (hashBeg < hashMax) { hashEnd = hashBeg + ovlHashBlockSize - 1; if (hashEnd > hashMax) hashEnd = hashMax; refBeg = refMin; refEnd = 0; while ((refBeg < refMax) && ((refBeg < hashEnd) || (libToHash.size() != 0 && libToHash == libToRef))) { uint64 refLen = 0; if (ovlRefBlockLength > 0) { do { refEnd++; if (readLen[refEnd] < minOverlapLength) continue; refLen += readLen[refEnd]; } while ((refLen < ovlRefBlockLength) && (refEnd < refMax)); } else { refEnd = refBeg + ovlRefBlockSize - 1; } if (refEnd > refMax) refEnd = refMax; if ((refEnd > hashEnd) && (libToHash.size() == 0 || libToHash != libToRef)) refEnd = hashEnd; outputJob(BAT, JOB, OPT, hashBeg, hashEnd, refBeg, refEnd, 0, 0, batchSize, batchName, jobName); refBeg = refEnd + 1; } hashBeg = hashEnd + 1; } delete [] readLen; } void partitionLength(gkStore *gkp, FILE *BAT, FILE *JOB, FILE *OPT, uint32 minOverlapLength, uint64 ovlHashBlockLength, uint64 ovlRefBlockLength, uint64 ovlRefBlockSize, set &libToHash, set &libToRef) { uint32 hashMin = 1; uint32 hashBeg = 1; uint32 hashEnd = 0; uint32 hashMax = UINT32_MAX; uint32 refMin = 1; uint32 refBeg = 1; uint32 refEnd = 0; uint32 refMax = UINT32_MAX; uint32 batchSize = 0; uint32 batchName = 1; uint32 jobName = 1; uint32 numReads = gkp->gkStore_getNumReads(); uint32 *readLen = loadReadLengths(gkp, libToHash, hashMin, hashMax, libToRef, refMin, refMax); if (hashMax > numReads) hashMax = numReads; if (refMax > numReads) refMax = numReads; fprintf(stderr, "Partitioning for hash: " F_U32 "-" F_U32 " ref: " F_U32 "," F_U32 "\n", hashMin, hashMax, refMin, refMax); hashBeg = hashMin; hashEnd = hashMin - 1; while (hashBeg < hashMax) { uint64 hashLen = 0; assert(hashEnd == hashBeg - 1); // Non deleted reads contribute one byte per untrimmed base, and every fragment contributes one // more byte for the terminating zero. In canu, there are no deleted reads. do { hashEnd++; if (readLen[hashEnd] < minOverlapLength) continue; hashLen += readLen[hashEnd] + 1; } while ((hashLen < ovlHashBlockLength) && (hashEnd < hashMax)); assert(hashEnd <= hashMax); refBeg = refMin; refEnd = 0; while ((refBeg < refMax) && ((refBeg < hashEnd) || (libToHash.size() != 0 && libToHash == libToRef))) { uint64 refLen = 0; if (ovlRefBlockLength > 0) { do { refEnd++; if (readLen[refEnd] < minOverlapLength) continue; refLen += readLen[refEnd]; } while ((refLen < ovlRefBlockLength) && (refEnd < refMax)); } else { refEnd = refBeg + ovlRefBlockSize - 1; } if (refEnd > refMax) refEnd = refMax; if ((refEnd > hashEnd) && (libToHash.size() == 0 || libToHash != libToRef)) refEnd = hashEnd; outputJob(BAT, JOB, OPT, hashBeg, hashEnd, refBeg, refEnd, hashEnd - hashBeg + 1, hashLen, batchSize, batchName, jobName); refBeg = refEnd + 1; } hashBeg = hashEnd + 1; } delete [] readLen; } FILE * openOutput(char *prefix, char *type) { char A[FILENAME_MAX]; snprintf(A, FILENAME_MAX, "%s.%s.WORKING", prefix, type); errno = 0; FILE *F = fopen(A, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", A, strerror(errno)), exit(1); return(F); } void renameToFinal(char *prefix, char *type) { char A[FILENAME_MAX]; char B[FILENAME_MAX]; snprintf(A, FILENAME_MAX, "%s.%s.WORKING", prefix, type); snprintf(B, FILENAME_MAX, "%s.%s", prefix, type); rename(A, B); } int main(int argc, char **argv) { char *gkpStoreName = NULL; gkStore *gkpStore = NULL; char *outputPrefix = NULL; char outputName[FILENAME_MAX]; uint64 ovlHashBlockLength = 0; uint64 ovlHashBlockSize = 0; uint64 ovlRefBlockLength = 0; uint64 ovlRefBlockSize = 0; uint32 minOverlapLength = 0; bool checkAllLibUsed = true; set libToHash; set libToRef; AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-g") == 0) { gkpStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-bl") == 0) { ovlHashBlockLength = strtoull(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-bs") == 0) { ovlHashBlockSize = strtoull(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-rl") == 0) { ovlRefBlockLength = strtoull(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-rs") == 0) { ovlRefBlockSize = strtoull(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-ol") == 0) { minOverlapLength = strtoull(argv[++arg], NULL, 10); } else if (strcmp(argv[arg], "-H") == 0) { AS_UTL_decodeRange(argv[++arg], libToHash); } else if (strcmp(argv[arg], "-R") == 0) { AS_UTL_decodeRange(argv[++arg], libToRef); } else if (strcmp(argv[arg], "-C") == 0) { checkAllLibUsed = false; } else if (strcmp(argv[arg], "-o") == 0) { outputPrefix = argv[++arg]; } else { fprintf(stderr, "ERROR: Unknown option '%s'\n", arg[argv]); err++; } arg++; } if (err) { fprintf(stderr, "usage: %s [opts]\n", argv[0]); exit(1); } if ((ovlHashBlockLength > 0) && (ovlHashBlockSize > 0)) fprintf(stderr, "ERROR: At most one of -bl and -bs can be non-zero.\n"), exit(1); if ((ovlRefBlockLength > 0) && (ovlRefBlockSize > 0)) fprintf(stderr, "ERROR: At most one of -rl and -rs can be non-zero.\n"), exit(1); fprintf(stderr, "HASH: " F_U64 " reads or " F_U64 " length.\n", ovlHashBlockSize, ovlHashBlockLength); fprintf(stderr, "REF: " F_U64 " reads or " F_U64 " length.\n", ovlRefBlockSize, ovlRefBlockLength); gkStore *gkp = gkStore::gkStore_open(gkpStoreName); uint32 numLibs = gkp->gkStore_getNumLibraries(); uint32 invalidLibs = 0; for (set::iterator it=libToHash.begin(); it != libToHash.end(); it++) if (numLibs < *it) fprintf(stderr, "ERROR: -H " F_U32 " is invalid; only " F_U32 " libraries in '%s'\n", *it, numLibs, gkpStoreName), invalidLibs++; for (set::iterator it=libToRef.begin(); it != libToRef.end(); it++) if (numLibs < *it) fprintf(stderr, "ERROR: -R " F_U32 " is invalid; only " F_U32 " libraries in '%s'\n", *it, numLibs, gkpStoreName), invalidLibs++; if ((libToHash.size() > 0) && (libToRef.size() > 0)) { for (uint32 lib=1; lib<=numLibs; lib++) { if ((libToHash.find(lib) == libToHash.end()) && (libToRef.find(lib) == libToRef.end())) { if (checkAllLibUsed == true) fprintf(stderr, "ERROR: library " F_U32 " is not mentioned in either -H or -R.\n", lib), invalidLibs++; else fprintf(stderr, "Warning: library " F_U32 " is not mentioned in either -H or -R.\n", lib); } } } if (invalidLibs > 0) fprintf(stderr, "ERROR: one of -H and/or -R are invalid.\n"), exit(1); FILE *BAT = openOutput(outputPrefix, "ovlbat"); FILE *JOB = openOutput(outputPrefix, "ovljob"); FILE *OPT = openOutput(outputPrefix, "ovlopt"); if (ovlHashBlockLength == 0) partitionFrags(gkp, BAT, JOB, OPT, minOverlapLength, ovlHashBlockSize, ovlRefBlockLength, ovlRefBlockSize, libToHash, libToRef); else partitionLength(gkp, BAT, JOB, OPT, minOverlapLength, ovlHashBlockLength, ovlRefBlockLength, ovlRefBlockSize, libToHash, libToRef); fclose(BAT); fclose(JOB); fclose(OPT); renameToFinal(outputPrefix, "ovlbat"); renameToFinal(outputPrefix, "ovljob"); renameToFinal(outputPrefix, "ovlopt"); gkp->gkStore_close(); exit(0); } canu-1.6/src/overlapInCore/overlapInCorePartition.mk000066400000000000000000000010021314437614700226430ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := overlapInCorePartition SOURCES := overlapInCorePartition.C SRC_INCDIRS := .. ../AS_UTL ../stores liboverlap TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/overlapInCore/overlapPair.C000066400000000000000000000665341314437614700202650ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAR-27 to 2015-JUL-20 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include #include "gkStore.H" #include "ovStore.H" #include "edlib.H" #include "overlapReadCache.H" #include "AS_UTL_reverseComplement.H" #include "timeAndSize.H" // getTime(); // The process will load BATCH_SIZE overlaps into memory, then load all the reads referenced by // those overlaps. Once all data is loaded, compute threads are spawned. Each thread will reserve // THREAD_SIZE overlaps to compute. A small THREAD_SIZE relative to BATCH_SIZE will result in // better load balancing, but too small and the overhead of reserving overlaps will dominate (too // small is on the order of 1). While threads are computing, the next batch of overlaps and reads // is loaded. // // A large BATCH_SIZE will make startup cost large - no computes are started until the initial load // is finished. To alleivate this (a little bit), the initial load is only 1/8 of the full // BATCH_SIZE. #define BATCH_SIZE 1024 * 1024 #define THREAD_SIZE 128 // Does slightly better with 2550 than 500. Speed takes a slight hit. #define MHAP_SLOP 500 class alignStats { public: alignStats() { startTime = getTime(); reportThreshold = 0; clear(); }; ~alignStats() { }; void clear(void) { nSkipped = 0; nPassed = 0; nFailed = 0; nFailExtA = 0; nFailExtB = 0; nFailExt = 0; nExtendedA = 0; nExtendedB = 0; nPartial = 0; nDovetail = 0; nExt5a = 0; nExt3a = 0; nExt5b = 0; nExt3b = 0; }; alignStats &operator+=(alignStats &that) { nSkipped += that.nSkipped; nPassed += that.nPassed; nFailed += that.nFailed; nFailExtA += that.nFailExtA; nFailExtB += that.nFailExtB; nFailExt += that.nFailExt; nExtendedA += that.nExtendedA; nExtendedB += that.nExtendedB; nPartial += that.nPartial; nDovetail += that.nDovetail; nExt5a += that.nExt5a; nExt3a += that.nExt3a; nExt5b += that.nExt5b; nExt3b += that.nExt3b; return(*this); }; void reportStatus(void) { if (nPassed + nFailed < reportThreshold) return; reportThreshold += 10000; fprintf(stderr, "Tested %9lu olaps -- Skipped %8.4f%% -- Passed %8.4f%% -- %8.2f olaps/sec\n", nPassed + nFailed, 100.0 * nSkipped / (nPassed + nFailed), 100.0 * nPassed / (nPassed + nFailed), (nPassed + nFailed) / (getTime() - startTime)); }; void reportFinal(void) { fprintf(stderr, "\n"); fprintf(stderr, " -- %lu overlaps processed.\n", nPassed + nFailed); fprintf(stderr, " -- %lu skipped.\n", nSkipped); fprintf(stderr, " -- %lu failed %lu passed (%.4f%%).\n", nFailed, nPassed, 100.0 * nPassed / (nPassed + nFailed)); fprintf(stderr, " --\n"); fprintf(stderr, " -- %lu failed initial alignment, allowing A to extend\n", nFailExtA); fprintf(stderr, " -- %lu failed initial alignment, allowing B to extend\n", nFailExtB); fprintf(stderr, " -- %lu failed initial alignment\n", nFailExt); fprintf(stderr, " --\n"); fprintf(stderr, " -- %lu partial overlaps (of any quality)\n", nPartial); fprintf(stderr, " -- %lu dovetail overlaps (before extensions, of any quality)\n", nDovetail); fprintf(stderr, " --\n"); fprintf(stderr, " -- %lu/%lu A read dovetail extensions\n", nExt5a, nExt3a); fprintf(stderr, " -- %lu/%lu B read dovetail extensions\n", nExt5b, nExt3b); }; double startTime; uint64 reportThreshold; uint64 nSkipped; uint64 nPassed; uint64 nFailed; uint64 nFailExtA; uint64 nFailExtB; uint64 nFailExt; uint64 nExtendedA; uint64 nExtendedB; uint64 nPartial; uint64 nDovetail; uint64 nExt5a; uint64 nExt3a; uint64 nExt5b; uint64 nExt3b; }; class workSpace { public: workSpace() { threadID = 0; maxErate = 0; partialOverlaps = false; invertOverlaps = false; gkpStore = NULL; overlapsLen = 0; overlaps = NULL; readSeq = NULL; }; ~workSpace() { delete[] readSeq; }; public: uint32 threadID; double maxErate; bool partialOverlaps; bool invertOverlaps; char* readSeq; gkStore *gkpStore; uint32 overlapsLen; // Not used. ovOverlap *overlaps; }; overlapReadCache *rcache = NULL; // Used to be just 'cache', but that conflicted with -pg: /usr/lib/libc_p.a(msgcat.po):(.bss+0x0): multiple definition of `cache' uint32 batchPrtID = 0; // When to report progress uint32 batchPosID = 0; // The current position of the batch uint32 batchEndID = 0; // The end of the batch pthread_mutex_t balanceMutex; uint32 minOverlapLength = 0; alignStats globalStats; bool debug = false; bool getRange(uint32 &bgnID, uint32 &endID) { pthread_mutex_lock(&balanceMutex); bgnID = batchPosID; batchPosID += THREAD_SIZE; // Supposed to overflow. endID = batchPosID; if (endID > batchEndID) endID = batchEndID; pthread_mutex_unlock(&balanceMutex); // If we're out of overlaps, batchPosID is more than batchEndID (from the last call to this // function), which makes bgnID > endID (in this call). return(bgnID < endID); } // Try to extend the overlap on the B read. If successful, returns new bbgn,bend and editDist and alignLen. // bool extendAlignment(char *aRead, int32 abgn, int32 aend, int32 UNUSED(alen), char *Alabel, uint32 Aid, char *bRead, int32 &bbgn, int32 &bend, int32 blen, char *Blabel, uint32 Bid, double maxErate, int32 slop, int32 &editDist, int32 &alignLen) { alignStats threadStats; EdlibAlignResult result = { 0, NULL, NULL, 0, NULL, 0, 0 }; bool success = false; // Find an alignment, allowing extensions on the B read. int32 bbgnExt = max(0, bbgn - slop); int32 bendExt = min(blen, bend + slop); // This probably isn't exactly correct, but close enough. int32 maxEdit = (int32)ceil(max(aend - abgn, bendExt - bbgnExt) * maxErate * 1.1); if (debug) fprintf(stderr, " align %s %6u %6d-%-6d to %s %6u %6d-%-6d", Alabel, Aid, abgn, aend, Blabel, Bid, bbgnExt, bendExt); result = edlibAlign(aRead + abgn, aend - abgn, bRead + bbgnExt, bendExt - bbgnExt, edlibNewAlignConfig(maxEdit, EDLIB_MODE_HW, EDLIB_TASK_LOC)); // Change the overlap for any extension found. if (result.numLocations > 0) { bbgn = bbgnExt + result.startLocations[0]; bend = bbgnExt + result.endLocations[0] + 1; // Edlib returns 0-based positions, add one to end to get space-based. editDist = result.editDistance; alignLen = result.alignmentLength; // Edlib 'alignmentLength' isn't populated for TASK_LOC, so we approximate it. alignLen = ((aend - abgn) + (bend - bbgn) + (editDist)) / 2; if (debug) fprintf(stderr, " aligned to %s at %6d-%-6d editDist %5d alignLen %6d qual %6.4f\n", Blabel, bbgn, bend, editDist, alignLen, 1.0 - editDist / (double)alignLen); success = true; } else { if (debug) fprintf(stderr, "\n"); } edlibFreeAlignResult(result); return(success); } bool finalAlignment(char *aRead, int32 alen,// char *Alabel, uint32 Aid, char *bRead, int32 blen,// char *Blabel, uint32 Bid, ovOverlap *ovl, double maxErate, int32 &editDist, int32 &alignLen) { EdlibAlignResult result = { 0, NULL, NULL, 0, NULL, 0, 0 }; bool success = false; int32 abgn = (int32) ovl->dat.ovl.ahg5; int32 aend = (int32)alen - ovl->dat.ovl.ahg3; int32 bbgn = (int32) ovl->dat.ovl.bhg5; int32 bend = (int32)blen - ovl->dat.ovl.bhg3; int32 maxEdit = (int32)ceil(max(aend - abgn, bend - bbgn) * maxErate * 1.1); result = edlibAlign(aRead + abgn, aend - abgn, bRead + bbgn, bend - bbgn, edlibNewAlignConfig(maxEdit, EDLIB_MODE_NW, EDLIB_TASK_LOC)); // NOTE! Global alignment. if (result.numLocations > 0) { editDist = result.editDistance; alignLen = result.alignmentLength; // Edlib 'alignmentLength' isn't populated for TASK_LOC, so we approximate it. alignLen = ((aend - abgn) + (bend - bbgn) + (editDist)) / 2; success = true; } else { } edlibFreeAlignResult(result); return(success); } void * recomputeOverlaps(void *ptr) { workSpace *WA = (workSpace *)ptr; uint32 bgnID = 0; uint32 endID = 0; while (getRange(bgnID, endID)) { alignStats localStats; for (uint32 oo=bgnID; oooverlaps + oo; // Swap IDs if requested (why would anyone want to do this?) if (WA->invertOverlaps) { ovOverlap swapped = WA->overlaps[oo]; WA->overlaps[oo].swapIDs(swapped); // Needs to be from a temporary! } // Initialize early, just so we can use goto. uint32 aID = ovl->a_iid; char *aRead = rcache->getRead(aID); int32 alen = (int32)rcache->getLength(aID); int32 abgn = (int32) ovl->dat.ovl.ahg5; int32 aend = (int32)alen - ovl->dat.ovl.ahg3; uint32 bID = ovl->b_iid; char *bRead = WA->readSeq; int32 blen = (int32)rcache->getLength(bID); int32 bbgn = (int32) ovl->dat.ovl.bhg5; int32 bend = (int32)blen - ovl->dat.ovl.bhg3; int32 alignLen = 1; int32 editDist = INT32_MAX; EdlibAlignResult result = { 0, NULL, NULL, 0, NULL, 0, 0 }; if (debug) { fprintf(stderr, "--------\n"); fprintf(stderr, "OLAP A %7u %6d-%-6d\n", aID, abgn, aend); fprintf(stderr, " B %7u %6d-%-6d %s\n", bID, bbgn, bend, (ovl->flipped() == false) ? "" : " flipped"); fprintf(stderr, "\n"); } // Invalidate the overlap. ovl->evalue(AS_MAX_EVALUE); ovl->dat.ovl.forOBT = false; ovl->dat.ovl.forDUP = false; ovl->dat.ovl.forUTG = false; // Make some bad changes, for testing #if 0 abgn += 100; aend -= 100; bbgn += 100; bend -= 100; #endif // Too short? Don't bother doing anything. // // Warning! Edlib failed on a 10bp to 10bp (extended to 5kbp) alignment. if ((aend - abgn < minOverlapLength) || (bend - bbgn < minOverlapLength)) { localStats.nSkipped++; goto finished; } // Grab the B read sequence. strcpy(bRead, rcache->getRead(bID)); // If flipped, reverse complement the B read. if (ovl->flipped() == true) reverseComplementSequence(bRead, blen); // // Find initial alignments, allowing one, then the other, sequence to be extended as needed. // if (extendAlignment(bRead, bbgn, bend, blen, "B", bID, aRead, abgn, aend, alen, "A", aID, WA->maxErate, MHAP_SLOP, editDist, alignLen) == false) { localStats.nFailExtA++; } if (extendAlignment(aRead, abgn, aend, alen, "A", aID, bRead, bbgn, bend, blen, "B", bID, WA->maxErate, MHAP_SLOP, editDist, alignLen) == false) { localStats.nFailExtB++; } // If no alignments were found, fail. if (alignLen == 1) { localStats.nFailExt++; goto finished; } // Update the overlap. ovl->dat.ovl.ahg5 = abgn; ovl->dat.ovl.ahg3 = alen - aend; ovl->dat.ovl.bhg5 = bbgn; ovl->dat.ovl.bhg3 = blen - bend; if (debug) { fprintf(stderr, "\n"); fprintf(stderr, "init A %7u %6d-%-6d\n", aID, abgn, aend); fprintf(stderr, " B %7u %6d-%-6d\n", bID, bbgn, bend); fprintf(stderr, "\n"); } // If we're just doing partial alignments or if we've found a dovetail, we're all done. if (WA->partialOverlaps == true) { localStats.nPartial++; goto finished; } if (ovl->overlapIsDovetail() == true) { localStats.nDovetail++; goto finished; } #warning do we need to check for contained too? // Otherwise, try to extend the alignment to make a dovetail overlap. { int32 ahg5 = ovl->dat.ovl.ahg5; int32 ahg3 = ovl->dat.ovl.ahg3; int32 bhg5 = ovl->dat.ovl.bhg5; int32 bhg3 = ovl->dat.ovl.bhg3; int32 slop = 0; if ((ahg5 >= bhg5) && (bhg5 > 0)) { //fprintf(stderr, "extend 5' by B=%d\n", bhg5); ahg5 -= bhg5; bhg5 -= bhg5; // Now zero. slop = bhg5 * WA->maxErate + 100; abgn = (int32) ahg5; aend = (int32)alen - ahg3; bbgn = (int32) bhg5; bend = (int32)blen - bhg3; if (extendAlignment(bRead, bbgn, bend, blen, "Bb5", bID, aRead, abgn, aend, alen, "Ab5", aID, WA->maxErate, slop, editDist, alignLen) == true) { ahg5 = abgn; //ahg3 = alen - aend; } else { ahg5 = ovl->dat.ovl.ahg5; bhg5 = ovl->dat.ovl.bhg5; } localStats.nExt5b++; } if ((bhg5 >= ahg5) && (ahg5 > 0)) { //fprintf(stderr, "extend 5' by A=%d\n", ahg5); bhg5 -= ahg5; ahg5 -= ahg5; // Now zero. slop = ahg5 * WA->maxErate + 100; abgn = (int32) ahg5; aend = (int32)alen - ahg3; bbgn = (int32) bhg5; bend = (int32)blen - bhg3; if (extendAlignment(aRead, abgn, aend, alen, "Aa5", aID, bRead, bbgn, bend, blen, "Ba5", bID, WA->maxErate, slop, editDist, alignLen) == true) { bhg5 = bbgn; //bhg3 = blen - bend; } else { bhg5 = ovl->dat.ovl.bhg5; ahg5 = ovl->dat.ovl.ahg5; } localStats.nExt5a++; } if ((bhg3 >= ahg3) && (ahg3 > 0)) { //fprintf(stderr, "extend 3' by A=%d\n", ahg3); bhg3 -= ahg3; ahg3 -= ahg3; // Now zero. slop = ahg3 * WA->maxErate + 100; abgn = (int32) ahg5; aend = (int32)alen - ahg3; bbgn = (int32) bhg5; bend = (int32)blen - bhg3; if (extendAlignment(aRead, abgn, aend, alen, "Aa3", aID, bRead, bbgn, bend, blen, "Ba3", bID, WA->maxErate, slop, editDist, alignLen) == true) { //bhg5 = bbgn; bhg3 = blen - bend; } else { bhg3 = ovl->dat.ovl.bhg3; ahg3 = ovl->dat.ovl.ahg3; } localStats.nExt3a++; } if ((ahg3 >= bhg3) && (bhg3 > 0)) { //fprintf(stderr, "extend 3' by B=%d\n", bhg3); ahg3 -= bhg3; bhg3 -= bhg3; // Now zero. slop = bhg3 * WA->maxErate + 100; abgn = (int32) ahg5; aend = (int32)alen - ahg3; bbgn = (int32) bhg5; bend = (int32)blen - bhg3; if (extendAlignment(bRead, bbgn, bend, blen, "Bb3", bID, aRead, abgn, aend, alen, "Ab3", aID, WA->maxErate, slop, editDist, alignLen) == true) { //ahg5 = abgn; ahg3 = alen - aend; } else { ahg3 = ovl->dat.ovl.ahg3; bhg3 = ovl->dat.ovl.bhg3; } localStats.nExt3b++; } // Now reset the overlap. ovl->dat.ovl.ahg5 = ahg5; ovl->dat.ovl.ahg3 = ahg3; ovl->dat.ovl.bhg5 = bhg5; ovl->dat.ovl.bhg3 = bhg3; } // If not a contained overlap // If we're still not dovetail, nothing more we want to do. Let the overlap be trashed. if (debug) { fprintf(stderr, "\n"); fprintf(stderr, "fini A %7u %6d-%-6d %d %d\n", aID, abgn, aend, ovl->a_bgn(), ovl->a_end()); fprintf(stderr, " B %7u %6d-%-6d %d %d %s\n", bID, bbgn, bend, ovl->b_bgn(), ovl->b_end(), (ovl->flipped() == false) ? "" : " flipped"); fprintf(stderr, "\n"); } finalAlignment(aRead, alen,// "A", aID, bRead, blen,// "B", bID, ovl, WA->maxErate, editDist, alignLen); finished: // Trash the overlap if it's junky quality. double eRate = editDist / (double)alignLen; if ((alignLen < minOverlapLength) || (eRate > WA->maxErate)) { localStats.nFailed++; ovl->evalue(AS_MAX_EVALUE); ovl->dat.ovl.forOBT = false; ovl->dat.ovl.forDUP = false; ovl->dat.ovl.forUTG = false; } else { localStats.nPassed++; ovl->erate(eRate); ovl->dat.ovl.forOBT = (WA->partialOverlaps == true); ovl->dat.ovl.forDUP = (WA->partialOverlaps == true); ovl->dat.ovl.forUTG = (WA->partialOverlaps == false) && (ovl->overlapIsDovetail() == true); } } // Over all overlaps in this range // Log that we've done stuff pthread_mutex_lock(&balanceMutex); globalStats += localStats; globalStats.reportStatus(); localStats.clear(); pthread_mutex_unlock(&balanceMutex); } // Over all ranges return(NULL); } int main(int argc, char **argv) { char *gkpName = NULL; char *ovlName = NULL; char *outName = NULL; uint32 bgnID = 0; uint32 endID = UINT32_MAX; uint32 numThreads = 1; double maxErate = 0.12; bool partialOverlaps = false; bool invertOverlaps = false; uint64 memLimit = 4; argc = AS_configure(argc, argv); int err=0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { ovlName = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { outName = argv[++arg]; } else if (strcmp(argv[arg], "-b") == 0) { bgnID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { endID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-t") == 0) { numThreads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-erate") == 0) { maxErate = atof(argv[++arg]); } else if (strcmp(argv[arg], "-partial") == 0) { partialOverlaps = true; } else if (strcmp(argv[arg], "-invert") == 0) { invertOverlaps = true; } else if (strcmp(argv[arg], "-memory") == 0) { memLimit = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-len") == 0) { minOverlapLength = atoi(argv[++arg]); } else { err++; } arg++; } if (gkpName == NULL) err++; if (ovlName == NULL) err++; if (outName == NULL) err++; if (err) { fprintf(stderr, "usage: %s ...\n", argv[0]); fprintf(stderr, " -G gkpStore Mandatory, path to gkpStore\n"); fprintf(stderr, "\n"); fprintf(stderr, "Inputs can come from either a store or a file.\n"); fprintf(stderr, " -O ovlStore \n"); fprintf(stderr, " -O ovlFile \n"); fprintf(stderr, "\n"); fprintf(stderr, "If from an ovlStore, the range of reads processed can be restricted.\n"); fprintf(stderr, " -b bgnID \n"); fprintf(stderr, " -e endID \n"); fprintf(stderr, "\n"); fprintf(stderr, "Outputs will be written to a store or file, depending on the input type\n"); fprintf(stderr, " -o ovlStore \n"); fprintf(stderr, " -o ovlFile \n"); fprintf(stderr, "\n"); fprintf(stderr, " -erate e Overlaps are computed at 'e' fraction error; must be larger than the original erate\n"); fprintf(stderr, " -partial Overlaps are 'overlapInCore -G' partial overlaps\n"); fprintf(stderr, " -memory m Use up to 'm' GB of memory\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t n Use up to 'n' cores\n"); fprintf(stderr, "\n"); fprintf(stderr, "Advanced options:\n"); fprintf(stderr, "\n"); fprintf(stderr, " -invert Invert the overlap A <-> B before aligning (they are not re-inverted before output)\n"); fprintf(stderr, "\n"); exit(1); } gkStore *gkpStore = gkStore::gkStore_open(gkpName); ovStore *ovlStore = NULL; ovStoreWriter *outStore = NULL; ovFile *ovlFile = NULL; ovFile *outFile = NULL; if (AS_UTL_fileExists(ovlName, true)) { fprintf(stderr, "Reading overlaps from store '%s' and writing to '%s'\n", ovlName, outName); ovlStore = new ovStore(ovlName, gkpStore); outStore = new ovStoreWriter(outName, gkpStore); if (bgnID < 1) bgnID = 1; if (endID > gkpStore->gkStore_getNumReads()) endID = gkpStore->gkStore_getNumReads(); ovlStore->setRange(bgnID, endID); } else { fprintf(stderr, "Reading overlaps from file '%s' and writing to '%s'\n", ovlName, outName); ovlFile = new ovFile(gkpStore, ovlName, ovFileFull); outFile = new ovFile(gkpStore, outName, ovFileFullWrite); } workSpace *WA = new workSpace [numThreads]; pthread_t *tID = new pthread_t [numThreads]; pthread_attr_t attr; pthread_attr_init(&attr); pthread_attr_setstacksize(&attr, 12 * 131072); pthread_mutex_init(&balanceMutex, NULL); // Initialize thread work areas. Mirrored from overlapInCore.C for (uint32 tt=0; ttreadOverlaps(overlaps, overlapsMax, false); if (ovlFile) *overlapsLen = ovlFile->readOverlaps(overlaps, overlapsMax); overlapsMax *= 8; // Back to the normal batch size. fprintf(stderr, "Loaded %u overlaps.\n", *overlapsLen); rcache->loadReads(overlaps, *overlapsLen); // Loop over all the overlaps. while (overlapsALen + overlapsBLen > 0) { // Launch next batch of threads //fprintf(stderr, "LAUNCH THREADS\n"); // Globals, ugh. These limit the threads to the range of overlaps we have loaded. Each thread // will pull out THREAD_SIZE overlaps at a time to compute, updating batchPosID as it does so. // Each thread will stop when batchPosID > batchEndID. batchPrtID = 0; batchPosID = 0; batchEndID = *overlapsLen; for (uint32 tt=0; ttwriteOverlap(overlaps + oo); if (ovlFile) outFile->writeOverlaps(overlaps, *overlapsLen); // Load more overlaps if (ovlStore) *overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax, false); if (ovlFile) *overlapsLen = ovlFile->readOverlaps(overlaps, overlapsMax); fprintf(stderr, "Loaded %u overlaps.\n", *overlapsLen); rcache->loadReads(overlaps, *overlapsLen); // Wait for threads to finish for (uint32 tt=0; ttpurgeReads(); } // Report. The last batch has no work to do. globalStats.reportFinal(); // Goodbye. delete rcache; gkpStore->gkStore_close(); delete ovlStore; delete outStore; delete ovlFile; delete outFile; delete [] overlapsA; delete [] overlapsB; delete [] WA; delete [] tID; fprintf(stderr, "\n"); fprintf(stderr, "Bye.\n"); return(0); } canu-1.6/src/overlapInCore/overlapPair.mk000066400000000000000000000010171314437614700204730ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := overlapPair SOURCES := overlapPair.C SRC_INCDIRS := .. ../AS_UTL ../stores ../meryl/libleaff libedlib TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lleaff -lcanu TGT_PREREQS := libleaff.a libcanu.a SUBMAKEFILES := canu-1.6/src/overlapInCore/overlapReadCache.C000066400000000000000000000117541314437614700211630ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JUN-25 to 2015-JUL-01 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "overlapReadCache.H" #include #include #include using namespace std; overlapReadCache::overlapReadCache(gkStore *gkpStore_, uint64 memLimit) { gkpStore = gkpStore_; nReads = gkpStore->gkStore_getNumReads(); readAge = new uint32 [nReads + 1]; readLen = new uint32 [nReads + 1]; memset(readAge, 0, sizeof(uint32) * (nReads + 1)); memset(readLen, 0, sizeof(uint32) * (nReads + 1)); readSeqFwd = new char * [nReads + 1]; //readSeqRev = new char * [nReads + 1]; memset(readSeqFwd, 0, sizeof(char *) * (nReads + 1)); //memset(readSeqRev, 0, sizeof(char *) * (nReads + 1)); memoryLimit = memLimit * 1024 * 1024 * 1024; } overlapReadCache::~overlapReadCache() { delete [] readAge; delete [] readLen; for (uint32 rr=0; rr<=nReads; rr++) { delete [] readSeqFwd[rr]; //delete [] readSeqRev[rr]; } delete [] readSeqFwd; //delete [] readSeqRev; } void overlapReadCache::loadRead(uint32 id) { gkRead *read = gkpStore->gkStore_getRead(id); gkpStore->gkStore_loadReadData(read, &readdata); readLen[id] = read->gkRead_sequenceLength(); readSeqFwd[id] = new char [readLen[id] + 1]; //readSeqRev[id] = new char [readLen[id] + 1]; memcpy(readSeqFwd[id], readdata.gkReadData_getSequence(), sizeof(char) * readLen[id]); readSeqFwd[id][readLen[id]] = 0; } // Make sure that the reads in 'reads' are in the cache. // Ideally, these are just the reads we need to load. void overlapReadCache::loadReads(set reads) { uint32 nn = 0; uint32 nc = reads.size() / 25; // For each read in the input set, load it. //if (reads.size() > 0) // fprintf(stderr, "loadReads()-- Need to load %u reads.\n", reads.size()); for (set::iterator it=reads.begin(); it != reads.end(); ++it) { //if ((++nn % nc) == 0) // fprintf(stderr, "loadReads()-- %6.2f%% finished.\n", 100.0 * nn / reads.size()); if (readLen[*it] != 0) continue; loadRead(*it); } //fprintf(stderr, "loadReads()-- %6.2f%% finished.\n", 100.0); // Age all the reads in the cache. uint32 nLoaded = 0; for (uint32 id=0; id 0) nLoaded++; readAge[id]++; } //fprintf(stderr, "loadReads()-- loaded %u -- %u in cache\n", reads.size(), nLoaded); } void overlapReadCache::markForLoading(set &reads, uint32 id) { // Note that it was just used. readAge[id] = 0; // Already loaded? Done! if (readLen[id] != 0) return; // Already pending? Done! if (reads.count(id) != 0) return; // Mark it for loading. reads.insert(id); } void overlapReadCache::loadReads(ovOverlap *ovl, uint32 nOvl) { set reads; for (uint32 oo=0; oo reads; markForLoading(reads, tig->tigID()); for (uint32 oo=0; oonumberOfChildren(); oo++) if (tig->getChild(oo)->isRead() == true) markForLoading(reads, tig->getChild(oo)->ident()); loadReads(reads); } void overlapReadCache::purgeReads(void) { uint32 maxAge = 0; uint64 memoryUsed = 0; // Find maxAge, and sum memory used for (uint32 rr=0; rr<=nReads; rr++) { if (maxAge < readAge[rr]) maxAge = readAge[rr]; memoryUsed += readLen[rr]; } // Purge oldest until memory is below watermark while ((memoryLimit < memoryUsed) && (maxAge > 1)) { fprintf(stderr, "purgeReads()-- used " F_U64 "MB limit " F_U64 "MB -- purge age " F_U32 "\n", memoryUsed >> 20, memoryLimit >> 20, maxAge); for (uint32 rr=0; rr<=nReads; rr++) { if (maxAge == readAge[rr]) { memoryUsed -= readLen[rr]; delete [] readSeqFwd[rr]; readSeqFwd[rr] = NULL; //delete [] readSeqRev[rr]; readSeqRev[rr] = NULL; readLen[rr] = 0; readAge[rr] = 0; } } maxAge--; } } canu-1.6/src/overlapInCore/overlapReadCache.H000066400000000000000000000036361314437614700211700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/overlapInCore/overlapPair-readCache.H * * Modifications by: * * Brian P. Walenz from 2015-JUN-16 to 2015-JUN-23 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "tgStore.H" class overlapReadCache { public: overlapReadCache(gkStore *gkpStore_, uint64 memLimit); ~overlapReadCache(); private: void loadRead(uint32 id); void loadReads(set reads); void markForLoading(set &reads, uint32 id); public: void loadReads(ovOverlap *ovl, uint32 nOvl); void loadReads(tgTig *tig); void purgeReads(void); char *getRead(uint32 id) { assert(readLen[id] > 0); return(readSeqFwd[id]); }; uint32 getLength(uint32 id) { assert(readLen[id] > 0); return(readLen[id]); }; private: gkStore *gkpStore; uint32 nReads; uint32 *readAge; uint32 *readLen; char **readSeqFwd; //char **readSeqRev; // Save it, or recompute? gkReadData readdata; uint64 memoryLimit; }; canu-1.6/src/pipelines/000077500000000000000000000000001314437614700150775ustar00rootroot00000000000000canu-1.6/src/pipelines/bogart-sweep.pl000066400000000000000000000133051314437614700200350ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2016-MAR-10 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; my $wrk = "/work/canuassemblies/sent"; my $asm = "test"; system("mkdir -p $wrk") if (! -d $wrk); my $gs = "5000000"; my $b = 6000; my $m = 4; my $t = 1; my $d = "all"; my (@EG, @EB, @EM, @ER, @OL, @RS, @NS, @CS); @EG = ( "0.0100", "0.0200", "0.0300", "0.0400", "0.0500", "0.0600", "0.0700", "0.0800", "0.0900", "0.1000" ); @EB = ( "-0.0050", "+0.0050" ); @EM = ( "-0.0050", "+0.0050" ); @ER = ( "-0.0050", "+0.0050" ); @OL = ( "100", "500", "1000", "2500", "5000", "7500", "10000" ); @RS = ( "-no", "-RS" ); @NS = ( "-no", "-NS" ); @CS = ( "-no", "-CS" ); @EG = ( "0.0400", "0.0500", "0.0600" ); @EB = ( "+0.0125" ); @EM = ( "-0.0125" ); @ER = ( "+0.0000" ); @OL = ( "50", "500", "1000", "2000", "3000", "4000", "5000", "6000", "7000", "8000", "9000", "10000", "11000" ); @RS = ( "-RS" ); @NS = ( "-NS" ); @CS = ( "-CS" ); @EG = ( "0.0500" ); @EB = ( "+0.0125" ); @EM = ( "-0.0125" ); @ER = ( "+0.0000" ); @OL = ( "4100", "4200", "4300", "4400", "4500", "4600", "4700", "4800", "4900", "5100", "5200", "5300", "5400", "5500", "5600", "5700", "5800", "5900" ); @RS = ( "-RS" ); @NS = ( "-NS" ); @CS = ( "-CS" ); undef @OL; for (my $ii=1100; $ii<1200; $ii += 1) { push @OL, $ii; } @OL = ( "1135", "1136", "1137", "1138" ); #-unassembled 2 1000 0.75 0.75 2 -repeatdetect 6 11 15 -threads 1 -D most foreach my $eg (@EG) { foreach my $eb (@EB) { foreach my $em (@EM) { foreach my $er (@ER) { foreach my $ol (@OL) { foreach my $rs (@RS) { foreach my $ns (@NS) { foreach my $cs (@CS) { my ($egl, $ebl, $eml, $erl, $oll) = ($eg, $eb, $em, $er, $ol); $ebl = $egl + $1 if ($eb =~ m/^\+(\d+.\d+)/); $ebl = $egl - $1 if ($eb =~ m/^-(\d+.\d+)/); $eml = $egl + $1 if ($em =~ m/^\+(\d+.\d+)/); $eml = $egl - $1 if ($em =~ m/^-(\d+.\d+)/); $erl = $egl + $1 if ($er =~ m/^\+(\d+.\d+)/); $erl = $egl - $1 if ($er =~ m/^-(\d+.\d+)/); $egl = sprintf("%6.4f", $egl); $ebl = sprintf("%6.4f", $ebl); $eml = sprintf("%6.4f", $eml); $erl = sprintf("%6.4f", $erl); $oll = sprintf("%05d", $ol); my $path = "test-eg$egl-eb$ebl-em$eml-er$erl-ol$oll$rs$ns$cs"; print "$path\n"; system("mkdir -p $path") if (! -d $path); open(F, "> $wrk/$path/bogart.sh") or die "can't open '$wrk/$path/bogart.sh' for writing: $!\n"; print F "#!/bin/sh\n"; print F "\n"; print F "cd $wrk/$path\n"; print F "\n"; print F "if [ ! -e test.tigStore ] ; then\n"; print F " /work/canu/FreeBSD-amd64/bin/bogart \\\n"; print F " -G $wrk/$asm.gkpStore \\\n"; print F " -O $wrk/$asm.ovlStore \\\n"; print F " -T test.tigStore -o test\\\n"; print F " -B $b -M $m -threads $t \\\n"; print F " -gs $gs \\\n"; print F " -eg $egl -eb $ebl -em $eml -er $erl -el $ol \\\n"; print F " -RS \\\n" if ($rs eq "-RS"); print F " -NS \\\n" if ($ns eq "-NS"); print F " -CS \\\n" if ($cs eq "-CS"); print F " -unassembled 2 1000 0.75 0.75 2 \\\n"; print F " -repeatdetect 6 32 5 \\\n"; print F " -D $d \\\n" if (length($d) > 0); print F " > bogart.err 2>& 1\n"; print F "fi\n"; print F "\n"; close(F); open(F, "> $wrk/$path/utgcns.sh") or die "can't open '$wrk/$path/utgcns.sh' for writing: $!\n"; print F "#!/bin/sh\n"; print F "\n"; print F "cd $wrk/$path\n"; print F "\n"; print F "if [ ! -e test.fasta ] ; then\n"; print F " /work/canu/FreeBSD-amd64/bin/utgcns \\\n"; print F " -G $wrk/$asm.gkpStore \\\n"; print F " -T test.tigStore 1 . \\\n"; print F " -O test.cns -L test.lay -A test.fasta\n"; print F "fi\n"; print F "\n"; print F "rm -f test.tigStore/seqDB.v002.dat\n"; print F "rm -f test.tigStore/seqDB.v002.tig\n"; print F "\n"; print F "/work/canu/FreeBSD-amd64/bin/tgStoreLoad \\\n"; print F " -G $wrk/$asm.gkpStore \\\n"; print F " -T test.tigStore 2 \\\n"; print F " test.cns\n"; print F "\n"; print F "/work/canu/FreeBSD-amd64/bin/tgStoreDump \\\n"; print F " -G $wrk/$asm.gkpStore \\\n"; print F " -T test.tigStore 2 \\\n"; print F " -consensus -fasta -contigs -bubbles \\\n"; print F "> contigs.fasta\n"; print F "\n"; print F "rm -f *.delta\n"; print F "rm -f *.coords\n"; print F "rm -f *.png\n"; print F "\n"; print F "sh /work/scripts/dotplot.sh usmarc /data/references/salmonella_enterica_usmarc_3124.1-cp006631.1.fasta contigs.fasta\n"; print F "sh /work/scripts/dotplot.sh serge /data/references/salmonella_enterica_serge.fasta contigs.fasta\n"; print F "\n"; print F "cp -fp usmarc.png $wrk/$path.usmarc.png\n"; print F "cp -fp serge.png $wrk/$path.serge.png\n"; print F "\n"; close(F); system("sh $wrk/$path/bogart.sh"); system("qsub -q vomit.q -cwd -j y -o /dev/null $wrk/$path/utgcns.sh > /dev/null 2>&1"); } } } } } } } } canu-1.6/src/pipelines/canu-object-store.pl000077500000000000000000000073571314437614700207770ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2017-MAR-14 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; use File::Basename; # A simple wrapper to emulate an object store. Implements: # describe --name - prints if the store has the named object, nothing otherwise # upload --path - uploads local to the store as object # download --output - downloads object into # # As seen immediately below, requires a magic hardcoded path. Files are copied to this directory # to become the objects. my $STASH; $STASH = "/assembly/STASH" if (-d "/assembly/STASH"); $STASH = "/Users/walenzbp/STASH" if (-d "/Users/walenzbp/STASH"); die "No STASH found\n" if (!defined($STASH)); my $task = shift @ARGV; if ($task eq "describe") { my $file; my $path = ""; my $wait = 0; while (scalar(@ARGV) > 0) { my $arg = shift @ARGV; if ($arg eq "--name") { $path = shift @ARGV; } else { die "Unknown option $arg\n"; } } if (-e "$STASH/$path") { print "$path\n"; } } if ($task eq "upload") { my $file; my $path; my $wait = 0; while (scalar(@ARGV) > 0) { my $arg = shift @ARGV; if ($arg eq "--path") { $path = shift @ARGV; } elsif ($arg eq "--wait") { $wait = 1; } elsif (!defined($file)) { $file = $arg; } else { die "Unknown option $arg\n"; } } # Copy local file $file, assumed to be in this directory, to the stash as $path. die "dx upload - no stash path supplied.\n" if (!defined($path)); die "dx upload - no input file supplied.\n" if (!defined($file)); die "dx upload - input file $file not found.\n" if (($file ne "-") && (! -e $file)); system("mkdir -p $STASH/" . dirname($path)); if ($file eq "-") { system("dd status=none of=$STASH/$path"); } else { system("cp -fp $file $STASH/$path"); } } if ($task eq "download") { my $file; my $path; my $wait = 0; while (scalar(@ARGV) > 0) { my $arg = shift @ARGV; if ($arg eq "--output") { $file = shift @ARGV; } elsif (!defined($path)) { $path = $arg; } else { die "Unknown option $arg\n"; } } # Copy local file $file, assumed to be in this directory, to the stash as $path. die "dx download - no stash path supplied.\n" if (!defined($path)); die "dx download - no output file supplied.\n" if (!defined($file)); #die "dx download - stash path $path not found.\n" if (! -e "$STASH/$path"); exit(0) if (! -e "$STASH/$path"); if ($file eq "-") { system("dd status=none if=$STASH/$path"); } else { system("cp -fp $STASH/$path $file"); } } exit(0); canu-1.6/src/pipelines/canu.pl000066400000000000000000000627531314437614700163770ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g.pl # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-AUG-26 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-03 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-19 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; use FindBin; use Cwd qw(getcwd abs_path); use lib "$FindBin::RealBin/lib"; use lib "$FindBin::RealBin/lib/canu/lib/perl5"; use lib "$FindBin::RealBin/lib/canu/lib64/perl5"; use File::Path 2.08 qw(make_path remove_tree); use Carp; use canu::Defaults; use canu::Execution; use canu::Configure; use canu::Grid; use canu::Grid_Cloud; use canu::Grid_SGE; use canu::Grid_Slurm; use canu::Grid_PBSTorque; use canu::Grid_LSF; use canu::Grid_DNANexus; use canu::Gatekeeper; use canu::Meryl; use canu::OverlapInCore; use canu::OverlapMhap; use canu::OverlapMMap; use canu::OverlapStore; use canu::CorrectReads; use canu::ErrorEstimate; use canu::OverlapBasedTrimming; use canu::OverlapErrorAdjustment; use canu::Unitig; use canu::Consensus; use canu::Output; use canu::HTML; my @specFiles; # Files of specs my @specOpts; # Command line specs my @inputFiles; # Command line inputs, later inputs in spec files are added # Initialize our defaults. Must be done before defaults are reported in printOptions() below. setDefaults(); # The bin directory is needed for -version, can only be set after setDefaults(), but really should be # set after checkParameters() so it can know pathMap. my $bin = getBinDirectory(); # Path to binaries, reset later. my $cmd = undef; # Temporary string passed to system(). my $asm = undef; # Name of our assembly. my $asmAuto = undef; # If set, the name was auto-discovered. # What a mess. We can't set the version string until after we have a bin directory, and # Defaults.pm can't call stuff in Execution.pm. So, we need to special case setting the version # string. setVersion($bin); # Check for the presence of a -options switch BEFORE we do any work. # This lets us print the default values of options. if (scalar(@ARGV) == 0) { printHelp(1); } foreach my $arg (@ARGV) { if (($arg eq "-options") || ($arg eq "-defaults")) { printOptions(); exit(0); } if (($arg eq "-version") || ($arg eq "--version")) { print getGlobal("version") . "\n"; exit(0); } } # By default, all three steps are run. Options -correct, -trim and -assemble # can limit the pipeline to just that stage. # At some pain, we stash the original options for later use. We need # to use these when we resubmit ourself to the grid. We can't simply dump # all of @ARGV into here, because we need to fix up relative paths first. my $rootdir = undef; my $readdir = undef; my $mode = undef; # "correct", "trim", "trim-assemble" or "assemble" my $type = undef; # "pacbio" or "nanopore" my $step = "run"; my $haveRaw = 0; my $haveCorrected = 0; while (scalar(@ARGV)) { my $arg = shift @ARGV; if (($arg eq "-h") || ($arg eq "-help") || ($arg eq "--help")) { printHelp(1); } elsif (($arg eq "-citation") || ($arg eq "--citation")) { print STDERR "\n"; printCitation(undef); exit(0); } elsif ($arg eq "-d") { $rootdir = shift @ARGV; } elsif ($arg eq "-p") { $asm = shift @ARGV; addCommandLineOption("-p '$asm'"); } elsif ($arg eq "-s") { my $spec = shift @ARGV; $spec = abs_path($spec); push @specFiles, $spec; addCommandLineOption("-s '$spec'"); } elsif ($arg eq "-correct") { $mode = $step = "correct"; addCommandLineOption("-correct"); } elsif ($arg eq "-trim") { $mode = $step = "trim"; addCommandLineOption("-trim"); } elsif ($arg eq "-assemble") { $mode = $step = "assemble"; addCommandLineOption("-assemble"); } elsif ($arg eq "-trim-assemble") { $mode = $step = "trim-assemble"; addCommandLineOption("-trim-assemble"); } elsif ($arg eq "-readdir") { $readdir = shift @ARGV; addCommandLineOption("-readdir '$readdir'"); } elsif (($arg eq "-pacbio") || ($arg eq "-nanopore")) { $type = "pacbio" if ($arg eq "-pacbio"); $type = "nanopore" if ($arg eq "-nanopore"); } elsif (($arg eq "-pacbio-raw") || # File handling is also present in ($arg eq "-pacbio-corrected") || # Defaults.pm around line 438 ($arg eq "-nanopore-raw") || ($arg eq "-nanopore-corrected")) { my $file = $ARGV[0]; my $fopt = addSequenceFile($readdir, $file, 1); while (defined($fopt)) { push @inputFiles, "$arg\0$fopt"; addCommandLineOption("$arg '$fopt'"); shift @ARGV; $file = $ARGV[0]; $fopt = addSequenceFile($readdir, $file); } } elsif (-e $arg) { addCommandLineError("ERROR: File supplied on command line; use -s, -pacbio-raw, -pacbio-corrected, -nanopore-raw, or -nanopore-corrected.\n"); } elsif ($arg =~ m/=/) { push @specOpts, $arg; addCommandLineOption("'$arg'"); } else { addCommandLineError("ERROR: Invalid command line option '$arg'. Did you forget quotes around options with spaces?\n"); } } # If no $asm or $dir, see if there is an assembly here. If so, set $asm to what was found. if (!defined($asm)) { $asmAuto = 1; # If we don't actually find a prefix, we'll fail right after this, so OK to set blindly. open(F, "ls -d */*gkpStore |"); while () { $asm = $1 if (m/^correction\/(.*).gkpStore$/); $asm = $1 if (m/^trimming\/(.*).gkpStore$/); $asm = $1 if (m/^unitigging\/(.*).gkpStore$/); } close(F); } # Fail if some obvious things aren't set. addCommandLineError("ERROR: Assembly name prefix not supplied with -p.\n") if (!defined($asm)); # Load paramters from the defaults files @inputFiles = setParametersFromFile("$bin/canu.defaults", $readdir, @inputFiles) if (-e "$bin/canu.defaults"); @inputFiles = setParametersFromFile("$ENV{'HOME'}/.canu", $readdir, @inputFiles) if (-e "$ENV{'HOME'}/.canu"); # For each of the spec files, parse it, setting parameters and remembering any input files discovered. foreach my $specFile (@specFiles) { @inputFiles = setParametersFromFile($specFile, $readdir, @inputFiles); } # Set parameters from the command line. setParametersFromCommandLine(@specOpts); # Reset $bin, now that all options, specifically the pathMap, are set. $bin = getBinDirectory(); # If anything complained (invalid option, missing file, etc) printHelp() will trigger and exit. printHelp(); # Now that we know the bin directory, print the version so those pesky users # will (hopefully) include it when they paste in logs. print STDERR "-- " . getGlobal("version") . "\n"; print STDERR "--\n"; print STDERR "-- CITATIONS\n"; print STDERR "--\n"; printCitation("-- "); print STDERR "-- CONFIGURE CANU\n"; print STDERR "--\n"; # Check java and gnuplot. checkJava(); checkGnuplot(); # And one last chance to fail - because java and gnuplot both can set an error. printHelp(); # Detect grid support. If 'gridEngine' isn't set, the execution methods submitScript() and # submitOrRunParallelJob() will return without submitting, or run locally (respectively). This # means that we can leave the default of 'useGrid' to 'true', and execution will do the right thing # when there isn't a grid. print STDERR "-- Detected ", getNumberOfCPUs(), " CPUs and ", getPhysicalMemorySize(), " gigabytes of memory.\n"; print STDERR "-- Limited to ", getGlobal("maxMemory"), " gigabytes from maxMemory option.\n" if (defined(getGlobal("maxMemory"))); print STDERR "-- Limited to ", getGlobal("maxThreads"), " CPUs from maxThreads option.\n" if (defined(getGlobal("maxThreads"))); detectSGE(); detectSlurm(); detectPBSTorque(); detectLSF(); detectDNANexus(); # Report if no grid engine found, or if the user has disabled grid support. if (!defined(getGlobal("gridEngine"))) { print STDERR "-- No grid engine detected, grid disabled.\n"; } if ((getGlobal("useGrid") eq "0") && (defined(getGlobal("gridEngine")))) { print STDERR "-- Grid engine disabled per useGrid=false option.\n"; setGlobal("gridEngine", undef); } # Finish setting up the grid. This is done AFTER parameters are set from the command line, to # let the user override any of our defaults. configureSGE(); configureSlurm(); configurePBSTorque(); configureLSF(); configureRemote(); configureDNANexus(); # Based on genomeSize, configure the execution of every component. # This needs to be done AFTER the grid is setup! configureAssembler(); # And, finally, move to the assembly directory, finish setting things up, and report the critical # parameters. setWorkDirectory(); if (defined($rootdir)) { make_path($rootdir) if (! -d $rootdir); chdir($rootdir); } setGlobal("onExitDir", getcwd()); setGlobal("onExitNam", $asm); setGlobalIfUndef("objectStoreNameSpace", $asm); # No good place to put this. # Figure out read inputs. From an existing store? From files? Corrected? Etc, etc. my $haveCorrected = 0; my $haveRaw = 0; my $setUpForPacBio = 0; my $setUpForNanopore = 0; # If we're a cloud run, fetch the store we expect to be working with. fetchStore("unitigging/$asm.gkpStore") if ((! -e "unitigging/$asm.gkpStore") && (fileExists("unitigging/$asm.gkpStore.tar"))); fetchStore("trimming/$asm.gkpStore") if ((! -e "trimming/$asm.gkpStore") && (fileExists("trimming/$asm.gkpStore.tar")) && (! -e "unitigging/$asm.gkpStore")); fetchStore("correction/$asm.gkpStore") if ((! -e "correction/$asm.gkpStore") && (fileExists("correction/$asm.gkpStore.tar")) && (! -e "trimming/$asm.gkpStore")); # Scan for an existing gkpStore. If the output from that stage exists, ignore the store there. my $gkp; $gkp = "correction/$asm.gkpStore" if ((-e "correction/$asm.gkpStore/libraries.txt") && (sequenceFileExists("$asm.correctedReads") eq undef)); $gkp = "trimming/$asm.gkpStore" if ((-e "trimming/$asm.gkpStore/libraries.txt") && (sequenceFileExists("$asm.trimmedReads") eq undef)); $gkp = "unitigging/$asm.gkpStore" if ((-e "unitigging/$asm.gkpStore/libraries.txt")); # Scan for existing stage outputs. These only get used if there isn't a gkpStore found above. my $reads; $reads = sequenceFileExists("$asm.correctedReads") if (!defined($reads)); $reads = sequenceFileExists("$asm.trimmedReads") if (!defined($reads)); # A handy function for reporting what reads we found. sub reportReadsFound ($$$$) { my ($setUpForPacBio, $setUpForNanopore, $haveRaw, $haveCorrected) = @_; my $rt; my $ct; $rt = "both PacBio and Nanopore" if (($setUpForPacBio > 0) && ($setUpForNanopore > 0)); $rt = "PacBio" if (($setUpForPacBio > 0) && ($setUpForNanopore == 0)); $rt = "Nanopore" if (($setUpForPacBio == 0) && ($setUpForNanopore > 0)); $rt = "unknown" if (($setUpForPacBio == 0) && ($setUpForNanopore == 0)); $ct = "uncorrected" if (($haveRaw > 0) && ($haveCorrected == 0)); $ct = "corrected" if (($haveRaw == 0) && ($haveCorrected > 0)); $ct = "uncorrected AND corrected" if (($haveRaw > 0) && ($haveCorrected > 0)); return("$rt $ct"); } # If a gkpStore was found, scan the reads in it to decide what we're working with. if (defined($gkp)) { my $numPacBioRaw = 0; my $numPacBioCorrected = 0; my $numNanoporeRaw = 0; my $numNanoporeCorrected = 0; open(L, "< $gkp/libraries.txt") or caExit("can't open '$gkp/libraries.txt' for reading: $!", undef); while () { $numPacBioRaw++ if (m/pacbio-raw/); $numPacBioCorrected++ if (m/pacbio-corrected/); $numNanoporeRaw++ if (m/nanopore-raw/); $numNanoporeCorrected++ if (m/nanopore-corrected/); } close(L); $setUpForPacBio++ if ($numPacBioRaw + $numPacBioCorrected > 0); $setUpForNanopore++ if ($numNanoporeRaw + $numNanoporeCorrected > 0); $haveRaw++ if ($numPacBioRaw + $numNanoporeRaw > 0); $haveCorrected++ if ($numPacBioCorrected + $numNanoporeCorrected > 0); my $rtct = reportReadsFound($setUpForPacBio, $setUpForNanopore, $haveRaw, $haveCorrected); print STDERR "--\n"; print STDERR "-- Found $rtct reads in '$gkp'.\n"; } # Like above, scan the gkpStore to decide what we're working with. The catch here is that # we scan the previous store, and all reads are corrected. elsif (defined($reads)) { $gkp = "correction/$asm.gkpStore" if ((-e "correction/$asm.gkpStore/libraries.txt") && (sequenceFileExists("$asm.correctedReads"))); $gkp = "trimming/$asm.gkpStore" if ((-e "trimming/$asm.gkpStore/libraries.txt") && (sequenceFileExists("$asm.trimmedReads"))); my $numPacBio = 0; my $numNanopore = 0; if (defined($gkp)) { open(L, "< $gkp/libraries.txt") or caExit("can't open '$gkp/libraries.txt' for reading: $!", undef); while () { $numPacBio++ if (m/pacbio/); $numNanopore++ if (m/nanopore/); } close(L); $setUpForPacBio++ if ($numPacBio > 0); $setUpForNanopore++ if ($numNanopore > 0); $haveCorrected++; } else { #$setUpForPacBio++; # Leaving both setUp's as zero reports 'unknown' and $haveCorrected++; # defaults to Pacbio below (search for setUpForNanopore). } # Regardless of what the user gave us, we always want to restart with these reads. undef @inputFiles; push @inputFiles, (($setUpForNanopore == 0) ? "-pacbio" : "-nanopore") . "-corrected\0$reads"; my $rtct = reportReadsFound($setUpForPacBio, $setUpForNanopore, $haveRaw, $haveCorrected); print STDERR "--\n"; print STDERR "-- Found $rtct reads in '$reads'.\n"; } # Scan input files, counting the different types of libraries we have. elsif (scalar(@inputFiles) > 0) { foreach my $typefile (@inputFiles) { my ($type, $file) = split '\0', $typefile; $haveCorrected++ if ($type =~ m/corrected/); $haveRaw++ if ($type =~ m/raw/); $setUpForPacBio++ if ($type =~ m/pacbio/); $setUpForNanopore++ if ($type =~ m/nanopore/); } my $rtct = reportReadsFound($setUpForPacBio, $setUpForNanopore, $haveRaw, $haveCorrected); print STDERR "--\n"; print STDERR "-- Found $rtct reads in the input files.\n"; } # Set an initial run mode, based on the libraries we have found, or the stores that exist (unless # it was set on the command line). if (!defined($mode)) { $mode = "run" if ($haveRaw > 0); $mode = "trim-assemble" if ($haveCorrected > 0); $mode = "run" if (-e "correction/$asm.gkpStore/libraries.txt"); $mode = "trim-assemble" if (-e "trimming/$asm.gkpStore/libraries.txt"); $mode = "assemble" if (-e "unitigging/$asm.gkpStore/libraries.txt"); } # Set the type of the reads. A command line option could force the type, e.g., "-pacbio" or # "-nanopore", to let you do cRaZy stuff like "-nanopore -pacbio-raw *fastq". if (!defined($type)) { $type = "pacbio" if ($setUpForPacBio > 0); $type = "nanopore" if ($setUpForNanopore > 0); } # Now set error rates (if not set already) based on the dominant read type. if ($type eq"nanopore") { setGlobalIfUndef("corOvlErrorRate", 0.320); setGlobalIfUndef("obtOvlErrorRate", 0.144); setGlobalIfUndef("utgOvlErrorRate", 0.144); setGlobalIfUndef("corErrorRate", 0.500); setGlobalIfUndef("obtErrorRate", 0.144); setGlobalIfUndef("utgErrorRate", 0.144); setGlobalIfUndef("cnsErrorRate", 0.192); } if ($type eq"pacbio") { setGlobalIfUndef("corOvlErrorRate", 0.240); setGlobalIfUndef("obtOvlErrorRate", 0.045); setGlobalIfUndef("utgOvlErrorRate", 0.045); setGlobalIfUndef("corErrorRate", 0.300); setGlobalIfUndef("obtErrorRate", 0.045); setGlobalIfUndef("utgErrorRate", 0.045); setGlobalIfUndef("cnsErrorRate", 0.075); } # Check for a few errors: # no mode -> don't have any reads or any store to run from. # both raw and corrected -> don't know how to process these caExit("ERROR: No reads supplied, and can't find any reads in any gkpStore", undef) if (!defined($mode)); caExit("ERROR: Failed to determine the sequencing technology of the reads", undef) if (!defined($type)); caExit("ERROR: Can't mix uncorrected and corrected reads", undef) if ($haveRaw && $haveCorrected); # Do a final check on parameters, cleaning up paths and case, and failing on invalid stuff. checkParameters(); # And one final last chance to fail - because java and gnuplot both can set an error. printHelp(); # Go! printf STDERR "--\n"; printf STDERR "-- Generating assembly '$asm' in '" . getcwd() . "'\n"; printf STDERR "--\n"; printf STDERR "-- Parameters:\n"; printf STDERR "--\n"; printf STDERR "-- genomeSize %s\n", getGlobal("genomeSize"); printf STDERR "--\n"; printf STDERR "-- Overlap Generation Limits:\n"; printf STDERR "-- corOvlErrorRate %6.4f (%6.2f%%)\n", getGlobal("corOvlErrorRate"), getGlobal("corOvlErrorRate") * 100.0; printf STDERR "-- obtOvlErrorRate %6.4f (%6.2f%%)\n", getGlobal("obtOvlErrorRate"), getGlobal("obtOvlErrorRate") * 100.0; printf STDERR "-- utgOvlErrorRate %6.4f (%6.2f%%)\n", getGlobal("utgOvlErrorRate"), getGlobal("utgOvlErrorRate") * 100.0; printf STDERR "--\n"; printf STDERR "-- Overlap Processing Limits:\n"; printf STDERR "-- corErrorRate %6.4f (%6.2f%%)\n", getGlobal("corErrorRate"), getGlobal("corErrorRate") * 100.0; printf STDERR "-- obtErrorRate %6.4f (%6.2f%%)\n", getGlobal("obtErrorRate"), getGlobal("obtErrorRate") * 100.0; printf STDERR "-- utgErrorRate %6.4f (%6.2f%%)\n", getGlobal("utgErrorRate"), getGlobal("utgErrorRate") * 100.0; printf STDERR "-- cnsErrorRate %6.4f (%6.2f%%)\n", getGlobal("cnsErrorRate"), getGlobal("cnsErrorRate") * 100.0; # Check that we were supplied a work directory, and that it exists, or we can create it. make_path("canu-logs") if (! -d "canu-logs"); make_path("canu-scripts") if (! -d "canu-scripts"); # This environment variable tells the binaries to log their execution in canu-logs/ $ENV{'CANU_DIRECTORY'} = getcwd(); # Report the parameters used. writeLog(); # Submit ourself for grid execution? If not grid enabled, or already running on the grid, this # call just returns. The arg MUST be undef. submitScript($asm, undef); # # When doing 'run', this sets options for each stage. # - overlapper 'mhap' for correction, 'ovl' for trimming and assembly. # - consensus 'falconpipe' for correction, 'utgcns' for assembly. No consensus in trimming. # - errorRates 15% for correction and 2% for trimming and assembly. Internally, this is # multiplied by three for obt, ovl, cns, etc. # sub setOptions ($$) { my $mode = shift @_; # E.g,. "run" or "trim-assemble" or just plain ol' "trim" my $step = shift @_; # Step we're setting options for. # Decide if we care about running this step in this mode. I almost applied # De Morgan's Laws to this. I don't think it would have been any clearer. if (($mode eq $step) || ($mode eq "run") || (($mode eq "trim-assemble") && ($step eq "trim")) || (($mode eq "trim-assemble") && ($step eq "assemble"))) { # Do run this. } else { return("don't run this"); } # Create directories for the step, if needed. make_path("correction") if ((! -d "correction") && ($step eq "correct")); make_path("trimming") if ((! -d "trimming") && ($step eq "trim")); make_path("unitigging") if ((! -d "unitigging") && ($step eq "assemble")); # Return that we want to run this step. return($step); } # # Pipeline piece # sub overlap ($$) { my $asm = shift @_; my $tag = shift @_; my $ovlType = ($tag eq "utg") ? "normal" : "partial"; if (getGlobal("${tag}overlapper") eq "mhap") { mhapConfigure($asm, $tag, $ovlType); mhapPrecomputeCheck($asm, $tag, $ovlType) foreach (1..getGlobal("canuIterationMax") + 1); # this also does mhapReAlign mhapCheck($asm, $tag, $ovlType) foreach (1..getGlobal("canuIterationMax") + 1); } elsif (getGlobal("${tag}overlapper") eq "minimap") { mmapConfigure($asm, $tag, $ovlType); mmapPrecomputeCheck($asm, $tag, $ovlType) foreach (1..getGlobal("canuIterationMax") + 1); mmapCheck($asm, $tag, $ovlType) foreach (1..getGlobal("canuIterationMax") + 1); } else { overlapConfigure($asm, $tag, $ovlType); overlapCheck($asm, $tag, $ovlType) foreach (1..getGlobal("canuIterationMax") + 1); } createOverlapStore($asm, $tag, getGlobal("ovsMethod")); } # # Begin pipeline # # The checks for sequenceFileExists() at the start aren't needed except for # object storage mode. Gatekeeper has no way of knowing, inside # gatekeeper(), that this stage is completed and it shouldn't fetch the # store. In 'normal' operation, the store exists already, and we just # return. # if (setOptions($mode, "correct") eq "correct") { if (sequenceFileExists("$asm.correctedReads") eq undef) { print STDERR "--\n"; print STDERR "--\n"; print STDERR "-- BEGIN CORRECTION\n"; print STDERR "--\n"; gatekeeper($asm, "cor", @inputFiles); merylConfigure($asm, "cor"); merylCheck($asm, "cor") foreach (1..getGlobal("canuIterationMax") + 1); merylProcess($asm, "cor"); overlap($asm, "cor"); buildCorrectionLayouts($asm); generateCorrectedReads($asm) foreach (1..getGlobal("canuIterationMax") + 1); dumpCorrectedReads($asm); buildHTML($asm, "cor"); } my $correctedReads = sequenceFileExists("$asm.correctedReads"); caExit("can't find corrected reads '$asm.correctedReads*' in directory '" . getcwd() . "'", undef) if (!defined($correctedReads)); undef @inputFiles; push @inputFiles, "-$type-corrected\0$correctedReads"; } if (setOptions($mode, "trim") eq "trim") { if (sequenceFileExists("$asm.trimmedReads") eq undef) { print STDERR "--\n"; print STDERR "--\n"; print STDERR "-- BEGIN TRIMMING\n"; print STDERR "--\n"; gatekeeper($asm, "obt", @inputFiles); merylConfigure($asm, "obt"); merylCheck($asm, "obt") foreach (1..getGlobal("canuIterationMax") + 1); merylProcess($asm, "obt"); overlap($asm, "obt"); trimReads ($asm); splitReads($asm); dumpReads ($asm); #summarizeReads($asm); buildHTML($asm, "obt"); } my $trimmedReads = sequenceFileExists("$asm.trimmedReads"); caExit("can't find trimmed reads '$asm.trimmedReads*' in directory '" . getcwd() . "'", undef) if (!defined($trimmedReads)); undef @inputFiles; push @inputFiles, "-$type-corrected\0$trimmedReads"; } if (setOptions($mode, "assemble") eq "assemble") { if (sequenceFileExists("$asm.contigs") eq undef) { print STDERR "--\n"; print STDERR "--\n"; print STDERR "-- BEGIN ASSEMBLY\n"; print STDERR "--\n"; gatekeeper($asm, "utg", @inputFiles); merylConfigure($asm, "utg"); merylCheck($asm, "utg") foreach (1..getGlobal("canuIterationMax") + 1); merylProcess($asm, "utg"); overlap($asm, "utg"); #readErrorDetection($asm); readErrorDetectionConfigure($asm); readErrorDetectionCheck($asm) foreach (1..getGlobal("canuIterationMax") + 1); overlapErrorAdjustmentConfigure($asm); overlapErrorAdjustmentCheck($asm) foreach (1..getGlobal("canuIterationMax") + 1); updateOverlapStore($asm); unitig($asm); unitigCheck($asm) foreach (1..getGlobal("canuIterationMax") + 1); foreach (1..getGlobal("canuIterationMax") + 1) { # Consensus wants to change the script between the first and consensusConfigure($asm); # second iterations. The script is rewritten in consensusCheck($asm); # consensusConfigure(), so we need to add that to the loop. } consensusLoad($asm); consensusAnalyze($asm); alignGFA($asm) foreach (1..getGlobal("canuIterationMax") + 1); generateOutputs($asm); } } print STDERR "--\n"; print STDERR "-- Bye.\n"; exit(0); canu-1.6/src/pipelines/canu/000077500000000000000000000000001314437614700160255ustar00rootroot00000000000000canu-1.6/src/pipelines/canu/Configure.pm000066400000000000000000001054471314437614700203170ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-DEC-02 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Configure; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(displayMemoryValue displayGenomeSize configureAssembler); use strict; use Carp qw(cluck); use Sys::Hostname; use canu::Defaults; use canu::Execution; # This is called to expand parameter ranges for memory and thread parameters. # Examples of valid ranges: # # no units - 1-4:2 - assumes 'g' in the adjust (if memory) # one unit - 1g-4:2 1-4g:2 1-4:2g - all the others are set to 'g' # two units - 1g-4g:2 1g-4:2g 1-4g:2g - bgn/end are the same, stp uses end # all three - 1g-4g:2g - use as is # # Quirks: 1g-2000m will increment every 1m. # 1g-2000m:1g only adds 1g. # 1g-2048m:1g adds 1 and 2g. sub expandRange ($$$$) { my $var = shift @_; my $val = shift @_; my $min = shift @_; # limit the minimum to be above this my $max = shift @_; # limit the maximum to be below this my @v = split ',', $val; my @r; foreach my $v (@v) { my $bgn; my $bgnu; my $end; my $endu; my $stp; my $stpu; # Decode the range. if ($v =~ m/^(\d+\.{0,1}\d*)([kKmMgGtT]{0,1})$/) { $bgn = $1; $bgnu = $2; $end = $1; $endu = $2; $stp = 1; $stpu = $2; } elsif ($v =~ m/^(\d+\.{0,1}\d*)([kKmMgGtT]{0,1})-(\d+\.{0,1}\d*)([kKmMgGtT]{0,1})$/) { $bgn = $1; $bgnu = $2; $end = $3; $endu = $4; $stp = 1; $stpu = $4; } elsif ($v =~ m/^(\d+\.{0,1}\d*)([kKmMgGtT]{0,1})-(\d+\.{0,1}\d*)([kKmMgGtT]{0,1}):(\d+\.{0,1}\d*)([kKmMgGtT]{0,1})$/) { $bgn = $1; $bgnu = $2; $end = $3; $endu = $4; $stp = $5; $stpu = $6; } else { caExit("can't parse '$var' entry '$v'", undef); } # Undef things that are null. The code that follows this was written assuming undef. $bgnu = undef if ($bgnu eq ""); $endu = undef if ($endu eq ""); $stpu = undef if ($stpu eq ""); # Process the range my $def = defined($bgnu) + defined($endu) + defined($stpu); # If no units, this could be a memory or a thread setting. Don't use units. if ($def == 0) { } # If only one unit specified, set the others to the same. elsif ($def == 1) { if (defined($bgnu)) { $endu = $stpu = $bgnu; } elsif (defined($endu)) { $bgnu = $stpu = $endu; } elsif (defined($stpu)) { $bgnu = $endu = $stpu; } } # If two units specified, set the unset as: # bgn or end unset - set based on the other range # stp unset - set on end if stp $end)); } # Nothing to do if all three are set! elsif ($def == 3) { } # Convert the value and unit to gigabytes. my $b = adjustMemoryValue("$bgn$bgnu"); my $e = adjustMemoryValue("$end$endu"); my $s = adjustMemoryValue("$stp$stpu"); # Enforce the user supplied minimum and maximum. We cannot 'decrease min to user supplied # maximum' because this effectively ignores the task setting. For, e.g., batMemory=64-128 # and maxMemory=32, we want it to fail. $b = $min if ((defined($min)) && ($b < $min)); # Increase min to user supplied minimum. $e = $min if ((defined($min)) && ($e < $min)); # Increase max to user supplied minimum. #$b = $max if ((defined($max)) && ($b > $max)); # Decrease min to use supplied maximum. $e = $max if ((defined($max)) && ($e > $max)); # Decrease max to use supplied maximum. # Iterate over the range, push values to test onto the array. for (my $ii=$b; $ii<=$e; $ii += $s) { push @r, $ii; } } #print "$var = "; #foreach my $r (@r) { # print "$r "; #} #print "\n"; return(@r); } sub findGridMaxMemoryAndThreads () { my @grid = split '\0', getGlobal("availableHosts"); my $maxmem = 0; my $maxcpu = 0; foreach my $g (@grid) { my ($cpu, $mem, $num) = split '-', $g; $maxmem = ($maxmem < $mem) ? $mem : $maxmem; $maxcpu = ($maxcpu < $cpu) ? $cpu : $maxcpu; } return($maxmem, $maxcpu); } # Side effect! This will RESET the $global{} parameters to the computed value. This lets # the rest of canu - in particular, the part that runs the jobs - use the correct value. Without # resetting, I'd be making code changes all over the place to support the values returned. sub getAllowedResources ($$$$@) { my $tag = shift @_; # Variant, e.g., "cor", "utg" my $alg = shift @_; # Algorithm, e.g., "mhap", "ovl" my $err = shift @_; # Report of things we can't run. my $all = shift @_; # Report of things we can run. my $dbg = shift @_; # Optional, report debugging stuff # If no grid, or grid not enabled, everything falls under 'lcoal'. my $class = ((getGlobal("useGrid") ne "0") && (defined(getGlobal("gridEngine")))) ? "grid" : "local"; # If grid, but no hosts, fail. if (($class eq "grid") && (!defined(getGlobal("availableHosts")))) { caExit("invalid useGrid (" . getGlobal("useGrid") . ") and gridEngine (" . getGlobal("gridEngine") . "); found no execution hosts - is grid available from this host?", undef); } # Figure out limits. my $minMemory = getGlobal("minMemory"); my $minThreads = getGlobal("minThreads"); my $maxMemory = getGlobal("maxMemory"); my $maxThreads = getGlobal("maxThreads"); my $taskMemory = getGlobal("${tag}${alg}Memory"); # Algorithm limit, "utgovlMemory", etc. my $taskThreads = getGlobal("${tag}${alg}Threads"); # # The task limits MUST be defined. caExit("${tag}${alg}Memory is not defined", undef) if (!defined($taskMemory)); caExit("${tag}${alg}Threads is not defined", undef) if (!defined($taskThreads)); # If the maximum limits aren't set, default to 'unlimited' (for the grid; we'll effectively filter # by the number of jobs we can fit on the hosts) or to the current hardware limits. if ($dbg) { print STDERR "--\n"; print STDERR "-- ERROR\n"; print STDERR "-- ERROR Limited to at least $minMemory GB memory via minMemory option\n" if (defined($minMemory)); print STDERR "-- ERROR Limited to at least $minThreads threads via minThreads option\n" if (defined($minThreads)); print STDERR "-- ERROR Limited to at most $maxMemory GB memory via maxMemory option\n" if (defined($maxMemory)); print STDERR "-- ERROR Limited to at most $maxThreads threads via maxThreads option\n" if (defined($maxThreads)); } # Figure out the largest memory and threads that could ever be supported. This lets us short-circuit # the loop below. my ($gridMaxMem, $gridMaxThr) = findGridMaxMemoryAndThreads(); $maxMemory = (($class eq "grid") ? $gridMaxMem : getPhysicalMemorySize()) if (!defined($maxMemory)); $maxThreads = (($class eq "grid") ? $gridMaxThr : getNumberOfCPUs()) if (!defined($maxThreads)); # Build a list of the available hardware configurations we can run on. If grid, we get this # from the list of previously discovered hosts. If local, it's just this machine. my @gridCor; # Number of cores my @gridMem; # GB's of memory my @gridNum; # Number of nodes if ($class eq "grid") { my @grid = split '\0', getGlobal("availableHosts"); foreach my $g (@grid) { my ($cpu, $mem, $num) = split '-', $g; if (($cpu > 0) && ($mem > 0) && ($num > 0)) { push @gridCor, $cpu; push @gridMem, $mem; push @gridNum, $num; } } } else { push @gridCor, $maxThreads; push @gridMem, $maxMemory; push @gridNum, 1; } if ($dbg) { print STDERR "-- ERROR\n"; print STDERR "-- ERROR Found ", scalar(@gridCor), " machine ", ((scalar(@gridCor) == 1) ? "configuration:\n" : "configurations:\n"); for (my $ii=0; $ii $maxMemory) || ($t > $maxThreads))) { # print STDERR "-- ERROR Tested $tag$alg requesting $t cores and ${m}GB memory - rejected: limited to ${maxMemory}GB and $maxThreads cores.\n"; #} next if ($m > $maxMemory); # Bail if either of the suggest settings are next if ($t > $maxThreads); # larger than the maximum allowed. # Save this memory size. ovsMemory uses a list of possible memory sizes to # pick the smallest one that results in an acceptable number of files. $availMemoryMin = $m if (!defined($availMemoryMin) || ($m < $availMemoryMin)); $availMemoryMax = $m if (!defined($availMemoryMax) || ($availMemoryMax < $m)); # For a job using $m GB memory and $t threads, we can compute how many processes will # fit on each node in our set of available machines. The smaller of the two is then # the number of processes we can run on this node. my $processes = 0; my $cores = 0; my $memory = 0; for (my $ii=0; $ii $maxMemory); caExit("invalid taskThread=$taskThreads; maxThreads=$maxThreads", undef) if ($taskThreads > $maxThreads); # Finally, reset the concurrency (if we're running locally) so we don't swamp our poor workstation. my $concurrent = undef; if ($class eq "local") { my $nc = int($maxThreads / $taskThreads); if (($taskThreads * getGlobal("${tag}${alg}Concurrency") > $maxThreads)) { $err .= "-- Reset concurrency from ", getGlobal("${tag}${alg}Concurrency"), " to $nc.\n"; setGlobal("${tag}${alg}Concurrency", $nc); } if (!defined(getGlobal("${tag}${alg}Concurrency"))) { setGlobal("${tag}${alg}Concurrency", $nc); } $concurrent = getGlobal("${tag}${alg}Concurrency"); } # And report. my $nam; if ($alg eq "meryl") { $nam = "(k-mer counting)"; } elsif ($alg eq "mhap") { $nam = "(overlap detection with mhap)"; } elsif ($alg eq "mmap") { $nam = "(overlap detection with minimap)"; } elsif ($alg eq "ovl") { $nam = "(overlap detection)"; } elsif ($alg eq "cor") { $nam = "(read correction)"; } elsif ($alg eq "ovb") { $nam = "(overlap store bucketizer)"; } elsif ($alg eq "ovs") { $nam = "(overlap store sorting)"; } elsif ($alg eq "red") { $nam = "(read error detection)"; } elsif ($alg eq "oea") { $nam = "(overlap error adjustment)"; } elsif ($alg eq "bat") { $nam = "(contig construction)"; } elsif ($alg eq "cns") { $nam = "(consensus)"; } elsif ($alg eq "gfa") { $nam = "(GFA alignment and processing)"; } else { caFailure("unknown task '$alg' in getAllowedResources().", undef); } my $job = substr(" $concurrent", -3) . " job" . (($concurrent == 1) ? " " : "s"); my $thr = substr(" $taskThreads", -3) . " CPU" . (($taskThreads == 1) ? " " : "s"); my $mem = substr(" $taskMemory", -4) . " GB"; my $t = substr("$tag$alg ", 0, 7); if (!defined($all)) { #$all .= "-- Memory, Threads and Concurrency configuration:\n" if ( defined($concurrent)); #$all .= "-- Memory and Threads configuration:\n" if (!defined($concurrent)); if (defined($concurrent)) { $all .= "-- (tag)Concurrency\n"; $all .= "-- (tag)Threads |\n"; $all .= "-- (tag)Memory | |\n"; $all .= "-- (tag) | | | algorithm\n"; $all .= "-- ------- ------ -------- -------- -----------------------------\n"; } else { $all .= "-- (tag)Threads\n"; $all .= "-- (tag)Memory |\n"; $all .= "-- (tag) | | algorithm\n"; $all .= "-- ------- ------ -------- -----------------------------\n"; } } $all .= "-- Local: $t $mem $thr x $job $nam\n" if ( defined($concurrent)); $all .= "-- Grid: $t $mem $thr $nam\n" if (!defined($concurrent)); return($err, $all); } # Converts number with units to gigabytes. If no units, gigabytes is assumed. sub adjustMemoryValue ($) { my $val = shift @_; return(undef) if (!defined($val)); return($1) if ($val =~ m/^(\d+\.{0,1}\d*)$/); return($1 / 1024 / 1024) if ($val =~ m/^(\d+\.{0,1}\d*)[kK]$/); return($1 / 1024) if ($val =~ m/^(\d+\.{0,1}\d*)[mM]$/); return($1) if ($val =~ m/^(\d+\.{0,1}\d*)[gG]$/); return($1 * 1024) if ($val =~ m/^(\d+\.{0,1}\d*)[tT]$/); return($1 * 1024 * 1024) if ($val =~ m/^(\d+\.{0,1}\d*)[pP]$/); die "Invalid memory value '$val'\n"; } # Converts gigabytes to number with units. sub displayMemoryValue ($) { my $val = shift @_; return(($val * 1024 * 1024) . "k") if ($val < adjustMemoryValue("1m")); return(($val * 1024) . "m") if ($val < adjustMemoryValue("1g")); return(($val) . "g") if ($val < adjustMemoryValue("1t")); return(($val / 1024) . "t"); } # Converts number with units to bases. sub adjustGenomeSize ($) { my $val = shift @_; return(undef) if (!defined($val)); return($1) if ($val =~ m/^(\d+\.{0,1}\d*)$/i); return($1 * 1000) if ($val =~ m/^(\d+\.{0,1}\d*)[kK]$/i); return($1 * 1000000) if ($val =~ m/^(\d+\.{0,1}\d*)[mM]$/i); return($1 * 1000000000) if ($val =~ m/^(\d+\.{0,1}\d*)[gG]$/i); return($1 * 1000000000000) if ($val =~ m/^(\d+\.{0,1}\d*)[tT]$/i); die "Invalid genome size '$val'\n"; } # Converts bases to number with units. sub displayGenomeSize ($) { my $val = shift @_; return(($val)) if ($val < adjustGenomeSize("1k")); return(($val / 1000) . "k") if ($val < adjustGenomeSize("1m")); return(($val / 1000000) . "m") if ($val < adjustGenomeSize("1g")); return(($val / 1000000000) . "g") if ($val < adjustGenomeSize("1t")); return(($val / 1000000000000) . "t"); } # # If minMemory or minThreads isn't defined, pick a reasonable pair based on genome size. # sub configureAssembler () { # Parse units on things the user possibly set. setGlobal("genomeSize", adjustGenomeSize(getGlobal("genomeSize"))); setGlobal("minMemory", adjustMemoryValue(getGlobal("minMemory"))); setGlobal("maxMemory", adjustMemoryValue(getGlobal("maxMemory"))); # For overlapper and mhap, allow larger maximums for larger genomes. More memory won't help # smaller genomes, and the smaller minimums won't hurt larger genomes (which are probably being # run on larger machines anyway, so the minimums won't be used). # For uncorrected overlapper, both memory and thread count is reduced. Memory because it is # very CPU bound, and thread count because it can be quite unbalanced. if (getGlobal("genomeSize") < adjustGenomeSize("40m")) { setGlobalIfUndef("corOvlMemory", "2-6"); setGlobalIfUndef("corOvlThreads", "1"); setGlobalIfUndef("obtOvlMemory", "4-8"); setGlobalIfUndef("obtOvlThreads", "1-8"); setGlobalIfUndef("utgOvlMemory", "4-8"); setGlobalIfUndef("utgOvlThreads", "1-8"); setGlobalIfUndef("corMhapMemory", "4-6"); setGlobalIfUndef("corMhapThreads", "1-16"); setGlobalIfUndef("obtMhapMemory", "4-6"); setGlobalIfUndef("obtMhapThreads", "1-16"); setGlobalIfUndef("utgMhapMemory", "4-6"); setGlobalIfUndef("utgMhapThreads", "1-16"); setGlobalIfUndef("corMMapMemory", "4-6"); setGlobalIfUndef("corMMapThreads", "1-16"); setGlobalIfUndef("obtMMapMemory", "4-6"); setGlobalIfUndef("obtMMapThreads", "1-16"); setGlobalIfUndef("utgMMapMemory", "4-6"); setGlobalIfUndef("utgMMapThreads", "1-16"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("500m")) { setGlobalIfUndef("corOvlMemory", "2-6"); setGlobalIfUndef("corOvlThreads", "1"); setGlobalIfUndef("obtOvlMemory", "4-8"); setGlobalIfUndef("obtOvlThreads", "1-8"); setGlobalIfUndef("utgOvlMemory", "4-8"); setGlobalIfUndef("utgOvlThreads", "1-8"); setGlobalIfUndef("corMhapMemory", "8-13"); setGlobalIfUndef("corMhapThreads", "1-16"); setGlobalIfUndef("obtMhapMemory", "8-13"); setGlobalIfUndef("obtMhapThreads", "1-16"); setGlobalIfUndef("utgMhapMemory", "8-13"); setGlobalIfUndef("utgMhapThreads", "1-16"); setGlobalIfUndef("corMMapMemory", "8-13"); setGlobalIfUndef("corMMapThreads", "1-16"); setGlobalIfUndef("obtMMapMemory", "8-13"); setGlobalIfUndef("obtMMapThreads", "1-16"); setGlobalIfUndef("utgMMapMemory", "8-13"); setGlobalIfUndef("utgMMapThreads", "1-16"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("2g")) { setGlobalIfUndef("corOvlMemory", "2-8"); setGlobalIfUndef("corOvlThreads", "1"); setGlobalIfUndef("obtOvlMemory", "4-12"); setGlobalIfUndef("obtOvlThreads", "1-8"); setGlobalIfUndef("utgOvlMemory", "4-12"); setGlobalIfUndef("utgOvlThreads", "1-8"); setGlobalIfUndef("corMhapMemory", "16-32"); setGlobalIfUndef("corMhapThreads", "4-16"); setGlobalIfUndef("obtMhapMemory", "16-32"); setGlobalIfUndef("obtMhapThreads", "4-16"); setGlobalIfUndef("utgMhapMemory", "16-32"); setGlobalIfUndef("utgMhapThreads", "4-16"); setGlobalIfUndef("corMMapMemory", "16-32"); setGlobalIfUndef("corMMapThreads", "1-16"); setGlobalIfUndef("obtMMapMemory", "16-32"); setGlobalIfUndef("obtMMapThreads", "1-16"); setGlobalIfUndef("utgMMapMemory", "16-32"); setGlobalIfUndef("utgMMapThreads", "1-16"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("5g")) { setGlobalIfUndef("corOvlMemory", "2-8"); setGlobalIfUndef("corOvlThreads", "1"); setGlobalIfUndef("obtOvlMemory", "4-16"); setGlobalIfUndef("obtOvlThreads", "1-8"); setGlobalIfUndef("utgOvlMemory", "4-16"); setGlobalIfUndef("utgOvlThreads", "1-8"); setGlobalIfUndef("corMhapMemory", "16-48"); setGlobalIfUndef("corMhapThreads", "4-16"); setGlobalIfUndef("obtMhapMemory", "16-48"); setGlobalIfUndef("obtMhapThreads", "4-16"); setGlobalIfUndef("utgMhapMemory", "16-48"); setGlobalIfUndef("utgMhapThreads", "4-16"); setGlobalIfUndef("corMMapMemory", "16-48"); setGlobalIfUndef("corMMapThreads", "1-16"); setGlobalIfUndef("obtMMapMemory", "16-48"); setGlobalIfUndef("obtMMapThreads", "1-16"); setGlobalIfUndef("utgMMapMemory", "16-48"); setGlobalIfUndef("utgMMapThreads", "1-16"); } else { setGlobalIfUndef("corOvlMemory", "2-8"); setGlobalIfUndef("corOvlThreads", "1"); setGlobalIfUndef("obtOvlMemory", "4-16"); setGlobalIfUndef("obtOvlThreads", "1-8"); setGlobalIfUndef("utgOvlMemory", "4-16"); setGlobalIfUndef("utgOvlThreads", "1-8"); setGlobalIfUndef("corMhapMemory", "32-64"); setGlobalIfUndef("corMhapThreads", "4-16"); setGlobalIfUndef("obtMhapMemory", "32-64"); setGlobalIfUndef("obtMhapThreads", "4-16"); setGlobalIfUndef("utgMhapMemory", "32-64"); setGlobalIfUndef("utgMhapThreads", "4-16"); setGlobalIfUndef("corMMapMemory", "32-64"); setGlobalIfUndef("corMMapThreads", "1-16"); setGlobalIfUndef("obtMMapMemory", "32-64"); setGlobalIfUndef("obtMMapThreads", "1-16"); setGlobalIfUndef("utgMMapMemory", "32-64"); setGlobalIfUndef("utgMMapThreads", "1-16"); } # Overlapper block sizes probably don't need to be modified based on genome size. setGlobalIfUndef("corOvlHashBlockLength", 2500000); setGlobalIfUndef("corOvlRefBlockSize", 20000); setGlobalIfUndef("corOvlRefBlockLength", 0); setGlobalIfUndef("obtOvlHashBlockLength", 100000000); setGlobalIfUndef("obtOvlRefBlockSize", 2000000); setGlobalIfUndef("obtOvlRefBlockLength", 0); setGlobalIfUndef("utgOvlHashBlockLength", 100000000); setGlobalIfUndef("utgOvlRefBlockSize", 2000000); setGlobalIfUndef("utgOvlRefBlockLength", 0); # Overlap store construction should be based on the number of overlaps, but we obviously don't # know that until much later. If we set memory too large, we risk (in the parallel version for sure) # inefficiency; too small and we run out of file handles. if (getGlobal("genomeSize") < adjustGenomeSize("300m")) { setGlobalIfUndef("ovsMethod", "sequential"); setGlobalIfUndef("ovbMemory", "2-4"); setGlobalIfUndef("ovbThreads", "1"); setGlobalIfUndef("ovsMemory", "2-8"); setGlobalIfUndef("ovsThreads", "1"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("1g")) { setGlobalIfUndef("ovsMethod", "parallel"); setGlobalIfUndef("ovbMemory", "2-4"); setGlobalIfUndef("ovbThreads", "1"); setGlobalIfUndef("ovsMemory", "4-16"); setGlobalIfUndef("ovsThreads", "1"); } else { setGlobalIfUndef("ovsMethod", "parallel"); setGlobalIfUndef("ovbMemory", "2-4"); setGlobalIfUndef("ovbThreads", "1"); setGlobalIfUndef("ovsMemory", "4-32"); setGlobalIfUndef("ovsThreads", "1"); } # Correction and consensus are somewhat invariant. if (getGlobal("genomeSize") < adjustGenomeSize("40m")) { setGlobalIfUndef("cnsMemory", "8-32"); setGlobalIfUndef("cnsThreads", "1-4"); setGlobalIfUndef("corMemory", "6-16"); setGlobalIfUndef("corThreads", "1-2"); setGlobalIfUndef("cnsPartitions", "8"); setGlobalIfUndef("cnsPartitionMin", "15000"); setGlobalIfUndef("corPartitions", "256"); setGlobalIfUndef("corPartitionMin", "5000"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("1g")) { setGlobalIfUndef("cnsMemory", "16-48"); setGlobalIfUndef("cnsThreads", "2-8"); setGlobalIfUndef("corMemory", "6-20"); setGlobalIfUndef("corThreads", "2-4"); setGlobalIfUndef("cnsPartitions", "64"); setGlobalIfUndef("cnsPartitionMin", "20000"); setGlobalIfUndef("corPartitions", "512"); setGlobalIfUndef("corPartitionMin", "10000"); } else { setGlobalIfUndef("cnsMemory", "64-128"); setGlobalIfUndef("cnsThreads", "2-8"); setGlobalIfUndef("corMemory", "10-32"); setGlobalIfUndef("corThreads", "2-4"); setGlobalIfUndef("cnsPartitions", "256"); setGlobalIfUndef("cnsPartitionMin", "25000"); setGlobalIfUndef("corPartitions", "1024"); setGlobalIfUndef("corPartitionMin", "15000"); } # Meryl too, basically just small or big. This should really be using the number of bases # reported from gatekeeper. if (getGlobal("genomeSize") < adjustGenomeSize("100m")) { setGlobalIfUndef("merylMemory", "4-8"); setGlobalIfUndef("merylThreads", "1-4"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("1g")) { setGlobalIfUndef("merylMemory", "16-64"); setGlobalIfUndef("merylThreads", "1-16"); } else { setGlobalIfUndef("merylMemory", "64-256"); setGlobalIfUndef("merylThreads", "1-32"); } # Overlap error adjustment # # Configuration is primarily done though memory size. This blows up when there are many # short(er) reads and large memory machines are available. # # The limit is arbitrary. # On medicago, with 740,000 reads (median len ~1,500bp), this will result in about 150 jobs. # The memory-only limit generated only 7 jobs. # # On drosophila, with 270,000 reads (median len ~17,000bp), this will result in about 50 jobs. # The memory-only limit generated 36 jobs. # setGlobalIfUndef("redBatchSize", "5000"); setGlobalIfUndef("redBatchLength", ""); setGlobalIfUndef("oeaBatchSize", "25000"); setGlobalIfUndef("oeaBatchLength", ""); if (getGlobal("genomeSize") < adjustGenomeSize("40m")) { setGlobalIfUndef("redMemory", "1-2"); setGlobalIfUndef("redThreads", "1-4"); setGlobalIfUndef("oeaMemory", "1"); setGlobalIfUndef("oeaThreads", "1"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("500m")) { setGlobalIfUndef("redMemory", "2-6"); setGlobalIfUndef("redThreads", "1-6"); setGlobalIfUndef("oeaMemory", "2"); setGlobalIfUndef("oeaThreads", "1"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("2g")) { setGlobalIfUndef("redMemory", "2-8"); setGlobalIfUndef("redThreads", "1-8"); setGlobalIfUndef("oeaMemory", "2"); setGlobalIfUndef("oeaThreads", "1"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("5g")) { setGlobalIfUndef("redMemory", "2-16"); setGlobalIfUndef("redThreads", "1-8"); setGlobalIfUndef("oeaMemory", "4"); setGlobalIfUndef("oeaThreads", "1"); } else { setGlobalIfUndef("redMemory", "2-16"); setGlobalIfUndef("redThreads", "1-8"); setGlobalIfUndef("oeaMemory", "4"); setGlobalIfUndef("oeaThreads", "1"); } # And bogart and GFA alignment/processing. # # GFA for genomes less than 40m is run in the canu process itself. if (getGlobal("genomeSize") < adjustGenomeSize("40m")) { setGlobalIfUndef("batMemory", "2-16"); setGlobalIfUndef("batThreads", "1-4"); setGlobalIfUndef("gfaMemory", "2-8"); setGlobalIfUndef("gfaThreads", "1-4"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("500m")) { setGlobalIfUndef("batMemory", "16-64"); setGlobalIfUndef("batThreads", "2-8"); setGlobalIfUndef("gfaMemory", "4-8"); setGlobalIfUndef("gfaThreads", "2-8"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("2g")) { setGlobalIfUndef("batMemory", "32-256"); setGlobalIfUndef("batThreads", "4-16"); setGlobalIfUndef("gfaMemory", "8-16"); setGlobalIfUndef("gfaThreads", "4-16"); } elsif (getGlobal("genomeSize") < adjustGenomeSize("5g")) { setGlobalIfUndef("batMemory", "128-512"); setGlobalIfUndef("batThreads", "8-32"); setGlobalIfUndef("gfaMemory", "16-32"); setGlobalIfUndef("gfaThreads", "8-32"); } else { setGlobalIfUndef("batMemory", "256-1024"); setGlobalIfUndef("batThreads", "16-64"); setGlobalIfUndef("gfaMemory", "32-64"); setGlobalIfUndef("gfaThreads", "16-64"); } # Finally, use all that setup to pick actual values for each component. # # ovsMemory needs to be configured here iff the sequential build method is used. This runs in # the canu process, and needs to have a single memory size. The parallel method will pick a # memory size based on the number of overlaps and submit jobs using that size. my $err; my $all; ($err, $all) = getAllowedResources("", "meryl", $err, $all); ($err, $all) = getAllowedResources("cor", "mhap", $err, $all) if (getGlobal("corOverlapper") eq "mhap"); ($err, $all) = getAllowedResources("cor", "mmap", $err, $all) if (getGlobal("corOverlapper") eq "minimap"); ($err, $all) = getAllowedResources("cor", "ovl", $err, $all) if (getGlobal("corOverlapper") eq "ovl"); ($err, $all) = getAllowedResources("obt", "mhap", $err, $all) if (getGlobal("obtOverlapper") eq "mhap"); ($err, $all) = getAllowedResources("obt", "mmap", $err, $all) if (getGlobal("obtOverlapper") eq "minimap"); ($err, $all) = getAllowedResources("obt", "ovl", $err, $all) if (getGlobal("obtOverlapper") eq "ovl"); ($err, $all) = getAllowedResources("utg", "mhap", $err, $all) if (getGlobal("utgOverlapper") eq "mhap"); ($err, $all) = getAllowedResources("utg", "mmap", $err, $all) if (getGlobal("utgOverlapper") eq "minimap"); ($err, $all) = getAllowedResources("utg", "ovl", $err, $all) if (getGlobal("utgOverlapper") eq "ovl"); ($err, $all) = getAllowedResources("", "cor", $err, $all); ($err, $all) = getAllowedResources("", "ovb", $err, $all); ($err, $all) = getAllowedResources("", "ovs", $err, $all); ($err, $all) = getAllowedResources("", "red", $err, $all); ($err, $all) = getAllowedResources("", "oea", $err, $all); ($err, $all) = getAllowedResources("", "bat", $err, $all); ($err, $all) = getAllowedResources("", "cns", $err, $all); ($err, $all) = getAllowedResources("", "gfa", $err, $all); # Check some minimums. if ((getGlobal("ovsMemory") =~ m/^([0123456789.]+)-*[0123456789.]*$/) && ($1 < 0.25)) { caExit("ovsMemory must be at least 0.25g or 256m", undef); } # 2017-02-21 -- not sure why $err is being reported here if it doesn't stop. What's in it? print STDERR "--\n" if (defined($err)); print STDERR $err if (defined($err)); print STDERR "--\n"; print STDERR $all; } canu-1.6/src/pipelines/canu/Consensus.pm000066400000000000000000000611061314437614700203470ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/Consensus.pm # # Modifications by: # # Brian P. Walenz from 2015-MAR-06 to 2015-AUG-25 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-03 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-DEC-16 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Consensus; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(consensusConfigure consensusCheck consensusLoad consensusAnalyze alignGFA); use strict; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::Unitig; use canu::HTML; use canu::Grid_Cloud; sub utgcns ($$$) { my $asm = shift @_; my $ctgjobs = shift @_; my $utgjobs = shift @_; my $jobs = $ctgjobs + $utgjobs; my $path = "unitigging/5-consensus"; open(F, "> $path/consensus.sh") or caExit("can't open '$path/consensus.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F setWorkDirectoryShellCode($path); print F "\n"; print F getJobIDShellCode(); print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F "if [ \$jobid -gt $jobs ]; then\n"; print F " echo Error: Only $jobs partitions, you asked for \$jobid.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "if [ \$jobid -le $ctgjobs ] ; then\n"; print F " tag=\"ctg\"\n"; print F "else\n"; print F " tag=\"utg\"\n"; print F " jobid=`expr \$jobid - $ctgjobs`\n"; print F "fi\n"; print F "\n"; print F "jobid=`printf %04d \$jobid`\n"; print F "\n"; print F "if [ ! -d ./\${tag}cns ] ; then\n"; print F " mkdir -p ./\${tag}cns\n"; print F "fi\n"; print F "\n"; print F "if [ -e ./\${tag}cns/\$jobid.cns ] ; then\n"; print F " exit 0\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("unitigging/$asm.\${tag}Store", "seqDB.v001.dat", ""); print F fetchFileShellCode("unitigging/$asm.\${tag}Store", "seqDB.v001.tig", ""); print F "\n"; print F fetchStoreShellCode("unitigging/$asm.\${tag}Store/partitionedReads.gkpStore", $path, ""); print F "\n"; print F "\$bin/utgcns \\\n"; print F " -G ../$asm.\${tag}Store/partitionedReads.gkpStore \\\n"; # Optional; utgcns will default to this print F " -T ../$asm.\${tag}Store 1 \$jobid \\\n"; print F " -O ./\${tag}cns/\$jobid.cns.WORKING \\\n"; print F " -maxcoverage " . getGlobal('cnsMaxCoverage') . " \\\n"; print F " -e " . getGlobal("cnsErrorRate") . " \\\n"; print F " -quick \\\n" if (getGlobal("cnsConsensus") eq "quick"); print F " -pbdagcon \\\n" if (getGlobal("cnsConsensus") eq "pbdagcon"); print F " -edlib \\\n" if (getGlobal("canuIteration") >= 0); print F " -utgcns \\\n" if (getGlobal("cnsConsensus") eq "utgcns"); print F " -threads " . getGlobal("cnsThreads") . " \\\n"; print F "&& \\\n"; print F "mv ./\${tag}cns/\$jobid.cns.WORKING ./\${tag}cns/\$jobid.cns \\\n"; print F "\n"; print F stashFileShellCode("unitigging/5-consensus", "\${tag}cns/\$jobid.cns", ""); print F "\n"; print F "exit 0\n"; if (getGlobal("canuIteration") < 0) { print STDERR "-- Using fast alignment for consensus (iteration '", getGlobal("canuIteration"), "').\n"; } else { print STDERR "-- Using slow alignment for consensus (iteration '", getGlobal("canuIteration"), "').\n"; } close(F); makeExecutable("$path/consensus.sh"); stashFile("$path/consensus.sh"); } sub cleanupPartitions ($$) { my $asm = shift @_; my $tag = shift @_; return if (! -e "unitigging/$asm.${tag}Store/partitionedReads.gkpStore/partitions/map"); my $gkpTime = -M "unitigging/$asm.${tag}Store/partitionedReads.gkpStore/partitions/map"; my $tigTime = -M "unitigging/$asm.ctgStore/seqDB.v001.tig"; return if ($gkpTime <= $tigTime); print STDERR "-- Partitioned gkpStore is older than tigs, rebuild partitioning (gkpStore $gkpTime days old; ctgStore $tigTime days old).\n"; remove_tree("unitigging/$asm.${tag}Store/partitionedReads.gkpStore"); } sub partitionReads ($$) { my $asm = shift @_; my $tag = shift @_; my $bin = getBinDirectory(); my $cmd; return if (-e "unitigging/$asm.${tag}Store/partitionedReads.gkpStore/partitions/map"); return if (fileExists("unitigging/$asm.${tag}Store/partitionedReads.gkpStore.tar")); fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.${tag}Store/seqDB.v001.dat"); fetchFile("unitigging/$asm.${tag}Store/seqDB.v001.tig"); $cmd = "$bin/gatekeeperPartition \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -T ./$asm.${tag}Store 1 \\\n"; $cmd .= " -b " . getGlobal("cnsPartitionMin") . " \\\n" if (defined(getGlobal("cnsPartitionMin"))); $cmd .= " -p " . getGlobal("cnsPartitions") . " \\\n" if (defined(getGlobal("cnsPartitions"))); $cmd .= "> ./$asm.${tag}Store/partitionedReads.log 2>&1"; if (runCommand("unitigging", $cmd)) { caExit("failed to partition the reads", "unitigging/$asm.${tag}Store/partitionedReads.log"); } stashStore("unitigging/$asm.${tag}Store/partitionedReads.gkpStore"); stashFile ("unitigging/$asm.${tag}Store/partitionedReads.log"); } sub computeNumberOfConsensusJobs ($$) { my $asm = shift @_; my $tag = shift @_; my $jobs = "0001"; my $bin = getBinDirectory(); fetchFile("unitigging/$asm.${tag}Store/partitionedReads.log"); open(F, "< unitigging/$asm.${tag}Store/partitionedReads.log") or caExit("can't open 'unitigging/$asm.${tag}Store/partitionedReads.log' for reading: $!", undef); while() { #if (m/^partition (\d+) has \d+ reads$/) { # $jobs = $1; #} if (m/^Found \d+ unpartitioned reads and maximum partition of (\d+)$/) { $jobs = $1; } } close(F); return($jobs); } sub consensusConfigure ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "unitigging/5-consensus"; goto allDone if (skipStage($asm, "consensusConfigure") == 1); goto allDone if ((fileExists("unitigging/$asm.ctgStore/seqDB.v002.tig")) && (fileExists("unitigging/$asm.utgStore/seqDB.v002.tig"))); make_path($path) if (! -d $path); # If the gkpStore partitions are older than the ctgStore unitig output, assume the unitigs have # changed and remove the gkpStore partition. -M is (annoyingly) 'file age', so we need to # rebuild if gkp is older (larger) than tig. cleanupPartitions($asm, "ctg"); cleanupPartitions($asm, "utg"); # Partition gkpStore if needed. Yeah, we could create both at the same time, with significant # effort in coding it up. partitionReads($asm, "ctg"); partitionReads($asm, "utg"); # Set up the consensus compute. It's in a useless if chain because there used to be # different executables; now they're all rolled into utgcns itself. my $ctgjobs = computeNumberOfConsensusJobs($asm, "ctg"); my $utgjobs = computeNumberOfConsensusJobs($asm, "utg"); # This configure is an odd-ball. Unlike all the other places that write scripts, # we'll rewrite this one every time, so that we can change the alignment algorithm # on the second attempt. my $firstTime = (! -e "$path/consensus.sh"); if ((getGlobal("cnsConsensus") eq "quick") || (getGlobal("cnsConsensus") eq "pbdagcon") || (getGlobal("cnsConsensus") eq "utgcns")) { utgcns($asm, $ctgjobs, $utgjobs); } else { caFailure("unknown consensus style '" . getGlobal("cnsConsensus") . "'", undef); } print STDERR "-- Configured $ctgjobs contig and $utgjobs unitig consensus jobs.\n"; finishStage: emitStage($asm, "consensusConfigure") if ($firstTime); buildHTML($asm, "utg"); allDone: stopAfter("consensusConfigure"); } # Checks that all consensus jobs are complete, loads them into the store. # sub consensusCheck ($) { my $asm = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "unitigging/5-consensus"; goto allDone if (skipStage($asm, "consensusCheck", $attempt) == 1); goto allDone if ((fileExists("$path/ctgcns.files")) && (fileExists("$path/utgcns.files"))); goto allDone if (fileExists("unitigging/$asm.ctgStore/seqDB.v002.tig")); fetchFile("$path/consensus.sh"); # Figure out if all the tasks finished correctly. my $ctgjobs = computeNumberOfConsensusJobs($asm, "ctg"); my $utgjobs = computeNumberOfConsensusJobs($asm, "utg"); my $jobs = $ctgjobs + $utgjobs; caExit("no consensus jobs found?", undef) if ($jobs == 0); my $currentJobID = "0001"; my $tag = "ctgcns"; my @ctgSuccessJobs; my @utgSuccessJobs; my @failedJobs; my $failureMessage = ""; for (my $job=1; $job <= $jobs; $job++) { if (fileExists("$path/$tag/$currentJobID.cns")) { push @ctgSuccessJobs, "5-consensus/$tag/$currentJobID.cns\n" if ($tag eq "ctgcns"); push @utgSuccessJobs, "5-consensus/$tag/$currentJobID.cns\n" if ($tag eq "utgcns"); } elsif (fileExists("$path/$tag/$currentJobID.cns.gz")) { push @ctgSuccessJobs, "5-consensus/$tag/$currentJobID.cns.gz\n" if ($tag eq "ctgcns"); push @utgSuccessJobs, "5-consensus/$tag/$currentJobID.cns.gz\n" if ($tag eq "utgcns"); } elsif (fileExists("$path/$tag/$currentJobID.cns.bz2")) { push @ctgSuccessJobs, "5-consensus/$tag/$currentJobID.cns.bz2\n" if ($tag eq "ctgcns"); push @utgSuccessJobs, "5-consensus/$tag/$currentJobID.cns.bz2\n" if ($tag eq "utgcns"); } elsif (fileExists("$path/$tag/$currentJobID.cns.xz")) { push @ctgSuccessJobs, "5-consensus/$tag/$currentJobID.cns.xz\n" if ($tag eq "ctgcns"); push @utgSuccessJobs, "5-consensus/$tag/$currentJobID.cns.xz\n" if ($tag eq "utgcns"); } else { $failureMessage .= "-- job $tag/$currentJobID.cns FAILED.\n"; push @failedJobs, $job; } $currentJobID++; $currentJobID = "0001" if ($job == $ctgjobs); # Reset for first utg job. $tag = "utgcns" if ($job == $ctgjobs); } # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Consensus jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Consensus jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "consensusCheck", $attempt); buildHTML($asm, "utg"); submitOrRunParallelJob($asm, "cns", $path, "consensus", @failedJobs); return; } finishStage: print STDERR "-- All ", scalar(@ctgSuccessJobs) + scalar(@utgSuccessJobs), " consensus jobs finished successfully.\n"; open(L, "> $path/ctgcns.files") or caExit("can't open '$path/ctgcns.files' for writing: $!", undef); print L @ctgSuccessJobs; close(L); stashFile("$path/ctgcns.files"); open(L, "> $path/utgcns.files") or caExit("can't open '$path/utgcns.files' for writing: $!", undef); print L @utgSuccessJobs; close(L); stashFile("$path/utgcns.files"); emitStage($asm, "consensusCheck"); buildHTML($asm, "utg"); allDone: } sub purgeFiles ($$$$$$) { my $asm = shift @_; my $tag = shift @_; my $Ncns = shift @_; my $Nfastq = shift @_; my $Nlayout = shift @_; my $Nlog = shift @_; remove_tree("unitigging/$asm.ctgStore/partitionedReads.gkpStore"); # The partitioned gkpStores remove_tree("unitigging/$asm.utgStore/partitionedReads.gkpStore"); # are useless now. Bye bye! unlink "unitigging/$asm.ctgStore/partitionedReads.log"; unlink "unitigging/$asm.utgStore/partitionedReads.log"; my $path = "unitigging/5-consensus"; open(F, "< $path/$tag.files") or caExit("can't open '$path/$tag.files' for reading: $!\n", undef); while () { chomp; if (m/^(.*)\/0*(\d+).cns$/) { my $ID6 = substr("00000" . $2, -6); my $ID4 = substr("000" . $2, -4); my $ID0 = $2; if (-e "unitigging/$1/$ID4.cns") { $Ncns++; unlink "unitigging/$1/$ID4.cns"; } if (-e "unitigging/$1/$ID4.fastq") { $Nfastq++; unlink "unitigging/$1/$ID4.fastq"; } if (-e "unitigging/$1/$ID4.layout") { $Nlayout++; unlink "unitigging/$1/$ID4.layout"; } if (-e "unitigging/$1/consensus.$ID6.out") { $Nlog++; unlink "unitigging/$1/consensus.$ID6.out"; } if (-e "unitigging/$1/consensus.$ID0.out") { $Nlog++; unlink "unitigging/$1/consensus.$ID0.out"; } } else { caExit("unknown consensus job name '$_'\n", undef); } } close(F); unlink "$path/$tag.files"; rmdir "$path/$tag"; return($Ncns, $Nfastq, $Nlayout, $Nlog); } sub consensusLoad ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "unitigging/5-consensus"; goto allDone if (skipStage($asm, "consensusLoad") == 1); goto allDone if ((fileExists("unitigging/$asm.ctgStore/seqDB.v002.tig")) && (fileExists("unitigging/$asm.utgStore/seqDB.v002.tig"))); # Expects to have a list of output files from the consensusCheck() function. fetchFile("$path/ctgcns.files"); fetchFile("$path/utgcns.files"); caExit("can't find '$path/ctgcns.files' for loading tigs into store: $!", undef) if (! -e "$path/ctgcns.files"); caExit("can't find '$path/utgcns.files' for loading tigs into store: $!", undef) if (! -e "$path/utgcns.files"); # Now just load them. if (! fileExists("unitigging/$asm.ctgStore/seqDB.v002.tig")) { fetchFile("unitigging/$asm.ctgStore/seqDB.v001.dat"); fetchFile("unitigging/$asm.ctgStore/seqDB.v001.tig"); open(F, "< $path/ctgcns.files"); while () { chomp; fetchFile("unitigging/$_"); } close(F); $cmd = "$bin/tgStoreLoad \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -T ./$asm.ctgStore 2 \\\n"; $cmd .= " -L ./5-consensus/ctgcns.files \\\n"; $cmd .= "> ./5-consensus/ctgcns.files.ctgStoreLoad.err 2>&1"; if (runCommand("unitigging", $cmd)) { caExit("failed to load unitig consensus into ctgStore", "$path/ctgcns.files.ctgStoreLoad.err"); } unlink "$path/ctgcns.files.ctgStoreLoad.err"; stashFile("unitigging/$asm.ctgStore/seqDB.v002.dat"); stashFile("unitigging/$asm.ctgStore/seqDB.v002.tig"); } if (! fileExists("unitigging/$asm.utgStore/seqDB.v002.tig")) { fetchFile("unitigging/$asm.utgStore/seqDB.v001.dat"); fetchFile("unitigging/$asm.utgStore/seqDB.v001.tig"); open(F, "< $path/utgcns.files"); while () { chomp; fetchFile("unitigging/$_"); } close(F); $cmd = "$bin/tgStoreLoad \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -T ./$asm.utgStore 2 \\\n"; $cmd .= " -L ./5-consensus/utgcns.files \\\n"; $cmd .= "> ./5-consensus/utgcns.files.utgStoreLoad.err 2>&1"; if (runCommand("unitigging", $cmd)) { caExit("failed to load unitig consensus into utgStore", "$path/utgcns.files.utgStoreLoad.err"); } unlink "$path/utgcns.files.utgStoreLoad.err"; stashFile("unitigging/$asm.utgStore/seqDB.v002.dat"); stashFile("unitigging/$asm.utgStore/seqDB.v002.tig"); } # Remvoe consensus outputs if ((-e "$path/ctgcns.files") || (-e "$path/utgcns.files")) { print STDERR "-- Purging consensus output after loading to ctgStore and/or utgStore.\n"; my $Ncns = 0; my $Nfastq = 0; my $Nlayout = 0; my $Nlog = 0; ($Ncns, $Nfastq, $Nlayout, $Nlog) = purgeFiles($asm, "ctgcns", $Ncns, $Nfastq, $Nlayout, $Nlog); ($Ncns, $Nfastq, $Nlayout, $Nlog) = purgeFiles($asm, "utgcns", $Ncns, $Nfastq, $Nlayout, $Nlog); print STDERR "-- Purged $Ncns .cns outputs.\n" if ($Ncns > 0); print STDERR "-- Purged $Nfastq .fastq outputs.\n" if ($Nfastq > 0); print STDERR "-- Purged $Nlayout .layout outputs.\n" if ($Nlayout > 0); print STDERR "-- Purged $Nlog .err log outputs.\n" if ($Nlog > 0); } reportUnitigSizes($asm, 2, "after consensus generation"); finishStage: emitStage($asm, "consensusLoad"); buildHTML($asm, "utg"); allDone: } sub consensusAnalyze ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; goto allDone if (skipStage($asm, "consensusAnalyze") == 1); goto allDone if (fileExists("unitigging/$asm.ctgStore.coverageStat.log")); fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.ctgStore/seqDB.v001.dat"); # Shouldn't need this, right? fetchFile("unitigging/$asm.ctgStore/seqDB.v001.tig"); # So why does it? fetchFile("unitigging/$asm.ctgStore/seqDB.v002.dat"); fetchFile("unitigging/$asm.ctgStore/seqDB.v002.tig"); $cmd = "$bin/tgStoreCoverageStat \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -T ./$asm.ctgStore 2 \\\n"; $cmd .= " -s " . getGlobal("genomeSize") . " \\\n"; $cmd .= " -o ./$asm.ctgStore.coverageStat \\\n"; $cmd .= "> ./$asm.ctgStore.coverageStat.err 2>&1"; if (runCommand("unitigging", $cmd)) { caExit("failed to compute coverage statistics", "unitigging/$asm.ctgStore.coverageStat.err"); } unlink "unitigging/$asm.ctgStore.coverageStat.err"; stashFile("unitigging/$asm.ctgStore.coverageStat.stats"); stashFile("unitigging/$asm.ctgStore.coverageStat.log"); finishStage: emitStage($asm, "consensusAnalyze"); buildHTML($asm, "utg"); allDone: stopAfter("consensus"); } sub alignGFA ($) { my $asm = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "unitigging/4-unitigger"; # Decide if this is small enough to run right now, or if we should submit to the grid. #my $bin = getBinDirectory(); # This is just big enough to not fit comfortably in the canu process itself. goto allDone if (skipStage($asm, "alignGFA") == 1); goto allDone if (fileExists("unitigging/4-unitigger/$asm.contigs.aligned.gfa") && fileExists("unitigging/4-unitigger/$asm.unitigs.aligned.gfa") && fileExists("unitigging/4-unitigger/$asm.unitigs.aligned.bed")); # If a large genome, run this on the grid, else, run in the canu process itself. my $runGrid = (getGlobal("genomeSize") >= 40000000); fetchFile("$path/alignGFA.sh"); if (! -e "$path/alignGFA.sh") { open(F, "> $path/alignGFA.sh") or caExit("can't open '$path/alignGFA.sh.sh' for writing: $!\n", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path) if ($runGrid); # If not local, need to cd first. print F "\n"; print F fetchFileShellCode("unitigging/$asm.utgStore", "seqDB.v001.dat", ""); print F fetchFileShellCode("unitigging/$asm.utgStore", "seqDB.v001.tig", ""); print F "\n"; print F fetchFileShellCode("unitigging/$asm.utgStore", "seqDB.v002.dat", ""); print F fetchFileShellCode("unitigging/$asm.utgStore", "seqDB.v002.tig", ""); print F "\n"; print F fetchFileShellCode("unitigging/$asm.ctgStore", "seqDB.v001.dat", ""); print F fetchFileShellCode("unitigging/$asm.ctgStore", "seqDB.v001.tig", ""); print F "\n"; print F fetchFileShellCode("unitigging/$asm.ctgStore", "seqDB.v002.dat", ""); print F fetchFileShellCode("unitigging/$asm.ctgStore", "seqDB.v002.tig", ""); print F "\n"; print F "\n"; print F "if [ ! -e ./$asm.unitigs.aligned.gfa ] ; then\n"; print F " \$bin/alignGFA \\\n"; print F " -T ../$asm.utgStore 2 \\\n"; print F " -i ./$asm.unitigs.gfa \\\n"; print F " -o ./$asm.unitigs.aligned.gfa \\\n"; print F " -t " . getGlobal("gfaThreads") . " \\\n"; print F " > ./$asm.unitigs.aligned.gfa.err 2>&1"; print F "\n"; print F stashFileShellCode("$path", "$asm.unitigs.aligned.gfa", " "); print F "fi\n"; print F "\n"; print F "\n"; print F "if [ ! -e ./$asm.contigs.aligned.gfa ] ; then\n"; print F " \$bin/alignGFA \\\n"; print F " -T ../$asm.ctgStore 2 \\\n"; print F " -i ./$asm.contigs.gfa \\\n"; print F " -o ./$asm.contigs.aligned.gfa \\\n"; print F " -t " . getGlobal("gfaThreads") . " \\\n"; print F " > ./$asm.contigs.aligned.gfa.err 2>&1"; print F "\n"; print F stashFileShellCode("$path", "$asm.contigs.aligned.gfa", " "); print F "fi\n"; print F "\n"; print F "\n"; print F "if [ ! -e ./$asm.unitigs.aligned.bed ] ; then\n"; print F " \$bin/alignGFA -bed \\\n"; print F " -T ../$asm.utgStore 2 \\\n"; print F " -C ../$asm.ctgStore 2 \\\n"; print F " -i ./$asm.unitigs.bed \\\n"; print F " -o ./$asm.unitigs.aligned.bed \\\n"; print F " -t " . getGlobal("gfaThreads") . " \\\n"; print F " > ./$asm.unitigs.aligned.bed.err 2>&1"; print F "\n"; print F stashFileShellCode("$path", "$asm.unitigs.aligned.bed", " "); print F "fi\n"; print F "\n"; print F "\n"; print F "if [ -e ./$asm.unitigs.aligned.gfa -a \\\n"; print F " -e ./$asm.contigs.aligned.gfa -a \\\n"; print F " -e ./$asm.unitigs.aligned.bed ] ; then\n"; print F " echo GFA alignments updated.\n"; print F " exit 0\n"; print F "else\n"; print F " echo GFA alignments failed.\n"; print F " exit 1\n"; print F "fi\n"; close(F); makeExecutable("$path/alignGFA.sh"); stashFile("$path/alignGFA.sh"); } # Since there is only one job, if we get here, we're not done. Any other 'check' function # shows how to process multiple jobs. This only checks for the existence of the final outputs. # (meryl and unitig are the same) # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Graph alignment jobs failed, tried $attempt times, giving up.\n"; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Graph alignment jobs failed, retry.\n"; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "alignGFA", $attempt); if ($runGrid) { submitOrRunParallelJob($asm, "gfa", $path, "alignGFA", (1)); } else { if (runCommand($path, "./alignGFA.sh")) { caExit("failed to align contigs", "./$asm.contigs.aligned.gfa.err"); } } return; finishStage: emitStage($asm, "alignGFA"); allDone: } canu-1.6/src/pipelines/canu/CorrectReads.pm000066400000000000000000001555111314437614700207530ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/CorrectReads.pm # # Modifications by: # # Brian P. Walenz from 2015-APR-09 to 2015-SEP-03 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-19 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-17 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::CorrectReads; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(buildCorrectionLayouts generateCorrectedReads dumpCorrectedReads); use strict; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::Report; use canu::HTML; use canu::Grid_Cloud; # Returns a coverage: # If $cov not defined, default to desired output coverage * 1.0. # Otherwise, if defined but ends in an 'x', that's desired output coverage * whatever # Otherwise, the coverage is as defined. # sub getCorCov ($$) { my $asm = shift @_; my $typ = shift @_; my $cov = getGlobal("corMaxEvidenceCoverage$typ"); my $exp = getExpectedCoverage("correction", $asm); my $des = getGlobal("corOutCoverage"); if (!defined($cov)) { $cov = $des; } elsif ($cov =~ m/(.*)x/) { $cov = int($des * $1); } return($cov); } # Query gkpStore to find the read types involved. Return an error rate that is appropriate for # aligning reads of that type to each other. sub getCorErrorRate ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $erate = getGlobal("corErrorRate"); if (defined($erate)) { print STDERR "-- Using overlaps no worse than $erate fraction error for correcting reads (from corErrorRate parameter).\n"; return($erate); } my $numPacBioRaw = 0; my $numPacBioCorrected = 0; my $numNanoporeRaw = 0; my $numNanoporeCorrected = 0; open(L, "< correction/$asm.gkpStore/libraries.txt") or caExit("can't open 'correction/$asm.gkpStore/libraries.txt' for reading: $!", undef); while () { $numPacBioRaw++ if (m/pacbio-raw/); $numPacBioCorrected++ if (m/pacbio-corrected/); $numNanoporeRaw++ if (m/nanopore-raw/); $numNanoporeCorrected++ if (m/nanopore-corrected/); } close(L); $erate = 0.10; # Default; user is stupid and forced correction of corrected reads. $erate = 0.30 if ($numPacBioRaw > 0); $erate = 0.50 if ($numNanoporeRaw > 0); print STDERR "-- Found $numPacBioRaw raw and $numPacBioCorrected corrected PacBio libraries.\n"; print STDERR "-- Found $numNanoporeRaw raw and $numNanoporeCorrected corrected Nanopore libraries.\n"; print STDERR "-- Using overlaps no worse than $erate fraction error for correcting reads.\n"; return($erate); } # Return the number of jobs for 'falcon', 'falconpipe' or 'utgcns' # sub computeNumberOfCorrectionJobs ($) { my $asm = shift @_; my $nJobs = 0; my $nPerJob = 0; my $path = "correction/2-correction"; if (getGlobal("corConsensus") eq "falcon" && -e "$path/correction_inputs" ) { open(F, "ls $path/correction_inputs/ |") or caExit("can't find list of correction_inputs: $!", undef); while () { $nJobs++ if (m/^\d\d\d\d$/); } close(F); return($nJobs, undef); } if ((getGlobal("corConsensus") eq "utgcns") || (getGlobal("corConsensus") eq "falconpipe") || (getGlobal("corConsensus") eq "falcon")) { my $nPart = getGlobal("corPartitions"); my $nReads = getNumberOfReadsInStore("correction", $asm); caExit("didn't find any reads in store 'correction/$asm.gkpStore'?", undef) if ($nReads == 0); $nPerJob = int($nReads / $nPart + 1); $nPerJob = getGlobal("corPartitionMin") if ($nPerJob < getGlobal("corPartitionMin")); for (my $j=1; $j<=$nReads; $j += $nPerJob) { # We could just divide, except for rounding issues.... $nJobs++; } } return($nJobs, $nPerJob); } # Generate a corStore, dump files for falcon to process, generate a script to run falcon. # sub buildCorrectionLayouts_direct ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $base = "correction"; my $path = "correction/2-correction"; # Outer level buildCorrectionLayouts() ensures the task is not finished. my $maxCov = getCorCov($asm, "Local"); if (! -e "correction/$asm.corStore") { $cmd = "$bin/generateCorrectionLayouts \\\n"; $cmd .= " -rl ./$asm.readsToCorrect \\\n" if (-e "$path/$asm.readsToCorrect"); $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -O ../$asm.ovlStore \\\n"; $cmd .= " -S ./$asm.globalScores \\\n" if (-e "$path/$asm.globalScores"); $cmd .= " -T ../$asm.corStore.WORKING \\\n"; $cmd .= " -L " . getGlobal("corMinEvidenceLength") . " \\\n" if (defined(getGlobal("corMinEvidenceLength"))); $cmd .= " -E " . getGlobal("corMaxEvidenceErate") . " \\\n" if (defined(getGlobal("corMaxEvidenceErate"))); $cmd .= " -C $maxCov \\\n" if (defined($maxCov)); $cmd .= " -legacy \\\n" if (defined(getGlobal("corLegacyFilter"))); $cmd .= "> ../$asm.corStore.err 2>&1"; if (runCommand($path, $cmd)) { caExit("failed to generate layouts for correction", "correction/$asm.corStore.err"); } rename "correction/$asm.corStore.WORKING", "correction/$asm.corStore"; } # first we call this function to compute partioning my ($jobs, $nPer) = computeNumberOfCorrectionJobs($asm); make_path("$path/correction_inputs") if (! -d "$path/correction_inputs"); make_path("$path/correction_outputs") if (! -d "$path/correction_outputs"); if (getGlobal("corConsensus") eq "falcon") { $cmd = "$bin/createFalconSenseInputs \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -T ../$asm.corStore 1 \\\n"; $cmd .= " -o ./correction_inputs/ \\\n"; $cmd .= " -p " . $jobs . " \\\n"; $cmd .= "> ./correction_inputs.err 2>&1"; if (runCommand($path, $cmd)) { caExit("failed to generate falcon inputs", "$path/correction_inputs.err"); } } if (getGlobal("corConsensus") eq "falcon") { #the second call will confirm we have the proper number of output files and set jobs ($jobs, $nPer) = computeNumberOfCorrectionJobs($asm); } #getAllowedResources("", "cor"); open(F, "> $path/correctReads.sh") or caExit("can't open '$path/correctReads.sh'", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/2-correction", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; print F "if [ \$jobid -gt $jobs ]; then\n"; print F " echo Error: Only $jobs partitions, you asked for \$jobid.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; if (getGlobal("corConsensus") eq "utgcns") { print F "bgn=`expr \\( \$jobid - 1 \\) \\* $nPer`\n"; print F "end=`expr \\( \$jobid + 0 \\) \\* $nPer`\n"; print F "\n"; } print F "jobid=`printf %04d \$jobid`\n"; print F "\n"; print F "if [ -e \"./correction_outputs/\$jobid.fasta\" ] ; then\n"; print F " echo Job finished successfully.\n"; print F " exit 0\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d \"./correction_outputs\" ] ; then\n"; print F " mkdir -p \"./correction_outputs\"\n"; print F "fi\n"; print F "\n"; print F "gkpStore=\"../$asm.gkpStore\"\n"; print F "\n"; my $stageDir = getGlobal("stageDirectory"); if (defined($stageDir)) { print F "if [ ! -d $stageDir ] ; then\n"; print F " mkdir -p $stageDir\n"; print F "fi\n"; print F "\n"; print F "mkdir -p $stageDir/$asm.gkpStore\n"; print F "\n"; print F "echo Start copy at `date`\n"; print F "cp -p \$gkpStore/info $stageDir/$asm.gkpStore/info\n"; print F "cp -p \$gkpStore/libraries $stageDir/$asm.gkpStore/libraries\n"; print F "cp -p \$gkpStore/reads $stageDir/$asm.gkpStore/reads\n"; print F "cp -p \$gkpStore/blobs $stageDir/$asm.gkpStore/blobs\n"; print F "echo Finished at `date`\n"; print F "\n"; print F "gkpStore=\"$stageDir/$asm.gkpStore\"\n"; print F "\n"; } my $erate = getCorErrorRate($asm); my $minidt = 1 - $erate; # UTGCNS for correction is writing FASTQ, but needs to write FASTA. The names below were changed to fasta preemptively. if (getGlobal("corConsensus") eq "utgcns") { caExit("UTGCNS for correction is writing FASTQ, but needs to write FASTA", undef); print F "\n"; print F "\$bin/utgcns \\\n"; print F " -u \$bgn-\$end \\\n"; print F " -e $erate \\\n"; print F " -G \$gkpStore \\\n"; print F " -T ../$asm.corStore 1 . \\\n"; print F " -O ./correction_outputs/\$jobid.cns.WORKING \\\n"; print F " -L ./correction_outputs/\$jobid.layout.WORKING \\\n"; print F " -F ./correction_outputs/\$jobid.fasta.WORKING \\\n"; print F "&& \\\n"; print F "mv ./correction_outputs/\$jobid.cns.WORKING ./correction_outputs/\$jobid.cns \\\n"; print F "&& \\\n"; print F "mv ./correction_outputs/\$jobid.layout.WORKING ./correction_outputs/\$jobid.layout \\\n"; print F "&& \\\n"; print F "mv ./correction_outputs/\$jobid.fasta.WORKING ./correction_outputs/\$jobid.fasta \\\n"; print F "\n"; } if (getGlobal("corConsensus") eq "falcon") { print F "\n"; print F "\$bin/falcon_sense \\\n"; print F " --min_idt $minidt \\\n"; print F " --min_len " . getGlobal("minReadLength") . "\\\n"; print F " --max_read_len " . 2 * getMaxReadLengthInStore($base, $asm) . " \\\n"; print F " --min_ovl_len " . getGlobal("minOverlapLength") . "\\\n"; print F " --min_cov " . getGlobal("corMinCoverage") . " \\\n"; print F " --n_core " . getGlobal("corThreads") . " \\\n"; print F " < ./correction_inputs/\$jobid \\\n"; print F " > ./correction_outputs/\$jobid.fasta.WORKING \\\n"; print F " 2> ./correction_outputs/\$jobid.err \\\n"; print F "&& \\\n"; print F "mv ./correction_outputs/\$jobid.fasta.WORKING ./correction_outputs/\$jobid.fasta \\\n"; } if (defined($stageDir)) { print F "\n"; print F "rm -rf $stageDir/$asm.gkpStore\n"; # Prevent accidents of 'rm -rf /' if stageDir = "/". print F "rmdir $stageDir\n"; } print F "\n"; print F "exit 0\n"; close(F); makeExecutable("$path/correctReads.sh"); stashFile("$path/correctReads.sh"); finishStage: ; allDone: } # For falcon_sense, using a pipe and no intermediate files # sub buildCorrectionLayouts_piped ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $base = "correction"; my $path = "correction/2-correction"; # Outer level buildCorrectionLayouts() ensures the task is not finished. make_path("$path/correction_inputs") if (! -d "$path/correction_inputs"); make_path("$path/correction_outputs") if (! -d "$path/correction_outputs"); my ($nJobs, $nPerJob) = computeNumberOfCorrectionJobs($asm); # Does math based on number of reads and parameters. my $nReads = getNumberOfReadsInStore("correction", $asm); #getAllowedResources("", "cor"); open(F, "> $path/correctReads.sh") or caExit("can't open '$path/correctReads.sh'", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/2-correction", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; print F "if [ \$jobid -gt $nJobs ]; then\n"; print F " echo Error: Only $nJobs partitions, you asked for \$jobid.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; my $bgnID = 1; my $endID = $bgnID + $nPerJob - 1; my $jobID = 1; while ($bgnID < $nReads) { $endID = $bgnID + $nPerJob - 1; $endID = $nReads if ($endID > $nReads); print F "if [ \$jobid -eq $jobID ] ; then\n"; print F " bgn=$bgnID\n"; print F " end=$endID\n"; print F "fi\n"; $bgnID = $endID + 1; $jobID++; } print F "\n"; print F "jobid=`printf %04d \$jobid`\n"; print F "\n"; print F "if [ -e \"./correction_outputs/\$jobid.fasta\" ] ; then\n"; print F " echo Job finished successfully.\n"; print F " exit 0\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d \"./correction_outputs\" ] ; then\n"; print F " mkdir -p \"./correction_outputs\"\n"; print F "fi\n"; print F "\n"; print F fetchStoreShellCode("correction/$asm.gkpStore", "correction/3-correction", ""); print F "\n"; print F fetchStoreShellCode("correction/$asm.ovlStore", "correction/3-correction", ""); print F "\n"; print F fetchFileShellCode("correction/2-correction", "$asm.readsToCorrect", ""); print F "\n"; print F fetchFileShellCode("correction/2-correction", "$asm.globalScores", ""); print F "\n"; print F "gkpStore=\"../$asm.gkpStore\"\n"; print F "\n"; my $stageDir = getGlobal("stageDirectory"); if (defined($stageDir)) { print F "if [ ! -d $stageDir ] ; then\n"; print F " mkdir -p $stageDir\n"; print F "fi\n"; print F "\n"; print F "mkdir -p $stageDir/$asm.gkpStore\n"; print F "\n"; print F "echo Start copy at `date`\n"; print F "cp -p \$gkpStore/info $stageDir/$asm.gkpStore/info\n"; print F "cp -p \$gkpStore/libraries $stageDir/$asm.gkpStore/libraries\n"; print F "cp -p \$gkpStore/reads $stageDir/$asm.gkpStore/reads\n"; print F "cp -p \$gkpStore/blobs $stageDir/$asm.gkpStore/blobs\n"; print F "echo Finished at `date`\n"; print F "\n"; print F "gkpStore=\"$stageDir/$asm.gkpStore\"\n"; print F "\n"; } my $maxCov = getCorCov($asm, "Local"); my $erate = getCorErrorRate($asm); my $minidt = 1 - $erate; print F "\n"; print F "if [ \"x\$BASH\" != \"x\" ] ; then\n"; # Needs doublequotes, else shell doesn't expand $BASH print F " set -o pipefail\n"; print F "fi\n"; print F "\n"; print F "( \\\n"; print F "\$bin/generateCorrectionLayouts -b \$bgn -e \$end \\\n"; print F " -rl ./$asm.readsToCorrect \\\n" if (-e "$path/$asm.readsToCorrect"); print F " -G \$gkpStore \\\n"; print F " -O ../$asm.ovlStore \\\n"; print F " -S ./$asm.globalScores \\\n" if (-e "$path/$asm.globalScores"); print F " -L " . getGlobal("corMinEvidenceLength") . " \\\n" if (defined(getGlobal("corMinEvidenceLength"))); print F " -E " . getGlobal("corMaxEvidenceErate") . " \\\n" if (defined(getGlobal("corMaxEvidenceErate"))); print F " -C $maxCov \\\n" if (defined($maxCov)); print F " -legacy \\\n" if (defined(getGlobal("corLegacyFilter"))); print F " -F \\\n"; print F "&& \\\n"; print F " touch ./correction_outputs/\$jobid.dump.success \\\n"; print F ") \\\n"; print F "| \\\n"; print F "\$bin/falcon_sense \\\n"; print F " --min_idt $minidt \\\n"; print F " --min_len " . getGlobal("minReadLength") . "\\\n"; print F " --max_read_len " . 2 * getMaxReadLengthInStore($base, $asm) . " \\\n"; print F " --min_ovl_len " . getGlobal("minOverlapLength") . "\\\n"; print F " --min_cov " . getGlobal("corMinCoverage") . " \\\n"; print F " --n_core " . getGlobal("corThreads") . " \\\n"; print F " > ./correction_outputs/\$jobid.fasta.WORKING \\\n"; print F " 2> ./correction_outputs/\$jobid.err \\\n"; print F "&& \\\n"; print F "mv ./correction_outputs/\$jobid.fasta.WORKING ./correction_outputs/\$jobid.fasta \\\n"; print F "\n"; print F "if [ ! -e \"./correction_outputs/\$jobid.dump.success\" ] ; then\n"; print F " echo Read layout generation failed.\n"; print F " mv ./correction_outputs/\$jobid.fasta ./correction_outputs/\$jobid.fasta.INCOMPLETE\n"; print F "fi\n"; print F "\n"; if (defined($stageDir)) { print F "rm -rf $stageDir/$asm.gkpStore\n"; # Prevent accidents of 'rm -rf /' if stageDir = "/". print F "rmdir $stageDir\n"; print F "\n"; } print F stashFileShellCode("$path", "correction_outputs/\$jobid.fasta", ""); print F "\n"; print F "exit 0\n"; close(F); makeExecutable("$path/correctReads.sh"); stashFile("$path/correctReads.sh"); finishStage: ; allDone: } sub lengthStats (@) { my @v = sort { $b <=> $a } @_; my $total = 0; my $mean = 0; my $n50 = 0; if (scalar(@v) > 0) { foreach my $v (@v) { $total += $v; $n50 = $v if ($total < getGlobal("genomeSize") / 2); } $mean = int($total / scalar(@v) + 0.5); } return($mean, $n50); } sub quickFilter ($$) { my $asm = shift @_; my $minTotal = shift @_; my $bin = getBinDirectory(); my $path = "correction/2-correction"; my $totCorLengthIn = 0; my $totCorLengthOut = 0; my $minCorLength = 0; open(O, "> $path/$asm.readsToCorrect.WORKING") or caExit("can't open '$path/$asm.readsToCorrect.WORKING' for writing: $!\n", undef); open(F, "$bin/gatekeeperDumpMetaData -G correction/$asm.gkpStore -reads | sort -T . -k3nr | ") or caExit("can't dump gatekeeper for read lengths: $!\n", undef); print O "read\toriginalLength\tcorrectedLength\n"; while () { my @v = split '\s+', $_; $totCorLengthIn += $v[2]; $totCorLengthOut += $v[2]; print O "$v[0]\t$v[2]\t0\n"; if ($minTotal != 0 && $totCorLengthIn >= $minTotal) { $minCorLength = $v[2]; last; } } close(F); close(O); rename "$path/$asm.readsToCorrect.WORKING", "$path/$asm.readsToCorrect"; stashFile("$path/$asm.readsToCorrect"); } sub expensiveFilter ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "correction/2-correction"; my $minCov = getGlobal("corMinCoverage"); my $maxCov = getCorCov($asm, "Local"); if (! fileExists("$path/$asm.estimate.log")) { print STDERR "-- Computing expected corrected read lengths '$path/$asm.estimate.log'.\n"; $cmd = "$bin/generateCorrectionLayouts \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -O ../$asm.ovlStore \\\n"; $cmd .= " -S ./$asm.globalScores \\\n" if (-e "$path/$asm.globalScores"); $cmd .= " -L " . getGlobal("corMinEvidenceLength") . " \\\n" if (defined(getGlobal("corMinEvidenceLength"))); $cmd .= " -E " . getGlobal("corMaxEvidenceErate") . " \\\n" if (defined(getGlobal("corMaxEvidenceErate"))); $cmd .= " -c $minCov \\\n" if (defined($minCov)); $cmd .= " -C $maxCov \\\n" if (defined($maxCov)); $cmd .= " -legacy \\\n" if (defined(getGlobal("corLegacyFilter"))); $cmd .= " -p ./$asm.estimate.WORKING"; if (runCommand($path, $cmd)) { rename "$path/$asm.estimate.log", "$path/$asm.estimate.log.FAILED"; caExit("failed to generate estimated lengths of corrected reads", "correction/$asm.corStore.err"); } rename "$path/$asm.estimate.WORKING.filter.log", "$path/$asm.estimate"; rename "$path/$asm.estimate.WORKING.summary", "$path/$asm.estimate.stats"; rename "$path/$asm.estimate.WORKING.log", "$path/$asm.estimate.log"; unlink "$path/$asm.estimate.correctedLength.log"; unlink "$path/$asm.estimate.originalLength.log"; } else { print STDERR "-- Expected corrected read lengths found in '$path/$asm.estimate.log'.\n"; } if (! -e "$path/$asm.estimate.correctedLength.log") { print STDERR "-- Sorting reads by expected corrected length.\n"; if (runCommandSilently($path, "sort -T . -k4nr -k2nr < ./$asm.estimate.log > ./$asm.estimate.correctedLength.log", 1)) { caExit("failed to sort by corrected read length", undef); } } if (! -e "$path/$asm.estimate.originalLength.log") { print STDERR "-- Sorting reads by uncorrected length.\n"; if (runCommandSilently($path, "sort -T . -k2nr -k4nr < ./$asm.estimate.log > ./$asm.estimate.originalLength.log", 1)) { caExit("failed to sort by original read length", undef); } } my $totRawLengthIn = 0; # Bases in raw reads we correct my $totRawLengthOut = 0; # Expected bases in corrected reads my $minRawLength = 0; my $totCorLengthIn = 0; # Bases in raw reads we correct my $totCorLengthOut = 0; # Expected bases in corrected reads my $minCorLength = 0; my $minTotal = getGlobal("genomeSize") * getGlobal("corOutCoverage"); # Lists of reads to correct if we use the raw length or the corrected length as a filter. my $nReads = getNumberOfReadsInStore("correction", $asm); my @rawReads; my @corReads; my @corReadLen; for (my $ii=0; $ii<=$nReads; $ii++) { $rawReads[$ii] = undef; $corReads[$ii] = undef; $corReadLen[$ii] = undef; } # The expected length of the corrected reads, based on the filter my @corLengthRawFilter; my @corLengthCorFilter; # Filter! print STDERR "-- Loading expected corrected read lengths.\n"; open(F, "< $path/$asm.estimate.originalLength.log"); while () { my @v = split '\s+', $_; next if ($v[0] eq "read"); $corReadLen[int($v[0])] = int($v[3]); } close(F); print STDERR "-- Picking longest corrected reads.\n"; open(F, "< $path/$asm.estimate.originalLength.log"); while () { my @v = split '\s+', $_; next if ($v[0] eq "read"); $totRawLengthIn += $v[1]; $totRawLengthOut += $v[3]; #print O "$v[0]\t$v[1]\t$v[3]\n"; $rawReads[int($v[0])] = 1; push @corLengthRawFilter, $v[3]; if ($totRawLengthIn >= $minTotal) { # Compare against raw bases $minRawLength = $v[1]; last; } } print STDERR "-- Writing longest corrected reads to '$path/$asm.readsToCorrect'.\n"; open(F, "< $path/$asm.estimate.correctedLength.log"); open(O, "| sort -T . -k1n > $path/$asm.readsToCorrect.WORKING") or caExit("can't open sort -k1n > '$path/$asm.readsToCorrect.WORKING' for writing: $!\n", undef); print O "read\toriginalLength\tcorrectedLength\n"; while () { my @v = split '\s+', $_; next if ($v[0] eq "read"); $totCorLengthIn += $v[1]; $totCorLengthOut += $v[3]; print O "$v[0]\t$v[1]\t$v[3]\n"; $corReads[int($v[0])] = 1; push @corLengthCorFilter, $v[3]; if ($totCorLengthOut >= $minTotal) { # Compare against corrected bases $minCorLength = $v[3]; last; } } close(O); close(F); rename "$path/$asm.readsToCorrect.WORKING", "$path/$asm.readsToCorrect"; stashFile("$path/$asm.readsToCorrect"); # Generate true/false positive/negative lists. print STDERR "-- Summarizing filter.\n"; open(F, "< $path/$asm.estimate.correctedLength.log") or die; open(TN, "> $path/$asm.estimate.tn.log") or die; open(FN, "> $path/$asm.estimate.fn.log") or die; open(FP, "> $path/$asm.estimate.fp.log") or die; open(TP, "> $path/$asm.estimate.tp.log") or die; my ($tnReads, $tnBasesR, $tnBasesC, $tnBasesRave, $tnBasesCave) = (0, 0, 0, 0, 0 ); my ($fnReads, $fnBasesR, $fnBasesC, $fnBasesRave, $fnBasesCave) = (0, 0, 0, 0, 0 ); my ($fpReads, $tpBasesR, $tpBasesC, $tpBasesRave, $tpBasesCave) = (0, 0, 0, 0, 0 ); my ($tpReads, $fpBasesR, $fpBasesC, $fpBasesRave, $fpBasesCave) = (0, 0, 0, 0, 0 ); while () { my @v = split '\s+', $_; next if ($v[0] eq "read"); my $er = defined($rawReads[int($v[0])]); my $ec = defined($corReads[int($v[0])]); if (($er == 0) && ($ec == 0)) { print TN $_; $tnReads++; $tnBasesR += $v[1]; $tnBasesC += $corReadLen[$v[0]]; } # True negative, yay! if (($er == 0) && ($ec == 1)) { print FN $_; $fnReads++; $fnBasesR += $v[1]; $fnBasesC += $corReadLen[$v[0]]; } # False negative. Bad. if (($er == 1) && ($ec == 0)) { print FP $_; $fpReads++; $fpBasesR += $v[1]; $fpBasesC += $corReadLen[$v[0]]; } # False positive. Bad. if (($er == 1) && ($ec == 1)) { print TP $_; $tpReads++; $tpBasesR += $v[1]; $tpBasesC += $corReadLen[$v[0]]; } # True positive, yay! } undef @corReadLen; close(TP); close(FP); close(FN); close(TN); close(F); $tnBasesRave = $tnBasesR / $tnReads if ($tnReads > 0); $tnBasesCave = $tnBasesC / $tnReads if ($tnReads > 0); $fnBasesRave = $fnBasesR / $fnReads if ($fnReads > 0); $fnBasesCave = $fnBasesC / $fnReads if ($fnReads > 0); $fpBasesRave = $fpBasesR / $fpReads if ($fpReads > 0); $fpBasesCave = $fpBasesC / $fpReads if ($fpReads > 0); $tpBasesRave = $tpBasesR / $tpReads if ($tpReads > 0); $tpBasesCave = $tpBasesC / $tpReads if ($tpReads > 0); # Dump a summary of the filter my ($rawFilterMean, $rawFilterN50) = lengthStats(@corLengthRawFilter); my ($corFilterMean, $corFilterN50) = lengthStats(@corLengthCorFilter); undef @corLengthRawFilter; undef @corLengthCorFilter; my $nCorReads = 0; my $nRawReads = 0; for (my $ii=0; $ii<=$nReads; $ii++) { $nCorReads++ if (defined($corReads[$ii])); $nRawReads++ if (defined($rawReads[$ii])); } undef @corReads; undef @rawReads; open(F, "> $path/$asm.readsToCorrect.summary") or caExit("can't open '$path/$asm.readsToCorrect.summary' for writing: $!\n", undef); print F "Corrected read length filter:\n"; print F "\n"; print F " nReads $nCorReads\n"; print F " nBases $totCorLengthIn (input bases)\n"; print F " nBases $totCorLengthOut (corrected bases)\n"; print F " Mean $corFilterMean\n"; print F " N50 $corFilterN50\n"; print F "\n"; print F "Raw read length filter:\n"; print F "\n"; print F " nReads $nRawReads\n"; print F " nBases $totRawLengthIn (input bases)\n"; print F " nBases $totRawLengthOut (corrected bases)\n"; print F " Mean $rawFilterMean\n"; print F " N50 $rawFilterN50\n"; print F "\n"; printf F "TN %9d reads %13d raw bases (%6d ave) %13d corrected bases (%6d ave)\n", $tnReads, $tnBasesR, $tnBasesRave, $tnBasesC, $tnBasesCave; printf F "FN %9d reads %13d raw bases (%6d ave) %13d corrected bases (%6d ave)\n", $fnReads, $fnBasesR, $fnBasesRave, $fnBasesC, $fnBasesCave; printf F "FP %9d reads %13d raw bases (%6d ave) %13d corrected bases (%6d ave)\n", $fpReads, $fpBasesR, $fpBasesRave, $fpBasesC, $fpBasesCave; printf F "TP %9d reads %13d raw bases (%6d ave) %13d corrected bases (%6d ave)\n", $tpReads, $tpBasesR, $tpBasesRave, $tpBasesC, $tpBasesCave; close(F); stashFile("$path/$asm.readsToCorrect.summary"); my $report; $report = "--\n"; $report .= "-- Reads to be corrected:\n"; $report .= "-- $nCorReads reads longer than $minRawLength bp\n"; $report .= "-- $totCorLengthIn bp\n"; $report .= "-- Expected corrected reads:\n"; $report .= "-- $nCorReads reads\n"; $report .= "-- $totCorLengthOut bp\n"; $report .= "-- $minCorLength bp minimum length\n"; $report .= "-- $corFilterMean bp mean length\n"; $report .= "-- $corFilterN50 bp n50 length\n"; addToReport("corrections", $report); # Plot a scatter plot of the original vs the expected corrected read lengths. Early versions # also plotted the sorted length vs the other length, but those were not interesting. if (! fileExists("$path/$asm.estimate.original-x-correctedLength.gp")) { my $gnuplot = getGlobal("gnuplot"); my $format = getGlobal("gnuplotImageFormat"); open(F, "> $path/$asm.estimate.original-x-correctedLength.gp"); print F "set title 'original length (x) vs corrected length (y)'\n"; print F "set xlabel 'original read length'\n"; print F "set ylabel 'corrected read length (expected)'\n"; print F "set pointsize 0.25\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.estimate.original-x-corrected.lg.$format'\n"; print F "plot './$asm.estimate.tn.log' using 2:4 title 'tn', \\\n"; print F " './$asm.estimate.fn.log' using 2:4 title 'fn', \\\n"; print F " './$asm.estimate.fp.log' using 2:4 title 'fp', \\\n"; print F " './$asm.estimate.tp.log' using 2:4 title 'tp'\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.estimate.original-x-corrected.sm.$format'\n"; print F "replot\n"; close(F); if (runCommandSilently($path, "$gnuplot ./$asm.estimate.original-x-correctedLength.gp > /dev/null 2>&1", 0)) { print STDERR "--\n"; print STDERR "-- WARNING: gnuplot failed; no plots will appear in HTML output.\n"; print STDERR "--\n"; print STDERR "----------------------------------------\n"; } stashFile("$path/$asm.estimate.original-x-correctedLength.gp"); stashFile("$path/$asm.estimate.original-x-corrected.lg.$format"); stashFile("$path/$asm.estimate.original-x-corrected.sm.$format"); } } sub buildCorrectionLayouts ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "correction/2-correction"; # All we do here is decide if the job is finished, and delegate # to the correct function if not finished. # But first we analyze the layouts to pick a length cutoff - or more precisely, to pick a set # of reads to correct. # # If one doesn't want to compute the (expensive) stats, one can just correct more reads. How many more to correct # is dependent on the particular reads. An extra 1x to 5x seems reasonable. # A one-pass algorithm would write layouts to tigStore, computing stats as before, then delete # tigs that shouldn't be corrected. I suspect this will be slower. goto allDone if (skipStage($asm, "cor-buildCorrectionLayouts") == 1); goto allDone if (sequenceFileExists("$asm.correctedReads")); # Output exists goto allDone if (fileExists("$path/cnsjob.files")); # Jobs all finished goto allDone if (fileExists("$path/correctReads.sh")); # Jobs created make_path("$path") if (! -d "$path"); # Set the minimum coverage for a corrected read based on coverage in input reads. if (!defined(getGlobal("corMinCoverage"))) { my $cov = getExpectedCoverage("correction", $asm); setGlobal("corMinCoverage", 4); setGlobal("corMinCoverage", 4) if ($cov < 60); setGlobal("corMinCoverage", 0) if ($cov <= 20); print STDERR "-- Set corMinCoverage=", getGlobal("corMinCoverage"), " based on read coverage of $cov.\n"; } # This will eventually get rolled into overlap store creation. Generate a list of scores for # 'global' overlap filtering. fetchFile("$path/$asm.globalScores"); if (! fileExists("$path/$asm.globalScores")) { print STDERR "-- Computing global filter scores '$path/$asm.globalScores'.\n"; fetchStore("./correction/$asm.ovlStore"); my $maxCov = getCorCov($asm, "Global"); my $minLen = (defined(getGlobal("corMinEvidenceLength"))) ? getGlobal("corMinEvidenceLength") : 0; $cmd = "$bin/filterCorrectionOverlaps \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -O ../$asm.ovlStore \\\n"; $cmd .= " -S ./$asm.globalScores.WORKING \\\n"; $cmd .= " -c $maxCov \\\n"; $cmd .= " -l $minLen \\\n"; $cmd .= " -e " . getGlobal("corMaxEvidenceErate") . " \\\n" if (defined(getGlobal("corMaxEvidenceErate"))); $cmd .= " -legacy \\\n" if (defined(getGlobal("corLegacyFilter"))); $cmd .= "> ./$asm.globalScores.err 2>&1"; if (runCommand($path, $cmd)) { caExit("failed to globally filter overlaps for correction", "$path/$asm.globalScores.err"); } rename "$path/$asm.globalScores.WORKING", "$path/$asm.globalScores"; rename "$path/$asm.globalScores.WORKING.stats", "$path/$asm.globalScores.stats"; rename "$path/$asm.globalScores.WORKING.log", "$path/$asm.globalScores.log"; unlink "$path/$asm.globalScores.err"; stashFile("$path/$asm.globalScores"); } else { print STDERR "-- Global filter scores found in '$path/$asm.globalScores'.\n"; } my $report; #FORMAT open(F, "< $path/$asm.globalScores.stats") or caExit("can't open '$path/$asm.globalScores.stats' for reading: $!", undef); while() { $report .= "-- $_"; } close(F); addToReport("filtering", $report); # For 'quick' filtering, but more reads to correct, sort the reads by length, and correct the # longest Nx of reads. # # For 'expensive' filtering, but fewer reads to correct, first estimate the corrected lengths # (requires a pass through the overlaps), then pull out the longest Nx of corrected reads. # # Both are required to create a file $asm.readsToCorrect, containing a list of IDs to correct. fetchFile("$path/$asm.readsToCorrect"); if (! fileExists("$path/$asm.readsToCorrect")) { if (getGlobal("corFilter") eq "quick") { quickFilter($asm, (getGlobal("genomeSize") * getGlobal("corOutCoverage"))); } elsif (getGlobal("corFilter") eq "expensive") { expensiveFilter($asm); } elsif (getGlobal("corFilter") eq "none" ) { quickFilter($asm, 0); } else { caFailure("unknown corFilter '" . getGlobal("corFilter") . "'", undef); } caExit("failed to create list of reads to correct", undef) if (! -e "$path/$asm.readsToCorrect"); } else { print STDERR "-- Filtered list of reads found in '$path/$asm.readsToCorrect'.\n"; } buildCorrectionLayouts_direct($asm) if (getGlobal("corConsensus") eq "utgcns"); buildCorrectionLayouts_direct($asm) if (getGlobal("corConsensus") eq "falcon"); buildCorrectionLayouts_piped($asm) if (getGlobal("corConsensus") eq "falconpipe"); finishStage: emitStage($asm, "cor-buildCorrectionLayouts"); buildHTML($asm, "cor"); allDone: } sub generateCorrectedReads ($) { my $asm = shift @_; my $attempt = getGlobal("canuIteration"); my $bin = getBinDirectory(); my $path = "correction/2-correction"; goto allDone if (skipStage($asm, "cor-generateCorrectedReads", $attempt) == 1); goto allDone if (sequenceFileExists("$asm.correctedReads")); # Compute the size of gkpStore for staging { my $size = 0; $size += -s "correction/$asm.gkpStore/info"; $size += -s "correction/$asm.gkpStore/libraries"; $size += -s "correction/$asm.gkpStore/reads"; $size += -s "correction/$asm.gkpStore/blobs"; $size = int($size / 1024 / 1024 / 1024 + 1.5); setGlobal("corStageSpace", $size); } # Figure out if all the tasks finished correctly. fetchFile("$path/correctReads.sh"); my ($jobs, undef) = computeNumberOfCorrectionJobs($asm); my $currentJobID = "0001"; my @successJobs; my @failedJobs; my $failureMessage = ""; for (my $job=1; $job <= $jobs; $job++) { if (fileExists("$path/correction_outputs/$currentJobID.fasta")) { push @successJobs, "$path/correction_outputs/$currentJobID.fasta\n"; } else { $failureMessage .= "-- job $path/correction_outputs/$currentJobID.fasta FAILED.\n"; push @failedJobs, $job; } $currentJobID++; } # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Read correction jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Read correction jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "cor-generateCorrectedReads", $attempt); buildHTML($asm, "cor"); submitOrRunParallelJob($asm, "cor", $path, "correctReads", @failedJobs); return; } finishStage: print STDERR "-- Found ", scalar(@successJobs), " read correction output files.\n"; open(L, "> $path/corjob.files") or caExit("failed to open '$path/corjob.files'", undef); print L @successJobs; close(L); stashFile("$path/corjob.files"); emitStage($asm, "cor-generateCorrectedReads"); buildHTML($asm, "cor"); allDone: } sub dumpCorrectedReads ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $path = "correction/2-correction"; goto allDone if (skipStage($asm, "cor-dumpCorrectedReads") == 1); goto allDone if (sequenceFileExists("$asm.correctedReads")); print STDERR "-- Concatenating correctReads output.\n"; my $files = 0; my $reads = 0; stashFile("$path/corjob.files"); open(F, "< $path/corjob.files") or caExit("can't open '$path/corjob.files' for reading: $!", undef); open(N, "< correction/$asm.gkpStore/readNames.txt") or caExit("can't open 'correction/$asm.gkpStore/readNames.txt' for reading: $!", undef); open(O, "| gzip -1c > $asm.correctedReads.fasta.gz") or caExit("can't open '$asm.correctedReads.fasta.gz' for writing: $!", undef); open(L, "> $asm.correctedReads.length") or caExit("can't open '$asm.correctedReads.length' for writing: $!", undef); while () { chomp; fetchFile($_); open(R, "< $_") or caExit("can't open correction output '$_' for reading: $!\n", undef); my $h; # Current header line my $s; # Current sequence my $n; # Next header line my $nameid; # Currently loaded id and name from gkpStore/readNames my $name; $n = ; chomp $n; # Read the first line, the first header. while (!eof(R)) { $h = $n; # Read name. $s = undef; # No sequence yet. $n = ; chomp $n; # Sequence, or the next header. # Read sequence until the next header or we stop reading lines. Perl seems to be # setting EOF when the last line is read, which is early, IMHO. We loop until # we both are EOF and have an empty line. while (($n !~ m/^>/) && ((length($n) > 0) || !eof(R))) { $s .= $n; $n = ; chomp $n; } # Parse the header of the corrected read to find the IID and the split piece number my $rid = undef; # Read ID my $pid = undef; # Piece ID if ($h =~ m/read(\d+)_(\d+)/) { $rid = $1; $pid = $2; } # Load the next line from the gatekeeper ID map file until we find the # correct one. while (!eof(N) && ($rid != $nameid)) { my $rn = ; ($rn =~ m/^(\d+)\s+(.*)$/); $nameid = $1; $name = $2; } # If a match, replace the header with the actual read id. If no match, use the bogus # corrected read name as is. if ($rid eq $nameid) { $h = ">$name id=${rid}_${pid}"; } # And write the read to the output as FASTA. print O "$h", "\n"; print O "$s", "\n"; # Or as FASTQ #my $q = $s; #$n =~ s/^>/\@/; #$q =~ tr/[A-Z][a-z]/*/; #print O $h, "\n"; #print O $s, "\n"; #print O "+\n"; #print O $q, "\n"; print L "$rid\t$pid\t", length($s), "\n"; $reads++; } $files++; close(R); } close(O); close(F); stashFile("$asm.correctedReads.fasta.gz"); stashFile("$asm.correctedReads.length"); # Analyze the results. print STDERR "-- Analyzing correctReads output.\n"; my @origLengthHist; my @expcLengthHist; my @corrLengthHist; #my @origExpcDiffLengthHist; #my @origCorrDiffLengthHist; #my @expcCorrDiffLengthHist; my @corrPiecesHist; my $inputReads = 0; my $inputBases = 0; my $failedReads = 0; my $correctedReads = 0; my $correctedBases = 0; my $maxReadLen = 0; my $minDiff = 0; my $maxDiff = 0; fetchFile("$path/$asm.readsToCorrect"); fetchFile("$asm.correctedReads.length"); # Already here, we just wrote it. open(R, "< $path/$asm.readsToCorrect") or caExit("can't open '$path/$asm.readsToCorrect' for reading: $!", undef); open(L, "< $asm.correctedReads.length") or caExit("can't open '$asm.correctedReads.length' for reading: $!", undef); open(S, "> $path/$asm.original-expected-corrected-length.dat") or caExit("", undef); my $moreData = 1; my $r = ; chomp $r; # Read first 'read to correct' (input read) my @r = split '\s+', $r; if ($r[0] == 0) { $r = ; chomp $r; # First line was header, read again. @r = split '\s+', $r; } my $l = ; chomp $l; # Read first 'corrected read' (output read) my @l = split '\s+', $l; if ($l[0] == 0) { $l = ; chomp $l; # First line was header, read again. @l = split '\s+', $l; } while ($moreData) { # The 'read to correct' should be a superset of the 'corrected reads'. # There will be only one 'read to correct', but possibly multiple (or even zero) # 'corrected reads'. Read until the IDs differ. For this, we only want to # count the number of pieces and the total length. my $numPieces = 0; my $numBases = 0; while ($l[0] == $r[0]) { $numPieces++; $numBases += $l[2]; last if (eof(L)); # Ugly, break out if we just processed the last 'corrected read'. $l = ; chomp $l; # Read next 'corrected read' @l = split '\s+', $l; } # Add to the length scatter plot (original length, expected length, actual length). print S "$r[0]\t$r[1]\t$r[2]\t$numBases\t", $r[1] - $r[2], "\t", $r[1] - $numBases, "\t", $r[2] - $numBases, "\n"; $minDiff = $r[2] - $numBases if ($r[2] - $numBases < $minDiff); $maxDiff = $r[2] - $numBases if ($r[2] - $numBases > $minDiff); # Add to the histograms. $origLengthHist[$r[1]]++; $expcLengthHist[$r[2]]++; $corrLengthHist[$numBases]++; #$origExpcDiffLengthHist[$r[1] - $r[2]]++; #$origCorrDiffLengthHist[$r[1] - $numBases]++; #$expcCorrDiffLengthHist[$r[2] - $numBases]++; $corrPiecesHist[$numPieces]++; $inputReads++; $inputBases += $r[1]; $failedReads++ if ($numBases == 0); $correctedReads += $numPieces; $correctedBases += $numBases; $maxReadLen = $r[1] if ($maxReadLen < $r[1]); $maxReadLen = $r[2] if ($maxReadLen < $r[2]); $maxReadLen = $numBases if ($maxReadLen < $numBases); # Read next 'read to correct'. If this reads the last line, eof(R) is true, so can't # loop on that. Instead, we set a flag after we process that last line. if (!eof(R)) { $r = ; chomp $r; @r = split '\s+', $r; } else { $moreData = 0; } } close(S); close(R); close(L); stashFile("$path/$asm.original-expected-corrected-length.dat"); # Write a summary of the corrections. my $report = getFromReport("corrections"); $report .= "-- Actual corrected reads:\n"; $report .= "-- $correctedReads reads\n"; $report .= "-- $correctedBases bp\n"; for (my $ii=0; $ii $path/$asm.correction.summary") or caExit("", undef); print F "CORRECTION INPUTS:\n"; print F "-----------------\n"; print F "$inputReads (input reads)\n"; print F "$inputBases (input bases)\n"; print F "\n"; print F "CORRECTION OUTPUTS:\n"; print F "-----------------\n"; print F "$failedReads (reads that failed to generate any corrected bases)\n"; print F "$correctedReads (corrected read pieces)\n"; print F "$correctedBases (corrected bases)\n"; print F "\n"; print F "PIECES PER READ:\n"; print F "---------------\n"; for (my $ii=0; $ii $path/$asm.originalLength-vs-correctedLength.gp") or caExit("", undef); print F "\n"; print F "set pointsize 0.25\n"; print F "\n"; print F "set title 'original read length vs expected corrected read length'\n"; print F "set xlabel 'original read length'\n"; print F "set ylabel 'expected corrected read length'\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.originalLength-vs-expectedLength.lg.$format'\n"; print F "plot [0:$maxReadLen] [0:$maxReadLen] './$asm.original-expected-corrected-length.dat' using 2:3 title 'original (x) vs expected (y)'\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.originalLength-vs-expectedLength.sm.$format'\n"; print F "replot\n"; print F "\n"; print F "set title 'original read length vs sum of corrected read lengths'\n"; print F "set xlabel 'original read length'\n"; print F "set ylabel 'sum of corrected read lengths'\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.originalLength-vs-correctedLength.lg.$format'\n"; print F "plot [0:$maxReadLen] [0:$maxReadLen] './$asm.original-expected-corrected-length.dat' using 2:4 title 'original (x) vs corrected (y)'\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.originalLength-vs-correctedLength.sm.$format'\n"; print F "replot\n"; print F "\n"; print F "set title 'expected read length vs sum of corrected read lengths'\n"; print F "set xlabel 'expected read length'\n"; print F "set ylabel 'sum of corrected read lengths'\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.expectedLength-vs-correctedLength.lg.$format'\n"; print F "plot [0:$maxReadLen] [0:$maxReadLen] './$asm.original-expected-corrected-length.dat' using 3:4 title 'expected (x) vs corrected (y)'\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.expectedLength-vs-correctedLength.sm.$format'\n"; print F "replot\n"; close(F); if (runCommandSilently($path, "$gnuplot ./$asm.originalLength-vs-correctedLength.gp > /dev/null 2>&1", 0)) { print STDERR "--\n"; print STDERR "-- WARNING: gnuplot failed; no plots will appear in HTML output.\n"; print STDERR "--\n"; print STDERR "----------------------------------------\n"; } stashFile("$path/$asm.originalLength-vs-correctedLength.gp"); stashFile("$asm.originalLength-vs-expectedLength.lg.$format"); stashFile("$asm.originalLength-vs-expectedLength.sm.$format"); stashFile("$asm.originalLength-vs-correctedLength.lg.$format"); stashFile("$asm.originalLength-vs-correctedLength.sm.$format"); stashFile("$asm.expectedLength-vs-correctedLength.lg.$format"); stashFile("$asm.expectedLength-vs-correctedLength.sm.$format"); # Histograms of lengths, including the difference between expected and actual corrected (the # other two difference plots weren't interesting; original-expected was basically all zero, and # so original-actual was nearly the same as expected-actual. open(F, "> $path/$asm.length-histograms.gp") or caExit("", undef); print F "set title 'read length'\n"; print F "set ylabel 'number of reads'\n"; print F "set xlabel 'read length, bin width = 250'\n"; print F "\n"; print F "binwidth=250\n"; print F "set boxwidth binwidth\n"; print F "bin(x,width) = width*floor(x/width) + binwidth/2.0\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.length-histograms.lg.$format'\n"; print F "plot [1:$maxReadLen] [0:] \\\n"; print F " './$asm.original-expected-corrected-length.dat' using (bin(\$2,binwidth)):(1.0) smooth freq with boxes title 'original', \\\n"; print F " './$asm.original-expected-corrected-length.dat' using (bin(\$3,binwidth)):(1.0) smooth freq with boxes title 'expected', \\\n"; print F " './$asm.original-expected-corrected-length.dat' using (bin(\$4,binwidth)):(1.0) smooth freq with boxes title 'corrected'\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.length-histograms.sm.$format'\n"; print F "replot\n"; print F "\n"; print F "set xlabel 'difference between expected and corrected read length, bin width = 250, min=$minDiff, max=$maxDiff'\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.length-difference-histograms.lg.$format'\n"; print F "plot [$minDiff:$maxDiff] [0:] \\\n"; print F " './$asm.original-expected-corrected-length.dat' using (bin(\$7,binwidth)):(1.0) smooth freq with boxes title 'expected - corrected'\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.length-difference-histograms.sm.$format'\n"; print F "replot\n"; close(F); if (runCommandSilently($path, "$gnuplot ./$asm.length-histograms.gp > /dev/null 2>&1", 0)) { print STDERR "--\n"; print STDERR "-- WARNING: gnuplot failed; no plots will appear in HTML output.\n"; print STDERR "--\n"; print STDERR "----------------------------------------\n"; } stashFile("$path/$asm.length-histograms.gp"); stashFile("$asm.length-histograms.lg.$format"); stashFile("$asm.length-histograms.sm.$format"); stashFile("$asm.length-difference-histograms.lg.$format"); stashFile("$asm.length-difference-histograms.sm.$format"); # Now that all outputs are (re)written, cleanup the job outputs. print STDERR "--\n"; print STDERR "-- Purging correctReads output after merging to final output file.\n"; print STDERR "-- Purging disabled by saveReadCorrections=true.\n" if (getGlobal("saveReadCorrections") == 1); my $Nsuccess = 0; my $Nerr = 0; my $Nfasta = 0; my $Nlog = 0; if (getGlobal("saveReadCorrections") != 1) { open(F, "< $path/corjob.files") or caExit("can't open '$path/corjob.files' for reading: $!", undef); while () { chomp; if (m/^(.*)\/correction_outputs\/0*(\d+).fasta$/) { my $ID6 = substr("00000" . $2, -6); my $ID4 = substr("000" . $2, -4); my $ID0 = $2; if (-e "$1/correction_outputs/$ID4.dump.success") { $Nsuccess++; unlink "$1/correction_outputs/$ID4.dump.success"; } if (-e "$1/correction_outputs/$ID4.err") { $Nerr++; unlink "$1/correction_outputs/$ID4.err"; } if (-e "$1/correction_outputs/$ID4.fasta") { $Nfasta++; unlink "$1/correction_outputs/$ID4.fasta"; } if (-e "$1/correctReads.$ID6.out") { $Nlog++; unlink "$1/correctReads.$ID6.out"; } if (-e "$1/correctReads.$ID0.out") { $Nlog++; unlink "$1/correctReads.$ID0.out"; } } else { caExit("unknown correctReads job name '$_'\n", undef); } } close(F); print STDERR "-- Purged $Nsuccess .dump.success sentinels.\n" if ($Nsuccess > 0); print STDERR "-- Purged $Nfasta .fasta outputs.\n" if ($Nfasta > 0); print STDERR "-- Purged $Nerr .err outputs.\n" if ($Nerr > 0); print STDERR "-- Purged $Nlog .out job log outputs.\n" if ($Nlog > 0); } remove_tree("correction/$asm.ovlStore") if (getGlobal("saveOverlaps") eq "0"); finishStage: emitStage($asm, "cor-dumpCorrectedReads"); buildHTML($asm, "cor"); allDone: print STDERR "--\n"; print STDERR "-- Corrected reads saved in '", sequenceFileExists("$asm.correctedReads"), "'.\n"; stopAfter("readCorrection"); } canu-1.6/src/pipelines/canu/CorrectReads.txt000066400000000000000000000032031314437614700211440ustar00rootroot00000000000000 buildCorrectionLayouts() ---------------------------------------- filterCorrectionOverlaps (binary) - writes asm.globalScores - log to asm.globalScores.log -- PER READ, #olaps, #scored, #filtered, #saved, reason - log to asm.globalScores.err -- STATS - knows a -S , so write log and stats using that. - params corMaxEvidenceCoverageGlobal - params corMinEvidenceLength ---------------------------------------- quickFilter() or expensiveFilter() Creates asm.readsToCorrect with 'readID', 'originalLength', 'expectedCorrectedLength' quick filter just picks the longest originalLength reads that sum to corOutCoverage * genomeSize no logging, no stats, no plot expensive filter calls generateCorrectionLayouts (binary) to write asm.estimate.log and asm.estimate.stats with expected corrected length based on the overlaps we'd use canu.pl makes asm.estimate.* files with tp/tn rates and a figure canu.pl makes asm.readsToCorrect.summary and asm.estimate.original-x-corrected.png ---------------------------------------- buildCorrectionLayouts (_direct or _piped) generateCorrectionLayouts - reads asm.readsToCorrect (who makes this?) - reads asm.globalScores - writes asm.corStore (direct) - writes falcon-formatted reads for a pipe to compute consensus - params -L corMinEvidenceLength - params -E corMaxEvidenceErate - params -C maxCov (corMaxEvidenceCoverageLocal) - only errors to stderr - writes no log or summary (disabled in canu here, output in the expensiveFilter above) When outputs of the parallel processes are merged, a length file is created. This, with the expensive filter length file, can generate stats on corrections. canu-1.6/src/pipelines/canu/Defaults.pm000066400000000000000000001611601314437614700201370ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/Defaults.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-SEP-21 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-21 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-19 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Defaults; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(getCommandLineOptions addCommandLineOption addCommandLineError writeLog getNumberOfCPUs getPhysicalMemorySize getAllowedResources diskSpace printOptions printHelp printCitation addSequenceFile setParametersFromFile setParametersFromCommandLine checkJava checkGnuplot checkParameters getGlobal setGlobal setGlobalIfUndef setDefaults setVersion); use strict; use Cwd qw(getcwd abs_path); use Carp qw(cluck); use Sys::Hostname; use Text::Wrap; use File::Basename; # dirname my %global; # Parameter value my %synops; # Parameter description (for -defaults) my %synnam; # Parameter name (beacuse the key is lowercase) my $cLineOpts = undef; my $specLog = ""; sub getGlobal ($) { my $var = shift @_; $var =~ tr/A-Z/a-z/; # We lost the use of caFailure in Defaults.pm (because it was moved to # Execution.pm so it can run stuff) here, so duplicate the functionality. # This should only trigger on static pipeline errors (i.e., no depending # on reads input) and so should never occur in the wild. if (!exists($global{$var})) { print STDERR "================================================================================\n"; print STDERR "Unknown parameter '$var' accessed. Stack trace:\n"; cluck; exit(1); } return($global{$var}); } sub setGlobalSpecialization ($@) { my $val = shift @_; foreach my $var (@_) { $global{$var} = $val; } return(1); } sub setGlobal ($$) { my $VAR = shift @_; my $var = $VAR; my $val = shift @_; my $set = 0; $var =~ tr/A-Z/a-z/; $val = undef if ($val eq "undef"); # Set to undefined, the default for many of the options. $val = undef if ($val eq ""); # Map 'true'/'false' et al. to 0/1. $val = 0 if (($val =~ m/^false$/i) || ($val =~ m/^f$/i)); $val = 1 if (($val =~ m/^true$/i) || ($val =~ m/^t$/i)); # Grid options foreach my $opt ("gridoptions") { $set += setGlobalSpecialization($val, ("${opt}corovl", "${opt}obtovl", "${opt}utgovl")) if ($var eq "${opt}ovl"); $set += setGlobalSpecialization($val, ("${opt}cormhap", "${opt}obtmhap", "${opt}utgmhap")) if ($var eq "${opt}mhap"); $set += setGlobalSpecialization($val, ("${opt}cormmap", "${opt}obtmmap", "${opt}utgmmap")) if ($var eq "${opt}mmap"); } foreach my $opt ("memory", "threads", "concurrency") { $set += setGlobalSpecialization($val, ( "corovl${opt}", "obtovl${opt}", "utgovl${opt}")) if ($var eq "ovl${opt}"); $set += setGlobalSpecialization($val, ("cormhap${opt}", "obtmhap${opt}", "utgmhap${opt}")) if ($var eq "mhap${opt}"); $set += setGlobalSpecialization($val, ("cormmap${opt}", "obtmmap${opt}", "utgmmap${opt}")) if ($var eq "mmap${opt}"); } # Overlapping algorithm choice options foreach my $opt ("overlapper", "realign") { $set += setGlobalSpecialization($val, ("cor${opt}", "obt${opt}", "utg${opt}")) if ($var eq "${opt}"); } # OverlapInCore options foreach my $opt ("ovlerrorrate", "ovlhashblocklength", "ovlrefblocksize", "ovlrefblocklength", "ovlhashbits", "ovlhashload", "ovlmersize", "ovlmerthreshold", "ovlmerdistinct", "ovlmertotal", "ovlfrequentmers") { $set += setGlobalSpecialization($val, ("cor${opt}", "obt${opt}", "utg${opt}")) if ($var eq "${opt}"); } # Mhap options foreach my $opt ("mhapblocksize", "mhapmersize", "mhapsensitivity", "mhapfilterunique", "mhapfilterthreshold", "mhapnotf") { $set += setGlobalSpecialization($val, ("cor${opt}", "obt${opt}", "utg${opt}")) if ($var eq "${opt}"); } # MiniMap options foreach my $opt ("mmapblocksize", "mmapmersize") { $set += setGlobalSpecialization($val, ("cor${opt}", "obt${opt}", "utg${opt}")) if ($var eq "${opt}"); } # Handle the two error rate aliases. if ($var eq "rawerrorrate") { setGlobalIfUndef("corOvlErrorRate", $val); setGlobalIfUndef("corErrorRate", $val); return; } if ($var eq "correctederrorrate") { setGlobalIfUndef("obtOvlErrorRate", $val); setGlobalIfUndef("obtErrorRate", $val); setGlobalIfUndef("utgOvlErrorRate", $val); setGlobalIfUndef("utgErrorRate", $val); setGlobalIfUndef("cnsErrorRate", $val); return; } return if ($set > 0); #if ($var eq "canuiteration") { # print STDERR "-- WARNING: set canuIteration to $val\n"; #} # If we get a parameter we don't understand, we should be parsing command line options or # reading spec files, and we can let the usual error handling handle it. addCommandLineError("ERROR: Parameter '$VAR' is not known.\n") if (!exists($global{$var})); $global{$var} = $val; } sub setGlobalIfUndef ($$) { my $var = shift @_; my $val = shift @_; $var =~ tr/A-Z/a-z/; $val = undef if ($val eq ""); # Set to undefined, the default for many of the options. return if (defined($global{$var})); $global{$var} = $val; } sub getCommandLineOptions () { return($cLineOpts); } sub addCommandLineOption ($) { my $opt = shift @_; return if ($opt =~ m/canuIteration=/); # Ignore canu resetting canuIteration $cLineOpts .= " " if (defined($cLineOpts) && ($cLineOpts !~ m/\s$/)); $cLineOpts .= $opt; } sub addCommandLineError($) { $global{'errors'} .= shift @_; } sub writeLog () { my $time = time(); my $host = hostname(); my $pid = $$; open(F, "> canu-logs/${time}_${host}_${pid}_canu"); print F $specLog; close(F); } # # Host management - these really belong in 'Execution.pm' (or 'Utilities.pm') but can't go there # (Execution.pm) and be used here too. # sub getNumberOfCPUs () { my $os = $^O; my $ncpu = 1; # See http://stackoverflow.com/questions/6481005/obtain-the-number-of-cpus-cores-in-linux if ($os eq "freebsd") { $ncpu = int(`/sbin/sysctl -n hw.ncpu`); } if ($os eq "darwin") { $ncpu = int(`/usr/bin/getconf _NPROCESSORS_ONLN`); } if ($os eq "linux" || $os eq "cygwin") { $ncpu = int(`getconf _NPROCESSORS_ONLN`); } return($ncpu); } sub getPhysicalMemorySize () { my $os = $^O; my $memory = 1; if ($os eq "freebsd") { $memory = `/sbin/sysctl -n hw.physmem` / 1024 / 1024 / 1024; } if ($os eq "darwin") { $memory = `/usr/sbin/sysctl -n hw.memsize` / 1024 / 1024 / 1024; } if ($os eq "linux" || $os eq "cygwin") { open(F, "< /proc/meminfo"); # Way to go, Linux! Make it easy on us! while () { if (m/MemTotal:\s+(\d+)/) { $memory = $1 / 1024 / 1024; } } close(F); } return(int($memory + 0.5)); # Poor man's rounding } sub diskSpace ($) { my $dir = dirname($_[0]); my ($total, $used, $free, $avail) = (0, 0, 0, 0); if (-d $dir) { open(DF, "df -P -k $dir |"); while () { chomp; if (m/^(.*)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+%)\s+(.*)$/) { $total = int($2 / 1048.576) / 1000; $used = int($3 / 1048.576) / 1000; $free = int($4 / 1048.576) / 1000; $avail = int($4 / 1048.576) / 1000; # Possibly limited by quota? } } close(DF); } #print STDERR "Disk space: total $total GB, used $used GB, free $free GB, available $avail GB\n"; return (wantarray) ? ($total, $used, $free, $avail) : $avail; } sub printOptions () { my $pretty = 0; foreach my $k (sort values %synnam) { my $o = $k; my $u = $synops{$k}; next if (length($u) == 0); if ($pretty == 0) { $o = substr("$k ", 0, 40); } else { $Text::Wrap::columns = 60; $o = "$o\n"; $u = wrap(" ", " ", $u) . "\n"; } print "$o$u\n"; } } sub printHelp (@) { my $force = shift @_; return if (!defined($force) && !defined($global{"errors"})); print "\n"; print "usage: canu [-version] [-citation] \\\n"; print " [-correct | -trim | -assemble | -trim-assemble] \\\n"; print " [-s ] \\\n"; print " -p \\\n"; print " -d \\\n"; print " genomeSize=[g|m|k] \\\n"; print " [other-options] \\\n"; print " [-pacbio-raw |\n"; print " -pacbio-corrected |\n"; print " -nanopore-raw |\n"; print " -nanopore-corrected] file1 file2 ...\n"; print "\n"; print "example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz \n"; print "\n"; print "\n"; print " To restrict canu to only a specific stage, use:\n"; print " -correct - generate corrected reads\n"; print " -trim - generate trimmed reads\n"; print " -assemble - generate an assembly\n"; print " -trim-assemble - generate trimmed reads and then assemble them\n"; print "\n"; print " The assembly is computed in the -d , with output files named\n"; print " using the -p . This directory is created if needed. It is not\n"; print " possible to run multiple assemblies in the same directory.\n"; print "\n"; print " The genome size should be your best guess of the haploid genome size of what is being\n"; print " assembled. It is used primarily to estimate coverage in reads, NOT as the desired\n"; print " assembly size. Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'\n"; print "\n"; print " Some common options:\n"; print " useGrid=string\n"; print " - Run under grid control (true), locally (false), or set up for grid control\n"; print " but don't submit any jobs (remote)\n"; print " rawErrorRate=fraction-error\n"; print " - The allowed difference in an overlap between two raw uncorrected reads. For lower\n"; print " quality reads, use a higher number. The defaults are 0.300 for PacBio reads and\n"; print " 0.500 for Nanopore reads.\n"; print " correctedErrorRate=fraction-error\n"; print " - The allowed difference in an overlap between two corrected reads. Assemblies of\n"; print " low coverage or data with biological differences will benefit from a slight increase\n"; print " in this. Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.\n"; print " gridOptions=string\n"; print " - Pass string to the command used to submit jobs to the grid. Can be used to set\n"; print " maximum run time limits. Should NOT be used to set memory limits; Canu will do\n"; print " that for you.\n"; print " minReadLength=number\n"; print " - Ignore reads shorter than 'number' bases long. Default: 1000.\n"; print " minOverlapLength=number\n"; print " - Ignore read-to-read overlaps shorter than 'number' bases long. Default: 500.\n"; print " A full list of options can be printed with '-options'. All options can be supplied in\n"; print " an optional sepc file with the -s option.\n"; print "\n"; print " Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.\n"; print " Reads are specified by the technology they were generated with:\n"; print " -pacbio-raw \n"; print " -pacbio-corrected \n"; print " -nanopore-raw \n"; print " -nanopore-corrected \n"; print "\n"; print "Complete documentation at http://canu.readthedocs.org/en/latest/\n"; print "\n"; if (defined($global{'errors'})) { print "$global{'errors'}"; print "\n"; } exit(1); } sub printCitation ($) { my $prefix = shift @_; print STDERR "${prefix}Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.\n"; print STDERR "${prefix}Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.\n"; print STDERR "${prefix}Genome Res. 2017 May;27(5):722-736.\n"; print STDERR "${prefix}http://doi.org/10.1101/gr.215087.116\n"; print STDERR "${prefix}\n"; print STDERR "${prefix}Read and contig alignments during correction, consensus and GFA building use:\n"; print STDERR "${prefix} Šošic M, Šikic M.\n"; print STDERR "${prefix} Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.\n"; print STDERR "${prefix} Bioinformatics. 2017 May 1;33(9):1394-1395.\n"; print STDERR "${prefix} http://doi.org/10.1093/bioinformatics/btw753\n"; print STDERR "${prefix}\n"; print STDERR "${prefix}Overlaps are generated using:\n"; print STDERR "${prefix} Berlin K, et al.\n"; print STDERR "${prefix} Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.\n"; print STDERR "${prefix} Nat Biotechnol. 2015 Jun;33(6):623-30.\n"; print STDERR "${prefix} http://doi.org/10.1038/nbt.3238\n"; print STDERR "${prefix}\n"; print STDERR "${prefix} Myers EW, et al.\n"; print STDERR "${prefix} A Whole-Genome Assembly of Drosophila.\n"; print STDERR "${prefix} Science. 2000 Mar 24;287(5461):2196-204.\n"; print STDERR "${prefix} http://doi.org/10.1126/science.287.5461.2196\n"; print STDERR "${prefix}\n"; print STDERR "${prefix} Li H.\n"; print STDERR "${prefix} Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.\n"; print STDERR "${prefix} Bioinformatics. 2016 Jul 15;32(14):2103-10.\n"; print STDERR "${prefix} http://doi.org/10.1093/bioinformatics/btw152\n"; print STDERR "${prefix}\n"; print STDERR "${prefix}Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:\n"; print STDERR "${prefix} Chin CS, et al.\n"; print STDERR "${prefix} Phased diploid genome assembly with single-molecule real-time sequencing.\n"; print STDERR "${prefix} Nat Methods. 2016 Dec;13(12):1050-1054.\n"; print STDERR "${prefix} http://doi.org/10.1038/nmeth.4035\n"; print STDERR "${prefix}\n"; print STDERR "${prefix}Contig consensus sequences are generated using an algorithm derived from pbdagcon:\n"; print STDERR "${prefix} Chin CS, et al.\n"; print STDERR "${prefix} Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.\n"; print STDERR "${prefix} Nat Methods. 2013 Jun;10(6):563-9\n"; print STDERR "${prefix} http://doi.org/10.1038/nmeth.2474\n"; print STDERR "${prefix}\n"; } sub makeAbsolute ($) { my $var = shift @_; my $val = getGlobal($var); my $abs = abs_path($val); if (defined($val) && ($val != $abs)) { setGlobal($var, $abs); $val =~ s/\\\"/\"/g; $val =~ s/\"/\\\"/g; $val =~ s/\\\$/\$/g; $val =~ s/\$/\\\$/g; addCommandLineOption("'$var=$val'"); } } sub fixCase ($) { my $var = shift @_; my $val = getGlobal($var); if (defined($val)) { $val =~ tr/A-Z/a-z/; setGlobal($var, $val); } } sub addSequenceFile ($$@) { my $dir = shift @_; my $file = shift @_; my $err = shift @_; return(undef) if (!defined($file)); # No file name? Nothing to do. $file = "$dir/$file" if (defined($dir)); # If $dir defined, assume file is in there. return(abs_path($file)) if (-e $file); # If found, return the full path. # And if not found, report an error, unless told not to. This is because on the command # line, the first word after -pacbio-raw must exist, but all the other words could # be files or options. addCommandLineError("ERROR: Input read file '$file' not found.\n") if (defined($err)); return(undef); } sub setParametersFromFile ($$@) { my $specFile = shift @_; my $readdir = shift @_; my @fragFiles = @_; # Client should be ensuring that the file exists before calling this function. die "specFile '$specFile' not found.\n" if (! -e "$specFile"); $specLog .= "\n"; $specLog .= "###\n"; $specLog .= "### Reading options from '$specFile'\n"; $specLog .= "###\n"; $specLog .= "\n"; # We lost the use of caExit() here (moved to Execution.pm) and so can't call it. # Just die. open(F, "< $specFile") or die("can't open '$specFile' for reading: $!\n"); while () { $specLog .= $_; s/^\s+//; s/\s+$//; next if (m/^#/); next if (length($_) eq 0); # First, figure out the two words. my $one; my $two; my $opt; if (m/^-(pacbio|nanopore)-(corrected|raw)\s+(.*)\s*$/) { # Comments not allowed, because then we can't decide $one = "-$1-$2"; # if the # is a comment, or part of the file! $two = $3; # e.g., this_is_file_#1 vs $opt = 0; # this_is_the_only_file#no more data } elsif (m/^(\w*)\s*=\s*([^#]*)\s*#*.*?$/) { # Word two won't match a #, but will gobble up spaces at the end. $one = $1; # Then, we can match a #, and any amount of comment, minimally. $two = $2; # If word two is made non-greedy, it will shrink to nothing, as $opt = 1; # the last bit will gobble up everything, since we're allowed } # to match zero #'s in between. else { addCommandLineError("ERROR: File not found or unknown specFile option line '$_'.\n"); } # Now, clean up the second word to handle quotes. $two =~ s/^\s+//; # There can be spaces from the greedy match. $two =~ s/\s+$//; $two = $1 if ($two =~ m/^'(.+)'$/); # Remove single quotes | But don't allowed mixed quotes; users $two = $1 if ($two =~ m/^"(.+)"$/); # Remove double quotes | should certainly know better # And do something. if ($opt == 1) { $two =~ s/^\s+//; # Remove spaces again. They'll just confuse our option processing. $two =~ s/\s+$//; setGlobal($one, $two); } else { my $file = addSequenceFile($readdir, $two, 1); # Don't remove spaces. File could be " file ", for some stupid reason. if (defined($file)) { push @fragFiles, "$one\0$file"; } else { addCommandLineError("ERROR: File not found in spec file option '$_'\n"); } } } close(F); return(@fragFiles); } sub setParametersFromCommandLine(@) { my @specOpts = @_; if (scalar(@specOpts) > 0) { $specLog .= "\n"; $specLog .= "###\n"; $specLog .= "### Reading options from the command line.\n"; $specLog .= "###\n"; $specLog .= "\n"; } foreach my $s (@specOpts) { $specLog .= "$s\n"; if ($s =~ m/\s*(\w*)\s*=(.*)/) { my ($var, $val) = ($1, $2); $var =~ s/^\s+//; $var =~ s/\s+$//; $val =~ s/^\s+//; $val =~ s/\s+$//; setGlobal($var, $val); } else { addCommandLineError("ERROR: Misformed command line option '$s'.\n"); } } } sub setExecDefaults ($$) { my $tag = shift @_; my $name = shift @_; $global{"gridOptions${tag}"} = undef; $synops{"gridOptions${tag}"} = "Grid engine options applied to $name jobs"; $global{"${tag}Memory"} = undef; $synops{"${tag}Memory"} = "Amount of memory, in gigabytes, to use for $name jobs"; $global{"${tag}Threads"} = undef; $synops{"${tag}Threads"} = "Number of threads to use for $name jobs"; $global{"${tag}StageSpace"} = undef; $synops{"${tag}StageSpace"} = "Amount of local disk space needed to stage data for $name jobs"; $global{"${tag}Concurrency"} = undef; $synops{"${tag}Concurrency"} = "If grid not enabled, number of $name jobs to run at the same time; default is n_proc / n_threads"; } sub setOverlapDefault ($$$$) { my $tag = shift @_; my $var = shift @_; my $value = shift @_; my $description = shift @_; $global{"${tag}${var}"} = $value; $synops{"${tag}${var}"} = $description; $synops{ "${var}"} = $description; } sub setOverlapDefaults ($$$) { my $tag = shift @_; # If 'cor', some parameters are loosened for raw pacbio reads my $name = shift @_; my $default = shift @_; # Sets ${tag}Overlapper # Which overlapper to use. setOverlapDefault($tag, "Overlapper", $default, "Which overlap algorithm to use for $name"); setOverlapDefault($tag, "ReAlign", 0, "Refine overlaps by computing the actual alignment: 'true' or 'false'. Not useful for overlapper=ovl. Uses ${tag}OvlErrorRate"); # OverlapInCore parameters. setOverlapDefault($tag, "OvlHashBlockLength", undef, "Amount of sequence (bp) to load into the overlap hash table"); setOverlapDefault($tag, "OvlRefBlockSize", undef, "Number of reads to search against the hash table per batch"); setOverlapDefault($tag, "OvlRefBlockLength", 0, "Amount of sequence (bp) to search against the hash table per batch"); setOverlapDefault($tag, "OvlHashBits", ($tag eq "cor") ? 18 : 23, "Width of the kmer hash. Width 22=1gb, 23=2gb, 24=4gb, 25=8gb. Plus 10b per ${tag}OvlHashBlockLength"); setOverlapDefault($tag, "OvlHashLoad", 0.75, "Maximum hash table load. If set too high, table lookups are inefficent; if too low, search overhead dominates run time; default 0.75"); setOverlapDefault($tag, "OvlMerSize", ($tag eq "cor") ? 19 : 22, "K-mer size for seeds in overlaps"); setOverlapDefault($tag, "OvlMerThreshold", "auto", "K-mer frequency threshold; mers more frequent than this count are ignored; default 'auto'"); setOverlapDefault($tag, "OvlMerDistinct", undef, "K-mer frequency threshold; the least frequent fraction of distinct mers can seed overlaps"); setOverlapDefault($tag, "OvlMerTotal", undef, "K-mer frequency threshold; the least frequent fraction of all mers can seed overlaps"); setOverlapDefault($tag, "OvlFrequentMers", undef, "Do not seed overlaps with these kmers (fasta format)"); setOverlapDefault($tag, "OvlFilter", undef, "Filter overlaps based on expected kmers vs observed kmers"); # Mhap parameters. FilterThreshold MUST be a string, otherwise it gets printed in scientific notation (5e-06) which java doesn't understand. setOverlapDefault($tag, "MhapVersion", "2.1.2", "Version of the MHAP jar file to use"); setOverlapDefault($tag, "MhapFilterThreshold", "0.000005", "Value between 0 and 1. kmers which comprise more than this percentage of the input are downweighted"); setOverlapDefault($tag, "MhapFilterUnique", undef, "Expert option: True or false, supress the low-frequency k-mer distribution based on them being likely noise and not true overlaps. Threshold auto-computed based on error rate and coverage."); setOverlapDefault($tag, "MhapNoTf", undef, "Expert option: True or false, do not use tf weighting, only idf of tf-idf."); setOverlapDefault($tag, "MhapOptions", undef, "Expert option: free-form parameters to pass to MHAP."); setOverlapDefault($tag, "MhapBlockSize", 3000, "Number of reads per 1GB; memory * blockSize = the size of block loaded into memory per job"); setOverlapDefault($tag, "MhapMerSize", ($tag eq "cor") ? 16 : 16, "K-mer size for seeds in mhap"); setOverlapDefault($tag, "MhapOrderedMerSize", ($tag eq "cor") ? 12 : 18, "K-mer size for second-stage filter in mhap"); setOverlapDefault($tag, "MhapSensitivity", undef, "Coarse sensitivity level: 'low', 'normal' or 'high'. Set automatically based on coverage; 'high' <= 30x < 'normal' < 60x <= 'low'"); # MiniMap parameters. setOverlapDefault($tag, "MMapBlockSize", 6000, "Number of reads per 1GB; memory * blockSize = the size of block loaded into memory per job"); setOverlapDefault($tag, "MMapMerSize", ($tag eq "cor") ? 15 : 21, "K-mer size for seeds in minmap"); } sub setDefault ($$$) { my $var = shift @_; my $value = shift @_; my $description = shift @_; $global{$var} = $value; $synops{$var} = $description; } sub setDefaults () { ##### Internal stuff $global{"errors"} = undef; # Command line errors $global{"version"} = undef; # Reset at the end of this function, once we know where binaries are. $global{"availablehosts"} = undef; # Internal list of cpus-memory-nodes describing the grid $global{"canuiteration"} = 0; $global{"canuiterationmax"} = 2; $global{"onexitdir"} = undef; # Copy of $wrk, for caExit() and caFailure() ONLY. $global{"onexitnam"} = undef; # Copy of $asm, for caExit() and caFailure() ONLY. ##### Meta options (no $global for these, only synopsis), more of these, many many more, are defined in setOverlapDefaults(). $synops{"rawErrorRate"} = "Expected fraction error in an alignment of two uncorrected reads"; $synops{"correctedErrorRate"} = "Expected fraction error in an alignment of two corrected reads"; ##### General Configuration Options (aka miscellany) my $java = (exists $ENV{"JAVA_HOME"} && -e "$ENV{'JAVA_HOME'}/bin/java") ? "$ENV{'JAVA_HOME'}/bin/java" : "java"; setDefault("showNext", undef, "Don't run any commands, just report what would run"); setDefault("pathMap", undef, "File with a hostname to binary directory map; binary directories must be absolute paths"); setDefault("shell", "/bin/sh", "Command interpreter to use; sh-compatible (e.g., bash), NOT C-shell (csh or tcsh); default '/bin/sh'"); setDefault("java", $java, "Java interpreter to use; at least version 1.8; default 'java'"); setDefault("gnuplot", "gnuplot", "Path to the gnuplot executable"); setDefault("gnuplotImageFormat", undef, "Image format that gnuplot will generate, used in HTML reports. Default: based on gnuplot, 'png', 'svg' or 'gif'"); setDefault("gnuplotTested", 0, "If set, skip the initial testing of gnuplot"); setDefault("stageDirectory", undef, "If set, copy heavily used data to this node-local location"); ##### Cleanup and Termination options setDefault("saveOverlaps", 0, "Save intermediate overlap files, almost never a good idea"); setDefault("saveReadCorrections", 0, "Save intermediate read correction files, almost never a good idea"); setDefault("saveMerCounts", 0, "Save full mer counting results, sometimes useful"); setDefault("onSuccess", undef, "Full path to command to run on successful completion"); setDefault("onFailure", undef, "Full path to command to run on failure"); ##### Error Rates setDefault("corOvlErrorRate", undef, "Overlaps above this error rate are not computed"); setDefault("obtOvlErrorRate", undef, "Overlaps at or below this error rate are used to trim reads"); setDefault("utgOvlErrorRate", undef, "Overlaps at or below this error rate are used to trim reads"); setDefault("utgErrorRate", undef, "Overlaps at or below this error rate are used to construct contigs"); setDefault("utgGraphDeviation", 6, "Overlaps this much above median will not be used for initial graph construction"); setDefault("utgRepeatDeviation", 3, "Overlaps this much above mean unitig error rate will not be used for repeat splitting"); setDefault("utgRepeatConfusedBP", 2100, "Repeats where the next best edge is at least this many bp shorter will not be split"); setDefault("corErrorRate", undef, "Only use raw alignments below this error rate to construct corrected reads"); setDefault("cnsErrorRate", undef, "Consensus expects alignments at about this error rate"); ##### Minimums and maximums setDefault("minReadLength", 1000, "Reads shorter than this length are not loaded into the assembler; default 1000"); setDefault("minOverlapLength", 500, "Overlaps shorter than this length are not computed; default 500"); setDefault("minMemory", undef, "Minimum amount of memory needed to compute the assembly (do not set unless prompted!)"); setDefault("maxMemory", undef, "Maximum memory to use by any component of the assembler"); setDefault("minThreads", undef, "Minimum number of compute threads suggested to compute the assembly"); setDefault("maxThreads", undef, "Maximum number of compute threads to use by any component of the assembler"); ##### Stopping conditions setDefault("stopOnReadQuality", 1, "Stop if a significant portion of the input data is too short or has quality value or base composition errors"); setDefault("stopAfter", undef, "Stop after a specific algorithm step is completed"); ##### Grid Engine configuration, internal parameters. These are filled out in canu.pl, right after this function returns. setDefault("gridEngine", undef, "Grid engine configuration, not documented"); setDefault("gridEngineSubmitCommand", undef, "Grid engine configuration, not documented"); setDefault("gridEngineNameOption", undef, "Grid engine configuration, not documented"); setDefault("gridEngineArrayOption", undef, "Grid engine configuration, not documented"); setDefault("gridEngineArrayName", undef, "Grid engine configuration, not documented"); setDefault("gridEngineArrayMaxJobs", undef, "Grid engine configuration, not documented"); setDefault("gridEngineOutputOption", undef, "Grid engine configuration, not documented"); setDefault("gridEnginePropagateCommand", undef, "Grid engine configuration, not documented"); setDefault("gridEngineThreadsOption", undef, "Grid engine configuration, not documented"); setDefault("gridEngineMemoryOption", undef, "Grid engine configuration, not documented"); setDefault("gridEngineMemoryUnits", undef, "Grid engine configuration, not documented"); setDefault("gridEngineNameToJobIDCommand", undef, "Grid engine configuration, not documented"); setDefault("gridEngineNameToJobIDCommandNoArray", undef, "Grid engine configuration, not documented"); setDefault("gridEngineStageOption", undef, "Grid engine configuration, not documented"); setDefault("gridEngineTaskID", undef, "Grid engine configuration, not documented"); setDefault("gridEngineArraySubmitID", undef, "Grid engine configuration, not documented"); setDefault("gridEngineJobID", undef, "Grid engine configuration, not documented"); ##### Grid Engine Pipeline setDefault("useGrid", 1, "If 'true', enable grid-based execution; if 'false', run all jobs on the local machine; if 'remote', create jobs for grid execution but do not submit; default 'true'"); foreach my $c (qw(BAT GFA CNS COR MERYL CORMHAP CORMMAP COROVL OBTMHAP OBTMMAP OBTOVL OEA OVB OVS RED UTGMHAP UTGMMAP UTGOVL)) { setDefault("useGrid$c", 1, "If 'true', run module $c under grid control; if 'false' run locally."); } ##### Grid Engine configuration, for each step of the pipeline setDefault("gridOptions", undef, "Grid engine options applied to all jobs"); setDefault("gridOptionsExecutive", undef, "Grid engine options applied to the canu executive script"); setDefault("gridOptionsJobName", undef, "Grid jobs job-name suffix"); ##### Grid Engine configuration and parameters, for each step of the pipeline (memory, threads) setExecDefaults("meryl", "mer counting"); setExecDefaults("cor", "read correction"); setExecDefaults("corovl", "overlaps for correction"); setExecDefaults("obtovl", "overlaps for trimming"); setExecDefaults("utgovl", "overlaps for unitig construction"); setExecDefaults("cormhap", "mhap overlaps for correction"); setExecDefaults("obtmhap", "mhap overlaps for trimming"); setExecDefaults("utgmhap", "mhap overlaps for unitig construction"); setExecDefaults("cormmap", "mmap overlaps for correction"); setExecDefaults("obtmmap", "mmap overlaps for trimming"); setExecDefaults("utgmmap", "mmap overlaps for unitig construction"); setExecDefaults("ovb", "overlap store bucketizing"); setExecDefaults("ovs", "overlap store sorting"); setExecDefaults("red", "read error detection"); setExecDefaults("oea", "overlap error adjustment"); setExecDefaults("bat", "unitig construction"); setExecDefaults("cns", "unitig consensus"); setExecDefaults("gfa", "graph alignment and processing"); ##### Object Storage setDefault("objectStore", undef, "Type of object storage used; not ready for production yet"); setDefault("objectStoreClient", undef, "Path to the command line client used to access the object storage"); setDefault("objectStoreNameSpace", undef, "Object store parameters; specific to the type of objectStore used"); ##### Overlapper setOverlapDefaults("cor", "correction", "mhap"); # Overlaps computed for correction setOverlapDefaults("obt", "overlap based trimming", "ovl"); # Overlaps computed for trimming setOverlapDefaults("utg", "unitig construction", "ovl"); # Overlaps computed for unitigging ##### Overlap Store setDefault("ovsMethod", undef, "Use the 'sequential' or 'parallel' algorithm for constructing an overlap store; default 'sequential'"); ##### Mers setDefault("merylMemory", undef, "Amount of memory, in gigabytes, to use for mer counting"); setDefault("merylThreads", undef, "Number of threads to use for mer counting"); setDefault("merylConcurrency", undef, "Unused, there is only one process"); ##### Overlap Based Trimming setDefault("obtErrorRate", undef, "Stringency of overlaps to use for trimming"); setDefault("trimReadsOverlap", 1, "Minimum overlap between evidence to make contiguous trim; default '1'"); setDefault("trimReadsCoverage", 1, "Minimum depth of evidence to retain bases; default '1'"); #$global{"splitReads..."} = 1; #$synops{"splitReads..."} = ""; ##### Fragment/Overlap Error Correction setDefault("enableOEA", 1, "Do overlap error adjustment - comprises two steps: read error detection (RED) and overlap error adjustment (OEA); default 'true'"); setDefault("redBatchSize", undef, "Number of reads per fragment error detection batch"); setDefault("redBatchLength", undef, "Number of bases per fragment error detection batch"); setDefault("oeaBatchSize", undef, "Number of reads per overlap error correction batch"); setDefault("oeaBatchLength", undef, "Number of bases per overlap error correction batch"); ##### Unitigger & BOG & bogart Options setDefault("unitigger", "bogart", "Which unitig algorithm to use; only 'bogart' supported; default 'bogart'"); setDefault("genomeSize", undef, "An estimate of the size of the genome"); setDefault("batOptions", undef, "Advanced options to bogart"); setDefault("batMemory", undef, "Approximate maximum memory usage, in gigabytes, default is the maxMemory limit"); setDefault("batThreads", undef, "Number of threads to use; default is the maxThreads limit"); setDefault("batConcurrency", undef, "Unused, only one process supported"); setDefault("contigFilter", "2 0 1.0 0.5 5", "Parameters to filter out 'unassembled' unitigs. Five values: minReads minLength singleReadSpan lowCovFraction lowCovDepth"); ##### Consensus Options setDefault("cnsPartitions", undef, "Partition consensus into N jobs"); setDefault("cnsPartitionMin", undef, "Don't make a consensus partition with fewer than N reads"); setDefault("cnsMaxCoverage", 40, "Limit unitig consensus to at most this coverage; default '0' = unlimited"); setDefault("cnsConsensus", "pbdagcon", "Which consensus algorithm to use; 'pbdagcon' (fast, reliable); 'utgcns' (multialignment output); 'quick' (single read mosaic); default 'pbdagcon'"); ##### Correction Options setDefault("corPartitions", undef, "Partition read correction into N jobs"); setDefault("corPartitionMin", undef, "Don't make a read correction partition with fewer than N reads"); setDefault("corMinEvidenceLength", undef, "Limit read correction to only overlaps longer than this; default: unlimited"); setDefault("corMaxEvidenceErate", undef, "Limit read correction to only overlaps at or below this fraction error; default: unlimited"); setDefault("corMaxEvidenceCoverageGlobal", "1.0x", "Limit reads used for correction to supporting at most this coverage; default: '1.0x' = 1.0 * estimated coverage"); setDefault("corMaxEvidenceCoverageLocal", "2.0x", "Limit reads being corrected to at most this much evidence coverage; default: '2.0x' = 2.0 * estimated coverage"); setDefault("corOutCoverage", 40, "Only correct the longest reads up to this coverage; default 40"); setDefault("corMinCoverage", undef, "Minimum number of bases supporting each corrected base, if less than this sequences are split; default based on input read coverage: 0 <= 30x < 4 < 60x <= 4"); setDefault("corFilter", "expensive", "Method to filter short reads from correction; 'quick' or 'expensive'; default 'expensive'"); setDefault("corConsensus", "falconpipe", "Which consensus algorithm to use; only 'falcon' and 'falconpipe' are supported; default 'falconpipe'"); setDefault("corLegacyFilter", undef, "Expert option: global filter, length * identity (default) or length with broken by identity (if on)"); # Convert all the keys to lowercase, and remember the case-sensitive version foreach my $k (keys %synops) { (my $l = $k) =~ tr/A-Z/a-z/; $synnam{$l} = $k; # Remember that option $l is stylized as $k. next if (!exists($global{$k})); # If no option for this (it's a meta-option), skip. next if ( exists($global{$l})); # If lowercase already exists, skip. $global{$l} = $global{$k}; # Otherwise, set the lowercase option and delete $global{$k}; # delete the uppercase version } # If this is set, it breaks the consensus.sh and overlap.sh scripts. Good grief! Why # are you running this in a task array!? if (exists($ENV{getGlobal("gridEngineTaskID")})) { undef $ENV{getGlobal("gridEngineTaskID")}; print STDERR "ENV: ", getGlobal("gridEngineTaskID"), " needs to be unset, done.\n"; } } # Get the version information. Needs to be last so that pathMap can be defined. sub setVersion ($) { my $bin = shift @_; my $version; open(F, "$bin/gatekeeperCreate --version 2>&1 |"); while () { $version = $_; chomp $version; } close(F); $global{'version'} = $version; } sub checkJava () { return if ((getGlobal("corOverlapper") ne "mhap") && (getGlobal("obtOverlapper") ne "mhap") && (getGlobal("utgOverlapper") ne "mhap")); my $java = getGlobal("java"); my $versionStr = "unknown"; my $version = 0; # Argh, we can't use runCommand() here, because we're included in Execution.pm. Try to check # it with -x. Nope. Fails if $java == "java". #if (! -x $java) { # addCommandLineError("ERROR: java executable '$java' not found or not executable\n"); #} open(F, "$java -Xmx1g -showversion 2>&1 |"); while () { # First word is either "java" or "openjdk" or ... if (m/^.*\s+version\s+\"(\d+.\d+)(.*)\".*$/) { $versionStr = "$1$2"; $version = $1; } } close(F); if ($version < 1.8) { addCommandLineError("ERROR: mhap overlapper requires java version at least 1.8.0; you have $versionStr (from '$java').\n"); addCommandLineError("ERROR: '$java -showversion' reports:\n"); open(F, "$java -showversion 2>&1 |"); while () { chomp; addCommandLineError("ERROR: '$_'\n"); } close(F); } else { print STDERR "-- Detected Java(TM) Runtime Environment '$versionStr' (from '$java').\n"; } } sub checkGnuplot () { return if (getGlobal("gnuPlotTested") == 1); my $gnuplot = getGlobal("gnuplot"); my $format = getGlobal("gnuplotImageFormat"); my $version = undef; # Check for existence of gnuplot. open(F, "$gnuplot -V |"); while () { chomp; $version = $_; $version = $1 if ($version =~ m/^gnuplot\s+(.*)$/); } close(F); if (!defined($version)) { addCommandLineError("ERROR: Failed to run gnuplot from '$gnuplot'."); addCommandLineError("ERROR: Set option gnuplot= or gnuplotTested=true to skip this test and not generate plots.\n"); return; } # Check for existence of a decent output format. Need to redirect in /dev/null to make gnuplot # not use it's builtin pager. if (!defined($format)) { my $havePNG = 0; my $haveSVG = 0; my $haveGIF = 0; open(F, "> /tmp/gnuplot-$$-test.gp"); print F "set terminal\n"; close(F); system("cd /tmp && $gnuplot < /dev/null /tmp/gnuplot-$$-test.gp > /tmp/gnuplot-$$-test.err 2>&1"); open(F, "< /tmp/gnuplot-$$-test.err"); while () { s/^\s+//; s/\s+$//; my @t = split '\s+', $_; $havePNG = 1 if ($t[0] eq 'png'); $haveSVG = 1 if ($t[0] eq 'svg'); $haveGIF = 1 if ($t[0] eq 'gif'); } close(F); $format = "gif" if ($haveGIF); $format = "svg" if ($haveSVG); $format = "png" if ($havePNG); setGlobal("gnuplotImageFormat", $format); unlink "/tmp/gnuplot-$$-test.gp"; unlink "/tmp/gnuplot-$$-test.err"; } if (!defined($format)) { addCommandLineError("ERROR: Failed to detect a suitable output format for gnuplot.\n"); addCommandLineError("ERROR: Looked for png, svg and gif, found none of them.\n"); addCommandLineError("Set option gnuplotImageFormat=, or gnuplotTested=true to skip this test and not generate plots.\n"); return; } # Test if we can actually make images. open(F, "> /tmp/gnuplot-$$-test.gp"); print F "set title 'gnuplot test'\n"; print F "set xlabel 'X'\n"; print F "set xlabel 'Y'\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output '/tmp/gnuplot-$$-test.1.$format'\n"; print F "\n"; print F "plot [-30:20] sin(x*20) * atan(x)\n\n"; print F "\n"; print F "set terminal $format size 256,256\n"; print F "set output '/tmp/gnuplot-$$-test.2.$format'\n"; print F "\n"; print F "bogus line\n"; close(F); # Dang, we don't have runCommandSilently here, so have to do it the hard way. system("cd /tmp && $gnuplot < /dev/null /tmp/gnuplot-$$-test.gp > /tmp/gnuplot-$$-test.err 2>&1"); if ((! -e "/tmp/gnuplot-$$-test.1.$format") || (! -e "/tmp/gnuplot-$$-test.2.$format")) { addCommandLineError("ERROR: gnuplot failed to generate images.\n"); open(F, "< /tmp/gnuplot-$$-test.err"); while () { chomp; addCommandLineError("ERROR: gnuplot reports: $_\n"); } close(F); addCommandLineError("ERROR: Set option gnuplotImageFormat=, or gnuplotTested=true to skip this test and not generate plots.\n"); return; } # Yay, gnuplot works! print STDERR "-- Detected gnuplot version '$version' (from '$gnuplot') and image format '$format'.\n"; #addCommandLineOption("gnuplotTested=1"); unlink "/tmp/gnuplot-$$-test.gp"; unlink "/tmp/gnuplot-$$-test.err"; unlink "/tmp/gnuplot-$$-test.1.$format"; unlink "/tmp/gnuplot-$$-test.2.$format"; } sub checkParameters () { # # Fiddle with filenames to make them absolute paths. # makeAbsolute("corOvlFrequentMers"); makeAbsolute("obtOvlFrequentMers"); makeAbsolute("utgOvlFrequentMers"); # # Adjust case on some of them # fixCase("corOverlapper"); fixCase("obtOverlapper"); fixCase("utgOverlapper"); fixCase("corConsensus"); fixCase("cnsConsensus"); fixCase("corFilter"); fixCase("unitigger"); fixCase("stopAfter"); # # Well, crud. 'gridEngine' wants to be uppercase, not lowercase like fixCase() would do. # $global{"gridengine"} =~ tr/a-z/A-Z/; # NOTE: lowercase 'gridengine' # # Check for inconsistent parameters # # Genome size isn't properly decoded until later, but we want to fail quickly. So, just test if # a unitless number is supplied, and if that number is tiny. { my $gs = getGlobal("genomeSize"); if (!defined($gs)) { addCommandLineError("ERROR: Required parameter 'genomeSize' not set.\n"); } if (($gs =~ m/^(\d+)$/) || ($gs =~ m/^(\d+\.\d+)$/)) { if ($gs < 1000) { addCommandLineError("ERROR: Implausibly small genome size $gs. Check units!\n"); } } } foreach my $var ("corOvlErrorRate", "obtOvlErrorRate", "utgOvlErrorRate", "corErrorRate", "obtErrorRate", "utgErrorRate", "cnsErrorRate") { if (!defined(getGlobal($var))) { addCommandLineError("ERROR: Invalid '$var' specified; must be set\n"); } elsif (getGlobal($var) !~ m/^[.-0123456789]/) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be numeric\n"); } elsif ((getGlobal($var) < 0.0) || (getGlobal($var) > 1.0)) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be at least 0.0 and no more than 1.0\n"); } } if (getGlobal("minReadLength") < getGlobal("minOverlapLength")) { my $mr = getGlobal("minReadLength"); my $mo = getGlobal("minOverlapLength"); addCommandLineError("ERROR: minReadLength=$mr must be at least minOverlapLength=$mo.\n"); } foreach my $var ("corOutCoverage") { if (!defined(getGlobal($var))) { addCommandLineError("ERROR: Invalid 'corOutCoverage' specified; must be at least 1.0\n"); } elsif (getGlobal($var) =~ m/all/i) { setGlobal($var, 9999); } elsif (getGlobal($var) !~ m/^[.-0123456789]/) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be numeric\n"); } elsif (getGlobal($var) < 1.0) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be at least 1.0\n"); } } foreach my $var ("corMaxEvidenceCoverageGlobal", "corMaxEvidenceCoverageLocal") { if (!defined(getGlobal($var))) { # If undef, defaults to corOutCoverage in CorrectReads.pm } elsif (getGlobal($var) =~ m/^(\d*\.*\d*)(x*)$/) { if (($1 < 1.0) && ($2 ne "x")) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be at least 1.0\n"); } } else { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be numeric\n"); } } foreach my $var ("utgGraphDeviation", "utgRepeatDeviation", "utgRepeatConfusedBP", "minReadLength", "minOverlapLength") { if (!defined(getGlobal($var))) { addCommandLineError("ERROR: Invalid '$var' specified; must be set\n"); } elsif (getGlobal($var) !~ m/^[.-0123456789]/) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be numeric\n"); } elsif (getGlobal($var) < 0.0) { addCommandLineError("ERROR: Invalid '$var' specified (" . getGlobal("$var") . "); must be at least 0.0\n"); } } # # Check for invalid usage # foreach my $tag ("cor", "obt", "utg") { if ((getGlobal("${tag}Overlapper") ne "mhap") && (getGlobal("${tag}Overlapper") ne "ovl") && (getGlobal("${tag}Overlapper") ne "minimap")) { addCommandLineError("ERROR: Invalid '${tag}Overlapper' specified (" . getGlobal("${tag}Overlapper") . "); must be 'mhap', 'ovl', or 'minimap'\n"); } } foreach my $tag ("cor", "obt", "utg") { if (getGlobal("${tag}MhapSensitivity") eq "fast") { print STDERR "WARNING: deprecated ${tag}NhapSensitivity=fast replaced with ${tag}MhapSensitivity=low\n"; } if ((getGlobal("${tag}MhapSensitivity") ne undef) && (getGlobal("${tag}MhapSensitivity") ne "low") && (getGlobal("${tag}MhapSensitivity") ne "normal") && (getGlobal("${tag}MhapSensitivity") ne "high")) { addCommandLineError("ERROR: Invalid '${tag}MhapSensitivity' specified (" . getGlobal("${tag}MhapSensitivity") . "); must be 'fast' or 'normal' or 'high'\n"); } } if ((getGlobal("ovsMethod") ne "sequential") && (getGlobal("ovsMethod") ne "parallel")) { addCommandLineError("ERROR: Invalid 'ovsMethod' specified (" . getGlobal("ovsMethod") . "); must be 'sequential' or 'parallel'\n"); } if ((getGlobal("useGrid") eq "0") && (getGlobal("ovsMethod") eq "parallel")) { addCommandLineError("ERROR: ovsMethod=parallel requires useGrid=true or useGrid=remote. Set ovsMethod=sequential if no grid is available\n"); } if ((getGlobal("unitigger") ne "unitigger") && (getGlobal("unitigger") ne "bogart")) { addCommandLineError("ERROR: Invalid 'unitigger' specified (" . getGlobal("unitigger") . "); must be 'unitigger' or 'bogart'\n"); } if ((getGlobal("corConsensus") ne "utgcns") && (getGlobal("corConsensus") ne "falcon") && (getGlobal("corConsensus") ne "falconpipe")) { addCommandLineError("ERROR: Invalid 'corConsensus' specified (" . getGlobal("corConsensus") . "); must be 'utgcns' or 'falcon' or 'falconpipe'\n"); } if ((getGlobal("cnsConsensus") ne "quick") && (getGlobal("cnsConsensus") ne "pbdagcon") && (getGlobal("cnsConsensus") ne "utgcns")) { addCommandLineError("ERROR: Invalid 'cnsConsensus' specified (" . getGlobal("cnsConsensus") . "); must be 'quick', 'pbdagcon', or 'utgcns'\n"); } if ((!defined("lowCoverageAllowed") && defined("lowCoverageDepth")) || ( defined("lowCoverageAllowed") && !defined("lowCoverageDepth"))) { addCommandLineError("ERROR: Invalid 'lowCoverageAllowed' and 'lowCoverageDepth' specified; both must be set\n"); } if ((getGlobal("saveOverlaps") ne "0") && (getGlobal("saveOverlaps") ne "stores") && (getGlobal("saveOverlaps") ne "1")) { addCommandLineError("ERROR: Invalid 'saveOverlaps' specified (" . getGlobal("saveOverlaps") . "); must be 'false', 'stores', or 'true'\n"); } if ((getGlobal("corFilter") ne "quick") && (getGlobal("corFilter") ne "expensive") && (getGlobal("corFilter") ne "none")) { addCommandLineError("ERROR: Invalid 'corFilter' specified (" . getGlobal("corFilter") . "); must be 'none' or 'quick' or 'expensive'\n"); } if ((getGlobal("useGrid") ne "0") && (getGlobal("useGrid") ne "1") && (getGlobal("useGrid") ne "remote")) { addCommandLineError("ERROR: Invalid 'useGrid' specified (" . getGlobal("useGrid") . "); must be 'true', 'false' or 'remote'\n"); } if (defined(getGlobal("stopAfter"))) { my $ok = 0; my $st = getGlobal("stopAfter"); $st =~ tr/A-Z/a-z/; my $failureString = "ERROR: Invalid stopAfter specified (" . getGlobal("stopAfter") . "); must be one of:\n"; my @stopAfter = ("gatekeeper", "meryl", "overlapConfigure", "overlap", "overlapStoreConfigure", "overlapStore", "readCorrection", "readTrimming", "unitig", "consensusConfigure", "consensusCheck", "consensusLoad", "consensusAnalyze"); foreach my $sa (@stopAfter) { $failureString .= "ERROR: '$sa'\n"; $sa =~ tr/A-Z/a-z/; if ($st eq $sa) { $ok++; setGlobal('stopAfter', $st); } } addCommandLineError($failureString) if ($ok == 0); } { my @v = split '\s+', getGlobal("contigFilter"); if (scalar(@v) != 5) { addCommandLineError("contigFilter must have five values: minReads minLength singleReadSpan lowCovFraction lowCovDepth\n"); } addCommandLineError("contigFilter 'minReads' must be a positive integer, currently $v[0]\n") if (($v[0] < 0) || ($v[0] !~ m/^[0-9]+$/)); addCommandLineError("contigFilter 'minLength' must be a positive integer, currently $v[1]\n") if (($v[1] < 0) || ($v[1] !~ m/^[0-9]+$/)); addCommandLineError("contigFilter 'singleReadSpan' must be between 0.0 and 1.0, currently $v[2]\n") if (($v[2] < 0) || (1 < $v[2]) || ($v[2] !~ m/^[0-9]*\.{0,1}[0-9]*$/)); addCommandLineError("contigFilter 'lowCovFraction' must be between 0.0 and 1.0, currently $v[3]\n") if (($v[3] < 0) || (1 < $v[3]) || ($v[3] !~ m/^[0-9]*\.{0,1}[0-9]*$/)); addCommandLineError("contigFilter 'lowCovDepth' must be a positive integer, currently $v[4]\n") if (($v[4] < 0) || ($v[4] !~ m/^[0-9]+$/)); } # # Minimap, no valid identities, set legacy # if (getGlobal("corOverlapper") eq "minimap") { setGlobalIfUndef("corLegacyFilter", 1); } } 1; canu-1.6/src/pipelines/canu/ErrorEstimate.pm000066400000000000000000000246011314437614700211530ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Sergey Koren beginning on 2016-MAY-16 # are a 'United States Government Work', and # are released in the public domain # # Brian P. Walenz beginning on 2016-DEC-12 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::ErrorEstimate; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(estimateKmerError estimateRawError estimateCorrectedError uniqueKmerThreshold); use strict; use POSIX qw(floor); use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::HTML; # # NOTE!!! This was made to compile ONLY. All the paths will be wrong. runCommand() is expecting # commands to be run in the directory they operate on, but this was written to run with absolute # paths. # sub fac($) { my $x = shift @_; return 1 if($x == 0); return 1 if($x == 1); return $x * fac($x - 1); } sub poisson_pdf ($$) { my $lambda = shift @_; my $k = shift @_; return ( ( ($lambda ** $k) * exp(-$lambda) ) / fac($k) ); } sub uniqueKmerThreshold($$$$) { my $path = shift @_; my $asm = shift @_; my $merSize = shift @_; my $loss = shift @_; my $bin = getBinDirectory(); my $errorRate = estimateRawError($path, $asm, "cor", $merSize); my $readLength = getNumberOfBasesInStore($path, $asm) / getNumberOfReadsInStore ($path, $asm); my $effective_coverage = getExpectedCoverage($path, $asm) * ( ($readLength - $merSize + 1)/$readLength ) * (1 - $errorRate) ** $merSize; my $threshold = 0; my $kMer_loss = poisson_pdf($effective_coverage, 0); return 1 if($kMer_loss > $loss); my $keepTrying = 1; while($keepTrying) { $keepTrying = 0; my $p_true_kMers_threshold_p1 = poisson_pdf($effective_coverage, $threshold+1); if(($kMer_loss + $p_true_kMers_threshold_p1) <= $loss) { $threshold++; $kMer_loss += $p_true_kMers_threshold_p1; $keepTrying = 1; } } return ($threshold == 0 ? 1 : $threshold); } sub computeSampleSize($$$$$) { my $path = shift @_; my $asm = shift @_; my $tag = shift @_; my $percent = shift @_; my $coverage = shift @_; my $sampleSize = 0; my $minSampleSize = 100; my $maxSampleSize = getGlobal("${tag}MhapBlockSize") * 4; if (defined($percent)) { $sampleSize = int($percent * getNumberOfReadsInStore ($path, $asm))+1; $sampleSize++ if ($sampleSize % 2 != 0); } elsif (defined($coverage)) { $sampleSize = int(($coverage * getGlobal("genomeSize")) / (getNumberOfBasesInStore($path, $asm) / getNumberOfReadsInStore ($path, $asm))) + 1; } $sampleSize = $maxSampleSize if (defined($percent) && $sampleSize > $maxSampleSize); return $sampleSize < $minSampleSize ? $minSampleSize : $sampleSize; } sub runMHAP($$$$$$$$$$$$) { my ($path, $tag, $numHashes, $minNumMatches, $threshold, $ordSketch, $ordSketchMer, $sampleSize, $hash, $query, $out, $err) = @_; my $filterThreshold = getGlobal("${tag}MhapFilterThreshold"); my $merSize = getGlobal("${tag}MhapMerSize"); my $javaPath = getGlobal("java"); my $bin = getBinDirectory(); print STDERR "--\n"; print STDERR "-- PARAMETERS: hashes=$numHashes, minMatches=$minNumMatches, threshold=$threshold\n"; print STDERR "--\n"; my $cmd = "$javaPath -d64 -server -Xmx4g -jar $bin/mhap-" . getGlobal("${tag}MhapVersion") . ".jar "; $cmd .= " --no-self --repeat-weight 0.9 -k $merSize --num-hashes $numHashes --num-min-matches $minNumMatches --ordered-sketch-size $ordSketch --ordered-kmer-size $ordSketchMer --threshold $threshold --filter-threshold $filterThreshold --num-threads " . getGlobal("${tag}mhapThreads"); $cmd .= " -s $hash -q $query 2> /dev/null | awk '{if (\$1 != \$2+$sampleSize) { print \$0}}' | $bin/errorEstimate -d 2 -m 0.95 -S - > $out 2> $err"; runCommand($path, $cmd); } sub estimateRawError($$$$) { my $base = "correction"; my $path = shift @_; my $asm = shift @_; my $tag = shift @_; my $merSize = shift @_; my $bin = getBinDirectory(); my $numReads = getNumberOfReadsInStore ($path, $asm); goto allDone if (skipStage($asm, "errorEstimate") == 1); goto allDone if (-e "$path/asm.gkpStore/raw.estimate.out"); goto allDone if (getGlobal("errorrate") > 0); my ($numHashes, $minNumMatches, $threshold, $ordSketch, $ordSketchMer); $numHashes = 10000; $minNumMatches = 3; $threshold = 0.65; $ordSketch = 10000; $ordSketchMer = getGlobal("${tag}MhapOrderedMerSize"); # subsample raw reads my $sampleSize = computeSampleSize($path, $asm, $tag, 0.01, undef); $sampleSize /= 2; my $cmd = "$bin/gatekeeperDumpFASTQ -G $base/$asm.gkpStore -nolibname -fasta -r 1-$sampleSize -o - > $base/$asm.gkpStore/subset.fasta 2> /dev/null"; runCommandSilently($path, $cmd, 1); my $min = $numReads - $sampleSize + 1; my $cmd = "$bin/gatekeeperDumpFASTQ -G $base/$asm.gkpStore -nolibname -fasta -r $min-$numReads -o - >> $base/$asm.gkpStore/subset.fasta 2> /dev/null"; runCommandSilently($path, $cmd, 1); my $querySize = computeSampleSize($path, $asm, $tag, undef, 2); my $cmd = "$bin/gatekeeperDumpFASTQ -G $base/$asm.gkpStore -nolibname -fasta -r 1-$querySize -o - > $base/$asm.gkpStore/reads.fasta 2> /dev/null"; runCommandSilently($path, $cmd, 1); print STDERR "--\n"; print STDERR "-- ESTIMATOR (mhap) (raw) (hash sample size=". ($sampleSize*2) . ") (query sample size=$querySize)\n"; runMHAP($path, $tag, $numHashes, $minNumMatches, $threshold, $ordSketch, $ordSketchMer, $sampleSize*2, "$base/$asm.gkpStore/subset.fasta", "$base/$asm.gkpStore/reads.fasta", "$base/$asm.gkpStore/raw.estimate.out", "$base/$asm.gkpStore/raw.estimate.err"); unlink("$base/$asm.gkpStore/subset.fasta"); unlink("$base/$asm.gkpStore/reads.fasta"); allDone: return 0.15 if (! -e "$base/$asm.gkpStore/raw.estimate.out"); my $errorRate = 0; open(L, "< $base/$asm.gkpStore/raw.estimate.out") or caExit("can't open '$base/$asm.gkpStore/raw.estimate.out' for reading: $!", undef); while () { $errorRate = sprintf "%.3f", ($_ / 2); $errorRate = 0.15 if ($errorRate <= 0.005); } close(L); return $errorRate; } # Map subset of reads to long reads with mhap. # Compute resulting distribution and estimate error rate sub estimateCorrectedError ($$) { my $asm = shift @_; my $tag = shift @_; my $bin = getBinDirectory(); # DISABLED 2016 JAN 27 when 'errorRate' was replaced with 'correctedErrorRate'. # # A new option needs to be added to enable this explicitly. # This doesn't work on grid either (the set error rates are lost on the next restart). return; my $base = "correction"; my $path = "correction/3-estimator"; # only run if we aren't done and were asked to goto allDone if (skipStage($asm, "errorEstimate") == 1); goto allDone if (-e "$base/$asm.estimate.out"); goto allDone if (getGlobal("errorrate") > 0); # Mhap parameters - filterThreshold needs to be a string, else it is printed as 5e-06. # my ($numHashes, $minNumMatches, $threshold, $ordSketch, $ordSketchMer); $numHashes = 256; $minNumMatches = 4; $threshold = 0.85; $ordSketch = 1000; $ordSketchMer = getGlobal("${tag}MhapOrderedMerSize") + 2; make_path("$path"); # subsample corrected reads, this assumes the fasta records are on a single line. We take some reads from the top and bottom of file to avoid sampling one library my $sampleSize = computeSampleSize("correction", $asm, $tag, 0.01, undef); my $cmd = "gunzip -c correction/asm.correctedReads.fasta.gz |head -n $sampleSize > $path/subset.fasta"; runCommandSilently($path, $cmd, 1); my $cmd = "gunzip -c correction/asm.correctedReads.fasta.gz |tail -n $sampleSize >> $path/subset.fasta"; runCommandSilently($path, $cmd, 1); my $querySize = computeSampleSize("correction", $asm, $tag, undef, 2); my $cmd = "gunzip -c correction/asm.correctedReads.fasta.gz |head -n $querySize > $path/reads.fasta"; runCommandSilently($path, $cmd, 1); my $cmd = "gunzip -c correction/asm.correctedReads.fasta.gz |tail -n $querySize >> $path/reads.fasta"; runCommandSilently($path, $cmd, 1); # now compute the overlaps print STDERR "--\n"; print STDERR "-- ESTIMATOR (mhap) (corrected) (hash sample size=$sampleSize) (query sample size=$querySize)\n"; runMHAP("correction", $tag, $numHashes, $minNumMatches, $threshold, $ordSketch, $ordSketchMer, $sampleSize, "$path/subset.fasta", "$path/reads.fasta", "$base/$asm.estimate.out", "$base/$asm.estimate.err"); unlink("$path/subset.fasta"); unlink("$path/reads.fasta"); allDone: return if (! -e "$base/$asm.estimate.out"); my $errorRate = 0; open(L, "< $base/$asm.estimate.out") or caExit("can't open '$base/$asm.estimate.out' for reading: $!", undef); while () { $errorRate = sprintf "%.3f", ($_ / 2); } close(L); print STDERR "-- \n"; if ($errorRate > 0.13) { print STDERR "-- Estimated error rate: " . ($errorRate*100) . "% > " . (0.13 * 100) . "% limit, capping it.\n"; $errorRate = 0.13; } elsif ($errorRate < 0.005) { print STDERR "-- Estimated error rate: " . ($errorRate*100) . "%, increasing to " . (0.005 * 100). "%.\n"; $errorRate = 0.005; } else { print STDERR "-- Estimated error rate: " . ($errorRate * 100) . "%.\n"; } setErrorRate($errorRate, 1); showErrorRates("-- "); print STDERR "-- \n"; } canu-1.6/src/pipelines/canu/Execution.pm000066400000000000000000001341221314437614700203310ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # kmer/ESTmapper/scheduler.pm # kmer/scripts/libBri.pm # kmer/scripts/scheduler.pm # src/pipelines/ca3g/Execution.pm # # Modifications by: # # Brian P. Walenz from 2003-JAN-03 to 2003-NOV-11 # are Copyright 2003 Applera Corporation, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz on 2004-MAR-22 # are Copyright 2004 Brian P. Walenz, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz from 2006-APR-07 to 2011-DEC-28 # are Copyright 2006,2008-2009,2011 J. Craig Venter Institute, and # are subject to the GNU General Public License version 2 # # Brian P. Walenz from 2015-FEB-27 to 2015-SEP-11 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-03 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-25 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Execution; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(stopAfter skipStage emitStage touch makeExecutable getInstallDirectory getJobIDShellCode getLimitShellCode getBinDirectory getBinDirectoryShellCode setWorkDirectory setWorkDirectoryShellCode submitScript submitOrRunParallelJob runCommand runCommandSilently findCommand findExecutable caExit caFailure); use strict; use Config; # for @signame use Cwd qw(getcwd); use Carp qw(longmess); use POSIX ":sys_wait_h"; # For waitpid(..., &WNOHANG) use List::Util qw(min max); use File::Path 2.08 qw(make_path remove_tree); use File::Spec; use canu::Defaults; use canu::Report qw(generateReport); # # Functions for running multiple processes at the same time. This is private to the module. # my $numberOfProcesses = 0; # Number of jobs concurrently running my $numberOfProcessesToWait = 0; # Number of jobs we can leave running at exit my @processQueue = (); my @processesRunning = (); my $printProcessCommand = 1; # Show commands as they run sub schedulerSetNumberOfProcesses { $numberOfProcesses = shift @_; } sub schedulerSubmit ($) { my $cmd = shift @_; chomp $cmd; push @processQueue, $cmd; } sub schedulerForkProcess ($) { my $process = shift @_; my $pid; # From Programming Perl, page 167 FORK: { if ($pid = fork) { # Parent # return($pid); } elsif (defined $pid) { # Child # exec($process); } elsif ($! =~ /No more processes/) { # EAGIN, supposedly a recoverable fork error sleep 1; redo FORK; } else { die "Can't fork: $!\n"; } } } sub schedulerReapProcess ($) { my $pid = shift @_; if (waitpid($pid, &WNOHANG) > 0) { return(1); } else { return(0); } } sub schedulerRun () { my @running; # Reap any processes that have finished foreach my $i (@processesRunning) { push @running, $i if (schedulerReapProcess($i) == 0); } @processesRunning = @running; # Run processes in any available slots while ((scalar(@processesRunning) < $numberOfProcesses) && (scalar(@processQueue) > 0)) { my $process = shift @processQueue; print STDERR " $process\n"; push @processesRunning, schedulerForkProcess($process); } } sub schedulerFinish ($$) { my $dir = shift @_; my $nam = shift @_; my $child; my @newProcesses; my $remain; $remain = scalar(@processQueue); my $startsecs = time(); my $diskfree = (defined($dir)) ? (diskSpace($dir)) : (0); print STDERR "----------------------------------------\n"; print STDERR "-- Starting '$nam' concurrent execution on ", scalar(localtime()), " with $diskfree GB free disk space ($remain processes; $numberOfProcesses concurrently)\n" if (defined($dir)); print STDERR "-- Starting '$nam' concurrent execution on ", scalar(localtime()), " ($remain processes; $numberOfProcesses concurrently)\n" if (!defined($dir)); print STDERR "\n"; print STDERR " cd $dir\n"; my $cwd = getcwd(); # Remember where we are. chdir($dir); # So we can root the jobs in the correct location. # Run all submitted jobs # while ($remain > 0) { schedulerRun(); $remain = scalar(@processQueue); if ($remain > 0) { $child = waitpid -1, 0; undef @newProcesses; foreach my $i (@processesRunning) { push @newProcesses, $i if ($child != $i); } undef @processesRunning; @processesRunning = @newProcesses; } } # Wait for them to finish, if requested # while (scalar(@processesRunning) > $numberOfProcessesToWait) { waitpid(shift @processesRunning, 0); } chdir($cwd); $diskfree = (defined($dir)) ? (diskSpace($dir)) : (0); my $warning = " !!! WARNING !!!" if ($diskfree < 10); my $elapsed = time() - $startsecs; $elapsed = "lickety-split" if ($elapsed eq "0"); $elapsed = "$elapsed second" if ($elapsed eq "1"); $elapsed = "$elapsed seconds" if ($elapsed > 1); print STDERR "\n"; print STDERR "-- Finished on ", scalar(localtime()), " ($elapsed) with $diskfree GB free disk space$warning\n"; print STDERR "----------------------------------------\n"; } # # File Management # sub touch ($@) { open(F, "> $_[0]") or caFailure("failed to touch file '$_[0]'", undef); print F "$_[1]\n" if (defined($_[1])); close(F); } sub makeExecutable ($) { my $file = shift @_; chmod(0755 & ~umask(), $file); } # # State management # sub stopAfter ($) { my $stopAfter = shift @_; $stopAfter =~ tr/A-Z/a-z/; if ((defined($stopAfter)) && (defined(getGlobal("stopAfter"))) && (getGlobal("stopAfter") eq $stopAfter)) { print STDERR "Stop requested after '$stopAfter'.\n"; exit(0); } } sub emitStage ($$@) { my $asm = shift @_; my $stage = shift @_; my $attempt = shift @_; generateReport($asm); if (!defined($attempt)) { print STDERR "-- Finished stage '$stage', reset canuIteration.\n"; setGlobal("canuIteration", 0); } } sub skipStage ($$@) { return(0); } # Decide what bin directory to use. # # When we are running on the grid, the path of this perl script is NOT always the correct # architecture. If the submission host is FreeBSD, but the grid is Linux, the BSD box will submit # FreeBSD/bin/canu to the grid. Unless it knows which grid host it will run on in advance, there # is no way to pick the correct one. The grid host then has to have enough smarts to choose the # correct binaries, and that is what we're doing here. # # To make it more trouble, shell scripts need to do all this by themselves. # sub getInstallDirectory () { my $installDir = $FindBin::RealBin; if ($installDir =~ m!^(.*)/\w+-\w+/bin$!) { $installDir = $1; } return($installDir); } # Emits a block of shell code to parse the grid task id and offset. # Expects zero or one argument, which is interpreted different in grid and non-grid mode. # Off grid - the job to run # On grid - an offset to add to SGE_TASK_ID or SLURM_ARRAY_TASK_ID to compute the job to run # # PBSPro refuses to run an array job with one element. They're submitted as a normal job. Here, # we check if it is running on the grid and if the task ID (aka, array ID) isn't set. If so, we # assume it is job 1. # sub getJobIDShellCode () { my $string; my $taskenv = getGlobal('gridEngineTaskID'); $string .= "# Discover the job ID to run, from either a grid environment variable and a\n"; $string .= "# command line offset, or directly from the command line.\n"; $string .= "#\n"; $string .= "if [ x\$PBS_JOBID != x -a x\$$taskenv = x ]; then\n" if (uc(getGlobal("gridEngine")) eq "PBSPRO"); $string .= " $taskenv=1\n" if (uc(getGlobal("gridEngine")) eq "PBSPRO"); $string .= "fi\n" if (uc(getGlobal("gridEngine")) eq "PBSPRO"); $string .= "if [ x\$$taskenv = x -o x\$$taskenv = xundefined -o x\$$taskenv = x0 ]; then\n"; $string .= " baseid=\$1\n"; # Off grid $string .= " offset=0\n"; $string .= "else\n"; $string .= " baseid=\$$taskenv\n"; # On Grid $string .= " offset=\$1\n"; $string .= "fi\n"; $string .= "if [ x\$offset = x ]; then\n"; $string .= " offset=0\n"; $string .= "fi\n"; $string .= "if [ x\$baseid = x ]; then\n"; $string .= " echo Error: I need $taskenv set, or a job index on the command line.\n"; $string .= " exit\n"; $string .= "fi\n"; $string .= "jobid=`expr \$baseid + \$offset`\n"; $string .= "if [ x\$$taskenv = x ]; then\n"; $string .= " echo Running job \$jobid based on command line options.\n"; $string .= "else\n"; $string .= " echo Running job \$jobid based on $taskenv=\$$taskenv and offset=\$offset.\n"; $string .= "fi\n"; } # Emits a block of shell code to change shell imposed limit on the number of open files and # processes. # sub getLimitShellCode () { my $string; $string .= "echo \"\"\n"; $string .= "echo \"Attempting to increase maximum allowed processes and open files.\""; $string .= "\n"; $string .= "max=`ulimit -Hu`\n"; $string .= "bef=`ulimit -Su`\n"; $string .= "if [ \$bef -lt \$max ] ; then\n"; $string .= " ulimit -Su \$max\n"; $string .= " aft=`ulimit -Su`\n"; $string .= " echo \" Changed max processes per user from \$bef to \$aft (max \$max).\"\n"; $string .= "else\n"; $string .= " echo \" Max processes per user limited to \$bef, no increase possible.\"\n"; $string .= "fi\n"; $string .= "\n"; $string .= "max=`ulimit -Hn`\n"; $string .= "bef=`ulimit -Sn`\n"; $string .= "if [ \$bef -lt \$max ] ; then\n"; $string .= " ulimit -Sn \$max\n"; $string .= " aft=`ulimit -Sn`\n"; $string .= " echo \" Changed max open files from \$bef to \$aft (max \$max).\"\n"; $string .= "else\n"; $string .= " echo \" Max open files limited to \$bef, no increase possible.\"\n"; $string .= "fi\n"; $string .= "\n"; $string .= "echo \"\"\n"; $string .= "\n"; return($string); } # Used inside canu to find where binaries are located. It uses uname to find OS, architecture and # system name, then uses that to construct a path to binaries. If a "pathMap" is defined, this is # used to hardcode a path to a system name. # sub getBinDirectory () { my $installDir = getInstallDirectory(); my $syst = `uname -s`; chomp $syst; # OS implementation my $arch = `uname -m`; chomp $arch; # Hardware platform my $name = `uname -n`; chomp $name; # Name of the system $arch = "amd64" if ($arch eq "x86_64"); $arch = "ppc" if ($arch eq "Power Macintosh"); my $path = "$installDir/$syst-$arch/bin"; my $pathMap = getGlobal("pathMap"); if (defined($pathMap)) { open(F, "< $pathMap") or caFailure("failed to open pathMap '$pathMap'", undef); while () { my ($n, $b) = split '\s+', $_; $path = $b if ($name eq $n); } close(F); } if (! -d "$path") { $path = $installDir; } return($path); } # Emits a block of shell code to locate binaries during shell scripts. See comments on # getBinDirectory. # sub getBinDirectoryShellCode () { my $installDir = getInstallDirectory(); my $string; $string .= "syst=`uname -s`\n"; $string .= "arch=`uname -m`\n"; $string .= "name=`uname -n`\n"; $string .= "\n"; $string .= "if [ \"\$arch\" = \"x86_64\" ] ; then\n"; $string .= " arch=\"amd64\"\n"; $string .= "fi\n"; $string .= "if [ \"\$arch\" = \"Power Macintosh\" ] ; then\n"; $string .= " arch=\"ppc\"\n"; $string .= "fi\n"; $string .= "\n"; $string .= "bin=\"$installDir/\$syst-\$arch/bin\"\n"; $string .= "\n"; my $pathMap = getGlobal("pathMap"); if (defined($pathMap)) { open(PM, "< $pathMap") or caFailure("failed to open pathMap '$pathMap'", undef); while () { my ($n, $b) = split '\s+', $_; $string .= "if [ \"\$name\" = \"$n\" ] ; then\n"; $string .= " bin=\"$b\"\n"; $string .= "fi\n"; } close(PM); $string .= "\n"; } $string .= "if [ ! -d \"\$bin\" ] ; then\n"; $string .= " bin=\"$installDir\"\n"; $string .= "fi\n"; $string .= "\n"; return($string); } # # If running on a cloud system, shell scripts are started in some random location. # setWorkDirectory() will create the directory the script is supposed to run in (e.g., # correction/0-mercounts) and move into it. This will keep the scripts compatible with the way # they are run from within canu.pl. # # If you're fine running in 'some random location' do nothing here. # # Note that canu does minimal cleanup. # sub setWorkDirectory () { if ((getGlobal("objectStore") eq "TEST") && (defined($ENV{"JOB_ID"}))) { my $jid = $ENV{'JOB_ID'}; my $tid = $ENV{'SGE_TASK_ID'}; make_path("/assembly/COMPUTE/job-$jid-$tid"); chdir ("/assembly/COMPUTE/job-$jid-$tid"); } elsif (getGlobal("objectStore") eq "DNANEXUS") { } elsif (getGlobal("gridEngine") eq "PBSPRO") { chdir($ENV{"PBS_O_WORKDIR"}) if (exists($ENV{"PBS_O_WORKDIR"})); } } sub setWorkDirectoryShellCode ($) { my $path = shift @_; my $code = ""; if (getGlobal("objectStore") eq "TEST") { $code .= "if [ z\$SGE_TASK_ID != z ] ; then\n"; $code .= " jid=\$JOB_ID\n"; $code .= " tid=\$SGE_TASK_ID\n"; $code .= " mkdir -p /assembly/COMPUTE/job-\$jid-\$tid/$path\n"; $code .= " cd /assembly/COMPUTE/job-\$jid-\$tid/$path\n"; $code .= " echo IN /assembly/COMPUTE/job-\$jid-\$tid/$path\n"; $code .= "fi\n"; } elsif (getGlobal("objectStore") eq "DNANEXUS") { # You're probably fine running in some random location, but if there is faster disk # available, move there. } elsif (getGlobal("gridEngine") eq "PBSPRO") { $code .= "if [ z\$PBS_O_WORKDIR != z ] ; then\n"; $code .= " cd \$PBS_O_WORKDIR\n"; $code .= "fi\n"; } return($code); } # Spend too much effort ensuring that the name is unique in the system. For 'canu' jobs, we don't # care. sub makeRandomSuffix ($) { my $length = shift @_; my @chars = +('0'..'9', 'a'..'k', 'm'..'z', 'A'..'H', 'J'..'N', 'P'..'Z'); # Remove 'l', 'I' and 'O' my $suffix; while ($length-- > 0) { $suffix .= @chars[int(rand(59))]; } return($suffix); } sub makeUniqueJobName ($$) { my $jobType = shift @_; my $asm = shift @_; # If a canu job, just return the standard name. No uniquification needed. if ($jobType eq "canu") { return("canu_" . $asm . ((defined(getGlobal("gridOptionsJobName"))) ? ("_" . getGlobal("gridOptionsJobName")) : (""))); } # For all other jobs, we need to ensure the name is unique. We do this by adding digits at the end. my $jobName = "${jobType}_" . $asm . ((defined(getGlobal("gridOptionsJobName"))) ? ("_" . getGlobal("gridOptionsJobName")) : ("")); my %jobs; # First, find the list of all jobs that exist. if (uc(getGlobal("gridEngine")) eq "SGE") { open(F, "qstat -xml |"); while () { $jobs{$1}++ if (m/^\s*(.*)<\/JB_name>$/); } close(F); } if (uc(getGlobal("gridEngine")) eq "PBS") { } if (uc(getGlobal("gridEngine")) eq "PBSPro") { } if (uc(getGlobal("gridEngine")) eq "LSF") { } if (uc(getGlobal("gridEngine")) eq "DNANEXUS") { } # If the jobName doesn't exist, we can use it. return($jobName) if (! exists($jobs{$jobName})); # Otherwise, find a unique random 2-letter suffix. my $jobIdx = makeRandomSuffix(2); while (exists($jobs{"${jobName}_$jobIdx"})) { $jobIdx = makeRandomSuffix(2); } # And return it! Simple! # this was breaking dependencies when multiple jobs were submitted like for a failed consensus run, turn off for now return("${jobName}"); #return("${jobName}_$jobIdx"); } # Submit ourself back to the grid. If the one argument is defined, make us hold on jobs with that # name. # # The previous version (CA) would use "gridPropagateHold" to reset holds on existing jobs so that # they would also hold on this job. # sub submitScript ($$) { my $asm = shift @_; my $jobHold = shift @_; return if (getGlobal("useGrid") ne "1"); # If not requested to run on the grid, return if (getGlobal("gridEngine") eq undef); # or can't run on the grid, don't run on the grid. # If no job hold, and we are already on the grid, do NOT resubmit ourself. # # When the user launches canu on the head node, a call to submitScript() is made to launch canu # under grid control. That results in a restart of canu, and another call to submitScript(), # but this time, the envorinment variable is set, we we can skip the resubmission, and continue # with canu execution. return if (($jobHold eq undef) && (exists($ENV{getGlobal("gridEngineJobID")}))); # Find the next available output file. make_path("canu-scripts") if (! -d "canu-scripts"); # Done in canu.pl, just being paranoid my $idx = "01"; while (-e "canu-scripts/canu.$idx.out") { $idx++; } my $outName = "canu-scripts/canu.$idx.out"; my $script = "canu-scripts/canu.$idx.sh"; # Make a script for us to submit. open(F, "> $script") or caFailure("failed to open '$script' for writing", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n" if (getGlobal("gridEngine") eq "SGE"); print F "# Attempt to (re)configure SGE. For unknown reasons, jobs submitted\n" if (getGlobal("gridEngine") eq "SGE"); print F "# to SGE, and running under SGE, fail to read the shell init scripts,\n" if (getGlobal("gridEngine") eq "SGE"); print F "# and so they don't set up SGE (or ANY other paths, etc) properly.\n" if (getGlobal("gridEngine") eq "SGE"); print F "# For the record, interactive logins (qlogin) DO set the environment.\n" if (getGlobal("gridEngine") eq "SGE"); print F "\n"; print F "if [ \"x\$SGE_ROOT\" != \"x\" ]; then \n" if (getGlobal("gridEngine") eq "SGE"); print F " . \$SGE_ROOT/\$SGE_CELL/common/settings.sh\n" if (getGlobal("gridEngine") eq "SGE"); print F "fi\n" if (getGlobal("gridEngine") eq "SGE"); print F "\n"; print F "# On the off chance that there is a pathMap, and the host we\n"; print F "# eventually get scheduled on doesn't see other hosts, we decide\n"; print F "# at run time where the binary is.\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode("."); print F "\n"; print F "rm -f canu.out\n"; print F "ln -s canu-scripts/canu.$idx.out canu.out\n"; print F "\n"; print F "/usr/bin/env perl \\\n"; print F "\$bin/canu " . getCommandLineOptions() . " canuIteration=" . getGlobal("canuIteration") . "\n"; close(F); makeExecutable("$script"); # Construct a submission command line. my ($jobName, $memOption, $thrOption, $gridOpts); $jobName = makeUniqueJobName("canu", $asm); # The canu.pl script isn't expected to take resources. We'll default to 4gb and one thread. my $mem = 4; my $thr = 1; # However, the sequential overlap store is still built from within the canu process. if (getGlobal("ovsMethod") eq "sequential") { $mem = getGlobal("ovsMemory"); $mem = $2 if ($mem =~ m/^(\d+)-(\d+)$/); } $memOption = buildMemoryOption($mem, 1); $thrOption = buildThreadOption($thr); $gridOpts = $jobHold; $gridOpts .= " " if (defined($gridOpts)); $gridOpts .= $memOption if (defined($memOption)); $gridOpts .= " " if (defined($gridOpts)); $gridOpts .= $thrOption if (defined($thrOption)); $gridOpts .= " " if (defined($gridOpts)); $gridOpts .= getGlobal("gridOptions") if (defined(getGlobal("gridOptions"))); $gridOpts .= " " if (defined($gridOpts)); $gridOpts .= getGlobal("gridOptionsExecutive") if (defined(getGlobal("gridOptionsExecutive"))); my $submitCommand = getGlobal("gridEngineSubmitCommand"); my $nameOption = getGlobal("gridEngineNameOption"); my $outputOption = getGlobal("gridEngineOutputOption"); my $qcmd = "$submitCommand $gridOpts $nameOption '$jobName' $outputOption $outName $script"; runCommand(getcwd(), $qcmd) and caFailure("Failed to submit script", undef); exit(0); } sub buildGridArray ($$$$) { my ($name, $bgn, $end, $opt) = @_; my $off = 0; # In some grids (SGE) this is the maximum size of an array job. # In some grids (Slurm) this is the maximum index of an array job. # # So, here, we just don't let any index be above the value. Both types will be happy. if ($end > getGlobal('gridEngineArrayMaxJobs')) { $off = $bgn - 1; $bgn -= $off; $end -= $off; } # PBSPro requires array jobs to have bgn < end. When $bgn == $end, we # just remove the array qualifier. But only if this option is setting # the number of jobs, not if it is setting the name. if (uc(getGlobal("gridEngine")) eq "PBSPRO") { $opt = "" if (($bgn == $end) && ($opt =~ m/ARRAY_JOBS/)); $off = ""; } # Further, PBS/Torque won't let scripts be passed options unless they # are prefixed with a -F....and PBSPro doesn't need this. if (uc(getGlobal("gridEngine")) eq "PBS") { $off = "-F \"$off\""; $off = ""; } $opt =~ s/ARRAY_NAME/$name/g; # Replace ARRAY_NAME with 'job name' $opt =~ s/ARRAY_JOBS/$bgn-$end/g; # Replace ARRAY_JOBS with 'bgn-end' return($opt, $off); } sub buildOutputName ($$$) { my $path = shift @_; my $script = shift @_; my $tid = substr("000000" . (shift @_), -6); my $o; # When this function is called, canu.pl is running in the assembly directory. # But, when the script is executed, it is rooted in '$path'. To get the # 'logs' working, we need to check if the directory relative to the assembly root exists, # but set it relative to $path (which is also where $script is relative to). $o = "$script.$tid.out"; $o = "logs/$1.$tid.out" if ((-e "$path/logs") && ($script =~ m/scripts\/(.*)/)); return($o); } sub buildOutputOption ($$) { my $path = shift @_; my $script = shift @_; my $tid = getGlobal("gridEngineArraySubmitID"); my $opt = getGlobal("gridEngineOutputOption"); if (defined($tid) && defined($opt)) { my $o; $o = "$script.$tid.out"; $o = "logs/$1.$tid.out" if ((-e "$path/logs") && ($script =~ m/scripts\/(.*)/)); return("$opt $o"); } return(undef); } sub buildStageOption ($$) { my $t = shift @_; my $d = shift @_; my $r; if ($t eq "cor") { $r = getGlobal("gridEngineStageOption"); $r =~ s/DISK_SPACE/${d}/g; } return($r); } sub buildMemoryOption ($$) { my $m = shift @_; my $t = shift @_; my $r; my $u = "g"; if (uc(getGlobal("gridEngine")) eq "SGE") { $m /= $t; } if ((uc(getGlobal("gridEngine")) eq "SLURM") && (getGlobal("gridEngineMemoryOption") =~ m/mem-per-cpu/i)) { $m /= $t; } if (int($m) != $m) { $m = int($m * 1024); $u = "m"; } if (uc(getGlobal("gridEngine")) eq "LSF") { $m = $m / 1024 if (getGlobal("gridEngineMemoryUnits") eq "t"); $m = $m * 1 if (getGlobal("gridEngineMemoryUnits") eq "g"); $m = $m * 1024 if (getGlobal("gridEngineMemoryUnits") eq "m"); $m = $m * 1024 * 1024 if (getGlobal("gridEngineMemoryUnits") eq "k"); $u = ""; } $r = getGlobal("gridEngineMemoryOption"); $r =~ s/MEMORY/${m}${u}/g; return($r); } sub buildThreadOption ($) { my $t = shift @_; my $r; $r = getGlobal("gridEngineThreadsOption"); $r =~ s/THREADS/$t/g; return($r); } sub purgeGridJobSubmitScripts ($$) { my $path = shift @_; my $script = shift @_; my $idx = "01"; while (-e "$path/$script.jobSubmit-$idx.sh") { unlink "$path/$script.jobSubmit-$idx.sh"; $idx++; } } sub buildGridJob ($$$$$$$$$) { my $asm = shift @_; my $jobType = shift @_; my $path = shift @_; my $script = shift @_; my $mem = shift @_; my $thr = shift @_; my $dsk = shift @_; my $bgnJob = shift @_; my $endJob = shift @_; # Unpack the job range if needed. if ($bgnJob =~ m/^(\d+)-(\d+)$/) { $bgnJob = $1; $endJob = $2; } if (!defined($endJob)) { $endJob = $bgnJob; } # Figure out the command and options needed to run the job. my $submitCommand = getGlobal("gridEngineSubmitCommand"); my $nameOption = getGlobal("gridEngineNameOption"); my $jobNameT = makeUniqueJobName($jobType, $asm); my ($jobName, $jobOff) = buildGridArray($jobNameT, $bgnJob, $endJob, getGlobal("gridEngineArrayName")); my ($arrayOpt, $arrayOff) = buildGridArray($jobNameT, $bgnJob, $endJob, getGlobal("gridEngineArrayOption")); my $outputOption = buildOutputOption($path, $script); my $stageOption = buildStageOption($jobType, $dsk); my $memOption = buildMemoryOption($mem, $thr); my $thrOption = buildThreadOption($thr); my $globalOptions = getGlobal("gridOptions"); my $jobOptions = getGlobal("gridOptions$jobType"); my $opts; $opts = "$stageOption " if (defined($stageOption)); $opts .= "$memOption " if (defined($memOption)); $opts .= "$thrOption " if (defined($thrOption)); $opts .= "$globalOptions " if (defined($globalOptions)); $opts .= "$jobOptions " if (defined($jobOptions)); $opts .= "$outputOption " if (defined($outputOption)); $opts =~ s/\s+$//; # Find a unique file name to save the command. my $idx = "01"; while (-e "$path/$script.jobSubmit-$idx.sh") { $idx++; } # Build and save the command line. Return the command PREFIX (we'll be adding .sh and .out as # appropriate), and the job name it will be submitted with (which isn't expected to be used). open(F, "> $path/$script.jobSubmit-$idx.sh") or die; print F "#!/bin/sh\n"; print F "\n"; print F "$submitCommand \\\n"; print F " $opts \\\n" if (defined($opts)); print F " $nameOption \"$jobName\" \\\n"; print F " $arrayOpt \\\n"; print F " ./$script.sh $arrayOff \\\n"; print F "> ./$script.jobSubmit-$idx.out 2>&1\n"; close(F); makeExecutable("$path/$script.jobSubmit-$idx.sh"); return("$script.jobSubmit-$idx", $jobName); } # Convert @jobs to a list of ranges, a-b, c, d-e, etc. These will be directly submitted to the # grid, or run one-by-one locally. # # If we're SGE, we can combine everything to one job range: a-b,c,d,e-f. Except that # buildGridJob() doesn't know how to handle that. sub convertToJobRange (@) { my @jobs; # Expand the ranges into a simple list of job ids. foreach my $j (@_) { if ($j =~ m/^(\d+)-(\d+)$/) { for (my $a=$1; $a<=$2; $a++) { push @jobs, $a; } } elsif ($j =~ m/^(\d+)$/) { push @jobs, $1; } else { caFailure("invalid job format in '$j'", undef); } } # Sort. my @jobsA = sort { $a <=> $b } @jobs; undef @jobs; # Merge adjacent ids into a range. my $st = $jobsA[0]; my $ed = $jobsA[0]; shift @jobsA; foreach my $j (@jobsA) { if ($ed + 1 == $j) { $ed = $j; } else { push @jobs, ($st == $ed) ? "$st" : "$st-$ed"; $st = $j; $ed = $j; } } push @jobs, ($st == $ed) ? "$st" : "$st-$ed"; # In some grids (SGE) this is the maximum size of an array job. # In some grids (Slurm) this is the maximum index of an array job. # # So, here, we make blocks that have at most that many jobs. When we submit the job, we'll # offset the indices to be 1..Max. my $l = getGlobal("gridEngineArrayMaxJobs") - 1; if ($l > 0) { @jobsA = @jobs; undef @jobs; foreach my $j (@jobsA) { if ($j =~ m/^(\d+)-(\d+)$/) { my $b = $1; my $e = $2; while ($b <= $e) { my $B = ($b + $l < $e) ? ($b + $l) : $e; push @jobs, "$b-$B"; $b += $l + 1; } } else { push @jobs, $j } } undef @jobsA; } return(@jobs); } # Expects # job type ("ovl", etc) # output directory # script name with no directory or .sh # number of jobs in the task # # If under grid control, submit grid jobs. Otherwise, run in parallel locally. # sub submitOrRunParallelJob ($$$$@) { my $asm = shift @_; # Name of the assembly my $jobType = shift @_; # E.g., ovl, cns, ... - populates 'gridOptionsXXX # - also becomes the grid job name prefix, so three letters suggested my $path = shift @_; # Location of script to run my $script = shift @_; # Runs $path/$script.sh > $path/$script.######.out my $mem = getGlobal("${jobType}Memory"); my $thr = getGlobal("${jobType}Threads"); my $dsk = getGlobal("${jobType}StageSpace"); my @jobs = convertToJobRange(@_); # The script MUST be executable. makeExecutable("$path/$script.sh"); # Report what we're doing. #my $t = localtime(); #print STDERR "----------------------------------------GRIDSTART $t\n"; #print STDERR "$path/$script.sh with $mem gigabytes memory and $thr threads.\n"; # Break infinite loops. If the grid jobs keep failing, give up after a few attempts. # # submitScript() passes canuIteration on to the next call. # canuIteration is reset to zero if the Check() for any parallel step succeeds. # # Assuming grid jobs die on each attempt: # 0) canu run from the command line submits iteration 1; canuIteration is NOT incremented # because no parallel jobs have been submitted. # 1) Iteration 1 - canu.pl submits jobs, increments the interation count, and submits itself as iteration 2 # 2) Iteration 2 - canu.pl submits jobs, increments the interation count, and submits itself as iteration 3 # 3) Iteration 3 - canu.pl fails with the error below # # If the jobs succeed in Iteration 2, the canu in iteration 3 will pass the Check(), never call # this function, and continue the pipeline. my $iter = getGlobal("canuIteration"); my $max = getGlobal("canuIterationMax"); if ($iter >= $max) { caExit("canu iteration count too high, stopping pipeline (most likely a problem in the grid-based computes)", undef); } elsif ($iter == 0) { $iter = "First"; } elsif ($iter == 1) { $iter = "Second"; } elsif ($iter == 2) { $iter = "Third"; } elsif ($iter == 3) { $iter = "Fourth"; } elsif ($iter == 4) { $iter = "Fifth"; } else { $iter = "${iter}th"; } print STDERR "--\n"; print STDERR "-- Running jobs. $iter attempt out of $max.\n"; setGlobal("canuIteration", getGlobal("canuIteration") + 1); # If 'gridEngineJobID' environment variable exists (SGE: JOB_ID; LSF: LSB_JOBID) then we are # currently running under grid crontrol. If so, run the grid command to submit more jobs, then # submit ourself back to the grid. If not, tell the user to run the grid command by hand. # Jobs under grid control, and we submit them if (defined(getGlobal("gridEngine")) && (getGlobal("useGrid") eq "1") && (getGlobal("useGrid$jobType") eq "1") && (exists($ENV{getGlobal("gridEngineJobID")}))) { my @jobsSubmitted; print STDERR "--\n"; purgeGridJobSubmitScripts($path, $script); foreach my $j (@jobs) { my ($cmd, $jobName) = buildGridJob($asm, $jobType, $path, $script, $mem, $thr, $dsk, $j, undef); runCommandSilently($path, "./$cmd.sh", 0) and caFailure("Failed to submit batch jobs", "$path/$cmd.out"); # Parse the stdout/stderr from the submit command to find the id of the job # we just submitted. We'll use this to hold the next iteration until all these # jobs have completed. open(F, "< $path/$cmd.out"); while () { chomp; if (uc(getGlobal("gridEngine")) eq "SGE") { # Your job 148364 ("canu_asm") has been submitted if (m/Your\sjob\s(\d+)\s/) { $jobName = $1; } # Your job-array 148678.1500-1534:1 ("canu_asm") has been submitted if (m/Your\sjob-array\s(\d+).\d+-\d+:\d\s/) { $jobName = $1; } } if (uc(getGlobal("gridEngine")) eq "LSF") { # Job <759810> is submitted to queue <14>. if (m/Job\s<(\d+)>\sis/) { $jobName = "ended($1)"; } } if (uc(getGlobal("gridEngine")) eq "PBS") { # 123456.qm2 $jobName = $_; } if (uc(getGlobal("gridEngine")) eq "PBSPRO") { # ?? $jobName = $_; } if (uc(getGlobal("gridEngine")) eq "SLURM") { if (m/Submitted\sbatch\sjob\s(\d+)/) { $jobName = $1; } else { $jobName = $_; } } if (uc(getGlobal("gridEngine")) eq "DNANEXUS") { } } close(F); if ($j =~ m/^\d+$/) { print STDERR "-- '$cmd.sh' -> job $jobName task $j.\n"; } else { print STDERR "-- '$cmd.sh' -> job $jobName tasks $j.\n"; } push @jobsSubmitted, $jobName; } print STDERR "--\n"; # All jobs submitted. Make an option to hold the executive on those jobs. my $jobHold; if (uc(getGlobal("gridEngine")) eq "SGE") { $jobHold = "-hold_jid " . join ",", @jobsSubmitted; } if (uc(getGlobal("gridEngine")) eq "LSF") { $jobHold = "-w \"" . (join "&&", @jobsSubmitted) . "\""; } if (uc(getGlobal("gridEngine")) eq "PBS") { $jobHold = "-W depend=afteranyarray:" . join ":", @jobsSubmitted; } if (uc(getGlobal("gridEngine")) eq "PBSPRO") { $jobHold = "-W depend=afterany:" . join ":", @jobsSubmitted; } if (uc(getGlobal("gridEngine")) eq "SLURM") { $jobHold = "--depend=afterany:" . join ":", @jobsSubmitted; } if (uc(getGlobal("gridEngine")) eq "DNANEXUS") { $jobHold = "...whatever magic needed to hold the job until all jobs in @jobsSubmitted are done..."; } submitScript($asm, $jobHold); # submitScript() should never return. If it does, then a parallel step was attempted too many time. caExit("Too many attempts to run a parallel stage on the grid. Stop.", undef); } # Jobs under grid control, but the user must submit them if (defined(getGlobal("gridEngine")) && (getGlobal("useGrid") ne "0") && (getGlobal("useGrid$jobType") eq "1") && (! exists($ENV{getGlobal("gridEngineJobID")}))) { print STDERR "\n"; print STDERR "Please run the following commands to submit jobs to the grid for execution using $mem gigabytes memory and $thr threads:\n"; print STDERR "\n"; purgeGridJobSubmitScripts($path, $script); foreach my $j (@jobs) { my $cwd = getcwd(); my ($cmd, $jobName) = buildGridJob($asm, $jobType, $path, $script, $mem, $thr, $dsk, $j, undef); print " $cwd/$path/$cmd.sh\n"; } print STDERR "\n"; print STDERR "When all jobs complete, restart canu as before.\n"; print STDERR "\n"; exit(0); } # Standard jobs, run locally. foreach my $j (@jobs) { my $st; my $ed; if ($j =~ m/^(\d+)-(\d+)$/) { $st = $1; $ed = $2; } else { $st = $ed = $j; } for (my $i=$st; $i<=$ed; $i++) { schedulerSubmit("./$script.sh $i > ./" . buildOutputName($path, $script, $i) . " 2>&1"); } } # compute limit based on # of cpus my $nCParallel = getGlobal("${jobType}Concurrency"); $nCParallel = int(getGlobal("maxThreads") / $thr) if ((!defined($nCParallel)) || ($nCParallel == 0)); $nCParallel = 1 if ((!defined($nCParallel)) || ($nCParallel == 0)); # compute limit based on physical memory my $nMParallel = getGlobal("${jobType}Concurrency"); $nMParallel = int(getGlobal("maxMemory") / getGlobal("${jobType}Memory")) if ((!defined($nMParallel)) || ($nMParallel == 0)); $nMParallel = 1 if ((!defined($nMParallel)) || ($nMParallel == 0)); # run min of our limits my $nParallel = $nCParallel < $nMParallel ? $nCParallel : $nMParallel; schedulerSetNumberOfProcesses($nParallel); schedulerFinish($path, $jobType); } # Pretty-ify the command. If there are no newlines already in it, break # before every switch and before file redirects. sub prettifyCommand ($) { my $dis = shift @_; if (($dis =~ tr/\n/\n/) == 0) { $dis =~ s/\s-/ \\\n -/g; # Replace ' -' with '\n -' (newline, two spaces, then the dash) $dis =~ s/\s>\s/ \\\n> /; # Replace ' > ' with '\n> ' $dis =~ s/\s2>\s/ \\\n2> /; # Replace ' 2> ' with '\n2> ' } $dis = " " . $dis; # Indent the command by four spaces. $dis =~ s/\n/\n /g; return($dis); } sub reportRunError ($) { my $rc = shift @_; # Bunch of busy work to get the names of signals. Is it really worth it?! my @signame; if (defined($Config{sig_name})) { my $i = 0; foreach my $n (split('\s+', $Config{sig_name})) { $signame[$i] = $n; $i++; } } else { for (my $i=0; $i<127; $i++) { $signame[$i] = "signal $i"; } } # The rest is rather straightforward at least. print STDERR "ERROR:\n"; if ($rc == -1) { print STDERR "ERROR: Failed to run the command. (rc=$rc)\n"; } elsif ($rc & 127) { print STDERR "ERROR: Failed with signal $signame[$rc & 127]. (rc=$rc)\n"; } else { print STDERR "ERROR: Failed with exit code ", $rc >> 8 , ". (rc=$rc)\n"; } print STDERR "ERROR:\n"; } # Utility to run a command and check the exit status, report time used. # sub runCommand ($$) { my $dir = shift @_; my $cmd = shift @_; my $dis = prettifyCommand($cmd); return if ($cmd eq ""); # Check if the directory exists. if (! -d $dir) { caFailure("Directory '$dir' doesn't exist, can't run command", ""); } # If only showing the next command, show it and stop. if (getGlobal("showNext")) { print STDERR "--NEXT-COMMAND\n"; print STDERR "$dis\n"; exit(0); } # Log that we're starting, and show the pretty-ified command. my $cwd = getcwd(); # Remember where we are. chdir($dir); # So we can root the jobs in the correct location. my $startsecs = time(); my $diskfree = diskSpace("."); print STDERR "----------------------------------------\n"; print STDERR "-- Starting command on ", scalar(localtime()), " with $diskfree GB free disk space\n"; print STDERR "\n"; print STDERR " cd $dir\n"; print STDERR "$dis\n"; my $rc = 0xffff & system($cmd); $diskfree = diskSpace("."); my $warning = " !!! WARNING !!!" if ($diskfree < 10); my $elapsed = time() - $startsecs; $elapsed = "lickety-split" if ($elapsed eq "0"); $elapsed = "$elapsed second" if ($elapsed eq "1"); $elapsed = "$elapsed seconds" if ($elapsed > 1); print STDERR "\n"; print STDERR "-- Finished on ", scalar(localtime()), " ($elapsed) with $diskfree GB free disk space$warning\n"; print STDERR "----------------------------------------\n"; chdir($cwd); # Pretty much copied from Programming Perl page 230 return(0) if ($rc == 0); reportRunError($rc); return(1); } # Duplicated in Grid_Cloud.pm to get around recursive 'use' statements. sub runCommandSilently ($$$) { my $dir = shift @_; my $cmd = shift @_; my $dis = prettifyCommand($cmd); my $critical = shift @_; return(0) if ($cmd eq ""); my $cwd = getcwd(); # Remember where we are. chdir($dir); # So we can root the jobs in the correct location. my $rc = 0xffff & system($cmd); chdir($cwd); return(0) if ($rc == 0); # No errors, return no error. return(1) if ($critical == 0); # If not critical, return that it failed, otherwise, report error and fail. print STDERR "$dis\n"; reportRunError($rc); return(1); } sub findCommand ($) { my $cmd = shift @_; my @path = File::Spec->path; for my $path (@path) { if (-x "$path/$cmd") { return("$path/$cmd"); } } return(undef); } sub findExecutable ($) { my $exec = shift @_; my $path = `which \"$exec\" 2> /dev/null`; $path =~ s/^\s+//; $path =~ s/\s+$//; return(undef) if ($path eq ""); return($path); } # Use caExit() for transient errors, like not opening files, processes that die, etc. sub caExit ($$) { my $asm = getGlobal("onExitNam"); my $msg = shift @_; my $log = shift @_; my $version = getGlobal("version"); print STDERR "\n"; print STDERR "ABORT:\n"; print STDERR "ABORT: $version\n"; print STDERR "ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.\n"; print STDERR "ABORT: Try restarting. If that doesn't work, ask for help.\n"; print STDERR "ABORT:\n"; print STDERR "ABORT: $msg.\n" if (defined($msg)); print STDERR "ABORT:\n" if (defined($msg)); if (defined($log) && -e $log) { my $df = diskSpace($log); print STDERR "ABORT: Disk space available: $df GB\n"; print STDERR "ABORT:\n"; } if (-e $log) { print STDERR "ABORT: Last 50 lines of the relevant log file ($log):\n"; print STDERR "ABORT:\n"; open(Z, "tail -n 50 $log"); while () { print STDERR "ABORT: $_"; } close(Z); print STDERR "ABORT:\n"; } my $fail = getGlobal('onFailure'); if (defined($fail)) { runCommandSilently(getGlobal("onExitDir"), "$fail $asm", 0); } exit(1); } # Use caFailure() for errors that definitely will require code changes to fix. sub caFailure ($$) { my $asm = getGlobal("onExitNam"); my $msg = shift @_; my $log = shift @_; my $version = getGlobal("version"); my $trace = longmess("Failed"); $trace =~ s/\n/\nCRASH: /g; print STDERR "\n"; print STDERR "CRASH:\n"; print STDERR "CRASH: $version\n"; print STDERR "CRASH: Please panic, this is abnormal.\n"; print STDERR "ABORT:\n"; print STDERR "CRASH: $msg.\n"; print STDERR "CRASH:\n"; print STDERR "CRASH: $trace\n"; #print STDERR "CRASH:\n"; # $trace has an extra CRASH: at the end if (-e $log) { print STDERR "CRASH: Last 50 lines of the relevant log file ($log):\n"; print STDERR "CRASH:\n"; open(Z, "tail -n 50 $log"); while () { print STDERR "CRASH: $_"; } close(Z); print STDERR "CRASH:\n"; } else { print STDERR "CRASH: No log file supplied.\n"; print STDERR "CRASH:\n"; } my $fail = getGlobal('onFailure'); if (defined($fail)) { runCommandSilently(getGlobal("onExitDir"), "$fail $asm", 0); } exit(1); } 1; canu-1.6/src/pipelines/canu/Execution.txt000066400000000000000000000052041314437614700205320ustar00rootroot00000000000000 Running Commands ---------------- Simple single commands can be run, and logged to the chatter output, with something like if (output-doesn't-exist) { if (runCommand(directory, command)) { caExit(command-failed-message, command.err) } do-steps-to-make-output-exist } If no chatter output is desired, runCommandSilently() can be used. Ideally in the same recipe as above, but usually it isn't guarded at all. This function will terminate ungracefully if the command fails. For jobs to be run on the grid, either in parallel or a single job, the function submitOrRunParallelJob() is used. This takes a job type (as in gridOptions{jobType}), a path and a script, and a list of numeric job IDs to run. The list can be formed of simple integers or ranges, or both (e.g., 1,2,3-9,10,12-99). Its use is straightforward, but the wrapper to make it work both for grid-based execution and local execution is non-trivial. useGrid=remote fails maxMemory / maxThreads minMemory / minThreads - means what? maxGridCores - sge "-tc N", slurm/pbs "-a 1-1000%50" A *Configure() function needs to prepare the job. Its product is a shell script to run the job. A *Check() function parses the shell script (usually) to find out which jobs to run, then submitOrRunParallelJob() to execute them. Each Check() function is called up to three times (for a MaxIteration=2) - the first two to actually try to compute and the last to fail. For executions not using the grid, the check function will run the jobs and fall through to the finishStage: clause, reporting the job finished and maybe generating some stats. If the job fails, it is retried, using the same flow. For executions using the grid, the check function breaks execution and checking into two grid jobs. A parallel job runs the compute, and a sequential job holds on the parallel job. The sequential job remembers canuIteration (by having it passed on the command line as a parameter). If the parallel jobs succeeded, it falls through to finishStage: as above. If they had failed, they are tried again. The flow of each *Check() function is: goto allDone if (outputs exist) decide if job outputs exist or not if (job outputs do not exist) { if (attempt > 1) report job failed if (attempt > max) report failed, caExit() report starting an attempt emitStage(check, $attempt) buildHTML() submitOrRun() return # if not on grid, we need to call again to decide if job outputs exist } finishStage: report job finished successfully do any processing to make 'outputs exist' true above setGlobal(iteration, 0) emitStage() buildHTML() stopAfter() allDone: anything that runs after EVERY call canu-1.6/src/pipelines/canu/Gatekeeper.pm000066400000000000000000000423221314437614700204420ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/Gatekeeper.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-SEP-21 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2016-FEB-29 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Gatekeeper; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(getMaxReadLengthInStore getNumberOfReadsInStore getNumberOfBasesInStore getExpectedCoverage sequenceFileExists gatekeeper); use strict; use Cwd qw(getcwd); use canu::Defaults; use canu::Execution; use canu::HTML; use canu::Report; use canu::Grid_Cloud; sub getMaxReadLengthInStore ($$) { my $base = shift @_; my $asm = shift @_; my $ml = 0; open(L, "< $base/$asm.gkpStore/maxreadlength.txt") or caExit("can't open '$base/$asm.gkpStore/maxreadlength.txt' for reading: $!", undef); $ml = ; close(L); return(int($ml)); } sub getNumberOfReadsInStore ($$) { my $base = shift @_; my $asm = shift @_; my $nr = 0; # No file, no reads. return($nr) if (! -e "$base/$asm.gkpStore/info.txt"); # Read the info file. gatekeeperCreate creates this at the end. open(F, "< $base/$asm.gkpStore/info.txt") or caExit("can't open '$base/$asm.gkpStore/info.txt' for reading: $!", undef); while () { if (m/numReads\s+=\s+(\d+)/) { $nr = $1; } } close(F); return($nr); } sub getNumberOfBasesInStore ($$) { my $base = shift @_; my $asm = shift @_; my $nb = 0; # No file, no bases. return($nb) if (! -e "$base/$asm.gkpStore/info.txt"); # Read the info file. gatekeeperCreate creates this at the end. open(F, "< $base/$asm.gkpStore/reads.txt") or caExit("can't open '$base/$asm.gkpStore/reads.txt' for reading: $!", undef); while () { my @v = split '\s+', $_; $nb += $v[2]; } close(F); # An alternate, used in Meryl::getGenomeCoverage (deceased) was to parse the stats output. # #open(F, "$bin/gatekeeperDumpMetaData -stats -G $base/$asm.gkpStore | ") or caFailure("failed to read gatekeeper stats fromfrom '$base/$asm.gkpStore'", undef); #while () { # my ($junk1, $library, $junk2, $reads, $junk3, $junk4, $bases, $junk5, $average, $junk6, $min, $junk7, $max) = split '\s+', $_; # if ($library == 0) { # $gs = $bases / getGlobal("genomeSize"); # last; # } #} #close(F); return($nb); } sub getExpectedCoverage ($$) { my $base = shift @_; my $asm = shift @_; return(int(getNumberOfBasesInStore($base, $asm) / getGlobal("genomeSize"))); } # Returns undef if a sequence file with the supplied name cannot be found. Common suffices and compressions are tried. # Otherwise, returns the found sequence file. sub sequenceFileExists ($) { my $p = shift @_; foreach my $s ("", ".fasta", ".fastq", ".fa", ".fq") { foreach my $c ("", ".gz", ".xz") { return("$p$s$c") if (fileExists("$p$s$c")); } } return(undef); } sub gatekeeperCreateStore ($$@) { my $base = shift @_; my $asm = shift @_; my $bin = getBinDirectory(); my @inputs = @_; # If the store failed to build because of input errors and warnings, rename the store and continue. # Not sure how to support this in DNANexus. if (-e "$base/$asm.gkpStore.ACCEPTED") { rename("$base/$asm.gkpStore.ACCEPTED", "$base/$asm.gkpStore"); rename("$base/$asm.gkpStore.BUILDING.err", "$base/$asm.gkpStore.err"); return; } # If the store failed to build and the user just reruns canu, this will be triggered. We'll # skip rebuilding the store again, and report the original error message. if (-e "$base/$asm.gkpStore.BUILDING") { print STDERR "-- WARNING:\n"; print STDERR "-- WARNING: Previously failed gkpStore detected.\n"; print STDERR "-- WARNING:\n"; } # Not sure how this can occur. Possibly the user just deleted gkpStore.BUILDING and restarted? if ((! -e "$base/$asm.gkpStore.BUILDING") && (-e "$base/$asm.gkpStore.gkp")) { print STDERR "-- WARNING:\n"; print STDERR "-- WARNING: Existing sequence inputs used.\n"; print STDERR "-- WARNING:\n"; } # Fail if there are no inputs. caExit("no input files specified, and store not already created, I have nothing to work on!", undef) if (scalar(@inputs) == 0); # Convert the canu-supplied reads into correct relative paths. This is made complicated by # gatekeeperCreate being run in a directory one below where we are now. # At the same time, check that all files exist. if (!-e "$base/$asm.gkpStore.gkp") { my $ff = undef; foreach my $iii (@inputs) { my ($type, $file) = split '\0', $iii; if (($file =~ m/\.correctedReads\./) || ($file =~ m/\.trimmedReads\./)) { fetchFile($file); chdir($base); # Move to where we run the command $file = "../$file" if (-e "../$file"); # If file exists up one dir, it's our file $iii = "$type\0$file"; # Rewrite the option chdir(".."); # ($file is used below too) } chdir($base); $ff .= (defined($ff) ? "\n " : "") . "reads '$file' not found." if (! -e $file); chdir(".."); } caExit($ff, undef) if defined($ff); # Build a gkp file for all the raw sequence inputs. For simplicity, we just copy in any gkp # files as is. This documents what gatekeeper was built with, etc. open(F, "> $base/$asm.gkpStore.gkp") or caExit("cant' open '$base/$asm.gkpStore.gkp' for writing: $0", undef); foreach my $iii (@inputs) { if ($iii =~ m/^-(.*)\0(.*)$/) { my $tech = $1; my $file = $2; my @name = split '/', $2; my $name = $name[scalar(@name)-1]; $name = $1 if ($name =~ m/(.*).[xgb][z]2{0,1}$/i); $name = $1 if ($name =~ m/(.*).fast[aq]$/i); $name = $1 if ($name =~ m/(.*).f[aq]$/i); print F "########################################\n"; print F "# $tech: $file\n"; print F "#\n"; print F "name $name\n"; print F "preset $tech\n"; print F "$file\n"; print F "\n"; } elsif (-e $iii) { print F "########################################\n"; print F "# $iii\n"; print F "#\n"; open(I, "< $iii") or caExit("can't open gatekeeper input '$iii' for reading: $0", undef); while () { print F $_; } close(I); print F "\n"; } else { caExit("unrecognized gatekeeper input file '$iii'", undef); } } close(F); } # Load the store. if (! -e "$base/$asm.gkpStore.BUILDING") { my $cmd; $cmd .= "$bin/gatekeeperCreate \\\n"; $cmd .= " -minlength " . getGlobal("minReadLength") . " \\\n"; $cmd .= " -o ./$asm.gkpStore.BUILDING \\\n"; $cmd .= " ./$asm.gkpStore.gkp \\\n"; $cmd .= "> ./$asm.gkpStore.BUILDING.err 2>&1"; # A little funny business to make gatekeeper not fail on read quality issues. # A return code of 0 is total success. # A return code of 1 means it found errors in the inputs, but finished. # Anything larger is a crash. if (runCommand($base, $cmd) > 1) { caExit("gatekeeper failed", "$base/$asm.gkpStore.BUILDING.err"); } } # Check for quality issues. if (-e "$base/$asm.gkpStore.BUILDING.err") { my $nProblems = 0; open(F, "< $base/$asm.gkpStore.BUILDING.err"); while () { $nProblems++ if (m/Check\syour\sreads/); } close(F); if ($nProblems > 0) { print STDERR "\n"; print STDERR "Gatekeeper detected problems in your input reads. Please review the logging in files:\n"; print STDERR " ", getcwd(), "/$base/$asm.gkpStore.BUILDING.err\n"; print STDERR " ", getcwd(), "/$base/$asm.gkpStore.BUILDING/errorLog\n"; if (getGlobal("stopOnReadQuality")) { print STDERR "If you wish to proceed, rename the store with the following commands and restart canu.\n"; print STDERR "\n"; print STDERR " mv ", getcwd(), "/$base/$asm.gkpStore.BUILDING \\\n"; print STDERR " ", getcwd(), "/$base/$asm.gkpStore.ACCEPTED\n"; print STDERR "\n"; print STDERR "Or remove '", getcwd(), "/$base/' and re-run with stopOnReadQuality=false\n"; print STDERR "\n"; exit(1); } else { print STDERR "Proceeding with assembly because stopOnReadQuality=false.\n"; } } } rename "$base/$asm.gkpStore.BUILDING", "$base/$asm.gkpStore"; rename "$base/$asm.gkpStore.BUILDING.err", "$base/$asm.gkpStore.err"; } sub gatekeeperGenerateReadsList ($$) { my $base = shift @_; my $asm = shift @_; my $bin = getBinDirectory(); if (runCommandSilently($base, "$bin/gatekeeperDumpMetaData -G ./$asm.gkpStore -reads > ./$asm.gkpStore/reads.txt 2> /dev/null", 1)) { caExit("failed to generate list of reads in store", undef); } } sub gatekeeperGenerateLibrariesList ($$) { my $base = shift @_; my $asm = shift @_; my $bin = getBinDirectory(); if (runCommandSilently($base, "$bin/gatekeeperDumpMetaData -G ./$asm.gkpStore -libs > ./$asm.gkpStore/libraries.txt 2> /dev/null", 1)) { caExit("failed to generate list of libraries in store", undef); } } sub gatekeeperGenerateReadLengths ($$) { my $base = shift @_; my $asm = shift @_; my $nb = 0; my @rl; my @hi; my $mm; my $minLen = 999999; my $maxLen = 0; open(F, "< $base/$asm.gkpStore/reads.txt") or caExit("can't open '$base/$asm.gkpStore/reads.txt' for reading: $!", undef); while () { my @v = split '\s+', $_; push @rl, $v[2]; # Save the length $nb += $v[2]; # Sum the bases $minLen = ($minLen < $v[2]) ? $minLen : $v[2]; $maxLen = ($v[2] < $maxLen) ? $maxLen : $v[2]; } close(F); @rl = sort { $a <=> $b } @rl; # Buckets of size 1000 are easy to interpret, but sometimes not ideal. my $bucketSize = 0; if ($maxLen - $minLen < 10000) { $bucketSize = 100; } elsif ($maxLen - $minLen < 100000) { $bucketSize = 1000; } elsif ($maxLen - $minLen < 1000000) { $bucketSize = 5000; } else { $bucketSize = 10000; } # Generate the histogram (int truncates) foreach my $rl (@rl) { my $b = int($rl / $bucketSize); $hi[$b]++; } $mm = int($maxLen / $bucketSize); # Max histogram value # Write the sorted read lengths (for gnuplot) and the maximum read length (for correction consensus) open(F, "> $base/$asm.gkpStore/readlengths.txt") or caExit("can't open '$base/$asm.gkpStore/readlengths.txt' for writing: $!", undef); foreach my $rl (@rl) { print F "$rl\n"; } close(F); open(F, "> $base/$asm.gkpStore/maxreadlength.txt") or caExit("can't open '$base/$asm.gkpStore/maxreadlength.txt' for writing: $!", undef); print F "$maxLen\n"; close(F); # Generate PNG histograms my $gnuplot = getGlobal("gnuplot"); my $format = getGlobal("gnuplotImageFormat"); open(F, "> $base/$asm.gkpStore/readlengths.gp") or caExit("can't open '$base/$asm.gkpStore/readlengths.gp' for writing: $!", undef); print F "set title 'read length'\n"; print F "set xlabel 'read length, bin width = 250'\n"; print F "set ylabel 'number of reads'\n"; print F "\n"; print F "binwidth=250\n"; print F "set boxwidth binwidth\n"; print F "bin(x,width) = width*floor(x/width) + binwidth/2.0\n"; print F "\n"; print F "set terminal $format size 1024,1024\n"; print F "set output './$asm.gkpStore/readlengths.lg.$format'\n"; print F "plot [] './$asm.gkpStore/readlengths.txt' using (bin(\$1,binwidth)):(1.0) smooth freq with boxes title ''\n"; print F "\n"; print F "set terminal $format size 256,256\n"; print F "set output './$asm.gkpStore/readlengths.sm.$format'\n"; print F "plot [] './$asm.gkpStore/readlengths.txt' using (bin(\$1,binwidth)):(1.0) smooth freq with boxes title ''\n"; close(F); if (runCommandSilently($base, "$gnuplot ./$asm.gkpStore/readlengths.gp > /dev/null 2>&1", 0)) { print STDERR "--\n"; print STDERR "-- WARNING: gnuplot failed; no plots will appear in HTML output.\n"; print STDERR "--\n"; print STDERR "----------------------------------------\n"; } # Generate the ASCII histogram my $reads = getNumberOfReadsInStore($base, $asm); my $bases = getNumberOfBasesInStore($base, $asm); my $coverage = int(100 * $bases / getGlobal("genomeSize")) / 100; my $scale = 0; my $hist; for (my $ii=0; $ii<=$mm; $ii++) { # Scale the *'s so that the longest has 70 of 'em $scale = $hi[$ii] / 70 if ($scale < $hi[$ii] / 70); } $hist = "--\n"; $hist .= "-- In gatekeeper store '$base/$asm.gkpStore':\n"; $hist .= "-- Found $reads reads.\n"; $hist .= "-- Found $bases bases ($coverage times coverage).\n"; $hist .= "--\n"; $hist .= "-- Read length histogram (one '*' equals " . int(100 * $scale) / 100 . " reads):\n"; for (my $ii=0; $ii<=$mm; $ii++) { my $s = $ii * $bucketSize; my $e = $ii * $bucketSize + $bucketSize - 1; $hi[$ii] += 0; # Otherwise, cells with no count print as null. $hist .= sprintf("-- %6d %6d %6d %s\n", $s, $e, $hi[$ii], "*" x int($hi[$ii] / $scale)); } return($hist); } sub gatekeeper ($$@) { my $asm = shift @_; my $tag = shift @_; my @inputs = @_; my $base; $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); # Try fetching the store from object storage. This might not be needed in all cases (e.g., # between mhap precompute and mhap compute), but it greatly simplifies stuff, like immediately # here needing to check if the store exists. fetchStore("$base/$asm.gkpStore"); # An empty store? Remove it and try again. if ((-e "$base/$asm.gkpStore/info") && (getNumberOfReadsInStore($base, $asm) == 0)) { print STDERR "-- Removing empty or incomplate gkpStore '$base/$asm.gkpStore'\n"; remove_tree("$asm.gkpStore"); } # Store with reads? Yay! Report it, then skip. goto allDone if (skipStage($asm, "$tag-gatekeeper") == 1); goto allDone if (getNumberOfReadsInStore($base, $asm) > 0); # Create the store. If all goes well, we get asm.gkpStore. If not, we could end up with # asm.gkpStore.BUILDING and ask the user to examine it and rename it to asm.gkpStore.ACCEPTED # and restart. On the restart, gatekeeperCreateStore() detects the 'ACCPETED' store and # renames to asm.gkpStore. gatekeeperCreateStore($base, $asm, @inputs) if (! -e "$base/$asm.gkpStore"); caExit("gatekeeper store exists, but contains no reads", undef) if (getNumberOfReadsInStore($base, $asm) == 0); gatekeeperGenerateReadsList($base, $asm) if (! -e "$base/$asm.gkpStore/reads.txt"); gatekeeperGenerateLibrariesList($base, $asm) if (! -e "$base/$asm.gkpStore/libraries.txt"); my $hist = gatekeeperGenerateReadLengths($base, $asm) if (! -e "$base/$asm.gkpStore/readlengths.txt"); addToReport("${tag}GkpStore", $hist); # Now that all the extra data is generated, stash the store. stashStore("$base/$asm.gkpStore"); finishStage: emitStage($asm, "$tag-gatekeeper"); buildHTML($asm, $tag); allDone: stopAfter("gatekeeper"); } canu-1.6/src/pipelines/canu/Grid.pm000066400000000000000000000046011314437614700172510ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2016-JUN-20 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(formatAllowedResources configureRemote); use strict; use canu::Defaults; # # Given a map of "cpu-mem" to number-of-nodes, write a log message, and return a string for use # later in the pipeline. # sub formatAllowedResources (\%$) { my $hosts_ref = shift @_; my $geName = shift @_; my %hosts = %$hosts_ref; my $hosts = undef; print STDERR "-- \n"; foreach my $c (keys %hosts) { my ($cpus, $mem) = split '-', $c; my $nodes = $hosts{$c}; printf(STDERR "-- Found %3d host%s with %3d core%s and %4d GB memory under $geName control.\n", $nodes, ($nodes == 1) ? " " : "s", $cpus, ($cpus == 1) ? " " : "s", $mem); $hosts .= "\0" if (defined($hosts)); $hosts .= "$cpus-$mem-$nodes"; } return $hosts; } sub configureRemote () { if ((getGlobal("useGrid") eq "remote") && (getGlobal("gridEngine") eq "")) { caExit("invalid 'useGrid=remote' specified; no gridEngine available", undef); } return if (uc(getGlobal("gridEngine")) ne ""); # If here, gridEngine is not set, and we're running locally. # Set to a variable we don't expect to see in the environment. setGlobalIfUndef("gridEngineTaskID", "CANU_LOCAL_JOB_ID"); } canu-1.6/src/pipelines/canu/Grid_Cloud.pm000066400000000000000000000243601314437614700204030ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2017-FEB-15 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid_Cloud; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(fileExists fileExistsShellCode fetchFile fetchFileShellCode stashFile stashFileShellCode fetchStore fetchStoreShellCode stashStore stashStoreShellCode); use strict; use File::Path qw(make_path); use File::Basename; use Cwd qw(getcwd); use canu::Defaults; use canu::Grid; use canu::Execution qw(runCommand runCommandSilently); # This file contains most of the magic needed to access an object store. Two flavors of each # function are needed: one that runs in the canu.pl process (rooted in the base assembly directory, # where the 'correction', 'trimming' and 'unitigging' directories exist) and one that is # used in shell scripts (rooted where the shell script is run from). # Convert a/path/to/file to ../../../.. sub pathToDots ($) { return(join("/", map("..", (1..scalar(split '/', $_[0]))))); } # True if we're using an object store. sub isOS () { return(getGlobal("objectStore")); } # # fileExists() returns true if the file exists on disk or in the object store. It does not fetch # the file. It returns undef if the file doesn't exist. The second argument to # fileExistsShellCode() is an optional indent level (a whitespace string). # # The shellCode version should emit the if test for file existence, but nothing else (not even the # endif). # sub fileExists ($) { my $file = shift @_; my $exists = ""; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); return(1) if (-e $file); # If file exists, it exists. if (isOS() eq "TEST") { $exists = `$client describe --name $ns/$file`; } elsif (isOS() eq "DNANEXUS") { } else { $exists = ""; } $exists =~ s/^\s+//; $exists =~ s/\s+$//; return(($exists ne "") ? 1 : undef); } sub fileExistsShellCode ($@) { my $file = shift @_; my $indent = shift @_; my $code = ""; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); if (isOS() eq "TEST") { $code .= "${indent}if [ ! -e $file ] ; then\n"; $code .= "${indent} exists=`$client describe --name $ns/$file`\n"; $code .= "${indent}fi\n"; $code .= "${indent}if [ -e $file -o x\$exists != x ] ; then\n"; } elsif (isOS() eq "DNANEXUS") { } else { $code .= "${indent}if [ -e $file ]; then\n"; } return($code); } # # fetchFile() and stashFile() both expect to be called from the assembly root directory, and have # the path to the file passed in, e.g., "correction/0-mercounts/whatever.histogram". # # The shellCode versions expect the same, but need the path from the assembly root to the location # the shell script is running split. A meryl script would give "correction/0-mercounts" for the # first arg, and could give "some/directory/file" for the file. # sub fetchFile ($) { my $file = shift @_; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); return if (-e $file); # If it exists, we don't need to fetch it. if (isOS() eq "TEST") { make_path(dirname($file)); runCommandSilently(".", "$client download --output $file $ns/$file", 1); } elsif (isOS() eq "DNANEXUS") { } else { # Nothing we can be obnoxious about here, I suppose we could log... } } sub fetchFileShellCode ($$$) { my $path = shift @_; my $dots = pathToDots($path); my $file = shift @_; my $indent = shift @_; my $code = ""; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); # We definitely need to be able to fetch files from places that are # parallel to us, e.g., from 0-mercounts when we're in 1-overlapper. # # To get a file, we first go up to the assembly root, then check if the # file exists, and fetch it if not. # # The call needs to be something like: # stashFileShellCode("correction/0-mercounts", "whatever", ""); if (isOS() eq "TEST") { $code .= "${indent}if [ ! -e $dots/$path/$file ] ; then\n"; $code .= "${indent} mkdir -p $dots/$path\n"; $code .= "${indent} cd $dots/$path\n"; $code .= "${indent} $client download --output $file $ns/$path/$file\n"; $code .= "${indent} cd -\n"; $code .= "${indent}fi\n"; } elsif (isOS() eq "DNANEXUS") { } else { $code .= "# File must exist: $file\n"; } return($code); } sub stashFile ($) { my $file = shift @_; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); return if (! -e $file); if (isOS() eq "TEST") { runCommandSilently(".", "$client upload --path $ns/$file $file", 1); } elsif (isOS() eq "DNANEXUS") { } else { # Nothing we can be obnoxious about here, I suppose we could log... } } sub stashFileShellCode ($$$) { my $path = shift @_; my $dots = pathToDots($path); my $file = shift @_; my $indent = shift @_; my $code = ""; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); # Just like for fetching, we allow stashing files from parallel # directories (even though that should never happen). if (isOS() eq "TEST") { $code .= "${indent}if [ -e $dots/$path/$file ] ; then\n"; $code .= "${indent} cd $dots/$path\n"; $code .= "${indent} $client upload --path $ns/$path/$file $file\n"; $code .= "${indent} cd -\n"; $code .= "${indent}fi\n"; } elsif (isOS() eq "DNANEXUS") { } else { $code .= "# File is important: $file\n"; } return($code); } # # Given $base/$asm.gkpStore, fetch or stash it. # # The non-shell versions are assumed to be running in the assembly directory, that is, where # $base/$asm.gkpStore would exist naturally. This is consistent with canu.pl - it runs in the # assembly directory, and then chdir to subdirectories to run binaries. # # The shell versions usually run within a subdirectory (e.g., in correction/0-mercounts). They # need to know this location, so they can go up to the assembly directory to fetch and unpack the # store. After fetching, they chdir back to the subdirectory. # sub fetchStore ($) { my $store = shift @_; # correction/asm.gkpStore my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); return if (-e "$store/info"); # Store exists on disk return if (! fileExists("$store.tar")); # Store doesn't exist in object store if (isOS() eq "TEST") { runCommandSilently(".", "$client download --output - $ns/$store.tar | tar -xf -", 1); } elsif (isOS() eq "DNANEXUS") { } else { } } sub stashStore ($) { my $store = shift @_; # correction/asm.gkpStore my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); return if (! -e "$store/info"); # Store doesn't exist on disk if (isOS() eq "TEST") { runCommandSilently(".", "tar -cf - $store | $client upload --path $ns/$store.tar -", 1); } elsif (isOS() eq "DNANEXUS") { } else { } } sub fetchStoreShellCode ($$@) { my $store = shift @_; # correction/asm.gkpStore - store we're trying to get my $root = shift @_; # correction/1-overlapper - place the script is running in my $indent = shift @_; # my $base = dirname($store); # correction my $basep = pathToDots($root); # ../.. my $name = basename($store); # asm.gkpStore my $code; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); if (isOS() eq "TEST") { $code .= "${indent}if [ ! -e $basep/$store/info ] ; then\n"; $code .= "${indent} echo Fetching $ns/$store\n"; $code .= "${indent} $client download --output - $ns/$store.tar | tar -C $basep -xf -\n"; $code .= "${indent}fi\n"; } elsif (isOS() eq "DNANEXUS") { } else { $code .= "# Store must exist: $store\n"; } return($code); } sub stashStoreShellCode ($$@) { my $store = shift @_; # correction/asm.gkpStore - store we're trying to get my $root = shift @_; # correction/1-overlapper - place the script is running in my $indent = shift @_; # my $base = dirname($store); # correction my $basep = pathToDots($root); # ../.. my $name = basename($store); # asm.gkpStore my $code; my $client = getGlobal("objectStoreClient"); my $ns = getGlobal("objectStoreNameSpace"); if (isOS() eq "TEST") { $code .= "${indent}if [ -e $basep/$store/info ] ; then\n"; $code .= "${indent} echo Stashing $ns/$store\n"; $code .= "${indent} tar -C $basep -cf - $store | $client upload --path $ns/$store.tar -\n"; $code .= "${indent}fi\n"; } elsif (isOS() eq "DNANEXUS") { } else { $code .= "# Store is important: $store\n"; } return($code); } canu-1.6/src/pipelines/canu/Grid_DNANexus.pm000066400000000000000000000066121314437614700207620ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2017-FEB-11 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid_DNANexus; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(detectDNANexus configureDNANexus); use strict; use canu::Defaults; use canu::Grid; use canu::Execution; sub detectDNANexus () { return if ( defined(getGlobal("gridEngine"))); # Grid not requested. return if (!defined($ENV{'DNA_NEXUS'})); # Not a DNA Nexus grid print STDERR "-- Detected DNA Nexus '...some-version...'.\n"; setGlobal("gridEngine", "DNANEXUS"); # DNANexus mode doesn't support (easily) the gatekeeper check on short reads. # The issue is that we'd need to save the store, ask the user to accept it (and rename), # then continue. Nothing super tricky, just not done. setGlobal("stopOnReadQuality", 0); } sub configureDNANexus () { return if (uc(getGlobal("gridEngine")) ne "DNANEXUS"); my $maxArraySize = 65535; # Probe for the maximum array job size setGlobalIfUndef("gridEngineSubmitCommand", ""); setGlobalIfUndef("gridEngineNameOption", ""); setGlobalIfUndef("gridEngineArrayOption", ""); setGlobalIfUndef("gridEngineArrayName", ""); setGlobalIfUndef("gridEngineArrayMaxJobs", $maxArraySize); setGlobalIfUndef("gridEngineOutputOption", ""); setGlobalIfUndef("gridEnginePropagateCommand", ""); setGlobalIfUndef("gridEngineThreadsOption", undef); setGlobalIfUndef("gridEngineMemoryOption", undef); setGlobalIfUndef("gridEngineNameToJobIDCommand", undef); setGlobalIfUndef("gridEngineNameToJobIDCommandNoArray", undef); setGlobalIfUndef("gridEngineTaskID", ""); setGlobalIfUndef("gridEngineArraySubmitID", ""); setGlobalIfUndef("gridEngineJobID", ""); my %hosts; # Probe for how to request multiple CPUs on each node, set # . # . # . # Probe for how to reserve memory on each node # . # . # . # Build a list of the resources available in the grid. This will contain a list with keys of # "#CPUs-#GBs" and values of the number of nodes With such a config. Later on, we'll use this # to figure out what specific settings to use for each algorithm. # # The list is saved in global{"availableHosts"} # . # $hosts{"4-32"} = 15; # 15 machines with 4 CPUs and 32gb memory # . setGlobal("availableHosts", formatAllowedResources(%hosts, "DNA Nexus")); } canu-1.6/src/pipelines/canu/Grid_LSF.pm000066400000000000000000000122401314437614700177530ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-30 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid_LSF; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(detectLSF configureLSF); use strict; use canu::Defaults; use canu::Execution; use canu::Grid; sub detectLSF () { return if ( defined(getGlobal("gridEngine"))); my $bsub = findExecutable("bsub"); return if (!defined($bsub)); print STDERR "-- Detected LSF with 'bsub' binary in $bsub.\n"; setGlobal("gridEngine", "LSF"); } sub configureLSF () { return if (uc(getGlobal("gridEngine")) ne "LSF"); setGlobalIfUndef("gridEngineSubmitCommand", "bsub"); setGlobalIfUndef("gridEngineNameOption", "-J"); setGlobalIfUndef("gridEngineArrayOption", ""); setGlobalIfUndef("gridEngineArrayName", "ARRAY_NAME\[ARRAY_JOBS\]"); setGlobalIfUndef("gridEngineArrayMaxJobs", 65535); setGlobalIfUndef("gridEngineOutputOption", "-o"); setGlobalIfUndef("gridEngineThreadsOption", "-R span[hosts=1] -n THREADS"); setGlobalIfUndef("gridEngineMemoryOption", "-M MEMORY"); setGlobalIfUndef("gridEnginePropagateCommand", "bmodify -w \"done\(\"WAIT_TAG\"\)\""); setGlobalIfUndef("gridEngineNameToJobIDCommand", "bjobs -A -J \"WAIT_TAG\" | grep -v JOBID"); setGlobalIfUndef("gridEngineNameToJobIDCommandNoArray", "bjobs -J \"WAIT_TAG\" | grep -v JOBID"); setGlobalIfUndef("gridEngineTaskID", "LSB_JOBINDEX"); setGlobalIfUndef("gridEngineArraySubmitID", "%I"); setGlobalIfUndef("gridEngineJobID", "LSB_JOBID"); # # LSF has variation in the units used to request memory # They are defined by the LSF_UNIT_FOR_LIMITS variable in lsf.conf # Poll and see if we can find it # my $memUnits = undef; open(F, "lsadmin showconf lim |"); my $s = ; # cluster name my $d = ; # dat/time while () { my @v = split '=', $_; if ($v[0] =~ m/LSF_UNIT_FOR_LIMITS/) { $memUnits = "t" if ($v[1] =~ m/[tT]/); $memUnits = "g" if ($v[1] =~ m/[gG]/); $memUnits = "m" if ($v[1] =~ m/[mM]/); $memUnits = "k" if ($v[1] =~ m/[kK]/); } } close(F); if (!defined($memUnits)) { print STDERR "-- Warning: unknown memory units for grid engine LSF assuming KB\n"; $memUnits = "k"; } # Build a list of the resources available in the grid. This will contain a list with keys # of "#CPUs-#GBs" and values of the number of nodes With such a config. Later on, we'll use this # to figure out what specific settings to use for each algorithm. # # The list is saved in global{"availableHosts"} # # !!! UNTESTED !! # my %hosts; open(F, "lshosts |"); my $h = ; # header my @h = split '\s+', $h; my $cpuIdx = 4; my $memIdx = 5; for (my $ii=0; ($ii < scalar(@h)); $ii++) { $cpuIdx = $ii if ($h[$ii] eq "ncpus"); $memIdx = $ii if ($h[$ii] eq "maxmem"); } while () { my @v = split '\s+', $_; my $cpus = $v[$cpuIdx]; my $mem = $v[$memIdx]; # if we failed to find the units from the configuration, inherit it from the lshosts output if (!defined($memUnits)) { $memUnits = "t" if ($mem =~ m/(\d+.*\d+)[tT]/); $memUnits = "g" if ($mem =~ m/(\d+.*\d+)[gG]/); $memUnits = "m" if ($mem =~ m/(\d+.*\d+)[mM]/); $memUnits = "k" if ($mem =~ m/(\d+.*\d+)[kK]/); } $mem = $1 * 1024 if ($mem =~ m/(\d+.*\d+)[tT]/); $mem = $1 * 1 if ($mem =~ m/(\d+.*\d+)[gG]/); $mem = $1 / 1024 if ($mem =~ m/(\d+.*\d+)[mM]/); $mem = $1 / 1024 / 1024 if ($mem =~ m/(\d+.*\d+)[kK]/); $mem = int($mem); $hosts{"$cpus-$mem"}++ if ($cpus gt 0); } close(F); setGlobal("availableHosts", formatAllowedResources(%hosts, "LSF")); setGlobal("gridEngineMemoryUnits", $memUnits); print STDERR "-- \n"; print STDERR "-- On LSF detected memory is requested in " . uc(${memUnits}) . "B\n"; print STDERR "-- \n"; } canu-1.6/src/pipelines/canu/Grid_PBSTorque.pm000066400000000000000000000161131314437614700211560ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-30 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid_PBSTorque; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(detectPBSTorque configurePBSTorque); use strict; use canu::Defaults; use canu::Execution; use canu::Grid; sub detectPBSVersion () { my $isPro = 0; my $version = ""; open(F, "pbsnodes --version 2>&1 |"); while () { if (m/pbs_version\s+=\s+(.*)/) { $isPro = 1; $version = $1; } if (m/Version:\s+(.*)/) { $version = $1; } } close(F); return($version, $isPro); } sub detectPBSTorque () { return if ( defined(getGlobal("gridEngine"))); my $pbsnodes = findExecutable("pbsnodes"); return if (!defined($pbsnodes)); my ($version, $isPro) = detectPBSVersion(); if ($isPro == 0) { print STDERR "-- Detected PBS/Torque '$version' with 'pbsnodes' binary in $pbsnodes.\n"; setGlobal("gridEngine", "PBS"); } else { print STDERR "-- Detected PBSPro '$version' with 'pbsnodes' binary in $pbsnodes.\n"; setGlobal("gridEngine", "PBSPRO"); } } sub configurePBSTorqueNodes () { my %hosts; print STDERR "-- Detecting PBS/Torque resources.\n"; open(F, "pbsnodes |"); while () { my $cpus = 0; my $mem = 0; if ($_ =~ m/status/) { my @stats = split ',', $_; for my $stat (@stats) { if ($stat =~ m/physmem/) { $mem = ( split '=', $stat )[-1]; } elsif ($stat =~ m/ncpus/) { $cpus = int(( split '=', $stat )[-1]); } } $mem = $1 * 1024 if ($mem =~ m/(\d+.*\d+)[tT]/); $mem = $1 * 1 if ($mem =~ m/(\d+.*\d+)[gG]/); $mem = $1 / 1024 if ($mem =~ m/(\d+.*\d+)[mM]/); $mem = $1 / 1024 / 1024 if ($mem =~ m/(\d+.*\d+)[kK]/); $mem = int($mem); $hosts{"$cpus-$mem"}++ if ($cpus gt 0); } } close(F); setGlobal("availableHosts", formatAllowedResources(%hosts, "PBS/Torque")); } sub configurePBSProNodes () { my %hosts; my $mem = 0; my $cpus = 0; print STDERR "-- Detecting PBSPro resources.\n"; open(F, "pbsnodes -av |"); while () { if (m/resources_available.mem\s*=\s*(\d+)kb/) { $mem = int($1 / 1024 / 1024); } if (m/resources_available.mem\s*=\s*(\d+)mb/) { $mem = int($1 / 1024); } if (m/resources_available.mem\s*=\s*(\d+)gb/) { $mem = int($1); } if (m/resources_available.ncpus\s*=\s*(\d+)/) { $cpus = $1; } if (($cpus > 0) && ($mem > 0)) { $hosts{"$cpus-$mem"}++; $cpus = 0; $mem = 0; } } close(F); setGlobal("availableHosts", formatAllowedResources(%hosts, "PBSPro")); } sub configurePBSTorque () { return if ((uc(getGlobal("gridEngine")) ne "PBS") && (uc(getGlobal("gridEngine")) ne "PBSPRO")); my $isPro = (uc(getGlobal("gridEngine")) eq "PBSPRO"); # For Torque, see if there is a max array size. # For Pro, set to 1000. my $maxArraySize = getGlobal("gridEngineArrayMaxJobs"); if (!defined($maxArraySize)) { $maxArraySize = 1000; open(F, "qmgr -c 'p s' |"); while () { if (m/max_job_array_size\s+=\s+(\d+)/) { # Torque $maxArraySize = $1; } if (m/max_array_size\s+=\s+(\d+)/) { # PBSPro $maxArraySize = $1; } } close(F); } # PBSPro, again, throws a curve ball at us. There is no way to set the output of array jobs # to someting reasonable like name.TASK_ID.err, even though that is basically the default. # So, we unset gridEngineArraySubmitID to get the default name, but then need to move the '-j oe' # somewhere else - and put it in the submit command. setGlobalIfUndef("gridEngineSubmitCommand", "qsub -j oe -d `pwd`") if ($isPro == 0); setGlobalIfUndef("gridEngineSubmitCommand", "qsub -j oe") if ($isPro == 1); setGlobalIfUndef("gridEngineNameOption", "-N"); setGlobalIfUndef("gridEngineArrayOption", "-t ARRAY_JOBS") if ($isPro == 0); setGlobalIfUndef("gridEngineArrayOption", "-J ARRAY_JOBS") if ($isPro == 1); setGlobalIfUndef("gridEngineArrayName", "ARRAY_NAME"); setGlobalIfUndef("gridEngineArrayMaxJobs", $maxArraySize); setGlobalIfUndef("gridEngineOutputOption", "-o"); setGlobalIfUndef("gridEngineThreadsOption", "-l nodes=1:ppn=THREADS"); setGlobalIfUndef("gridEngineMemoryOption", "-l mem=MEMORY"); setGlobalIfUndef("gridEnginePropagateCommand", "qalter -W depend=afterany:\"WAIT_TAG\""); setGlobalIfUndef("gridEngineNameToJobIDCommand", "qstat -f |grep -F -B 1 WAIT_TAG | grep Id: | grep -F [] |awk '{print \$NF}'"); setGlobalIfUndef("gridEngineNameToJobIDCommandNoArray", "qstat -f |grep -F -B 1 WAIT_TAG | grep Id: |awk '{print \$NF}'"); setGlobalIfUndef("gridEngineTaskID", "PBS_ARRAYID") if ($isPro == 0); setGlobalIfUndef("gridEngineTaskID", "PBS_ARRAY_INDEX") if ($isPro == 1); setGlobalIfUndef("gridEngineArraySubmitID", "\\\$PBS_ARRAYID") if ($isPro == 0); setGlobalIfUndef("gridEngineArraySubmitID", undef) if ($isPro == 1); # Was "\\\$PBS_ARRAY_INDEX" setGlobalIfUndef("gridEngineJobID", "PBS_JOBID"); # Build a list of the resources available in the grid. This will contain a list with keys # of "#CPUs-#GBs" and values of the number of nodes With such a config. Later on, we'll use this # to figure out what specific settings to use for each algorithm. # # The list is saved in global{"availableHosts"} configurePBSTorqueNodes() if ($isPro == 0); configurePBSProNodes() if ($isPro == 1); } canu-1.6/src/pipelines/canu/Grid_SGE.pm000066400000000000000000000271371314437614700177600ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-30 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid_SGE; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(detectSGE configureSGE); use strict; use canu::Defaults; use canu::Grid; use canu::Execution; sub detectSGE () { return if ( defined(getGlobal("gridEngine"))); return if (!defined($ENV{'SGE_ROOT'})); print STDERR "-- Detected Sun Grid Engine in '$ENV{'SGE_ROOT'}/$ENV{'SGE_CELL'}'.\n"; setGlobal("gridEngine", "SGE"); } sub configureSGE () { return if (uc(getGlobal("gridEngine")) ne "SGE"); my $maxArraySize = getGlobal("gridEngineArrayMaxJobs"); if (!defined($maxArraySize)) { $maxArraySize = 65535; open(F, "qconf -sconf |") or caExit("can't run 'qconf' to get SGE config", undef); while () { if (m/max_aj_tasks\s+(\d+)/) { $maxArraySize = $1; } } close(F); } setGlobalIfUndef("gridEngineSubmitCommand", "qsub"); setGlobalIfUndef("gridEngineNameOption", "-cwd -N"); setGlobalIfUndef("gridEngineArrayOption", "-t ARRAY_JOBS"); setGlobalIfUndef("gridEngineArrayName", "ARRAY_NAME"); setGlobalIfUndef("gridEngineArrayMaxJobs", $maxArraySize); setGlobalIfUndef("gridEngineOutputOption", "-j y -o"); setGlobalIfUndef("gridEnginePropagateCommand", "qalter -hold_jid \"WAIT_TAG\""); setGlobalIfUndef("gridEngineThreadsOption", undef); #"-pe threads THREADS"); setGlobalIfUndef("gridEngineMemoryOption", undef); #"-l mem=MEMORY"); setGlobalIfUndef("gridEngineNameToJobIDCommand", undef); setGlobalIfUndef("gridEngineNameToJobIDCommandNoArray", undef); setGlobalIfUndef("gridEngineTaskID", "SGE_TASK_ID"); setGlobalIfUndef("gridEngineArraySubmitID", "\\\$TASK_ID"); setGlobalIfUndef("gridEngineJobID", "JOB_ID"); # Try to figure out the name of the threaded job execution environment. # It's the one with allocation_rule of $pe_slots. my $configError = 0; if (!defined(getGlobal("gridEngineThreadsOption"))) { my @env = `qconf -spl`; chomp @env; my @thr; my $bestThr = undef; my $bestSlots = 0; foreach my $env (@env) { my $ns = 0; my $ar = 0; my $jf = 0; open(F, "qconf -sp $env |"); while () { $ns = $1 if (m/slots\s+(\d+)/); # How many slots can we use? $ar = 1 if (m/allocation_rule.*pe_slots/); # All slots need to be on a single node. #$cs = 1 if (m/control_slaves.*FALSE/); # Doesn't apply to pe_slots. $jf = 1 if (m/job_is_first_task.*TRUE/); # The fisrt task (slot) does actual work. } close(F); next if ($ar == 0); next if ($jf == 0); push @thr, $env; if ($ns > $bestSlots) { $bestThr = $env; $bestSlots = $ns; } } if (scalar(@thr) == 1) { print STDERR "-- Detected Grid Engine environment '$thr[0]'.\n"; setGlobal("gridEngineThreadsOption", "-pe $thr[0] THREADS"); } elsif (scalar(@thr) > 1) { print STDERR "--\n"; print STDERR "-- WARNING: Couldn't determine the SGE parallel environment to run multi-threaded codes.\n"; print STDERR "-- WARNING: Valid choices are:\n"; foreach my $thr (@thr) { print STDERR "-- WARNING: gridEngineThreadsOption=\"-pe $thr THREADS\"\n"; } print STDERR "-- WARNING:\n"; print STDERR "-- WARNING: Using SGE parallel environment '$bestThr'.\n"; print STDERR "--\n"; setGlobal("gridEngineThreadsOption", "-pe $bestThr THREADS"); #$configError++; } else { print STDERR "--\n"; print STDERR "-- WARNING: Couldn't determine the SGE parallel environment to run multi-threaded codes.\n"; print STDERR "-- WARNING: No valid choices found! Find an appropriate Parallel Environment name (qconf -spl) and set:\n"; print STDERR "-- WARNING: gridEngineThreadsOption=\"-pe THREADS\"\n"; print STDERR "--\n"; $configError++; } } elsif (getGlobal("gridEngineThreadsOption") =~ m/-pe\s+(.*)\s+THREADS$/) { print STDERR "-- User supplied Grid Engine environment '", getGlobal("gridEngineThreadsOption"), "'.\n"; } else { caFailure("Couldn't parse gridEngineThreadsOption='" . getGlobal("gridEngineThreadsOption") . "'", undef); } # Try to figure out the name of the memory resource. if (!defined(getGlobal("gridEngineMemoryOption"))) { my @mem; open(F, "qconf -sc |"); while () { my @vals = split '\s+', $_; next if ($vals[5] ne "YES"); # Not a consumable resource. next if ($vals[2] ne "MEMORY"); # Not a memory resource. next if ($vals[0] =~ m/swap/); # Don't care about swap. next if ($vals[0] =~ m/virtual/); # Don't care about vm space. push @mem, $vals[0]; } close(F); if (scalar(@mem) == 1) { print STDERR "-- Detected Grid Engine consumable '$mem[0]'.\n"; setGlobal("gridEngineMemoryOption", "-l $mem[0]=MEMORY"); } elsif (scalar(@mem) > 1) { print STDERR "--\n"; print STDERR "-- WARNING: Couldn't determine the SGE resource to request memory.\n"; print STDERR "-- WARNING: Valid choices are (pick one and supply it to canu):\n"; foreach my $mem (@mem) { print STDERR "-- WARNING: gridEngineMemoryOption=\"-l $mem=MEMORY\"\n"; } print STDERR "--\n"; $configError++; } else { print STDERR "--\n"; print STDERR "-- WARNING: Couldn't determine the SGE resource to request memory.\n"; print STDERR "-- WARNING: No valid choices found! Find an appropriate complex name (qconf -sc) and set:\n"; print STDERR "-- WARNING: gridEngineMemoryOption=\"-l =MEMORY\"\n"; print STDERR "--\n"; $configError++; } } elsif (getGlobal("gridEngineMemoryOption") =~ m/^-l\s+.*=MEMORY$/) { print STDERR "-- User supplied Grid Engine consumable '", getGlobal("gridEngineMemoryOption"), "'.\n"; } else { caFailure("Couldn't parse gridEngineMemoryOption='" . getGlobal("gridEngineMemoryOption") . "'", undef); } caExit("can't configure for SGE", undef) if ($configError); # Check that SGE is setup to use the #! line instead of the (stupid) defaults. my %start_mode; my %start_shell; if (getGlobal('gridOptions') !~ m/-S/) { open(Q, "qconf -sql |"); while () { chomp; my $q = $_; $start_mode{$q} = "na"; $start_shell{$q} = "na"; open(F, "qconf -sq $q |"); while () { $start_mode{$q} = $1 if (m/shell_start_mode\s+(\S+)/); $start_shell{$q} = $1 if (m/shell\s+(\S+)/); } close(F); } my $startBad = undef; foreach my $q (keys %start_mode) { if (($start_mode{$q} ne "unix_behavior") && ($start_shell{$q} =~ m/csh$/)) { $startBad .= "-- WARNING: Queue '$q' has start mode set to 'posix_behavior' and shell set to '$start_shell{$q}'.\n"; } } if (defined($startBad)) { my $bash = findCommand("bash"); my $sh = findCommand("sh"); my $shell; $shell = $bash if ($bash ne ""); $shell = $sh if ($sh ne ""); print STDERR "--\n"; print STDERR "-- WARNING:\n"; print STDERR "$startBad"; print STDERR "-- WARNING:\n"; print STDERR "-- WARNING: Some queues in your configuration will fail to start jobs correctly.\n"; print STDERR "-- WARNING: Jobs will be submitted with option:\n"; print STDERR "-- WARNING: gridOptions=-S $shell\n"; print STDERR "-- WARNING:\n"; print STDERR "-- WARNING: If jobs fail to start, modify the above option to use a valid shell\n"; print STDERR "-- WARNING: and supply it directly to canu.\n"; print STDERR "-- WARNING:\n"; if (!defined(getGlobal('gridOptions'))) { setGlobal('gridOptions', "-S $shell"); } else { setGlobal('gridOptions', getGlobal('gridOptions') . " -S $shell"); } } } # Build a list of the resources available in the grid. This will contain a list with keys # of "#CPUs-#GBs" and values of the number of nodes With such a config. Later on, we'll use this # to figure out what specific settings to use for each algorithm. # # The list is saved in global{"availableHosts"} my %hosts; my $hosts = ""; open(F, "qhost |"); my $h = ; # Header my $b = ; # Table bar my @h = split '\s+', $h; my $cpuIdx = 2; my $memIdx = 4; for (my $ii=0; ($ii < scalar(@h)); $ii++) { $cpuIdx = $ii if ($h[$ii] eq "NCPU"); $memIdx = $ii if ($h[$ii] eq "MEMTOT"); } while () { my @v = split '\s+', $_; next if ($v[3] eq "-"); # Node disabled or otherwise not available my $cpus = $v[$cpuIdx]; my $mem = $v[$memIdx]; $mem = $1 * 1024 if ($mem =~ m/(\d+.*\d+)[tT]/); $mem = $1 * 1 if ($mem =~ m/(\d+.*\d+)[gG]/); $mem = $1 / 1024 if ($mem =~ m/(\d+.*\d+)[mM]/); $mem = int($mem); $hosts{"$cpus-$mem"}++ if ($cpus gt 0); } close(F); if (scalar(keys(%hosts)) == 0) { my $mm = getGlobal("maxMemory"); my $mt = getGlobal("maxThreads"); print STDERR "--\n"; print STDERR "-- WARNING: No hosts found in 'qhost' report.\n"; print STDERR "-- WARNING: Will use maxMemory=$mm and maxThreads=$mt instead.\n"; print STDERR "-- ERROR: maxMemory not defined!\n" if (!defined($mm)); print STDERR "-- ERROR: maxThreads not defined!\n" if (!defined($mt)); caExit("maxMemory or maxThreads not defined", undef) if (!defined($mm) || !defined($mt)); $hosts{"$mt-$mm"}++; } setGlobal("availableHosts", formatAllowedResources(%hosts, "Sun Grid Engine")); } canu-1.6/src/pipelines/canu/Grid_Slurm.pm000066400000000000000000000112261314437614700204340ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-30 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Grid_Slurm; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(detectSlurm configureSlurm); use strict; use canu::Defaults; use canu::Execution; use canu::Grid; sub detectSlurm () { return if ( defined(getGlobal("gridEngine"))); my $sinfo = findExecutable("sinfo"); return if (!defined($sinfo)); print STDERR "-- Detected Slurm with 'sinfo' binary in $sinfo.\n"; setGlobal("gridEngine", "SLURM"); } sub configureSlurm () { return if (uc(getGlobal("gridEngine")) ne "SLURM"); my $maxArraySize = 65535; # From the docs (http://slurm.schedmd.com/job_array.html): # # Note that the minimum index value is zero and the maximum value is a Slurm configuration # parameter (MaxArraySize minus one). # # Which is a totally stupid name for the parameter, and a totally stupid interpretation. open(F, "scontrol show config |") or caExit("can't run 'scontrol' to get SLURM config", undef); while () { if (m/MaxArraySize\s+=\s+(\d+)/) { $maxArraySize = $1 - 1; print STDERR "-- Detected Slurm with 'MaxArraySize' limited to $maxArraySize jobs.\n"; } } close(F); setGlobalIfUndef("gridEngineSubmitCommand", "sbatch"); setGlobalIfUndef("gridEngineNameOption", "-D `pwd` -J"); setGlobalIfUndef("gridEngineArrayOption", "-a ARRAY_JOBS"); setGlobalIfUndef("gridEngineArrayName", "ARRAY_NAME"); setGlobalIfUndef("gridEngineArrayMaxJobs", $maxArraySize); setGlobalIfUndef("gridEngineOutputOption", "-o"); ## NB: SLURM default joins STDERR & STDOUT if no -e specified setGlobalIfUndef("gridEngineThreadsOption", "--cpus-per-task=THREADS"); setGlobalIfUndef("gridEngineMemoryOption", "--mem-per-cpu=MEMORY"); setGlobalIfUndef("gridEnginePropagateCommand", "scontrol update job=\"WAIT_TAG\""); ## TODO: manually verify this in all cases setGlobalIfUndef("gridEngineNameToJobIDCommand", "squeue -h -o\%F -n \"WAIT_TAG\" | uniq"); ## TODO: manually verify this in all cases setGlobalIfUndef("gridEngineNameToJobIDCommandNoArray", "squeue -h -o\%i -n \"WAIT_TAG\""); ## TODO: manually verify this in all cases setGlobalIfUndef("gridEngineTaskID", "SLURM_ARRAY_TASK_ID"); setGlobalIfUndef("gridEngineArraySubmitID", "%A_%a"); setGlobalIfUndef("gridEngineJobID", "SLURM_JOB_ID"); # Build a list of the resources available in the grid. This will contain a list with keys # of "#CPUs-#GBs" and values of the number of nodes With such a config. Later on, we'll use this # to figure out what specific settings to use for each algorithm. # # The list is saved in global{"availableHosts"} # my %hosts; #NODELIST NODES CPUS MEMORY open(F, "sinfo --exact -o '%N %D %c %m' | grep -v drained | grep -v interactive |"); my $h = ; # header my @h = split '\s+', $h; my $nodeIdx = 1; my $cpuIdx = 4; my $memIdx = 6; for (my $ii=0; ($ii < scalar(@h)); $ii++) { $nodeIdx = $ii if ($h[$ii] eq "NODES"); $cpuIdx = $ii if ($h[$ii] eq "CPUS"); $memIdx = $ii if ($h[$ii] eq "MEMORY"); } while () { my @v = split '\s+', $_; my $cpus = $v[$cpuIdx]; my $mem = $v[$memIdx] / 1024; my $nodes = $v[$nodeIdx]; $hosts{"$cpus-$mem"}+=int($nodes) if ($cpus gt 0); } close(F); setGlobal("availableHosts", formatAllowedResources(%hosts, "Slurm")); } canu-1.6/src/pipelines/canu/HTML.pm000066400000000000000000001154471314437614700171430ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2015-NOV-08 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::HTML; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(buildHTML); use strict; use File::Copy; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; sub copyFile ($$) { my $sPath = shift @_; # Path to source file. my $dPath = shift @_; # Path to destination file. if ((-e $sPath) && ((! -e $dPath) || ((-M $sPath) < (-M $dPath)))) { copy($sPath, $dPath); } } sub simpleFigure ($$$$) { my $body = shift @_; my $sImage = shift @_; my $dImage = shift @_; my $text = shift @_; my $format = getGlobal("gnuplotImageFormat"); # No image? Note so in the html. if ((! -e "$sImage.sm.$format") && (! -e "$sImage.lg.$format") && (! -e "$dImage.sm.$format") && (! -e "$dImage.lg.$format")) { push @$body, "

Image '$sImage' not found.

\n"; return; } # Copy the file to our files location. copyFile("$sImage.lg.$format", "$dImage.lg.$format"); copyFile("$sImage.sm.$format", "$dImage.sm.$format"); # Empty image? Note so in the html. if ((-z "$dImage.sm.$format") || (-z "$dImage.lg.$format")) { push @$body, "

Image '$sImage' is empty. Probably no data to display.

\n"; return; } # Otherwise, show it! push @$body, "
\n"; push @$body, "\n"; push @$body, "
\n"; push @$body, "$text\n"; push @$body, "
\n"; push @$body, "
\n"; } sub buildGatekeeperHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

Input Reads

\n"; push @$body, "\n"; if (! -e "$base/$asm.gkpStore/load.dat") { push @$body, "

None loaded.

\n"; return; } push @$body, "\n"; open(F, "< $base/$asm.gkpStore/load.dat") or caExit("can't open '$base/$asm.gkpStore/load.dat' for reading: $!", undef); while () { # nam blocks show up once per file. if (m/^nam\s(\d+)\s(.*)$/) { my $idx = $1; my $file = $2; push @$body, "\n"; push @$scripts, "document.getElementById('gkpload$idx').onclick = toggleTable;\n"; push @$scripts, "document.getElementById('gkpload$idx').style = 'cursor: pointer;';\n"; } # lib blocks show up once per file, all parameters are on the same line elsif (m/^lib\s/) { my @libs = split '\s+', $_; my ($param, $np, $var, $val); $param = shift @libs; # Throw out the first 'lib' word. $np = scalar(@libs); $param = shift @libs; # First thing we want to report. # First row needs to have a spanning cell for the 'parameters'. ($var, $val) = split '=', $param; push @$body, "\n"; # Remaining rows just have var=val. foreach $param (@libs) { ($var, $val) = split '=', $param; push @$body, "\n"; } } # dat blocks show up once per file, and are the last block emitted for a file elsif (m/^dat\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)$/) { my $nLOADEDA = $1; my $bLOADEDA = $2; my $nSKIPPEDA = $3; my $bSKIPPEDA = $4; my $nLOADEDQ = $5; my $bLOADEDQ = $6; my $nSKIPPEDQ = $7; my $bSKIPPEDQ = $8; my $nWARNS = $9; push @$body, "\n",; push @$body, "\n"; push @$body, "\n"; push @$body, "\n"; my $nl = $nLOADEDA + $nLOADEDQ; my $bl = $bLOADEDA + $bLOADEDQ; my $ns = $nSKIPPEDA + $nSKIPPEDQ; my $bs = $bSKIPPEDA + $bSKIPPEDQ; push @$body, "\n"; } # the sum block shows up excatly once, a summary of all the reads loaded elsif (m/^sum\s(\d+)\s(\d+)\s(\d+)\s(\d+)\s(\d+)$/) { my $nLOADED = $1; my $bLOADED = $2; my $nSKIPPED = $3; my $bSKIPPED = $4; my $nWARNS = $5; push @$body, "
$file
Parameters$var = $val
$var = $val
FASTA$nLOADEDA reads ($bLOADEDA bp)
$nSKIPPEDA reads ($bSKIPPEDA bp) were short and not loaded
FASTQ$nLOADEDQ reads ($bLOADEDQ bp)
$nSKIPPEDQ reads ($bSKIPPEDQ bp) were short and not loaded
$nl reads ($bl bp) loaded, $ns reads ($bs bp) skipped, $nWARNS warnings
\n"; push @$body, "\n"; push @$body, "

Final Store

\n"; push @$body, "\n"; push @$body, "\n"; push @$body, "\n"; push @$body, "\n"; push @$body, "\n"; push @$body, "\n"; push @$body, "
$base/$asm.gkpStore
readsLoaded$nLOADED reads ($bLOADED bp)
readsSkipped$nSKIPPED reads ($bSKIPPED bp) (read was too short)
warnings$nWARNS warnings (invalid base or quality value)
\n"; } else { caExit("failed to read '$base/$asm.gkpStore/load.log': invalid format", undef); } } close(F); push @$body, "

Read Length Histogram

\n"; simpleFigure($body, "$base/$asm.gkpStore/readlengths", "$base.html.files/readlengths", ""); } sub buildMerylHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

k-Mer Counts

\n"; push @$body, "\n"; if (! -d "$base/0-mercounts") { push @$body, "

Stage not computed. ($base/0-mercounts)

\n"; return; } my %merSizes; open(F, "ls $base/0-mercounts/ |") or caExit("can't find files in '$base/0-mercounts': $!", undef); while () { if (m/\.ms(\d+)\./) { $merSizes{$1}++; } } close(F); foreach my $ms (keys %merSizes) { my $numTotal = 0; my $numDistinct = 0; my $numUnique = 0; my $largest = 0; if (-e "$base/0-mercounts/$asm.ms$ms.histogram.info") { open(F, "< $base/0-mercounts/$asm.ms$ms.histogram.info") or caExit("can't open '$base/0-mercounts/$asm.ms$ms.histogram.info' for reading: $!", undef); while () { $numTotal = $1 if (m/Found\s(\d+)\s+mers./); $numDistinct = $1 if (m/Found\s(\d+)\s+distinct\smers./); $numUnique = $1 if (m/Found\s(\d+)\s+unique\smers./); $largest = $1 if (m/Largest\smercount\sis\s(\d+)/); } close(F); simpleFigure($body, "$base/0-mercounts/$asm.ms$ms.histogram", "$base.html.files/$asm.ms$ms.histogram", "Histogram for k=$ms with $numTotal mers, $numDistinct distinct mers and $numUnique single-copy mers. Largest count is $largest."); } elsif ((-e "$base/0-mercounts/$asm.ms$ms.ignore") && (-z "$base/0-mercounts/$asm.ms$ms.ignore")) { push @$body, "Threshold zero. No mers reported.\n"; } elsif ((-e "$base/0-mercounts/$asm.ms$ms.fasta") && (-z "$base/0-mercounts/$asm.ms$ms.fasta")) { push @$body, "Threshold zero. No mers reported.\n"; } else { push @$body, "Using user-supplied frequent mers.\n"; } } } sub buildCorrectionHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference # Need to include the minimum original read length that is correctable # Summarizes filterCorrectionOverlaps outputs. push @$body, "

Overlap Filtering

\n"; push @$body, "\n"; if (-e "$base/2-correction/$asm.globalScores.stats") { my $rh; # 'row header', for labeling a set of rows with a common cell push @$body, "\n"; open(F, "< $base/2-correction/$asm.globalScores.stats") or caExit("can't open '$base/2-correction/$asm.globalScores.stats' for reading: $!", undef); while () { chomp; next if (m/^$/); # Skip blank lines. push @$body, "\n" if ($_ eq "PARAMETERS:"); push @$body, "\n" if ($_ eq "OVERLAPS:"); push @$body, "\n" if ($_ eq "READS:"); $rh = "" if ($_ eq "PARAMETERS:"); $rh = "" if ($_ eq "OVERLAPS:"); # Gets replaced by 'IGNORED' below. $rh = "" if ($_ eq "READS:"); $rh = "" if ($_ eq "IGNORED:"); $rh = "" if ($_ eq "FILTERED:"); $rh = "" if ($_ eq "EVIDENCE:"); $rh = "" if ($_ eq "TOTAL:"); if (m/^\s*(\d+\.*\d*)\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } } close(F); push @$body, "
PARAMETERS
OVERLAPS
READS
IgnoredFilteredEvidenceTotal
$1$2
\n"; } else { push @$body, "

Stage not computed or results file removed ($base/2-correction/$asm.globalScores.stats).

\n"; } push @$body, "

Read Correction

\n"; push @$body, "\n"; # Summarizes expensiveFilter() outputs - we want to get the 'corrected read length filter' numbers. # which should be the first set in the file. my $nReads = undef; my $nBasesIn = undef; my $nBasesOut = undef; if (-e "$base/2-correction/$asm.readsToCorrect.summary") { open(F, "< $base/2-correction/$asm.readsToCorrect.summary") or caExit("can't open '$base/2-correction/$asm.readsToCorrect.summary' for reading: $!", undef); while () { $nReads = $1 if ((m/nReads\s+(\d+)/) && (!defined($nReads))); $nBasesIn = $1 if ((m/nBasds\s+(\d+).*input/) && (!defined($nBasesIn))); $nBasesOut = $1 if ((m/nReads\s+(\d+).*output/) && (!defined($nBasesOut))); last if (m/^Raw\sreads/); } close(F); push @$body, "

Filter method: corFilter=expensive. Expect to correct $nReads reads with ${nBasesIn}bp to ${nBasesOut}bp.

\n"; } else { push @$body, "

Filter method: corFilter=quick.

\n"; } # $base/2-correction/$asm.readsToCorrect has 'readID', 'originalLength' and 'expectedCorrectedLength'. # $BASE/$asm.correctedReads.length has 'readID', 'pieceID', 'length'. # # Both files should be sorted by increasing ID, so a simple merge sufficies. if (-e "$base/2-correction/$asm.correction.summary") { my $rh; push @$body, "\n"; open(F, "< $base/2-correction/$asm.correction.summary") or caExit("can't open '$base/2-correction/$asm.correction.summary' for reading: $!", undef); while () { chomp; next if (m/^$/); # Skip blank lines. push @$body, "\n" if ($_ eq "CORRECTION INPUTS:"); push @$body, "\n" if ($_ eq "CORRECTION OUTPUTS:"); push @$body, "\n" if ($_ eq "PIECES PER READ:"); # Normal table lines. if (m/^\s*(\d+\.*\d*)\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } # Pieces per read histogram. if (m/^\s*(\d+)\s+pieces:\s+(\d+)$/) { push @$body, "$rh\n"; $rh = undef; } } close(F); push @$body, "
INPUTS
OUTPUTS
PIECES PER READ
$1$2
$1$2
\n"; } # Really should be a 'caption' on the 'pieces per read' table. push @$body, "

A single input read can be split into multiple output reads, or possibly not even output at all.

\n"; # Simple vs Expensive filter true/false positive simpleFigure($body, "$base/2-correction/$asm.estimate.original-x-corrected", "$base.html.files/$asm.estimate.original-x-corrected", "Scatter plot of the original read length (X axis) against the expected corrected read length (Y axis).\n" . "Colors show a comparison of the simple filter (which doesn't use overlaps) to the expensive filter (which does).\n" . "A large green triangle (false negatives) hints that there could be abnormally low quality regions in the reads.\n"); # Scatter plots of read lengths - they don't show much. # Original vs expected shown above. simpleFigure($body, "$base/2-correction/$asm.originalLength-vs-expectedLength", "$base.html.files/$asm.originalLength-vs-expectedLength", "Scatter plot of original vs expected read length. Shown in filter plot above."); simpleFigure($body, "$base/2-correction/$asm.originalLength-vs-correctedLength", "$base.html.files/$asm.originalLength-vs-correctedLength", "Scatter plot of original vs corrected read length."); simpleFigure($body, "$base/2-correction/$asm.expectedLength-vs-correctedLength", "$base.html.files/$asm.expectedLength-vs-correctedLength", "Scatter plot of expected vs corrected read length."); # Histogram - expected vs corrected lengths NEEDS TO SHOW NEGATIVES!? simpleFigure($body, "$base/2-correction/$asm.length-difference-histograms", "$base.html.files/$asm.length-difference-histograms", "Histogram of the difference between the expected and corrected read lengths.\n" . "Note that a negative difference means the corrected read is larger than expected.\n"); # Histogram - original, expected, corrected lengths simpleFigure($body, "$base/2-correction/$asm.length-histograms", "$base.html.files/$asm.length-histograms", "Histogram of original (red), expected (green) and actual corrected (blue) read lengths.\n"); } sub buildTrimmingHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

Trimming

\n"; push @$body, "\n"; if (-e "$base/3-overlapbasedtrimming/$asm.1.trimReads.stats") { my $rh; # 'row header', for labeling a set of rows with a common cell # Read once to make a parameters table. We could have embedded this in the loop below, but it's cleaner here. #push @$body, "\n"; #push @$body, "
\n"; # Read again for the statistics. push @$body, "\n"; open(F, "< $base/3-overlapbasedtrimming/$asm.1.trimReads.stats") or caExit("can't open '$base/3-overlapbasedtrimming/$asm.1.trimReads.stats' for reading: $!", undef); while () { chomp; next if (m/^$/); # Skip blank lines. push @$body, "\n" if ($_ eq "PARAMETERS:"); push @$body, "
PARAMETERS
\n" if ($_ eq "INPUT READS:"); # Start a new table because 'params' has only push @$body, "\n" if ($_ eq "INPUT READS:"); # 2 cols, but the rest have 3 push @$body, "\n" if ($_ eq "INPUT READS:"); push @$body, "\n" if ($_ eq "INPUT READS:"); push @$body, "\n" if ($_ eq "OUTPUT READS:"); push @$body, "\n" if ($_ eq "OUTPUT READS:"); push @$body, "\n" if ($_ eq "TRIMMING DETAILS:"); push @$body, "\n" if ($_ eq "TRIMMING DETAILS:"); # Normal stats line "number (text)" if (m/^\s*(\d+\.*\d*)\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } # Specific to trimming "number reads number bases (text)" if (m/^\s*(\d+\.*\d*)\s+reads\s+(\d+\.*\d*)\s+bases\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } } close(F); push @$body, "
INPUT READS
readsbases
OUTPUT READS
readsbases
TRIMMING DETAILS
readsbases
$1$2
$1$2$3
\n"; } else { push @$body, "

Stage not computed or results file removed ($base/3-overlapbasedtrimming/$asm.1.trimReads.stats).

\n"; } simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.inputDeletedReads", "$base.html.files/$asm.1.trimReads.inputDeletedReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.inputNoTrimReads", "$base.html.files/$asm.1.trimReads.inputNoTrimReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.inputReads", "$base.html.files/$asm.1.trimReads.inputReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.outputDeletedReads", "$base.html.files/$asm.1.trimReads.outputDeletedReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.outputNoOvlReads", "$base.html.files/$asm.1.trimReads.outputNoOvlReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.outputTrimmedReads", "$base.html.files/$asm.1.trimReads.outputTrimmedReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.outputUnchangedReads", "$base.html.files/$asm.1.trimReads.outputUnchangedReads", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.trim3", "$base.html.files/$asm.1.trimReads.trim3", ""); simpleFigure($body, "$base/3-overlapbasedtrimming/$asm.1.trimReads.trim5", "$base.html.files/$asm.1.trimReads.trim5", ""); push @$body, "

Splitting

\n"; push @$body, "\n"; if (-e "$base/3-overlapbasedtrimming/$asm.2.splitReads.stats") { my $rh; # 'row header', for labeling a set of rows with a common cell # Read once to make a parameters table. We could have embedded this in the loop below, but it's cleaner here. #push @$body, "\n"; #push @$body, "
\n"; # Read again for the statistics. push @$body, "\n"; open(F, "< $base/3-overlapbasedtrimming/$asm.2.splitReads.stats") or caExit("can't open '$base/3-overlapbasedtrimming/$asm.2.splitReads.stats' for reading: $!", undef); while () { chomp; next if (m/^$/); # Skip blank lines. push @$body, "\n" if ($_ eq "PARAMETERS:"); push @$body, "
PARAMETERS
\n" if ($_ eq "INPUT READS:"); # Start a new table because 'params' has only push @$body, "\n" if ($_ eq "INPUT READS:"); # 2 cols, but the rest have 3 push @$body, "\n" if ($_ eq "INPUT READS:"); push @$body, "\n" if ($_ eq "INPUT READS:"); push @$body, "\n" if ($_ eq "PROCESSED:"); push @$body, "\n" if ($_ eq "PROCESSED:"); push @$body, "\n" if ($_ eq "READS WITH SIGNALS:"); push @$body, "\n" if ($_ eq "READS WITH SIGNALS:"); push @$body, "\n" if ($_ eq "SIGNALS:"); push @$body, "\n" if ($_ eq "SIGNALS:"); push @$body, "\n" if ($_ eq "TRIMMING:"); push @$body, "\n" if ($_ eq "TRIMMING:"); # Normal stats line "number (text)" if (m/^\s*(\d+\.*\d*)\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } # Specific to trimming "number reads number bases (text)" if (m/^\s*(\d+\.*\d*)\s+reads\s+(\d+\.*\d*)\s+bases\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } if (m/^\s*(\d+\.*\d*)\s+reads\s+(\d+\.*\d*)\s+signals\s+\((.*)\)$/) { push @$body, "$rh\n"; $rh = undef; } } close(F); push @$body, "
INPUT READS
readsbases
PROCESSED
readsbases
READS WITH SIGNALS
readssignals
SIGNALS
readsbases
TRIMMING
readsbases
$1$2
$1$2$3
$1$2$3
\n"; } else { push @$body, "

Stage not computed or results file removed ($base/3-overlapbasedtrimming/$asm.2.splitReads.stats).

\n"; } #buildGatekeeperHTML($base, $asm, $tag, $css, $body, $scripts); # Analyzes the output fastq } sub buildOverlapperHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

Overlaps

\n"; push @$body, "\n"; if (! -d "$base/$asm.ovlStore") { push @$body, "

Overlaps not computed.

\n"; return; } if (! -e "$base/$asm.ovlStore.summary") { push @$body, "

No statistics available for store '$base/$asm.ovlStore'.

\n"; return; } push @$body, "\n"; push @$body, "\n"; my ($category, $reads, $readsP, $length, $lengthsd, $size, $sizesd, $analysis); open(F, "< $base/$asm.ovlStore.summary") or caExit("Failed to open overlap store statistics in '$base/$asm.ovlStore': $!", undef); $_ = ; $_ = ; while () { chomp; next if ($_ eq ""); if (m/(.*)\s+(\d+)\s+(\d+.\d+)\s+(\d+.\d+)\s+\+-\s+(\d+.\d+)\s+(\d+.\d+)\s+\+-\s+(\d+.\d+)\s+\((.*)\)$/) { $category = $1; $reads = $2; $readsP = $3; $length = $4; $lengthsd = $5; $size = $6; $sizesd = $7; $analysis = $8; push @$body, "\n"; } elsif (m/(.*)\s+(\d+)\s+(\d+.\d+)\s+(\d+.\d+)\s+\+-\s+(\d+.\d+)\s+\((.*)\)$/) { $category = $1; $reads = $2; $readsP = $3; $length = $4; $lengthsd = $5; $size = undef; $sizesd = undef; $analysis = $6; push @$body, "\n"; } else { chomp; caExit("failed to parse line '$_' in file '$base/$asm.ovlStore.summary'", undef); } } close(F); push @$body, "
CategoryReads%Read LengthFeature Size or CoverageAnalysis
$category$reads$readsP$length±$lengthsd$size±$sizesd$analysis
$category$reads$readsP$length±$lengthsd$analysis
\n"; } sub buildOverlapErrorCorrectionHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

Overlap Error Adjustment

\n"; push @$body, "\n"; } sub reportSizeStatistics ($$$) { my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference $_ = ; chomp; # First real line. push @$body, "\n"; push @$body, "\n"; while (!eof(F) && (length($_) > 0)) { if (m/^(\w+)\s+\((\d+)\s+tigs\)\s+\((\d+)\s+length\)\s+\((\d+)\s+average\)\s+\((\d+.\d+x)\s+coverage\)$/) { push @$body, "\n"; } if (m/^ng(\d\d\d)\s+(\d+)\s+lg(\d\d\d)\s+(\d+)\s+sum\s+(\d+)\s+\((\w+\))$/) { my $ng = $1; my $ngv = $2; my $lg = $3; my $lgv = $4; my $sum = $5; my $typ = $6; $ng =~ s/^0*//; push @$body, "\n"; } $_ = ; chomp; } push @$body, "
FractionLengthSequencesBases
$_
$ng$ngv$lgv$sum
\n"; } sub buildUnitiggerHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference return if (! -d "$base/4-unitigger"); my @logs; push @logs, "$base/4-unitigger/unitigger.err"; open(F, "ls $base/4-unitigger |"); while () { chomp; push @logs, "$base/4-unitigger/$_" if (m/log$/); } close(F); push @$body, "

Unitigs

\n"; push @$body, "\n"; if (-e "$base/4-unitigger/unitigger.err") { my $all = 0; my $some = 0; my $someL = 0; my $olaps = 0; open(F, "< $base/4-unitigger/unitigger.err"); while () { chomp; #if (m/maxPer.*numBelow=(\d+)\snumEqual=(\d+)\snumAbove=(\d+)\stotalLoad=(\d+)\s/) { # push @$body, "Loaded $4 overlaps. $3 overlaps were omitted due to memory constraints.\n"; #} $someL = $1 if (m/_maxPer\s+=\s+(\d+)\s+overlaps/); $all += $1 if (m/numBelow\s+=\s+(\d+)\s+reads/); $all += $1 if (m/numEqual\s+=\s+(\d+)\s+reads/); $some = $1 if (m/numAbove\s+=\s+(\d+)\s+reads/); $olaps = $1 if (m/totalLoad\s+=\s+(\d+)\s+overlaps/); } close(F); push @$body, "

Overlaps

\n"; push @$body, "\n"; push @$body, "Loaded all overlaps for $all reads.
\n"; push @$body, "Loaded some overlaps for $some reads (the best $someL for each read).
\n" if ($some > 0); push @$body, "Loaded $olaps overlaps in total.
\n"; } if (-e "$base/4-unitigger/$asm.001.filterOverlaps.thr000.num000.log") { push @$body, "

Edges

\n"; push @$body, "\n"; my $initContained = 0; my $initSingleton = 0; my $initSpur = 0; my $initSpurMutualBest = 0; my $initBest = 0; my $initBest0Mutual = 0; my $initBest1Mutual = 0; my $initBest2Mutual = 0; my $mean = 0; my $stddev = 0; my $ms = 0; my $median = 0; my $mad = 0; my $mm = 0; my $noBest = 0; my $highErr = 0; my $acceptable = 0; my $suspicious = 0; my $filtered1 = 0; my $filtered2 = 0; my $lopsided1 = 0; my $lopsided2 = 0; my $finalContained = 0; my $finalSingleton = 0; my $finalSpur = 0; my $finalSpurMutualBest = 0; my $finalBest = 0; my $finalBest0Mutual = 0; my $finalBest1Mutual = 0; my $finalBest2Mutual = 0; open(F, "$base/4-unitigger/$asm.001.filterOverlaps.thr000.num000.log"); $_ = ; chomp; my $block = "none"; while (!eof(F)) { $block = "init" if (m/^INITIAL\sEDGES/); $block = "error" if (m/^ERROR\sRATES/); $block = "edge" if (m/^EDGE\sFILTERING/); $block = "final" if (m/^FINAL\sEDGES/); $initContained = $1 if (($block eq "init") && (m/(\d+)\sreads\sare\scontained/)); $initSingleton = $1 if (($block eq "init") && (m/(\d+)\sreads\shave\sno\sbest\sedges/)); $initSpur = $1 if (($block eq "init") && (m/(\d+)\sreads\shave\sonly\sone\sbest\sedge.*spur/)); $initSpurMutualBest = $1 if (($block eq "init") && (m/(\d+)\sare\smutual\sbest/)); $initBest = $1 if (($block eq "init") && (m/(\d+)\sreads\shave\stwo\sbest\sedges/)); $initBest1Mutual = $1 if (($block eq "init") && (m/(\d+)\shave\sone\smutual\sbest/)); $initBest2Mutual = $1 if (($block eq "init") && (m/(\d+)\shave\stwo\smutual\sbest/)); if (($block eq "error") && (m/mean\s+(\d+.\d+)\s+stddev\s+(\d+.\d+)\s+.*\s+(\d+.\d+)\s+fraction\serror/)) { $mean = $1; $stddev = $2; $ms = $3; } if (($block eq "error") && (m/median\s+(\d+.\d+)\s+mad\s+(\d+.\d+)\s+.*\s+(\d+.\d+)\s+fraction\serror/)) { $median = $1; $mad = $2; $mm = $3; } $suspicious = $1 if (($block eq "edge") && (m/(\d+)\sreads\shave\sa\ssuspicious\soverlap\spattern/)); $filtered1 = $1 if (($block eq "edge") && (m/(\d+)\shad\sone/)); $filtered2 = $1 if (($block eq "edge") && (m/(\d+)\shad\stwo/)); $lopsided1 = $1 if (($block eq "edge") && (m/(\d+)\shave\sone/)); $lopsided2 = $1 if (($block eq "edge") && (m/(\d+)\shave\stwo/)); $finalContained = $1 if (($block eq "final") && (m/(\d+)\sreads\sare\scontained/)); $finalSingleton = $1 if (($block eq "final") && (m/(\d+)\sreads\shave\sno\sbest\sedges/)); $finalSpur = $1 if (($block eq "final") && (m/(\d+)\sreads\shave\sonly\sone\sbest\sedge.*spur/)); $finalSpurMutualBest = $1 if (($block eq "final") && (m/(\d+)\sare\smutual\sbest/)); $finalBest = $1 if (($block eq "final") && (m/(\d+)\sreads\shave\stwo\sbest\sedges/)); $finalBest1Mutual = $1 if (($block eq "final") && (m/(\d+)\shave\sone\smutual\sbest/)); $finalBest2Mutual = $1 if (($block eq "final") && (m/(\d+)\shave\stwo\smutual\sbest/)); $_ = ; chomp; } close(F); $initBest0Mutual = $initBest - $initBest1Mutual - $initBest2Mutual; $finalBest0Mutual = $finalBest - $finalBest1Mutual - $finalBest2Mutual; push @$body, "Constructing unitigs using overlaps of at most this fraction error:
\n"; push @$body, "$median +- $mad = $mm = ", $mm * 100, "\% (median absolute deviation)
\n"; push @$body, "$mean +- $stddev = $ms = ", $ms * 100, "\% (standard deviation)
\n"; push @$body, "
\n"; push @$body, "INITIAL EDGES
\n"; push @$body, "$initContained reads are contained.
\n"; push @$body, "$initSingleton reads are singleton.
\n"; push @$body, "$initSpur reads are spur ($initSpurMutualBest have a mutual best edge).
\n"; push @$body, "$initBest reads form the backbone ($initBest0Mutual have no mutual best edges; $initBest1Mutual have one; $initBest2Mutual have both).
\n"; push @$body, "
\n"; push @$body, "FILTERING
\n"; push @$body, "$suspicious reads have a suspicious overlap pattern.
\n"; push @$body, "$filtered1 had one high error rate edge filtered; $filtered2 had both.
\n"; push @$body, "$lopsided1 had one size incompatible edge filtered; $lopsided2 had both.
\n"; push @$body, "
\n"; push @$body, "FINAL EDGES
\n"; push @$body, "$finalContained reads are contained.
\n"; push @$body, "$finalSingleton reads are singleton.
\n"; push @$body, "$finalSpur reads are spur ($finalSpurMutualBest have a mutual best edge).
\n"; push @$body, "$finalBest reads form the backbone ($finalBest0Mutual have no mutual best edges; $finalBest1Mutual have one; $finalBest2Mutual have both).
\n"; } push @$body, "

Initial Tig Sizes

\n"; if (-e "$base/4-unitigger/$asm.003.buildUnitigs.sizes") { open(F, "< $base/4-unitigger/$asm.003.buildUnitigs.sizes"); reportSizeStatistics($css, $body, $scripts); close(F); } push @$body, "

Final Tig Sizes

\n"; if (-e "$base/4-unitigger/$asm.008.generateOutputs.sizes") { open(F, "< $base/4-unitigger/$asm.008.generateOutputs.sizes"); reportSizeStatistics($css, $body, $scripts); close(F); } } sub buildConsensusHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

Consensus

\n"; push @$body, "\n"; } sub buildOutputHTML ($$$$$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $css = shift @_; # Array reference my $body = shift @_; # Array reference my $scripts = shift @_; # Array reference push @$body, "

Final Outputs

\n"; push @$body, "\n"; } sub buildHTML ($$) { my $asm = shift @_; my $tag = shift @_; my @css; my @body; my @scripts; my $base; $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); make_path("$base.html.files") if (! -e "$base.html.files"); # For correction runs if ($tag eq "cor") { push @body, "

Correction

\n"; buildGatekeeperHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildMerylHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildOverlapperHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildCorrectionHTML($base, $asm, $tag, \@css, \@body, \@scripts); } # For trimming runs if ($tag eq "obt") { push @body, "

Trimming

\n"; buildGatekeeperHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildMerylHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildOverlapperHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildTrimmingHTML($base, $asm, $tag, \@css, \@body, \@scripts); } # For assembly runs if ($tag eq "utg") { push @body, "

Assembly

\n"; buildGatekeeperHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildMerylHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildOverlapperHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildOverlapErrorCorrectionHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildUnitiggerHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildConsensusHTML($base, $asm, $tag, \@css, \@body, \@scripts); buildOutputHTML($base, $asm, $tag, \@css, \@body, \@scripts); } #print STDERR "WRITING '$base/$asm-summary.html'\n"; open(F, "> $base.html") or die "can't open '$base.html' for writing: $!\n"; print F "\n"; print F "\n"; print F "\n"; print F "\n"; print F "\n"; print F "canu analysis for assembly '$asm' in directory '$base'\n"; print F "\n"; print F "\n"; print F "\n"; print F "\n"; print F "\n"; print F @body; print F "\n"; print F "\n"; print F "\n"; print F "\n"; print F "\n"; print F "\n"; close(F); } canu-1.6/src/pipelines/canu/Meryl.pm000066400000000000000000000625461314437614700174700ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/Meryl.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-AUG-25 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-03 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-19 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Meryl; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(merylConfigure merylCheck merylProcess); use strict; use File::Path 2.08 qw(make_path remove_tree); use File::Basename; use POSIX qw(ceil); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::ErrorEstimate; use canu::HTML; use canu::Report; use canu::Grid_Cloud; sub merylGenerateHistogram ($$) { my $asm = shift @_; my $tag = shift @_; my $hist; # We don't know $ofile from where merylGenerateHistogram is typically called (Report.pm) # and so we're forced to configure every time. my ($base, $path, $merSize, $merThresh, $merScale, $merDistinct, $merTotal, $ffile, $ofile) = merylParameters($asm, $tag); return(undef) if (! -e "$path/$ofile.histogram"); return(undef) if (! -e "$path/$ofile.histogram.info"); # Load the statistics my $numTotal = 0; my $numDistinct = 0; my $numUnique = 0; my $largest = 0; open(F, "< $path/$ofile.histogram.info") or caFailure("can't open meryl histogram information file '$path/$ofile.histogram.info' for reading: $!\n", undef); while () { $numTotal = $1 if (m/Found\s(\d+)\s+mers./); $numDistinct = $1 if (m/Found\s(\d+)\s+distinct\smers./); $numUnique = $1 if (m/Found\s(\d+)\s+unique\smers./); $largest = $1 if (m/Largest\smercount\sis\s(\d+)/); } close(F); # Load histogram data my @tc; # Total count my @fu; # Fraction unique my @ft; # Fraction total my $mc; open(F, "< $path/$ofile.histogram"); while () { my @v = split '\s+', $_; $tc[$v[0]] = $v[1]; $fu[$v[0]] = $v[2]; $ft[$v[0]] = $v[3]; $mc = $v[0]; # histogram should be sorted } close(F); # Prune the high-count kmers # # In blocks of 40, extend the histogram until the average of the next block is nearly the same # as the average of this block. if (0) { my $lo = 2; my $hi = 3; my $st = 1; my $aveLast = 0; my $aveThis = 0; for (my $ii=$lo; $ii<$hi; $ii++) { $aveThis += $tc[$ii]; } $aveThis /= ($hi - $lo); $aveLast = 0; print STDERR "aveLast $aveLast aveThis $aveThis $lo $hi INITIAL\n"; while (($hi < $mc) && ($aveThis > 2) && (($aveThis < 0.90 * $aveLast) || ($aveLast < 0.90 * $aveThis))) { $lo += $st; $hi += $st; $st += 1; $aveLast = $aveThis; $aveThis = 0; for (my $ii=$lo; $ii<$hi; $ii++) { $aveThis += $tc[$ii]; } $aveThis /= ($hi - $lo); print STDERR "aveLast $aveLast aveThis $aveThis $lo $hi\n"; } print STDERR "aveLast $aveLast aveThis $aveThis $lo $hi FINAL\n"; } my @TC; my @FU; my @FT; my $TCmax = 0; my $lo = 1; my $hi = 2; my $st = 1; for (my $ii=0; $ii <= 40; $ii++) { for (my $jj=$lo; $jj < $hi; $jj++) { $TC[$ii] += $tc[$jj]; # Sum the counts $FU[$ii] = ($fu[$ii] < $FU[$ii]) ? $FU[$ii] : $fu[$jj]; # But the fractions are already cumulative, $FT[$ii] = ($ft[$ii] < $FT[$ii]) ? $FT[$ii] : $ft[$jj]; # we just need to skip zeros. } if ($ii > 0) { $TCmax = ($TCmax < $TC[$ii]) ? $TC[$ii] : $TCmax; } $lo = $hi; $hi += $st; $st += 1; } my $maxY = $lo; my $Xscale = $TCmax / 70; # Now just draw the histogram $hist .= "--\n"; $hist .= "-- $merSize-mers Fraction\n"; $hist .= "-- Occurrences NumMers Unique Total\n"; $lo = 1; $hi = 2; $st = 1; for (my $ii=0; $ii<=40; $ii++) { my $numXs = int($TC[$ii] / $Xscale); if ($numXs <= 70) { $hist .= sprintf("-- %6d-%6d %9d %s%s %.4f %.4f\n", $lo, $hi-1, $TC[$ii], "*" x ($numXs), " " x (70 - $numXs), $FU[$ii], $FT[$ii]); } else { $hist .= sprintf("-- %6d-%6d %9d %s%s %.4f %.4f\n", $lo, $hi-1, $TC[$ii], "*" x 67, "-->", $FU[$ii], $FT[$ii]); } last if ($hi >= $maxY); $lo = $hi; $hi += $st; $st += 1; } $hist .= sprintf("--\n"); $hist .= sprintf("-- %11d (max occurrences)\n", $largest); $hist .= sprintf("-- %11d (total mers, non-unique)\n", $numTotal - $numUnique); $hist .= sprintf("-- %11d (distinct mers, non-unique)\n", $numDistinct - $numUnique); $hist .= sprintf("-- %11d (unique mers)\n", $numUnique); return($hist); } # Threshold: Three methods to pick it. # Threshold - 'auto', 'auto * X', 'auto / X', or an integer value # Distinct - by the fraction distinct retained # Total - by the fraction total retained sub merylPlotHistogram ($$$$) { my $path = shift @_; my $ofile = shift @_; my $suffix = shift @_; my $size = shift @_; # Size of image, not merSize! return if (fileExists("$path/$ofile.histogram.$suffix.gp")); my $gnuplot = getGlobal("gnuplot"); my $format = getGlobal("gnuplotImageFormat"); fetchFile("$path/$ofile.histogram"); open(F, "> $path/$ofile.histogram.$suffix.gp"); print F "\n"; print F "unset multiplot\n"; print F "\n"; print F "set terminal $format size $size,$size\n"; print F "set output '$ofile.histogram.$suffix.$format'\n"; print F "\n"; print F "set multiplot\n"; print F "\n"; print F "# Distinct-vs-total full size plot\n"; print F "\n"; print F "set origin 0.0,0.0\n"; print F "set size 1.0,1.0\n"; print F "\n"; print F "set xrange [0.5:1.0]\n"; print F "set yrange [0.0:1.0]\n"; print F "\n"; print F "unset ytics\n"; print F "set y2tics 0.1\n"; #print F "set y2tics add ('0.6765' 0.6765)\n"; print F "\n"; print F "plot [0.5:1.0] '$ofile.histogram' using 3:4 with lines title 'Distinct-vs-Total'\n"; print F "\n"; print F "# Distinct-vs-total zoom in lower left corner\n"; print F "\n"; print F "set origin 0.05,0.10\n"; print F "set size 0.40,0.40\n"; print F "\n"; print F "set xrange [0.975:1.0]\n"; print F "set yrange [0.4:0.80]\n"; print F "\n"; print F "unset ytics\n"; # ytics on the left of the plot print F "set y2tics 0.1\n"; # y2tics on the right of the plot #print F "set y2tics add ('0.6765' 0.6765)\n"; print F "\n"; print F "plot [0.975:1.0] '$ofile.histogram' using 3:4 with lines title 'Distinct-vs-Total'\n"; print F "\n"; print F "# Histogram in upper left corner\n"; print F "\n"; print F "set origin 0.05,0.55\n"; print F "set size 0.40,0.40\n"; print F "\n"; print F "set xrange [0:200]\n"; print F "set yrange [0:30000000]\n"; print F "\n"; print F "unset ytics\n"; # ytics on the left of the plot print F "set y2tics 10e6\n"; # y2tics on the right of the plot print F "unset mytics\n"; print F "\n"; print F "plot [0:200] '$ofile.histogram' using 1:2 with lines title 'Histogram'\n"; close(F); if (runCommandSilently($path, "$gnuplot ./$ofile.histogram.$suffix.gp > /dev/null 2>&1", 0)) { print STDERR "--\n"; print STDERR "-- WARNING: gnuplot failed; no plots will appear in HTML output.\n"; print STDERR "--\n"; print STDERR "----------------------------------------\n"; } stashFile("$path/$ofile.histogram.$suffix.gp"); stashFile("$path/$ofile.histogram.$suffix.$format"); } sub merylParameters ($$) { my $asm = shift @_; my $tag = shift @_; my ($base, $path, $merSize, $merThresh, $merScale, $merDistinct, $merTotal, $ffile, $ofile); # Find a place to run stuff. $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/0-mercounts"; # Decide on which set of parameters we need to be using, and make output file names. if (getGlobal("${tag}Overlapper") eq "ovl") { $merSize = getGlobal("${tag}OvlMerSize"); $merThresh = getGlobal("${tag}OvlMerThreshold"); $merScale = 1.0; $merDistinct = getGlobal("${tag}OvlMerDistinct"); $merTotal = getGlobal("${tag}OvlMerTotal"); $ffile = "$asm.ms$merSize.frequentMers.fasta"; # The fasta file we should be creating (ends in FASTA). $ofile = "$asm.ms$merSize"; # The meryl database 'intermediate file'. } elsif (getGlobal("${tag}Overlapper") eq "mhap") { $merSize = getGlobal("${tag}mhapMerSize"); $merThresh = undef; $merScale = 1.0; $merDistinct = undef; $merTotal = undef; $ffile = "$asm.ms$merSize.frequentMers.ignore.gz"; # The mhap-specific file we should be creating (ends in IGNORE). $ofile = "$asm.ms$merSize"; # The meryl database 'intermediate file'. } elsif (getGlobal("${tag}Overlapper") eq "minimap") { $merSize = 0; $merThresh = 0; $merScale = 1.0; $merDistinct = undef; $merTotal = undef; $ffile = undef; $ofile = undef; } else { caFailure("unknown ${tag}Overlapper '" . getGlobal("${tag}Overlapper") . "'", undef); } # Decode the threshold. Auto with modifications ("auto * X") or ("auto / X")? if ($merThresh =~ m/auto\s*\*\s*(\S+)/) { $merThresh = "auto"; $merScale = $1; } if ($merThresh =~ m/auto\s*\/\s*(\S+)/) { $merThresh = "auto"; $merScale = 1.0 / $1; } # Return all this goodness. return($base, $path, $merSize, $merThresh, $merScale, $merDistinct, $merTotal, $ffile, $ofile); } sub merylConfigure ($$) { my $asm = shift @_; my $tag = shift @_; my $bin = getBinDirectory(); my $cmd; my ($base, $path, $merSize, $merThresh, $merScale, $merDistinct, $merTotal, $ffile, $ofile) = merylParameters($asm, $tag); goto allDone if (skipStage($asm, "$tag-merylConfigure") == 1); goto allDone if (fileExists("$path/meryl.sh")); goto allDone if (!defined($ffile)); goto allDone if (fileExists("$path/$ffile")); goto allDone if (fileExists("$path/$ofile.mcidx") && fileExists("$path/$ofile.mcdat")); make_path($path) if (! -d $path); # User supplied mers? Copy them to the proper location and exit. my $sfile = getGlobal("${tag}OvlFrequentMers"); if (defined($sfile) && ! -e "$path/$ffile") { caFailure("${tag}OvlFrequentMers '$sfile' not found", undef) if (! -e $sfile); copy($sfile, "$path/$ffile"); stashFile("$path/$ffile"); goto allDone; } # No filtering? Make an empty file and exit. if ((defined($merThresh)) && ($merThresh ne "auto") && ($merThresh == 0) && (!defined($merDistinct)) && (!defined($merTotal))) { touch("$path/$ffile"); stashFile("$path/$ffile"); goto allDone; } # Nope, build a script for computing kmer counts. my $mem = int(getGlobal("merylMemory") * 1024 * 0.8); # Because meryl expects megabytes, not gigabytes. my $thr = getGlobal("merylThreads"); my $cov = getExpectedCoverage($base, $asm); caExit("merylMemory isn't defined?", undef) if (!defined($mem)); caExit("merylThreads isn't defined?", undef) if (!defined($thr)); open(F, "> $path/meryl.sh") or caExit("can't open '$path/meryl.sh' for writing: $1", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", $path); print F "\n"; print F "# Purge any previous intermediate result. Possibly not needed, but safer.\n"; print F "\n"; print F "rm -f ./$ofile.WORKING*\n"; print F "\n"; print F "\$bin/meryl \\\n"; print F " -B -C -L 2 -v -m $merSize -threads $thr -memory $mem \\\n"; print F " -s ../$asm.gkpStore \\\n"; print F " -o ./$ofile.WORKING \\\n"; print F "&& \\\n"; print F "mv ./$ofile.WORKING.mcdat ./$ofile.mcdat \\\n"; print F "&& \\\n"; print F "mv ./$ofile.WORKING.mcidx ./$ofile.mcidx\n"; print F "\n"; print F stashFileShellCode("$path", "$ofile.mcdat", ""); print F "\n"; print F stashFileShellCode("$path", "$ofile.mcidx", ""); print F "\n"; print F "\n"; print F "# Dump a histogram\n"; print F "\n"; print F "\$bin/meryl \\\n"; print F " -Dh -s ./$ofile \\\n"; print F "> ./$ofile.histogram.WORKING \\\n"; print F "2> ./$ofile.histogram.info \\\n"; print F "&& \\\n"; print F "mv -f ./$ofile.histogram.WORKING ./$ofile.histogram\n"; print F "\n"; print F stashFileShellCode("$path", "$ofile.histogram", ""); print F "\n"; print F stashFileShellCode("$path", "$ofile.histogram.info", ""); print F "\n"; print F "\n"; print F "# Compute a nice kmer threshold.\n"; print F "\n"; print F "\$bin/estimate-mer-threshold \\\n"; print F " -h ./$ofile.histogram \\\n"; print F " -c $cov \\\n"; print F "> ./$ofile.estMerThresh.out.WORKING \\\n"; print F "2> ./$ofile.estMerThresh.err \\\n"; print F "&& \\\n"; print F "mv ./$ofile.estMerThresh.out.WORKING ./$ofile.estMerThresh.out\n"; print F "\n"; print F stashFileShellCode("$path", "$ofile.estMerThresh.out", ""); print F "\n"; print F stashFileShellCode("$path", "$ofile.estMerThresh.err", ""); print F "\n"; print F "\n"; print F "exit 0\n"; close(F); makeExecutable("$path/meryl.sh"); stashFile("$path/meryl.sh"); finishStage: emitStage($asm, "merylConfigure"); buildHTML($asm, $tag); allDone: } sub merylCheck ($$) { my $asm = shift @_; my $tag = shift @_; my $attempt = getGlobal("canuIteration"); my $bin = getBinDirectory(); my $cmd; my ($base, $path, $merSize, $merThresh, $merScale, $merDistinct, $merTotal, $ffile, $ofile) = merylParameters($asm, $tag); # If the frequent mer file exists, don't bother running meryl. We don't really need the # databases. goto allDone if (skipStage($asm, "$tag-meryl") == 1); goto allDone if (fileExists("$path/meryl.success")); goto finishStage if (!defined($ffile)); goto finishStage if (fileExists("$path/$ffile")); goto finishStage if (fileExists("$path/$ofile.mcidx") && fileExists("$path/$ofile.mcdat")); fetchFile("$path/meryl.sh"); # Since there is only one job, if we get here, we're not done. Any other 'check' function # shows how to process multiple jobs. This only checks for the existence of the final outputs. # (unitigger is the same) # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Meryl failed, tried $attempt times, giving up.\n"; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Meryl failed, retry.\n"; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "merylCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "meryl", $path, "meryl", (1)); return; finishStage: print STDERR "-- Meryl finished successfully.\n"; make_path($path); # With object storage, we might not have this directory! open(F, "> $path/meryl.success") or caExit("can't open '$path/meryl.success' for writing: $!", undef); close(F); stashFile("$path/meryl.success"); emitStage($asm, "merylCheck"); buildHTML($asm, $tag); allDone: } sub merylProcess ($$) { my $asm = shift @_; my $tag = shift @_; my $bin = getBinDirectory(); my $cmd; my ($base, $path, $merSize, $merThresh, $merScale, $merDistinct, $merTotal, $ffile, $ofile) = merylParameters($asm, $tag); # ffile exists if we've already output it here, or if user supplied a file, or if user wants no masking. goto allDone if (skipStage($asm, "$tag-meryl") == 1); goto allDone if (fileExists("$path/$ffile")); # Compute a threshold, if needed. if ($merThresh eq "auto") { fetchFile("$path/$ofile.estMerThresh.out"); open(F, "< $path/$ofile.estMerThresh.out") or caFailure("failed to read estimated mer threshold from '$path/$ofile.estMerThresh.out'", undef); $merThresh = ; $merThresh = int($merThresh * $merScale) + 1; close(F); } # Compute a threshold based on the fraction distinct or total. if (defined($merDistinct) || defined($merTotal)) { fetchFile("$path/$ofile.histogram"); open(F, "< $path/$ofile.histogram") or caFailure("failed to read mer histogram from '$path/$ofile.histogram'", undef); while () { my ($threshold, $num, $distinct, $total) = split '\s+', $_; if (($merThresh > 0) && ($merThresh < $threshold)) { print STDERR "-- Supplied merThreshold $merThresh is the smallest.\n"; last; } if ((defined($merDistinct)) && ($merDistinct <= $distinct)) { $merThresh = (($merThresh > 0) && ($merThresh < $threshold)) ? $merThresh : $threshold; print STDERR "-- Supplied merDistinct $merDistinct with threshold $threshold is the smallest.\n"; last; } if ((defined($merTotal)) && ($merTotal <= $total)) { $merThresh = (($merThresh > 0) && ($merThresh < $threshold)) ? $merThresh : $threshold; print STDERR "-- Supplied merTotal $merTotal with threshold $threshold is the smallest.\n"; last; } } close(F); } # Plot the histogram - annotated with the thesholds merylPlotHistogram($path, $ofile, "lg", 1024); # $ofile has merSize encoded in it merylPlotHistogram($path, $ofile, "sm", 256); # Display the histogram, and save to the report. Shouldn't this (and the plots above) # go in finishStage? addToReport("${tag}Meryl", merylGenerateHistogram($asm, $tag)); # Generate the frequent mers for overlapper if (getGlobal("${tag}Overlapper") eq "ovl") { fetchFile("$path/$ofile.mcdat"); fetchFile("$path/$ofile.mcidx"); if ((! -e "$path/$ofile.mcdat") || (! -e "$path/$ofile.mcdat")) { caFailure("meryl can't dump frequent mers, databases don't exist. Remove $path/meryl.success to try again.", undef); } if (runCommand($path, "$bin/meryl -Dt -n $merThresh -s ./$ofile > ./$ffile 2> ./$ffile.err")) { unlink "$path/$ffile"; caFailure("meryl failed to dump frequent mers", "$path/$ffile.err"); } unlink "$path/$ffile.err"; stashFile("$path/$ffile"); } # Generate the frequent mers for mhap # # mer value numInstances totalKmers # TTTTGTTTTTTTTTTT 0.0000044602 589 132055862 # # The fraction is just $3/$4. I assume this is used with "--filter-threshold 0.000005". if (getGlobal("${tag}Overlapper") eq "mhap") { my $totalMers = 0; my $maxCount = 0; fetchFile("$path/$ofile.histogram"); fetchFile("$path/$ofile.histogram.info"); # Meryl reports number of distinct canonical mers, we multiply by two to get the # (approximate) number of distinct mers. Palindromes are counted twice, oh well. open(F, "< $path/$ofile.histogram.info") or die "Failed to open '$path/$ofile.histogram.info' for reading: $!\n"; while () { if (m/Found\s+(\d+)\s+mers./) { $totalMers = 2 * $1; } if (m/Largest\s+mercount\s+is\s+(\d+)./) { $maxCount = $1; } } close(F); caFailure("didn't find any mers?", "$path/$ofile.histogram.info") if ($totalMers == 0); my $filterThreshold = getGlobal("${tag}MhapFilterThreshold"); my $misRate = 0.1; my $minCount = int($filterThreshold * $totalMers); my $totalToOutput = 0; my $totalFiltered = 0; if (defined(getGlobal("${tag}MhapFilterUnique"))) { $minCount = uniqueKmerThreshold($base, $asm, $merSize, $misRate) + 1; } open(F, "< $path/$ofile.histogram") or die "Failed to open '$path/$ofile.histogram' for reading: $!\n"; while () { my ($kCount, $occurences, $cumsum, $faction) = split '\s+', $_; if ($kCount < $minCount) { $totalFiltered = $cumsum * 100; } if ($kCount >= $minCount) { $totalToOutput += $occurences; } } close(F); $totalToOutput *= 2; # for the reverse complement fetchFile("$path/$ofile.mcdat"); fetchFile("$path/$ofile.mcidx"); open(F, "$bin/meryl -Dt -n $minCount -s $path/$ofile | ") or die "Failed to run meryl to generate frequent mers $!\n"; open(O, "| gzip -c > $path/$ofile.frequentMers.ignore.gz") or die "Failed to open '$path/$ofile.frequentMers.ignore.gz' for writing: $!\n"; printf(O "%d\n", $totalToOutput); while (!eof(F)) { my $h = ; my $m = ; chomp $m; my $r = reverse $m; $r =~ tr/ACGTacgt/TGCAtgca/; if ($h =~ m/^>(\d+)/) { printf(O "%s\t%e\n", $m, $1 / $totalMers); printf(O "%s\t%e\n", $r, $1 / $totalMers); } } close(O); close(F); stashFile("$path/$ffile"); if (defined(getGlobal("${tag}MhapFilterUnique"))) { printf STDERR "-- For %s overlapping, filtering low-occurence k-mers < %d (%.2f\%) based on estimated error of %.2f\%.\n", getGlobal("${tag}Overlapper"), $minCount, $totalFiltered, 100*estimateRawError($base, $asm, $tag, $merSize); } printf STDERR "-- For %s overlapping, set repeat k-mer threshold to %d.\n", getGlobal("${tag}Overlapper"), int($filterThreshold * $totalMers); } # Report the new threshold. if ((getGlobal("${tag}Overlapper") eq "ovl") && ($merThresh > 0) && (getGlobal("${tag}OvlMerThreshold") ne $merThresh)) { print STDERR "-- Reset ${tag}OvlMerThreshold from ", getGlobal("${tag}OvlMerThreshold"), " to $merThresh.\n"; setGlobal("${tag}OvlMerThreshold", $merThresh); } finishStage: fetchFile("$path/$ofile.histogram.info"); fetchFile("$path/$ffile"); if (-e "$path/$ofile.histogram.info") { my $numTotal = 0; my $numDistinct = 0; my $numUnique = 0; my $largest = 0; open(F, "< $path/$ofile.histogram.info") or caFailure("can't open meryl histogram information file '$path/$ofile.histogram.info' for reading: $!\n", undef); while () { $numTotal = $1 if (m/Found\s(\d+)\s+mers./); $numDistinct = $1 if (m/Found\s(\d+)\s+distinct\smers./); $numUnique = $1 if (m/Found\s(\d+)\s+unique\smers./); $largest = $1 if (m/Largest\smercount\sis\s(\d+)/); } close(F); print STDERR "--\n"; print STDERR "-- Found $numTotal $merSize-mers; $numDistinct distinct and $numUnique unique. Largest count $largest.\n"; } elsif (-z "$path/$ffile") { print STDERR "--\n"; print STDERR "-- Threshold zero. No mers will be masked.\n"; } else { print STDERR "--\n"; print STDERR "-- Using frequent mers in '", getGlobal("${tag}OvlFrequentMers"), "'\n"; } unlink "$path/$ofile.mcidx" if (getGlobal("saveMerCounts") == 0); unlink "$path/$ofile.mcdat" if (getGlobal("saveMerCounts") == 0); emitStage($asm, "$tag-meryl"); buildHTML($asm, $tag); allDone: stopAfter("meryl"); } canu-1.6/src/pipelines/canu/Output.pm000066400000000000000000000162431314437614700176710ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/Output.pm # # Modifications by: # # Brian P. Walenz from 2015-MAR-16 to 2015-AUG-25 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-02 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-DEC-02 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Output; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(generateOutputs); use strict; use File::Copy; use canu::Defaults; use canu::Execution; use canu::HTML; use canu::Grid_Cloud; sub generateOutputs ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $type = "fasta"; # Should probably be an option. goto allDone if (skipStage($asm, "generateOutputs") == 1); # Layouts if (! fileExists("$asm.contigs.layout")) { fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.ctgStore/seqDB.v002.dat"); fetchFile("unitigging/$asm.ctgStore/seqDB.v002.tig"); $cmd = "$bin/tgStoreDump \\\n"; $cmd .= " -G ./unitigging/$asm.gkpStore \\\n"; $cmd .= " -T ./unitigging/$asm.ctgStore 2 \\\n"; $cmd .= " -o ./$asm.contigs \\\n"; $cmd .= " -layout \\\n"; $cmd .= "> ./$asm.contigs.layout.err 2>&1"; if (runCommand(".", $cmd)) { caExit("failed to output contig layouts", "$asm.contigs.layout.err"); } unlink "$asm.contigs.layout.err"; stashFile("$asm.contigs.layout"); } if (! fileExists("$asm.unitigs.layout")) { fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.utgStore/seqDB.v001.dat"); # Why is this needed? fetchFile("unitigging/$asm.utgStore/seqDB.v001.tig"); fetchFile("unitigging/$asm.utgStore/seqDB.v002.dat"); fetchFile("unitigging/$asm.utgStore/seqDB.v002.tig"); $cmd = "$bin/tgStoreDump \\\n"; $cmd .= " -G ./unitigging/$asm.gkpStore \\\n"; $cmd .= " -T ./unitigging/$asm.utgStore 2 \\\n"; $cmd .= " -o ./$asm.unitigs \\\n"; $cmd .= " -layout \\\n"; $cmd .= "> ./$asm.unitigs.layout.err 2>&1"; if (runCommand(".", $cmd)) { caExit("failed to output unitig layouts", "$asm.unitigs.layout.err"); } unlink "$asm.unitigs.layout.err"; stashFile("$asm.unitigs.layout"); } # Sequences foreach my $tt ("unassembled", "contigs") { if (! fileExists("$asm.$tt.$type")) { fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.ctgStore/seqDB.v002.dat"); fetchFile("unitigging/$asm.ctgStore/seqDB.v002.tig"); $cmd = "$bin/tgStoreDump \\\n"; $cmd .= " -G ./unitigging/$asm.gkpStore \\\n"; $cmd .= " -T ./unitigging/$asm.ctgStore 2 \\\n"; $cmd .= " -consensus -$type \\\n"; $cmd .= " -$tt \\\n"; $cmd .= "> ./$asm.$tt.$type\n"; $cmd .= "2> ./$asm.$tt.err"; if (runCommand(".", $cmd)) { caExit("failed to output $tt consensus sequences", "$asm.$tt.err"); } unlink "$asm.$tt.err"; stashFile("$asm.$tt.$type"); } } if (! fileExists("$asm.unitigs.$type")) { fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.utgStore/seqDB.v002.dat"); fetchFile("unitigging/$asm.utgStore/seqDB.v002.tig"); $cmd = "$bin/tgStoreDump \\\n"; $cmd .= " -G ./unitigging/$asm.gkpStore \\\n"; $cmd .= " -T ./unitigging/$asm.utgStore 2 \\\n"; $cmd .= " -consensus -$type \\\n"; $cmd .= " -contigs \\\n"; $cmd .= "> ./$asm.unitigs.$type\n"; $cmd .= "2> ./$asm.unitigs.err"; if (runCommand(".", $cmd)) { caExit("failed to output unitig consensus sequences", "$asm.unitigs.err"); } unlink "$asm.unitigs.err"; stashFile("$asm.unitigs.$type"); } # Graphs if ((!fileExists("$asm.contigs.gfa")) && ( fileExists("unitigging/4-unitigger/$asm.contigs.aligned.gfa"))) { fetchFile("unitigging/4-unitigger/$asm.contigs.aligned.gfa"); copy("unitigging/4-unitigger/$asm.contigs.aligned.gfa", "$asm.contigs.gfa"); stashFile("$asm.contigs.gfa"); } if ((! fileExists("$asm.unitigs.gfa")) && ( fileExists("unitigging/4-unitigger/$asm.unitigs.aligned.gfa"))) { fetchFile("unitigging/4-unitigger/$asm.unitigs.aligned.gfa"); copy("unitigging/4-unitigger/$asm.unitigs.aligned.gfa", "$asm.unitigs.gfa"); stashFile("$asm.unitigs.gfa"); } if ((! fileExists("$asm.unitigs.bed")) && ( fileExists("unitigging/4-unitigger/$asm.unitigs.aligned.bed"))) { fetchFile("unitigging/4-unitigger/$asm.unitigs.aligned.bed"); copy("unitigging/4-unitigger/$asm.unitigs.aligned.bed", "$asm.unitigs.bed"); stashFile("$asm.unitigs.bed"); } # User-supplied termination command. if (defined(getGlobal("onSuccess"))) { print STDERR "-- Running user-supplied termination command.\n"; runCommand(getGlobal("onExitDir"), getGlobal("onSuccess") . " $asm"); } finishStage: emitStage($asm, "generateOutputs"); buildHTML($asm, "utg"); allDone: print STDERR "--\n"; print STDERR "-- Assembly '", getGlobal("onExitNam"), "' finished in '", getGlobal("onExitDir"), "'.\n"; print STDERR "--\n"; print STDERR "-- Summary saved in 'unitigging.html'.\n"; print STDERR "--\n"; print STDERR "-- Sequences saved:\n"; print STDERR "-- Contigs -> '$asm.contigs.$type'\n"; print STDERR "-- Unassembled -> '$asm.unassembled.$type'\n"; print STDERR "-- Unitigs -> '$asm.unitigs.$type'\n"; print STDERR "--\n"; print STDERR "-- Read layouts saved:\n"; print STDERR "-- Contigs -> '$asm.contigs.layout'.\n"; print STDERR "-- Unitigs -> '$asm.unitigs.layout'.\n"; print STDERR "--\n"; print STDERR "-- Graphs saved:\n"; print STDERR "-- Contigs -> '$asm.contigs.gfa'.\n"; print STDERR "-- Unitigs -> '$asm.unitigs.gfa'.\n"; print STDERR "--\n"; print STDERR "-- Bye.\n"; finishStage: emitStage($asm, "outputSequence"); buildHTML($asm, "utg"); allDone: } canu-1.6/src/pipelines/canu/OverlapBasedTrimming.pm000066400000000000000000000175571314437614700224600ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/OverlapBasedTrimming.pm # # Modifications by: # # Brian P. Walenz from 2015-MAR-16 to 2015-AUG-25 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-04 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2017-MAR-03 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::OverlapBasedTrimming; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(qualTrimReads dedupeReads trimReads splitReads dumpReads); use strict; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::Report; use canu::HTML; use canu::Grid_Cloud; sub trimReads ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "trimming/3-overlapbasedtrimming"; goto allDone if (skipStage($asm, "obt-trimReads") == 1); goto allDone if (fileExists("trimming/3-overlapbasedtrimming/$asm.1.trimReads.clear")); make_path($path) if (! -d $path); fetchStore("./trimming/$asm.ovlStore"); # Previously, we'd pick the error rate used by unitigger. Now, we don't know unitigger here, # and require an obt specific error rate. $cmd = "$bin/trimReads \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -O ../$asm.ovlStore \\\n"; $cmd .= " -Co ./$asm.1.trimReads.clear \\\n"; $cmd .= " -e " . getGlobal("obtErrorRate") . " \\\n"; $cmd .= " -minlength " . getGlobal("minReadLength") . " \\\n"; #$cmd .= " -Cm ./$asm.max.clear \\\n" if (-e "./$asm.max.clear"); $cmd .= " -ol " . getGlobal("trimReadsOverlap") . " \\\n"; $cmd .= " -oc " . getGlobal("trimReadsCoverage") . " \\\n"; $cmd .= " -o ./$asm.1.trimReads \\\n"; $cmd .= "> ./$asm.1.trimReads.err 2>&1"; if (runCommand($path, $cmd)) { caFailure("trimReads failed", "$path/$asm.1.trimReads.err"); } caFailure("trimReads finished, but no '$asm.1.trimReads.clear' output found", undef) if (! -e "$path/$asm.1.trimReads.clear"); unlink("$path/$asm.1.trimReads.err"); stashFile("./trimming/3-overlapbasedtrimming/$asm.1.trimReads.clear"); my $report; #FORMAT open(F, "< trimming/3-overlapbasedtrimming/$asm.1.trimReads.stats") or caExit("can't open 'trimming/3-overlapbasedtrimming/$asm.1.trimReads.stats' for reading: $!", undef); while () { $report .= "-- $_"; } close(F); addToReport("trimming", $report); if (0) { $cmd = "$bin/gatekeeperDumpFASTQ \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -c ./$asm.1.trimReads.clear \\\n"; $cmd .= " -o ./$asm.1.trimReads.trimmed \\\n"; $cmd .= "> ./$asm.1.trimReads.trimmed.err 2>&1"; if (runCommand($path, $cmd)) { caFailure("dumping trimmed reads failed", "$path/$asm.1.trimReads.trimmed.err"); } } finishStage: emitStage($asm, "obt-trimReads"); buildHTML($asm, "obt"); allDone: } sub splitReads ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "trimming/3-overlapbasedtrimming"; goto allDone if (skipStage($asm, "obt-splitReads") == 1); goto allDone if (fileExists("trimming/3-overlapbasedtrimming/$asm.2.splitReads.clear")); make_path($path) if (! -d $path); fetchStore("./trimming/$asm.ovlStore"); fetchFile("./trimming/3-overlapbasedtrimming/$asm.1.trimReads.clear"); my $erate = getGlobal("obtErrorRate"); # Was this historically #$cmd .= " -mininniepair 0 -minoverhanging 0 \\\n" if (getGlobal("doChimeraDetection") eq "aggressive"); $cmd = "$bin/splitReads \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -O ../$asm.ovlStore \\\n"; $cmd .= " -Ci ./$asm.1.trimReads.clear \\\n" if (-e "trimming/3-overlapbasedtrimming/$asm.1.trimReads.clear"); #$cmd .= " -Cm ./$asm.max.clear \\\n" if (-e "trimming/3-overlapbasedtrimming/$asm.max.clear"); $cmd .= " -Co ./$asm.2.splitReads.clear \\\n"; $cmd .= " -e $erate \\\n"; $cmd .= " -minlength " . getGlobal("minReadLength") . " \\\n"; $cmd .= " -o ./$asm.2.splitReads \\\n"; $cmd .= "> ./$asm.2.splitReads.err 2>&1"; if (runCommand($path, $cmd)) { caFailure("splitReads failed", "$path/$asm.2.splitReads.err"); } caFailure("splitReads finished, but no '$asm.2.splitReads.clear' output found", undef) if (! -e "$path/$asm.2.splitReads.clear"); unlink("$path/$asm.2.splitReads.err"); stashFile("./trimming/3-overlapbasedtrimming/$asm.2.splitReads.clear"); my $report; #FORMAT open(F, "< trimming/3-overlapbasedtrimming/$asm.2.splitReads.stats") or caExit("can't open 'trimming/3-overlapbasedtrimming/$asm.2.splitReads.stats' for reading: $!", undef); while () { $report .= "-- $_"; } close(F); addToReport("splitting", $report); if (0) { $cmd = "$bin/gatekeeperDumpFASTQ \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -c ./$asm.2.splitReads.clear \\\n"; $cmd .= " -o ./$asm.2.splitReads.trimmed \\\n"; $cmd .= "> ./$asm.2.splitReads.trimmed.err 2>&1"; if (runCommand($path, $cmd)) { caFailure("dumping trimmed reads failed", "$path/$asm.2.splitReads.trimmed.err"); } } finishStage: emitStage($asm, "obt-splitReads"); buildHTML($asm, "obt"); allDone: } sub dumpReads ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "trimming/3-overlapbasedtrimming"; my $inp; goto allDone if (skipStage($asm, "obt-dumpReads") == 1); goto allDone if (sequenceFileExists("$asm.trimmedReads")); make_path($path) if (! -d $path); fetchFile("./trimming/3-overlapbasedtrimming/$asm.1.trimReads.clear"); fetchFile("./trimming/3-overlapbasedtrimming/$asm.2.splitReads.clear"); $inp = "./3-overlapbasedtrimming/$asm.1.trimReads.clear" if (-e "$path/$asm.1.trimReads.clear"); $inp = "./3-overlapbasedtrimming/$asm.2.splitReads.clear" if (-e "$path/$asm.2.splitReads.clear"); caFailure("dumping trimmed reads failed; no 'clear' input", "trimming/$asm.trimmedReads.err") if (!defined($inp)); $cmd = "$bin/gatekeeperDumpFASTQ -fasta -nolibname \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -c $inp \\\n"; $cmd .= " -o ../$asm.trimmedReads.gz \\\n"; # Adds .fasta $cmd .= "> ../$asm.trimmedReads.err 2>&1"; if (runCommand("trimming", $cmd)) { caFailure("dumping trimmed reads failed", "./$asm.trimmedReads.err"); } unlink("./$asm.trimmedReads.err"); stashFile("./$asm.trimmedReads.fasta.gz"); remove_tree("trimming/$asm.ovlStore") if (getGlobal("saveOverlaps") eq "0"); finishStage: emitStage($asm, "obt-dumpReads"); buildHTML($asm, "obt"); allDone: print STDERR "--\n"; print STDERR "-- Trimmed reads saved in 'trimming/$asm.trimmedReads.fasta.gz'\n"; stopAfter("readTrimming"); } canu-1.6/src/pipelines/canu/OverlapErrorAdjustment.pm000066400000000000000000000532131314437614700230500ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/OverlapErrorAdjustment.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-SEP-21 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-03 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2016-MAR-27 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::OverlapErrorAdjustment; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(readErrorDetectionConfigure readErrorDetectionCheck overlapErrorAdjustmentConfigure overlapErrorAdjustmentCheck updateOverlapStore); use strict; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::Report; use canu::HTML; use canu::Grid_Cloud; # Hardcoded to use utgOvlErrorRate sub readErrorDetectionConfigure ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $path = "unitigging/3-overlapErrorAdjustment"; return if (getGlobal("enableOEA") == 0); goto allDone if (fileExists("$path/red.sh")); # Script exists goto allDone if (fileExists("$path/red.red")); # Result exists goto allDone if (skipStage($asm, "readErrorDetectionConfigure") == 1); goto allDone if (fileExists("unitigging/$asm.ovlStore/evalues")); # Stage entrely finished goto allDone if (-d "unitigging/$asm.ctgStore"); # Assembly finished make_path("$path") if (! -d "$path"); # RED uses 13 bytes/base plus 12 bytes/overlap + space for evidence reads. my @readLengths; my @numOlaps; #print STDERR "$bin/gatekeeperDumpMetaData -G unitigging/$asm.gkpStore -reads\n"; open(F, "$bin/gatekeeperDumpMetaData -G unitigging/$asm.gkpStore -reads |"); while () { my @v = split '\s+', $_; $readLengths[$v[0]] = $v[2]; } close(F); # NEEDS OPTIMIZE - only need counts here, not the whole store fetchStore("unitigging/$asm.ovlStore"); #print STDERR "$bin/ovStoreDump -G unitigging/$asm.gkpStore -O unitigging/$asm.ovlStore -d -counts\n"; open(F, "$bin/ovStoreDump -G unitigging/$asm.gkpStore -O unitigging/$asm.ovlStore -d -counts |"); while () { my @v = split '\s+', $_; $numOlaps[$v[0]] = $v[1]; } close(F); # Make an array of partitions, putting as many reads into each as will fit in the desired memory. my @bgn; my @end; my $nj = 0; #getAllowedResources("", "red"); my $maxID = getNumberOfReadsInStore("unitigging", $asm); my $maxMem = getGlobal("redMemory") * 1024 * 1024 * 1024; my $maxReads = getGlobal("redBatchSize"); my $maxBases = getGlobal("redBatchLength"); print STDERR "\n"; print STDERR "Configure RED for ", getGlobal("redMemory"), "gb memory with batches of at most ", ($maxReads > 0) ? $maxReads : "(unlimited)", " reads and ", ($maxBases > 0) ? $maxBases : "(unlimited)", " bases.\n"; print STDERR "\n"; my $reads = 0; my $bases = 0; my $olaps = 0; my $coverage = getExpectedCoverage("unitigging", $asm); push @bgn, 1; for (my $id = 1; $id <= $maxID; $id++) { $reads += 1; $bases += $readLengths[$id]; $olaps += $numOlaps[$id]; # Guess how much extra memory used for overlapping reads. Small genomes tend to load every read in the store, # large genomes ... load repeats + 2 * coverage * bases in reads (times 2 for overlaps off of each end) my $memory = (13 * $bases) + (12 * $olaps) + (2 * $bases * $coverage); if ((($maxMem > 0) && ($memory >= $maxMem * 0.75)) || # Allow 25% slop (10% is probably sufficient) (($maxReads > 0) && ($reads >= $maxReads)) || (($maxBases > 0) && ($bases >= $maxBases)) || (($id == $maxID))) { push @end, $id; printf(STDERR "RED job %3u from read %9u to read %9u - %7.3f GB for %7u reads - %7.3f GB for %9u olaps - %7.3f GB for evidence\n", $nj + 1, $bgn[$nj], $end[$nj], $memory / 1024 / 1024 / 1024, $reads, 13 * $bases / 1024 / 1024 / 1024, $bases, 12 * $olaps / 1024 / 1024 / 1024, $olaps, 2 * $bases * $coverage / 1024 / 1024 / 1024); $nj++; $reads = 0; $bases = 0; $olaps = 0; push @bgn, $id + 1; # RED expects inclusive ranges. } } # Dump a script. my $batchSize = getGlobal("redBatchSize"); my $numThreads = getGlobal("redThreads"); my $numReads = getNumberOfReadsInStore("unitigging", $asm); open(F, "> $path/red.sh") or caExit("can't open '$path/red.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("unitigging/$asm.gkpStore", $path, ""); print F fetchStoreShellCode("unitigging/$asm.ovlStore", $path, ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $jj=1; $jj <= $nj; $jj++) { print F "if [ \$jobid = $jj ] ; then\n"; print F " minid=$bgn[$jj-1]\n"; print F " maxid=$end[$jj-1]\n"; print F "fi\n"; } print F "jobid=`printf %05d \$jobid`\n"; print F "\n"; print F "if [ -e ./\$jobid.red ] ; then\n"; print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F "\$bin/findErrors \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -O ../$asm.ovlStore \\\n"; print F " -R \$minid \$maxid \\\n"; print F " -e " . getGlobal("utgOvlErrorRate") . " -l " . getGlobal("minOverlapLength") . " \\\n"; print F " -o ./\$jobid.red.WORKING \\\n"; print F " -t $numThreads \\\n"; print F "&& \\\n"; print F "mv ./\$jobid.red.WORKING ./\$jobid.red\n"; print F "\n"; print F stashFileShellCode("$path", "\$jobid.red", ""); print F "\n"; close(F); makeExecutable("$path/red.sh"); stashFile("$path/red.sh"); finishStage: emitStage($asm, "readErrorDetectionConfigure"); buildHTML($asm, "utg"); allDone: } sub readErrorDetectionCheck ($) { my $asm = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "unitigging/3-overlapErrorAdjustment"; return if (getGlobal("enableOEA") == 0); goto allDone if (fileExists("$path/red.red")); # Output exists goto allDone if (skipStage($asm, "readErrorDetectionCheck", $attempt) == 1); goto allDone if (fileExists("unitigging/$asm.ovlStore/evalues")); # Stage entrely finished goto allDone if (-d "unitigging/$asm.ctgStore"); # Assembly finished fetchFile("$path/red.sh"); # Figure out if all the tasks finished correctly. my @successJobs; my @failedJobs; my $failureMessage = ""; open(A, "< $path/red.sh") or caExit("can't open '$path/red.sh' for reading: $!", undef); while () { if (m/if.*jobid\s+=\s+(\d+)\s+.*then/) { my $ji = substr("00000" . $1, -5); my $jn = "unitigging/3-overlapErrorAdjustment/$ji.red"; if (! fileExists($jn)) { $failureMessage .= "-- job $ji.red FAILED.\n"; push @failedJobs, $1; } else { push @successJobs, $jn; } } } close(A); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Read error detection jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Read error detection jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "readErrorDetectionCheck", $attempt); buildHTML($asm, "utg"); submitOrRunParallelJob($asm, "red", $path, "red", @failedJobs); return; } finishStage: print STDERR "-- Found ", scalar(@successJobs), " read error detection output files.\n"; # I didn't wan't to concat all the corrections, but it is _vastly_ easier to do so, compared to # hacking correctOverlaps to handle multiple corrections files. Plus, it is now really just a # concat; before, the files needed to be parsed to strip off a header. open(O, "> $path/red.red") or caExit("can't open '$path/red.red' for writing: $!", undef); binmode(O); foreach my $f (@successJobs) { fetchFile($f); open(F, "< $f") or caExit("can't open '$f' for reading: $!", undef); binmode(F); my $buf; my $len = sysread(F, $buf, 1024 * 1024); while ($len > 0) { syswrite(O, $buf, $len); $len = sysread(F, $buf, 1024 * 1024); } close(F); } close(O); stashFile("$path/red.red"); foreach my $f (@successJobs) { unlink $f; } emitStage($asm, "readErrorDetectionCheck"); buildHTML($asm, "utg"); allDone: } sub overlapErrorAdjustmentConfigure ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $path = "unitigging/3-overlapErrorAdjustment"; return if (getGlobal("enableOEA") == 0); goto allDone if (fileExists("$path/oea.sh")); # Script exists goto allDone if (skipStage($asm, "overlapErrorAdjustmentConfigure") == 1); goto allDone if (fileExists("unitigging/$asm.ovlStore/evalues")); # Stage entrely finished goto allDone if (-d "unitigging/$asm.ctgStore"); # Assembly finished # OEA uses 1 byte/base + 8 bytes/adjustment + 28 bytes/overlap. We don't know the number of adjustments, but that's # basically error rate. No adjustment is output for mismatches. my @readLengths; my @numOlaps; #print STDERR "$bin/gatekeeperDumpMetaData -G unitigging/$asm.gkpStore -reads\n"; open(F, "$bin/gatekeeperDumpMetaData -G unitigging/$asm.gkpStore -reads |"); while () { my @v = split '\s+', $_; $readLengths[$v[0]] = $v[2]; } close(F); #print STDERR "$bin/ovStoreDump -G unitigging/$asm.gkpStore -O unitigging/$asm.ovlStore -d -counts\n"; open(F, "$bin/ovStoreDump -G unitigging/$asm.gkpStore -O unitigging/$asm.ovlStore -d -counts |"); while () { my @v = split '\s+', $_; $numOlaps[$v[0]] = $v[1]; } close(F); # Make an array of partitions, putting as many reads into each as will fit in the desired memory. tryOEAagain: my @bgn; undef @bgn; my @end; undef @end; my @log; undef @log; my $nj = 0; my $maxID = getNumberOfReadsInStore("unitigging", $asm); my $maxMem = getGlobal("oeaMemory") * 1024 * 1024 * 1024; my $maxReads = getGlobal("oeaBatchSize"); my $maxBases = getGlobal("oeaBatchLength"); print STDERR "\n"; print STDERR "Configure OEA for ", getGlobal("oeaMemory"), "gb memory with batches of at most ", ($maxReads > 0) ? $maxReads : "(unlimited)", " reads and ", ($maxBases > 0) ? $maxBases : "(unlimited)", " bases.\n"; my $reads = 0; my $bases = 0; my $olaps = 0; fetchFile("$path/red.red"); my $coverage = getExpectedCoverage("unitigging", $asm); my $corrSize = (-s "$path/red.red"); my $smallJobs = 0; my $smallJobSize = 1024; push @bgn, 1; for (my $id = 1; $id <= $maxID; $id++) { $reads += 1; $bases += $readLengths[$id]; $olaps += $numOlaps[$id]; # Hacked to attempt to estimate adjustment size better. Olaps should only require 12 bytes each. my $memBases = (1 * $bases); # Corrected reads for this batch my $memAdj1 = (8 * $corrSize) * 0.33; # Overestimate of the size of the indel adjustments needed (total size includes mismatches) my $memReads = (32 * $reads); # Read data in the batch my $memOlaps = (32 * $olaps); # Loaded overlaps my $memSeq = (4 * 2097152); # two char arrays of 2*maxReadLen my $memAdj2 = (16 * 2097152); # two Adjust_t arrays of maxReadLen my $memWA = (32 * 1048576); # Work area (16mb) and edit array (16mb) my $memMisc = (256 * 1048576); # Work area (16mb) and edit array (16mb) and (192mb) slop my $memory = $memBases + $memAdj1 + $memReads + $memOlaps + $memSeq + $memAdj2 + $memWA + $memMisc; if ((($maxMem > 0) && ($memory >= $maxMem * 0.75)) || (($maxReads > 0) && ($reads >= $maxReads)) || (($maxBases > 0) && ($bases >= $maxBases)) || (($id == $maxID))) { push @end, $id; $smallJobs++ if ($end[$nj] - $bgn[$nj] < $smallJobSize); push @log, sprintf("OEA job %3u from read %9u to read %9u - %4.1f bases + %4.1f adjusts + %4.1f reads + %4.1f olaps + %4.1f fseq/rseq + %4.1f fadj/radj + %4.1f work + %4.1f misc = %5.1f MB\n", $nj + 1, $bgn[$nj], $end[$nj], $memBases / 1024 / 1024, $memAdj1 / 1024 / 1024, $memReads / 1024 / 1024, $memOlaps / 1024 / 1024, $memSeq / 1024 / 1024, $memAdj2 / 1024 / 1024, $memWA / 1024 / 1024, $memMisc / 1024 / 1024, $memory / 1024 / 1024); $nj++; $reads = 0; $bases = 0; $olaps = 0; push @bgn, $id + 1; # OEA expects inclusive ranges. } } # If too many small jobs, increase memory and try again. We'll allow any size jobs as long as # there are 8 or less, but then demand there are at most 2 small jobs. if (($nj > 8) && ($smallJobs >= 2)) { my $curMem = getGlobal("oeaMemory"); my $newMem = int(1000 * getGlobal("oeaMemory") * 1.25) / 1000; print STDERR " FAILED - configured $nj jobs, but $smallJobs jobs process $smallJobSize reads or less each. Increasing memory from $curMem GB to $newMem GB.\n"; setGlobal("oeaMemory", $newMem); goto tryOEAagain; } # Report. print STDERR "Configured $nj jobs.\n"; print STDERR "\n"; foreach my $l (@log) { print STDERR $l; } print STDERR "\n"; # Dump a script open(F, "> $path/oea.sh") or caExit("can't open '$path/oea.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("unitigging/$asm.gkpStore", $path, ""); print F fetchStoreShellCode("unitigging/$asm.ovlStore", $path, ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $jj=1; $jj <= $nj; $jj++) { print F "if [ \$jobid = $jj ] ; then\n"; print F " minid=$bgn[$jj-1]\n"; print F " maxid=$end[$jj-1]\n"; print F "fi\n"; } print F "jobid=`printf %05d \$jobid`\n"; print F "\n"; print F "if [ -e ./\$jobid.oea ] ; then\n"; print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("unitigging/3-overlapErrorAdjustment", "red.red", ""); print F "\n"; print F "\$bin/correctOverlaps \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -O ../$asm.ovlStore \\\n"; print F " -R \$minid \$maxid \\\n"; print F " -e " . getGlobal("utgOvlErrorRate") . " -l " . getGlobal("minOverlapLength") . " \\\n"; print F " -c ./red.red \\\n"; print F " -o ./\$jobid.oea.WORKING \\\n"; print F "&& \\\n"; print F "mv ./\$jobid.oea.WORKING ./\$jobid.oea\n"; print F "\n"; print F stashFileShellCode("$path", "\$jobid.oea", ""); print F "\n"; close(F); makeExecutable("$path/oea.sh"); stashFile("$path/oea.sh"); finishStage: emitStage($asm, "overlapErrorAdjustmentConfigure"); buildHTML($asm, "utg"); allDone: } sub overlapErrorAdjustmentCheck ($) { my $asm = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "unitigging/3-overlapErrorAdjustment"; return if (getGlobal("enableOEA") == 0); goto allDone if (fileExists("$path/oea.files")); # Output exists goto allDone if (skipStage($asm, "overlapErrorAdjustmentCheck", $attempt) == 1); goto allDone if (fileExists("unitigging/$asm.ovlStore/evalues")); # Stage entrely finished goto allDone if (-d "unitigging/$asm.ctgStore"); # Assembly finished # Figure out if all the tasks finished correctly. my $batchSize = getGlobal("oeaBatchSize"); my $failedJobs = 0; my $numReads = getNumberOfReadsInStore("unitigging", $asm); my $numJobs = 0; #int($numReads / $batchSize) + (($numReads % $batchSize == 0) ? 0 : 1); # Need to read script to find number of jobs! my @successJobs; my @failedJobs; my $failureMessage = ""; fetchFile("$path/oea.sh"); open(A, "< $path/oea.sh") or caExit("can't open '$path/oea.sh' for reading: $!", undef); while () { if (m/if.*jobid\s+=\s+(\d+)\s+.*then/) { my $ji = substr("00000" . $1, -5); if (! fileExists("unitigging/3-overlapErrorAdjustment/$ji.oea")) { $failureMessage .= "-- job $ji.oea FAILED.\n"; push @failedJobs, $1; } else { push @successJobs, "./$ji.oea"; } } } close(A); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Overlap error adjustment jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Overlap error adjustment jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "overlapErrorAdjustmentCheck", $attempt); buildHTML($asm, "utg"); submitOrRunParallelJob($asm, "oea", $path, "oea", @failedJobs); return; } finishStage: print STDERR "-- Found ", scalar(@successJobs), " overlap error adjustment output files.\n"; open(L, "> $path/oea.files") or caExit("can't open '$path/oea.files' for writing: $!", undef); foreach my $f (@successJobs) { print L "$f\n"; } close(L); stashFile("$path/oea.files"); emitStage($asm, "overlapErrorAdjustmentCheck"); buildHTML($asm, "utg"); allDone: } sub updateOverlapStore ($) { my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "unitigging/3-overlapErrorAdjustment"; return if (getGlobal("enableOEA") == 0); goto allDone if (skipStage($asm, "updateOverlapStore") == 1); goto allDone if (fileExists("unitigging/$asm.ovlStore/evalues")); # Stage entrely finished goto allDone if (-d "unitigging/$asm.ctgStore"); # Assembly finished fetchFile("unitigging/3-overlapErrorAdjustment/oea.files"); caExit("didn't find '$path/oea.files' to add to store, yet jobs claim to be finished", undef) if (! -e "$path/oea.files"); open(F, "< $path/oea.files"); while () { chomp; fetchFile("$path/$_"); } close(F); fetchStore("unitigging/$asm.ovlStore"); $cmd = "$bin/ovStoreBuild \\\n"; $cmd .= " -G ../$asm.gkpStore \\\n"; $cmd .= " -O ../$asm.ovlStore \\\n"; $cmd .= " -evalues \\\n"; $cmd .= " -L ./oea.files \\\n"; $cmd .= "> ./oea.apply.err 2>&1"; if (runCommand($path, $cmd)) { unlink "unitigging/$asm.ovlStore/evalues"; caExit("failed to add error rates to overlap store", "$path/oea.apply.err"); } stashFile("unitigging/$asm.ovlStore/evalues"); my $report = "-- No report available.\n"; #open(F, "< $path/oea.apply.stats") or caExit("Failed to open error rate adjustment statistics in '$path/oea.apply.stats': $!", undef); #while () { #} #close(F); addToReport("adjustments", $report); finishStage: emitStage($asm, "updateOverlapStore"); buildHTML($asm, "utg"); allDone: } canu-1.6/src/pipelines/canu/OverlapInCore.pm000066400000000000000000000412471314437614700211030ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/OverlapInCore.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-AUG-25 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-19 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2016-JUN-08 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::OverlapInCore; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(overlapConfigure overlap overlapCheck); use strict; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Report; use canu::HTML; use canu::Grid_Cloud; sub overlapConfigure ($$$) { my $asm = shift @_; my $tag = shift @_; my $type = shift @_; my $bin = getBinDirectory(); my $cmd; my $base; my $path; $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; caFailure("invalid type '$type'", undef) if (($type ne "partial") && ($type ne "normal")); fetchFile("$path/overlap.sh"); fetchFile("$path/ovljob.files"); goto allDone if (skipStage($asm, "$tag-overlapConfigure") == 1); goto allDone if (fileExists("$path/overlap.sh") && fileExists("$path/$asm.partition.ovlbat") && fileExists("$path/$asm.partition.ovljob") && fileExists("$path/$asm.partition.ovlopt")); goto allDone if (fileExists("$path/ovljob.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); print STDERR "--\n"; print STDERR "-- OVERLAPPER (normal) (correction) erate=", getGlobal("corOvlErrorRate"), "\n" if ($tag eq "cor"); print STDERR "-- OVERLAPPER (normal) (trimming) erate=", getGlobal("obtOvlErrorRate"), "\n" if ($tag eq "obt"); print STDERR "-- OVERLAPPER (normal) (assembly) erate=", getGlobal("utgOvlErrorRate"), "\n" if ($tag eq "utg"); print STDERR "--\n"; make_path("$path") if (! -d "$path"); # overlapInCorePartition internally uses 'WORKING' outputs, and renames to the final # version right before it exits. All we need to do here is check for existence of # the output, and exit if the command fails. fetchFile("$path/$asm.partition.ovlbat"); # Don't know where to put these. Before the tests above? fetchFile("$path/$asm.partition.ovljob"); # Here? Just before we use them? fetchFile("$path/$asm.partition.ovlopt"); if ((! -e "$path/$asm.partition.ovlbat") || (! -e "$path/$asm.partition.ovljob") || (! -e "$path/$asm.partition.ovlopt")) { # These used to be runCA options, but were removed in canu. They were used mostly for # illumina-pacbio correction, but were also used (or could have been used) during the # Salmon assembly when overlaps were computed differently depending on the libraries # involved (and was run manually). These are left in for documentation. # #my $checkLibrary = getGlobal("${tag}CheckLibrary"); #my $hashLibrary = getGlobal("${tag}HashLibrary"); #my $refLibrary = getGlobal("${tag}RefLibrary"); my $hashBlockLength = getGlobal("${tag}OvlHashBlockLength"); my $hashBlockSize = 0; my $refBlockSize = getGlobal("${tag}OvlRefBlockSize"); my $refBlockLength = getGlobal("${tag}OvlRefBlockLength"); my $minOlapLength = getGlobal("minOverlapLength"); if (($refBlockSize > 0) && ($refBlockLength > 0)) { caExit("can't set both ${tag}OvlRefBlockSize and ${tag}OvlRefBlockLength", undef); } $cmd = "$bin/overlapInCorePartition \\\n"; $cmd .= " -g ../$asm.gkpStore \\\n"; $cmd .= " -bl $hashBlockLength \\\n"; $cmd .= " -bs $hashBlockSize \\\n"; $cmd .= " -rs $refBlockSize \\\n"; $cmd .= " -rl $refBlockLength \\\n"; #$cmd .= " -H $hashLibrary \\\n" if ($hashLibrary ne "0"); #$cmd .= " -R $refLibrary \\\n" if ($refLibrary ne "0"); #$cmd .= " -C \\\n" if (!$checkLibrary); $cmd .= " -ol $minOlapLength \\\n"; $cmd .= " -o ./$asm.partition \\\n"; $cmd .= "> ./$asm.partition.err 2>&1"; if (runCommand($path, $cmd)) { caExit("failed partition for overlapper", undef); } stashFile("$path/$asm.partition.ovlbat"); stashFile("$path/$asm.partition.ovljob"); stashFile("$path/$asm.partition.ovlopt"); unlink "$path/overlap.sh"; } open(BAT, "< $path/$asm.partition.ovlbat") or caExit("can't open '$path/$asm.partition.ovlbat' for reading: $!", undef); open(JOB, "< $path/$asm.partition.ovljob") or caExit("can't open '$path/$asm.partition.ovljob' for reading: $!", undef); open(OPT, "< $path/$asm.partition.ovlopt") or caExit("can't open '$path/$asm.partition.ovlopt' for reading: $!", undef); my @bat = ; chomp @bat; my @job = ; chomp @job; my @opt = ; chomp @opt; close(BAT); close(JOB); close(OPT); fetchFile("$path/overlap.sh"); if (! -e "$path/overlap.sh") { my $merSize = getGlobal("${tag}OvlMerSize"); #my $hashLibrary = getGlobal("${tag}OvlHashLibrary"); #my $refLibrary = getGlobal("${tag}OvlRefLibrary"); # Create a script to run overlaps. We make a giant job array for this -- we need to know # hashBeg, hashEnd, refBeg and refEnd -- from that we compute batchName and jobName. my $hashBits = getGlobal("${tag}OvlHashBits"); my $hashLoad = getGlobal("${tag}OvlHashLoad"); open(F, "> $path/overlap.sh") or caExit("can't open '$path/overlap.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F "perl='/usr/bin/env perl'\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/1-overlapper", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $ii=1; $ii<=scalar(@bat); $ii++) { print F "if [ \$jobid -eq $ii ] ; then\n"; print F " bat=\"$bat[$ii-1]\"\n"; # Needed to mkdir. print F " job=\"$bat[$ii-1]/$job[$ii-1]\"\n"; # Needed to simplify overlapCheck() below. print F " opt=\"$opt[$ii-1]\"\n"; print F "fi\n"; print F "\n"; } print F "\n"; print F "if [ ! -d ./\$bat ]; then\n"; print F " mkdir ./\$bat\n"; print F "fi\n"; print F "\n"; print F fileExistsShellCode("./\$job.ovb"); print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("$base/0-mercounts", "$asm.ms$merSize.frequentMers.fasta", ""); print F "\n"; print F "\$bin/overlapInCore \\\n"; print F " -G \\\n" if ($type eq "partial"); print F " -t ", getGlobal("${tag}OvlThreads"), " \\\n"; print F " -k $merSize \\\n"; print F " -k ../0-mercounts/$asm.ms$merSize.frequentMers.fasta \\\n"; print F " --hashbits $hashBits \\\n"; print F " --hashload $hashLoad \\\n"; print F " --maxerate ", getGlobal("corOvlErrorRate"), " \\\n" if ($tag eq "cor"); # Explicitly using proper name for grepability. print F " --maxerate ", getGlobal("obtOvlErrorRate"), " \\\n" if ($tag eq "obt"); print F " --maxerate ", getGlobal("utgOvlErrorRate"), " \\\n" if ($tag eq "utg"); print F " --minlength ", getGlobal("minOverlapLength"), " \\\n"; print F " --minkmers \\\n" if (defined(getGlobal("${tag}OvlFilter")) && getGlobal("${tag}OvlFilter")==1); print F " \$opt \\\n"; print F " -o ./\$job.ovb.WORKING \\\n"; print F " -s ./\$job.stats \\\n"; #print F " -H $hashLibrary \\\n" if ($hashLibrary ne "0"); #print F " -R $refLibrary \\\n" if ($refLibrary ne "0"); print F " ../$asm.gkpStore \\\n"; print F "&& \\\n"; print F "mv ./\$job.ovb.WORKING ./\$job.ovb\n"; print F "\n"; print F stashFileShellCode("$base/1-overlapper/", "\$job.ovb", ""); print F stashFileShellCode("$base/1-overlapper/", "\$job.counts", ""); print F stashFileShellCode("$base/1-overlapper/", "\$job.stats", ""); print F "\n"; print F "exit 0\n"; close(F); makeExecutable("$path/overlap.sh"); stashFile("$path/overlap.sh"); } my $jobs = scalar(@job); my $batchName = $bat[$jobs-1]; chomp $batchName; my $jobName = $job[$jobs-1]; chomp $jobName; my $numJobs = 0; open(F, "< $path/overlap.sh") or caExit("can't open '$path/overlap.sh' for reading: $!", undef); while () { $numJobs++ if (m/^\s+job=/); } close(F); print STDERR "--\n"; print STDERR "-- Configured $numJobs overlapInCore jobs.\n"; finishStage: emitStage($asm, "$tag-overlapConfigure"); buildHTML($asm, $tag); allDone: stopAfter("overlapConfigure"); } sub reportSumMeanStdDev (@) { my $sum = 0; my $mean = 0; my $stddev = 0; my $n = scalar(@_); my $formatted; $sum += $_ foreach (@_); $mean = $sum / $n; $stddev += ($_ - $mean) * ($_ - $mean) foreach (@_); $stddev = int(1000 * sqrt($stddev / ($n-1))) / 1000 if ($n > 1); $formatted = substr(" $mean", -10) . " +- $stddev"; return($sum, $formatted); } sub reportOverlapStats ($$@) { my $base = shift @_; my $asm = shift @_; my @statsJobs = @_; my @hitsWithoutOlaps; my @hitsWithOlaps; my @multiOlaps; my @totalOlaps; my @containedOlaps; my @dovetailOlaps; my @shortReject; my @longReject; foreach my $s (@statsJobs) { fetchFile("$base/$s"); open(F, "< $base/$s") or caExit("can't open '$base/$s' for reading: $!", undef); $_ = ; push @hitsWithoutOlaps, $1 if (m/^\s*Kmer\shits\swithout\solaps\s=\s(\d+)$/); $_ = ; push @hitsWithOlaps, $1 if (m/^\s*Kmer\shits\swith\solaps\s=\s(\d+)$/); $_ = ; push @multiOlaps, $1 if (m/^\s*Multiple\soverlaps\/pair\s=\s(\d+)$/); $_ = ; push @totalOlaps, $1 if (m/^\s*Total\soverlaps\sproduced\s=\s(\d+)$/); $_ = ; push @containedOlaps, $1 if (m/^\s*Contained\soverlaps\s=\s(\d+)$/); $_ = ; push @dovetailOlaps, $1 if (m/^\s*Dovetail\soverlaps\s=\s(\d+)$/); $_ = ; push @shortReject, $1 if (m/^\s*Rejected\sby\sshort\swindow\s=\s(\d+)$/); $_ = ; push @longReject, $1 if (m/^\s*Rejected\sby\slong\swindow\s=\s(\d+)$/); close(F); } printf STDERR "--\n"; printf STDERR "-- overlapInCore compute '$base/1-overlapper':\n"; printf STDERR "-- kmer hits\n"; printf STDERR "-- with no overlap %12d %s\n", reportSumMeanStdDev(@hitsWithoutOlaps); printf STDERR "-- with an overlap %12d %s\n", reportSumMeanStdDev(@hitsWithOlaps); printf STDERR "--\n"; printf STDERR "-- overlaps %12d %s\n", reportSumMeanStdDev(@totalOlaps); printf STDERR "-- contained %12d %s\n", reportSumMeanStdDev(@containedOlaps); printf STDERR "-- dovetail %12d %s\n", reportSumMeanStdDev(@dovetailOlaps); printf STDERR "--\n"; printf STDERR "-- overlaps rejected\n"; printf STDERR "-- multiple per pair %12d %s\n", reportSumMeanStdDev(@multiOlaps); printf STDERR "-- bad short window %12d %s\n", reportSumMeanStdDev(@shortReject); printf STDERR "-- bad long window %12d %s\n", reportSumMeanStdDev(@longReject); } # Check that the overlapper jobs properly executed. If not, # complain, but don't help the user fix things. # sub overlapCheck ($$$) { my $asm = shift @_; my $tag = shift @_; my $type = shift @_; my $attempt = getGlobal("canuIteration"); my $base; my $path; $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; goto allDone if (skipStage($asm, "$tag-overlapCheck", $attempt) == 1); goto allDone if (fileExists("$path/ovljob.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); # Figure out if all the tasks finished correctly. my $currentJobID = 1; my @successJobs; my @statsJobs; my @miscJobs; my @failedJobs; my $failureMessage = ""; fetchFile("$path/overlap.sh"); open(F, "< $path/overlap.sh") or caExit("can't open '$path/overlap.sh' for reading: $!", undef); while () { if (m/^\s+job=\"(\d+\/\d+)\"$/) { if (fileExists("$path/$1.ovb.gz")) { push @successJobs, "1-overlapper/$1.ovb.gz\n"; # Dumped to a file, so include \n push @statsJobs, "1-overlapper/$1.stats"; # Used here, don't include \n push @miscJobs, "1-overlapper/$1.stats\n"; push @miscJobs, "1-overlapper/$1.counts\n"; } elsif (fileExists("$path/$1.ovb")) { push @successJobs, "1-overlapper/$1.ovb\n"; push @statsJobs, "1-overlapper/$1.stats"; push @miscJobs, "1-overlapper/$1.stats\n"; push @miscJobs, "1-overlapper/$1.counts\n"; } elsif (fileExists("$path/$1.ovb.bz2")) { push @successJobs, "1-overlapper/$1.ovb.bz2\n"; push @statsJobs, "1-overlapper/$1.stats"; push @miscJobs, "1-overlapper/$1.stats\n"; push @miscJobs, "1-overlapper/$1.counts\n"; } elsif (fileExists("$path/$1.ovb.xz")) { push @successJobs, "1-overlapper/$1.ovb.xz\n"; push @statsJobs, "1-overlapper/$1.stats"; push @miscJobs, "1-overlapper/$1.stats\n"; push @miscJobs, "1-overlapper/$1.counts\n"; } else { $failureMessage .= "-- job $path/$1.ovb FAILED.\n"; push @failedJobs, $currentJobID; } $currentJobID++; } } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Overlap jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Overlap jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-overlapCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "${tag}ovl", $path, "overlap", @failedJobs); return; } finishStage: print STDERR "-- Found ", scalar(@successJobs), " overlapInCore output files.\n"; open(L, "> $path/ovljob.files") or caExit("can't open '$path/ovljob.files' for writing: $!", undef); print L @successJobs; close(L); open(L, "> $path/ovljob.more.files") or caExit("can't open '$path/ovljob.more.files' for writing: $!", undef); print L @miscJobs; close(L); stashFile("$path/ovljob.files"); stashFile("$path/ovljob.more.files"); reportOverlapStats($base, $asm, @statsJobs); emitStage($asm, "$tag-overlapCheck"); buildHTML($asm, $tag); allDone: stopAfter("overlap"); } canu-1.6/src/pipelines/canu/OverlapMMap.pm000066400000000000000000000575251314437614700205640ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Sergey Koren beginning on 2016-FEB-24 # are a 'United States Government Work', and # are released in the public domain # # Brian P. Walenz beginning on 2016-MAY-02 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::OverlapMMap; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(mmapConfigure mmapPrecomputeCheck mmapCheck); use strict; use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::HTML; use canu::Grid_Cloud; # Map long reads to long reads with minimap. sub mmapConfigure ($$$) { my $asm = shift @_; my $tag = shift @_; my $typ = shift @_; my $bin = getBinDirectory(); my $base; # e.g., $base/1-overlapper/mhap.sh my $path; # e.g., $path/mhap.sh $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; caFailure("invalid type '$typ'", undef) if (($typ ne "partial") && ($typ ne "normal")); goto allDone if (skipStage($asm, "$tag-mmapConfigure") == 1); goto allDone if (fileExists("$path/precompute.sh")) && (fileExists("$path/mmap.sh")); goto allDone if (fileExists("$path/ovljob.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); my $numPacBioRaw = 0; my $numPacBioCorrected = 0; my $numNanoporeRaw = 0; my $numNanoporeCorrected = 0; open(L, "< $base/$asm.gkpStore/libraries.txt") or caExit("can't open '$base/$asm.gkpStore/libraries.txt' for reading: $!", undef); while () { $numPacBioRaw++ if (m/pacbio-raw/); $numPacBioCorrected++ if (m/pacbio-corrected/); $numNanoporeRaw++ if (m/nanopore-raw/); $numNanoporeCorrected++ if (m/nanopore-corrected/); } close(L); my $parameters = ""; if ($numPacBioRaw > 0) { $parameters = "-x ava-pb"; } elsif ($numNanoporeRaw > 0) { $parameters = "-x ava-ont"; } elsif ($numPacBioCorrected > 0) { $parameters = "-x ava-pb -c -Hk21 -w14"; #tuned to find 1000bp 5% error } elsif ($numNanoporeCorrected > 0) { $parameters = "-x ava-ont -c -k17 -w11"; #tuned to find 1000bp 15% error } else { caFailiure("--ERROR: no know read types found in $base/$asm.gkpStore/libraries.txt") } print STDERR "--\n"; print STDERR "-- OVERLAPPER (mmap) (correction) with $parameters\n" if ($tag eq "cor"); print STDERR "-- OVERLAPPER (mmap) (trimming) with $parameters\n" if ($tag eq "obt"); print STDERR "-- OVERLAPPER (mmap) (assembly) with $parameters\n" if ($tag eq "utg"); print STDERR "--\n"; make_path($path) if (! -d $path); # Constants. my $numReads = getNumberOfReadsInStore($base, $asm); my $memorySize = getGlobal("${tag}mmapMemory"); my $blockPerGb = getGlobal("${tag}MMapBlockSize"); my $blockSize = int($blockPerGb * $memorySize); print STDERR "-- Given $memorySize GB, can fit $blockSize reads per block.\n"; # Divide the reads into blocks of ovlHashBlockSize. Each one of these blocks is used as the # table in mmap. Several of these blocks are used as the queries. my @blocks; # Range of reads to extract for this block my @blockBgn; # First read in the block my @blockLen; # Number of reads in the block my @hashes; # One for each job, the block that is the hash table my @skipSelf; # One for each job, jobs that would search block N against hash block N need to be handled special my @convert; # One for each job, flags to the mmap-ovl conversion program push @blocks, "no zeroth block, makes the loop where this is used easier"; push @blockBgn, "no zeroth block"; push @blockLen, "no zeroth block"; push @hashes, "no zeroth job"; push @skipSelf, "no zeroth job"; for (my $bgn=1; $bgn < $numReads; $bgn += $blockSize) { my $end = $bgn + $blockSize - 1; $end = $numReads if ($end > $numReads); #print STDERR "BLOCK ", scalar(@blocks), " reads from $bgn through $end\n"; push @blocks, "-b $bgn -e $end"; push @blockBgn, $bgn; push @blockLen, $end - $bgn + 1; } # Each mmap job will process one block against a set of other blocks. We'll pick, arbitrarily, # to use num_blocks/4 for that size, unless it is too small. my $numBlocks = scalar(@blocks); my $qryStride = ($numBlocks < 16) ? (2) : int($numBlocks / 4); print STDERR "-- For $numBlocks blocks, set stride to $qryStride blocks.\n"; print STDERR "-- Logging partitioning to '$path/partitioning.log'.\n"; open(L, "> $path/partitioning.log") or caExit("can't open '$path/partitioning.log' for writing: $!\n", undef); # Make queries. Each hask block needs to search against all blocks less than or equal to it. # Each job will search at most $qryStride blocks at once. So, queries could be: # 1: 1 vs 1,2,3 (with self-allowed, and quert block 1 implicitly included) # 2: 1 vs 4,5 (with no-self allowed) # 3: 2 vs 2,3,4 # 4: 2 vs 5 # 5: 3 vs 3,4,5 # 6: 4 vs 4,5 # 7: 5 vs 5 make_path("$path/queries"); for (my $bid=1; $bid < $numBlocks; $bid++) { # Note that we never do qbgn = bid; the self-self overlap is special cased. for (my $qbgn = $bid; $qbgn < $numBlocks; $qbgn += $qryStride) { my $andSelf = ""; if ($bid == $qbgn) { # The hash block bid is in the range and we need to compute hash-to-hash # overlaps, and exclude the block from the queries. push @skipSelf, "true"; $qbgn++; $andSelf = " (and self)"; } else { # The hash block bid isn't in the range of query blocks, don't allow hash-to-hash # overlaps. push @skipSelf, ""; } my $qend = $qbgn + $qryStride - 1; # Block bid searches reads in dat files from $qend = $numBlocks-1 if ($qend >= $numBlocks); # qbgn to qend (inclusive). my $job = substr("000000" . scalar(@hashes), -6); # Unique ID for this compute # Make a place to save queries. If this is the last-block-special-case, make a directory, # but don't link in any files. Without the directory, we'd need even more special case # code down in mmap.sh to exclude the -q option for this last block. make_path("$path/queries/$job"); if ($qbgn < $numBlocks) { print L "Job ", scalar(@hashes), " computes block $bid vs blocks $qbgn-$qend$andSelf,\n"; for (my $qid=$qbgn; $qid <= $qend; $qid++) { my $qry = substr("000000" . $qid, -6); # Name for the query block symlink("../../blocks/$qry.fasta", "$path/queries/$job/$qry.fasta"); } } else { print L "Job ", scalar(@hashes), " computes block $bid vs itself.\n"; $qbgn = $bid; # Otherwise, the @convert -q value is bogus } # This is easy, the ID of the hash. push @hashes, substr("000000" . $bid, -6); # One new job for block bid with qend-qbgn query files in it } } close(L); # Tar up the queries directory. Only useful for cloud support. runCommandSilently($path, "tar -cf queries.tar queries", 1); stashFile("$path/queries.tar"); # Create a script to generate precomputed blocks, including extracting the reads from gkpStore. #OPTIMIZE #OPTIMIZE Probably a big optimization for cloud assemblies, the block fasta inputs can be #OPTIMIZE computed ahead of time, stashed, and then fetched to do the actual precompute. #OPTIMIZE open(F, "> $path/precompute.sh") or caFailure("can't open '$path/precompute.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/1-overlapper", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $ii=1; $ii < scalar(@blocks); $ii++) { print F "if [ \$jobid -eq $ii ] ; then\n"; print F " rge=\"$blocks[$ii]\"\n"; print F " job=\"", substr("000000" . $ii, -6), "\"\n"; print F "fi\n"; print F "\n"; } print F "\n"; print F "if [ x\$job = x ] ; then\n"; print F " echo Job partitioning error. jobid \$jobid is invalid.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d ./blocks ]; then\n"; print F " mkdir -p ./blocks\n"; print F "fi\n"; print F "\n"; print F fileExistsShellCode("./blocks/\$job.fasta"); print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F "\$bin/gatekeeperDumpFASTQ \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " \$rge \\\n"; print F " -nolibname \\\n"; print F " -noreadname \\\n"; print F " -fasta \\\n"; print F " -o ./blocks/\$job.input \\\n"; print F "&& \\\n"; print F "mv -f ./blocks/\$job.input.fasta ./blocks/\$job.fasta\n"; print F "if [ ! -e ./blocks/\$job.fasta ] ; then\n"; print F " echo Failed to extract fasta.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "\n"; print F "echo \"\"\n"; print F "echo Starting mmap precompute.\n"; print F "echo \"\"\n"; print F "\n"; print F " \$bin/minimap2 \\\n"; print F " $parameters -t ", getGlobal("${tag}mmapThreads"), " \\\n"; print F " -d ./blocks/\$job.input.mmi\\\n"; print F " ./blocks/\$job.fasta \\\n"; print F "&& \\\n"; print F "mv -f ./blocks/\$job.input.mmi ./blocks/\$job.mmi\n"; print F "\n"; print F "if [ ! -e ./blocks/\$job.mmi ] ; then\n"; print F " echo MMap failed.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F stashFileShellCode("$base/1-overlapper/blocks", "\$job.mmi", ""); print F "\n"; print F "exit 0\n"; close(F); # Create a script to run mmap. open(F, "> $path/mmap.sh") or caFailure("can't open '$path/mmap.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/1-overlapper", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $ii=1; $ii < scalar(@hashes); $ii++) { print F "if [ \$jobid -eq $ii ] ; then\n"; print F " blk=\"$hashes[$ii]\"\n"; print F " slf=\"$skipSelf[$ii]\"\n"; print F " qry=\"", substr("000000" . $ii, -6), "\"\n"; print F "fi\n"; print F "\n"; } print F "\n"; print F "if [ x\$qry = x ]; then\n"; print F " echo Error: Job index out of range.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "if [ -e ./results/\$qry.ovb ]; then\n"; print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("$path", "queries.tar", ""); print F "\n"; print F "if [ -e ./queries.tar -a ! -d ./queries ] ; then\n"; print F " tar -xf ./queries.tar\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d ./results ]; then\n"; print F " mkdir -p ./results\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("$path", "blocks/\$blk.mmi", ""); print F "for ii in `ls ./queries/\$qry` ; do\n"; print F " echo Fetch blocks/\$ii\n"; print F fetchFileShellCode("$path", "blocks/\$ii", " "); print F "done\n"; print F "\n"; # Begin comparison, we loop through query and compare current block to it, if we need to do # self first compare to self, otherwise initialize as empty print F "if [ x\$slf = x ]; then\n"; print F " > ./results/\$qry.mmap.WORKING\n"; print F "else\n"; print F " \$bin/minimap2 \\\n"; print F " $parameters -t ", getGlobal("${tag}mmapThreads"), " \\\n"; print F " ./blocks/\$blk.mmi \\\n"; print F " ./blocks/\$blk.fasta \\\n"; print F " > ./results/\$qry.mmap.WORKING \n"; print F " \n"; print F "fi\n"; print F "\n"; print F "for file in `ls queries/\$qry/*.fasta`; do\n"; print F " \$bin/minimap2 \\\n"; print F " $parameters -t ", getGlobal("${tag}mmapThreads"), " \\\n"; print F " ./blocks/\$blk.mmi \\\n"; print F " \$file \\\n"; print F " >> ./results/\$qry.mmap.WORKING \n"; print F "done\n"; print F "\n"; print F "mv ./results/\$qry.mmap.WORKING ./results/\$qry.mmap\n"; print F "\n"; print F "if [ -e ./results/\$qry.mmap -a \\\n"; print F " ! -e ./results/\$qry.ovb ] ; then\n"; print F " \$bin/mmapConvert \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -o ./results/\$qry.mmap.ovb.WORKING \\\n"; print F " -partial \\\n" if ($typ eq "partial"); print F " -tolerance 100 \\\n" if ($typ eq "normal"); print F " -len " , getGlobal("minOverlapLength"), " \\\n"; print F " ./results/\$qry.mmap \\\n"; print F " && \\\n"; print F " mv ./results/\$qry.mmap.ovb.WORKING ./results/\$qry.mmap.ovb\n"; print F "fi\n"; print F "\n"; if (getGlobal('saveOverlaps') eq "0") { print F "if [ -e ./results/\$qry.mmap -a \\\n"; print F " -e ./results/\$qry.mmap.ovb ] ; then\n"; print F " rm -f ./results/\$qry.mmap\n"; print F "fi\n"; print F "\n"; } print F "if [ -e ./results/\$qry.mmap.ovb ] ; then\n"; if (getGlobal("${tag}ReAlign") eq "1") { print F " \$bin/overlapPair \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -O ./results/\$qry.mmap.ovb \\\n"; print F " -o ./results/\$qry.ovb \\\n"; print F " -partial \\\n" if ($typ eq "partial"); print F " -erate ", getGlobal("corOvlErrorRate"), " \\\n" if ($tag eq "cor"); # Explicitly using proper name for grepability. print F " -erate ", getGlobal("obtOvlErrorRate"), " \\\n" if ($tag eq "obt"); print F " -erate ", getGlobal("utgOvlErrorRate"), " \\\n" if ($tag eq "utg"); print F " -memory " . getGlobal("${tag}mmapMemory") . " \\\n"; print F " -t " . getGlobal("${tag}mmapThreads") . " \n"; } else { print F " mv -f ./results/\$qry.mmap.ovb ./results/\$qry.ovb\n"; print F " mv -f ./results/\$qry.mmap.counts ./results/\$qry.counts\n"; } print F "fi\n"; print F stashFileShellCode("$path", "results/\$qry.ovb", ""); print F stashFileShellCode("$path", "results/\$qry.counts", ""); print F "\n"; print F "\n"; print F "exit 0\n"; close(F); if (-e "$path/precompute.sh") { my $numJobs = 0; open(F, "< $path/precompute.sh") or caFailure("can't open '$path/precompute.sh' for reading: $!", undef); while () { $numJobs++ if (m/^\s+job=/); } close(F); print STDERR "-- Configured $numJobs mmap precompute jobs.\n"; } if (-e "$path/mmap.sh") { my $numJobs = 0; open(F, "< $path/mmap.sh") or caFailure("can't open '$path/mmap.sh' for reading: $!", undef); while () { $numJobs++ if (m/^\s+qry=/); } close(F); print STDERR "-- Configured $numJobs mmap overlap jobs.\n"; } makeExecutable("$path/precompute.sh"); makeExecutable("$path/mhap.sh"); stashFile("$path/precompute.sh"); stashFile("$path/mhap.sh"); finishStage: emitStage($asm, "$tag-mmapConfigure"); buildHTML($asm, $tag); allDone: stopAfter("overlapConfigure"); } sub mmapPrecomputeCheck ($$$) { my $asm = shift @_; my $tag = shift @_; my $typ = shift @_; my $attempt = getGlobal("canuIteration"); my $base; # e.g., $base/1-overlapper/mhap.sh my $path; # e.g., $path/mhap.sh $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; goto allDone if (skipStage($asm, "$tag-mmapPrecomputeCheck", $attempt) == 1); goto allDone if (fileExists("$path/precompute.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); fetchFile("$path/precompute.sh"); # Figure out if all the tasks finished correctly. my $currentJobID = 1; my @successJobs; my @failedJobs; my $failureMessage = ""; open(F, "< $path/precompute.sh") or caFailure("can't open '$path/precompute.sh' for reading: $!", undef); while () { if (m/^\s+job=\"(\d+)\"$/) { if (fileExists("$path/blocks/$1.mmi")) { push @successJobs, "1-overlapper/blocks/$1.mmi\\n"; } else { $failureMessage .= "-- job $path/blocks/$1.fasta FAILED.\n"; push @failedJobs, $currentJobID; } $currentJobID++; } } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- MiniMap precompute jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- MiniMap precompute jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-mmapPrecomputeCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "${tag}mmap", $path, "precompute", @failedJobs); return; } finishStage: print STDERR "-- All ", scalar(@successJobs), " mmap precompute jobs finished successfully.\n"; open(L, "> $path/precompute.files") or caExit("failed to open '$path/precompute.files'", undef); print L @successJobs; close(L); stashFile("$path/precompute.files"); emitStage($asm, "$tag-mmapPrecomputeCheck"); buildHTML($asm, $tag); allDone: } sub mmapCheck ($$$) { my $asm = shift @_; my $tag = shift @_; my $typ = shift @_; my $attempt = getGlobal("canuIteration"); my $base; # e.g., $base/1-overlapper/mhap.sh my $path; # e.g., $path/mhap.sh $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; goto allDone if (skipStage($asm, "$tag-mmapCheck", $attempt) == 1); goto allDone if (fileExists("$path/mmap.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); fetchFile("$path/mmap.sh"); # Figure out if all the tasks finished correctly. my $currentJobID = 1; my @mmapJobs; my @successJobs; my @miscJobs; my @failedJobs; my $failureMessage = ""; open(F, "< $path/mmap.sh") or caExit("failed to open '$path/mmap.sh'", undef); while () { if (m/^\s+qry=\"(\d+)\"$/) { if (fileExists("$path/results/$1.ovb.gz")) { push @mmapJobs, "1-overlapper/results/$1.mmap\n"; push @successJobs, "1-overlapper/results/$1.ovb.gz\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } elsif (fileExists("$path/results/$1.ovb")) { push @mmapJobs, "1-overlapper/results/$1.mmap\n"; push @successJobs, "1-overlapper/results/$1.ovb\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } elsif (fileExists("$path/results/$1.ovb.bz2")) { push @mmapJobs, "1-overlapper/results/$1.mmap\n"; push @successJobs, "1-overlapper/results/$1.ovb.bz2\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } elsif (fileExists("$path/results/$1.ovb.xz")) { push @mmapJobs, "1-overlapper/results/$1.mmap\n"; push @successJobs, "1-overlapper/results/$1.ovb.xz\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } else { $failureMessage .= "-- job $path/results/$1.ovb FAILED.\n"; push @failedJobs, $currentJobID; } $currentJobID++; } } close(F); # Also find the queries symlinks so we can remove those. And the query directories, because # the last directory can be empty, and so we'd never see it at all if only finding files. open(F, "cd $base && find 1-overlapper/queries -print |"); while () { push @mmapJobs, $_; } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- MiniMap overlap jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- MiniMap overlap jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-mmapCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "${tag}mmap", $path, "mmap", @failedJobs); return; } finishStage: print STDERR "-- Found ", scalar(@successJobs), " mmap overlap output files.\n"; open(L, "> $path/mmap.files") or caExit("failed to open '$path/mmap.files'", undef); print L @mmapJobs; close(L); open(L, "> $path/ovljob.files") or caExit("failed to open '$path/ovljob.files'", undef); print L @successJobs; close(L); open(L, "> $path/ovljob.more.files") or caExit("failed to open '$path/ovljob.more.files'", undef); print L @miscJobs; close(L); stashFile("$path/mmap.files"); stashFile("$path/ovljob.files"); stashFile("$path/ovljob.more.files"); emitStage($asm, "$tag-mmapCheck"); buildHTML($asm, $tag); allDone: stopAfter("overlap"); } canu-1.6/src/pipelines/canu/OverlapMhap.pm000066400000000000000000000734331314437614700206130ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/OverlapMhap.pm # # Modifications by: # # Brian P. Walenz from 2015-MAR-27 to 2015-SEP-21 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-03 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-20 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::OverlapMhap; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(mhapConfigure mhapPrecomputeCheck mhapCheck); use strict; use POSIX qw(floor); use File::Path 2.08 qw(make_path remove_tree); use canu::Defaults; use canu::Execution; use canu::Gatekeeper; use canu::HTML; use canu::Grid_Cloud; # Map long reads to long reads with mhap. # Problems: # - mhap .dat output can't be verified. if the job is killed, a partial .dat is output. It needs to write to # .dat.WORKING, then let the script rename it when the job finishes successfully. There is no output file name option. sub mhapConfigure ($$$) { my $asm = shift @_; my $tag = shift @_; my $typ = shift @_; my $bin = getBinDirectory(); my $base; # e.g., $base/1-overlapper/mhap.sh my $path; # e.g., $path/mhap.sh $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; caFailure("invalid type '$typ'", undef) if (($typ ne "partial") && ($typ ne "normal")); goto allDone if (skipStage($asm, "$tag-mhapConfigure") == 1); goto allDone if (fileExists("$path/precompute.sh")) && (fileExists("$path/mhap.sh")); goto allDone if (fileExists("$path/ovljob.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); print STDERR "--\n"; print STDERR "-- OVERLAPPER (mhap) (correction)\n" if ($tag eq "cor"); print STDERR "-- OVERLAPPER (mhap) (trimming)\n" if ($tag eq "obt"); print STDERR "-- OVERLAPPER (mhap) (assembly)\n" if ($tag eq "utg"); print STDERR "--\n"; make_path($path) if (! -d $path); # Mhap parameters my ($numHashes, $minNumMatches, $threshold, $ordSketch, $ordSketchMer); if (!defined(getGlobal("${tag}MhapSensitivity"))) { my $cov = getExpectedCoverage($base, $asm); setGlobal("${tag}MhapSensitivity", "low"); # Yup, super inefficient. The code is setGlobal("${tag}MhapSensitivity", "normal") if ($cov < 60); # compact and clear and runs once. setGlobal("${tag}MhapSensitivity", "high") if ($cov <= 30); # Live with it. print STDERR "-- Set ${tag}MhapSensitivity=", getGlobal("${tag}MhapSensitivity"), " based on read coverage of $cov.\n"; } if (getGlobal("${tag}MhapSensitivity") eq "low") { $numHashes = 256; $minNumMatches = 3; $threshold = 0.80; $ordSketch = 1000; $ordSketchMer = getGlobal("${tag}MhapOrderedMerSize") + 2; } elsif (getGlobal("${tag}MhapSensitivity") eq "normal") { $numHashes = 512; $minNumMatches = 3; $threshold = 0.78; $ordSketch = 1536; $ordSketchMer = getGlobal("${tag}MhapOrderedMerSize"); } elsif (getGlobal("${tag}MhapSensitivity") eq "high") { $numHashes = 768; $minNumMatches = 2; $threshold = 0.73; $ordSketch = 1536; $ordSketchMer = getGlobal("${tag}MhapOrderedMerSize"); } else { caFailure("invalid ${tag}MhapSensitivity=" . getGlobal("${tag}MhapSensitivity"), undef); } # due to systematic bias in nanopore data, adjust threshold up by 5% my $numNanoporeRaw = 0; open(L, "< $base/$asm.gkpStore/libraries.txt") or caExit("can't open '$base/$asm.gkpStore/libraries.txt' for reading: $!", undef); while () { $numNanoporeRaw++ if (m/nanopore-raw/); } close(L); $threshold += 0.05 if ($numNanoporeRaw > 0); my $filterThreshold = getGlobal("${tag}MhapFilterThreshold"); # Constants. my $merSize = getGlobal("${tag}MhapMerSize"); my $numReads = getNumberOfReadsInStore($base, $asm); my $memorySize = getGlobal("${tag}mhapMemory"); my $blockPerGb = getGlobal("${tag}MhapBlockSize"); if ($numHashes >= 768) { $blockPerGb = int($blockPerGb / 2); } # quick guess parameter adjustment for corrected reads, hack for now and should better take error rate into account if (($tag eq "obt") || ($tag eq "utg")) { $numHashes = "128"; $minNumMatches = 5; $ordSketch = 1000; $threshold = 1-getGlobal("${tag}OvlErrorRate"); } print STDERR "--\n"; print STDERR "-- PARAMETERS: hashes=$numHashes, minMatches=$minNumMatches, threshold=$threshold\n"; print STDERR "--\n"; my $blockSize = int($blockPerGb * $memorySize); print STDERR "-- Given $memorySize GB, can fit $blockSize reads per block.\n"; # Divide the reads into blocks of ovlHashBlockSize. Each one of these blocks is used as the # table in mhap. Several of these blocks are used as the queries. my @blocks; # Range of reads to extract for this block my @blockBgn; # First read in the block my @blockLen; # Number of reads in the block my @hashes; # One for each job, the block that is the hash table my @skipSelf; # One for each job, jobs that would search block N against hash block N need to be handled special my @convert; # One for each job, flags to the mhap-ovl conversion program push @blocks, "no zeroth block, makes the loop where this is used easier"; push @blockBgn, "no zeroth block"; push @blockLen, "no zeroth block"; push @hashes, "no zeroth job"; push @skipSelf, "no zeroth job"; push @convert, "no zeroth job"; for (my $bgn=1; $bgn < $numReads; $bgn += $blockSize) { my $end = $bgn + $blockSize - 1; $end = $numReads if ($end > $numReads); #print STDERR "BLOCK ", scalar(@blocks), " reads from $bgn through $end\n"; push @blocks, "-b $bgn -e $end"; push @blockBgn, $bgn; push @blockLen, $end - $bgn + 1; } # Each mhap job will process one block against a set of other blocks. We'll pick, arbitrarily, # to use num_blocks/4 for that size, unless it is too small. my $numBlocks = scalar(@blocks); my $qryStride = ($numBlocks < 16) ? (2) : int($numBlocks / 4); print STDERR "-- For $numBlocks blocks, set stride to $qryStride blocks.\n"; print STDERR "-- Logging partitioning to '$path/partitioning.log'.\n"; open(L, "> $path/partitioning.log") or caExit("can't open '$path/partitioning.log' for writing: $!\n", undef); # Make queries. Each hask block needs to search against all blocks less than or equal to it. # Each job will search at most $qryStride blocks at once. So, queries could be: # 1: 1 vs 1,2,3 (with self-allowed, and quert block 1 implicitly included) # 2: 1 vs 4,5 (with no-self allowed) # 3: 2 vs 2,3,4 # 4: 2 vs 5 # 5: 3 vs 3,4,5 # 6: 4 vs 4,5 # 7: 5 vs 5 make_path("$path/queries"); for (my $bid=1; $bid < $numBlocks; $bid++) { # Note that we never do qbgn = bid; the self-self overlap is special cased. for (my $qbgn = $bid; $qbgn < $numBlocks; $qbgn += $qryStride) { # mhap's read labeling is dumb. It dumps all reads into one array, labeling the hash # reads 0 to N, and the query reads N+1 to N+1+M. This makes it impossible to tell if # we are comparing a read against itself (e.g., when comparing hash block 1 against # query blocks 1,2,3,4 -- compared to hash block 1 against query blocks 5,6,7,8). # # So, there is a giant special case to compare the hash block against itself enabled by # default. This needs to be disabled for the second example above. # # This makes for ANOTHER special case, that of the last block. There are no query # blocks, just the hash block, and we need to omit the flag...do we? my $andSelf = ""; if ($bid == $qbgn) { # The hash block bid is in the range and we need to compute hash-to-hash # overlaps, and exclude the block from the queries. push @skipSelf, ""; $qbgn++; $andSelf = " (and self)"; } else { # The hash block bid isn't in the range of query blocks, don't allow hash-to-hash # overlaps. push @skipSelf, "--no-self"; } my $qend = $qbgn + $qryStride - 1; # Block bid searches reads in dat files from $qend = $numBlocks-1 if ($qend >= $numBlocks); # qbgn to qend (inclusive). my $job = substr("000000" . scalar(@hashes), -6); # Unique ID for this compute # Make a place to save queries. If this is the last-block-special-case, make a directory, # but don't link in any files. Without the directory, we'd need even more special case # code down in mhap.sh to exclude the -q option for this last block. make_path("$path/queries/$job"); if ($qbgn < $numBlocks) { print L "Job ", scalar(@hashes), " computes block $bid vs blocks $qbgn-$qend$andSelf,\n"; for (my $qid=$qbgn; $qid <= $qend; $qid++) { my $qry = substr("000000" . $qid, -6); # Name for the query block symlink("../../blocks/$qry.dat", "$path/queries/$job/$qry.dat"); } } else { print L "Job ", scalar(@hashes), " computes block $bid vs itself.\n"; $qbgn = $bid; # Otherwise, the @convert -q value is bogus } # This is easy, the ID of the hash. push @hashes, substr("000000" . $bid, -6); # One new job for block bid with qend-qbgn query files in it # Annoyingly, if we're against 'self', then the conversion needs to know that the query IDs # aren't offset by the number of hash reads. if ($andSelf eq "") { push @convert, "-h $blockBgn[$bid] $blockLen[$bid] -q $blockBgn[$qbgn]"; } else { push @convert, "-h $blockBgn[$bid] 0 -q $blockBgn[$bid]"; } } } close(L); # Tar up the queries directory. Only useful for cloud support. runCommandSilently($path, "tar -cf queries.tar queries", 1); stashFile("$path/queries.tar"); # Create a script to generate precomputed blocks, including extracting the reads from gkpStore. #OPTIMIZE #OPTIMIZE Probably a big optimization for cloud assemblies, the block fasta inputs can be #OPTIMIZE computed ahead of time, stashed, and then fetched to do the actual precompute. #OPTIMIZE my $javaPath = getGlobal("java"); my $javaMemory = int(getGlobal("${tag}mhapMemory") * 1024 + 0.5); my $cygA = ($^O ne "cygwin") ? "" : "\$(cygpath -w "; # Some baloney to convert nice POSIX paths into my $cygB = ($^O ne "cygwin") ? "" : ")"; # whatever abomination Windows expects. open(F, "> $path/precompute.sh") or caFailure("can't open '$path/precompute.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/1-overlapper", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $ii=1; $ii < scalar(@blocks); $ii++) { print F "if [ \$jobid -eq $ii ] ; then\n"; print F " rge=\"$blocks[$ii]\"\n"; print F " job=\"", substr("000000" . $ii, -6), "\"\n"; print F "fi\n"; print F "\n"; } print F "\n"; print F "if [ x\$job = x ] ; then\n"; print F " echo Job partitioning error. jobid \$jobid is invalid.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d ./blocks ]; then\n"; print F " mkdir -p ./blocks\n"; print F "fi\n"; print F "\n"; print F fileExistsShellCode("./blocks/\$job.dat"); print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F "# Remove any previous result.\n"; print F "rm -f ./blocks/\$job.input.dat\n"; print F "\n"; #print F fetchFileShellCode("./blocks/\$job.input.fasta"); print F "\n"; print F "\$bin/gatekeeperDumpFASTQ \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " \$rge \\\n"; print F " -nolibname \\\n"; print F " -fasta \\\n"; print F " -o ./blocks/\$job.input \\\n"; print F "|| \\\n"; print F "mv -f ./blocks/\$job.input.fasta ./blocks/\$job.input.fasta.FAILED\n"; print F "\n"; print F "\n"; print F "if [ ! -e ./blocks/\$job.input.fasta ] ; then\n"; print F " echo Failed to extract fasta.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("$base/0-mercounts", "$asm.ms$merSize.frequentMers.ignore.gz", ""); print F "\n"; print F "echo \"\"\n"; print F "echo Starting mhap precompute.\n"; print F "echo \"\"\n"; print F "\n"; print F "# So mhap writes its output in the correct spot.\n"; print F "cd ./blocks\n"; print F "\n"; print F "$javaPath -d64 -server -Xmx", $javaMemory, "m \\\n"; print F " -jar $cygA \$bin/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n"; print F " --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n"; print F " --supress-noise 2 \\\n" if (defined(getGlobal("${tag}MhapFilterUnique")) && getGlobal("${tag}MhapFilterUnique") == 1); print F " --no-tf \\\n" if (defined(getGlobal("${tag}MhapNoTf")) && getGlobal("${tag}MhapNoTf") == 1); print F " --num-hashes $numHashes \\\n"; print F " --num-min-matches $minNumMatches \\\n"; print F " --ordered-sketch-size $ordSketch \\\n"; print F " --ordered-kmer-size $ordSketchMer \\\n"; print F " --threshold $threshold \\\n"; print F " --filter-threshold $filterThreshold \\\n"; print F " --min-olap-length ", getGlobal("minOverlapLength"), " \\\n"; print F " --num-threads ", getGlobal("${tag}mhapThreads"), " \\\n"; print F " " . getGlobal("${tag}MhapOptions") . " \\\n" if (defined(getGlobal("${tag}MhapOptions"))); print F " -f $cygA ../../0-mercounts/$asm.ms$merSize.frequentMers.ignore.gz $cygB \\\n" if (-e "$base/0-mercounts/$asm.ms$merSize.frequentMers.ignore.gz"); print F " -p $cygA ./\$job.input.fasta $cygB \\\n"; print F " -q $cygA . $cygB \\\n"; print F "&& \\\n"; print F "mv -f ./\$job.input.dat ./\$job.dat\n"; print F "\n"; print F "if [ ! -e ./\$job.dat ] ; then\n"; print F " echo Mhap failed.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "# Clean up, remove the fasta input\n"; print F "rm -f ./\$job.input.fasta\n"; print F "\n"; print F stashFileShellCode("$base/1-overlapper/blocks", "\$job.dat", ""); print F "\n"; print F "exit 0\n"; close(F); # Create a script to run mhap. open(F, "> $path/mhap.sh") or caFailure("can't open '$path/mhap.sh' for writing: $!", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F fetchStoreShellCode("$base/$asm.gkpStore", "$base/1-overlapper", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; for (my $ii=1; $ii < scalar(@hashes); $ii++) { print F "if [ \$jobid -eq $ii ] ; then\n"; print F " blk=\"$hashes[$ii]\"\n"; print F " slf=\"$skipSelf[$ii]\"\n"; print F " cvt=\"$convert[$ii]\"\n"; print F " qry=\"", substr("000000" . $ii, -6), "\"\n"; print F "fi\n"; print F "\n"; } print F "\n"; print F "if [ x\$qry = x ]; then\n"; print F " echo Error: Job index out of range.\n"; print F " exit 1\n"; print F "fi\n"; print F "\n"; print F "if [ -e ./results/\$qry.ovb ]; then\n"; print F " echo Job previously completed successfully.\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F fetchFileShellCode("$path", "queries.tar", ""); print F "\n"; print F "if [ -e ./queries.tar -a ! -d ./queries ] ; then\n"; print F " tar -xf ./queries.tar\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d ./results ]; then\n"; print F " mkdir -p ./results\n"; print F "fi\n"; print F "\n"; print F "if [ ! -d ./blocks ] ; then\n"; print F " mkdir -p ./blocks\n"; print F "fi\n"; print F fetchFileShellCode("$path", "blocks/\$blk.dat", ""); print F "for ii in `ls ./queries/\$qry` ; do\n"; print F " echo Fetch blocks/\$ii\n"; print F fetchFileShellCode("$path", "blocks/\$ii", " "); print F "done\n"; print F "\n"; print F fetchFileShellCode("$base/0-mercounts", "$asm.ms$merSize.frequentMers.ignore.gz", ""); print F "\n"; print F "echo \"\"\n"; print F "echo Running block \$blk in query \$qry\n"; print F "echo \"\"\n"; print F "\n"; print F "if [ ! -e ./results/\$qry.mhap ] ; then\n"; print F " $javaPath -d64 -server -Xmx", $javaMemory, "m \\\n"; print F " -jar $cygA \$bin/mhap-" . getGlobal("${tag}MhapVersion") . ".jar $cygB \\\n"; print F " --repeat-weight 0.9 --repeat-idf-scale 10 -k $merSize \\\n"; print F " --supress-noise 2 \\\n" if (defined(getGlobal("${tag}MhapFilterUnique")) && getGlobal("${tag}MhapFilterUnique") == 1); print F " --no-tf \\\n" if (defined(getGlobal("${tag}MhapNoTf")) && getGlobal("${tag}MhapNoTf") == 1); print F " --num-hashes $numHashes \\\n"; print F " --num-min-matches $minNumMatches \\\n"; print F " --threshold $threshold \\\n"; print F " --filter-threshold $filterThreshold \\\n"; print F " --ordered-sketch-size $ordSketch \\\n"; print F " --ordered-kmer-size $ordSketchMer \\\n"; print F " --min-olap-length ", getGlobal("minOverlapLength"), " \\\n"; print F " --num-threads ", getGlobal("${tag}mhapThreads"), " \\\n"; print F " " . getGlobal("${tag}MhapOptions") . " \\\n" if (defined(getGlobal("${tag}MhapOptions"))); print F " -s $cygA ./blocks/\$blk.dat \$slf $cygB \\\n"; print F " -q $cygA queries/\$qry $cygB \\\n"; print F " > ./results/\$qry.mhap.WORKING \\\n"; print F " && \\\n"; print F " mv -f ./results/\$qry.mhap.WORKING ./results/\$qry.mhap\n"; print F "fi\n"; print F "\n"; print F "if [ -e ./results/\$qry.mhap -a \\\n"; print F " ! -e ./results/\$qry.ovb ] ; then\n"; print F " \$bin/mhapConvert \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " \$cvt \\\n"; print F " -o ./results/\$qry.mhap.ovb.WORKING \\\n"; print F " ./results/\$qry.mhap \\\n"; print F " && \\\n"; print F " mv ./results/\$qry.mhap.ovb.WORKING ./results/\$qry.mhap.ovb\n"; print F "fi\n"; print F "\n"; if (getGlobal('saveOverlaps') eq "0") { print F "if [ -e ./results/\$qry.mhap -a \\\n"; print F " -e ./results/\$qry.mhap.ovb ] ; then\n"; print F " rm -f ./results/\$qry.mhap\n"; print F "fi\n"; print F "\n"; } if (getGlobal("${tag}ReAlign") eq "1") { print F "if [ -e ./results/\$qry.mhap.ovb ] ; then\n"; print F " \$bin/overlapPair \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -O ./results/\$qry.mhap.ovb \\\n"; print F " -o ./results/\$qry.WORKING.ovb \\\n"; print F " -partial \\\n" if ($typ eq "partial"); print F " -len " , getGlobal("minOverlapLength"), " \\\n"; print F " -erate ", getGlobal("corOvlErrorRate"), " \\\n" if ($tag eq "cor"); # Explicitly using proper name for grepability. print F " -erate ", getGlobal("obtOvlErrorRate"), " \\\n" if ($tag eq "obt"); print F " -erate ", getGlobal("utgOvlErrorRate"), " \\\n" if ($tag eq "utg"); print F " -memory " . getGlobal("${tag}mhapMemory") . " \\\n"; print F " -t " . getGlobal("${tag}mhapThreads") . " \\\n"; print F " && \\\n"; print F " mv -f ./results/\$qry.WORKING.ovb ./results/\$qry.ovb\n"; print F "fi\n"; } else { print F "if [ -e ./results/\$qry.mhap.ovb ] ; then\n"; print F " mv -f ./results/\$qry.mhap.ovb ./results/\$qry.ovb\n"; print F "fi\n"; } print F "\n"; print F stashFileShellCode("$path", "results/\$qry.ovb", ""); print F stashFileShellCode("$path", "results/\$qry.counts", ""); print F "\n"; print F "exit 0\n"; close(F); if (-e "$path/precompute.sh") { my $numJobs = 0; open(F, "< $path/precompute.sh") or caFailure("can't open '$path/precompute.sh' for reading: $!", undef); while () { $numJobs++ if (m/^\s+job=/); } close(F); print STDERR "-- Configured $numJobs mhap precompute jobs.\n"; } if (-e "$path/mhap.sh") { my $numJobs = 0; open(F, "< $path/mhap.sh") or caFailure("can't open '$path/mhap.sh' for reading: $!", undef); while () { $numJobs++ if (m/^\s+qry=/); } close(F); print STDERR "-- Configured $numJobs mhap overlap jobs.\n"; } makeExecutable("$path/precompute.sh"); makeExecutable("$path/mhap.sh"); stashFile("$path/precompute.sh"); stashFile("$path/mhap.sh"); finishStage: emitStage($asm, "$tag-mhapConfigure"); buildHTML($asm, $tag); allDone: stopAfter("overlapConfigure"); } sub mhapPrecomputeCheck ($$$) { my $asm = shift @_; my $tag = shift @_; my $typ = shift @_; my $attempt = getGlobal("canuIteration"); my $base; # e.g., $base/1-overlapper/mhap.sh my $path; # e.g., $path/mhap.sh $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; goto allDone if (skipStage($asm, "$tag-mhapPrecomputeCheck", $attempt) == 1); goto allDone if (fileExists("$path/precompute.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); fetchFile("$path/precompute.sh"); # Figure out if all the tasks finished correctly. my $currentJobID = 1; my @successJobs; my @failedJobs; my $failureMessage = ""; open(F, "< $path/precompute.sh") or caFailure("can't open '$path/precompute.sh' for reading: $!", undef); while () { if (m/^\s+job=\"(\d+)\"$/) { if (fileExists("$path/blocks/$1.dat")) { push @successJobs, "1-overlapper/blocks/$1.dat\n"; } else { $failureMessage .= "-- job $path/blocks/$1.dat FAILED.\n"; push @failedJobs, $currentJobID; } $currentJobID++; } } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Mhap precompute jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Mhap precompute jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-mhapPrecomputeCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "${tag}mhap", $path, "precompute", @failedJobs); return; } finishStage: print STDERR "-- All ", scalar(@successJobs), " mhap precompute jobs finished successfully.\n"; open(L, "> $path/precompute.files") or caExit("failed to open '$path/precompute.files'", undef); print L @successJobs; close(L); stashFile("$path/precompute.files"); emitStage($asm, "$tag-mhapPrecomputeCheck"); buildHTML($asm, $tag); allDone: } sub mhapCheck ($$$) { my $asm = shift @_; my $tag = shift @_; my $typ = shift @_; my $attempt = getGlobal("canuIteration"); my $base; # e.g., $base/1-overlapper/mhap.sh my $path; # e.g., $path/mhap.sh $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); $path = "$base/1-overlapper"; goto allDone if (skipStage($asm, "$tag-mhapCheck", $attempt) == 1); goto allDone if (fileExists("$path/mhap.files")); goto allDone if (-e "$base/$asm.ovlStore"); goto allDone if (fileExists("$base/$asm.ovlStore.tar")); fetchFile("$path/mhap.sh"); # Figure out if all the tasks finished correctly. my $currentJobID = 1; my @mhapJobs; my @successJobs; my @miscJobs; my @failedJobs; my $failureMessage = ""; open(F, "< $path/mhap.sh") or caExit("failed to open '$path/mhap.sh'", undef); while () { if (m/^\s+qry=\"(\d+)\"$/) { if (fileExists("$path/results/$1.ovb.gz")) { push @mhapJobs, "1-overlapper/results/$1.mhap\n"; push @successJobs, "1-overlapper/results/$1.ovb.gz\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } elsif (fileExists("$path/results/$1.ovb")) { push @mhapJobs, "1-overlapper/results/$1.mhap\n"; push @successJobs, "1-overlapper/results/$1.ovb\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } elsif (fileExists("$path/results/$1.ovb.bz2")) { push @mhapJobs, "1-overlapper/results/$1.mhap\n"; push @successJobs, "1-overlapper/results/$1.ovb.bz2\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } elsif (fileExists("$path/results/$1.ovb.xz")) { push @mhapJobs, "1-overlapper/results/$1.mhap\n"; push @successJobs, "1-overlapper/results/$1.ovb.xz\n"; push @miscJobs, "1-overlapper/results/$1.stats\n"; push @miscJobs, "1-overlapper/results/$1.counts\n"; } else { $failureMessage .= "-- job $path/results/$1.ovb FAILED.\n"; push @failedJobs, $currentJobID; } $currentJobID++; } } close(F); # Also find the queries symlinks so we can remove those. And the query directories, because # the last directory can be empty, and so we'd never see it at all if only finding files. open(F, "cd $base && find 1-overlapper/queries -print |"); while () { push @mhapJobs, $_; } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Mhap overlap jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Mhap overlap jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-mhapCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "${tag}mhap", $path, "mhap", @failedJobs); return; } finishStage: print STDERR "-- Found ", scalar(@successJobs), " mhap overlap output files.\n"; open(L, "> $path/mhap.files") or caExit("failed to open '$path/mhap.files'", undef); print L @mhapJobs; close(L); open(L, "> $path/ovljob.files") or caExit("failed to open '$path/ovljob.files'", undef); print L @successJobs; close(L); open(L, "> $path/ovljob.more.files") or caExit("failed to open '$path/ovljob.more.files'", undef); print L @miscJobs; close(L); stashFile("$path/mhap.files"); stashFile("$path/ovljob.files"); stashFile("$path/ovljob.more.files"); emitStage($asm, "$tag-mhapCheck"); buildHTML($asm, $tag); allDone: stopAfter("overlap"); } canu-1.6/src/pipelines/canu/OverlapStore.pm000066400000000000000000000573761314437614700210320ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/OverlapStore.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-SEP-21 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-OCT-10 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-DEC-08 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::OverlapStore; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(createOverlapStore); use strict; use File::Basename; # dirname use File::Path 2.08 qw(make_path remove_tree); use POSIX qw(ceil); use canu::Defaults; use canu::Execution; use canu::Report; use canu::HTML; use canu::Grid_Cloud; # Parallel documentation: Each overlap job is converted into a single bucket of overlaps. Within # each bucket, the overlaps are distributed into many slices, one per sort job. The sort jobs then # load the same slice from each bucket. # NOT FILTERING overlaps by error rate when building the parallel store. sub createOverlapStoreSequential ($$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $bin = getBinDirectory(); my $cmd; # Fetch inputs. If you're not cloud-based, this does nothing. Really. Trust me. fetchFile("$base/1-overlapper/ovljob.files"); open(F, "< $base/1-overlapper/ovljob.files") or caExit("failed to open overlapper output list in '$base/1-overlapper/ovljob.files'", undef); while () { chomp; if (m/^(.*).ovb$/) { fetchFile("$base/$1.ovb"); fetchFile("$base/$1.counts"); } else { caExit("didn't recognize ovljob.files line '$_'", undef); } } close(F); # This is running in the canu process itself. Execution.pm has special case code # to submit canu to grids using the maximum of 4gb and this memory limit. my $memSize = getGlobal("ovsMemory"); # The parallel store build will unlimit 'max user processes'. The sequential method usually # runs out of open file handles first (meaning it has never run out of processes yet). $cmd = "$bin/ovStoreBuild \\\n"; $cmd .= " -O ./$asm.ovlStore.BUILDING \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -M $memSize \\\n"; $cmd .= " -L ./1-overlapper/ovljob.files \\\n"; $cmd .= " > ./$asm.ovlStore.err 2>&1"; if (runCommand($base, $cmd)) { caExit("failed to create the overlap store", "$base/$asm.ovlStore.err"); } unlink("$base/$asm.ovlStore.err"); rename("$base/$asm.ovlStore.BUILDING", "$base/$asm.ovlStore"); stashStore("$base/$asm.ovlStore"); } # Count the number of inputs. We don't expect any to be missing (they were just checked # by overlapCheck()) but feel silly not checking again. # # Note that these files are rooted in '$base' (because that's where we run the overlap store # building) but canu.pl itself is rooted in the same directory as '$base', so we need to # add in '$base'. sub countOverlapStoreInputs ($) { my $base = shift @_; my $numInputs = 0; open(F, "< $base/1-overlapper/ovljob.files") or caExit("Failed to open overlap store input file '$base/1-overlapper/ovljob.files': $0", undef); while () { chomp; caExit("overlapper output '$_' not found", undef) if (! -e "$base/$_"); $numInputs++; } close(F); return($numInputs); } sub getNumOlapsAndSlices ($$) { my $base = shift @_; my $asm = shift @_; my $path = "$base/$asm.ovlStore.BUILDING"; my $numOlaps = undef; my $numSlices = undef; my $memLimit = undef; open(F, "< $path/config.err") or caExit("can't open '$path/config.err' for reading: $!\n", undef); while () { if (m/Will sort (\d+.\d+) million overlaps per bucket, using (\d+) buckets (\d+.\d+) GB per bucket./) { $numOlaps = $1; $numSlices = $2; $memLimit = ceil($3); } } close(F); if (!defined($numOlaps) || !defined($numSlices) || !defined($memLimit)) { caExit("Failed to find any overlaps ($numOlaps) or slices ($numSlices) or memory limit ($memLimit)", undef); } # Bump up the memory limit on grid jobs a bit. setGlobal("ovsMemory", ceil($memLimit + 0.5)); # The memory limit returned is used to tell ovStoreSorter itself how much space to reserve. return($numOlaps, $numSlices, $memLimit); } sub overlapStoreConfigure ($$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $bin = getBinDirectory(); my $cmd; my $path = "$base/$asm.ovlStore.BUILDING"; goto allDone if (skipStage($asm, "$tag-overlapStoreConfigure") == 1); goto allDone if ((-e "$path/scripts/0-config.sh") && (-e "$path/scripts/1-bucketize.sh") && (-e "$path/scripts/2-sort.sh") && (-e "$path/scripts/3-index.sh")); goto allDone if (-d "$base/$asm.ovlStore"); my $numInputs = countOverlapStoreInputs($base); # Create an output directory, and populate it with more directories and scripts system("mkdir -p $path/scripts") if (! -d "$path/scripts"); system("mkdir -p $path/logs") if (! -d "$path/logs"); # Run the normal store build, but just to get the partitioning. ovStoreBuild internally # writes to config.WORKING, then renames when it is finished. No need for the script # to be overly careful about incomplete files. if (! -e "$path/scripts/0-config.sh") { open(F, "> $path/scripts/0-config.sh") or die; print F "#!" . getGlobal("shell") . "\n"; print F "\n"; #print F setWorkDirectoryShellCode($path); # This is never run on grid, so don't need to cd first. #print F "\n"; #print F getJobIDShellCode(); #print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F getLimitShellCode(); print F "\n"; print F "\$bin/ovStoreBuild \\\n"; print F " -G ./$asm.gkpStore \\\n"; print F " -O ./$asm.ovlStore \\\n"; # NOT created! print F " -M " . getGlobal("ovsMemory") . " \\\n"; print F " -config ./$asm.ovlStore.BUILDING/config \\\n"; print F " -L ./1-overlapper/ovljob.files \\\n"; close(F); } makeExecutable("$path/scripts/0-config.sh"); stashFile("$path/scripts/0-config.sh"); if (! -e "$path/config") { $cmd = "./$asm.ovlStore.BUILDING/scripts/0-config.sh \\\n"; $cmd .= "> ./$asm.ovlStore.BUILDING/config.err 2>&1\n"; if (runCommand($base, $cmd)) { caExit("failed to generate configuration for building overlap store", "$path/config.err"); } } # Parse the output to find the number of jobs we need to sort and the memory # ovs store memory is left as a range (e.g. 4-16) so building can scale itself to (hopefully) fit both into memory and into max system open files my ($numOlaps, $numSlices, $memLimit) = getNumOlapsAndSlices($base, $asm); # Parallel jobs for bucketizing. This should really be part of overlap computation itself. #getAllowedResources("", "ovb"); if (! -e "$path/scripts/1-bucketize.sh") { open(F, "> $path/scripts/1-bucketize.sh") or die; print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F setWorkDirectoryShellCode($path); print F "\n"; print F getJobIDShellCode(); print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F "bn=`printf %04d \$jobid`\n"; print F "jn=\"undefined\"\n"; print F "\n"; print F "# This script runs in $path/, but the overlap file list\n"; print F "# is relative to $base/, so we need to add a few dots to make things work.\n"; print F "\n"; my $tstid = 1; open(I, "< $base/1-overlapper/ovljob.files") or die "Failed to open '$base/1-overlapper/ovljob.files': $0\n"; while () { chomp; print F "if [ \"\$jobid\" -eq \"$tstid\" ] ; then jn=\"../$_\"; fi\n"; $tstid++; } close(I); print F "\n"; print F "if [ \$jn = \"undefined\" ] ; then\n"; print F " echo \"Job out of range.\"\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F "if [ -e \"./bucket\$bn/sliceSizes\" ] ; then\n"; print F " echo \"Bucket bucket\$bn finished successfully.\"\n"; print F " exit\n"; print F "fi\n"; print F "\n"; print F "if [ -e \"./create\$bn\" ] ; then\n"; print F " echo \"Removing incomplete bucket create\$bn\"\n"; print F " rm -rf \"./create\$bn\"\n"; print F "fi\n"; print F "\n"; print F getLimitShellCode(); print F "\n"; print F "\$bin/ovStoreBucketizer \\\n"; print F " -O . \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -C ./config \\\n"; #print F " -e " . getGlobal("") . " \\\n" if (defined(getGlobal(""))); print F " -job \$jobid \\\n"; print F " -i \$jn\n"; close(F); } # Parallel jobs for sorting each bucket #getAllowedResources("", "ovs"); if (! -e "$path/scripts/2-sort.sh") { open(F, "> $path/scripts/2-sort.sh") or die; print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F setWorkDirectoryShellCode($path); print F "\n"; print F getJobIDShellCode(); print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F getLimitShellCode(); print F "\n"; print F "\$bin/ovStoreSorter \\\n"; print F " -deletelate \\\n"; # Choices -deleteearly -deletelate or nothing print F " -M $memLimit \\\n"; print F " -O . \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -F $numSlices \\\n"; print F " -job \$jobid $numInputs\n"; close(F); } # A final job to merge the indices. if (! -e "$path/scripts/3-index.sh") { open(F, "> $path/scripts/3-index.sh") or die; print F "#!" . getGlobal("shell") . "\n"; print F "\n"; #print F setWorkDirectoryShellCode($path); # This is never run on grid, so don't need to cd first. #print F "\n"; #print F getJobIDShellCode(); #print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F "\$bin/ovStoreIndexer \\\n"; #print F " -nodelete \\\n"; # Choices -nodelete or nothing print F " -O . \\\n"; print F " -F $numSlices\n"; close(F); } makeExecutable("$path/scripts/1-bucketize.sh"); makeExecutable("$path/scripts/2-sort.sh"); makeExecutable("$path/scripts/3-index.sh"); stashFile("$path/scripts/1-bucketize.sh"); stashFile("$path/scripts/2-sort.sh"); stashFile("$path/scripts/3-index.sh"); finishStage: emitStage($asm, "$tag-overlapStoreConfigure"); buildHTML($asm, $tag); allDone: stopAfter("overlapStoreConfigure"); } sub overlapStoreBucketizerCheck ($$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "$base/$asm.ovlStore.BUILDING"; goto allDone if (skipStage($asm, "$tag-overlapStoreBucketizerCheck", $attempt) == 1); goto allDone if (-d "$base/$asm.ovlStore"); goto allDone if (-e "$path/1-bucketize.success"); fetchFile("scripts/1-bucketize/1-bucketize.sh"); # Figure out if all the tasks finished correctly. my $numInputs = countOverlapStoreInputs($base); my $currentJobID = 1; my @successJobs; my @failedJobs; my $failureMessage = ""; my $bucketID = "0001"; # Two ways to check for completeness, either 'sliceSizes' exists, or the 'bucket' directory # exists. The compute is done in a 'create' directory, which is renamed to 'bucket' just # before the job completes. open(F, "< $base/1-overlapper/ovljob.files") or caExit("can't open '$base/1-overlapper/ovljob.files' for reading: $!", undef); while () { chomp; if (! -e "$path/bucket$bucketID") { $failureMessage .= "-- job $path/bucket$bucketID FAILED.\n"; push @failedJobs, $currentJobID; } else { push @successJobs, $currentJobID; } $currentJobID++; $bucketID++; } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Overlap store bucketizer jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Overlap store bucketizer jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-overlapStoreBucketizerCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "ovB", $path, "scripts/1-bucketize", @failedJobs); return; } finishStage: print STDERR "-- Overlap store bucketizer finished.\n"; touch("$path/1-bucketize.success"); emitStage($asm, "$tag-overlapStoreBucketizerCheck"); buildHTML($asm, $tag); allDone: } sub overlapStoreSorterCheck ($$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "$base/$asm.ovlStore.BUILDING"; goto allDone if (skipStage($asm, "$tag-overlapStoreSorterCheck", $attempt) == 1); goto allDone if (-d "$base/$asm.ovlStore"); goto allDone if (-e "$path/2-sorter.success"); fetchFile("scripts/1-bucketize/2-sort.sh"); # Figure out if all the tasks finished correctly. my ($numOlaps, $numSlices, $memLimit) = getNumOlapsAndSlices($base, $asm); my $currentJobID = 1; my @successJobs; my @failedJobs; my $failureMessage = ""; my $sortID = "0001"; open(F, "< $base/1-overlapper/ovljob.files") or caExit("can't open '$base/1-overlapper/ovljob.files' for reading: $!", undef); # A valid result has three files: # $path/$sortID # $path/$sortID.index # $path/$sortID.info # # A crashed result has one file, if it crashes before output # $path/$sortID.ovs # # On out of disk, the .info is missing. It's the last thing created. # while ($currentJobID <= $numSlices) { if ((! -e "$path/$sortID") || (! -e "$path/$sortID.info") || ( -e "$path/$sortID.ovs")) { $failureMessage .= "-- job $path/$sortID FAILED.\n"; unlink "$path/$sortID.ovs"; push @failedJobs, $currentJobID; } else { push @successJobs, $currentJobID; } $currentJobID++; $sortID++; } close(F); # Failed jobs, retry. if (scalar(@failedJobs) > 0) { # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Overlap store sorting jobs failed, tried $attempt times, giving up.\n"; print STDERR $failureMessage; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Overlap store sorting jobs failed, retry.\n"; print STDERR $failureMessage; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "$tag-overlapStoreSorterCheck", $attempt); buildHTML($asm, $tag); submitOrRunParallelJob($asm, "ovS", $path, "scripts/2-sort", @failedJobs); return; } finishStage: print STDERR "-- Overlap store sorter finished.\n"; touch("$path/2-sorter.success"); emitStage($asm, "$tag-overlapStoreSorterCheck"); buildHTML($asm, $tag); allDone: } sub createOverlapStoreParallel ($$$) { my $base = shift @_; my $asm = shift @_; my $tag = shift @_; my $path = "$base/$asm.ovlStore.BUILDING"; overlapStoreConfigure ($base, $asm, $tag); overlapStoreBucketizerCheck($base, $asm, $tag) foreach (1..getGlobal("canuIterationMax") + 1); overlapStoreSorterCheck ($base, $asm, $tag) foreach (1..getGlobal("canuIterationMax") + 1); if (runCommand($path, "./scripts/3-index.sh > ./logs/3-index.err 2>&1")) { caExit("failed to build index for overlap store", "$base/$asm.ovlStore.BUILDING/logs/3-index.err"); } rename $path, "$base/$asm.ovlStore"; } sub checkOverlapStore ($$) { my $base = shift @_; my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; $cmd = "$bin/ovStoreDump \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -O ./$asm.ovlStore \\\n"; $cmd .= " -d -counts \\\n"; $cmd .= " > ./$asm.ovlStore/counts.dat 2> ./$asm.ovlStore/counts.err"; print STDERR "-- Checking store.\n"; if (runCommand($base, $cmd)) { caExit("failed to dump counts of overlaps; invalid store?", "./$asm.ovlStore/counts.err"); } my $totOvl = 0; my $nulReads = 0; my $ovlReads = 0; open(F, "< ./$base/$asm.ovlStore/counts.dat") or die; while () { my @v = split '\s+', $_; $nulReads += 1 if ($v[1] < 1); $ovlReads += 1 if ($v[1] > 0); $totOvl += $v[1]; } close(F); print STDERR "--\n"; print STDERR "-- Overlap store '$base/$asm.ovlStore' successfully constructed.\n"; print STDERR "-- Found $totOvl overlaps for $ovlReads reads; $nulReads reads have no overlaps.\n"; print STDERR "--\n"; unlink "./$base/$asm.ovlStore/counts.dat"; unlink "./$base/$asm.ovlStore/counts.err"; } sub generateOverlapStoreStats ($$) { my $base = shift @_; my $asm = shift @_; my $bin = getBinDirectory(); my $cmd; $cmd = "$bin/ovStoreStats \\\n"; $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -O ./$asm.ovlStore \\\n"; $cmd .= " -o ./$asm.ovlStore \\\n"; $cmd .= " > ./$asm.ovlStore.summary.err 2>&1"; if (runCommand($base, $cmd)) { print STDERR "--\n"; print STDERR "-- WARNING: failed to generate statistics for the overlap store; no summary will appear in report.\n"; print STDERR "--\n"; print STDERR "----------------------------------------\n"; return; } unlink "$base/$asm.ovlStore.summary.err"; } sub createOverlapStore ($$$) { my $asm = shift @_; my $tag = shift @_; my $seq = shift @_; my $base; $base = "correction" if ($tag eq "cor"); $base = "trimming" if ($tag eq "obt"); $base = "unitigging" if ($tag eq "utg"); goto allDone if (skipStage($asm, "$tag-createOverlapStore") == 1); goto allDone if ((-d "$base/$asm.ovlStore") || (fileExists("$base/$asm.ovlStore.tar"))); goto allDone if ((-d "$base/$asm.ctgStore") || (fileExists("$base/$asm.ctgStore.tar"))); # Did we _really_ complete? caExit("overlapper claims to be finished, but no job list found in '$base/1-overlapper/ovljob.files'", undef) if (! fileExists("$base/1-overlapper/ovljob.files")); # Then just build the store! Simple! createOverlapStoreSequential($base, $asm, $tag) if ($seq eq "sequential"); createOverlapStoreParallel ($base, $asm, $tag) if ($seq eq "parallel"); checkOverlapStore($base, $asm); goto finishStage if (getGlobal("saveOverlaps") eq "1"); # Delete the inputs and directories. # # Directories - Viciously remove the whole thing (after all files are deleted, so we # can get the sizes). # Files - Sum the size, remove the file, and try to remove the directory. In # particular, we don't want to remove_tree() this directory - there could # be other stuff in it - only remove if empty. # # Ideally, every directory we have in our list should be empty after we delete the files in the # list. But they won't be. Usually because there are empty directories in there too. Maybe # some stray files we didn't track. Regardless, just blow them away. # # Previous (to July 2017) versions tried to gently rmdir things, but it was ugly and didn't # quite work. my %directories; my $bytes = 0; my $files = 0; foreach my $file ("$base/1-overlapper/ovljob.files", "$base/1-overlapper/ovljob.more.files", "$base/1-overlapper/mhap.files", "$base/1-overlapper/mmap.files", "$base/1-overlapper/precompute.files") { next if (! -e $file); open(F, "< $file") or caExit("can't open '$file' for reading: $!\n", undef); while () { chomp; if (-d "$base/$_") { print STDERR "DIRECTORY $base/$_\n"; $directories{$_}++; } elsif (-e "$base/$_") { print STDERR "FILE $base/$_\n"; $bytes += -s "$base/$_"; $files += 1; unlink "$base/$_"; rmdir dirname("$base/$_"); # Try to rmdir the directory the file is in. If empty, yay! } } close(F); } foreach my $dir (keys %directories) { print STDERR "REMOVE TREE $base/$dir\n"; remove_tree("$base/$dir"); } unlink "$base/1-overlapper/ovljob.files"; unlink "$base/1-overlapper/ovljob.more.files"; unlink "$base/1-overlapper/mhap.files"; unlink "$base/1-overlapper/mmap.files"; unlink "$base/1-overlapper/precompute.files"; print STDERR "--\n"; print STDERR "-- Purged ", int(1000 * $bytes / 1024 / 1024 / 1024) / 1000, " GB in $files overlap output files.\n"; # Now all done! finishStage: if ($tag eq "utg") { generateOverlapStoreStats($base, $asm); } if (-e "$base/$asm.ovlStore.summary") { print STDERR "--\n"; print STDERR "-- Overlap store '$base/$asm.ovlStore' contains:\n"; print STDERR "--\n"; my $report; open(F, "< $base/$asm.ovlStore.summary") or caExit("Failed to open overlap store statistics in '$base/$asm.ovlStore': $!", undef); while () { $report .= "-- $_"; } close(F); addToReport("overlaps", $report); # Also shows it. } else { print STDERR "-- Overlap store '$base/$asm.ovlStore' statistics not available (skipped in correction and trimming stages).\n"; print STDERR "--\n"; } emitStage($asm, "$tag-createOverlapStore"); buildHTML($asm, $tag); allDone: stopAfter("overlapStore"); } canu-1.6/src/pipelines/canu/Report.pm000066400000000000000000000115521314437614700176420ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2017-MAR-20 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Report; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(generateReport addToReport getFromReport); use strict; #use File::Copy; #use File::Path 2.08 qw(make_path remove_tree); #use canu::Defaults; use canu::Grid_Cloud; # Holds the report we have so far (loaded from disk) and is then populated with # any new results and written back to disk. # # For a rather technical reason, some keys will exist in the hash (*Meryl, for example) # but have an undef value. # my %report; # Parse an existing report to load the data present in it. sub loadReport ($) { my $asm = shift @_; my $tag; my $rpt; fetchFile("$asm.report"); return if (! -e "$asm.report"); # Parse the file, loading each section into the proper key. open(F, "< $asm.report") or caExit("can't open '$asm.report' for reading: $!", undef); while () { if (m/^\[CORRECTION/) { $tag = "cor"; } elsif (m/^\[TRIMMING/) { $tag = "obt"; } elsif (m/^\[UNITIGGING/) { $tag = "utg"; } if (m/READS\]$/) { $rpt = "${tag}GkpStore"; $report{$rpt} = undef; } elsif (m/MERS\]$/) { $rpt = "${tag}Meryl"; $report{$rpt} = undef; } elsif (m/FILTERING\]$/) { $rpt = "filtering"; $report{$rpt} = undef; } elsif (m/CORRECTIONS\]$/) { $rpt = "corrections"; $report{$rpt} = undef; } elsif (m/TRIMMING\]$/) { $rpt = "trimming"; $report{$rpt} = undef; } elsif (m/SPLITTING\]$/) { $rpt = "splitting"; $report{$rpt} = undef; } elsif (m/OVERLAPS\]$/) { $rpt = "overlaps"; $report{$rpt} = undef; } elsif (m/ADJUSTMENT\]$/) { $rpt = "adjustments"; $report{$rpt} = undef; } elsif (m/UNITIGS\]$/) { $rpt = "unitigs"; $report{$rpt} = undef; } elsif (m/CONTIGS\]$/) { $rpt = "contigs"; $report{$rpt} = undef; } elsif (m/CONSENSUS\]$/) { $rpt = "consensus"; $report{$rpt} = undef; } else { $report{$rpt} .= $_; } } close(F); # Ignore blank lines at the start and end. foreach my $k (keys %report) { $report{$k} =~ s/^\s+//; $report{$k} =~ s/\s+$//; $report{$k} .= "\n"; } } sub saveReportItem ($$) { my $title = shift @_; my $item = shift @_; print F "[$title]\n$item\n" if (defined($item)); } sub saveReport ($) { my $asm = shift @_; my $tag; open(F, "> $asm.report") or caExit("can't open '$asm.report' for writing: $!", undef); saveReportItem("CORRECTION/READS", $report{"corGkpStore"}); saveReportItem("CORRECTION/MERS", $report{"corMeryl"}); saveReportItem("CORRECTION/FILTERING", $report{"correctionFiltering"}); saveReportItem("CORRECTION/CORRECTIONS", $report{"corrections"}); saveReportItem("TRIMMING/READS", $report{"obtGkpStore"}); saveReportItem("TRIMMING/MERS", $report{"obtMeryl"}); saveReportItem("TRIMMING/TRIMMING", $report{"trimming"}); saveReportItem("TRIMMING/SPLITTING", $report{"splitting"}); saveReportItem("UNITIGGING/READS", $report{"utgGkpStore"}); saveReportItem("UNITIGGING/MERS", $report{"utgMeryl"}); saveReportItem("UNITIGGING/OVERLAPS", $report{"overlaps"}); saveReportItem("UNITIGGING/ADJUSTMENT", $report{"adjustments"}); saveReportItem("UNITIGGING/UNITIGS", $report{"unitigs"}); saveReportItem("UNITIGGING/CONTIGS", $report{"contigs"}); saveReportItem("UNITIGGING/CONSENSUS", $report{"consensus"}); close(F); stashFile("$asm.report"); } sub generateReport ($) { my $asm = shift @_; loadReport($asm); saveReport($asm); } sub addToReport ($$) { my $item = shift @_; my $text = shift @_; print STDERR $text; # Client code is GREATLY simplified if this dumps to the screen too. $report{$item} = $text; } sub getFromReport ($) { return($report{$_[0]}); } 1; canu-1.6/src/pipelines/canu/Unitig.pm000066400000000000000000000257601314437614700176340ustar00rootroot00000000000000 ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # This file is derived from: # # src/pipelines/ca3g/Unitig.pm # # Modifications by: # # Brian P. Walenz from 2015-FEB-27 to 2015-AUG-26 # are Copyright 2015 Battelle National Biodefense Institute, and # are subject to the BSD 3-Clause License # # Brian P. Walenz beginning on 2015-NOV-02 # are a 'United States Government Work', and # are released in the public domain # # Sergey Koren beginning on 2015-NOV-27 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## package canu::Unitig; require Exporter; @ISA = qw(Exporter); @EXPORT = qw(reportUnitigSizes unitig unitigCheck); use strict; use File::Path 2.08 qw(make_path remove_tree); use POSIX qw(ceil); use canu::Defaults; use canu::Execution; use canu::Configure; # For displayGenomeSize use canu::Gatekeeper; use canu::Report; use canu::HTML; use canu::Meryl; use canu::Grid_Cloud; sub reportUnitigSizes ($$$) { my $asm = shift @_; my $version = shift @_; my $label = shift @_; my $bin = getBinDirectory(); my $cmd = ""; my $ctgNum = 0; my $ctgBases = 0; my $ctgSizes = ""; my $usmNum = 0; my $usmBases = 0; my $bubNum = 0; my $bubBases = 0; my $rptNum = 0; my $rptBases = 0; my $gs = getGlobal("genomeSize"); my $V = substr("000000" . $version, -3); my $N = "$asm.ctgStore/seqDB.v$V.sizes.txt"; fetchFile("unitigging/$N"); if (! -e "unitigging/$N") { fetchStore("unitigging/$asm.gkpStore"); fetchFile("unitigging/$asm.ctgStore/seqDB.v$V.dat"); fetchFile("unitigging/$asm.ctgStore/seqDB.v$V.tig"); $cmd = "$bin/tgStoreDump \\\n"; # Duplicated at the end of unitigger.sh $cmd .= " -G ./$asm.gkpStore \\\n"; $cmd .= " -T ./$asm.ctgStore $version \\\n"; $cmd .= " -sizes -s " . getGlobal("genomeSize") . " \\\n"; $cmd .= "> ./$N"; if (runCommand("unitigging", $cmd)) { caExit("failed to generate unitig sizes", undef); } stashFile("unitigging/$N"); } $ctgSizes .= "-- NG (bp) LG (contigs) sum (bp)\n"; $ctgSizes .= "-- ---------- ------------ ----------\n"; open(F, "< unitigging/$N") or caExit("failed to open 'unitigging/$N' for reading: $!", undef); while () { $rptBases = $1 if (m/lenSuggestRepeat\s+sum\s+(\d+)/); $rptNum = $1 if (m/lenSuggestRepeat\s+num\s+(\d+)/); $usmBases = $1 if (m/lenUnassembled\s+sum\s+(\d+)/); $usmNum = $1 if (m/lenUnassembled\s+num\s+(\d+)/); $bubBases = $1 if (m/lenBubble\s+sum\s+(\d+)/); $bubNum = $1 if (m/lenBubble\s+num\s+(\d+)/); $ctgBases = $1 if (m/lenContig\s+sum\s+(\d+)/); $ctgNum = $1 if (m/lenContig\s+num\s+(\d+)/); if (m/lenContig\s+ng(\d+)\s+(\d+)\s+bp\s+lg(\d+)\s+(\d+)\s+sum\s+(\d+)\s+bp$/) { my $n = substr(" $1", -3); my $ng = substr(" $2", -10); my $lg = substr(" $4", -6); my $ss = substr(" $5", -10); $ctgSizes .= "-- $n $ng $lg $ss\n"; } # User could have changed genome size since this report was dumped. if (m/lenContig.*genomeSize\s+(\d+)/) { $gs = $1; } } close(F); my $report; $report .= "-- Found, in version $version, $label:\n"; $report .= "-- contigs: $ctgNum sequences, total length $ctgBases bp (including $rptNum repeats of total length $rptBases bp).\n"; $report .= "-- bubbles: $bubNum sequences, total length $bubBases bp.\n"; $report .= "-- unassembled: $usmNum sequences, total length $usmBases bp.\n"; $report .= "--\n"; $report .= "-- Contig sizes based on genome size ", displayGenomeSize($gs), "bp:\n"; $report .= "--\n"; $report .= $ctgSizes; $report .= "--\n"; # Hmmm. Report wants a small tag, but $label is a bit verbose. addToReport("contigs", $report) if ($label eq "after unitig construction"); addToReport("consensus", $report) if ($label eq "after consensus generation"); } sub unitig ($) { my $asm = shift @_; my $path = "unitigging/4-unitigger"; goto allDone if (skipStage($asm, "unitig") == 1); goto allDone if (fileExists("unitigging/4-unitigger/unitigger.sh")); goto allDone if (fileExists("unitigging/$asm.utgStore/seqDB.v001.tig") && fileExists("unitigging/$asm.ctgStore/seqDB.v001.tig")); make_path($path) if (! -d $path); # How many reads per partition? This will change - it'll move to be after unitigs are constructed. my $overlapLength = getGlobal("minOverlapLength"); # Dump a script to run the unitigger. open(F, "> $path/unitigger.sh") or caExit("can't open '$path/unitigger.sh' for writing: $!\n", undef); print F "#!" . getGlobal("shell") . "\n"; print F "\n"; print F getBinDirectoryShellCode(); print F "\n"; print F setWorkDirectoryShellCode($path); print F "\n"; print F fetchStoreShellCode("unitigging/$asm.gkpStore", $path, ""); print F fetchStoreShellCode("unitigging/$asm.ovlStore", $path, ""); print F "\n"; print F fetchFileShellCode("unitigging/$asm.ovlStore", "evalues", ""); print F "\n"; print F getJobIDShellCode(); print F "\n"; print F "if [ -e ../$asm.ctgStore/seqDB.v001.tig -a -e ../$asm.utgStore/seqDB.v001.tig ] ; then\n"; print F " exit 0\n"; print F "fi\n"; print F "\n"; if (getGlobal("unitigger") eq "bogart") { print F "\$bin/bogart \\\n"; print F " -G ../$asm.gkpStore \\\n"; print F " -O ../$asm.ovlStore \\\n"; print F " -o ./$asm \\\n"; print F " -gs " . getGlobal("genomeSize") . " \\\n"; print F " -eg " . getGlobal("utgErrorRate") . " \\\n"; print F " -eM " . getGlobal("utgErrorRate") . " \\\n"; print F " -mo " . $overlapLength . " \\\n"; print F " -dg " . getGlobal("utgGraphDeviation") . " \\\n"; print F " -db " . getGlobal("utgGraphDeviation") . " \\\n"; print F " -dr " . getGlobal("utgRepeatDeviation") . " \\\n"; print F " -ca " . getGlobal("utgRepeatConfusedBP"). " \\\n"; print F " -cp " . "200" . " \\\n"; print F " -threads " . getGlobal("batThreads") . " \\\n" if (defined(getGlobal("batThreads"))); print F " -M " . getGlobal("batMemory") . " \\\n" if (defined(getGlobal("batMemory"))); print F " -unassembled " . getGlobal("contigFilter") . " \\\n" if (defined(getGlobal("contigFilter"))); print F " " . getGlobal("batOptions") . " \\\n" if (defined(getGlobal("batOptions"))); print F " > ./unitigger.err 2>&1 \\\n"; print F "&& \\\n"; print F "mv ./$asm.ctgStore ../$asm.ctgStore \\\n"; print F "&& \\\n"; print F "mv ./$asm.utgStore ../$asm.utgStore\n"; } else { caFailure("unknown unitigger '" . getGlobal("unitigger") . "'", undef); } print F "\n"; print F stashFileShellCode("unitigging/4-unitigger", "$asm.unitigs.gfa", ""); print F stashFileShellCode("unitigging/4-unitigger", "$asm.contigs.gfa", ""); print F "\n"; print F stashFileShellCode("unitigging/$asm.ctgStore", "seqDB.v001.dat", ""); print F stashFileShellCode("unitigging/$asm.ctgStore", "seqDB.v001.tig", ""); print F "\n"; print F stashFileShellCode("unitigging/$asm.utgStore", "seqDB.v001.dat", ""); print F stashFileShellCode("unitigging/$asm.utgStore", "seqDB.v001.tig", ""); print F "\n"; print F "\$bin/tgStoreDump \\\n"; # Duplicated in reportUnitigSizes() print F " -G ../$asm.gkpStore \\\n"; # Done here so we don't need another print F " -T ../$asm.ctgStore 1 \\\n"; # pull of gkpStore and ctgStore print F " -sizes -s " . getGlobal("genomeSize") . " \\\n"; print F "> ../$asm.ctgStore/seqDB.v001.sizes.txt"; print F "\n"; print F stashFileShellCode("unitigging/$asm.ctgStore", "seqDB.v001.sizes.txt", ""); print F "\n"; print F "exit 0\n"; close(F); makeExecutable("$path/unitigger.sh"); stashFile("$path/unitigger.sh"); finishStage: emitStage($asm, "unitig"); buildHTML($asm, "utg"); allDone: } sub unitigCheck ($) { my $asm = shift @_; my $attempt = getGlobal("canuIteration"); my $path = "unitigging/4-unitigger"; goto allDone if (skipStage($asm, "unitigCheck", $attempt) == 1); goto allDone if (fileExists("$path/unitigger.success")); goto finishStage if (fileExists("unitigging/$asm.utgStore/seqDB.v001.tig") && fileExists("unitigging/$asm.ctgStore/seqDB.v001.tig")); fetchFile("$path/unitigger.sh"); # Since there is only one job, if we get here, we're not done. Any other 'check' function # shows how to process multiple jobs. This only checks for the existence of the final outputs. # (meryl is the same) # If too many attempts, give up. if ($attempt >= getGlobal("canuIterationMax")) { print STDERR "--\n"; print STDERR "-- Bogart failed, tried $attempt times, giving up.\n"; print STDERR "--\n"; caExit(undef, undef); } if ($attempt > 0) { print STDERR "--\n"; print STDERR "-- Bogart failed, retry\n"; print STDERR "--\n"; } # Otherwise, run some jobs. emitStage($asm, "unitigCheck", $attempt); buildHTML($asm, "utg"); submitOrRunParallelJob($asm, "bat", $path, "unitigger", (1)); return; finishStage: print STDERR "-- Unitigger finished successfully.\n"; make_path($path); # With object storage, we might not have this directory! open(F, "> $path/unitigger.success") or caExit("can't open '$path/unitigger.success' for writing: $!", undef); close(F); stashFile("$path/unitigger.success"); reportUnitigSizes($asm, 1, "after unitig construction"); emitStage($asm, "unitigCheck"); buildHTML($asm, "utg"); allDone: stopAfter("unitig"); } canu-1.6/src/pipelines/parallel-ovl-store-test.sh000066400000000000000000000025251314437614700221400ustar00rootroot00000000000000#!/bin/sh bin="/work/canu/FreeBSD-amd64/bin" if [ ! -e test.gkpStore ] ; then echo Didn\'t find \'test.gkpStore\', can\'t make fake overlaps. exit fi rm -rf *ovb *counts test.ovlStore test.ovlStore.seq jobs=50 size=100000000 # Create a bunch of overlap outputs, one for each job. echo "" echo "Creating overlaps" echo "" for ii in `seq 1 $jobs` ; do name=`printf %03d $ii` $bin/overlapImport -G test.gkpStore -o $name.ovb -random $size done # Configure. $bin/ovStoreBuild -G test.gkpStore -O test.ovlStore -M 0.26 -config config *ovb > config.err 2>&1 buckets=`grep Will config.err | grep buckets | awk '{ print $9 }'` echo "" echo "Using $buckets buckets for sorting" echo "" # Bucketize each job. for ii in `seq 1 $jobs` ; do name=`printf %03d $ii` $bin/ovStoreBucketizer -G test.gkpStore -O test.ovlStore -C config -i $name.ovb -job $ii done # Sort each bucket. echo "" echo "Sorting each bucket" echo "" for ii in `seq 1 $buckets` ; do echo "" echo "Sorting bucket $ii" echo "" $bin/ovStoreSorter -deletelate -G test.gkpStore -O test.ovlStore -F $jobs -job $ii $jobs done # And build the index $bin/ovStoreIndexer -O test.ovlStore -F $buckets # And a sequential store? exit echo "" echo "Building sequential store" echo "" $bin/ovStoreBuild -G test.gkpStore -O test.ovlStore.seq *ovb > test.ovlStore.seq.err 2>&1 canu-1.6/src/pipelines/simple-repeat-test.pl000066400000000000000000000056331314437614700211670ustar00rootroot00000000000000#!/usr/bin/env perl ############################################################################### # # This file is part of canu, a software program that assembles whole-genome # sequencing reads into contigs. # # This software is based on: # 'Celera Assembler' (http://wgs-assembler.sourceforge.net) # the 'kmer package' (http://kmer.sourceforge.net) # both originally distributed by Applera Corporation under the GNU General # Public License, version 2. # # Canu branched from Celera Assembler at its revision 4587. # Canu branched from the kmer project at its revision 1994. # # Modifications by: # # Brian P. Walenz beginning on 2017-JAN-09 # are a 'United States Government Work', and # are released in the public domain # # File 'README.licenses' in the root directory of this distribution contains # full conditions and disclaimers for each license. ## use strict; # Generate a circular 'genome' with nUnique pieces and nUnique+1 repeats, the pair at the start/end # forming the circularization. Assemble it, and compare against 'reference'. my $repeatSize = 10000; my $uniqueSize = 90000; my $nUnique = 4; my $readLenL = 4000; # Long reads my $readLenM = 3500; # Medium reads my $readLenS = 3000; # Short reads my $coverageL = 30; my $coverageM = 1; my $coverageS = 1; sub randomSequence ($) { my $L = shift @_; open(F, "leaff -G 1 $L $L |"); $_ = ; chomp; $_ = ; chomp; close(F); return($_); } if (! -e "reads.genome.fasta") { my $len = 0; my $R = randomSequence($repeatSize); my $Rlen = length($R); open(O, "> reads.genome.fasta"); print O ">G\n"; for (my $r=0; $r < $nUnique; $r++) { print STDERR "REPEAT $len ", $len + $Rlen, "\n"; print O "$R\n"; $len += $Rlen; my $U = randomSequence($uniqueSize); my $Ulen = length($U); print STDERR "UNIQUE $len ", $len + $Ulen, "\n"; print O "$U\n"; $len += $Ulen; } print STDERR "REPEAT $len ", $len + $Rlen, "\n"; print O "$R\n"; $len += $Rlen; close(O); } if (! -e "reads.s.fastq") { system("fastqSimulate -f reads.genome.fasta -o readsL -l $readLenL -x $coverageL -em 0.005 -ei 0 -ed 0 -se"); system("fastqSimulate -f reads.genome.fasta -o readsM -l $readLenM -x $coverageM -em 0.005 -ei 0 -ed 0 -se"); system("fastqSimulate -f reads.genome.fasta -o readsS -l $readLenS -x $coverageS -em 0.005 -ei 0 -ed 0 -se"); } my $canu; $canu = "canu -assemble -p test -d test"; $canu .= " ovlMerThreshold=0"; $canu .= " useGrid=0 genomeSize=1m ovlThreads=24"; $canu .= " -pacbio-corrected readsL.s.fastq"; $canu .= " -pacbio-corrected readsM.s.fastq"; $canu .= " -pacbio-corrected readsS.s.fastq"; system($canu); system("dotplot.sh UTG reads.genome.fasta test/test.unitigs.fasta"); system("dotplot.sh CTG reads.genome.fasta test/test.contigs.fasta"); exit(0); canu-1.6/src/stores/000077500000000000000000000000001314437614700144265ustar00rootroot00000000000000canu-1.6/src/stores/gatekeeperCreate.C000066400000000000000000000570441314437614700200040ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_GKP/gkpStoreCreate.C * * Modifications by: * * Brian P. Walenz from 2012-FEB-06 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-OCT-09 to 2015-AUG-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-08 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "findKeyAndValue.H" #include "AS_UTL_fileIO.H" #undef UPCASE // Don't convert lowercase to uppercase, special case for testing alignments. #define UPCASE // Convert lowercase to uppercase. Probably needed. // Canu doesn't really use QVs as of late 2015. Only consensus is aware of them, but since read // correction doesn't output QVs, the QVs that consensus sees are totally bogus. So, disable // storage of them - gatekeeperCreate will store a default QV for every read, and consensus will // process as usual. // #define DO_NOT_STORE_QVs // Support fastq of fasta, even in the same file. // Eventually want to support bax.h5 natively. uint32 validSeq[256] = {0}; uint32 loadFASTA(char *L, char *H, char *S, uint32 &Slen, char *Q, compressedFileReader *F, FILE *errorLog, uint32 &nWARNS) { uint32 nLines = 0; // Lines read from the input uint32 nBases = 0; // Bases read from the input, used for reporting errors bool valid = true; // We've already read the header. It's in L. But we want to use L to load the sequence, so the // header is copied to H. We need to do a copy anyway - we need to return the next header in L. // We can either copy L to H now, or use H to read lines and copy H to L at the end. We're stuck // either way. strcpy(H, L + 1); // Clear the sequence. S[0] = 0; Q[0] = 0; // Sentinel to tell gatekeeper to use the fixed QV value Slen = 0; // Load sequence. This is a bit tricky, since we need to peek ahead // and stop reading before the next header is loaded. Instead, we read the // next line into what we'd read the header into outside here. fgets(L, AS_MAX_READLEN+1, F->file()); nLines++; chomp(L); // Catch empty reads - reads with no sequence line at all. if (L[0] == '>') { fprintf(errorLog, "read '%s' is empty.\n", H); nWARNS++; return(nLines); } // Copy in the sequence, as long as it is valid sequence. If any invalid letters // are found, set the base to 'N'. uint32 baseErrors = 0; while ((!feof(F->file())) && (L[0] != '>')) { nBases += strlen(L); // Could do this in the loop below, but it makes it ugly. for (uint32 i=0; (Slen < AS_MAX_READLEN) && (L[i] != 0); i++) { switch (L[i]) { #ifdef UPCASE case 'a': S[Slen] = 'A'; break; case 'c': S[Slen] = 'C'; break; case 'g': S[Slen] = 'G'; break; case 't': S[Slen] = 'T'; break; #else case 'a': S[Slen] = 'a'; break; case 'c': S[Slen] = 'c'; break; case 'g': S[Slen] = 'g'; break; case 't': S[Slen] = 't'; break; #endif case 'A': S[Slen] = 'A'; break; case 'C': S[Slen] = 'C'; break; case 'G': S[Slen] = 'G'; break; case 'T': S[Slen] = 'T'; break; case 'n': S[Slen] = 'N'; break; case 'N': S[Slen] = 'N'; break; default: baseErrors++; S[Slen] = 'N'; break; } Slen++; } // Grab the next line. It should be more sequence, or the next header, or eof. // The last two are stop conditions for the while loop. L[0] = 0; fgets(L, AS_MAX_READLEN+1, F->file()); nLines++; chomp(L); } // Terminate the sequence. S[Slen] = 0; // Report errors. if (baseErrors > 0) { fprintf(errorLog, "read '%s' has " F_U32 " invalid base%s. Converted to 'N'.\n", H, baseErrors, (baseErrors > 1) ? "s" : ""); nWARNS++; } if (Slen == 0) { fprintf(errorLog, "read '%s' is empty.\n", H); nWARNS++; } if (Slen != nBases) { fprintf(errorLog, "read '%s' is too long; contains %u bases, but we can only handle %u.\n", H, nBases, AS_MAX_READLEN); nWARNS++; } // Do NOT clear L, it contains the next header. return(nLines); } uint32 loadFASTQ(char *L, char *H, char *S, uint32 &Slen, char *Q, compressedFileReader *F, FILE *errorLog, uint32 &nWARNS) { // We've already read the header. It's in L. strcpy(H, L + 1); // Load sequence. S[0] = 0; Slen = 0; S[AS_MAX_READLEN+1-2] = 0; // If this is ever set, the read is probably longer than we can support. S[AS_MAX_READLEN+1-1] = 0; // This will always be zero; fgets() sets it. Q[AS_MAX_READLEN+1-2] = 0; // This too. Q[AS_MAX_READLEN+1-1] = 0; fgets(S, AS_MAX_READLEN+1, F->file()); chomp(S); // Check for long reads. If found, read the rest of the line, and report an error. The -1 (in // the print) is because fgets() and strlen() will count the newline, which isn't a base. if ((S[AS_MAX_READLEN+1-2] != 0) && (S[AS_MAX_READLEN+1-2] != '\n')) { char *overflow = new char [1048576]; uint32 nBases = AS_MAX_READLEN; do { overflow[1048576-2] = 0; overflow[1048576-1] = 0; fgets(overflow, 1048576, F->file()); nBases += strlen(overflow); } while (overflow[1048576-2] != 0); fprintf(errorLog, "read '%s' is too long; contains %u bases, but we can only handle %u.\n", H, nBases-1, AS_MAX_READLEN); nWARNS++; delete [] overflow; } // Check for and correct invalid bases. uint32 baseErrors = 0; for (uint32 i=0; (Slen < AS_MAX_READLEN) && (S[i] != 0); i++) { switch (S[i]) { #ifdef UPCASE case 'a': S[i] = 'A'; break; case 'c': S[i] = 'C'; break; case 'g': S[i] = 'G'; break; case 't': S[i] = 'T'; break; #else case 'a': break; case 'c': break; case 'g': break; case 't': break; #endif case 'A': break; case 'C': break; case 'G': break; case 'T': break; case 'n': S[i] = 'N'; break; case 'N': break; default: S[i] = 'N'; Q[i] = '!'; // QV=0, ASCII=33 baseErrors++; break; } Slen++; } if (baseErrors > 0) { fprintf(errorLog, "read '%s' has " F_U32 " invalid base%s. Converted to 'N'.\n", L, baseErrors, (baseErrors > 1) ? "s" : ""); nWARNS++; } // Load the qv header, and then load the qvs themselves over the header. Q[0] = 0; fgets(Q, AS_MAX_READLEN+1, F->file()); fgets(Q, AS_MAX_READLEN+1, F->file()); chomp(Q); // As with the base, we need to suck in the rest of the longer-than-allowed QV string. But we don't need to report it // or do anything fancy, just advance the file pointer. if ((Q[AS_MAX_READLEN-1] != 0) && (Q[AS_MAX_READLEN-1] != '\n')) { char *overflow = new char [1048576]; do { overflow[1048576-2] = 0; overflow[1048576-1] = 0; fgets(overflow, 1048576, F->file()); } while (overflow[1048576-2] != 0); delete [] overflow; } // Convert from the (assumed to be) Sanger QVs to plain ol' integers. uint32 QVerrors = 0; #ifndef DO_NOT_STORE_QVs for (uint32 i=0; Q[i]; i++) { if (Q[i] < '!') { // QV=0, ASCII=33 Q[i] = '!'; QVerrors++; } if (Q[i] > '!' + 60) { // QV=60, ASCII=93=']' Q[i] = '!' + 60; QVerrors++; } } if (QVerrors > 0) { fprintf(errorLog, "read '%s' has " F_U32 " invalid QV%s. Converted to min or max value.\n", L, QVerrors, (QVerrors > 1) ? "s" : ""); nWARNS++; } #else // If we're not using QVs, just reset the first value to -1. This is the sentinel that FASTA sequences set, // causing the encoding later to use a fixed QV for all bases. Q[0] = 0; #endif // Clear the lines, so we can load the next one. L[0] = 0; return(4); // FASTQ always reads exactly four lines } void loadReads(gkStore *gkpStore, gkLibrary *gkpLibrary, uint32 gkpFileID, uint32 minReadLength, FILE *nameMap, FILE *htmlLog, FILE *errorLog, char *fileName, uint32 &nWARNS, uint32 &nLOADED, uint64 &bLOADED, uint32 &nSKIPPED, uint64 &bSKIPPED) { char *L = new char [AS_MAX_READLEN + 1]; // +1. One for the newline, and one for the terminating nul. char *H = new char [AS_MAX_READLEN + 1]; char *S = new char [AS_MAX_READLEN + 1]; char *Q = new char [AS_MAX_READLEN + 1]; uint32 Slen = 0; uint32 Qlen = 0; uint64 lineNumber = 1; fprintf(stderr, "\n"); fprintf(stderr, " Loading reads from '%s'\n", fileName); fprintf(htmlLog, "nam " F_U32 " %s\n", gkpFileID, fileName); fprintf(htmlLog, "lib preset=N/A"); fprintf(htmlLog, " defaultQV=%u", gkpLibrary->gkLibrary_defaultQV()); fprintf(htmlLog, " isNonRandom=%s", gkpLibrary->gkLibrary_isNonRandom() ? "true" : "false"); fprintf(htmlLog, " removeDuplicateReads=%s", gkpLibrary->gkLibrary_removeDuplicateReads() ? "true" : "false"); fprintf(htmlLog, " finalTrim=%s", gkpLibrary->gkLibrary_finalTrim() ? "true" : "false"); fprintf(htmlLog, " removeSpurReads=%s", gkpLibrary->gkLibrary_removeSpurReads() ? "true" : "false"); fprintf(htmlLog, " removeChimericReads=%s", gkpLibrary->gkLibrary_removeChimericReads() ? "true" : "false"); fprintf(htmlLog, " checkForSubReads=%s\n", gkpLibrary->gkLibrary_checkForSubReads() ? "true" : "false"); compressedFileReader *F = new compressedFileReader(fileName); uint32 nFASTAlocal = 0; // number of sequences read from disk uint32 nFASTQlocal = 0; uint32 nWARNSlocal = 0; uint32 nLOADEDAlocal = 0; // Sequences actaully loaded into the store uint32 nLOADEDQlocal = 0; uint64 bLOADEDAlocal = 0; uint64 bLOADEDQlocal = 0; uint32 nSKIPPEDAlocal = 0; // Sequences skipped because they are too short uint32 nSKIPPEDQlocal = 0; uint64 bSKIPPEDAlocal = 0; uint64 bSKIPPEDQlocal = 0; fgets(L, AS_MAX_READLEN+1, F->file()); chomp(L); while (!feof(F->file())) { bool isFASTA = false; bool isFASTQ = false; if (L[0] == '>') { lineNumber += loadFASTA(L, H, S, Slen, Q, F, errorLog, nWARNSlocal); isFASTA = true; nFASTAlocal++; } else if (L[0] == '@') { lineNumber += loadFASTQ(L, H, S, Slen, Q, F, errorLog, nWARNSlocal); isFASTQ = true; nFASTQlocal++; } else { fprintf(errorLog, "invalid read header '%.40s%s' in file '%s' at line " F_U64 ", skipping.\n", L, (strlen(L) > 80) ? "..." : "", fileName, lineNumber); L[0] = 0; nWARNSlocal++; } // If S[0] isn't nul, we loaded a sequence and need to store it. if (Slen < minReadLength) { fprintf(errorLog, "read '%s' of length " F_U32 " in file '%s' at line " F_U64 " is too short, skipping.\n", H, Slen, fileName, lineNumber); if (isFASTA) { nSKIPPEDAlocal += 1; bSKIPPEDAlocal += Slen; } if (isFASTQ) { nSKIPPEDQlocal += 1; bSKIPPEDQlocal += Slen; } S[0] = 0; Q[0] = 0; } if (S[0] != 0) { gkRead *nr = gkpStore->gkStore_addEmptyRead(gkpLibrary); gkReadData *nd = nr->gkRead_encodeSeqQlt(H, S, Q, gkpLibrary->gkLibrary_defaultQV()); gkpStore->gkStore_stashReadData(nr, nd); delete nd; if (isFASTA) { nLOADEDAlocal += 1; bLOADEDAlocal += Slen; } if (isFASTQ) { nLOADEDQlocal += 1; bLOADEDQlocal += Slen; } fprintf(nameMap, F_U32"\t%s\n", gkpStore->gkStore_getNumReads(), H); } // If L[0] is nul, we need to load the next line. If not, the next line is the header (from // the fasta loader). if (L[0] == 0) { fgets(L, AS_MAX_READLEN+1, F->file()); lineNumber++; chomp(L); } } delete F; delete [] Q; delete [] S; delete [] H; delete [] L; lineNumber--; // The last fgets() returns EOF, but we still count the line. // Write status to the screen fprintf(stderr, " Processed " F_U64 " lines.\n", lineNumber); fprintf(stderr, " Loaded " F_U64 " bp from:\n", bLOADEDAlocal + bLOADEDQlocal); if (nFASTAlocal > 0) fprintf(stderr, " " F_U32 " FASTA format reads (" F_U64 " bp).\n", nFASTAlocal, bLOADEDAlocal); if (nFASTQlocal > 0) fprintf(stderr, " " F_U32 " FASTQ format reads (" F_U64 " bp).\n", nFASTQlocal, bLOADEDQlocal); if (nWARNSlocal > 0) fprintf(stderr, " WARNING: " F_U32 " reads issued a warning.\n", nWARNSlocal); if (nSKIPPEDAlocal > 0) fprintf(stderr, " WARNING: " F_U32 " reads (%0.4f%%) with " F_U64 " bp (%0.4f%%) were too short (< " F_U32 "bp) and were ignored.\n", nSKIPPEDAlocal, 100.0 * nSKIPPEDAlocal / (nSKIPPEDAlocal + nLOADEDAlocal), bSKIPPEDAlocal, 100.0 * bSKIPPEDAlocal / (bSKIPPEDAlocal + bLOADEDAlocal), minReadLength); if (nSKIPPEDQlocal > 0) fprintf(stderr, " WARNING: " F_U32 " reads (%0.4f%%) with " F_U64 " bp (%0.4f%%) were too short (< " F_U32 "bp) and were ignored.\n", nSKIPPEDQlocal, 100.0 * nSKIPPEDQlocal / (nSKIPPEDQlocal + nLOADEDQlocal), bSKIPPEDQlocal, 100.0 * bSKIPPEDQlocal / (bSKIPPEDQlocal + bLOADEDQlocal), minReadLength); // Write status to HTML fprintf(htmlLog, "dat " F_U32 " " F_U64 " " F_U32 " " F_U64 " " F_U32 " " F_U64 " " F_U32 " " F_U64 " " F_U32 "\n", nLOADEDAlocal, bLOADEDAlocal, nSKIPPEDAlocal, bSKIPPEDAlocal, nLOADEDQlocal, bLOADEDQlocal, nSKIPPEDQlocal, bSKIPPEDQlocal, nWARNSlocal); // Add the just loaded numbers to the global numbers nWARNS += nWARNSlocal; nLOADED += nLOADEDAlocal + nLOADEDQlocal; bLOADED += bLOADEDAlocal + bLOADEDQlocal; nSKIPPED += nSKIPPEDAlocal + nSKIPPEDQlocal; bSKIPPED += bSKIPPEDAlocal + bSKIPPEDQlocal; }; int main(int argc, char **argv) { char *gkpStoreName = NULL; char *outPrefix = NULL; gkStore_mode mode = gkStore_create; uint32 minReadLength = 0; uint32 firstFileArg = 0; char errorLogName[FILENAME_MAX]; char htmlLogName[FILENAME_MAX]; char nameMapName[FILENAME_MAX]; argc = AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-o") == 0) { // This previously used gkStore_append here, but if mode = gkStore_create; // two instances of gatekeeperCreate were run in the gkpStoreName = argv[++arg]; // same directory, they would clobber each other, // generating a blobs file that is a mix of both. } else if (strcmp(argv[arg], "-a") == 0) { // So now the -c will fail if any trace of a store mode = gkStore_extend; // exists in the output location, and -a will gkpStoreName = argv[++arg]; // blindly add reads to an existing set of files } else if (strcmp(argv[arg], "-minlength") == 0) { minReadLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "--") == 0) { firstFileArg = arg++; break; } else if (argv[arg][0] == '-') { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } else { firstFileArg = arg; break; } arg++; } if (gkpStoreName == NULL) err++; if (firstFileArg == 0) err++; if (err) { fprintf(stderr, "usage: %s [...] -o gkpStore\n", argv[0]); fprintf(stderr, " -o gkpStore create this gkpStore\n"); fprintf(stderr, " -a gkpStore append to this gkpStore\n"); fprintf(stderr, " \n"); fprintf(stderr, " -minlength L discard reads shorter than L\n"); fprintf(stderr, " \n"); fprintf(stderr, " \n"); if (gkpStoreName == NULL) fprintf(stderr, "ERROR: no gkpStore (-o) supplied.\n"); if (firstFileArg == 0) fprintf(stderr, "ERROR: no input files supplied.\n"); exit(1); } gkStore *gkpStore = gkStore::gkStore_open(gkpStoreName, mode); gkRead *gkpRead = NULL; gkLibrary *gkpLibrary = NULL; uint32 gkpFileID = 0; // Used for HTML output, an ID for each file loaded. uint32 inLineLen = 1024; char inLine[1024] = { 0 }; validSeq['a'] = validSeq['c'] = validSeq['g'] = validSeq['t'] = validSeq['n'] = 1; validSeq['A'] = validSeq['C'] = validSeq['G'] = validSeq['T'] = validSeq['N'] = 1; errno = 0; snprintf(errorLogName, FILENAME_MAX, "%s/errorLog", gkpStoreName); FILE *errorLog = fopen(errorLogName, "w"); if (errno) fprintf(stderr, "ERROR: cannot open error file '%s': %s\n", errorLogName, strerror(errno)), exit(1); snprintf(htmlLogName, FILENAME_MAX, "%s/load.dat", gkpStoreName); FILE *htmlLog = fopen(htmlLogName, "w"); if (errno) fprintf(stderr, "ERROR: cannot open uid map file '%s': %s\n", htmlLogName, strerror(errno)), exit(1); snprintf(nameMapName, FILENAME_MAX, "%s/readNames.txt", gkpStoreName); FILE *nameMap = fopen(nameMapName, "w"); if (errno) fprintf(stderr, "ERROR: cannot open uid map file '%s': %s\n", nameMapName, strerror(errno)), exit(1); uint32 nERROR = 0; // There aren't any errors, we just exit fatally if encountered. uint32 nWARNS = 0; uint32 nLOADED = 0; // Reads loaded uint64 bLOADED = 0; // Bases loaded uint32 nSKIPPED = 0; uint64 bSKIPPED = 0; // Bases not loaded, too short for (; firstFileArg < argc; firstFileArg++) { fprintf(stderr, "\n"); fprintf(stderr, "Starting file '%s'.\n", argv[firstFileArg]); compressedFileReader *inFile = new compressedFileReader(argv[firstFileArg]); char *line = new char [10240]; char *linekv = new char [10240]; KeyAndValue keyval; while (fgets(line, 10240, inFile->file()) != NULL) { chomp(line); strcpy(linekv, line); // keyval.find() modifies the input line, adding a nul byte to split the key and value. keyval.find(linekv); if (keyval.key() == NULL) { // No key, so must be a comment or blank line continue; } if (strcasecmp(keyval.key(), "name") == 0) { gkpLibrary = gkpStore->gkStore_addEmptyLibrary(keyval.value()); continue; } // We'd better have a gkpLibrary defined, if not, the .gkp input file is incorrect. if (gkpLibrary == NULL) { fprintf(stderr, "WARNING: no 'name' tag in gkp input; creating library with name 'DEFAULT'.\n"); gkpLibrary = gkpStore->gkStore_addEmptyLibrary(keyval.value()); nWARNS++; } if (strcasecmp(keyval.key(), "preset") == 0) { gkpLibrary->gkLibrary_parsePreset(keyval.value()); } else if (strcasecmp(keyval.key(), "qv") == 0) { gkpLibrary->gkLibrary_setDefaultQV(keyval.value_double()); } else if (strcasecmp(keyval.key(), "isNonRandom") == 0) { gkpLibrary->gkLibrary_setIsNonRandom(keyval.value_bool()); } else if (strcasecmp(keyval.key(), "readType") == 0) { gkpLibrary->gkLibrary_setReadType(keyval.value()); } else if (strcasecmp(keyval.key(), "removeDuplicateReads") == 0) { gkpLibrary->gkLibrary_setRemoveDuplicateReads(keyval.value_bool()); } else if (strcasecmp(keyval.key(), "finalTrim") == 0) { gkpLibrary->gkLibrary_setFinalTrim(keyval.value()); } else if (strcasecmp(keyval.key(), "removeSpurReads") == 0) { gkpLibrary->gkLibrary_setRemoveSpurReads(keyval.value_bool()); } else if (strcasecmp(keyval.key(), "removeChimericReads") == 0) { gkpLibrary->gkLibrary_setRemoveChimericReads(keyval.value_bool()); } else if (strcasecmp(keyval.key(), "checkForSubReads") == 0) { gkpLibrary->gkLibrary_setCheckForSubReads(keyval.value_bool()); } else if (AS_UTL_fileExists(line, false, false)) { loadReads(gkpStore, gkpLibrary, gkpFileID++, minReadLength, nameMap, htmlLog, errorLog, line, nWARNS, nLOADED, bLOADED, nSKIPPED, bSKIPPED); } else { fprintf(stderr, "ERROR: option '%s' not recognized, and not a file of reads.\n", line); exit(1); } } delete inFile; delete [] line; delete [] linekv; } gkpStore->gkStore_close(); fclose(nameMap); fclose(errorLog); fprintf(stderr, "\n"); fprintf(stderr, "Finished with:\n"); fprintf(stderr, " " F_U32 " warnings (bad base or qv, too short, too long)\n", nWARNS); fprintf(stderr, "\n"); fprintf(stderr, "Loaded into store:\n"); fprintf(stderr, " " F_U64 " bp.\n", bLOADED); fprintf(stderr, " " F_U32 " reads.\n", nLOADED); fprintf(stderr, "\n"); fprintf(stderr, "Skipped (too short):\n"); fprintf(stderr, " " F_U64 " bp (%.4f%%).\n", bSKIPPED, (bSKIPPED + bLOADED > 0) ? (100.0 * bSKIPPED / (bSKIPPED + bLOADED)) : 0); fprintf(stderr, " " F_U32 " reads (%.4f%%).\n", nSKIPPED, (nSKIPPED + nLOADED > 0) ? (100.0 * nSKIPPED / (nSKIPPED + nLOADED)) : 0); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(htmlLog, "sum " F_U32 " " F_U64 " " F_U32 " " F_U64 " " F_U32 "\n", nLOADED, bLOADED, nSKIPPED, bSKIPPED, nWARNS); fclose(htmlLog); if (nERROR > 0) fprintf(stderr, "gatekeeperCreate did NOT finish successfully; too many errors.\n"); if (bSKIPPED > 0.25 * (bSKIPPED + bLOADED)) fprintf(stderr, "gatekeeperCreate did NOT finish successfully; too many bases skipped. Check your reads.\n"); if (nWARNS > 0.25 * (nLOADED)) fprintf(stderr, "gatekeeperCreate did NOT finish successfully; too many warnings. Check your reads.\n"); if (nSKIPPED > 0.50 * (nLOADED)) fprintf(stderr, "gatekeeperCreate did NOT finish successfully; too many short reads. Check your reads!\n"); if ((nERROR > 0) || (bSKIPPED > 0.25 * (bSKIPPED + bLOADED)) || (nWARNS > 0.25 * (nSKIPPED + nLOADED)) || (nSKIPPED > 0.50 * (nSKIPPED + nLOADED))) exit(1); fprintf(stderr, "gatekeeperCreate finished successfully.\n"); exit(0); } canu-1.6/src/stores/gatekeeperCreate.mk000066400000000000000000000007411314437614700202210ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := gatekeeperCreate SOURCES := gatekeeperCreate.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/gatekeeperDumpFASTQ.C000066400000000000000000000307571314437614700203070ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_GKP/gkpStoreDumpFASTQ.C * * Modifications by: * * Brian P. Walenz from 2012-FEB-06 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-21 to 2015-JUN-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "AS_UTL_decodeRange.H" #include "AS_UTL_fileIO.H" #include "AS_UTL_fasta.H" #include "clearRangeFile.H" // Write sequence in multiple formats. This used to write to four fastq files, the .1, .2, .paired and .unmated. // It's left around for future expansion to .fastq and .bax.h5. // class libOutput { public: libOutput(char const *outPrefix, char const *outSuffix, char const *libName = NULL) { strcpy(_p, outPrefix); if (outSuffix[0]) snprintf(_s, FILENAME_MAX, ".%s", outSuffix); else _s[0] = 0; if (libName) strcpy(_n, libName); else _n[0] = 0; _WRITER = NULL; _FASTA = NULL; _FASTQ = NULL; }; ~libOutput() { if (_WRITER) delete _WRITER; }; FILE *getFASTQ(void) { if (_FASTQ) return(_FASTQ); return(openFASTQ()); }; FILE *openFASTQ(void) { char N[FILENAME_MAX]; if (_n[0]) snprintf(N, FILENAME_MAX, "%s.%s.fastq%s", _p, _n, _s); else snprintf(N, FILENAME_MAX, "%s.fastq%s", _p, _s); if ((_p[0] == '-') && (_p[1] == 0)) { snprintf(N, FILENAME_MAX, "(stdout)"); _FASTQ = stdout; } else { _WRITER = new compressedFileWriter(N); _FASTQ = _WRITER->file(); } return(_FASTQ); }; FILE *getFASTA(void) { if (_FASTA) return(_FASTA); return(openFASTA()); }; FILE *openFASTA(void) { char N[FILENAME_MAX]; if (_n[0]) snprintf(N, FILENAME_MAX, "%s.%s.fasta%s", _p, _n, _s); else snprintf(N, FILENAME_MAX, "%s.fasta%s", _p, _s); if ((_p[0] == '-') && (_p[1] == 0)) { snprintf(N, FILENAME_MAX, "(stdout)"); _FASTA = stdout; } else { _WRITER = new compressedFileWriter(N); _FASTA = _WRITER->file(); } return(_FASTA); }; private: char _p[FILENAME_MAX]; char _s[FILENAME_MAX]; char _n[FILENAME_MAX]; compressedFileWriter *_WRITER; FILE *_FASTA; FILE *_FASTQ; }; char * scanPrefix(char *prefix) { int32 len = strlen(prefix); if ((len > 3) && (strcasecmp(prefix + len - 3, ".gz") == 0)) { prefix[len-3] = 0; return(prefix + len - 2); } if ((len > 4) && (strcasecmp(prefix + len - 4, ".bz2") == 0)) { prefix[len-4] = 0; return(prefix + len - 3); } if ((len > 3) && (strcasecmp(prefix + len - 3, ".xz") == 0)) { prefix[len-3] = 0; return(prefix + len - 2); } return(prefix + len); } int main(int argc, char **argv) { char *gkpStoreName = NULL; char *outPrefix = NULL; char *outSuffix = NULL; char *clrName = NULL; uint32 libToDump = 0; uint32 bgnID = 1; uint32 endID = UINT32_MAX; bool dumpAllReads = false; bool dumpAllBases = false; bool dumpOnlyDeleted = false; bool dumpFASTQ = true; bool dumpFASTA = false; bool withLibName = true; bool withReadName = true; argc = AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpStoreName = argv[++arg]; } else if (strcmp(argv[arg], "-o") == 0) { outPrefix = argv[++arg]; outSuffix = scanPrefix(outPrefix); } else if (strcmp(argv[arg], "-c") == 0) { clrName = argv[++arg]; } else if (strcmp(argv[arg], "-l") == 0) { libToDump = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-b") == 0) { // DEPRECATED! bgnID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { // DEPRECATED! endID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-r") == 0) { AS_UTL_decodeRange(argv[++arg], bgnID, endID); } else if (strcmp(argv[arg], "-allreads") == 0) { dumpAllReads = true; } else if (strcmp(argv[arg], "-allbases") == 0) { dumpAllBases = true; } else if (strcmp(argv[arg], "-onlydeleted") == 0) { dumpOnlyDeleted = true; dumpAllReads = true; // Otherwise we won't report the deleted reads! } else if (strcmp(argv[arg], "-fastq") == 0) { dumpFASTQ = true; dumpFASTA = false; } else if (strcmp(argv[arg], "-fasta") == 0) { dumpFASTQ = false; dumpFASTA = true; } else if (strcmp(argv[arg], "-nolibname") == 0) { withLibName = false; } else if (strcmp(argv[arg], "-noreadname") == 0) { withReadName = false; } else { err++; fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); } arg++; } if (gkpStoreName == NULL) err++; if (outPrefix == NULL) err++; if (err) { fprintf(stderr, "usage: %s [...] -o fastq-prefix -g gkpStore\n", argv[0]); fprintf(stderr, " -G gkpStore\n"); fprintf(stderr, " -o fastq-prefix write files fastq-prefix.(libname).fastq, ...\n"); fprintf(stderr, " if fastq-prefix is '-', all sequences output to stdout\n"); fprintf(stderr, " if fastq-prefix ends in .gz, .bz2 or .xz, output is compressed\n"); fprintf(stderr, "\n"); fprintf(stderr, " -l libToDump output only read in library number libToDump (NOT IMPLEMENTED)\n"); fprintf(stderr, " -r id[-id] output only the single read 'id', or the specified range of ids\n"); fprintf(stderr, "\n"); fprintf(stderr, " -c clearFile clear range file from OBT modules\n"); fprintf(stderr, " -allreads if a clear range file, lower case mask the deleted reads\n"); fprintf(stderr, " -allbases if a clear range file, lower case mask the non-clear bases\n"); fprintf(stderr, " -onlydeleted if a clear range file, only output deleted reads (the entire read)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -fastq output is FASTQ format (with extension .fastq, default)\n"); fprintf(stderr, " -fasta output is FASTA format (with extension .fasta)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -nolibname don't include the library name in the output file name\n"); fprintf(stderr, "\n"); fprintf(stderr, " -noreadname don't include the read name in the sequence header. header will be:\n"); fprintf(stderr, " '>original-name id= clr=, with names\n"); fprintf(stderr, " '> clr=, without names\n"); if (gkpStoreName == NULL) fprintf(stderr, "ERROR: no gkpStore (-G) supplied.\n"); if (outPrefix == NULL) fprintf(stderr, "ERROR: no output prefix (-o) supplied.\n"); exit(1); } gkStore *gkpStore = gkStore::gkStore_open(gkpStoreName); uint32 numReads = gkpStore->gkStore_getNumReads(); uint32 numLibs = gkpStore->gkStore_getNumLibraries(); clearRangeFile *clrRange = (clrName == NULL) ? NULL : new clearRangeFile(clrName, gkpStore); if (bgnID < 1) bgnID = 1; if (numReads < endID) endID = numReads; if (endID < bgnID) fprintf(stderr, "No reads to dump; reversed ranges make no sense: bgn=" F_U32 " end=" F_U32 "??\n", bgnID, endID); fprintf(stderr, "Dumping reads from %u to %u (inclusive).\n", bgnID, endID); libOutput **out = new libOutput * [numLibs + 1]; // Allocate outputs. If withLibName == false, all reads will artificially be in lib zero, the // other files won't ever be created. Otherwise, the zeroth file won't ever be created. out[0] = new libOutput(outPrefix, outSuffix, NULL); for (uint32 i=1; i<=numLibs; i++) out[i] = new libOutput(outPrefix, outSuffix, gkpStore->gkStore_getLibrary(i)->gkLibrary_libraryName()); // Grab a new readData, and iterate through reads to dump. gkReadData *readData = new gkReadData; for (uint32 rid=bgnID; rid<=endID; rid++) { gkRead *read = gkpStore->gkStore_getRead(rid); uint32 libID = (withLibName == false) ? 0 : read->gkRead_libraryID(); uint32 flen = read->gkRead_sequenceLength(); uint32 lclr = 0; uint32 rclr = flen; bool ignore = false; //fprintf(stderr, "READ %u claims id %u length %u in lib %u\n", rid, read->gkRead_readID(), read->gkRead_sequenceLength(), libID); // If a clear range file is supplied, grab the clear range. If it hasn't been set, the default // is the entire read. if (clrRange) { lclr = clrRange->bgn(rid); rclr = clrRange->end(rid); ignore = clrRange->isDeleted(rid); } // Abort if we're not dumping anything from this read // - not in a library we care about // - deleted, and not dumping all reads // - not deleted, but only reporting deleted reads if (((libToDump != 0) && (libID == libToDump)) || ((dumpAllReads == false) && (ignore == true)) || ((dumpOnlyDeleted == true) && (ignore == false))) continue; // And if we're told to ignore the read, and here, then the read was deleted and we're printing // all reads. Reset the clear range to the whole read, the clear range is invalid. if (ignore) { lclr = 0; rclr = read->gkRead_sequenceLength(); } // Grab the sequence and quality. gkpStore->gkStore_loadReadData(read, readData); char *name = readData->gkReadData_getName(); char *seq = readData->gkReadData_getSequence(); char *qlt = readData->gkReadData_getQualities(); uint32 clen = rclr - lclr; // Soft mask not-clear bases if (dumpAllBases == true) { for (uint32 i=0; i= 'A') ? 'a' - 'A' : 0; for (uint32 i=lclr; i= 'A') ? 0 : 'A' - 'a'; for (uint32 i=rclr; i= 'A') ? 'a' - 'A' : 0; lclr = 0; rclr = flen; } // Chop off the ends we're not printing. seq += lclr; qlt += lclr; seq[clen] = 0; qlt[clen] = 0; // Print the read. if (dumpFASTA) // Dear GCC: I'm NOT ambiguous if ((withReadName == true) && (name != NULL)) AS_UTL_writeFastA(out[libID]->getFASTA(), seq, clen, 100, ">%s id=" F_U32 " clr=" F_U32 "," F_U32 "\n", name, rid, lclr, rclr); else AS_UTL_writeFastA(out[libID]->getFASTA(), seq, clen, 100, ">" F_U32 " clr=" F_U32 "," F_U32 "\n", rid, lclr, rclr); if (dumpFASTQ) // Dear GCC: I'm NOT ambiguous if ((withReadName == true) && (name != NULL)) AS_UTL_writeFastQ(out[libID]->getFASTQ(), seq, clen, qlt, clen, "@%s id=" F_U32 " clr=" F_U32 "," F_U32 "\n", name, rid, lclr, rclr); else AS_UTL_writeFastQ(out[libID]->getFASTQ(), seq, clen, qlt, clen, "@" F_U32 " clr=" F_U32 "," F_U32 "\n", rid, lclr, rclr); } delete clrRange; delete readData; for (uint32 i=0; i<=numLibs; i++) delete out[i]; delete [] out; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/stores/gatekeeperDumpFASTQ.mk000066400000000000000000000010111314437614700205110ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := gatekeeperDumpFASTQ SOURCES := gatekeeperDumpFASTQ.C SRC_INCDIRS := .. ../stores ../AS_UTL ../overlapBasedTrimming TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/gatekeeperDumpMetaData.C000066400000000000000000000200051314437614700210720ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-MAR-18 to 2015-SEP-21 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-24 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" void dumpLibs(gkStore *gkp, uint32 bgnID, uint32 endID) { //fprintf(stderr, "Dumping libraries from %u to %u (inclusive).\n", bgnID, endID); fprintf(stdout, "libID\tnonRandom\treadType\tcorrectBases\tfinalTrim\tremoveDupe\tremoveSpur\tremoveChimer\tcheckSubRead\tdefaultQV\tlibName\n"); for (uint32 lid=bgnID; lid<=endID; lid++) { gkLibrary *library = gkp->gkStore_getLibrary(lid); fprintf(stdout, F_U32"\t" F_U32 "\t%s\t%s\t%s\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\t" F_U32 "\t%s\n", library->gkLibrary_libraryID(), library->gkLibrary_isNonRandom(), library->gkLibrary_readTypeString(), library->gkLibrary_readCorrectionString(), library->gkLibrary_finalTrimString(), library->gkLibrary_removeDuplicateReads(), library->gkLibrary_removeSpurReads(), library->gkLibrary_removeChimericReads(), library->gkLibrary_checkForSubReads(), library->gkLibrary_defaultQV(), library->gkLibrary_libraryName()); } } void dumpReads(gkStore *gkp, uint32 bgnID, uint32 endID, bool fullDump) { //fprintf(stderr, "Dumping reads from %u to %u (inclusive).\n", bgnID, endID); for (uint32 rid=bgnID; rid<=endID; rid++) { gkRead *read = gkp->gkStore_getReadInPartition(rid); if (read == NULL) continue; if (fullDump == false) fprintf(stdout, F_U32"\t" F_U32 "\t" F_U32 "\n", read->gkRead_readID(), read->gkRead_libraryID(), read->gkRead_sequenceLength()); else fprintf(stdout, F_U32"\t" F_U32 "\t" F_U32 "\t" F_U64 "\t" F_U64 "\n", read->gkRead_readID(), read->gkRead_libraryID(), read->gkRead_sequenceLength(), read->gkRead_mPtr(), read->gkRead_pID()); } } class readStats { public: readStats() { _nBases = 0; _minBases = UINT32_MAX; _maxBases = 0; }; ~readStats() { }; void addRead(gkRead *read) { uint32 l = read->gkRead_sequenceLength(); _readLengths.push_back(l); _nBases += l; if (l < _minBases) _minBases = l; if (_maxBases < l) _maxBases = l; }; uint32 numberOfReads(void) { return(_readLengths.size()); }; uint64 numberOfBases(void) { return(_nBases); }; uint64 minBases(void) { return(_minBases); }; uint64 maxBases(void) { return(_maxBases); }; private: vector _readLengths; uint64 _nBases; uint32 _minBases; uint32 _maxBases; }; void dumpStats(gkStore *gkp, uint32 bgnID, uint32 endID) { //fprintf(stderr, "Dumping read statistics from %u to %u (inclusive).\n", bgnID, endID); readStats *rs = new readStats [gkp->gkStore_getNumLibraries() + 1]; for (uint32 rid=bgnID; rid<=endID; rid++) { gkRead *read = gkp->gkStore_getReadInPartition(rid); if (read == NULL) continue; uint32 l = read->gkRead_libraryID(); rs[0].addRead(read); rs[l].addRead(read); } // Stats per library (this mode when -libs -stats is selected?) and global. // Stats include: // number of reads // total bases // min, mean, stddev, max base per read // length histogram plot for (uint32 l=0; lgkStore_getNumLibraries() + 1; l++) { fprintf(stdout, "library " F_U32 " reads " F_U32 " bases: total " F_U64 " ave " F_U64 " min " F_U64 " max " F_U64 "\n", l, rs[l].numberOfReads(), rs[l].numberOfBases(), rs[l].numberOfBases() / rs[l].numberOfReads(), rs[l].minBases(), rs[l].maxBases()); } } int main(int argc, char **argv) { char *gkpStoreName = NULL; uint32 gkpStorePart = UINT32_MAX; bool wantLibs = false; bool wantReads = true; bool wantStats = false; // Useful only for reads bool fullDump = true; uint32 bgnID = 1; uint32 endID = UINT32_MAX; argc = AS_configure(argc, argv); int arg = 1; int err = 0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpStoreName = argv[++arg]; if ((arg+1 < argc) && (argv[arg+1][0] != '-')) gkpStorePart = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-libs") == 0) { wantLibs = true; wantReads = false; wantStats = false; } else if (strcmp(argv[arg], "-reads") == 0) { wantLibs = false; wantReads = true; wantStats = false; } else if (strcmp(argv[arg], "-full") == 0) { fullDump = true; } else if (strcmp(argv[arg], "-stats") == 0) { wantLibs = false; wantReads = false; wantStats = true; } else if (strcmp(argv[arg], "-b") == 0) { bgnID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { endID = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-r") == 0) { bgnID = atoi(argv[++arg]); endID = bgnID; } else { err++; fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); } arg++; } if (gkpStoreName == NULL) err++; if (err) { fprintf(stderr, "usage: %s -G gkpStore [p] [...]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G gkpStore [p] dump reads from 'gkpStore', restricted to\n"); fprintf(stderr, " partition 'p', if supplied.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -libs dump information about libraries\n"); fprintf(stderr, " -reads [-full] dump information about reads\n"); fprintf(stderr, " (-full also dumps some storage metadata)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -stats dump summary statistics on reads\n"); fprintf(stderr, "\n"); fprintf(stderr, " -b id output starting at read/library 'id'\n"); fprintf(stderr, " -e id output stopping after read/library 'id'\n"); fprintf(stderr, "\n"); fprintf(stderr, " -r id output only the single read 'id'\n"); fprintf(stderr, "\n"); if (gkpStoreName == NULL) fprintf(stderr, "ERROR: no gkpStore (-G) supplied.\n"); exit(1); } gkStore *gkpStore = gkStore::gkStore_open(gkpStoreName, gkStore_readOnly, gkpStorePart); uint32 numReads = gkpStore->gkStore_getNumReads(); uint32 numLibs = gkpStore->gkStore_getNumLibraries(); if (wantLibs) { if (bgnID < 1) bgnID = 1; if (numLibs < endID) endID = numLibs; } else { if (bgnID < 1) bgnID = 1; if (numReads < endID) endID = numReads; } if (endID < bgnID) fprintf(stderr, "No objects to dump; reversed ranges make no sense: bgn=" F_U32 " end=" F_U32 "??\n", bgnID, endID); if (wantLibs) dumpLibs(gkpStore, bgnID, endID); if (wantReads) dumpReads(gkpStore, bgnID, endID, fullDump); if (wantStats) dumpStats(gkpStore, bgnID, endID); gkpStore->gkStore_close(); exit(0); } canu-1.6/src/stores/gatekeeperDumpMetaData.mk000066400000000000000000000007551314437614700213310ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := gatekeeperDumpMetaData SOURCES := gatekeeperDumpMetaData.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/gatekeeperPartition.C000066400000000000000000000222241314437614700205420ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-DEC-23 to 2015-MAR-17 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" //#include "AS_UTL_fileIO.H" #include uint32 * buildPartition(char *tigStoreName, uint32 tigStoreVers, uint32 readCountTarget, uint32 partCountTarget, uint32 numReads) { tgStore *tigStore = new tgStore(tigStoreName, tigStoreVers); // Decide on how many reads per partition. We take two targets, the partCountTarget // is used to decide how many partitions to make, but if there are too few reads in // each partition, we'll reset to readCountTarget. if (readCountTarget < numReads / partCountTarget) readCountTarget = numReads / partCountTarget; // Figure out how many partitions we'll make, then spread the reads equally through them. uint32 numParts = (uint32)ceil((double)numReads / readCountTarget); readCountTarget = 1 + numReads / numParts; fprintf(stderr, "For %u reads, will make %u partition%s with up to %u reads%s.\n", numReads, (numParts), (numParts == 1) ? "" : "s", readCountTarget, (numParts == 1) ? "" : " in each"); // Allocate space for the partitioning. uint32 *readToPart = new uint32 [numReads + 1]; for (uint32 i=0; i<=numReads; i++) // All reads are in invalid readToPart[i] = UINT32_MAX; // partitions, initially. // Run through all tigs and partition! uint32 partCount = 1; uint32 tigsCount = 0; uint32 readCount = 0; for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); // Move to the next partition if needed if ((readCount + tig->numberOfChildren() >= readCountTarget) && (readCount > 0)) { fprintf(stderr, "Partition %d has %d tigs and %d reads.\n", partCount, tigsCount, readCount); partCount++; tigsCount = 0; readCount = 0; } // Assign all the reads in this tig to this partition. readCount += tig->numberOfChildren(); for (uint32 ci=0; cinumberOfChildren(); ci++) readToPart[tig->getChild(ci)->ident()] = partCount; tigStore->unloadTig(ti); } if (readCount > 0) fprintf(stderr, "Partition %d has %d tigs and %d reads.\n", partCount, tigsCount, readCount); delete tigStore; return(readToPart); } int main(int argc, char **argv) { char *gkpStorePath = NULL; char *tigStorePath = NULL; uint32 tigStoreVers = 0; uint32 readCountTarget = 2500; // No partition smaller than this uint32 partCountTarget = 200; // No more than this many partitions bool doDelete = false; argc = AS_configure(argc, argv); vector err; int arg = 1; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpStorePath = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigStorePath = argv[++arg]; tigStoreVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-b") == 0) { readCountTarget = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-p") == 0) { partCountTarget = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-D") == 0) { tigStorePath = argv[++arg]; tigStoreVers = 1; doDelete = true; } else { char *s = new char [1024]; snprintf(s, 1024, "ERROR: unknown option '%s'\n", argv[arg]); err.push_back(s); } arg++; } if ((gkpStorePath == NULL) && (doDelete == false)) err.push_back("ERROR: no gkpStore (-G) supplied.\n"); if (tigStorePath == NULL) err.push_back("ERROR: no tigStore (-T) supplied.\n"); if (err.size() > 0) { fprintf(stderr, "usage: %s [-G -T ] ...\n", argv[0]); fprintf(stderr, " %s [-D ]\n", argv[0]); fprintf(stderr, " -G path to gatekeeper store\n"); fprintf(stderr, " -T path to tig store and version to be partitioned\n"); fprintf(stderr, " -D remove a partitioned gkpStore\n"); fprintf(stderr, "\n"); fprintf(stderr, " -b minimum number of reads per partition (50000)\n"); fprintf(stderr, " -p number of partitions (200)\n"); fprintf(stderr, "\n"); fprintf(stderr, "Create a partitioned copy of and place it in /partitionedReads.gkpStore\n"); fprintf(stderr, "\n"); fprintf(stderr, "NOTE: Path handling in this is probably quite brittle. Due to an implementation\n"); fprintf(stderr, " detail, the new store must have symlinks back to the original store. Canu\n"); fprintf(stderr, " wants to use relative paths, and this program tries to adjust to\n"); fprintf(stderr, " be relative to -- in particular, remove any '..'\n"); fprintf(stderr, " or '.' components -- or run this from a higher/lower directory.\n"); for (uint32 ii=0; iigkStore_deletePartitions(); gkpStore->gkStore_close(); } if (doDelete == false) { gkStore::gkStore_clone(gkpSourcePath, gkpClonePath); // Make the clone. gkStore *gkpStore = gkStore::gkStore_open(gkpClonePath, gkStore_readOnly); // Open the clone. uint32 *partition = buildPartition(tigStorePath, tigStoreVers, // Scan all the tigs readCountTarget, // to build a map from partCountTarget, // read to partition. gkpStore->gkStore_getNumReads()); gkpStore->gkStore_buildPartitions(partition); // Build partitions. delete [] partition; gkpStore->gkStore_close(); } exit(0); } canu-1.6/src/stores/gatekeeperPartition.mk000066400000000000000000000007471314437614700207750ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := gatekeeperPartition SOURCES := gatekeeperPartition.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/gkStore.C000066400000000000000000001262031314437614700161540ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-NOV-26 to 2015-AUG-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-09 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-DEC-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "gkStore.H" #include "AS_UTL_fileIO.H" gkStore *gkStore::_instance = NULL; uint32 gkStore::_instanceCount = 0; // Define this to use the original memory mapped file interface to the blobs data. #undef MMAP_BLOBS // Lowest level function to load data into a read. // void gkRead::gkRead_loadData(gkReadData *readData, uint8 *blob) { readData->_read = this; // The resize will only increase the space. if the new is less than the max, it returns immediately. resizeArrayPair(readData->_seq, readData->_qlt, readData->_seqAlloc, readData->_seqAlloc, (uint32)_seqLen+1, resizeArray_doNothing); // One might be tempted to set the readData blob to point to the blob data in the mmap, // but doing so will cause it to be written out again. readData->_blobLen = 0; readData->_blobMax = 0; readData->_blob = NULL; // Make sure that our blob is actually a blob. char chunk[5]; if ((blob[0] != 'B') && (blob[1] != 'L') && (blob[2] != 'O') && (blob[3] != 'B')) fprintf(stderr, "Index error in read " F_U32 " %c mPtr " F_U64 " pID " F_U64 " expected BLOB, got %02x %02x %02x %02x '%c%c%c%c'\n", gkRead_readID(), '?', //(_numberOfPartitions == 0) ? 'm' : 'p', _mPtr, _pID, blob[0], blob[1], blob[2], blob[3], blob[0], blob[1], blob[2], blob[3]); assert(blob[0] == 'B'); assert(blob[1] == 'L'); assert(blob[2] == 'O'); assert(blob[3] == 'B'); uint32 blobLen = *((uint32 *)blob + 1); blob += 8; while ((blob[0] != 'S') || (blob[1] != 'T') || (blob[2] != 'O') || (blob[3] != 'P')) { chunk[0] = blob[0]; chunk[1] = blob[1]; chunk[2] = blob[2]; chunk[3] = blob[3]; chunk[4] = 0; uint32 chunkLen = *((uint32 *)blob + 1); if (strncmp(chunk, "VERS", 4) == 0) { } else if (strncmp(chunk, "NAME", 4) == 0) { resizeArray(readData->_name, 0, readData->_nameAlloc, chunkLen + 1, resizeArray_doNothing); memcpy(readData->_name, blob + 8, chunkLen); readData->_name[chunkLen] = 0; } else if (strncmp(chunk, "QSEQ", 4) == 0) { //fprintf(stderr, "QSEQ not supported.\n"); } else if (strncmp(chunk, "USEQ", 4) == 0) { assert(_seqLen <= chunkLen); assert(_seqLen <= readData->_seqAlloc); memcpy(readData->_seq, blob + 8, _seqLen); readData->_seq[_seqLen] = 0; } else if (strncmp(chunk, "UQLT", 4) == 0) { assert(_seqLen <= chunkLen); assert(_seqLen <= readData->_seqAlloc); memcpy(readData->_qlt, blob + 8, _seqLen); readData->_qlt[_seqLen] = 0; } else if (strncmp(chunk, "2SEQ", 4) == 0) { gkRead_decode2bit(blob + 8, chunkLen, readData->_seq, _seqLen); } else if (strncmp(chunk, "3SEQ", 4) == 0) { gkRead_decode3bit(blob + 8, chunkLen, readData->_seq, _seqLen); } else if (strncmp(chunk, "4QLT", 4) == 0) { gkRead_decode4bit(blob + 8, chunkLen, readData->_qlt, _seqLen); } else if (strncmp(chunk, "5QLT", 4) == 0) { gkRead_decode5bit(blob + 8, chunkLen, readData->_qlt, _seqLen); } else if (strncmp(chunk, "QVAL", 4) == 0) { uint32 qval = *((uint32 *)blob + 2); for (uint32 ii=0; ii<_seqLen; ii++) readData->_qlt[ii] = qval; } else { fprintf(stderr, "gkRead::gkRead_loadDataFromBlob()-- unknown chunk type %02x %02x %02x %02x '%c%c%c%c' skipped\n", chunk[0], chunk[1], chunk[2], chunk[3], chunk[0], chunk[1], chunk[2], chunk[3]); assert(0); } blob += 4 + 4 + chunkLen; } } void gkRead::gkRead_loadDataFromStream(gkReadData *readData, FILE *file) { char tag[5]; uint32 size; // Ideally, we'd do one read to get the whole blob. Without knowing // the length, we're forced to do two. AS_UTL_safeRead(file, tag, "gkStore::gkStore_loadDataFromFile::blob", sizeof(int8), 4); AS_UTL_safeRead(file, &size, "gkStore::gkStore_loadDataFromFile::size", sizeof(uint32), 1); uint8 *blob = new uint8 [8 + size]; memcpy(blob, tag, sizeof(uint8) * 4); memcpy(blob+4, &size, sizeof(uint32) * 1); AS_UTL_safeRead(file, blob+8, "gkStore::gkStore_loadDataFromFile::blob", sizeof(char), size); gkRead_loadData(readData, blob); delete [] blob; } void gkRead::gkRead_loadDataFromMMap(gkReadData *readData, void *blobs) { //fprintf(stderr, "gkRead::gkRead_loadDataFromMMap()-- read %lu position %lu\n", _readID, _mPtr); gkRead_loadData(readData, ((uint8 *)blobs) + _mPtr); } void gkRead::gkRead_loadDataFromFile(gkReadData *readData, FILE *file) { //fprintf(stderr, "gkRead::gkRead_loadDataFromFile()-- read %lu position %lu\n", _readID, _mPtr); AS_UTL_fseek(file, _mPtr, SEEK_SET); gkRead_loadDataFromStream(readData, file); } // Dump a block of encoded data to disk, then update the gkRead to point to it. // void gkStore::gkStore_stashReadData(gkRead *read, gkReadData *data) { assert(_blobsWriter != NULL); read->_mPtr = _blobsWriter->tell(); read->_pID = _partitionID; // 0 if not partitioned //fprintf(stderr, "STASH read %u at position " F_U64 " or length " F_U64 "\n", read->gkRead_readID(), read->_mPtr, data->_blobLen); _blobsWriter->write(data->_blob, data->_blobLen); } // Load read metadata and data from a stream. // void gkStore::gkStore_loadReadFromStream(FILE *S, gkRead *read, gkReadData *readData) { char tag[5]; uint32 size; // Mark this as a read. Needed for tgTig::loadFromStreamOrLayout(), and loading this stuff in // utgcns. AS_UTL_safeRead(S, tag, "gkStore::gkStore_loadReadFromStream::tag", sizeof(char), 4); if (strncmp(tag, "READ", 4) != 0) fprintf(stderr, "Failed to load gkRead, got tag '%c%c%c%c' (0x%02x 0x%02x 0x%02x 0x%02x), expected 'READ'.\n", tag[0], tag[1], tag[2], tag[3], tag[0], tag[1], tag[2], tag[3]), exit(1); // Load the read metadata AS_UTL_safeRead(S, read, "gkStore::gkStore_loadReadFromStream::read", sizeof(gkRead), 1); // Load the read data. read->gkRead_loadDataFromStream(readData, S); } // Dump the read metadata and read data to a stream. // void gkStore::gkStore_saveReadToStream(FILE *S, uint32 id) { // Mark this as a read. Needed for tgTig::loadFromStreamOrLayout(), and loading this stuff in // utgcns. fprintf(S, "READ"); // Dump the read metadata gkRead *read = gkStore_getRead(id); AS_UTL_safeWrite(S, read, "gkStore::gkStore_saveReadToStream::read", sizeof(gkRead), 1); // Figure out where the blob actually is, and make sure that it really is a blob uint8 *blob = (uint8 *)_blobs + read->_mPtr; uint32 blobLen = 8 + *((uint32 *)blob + 1); assert(blob[0] == 'B'); assert(blob[1] == 'L'); assert(blob[2] == 'O'); assert(blob[3] == 'B'); // Write the blob to the stream AS_UTL_safeWrite(S, blob, "gkStore::gkStore_saveReadToStream::blob", sizeof(char), blobLen); } // Store the 'len' bytes of data in 'dat' into the class-managed _blob data block. // Ensures that the _blob block is appropriately padded to maintain 32-bit alignment. // void gkReadData::gkReadData_encodeBlobChunk(char const *tag, uint32 len, void *dat) { // Allocate an initial blob if we don't have one if (_blobMax == 0) { _blobLen = 0; _blobMax = 1048576; _blob = new uint8 [_blobMax]; } // Or make it bigger while (_blobMax <= _blobLen + 8 + len) { _blobMax *= 2; uint8 *b = new uint8 [_blobMax]; memcpy(b, _blob, sizeof(uint8) * _blobLen); delete [] _blob; _blob = b; } // Figure out how much padding we need to add uint32 pad = 4 - (len % 4); if (pad == 4) pad = 0; // Copy in the chunk id and padded length len += pad; memcpy(_blob + _blobLen, tag, sizeof(char) * 4); _blobLen += sizeof(char) * 4; memcpy(_blob + _blobLen, &len, sizeof(uint32)); _blobLen += sizeof(uint32); len -= pad; // Then the unpadded data and any padding. memcpy(_blob + _blobLen, dat, sizeof(uint8) * len); _blobLen += sizeof(uint8) * len; if (pad > 2) _blob[_blobLen++] = 0; if (pad > 1) _blob[_blobLen++] = 0; if (pad > 0) _blob[_blobLen++] = 0; // Finally, update the total blob length. _blobLen -= 8; memcpy(_blob + 4, &_blobLen, sizeof(uint32)); _blobLen += 8; } gkReadData * gkRead::gkRead_encodeSeqQlt(char *H, char *S, char *Q, uint32 qv) { gkReadData *rd = new gkReadData; uint32 RID = _readID; // Debugging // If there is a QV string, ensure that the lengths are the same. If not, trim or pad the QVs. // Then, convert the expected Sanger-encoded QV's (base='!') to be just integers. uint32 Hlen = strlen(H); uint32 Slen = _seqLen = strlen(S); uint32 Qlen = 0; if (Q[0] != 0) { Qlen = strlen(Q); if (Slen < Qlen) { fprintf(stderr, "-- WARNING: read '%s' sequence length %u != quality length %u; quality bases truncated.\n", H, Slen, Qlen); Q[Slen] = 0; } if (Slen > Qlen) { fprintf(stderr, "-- WARNING: read '%s' sequence length %u != quality length %u; quality bases padded.\n", H, Slen, Qlen); for (uint32 ii=Qlen; iigkReadData_encodeBlobChunk("BLOB", 0, NULL); rd->gkReadData_encodeBlobChunk("VERS", 4, &blobVers); rd->gkReadData_encodeBlobChunk("NAME", Hlen, H); if (seq2Len > 0) rd->gkReadData_encodeBlobChunk("2SEQ", seq2Len, seq); // Two-bit encoded sequence (ACGT only) else if (seq3Len > 0) rd->gkReadData_encodeBlobChunk("3SEQ", seq3Len, seq); // Three-bit encoded sequence (ACGTN) else rd->gkReadData_encodeBlobChunk("USEQ", Slen, S); // Unencoded sequence if (qlt4Len > 0) rd->gkReadData_encodeBlobChunk("4QLT", qlt4Len, qlt); // Four-bit (0-15) encoded QVs else if (qlt5Len > 0) rd->gkReadData_encodeBlobChunk("5QLT", qlt5Len, qlt); // Five-bit (0-32) encoded QVs else if (Q[0] == 0) rd->gkReadData_encodeBlobChunk("QVAL", 4, &qv); // Constant QV for every base else rd->gkReadData_encodeBlobChunk("UQLT", Qlen, Q); // Unencoded quality rd->gkReadData_encodeBlobChunk("STOP", 0, NULL); // Cleanup. Restore the QV's. Delete temporary storage. for (uint32 ii=0; ii 0) ? "complete" : "partial"); exit(1); } } // Clear ourself, to make valgrind happier. _librariesMMap = NULL; _librariesAlloc = 0; _libraries = NULL; _readsMMap = NULL; _readsAlloc = 0; _reads = NULL; _blobsMMap = NULL; _blobs = NULL; _blobsWriter = NULL; _blobsFiles = NULL; _mode = mode; _numberOfPartitions = 0; _partitionID = 0; _readIDtoPartitionIdx = NULL; _readIDtoPartitionID = NULL; _readsPerPartition = NULL; //_readsInThisPartition = NULL; // // READ ONLY // if ((mode == gkStore_readOnly) && (partID == UINT32_MAX)) { //fprintf(stderr, "gkStore()-- opening '%s' for read-only access.\n", _storePath); if (AS_UTL_fileExists(_storePath, true, false) == false) { fprintf(stderr, "gkStore()-- failed to open '%s' for read-only access: store doesn't exist.\n", _storePath); exit(1); } snprintf(name, FILENAME_MAX, "%s/libraries", _storePath); _librariesMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); _libraries = (gkLibrary *)_librariesMMap->get(0); snprintf(name, FILENAME_MAX, "%s/reads", _storePath); _readsMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); _reads = (gkRead *)_readsMMap->get(0); snprintf(name, FILENAME_MAX, "%s/blobs", _storePath); #ifdef MMAP_BLOBS _blobsMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); _blobs = (void *)_blobsMMap->get(0); #else _blobsFiles = new FILE * [omp_get_max_threads()]; errno = 0; for (uint32 ii=0; iiget(0); snprintf(name, FILENAME_MAX, "%s/reads", _storePath); _readsMMap = new memoryMappedFile (name, memoryMappedFile_readWrite); _reads = (gkRead *)_readsMMap->get(0); snprintf(name, FILENAME_MAX, "%s/blobs", _storePath); _blobsMMap = new memoryMappedFile (name, memoryMappedFile_readWrite); _blobs = (void *)_blobsMMap->get(0); } // // MODIFY, APPEND, open mmap'd files, but copy them entirely to local memory // else if ((mode == gkStore_extend) && (partID == UINT32_MAX)) { //fprintf(stderr, "gkStore()-- opening '%s' for read-write and append access.\n", _storePath); if (AS_UTL_fileExists(_storePath, true, true) == false) AS_UTL_mkdir(_storePath); _librariesAlloc = MAX(64, 2 * _info.numLibraries); _libraries = new gkLibrary [_librariesAlloc]; snprintf(name, FILENAME_MAX, "%s/libraries", _storePath); if (AS_UTL_fileExists(name, false, false) == true) { _librariesMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); memcpy(_libraries, _librariesMMap->get(0), sizeof(gkLibrary) * (_info.numLibraries + 1)); delete _librariesMMap; _librariesMMap = NULL;; } _readsAlloc = MAX(128, 2 * _info.numReads); _reads = new gkRead [_readsAlloc]; snprintf(name, FILENAME_MAX, "%s/reads", _storePath); if (AS_UTL_fileExists(name, false, false) == true) { _readsMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); memcpy(_reads, _readsMMap->get(0), sizeof(gkRead) * (_info.numReads + 1)); delete _readsMMap; _readsMMap = NULL; } snprintf(name, FILENAME_MAX, "%s/blobs", _storePath); _blobsMMap = NULL; _blobs = NULL; _blobsWriter = new writeBuffer(name, "a+"); } // // PARTITIONED, no modifications, no appends // // BIG QUESTION: do we want to partition the read metadata too, or is it small enough // to load in every job? For now, we load all the metadata. else if ((mode == gkStore_readOnly) && (partID != UINT32_MAX)) { //fprintf(stderr, "gkStore()-- opening '%s' partition '%u' for read-only access.\n", _storePath, partID); // For partitioned reads, we need to have a uint32 map of readID to partitionReadID so we can // lookup the metadata in the partitoned _reads data. This is 4 bytes per read, compared to 24 // bytes for the full meta data. Assuming 100x of 3kb read coverage on human, that's 100 // million reads, so 0.400 GB vs 2.4 GB. snprintf(name, FILENAME_MAX, "%s/partitions/map", _storePath); errno = 0; FILE *F = fopen(name, "r"); if (errno) fprintf(stderr, "gkStore::gkStore()-- failed to open '%s' for reading: %s\n", name, strerror(errno)), exit(1); AS_UTL_safeRead(F, &_numberOfPartitions, "gkStore::_numberOfPartitions", sizeof(uint32), 1); _partitionID = partID; _readsPerPartition = new uint32 [_numberOfPartitions + 1]; // No zeroth element in any of these _readIDtoPartitionID = new uint32 [gkStore_getNumReads() + 1]; _readIDtoPartitionIdx = new uint32 [gkStore_getNumReads() + 1]; AS_UTL_safeRead(F, _readsPerPartition, "gkStore::_readsPerPartition", sizeof(uint32), _numberOfPartitions + 1); AS_UTL_safeRead(F, _readIDtoPartitionID, "gkStore::_readIDtoPartitionID", sizeof(uint32), gkStore_getNumReads() + 1); AS_UTL_safeRead(F, _readIDtoPartitionIdx, "gkStore::_readIDtoPartitionIdx", sizeof(uint32), gkStore_getNumReads() + 1); fclose(F); snprintf(name, FILENAME_MAX, "%s/libraries", _storePath); _librariesMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); _libraries = (gkLibrary *)_librariesMMap->get(0); //fprintf(stderr, " -- openend '%s' at " F_X64 "\n", name, _libraries); snprintf(name, FILENAME_MAX, "%s/partitions/reads.%04" F_U32P, _storePath, partID); _readsMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); _reads = (gkRead *)_readsMMap->get(0); //fprintf(stderr, " -- openend '%s' at " F_X64 "\n", name, _reads); snprintf(name, FILENAME_MAX, "%s/partitions/blobs.%04" F_U32P, _storePath, partID); _blobsMMap = new memoryMappedFile (name, memoryMappedFile_readOnly); _blobs = (void *)_blobsMMap->get(0); //fprintf(stderr, " -- openend '%s' at " F_X64 "\n", name, _blobs); } // Info only, no access to reads or libraries. else if (mode == gkStore_infoOnly) { //fprintf(stderr, "gkStore()-- opening '%s' for info-only access.\n", _storePath); } else { fprintf(stderr, "gkStore::gkStore()-- invalid mode '%s' with partition ID %u.\n", toString(mode), partID); assert(0); } } gkStore::~gkStore() { char N[FILENAME_MAX]; FILE *F; // Should check that inf on disk is the same as inf in memory, and update if needed. bool needsInfoUpdate = false; // Write N+1 because we write, but don't count, the [0] element. if (_librariesMMap) { delete _librariesMMap; } else if (_libraries) { snprintf(N, FILENAME_MAX, "%s/libraries", gkStore_path()); errno = 0; F = fopen(N, "w"); if (errno) fprintf(stderr, "gkStore::~gkStore()-- failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); AS_UTL_safeWrite(F, _libraries, "libraries", sizeof(gkLibrary), gkStore_getNumLibraries() + 1); fclose(F); delete [] _libraries; needsInfoUpdate = true; } if (_readsMMap) { delete _readsMMap; } else if (_reads) { snprintf(N, FILENAME_MAX, "%s/reads", gkStore_path()); errno = 0; F = fopen(N, "w"); if (errno) fprintf(stderr, "gkStore::~gkStore()-- failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); AS_UTL_safeWrite(F, _reads, "reads", sizeof(gkRead), gkStore_getNumReads() + 1); fclose(F); delete [] _reads; needsInfoUpdate = true; } if (needsInfoUpdate) { snprintf(N, FILENAME_MAX, "%s/info", gkStore_path()); errno = 0; F = fopen(N, "w"); if (errno) fprintf(stderr, "gkStore::~gkStore()-- failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); AS_UTL_safeWrite(F, &_info, "info", sizeof(gkStoreInfo), 1); fclose(F); snprintf(N, FILENAME_MAX, "%s/info.txt", gkStore_path()); errno = 0; F = fopen(N, "w"); if (errno) fprintf(stderr, "gkStore::~gkStore()-- failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); _info.writeInfoAsText(F); fclose(F); } if (_blobsMMap) delete _blobsMMap; if (_blobsWriter) delete _blobsWriter; for (uint32 ii=0; ii= LIBRARY_NAME_SIZE) { libname[LIBRARY_NAME_SIZE-1] = 0; truncated = true; break; } } return(_libraries + _info.numLibraries); } gkRead * gkStore::gkStore_addEmptyRead(gkLibrary *lib) { assert(_readsMMap == NULL); assert(_info.numReads <= _readsAlloc); assert(_mode != gkStore_readOnly); // We reserve the zeroth read for "null". This is easy to accomplish // here, just pre-increment the number of reads. However, we need to be sure // to iterate up to and including _info.numReads. _info.numReads++; if (_info.numReads == _readsAlloc) increaseArray(_reads, _info.numReads, _readsAlloc, _info.numReads/2); _reads[_info.numReads] = gkRead(); _reads[_info.numReads]._readID = _info.numReads; _reads[_info.numReads]._libraryID = lib->gkLibrary_libraryID(); //fprintf(stderr, "ADD READ %u = %u alloc = %u\n", _info.numReads, _reads[_info.numReads]._readID, _readsAlloc); return(_reads + _info.numReads); } void gkRead::gkRead_copyDataToPartition(void *blobs, FILE **partfiles, uint64 *partfileslen, uint32 partID) { if (partID == UINT32_MAX) // If an invalid partition, don't do anything. return; // Figure out where the blob actually is, and make sure that it really is a blob uint8 *blob = (uint8 *)blobs + _mPtr; uint32 blobLen = 8 + *((uint32 *)blob + 1); assert(blob[0] == 'B'); assert(blob[1] == 'L'); assert(blob[2] == 'O'); assert(blob[3] == 'B'); // The partfile should be at what we think is the end. assert(partfileslen[partID] == AS_UTL_ftell(partfiles[partID])); // Write the blob to the partition, update the length of the partition AS_UTL_safeWrite(partfiles[partID], blob, "gkRead::gkRead_copyDataToPartition::blob", sizeof(char), blobLen); // Update the read to the new location of the blob in the partitioned data. _mPtr = partfileslen[partID]; _pID = partID; // And finalize by remembering the length. partfileslen[partID] += blobLen; assert(partfileslen[partID] == AS_UTL_ftell(partfiles[partID])); } void gkRead::gkRead_copyDataToPartition(FILE **blobsFiles, FILE **partfiles, uint64 *partfileslen, uint32 partID) { // Load the blob from disk. char tag[5]; uint8 *blob; uint32 blobLen; FILE *file = blobsFiles[omp_get_thread_num()]; // Ideally, we'd do one read to get the whole blob. Without knowing // the length, we're forced to do two. AS_UTL_safeRead(file, tag, "gkStore::gkStore_loadDataFromFile::tag", sizeof(int8), 4); AS_UTL_safeRead(file, &blobLen, "gkStore::gkStore_loadDataFromFile::blobLen", sizeof(uint32), 1); blob = new uint8 [8 + blobLen]; memcpy(blob, tag, sizeof(uint8) * 4); memcpy(blob+4, &blobLen, sizeof(uint32) * 1); AS_UTL_safeRead(file, blob+8, "gkStore::gkStore_loadDataFromFile::blob", sizeof(char), blobLen); assert(blob[0] == 'B'); assert(blob[1] == 'L'); assert(blob[2] == 'O'); assert(blob[3] == 'B'); // If a valid partition, write the data (we always have to read it though, so don't be all clever // and try to move this test backward). if (partID != UINT32_MAX) { assert(partfileslen[partID] == AS_UTL_ftell(partfiles[partID])); // The partfile should be at what we think is the end. // Write the blob to the partition, update the length of the partition blobLen += 8; AS_UTL_safeWrite(partfiles[partID], blob, "gkRead::gkRead_copyDataToPartition::blob", sizeof(char), blobLen); // Update the read to the new location of the blob in the partitioned data. _mPtr = partfileslen[partID]; _pID = partID; // And finalize by remembering the length. partfileslen[partID] += blobLen; assert(partfileslen[partID] == AS_UTL_ftell(partfiles[partID])); } delete [] blob; } void gkStore::gkStore_buildPartitions(uint32 *partitionMap) { char name[FILENAME_MAX]; // Store cannot be partitioned already, and it must be readOnly (for safety) as we don't need to // be changing any of the normal store data. assert(_numberOfPartitions == 0); assert(_mode == gkStore_readOnly); // Figure out what the last partition is uint32 maxPartition = 0; uint32 unPartitioned = 0; assert(partitionMap[0] == UINT32_MAX); for (uint32 fi=1; fi<=gkStore_getNumReads(); fi++) { if (partitionMap[fi] == UINT32_MAX) unPartitioned++; else if (maxPartition < partitionMap[fi]) maxPartition = partitionMap[fi]; } fprintf(stderr, "Found " F_U32 " unpartitioned reads and maximum partition of " F_U32 "\n", unPartitioned, maxPartition); // Create the partitions by opening N copies of the data stores, // and writing data to each. FILE **blobfiles = new FILE * [maxPartition + 1]; uint64 *blobfileslen = new uint64 [maxPartition + 1]; // Offset, in bytes, into the blobs file FILE **readfiles = new FILE * [maxPartition + 1]; uint32 *readfileslen = new uint32 [maxPartition + 1]; // aka _readsPerPartition uint32 *readIDmap = new uint32 [gkStore_getNumReads() + 1]; // aka _readIDtoPartitionIdx // Be nice and put all the partitions in a subdirectory. snprintf(name, FILENAME_MAX, "%s/partitions", _storePath); if (AS_UTL_fileExists(name, true, true) == false) AS_UTL_mkdir(name); // Open all the output files -- fail early if we can't open that many files. blobfiles[0] = NULL; blobfileslen[0] = UINT64_MAX; readfiles[0] = NULL; readfileslen[0] = UINT32_MAX; for (uint32 i=1; i<=maxPartition; i++) { snprintf(name, FILENAME_MAX, "%s/partitions/blobs.%04d", _storePath, i); errno = 0; blobfiles[i] = fopen(name, "w"); blobfileslen[i] = 0; if (errno) fprintf(stderr, "gkStore::gkStore_buildPartitions()-- ERROR: failed to open partition %u file '%s' for write: %s\n", i, name, strerror(errno)), exit(1); snprintf(name, FILENAME_MAX, "%s/partitions/reads.%04d", _storePath, i); errno = 0; readfiles[i] = fopen(name, "w"); readfileslen[i] = 0; if (errno) fprintf(stderr, "gkStore::gkStore_buildPartitions()-- ERROR: failed to open partition %u file '%s' for write: %s\n", i, name, strerror(errno)), exit(1); } // Open the output partition map file -- we might as well fail early if we can't make it also. snprintf(name, FILENAME_MAX, "%s/partitions/map", _storePath); errno = 0; FILE *rIDmF = fopen(name, "w"); if (errno) fprintf(stderr, "gkStore::gkStore_buildPartitions()-- ERROR: failed to open partition map file '%s': %s\n", name, strerror(errno)), exit(1); // Copy the blob from the master file to the partitioned file, update pointers. readIDmap[0] = UINT32_MAX; // There isn't a zeroth read, make it bogus. for (uint32 fi=1; fi<=gkStore_getNumReads(); fi++) { uint32 pi = partitionMap[fi]; assert(pi != 0); // No zeroth partition, right? // Make a copy of the read, then modify it for the partition, then write it to the partition. // Without the copy, we'd need to update the master record too. gkRead partRead = _reads[fi]; if (_blobs) partRead.gkRead_copyDataToPartition(_blobs, blobfiles, blobfileslen, pi); if (_blobsFiles) partRead.gkRead_copyDataToPartition(_blobsFiles, blobfiles, blobfileslen, pi); // Because the blobsFiles copyDataToPartition variant is streaming through the file, // we need to let it load (and ignore) deleted reads. After they're loaded (and ignored) // we can then skip it. if (pi < UINT32_MAX) { #if 0 fprintf(stderr, "read " F_U32 "=" F_U32 " len " F_U32 " -- blob master " F_U64 " -- to part " F_U32 " new read id " F_U32 " blob " F_U64 "/" F_U64 " -- at readIdx " F_U32 "\n", fi, _reads[fi].gkRead_readID(), _reads[fi].gkRead_sequenceLength(), _reads[fi]._mPtr, pi, partRead.gkRead_readID(), partRead._pID, partRead._mPtr, readfileslen[pi]); #endif AS_UTL_safeWrite(readfiles[pi], &partRead, "gkStore::gkStore_buildPartitions::read", sizeof(gkRead), 1); readIDmap[fi] = readfileslen[pi]++; } else { #if 0 fprintf(stderr, "read " F_U32 "=" F_U32 " len " F_U32 " -- blob master " F_U64 " -- DELETED\n", fi, _reads[fi].gkRead_readID(), _reads[fi].gkRead_sequenceLength(), _reads[fi]._mPtr); #endif } } // There isn't a zeroth read. AS_UTL_safeWrite(rIDmF, &maxPartition, "gkStore::gkStore_buildPartitions::maxPartition", sizeof(uint32), 1); AS_UTL_safeWrite(rIDmF, readfileslen, "gkStore::gkStore_buildPartitions::readfileslen", sizeof(uint32), maxPartition + 1); AS_UTL_safeWrite(rIDmF, partitionMap, "gkStore::gkStore_buildPartitions::partitionMap", sizeof(uint32), gkStore_getNumReads() + 1); AS_UTL_safeWrite(rIDmF, readIDmap, "gkStore::gkStore_buildPartitions::readIDmap", sizeof(uint32), gkStore_getNumReads() + 1); // cleanup -- close all the files, delete storage fclose(rIDmF); for (uint32 i=1; i<=maxPartition; i++) { fprintf(stderr, "partition " F_U32 " has " F_U32 " reads\n", i, readfileslen[i]); errno = 0; fclose(blobfiles[i]); fclose(readfiles[i]); if (errno) fprintf(stderr, " warning: %s\n", strerror(errno)); } delete [] readIDmap; delete [] readfileslen; delete [] readfiles; delete [] blobfileslen; delete [] blobfiles; } void gkStore::gkStore_clone(char *originalPath, char *clonePath) { char cPath[FILENAME_MAX]; char sPath[FILENAME_MAX]; getcwd(cPath, FILENAME_MAX); AS_UTL_mkdir(clonePath); chdir(clonePath); snprintf(sPath, FILENAME_MAX, "%s/info", originalPath); AS_UTL_symlink(sPath, "info"); snprintf(sPath, FILENAME_MAX, "%s/libraries", originalPath); AS_UTL_symlink(sPath, "libraries"); snprintf(sPath, FILENAME_MAX, "%s/reads", originalPath); AS_UTL_symlink(sPath, "reads"); snprintf(sPath, FILENAME_MAX, "%s/blobs", originalPath); AS_UTL_symlink(sPath, "blobs"); chdir(cPath); } void gkStore::gkStore_delete(void) { char path[FILENAME_MAX]; gkStore_deletePartitions(); snprintf(path, FILENAME_MAX, "%s/info", gkStore_path()); AS_UTL_unlink(path); snprintf(path, FILENAME_MAX, "%s/libraries", gkStore_path()); AS_UTL_unlink(path); snprintf(path, FILENAME_MAX, "%s/reads", gkStore_path()); AS_UTL_unlink(path); snprintf(path, FILENAME_MAX, "%s/blobs", gkStore_path()); AS_UTL_unlink(path); AS_UTL_rmdir(gkStore_path()); } void gkStore::gkStore_deletePartitions(void) { char path[FILENAME_MAX]; snprintf(path, FILENAME_MAX, "%s/partitions/map", gkStore_path()); if (AS_UTL_fileExists(path, false, false) == false) return; // How many partitions? FILE *F = fopen(path, "r"); if (errno) fprintf(stderr, "ERROR: failed to open partition meta data '%s': %s\n", path, strerror(errno)), exit(1); AS_UTL_safeRead(F, &_numberOfPartitions, "gkStore_deletePartitions::numberOfPartitions", sizeof(uint32), 1); fclose(F); // Yay! Delete! AS_UTL_unlink(path); for (uint32 ii=0; ii<_numberOfPartitions; ii++) { snprintf(path, FILENAME_MAX, "%s/partitions/reads.%04u", gkStore_path(), ii+1); AS_UTL_unlink(path); snprintf(path, FILENAME_MAX, "%s/partitions/blobs.%04u", gkStore_path(), ii+1); AS_UTL_unlink(path); } // And the directory. snprintf(path, FILENAME_MAX, "%s/partitions", gkStore_path()); AS_UTL_rmdir(path); } void gkStoreStats::init(gkStore *UNUSED(gkp)) { #if 0 gkFragment fr; gkStream *fs = new gkStream(gkp, 0, 0, GKFRAGMENT_INF); numActiveFrag = 0; numMatedFrag = 0; readLength = 0; clearLength = 0; lowestID = new uint32 [gkp->gkStore_getNumLibraries() + 1]; highestID = new uint32 [gkp->gkStore_getNumLibraries() + 1]; numActivePerLib = new uint32 [gkp->gkStore_getNumLibraries() + 1]; numMatedPerLib = new uint32 [gkp->gkStore_getNumLibraries() + 1]; readLengthPerLib = new uint64 [gkp->gkStore_getNumLibraries() + 1]; clearLengthPerLib = new uint64 [gkp->gkStore_getNumLibraries() + 1]; for (uint32 i=0; igkStore_getNumLibraries() + 1; i++) { lowestID[i] = 0; highestID[i] = 0; numActivePerLib[i] = 0; numMatedPerLib[i] = 0; readLengthPerLib[i] = 0; clearLengthPerLib[i] = 0; } while (fs->next(&fr)) { uint32 lid = fr.gkFragment_getLibraryID(); uint32 rid = fr.gkFragment_getReadID(); if (lowestID[lid] == 0) { lowestID[lid] = rid; highestID[lid] = rid; } if (highestID[lid] < rid) { highestID[lid] = rid; } numActiveFrag++; numActivePerLib[lid]++; readLength += fr.gkFragment_getSequenceLength(); readLengthPerLib[lid] += fr.gkFragment_getSequenceLength(); clearLength += fr.gkFragment_getClearRegionLength(); clearLengthPerLib[lid] += fr.gkFragment_getClearRegionLength(); } delete fs; #endif } canu-1.6/src/stores/gkStore.H000066400000000000000000000522161314437614700161630ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-NOV-26 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef GKSTORE_H #define GKSTORE_H #include "AS_global.H" #include "memoryMappedFile.H" #include "writeBuffer.H" #include using namespace std; // The number of library IIDs we can handle. // #define AS_MAX_LIBRARIES_BITS 6 #define AS_MAX_LIBRARIES (((uint32)1 << AS_MAX_LIBRARIES_BITS) - 1) #define LIBRARY_NAME_SIZE 128 // Maximum length of reads. // // If 16, an overlap is only 20 bytes. (5x 32 bit words) // If 17-21, an overlap is 24 bytes. (3x 64 bit words) // If 22-32, an overlap is 32 bytes. (4x 64 bit words) // // if 26, bogart has issues with storing the error rate // If 28, alignment/alignment-drivers.C won't link // If 29, alignment/alignment-drivers.C won't link // If 30, alignment/alignment-drivers.C won't link // If 31, alignment/alignment-drivers.C won't compile, len+len+2 == 0 // If 32, it won't compile because of shifting (uint32)1 << 32 == 0. // #define AS_MAX_READLEN_BITS 21 #define AS_MAX_READLEN (((uint32)1 << AS_MAX_READLEN_BITS) - 1) // The number of read IDs we can handle. Longer reads implies fewer reads. // readLen 32 + numLibs 6 -> numReads 26 ( 64 million) // readLen 30 + numLibs 6 -> numReads 28 (256 million) // readLen 28 + numLibs 6 -> numReads 30 (1024 million) // readLen 26 + numLibs 6 -> numReads 32 (4096 million) // limited elsewhere! // readLen 24 + numLibs 6 -> numReads 34 (4096 million) // limited elsewhere! // readLen 22 + numLibs 6 -> numReads 36 (4096 million) // limited elsewhere! // readLen 21 + numLibs 6 -> numReads 37 (4096 million) // limited elsewhere! // readLen 20 + numLibs 6 -> numReads 38 (4096 million) // limited elsewhere! // #define AS_MAX_READS_BITS 64 - AS_MAX_READLEN_BITS - AS_MAX_LIBRARIES_BITS #define AS_MAX_READS (((uint64)1 << AS_MAX_READS_BITS) - 1) // Per-library options. // Read type #define GK_READTYPE_GENERIC 0x0000 #define GK_READTYPE_CONTIG 0x0001 #define GK_READTYPE_PACBIO_RAW 0x0002 #define GK_READTYPE_PACBIO_CORRECTED 0x0003 #define GK_READTYPE_NANOPORE_RAW 0x0004 #define GK_READTYPE_NANOPORE_CORRECTED 0x0005 // Correction algorithm #define GK_CORRECTION_NONE 0x0000 #define GK_CORRECTION_CONSENSUS 0x0001 #define GK_CORRECTION_MER 0x0002 // Trimming algorithm #define GK_FINALTRIM_NONE 0x0000 #define GK_FINALTRIM_LARGEST_COVERED 0x0001 // largest region covered by good overlaps #define GK_FINALTRIM_BEST_EDGE 0x0002 // largest region covered by best overlaps (broken) class gkLibrary { public: gkLibrary() { memset(_libraryName, 0, sizeof(char) * LIBRARY_NAME_SIZE); strncpy(_libraryName, "UNDEFINED", LIBRARY_NAME_SIZE-1); _libraryID = UINT32_MAX; gkLibrary_clearFeatures(); }; ~gkLibrary() { }; void gkLibrary_clearFeatures(void) { // DO NOT change defaults without updating gkLibrary_parsePreset(). _isNonRandom = 0; _readCorrection = GK_CORRECTION_NONE; _readType = GK_READTYPE_GENERIC; _finalTrim = GK_FINALTRIM_LARGEST_COVERED; _removeDuplicateReads = 1; _removeSpurReads = 1; _removeChimericReads = 1; _checkForSubReads = 1; _defaultQV = 20; }; public: char const *gkLibrary_libraryName(void) { return(_libraryName); }; uint32 gkLibrary_libraryID(void) { return(_libraryID); }; uint32 gkLibrary_isNonRandom(void) { return(_isNonRandom); }; uint32 gkLibrary_readType(void) { return(_readType); }; char const *gkLibrary_readTypeString(void); uint32 gkLibrary_readCorrection(void) { return(_readCorrection); }; char const *gkLibrary_readCorrectionString(void); uint32 gkLibrary_finalTrim(void) { return(_finalTrim); }; char const *gkLibrary_finalTrimString(void); uint32 gkLibrary_removeDuplicateReads(void) { return(_removeDuplicateReads); }; uint32 gkLibrary_removeSpurReads(void) { return(_removeSpurReads); }; uint32 gkLibrary_removeChimericReads(void) { return(_removeChimericReads); }; uint32 gkLibrary_checkForSubReads(void) { return(_checkForSubReads); }; uint32 gkLibrary_defaultQV(void) { return(_defaultQV); }; void gkLibrary_setIsNonRandom(bool f) { _isNonRandom = f; }; void gkLibrary_setReadType(char *f); void gkLibrary_setReadCorrection(char *t); void gkLibrary_setFinalTrim(char *t); void gkLibrary_setRemoveDuplicateReads(bool f) { _removeDuplicateReads = f; }; void gkLibrary_setRemoveSpurReads(bool f) { _removeSpurReads = f; }; void gkLibrary_setRemoveChimericReads(bool f) { _removeChimericReads = f; }; void gkLibrary_setCheckForSubReads(bool f) { _checkForSubReads = f; }; void gkLibrary_setDefaultQV(double qv) { _defaultQV = qv; }; void gkLibrary_parsePreset(char *t); private: char _libraryName[LIBRARY_NAME_SIZE]; uint32 _libraryID; // If set, reads are from a non-random library and shouldn't count toward coverage stats. uint32 _isNonRandom; // What generated these reads? uint32 _readType; // Should reads be corrected? How? uint32 _readCorrection; // Should reads be trimmed based on overlaps? How? uint32 _finalTrim; // Should duplicate reads (based on overlaps) be removed? uint32 _removeDuplicateReads; // Should spur reads be cleaned up? How? uint32 _removeSpurReads; // Should chimeric reads be cleaned up? How? uint32 _removeChimericReads; // Should PacBio circular sub-reads be cleaned up? How? uint32 _checkForSubReads; // For reads with no QVs, use this. uint32 _defaultQV; friend class gkStore; }; class gkRead; class gkReadData { public: gkReadData() { _read = NULL; _name = NULL; _nameAlloc = 0; _seq = NULL; _qlt = NULL; _seqAlloc = 0; _blobLen = 0; _blobMax = 0; _blob = NULL; }; ~gkReadData() { delete [] _name; delete [] _seq; delete [] _qlt; delete [] _blob; }; gkRead *gkReadData_getRead(void) { return(_read); }; char *gkReadData_getName(void) { return(_name); }; bool gkReadData_hasSequence(void) { return(_seq != NULL); }; bool gkReadData_hasQualities(void) { return(_qlt != NULL); }; char *gkReadData_getSequence(void) { return(_seq); }; char *gkReadData_getQualities(void) { return(_qlt); }; private: gkRead *_read; // Pointer to the mmap'd read char *_name; uint32 _nameAlloc; char *_seq; // Everyone has sequence char *_qlt; // and quality uint32 _seqAlloc; uint32 _blobLen; uint32 _blobMax; uint8 *_blob; // And maybe even an encoded blob of data from the store. // Used by the store for adding a read. void gkReadData_encodeBlobChunk(char const *tag, uint32 len, void *dat); friend class gkRead; friend class gkStore; }; class gkRead { public: gkRead() { _readID = 0; _libraryID = 0; _seqLen = 0; _mPtr = 0; _pID = 0; }; ~gkRead() { }; uint32 gkRead_readID(void) { return((uint32)_readID); }; uint32 gkRead_libraryID(void) { return((uint32)_libraryID); }; uint32 gkRead_sequenceLength(void) { return((uint32)_seqLen); }; uint64 gkRead_mPtr(void) { return(_mPtr); }; // For debugging, in gatekeeperDumpMetatData uint64 gkRead_pID(void) { return(_pID); }; // Functions to load the read data from disk. // // loadData() -- lowest level, called by the other functions to decode the // encoded data into the gkReadData structure. // loadDataFromStream() -- reads data from a FILE, does not position the stream // loadDataFromFile() -- reads data from a FILE, positions the stream first // loadDataFromMMap() -- reads data from a memory mapped file // private: void gkRead_loadData (gkReadData *readData, uint8 *blob); void gkRead_loadDataFromStream(gkReadData *readData, FILE *file); void gkRead_loadDataFromFile (gkReadData *readData, FILE *file); void gkRead_loadDataFromMMap (gkReadData *readData, void *blob); private: uint32 gkRead_encode2bit(uint8 *&chunk, char *seq, uint32 seqLen); uint32 gkRead_encode3bit(uint8 *&chunk, char *seq, uint32 seqLen); uint32 gkRead_encode4bit(uint8 *&chunk, char *qlt, uint32 seqLen); uint32 gkRead_encode5bit(uint8 *&chunk, char *qlt, uint32 seqLen); bool gkRead_decode2bit(uint8 *chunk, uint32 chunkLen, char *seq, uint32 seqLen); bool gkRead_decode3bit(uint8 *chunk, uint32 chunkLen, char *seq, uint32 seqLen); bool gkRead_decode4bit(uint8 *chunk, uint32 chunkLen, char *qlt, uint32 seqLen); bool gkRead_decode5bit(uint8 *chunk, uint32 chunkLen, char *qlt, uint32 seqLen); // Called by gatekeeperCreate to add a new read to the store. public: gkReadData *gkRead_encodeSeqQlt(char *H, char *S, char *Q, uint32 qv); private: char *gkRead_encodeSequence(char *sequence, char *encoded); char *gkRead_decodeSequence(char *encoded, char *sequence); char *gkRead_encodeQuality(char *sequence, char *encoded); char *gkRead_decodeQuality(char *encoded, char *sequence); private: // Used by the store to copy data to a partition void gkRead_copyDataToPartition(void *blobs, FILE **partfiles, uint64 *partfileslen, uint32 partID); void gkRead_copyDataToPartition(FILE **blobsFiles, FILE **partfiles, uint64 *partfileslen, uint32 partID); private: uint64 _readID : AS_MAX_READS_BITS; uint64 _libraryID : AS_MAX_LIBRARIES_BITS; uint64 _seqLen : AS_MAX_READLEN_BITS; uint64 _mPtr : 48; // Pointer to blob of data in blob file, 0..256 TB uint64 _pID : 16; // Partition file id, 0...65536 friend class gkStore; }; // gkStoreInfo is saved on disk. // gkStore is the in memory structure used to access the data. // class gkStoreInfo { public: gkStoreInfo() { gkMagic = 0x504b473a756e6163llu; // canu:GKP gkVersion = 0x0000000000000001llu; gkLibrarySize = sizeof(gkLibrary); gkReadSize = sizeof(gkRead); gkMaxLibrariesBits = AS_MAX_LIBRARIES_BITS; gkLibraryNameSize = LIBRARY_NAME_SIZE; gkMaxReadBits = AS_MAX_READS_BITS; gkMaxReadLenBits = AS_MAX_READLEN_BITS; gkUNUSED = 0; numLibraries = 0; numReads = 0; }; ~gkStoreInfo() { }; void writeInfoAsText(FILE *F) { fprintf(F, "gkMagic = 0x" F_X64 "\n", gkMagic); fprintf(F, "gkVersion = 0x" F_X64 "\n", gkVersion); fprintf(F, "\n"); fprintf(F, "gkLibrarySize = " F_U32 "\n", gkLibrarySize); fprintf(F, "gkReadSize = " F_U32 "\n", gkReadSize); fprintf(F, "gkMaxLibrariesBits = " F_U32 "\n", gkMaxLibrariesBits); fprintf(F, "gkLibraryNameSize = " F_U32 "\n", gkLibraryNameSize); fprintf(F, "gkMaxReadBits = " F_U32 "\n", gkMaxReadBits); fprintf(F, "gkMaxReadLenBits = " F_U32 "\n", gkMaxReadLenBits); fprintf(F, "\n"); fprintf(F, "numLibraries = " F_U32 "\n", numLibraries); fprintf(F, "numReads = " F_U32 "\n", numReads); }; private: uint64 gkMagic; uint64 gkVersion; uint32 gkLibrarySize; // Sanity checks that this code can load the data properly. uint32 gkReadSize; uint32 gkMaxLibrariesBits; uint32 gkLibraryNameSize; uint32 gkMaxReadBits; uint32 gkMaxReadLenBits; uint32 gkUNUSED; // Used to hold a blob block size that was never implemented uint32 numLibraries; // Counts of types of things we have loaded uint32 numReads; friend class gkStore; }; // The default behavior is to open the store for read only, and to load // all the metadata into memory. typedef enum { gkStore_readOnly = 0x00, // Open read only gkStore_modify = 0x01, // Open for modification - never used, explicitly uses mmap file gkStore_create = 0x02, // Open for creating, will fail if files exist already gkStore_extend = 0x03, // Open for modification and appending new reads/libraries gkStore_infoOnly = 0x04 // Open read only, but only load the info on the store; no access to reads or libraries } gkStore_mode; static const char * toString(gkStore_mode m) { switch (m) { case gkStore_readOnly: return("gkStore_readOnly"); break; case gkStore_modify: return("gkStore_modify"); break; case gkStore_create: return("gkStore_create"); break; case gkStore_extend: return("gkStore_extend"); break; case gkStore_infoOnly: return("gkStore_infoOnly"); break; } return("undefined-mode"); } class gkStore { private: gkStore(char const *path, gkStore_mode mode, uint32 partID); ~gkStore(); public: static gkStore *gkStore_open(char const *path, gkStore_mode mode=gkStore_readOnly, uint32 partID=UINT32_MAX) { // If an instance exists, return it, otherwise, make a new one. #pragma omp critical { if (_instance != NULL) { _instanceCount++; //fprintf(stderr, "gkStore_open(%s) from thread %d, %u instances now\n", path, omp_get_thread_num(), _instanceCount); } else { _instance = new gkStore(path, mode, partID); _instanceCount = 1; //fprintf(stderr, "gkStore_open(%s) form thread %d, first instance, create store\n", path, omp_get_thread_num()); } } return(_instance); }; void gkStore_close(void) { #pragma omp critical { _instanceCount--; if (_instanceCount == 0) { delete _instance; _instance = NULL; //fprintf(stderr, "gkStore_close(%s) from thread %d, no instances remain, delete store\n", // _storeName, omp_get_thread_num()); } else { //fprintf(stderr, "gkStore_close(%s) from thread %d, %u instances remain\n", // _storeName, omp_get_thread_num(), _instanceCount); } } }; public: const char *gkStore_path(void) { return(_storePath); }; // Returns the path to the store const char *gkStore_name(void) { return(_storeName); }; // Returns the name, e.g., name.gkpStore void gkStore_buildPartitions(uint32 *partitionMap); static void gkStore_clone(char *originalPath, char *clonePath); void gkStore_delete(void); // Deletes the files in the store. void gkStore_deletePartitions(void); // Deletes the files for a partition. uint32 gkStore_getNumLibraries(void) { return(_info.numLibraries); }; uint32 gkStore_getNumReads(void) { return(_info.numReads); }; gkLibrary *gkStore_getLibrary(uint32 id) { return(&_libraries[id]); }; // Returns a read, using the copy in the partition if the partition exists. gkRead *gkStore_getRead(uint32 id) { if ((_readIDtoPartitionID) && (_readIDtoPartitionID[id] != _partitionID)) { fprintf(stderr, "getRead()-- WARNING: access to read %u in partition %u is slow when partition %u is loaded.\n", id, _readIDtoPartitionID[id], _partitionID); assert(0); } if ((_readIDtoPartitionID) && (_readIDtoPartitionID[id] == _partitionID)) { //fprintf(stderr, "getRead()-- id=%u mapped=%u mappedid=%u\n", // id, _readIDtoPartitionIdx[id], _reads[_readIDtoPartitionIdx[id]].gkRead_readID()); return(_reads + _readIDtoPartitionIdx[id]); } return(_reads + id); }; // Returns a read, but only if it is in the currently loaded partition. gkRead *gkStore_getReadInPartition(uint32 id) { if (_readIDtoPartitionID == NULL) // Not partitioned, return regular read. return(gkStore_getRead(id)); if (_readIDtoPartitionID[id] != _partitionID) // Patitioned, and not in this partition. return(NULL); //fprintf(stderr, "getRead()-- SUCCESS: access to read %u in partition %u is FAST when partition %u is loaded.\n", // id, _readIDtoPartitionID[id], _partitionID); return(_reads + _readIDtoPartitionIdx[id]); } gkLibrary *gkStore_addEmptyLibrary(char const *name); gkRead *gkStore_addEmptyRead(gkLibrary *lib); void gkStore_loadReadData(gkRead *read, gkReadData *readData) { //fprintf(stderr, "loadReadData()- read " F_U64 " thread " F_S32 " out of " F_S32 "\n", // read->_readID, omp_get_thread_num(), omp_get_max_threads()); if (_blobs) read->gkRead_loadDataFromMMap(readData, _blobs); if (_blobsFiles) read->gkRead_loadDataFromFile(readData, _blobsFiles[omp_get_thread_num()]); }; void gkStore_loadReadData(uint32 readID, gkReadData *readData) { gkStore_loadReadData(gkStore_getRead(readID), readData); }; void gkStore_stashReadData(gkRead *read, gkReadData *data); // Used in utgcns, for the package format. static void gkStore_loadReadFromStream(FILE *S, gkRead *read, gkReadData *readData); void gkStore_saveReadToStream(FILE *S, uint32 id); private: static gkStore *_instance; static uint32 _instanceCount; gkStoreInfo _info; // All the stuff stored on disk. char _storePath[FILENAME_MAX]; // Needed to create files char _storeName[FILENAME_MAX]; // Useful for log files in other programs gkStore_mode _mode; // What mode this store is opened as, sanity checking // If these are memory mapped, then multiple processes on the same host can share the // (read-only) data. // // For blobs, we allow either using the mmap directly, or skipping the mmap and // using a buffer. memoryMappedFile *_librariesMMap; uint32 _librariesAlloc; // If zero, the mmap is used. gkLibrary *_libraries; memoryMappedFile *_readsMMap; uint32 _readsAlloc; // If zero, the mmap is used. gkRead *_reads; memoryMappedFile *_blobsMMap; // Either the full blobs, or the partitioned blobs. void *_blobs; // Pointer to the data in the blobsMMap. writeBuffer *_blobsWriter; // For constructing a store, data gets dumped here. FILE **_blobsFiles; // For loading reads directly, one per thread. // If the store is openend partitioned, this data is loaded from disk uint32 _numberOfPartitions; // Total number of partitions that exist uint32 _partitionID; // Which partition this is uint32 *_readsPerPartition; // Number of reads in each partition, mostly sanity checking uint32 *_readIDtoPartitionIdx; // Map from global ID to local partition index uint32 *_readIDtoPartitionID; // Map from global ID to partition ID }; class gkStoreStats { public: gkStoreStats(char const *gkStoreName) { gkStore *gkp = gkStore::gkStore_open(gkStoreName); init(gkp); gkp->gkStore_close(); }; gkStoreStats(gkStore *gkp) { init(gkp); }; ~gkStoreStats() { delete [] lowestID; delete [] highestID; delete [] numActivePerLib; delete [] readLengthPerLib; delete [] clearLengthPerLib; }; void init(gkStore *gkp); // Global stats over the whole store uint32 numActiveFrag; uint64 readLength; uint64 clearLength; // Per library stats uint32 *lowestID; uint32 *highestID; uint32 *numActivePerLib; uint64 *readLengthPerLib; uint64 *clearLengthPerLib; }; #endif canu-1.6/src/stores/gkStoreEncode.C000066400000000000000000000102451314437614700172700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2015-DEC-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "gkStore.H" // Encode seq as 2-bit bases. Doesn't touch qlt. uint32 gkRead::gkRead_encode2bit(uint8 *&chunk, char *seq, uint32 seqLen) { // Scan the read, if there are non-acgt, return length 0; this cannot encode it. for (uint32 ii=0; ii> 6) & 0x03)]; seq[ii++] = acgt[((byte >> 4) & 0x03)]; seq[ii++] = acgt[((byte >> 2) & 0x03)]; seq[ii++] = acgt[((byte >> 0) & 0x03)]; } else { if (ii < seqLen) seq[ii++] = acgt[((byte >> 6) & 0x03)]; // This if is also redundant, and also pretty. if (ii < seqLen) seq[ii++] = acgt[((byte >> 4) & 0x03)]; if (ii < seqLen) seq[ii++] = acgt[((byte >> 2) & 0x03)]; if (ii < seqLen) seq[ii++] = acgt[((byte >> 0) & 0x03)]; } } seq[seqLen] = 0; return(true); } // Encode seq as 3-bases-in-7-bits. Doesn't touch qlt. uint32 gkRead::gkRead_encode3bit(uint8 *&UNUSED(chunk), char *UNUSED(seq), uint32 UNUSED(seqLen)) { return(0); } bool gkRead::gkRead_decode3bit(uint8 *UNUSED(chunk), uint32 UNUSED(chunkLen), char *UNUSED(seq), uint32 UNUSED(seqLen)) { return(false); } // Encode qualities as 4 bit integers. Doesn't touch seq. uint32 gkRead::gkRead_encode4bit(uint8 *&UNUSED(chunk), char *qlt, uint32 UNUSED(seqLen)) { if (qlt[0] == 0) // No QVs in the string. return(0); return(0); } bool gkRead::gkRead_decode4bit(uint8 *UNUSED(chunk), uint32 UNUSED(chunkLen), char *UNUSED(qlt), uint32 UNUSED(seqLen)) { return(false); } // Encode qualities as 5 bit integers. Doesn't touch seq. uint32 gkRead::gkRead_encode5bit(uint8 *&UNUSED(chunk), char *qlt, uint32 UNUSED(seqLen)) { if (qlt[0] == 0) // No QVs in the string. return(0); return(0); } bool gkRead::gkRead_decode5bit(uint8 *UNUSED(chunk), uint32 UNUSED(chunkLen), char *UNUSED(qlt), uint32 UNUSED(seqLen)) { return(false); } canu-1.6/src/stores/libsnappy/000077500000000000000000000000001314437614700164275ustar00rootroot00000000000000canu-1.6/src/stores/libsnappy/snappy-internal.h000066400000000000000000000225651314437614700217360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright 2008 Google Inc. All Rights Reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Internals shared between the Snappy implementation and its unittest. #ifndef THIRD_PARTY_SNAPPY_SNAPPY_INTERNAL_H_ #define THIRD_PARTY_SNAPPY_SNAPPY_INTERNAL_H_ #include "snappy-stubs-internal.h" namespace snappy { namespace internal { class WorkingMemory { public: WorkingMemory() : large_table_(NULL) { } ~WorkingMemory() { delete[] large_table_; } // Allocates and clears a hash table using memory in "*this", // stores the number of buckets in "*table_size" and returns a pointer to // the base of the hash table. uint16* GetHashTable(size_t input_size, int* table_size); private: uint16 small_table_[1<<10]; // 2KB uint16* large_table_; // Allocated only when needed DISALLOW_COPY_AND_ASSIGN(WorkingMemory); }; // Flat array compression that does not emit the "uncompressed length" // prefix. Compresses "input" string to the "*op" buffer. // // REQUIRES: "input_length <= kBlockSize" // REQUIRES: "op" points to an array of memory that is at least // "MaxCompressedLength(input_length)" in size. // REQUIRES: All elements in "table[0..table_size-1]" are initialized to zero. // REQUIRES: "table_size" is a power of two // // Returns an "end" pointer into "op" buffer. // "end - op" is the compressed size of "input". char* CompressFragment(const char* input, size_t input_length, char* op, uint16* table, const int table_size); // Return the largest n such that // // s1[0,n-1] == s2[0,n-1] // and n <= (s2_limit - s2). // // Does not read *s2_limit or beyond. // Does not read *(s1 + (s2_limit - s2)) or beyond. // Requires that s2_limit >= s2. // // Separate implementation for x86_64, for speed. Uses the fact that // x86_64 is little endian. #if defined(ARCH_K8) static inline int FindMatchLength(const char* s1, const char* s2, const char* s2_limit) { assert(s2_limit >= s2); int matched = 0; // Find out how long the match is. We loop over the data 64 bits at a // time until we find a 64-bit block that doesn't match; then we find // the first non-matching bit and use that to calculate the total // length of the match. while (PREDICT_TRUE(s2 <= s2_limit - 8)) { if (UNALIGNED_LOAD64(s2) == UNALIGNED_LOAD64(s1 + matched)) { s2 += 8; matched += 8; } else { // On current (mid-2008) Opteron models there is a 3% more // efficient code sequence to find the first non-matching byte. // However, what follows is ~10% better on Intel Core 2 and newer, // and we expect AMD's bsf instruction to improve. uint64 x = UNALIGNED_LOAD64(s2) ^ UNALIGNED_LOAD64(s1 + matched); int matching_bits = Bits::FindLSBSetNonZero64(x); matched += matching_bits >> 3; return matched; } } while (PREDICT_TRUE(s2 < s2_limit)) { if (s1[matched] == *s2) { ++s2; ++matched; } else { return matched; } } return matched; } #else static inline int FindMatchLength(const char* s1, const char* s2, const char* s2_limit) { // Implementation based on the x86-64 version, above. assert(s2_limit >= s2); int matched = 0; while (s2 <= s2_limit - 4 && UNALIGNED_LOAD32(s2) == UNALIGNED_LOAD32(s1 + matched)) { s2 += 4; matched += 4; } if (LittleEndian::IsLittleEndian() && s2 <= s2_limit - 4) { uint32 x = UNALIGNED_LOAD32(s2) ^ UNALIGNED_LOAD32(s1 + matched); int matching_bits = Bits::FindLSBSetNonZero(x); matched += matching_bits >> 3; } else { while ((s2 < s2_limit) && (s1[matched] == *s2)) { ++s2; ++matched; } } return matched; } #endif // Lookup tables for decompression code. Give --snappy_dump_decompression_table // to the unit test to recompute char_table. enum { LITERAL = 0, COPY_1_BYTE_OFFSET = 1, // 3 bit length + 3 bits of offset in opcode COPY_2_BYTE_OFFSET = 2, COPY_4_BYTE_OFFSET = 3 }; static const int kMaximumTagLength = 5; // COPY_4_BYTE_OFFSET plus the actual offset. // Mapping from i in range [0,4] to a mask to extract the bottom 8*i bits static const uint32 wordmask[] = { 0u, 0xffu, 0xffffu, 0xffffffu, 0xffffffffu }; // Data stored per entry in lookup table: // Range Bits-used Description // ------------------------------------ // 1..64 0..7 Literal/copy length encoded in opcode byte // 0..7 8..10 Copy offset encoded in opcode byte / 256 // 0..4 11..13 Extra bytes after opcode // // We use eight bits for the length even though 7 would have sufficed // because of efficiency reasons: // (1) Extracting a byte is faster than a bit-field // (2) It properly aligns copy offset so we do not need a <<8 static const uint16 char_table[256] = { 0x0001, 0x0804, 0x1001, 0x2001, 0x0002, 0x0805, 0x1002, 0x2002, 0x0003, 0x0806, 0x1003, 0x2003, 0x0004, 0x0807, 0x1004, 0x2004, 0x0005, 0x0808, 0x1005, 0x2005, 0x0006, 0x0809, 0x1006, 0x2006, 0x0007, 0x080a, 0x1007, 0x2007, 0x0008, 0x080b, 0x1008, 0x2008, 0x0009, 0x0904, 0x1009, 0x2009, 0x000a, 0x0905, 0x100a, 0x200a, 0x000b, 0x0906, 0x100b, 0x200b, 0x000c, 0x0907, 0x100c, 0x200c, 0x000d, 0x0908, 0x100d, 0x200d, 0x000e, 0x0909, 0x100e, 0x200e, 0x000f, 0x090a, 0x100f, 0x200f, 0x0010, 0x090b, 0x1010, 0x2010, 0x0011, 0x0a04, 0x1011, 0x2011, 0x0012, 0x0a05, 0x1012, 0x2012, 0x0013, 0x0a06, 0x1013, 0x2013, 0x0014, 0x0a07, 0x1014, 0x2014, 0x0015, 0x0a08, 0x1015, 0x2015, 0x0016, 0x0a09, 0x1016, 0x2016, 0x0017, 0x0a0a, 0x1017, 0x2017, 0x0018, 0x0a0b, 0x1018, 0x2018, 0x0019, 0x0b04, 0x1019, 0x2019, 0x001a, 0x0b05, 0x101a, 0x201a, 0x001b, 0x0b06, 0x101b, 0x201b, 0x001c, 0x0b07, 0x101c, 0x201c, 0x001d, 0x0b08, 0x101d, 0x201d, 0x001e, 0x0b09, 0x101e, 0x201e, 0x001f, 0x0b0a, 0x101f, 0x201f, 0x0020, 0x0b0b, 0x1020, 0x2020, 0x0021, 0x0c04, 0x1021, 0x2021, 0x0022, 0x0c05, 0x1022, 0x2022, 0x0023, 0x0c06, 0x1023, 0x2023, 0x0024, 0x0c07, 0x1024, 0x2024, 0x0025, 0x0c08, 0x1025, 0x2025, 0x0026, 0x0c09, 0x1026, 0x2026, 0x0027, 0x0c0a, 0x1027, 0x2027, 0x0028, 0x0c0b, 0x1028, 0x2028, 0x0029, 0x0d04, 0x1029, 0x2029, 0x002a, 0x0d05, 0x102a, 0x202a, 0x002b, 0x0d06, 0x102b, 0x202b, 0x002c, 0x0d07, 0x102c, 0x202c, 0x002d, 0x0d08, 0x102d, 0x202d, 0x002e, 0x0d09, 0x102e, 0x202e, 0x002f, 0x0d0a, 0x102f, 0x202f, 0x0030, 0x0d0b, 0x1030, 0x2030, 0x0031, 0x0e04, 0x1031, 0x2031, 0x0032, 0x0e05, 0x1032, 0x2032, 0x0033, 0x0e06, 0x1033, 0x2033, 0x0034, 0x0e07, 0x1034, 0x2034, 0x0035, 0x0e08, 0x1035, 0x2035, 0x0036, 0x0e09, 0x1036, 0x2036, 0x0037, 0x0e0a, 0x1037, 0x2037, 0x0038, 0x0e0b, 0x1038, 0x2038, 0x0039, 0x0f04, 0x1039, 0x2039, 0x003a, 0x0f05, 0x103a, 0x203a, 0x003b, 0x0f06, 0x103b, 0x203b, 0x003c, 0x0f07, 0x103c, 0x203c, 0x0801, 0x0f08, 0x103d, 0x203d, 0x1001, 0x0f09, 0x103e, 0x203e, 0x1801, 0x0f0a, 0x103f, 0x203f, 0x2001, 0x0f0b, 0x1040, 0x2040 }; } // end namespace internal } // end namespace snappy #endif // THIRD_PARTY_SNAPPY_SNAPPY_INTERNAL_H_ canu-1.6/src/stores/libsnappy/snappy-sinksource.cc000066400000000000000000000063261314437614700224420ustar00rootroot00000000000000// Copyright 2011 Google Inc. All Rights Reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #include #include "snappy-sinksource.h" namespace snappy { Source::~Source() { } Sink::~Sink() { } char* Sink::GetAppendBuffer(size_t length, char* scratch) { return scratch; } char* Sink::GetAppendBufferVariable( size_t min_size, size_t desired_size_hint, char* scratch, size_t scratch_size, size_t* allocated_size) { *allocated_size = scratch_size; return scratch; } void Sink::AppendAndTakeOwnership( char* bytes, size_t n, void (*deleter)(void*, const char*, size_t), void *deleter_arg) { Append(bytes, n); (*deleter)(deleter_arg, bytes, n); } ByteArraySource::~ByteArraySource() { } size_t ByteArraySource::Available() const { return left_; } const char* ByteArraySource::Peek(size_t* len) { *len = left_; return ptr_; } void ByteArraySource::Skip(size_t n) { left_ -= n; ptr_ += n; } UncheckedByteArraySink::~UncheckedByteArraySink() { } void UncheckedByteArraySink::Append(const char* data, size_t n) { // Do no copying if the caller filled in the result of GetAppendBuffer() if (data != dest_) { memcpy(dest_, data, n); } dest_ += n; } char* UncheckedByteArraySink::GetAppendBuffer(size_t len, char* scratch) { return dest_; } void UncheckedByteArraySink::AppendAndTakeOwnership( char* data, size_t n, void (*deleter)(void*, const char*, size_t), void *deleter_arg) { if (data != dest_) { memcpy(dest_, data, n); (*deleter)(deleter_arg, data, n); } dest_ += n; } char* UncheckedByteArraySink::GetAppendBufferVariable( size_t min_size, size_t desired_size_hint, char* scratch, size_t scratch_size, size_t* allocated_size) { *allocated_size = desired_size_hint; return dest_; } } // namespace snappy canu-1.6/src/stores/libsnappy/snappy-sinksource.h000066400000000000000000000177261314437614700223120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright 2011 Google Inc. All Rights Reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #ifndef THIRD_PARTY_SNAPPY_SNAPPY_SINKSOURCE_H_ #define THIRD_PARTY_SNAPPY_SNAPPY_SINKSOURCE_H_ #include namespace snappy { // A Sink is an interface that consumes a sequence of bytes. class Sink { public: Sink() { } virtual ~Sink(); // Append "bytes[0,n-1]" to this. virtual void Append(const char* bytes, size_t n) = 0; // Returns a writable buffer of the specified length for appending. // May return a pointer to the caller-owned scratch buffer which // must have at least the indicated length. The returned buffer is // only valid until the next operation on this Sink. // // After writing at most "length" bytes, call Append() with the // pointer returned from this function and the number of bytes // written. Many Append() implementations will avoid copying // bytes if this function returned an internal buffer. // // If a non-scratch buffer is returned, the caller may only pass a // prefix of it to Append(). That is, it is not correct to pass an // interior pointer of the returned array to Append(). // // The default implementation always returns the scratch buffer. virtual char* GetAppendBuffer(size_t length, char* scratch); // For higher performance, Sink implementations can provide custom // AppendAndTakeOwnership() and GetAppendBufferVariable() methods. // These methods can reduce the number of copies done during // compression/decompression. // Append "bytes[0,n-1] to the sink. Takes ownership of "bytes" // and calls the deleter function as (*deleter)(deleter_arg, bytes, n) // to free the buffer. deleter function must be non NULL. // // The default implementation just calls Append and frees "bytes". // Other implementations may avoid a copy while appending the buffer. virtual void AppendAndTakeOwnership( char* bytes, size_t n, void (*deleter)(void*, const char*, size_t), void *deleter_arg); // Returns a writable buffer for appending and writes the buffer's capacity to // *allocated_size. Guarantees *allocated_size >= min_size. // May return a pointer to the caller-owned scratch buffer which must have // scratch_size >= min_size. // // The returned buffer is only valid until the next operation // on this ByteSink. // // After writing at most *allocated_size bytes, call Append() with the // pointer returned from this function and the number of bytes written. // Many Append() implementations will avoid copying bytes if this function // returned an internal buffer. // // If the sink implementation allocates or reallocates an internal buffer, // it should use the desired_size_hint if appropriate. If a caller cannot // provide a reasonable guess at the desired capacity, it should set // desired_size_hint = 0. // // If a non-scratch buffer is returned, the caller may only pass // a prefix to it to Append(). That is, it is not correct to pass an // interior pointer to Append(). // // The default implementation always returns the scratch buffer. virtual char* GetAppendBufferVariable( size_t min_size, size_t desired_size_hint, char* scratch, size_t scratch_size, size_t* allocated_size); private: // No copying Sink(const Sink&); void operator=(const Sink&); }; // A Source is an interface that yields a sequence of bytes class Source { public: Source() { } virtual ~Source(); // Return the number of bytes left to read from the source virtual size_t Available() const = 0; // Peek at the next flat region of the source. Does not reposition // the source. The returned region is empty iff Available()==0. // // Returns a pointer to the beginning of the region and store its // length in *len. // // The returned region is valid until the next call to Skip() or // until this object is destroyed, whichever occurs first. // // The returned region may be larger than Available() (for example // if this ByteSource is a view on a substring of a larger source). // The caller is responsible for ensuring that it only reads the // Available() bytes. virtual const char* Peek(size_t* len) = 0; // Skip the next n bytes. Invalidates any buffer returned by // a previous call to Peek(). // REQUIRES: Available() >= n virtual void Skip(size_t n) = 0; private: // No copying Source(const Source&); void operator=(const Source&); }; // A Source implementation that yields the contents of a flat array class ByteArraySource : public Source { public: ByteArraySource(const char* p, size_t n) : ptr_(p), left_(n) { } virtual ~ByteArraySource(); virtual size_t Available() const; virtual const char* Peek(size_t* len); virtual void Skip(size_t n); private: const char* ptr_; size_t left_; }; // A Sink implementation that writes to a flat array without any bound checks. class UncheckedByteArraySink : public Sink { public: explicit UncheckedByteArraySink(char* dest) : dest_(dest) { } virtual ~UncheckedByteArraySink(); virtual void Append(const char* data, size_t n); virtual char* GetAppendBuffer(size_t len, char* scratch); virtual char* GetAppendBufferVariable( size_t min_size, size_t desired_size_hint, char* scratch, size_t scratch_size, size_t* allocated_size); virtual void AppendAndTakeOwnership( char* bytes, size_t n, void (*deleter)(void*, const char*, size_t), void *deleter_arg); // Return the current output pointer so that a caller can see how // many bytes were produced. // Note: this is not a Sink method. char* CurrentDestination() const { return dest_; } private: char* dest_; }; } // namespace snappy #endif // THIRD_PARTY_SNAPPY_SNAPPY_SINKSOURCE_H_ canu-1.6/src/stores/libsnappy/snappy-stubs-internal.cc000066400000000000000000000034431314437614700232240ustar00rootroot00000000000000// Copyright 2011 Google Inc. All Rights Reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #include #include #include "snappy-stubs-internal.h" namespace snappy { void Varint::Append32(string* s, uint32 value) { char buf[Varint::kMax32]; const char* p = Varint::Encode32(buf, value); s->append(buf, p - buf); } } // namespace snappy canu-1.6/src/stores/libsnappy/snappy-stubs-internal.h000066400000000000000000000420421314437614700230640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright 2011 Google Inc. All Rights Reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Various stubs for the open-source version of Snappy. #ifndef THIRD_PARTY_SNAPPY_OPENSOURCE_SNAPPY_STUBS_INTERNAL_H_ #define THIRD_PARTY_SNAPPY_OPENSOURCE_SNAPPY_STUBS_INTERNAL_H_ #ifdef HAVE_CONFIG_H #include "config.h" #endif #include #include #include #include #ifdef HAVE_SYS_MMAN_H #include #endif #include "snappy-stubs-public.h" #if defined(__x86_64__) // Enable 64-bit optimized versions of some routines. #define ARCH_K8 1 #endif // Needed by OS X, among others. #ifndef MAP_ANONYMOUS #define MAP_ANONYMOUS MAP_ANON #endif // Pull in std::min, std::ostream, and the likes. This is safe because this // header file is never used from any public header files. using namespace std; // The size of an array, if known at compile-time. // Will give unexpected results if used on a pointer. // We undefine it first, since some compilers already have a definition. #ifdef ARRAYSIZE #undef ARRAYSIZE #endif #define ARRAYSIZE(a) (sizeof(a) / sizeof(*(a))) // Static prediction hints. #ifdef HAVE_BUILTIN_EXPECT #define PREDICT_FALSE(x) (__builtin_expect(x, 0)) #define PREDICT_TRUE(x) (__builtin_expect(!!(x), 1)) #else #define PREDICT_FALSE(x) x #define PREDICT_TRUE(x) x #endif // This is only used for recomputing the tag byte table used during // decompression; for simplicity we just remove it from the open-source // version (anyone who wants to regenerate it can just do the call // themselves within main()). #define DEFINE_bool(flag_name, default_value, description) \ bool FLAGS_ ## flag_name = default_value #define DECLARE_bool(flag_name) \ extern bool FLAGS_ ## flag_name namespace snappy { static const uint32 kuint32max = static_cast(0xFFFFFFFF); static const int64 kint64max = static_cast(0x7FFFFFFFFFFFFFFFLL); // Potentially unaligned loads and stores. // x86 and PowerPC can simply do these loads and stores native. #if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__) #define UNALIGNED_LOAD16(_p) (*reinterpret_cast(_p)) #define UNALIGNED_LOAD32(_p) (*reinterpret_cast(_p)) #define UNALIGNED_LOAD64(_p) (*reinterpret_cast(_p)) #define UNALIGNED_STORE16(_p, _val) (*reinterpret_cast(_p) = (_val)) #define UNALIGNED_STORE32(_p, _val) (*reinterpret_cast(_p) = (_val)) #define UNALIGNED_STORE64(_p, _val) (*reinterpret_cast(_p) = (_val)) // ARMv7 and newer support native unaligned accesses, but only of 16-bit // and 32-bit values (not 64-bit); older versions either raise a fatal signal, // do an unaligned read and rotate the words around a bit, or do the reads very // slowly (trip through kernel mode). There's no simple #define that says just // “ARMv7 or higherâ€, so we have to filter away all ARMv5 and ARMv6 // sub-architectures. // // This is a mess, but there's not much we can do about it. // // To further complicate matters, only LDR instructions (single reads) are // allowed to be unaligned, not LDRD (two reads) or LDM (many reads). Unless we // explicitly tell the compiler that these accesses can be unaligned, it can and // will combine accesses. On armcc, the way to signal this is done by accessing // through the type (uint32 __packed *), but GCC has no such attribute // (it ignores __attribute__((packed)) on individual variables). However, // we can tell it that a _struct_ is unaligned, which has the same effect, // so we do that. #elif defined(__arm__) && \ !defined(__ARM_ARCH_4__) && \ !defined(__ARM_ARCH_4T__) && \ !defined(__ARM_ARCH_5__) && \ !defined(__ARM_ARCH_5T__) && \ !defined(__ARM_ARCH_5TE__) && \ !defined(__ARM_ARCH_5TEJ__) && \ !defined(__ARM_ARCH_6__) && \ !defined(__ARM_ARCH_6J__) && \ !defined(__ARM_ARCH_6K__) && \ !defined(__ARM_ARCH_6Z__) && \ !defined(__ARM_ARCH_6ZK__) && \ !defined(__ARM_ARCH_6T2__) #if __GNUC__ #define ATTRIBUTE_PACKED __attribute__((__packed__)) #else #define ATTRIBUTE_PACKED #endif namespace base { namespace internal { struct Unaligned16Struct { uint16 value; uint8 dummy; // To make the size non-power-of-two. } ATTRIBUTE_PACKED; struct Unaligned32Struct { uint32 value; uint8 dummy; // To make the size non-power-of-two. } ATTRIBUTE_PACKED; } // namespace internal } // namespace base #define UNALIGNED_LOAD16(_p) \ ((reinterpret_cast(_p))->value) #define UNALIGNED_LOAD32(_p) \ ((reinterpret_cast(_p))->value) #define UNALIGNED_STORE16(_p, _val) \ ((reinterpret_cast< ::snappy::base::internal::Unaligned16Struct *>(_p))->value = \ (_val)) #define UNALIGNED_STORE32(_p, _val) \ ((reinterpret_cast< ::snappy::base::internal::Unaligned32Struct *>(_p))->value = \ (_val)) // TODO(user): NEON supports unaligned 64-bit loads and stores. // See if that would be more efficient on platforms supporting it, // at least for copies. inline uint64 UNALIGNED_LOAD64(const void *p) { uint64 t; memcpy(&t, p, sizeof t); return t; } inline void UNALIGNED_STORE64(void *p, uint64 v) { memcpy(p, &v, sizeof v); } #else // These functions are provided for architectures that don't support // unaligned loads and stores. inline uint16 UNALIGNED_LOAD16(const void *p) { uint16 t; memcpy(&t, p, sizeof t); return t; } inline uint32 UNALIGNED_LOAD32(const void *p) { uint32 t; memcpy(&t, p, sizeof t); return t; } inline uint64 UNALIGNED_LOAD64(const void *p) { uint64 t; memcpy(&t, p, sizeof t); return t; } inline void UNALIGNED_STORE16(void *p, uint16 v) { memcpy(p, &v, sizeof v); } inline void UNALIGNED_STORE32(void *p, uint32 v) { memcpy(p, &v, sizeof v); } inline void UNALIGNED_STORE64(void *p, uint64 v) { memcpy(p, &v, sizeof v); } #endif // This can be more efficient than UNALIGNED_LOAD64 + UNALIGNED_STORE64 // on some platforms, in particular ARM. inline void UnalignedCopy64(const void *src, void *dst) { if (sizeof(void *) == 8) { UNALIGNED_STORE64(dst, UNALIGNED_LOAD64(src)); } else { const char *src_char = reinterpret_cast(src); char *dst_char = reinterpret_cast(dst); UNALIGNED_STORE32(dst_char, UNALIGNED_LOAD32(src_char)); UNALIGNED_STORE32(dst_char + 4, UNALIGNED_LOAD32(src_char + 4)); } } // The following guarantees declaration of the byte swap functions. #ifdef WORDS_BIGENDIAN #ifdef HAVE_SYS_BYTEORDER_H #include #endif #ifdef HAVE_SYS_ENDIAN_H #include #endif #ifdef _MSC_VER #include #define bswap_16(x) _byteswap_ushort(x) #define bswap_32(x) _byteswap_ulong(x) #define bswap_64(x) _byteswap_uint64(x) #elif defined(__APPLE__) // Mac OS X / Darwin features #include #define bswap_16(x) OSSwapInt16(x) #define bswap_32(x) OSSwapInt32(x) #define bswap_64(x) OSSwapInt64(x) #elif defined(HAVE_BYTESWAP_H) #include #elif defined(bswap32) // FreeBSD defines bswap{16,32,64} in (already #included). #define bswap_16(x) bswap16(x) #define bswap_32(x) bswap32(x) #define bswap_64(x) bswap64(x) #elif defined(BSWAP_64) // Solaris 10 defines BSWAP_{16,32,64} in (already #included). #define bswap_16(x) BSWAP_16(x) #define bswap_32(x) BSWAP_32(x) #define bswap_64(x) BSWAP_64(x) #else inline uint16 bswap_16(uint16 x) { return (x << 8) | (x >> 8); } inline uint32 bswap_32(uint32 x) { x = ((x & 0xff00ff00UL) >> 8) | ((x & 0x00ff00ffUL) << 8); return (x >> 16) | (x << 16); } inline uint64 bswap_64(uint64 x) { x = ((x & 0xff00ff00ff00ff00ULL) >> 8) | ((x & 0x00ff00ff00ff00ffULL) << 8); x = ((x & 0xffff0000ffff0000ULL) >> 16) | ((x & 0x0000ffff0000ffffULL) << 16); return (x >> 32) | (x << 32); } #endif #endif // WORDS_BIGENDIAN // Convert to little-endian storage, opposite of network format. // Convert x from host to little endian: x = LittleEndian.FromHost(x); // convert x from little endian to host: x = LittleEndian.ToHost(x); // // Store values into unaligned memory converting to little endian order: // LittleEndian.Store16(p, x); // // Load unaligned values stored in little endian converting to host order: // x = LittleEndian.Load16(p); class LittleEndian { public: // Conversion functions. #ifdef WORDS_BIGENDIAN static uint16 FromHost16(uint16 x) { return bswap_16(x); } static uint16 ToHost16(uint16 x) { return bswap_16(x); } static uint32 FromHost32(uint32 x) { return bswap_32(x); } static uint32 ToHost32(uint32 x) { return bswap_32(x); } static bool IsLittleEndian() { return false; } #else // !defined(WORDS_BIGENDIAN) static uint16 FromHost16(uint16 x) { return x; } static uint16 ToHost16(uint16 x) { return x; } static uint32 FromHost32(uint32 x) { return x; } static uint32 ToHost32(uint32 x) { return x; } static bool IsLittleEndian() { return true; } #endif // !defined(WORDS_BIGENDIAN) // Functions to do unaligned loads and stores in little-endian order. static uint16 Load16(const void *p) { return ToHost16(UNALIGNED_LOAD16(p)); } static void Store16(void *p, uint16 v) { UNALIGNED_STORE16(p, FromHost16(v)); } static uint32 Load32(const void *p) { return ToHost32(UNALIGNED_LOAD32(p)); } static void Store32(void *p, uint32 v) { UNALIGNED_STORE32(p, FromHost32(v)); } }; // Some bit-manipulation functions. class Bits { public: // Return floor(log2(n)) for positive integer n. Returns -1 iff n == 0. static int Log2Floor(uint32 n); // Return the first set least / most significant bit, 0-indexed. Returns an // undefined value if n == 0. FindLSBSetNonZero() is similar to ffs() except // that it's 0-indexed. static int FindLSBSetNonZero(uint32 n); static int FindLSBSetNonZero64(uint64 n); private: DISALLOW_COPY_AND_ASSIGN(Bits); }; #ifdef HAVE_BUILTIN_CTZ inline int Bits::Log2Floor(uint32 n) { return n == 0 ? -1 : 31 ^ __builtin_clz(n); } inline int Bits::FindLSBSetNonZero(uint32 n) { return __builtin_ctz(n); } inline int Bits::FindLSBSetNonZero64(uint64 n) { return __builtin_ctzll(n); } #else // Portable versions. inline int Bits::Log2Floor(uint32 n) { if (n == 0) return -1; int log = 0; uint32 value = n; for (int i = 4; i >= 0; --i) { int shift = (1 << i); uint32 x = value >> shift; if (x != 0) { value = x; log += shift; } } assert(value == 1); return log; } inline int Bits::FindLSBSetNonZero(uint32 n) { int rc = 31; for (int i = 4, shift = 1 << 4; i >= 0; --i) { const uint32 x = n << shift; if (x != 0) { n = x; rc -= shift; } shift >>= 1; } return rc; } // FindLSBSetNonZero64() is defined in terms of FindLSBSetNonZero(). inline int Bits::FindLSBSetNonZero64(uint64 n) { const uint32 bottombits = static_cast(n); if (bottombits == 0) { // Bottom bits are zero, so scan in top bits return 32 + FindLSBSetNonZero(static_cast(n >> 32)); } else { return FindLSBSetNonZero(bottombits); } } #endif // End portable versions. // Variable-length integer encoding. class Varint { public: // Maximum lengths of varint encoding of uint32. static const int kMax32 = 5; // Attempts to parse a varint32 from a prefix of the bytes in [ptr,limit-1]. // Never reads a character at or beyond limit. If a valid/terminated varint32 // was found in the range, stores it in *OUTPUT and returns a pointer just // past the last byte of the varint32. Else returns NULL. On success, // "result <= limit". static const char* Parse32WithLimit(const char* ptr, const char* limit, uint32* OUTPUT); // REQUIRES "ptr" points to a buffer of length sufficient to hold "v". // EFFECTS Encodes "v" into "ptr" and returns a pointer to the // byte just past the last encoded byte. static char* Encode32(char* ptr, uint32 v); // EFFECTS Appends the varint representation of "value" to "*s". static void Append32(string* s, uint32 value); }; inline const char* Varint::Parse32WithLimit(const char* p, const char* l, uint32* OUTPUT) { const unsigned char* ptr = reinterpret_cast(p); const unsigned char* limit = reinterpret_cast(l); uint32 b, result; if (ptr >= limit) return NULL; b = *(ptr++); result = b & 127; if (b < 128) goto done; if (ptr >= limit) return NULL; b = *(ptr++); result |= (b & 127) << 7; if (b < 128) goto done; if (ptr >= limit) return NULL; b = *(ptr++); result |= (b & 127) << 14; if (b < 128) goto done; if (ptr >= limit) return NULL; b = *(ptr++); result |= (b & 127) << 21; if (b < 128) goto done; if (ptr >= limit) return NULL; b = *(ptr++); result |= (b & 127) << 28; if (b < 16) goto done; return NULL; // Value is too long to be a varint32 done: *OUTPUT = result; return reinterpret_cast(ptr); } inline char* Varint::Encode32(char* sptr, uint32 v) { // Operate on characters as unsigneds unsigned char* ptr = reinterpret_cast(sptr); static const int B = 128; if (v < (1<<7)) { *(ptr++) = v; } else if (v < (1<<14)) { *(ptr++) = v | B; *(ptr++) = v>>7; } else if (v < (1<<21)) { *(ptr++) = v | B; *(ptr++) = (v>>7) | B; *(ptr++) = v>>14; } else if (v < (1<<28)) { *(ptr++) = v | B; *(ptr++) = (v>>7) | B; *(ptr++) = (v>>14) | B; *(ptr++) = v>>21; } else { *(ptr++) = v | B; *(ptr++) = (v>>7) | B; *(ptr++) = (v>>14) | B; *(ptr++) = (v>>21) | B; *(ptr++) = v>>28; } return reinterpret_cast(ptr); } // If you know the internal layout of the std::string in use, you can // replace this function with one that resizes the string without // filling the new space with zeros (if applicable) -- // it will be non-portable but faster. inline void STLStringResizeUninitialized(string* s, size_t new_size) { s->resize(new_size); } // Return a mutable char* pointing to a string's internal buffer, // which may not be null-terminated. Writing through this pointer will // modify the string. // // string_as_array(&str)[i] is valid for 0 <= i < str.size() until the // next call to a string method that invalidates iterators. // // As of 2006-04, there is no standard-blessed way of getting a // mutable reference to a string's internal buffer. However, issue 530 // (http://www.open-std.org/JTC1/SC22/WG21/docs/lwg-defects.html#530) // proposes this as the method. It will officially be part of the standard // for C++0x. This should already work on all current implementations. inline char* string_as_array(string* str) { return str->empty() ? NULL : &*str->begin(); } } // namespace snappy #endif // THIRD_PARTY_SNAPPY_OPENSOURCE_SNAPPY_STUBS_INTERNAL_H_ canu-1.6/src/stores/libsnappy/snappy-stubs-public.h000066400000000000000000000100401314437614700225170ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright 2011 Google Inc. All Rights Reserved. // Author: sesse@google.com (Steinar H. Gunderson) // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // Various type stubs for the open-source version of Snappy. // // This file cannot include config.h, as it is included from snappy.h, // which is a public header. Instead, snappy-stubs-public.h is generated by // from snappy-stubs-public.h.in at configure time. #ifndef THIRD_PARTY_SNAPPY_OPENSOURCE_SNAPPY_STUBS_PUBLIC_H_ #define THIRD_PARTY_SNAPPY_OPENSOURCE_SNAPPY_STUBS_PUBLIC_H_ #if 1 #include #endif #if 1 #include #endif #if 0 #include #endif #define SNAPPY_MAJOR 1 #define SNAPPY_MINOR 1 #define SNAPPY_PATCHLEVEL 3 #define SNAPPY_VERSION \ ((SNAPPY_MAJOR << 16) | (SNAPPY_MINOR << 8) | SNAPPY_PATCHLEVEL) #include namespace snappy { #if 1 typedef int8_t int8; typedef uint8_t uint8; typedef int16_t int16; typedef uint16_t uint16; typedef int32_t int32; typedef uint32_t uint32; typedef int64_t int64; typedef uint64_t uint64; #else typedef signed char int8; typedef unsigned char uint8; typedef short int16; typedef unsigned short uint16; typedef int int32; typedef unsigned int uint32; typedef long long int64; typedef unsigned long long uint64; #endif typedef std::string string; #ifndef DISALLOW_COPY_AND_ASSIGN #define DISALLOW_COPY_AND_ASSIGN(TypeName) \ TypeName(const TypeName&); \ void operator=(const TypeName&) #endif #if !0 // Windows does not have an iovec type, yet the concept is universally useful. // It is simple to define it ourselves, so we put it inside our own namespace. struct iovec { void* iov_base; size_t iov_len; }; #endif } // namespace snappy #endif // THIRD_PARTY_SNAPPY_OPENSOURCE_SNAPPY_STUBS_PUBLIC_H_ canu-1.6/src/stores/libsnappy/snappy.cc000066400000000000000000001344021314437614700202540ustar00rootroot00000000000000// Copyright 2005 Google Inc. All Rights Reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. #include "snappy.h" #include "snappy-internal.h" #include "snappy-sinksource.h" #include #include #include #include namespace snappy { using internal::COPY_1_BYTE_OFFSET; using internal::COPY_2_BYTE_OFFSET; using internal::COPY_4_BYTE_OFFSET; using internal::LITERAL; using internal::char_table; using internal::kMaximumTagLength; using internal::wordmask; // Any hash function will produce a valid compressed bitstream, but a good // hash function reduces the number of collisions and thus yields better // compression for compressible input, and more speed for incompressible // input. Of course, it doesn't hurt if the hash function is reasonably fast // either, as it gets called a lot. static inline uint32 HashBytes(uint32 bytes, int shift) { uint32 kMul = 0x1e35a7bd; return (bytes * kMul) >> shift; } static inline uint32 Hash(const char* p, int shift) { return HashBytes(UNALIGNED_LOAD32(p), shift); } size_t MaxCompressedLength(size_t source_len) { // Compressed data can be defined as: // compressed := item* literal* // item := literal* copy // // The trailing literal sequence has a space blowup of at most 62/60 // since a literal of length 60 needs one tag byte + one extra byte // for length information. // // Item blowup is trickier to measure. Suppose the "copy" op copies // 4 bytes of data. Because of a special check in the encoding code, // we produce a 4-byte copy only if the offset is < 65536. Therefore // the copy op takes 3 bytes to encode, and this type of item leads // to at most the 62/60 blowup for representing literals. // // Suppose the "copy" op copies 5 bytes of data. If the offset is big // enough, it will take 5 bytes to encode the copy op. Therefore the // worst case here is a one-byte literal followed by a five-byte copy. // I.e., 6 bytes of input turn into 7 bytes of "compressed" data. // // This last factor dominates the blowup, so the final estimate is: return 32 + source_len + source_len/6; } // Copy "len" bytes from "src" to "op", one byte at a time. Used for // handling COPY operations where the input and output regions may // overlap. For example, suppose: // src == "ab" // op == src + 2 // len == 20 // After IncrementalCopy(src, op, len), the result will have // eleven copies of "ab" // ababababababababababab // Note that this does not match the semantics of either memcpy() // or memmove(). static inline void IncrementalCopy(const char* src, char* op, ssize_t len) { assert(len > 0); do { *op++ = *src++; } while (--len > 0); } // Equivalent to IncrementalCopy except that it can write up to ten extra // bytes after the end of the copy, and that it is faster. // // The main part of this loop is a simple copy of eight bytes at a time until // we've copied (at least) the requested amount of bytes. However, if op and // src are less than eight bytes apart (indicating a repeating pattern of // length < 8), we first need to expand the pattern in order to get the correct // results. For instance, if the buffer looks like this, with the eight-byte // and patterns marked as intervals: // // abxxxxxxxxxxxx // [------] src // [------] op // // a single eight-byte copy from to will repeat the pattern once, // after which we can move two bytes without moving : // // ababxxxxxxxxxx // [------] src // [------] op // // and repeat the exercise until the two no longer overlap. // // This allows us to do very well in the special case of one single byte // repeated many times, without taking a big hit for more general cases. // // The worst case of extra writing past the end of the match occurs when // op - src == 1 and len == 1; the last copy will read from byte positions // [0..7] and write to [4..11], whereas it was only supposed to write to // position 1. Thus, ten excess bytes. namespace { const int kMaxIncrementCopyOverflow = 10; inline void IncrementalCopyFastPath(const char* src, char* op, ssize_t len) { while (PREDICT_FALSE(op - src < 8)) { UnalignedCopy64(src, op); len -= op - src; op += op - src; } while (len > 0) { UnalignedCopy64(src, op); src += 8; op += 8; len -= 8; } } } // namespace static inline char* EmitLiteral(char* op, const char* literal, int len, bool allow_fast_path) { int n = len - 1; // Zero-length literals are disallowed if (n < 60) { // Fits in tag byte *op++ = LITERAL | (n << 2); // The vast majority of copies are below 16 bytes, for which a // call to memcpy is overkill. This fast path can sometimes // copy up to 15 bytes too much, but that is okay in the // main loop, since we have a bit to go on for both sides: // // - The input will always have kInputMarginBytes = 15 extra // available bytes, as long as we're in the main loop, and // if not, allow_fast_path = false. // - The output will always have 32 spare bytes (see // MaxCompressedLength). if (allow_fast_path && len <= 16) { UnalignedCopy64(literal, op); UnalignedCopy64(literal + 8, op + 8); return op + len; } } else { // Encode in upcoming bytes char* base = op; int count = 0; op++; while (n > 0) { *op++ = n & 0xff; n >>= 8; count++; } assert(count >= 1); assert(count <= 4); *base = LITERAL | ((59+count) << 2); } memcpy(op, literal, len); return op + len; } static inline char* EmitCopyLessThan64(char* op, size_t offset, int len) { assert(len <= 64); assert(len >= 4); assert(offset < 65536); if ((len < 12) && (offset < 2048)) { size_t len_minus_4 = len - 4; assert(len_minus_4 < 8); // Must fit in 3 bits *op++ = COPY_1_BYTE_OFFSET + ((len_minus_4) << 2) + ((offset >> 8) << 5); *op++ = offset & 0xff; } else { *op++ = COPY_2_BYTE_OFFSET + ((len-1) << 2); LittleEndian::Store16(op, offset); op += 2; } return op; } static inline char* EmitCopy(char* op, size_t offset, int len) { // Emit 64 byte copies but make sure to keep at least four bytes reserved while (PREDICT_FALSE(len >= 68)) { op = EmitCopyLessThan64(op, offset, 64); len -= 64; } // Emit an extra 60 byte copy if have too much data to fit in one copy if (len > 64) { op = EmitCopyLessThan64(op, offset, 60); len -= 60; } // Emit remainder op = EmitCopyLessThan64(op, offset, len); return op; } bool GetUncompressedLength(const char* start, size_t n, size_t* result) { uint32 v = 0; const char* limit = start + n; if (Varint::Parse32WithLimit(start, limit, &v) != NULL) { *result = v; return true; } else { return false; } } namespace internal { uint16* WorkingMemory::GetHashTable(size_t input_size, int* table_size) { // Use smaller hash table when input.size() is smaller, since we // fill the table, incurring O(hash table size) overhead for // compression, and if the input is short, we won't need that // many hash table entries anyway. assert(kMaxHashTableSize >= 256); size_t htsize = 256; while (htsize < kMaxHashTableSize && htsize < input_size) { htsize <<= 1; } uint16* table; if (htsize <= ARRAYSIZE(small_table_)) { table = small_table_; } else { if (large_table_ == NULL) { large_table_ = new uint16[kMaxHashTableSize]; } table = large_table_; } *table_size = htsize; memset(table, 0, htsize * sizeof(*table)); return table; } } // end namespace internal // For 0 <= offset <= 4, GetUint32AtOffset(GetEightBytesAt(p), offset) will // equal UNALIGNED_LOAD32(p + offset). Motivation: On x86-64 hardware we have // empirically found that overlapping loads such as // UNALIGNED_LOAD32(p) ... UNALIGNED_LOAD32(p+1) ... UNALIGNED_LOAD32(p+2) // are slower than UNALIGNED_LOAD64(p) followed by shifts and casts to uint32. // // We have different versions for 64- and 32-bit; ideally we would avoid the // two functions and just inline the UNALIGNED_LOAD64 call into // GetUint32AtOffset, but GCC (at least not as of 4.6) is seemingly not clever // enough to avoid loading the value multiple times then. For 64-bit, the load // is done when GetEightBytesAt() is called, whereas for 32-bit, the load is // done at GetUint32AtOffset() time. #ifdef ARCH_K8 typedef uint64 EightBytesReference; static inline EightBytesReference GetEightBytesAt(const char* ptr) { return UNALIGNED_LOAD64(ptr); } static inline uint32 GetUint32AtOffset(uint64 v, int offset) { assert(offset >= 0); assert(offset <= 4); return v >> (LittleEndian::IsLittleEndian() ? 8 * offset : 32 - 8 * offset); } #else typedef const char* EightBytesReference; static inline EightBytesReference GetEightBytesAt(const char* ptr) { return ptr; } static inline uint32 GetUint32AtOffset(const char* v, int offset) { assert(offset >= 0); assert(offset <= 4); return UNALIGNED_LOAD32(v + offset); } #endif // Flat array compression that does not emit the "uncompressed length" // prefix. Compresses "input" string to the "*op" buffer. // // REQUIRES: "input" is at most "kBlockSize" bytes long. // REQUIRES: "op" points to an array of memory that is at least // "MaxCompressedLength(input.size())" in size. // REQUIRES: All elements in "table[0..table_size-1]" are initialized to zero. // REQUIRES: "table_size" is a power of two // // Returns an "end" pointer into "op" buffer. // "end - op" is the compressed size of "input". namespace internal { char* CompressFragment(const char* input, size_t input_size, char* op, uint16* table, const int table_size) { // "ip" is the input pointer, and "op" is the output pointer. const char* ip = input; assert(input_size <= kBlockSize); assert((table_size & (table_size - 1)) == 0); // table must be power of two const int shift = 32 - Bits::Log2Floor(table_size); assert(static_cast(kuint32max >> shift) == table_size - 1); const char* ip_end = input + input_size; const char* base_ip = ip; // Bytes in [next_emit, ip) will be emitted as literal bytes. Or // [next_emit, ip_end) after the main loop. const char* next_emit = ip; const size_t kInputMarginBytes = 15; if (PREDICT_TRUE(input_size >= kInputMarginBytes)) { const char* ip_limit = input + input_size - kInputMarginBytes; for (uint32 next_hash = Hash(++ip, shift); ; ) { assert(next_emit < ip); // The body of this loop calls EmitLiteral once and then EmitCopy one or // more times. (The exception is that when we're close to exhausting // the input we goto emit_remainder.) // // In the first iteration of this loop we're just starting, so // there's nothing to copy, so calling EmitLiteral once is // necessary. And we only start a new iteration when the // current iteration has determined that a call to EmitLiteral will // precede the next call to EmitCopy (if any). // // Step 1: Scan forward in the input looking for a 4-byte-long match. // If we get close to exhausting the input then goto emit_remainder. // // Heuristic match skipping: If 32 bytes are scanned with no matches // found, start looking only at every other byte. If 32 more bytes are // scanned (or skipped), look at every third byte, etc.. When a match is // found, immediately go back to looking at every byte. This is a small // loss (~5% performance, ~0.1% density) for compressible data due to more // bookkeeping, but for non-compressible data (such as JPEG) it's a huge // win since the compressor quickly "realizes" the data is incompressible // and doesn't bother looking for matches everywhere. // // The "skip" variable keeps track of how many bytes there are since the // last match; dividing it by 32 (ie. right-shifting by five) gives the // number of bytes to move ahead for each iteration. uint32 skip = 32; const char* next_ip = ip; const char* candidate; do { ip = next_ip; uint32 hash = next_hash; assert(hash == Hash(ip, shift)); uint32 bytes_between_hash_lookups = skip >> 5; skip += bytes_between_hash_lookups; next_ip = ip + bytes_between_hash_lookups; if (PREDICT_FALSE(next_ip > ip_limit)) { goto emit_remainder; } next_hash = Hash(next_ip, shift); candidate = base_ip + table[hash]; assert(candidate >= base_ip); assert(candidate < ip); table[hash] = ip - base_ip; } while (PREDICT_TRUE(UNALIGNED_LOAD32(ip) != UNALIGNED_LOAD32(candidate))); // Step 2: A 4-byte match has been found. We'll later see if more // than 4 bytes match. But, prior to the match, input // bytes [next_emit, ip) are unmatched. Emit them as "literal bytes." assert(next_emit + 16 <= ip_end); op = EmitLiteral(op, next_emit, ip - next_emit, true); // Step 3: Call EmitCopy, and then see if another EmitCopy could // be our next move. Repeat until we find no match for the // input immediately after what was consumed by the last EmitCopy call. // // If we exit this loop normally then we need to call EmitLiteral next, // though we don't yet know how big the literal will be. We handle that // by proceeding to the next iteration of the main loop. We also can exit // this loop via goto if we get close to exhausting the input. EightBytesReference input_bytes; uint32 candidate_bytes = 0; do { // We have a 4-byte match at ip, and no need to emit any // "literal bytes" prior to ip. const char* base = ip; int matched = 4 + FindMatchLength(candidate + 4, ip + 4, ip_end); ip += matched; size_t offset = base - candidate; assert(0 == memcmp(base, candidate, matched)); op = EmitCopy(op, offset, matched); // We could immediately start working at ip now, but to improve // compression we first update table[Hash(ip - 1, ...)]. const char* insert_tail = ip - 1; next_emit = ip; if (PREDICT_FALSE(ip >= ip_limit)) { goto emit_remainder; } input_bytes = GetEightBytesAt(insert_tail); uint32 prev_hash = HashBytes(GetUint32AtOffset(input_bytes, 0), shift); table[prev_hash] = ip - base_ip - 1; uint32 cur_hash = HashBytes(GetUint32AtOffset(input_bytes, 1), shift); candidate = base_ip + table[cur_hash]; candidate_bytes = UNALIGNED_LOAD32(candidate); table[cur_hash] = ip - base_ip; } while (GetUint32AtOffset(input_bytes, 1) == candidate_bytes); next_hash = HashBytes(GetUint32AtOffset(input_bytes, 2), shift); ++ip; } } emit_remainder: // Emit the remaining bytes as a literal if (next_emit < ip_end) { op = EmitLiteral(op, next_emit, ip_end - next_emit, false); } return op; } } // end namespace internal // Signature of output types needed by decompression code. // The decompression code is templatized on a type that obeys this // signature so that we do not pay virtual function call overhead in // the middle of a tight decompression loop. // // class DecompressionWriter { // public: // // Called before decompression // void SetExpectedLength(size_t length); // // // Called after decompression // bool CheckLength() const; // // // Called repeatedly during decompression // bool Append(const char* ip, size_t length); // bool AppendFromSelf(uint32 offset, size_t length); // // // The rules for how TryFastAppend differs from Append are somewhat // // convoluted: // // // // - TryFastAppend is allowed to decline (return false) at any // // time, for any reason -- just "return false" would be // // a perfectly legal implementation of TryFastAppend. // // The intention is for TryFastAppend to allow a fast path // // in the common case of a small append. // // - TryFastAppend is allowed to read up to bytes // // from the input buffer, whereas Append is allowed to read // // . However, if it returns true, it must leave // // at least five (kMaximumTagLength) bytes in the input buffer // // afterwards, so that there is always enough space to read the // // next tag without checking for a refill. // // - TryFastAppend must always return decline (return false) // // if is 61 or more, as in this case the literal length is not // // decoded fully. In practice, this should not be a big problem, // // as it is unlikely that one would implement a fast path accepting // // this much data. // // // bool TryFastAppend(const char* ip, size_t available, size_t length); // }; // Helper class for decompression class SnappyDecompressor { private: Source* reader_; // Underlying source of bytes to decompress const char* ip_; // Points to next buffered byte const char* ip_limit_; // Points just past buffered bytes uint32 peeked_; // Bytes peeked from reader (need to skip) bool eof_; // Hit end of input without an error? char scratch_[kMaximumTagLength]; // See RefillTag(). // Ensure that all of the tag metadata for the next tag is available // in [ip_..ip_limit_-1]. Also ensures that [ip,ip+4] is readable even // if (ip_limit_ - ip_ < 5). // // Returns true on success, false on error or end of input. bool RefillTag(); public: explicit SnappyDecompressor(Source* reader) : reader_(reader), ip_(NULL), ip_limit_(NULL), peeked_(0), eof_(false) { } ~SnappyDecompressor() { // Advance past any bytes we peeked at from the reader reader_->Skip(peeked_); } // Returns true iff we have hit the end of the input without an error. bool eof() const { return eof_; } // Read the uncompressed length stored at the start of the compressed data. // On succcess, stores the length in *result and returns true. // On failure, returns false. bool ReadUncompressedLength(uint32* result) { assert(ip_ == NULL); // Must not have read anything yet // Length is encoded in 1..5 bytes *result = 0; uint32 shift = 0; while (true) { if (shift >= 32) return false; size_t n; const char* ip = reader_->Peek(&n); if (n == 0) return false; const unsigned char c = *(reinterpret_cast(ip)); reader_->Skip(1); uint32 val = c & 0x7f; if (((val << shift) >> shift) != val) return false; *result |= val << shift; if (c < 128) { break; } shift += 7; } return true; } // Process the next item found in the input. // Returns true if successful, false on error or end of input. template void DecompressAllTags(Writer* writer) { const char* ip = ip_; // We could have put this refill fragment only at the beginning of the loop. // However, duplicating it at the end of each branch gives the compiler more // scope to optimize the expression based on the local // context, which overall increases speed. #define MAYBE_REFILL() \ if (ip_limit_ - ip < kMaximumTagLength) { \ ip_ = ip; \ if (!RefillTag()) return; \ ip = ip_; \ } MAYBE_REFILL(); for ( ;; ) { const unsigned char c = *(reinterpret_cast(ip++)); if ((c & 0x3) == LITERAL) { size_t literal_length = (c >> 2) + 1u; if (writer->TryFastAppend(ip, ip_limit_ - ip, literal_length)) { assert(literal_length < 61); ip += literal_length; // NOTE(user): There is no MAYBE_REFILL() here, as TryFastAppend() // will not return true unless there's already at least five spare // bytes in addition to the literal. continue; } if (PREDICT_FALSE(literal_length >= 61)) { // Long literal. const size_t literal_length_length = literal_length - 60; literal_length = (LittleEndian::Load32(ip) & wordmask[literal_length_length]) + 1; ip += literal_length_length; } size_t avail = ip_limit_ - ip; while (avail < literal_length) { if (!writer->Append(ip, avail)) return; literal_length -= avail; reader_->Skip(peeked_); size_t n; ip = reader_->Peek(&n); avail = n; peeked_ = avail; if (avail == 0) return; // Premature end of input ip_limit_ = ip + avail; } if (!writer->Append(ip, literal_length)) { return; } ip += literal_length; MAYBE_REFILL(); } else { const uint32 entry = char_table[c]; const uint32 trailer = LittleEndian::Load32(ip) & wordmask[entry >> 11]; const uint32 length = entry & 0xff; ip += entry >> 11; // copy_offset/256 is encoded in bits 8..10. By just fetching // those bits, we get copy_offset (since the bit-field starts at // bit 8). const uint32 copy_offset = entry & 0x700; if (!writer->AppendFromSelf(copy_offset + trailer, length)) { return; } MAYBE_REFILL(); } } #undef MAYBE_REFILL } }; bool SnappyDecompressor::RefillTag() { const char* ip = ip_; if (ip == ip_limit_) { // Fetch a new fragment from the reader reader_->Skip(peeked_); // All peeked bytes are used up size_t n; ip = reader_->Peek(&n); peeked_ = n; if (n == 0) { eof_ = true; return false; } ip_limit_ = ip + n; } // Read the tag character assert(ip < ip_limit_); const unsigned char c = *(reinterpret_cast(ip)); const uint32 entry = char_table[c]; const uint32 needed = (entry >> 11) + 1; // +1 byte for 'c' assert(needed <= sizeof(scratch_)); // Read more bytes from reader if needed uint32 nbuf = ip_limit_ - ip; if (nbuf < needed) { // Stitch together bytes from ip and reader to form the word // contents. We store the needed bytes in "scratch_". They // will be consumed immediately by the caller since we do not // read more than we need. memmove(scratch_, ip, nbuf); reader_->Skip(peeked_); // All peeked bytes are used up peeked_ = 0; while (nbuf < needed) { size_t length; const char* src = reader_->Peek(&length); if (length == 0) return false; uint32 to_add = min(needed - nbuf, length); memcpy(scratch_ + nbuf, src, to_add); nbuf += to_add; reader_->Skip(to_add); } assert(nbuf == needed); ip_ = scratch_; ip_limit_ = scratch_ + needed; } else if (nbuf < kMaximumTagLength) { // Have enough bytes, but move into scratch_ so that we do not // read past end of input memmove(scratch_, ip, nbuf); reader_->Skip(peeked_); // All peeked bytes are used up peeked_ = 0; ip_ = scratch_; ip_limit_ = scratch_ + nbuf; } else { // Pass pointer to buffer returned by reader_. ip_ = ip; } return true; } template static bool InternalUncompress(Source* r, Writer* writer) { // Read the uncompressed length from the front of the compressed input SnappyDecompressor decompressor(r); uint32 uncompressed_len = 0; if (!decompressor.ReadUncompressedLength(&uncompressed_len)) return false; return InternalUncompressAllTags(&decompressor, writer, uncompressed_len); } template static bool InternalUncompressAllTags(SnappyDecompressor* decompressor, Writer* writer, uint32 uncompressed_len) { writer->SetExpectedLength(uncompressed_len); // Process the entire input decompressor->DecompressAllTags(writer); writer->Flush(); return (decompressor->eof() && writer->CheckLength()); } bool GetUncompressedLength(Source* source, uint32* result) { SnappyDecompressor decompressor(source); return decompressor.ReadUncompressedLength(result); } size_t Compress(Source* reader, Sink* writer) { size_t written = 0; size_t N = reader->Available(); char ulength[Varint::kMax32]; char* p = Varint::Encode32(ulength, N); writer->Append(ulength, p-ulength); written += (p - ulength); internal::WorkingMemory wmem; char* scratch = NULL; char* scratch_output = NULL; while (N > 0) { // Get next block to compress (without copying if possible) size_t fragment_size; const char* fragment = reader->Peek(&fragment_size); assert(fragment_size != 0); // premature end of input const size_t num_to_read = min(N, kBlockSize); size_t bytes_read = fragment_size; size_t pending_advance = 0; if (bytes_read >= num_to_read) { // Buffer returned by reader is large enough pending_advance = num_to_read; fragment_size = num_to_read; } else { // Read into scratch buffer if (scratch == NULL) { // If this is the last iteration, we want to allocate N bytes // of space, otherwise the max possible kBlockSize space. // num_to_read contains exactly the correct value scratch = new char[num_to_read]; } memcpy(scratch, fragment, bytes_read); reader->Skip(bytes_read); while (bytes_read < num_to_read) { fragment = reader->Peek(&fragment_size); size_t n = min(fragment_size, num_to_read - bytes_read); memcpy(scratch + bytes_read, fragment, n); bytes_read += n; reader->Skip(n); } assert(bytes_read == num_to_read); fragment = scratch; fragment_size = num_to_read; } assert(fragment_size == num_to_read); // Get encoding table for compression int table_size; uint16* table = wmem.GetHashTable(num_to_read, &table_size); // Compress input_fragment and append to dest const int max_output = MaxCompressedLength(num_to_read); // Need a scratch buffer for the output, in case the byte sink doesn't // have room for us directly. if (scratch_output == NULL) { scratch_output = new char[max_output]; } else { // Since we encode kBlockSize regions followed by a region // which is <= kBlockSize in length, a previously allocated // scratch_output[] region is big enough for this iteration. } char* dest = writer->GetAppendBuffer(max_output, scratch_output); char* end = internal::CompressFragment(fragment, fragment_size, dest, table, table_size); writer->Append(dest, end - dest); written += (end - dest); N -= num_to_read; reader->Skip(pending_advance); } delete[] scratch; delete[] scratch_output; return written; } // ----------------------------------------------------------------------- // IOVec interfaces // ----------------------------------------------------------------------- // A type that writes to an iovec. // Note that this is not a "ByteSink", but a type that matches the // Writer template argument to SnappyDecompressor::DecompressAllTags(). class SnappyIOVecWriter { private: const struct iovec* output_iov_; const size_t output_iov_count_; // We are currently writing into output_iov_[curr_iov_index_]. size_t curr_iov_index_; // Bytes written to output_iov_[curr_iov_index_] so far. size_t curr_iov_written_; // Total bytes decompressed into output_iov_ so far. size_t total_written_; // Maximum number of bytes that will be decompressed into output_iov_. size_t output_limit_; inline char* GetIOVecPointer(size_t index, size_t offset) { return reinterpret_cast(output_iov_[index].iov_base) + offset; } public: // Does not take ownership of iov. iov must be valid during the // entire lifetime of the SnappyIOVecWriter. inline SnappyIOVecWriter(const struct iovec* iov, size_t iov_count) : output_iov_(iov), output_iov_count_(iov_count), curr_iov_index_(0), curr_iov_written_(0), total_written_(0), output_limit_(-1) { } inline void SetExpectedLength(size_t len) { output_limit_ = len; } inline bool CheckLength() const { return total_written_ == output_limit_; } inline bool Append(const char* ip, size_t len) { if (total_written_ + len > output_limit_) { return false; } while (len > 0) { assert(curr_iov_written_ <= output_iov_[curr_iov_index_].iov_len); if (curr_iov_written_ >= output_iov_[curr_iov_index_].iov_len) { // This iovec is full. Go to the next one. if (curr_iov_index_ + 1 >= output_iov_count_) { return false; } curr_iov_written_ = 0; ++curr_iov_index_; } const size_t to_write = std::min( len, output_iov_[curr_iov_index_].iov_len - curr_iov_written_); memcpy(GetIOVecPointer(curr_iov_index_, curr_iov_written_), ip, to_write); curr_iov_written_ += to_write; total_written_ += to_write; ip += to_write; len -= to_write; } return true; } inline bool TryFastAppend(const char* ip, size_t available, size_t len) { const size_t space_left = output_limit_ - total_written_; if (len <= 16 && available >= 16 + kMaximumTagLength && space_left >= 16 && output_iov_[curr_iov_index_].iov_len - curr_iov_written_ >= 16) { // Fast path, used for the majority (about 95%) of invocations. char* ptr = GetIOVecPointer(curr_iov_index_, curr_iov_written_); UnalignedCopy64(ip, ptr); UnalignedCopy64(ip + 8, ptr + 8); curr_iov_written_ += len; total_written_ += len; return true; } return false; } inline bool AppendFromSelf(size_t offset, size_t len) { if (offset > total_written_ || offset == 0) { return false; } const size_t space_left = output_limit_ - total_written_; if (len > space_left) { return false; } // Locate the iovec from which we need to start the copy. size_t from_iov_index = curr_iov_index_; size_t from_iov_offset = curr_iov_written_; while (offset > 0) { if (from_iov_offset >= offset) { from_iov_offset -= offset; break; } offset -= from_iov_offset; assert(from_iov_index > 0); --from_iov_index; from_iov_offset = output_iov_[from_iov_index].iov_len; } // Copy bytes starting from the iovec pointed to by from_iov_index to // the current iovec. while (len > 0) { assert(from_iov_index <= curr_iov_index_); if (from_iov_index != curr_iov_index_) { const size_t to_copy = std::min( output_iov_[from_iov_index].iov_len - from_iov_offset, len); Append(GetIOVecPointer(from_iov_index, from_iov_offset), to_copy); len -= to_copy; if (len > 0) { ++from_iov_index; from_iov_offset = 0; } } else { assert(curr_iov_written_ <= output_iov_[curr_iov_index_].iov_len); size_t to_copy = std::min(output_iov_[curr_iov_index_].iov_len - curr_iov_written_, len); if (to_copy == 0) { // This iovec is full. Go to the next one. if (curr_iov_index_ + 1 >= output_iov_count_) { return false; } ++curr_iov_index_; curr_iov_written_ = 0; continue; } if (to_copy > len) { to_copy = len; } IncrementalCopy(GetIOVecPointer(from_iov_index, from_iov_offset), GetIOVecPointer(curr_iov_index_, curr_iov_written_), to_copy); curr_iov_written_ += to_copy; from_iov_offset += to_copy; total_written_ += to_copy; len -= to_copy; } } return true; } inline void Flush() {} }; bool RawUncompressToIOVec(const char* compressed, size_t compressed_length, const struct iovec* iov, size_t iov_cnt) { ByteArraySource reader(compressed, compressed_length); return RawUncompressToIOVec(&reader, iov, iov_cnt); } bool RawUncompressToIOVec(Source* compressed, const struct iovec* iov, size_t iov_cnt) { SnappyIOVecWriter output(iov, iov_cnt); return InternalUncompress(compressed, &output); } // ----------------------------------------------------------------------- // Flat array interfaces // ----------------------------------------------------------------------- // A type that writes to a flat array. // Note that this is not a "ByteSink", but a type that matches the // Writer template argument to SnappyDecompressor::DecompressAllTags(). class SnappyArrayWriter { private: char* base_; char* op_; char* op_limit_; public: inline explicit SnappyArrayWriter(char* dst) : base_(dst), op_(dst), op_limit_(dst) { } inline void SetExpectedLength(size_t len) { op_limit_ = op_ + len; } inline bool CheckLength() const { return op_ == op_limit_; } inline bool Append(const char* ip, size_t len) { char* op = op_; const size_t space_left = op_limit_ - op; if (space_left < len) { return false; } memcpy(op, ip, len); op_ = op + len; return true; } inline bool TryFastAppend(const char* ip, size_t available, size_t len) { char* op = op_; const size_t space_left = op_limit_ - op; if (len <= 16 && available >= 16 + kMaximumTagLength && space_left >= 16) { // Fast path, used for the majority (about 95%) of invocations. UnalignedCopy64(ip, op); UnalignedCopy64(ip + 8, op + 8); op_ = op + len; return true; } else { return false; } } inline bool AppendFromSelf(size_t offset, size_t len) { char* op = op_; const size_t space_left = op_limit_ - op; // Check if we try to append from before the start of the buffer. // Normally this would just be a check for "produced < offset", // but "produced <= offset - 1u" is equivalent for every case // except the one where offset==0, where the right side will wrap around // to a very big number. This is convenient, as offset==0 is another // invalid case that we also want to catch, so that we do not go // into an infinite loop. assert(op >= base_); size_t produced = op - base_; if (produced <= offset - 1u) { return false; } if (len <= 16 && offset >= 8 && space_left >= 16) { // Fast path, used for the majority (70-80%) of dynamic invocations. UnalignedCopy64(op - offset, op); UnalignedCopy64(op - offset + 8, op + 8); } else { if (space_left >= len + kMaxIncrementCopyOverflow) { IncrementalCopyFastPath(op - offset, op, len); } else { if (space_left < len) { return false; } IncrementalCopy(op - offset, op, len); } } op_ = op + len; return true; } inline size_t Produced() const { return op_ - base_; } inline void Flush() {} }; bool RawUncompress(const char* compressed, size_t n, char* uncompressed) { ByteArraySource reader(compressed, n); return RawUncompress(&reader, uncompressed); } bool RawUncompress(Source* compressed, char* uncompressed) { SnappyArrayWriter output(uncompressed); return InternalUncompress(compressed, &output); } bool Uncompress(const char* compressed, size_t n, string* uncompressed) { size_t ulength; if (!GetUncompressedLength(compressed, n, &ulength)) { return false; } // On 32-bit builds: max_size() < kuint32max. Check for that instead // of crashing (e.g., consider externally specified compressed data). if (ulength > uncompressed->max_size()) { return false; } STLStringResizeUninitialized(uncompressed, ulength); return RawUncompress(compressed, n, string_as_array(uncompressed)); } // A Writer that drops everything on the floor and just does validation class SnappyDecompressionValidator { private: size_t expected_; size_t produced_; public: inline SnappyDecompressionValidator() : expected_(0), produced_(0) { } inline void SetExpectedLength(size_t len) { expected_ = len; } inline bool CheckLength() const { return expected_ == produced_; } inline bool Append(const char* ip, size_t len) { produced_ += len; return produced_ <= expected_; } inline bool TryFastAppend(const char* ip, size_t available, size_t length) { return false; } inline bool AppendFromSelf(size_t offset, size_t len) { // See SnappyArrayWriter::AppendFromSelf for an explanation of // the "offset - 1u" trick. if (produced_ <= offset - 1u) return false; produced_ += len; return produced_ <= expected_; } inline void Flush() {} }; bool IsValidCompressedBuffer(const char* compressed, size_t n) { ByteArraySource reader(compressed, n); SnappyDecompressionValidator writer; return InternalUncompress(&reader, &writer); } bool IsValidCompressed(Source* compressed) { SnappyDecompressionValidator writer; return InternalUncompress(compressed, &writer); } void RawCompress(const char* input, size_t input_length, char* compressed, size_t* compressed_length) { ByteArraySource reader(input, input_length); UncheckedByteArraySink writer(compressed); Compress(&reader, &writer); // Compute how many bytes were added *compressed_length = (writer.CurrentDestination() - compressed); } size_t Compress(const char* input, size_t input_length, string* compressed) { // Pre-grow the buffer to the max length of the compressed output compressed->resize(MaxCompressedLength(input_length)); size_t compressed_length; RawCompress(input, input_length, string_as_array(compressed), &compressed_length); compressed->resize(compressed_length); return compressed_length; } // ----------------------------------------------------------------------- // Sink interface // ----------------------------------------------------------------------- // A type that decompresses into a Sink. The template parameter // Allocator must export one method "char* Allocate(int size);", which // allocates a buffer of "size" and appends that to the destination. template class SnappyScatteredWriter { Allocator allocator_; // We need random access into the data generated so far. Therefore // we keep track of all of the generated data as an array of blocks. // All of the blocks except the last have length kBlockSize. vector blocks_; size_t expected_; // Total size of all fully generated blocks so far size_t full_size_; // Pointer into current output block char* op_base_; // Base of output block char* op_ptr_; // Pointer to next unfilled byte in block char* op_limit_; // Pointer just past block inline size_t Size() const { return full_size_ + (op_ptr_ - op_base_); } bool SlowAppend(const char* ip, size_t len); bool SlowAppendFromSelf(size_t offset, size_t len); public: inline explicit SnappyScatteredWriter(const Allocator& allocator) : allocator_(allocator), full_size_(0), op_base_(NULL), op_ptr_(NULL), op_limit_(NULL) { } inline void SetExpectedLength(size_t len) { assert(blocks_.empty()); expected_ = len; } inline bool CheckLength() const { return Size() == expected_; } // Return the number of bytes actually uncompressed so far inline size_t Produced() const { return Size(); } inline bool Append(const char* ip, size_t len) { size_t avail = op_limit_ - op_ptr_; if (len <= avail) { // Fast path memcpy(op_ptr_, ip, len); op_ptr_ += len; return true; } else { return SlowAppend(ip, len); } } inline bool TryFastAppend(const char* ip, size_t available, size_t length) { char* op = op_ptr_; const int space_left = op_limit_ - op; if (length <= 16 && available >= 16 + kMaximumTagLength && space_left >= 16) { // Fast path, used for the majority (about 95%) of invocations. UNALIGNED_STORE64(op, UNALIGNED_LOAD64(ip)); UNALIGNED_STORE64(op + 8, UNALIGNED_LOAD64(ip + 8)); op_ptr_ = op + length; return true; } else { return false; } } inline bool AppendFromSelf(size_t offset, size_t len) { // See SnappyArrayWriter::AppendFromSelf for an explanation of // the "offset - 1u" trick. if (offset - 1u < op_ptr_ - op_base_) { const size_t space_left = op_limit_ - op_ptr_; if (space_left >= len + kMaxIncrementCopyOverflow) { // Fast path: src and dst in current block. IncrementalCopyFastPath(op_ptr_ - offset, op_ptr_, len); op_ptr_ += len; return true; } } return SlowAppendFromSelf(offset, len); } // Called at the end of the decompress. We ask the allocator // write all blocks to the sink. inline void Flush() { allocator_.Flush(Produced()); } }; template bool SnappyScatteredWriter::SlowAppend(const char* ip, size_t len) { size_t avail = op_limit_ - op_ptr_; while (len > avail) { // Completely fill this block memcpy(op_ptr_, ip, avail); op_ptr_ += avail; assert(op_limit_ - op_ptr_ == 0); full_size_ += (op_ptr_ - op_base_); len -= avail; ip += avail; // Bounds check if (full_size_ + len > expected_) { return false; } // Make new block size_t bsize = min(kBlockSize, expected_ - full_size_); op_base_ = allocator_.Allocate(bsize); op_ptr_ = op_base_; op_limit_ = op_base_ + bsize; blocks_.push_back(op_base_); avail = bsize; } memcpy(op_ptr_, ip, len); op_ptr_ += len; return true; } template bool SnappyScatteredWriter::SlowAppendFromSelf(size_t offset, size_t len) { // Overflow check // See SnappyArrayWriter::AppendFromSelf for an explanation of // the "offset - 1u" trick. const size_t cur = Size(); if (offset - 1u >= cur) return false; if (expected_ - cur < len) return false; // Currently we shouldn't ever hit this path because Compress() chops the // input into blocks and does not create cross-block copies. However, it is // nice if we do not rely on that, since we can get better compression if we // allow cross-block copies and thus might want to change the compressor in // the future. size_t src = cur - offset; while (len-- > 0) { char c = blocks_[src >> kBlockLog][src & (kBlockSize-1)]; Append(&c, 1); src++; } return true; } class SnappySinkAllocator { public: explicit SnappySinkAllocator(Sink* dest): dest_(dest) {} ~SnappySinkAllocator() {} char* Allocate(int size) { Datablock block(new char[size], size); blocks_.push_back(block); return block.data; } // We flush only at the end, because the writer wants // random access to the blocks and once we hand the // block over to the sink, we can't access it anymore. // Also we don't write more than has been actually written // to the blocks. void Flush(size_t size) { size_t size_written = 0; size_t block_size; for (int i = 0; i < blocks_.size(); ++i) { block_size = min(blocks_[i].size, size - size_written); dest_->AppendAndTakeOwnership(blocks_[i].data, block_size, &SnappySinkAllocator::Deleter, NULL); size_written += block_size; } blocks_.clear(); } private: struct Datablock { char* data; size_t size; Datablock(char* p, size_t s) : data(p), size(s) {} }; static void Deleter(void* arg, const char* bytes, size_t size) { delete[] bytes; } Sink* dest_; vector blocks_; // Note: copying this object is allowed }; size_t UncompressAsMuchAsPossible(Source* compressed, Sink* uncompressed) { SnappySinkAllocator allocator(uncompressed); SnappyScatteredWriter writer(allocator); InternalUncompress(compressed, &writer); return writer.Produced(); } bool Uncompress(Source* compressed, Sink* uncompressed) { // Read the uncompressed length from the front of the compressed input SnappyDecompressor decompressor(compressed); uint32 uncompressed_len = 0; if (!decompressor.ReadUncompressedLength(&uncompressed_len)) { return false; } char c; size_t allocated_size; char* buf = uncompressed->GetAppendBufferVariable( 1, uncompressed_len, &c, 1, &allocated_size); // If we can get a flat buffer, then use it, otherwise do block by block // uncompression if (allocated_size >= uncompressed_len) { SnappyArrayWriter writer(buf); bool result = InternalUncompressAllTags( &decompressor, &writer, uncompressed_len); uncompressed->Append(buf, writer.Produced()); return result; } else { SnappySinkAllocator allocator(uncompressed); SnappyScatteredWriter writer(allocator); return InternalUncompressAllTags(&decompressor, &writer, uncompressed_len); } } } // end namespace snappy canu-1.6/src/stores/libsnappy/snappy.h000066400000000000000000000245531314437614700201230ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-AUG-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright 2005 and onwards Google Inc. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions are // met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following disclaimer // in the documentation and/or other materials provided with the // distribution. // * Neither the name of Google Inc. nor the names of its // contributors may be used to endorse or promote products derived from // this software without specific prior written permission. // // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS // "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT // LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR // A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT // OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY // THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT // (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE // OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. // // A light-weight compression algorithm. It is designed for speed of // compression and decompression, rather than for the utmost in space // savings. // // For getting better compression ratios when you are compressing data // with long repeated sequences or compressing data that is similar to // other data, while still compressing fast, you might look at first // using BMDiff and then compressing the output of BMDiff with // Snappy. #ifndef THIRD_PARTY_SNAPPY_SNAPPY_H__ #define THIRD_PARTY_SNAPPY_SNAPPY_H__ #include #include #include "snappy-stubs-public.h" namespace snappy { class Source; class Sink; // ------------------------------------------------------------------------ // Generic compression/decompression routines. // ------------------------------------------------------------------------ // Compress the bytes read from "*source" and append to "*sink". Return the // number of bytes written. size_t Compress(Source* source, Sink* sink); // Find the uncompressed length of the given stream, as given by the header. // Note that the true length could deviate from this; the stream could e.g. // be truncated. // // Also note that this leaves "*source" in a state that is unsuitable for // further operations, such as RawUncompress(). You will need to rewind // or recreate the source yourself before attempting any further calls. bool GetUncompressedLength(Source* source, uint32* result); // ------------------------------------------------------------------------ // Higher-level string based routines (should be sufficient for most users) // ------------------------------------------------------------------------ // Sets "*output" to the compressed version of "input[0,input_length-1]". // Original contents of *output are lost. // // REQUIRES: "input[]" is not an alias of "*output". size_t Compress(const char* input, size_t input_length, string* output); // Decompresses "compressed[0,compressed_length-1]" to "*uncompressed". // Original contents of "*uncompressed" are lost. // // REQUIRES: "compressed[]" is not an alias of "*uncompressed". // // returns false if the message is corrupted and could not be decompressed bool Uncompress(const char* compressed, size_t compressed_length, string* uncompressed); // Decompresses "compressed" to "*uncompressed". // // returns false if the message is corrupted and could not be decompressed bool Uncompress(Source* compressed, Sink* uncompressed); // This routine uncompresses as much of the "compressed" as possible // into sink. It returns the number of valid bytes added to sink // (extra invalid bytes may have been added due to errors; the caller // should ignore those). The emitted data typically has length // GetUncompressedLength(), but may be shorter if an error is // encountered. size_t UncompressAsMuchAsPossible(Source* compressed, Sink* uncompressed); // ------------------------------------------------------------------------ // Lower-level character array based routines. May be useful for // efficiency reasons in certain circumstances. // ------------------------------------------------------------------------ // REQUIRES: "compressed" must point to an area of memory that is at // least "MaxCompressedLength(input_length)" bytes in length. // // Takes the data stored in "input[0..input_length]" and stores // it in the array pointed to by "compressed". // // "*compressed_length" is set to the length of the compressed output. // // Example: // char* output = new char[snappy::MaxCompressedLength(input_length)]; // size_t output_length; // RawCompress(input, input_length, output, &output_length); // ... Process(output, output_length) ... // delete [] output; void RawCompress(const char* input, size_t input_length, char* compressed, size_t* compressed_length); // Given data in "compressed[0..compressed_length-1]" generated by // calling the Snappy::Compress routine, this routine // stores the uncompressed data to // uncompressed[0..GetUncompressedLength(compressed)-1] // returns false if the message is corrupted and could not be decrypted bool RawUncompress(const char* compressed, size_t compressed_length, char* uncompressed); // Given data from the byte source 'compressed' generated by calling // the Snappy::Compress routine, this routine stores the uncompressed // data to // uncompressed[0..GetUncompressedLength(compressed,compressed_length)-1] // returns false if the message is corrupted and could not be decrypted bool RawUncompress(Source* compressed, char* uncompressed); // Given data in "compressed[0..compressed_length-1]" generated by // calling the Snappy::Compress routine, this routine // stores the uncompressed data to the iovec "iov". The number of physical // buffers in "iov" is given by iov_cnt and their cumulative size // must be at least GetUncompressedLength(compressed). The individual buffers // in "iov" must not overlap with each other. // // returns false if the message is corrupted and could not be decrypted bool RawUncompressToIOVec(const char* compressed, size_t compressed_length, const struct iovec* iov, size_t iov_cnt); // Given data from the byte source 'compressed' generated by calling // the Snappy::Compress routine, this routine stores the uncompressed // data to the iovec "iov". The number of physical // buffers in "iov" is given by iov_cnt and their cumulative size // must be at least GetUncompressedLength(compressed). The individual buffers // in "iov" must not overlap with each other. // // returns false if the message is corrupted and could not be decrypted bool RawUncompressToIOVec(Source* compressed, const struct iovec* iov, size_t iov_cnt); // Returns the maximal size of the compressed representation of // input data that is "source_bytes" bytes in length; size_t MaxCompressedLength(size_t source_bytes); // REQUIRES: "compressed[]" was produced by RawCompress() or Compress() // Returns true and stores the length of the uncompressed data in // *result normally. Returns false on parsing error. // This operation takes O(1) time. bool GetUncompressedLength(const char* compressed, size_t compressed_length, size_t* result); // Returns true iff the contents of "compressed[]" can be uncompressed // successfully. Does not return the uncompressed data. Takes // time proportional to compressed_length, but is usually at least // a factor of four faster than actual decompression. bool IsValidCompressedBuffer(const char* compressed, size_t compressed_length); // Returns true iff the contents of "compressed" can be uncompressed // successfully. Does not return the uncompressed data. Takes // time proportional to *compressed length, but is usually at least // a factor of four faster than actual decompression. // On success, consumes all of *compressed. On failure, consumes an // unspecified prefix of *compressed. bool IsValidCompressed(Source* compressed); // The size of a compression block. Note that many parts of the compression // code assumes that kBlockSize <= 65536; in particular, the hash table // can only store 16-bit offsets, and EmitCopy() also assumes the offset // is 65535 bytes or less. Note also that if you change this, it will // affect the framing format (see framing_format.txt). // // Note that there might be older data around that is compressed with larger // block sizes, so the decompression code should not rely on the // non-existence of long backreferences. static const int kBlockLog = 16; static const size_t kBlockSize = 1 << kBlockLog; static const int kMaxHashTableBits = 14; static const size_t kMaxHashTableSize = 1 << kMaxHashTableBits; } // end namespace snappy #endif // THIRD_PARTY_SNAPPY_SNAPPY_H__ canu-1.6/src/stores/ovOverlap.C000066400000000000000000000116371314437614700165170ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-DEC-15 to 2015-JUL-01 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Sergey Koren beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-OCT-17 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "ovStore.H" #include "gkStore.H" // Even though the b_end_hi | b_end_lo is uint64 in the struct, the result // of combining them doesn't appear to be 64-bit. The cast is necessary. char * ovOverlap::toString(char *str, ovOverlapDisplayType type, bool newLine) { switch (type) { case ovOverlapAsHangs: sprintf(str, "%10" F_U32P " %10" F_U32P " %c %6" F_S32P " %6" F_U32P " %6" F_S32P " %7.6f%s%s", a_iid, b_iid, flipped() ? 'I' : 'N', a_hang(), span(), b_hang(), erate(), (overlapIsDovetail()) ? "" : " PARTIAL", (newLine) ? "\n" : ""); break; case ovOverlapAsCoords: sprintf(str, "%10" F_U32P " %10" F_U32P " %c %6" F_U32P " %6" F_U32P " %6" F_U32P " %6" F_U32P " %6" F_U32P " %7.6f%s", a_iid, b_iid, flipped() ? 'I' : 'N', span(), a_bgn(), a_end(), b_bgn(), b_end(), erate(), (newLine) ? "\n" : ""); break; case ovOverlapAsRaw: sprintf(str, "%10" F_U32P " %10" F_U32P " %c %6" F_U32P " %6" F_OVP " %6" F_OVP " %6" F_OVP " %6" F_OVP " %7.6f %s %s %s%s", a_iid, b_iid, flipped() ? 'I' : 'N', span(), dat.ovl.ahg5, dat.ovl.ahg3, dat.ovl.bhg5, dat.ovl.bhg3, erate(), dat.ovl.forOBT ? "OBT" : " ", dat.ovl.forDUP ? "DUP" : " ", dat.ovl.forUTG ? "UTG" : " ", (newLine) ? "\n" : ""); break; case ovOverlapAsCompat: sprintf(str, "%8" F_U32P " %8" F_U32P " %c %6d %6d %5.2f %5.2f%s", a_iid, b_iid, dat.ovl.flipped ? 'I' : 'N', a_hang(), b_hang(), erate() * 100.0, erate() * 100.0, (newLine) ? "\n" : ""); break; case ovOverlapAsPaf: // miniasm/map expects entries to be separated by tabs // no padding spaces on names we don't confuse read identifiers sprintf(str, "%" F_U32P "\t%6" F_U32P "\t%6" F_U32P "\t%6" F_U32P "\t%c\t%" F_U32P "\t%6" F_U32P "\t%6" F_U32P "\t%6" F_U32P "\t%6" F_U32P "\t%6" F_U32P "\t%6" F_U32P " %s", a_iid, (g->gkStore_getRead(a_iid)->gkRead_sequenceLength()), a_bgn(), a_end(), flipped() ? '-' : '+', b_iid, (g->gkStore_getRead(b_iid)->gkRead_sequenceLength()), flipped() ? b_end() : b_bgn(), flipped() ? b_bgn() : b_end(), (uint32)floor(span() == 0 ? (1-erate() * (a_end()-a_bgn())) : (1-erate()) * span()), span() == 0 ? a_end() - a_bgn() : span(), 255, (newLine) ? "\n" : ""); break; } return(str); } void ovOverlap::swapIDs(ovOverlap const &orig) { a_iid = orig.b_iid; b_iid = orig.a_iid; // Copy the overlap as is, then fix it for the ID swap. for (uint32 ii=0; ii 0 ---> ahg5 > 0 (and bhg5 == 0) // a_hang < 0 ---> bhg5 > 0 (and ahg5 == 0) // // b_hang > 0 ---> bhg3 > 0 (and ahg3 == 0) // b_hang < 0 ---> ahg3 > 0 (and bhg3 == 0) // // Convenience functions. int32 a_hang(void) const { return((int32)dat.ovl.ahg5 - (int32)dat.ovl.bhg5); }; int32 b_hang(void) const { return((int32)dat.ovl.bhg3 - (int32)dat.ovl.ahg3); }; void a_hang(int32 a) { dat.ovl.ahg5 = (a < 0) ? 0 : a; dat.ovl.bhg5 = (a < 0) ? -a : 0; }; void b_hang(int32 b) { dat.ovl.bhg3 = (b < 0) ? 0 : b; dat.ovl.ahg3 = (b < 0) ? -b : 0; }; // These return the actual coordinates on the read. For reverse B reads, the coordinates are in the reverse-complemented // sequence, and are returned as bgn > end to show this. uint32 a_bgn(void) const { return(dat.ovl.ahg5); }; uint32 a_end(void) const { return(g->gkStore_getRead(a_iid)->gkRead_sequenceLength() - dat.ovl.ahg3); }; uint32 b_bgn(void) const { return((dat.ovl.flipped) ? (g->gkStore_getRead(b_iid)->gkRead_sequenceLength() - dat.ovl.bhg5) : (dat.ovl.bhg5)); }; uint32 b_end(void) const { return((dat.ovl.flipped) ? (dat.ovl.bhg3) : (g->gkStore_getRead(b_iid)->gkRead_sequenceLength() - dat.ovl.bhg3)); }; uint32 span(void) const { return(dat.ovl.span); }; void span(uint32 s) { dat.ovl.span = s; }; #if 0 // Return an approximate span as the average of the read span aligned. uint32 span(void) const { if (dat.ovl.span > 0) return(dat.ovl.span); else { uint32 ab = a_bgn(), ae = a_end(); uint32 bb = b_bgn(), be = b_end(); if (bb < be) return(((ae - ab) + (be - bb)) / 2); else return(((ae - ab) + (bb - be)) / 2); } } #endif void flipped(uint32 f) { dat.ovl.flipped = f; }; uint32 flipped(void) const { return(dat.ovl.flipped == true); }; void erate(double e) { dat.ovl.evalue = AS_OVS_encodeEvalue(e); }; double erate(void) const { return(AS_OVS_decodeEvalue(dat.ovl.evalue)); }; void evalue(uint64 e) { dat.ovl.evalue = e; }; uint64 evalue(void) const { return(dat.ovl.evalue); }; bool forOBT(void) { return(dat.ovl.forOBT); }; bool forDUP(void) { return(dat.ovl.forDUP); }; bool forUTG(void) { return(dat.ovl.forUTG); }; // These are true only if the overlap is dovetail, which is the usual case, and isn't checked. uint32 overlapAEndIs5prime(void) const { return((dat.ovl.bhg5 > 0) && (dat.ovl.ahg3 > 0)); }; uint32 overlapAEndIs3prime(void) const { return((dat.ovl.ahg5 > 0) && (dat.ovl.bhg3 > 0)); }; uint32 overlapBEndIs5prime(void) const { return((overlapAEndIs5prime() && (dat.ovl.flipped == true)) || (overlapAEndIs3prime() && (dat.ovl.flipped == false))); }; uint32 overlapBEndIs3prime(void) const { return((overlapAEndIs5prime() && (dat.ovl.flipped == false)) || (overlapAEndIs3prime() && (dat.ovl.flipped == true))); }; uint32 overlapAIsContained(void) const { return((dat.ovl.ahg5 == 0) && (dat.ovl.ahg3 == 0)); }; uint32 overlapBIsContainer(void) const { return((dat.ovl.ahg5 == 0) && (dat.ovl.ahg3 == 0)); }; uint32 overlapAIsContainer(void) const { return((dat.ovl.bhg5 == 0) && (dat.ovl.bhg3 == 0)); }; uint32 overlapBIsContained(void) const { return((dat.ovl.bhg5 == 0) && (dat.ovl.bhg3 == 0)); }; // Test if the overlap is dovetail or partial. uint32 overlap5primeIsPartial(void) const { return((dat.ovl.ahg5 > 0) && (dat.ovl.bhg5 > 0)); }; uint32 overlap3primeIsPartial(void) const { return((dat.ovl.ahg3 > 0) && (dat.ovl.bhg3 > 0)); }; uint32 overlapIsPartial(void) const { return(overlap5primeIsPartial() || overlap3primeIsPartial()); }; char *toString(char *str, ovOverlapDisplayType type, bool newLine); void swapIDs(ovOverlap const &orig); void clear(void) { //g = NULL; // Explicitly DO NOT clear the pointer to gkpStore. for (uint32 ii=0; ii that.a_iid) return(false); if (b_iid < that.b_iid) return(true); if (b_iid > that.b_iid) return(false); for (uint32 ii=0; ii that.dat.dat[ii]) return(false); } return(false); }; public: gkStore *g; public: uint32 a_iid; uint32 b_iid; union { ovOverlapWORD dat[ovOverlapNWORDS]; ovOverlapDAT ovl; } dat; }; #endif // AS_OVOVERLAP_H canu-1.6/src/stores/ovStore.C000066400000000000000000000423421314437614700162000ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/AS_OVS_overlapStore.C * src/AS_OVS/AS_OVS_overlapStore.c * * Modifications by: * * Brian P. Walenz from 2007-MAR-08 to 2013-AUG-01 * are Copyright 2007-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2007-MAY-08 * are Copyright 2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2011-JUN-02 to 2011-JUN-03 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Gregory Sims from 2012-FEB-01 to 2012-FEB-14 * are Copyright 2012 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-09 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-DEC-15 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "ovStore.H" ovStore::ovStore(const char *path, gkStore *gkp) { char name[FILENAME_MAX]; if (path == NULL) fprintf(stderr, "ovStore::ovStore()-- ERROR: no name supplied.\n"), exit(1); if ((path[0] == '-') && (path[1] == 0)) fprintf(stderr, "ovStore::ovStore()-- ERROR: name cannot be '-' (stdin).\n"), exit(1); memset(_storePath, 0, FILENAME_MAX); strncpy(_storePath, path, FILENAME_MAX-1); _info.clear(); _gkp = gkp; _offtFile = NULL; _offt.clear(); _offm.clear(); _evaluesMap = NULL; _evalues = NULL; _overlapsThisFile = 0; _currentFileIndex = 0; _bof = NULL; // Now open the store if (_info.load(_storePath) == false) fprintf(stderr, "ERROR: failed to intiialize ovStore '%s'.\n", path), exit(1); if (_info.checkIncomplete() == true) fprintf(stderr, "ERROR: directory '%s' is an incomplete ovStore, remove and rebuild.\n", path), exit(1); if (_info.checkMagic() == false) fprintf(stderr, "ERROR: directory '%s' is not an ovStore.\n", path), exit(1); if (_info.checkVersion() == false) fprintf(stderr, "ERROR: directory '%s' is not a supported ovStore version (store version %u; supported version %u.\n", path, _info.getVersion(), _info.getCurrentVersion()), exit(1); if (_info.checkSize() == false) fprintf(stderr, "ERROR: directory '%s' is not a supported read length (store is %u bits, AS_MAX_READLEN_BITS is %u).\n", path, _info.getSize(), AS_MAX_READLEN_BITS), exit(1); // Open the index snprintf(name, FILENAME_MAX, "%s/index", _storePath); errno = 0; _offtFile = fopen(name, "r"); if (errno) fprintf(stderr, "ERROR: failed to open offset file '%s': %s\n", name, strerror(errno)), exit(1); // Open and load erates snprintf(name, FILENAME_MAX, "%s/evalues", _storePath); if (AS_UTL_fileExists(name)) { _evaluesMap = new memoryMappedFile(name, memoryMappedFile_readOnly); _evalues = (uint16 *)_evaluesMap->get(0); } // Set the initial range to everything. _firstIIDrequested = _info.smallestID(); _lastIIDrequested = _info.largestID(); } ovStore::~ovStore() { if (_evaluesMap) { delete _evaluesMap; _evaluesMap = NULL; _evalues = NULL; } delete _bof; fclose(_offtFile); } uint32 ovStore::readOverlap(ovOverlap *overlap) { // If we've finished reading overlaps for the current a_iid, get // another a_iid. If we hit EOF here, we're all done, no more // overlaps. while (_offt._numOlaps == 0) if (0 == AS_UTL_safeRead(_offtFile, &_offt, "ovStore::readOverlap::offset", sizeof(ovStoreOfft), 1)) return(0); // And if we've exited the range of overlaps requested, return. if (_offt._a_iid > _lastIIDrequested) return(0); while ((_bof == NULL) || (_bof->readOverlap(overlap) == FALSE)) { char name[FILENAME_MAX]; // We read no overlap, open the next file and try again. if (_bof) delete _bof; _currentFileIndex++; snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, _currentFileIndex); _bof = new ovFile(_gkp, name, ovFileNormal); } overlap->a_iid = _offt._a_iid; overlap->g = _gkp; if (_evalues) overlap->evalue(_evalues[_offt._overlapID++]); _offt._numOlaps--; return(1); } uint32 ovStore::numberOfOverlaps(void) { ovOverlap *ovl = NULL; uint32 novl = 0; return(readOverlaps(ovl, novl)); } uint32 ovStore::readOverlaps(ovOverlap *&overlaps, uint32 &maxOverlaps, bool restrictToIID) { int numOvl = 0; // If we've finished reading overlaps for the current a_iid, get // another a_iid. If we hit EOF here, we're all done, no more // overlaps. while (_offt._numOlaps == 0) if (0 == AS_UTL_safeRead(_offtFile, &_offt, "ovStore::readOverlaps::offset", sizeof(ovStoreOfft), 1)) return(0); // And if we've exited the range of overlaps requested, return. if (_offt._a_iid > _lastIIDrequested) return(0); // Just a query? Return the number of overlaps we'd want to read if ((overlaps == NULL) || (maxOverlaps == 0)) return(_offt._numOlaps); // Allocate more space, if needed if (maxOverlaps < _offt._numOlaps) { delete [] overlaps; while (maxOverlaps < _offt._numOlaps) maxOverlaps *= 2; overlaps = ovOverlap::allocateOverlaps(_gkp, maxOverlaps); } // Read all the overlaps for this ID. while (((restrictToIID == true) && (_offt._numOlaps > 0)) || ((restrictToIID == false) && (_offt._numOlaps > 0) && (numOvl < maxOverlaps))) { // Read an overlap. If this fails, open the next partition and read from there. while ((_bof == NULL) || (_bof->readOverlap(overlaps + numOvl) == false)) { char name[FILENAME_MAX]; // We read no overlap, open the next file and try again. delete _bof; _bof = NULL; _currentFileIndex++; if (_currentFileIndex > _info.lastFileIndex()) // No more files, stop trying to load an overlap. break; snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, _currentFileIndex); _bof = new ovFile(_gkp, name, ovFileNormal); } // If the currentFileIndex is invalid, we ran out of overlaps to load. Don't save that // empty overlap to the list. if (_currentFileIndex <= _info.lastFileIndex()) { overlaps[numOvl].a_iid = _offt._a_iid; overlaps[numOvl].g = _gkp; if (_evalues) overlaps[numOvl].evalue(_evalues[_offt._overlapID++]); numOvl++; assert(_offt._numOlaps > 0); _offt._numOlaps--; } // If restrictToIID == false, we're loading all overlaps up to the end of the store, or the // request last IID. If to the end of store, we never read a last 'offset' and so a_iid is // still valid (below lastIIDrequested == infinity) but numOlaps is still zero, and the mail // loop terminates. if (restrictToIID == false) { while (_offt._numOlaps == 0) if (0 == AS_UTL_safeRead(_offtFile, &_offt, "ovStore::readOverlap::offset", sizeof(ovStoreOfft), 1)) break; if (_offt._a_iid > _lastIIDrequested) break; } } // while space for more overlaps, load overlaps assert(numOvl <= maxOverlaps); return(numOvl); } uint32 ovStore::readOverlaps(uint32 iid, ovOverlap *&ovl, uint32 &ovlLen, uint32 &ovlMax) { // Allocate initial space if needed. if (ovl == NULL) { ovlLen = 0; ovlMax = 65 * 1024; ovl = ovOverlap::allocateOverlaps(_gkp, ovlMax); } if (iid < ovl[0].a_iid) // Overlaps loaded are for a future read. return(0); if (iid == ovl[0].a_iid) // Overlaps loaded are for this read, YAY! We assume that ALL overlaps are loaded // for this iid. return(ovlLen); // Until we load the correct overlap, repeat. do { // Count the number of overlaps we would load ovlLen = numberOfOverlaps(); if (ovlLen == 0) // Quit now if there are no overlaps. This simplifies the rest of the loop. return(0); // Allocate space for these overlaps. while (ovlMax < ovlLen) { ovlMax *= 2; delete [] ovl; ovl = ovOverlap::allocateOverlaps(_gkp, ovlMax); } // Load the overlaps ovlLen = readOverlaps(ovl, ovlMax); // If we read overlaps for a fragment after 'iid', we're done. The client will properly save // these overlaps until the iid becomes active. // if (iid < ovl[0].a_iid) return(0); // If we've found the overlaps, we're still done, we also return the number of overlaps. // if (iid == ovl[0].a_iid) return(ovlLen); // On the otherhand, if we read overlaps for a fragment before 'iid', we can either keep // reading until we find the overlaps for this fragment, or jump to the correct spot to read // overlaps. // // The rule is simple. If we're within 50 of the correct IID, keep streaming. Otherwise, make // a jump. setRange() seems to ALWAYS close and open a file, which is somewhat expensive, // especially if the file doesn't actually change. // if (50 < iid - ovl[0].a_iid) setRange(iid, UINT32_MAX); } while (ovl[0].a_iid < iid); // Code can't get here. // If we ran out of overlaps, the first return is used. // If we read past the iid we're looking for, the second return is used. // If we found the overlaps we're looking for, the thrd return is used. // If we didn't find the overlaps, we loop. assert(0); return(0); } void ovStore::setRange(uint32 firstIID, uint32 lastIID) { char name[FILENAME_MAX]; // make the index be one record per read iid, regardless, then we // can quickly grab the correct record, and seek to the start of // those overlaps if (firstIID > _info.largestID()) firstIID = _info.largestID() + 1; if (lastIID >= _info.largestID()) lastIID = _info.largestID(); // If our range is invalid (firstIID > lastIID) we keep going, and // let readOverlap() deal with it. AS_UTL_fseek(_offtFile, (off_t)firstIID * sizeof(ovStoreOfft), SEEK_SET); // Unfortunately, we need to actually read the record to figure out // where to position the overlap stream. If the read fails, we // silently return, letting readOverlap() deal with // the problem. _offt.clear(); // Everything should notice that offsetFile is at EOF and not try // to find overlaps, but, just in case, we set invalid first/last // IIDs. // _firstIIDrequested = firstIID; _lastIIDrequested = lastIID; if (0 == AS_UTL_safeRead(_offtFile, &_offt, "ovStore::setRange::offset", sizeof(ovStoreOfft), 1)) return; _overlapsThisFile = 0; _currentFileIndex = _offt._fileno; delete _bof; snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, _currentFileIndex); _bof = new ovFile(_gkp, name, ovFileNormal); _bof->seekOverlap(_offt._offset); } void ovStore::resetRange(void) { char name[FILENAME_MAX]; rewind(_offtFile); _offt.clear(); _overlapsThisFile = 0; _currentFileIndex = 1; delete _bof; snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, _currentFileIndex); _bof = new ovFile(_gkp, name, ovFileNormal); _firstIIDrequested = _info.smallestID(); _lastIIDrequested = _info.largestID(); } uint64 ovStore::numOverlapsInRange(void) { off_t originalposition = 0; uint64 i = 0; uint64 len = 0; ovStoreOfft *offsets = NULL; uint64 numolap = 0; if (_firstIIDrequested > _lastIIDrequested) return(0); originalposition = AS_UTL_ftell(_offtFile); AS_UTL_fseek(_offtFile, (off_t)_firstIIDrequested * sizeof(ovStoreOfft), SEEK_SET); // Even if we're doing a whole human-size store, this allocation is // (a) temporary and (b) only 512MB. The only current consumer of // this code is FragCorrectOVL.c, which doesn't run on the whole // human, it runs on ~24 pieces, which cuts this down to < 32MB. len = _lastIIDrequested - _firstIIDrequested + 1; offsets = new ovStoreOfft [len]; if (len != AS_UTL_safeRead(_offtFile, offsets, "AS_OVS_numOverlapsInRange", sizeof(ovStoreOfft), len)) { fprintf(stderr, "AS_OVS_numOverlapsInRange()-- short read on offsets!\n"); exit(1); } for (i=0; i _lastIIDrequested) return(NULL); firstFrag = _firstIIDrequested; lastFrag = _lastIIDrequested; off_t originalPosition = AS_UTL_ftell(_offtFile); AS_UTL_fseek(_offtFile, (off_t)_firstIIDrequested * sizeof(ovStoreOfft), SEEK_SET); // Even if we're doing a whole human-size store, this allocation is // (a) temporary and (b) only 512MB. The only current consumer of // this code is FragCorrectOVL.c, which doesn't run on the whole // human, it runs on ~24 pieces, which cuts this down to < 32MB. uint64 len = _lastIIDrequested - _firstIIDrequested + 1; ovStoreOfft *offsets = new ovStoreOfft [len]; uint32 *numolap = new uint32 [len]; uint64 act = AS_UTL_safeRead(_offtFile, offsets, "ovStore::numOverlapsInRange::offsets", sizeof(ovStoreOfft), len); if (len != act) fprintf(stderr, "AS_OVS_numOverlapsPerFrag()-- short read on offsets! Expected len=" F_U64 " read act=" F_U64 "\n", len, act), exit(1); for (uint64 i=0; i &fileList) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/evalues", _storePath); // If we have an opened memory mapped file, close it. if (_evaluesMap) { delete _evaluesMap; _evaluesMap = NULL; _evalues = NULL; } // Allocate space for the evalues. _evalues = new uint16 [_info.numOverlaps()]; // Remove a bogus evalues file if one exists. if ((AS_UTL_fileExists(name) == true) && (AS_UTL_sizeOfFile(name) != (sizeof(uint16) * _info.numOverlaps()))) { fprintf(stderr, "WARNING: existing evalues file is incorrect size: should be " F_U64 " bytes, is " F_U64 " bytes. Removing.\n", (sizeof(uint16) * _info.numOverlaps()), AS_UTL_sizeOfFile(name)); AS_UTL_unlink(name); } // Clear the evalues. for (uint64 ii=0; ii<_info.numOverlaps(); ii++) _evalues[ii] = UINT16_MAX; // For each file in the fileList, open it, read the header (bgnID, endID and // number of values), load the evalues, then copy this data to the actual // evalues file. for (uint32 i=0; iget(0); } canu-1.6/src/stores/ovStore.H000066400000000000000000000312171314437614700162040ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-DEC-09 to 2015-JUL-01 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_OVSTORE_H #define AS_OVSTORE_H #include "AS_global.H" #include "gkStore.H" #define SNAPPY #include "ovOverlap.H" #include "ovStoreFile.H" #include "ovStoreHistogram.H" const uint64 ovStoreVersion = 2; const uint64 ovStoreMagic = 0x53564f3a756e6163; // == "canu:OVS - store complete const uint64 ovStoreMagicIncomplete = 0x50564f3a756e6163; // == "canu:OVP - store under construction class ovStoreInfo { public: ovStoreInfo() { clear(); }; void clear(void) { _ovsMagic = ovStoreMagicIncomplete; // Appropriate for a new store. _ovsVersion = ovStoreVersion; _UNUSED = 0; _smallestIID = UINT64_MAX; _largestIID = 0; _numOverlapsTotal = 0; _highestFileIndex = 0; _maxReadLenInBits = AS_MAX_READLEN_BITS; }; bool load(const char *path, uint32 index=UINT32_MAX, bool temporary=false) { char name[FILENAME_MAX]; if (temporary == false) snprintf(name, FILENAME_MAX, "%s/info", path); else snprintf(name, FILENAME_MAX, "%s/%04u.info", path, index); if (AS_UTL_fileExists(name, false, false) == false) { fprintf(stderr, "ERROR: directory '%s' is not an overlapStore; didn't find file '%s': %s\n", path, name, strerror(errno)); return(false); } errno = 0; FILE *ovsinfo = fopen(name, "r"); if (errno) { fprintf(stderr, "ERROR: directory '%s' is not an overlapStore; failed to open '%s': %s\n", path, name, strerror(errno)); return(false); } AS_UTL_safeRead(ovsinfo, this, "ovStore::ovStore::info", sizeof(ovStoreInfo), 1); fclose(ovsinfo); return(true); }; bool test(const char *path) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/info", path); if (AS_UTL_fileExists(name, false, false) == false) return(false); errno = 0; FILE *ovsinfo = fopen(name, "r"); if (errno) fprintf(stderr, "ERROR: failed to load '%s'; can't check if this is a valid ovStore: %s\n", name, strerror(errno)), exit(1); AS_UTL_safeRead(ovsinfo, this, "ovStore::ovStore::info", sizeof(ovStoreInfo), 1); return(checkMagic()); }; void save(const char *path, uint32 index=UINT32_MAX, bool temporary=false) { char name[FILENAME_MAX]; if (temporary == false) snprintf(name, FILENAME_MAX, "%s/info", path); else snprintf(name, FILENAME_MAX, "%s/%04u.info", path, index); if (temporary == false) { _ovsMagic = ovStoreMagic; _ovsVersion = ovStoreVersion; _highestFileIndex = index; } else { } errno = 0; FILE *ovsinfo = fopen(name, "w"); if (errno) fprintf(stderr, "ERROR: failed to save '%s': %s\n", name, strerror(errno)), exit(1); AS_UTL_safeWrite(ovsinfo, this, "ovStore::ovStore::saveinfo", sizeof(ovStoreInfo), 1); fclose(ovsinfo); }; void addOverlap(uint32 id, uint32 nOverlaps=1) { if (_smallestIID > id) _smallestIID = id; if (_largestIID < id) _largestIID = id; _numOverlapsTotal += nOverlaps; }; bool checkIncomplete(void) { return(_ovsMagic == ovStoreMagicIncomplete); }; bool checkMagic(void) { return(_ovsMagic == ovStoreMagic); }; bool checkVersion(void) { return(_ovsVersion == ovStoreVersion); }; bool checkSize(void) { return(_maxReadLenInBits == AS_MAX_READLEN_BITS); }; uint32 getVersion(void) { return((uint32)_ovsVersion); }; uint32 getCurrentVersion(void) { return((uint32)ovStoreVersion); }; uint32 getSize(void) { return((uint32)_maxReadLenInBits); }; uint64 numOverlaps(void) { return(_numOverlapsTotal); }; uint32 smallestID(void) { return(_smallestIID); }; uint32 largestID(void) { return(_largestIID); }; uint32 lastFileIndex(void) { return(_highestFileIndex); }; private: uint64 _ovsMagic; uint64 _ovsVersion; uint64 _UNUSED; // needed to keep the file layout the same uint64 _smallestIID; // smallest frag iid in the store uint64 _largestIID; // largest frag iid in the store uint64 _numOverlapsTotal; // number of overlaps in the store uint64 _highestFileIndex; uint64 _maxReadLenInBits; // length of a fragment }; class ovStoreOfft { public: ovStoreOfft() { clear(); }; ~ovStoreOfft() { }; void clear(void) { _a_iid = 0; _fileno = 0; _offset = 0; _numOlaps = 0; _overlapID = 0; }; private: uint32 _a_iid; // read ID for this block of overlaps. uint32 _fileno; // the file that contains this a_iid uint32 _offset; // offset to the first overlap for this iid uint32 _numOlaps; // number of overlaps for this iid uint64 _overlapID; // overlapID for the first overlap in this block. in memory, this is the id of the next overlap. friend class ovStore; friend class ovStoreWriter; friend void writeOverlaps(gkStore *gkp, char *storePath, ovOverlap *ovls, uint64 ovlsLen, uint32 fileID); friend bool testIndex(char *storePath, bool doFixes); friend void mergeInfoFiles(char *storePath, uint32 nPieces); }; class ovStoreWriter { public: ~ovStoreWriter(); // For sequential construction, there is only a constructor, destructor and writeOverlap(). // Overlaps must be sorted by a_iid (then b_iid) already. ovStoreWriter(const char *path, gkStore *gkp); void writeOverlap(ovOverlap *olap); // For parallel construction, usage is much more complicated. The constructor // will write a single file of sorted overlaps, and each file has it's own metadata. // After all files are written, the metadata is merged into one file. ovStoreWriter(const char *path, gkStore *gkp, uint32 fileLimit, uint32 fileID, uint32 jobIdxMax); void writeOverlaps(ovOverlap *ovls, uint64 ovlsLen); uint64 loadBucketSizes(uint64 *bucketSizes); void loadOverlapsFromSlice(uint32 slice, uint64 expectedLen, ovOverlap *ovls, uint64& ovlsLen); void removeOverlapSlice(void); void mergeInfoFiles(void); void mergeHistogram(void); bool testIndex(bool doFixes); void checkSortingIsComplete(void); void removeAllIntermediateFiles(void); private: char _storePath[FILENAME_MAX]; ovStoreInfo _info; gkStore *_gkp; FILE *_offtFile; // For writing overlaps, a place to dump ovStoreOfft's. ovStoreOfft _offt; // For writing overlaps, the current ovStoreOfft. ovStoreOfft _offm; // For writing overlaps, an empty ovStoreOfft, for reads with no overlaps. memoryMappedFile *_evaluesMap; uint16 *_evalues; uint64 _overlapsThisFile; // Count of the number of overlaps written so far uint64 _overlapsThisFileMax; uint32 _currentFileIndex; ovFile *_bof; ovStoreHistogram *_histogram; // When constructing a sequential store, collects all the stats from each file // Parallel store support uint32 _fileLimit; // number of slices used in bucketizing/sorting uint32 _fileID; // index of the overlap file we're processing uint32 _jobIdxMax; // total number of overlap files }; class ovStore { public: ovStore(const char *name, gkStore *gkp); ~ovStore(); // Read the next overlap from the store. Return value is the number of overlaps read. uint32 readOverlap(ovOverlap *overlap); // Return the number of overlaps that would be read. Basically the same as the next readOverlaps() call. uint32 numberOfOverlaps(void); // Read ALL remaining overlaps for the current A_iid. Return value is the number of overlaps read. uint32 readOverlaps(ovOverlap *&overlaps, uint32 &maxOverlaps, bool restrictToIID=true); // Append ALL remaining overlaps for the current A_iid to the overlaps in ovl. Return value is // the number of overlaps in ovl that are for A_iid == iid. // // It is up to the client to verify that ovl[0] is the same as iid (e.g., that the return value // is not zero); ovlLen is the number of overlaps in ovl, NOT the number of overlaps in ovl that // are the same as iid. // uint32 readOverlaps(uint32 iid, ovOverlap *&ovl, uint32 &ovlLen, uint32 &ovlMax); void setRange(uint32 low, uint32 high); void resetRange(void); uint64 numOverlapsInRange(void); uint32 * numOverlapsPerFrag(uint32 &firstFrag, uint32 &lastFrag); // Add new evalues for reads between bgnID and endID. No checking of IDs is done, but the number // of evalues must agree. void addEvalues(vector &fileList); // Return the statistics associated with this store ovStoreHistogram *getHistogram(void) { return(new ovStoreHistogram(_storePath)); }; private: char _storePath[FILENAME_MAX]; ovStoreInfo _info; gkStore *_gkp; uint32 _firstIIDrequested; uint32 _lastIIDrequested; FILE *_offtFile; // For writing overlaps, a place to dump ovStoreOfft's. ovStoreOfft _offt; // For writing overlaps, the current ovStoreOfft. ovStoreOfft _offm; // For writing overlaps, an empty ovStoreOfft, for reads with no overlaps. memoryMappedFile *_evaluesMap; uint16 *_evalues; uint64 _overlapsThisFile; // Count of the number of overlaps written so far uint32 _currentFileIndex; ovFile *_bof; }; // For store construction. Probably should be in either ovOverlap or ovStore. class ovStoreFilter { public: ovStoreFilter(gkStore *gkp_, double maxErate); ~ovStoreFilter(); void filterOverlap(ovOverlap &foverlap, ovOverlap &roverlap); void resetCounters(void); uint64 savedUnitigging(void) { return(saveUTG); }; uint64 savedTrimming(void) { return(saveOBT); }; uint64 savedDedupe(void) { return(saveDUP); }; uint64 filteredErate(void) { return(skipERATE); }; uint64 filteredNoTrim(void) { return(skipOBT); }; uint64 filteredBadTrim(void) { return(skipOBTbad); }; uint64 filteredShortTrim(void) { return(skipOBTshort); }; uint64 filteredNoDedupe(void) { return(skipDUP); }; uint64 filteredNotDupe(void) { return(skipDUPdiff); }; uint64 filteredDiffLib(void) { return(skipDUPlib); }; public: gkStore *gkp; uint32 maxID; uint32 maxEvalue; uint64 saveUTG; uint64 saveOBT; uint64 saveDUP; uint64 skipERATE; uint64 skipOBT; // OBT not requested for the A read uint64 skipOBTbad; // Overlap too similiar uint64 skipOBTshort; // Overlap is too short uint64 skipDUP; // DUP not requested for the A read uint64 skipDUPdiff; // Overlap isn't remotely similar uint64 skipDUPlib; char *skipReadOBT; // State of the filter. char *skipReadDUP; }; #endif // AS_OVSTORE_H canu-1.6/src/stores/ovStoreBucketizer.C000066400000000000000000000232321314437614700202250ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/overlapStoreBucketizer.C * * Modifications by: * * Brian P. Walenz from 2012-APR-02 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-22 to 2015-SEP-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-08 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" static void writeToFile(gkStore *gkp, ovOverlap *overlap, ovFile **sliceFile, uint32 sliceFileMax, uint64 *sliceSize, uint32 *iidToBucket, char *ovlName, uint32 jobIndex, bool useGzip) { uint32 df = iidToBucket[overlap->a_iid]; if (sliceFile[df] == NULL) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/create%04d/slice%04d%s", ovlName, jobIndex, df, (useGzip) ? ".gz" : ""); sliceFile[df] = new ovFile(gkp, name, ovFileFullWriteNoCounts); sliceSize[df] = 0; } sliceFile[df]->writeOverlap(overlap); sliceSize[df]++; } int main(int argc, char **argv) { char *ovlName = NULL; char *gkpName = NULL; char *cfgName = NULL; uint32 maxFiles = sysconf(_SC_OPEN_MAX) - 16; uint32 fileLimit = maxFiles; uint32 jobIndex = 0; double maxErrorRate = 1.0; uint64 maxError = AS_OVS_encodeEvalue(maxErrorRate); char *ovlInput = NULL; bool useGzip = false; argc = AS_configure(argc, argv); int err=0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-O") == 0) { ovlName = argv[++arg]; } else if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-C") == 0) { cfgName = argv[++arg]; } else if (strcmp(argv[arg], "-F") == 0) { fileLimit = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-job") == 0) { jobIndex = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-i") == 0) { ovlInput = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { maxErrorRate = atof(argv[++arg]); maxError = AS_OVS_encodeEvalue(maxErrorRate); } else if (strcmp(argv[arg], "-gzip") == 0) { useGzip = true; } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if (ovlName == NULL) err++; if (gkpName == NULL) err++; if (ovlInput == NULL) err++; if (jobIndex == 0) err++; if (fileLimit > maxFiles) err++; if (err) { fprintf(stderr, "usage: %s -O asm.ovlStore -G asm.gkpStore -i file.ovb -job j [opts]\n", argv[0]); fprintf(stderr, " -O asm.ovlStore path to store to create\n"); fprintf(stderr, " -G asm.gkpStore path to gkpStore for this assembly\n"); fprintf(stderr, "\n"); fprintf(stderr, " -C config path to previously created ovStoreBuild config data file\n"); fprintf(stderr, "\n"); fprintf(stderr, " -i file.ovb[.gz] input overlaps\n"); fprintf(stderr, " -job j index of this overlap input file\n"); fprintf(stderr, "\n"); fprintf(stderr, " -F f use up to 'f' files for store creation\n"); fprintf(stderr, "\n"); fprintf(stderr, " -obt filter overlaps for OBT\n"); fprintf(stderr, " -dup filter overlaps for OBT/dedupe\n"); fprintf(stderr, "\n"); fprintf(stderr, " -e e filter overlaps above e fraction error\n"); fprintf(stderr, "\n"); fprintf(stderr, " -gzip compress buckets even more\n"); fprintf(stderr, "\n"); fprintf(stderr, " DANGER DO NOT USE DO NOT USE DO NOT USE DANGER\n"); fprintf(stderr, " DANGER DANGER\n"); fprintf(stderr, " DANGER This command is difficult to run by hand. DANGER\n"); fprintf(stderr, " DANGER Use ovStoreCreate instead. DANGER\n"); fprintf(stderr, " DANGER DANGER\n"); fprintf(stderr, " DANGER DO NOT USE DO NOT USE DO NOT USE DANGER\n"); fprintf(stderr, "\n"); if (ovlName == NULL) fprintf(stderr, "ERROR: No overlap store (-O) supplied.\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: No gatekeeper store (-G) supplied.\n"); if (ovlInput == NULL) fprintf(stderr, "ERROR: No input (-i) supplied.\n"); if (jobIndex == 0) fprintf(stderr, "ERROR: No job index (-job) supplied.\n"); if (fileLimit > maxFiles) fprintf(stderr, "ERROR: Too many jobs (-F); only " F_U32 " supported on this architecture.\n", maxFiles); exit(1); } { if (AS_UTL_fileExists(ovlName, TRUE, FALSE) == false) AS_UTL_mkdir(ovlName); } { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/create%04d", ovlName, jobIndex); if (AS_UTL_fileExists(name, TRUE, FALSE) == false) AS_UTL_mkdir(name); else fprintf(stderr, "Overwriting previous result; directory '%s' exists.\n", name), exit(0); } { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/bucket%04d/sliceSizes", ovlName, jobIndex); if (AS_UTL_fileExists(name, FALSE, FALSE) == true) fprintf(stderr, "Job finished; file '%s' exists.\n", name), exit(0); } gkStore *gkp = gkStore::gkStore_open(gkpName); uint32 maxIID = gkp->gkStore_getNumReads() + 1; uint32 *iidToBucket = new uint32 [maxIID]; { errno = 0; FILE *C = fopen(cfgName, "r"); if (errno) fprintf(stderr, "ERROR: failed to open config file '%s' for reading: %s\n", cfgName, strerror(errno)), exit(1); uint32 maxIIDtest = 0; AS_UTL_safeRead(C, &maxIIDtest, "maxIID", sizeof(uint32), 1); AS_UTL_safeRead(C, iidToBucket, "iidToBucket", sizeof(uint32), maxIID); if (maxIIDtest != maxIID) fprintf(stderr, "ERROR: maxIID in store (" F_U32 ") differs from maxIID in config file (" F_U32 ").\n", maxIID, maxIIDtest), exit(1); } ovFile **sliceFile = new ovFile * [fileLimit + 1]; uint64 *sliceSize = new uint64 [fileLimit + 1]; memset(sliceFile, 0, sizeof(ovFile *) * (fileLimit + 1)); memset(sliceSize, 0, sizeof(uint64) * (fileLimit + 1)); fprintf(stderr, "maxError fraction: %.3f percent: %.3f encoded: " F_U64 "\n", maxErrorRate, maxErrorRate * 100, maxError); fprintf(stderr, "Bucketizing %s\n", ovlInput); ovStoreFilter *filter = new ovStoreFilter(gkp, maxError); ovOverlap foverlap(gkp); ovOverlap roverlap(gkp); ovFile *inputFile = new ovFile(gkp, ovlInput, ovFileFull); // Do bigger buffers increase performance? Do small ones hurt? //AS_OVS_setBinaryOverlapFileBufferSize(2 * 1024 * 1024); while (inputFile->readOverlap(&foverlap)) { filter->filterOverlap(foverlap, roverlap); // The filter copies f into r, and checks IDs // If all are skipped, don't bother writing the overlap. if ((foverlap.dat.ovl.forUTG == true) || (foverlap.dat.ovl.forOBT == true) || (foverlap.dat.ovl.forDUP == true)) writeToFile(gkp, &foverlap, sliceFile, fileLimit, sliceSize, iidToBucket, ovlName, jobIndex, useGzip); if ((roverlap.dat.ovl.forUTG == true) || (roverlap.dat.ovl.forOBT == true) || (roverlap.dat.ovl.forDUP == true)) writeToFile(gkp, &roverlap, sliceFile, fileLimit, sliceSize, iidToBucket, ovlName, jobIndex, useGzip); } delete inputFile; #warning not reporting fate //filter->reportFate(); //filter->resetCounters(); delete filter; for (uint32 i=0; i<=fileLimit; i++) delete sliceFile[i]; // Write slice sizes, rename bucket. { char name[FILENAME_MAX]; char finl[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/create%04d/sliceSizes", ovlName, jobIndex); FILE *F = fopen(name, "w"); if (errno) fprintf(stderr, "ERROR: Failed to open %s: %s\n", name, strerror(errno)), exit(1); AS_UTL_safeWrite(F, sliceSize, "sliceSize", sizeof(uint64), fileLimit + 1); fclose(F); snprintf(name, FILENAME_MAX, "%s/create%04d", ovlName, jobIndex); snprintf(finl, FILENAME_MAX, "%s/bucket%04d", ovlName, jobIndex); errno = 0; rename(name, finl); if (errno) fprintf(stderr, "ERROR: Failed to rename '%s' to final name '%s': %s\n", name, finl, strerror(errno)); } delete [] sliceFile; delete [] sliceSize; fprintf(stderr, "Success!\n"); return(0); } canu-1.6/src/stores/ovStoreBucketizer.mk000066400000000000000000000007431314437614700204540ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := ovStoreBucketizer SOURCES := ovStoreBucketizer.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/ovStoreBuild.C000066400000000000000000000544171314437614700171660ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/overlapStoreBuild.C * * Modifications by: * * Brian P. Walenz from 2012-APR-02 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-JUL-31 to 2015-SEP-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-03 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-20 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_decodeRange.H" #include "gkStore.H" #include "ovStore.H" #include #include using namespace std; #define MEMORY_OVERHEAD (256 * 1024 * 1024) // This is the size of the datastructure that we're using to store overlaps for sorting. // At present, with ovOverlap, it is over-allocating a pointer that we don't need, but // to make a custom structure, we'd need to duplicate a bunch of code or copy data after // loading and before writing. // // Used in both ovStoreSorter.C and ovStoreBuild.C. // #define ovOverlapSortSize (sizeof(ovOverlap)) static void addEvalues(char *ovlName, vector &fileList) { ovStore *ovs = new ovStore(ovlName, NULL); ovs->addEvalues(fileList); delete ovs; fprintf(stderr, "- Evalues updated.\n"); } void reportConfiguration(char *configOut, uint32 maxIID, uint32 *iidToBucket) { char F[FILENAME_MAX]; snprintf(F, FILENAME_MAX, "%s.WORKING", configOut); errno = 0; FILE *C = fopen(configOut, "w"); if (errno) fprintf(stderr, "Failed to open config output file '%s': %s\n", configOut, strerror(errno)), exit(1); AS_UTL_safeWrite(C, &maxIID, "maxIID", sizeof(uint32), 1); AS_UTL_safeWrite(C, iidToBucket, "iidToBucket", sizeof(uint32), maxIID); fclose(C); rename(F, configOut); delete [] iidToBucket; fprintf(stderr, "- Saved configuration to '%s'.\n", configOut); } static uint32 * computeIIDperBucket(uint32 fileLimit, uint64 minMemory, uint64 maxMemory, uint32 maxIID, vector &fileList) { uint32 *iidToBucket = new uint32 [maxIID]; int64 procMax = sysconf(_SC_CHILD_MAX); int64 openMax = sysconf(_SC_OPEN_MAX); int64 maxFiles = 0; // As of late August 2016, the ovb files are not gzip compressed, and do not need to use an // external process to decompress. The support for limiting by number of processes is left in - // but disabled - because it's really just these three lines here, and because the code still // supports gzip inputs. if (openMax > 16) openMax -= 16; if (procMax > 8192) // Once saw a case where procMax was 18,446,744,073,709,551,615 (2^64-1) procMax = 8192; // and openMax was 262,144. It didn't end well. maxFiles = openMax; // MIN(procMax, openMax); ENABLE THIS TO LIMIT PROCESSES TOO. // If we're reading from stdin, not much we can do but divide the IIDs equally per file. Note // that the IIDs must be consecutive; the obvious, simple and clean division of 'mod' won't work. if (fileList[0][0] == '-') { if (maxMemory > 0) { minMemory = 0; maxMemory = 0; fileLimit = maxFiles; fprintf(stderr, "WARNING: memory limit (-M) specified, but can't be used with inputs from stdin; using %d files instead.\n", fileLimit); } else { fprintf(stderr, "Sorting overlaps from stdin using %d files.\n", fileLimit); } uint32 iidPerBucket = maxIID / fileLimit; uint32 thisBucket = 1; uint32 iidThisBucket = 0; for (uint32 ii=0; ii iidPerBucket) { iidThisBucket = 0; thisBucket++; } } return(iidToBucket); } // Otherwise, we have files, and should have counts. Load them! ovStoreHistogram *hist = new ovStoreHistogram(); uint32 *oPR = NULL; allocateArray(oPR, maxIID); for (uint32 i=0; iloadData(fileList[i], maxIID); uint64 numOverlaps = hist->getOverlapsPerRead(oPR, maxIID); delete hist; hist = NULL; if (numOverlaps == 0) fprintf(stderr, "Found no overlaps to sort.\n"), exit(1); fprintf(stderr, "Found " F_U64 " (%.2f million) overlaps.\n", numOverlaps, numOverlaps / 1000000.0); // Partition the overlaps into buckets. uint64 olapsPerBucketMax = 1; double GBperOlap = ovOverlapSortSize / 1024.0 / 1024.0 / 1024.0; // If a file limit, distribute the overlaps to equal sized files. if (fileLimit > 0) { olapsPerBucketMax = (uint64)ceil((double)numOverlaps / (double)fileLimit); fprintf(stderr, "Will sort using " F_U32 " files; " F_U64 " (%.2f million) overlaps per bucket; %.2f GB memory per bucket\n", fileLimit, olapsPerBucketMax, olapsPerBucketMax / 1000000.0, olapsPerBucketMax * GBperOlap); } // If a memory limit, distribute the overlaps to files no larger than the limit. // // This will pick the smallest memory size that uses fewer than maxFiles buckets. Unreasonable // values can break this - either too low memory or too high allowed open files (an OS limit). if (maxMemory > 0) { fprintf(stderr, "Configuring for %.2f GB to %.2f GB memory and " F_S64 " open files.\n", minMemory / 1024.0 / 1024.0 / 1024.0, maxMemory / 1024.0 / 1024.0 / 1024.0, maxFiles); if (minMemory < MEMORY_OVERHEAD + ovOverlapSortSize) { fprintf(stderr, "Reset minMemory from " F_U64 " to " F_SIZE_T "\n", minMemory, MEMORY_OVERHEAD + ovOverlapSortSize); minMemory = MEMORY_OVERHEAD + ovOverlapSortSize; } uint64 incr = (maxMemory - minMemory) / 128; if (incr < 1024 * 1024) incr = 1024 * 1024; uint64 useMemory = minMemory; // Compute the initial number of overlaps per bucket, based on the smallest memory allowed. olapsPerBucketMax = (useMemory - MEMORY_OVERHEAD) / ovOverlapSortSize; // Find the smallest memory size that uses fewer files than the OS allows. for (; ((useMemory <= maxMemory) && (numOverlaps / olapsPerBucketMax + 1 > maxFiles)); useMemory += incr) { olapsPerBucketMax = (useMemory - MEMORY_OVERHEAD) / ovOverlapSortSize; fprintf(stderr, "At memory %.3fGB, " F_U64 " olaps per bucket, " F_U64 " buckets (pass 1).\n", useMemory / 1024.0 / 1024.0 / 1024.0, olapsPerBucketMax, numOverlaps / olapsPerBucketMax + 1); } // If we're at less than half the max, make buckets a little bit bigger to reduce the open file // count. This helps when multiple bucketizer jobs get scheduled to the same node. if (useMemory < minMemory + (maxMemory - minMemory) / 2) { for (; ((useMemory <= maxMemory) && (numOverlaps / olapsPerBucketMax + 1 > maxFiles / 2)); useMemory += incr) { olapsPerBucketMax = (useMemory - MEMORY_OVERHEAD) / ovOverlapSortSize; fprintf(stderr, "At memory %.3fGB, " F_U64 " olaps per bucket, " F_U64 " buckets (pass 2).\n", useMemory / 1024.0 / 1024.0 / 1024.0, olapsPerBucketMax, numOverlaps / olapsPerBucketMax + 1); } } // Give up if we hit our max limit. if ((minMemory > maxMemory) || (numOverlaps / olapsPerBucketMax + 1) > maxFiles) { fprintf(stderr, "ERROR: Cannot sort %.2f million overlaps using %.2f GB memory; too few file handles available.\n", numOverlaps / 1000000.0, maxMemory / 1024.0 / 1024.0 / 1024.0); fprintf(stderr, "ERROR: minMemory " F_U64 "\n", minMemory); fprintf(stderr, "ERROR: maxMemory " F_U64 "\n", maxMemory); fprintf(stderr, "ERROR: olapsPerBucket " F_U64 "\n", olapsPerBucketMax); fprintf(stderr, "ERROR: buckets " F_U64 "\n", numOverlaps / olapsPerBucketMax + 1); fprintf(stderr, "ERROR: SC_CHILD_MAX " F_S64 "\n", (int64)sysconf(_SC_CHILD_MAX)); fprintf(stderr, "ERROR: SC_OPEN_MAX " F_S64 "\n", (int64)sysconf(_SC_OPEN_MAX)); fprintf(stderr, "ERROR: Increase memory size (in canu, ovsMemory; in ovStoreBuild, -M)\n"); exit(1); } fprintf(stderr, "Will sort using " F_U64 " files; " F_U64 " (%.2f million) overlaps per bucket; %.2f GB memory per bucket\n", numOverlaps / olapsPerBucketMax + 1, olapsPerBucketMax, olapsPerBucketMax / 1000000.0, olapsPerBucketMax * GBperOlap + MEMORY_OVERHEAD / 1024.0 / 1024.0 / 1024.0); } // Given the limit on each bucket, count the number of buckets needed, then reset the limit on // each bucket to have the same number of overlaps for every bucket. { uint64 olaps = 0; uint32 bucket = 1; for (uint32 ii=0; ii= olapsPerBucketMax) { olaps = 0; bucket++; } } olapsPerBucketMax = (uint64)ceil((double)numOverlaps / (double)bucket); } // And, finally, assign IIDs to buckets. { uint64 olaps = 0; uint32 bucket = 1; for (uint32 ii=0; ii= olapsPerBucketMax) { fprintf(stderr, " bucket %4d has " F_U64 " olaps.\n", bucket, olaps); olaps = 0; bucket++; } } fprintf(stderr, " bucket %4d has " F_U64 " olaps.\n", bucket, olaps); } fprintf(stderr, "Will sort %.3f million overlaps per bucket, using %u buckets %.2f GB per bucket.\n", olapsPerBucketMax / 1000000.0, iidToBucket[maxIID-1], olapsPerBucketMax * GBperOlap + MEMORY_OVERHEAD / 1024.0 / 1024.0 / 1024.0); fprintf(stderr, "\n"); delete hist; return(iidToBucket); } static void writeToDumpFile(gkStore *gkp, ovOverlap *overlap, ovFile **dumpFile, uint64 *dumpLength, uint32 *iidToBucket, char *ovlName) { uint32 df = iidToBucket[overlap->a_iid]; // If the dump file isn't opened, open it. if (dumpFile[df] == NULL) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/tmp.sort.%04d", ovlName, df); fprintf(stderr, "-- Create bucket '%s'\n", name); dumpFile[df] = new ovFile(gkp, name, ovFileFullWriteNoCounts); dumpLength[df] = 0; } // And write the overlap. dumpFile[df]->writeOverlap(overlap); dumpLength[df]++; } int main(int argc, char **argv) { char *ovlName = NULL; char *gkpName = NULL; uint32 fileLimit = 0; uint64 minMemory = (uint64)1 * 1024 * 1024 * 1024; uint64 maxMemory = (uint64)4 * 1024 * 1024 * 1024; double maxError = 1.0; uint32 minOverlap = 0; vector fileList; uint32 nThreads = 4; bool eValues = false; char *configOut = NULL; argc = AS_configure(argc, argv); int err=0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-O") == 0) { ovlName = argv[++arg]; } else if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-F") == 0) { fileLimit = atoi(argv[++arg]); minMemory = 0; maxMemory = 0; } else if (strcmp(argv[arg], "-M") == 0) { double lo=0.0, hi=0.0; AS_UTL_decodeRange(argv[++arg], lo, hi); minMemory = (uint64)ceil(lo * 1024.0 * 1024.0 * 1024.0); maxMemory = (uint64)ceil(hi * 1024.0 * 1024.0 * 1024.0); fileLimit = 0; } else if (strcmp(argv[arg], "-e") == 0) { maxError = atof(argv[++arg]); } else if (strcmp(argv[arg], "-e") == 0) { minOverlap = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-L") == 0) { AS_UTL_loadFileList(argv[++arg], fileList); } else if (strcmp(argv[arg], "-evalues") == 0) { eValues = true; } else if (strcmp(argv[arg], "-config") == 0) { configOut = argv[++arg]; } else if (((argv[arg][0] == '-') && (argv[arg][1] == 0)) || (AS_UTL_fileExists(argv[arg]))) { // Assume it's an input file fileList.push_back(argv[arg]); } else { fprintf(stderr, "%s: unknown option '%s'.\n", argv[0], argv[arg]); err++; } arg++; } if (ovlName == NULL) err++; if (gkpName == NULL) err++; if (fileList.size() == 0) err++; if (fileLimit > sysconf(_SC_OPEN_MAX) - 16) err++; if (maxMemory < MEMORY_OVERHEAD) err++; if (err) { fprintf(stderr, "usage: %s -O asm.ovlStore -G asm.gkpStore [opts] [-L fileList | *.ovb.gz]\n", argv[0]); fprintf(stderr, " -O asm.ovlStore path to store to create\n"); fprintf(stderr, " -G asm.gkpStore path to gkpStore for this assembly\n"); fprintf(stderr, "\n"); fprintf(stderr, " -L fileList read input filenames from 'flieList'\n"); fprintf(stderr, "\n"); fprintf(stderr, " -F f use up to 'f' files for store creation\n"); fprintf(stderr, " -M g use up to 'g' gigabytes memory for sorting overlaps\n"); fprintf(stderr, " default 4; g-0.25 gb is available for sorting overlaps\n"); fprintf(stderr, "\n"); fprintf(stderr, " -e e filter overlaps above e fraction error\n"); fprintf(stderr, " -l l filter overlaps below l bases overlap length (needs gkpStore to get read lengths!)\n"); fprintf(stderr, "\n"); fprintf(stderr, "Non-building options:\n"); fprintf(stderr, " -evalues input files are evalue updates from overlap error adjustment\n"); fprintf(stderr, " -config out.dat don't build a store, just dump a binary partitioning file for ovStoreBucketizer\n"); fprintf(stderr, "\n"); fprintf(stderr, "Sizes and Limits:\n"); fprintf(stderr, " ovOverlap " F_S32 " words of " F_S32 " bits each.\n", (int32)ovOverlapNWORDS, (int32)ovOverlapWORDSZ); fprintf(stderr, " ovOverlapSortSize " F_S32 " bits\n", (int32)ovOverlapSortSize * 8); fprintf(stderr, " SC_CHILD_MAX " F_S32 " processes\n", (int32)sysconf(_SC_CHILD_MAX)); fprintf(stderr, " SC_OPEN_MAX " F_S32 " files\n", (int32)sysconf(_SC_OPEN_MAX)); fprintf(stderr, "\n"); if (ovlName == NULL) fprintf(stderr, "ERROR: No overlap store (-O) supplied.\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: No gatekeeper store (-G) supplied.\n"); if (fileList.size() == 0) fprintf(stderr, "ERROR: No input overlap files (-L or last on the command line) supplied.\n"); if (fileLimit > sysconf(_SC_OPEN_MAX) - 16) fprintf(stderr, "ERROR: Too many jobs (-F); only " F_SIZE_T " supported on this architecture.\n", sysconf(_SC_OPEN_MAX) - 16); if (maxMemory < MEMORY_OVERHEAD) fprintf(stderr, "ERROR: Memory (-M) must be at least %.3f GB to account for overhead.\n", MEMORY_OVERHEAD / 1024.0 / 1024.0 / 1024.0); exit(1); } // If only updating evalues, do it and quit. if (eValues) addEvalues(ovlName, fileList), exit(0); // Open reads, figure out a partitioning scheme. gkStore *gkp = gkStore::gkStore_open(gkpName); uint32 maxIID = gkp->gkStore_getNumReads() + 1; uint32 *iidToBucket = computeIIDperBucket(fileLimit, minMemory, maxMemory, maxIID, fileList); uint32 maxFiles = sysconf(_SC_OPEN_MAX); if (iidToBucket[maxIID-1] > maxFiles - 8) { fprintf(stderr, "ERROR:\n"); fprintf(stderr, "ERROR: Operating system limit of " F_U32 " open files. The current -F/-M settings\n", maxFiles); fprintf(stderr, "ERROR: will need to create " F_U32 " files to construct the store.\n", iidToBucket[maxIID-1]); fprintf(stderr, "ERROR:\n"); exit(1); } // But if only asked to report the configuration, do it and quit. if (configOut) reportConfiguration(configOut, maxIID, iidToBucket), gkp->gkStore_close(), exit(0); // Read the gkStore to determine which fragments we care about. ovStoreFilter *filter = new ovStoreFilter(gkp, maxError); // fprintf(stderr, "\n"); fprintf(stderr, "-- BUCKETIZING --\n"); fprintf(stderr, "\n"); // And load reads into the store! We used to create the store before filtering, so it could fail // quicker, but the filter should be much faster with the mmap()'d gkpStore in canu. ovStoreWriter *store = new ovStoreWriter(ovlName, gkp); uint32 dumpFileMax = iidToBucket[maxIID-1] + 1; ovFile **dumpFile = new ovFile * [dumpFileMax]; uint64 *dumpLength = new uint64 [dumpFileMax]; memset(dumpFile, 0, sizeof(ovFile *) * dumpFileMax); memset(dumpLength, 0, sizeof(uint64) * dumpFileMax); for (uint32 i=0; ireadOverlap(&foverlap)) { filter->filterOverlap(foverlap, roverlap); // The filter copies f into r // Check that overlap IDs are valid. #warning not checking overlap IDs for validity // If all are skipped, don't bother writing the overlap. if ((foverlap.dat.ovl.forUTG == true) || (foverlap.dat.ovl.forOBT == true) || (foverlap.dat.ovl.forDUP == true)) writeToDumpFile(gkp, &foverlap, dumpFile, dumpLength, iidToBucket, ovlName); if ((roverlap.dat.ovl.forUTG == true) || (roverlap.dat.ovl.forOBT == true) || (roverlap.dat.ovl.forDUP == true)) writeToDumpFile(gkp, &roverlap, dumpFile, dumpLength, iidToBucket, ovlName); } delete inputFile; } for (uint32 i=0; isavedDedupe() > 0) { fprintf(stderr, "-- Saved " F_U64 " dedupe overlaps\n", filter->savedDedupe()); fprintf(stderr, "-- Discarded " F_U64 " don't care " F_U64 " different library " F_U64 " obviously not duplicates\n", filter->filteredNoDedupe(), filter->filteredNotDupe(), filter->filteredDiffLib()); } if (filter->savedTrimming() > 0) { fprintf(stderr, "-- Saved " F_U64 " trimming overlaps\n", filter->savedTrimming()); fprintf(stderr, "-- Discarded " F_U64 " don't care " F_U64 " too similar " F_U64 " too short\n", filter->filteredNoTrim(), filter->filteredBadTrim(), filter->filteredShortTrim()); } if (filter->savedUnitigging() > 0) { fprintf(stderr, "-- Saved " F_U64 " unitigging overlaps\n", filter->savedUnitigging()); } if (filter->filteredErate() > 0) fprintf(stderr, "-- Discarded " F_U64 " low quality, more than %.4f fraction error\n", filter->filteredErate(), maxError); delete filter; // // Read each bucket, sort it, and dump it to the store // fprintf(stderr, "\n"); fprintf(stderr, "-- SORTING --\n"); fprintf(stderr, "\n"); uint64 dumpLengthMax = 0; for (uint32 i=0; ireadOverlap(overlapsort + numOvl)) { // Quick sanity check on IIDs. if ((overlapsort[numOvl].a_iid == 0) || (overlapsort[numOvl].b_iid == 0) || (overlapsort[numOvl].a_iid >= maxIID) || (overlapsort[numOvl].b_iid >= maxIID)) { char ovlstr[256]; fprintf(stderr, "Overlap has IDs out of range (maxIID " F_U32 "), possibly corrupt input data.\n", maxIID); fprintf(stderr, " Aid " F_U32 " Bid " F_U32 "\n", overlapsort[numOvl].a_iid, overlapsort[numOvl].b_iid); exit(1); } numOvl++; } delete bof; assert(numOvl == dumpLength[i]); assert(numOvl <= dumpLengthMax); // There's no real advantage to saving this file until after we write it out. If we crash // anywhere during the build, we are forced to restart from scratch. I'll argue that removing // it early helps us to not crash from running out of disk space. unlink(name); fprintf(stderr, "- Sorting\n"); #ifdef _GLIBCXX_PARALLEL // If we have the parallel STL, don't use it! Sort is not inplace! __gnu_sequential::sort(overlapsort, overlapsort + dumpLength[i]); #else sort(overlapsort, overlapsort + dumpLength[i]); #endif fprintf(stderr, "- Writing\n"); for (uint64 x=0; xwriteOverlap(overlapsort + x); } fprintf(stderr, "\n"); fprintf(stderr, "-- FINISHING --\n"); fprintf(stderr, "\n"); delete store; delete [] overlapsort; gkp->gkStore_close(); // And we have a store. exit(0); } canu-1.6/src/stores/ovStoreBuild.mk000066400000000000000000000007311314437614700174010ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := ovStoreBuild SOURCES := ovStoreBuild.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/ovStoreDump.C000066400000000000000000000633431314437614700170320ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/overlapStore.C * src/AS_OVS/overlapStore.c * * Modifications by: * * Brian P. Walenz from 2007-MAR-08 to 2013-AUG-01 * are Copyright 2007-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-DEC-16 to 2009-AUG-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gregory Sims from 2012-FEB-01 to 2012-MAR-14 * are Copyright 2012 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-AUG-22 to 2015-JUN-25 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-21 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_decodeRange.H" #include "splitToWords.H" #include "gkStore.H" #include "ovStore.H" enum dumpOp { OP_NONE = 1, OP_DUMP = 2, OP_DUMP_PICTURE = 3 }; enum dumpFlags { NO_5p = 1, NO_3p = 2, NO_CONTAINED = 4, NO_CONTAINS = 8, NO_CONTAINED_READS = 16, NO_SUSPICIOUS_READS = 32, NO_SINGLETON_READS = 64, WITH_ERATE = 128, WITH_LENGTH = 256, ONE_SIDED = 512 }; struct readStatus { uint64 best5id : 29; uint64 best53p : 1; // Unwieldy - best edge from my 5' is to the 3' of 'best5id'. uint64 best3id : 29; uint64 best33p : 1; uint64 unused : 1; uint64 isSingleton : 1; uint64 isContained : 1; uint64 isSuspicious : 1; }; class bogartStatus { public: bogartStatus(const char *prefix, uint32 nReads); ~bogartStatus() { delete [] _status; }; uint32 getBest5id(uint32 id) { return((_status) ? (_status[id].best5id) : 0); }; bool getBest53p(uint32 id) { return((_status) ? (_status[id].best53p) : 0); }; uint32 getBest3id(uint32 id) { return((_status) ? (_status[id].best3id) : false); }; bool getBest33p(uint32 id) { return((_status) ? (_status[id].best33p) : false); }; bool getSingleton(uint32 id) { return((_status) ? (_status[id].isSingleton) : false); }; bool getContained(uint32 id) { return((_status) ? (_status[id].isContained) : false); }; bool getSuspicious(uint32 id) { return((_status) ? (_status[id].isSuspicious) : false); }; private: readStatus *_status; }; bogartStatus::bogartStatus(const char *prefix, uint32 nReads) { char N[FILENAME_MAX]; splitToWords W; _status = NULL; if (prefix == NULL) return; errno = 0; snprintf(N, FILENAME_MAX, "%s.edges", prefix); FILE *E = fopen(N, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", N, strerror(errno)), exit(1); snprintf(N, FILENAME_MAX, "%s.edges.suspicious", prefix); FILE *S = fopen(N, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", N, strerror(errno)), exit(1); snprintf(N, FILENAME_MAX, "%s.singletons", prefix); FILE *G = fopen(N, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", N, strerror(errno)), exit(1); _status = new readStatus [nReads+1]; memset(_status, 0, sizeof(readStatus) * (nReads+1)); fgets(N, FILENAME_MAX, E); while (!feof(E)) { W.split(N); uint32 id = W(0); _status[id].best5id = W(2); _status[id].best53p = (W[3][0] == '3'); _status[id].best3id = W(4); _status[id].best33p = (W[5][0] == '3'); _status[id].isSingleton = false; _status[id].isContained = ((W.numWords() > 10) && (W[10][0] == 'c')); _status[id].isSuspicious = false; fgets(N, FILENAME_MAX, E); } fclose(E); fgets(N, FILENAME_MAX, S); while (!feof(S)) { W.split(N); uint32 id = W(0); _status[id].best5id = W(2); _status[id].best53p = (W[3][0] == '3'); _status[id].best3id = W(4); _status[id].best33p = (W[5][0] == '3'); _status[id].isSingleton = false; _status[id].isContained = ((W.numWords() > 10) && (W[10][0] == 'c')); _status[id].isSuspicious = true; fgets(N, FILENAME_MAX, S); } fclose(S); fgets(N, FILENAME_MAX, G); while (!feof(G)) { W.split(N); uint32 id = W(0); _status[id].best5id = 0; _status[id].best53p = 0; _status[id].best3id = 0; _status[id].best33p = 0; _status[id].isSingleton = true; _status[id].isContained = false; _status[id].isSuspicious = false; fgets(N, FILENAME_MAX, G); } fclose(G); } // // Also accept a single ovStoreFile (output from overlapper) and dump. // // // Then need some way of loading ascii overlaps into a store, or converting ascii overlaps to // binary and use the normal store build. The normal store build also needs to take sorted // overlaps and just rewrite as a store. // void dumpStore(ovStore *ovlStore, gkStore *gkpStore, bool asBinary, bool asCounts, bool asErateLen, double dumpERate, uint32 dumpLength, uint32 dumpType, uint32 bgnID, uint32 endID, uint32 qryID, ovOverlapDisplayType type, bool beVerbose, char *bestPrefix) { ovOverlap overlap(gkpStore); uint64 evalue = AS_OVS_encodeEvalue(dumpERate); char ovlString[1024]; uint32 ovlTooHighError = 0; uint32 ovlNot5p = 0; uint32 ovlNot3p = 0; uint32 ovlNotContainer = 0; uint32 ovlNotContainee = 0; uint32 ovlNotUnique = 0; uint32 ovlDumped = 0; uint32 obtTooHighError = 0; uint32 obtDumped = 0; uint32 merDumped = 0; uint32 *counts = NULL; ovStoreHistogram *hist = NULL; // Set the range of the reads to dump early so that we can reset it later. ovlStore->setRange(bgnID, endID); // If we're dumping counts, and there are modifiers, we need to scan all overlaps if ((asCounts) && (dumpType != 0)) { counts = new uint32 [endID - bgnID + 1]; for (uint32 ii=bgnID; ii<=endID; ii++) counts[ii - bgnID] = 0; } // If we're dumping counts, and no modifiers, we can just ask the store for the counts // and set the range to null. if ((asCounts) && (dumpType == 0)) { counts = ovlStore->numOverlapsPerFrag(bgnID, endID); ovlStore->setRange(1, 0); } // If we're dumping the erate-vs-length histogram, and no modifiers, grab it from the store and // set the range to null. Otherwise, allocate a new one. if ((asErateLen) && (dumpType == 0)) { hist = ovlStore->getHistogram(); ovlStore->setRange(1, 0); } if ((asErateLen) && (dumpType > 0)) { hist = new ovStoreHistogram(gkpStore, ovFileNormalWrite); } // Length filtering is expensive to compute, need to load both reads to get their length. // //if ((dumpType & WITH_LENGTH) && (dumpLength < overlapLength(overlap))) // continue; while (ovlStore->readOverlap(&overlap) == TRUE) { if ((qryID != 0) && (qryID != overlap.b_iid)) continue; if ((dumpType & WITH_ERATE) && (overlap.evalue() > evalue)) { ovlTooHighError++; continue; } int32 ahang = overlap.a_hang(); int32 bhang = overlap.b_hang(); if ((dumpType & NO_5p) && (ahang < 0) && (bhang < 0)) { ovlNot5p++; continue; } if ((dumpType & NO_3p) && (ahang > 0) && (bhang > 0)) { ovlNot3p++; continue; } if ((dumpType & NO_CONTAINS) && (ahang >= 0) && (bhang <= 0)) { ovlNotContainer++; continue; } if ((dumpType & NO_CONTAINED) && (ahang <= 0) && (bhang >= 0)) { ovlNotContainee++; continue; } if ((dumpType & ONE_SIDED) && (overlap.a_iid >= overlap.b_iid)) { ovlNotUnique++; continue; } ovlDumped++; // The toString() method is quite slow, all from snprintf(). // Without both the puts() and AtoString(), a dump ran in 3 seconds. // With both, 138 seconds. // Without the puts(), 127 seconds. if (asCounts) counts[overlap.a_iid - bgnID]++; else if (asErateLen) hist->addOverlap(&overlap); else if (asBinary) AS_UTL_safeWrite(stdout, &overlap, "dumpStore", sizeof(ovOverlap), 1); else fputs(overlap.toString(ovlString, type, true), stdout); } if (asCounts) { for (uint32 ii=bgnID; ii<=endID; ii++) fprintf(stdout, "%u\t%u\n", ii, counts[ii - bgnID]); } if (asErateLen) { hist->dumpEvalueLength(stdout); } delete [] counts; delete hist; if (beVerbose) { fprintf(stderr, "ovlTooHighError %u\n", ovlTooHighError); fprintf(stderr, "ovlNot5p %u\n", ovlNot5p); fprintf(stderr, "ovlNot3p %u\n", ovlNot3p); fprintf(stderr, "ovlNotContainer %u\n", ovlNotContainer); fprintf(stderr, "ovlNotContainee %u\n", ovlNotContainee); fprintf(stderr, "ovlDumped %u\n", ovlDumped); fprintf(stderr, "obtTooHighError %u\n", obtTooHighError); fprintf(stderr, "obtDumped %u\n", obtDumped); fprintf(stderr, "merDumped %u\n", merDumped); } } int sortOBT(const void *a, const void *b) { ovOverlap const *A = (ovOverlap const *)a; ovOverlap const *B = (ovOverlap const *)b; if (A->a_bgn() < B->a_bgn()) return(-1); if (A->a_bgn() > B->a_bgn()) return(1); // For overlaps off the 5' end, put the thinnest ones first. Sadly, we don't know // gkpStore here, and can't actually get the end coordinate. // //if (A->b_bgn() > B->b_bgn()) return(-1); //if (A->b_bgn() < B->b_bgn()) return(1); if (A->a_end() < B->a_end()) return(-1); if (A->a_end() > B->a_end()) return(1); return(0); } void dumpPicture(ovOverlap *overlaps, uint64 novl, gkStore *gkpStore, uint32 qryID, bogartStatus *bogart) { char ovl[256] = {0}; uint32 MHS = 9; // Max Hang Size, amount of padding for "+### " uint32 Aid = qryID; gkRead *A = gkpStore->gkStore_getRead(Aid); uint32 frgLenA = A->gkRead_sequenceLength(); for (int32 i=0; i<256; i++) ovl[i] = ' '; for (int32 i=0; i<100; i++) ovl[i + MHS] = '-'; ovl[ 99 + MHS] = '>'; ovl[100 + MHS] = 0; fprintf(stdout, "A %7d:%-7d A %9d %7d:%-7d %7d %s %s%s\n", 0, frgLenA, Aid, 0, frgLenA, frgLenA, ovl, bogart->getContained(Aid) ? "contained" : "", bogart->getSuspicious(Aid) ? "suspicious" : ""); qsort(overlaps, novl, sizeof(ovOverlap), sortOBT); // Build ascii representations for each overlapping read. for (uint32 o=0; ogkStore_getRead(Bid); uint32 frgLenB = B->gkRead_sequenceLength(); // Find bgn/end points on each read. If the overlap is reverse complement, // the B coords are flipped so that bgn > end. uint32 ovlBgnA = overlaps[o].a_bgn(); uint32 ovlEndA = overlaps[o].a_end(); uint32 ovlBgnB = overlaps[o].b_bgn(); uint32 ovlEndB = overlaps[o].b_end(); assert(ovlBgnA < ovlEndA); // The A coordiantes are always forward if (overlaps[o].flipped() == false) assert(ovlBgnB < ovlEndB); // Forward overlaps are forward else assert(ovlEndB < ovlBgnB); // Flipped overlaps are reversed // For the A read, find the points in our string representation where the overlap ends. uint32 ovlStrBgn = (int32)floor(ovlBgnA * 100.0 / frgLenA + MHS); uint32 ovlStrEnd = (int32)ceil (ovlEndA * 100.0 / frgLenA + MHS); // Fill the string representation with spaces, then fill the string with dashes where the read // is, add an arrow, and terminate the string. for (int32 i=0; i<256; i++) ovl[i] = ' '; // Decide how to draw this overlap. // For best edges, use '='. // For contained, use '-', alternating with spaces. // For suspicious, use '*', alternating with dashes. // For edges, use '-', solid. bool isBest = (((bogart->getBest5id(Aid) == Bid) && (overlaps[o].overlapAEndIs5prime() == true)) || ((bogart->getBest3id(Aid) == Bid) && (overlaps[o].overlapAEndIs3prime() == true))); bool isCont = (bogart->getContained(Bid)); bool isSusp = (bogart->getSuspicious(Bid)); // This bit of confusion makes sure that the alternating overlap lines (especially '- - - -') // end with a dash. bool oddEven = (overlaps[o].flipped() == false) ? (false) : (((ovlStrEnd - ovlStrBgn) % 2) == false); if (isCont == true) { for (uint32 i=ovlStrBgn; i 0) { char str[256]; int32 len; snprintf(str, 256, "+%d", ovlBgnHang); len = strlen(str); for (int32 i=0; i 0) { snprintf(ovl + ovlStrEnd, 256 - ovlStrEnd, " +%d", ovlEndHang); } // Set flags for best edge and singleton/contained/suspicious. Left in for when I get annoyed with the different lines. char olapClass[4] = { 0, ' ', 0, 0 }; #if 0 if ((bogart->getBest5id(Aid) == Bid) && (overlaps[o].overlapAEndIs5prime() == true)) { olapClass[0] = ' '; olapClass[2] = 'B'; } if ((bogart->getBest3id(Aid) == Bid) && (overlaps[o].overlapAEndIs3prime() == true)) { olapClass[0] = ' '; olapClass[2] = 'B'; } if (olapClass[2] == 'B') for (uint32 ii=0; ovl[ii]; ii++) if (ovl[ii] == '-') ovl[ii] = '='; if (bogart->getSingleton(Bid)) { olapClass[0] = ' '; olapClass[1] = 'S'; } if (bogart->getContained(Bid)) { olapClass[0] = ' '; olapClass[1] = 'C'; } if (bogart->getSuspicious(Bid)) { olapClass[0] = ' '; olapClass[1] = '!'; } #endif // Report! fprintf(stdout, "A %7d:%-7d B %9d %7d:%-7d %7d %5.2f%% %s%s\n", ovlBgnA, ovlEndA, Bid, min(ovlBgnB, ovlEndB), max(ovlBgnB, ovlEndB), frgLenB, overlaps[o].erate() * 100.0, ovl, olapClass); } } void dumpPicture(ovStore *ovlStore, gkStore *gkpStore, double dumpERate, uint32 dumpLength, uint32 dumpType, uint32 qryID, char *bestPrefix) { //fprintf(stderr, "DUMPING PICTURE for ID " F_U32 " in store %s (gkp %s)\n", // qryID, ovlName, gkpName); uint32 Aid = qryID; gkRead *A = gkpStore->gkStore_getRead(Aid); uint32 frgLenA = A->gkRead_sequenceLength(); ovlStore->setRange(Aid, Aid); uint64 novl = 0; ovOverlap overlap(gkpStore); ovOverlap *overlaps = ovOverlap::allocateOverlaps(gkpStore, ovlStore->numOverlapsInRange()); uint64 evalue = AS_OVS_encodeEvalue(dumpERate); // Load bogart status, if supplied. bogartStatus *bogart = new bogartStatus(bestPrefix, gkpStore->gkStore_getNumReads()); // Load all the overlaps so we can sort by the A begin position. while (ovlStore->readOverlap(&overlap) == TRUE) { // Filter out garbage overlaps. if (overlap.evalue() > evalue) continue; // Filter out 5' overlaps. if ((dumpType & NO_5p) && (overlap.a_hang() < 0) && (overlap.b_hang() < 0)) continue; // Filter out 3' overlaps. if ((dumpType & NO_3p) && (overlap.a_hang() > 0) && (overlap.b_hang() > 0)) continue; // Filter out contained overlaps (B-read is contained) if ((dumpType & NO_CONTAINS) && (overlap.a_hang() >= 0) && (overlap.b_hang() <= 0)) continue; // Filter out container overlaps (A-read is contained) if ((dumpType & NO_CONTAINED) && (overlap.a_hang() <= 0) && (overlap.b_hang() >= 0)) continue; // Filter out short overlaps. if ((overlap.b_end() - overlap.b_bgn() < dumpLength) || (overlap.a_end() - overlap.a_bgn() < dumpLength)) continue; // If bogart data is supplied, filter out contained or suspicious overlaps. if ((dumpType & NO_CONTAINED_READS) && (bogart->getContained(overlap.b_iid))) continue; if ((dumpType & NO_SUSPICIOUS_READS) && (bogart->getSuspicious(overlap.b_iid))) continue; if ((dumpType & NO_SINGLETON_READS) && (bogart->getSingleton(overlap.b_iid))) continue; overlaps[novl++] = overlap; } if (novl == 0) fprintf(stderr, "no overlaps to show.\n"); else dumpPicture(overlaps, novl, gkpStore, Aid, bogart); delete [] overlaps; } int main(int argc, char **argv) { uint32 operation = OP_NONE; char *gkpName = NULL; char *ovlName = NULL; bool asBinary = false; bool asCounts = false; bool asErateLen = false; double dumpERate = 1.0; uint32 dumpLength = 0; uint32 dumpType = 0; char *erateFile = NULL; uint32 bgnID = 1; uint32 endID = UINT32_MAX; uint32 qryID = 0; bool beVerbose = false; char *bestPrefix = NULL; ovOverlapDisplayType type = ovOverlapAsCoords; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) gkpName = argv[++arg]; else if (strcmp(argv[arg], "-O") == 0) ovlName = argv[++arg]; // Standard bulk dump of overlaps else if (strcmp(argv[arg], "-d") == 0) { operation = OP_DUMP; if ((arg+1 < argc) && (argv[arg+1][0] != '-')) AS_UTL_decodeRange(argv[++arg], bgnID, endID); } // Dump as a picture else if (strcmp(argv[arg], "-p") == 0) { operation = OP_DUMP_PICTURE; bgnID = atoi(argv[++arg]); endID = bgnID; qryID = bgnID; } // Query if the overlap for the next two integers exists else if (strcmp(argv[arg], "-q") == 0) { operation = OP_DUMP; bgnID = atoi(argv[++arg]); endID = bgnID; qryID = atoi(argv[++arg]); } // Format of the dump else if (strcmp(argv[arg], "-coords") == 0) type = ovOverlapAsCoords; else if (strcmp(argv[arg], "-hangs") == 0) type = ovOverlapAsHangs; else if (strcmp(argv[arg], "-raw") == 0) type = ovOverlapAsRaw; else if (strcmp(argv[arg], "-paf") == 0) type = ovOverlapAsPaf; else if (strcmp(argv[arg], "-binary") == 0) asBinary = true; else if (strcmp(argv[arg], "-counts") == 0) asCounts = true; else if (strcmp(argv[arg], "-eratelen") == 0) asErateLen = true; // standard bulk dump options else if (strcmp(argv[arg], "-E") == 0) { dumpERate = atof(argv[++arg]); dumpType |= WITH_ERATE; } else if (strcmp(argv[arg], "-L") == 0) { dumpLength = atoi(argv[++arg]); dumpType |= WITH_LENGTH; } else if (strcmp(argv[arg], "-d5") == 0) dumpType |= NO_5p; else if (strcmp(argv[arg], "-d3") == 0) dumpType |= NO_3p; else if (strcmp(argv[arg], "-dC") == 0) dumpType |= NO_CONTAINS; else if (strcmp(argv[arg], "-dc") == 0) dumpType |= NO_CONTAINED; else if (strcmp(argv[arg], "-v") == 0) beVerbose = true; else if (strcmp(argv[arg], "-unique") == 0) dumpType |= ONE_SIDED; else if (strcmp(argv[arg], "-best") == 0) bestPrefix = argv[++arg]; else if (strcmp(argv[arg], "-noc") == 0) dumpType |= NO_CONTAINED_READS; else if (strcmp(argv[arg], "-nos") == 0) dumpType |= NO_SUSPICIOUS_READS; else if (strcmp(argv[arg], "-nosi") == 0) dumpType |= NO_SINGLETON_READS; else { fprintf(stderr, "%s: unknown option '%s'.\n", argv[0], argv[arg]); err++; } arg++; } if (operation == OP_NONE) err++; if (gkpName == NULL) err++; if (ovlName == NULL) err++; if (err) { fprintf(stderr, "usage: %s -G gkpStore -O ovlStore ...\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "There are three modes of operation:\n"); fprintf(stderr, " -d [a[-b]] dump overlaps for reads a to b, inclusive\n"); fprintf(stderr, " -q a b report the a,b overlap, if it exists.\n"); fprintf(stderr, " -p a dump a picture of overlaps to fragment 'a'.\n"); fprintf(stderr, "\n"); fprintf(stderr, " FORMAT (for -d)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -coords dump overlap showing coordinates in the reads (default)\n"); fprintf(stderr, " -hangs dump overlap showing dovetail hangs unaligned\n"); fprintf(stderr, " -raw dump overlap showing its raw native format (four hangs)\n"); fprintf(stderr, " -paf dump overlaps in miniasm/minimap format\n"); fprintf(stderr, " -binary dump overlap as raw binary data\n"); fprintf(stderr, " -counts dump the number of overlaps per read\n"); fprintf(stderr, " -eratelen dump a heatmap of error-rate vs overlap-length\n"); fprintf(stderr, "\n"); fprintf(stderr, " MODIFIERS (for -d and -p)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -E erate Dump only overlaps <= erate fraction error.\n"); fprintf(stderr, " -L length Dump only overlaps that are larger than L bases (only for -p picture mode).\n"); fprintf(stderr, " -d5 Dump only overlaps off the 5' end of the A frag.\n"); fprintf(stderr, " -d3 Dump only overlaps off the 3' end of the A frag.\n"); fprintf(stderr, " -dC Dump only overlaps that are contained in the A frag (B contained in A).\n"); fprintf(stderr, " -dc Dump only overlaps that are containing the A frag (A contained in B).\n"); fprintf(stderr, " -v Report statistics (to stderr) on some dumps (-d).\n"); fprintf(stderr, " -unique Report only overlaps where A id is < B id, do not report both A to B and B to A overlap\n"); fprintf(stderr, "\n"); fprintf(stderr, " -best prefix Annotate picture with status from bogart outputs prefix.edges, prefix.singletons, prefix.edges.suspicious\n"); fprintf(stderr, " -noc With -best data, don't show overlaps to contained reads.\n"); fprintf(stderr, " -nos With -best data, don't show overlaps to suspicious reads.\n"); fprintf(stderr, "\n"); if (operation == OP_NONE) fprintf(stderr, "ERROR: no operation (-d, -q or -p) supplied.\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no input gkpStore (-G) supplied.\n"); if (ovlName == NULL) fprintf(stderr, "ERROR: no input ovlStore (-O) supplied.\n"); exit(1); } gkStore *gkpStore = gkStore::gkStore_open(gkpName); ovStore *ovlStore = new ovStore(ovlName, gkpStore); if (endID > gkpStore->gkStore_getNumReads()) endID = gkpStore->gkStore_getNumReads(); if (endID < bgnID) fprintf(stderr, "ERROR: invalid bgn/end range bgn=%u end=%u; only %u reads in the store\n", bgnID, endID, gkpStore->gkStore_getNumReads()), exit(1); switch (operation) { case OP_DUMP: dumpStore(ovlStore, gkpStore, asBinary, asCounts, asErateLen, dumpERate, dumpLength, dumpType, bgnID, endID, qryID, type, beVerbose, bestPrefix); break; case OP_DUMP_PICTURE: for (qryID=bgnID; qryID <= endID; qryID++) dumpPicture(ovlStore, gkpStore, dumpERate, dumpLength, dumpType, qryID, bestPrefix); break; default: break; } delete ovlStore; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/stores/ovStoreDump.mk000066400000000000000000000007271314437614700172540ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := ovStoreDump SOURCES := ovStoreDump.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/ovStoreFile.C000066400000000000000000000243401314437614700167760ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/AS_OVS_overlapFile.C * src/AS_OVS/AS_OVS_overlapFile.c * * Modifications by: * * Brian P. Walenz from 2007-FEB-28 to 2013-SEP-22 * are Copyright 2007-2009,2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2011-MAR-31 * are Copyright 2011 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Gregory Sims on 2012-FEB-01 * are Copyright 2012 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-09 to 2015-JUN-16 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "ovStore.H" #ifdef SNAPPY #include "snappy.h" #endif // The histogram associated with this is written to files with any suffices stripped off. ovFile::ovFile(gkStore *gkp, const char *name, ovFileType type, uint32 bufferSize) { _gkp = gkp; _histogram = new ovStoreHistogram(_gkp, type); // We write two sizes of overlaps. The 'normal' format doesn't contain the a_iid, while the // 'full' format does. The buffer size must hold an integer number of overlaps, otherwise the // reader will read partial overlaps and fail. Choose a buffer size that can handle both. uint32 lcm = ((sizeof(uint32) * 1 + sizeof(ovOverlapDAT)) * (sizeof(uint32) * 2 + sizeof(ovOverlapDAT))); if (bufferSize < 16 * 1024) bufferSize = 16 * 1024; _bufferLen = 0; _bufferPos = (bufferSize / (lcm * sizeof(uint32))) * lcm; // Forces reload on next read _bufferMax = (bufferSize / (lcm * sizeof(uint32))) * lcm; _buffer = new uint32 [_bufferMax]; #ifdef SNAPPY _snappyLen = 0; _snappyBuffer = NULL; #endif assert(_bufferMax % ((sizeof(uint32) * 1) + (sizeof(ovOverlapDAT))) == 0); assert(_bufferMax % ((sizeof(uint32) * 2) + (sizeof(ovOverlapDAT))) == 0); // Create the input/output buffers and files. _isOutput = false; _isSeekable = false; _isNormal = (type == ovFileNormal) || (type == ovFileNormalWrite); #ifdef SNAPPY _useSnappy = false; #endif _reader = NULL; _writer = NULL; // Open store files for reading. These generally cannot be compressed, but we pretend they can be. if (type == ovFileNormal) { _reader = new compressedFileReader(name); _file = _reader->file(); _isSeekable = (_reader->isCompressed() == false); } // Open dump files for reading. These certainly can be compressed. else if (type == ovFileFull) { _reader = new compressedFileReader(name); _file = _reader->file(); _isSeekable = (_reader->isCompressed() == false); #ifdef SNAPPY _useSnappy = true; #endif } // Open a store file for writing? else if (type == ovFileNormalWrite) { _writer = new compressedFileWriter(name); _file = _writer->file(); _isOutput = true; } // Else, open a dump file for writing. This catches two cases, one with counts and one without counts. else { _writer = new compressedFileWriter(name); _file = _writer->file(); _isOutput = true; #ifdef SNAPPY _useSnappy = true; #endif } AS_UTL_findBaseFileName(_prefix, name); } ovFile::~ovFile() { writeBuffer(true); delete _reader; delete _writer; delete [] _buffer; #ifdef SNAPPY delete [] _snappyBuffer; #endif _histogram->saveData(_prefix); delete _histogram; } void ovFile::writeBuffer(bool force) { if (_isOutput == false) // Needed because it's called in the destructor. return; if ((force == false) && (_bufferLen < _bufferMax)) return; if (_bufferLen == 0) return; // If compressing, compress the block then write compressed length and the block. #ifdef SNAPPY if (_useSnappy == true) { size_t bl = snappy::MaxCompressedLength(_bufferLen * sizeof(uint32)); if (_snappyLen < bl) { delete [] _snappyBuffer; _snappyLen = bl; _snappyBuffer = new char [_snappyLen]; } snappy::RawCompress((const char *)_buffer, _bufferLen * sizeof(uint32), _snappyBuffer, &bl); AS_UTL_safeWrite(_file, &bl, "ovFile::writeBuffer::bl", sizeof(size_t), 1); AS_UTL_safeWrite(_file, _snappyBuffer, "ovFile::writeBuffer::sb", sizeof(char), bl); } // Otherwise, just dump the block else #endif AS_UTL_safeWrite(_file, _buffer, "ovFile::writeBuffer", sizeof(uint32), _bufferLen); // Buffer written. Clear it. _bufferLen = 0; } void ovFile::writeOverlap(ovOverlap *overlap) { assert(_isOutput == true); writeBuffer(); _histogram->addOverlap(overlap); if (_isNormal == false) _buffer[_bufferLen++] = overlap->a_iid; _buffer[_bufferLen++] = overlap->b_iid; #if (ovOverlapWORDSZ == 32) for (uint32 ii=0; iidat.dat[ii]; #endif #if (ovOverlapWORDSZ == 64) for (uint32 ii=0; iidat.dat[ii] >> 32) & 0xffffffff; _buffer[_bufferLen++] = (overlap->dat.dat[ii]) & 0xffffffff; } #endif assert(_bufferLen <= _bufferMax); } void ovFile::writeOverlaps(ovOverlap *overlaps, uint64 overlapsLen) { uint64 nWritten = 0; assert(_isOutput == true); // Add all overlaps to the buffer. while (nWritten < overlapsLen) { writeBuffer(); _histogram->addOverlap(overlaps + nWritten); if (_isNormal == false) _buffer[_bufferLen++] = overlaps[nWritten].a_iid; _buffer[_bufferLen++] = overlaps[nWritten].b_iid; #if (ovOverlapWORDSZ == 32) for (uint32 ii=0; ii> 32) & 0xffffffff; _buffer[_bufferLen++] = (overlaps[nWritten].dat.dat[ii]) & 0xffffffff; } #endif nWritten++; } assert(_bufferLen <= _bufferMax); } void ovFile::readBuffer(void) { if (_bufferPos < _bufferLen) return; // Need to load a new buffer. Everyone resets bufferPos to the start. _bufferPos = 0; // If compressed, we need to decode the block. #ifdef SNAPPY if (_useSnappy == true) { size_t cl = 0; size_t clc = AS_UTL_safeRead(_file, &cl, "ovFile::readBuffer::cl", sizeof(size_t), 1); if (_snappyLen < cl) { delete [] _snappyBuffer; _snappyLen = cl; _snappyBuffer = new char [cl]; } size_t sbc = AS_UTL_safeRead(_file, _snappyBuffer, "ovFile::readBuffer::sb", sizeof(char), cl); if (sbc != cl) fprintf(stderr, "ERROR: short read on file '%s': read " F_SIZE_T " bytes, expected " F_SIZE_T ".\n", _prefix, sbc, cl), exit(1); size_t ol = 0; snappy::GetUncompressedLength(_snappyBuffer, cl, &ol); snappy::RawUncompress(_snappyBuffer, cl, (char *)_buffer); _bufferLen = ol / sizeof(uint32); } // But if loading from 'normal' files, just load. Easy peasy. else #endif _bufferLen = AS_UTL_safeRead(_file, _buffer, "ovFile::readBuffer", sizeof(uint32), _bufferMax); } bool ovFile::readOverlap(ovOverlap *overlap) { assert(_isOutput == false); readBuffer(); if (_bufferLen == 0) return(false); assert(_bufferPos < _bufferLen); if (_isNormal == FALSE) overlap->a_iid = _buffer[_bufferPos++]; overlap->b_iid = _buffer[_bufferPos++]; #if (ovOverlapWORDSZ == 32) for (uint32 ii=0; iidat.dat[ii] = _buffer[_bufferPos++]; #endif #if (ovOverlapWORDSZ == 64) for (uint32 ii=0; iidat.dat[ii] = _buffer[_bufferPos++]; overlap->dat.dat[ii] <<= 32; overlap->dat.dat[ii] |= _buffer[_bufferPos++]; } #endif assert(_bufferPos <= _bufferLen); return(true); } uint64 ovFile::readOverlaps(ovOverlap *overlaps, uint64 overlapsLen) { uint64 nLoaded = 0; assert(_isOutput == false); while (nLoaded < overlapsLen) { readBuffer(); if (_bufferLen == 0) return(nLoaded); assert(_bufferPos < _bufferLen); if (_isNormal == FALSE) overlaps[nLoaded].a_iid = _buffer[_bufferPos++]; overlaps[nLoaded].b_iid = _buffer[_bufferPos++]; #if (ovOverlapWORDSZ == 32) for (uint32 ii=0; iiadd(_histogram); delete _histogram; _histogram = new ovStoreHistogram; } canu-1.6/src/stores/ovStoreFile.H000066400000000000000000000075441314437614700170120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/stores/ovStore.H * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-24 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_OVSTOREFILE_H #define AS_OVSTOREFILE_H #include "AS_global.H" #include "gkStore.H" #include "ovOverlap.H" class ovStoreHistogram; // The default, no flags, is to open for normal overlaps, read only. Normal overlaps mean they // have only the B id, i.e., they are in a fully built store. // // Output of overlapper (input to store building) should be ovFileFullWrite. The specialized // ovFileFullWriteNoCounts is used internally by store creation. // enum ovFileType { ovFileNormal = 0, // Reading of b_id overlaps (aka store files) ovFileNormalWrite = 1, // Writing of b_id overlaps ovFileFull = 2, // Reading of a_id+b_id overlaps (aka dump files) ovFileFullWrite = 3, // Writing of a_id+b_id overlaps ovFileFullWriteNoCounts = 4 // Writing of a_id+b_id overlaps, omitting the counts of olaps per read }; class ovFile { public: ovFile(gkStore *gkpName, const char *name, ovFileType type = ovFileNormal, uint32 bufferSize = 1 * 1024 * 1024); ~ovFile(); void writeBuffer(bool force=false); void writeOverlap(ovOverlap *overlap); void writeOverlaps(ovOverlap *overlaps, uint64 overlapLen); void readBuffer(void); bool readOverlap(ovOverlap *overlap); uint64 readOverlaps(ovOverlap *overlaps, uint64 overlapMax); void seekOverlap(off_t overlap); // The size of an overlap record is 1 or 2 IDs + the size of a word times the number of words. uint64 recordSize(void) { return(sizeof(uint32) * ((_isNormal) ? 1 : 2) + sizeof(ovOverlapWORD) * ovOverlapNWORDS); }; // For use in conversion, force snappy compression. By default, it is ENABLED, and we cannot // read older ovb files. #ifdef SNAPPY void enableSnappy(bool enabled) { _useSnappy = enabled; }; #endif // Move the stats in our histogram to the one supplied, and remove our data void transferHistogram(ovStoreHistogram *copy); private: gkStore *_gkp; ovStoreHistogram *_histogram; uint32 _bufferLen; // length of valid data in the buffer uint32 _bufferPos; // position the read is at in the buffer uint32 _bufferMax; // allocated size of the buffer uint32 *_buffer; #ifdef SNAPPY size_t _snappyLen; char *_snappyBuffer; #endif bool _isOutput; // if true, we can writeOverlap() bool _isSeekable; // if true, we can seekOverlap() bool _isNormal; // if true, 3 words per overlap, else 4 #ifdef SNAPPY bool _useSnappy; // if true, compress with snappy before writing #endif compressedFileReader *_reader; compressedFileWriter *_writer; char _prefix[FILENAME_MAX]; FILE *_file; }; #endif // AS_OVSTOREFILE_H canu-1.6/src/stores/ovStoreFilter.C000066400000000000000000000161401314437614700173430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/stores/ovStore.C * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-28 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "ovStore.H" #define OBT_FAR5PRIME (29) #define OBT_MIN_LENGTH (75) ovStoreFilter::ovStoreFilter(gkStore *gkp_, double maxErate) { gkp = gkp_; maxID = gkp->gkStore_getNumReads() + 1; maxEvalue = AS_OVS_encodeEvalue(maxErate); resetCounters(); skipReadOBT = new char [maxID]; skipReadDUP = new char [maxID]; uint32 numSkipOBT = 0; uint32 numSkipDUP = 0; for (uint64 iid=0; iidgkStore_getRead(iid)->gkRead_libraryID(); gkLibrary *L = gkp->gkStore_getLibrary(Lid); skipReadOBT[iid] = false; skipReadDUP[iid] = false; if ((L->gkLibrary_removeDuplicateReads() == false) && (L->gkLibrary_finalTrim() == GK_FINALTRIM_NONE) && (L->gkLibrary_removeSpurReads() == false) && (L->gkLibrary_removeChimericReads() == false) && (L->gkLibrary_checkForSubReads() == false)) { numSkipOBT++; skipReadOBT[iid] = true; } if (L->gkLibrary_removeDuplicateReads() == false) { numSkipDUP++; skipReadDUP[iid] = true; } } if (numSkipOBT > 0) fprintf(stderr, "- Marked " F_U32 " reads to skip trimming.\n", numSkipOBT); if (numSkipDUP > 0) fprintf(stderr, "- Marked " F_U32 " reads to skip deduplication.\n", numSkipDUP); } ovStoreFilter::~ovStoreFilter() { delete [] skipReadOBT; delete [] skipReadDUP; } // Are the 5' end points very different? If the overlap is flipped, then, yes, they are. static bool isOverlapDifferent(ovOverlap &ol) { bool isDiff = true; if (ol.flipped() == false) { if (ol.a_bgn() > ol.b_bgn()) isDiff = ((ol.a_bgn() - ol.b_bgn()) > OBT_FAR5PRIME) ? (true) : (false); else isDiff = ((ol.b_bgn() - ol.a_bgn()) > OBT_FAR5PRIME) ? (true) : (false); } return(isDiff); } // Is the overlap long? static bool isOverlapLong(ovOverlap &ol) { int32 ab = ol.a_bgn(); int32 ae = ol.a_end(); int32 bb = ol.b_bgn(); int32 be = ol.b_end(); int32 Alength = ae - ab; int32 Blength = be - bb; if (be < bb) Blength = bb - be; return(((Alength > OBT_MIN_LENGTH) && (Blength > OBT_MIN_LENGTH)) ? (true) : (false)); } void ovStoreFilter::filterOverlap(ovOverlap &foverlap, ovOverlap &roverlap) { // Quick sanity check on IIDs. if ((foverlap.a_iid == 0) || (foverlap.b_iid == 0) || (foverlap.a_iid >= maxID) || (foverlap.b_iid >= maxID)) { char ovlstr[256]; fprintf(stderr, "Overlap has IDs out of range (maxID " F_U32 "), possibly corrupt input data.\n", maxID); fprintf(stderr, " coords -- %s\n", foverlap.toString(ovlstr, ovOverlapAsCoords, false)); fprintf(stderr, " hangs -- %s\n", foverlap.toString(ovlstr, ovOverlapAsHangs, false)); exit(1); } // Make the reverse overlap (important, AFTER resetting the erate-based 'for' flags). roverlap.swapIDs(foverlap); // Ignore high error overlaps if ((foverlap.evalue() > maxEvalue)) { foverlap.dat.ovl.forUTG = false; foverlap.dat.ovl.forOBT = false; foverlap.dat.ovl.forDUP = false; roverlap.dat.ovl.forUTG = false; roverlap.dat.ovl.forOBT = false; roverlap.dat.ovl.forDUP = false; skipERATE++; skipERATE++; } // Don't OBT if not requested. if ((foverlap.dat.ovl.forOBT == false) && (skipReadOBT[foverlap.a_iid] == true)) { foverlap.dat.ovl.forOBT = false; skipOBT++; } if ((roverlap.dat.ovl.forOBT == false) && (skipReadOBT[roverlap.a_iid] == true)) { roverlap.dat.ovl.forOBT = false; skipOBT++; } // If either overlap is good for either obt or dup, compute if it is different and long. These // are the same for both foverlap and roverlap. bool isDiff = isOverlapDifferent(foverlap); bool isLong = isOverlapLong(foverlap); // Remove the bad-for-OBT overlaps. if ((isDiff == false) && (foverlap.dat.ovl.forOBT == true)) { foverlap.dat.ovl.forOBT = false; skipOBTbad++; } if ((isDiff == false) && (roverlap.dat.ovl.forOBT == true)) { roverlap.dat.ovl.forOBT = false; skipOBTbad++; } // Remove the too-short-for-OBT overlaps. if ((isLong == false) && (foverlap.dat.ovl.forOBT == true)) { foverlap.dat.ovl.forOBT = false; skipOBTshort++; } if ((isLong == false) && (roverlap.dat.ovl.forOBT == true)) { roverlap.dat.ovl.forOBT = false; skipOBTshort++; } // Don't dedupe if not requested. if ((foverlap.dat.ovl.forDUP == true) && (skipReadDUP[foverlap.a_iid] == true)) { foverlap.dat.ovl.forDUP = false; skipDUP++; } if ((roverlap.dat.ovl.forDUP == true) && (skipReadDUP[roverlap.b_iid] == true)) { roverlap.dat.ovl.forDUP = false; skipDUP++; } // Remove the bad-for-DUP overlaps. #if 0 // Nah, do this in dedupe, since parameters can change. if ((isDiff == true) && (foverlap.dat.ovl.forDUP == true)) { foverlap.dat.ovl.forDUP = false; skipDUPdiff++; } if ((isDiff == true) && (roverlap.dat.ovl.forDUP == true)) { roverlap.dat.ovl.forDUP = false; skipDUPdiff++; } #endif // Can't have duplicates between libraries. if (((foverlap.dat.ovl.forDUP == true) || (roverlap.dat.ovl.forDUP == true)) && (gkp->gkStore_getRead(foverlap.a_iid)->gkRead_libraryID() != gkp->gkStore_getRead(foverlap.b_iid)->gkRead_libraryID())) { if ((foverlap.dat.ovl.forDUP == true)) { foverlap.dat.ovl.forDUP = false; skipDUPlib++; } if ((roverlap.dat.ovl.forDUP == true)) { roverlap.dat.ovl.forDUP = false; skipDUPlib++; } } // All done with the filtering, record some counts. if (foverlap.dat.ovl.forUTG == true) saveUTG++; if (foverlap.dat.ovl.forOBT == true) saveOBT++; if (foverlap.dat.ovl.forDUP == true) saveDUP++; if (roverlap.dat.ovl.forUTG == true) saveUTG++; if (roverlap.dat.ovl.forOBT == true) saveOBT++; if (roverlap.dat.ovl.forDUP == true) saveDUP++; } void ovStoreFilter::resetCounters(void) { saveUTG = 0; saveOBT = 0; saveDUP = 0; skipERATE = 0; skipOBT = 0; skipOBTbad = 0; skipOBTshort = 0; skipDUP = 0; skipDUPdiff = 0; skipDUPlib = 0; } canu-1.6/src/stores/ovStoreFilter.H000066400000000000000000000016761314437614700173600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/stores/ovStore.H * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-28 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ canu-1.6/src/stores/ovStoreHistogram.C000066400000000000000000000341071314437614700200560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "ovStoreHistogram.H" ovStoreHistogram::ovStoreHistogram() { _gkp = NULL; _maxOlength = 0; _maxEvalue = 0; _epb = 0; _bpb = 0; _opelLen = 0; _opel = NULL; _oprLen = 0; _oprMax = 0; _opr = NULL; } ovStoreHistogram::ovStoreHistogram(char *path) { _gkp = NULL; _maxOlength = 0; _maxEvalue = 0; _epb = 0; _bpb = 0; _opelLen = 0; _opel = NULL; _oprLen = 0; _oprMax = 0; _opr = NULL; loadData(path); } ovStoreHistogram::ovStoreHistogram(gkStore *gkp, ovFileType type) { _gkp = gkp; _maxOlength = 0; _maxEvalue = 0; _epb = 1; // Evalues per bucket _bpb = 250; // Bases per bucket _opelLen = 0; _opel = NULL; _oprLen = 0; _oprMax = 0; _opr = NULL; // When writing full overlaps out of an overlapper (ovFileFullWrite) we want // to keep track of the number of overlaps per read. We could pre-allocate // the array based on the size of gkpStore, but if we don't have that, it's // easy enough to grow the array. // // _opr is notably skipped if ovFileFullWriteNoCounts is used. That symbol // isn't actually used anywhere except in this comment (and when some ovFile // is created) so we mention it here for grep. if (type == ovFileFullWrite) { _oprLen = 0; _oprMax = (_gkp == NULL) ? (256 * 1024) : (_gkp->gkStore_getNumReads() + 1); _opr = new uint32 [_oprMax]; memset(_opr, 0, sizeof(uint32) * _oprMax); } // When writing store overlaps (ovFileNormalWrite) we want to keep track of // how many overlaps for each evalue X length. // // A gkpStore is required here so we can allocate the correct amount of // space and compute the length of an overlap. // // The histogram always allocates one pointer for each eValue (there's only 4096 of them), // but defers allocating the vector until needed. if (type == ovFileNormalWrite) { if (_gkp == NULL) fprintf(stderr, "ovStoreHistogram()-- ERROR: I need a valid gkpStore.\n"), exit(1); for (uint32 ii=1; ii<_gkp->gkStore_getNumReads(); ii++) if (_opelLen < _gkp->gkStore_getRead(ii)->gkRead_sequenceLength()) _opelLen = _gkp->gkStore_getRead(ii)->gkRead_sequenceLength(); _opelLen = _opelLen * 1.40 / _bpb + 1; // the overlap could have 40% insertions. _opel = new uint32 * [AS_MAX_EVALUE + 1]; memset(_opel, 0, sizeof(uint32 *) * (AS_MAX_EVALUE + 1)); } } ovStoreHistogram::~ovStoreHistogram() { if (_opel) for (uint32 ii=0; iia_iid, overlap->b_iid); if (_oprMax < maxID) resizeArray(_opr, _oprLen, _oprMax, maxID + maxID/2, resizeArray_copyData | resizeArray_clearNew); if (_oprLen < maxID + 1) _oprLen = maxID + 1; _opr[overlap->a_iid]++; _opr[overlap->b_iid]++; } if (_opel) { uint32 ev = overlap->evalue(); uint32 len = (_gkp->gkStore_getRead(overlap->a_iid)->gkRead_sequenceLength() - overlap->dat.ovl.ahg5 - overlap->dat.ovl.ahg3 + _gkp->gkStore_getRead(overlap->b_iid)->gkRead_sequenceLength() - overlap->dat.ovl.bhg5 - overlap->dat.ovl.bhg3) / 2; ev /= _epb; len /= _bpb; if (_opel[ev] == NULL) { _opel[ev] = new uint32 [_opelLen]; memset(_opel[ev], 0, sizeof(uint32) * _opelLen); } int32 alen = _gkp->gkStore_getRead(overlap->a_iid)->gkRead_sequenceLength(); int32 blen = _gkp->gkStore_getRead(overlap->b_iid)->gkRead_sequenceLength(); if (len < _opelLen) { //fprintf(stderr, "overlap %8u (len %6d) %8u (len %6d) hangs %6" F_U64P " %6d %6" F_U64P " - %6" F_U64P " %6d %6" F_U64P " flip " F_U64 "\n", // overlap->a_iid, alen, // overlap->b_iid, blen, // overlap->dat.ovl.ahg5, (int32)alen - (int32)overlap->dat.ovl.ahg5 - (int32)overlap->dat.ovl.ahg3, overlap->dat.ovl.ahg3, // overlap->dat.ovl.bhg5, (int32)blen - (int32)overlap->dat.ovl.bhg5 - (int32)overlap->dat.ovl.bhg3, overlap->dat.ovl.bhg3, // overlap->dat.ovl.flipped); _opel[ev][len]++; } else { fprintf(stderr, "overlap %8u (len %6d) %8u (len %6d) hangs %6" F_U64P " %6d %6" F_U64P " - %6" F_U64P " %6d %6" F_U64P " flip " F_U64 " -- BOGUS\n", overlap->a_iid, alen, overlap->b_iid, blen, overlap->dat.ovl.ahg5, (int32)alen - (int32)overlap->dat.ovl.ahg5 - (int32)overlap->dat.ovl.ahg3, overlap->dat.ovl.ahg3, overlap->dat.ovl.bhg5, (int32)blen - (int32)overlap->dat.ovl.bhg5 - (int32)overlap->dat.ovl.bhg3, overlap->dat.ovl.bhg3, overlap->dat.ovl.flipped); } } } // Build an output file name from a prefix and a suffix based // on if the prefix is a directory or a file. If a directory, // the new name will be a file in the directory, otherwise, // it will be an extension to the origianl name. void createDataName(char *name, char *prefix, char *suffix) { if (AS_UTL_fileExists(prefix, true, false)) { snprintf(name, FILENAME_MAX, "%s/%s", prefix, suffix); } else { AS_UTL_findBaseFileName(name, prefix); strcat(name, "."); strcat(name, suffix); } } void ovStoreHistogram::saveData(char *prefix) { char name[FILENAME_MAX]; // If we have overlaps-per-read data, dump it. Just a simple array. if (_opr) { createDataName(name, prefix, "counts"); errno = 0; FILE *F = fopen(name, "w"); if (errno) fprintf(stderr, "failed to open counts file '%s' for writing: %s\n", name, strerror(errno)), exit(1); AS_UTL_safeWrite(F, &_oprLen, "ovStoreHistogram::nr", sizeof(uint32), 1); AS_UTL_safeWrite(F, _opr, "ovStoreHistogram::opr", sizeof(uint32), _oprLen); fclose(F); } // If we have overlaps-per-evalue-length, dump it. This is a bit more complicated, as it has // holes in the array. if (_opel) { createDataName(name, prefix, "evalueLen"); errno = 0; FILE *F = fopen(name, "w"); if (errno) fprintf(stderr, "failed to open evalueLen file '%s' for writing: %s\n", name, strerror(errno)), exit(1); uint32 nArr = 0; for (uint32 ii=0; ii 0; ) { // For each saved vector: AS_UTL_safeRead(F, &ev, "ovStoreHistogram::evalue", sizeof(uint32), 1); // Load the evalue it is for AS_UTL_safeRead(F, in, "ovStoreHistogram::evalueLen", sizeof(uint32), _opelLen); // Load the data. if (_opel[ev] == NULL) // More abuse, if needed allocateArray(_opel[ev], _opelLen, resizeArray_clearNew); for (uint32 kk=0; kk<_opelLen; kk++) // Add new data to old data _opel[ev][kk] += in[kk]; } delete [] in; fclose(F); } } void ovStoreHistogram::removeData(char *prefix) { char name[FILENAME_MAX]; createDataName(name, prefix, "counts"); AS_UTL_unlink(name); createDataName(name, prefix, "evalueLen"); AS_UTL_unlink(name); } void ovStoreHistogram::add(ovStoreHistogram *input) { if (input->_opr) { resizeArray(_opr, _oprLen, _oprMax, input->_oprMax, resizeArray_copyData | resizeArray_clearNew); for (uint32 ii=0; ii_oprMax; ii++) _opr[ii] += input->_opr[ii]; _oprLen = max(_oprLen, input->_oprLen); } if (input->_opel) { if (_opel == NULL) { allocateArray(_opel, AS_MAX_EVALUE+1, resizeArray_clearNew); _opelLen = input->_opelLen; _maxOlength = input->_maxOlength; _maxEvalue = input->_maxEvalue; _epb = input->_epb; _bpb = input->_bpb; } if ((_opelLen != input->_opelLen) || (_epb != input->_epb) || (_bpb != input->_bpb)) { fprintf(stderr, "ERROR: can't merge histogram; parameters differ.\n"); fprintf(stderr, "ERROR: opelLen = %7u vs %7u\n", _opelLen, input->_opelLen); fprintf(stderr, "ERROR: opelLen = %7u vs %7u\n", _epb, input->_epb); fprintf(stderr, "ERROR: opelLen = %7u vs %7u\n", _bpb, input->_bpb); exit(1); } _maxOlength = max(_maxOlength, input->_maxOlength); _maxEvalue = max(_maxEvalue, input->_maxEvalue); for (uint32 ev=0; ev_opel[ev] == NULL) continue; if (_opel[ev] == NULL) allocateArray(_opel[ev], _opelLen, resizeArray_clearNew); for (uint32 kk=0; kk<_opelLen; kk++) _opel[ev][kk] += input->_opel[ev][kk]; } } } uint64 ovStoreHistogram::getOverlapsPerRead(uint32 *oprOut, uint32 oprOutLen) { uint64 tot = 0; if (oprOutLen < _oprLen) fprintf(stderr, "ERROR: more reads in histogram than available for output? oprOutLen=%u _oprLen=%u\n", oprOutLen, _oprLen), exit(1); for (uint32 ii=0; ii<_oprLen; ii++) { oprOut[ii] += _opr[ii]; tot += _opr[ii]; } return(tot); } void ovStoreHistogram::dumpEvalueLength(FILE *out) { uint32 maxEvalue = 0; uint32 maxLength = 0; // Find the largest Evalue and length with values in the histogram for (uint32 ee=0; ee 0) maxLength = ll; } // Dump those values for (uint32 ee=0; ee<=maxEvalue; ee++) { for (uint32 ll=0; ll<=maxLength; ll++) fprintf(out, "%u\t%.4f\t%u\n", ll * _bpb, AS_OVS_decodeEvalue(ee), (_opel[ee] == NULL) ? 0 : _opel[ee][ll]); fprintf(out, "\n"); } fprintf(stderr, "MAX Evalue %.4f\n", AS_OVS_decodeEvalue(maxEvalue)); fprintf(stderr, "MAX Length %u\n", maxLength * _bpb); } canu-1.6/src/stores/ovStoreHistogram.H000066400000000000000000000077561314437614700200750ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef AS_OVSTOREHISTOGRAM_H #define AS_OVSTOREHISTOGRAM_H // Automagically gathers statistics on overlaps as they're written: // from overlappers, the number of overlaps per read. // in the store, the number of overlaps per (evalue,overlapLength) #include "AS_global.H" #include "gkStore.H" #include "ovStoreFile.H" class ovStoreHistogram { public: ovStoreHistogram(); // Used when loading data, user must loadData() later ovStoreHistogram(char *path); // Used when loading data, calls loadData() for you ovStoreHistogram(gkStore *gkp, ovFileType type); // Used when writing ovFile ~ovStoreHistogram(); double minErate(void) { return(AS_OVS_decodeEvalue(0)); }; double maxErate(void) { return(AS_OVS_decodeEvalue(_maxEvalue)); }; uint32 minEvalue(void) { return(0); }; uint32 maxEvalue(void) { return(_maxEvalue); }; uint32 numEvalueBuckets(void) { return(AS_MAX_EVALUE + 1); }; uint32 numLengthBuckets(void) { return(_opelLen); }; //uint32 minOverlapLength(void) { return(0); }; //uint32 maxOverlapLength(void) { return(_maxLength * _bpb); }; uint32 evaluePerBucket(void) { return(_epb); }; uint32 basesPerBucket(void) { return(_bpb); }; uint32 numOverlaps(uint32 eb, uint32 lb) { assert(eb < numEvalueBuckets()); assert(lb < numLengthBuckets()); return((_opel[eb] == NULL) ? 0 : _opel[eb][lb]); }; //uint32 numOverlaps(uint32 id); //uint32 numOverlaps(uint32 evalue, uint32 length); // In an ovFile, add a single value to the histogram void addOverlap(ovOverlap *overlap); // In an ovStore, load the histogram saved in a file, and add it to our current data. void saveData(char *prefix); void loadData(char *prefix, uint32 maxIID=UINT32_MAX); // Remove data associated with some prefix. static void removeData(char *prefix); // Add in the data from histogram 'input' to this histogram void add(ovStoreHistogram *input); // Copies the number of overlaps per read into oprOut. This array is assumed to sized using // gkpStore to get the number of reads. The data in the histogram can be shorter, but shouldn't // be longer. If so, it will fail and exit. uint64 getOverlapsPerRead(uint32 *oprOut, uint32 oprOutLen); // Returns total overlaps in this histogram // Dump a gnuplot-friendly data file of the evalues-length. void dumpEvalueLength(FILE *out); private: gkStore *_gkp; uint32 _maxOlength; // Max overlap length seen uint32 _maxEvalue; // Max evalue seen uint32 _epb; // Evalues per bucket uint32 _bpb; // Bases per bucket uint32 _opelLen; // Length of the data vector for one evalue uint32 **_opel; // Overlaps per evalue-length uint32 _oprLen; // Length of opr valid data uint32 _oprMax; // Last allocated opr uint32 *_opr; // Overlaps per read }; #endif // AS_OVSTOREHISTOGRAM_H canu-1.6/src/stores/ovStoreIndexer.C000066400000000000000000000123221314437614700175120ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/overlapStoreIndexer.C * * Modifications by: * * Brian P. Walenz from 2012-APR-02 to 2013-AUG-01 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-15 to 2015-SEP-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" int main(int argc, char **argv) { char *storePath = NULL; uint32 fileLimit = 0; // Number of 'slices' from bucketizer bool deleteIntermediates = true; bool doExplicitTest = false; bool doFixes = false; char name[FILENAME_MAX]; argc = AS_configure(argc, argv); int err=0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-O") == 0) { storePath = argv[++arg]; } else if (strcmp(argv[arg], "-F") == 0) { fileLimit = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-f") == 0) { doFixes = true; } else if (strcmp(argv[arg], "-t") == 0) { doExplicitTest = true; storePath = argv[++arg]; } else if (strcmp(argv[arg], "-nodelete") == 0) { deleteIntermediates = false; } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if (storePath == NULL) err++; if ((fileLimit == 0) && (doExplicitTest == false)) err++; if (err) { fprintf(stderr, "usage: %s ...\n", argv[0]); fprintf(stderr, " -O x.ovlStore path to overlap store to build the final index for\n"); fprintf(stderr, " -F s number of slices used in bucketizing/sorting\n"); fprintf(stderr, "\n"); fprintf(stderr, " -t x.ovlStore explicitly test a previously constructed index\n"); fprintf(stderr, " -f when testing, also create a new 'idx.fixed' which might\n"); fprintf(stderr, " resolve rare problems\n"); fprintf(stderr, "\n"); fprintf(stderr, " -nodelete do not remove intermediate files when the index is\n"); fprintf(stderr, " successfully created\n"); fprintf(stderr, "\n"); fprintf(stderr, " DANGER DO NOT USE DO NOT USE DO NOT USE DANGER\n"); fprintf(stderr, " DANGER DANGER\n"); fprintf(stderr, " DANGER This command is difficult to run by hand. DANGER\n"); fprintf(stderr, " DANGER Use ovStoreCreate instead. DANGER\n"); fprintf(stderr, " DANGER DANGER\n"); fprintf(stderr, " DANGER DO NOT USE DO NOT USE DO NOT USE DANGER\n"); fprintf(stderr, "\n"); if (storePath == NULL) fprintf(stderr, "ERROR: No overlap store (-O) supplied.\n"); if ((fileLimit == 0) && (doExplicitTest == false)) fprintf(stderr, "ERROR: One of -F (number of slices) or -t (test a store) must be supplied.\n"); exit(1); } // Do the test, and maybe fix things up. //gkStore *gkp = gkStore::gkStore_open(gkpName); ovStoreWriter *writer = new ovStoreWriter(storePath, NULL, fileLimit, 0, 0); if (doExplicitTest == true) { bool passed = writer->testIndex(doFixes); if (passed == true) fprintf(stderr, "Index looks correct.\n"); delete writer; exit(passed == false); } // Check that all segments are present. Every segment should have an info file. writer->checkSortingIsComplete(); // Merge the indices and histogram data. writer->mergeInfoFiles(); writer->mergeHistogram(); // Diagnostics. if (writer->testIndex(false) == false) { fprintf(stderr, "ERROR: index failed tests.\n"); delete writer; exit(1); } // Remove intermediates. For the buckets, we keep going until there are 10 in a row not present. // During testing, on a microbe using 2850 buckets, some buckets were empty. if (deleteIntermediates == false) { fprintf(stderr, "\n"); fprintf(stderr, "Not removing intermediate files. Finished.\n"); exit(0); } fprintf(stderr, "\n"); fprintf(stderr, "Removing intermediate files.\n"); writer->removeAllIntermediateFiles(); fprintf(stderr, "Success!\n"); exit(0); } canu-1.6/src/stores/ovStoreIndexer.mk000066400000000000000000000007351314437614700177440ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := ovStoreIndexer SOURCES := ovStoreIndexer.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/ovStoreSorter.C000066400000000000000000000216701314437614700174000ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_OVS/overlapStoreSorter.C * * Modifications by: * * Brian P. Walenz from 2012-APR-02 to 2013-SEP-09 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-15 to 2015-SEP-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include #include using namespace std; // This is the size of the datastructure that we're using to store overlaps for sorting. // At present, with ovOverlap, it is over-allocating a pointer that we don't need, but // to make a custom structure, we'd need to duplicate a bunch of code or copy data after // loading and before writing. // // Used in both ovStoreSorter.C and ovStoreBuild.C. // #define ovOverlapSortSize (sizeof(ovOverlap)) void makeSentinel(char *storePath, uint32 fileID, bool forceRun) { char name[FILENAME_MAX]; // Check if done. snprintf(name, FILENAME_MAX, "%s/%04d", storePath, fileID); if ((forceRun == false) && (AS_UTL_fileExists(name, FALSE, FALSE))) fprintf(stderr, "Job " F_U32 " is finished (remove '%s' or -force to try again).\n", fileID, name), exit(0); // Check if running. snprintf(name, FILENAME_MAX, "%s/%04d.ovs", storePath, fileID); if ((forceRun == false) && (AS_UTL_fileExists(name, FALSE, FALSE))) fprintf(stderr, "Job " F_U32 " is running (remove '%s' or -force to try again).\n", fileID, name), exit(0); // Not done, not running, so create a sentinel to say we're running. errno = 0; FILE *F = fopen(name, "w"); if (errno) fprintf(stderr, "ERROR: Failed to open '%s' for writing: %s\n", name, strerror(errno)), exit(1); fclose(F); } void removeSentinel(char *storePath, uint32 fileID) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/%04d.ovs", storePath, fileID); unlink(name); } int main(int argc, char **argv) { char *storePath = NULL; char *gkpName = NULL; uint32 fileLimit = 512; // Number of 'slices' from bucketizer uint32 fileID = 0; // 'slice' that we are going to be sorting uint32 jobIdxMax = 0; // Number of 'buckets' from bucketizer uint64 maxMemory = UINT64_MAX; bool deleteIntermediateEarly = false; bool deleteIntermediateLate = false; bool forceRun = false; char name[FILENAME_MAX]; argc = AS_configure(argc, argv); int err=0; int arg=1; while (arg < argc) { if (strcmp(argv[arg], "-O") == 0) { storePath = argv[++arg]; } else if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-F") == 0) { fileLimit = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-job") == 0) { fileID = atoi(argv[++arg]); jobIdxMax = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-M") == 0) { maxMemory = (uint64)ceil(atof(argv[++arg]) * 1024.0 * 1024.0 * 1024.0); } else if (strcmp(argv[arg], "-deleteearly") == 0) { deleteIntermediateEarly = true; } else if (strcmp(argv[arg], "-deletelate") == 0) { deleteIntermediateLate = true; } else if (strcmp(argv[arg], "-force") == 0) { forceRun = true; } else { fprintf(stderr, "ERROR: unknown option '%s'\n", argv[arg]); err++; } arg++; } if (storePath == NULL) err++; if (fileID == 0) err++; if (jobIdxMax == 0) err++; if (err) { fprintf(stderr, "usage: %s ...\n", argv[0]); fprintf(stderr, " -G asm.gkpStore path to gkpStore for this assembly\n"); fprintf(stderr, " -O x.ovlStore path to overlap store to build the final index for\n"); fprintf(stderr, "\n"); fprintf(stderr, " -F s number of slices used in bucketizing/sorting\n"); fprintf(stderr, " -job j m index of this overlap input file, and max number of files\n"); fprintf(stderr, "\n"); fprintf(stderr, " -M m maximum memory to use, in gigabytes\n"); fprintf(stderr, "\n"); fprintf(stderr, " -deleteearly remove intermediates as soon as possible (unsafe)\n"); fprintf(stderr, " -deletelate remove intermediates when outputs exist (safe)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -force force a recompute, even if the output exists\n"); fprintf(stderr, "\n"); fprintf(stderr, " DANGER DO NOT USE DO NOT USE DO NOT USE DANGER\n"); fprintf(stderr, " DANGER DANGER\n"); fprintf(stderr, " DANGER This command is difficult to run by hand. DANGER\n"); fprintf(stderr, " DANGER Use ovStoreCreate instead. DANGER\n"); fprintf(stderr, " DANGER DANGER\n"); fprintf(stderr, " DANGER DO NOT USE DO NOT USE DO NOT USE DANGER\n"); fprintf(stderr, "\n"); if (storePath == NULL) fprintf(stderr, "ERROR: No overlap store (-O) supplied.\n"); if (fileID == 0) fprintf(stderr, "ERROR: no slice number (-F) supplied.\n"); if (jobIdxMax == 0) fprintf(stderr, "ERROR: no max job ID (-job) supplied.\n"); exit(1); } // Check if we're running or done (or crashed), then note that we're running. makeSentinel(storePath, fileID, forceRun); // Not done. Let's go! gkStore *gkp = gkStore::gkStore_open(gkpName); ovStoreWriter *writer = new ovStoreWriter(storePath, gkp, fileLimit, fileID, jobIdxMax); // Get the number of overlaps in each bucket slice. fprintf(stderr, "\n"); fprintf(stderr, "Finding overlaps.\n"); uint64 *bucketSizes = new uint64 [jobIdxMax + 1]; uint64 totOvl = writer->loadBucketSizes(bucketSizes); // Fail if we don't have enough memory to process. if (ovOverlapSortSize * totOvl > maxMemory) { fprintf(stderr, "ERROR: Overlaps need %.2f GB memory, but process limited (via -M) to " F_U64 " GB.\n", ovOverlapSortSize * totOvl / 1024.0 / 1024.0 / 1024.0, maxMemory >> 30); removeSentinel(storePath, fileID); exit(1); } // Or report that we can process. fprintf(stderr, "\n"); fprintf(stderr, "Loading %10" F_U64P " overlaps using %.2f GB of requested (-M) " F_U64 " GB memory.\n", totOvl, ovOverlapSortSize * totOvl / 1024.0 / 1024.0 / 1024.0, maxMemory >> 30); // Load all overlaps - we're guaranteed that either 'name.gz' or 'name' exists (we checked when // we loaded bucket sizes) or funny business is happening with our files. ovOverlap *ovls = ovOverlap::allocateOverlaps(gkp, totOvl); uint64 ovlsLen = 0; for (uint32 i=0; i<=jobIdxMax; i++) writer->loadOverlapsFromSlice(i, bucketSizes[i], ovls, ovlsLen); // Check that we found all the overlaps we were expecting. if (ovlsLen != totOvl) fprintf(stderr, "ERROR: read " F_U64 " overlaps, expected " F_U64 "\n", ovlsLen, totOvl); assert(ovlsLen == totOvl); // Clean up space if told to. if (deleteIntermediateEarly) writer->removeOverlapSlice(); // Sort the overlaps! Finally! The parallel STL sort is NOT inplace, and blows up our memory. fprintf(stderr, "\n"); fprintf(stderr, "Sorting.\n"); #ifdef _GLIBCXX_PARALLEL __gnu_sequential::sort(ovls, ovls + ovlsLen); #else sort(ovls, ovls + ovlsLen); #endif // Output to the store. fprintf(stderr, "\n"); // Sorting has no output, so this would generate a distracting extra newline fprintf(stderr, "Writing sorted overlaps.\n"); writer->writeOverlaps(ovls, ovlsLen); // Clean up. Delete inputs, remove the sentinel, release memory, etc. delete [] ovls; delete [] bucketSizes; removeSentinel(storePath, fileID); gkp->gkStore_close(); if (deleteIntermediateLate) { fprintf(stderr, "\n"); fprintf(stderr, "Removing bucketized overlaps.\n"); fprintf(stderr, "\n"); writer->removeOverlapSlice(); } // Success! fprintf(stderr, "Success!\n"); return(0); } canu-1.6/src/stores/ovStoreSorter.mk000066400000000000000000000007331314437614700176220ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := ovStoreSorter SOURCES := ovStoreSorter.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/ovStoreStats.C000066400000000000000000000616421314437614700172230ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2015-OCT-27 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-MAR-31 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "ovStore.H" #include "stddev.H" #include "intervalList.H" #include "speedCounter.H" #define OVL_5 0x01 #define OVL_3 0x02 #define OVL_CONTAINED 0x04 #define OVL_CONTAINER 0x08 #define OVL_PARTIAL 0x10 // Should count unique-contained and repeat-contained separately from unique and repeat // uniq-anchor is also 'plausible chimera' // no-5-prime includes things that entirely cover the read, just no overhang int main(int argc, char **argv) { char *gkpName = NULL; char *ovlName = NULL; char *outPrefix = NULL; uint32 bgnID = 0; uint32 endID = UINT32_MAX; uint32 ovlSelect = 0; double ovlAtMost = AS_OVS_encodeEvalue(1.0); double ovlAtLeast = AS_OVS_encodeEvalue(0.0); double expectedMean = 30.0; double expectedStdDev = 7.0; bool toFile = true; bool beVerbose = false; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) gkpName = argv[++arg]; else if (strcmp(argv[arg], "-O") == 0) ovlName = argv[++arg]; else if (strcmp(argv[arg], "-o") == 0) outPrefix = argv[++arg]; else if (strcmp(argv[arg], "-C") == 0) { expectedMean = atof(argv[++arg]); expectedStdDev = atof(argv[++arg]); } else if (strcmp(argv[arg], "-c") == 0) toFile = false; else if (strcmp(argv[arg], "-v") == 0) beVerbose = true; else if (strcmp(argv[arg], "-b") == 0) bgnID = atoi(argv[++arg]); else if (strcmp(argv[arg], "-e") == 0) endID = atoi(argv[++arg]); else if (strcmp(argv[arg], "-overlap") == 0) { arg++; if (strcmp(argv[arg], "5") == 0) ovlSelect |= OVL_5; else if (strcmp(argv[arg], "3") == 0) ovlSelect |= OVL_3; else if (strcmp(argv[arg], "contained") == 0) ovlSelect |= OVL_CONTAINED; else if (strcmp(argv[arg], "container") == 0) ovlSelect |= OVL_CONTAINER; else if (strcmp(argv[arg], "partial") == 0) ovlSelect |= OVL_PARTIAL; else if (strcmp(argv[arg], "atmost") == 0) ovlAtMost = atof(argv[++arg]); else if (strcmp(argv[arg], "atleast") == 0) ovlAtLeast = atof(argv[++arg]); else { fprintf(stderr, "ERROR: unknown -overlap '%s'\n", argv[arg]); exit(1); } } else { fprintf(stderr, "%s: unknown option '%s'.\n", argv[0], argv[arg]); err++; } arg++; } if (gkpName == NULL) err++; if (ovlName == NULL) err++; if (outPrefix == NULL) err++; if (err) { fprintf(stderr, "usage: %s -G gkpStore -O ovlStore -o outPrefix [-b bgnID] [-e endID] ...\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "Generates statistics for an overlap store. By default all possible classes\n"); fprintf(stderr, "are generated, options can disable specific classes.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -C mean stddev Expect coverage at mean +- stddev\n"); fprintf(stderr, " -c Write stats to stdout, not to a file\n"); fprintf(stderr, " -v Report processing speed to stderr\n"); fprintf(stderr, "\n"); fprintf(stderr, "Outputs:\n"); fprintf(stderr, "\n"); fprintf(stderr, " outPrefix.per-read.log One line per read, giving readID, read length and classification.\n"); fprintf(stderr, " outPrefix.summary The primary statistical output.\n"); fprintf(stderr, "\n"); fprintf(stderr, "Overlap Selection:\n"); fprintf(stderr, " -overlap 5 5' overlaps only\n"); fprintf(stderr, " -overlap 3 3' overlaps only\n"); fprintf(stderr, " -overlap contained contained overlaps only\n"); fprintf(stderr, " -overlap container container overlaps only\n"); fprintf(stderr, " -overlap partial overlap is not valid for assembly\n"); fprintf(stderr, "\n"); fprintf(stderr, " An overlap is classified as exactly one of 5', 3', contained or container.\n"); fprintf(stderr, " By default, all overlaps are selected. Specifying any of these options will\n"); fprintf(stderr, " restrict overlaps to just those classifications. E.g., '-overlap 5 -overlap 3'\n"); fprintf(stderr, " will select dovetail overlaps off either end of the read.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -overlap atmost x at most fraction x error (overlap-erate <= x)\n"); fprintf(stderr, " -overlap atleast x at least fraction x error (x <= overlap-erate)\n"); fprintf(stderr, "\n"); fprintf(stderr, " Overlaps can be further filtered by fraction error. Usually, this will be an\n"); fprintf(stderr, " 'atmost' filtering to use only the higher qualtiy overlaps.\n"); fprintf(stderr, "\n"); fprintf(stderr, " A contained read has at least one container overlap. Container read -> ---------------\n"); fprintf(stderr, " A container read has at least one contained overlap. Contained overlap -> -----\n"); fprintf(stderr, "\n"); exit(1); } // Set the default to 'all' if nothing set. if (ovlSelect == 0) ovlSelect = 0xff; // Open inputs, find limits. gkStore *gkpStore = gkStore::gkStore_open(gkpName); ovStore *ovlStore = new ovStore(ovlName, gkpStore); if (endID > gkpStore->gkStore_getNumReads()) endID = gkpStore->gkStore_getNumReads(); if (endID < bgnID) fprintf(stderr, "ERROR: invalid bgn/end range bgn=%u end=%u; only %u reads in the store\n", bgnID, endID, gkpStore->gkStore_getNumReads()), exit(1); ovlStore->setRange(bgnID, endID); // Allocate output histograms. histogramStatistics *readNoOlaps = new histogramStatistics; // Bad reads! (read length) histogramStatistics *readHole = new histogramStatistics; histogramStatistics *readHump = new histogramStatistics; histogramStatistics *readNo5 = new histogramStatistics; histogramStatistics *readNo3 = new histogramStatistics; histogramStatistics *olapHole = new histogramStatistics; // Hole size (sum of holes if more than one) histogramStatistics *olapHump = new histogramStatistics; // Hump size (sum of humps if more than one) histogramStatistics *olapNo5 = new histogramStatistics; // 5' uncovered size histogramStatistics *olapNo3 = new histogramStatistics; // 3' uncovered size histogramStatistics *readLowCov = new histogramStatistics; // Good reads! (read length) histogramStatistics *readUnique = new histogramStatistics; histogramStatistics *readRepeatCont = new histogramStatistics; histogramStatistics *readRepeatDove = new histogramStatistics; histogramStatistics *readSpanRepeat = new histogramStatistics; histogramStatistics *readUniqRepeatCont = new histogramStatistics; histogramStatistics *readUniqRepeatDove = new histogramStatistics; histogramStatistics *readUniqAnchor = new histogramStatistics; histogramStatistics *covrLowCov = new histogramStatistics; // Good reads! (overlap length) histogramStatistics *covrUnique = new histogramStatistics; histogramStatistics *covrRepeatCont = new histogramStatistics; histogramStatistics *covrRepeatDove = new histogramStatistics; histogramStatistics *covrSpanRepeat = new histogramStatistics; histogramStatistics *covrUniqRepeatCont = new histogramStatistics; histogramStatistics *covrUniqRepeatDove = new histogramStatistics; histogramStatistics *covrUniqAnchor = new histogramStatistics; histogramStatistics *olapLowCov = new histogramStatistics; // Good reads! (overlap length) histogramStatistics *olapUnique = new histogramStatistics; histogramStatistics *olapRepeatCont = new histogramStatistics; histogramStatistics *olapRepeatDove = new histogramStatistics; histogramStatistics *olapSpanRepeat = new histogramStatistics; histogramStatistics *olapUniqRepeatCont = new histogramStatistics; histogramStatistics *olapUniqRepeatDove = new histogramStatistics; histogramStatistics *olapUniqAnchor = new histogramStatistics; // Coverage interval lists, of all overlaps selected. // Open outputs. char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.per-read.log", outPrefix); FILE *LOG = fopen(N, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); // Compute! uint32 overlapsMax = 1024; uint32 overlapsLen = 0; ovOverlap *overlaps = ovOverlap::allocateOverlaps(gkpStore, overlapsMax); speedCounter C(" %9.0f reads (%6.1f reads/sec)\r", 1, 100, beVerbose); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); while (overlapsLen > 0) { uint32 readID = overlaps[0].a_iid; uint32 readLen = gkpStore->gkStore_getRead(readID)->gkRead_sequenceLength(); intervalList cov; uint32 covID = 0; bool readCoverage5 = false; bool readCoverage3 = false; bool readContained = false; bool readContainer = false; bool readPartial = false; for (uint32 oo=0; oo ovlAtMost) continue; readCoverage5 |= is5prime; // If there is a 5' overlap, the read isn't missing 5' coverage readCoverage3 |= is3prime; readContained |= isContained; // Read is contained in something else readContainer |= isContainer; // Read is a container of somethign else readPartial |= isPartial; cov.add(overlaps[oo].a_bgn(), overlaps[oo].a_end() - overlaps[oo].a_bgn()); } // If we filtered all the overlaps, just get out of here. Yeah, some code duplication, // but cleaner than sticking an if block around the rest of the loop. if (cov.numberOfIntervals() == 0) { readNoOlaps->add(readLen); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); continue; } // Generate a depth-of-coverage map, then merge intervals intervalList depth(cov); cov.merge(); // Analyze the intervals, save per-read information to the log. uint32 lastInt = cov.numberOfIntervals() - 1; uint32 bgn = cov.lo(0); uint32 end = cov.hi(lastInt); bool contiguous = (lastInt == 0) ? true : false; bool readFullCoverage = (lastInt == 0) && (bgn == 0) && (end == readLen); bool readMissingMiddle = (lastInt != 0); uint32 holeSize = 0; uint32 no5Size = bgn; uint32 no3Size = readLen - end; for (uint32 ii=1; iiadd(readLen); olapHole->add(holeSize); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); continue; } if ((readCoverage5 == false) && (readCoverage3 == false) && (readContained == false) && (readPartial == false)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "middle-only"); readHump->add(readLen); olapHump->add(no5Size + no3Size); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); continue; } if ((readCoverage5 == false) && (readContained == false) && (readPartial == false)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "no-5-prime"); readNo5->add(readLen); olapNo5->add(no5Size); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); continue; } if ((readCoverage3 == false) && (readContained == false) && (readPartial == false)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "no-3-prime"); readNo3->add(readLen); olapNo3->add(no3Size); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); continue; } // Handle good cases. For partial overlaps, bgn and end are not the extent of the read. if (readPartial == false) { assert(bgn == 0); assert(end == readLen); assert(contiguous == true); assert(readFullCoverage == true); } // Compute mean and std.dev of coverage. From this, we decide if the read is 'unique', // 'repeat' or 'mixed'. If 'mixed', we then need to decide if the read spans a repeat, or // joins unique and repeat. double covMean = 0; double covStdDev = 0; for (uint32 ii=0; iiadd(readLen); for (uint32 ii=0; iiadd(depth.depth(ii), depth.hi(ii) - depth.lo(ii)); } if (isUnique) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "unique"); readUnique->add(readLen); for (uint32 ii=0; iiadd(depth.depth(ii), depth.hi(ii) - depth.lo(ii)); } if ((isRepeat) && (readContained == true)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "contained-repeat"); readRepeatCont->add(readLen); for (uint32 ii=0; iiadd(depth.depth(ii), depth.hi(ii) - depth.lo(ii)); } if ((isRepeat) && (readContained == false)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "dovetail-repeat"); readRepeatDove->add(readLen); for (uint32 ii=0; iiadd(depth.depth(ii), depth.hi(ii) - depth.lo(ii)); } if (isSpanRepeat) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "span-repeat"); readSpanRepeat->add(readLen); olapSpanRepeat->add(depth.lo(endi) - depth.hi(bgni)); } if ((isUniqRepeat) && (readContained == true)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "uniq-repeat-cont"); readUniqRepeatCont->add(readLen); } if ((isUniqRepeat) && (readContained == false)) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "uniq-repeat-dove"); readUniqRepeatDove->add(readLen); } if (isUniqAnchor) { fprintf(LOG, "%u\t%u\t%s\n", readID, readLen, "uniq-anchor"); readUniqAnchor->add(readLen); olapUniqAnchor->add(depth.lo(endi) - depth.hi(bgni)); } // Done. Read more data. C.tick(); overlapsLen = ovlStore->readOverlaps(overlaps, overlapsMax); } fclose(LOG); // Done with logging. readHole->finalizeData(); olapHole->finalizeData(); readHump->finalizeData(); olapHump->finalizeData(); readNo5->finalizeData(); olapNo5->finalizeData(); readNo3->finalizeData(); olapNo3->finalizeData(); readLowCov->finalizeData(); olapLowCov->finalizeData(); covrLowCov->finalizeData(); readUnique->finalizeData(); olapUnique->finalizeData(); covrUnique->finalizeData(); readRepeatCont->finalizeData(); olapRepeatCont->finalizeData(); covrRepeatCont->finalizeData(); readRepeatDove->finalizeData(); olapRepeatDove->finalizeData(); covrRepeatDove->finalizeData(); readSpanRepeat->finalizeData(); olapSpanRepeat->finalizeData(); readUniqRepeatCont->finalizeData(); olapUniqRepeatCont->finalizeData(); readUniqRepeatDove->finalizeData(); olapUniqRepeatDove->finalizeData(); readUniqAnchor->finalizeData(); olapUniqAnchor->finalizeData(); LOG = stdout; if (toFile == true) { snprintf(N, FILENAME_MAX, "%s.summary", outPrefix); LOG = fopen(N, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", N, strerror(errno)), exit(1); } fprintf(LOG, "category reads %% read length feature size or coverage analysis\n"); fprintf(LOG, "---------------- ------- ------- ---------------------- ------------------------ --------------------\n"); fprintf(LOG, "middle-missing %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (bad trimming)\n", readHole->numberOfObjects(), (float)readHole->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readHole->mean(), readHole->stddev(), olapHole->mean(), olapHole->stddev()); fprintf(LOG, "middle-hump %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (bad trimming)\n", readHump->numberOfObjects(), (float)readHump->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readHump->mean(), readHump->stddev(), olapHump->mean(), olapHump->stddev()); fprintf(LOG, "no-5-prime %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (bad trimming)\n", readNo5->numberOfObjects(), (float)readNo5->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readNo5->mean(), readNo5->stddev(), olapNo5->mean(), olapNo5->stddev()); fprintf(LOG, "no-3-prime %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (bad trimming)\n", readNo3->numberOfObjects(), (float)readNo3->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readNo3->mean(), readNo3->stddev(), olapNo3->mean(), olapNo3->stddev()); fprintf(LOG, "\n"); fprintf(LOG, "low-coverage %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (easy to assemble, potential for lower quality consensus)\n", readLowCov->numberOfObjects(), (float)readLowCov->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readLowCov->mean(), readLowCov->stddev(), covrLowCov->mean(), covrLowCov->stddev()); fprintf(LOG, "unique %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (easy to assemble, perfect, yay)\n", readUnique->numberOfObjects(), (float)readUnique->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readUnique->mean(), readUnique->stddev(), covrUnique->mean(), covrUnique->stddev()); fprintf(LOG, "repeat-cont %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (potential for consensus errors, no impact on assembly)\n", readRepeatCont->numberOfObjects(), (float)readRepeatCont->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readRepeatCont->mean(), readRepeatCont->stddev(), covrRepeatCont->mean(), covrRepeatCont->stddev()); fprintf(LOG, "repeat-dove %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (hard to assemble, likely won't assemble correctly or even at all)\n", readRepeatDove->numberOfObjects(), (float)readRepeatDove->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readRepeatDove->mean(), readRepeatDove->stddev(), covrRepeatDove->mean(), covrRepeatDove->stddev()); fprintf(LOG, "\n"); fprintf(LOG, "span-repeat %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (read spans a large repeat, usually easy to assemble)\n", readSpanRepeat->numberOfObjects(), (float)readSpanRepeat->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readSpanRepeat->mean(), readSpanRepeat->stddev(), olapSpanRepeat->mean(), olapSpanRepeat->stddev()); fprintf(LOG, "uniq-repeat-cont %7" F_U64P " %6.2f %10.2f +- %-8.2f (should be uniquely placed, low potential for consensus errors, no impact on assembly)\n", readUniqRepeatCont->numberOfObjects(), (float)readUniqRepeatCont->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readUniqRepeatCont->mean(), readUniqRepeatCont->stddev()); fprintf(LOG, "uniq-repeat-dove %7" F_U64P " %6.2f %10.2f +- %-8.2f (will end contigs, potential to misassemble)\n", readUniqRepeatDove->numberOfObjects(), (float)readUniqRepeatDove->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readUniqRepeatDove->mean(), readUniqRepeatDove->stddev()); fprintf(LOG, "uniq-anchor %7" F_U64P " %6.2f %10.2f +- %-8.2f %10.2f +- %-8.2f (repeat read, with unique section, probable bad read)\n", readUniqAnchor->numberOfObjects(), (float)readUniqAnchor->numberOfObjects()/gkpStore->gkStore_getNumReads()*100, readUniqAnchor->mean(), readUniqAnchor->stddev(), olapUniqAnchor->mean(), olapUniqAnchor->stddev()); if (toFile == true) fclose(LOG); // Clean up the histograms delete readNoOlaps; delete readHole; delete readHump; delete readNo5; delete readNo3; delete olapHole; delete olapHump; delete olapNo5; delete olapNo3; delete readLowCov; delete readUnique; delete readRepeatCont; delete readRepeatDove; delete readSpanRepeat; delete readUniqRepeatCont; delete readUniqRepeatDove; delete readUniqAnchor; delete covrLowCov; delete covrUnique; delete covrRepeatCont; delete covrRepeatDove; delete covrSpanRepeat; delete covrUniqRepeatCont; delete covrUniqRepeatDove; delete covrUniqAnchor; delete olapLowCov; delete olapUnique; delete olapRepeatCont; delete olapRepeatDove; delete olapSpanRepeat; delete olapUniqRepeatCont; delete olapUniqRepeatDove; delete olapUniqAnchor; delete ovlStore; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/stores/ovStoreStats.mk000066400000000000000000000007311314437614700174400ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := ovStoreStats SOURCES := ovStoreStats.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/ovStoreWriter.C000066400000000000000000000522341314437614700173760ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/stores/ovStore.C * * Modifications by: * * Brian P. Walenz beginning on 2016-OCT-28 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "ovStore.H" void checkAndSaveName(char *storePath, const char *path) { if (path == NULL) fprintf(stderr, "ovStoreWriter::ovStoreWriter()-- ERROR: no name supplied.\n"), exit(1); if ((path[0] == '-') && (path[1] == 0)) fprintf(stderr, "ovStoreWriter::ovStoreWriter()-- ERROR: name cannot be '-' (stdin).\n"), exit(1); memset(storePath, 0, FILENAME_MAX); strncpy(storePath, path, FILENAME_MAX-1); } ovStoreWriter::ovStoreWriter(const char *path, gkStore *gkp) { char name[FILENAME_MAX]; checkAndSaveName(_storePath, path); // Fail if this is a valid ovStore. if (_info.test(_storePath) == true) fprintf(stderr, "ERROR: '%s' is a valid ovStore; cannot create a new one.\n", _storePath), exit(1); // Create the new store AS_UTL_mkdir(_storePath); _info.clear(); _info.save(_storePath); _gkp = gkp; _offtFile = NULL; _offt.clear(); _offm.clear(); _evaluesMap = NULL; _evalues = NULL; _overlapsThisFile = 0; _currentFileIndex = 0; _bof = NULL; // This is used by the sequential store build, so we want to collect stats. _histogram = new ovStoreHistogram(_gkp, ovFileNormalWrite); // Open the index file. snprintf(name, FILENAME_MAX, "%s/index", _storePath); errno = 0; _offtFile = fopen(name, "w"); if (errno) fprintf(stderr, "AS_OVS_createOverlapStore()-- failed to open offset file '%s': %s\n", name, strerror(errno)), exit(1); _overlapsThisFile = 0; _overlapsThisFileMax = 0; // 1024 * 1024 * 1024 / _bof->recordSize(); -- needs a valid _bof, dang. _currentFileIndex = 0; _bof = NULL; _fileLimit = 0; // Used in the parallel store, not here. _fileID = 0; _jobIdxMax = 0; } ovStoreWriter::ovStoreWriter(const char *path, gkStore *gkp, uint32 fileLimit, uint32 fileID, uint32 jobIdxMax) { checkAndSaveName(_storePath, path); _gkp = gkp; _offtFile = NULL; _evaluesMap = NULL; _evalues = NULL; _overlapsThisFile = 0; _overlapsThisFileMax = 0; _currentFileIndex = 0; _bof = NULL; _histogram = NULL; _fileLimit = fileLimit; _fileID = fileID; _jobIdxMax = jobIdxMax; }; ovStoreWriter::~ovStoreWriter() { // Write the last index element (don't forget to fill in gaps); // update the info, using the final magic number if (_offt._numOlaps > 0) { for (; _offm._a_iid < _offt._a_iid; _offm._a_iid++) { _offm._fileno = _offt._fileno; _offm._offset = _offt._offset; _offm._numOlaps = 0; AS_UTL_safeWrite(_offtFile, &_offm, "ovStore::~ovStore::offm", sizeof(ovStoreOfft), 1); } AS_UTL_safeWrite(_offtFile, &_offt, "ovStore::~ovStore::offt", sizeof(ovStoreOfft), 1); } _info.save(_storePath, _currentFileIndex); if (_bof) _bof->transferHistogram(_histogram); delete _bof; if (_histogram) _histogram->saveData(_storePath); delete _histogram; fprintf(stderr, "Created ovStore '%s' with " F_U64 " overlaps for reads from " F_U32 " to " F_U32 ".\n", _storePath, _info.numOverlaps(), _info.smallestID(), _info.largestID()); fclose(_offtFile); } void ovStoreWriter::writeOverlap(ovOverlap *overlap) { char name[FILENAME_MAX]; // Make sure overlaps are sorted, failing if not. if (_offt._a_iid > overlap->a_iid) { fprintf(stderr, "LAST: a:" F_U32 "\n", _offt._a_iid); fprintf(stderr, "THIS: a:" F_U32 " b:" F_U32 "\n", overlap->a_iid, overlap->b_iid); } assert(_offt._a_iid <= overlap->a_iid); // If we don't have an output file yet, or the current file is // too big, open a new file. if ((_bof) && (_overlapsThisFile >= _overlapsThisFileMax)) { _bof->transferHistogram(_histogram); delete _bof; _bof = NULL; _overlapsThisFile = 0; _overlapsThisFileMax = 0; } if (_bof == NULL) { char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, ++_currentFileIndex); _bof = new ovFile(_gkp, name, ovFileNormalWrite); _overlapsThisFile = 0; _overlapsThisFileMax = 1024 * 1024 * 1024 / _bof->recordSize(); } // Put the index to disk, filling any gaps if ((_offt._numOlaps != 0) && (_offt._a_iid != overlap->a_iid)) { while (_offm._a_iid < _offt._a_iid) { _offm._fileno = _offt._fileno; _offm._offset = _offt._offset; _offm._overlapID = _offt._overlapID; // Not needed, but makes life easier AS_UTL_safeWrite(_offtFile, &_offm, "ovStore::writeOverlap::offset", sizeof(ovStoreOfft), 1); _offm._a_iid++; } _offm._a_iid++; // One more, since this iid is not missing -- we write it next! AS_UTL_safeWrite(_offtFile, &_offt, "AS_OVS_writeOverlapToStore offset", sizeof(ovStoreOfft), 1); _offt._numOlaps = 0; // Reset; this new id has no overlaps yet. } // Update the index if this is the first overlap for this a_iid if (_offt._numOlaps == 0) { _offt._a_iid = overlap->a_iid; _offt._fileno = _currentFileIndex; _offt._offset = _overlapsThisFile; _offt._overlapID = _info.numOverlaps(); } _bof->writeOverlap(overlap); _offt._numOlaps++; _info.addOverlap(overlap->a_iid); _overlapsThisFile++; } // For the parallel sort, write a block of sorted overlaps into a single file, with index and info. void ovStoreWriter::writeOverlaps(ovOverlap *ovls, uint64 ovlsLen) { char name[FILENAME_MAX]; uint32 currentFileIndex = _fileID; ovStoreInfo info; info.clear(); ovStoreOfft offt; ovStoreOfft offm; offt._a_iid = offm._a_iid = ovls[0].a_iid; offt._fileno = offm._fileno = _fileID; offt._offset = offm._offset = 0; offt._numOlaps = offm._numOlaps = 0; offt._overlapID = offm._overlapID = 0; // Create the output file snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, _fileID); ovFile *bof = new ovFile(_gkp, name, ovFileNormalWrite); // Create the index file snprintf(name, FILENAME_MAX, "%s/%04d.index", _storePath, _fileID); errno = 0; FILE *offtFile=fopen(name,"w"); if (errno) fprintf(stderr, "ERROR: Failed to open '%s' for writing: %s\n", name, strerror(errno)), exit(1); // Dump the overlaps for (uint64 i=0; iwriteOverlap(ovls + i); if (offt._a_iid > ovls[i].a_iid) { fprintf(stderr, "LAST: a:" F_U32 "\n", offt._a_iid); fprintf(stderr, "THIS: a:" F_U32 " b:" F_U32 "\n", ovls[i].a_iid, ovls[i].b_iid); } assert(offt._a_iid <= ovls[i].a_iid); // Put the index to disk, filling any gaps if ((offt._numOlaps != 0) && (offt._a_iid != ovls[i].a_iid)) { while (offm._a_iid < offt._a_iid) { offm._fileno = offt._fileno; offm._offset = offt._offset; offm._overlapID = offt._overlapID; // Not needed, but makes life easier offm._numOlaps = 0; AS_UTL_safeWrite(offtFile, &offm, "AS_OVS_writeOverlapToStore offt", sizeof(ovStoreOfft), 1); offm._a_iid++; } // One more, since this iid is not offm -- we write it next! offm._a_iid++; AS_UTL_safeWrite(offtFile, &offt, "AS_OVS_writeOverlapToStore offt", sizeof(ovStoreOfft), 1); offt._overlapID += offt._numOlaps; // The next block of overlaps starts with this ID offt._numOlaps = 0; // The next block has no overlaps yet. } // Update the index if this is the first overlap for this a_iid if (offt._numOlaps == 0) { offt._a_iid = ovls[i].a_iid; offt._fileno = currentFileIndex; offt._offset = info.numOverlaps(); } offt._numOlaps++; info.addOverlap(ovls[i].a_iid); } // Close the output file. delete bof; // Write the final (empty) index entries. while (offm._a_iid < offt._a_iid) { offm._fileno = offt._fileno; offm._offset = offt._offset; offm._overlapID = offt._overlapID; // Not needed, but makes life easier offm._numOlaps = 0; AS_UTL_safeWrite(offtFile, &offm, "AS_OVS_writeOverlapToStore offt", sizeof(ovStoreOfft), 1); offm._a_iid++; } // And the final (real) index entry. We could, but don't need to, update overlapID with the // number of overlaps in this block. AS_UTL_safeWrite(offtFile, &offt, "AS_OVS_writeOverlapToStore offt", sizeof(ovStoreOfft), 1); fclose(offtFile); // Write the info, and some stats for the user. info.save(_storePath, _fileID, true); fprintf(stderr, " created '%s/%04d' with " F_U64 " overlaps for reads " F_U32 " to " F_U32 ".\n", _storePath, _fileID, info.numOverlaps(), info.smallestID(), info.largestID()); } // For the parallel sort, but also generally applicable, test that the index is sane. bool ovStoreWriter::testIndex(bool doFixes) { char name[FILENAME_MAX]; FILE *I = NULL; FILE *F = NULL; // Open the input index. snprintf(name, FILENAME_MAX, "%s/index", _storePath); errno = 0; I = fopen(name, "r"); if (errno) fprintf(stderr, "ERROR: Failed to open '%s' for reading: %s\n", name, strerror(errno)), exit(1); // If we're fixing, open the output index. if (doFixes) { snprintf(name, FILENAME_MAX, "%s/index.fixed", _storePath); errno = 0; F = fopen(name, "w"); if (errno) fprintf(stderr, "ERROR: Failed to open '%s' for writing: %s\n", name, strerror(errno)), exit(1); } ovStoreOfft O; uint32 curIID = 0; uint32 minIID = UINT32_MAX; uint32 maxIID = 0; uint32 nErrs = 0; while (1 == AS_UTL_safeRead(I, &O, "offset", sizeof(ovStoreOfft), 1)) { bool maxIncreases = (maxIID < O._a_iid); bool errorDecreased = ((O._a_iid < curIID)); bool errorGap = ((O._a_iid > 0) && (curIID + 1 != O._a_iid)); if (O._a_iid < minIID) minIID = O._a_iid; if (maxIncreases) maxIID = O._a_iid; if (errorDecreased) fprintf(stderr, "ERROR: index decreased from " F_U32 " to " F_U32 "\n", curIID, O._a_iid), nErrs++; else if (errorGap) fprintf(stderr, "ERROR: gap between " F_U32 " and " F_U32 "\n", curIID, O._a_iid), nErrs++; if ((maxIncreases == true) && (errorGap == false)) { if (doFixes) AS_UTL_safeWrite(F, &O, "offset", sizeof(ovStoreOfft), 1); } else if (O._numOlaps > 0) { fprintf(stderr, "ERROR: lost overlaps a_iid " F_U32 " fileno " F_U32 " offset " F_U32 " numOlaps " F_U32 "\n", O._a_iid, O._fileno, O._offset, O._numOlaps); } curIID = O._a_iid; } fclose(I); if (F) fclose(F); return(nErrs == 0); } // For the parallel sort, merge index and info files into one, clean up the intermediates. void ovStoreWriter::mergeInfoFiles(void) { ovStoreInfo infopiece; ovStoreInfo info; info.clear(); ovStoreOfft offm; offm._a_iid = 0; offm._fileno = 1; offm._offset = 0; offm._numOlaps = 0; offm._overlapID = 0; // Open the new master index output file char name[FILENAME_MAX]; snprintf(name, FILENAME_MAX, "%s/index", _storePath); errno = 0; FILE *idx = fopen(name, "w"); if (errno) fprintf(stderr, "ERROR: Failed to open '%s': %s\n", name, strerror(errno)), exit(1); // Special case, we need an empty index for the zeroth fragment. AS_UTL_safeWrite(idx, &offm, "ovStore::mergeInfoFiles::offsetZero", sizeof(ovStoreOfft), 1); // Sanity checking, compare the number of overlaps processed against the overlapID // of each ovStoreOfft. uint64 totalOverlaps = 0; // Process each for (uint32 i=1; i<=_fileLimit; i++) { fprintf(stderr, "Processing '%s'\n", name); infopiece.load(_storePath, i, true); if (infopiece.numOverlaps() == 0) { fprintf(stderr, " No overlaps found.\n"); continue; } // Add empty index elements for missing overlaps if (info.largestID() + 1 < infopiece.smallestID()) fprintf(stderr, " Adding empty records for fragments " F_U32 " to " F_U32 "\n", info.largestID() + 1, infopiece.smallestID() - 1); while (info.largestID() + 1 < infopiece.smallestID()) { offm._a_iid = info.largestID() + 1; //offm._fileno = set below, where the recs are written to the master file //offm._offset = set below, where the recs are written to the master file AS_UTL_safeWrite(idx, &offm, "ovStore::mergeInfoFiles::offsets", sizeof(ovStoreOfft), 1); info.addOverlap(offm._a_iid, 0); } // Copy index elements for existing overlaps. While copying, update the supposed position // of any fragments with no overlaps. Without doing this, accessing the store beginning // or ending at such a fragment will fail. { snprintf(name, FILENAME_MAX, "%s/%04d.index", _storePath, i); errno = 0; FILE *F = fopen(name, "r"); if (errno) fprintf(stderr, "ERROR: Failed to open '%s': %s\n", name, strerror(errno)), exit(1); uint32 recsLen = 0; uint32 recsMax = 1024 * 1024; ovStoreOfft *recs = new ovStoreOfft [recsMax]; recsLen = AS_UTL_safeRead(F, recs, "ovStore::mergeInfoFiles::offsetsLoad", sizeof(ovStoreOfft), recsMax); if (recsLen > 0) { if (info.largestID() + 1 != recs[0]._a_iid) fprintf(stderr, "ERROR: '%s' starts with iid " F_U32 ", but store only up to " F_U32 "\n", name, recs[0]._a_iid, info.largestID()); assert(info.largestID() + 1 == recs[0]._a_iid); } while (recsLen > 0) { // Update location of missing reads. offm._fileno = recs[recsLen-1]._fileno; offm._offset = recs[recsLen-1]._offset; // Update overlapID for each record. for (uint32 rr=0; rr 0) assert(recs[rr]._overlapID == totalOverlaps); totalOverlaps += recs[rr]._numOlaps; } // Write the records, read next batch AS_UTL_safeWrite(idx, recs, "ovStore::mergeInfoFiles::offsetsWrite", sizeof(ovStoreOfft), recsLen); recsLen = AS_UTL_safeRead(F, recs, "ovStore::mergeInfoFiles::offsetsReLoad", sizeof(ovStoreOfft), recsMax); } delete [] recs; fclose(F); } // Update the info block to include the overlaps we just added info.addOverlap(infopiece.smallestID(), 0); info.addOverlap(infopiece.largestID(), infopiece.numOverlaps()); fprintf(stderr, " Now finished with fragments " F_U32 " to " F_U32 " -- " F_U64 " overlaps.\n", info.smallestID(), info.largestID(), info.numOverlaps()); } fclose(idx); // Dump the new store info file info.save(_storePath, _fileLimit); fprintf(stderr, "Created ovStore '%s' with " F_U64 " overlaps for reads from " F_U32 " to " F_U32 ".\n", _storePath, _info.numOverlaps(), _info.smallestID(), _info.largestID()); } void ovStoreWriter::mergeHistogram(void) { char name[FILENAME_MAX]; ovStoreHistogram *histogram = new ovStoreHistogram; for (uint32 i=1; i<=_fileLimit; i++) { snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, i); histogram->loadData(name); } histogram->saveData(_storePath); delete histogram; } uint64 ovStoreWriter::loadBucketSizes(uint64 *bucketSizes) { char namz[FILENAME_MAX]; char name[FILENAME_MAX]; uint64 *sliceSizes = new uint64 [_fileLimit + 1]; // For each overlap job, number of overlaps per bucket uint64 totOvl = 0; for (uint32 i=0; i<=_jobIdxMax; i++) { bucketSizes[i] = 0; snprintf(name, FILENAME_MAX, "%s/bucket%04d/slice%04d", _storePath, i, _fileID); snprintf(namz, FILENAME_MAX, "%s/bucket%04d/slice%04d.gz", _storePath, i, _fileID); // If no file, there are no overlaps. Skip loading the bucketSizes file. // With snappy compression, we expect the file to be not gzip compressed, but will happily // accept a gzipped file. if ((AS_UTL_fileExists(name, FALSE, FALSE) == false) && (AS_UTL_fileExists(namz, FALSE, FALSE) == false)) continue; snprintf(name, FILENAME_MAX, "%s/bucket%04d/sliceSizes", _storePath, i); FILE *F = fopen(name, "r"); if (errno) fprintf(stderr, "ERROR: Failed to open %s: %s\n", name, strerror(errno)), exit(1); uint64 nr = AS_UTL_safeRead(F, sliceSizes, "sliceSizes", sizeof(uint64), _fileLimit + 1); fclose(F); if (nr != _fileLimit + 1) { fprintf(stderr, "ERROR: short read on '%s'.\n", name); fprintf(stderr, "ERROR: read " F_U64 " sizes insteadof " F_U32 ".\n", nr, _fileLimit + 1); } assert(nr == _fileLimit + 1); fprintf(stderr, " found %10" F_U64P " overlaps in '%s'.\n", sliceSizes[_fileID], name); bucketSizes[i] = sliceSizes[_fileID]; totOvl += sliceSizes[_fileID]; } delete [] sliceSizes; return(totOvl); } void ovStoreWriter::loadOverlapsFromSlice(uint32 slice, uint64 expectedLen, ovOverlap *ovls, uint64& ovlsLen) { char name[FILENAME_MAX]; if (expectedLen == 0) return; snprintf(name, FILENAME_MAX, "%s/bucket%04d/slice%04d", _storePath, slice, _fileID); if (AS_UTL_fileExists(name, FALSE, FALSE) == false) { snprintf(name, FILENAME_MAX, "%s/bucket%04d/slice%04d.gz", _storePath, slice, _fileID); if (AS_UTL_fileExists(name, FALSE, FALSE) == false) fprintf(stderr, "ERROR: " F_U64 " overlaps claim to exist in bucket '%s', but file not found.\n", expectedLen, name); } fprintf(stderr, " loading %10" F_U64P " overlaps from '%s'.\n", expectedLen, name); ovFile *bof = new ovFile(_gkp, name, ovFileFull); uint64 num = 0; while (bof->readOverlap(ovls + ovlsLen)) { ovlsLen++; num++; } if (num != expectedLen) fprintf(stderr, "ERROR: expected " F_U64 " overlaps, found " F_U64 " overlaps.\n", expectedLen, num); assert(num == expectedLen); delete bof; } void ovStoreWriter::removeOverlapSlice(void) { char name[FILENAME_MAX]; for (uint32 i=0; i<=_jobIdxMax; i++) { snprintf(name, FILENAME_MAX, "%s/bucket%04d/slice%04d.gz", _storePath, i, _fileID); AS_UTL_unlink(name); snprintf(name, FILENAME_MAX, "%s/bucket%04d/slice%04d", _storePath, i, _fileID); AS_UTL_unlink(name); } } void ovStoreWriter::checkSortingIsComplete(void) { char nameD[FILENAME_MAX]; char nameF[FILENAME_MAX]; char nameI[FILENAME_MAX]; uint32 failedJobs = 0; for (uint32 i=1; i<=_fileLimit; i++) { snprintf(nameD, FILENAME_MAX, "%s/%04d", _storePath, i); snprintf(nameF, FILENAME_MAX, "%s/%04d.info", _storePath, i); snprintf(nameI, FILENAME_MAX, "%s/%04d.index", _storePath, i); bool existD = AS_UTL_fileExists(nameD, FALSE, FALSE); bool existF = AS_UTL_fileExists(nameF, FALSE, FALSE); bool existI = AS_UTL_fileExists(nameI, FALSE, FALSE); if (existD && existF && existI) continue; failedJobs++; if (existD == false) fprintf(stderr, "ERROR: Segment " F_U32 " data not present (%s)\n", i, nameD); if (existF == false) fprintf(stderr, "ERROR: Segment " F_U32 " info not present (%s)\n", i, nameF); if (existI == false) fprintf(stderr, "ERROR: Segment " F_U32 " index not present (%s)\n", i, nameI); } if (failedJobs > 0) fprintf(stderr, "ERROR: " F_U32 " segments, out of " F_U32 ", failed.\n", _fileLimit, failedJobs), exit(1); } void ovStoreWriter::removeAllIntermediateFiles(void) { char name[FILENAME_MAX]; // Removing indices and histogram data is easy, beacuse we know how many there are. for (uint32 i=1; i<=_fileLimit; i++) { snprintf(name, FILENAME_MAX, "%s/%04u.index", _storePath, i); AS_UTL_unlink(name); snprintf(name, FILENAME_MAX, "%s/%04u.info", _storePath, i); AS_UTL_unlink(name); snprintf(name, FILENAME_MAX, "%s/%04d", _storePath, i); ovStoreHistogram::removeData(name); } // We don't know how many buckets there are, so we remove until we fail to find ten // buckets in a row. for (uint32 missing=0, i=1; missing<10; i++, missing++) { snprintf(name, FILENAME_MAX, "%s/bucket%04d", _storePath, i); if (AS_UTL_fileExists(name, false, false) == false) continue; snprintf(name, FILENAME_MAX, "%s/bucket%04d/sliceSizes", _storePath, i); AS_UTL_unlink(name); snprintf(name, FILENAME_MAX, "%s/bucket%04d", _storePath, i); rmdir(name); missing = 0; } } canu-1.6/src/stores/tgStore.C000066400000000000000000000500021314437614700161560ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignStore.C * * Modifications by: * * Brian P. Walenz from 2009-OCT-05 to 2014-MAR-31 * are Copyright 2009-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2010-FEB-05 * are Copyright 2010 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-22 to 2015-AUG-11 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "AS_UTL_fileIO.H" #include "tgStore.H" uint32 MASRmagic = 0x5253414d; // 'MASR', as a big endian integer uint32 MASRversion = 1; #define MAX_VERS 1024 // Linked to 10 bits in the header file. tgStore::tgStore(const char *path_, uint32 version_, tgStoreType type_) { // Handle goofy default parameters. These let us retain the previous behavior (before tgStoreType): // new tgStore("path") - to create a new store // new tgStore("path", v) - to open an existing store // // And still allow // new tgStore("path", v, tgStoreCreate) - create new store, make v the current version if (version_ == 0) { version_ = 1; type_ = tgStoreCreate; } // Initialize the object. _type = type_; _path[FILENAME_MAX] = 0; strncpy(_path, path_, FILENAME_MAX-1); _newTigs = false; _currentVersion = version_; _originalVersion = version_; _tigMax = 0; _tigLen = 0; _tigEntry = NULL; _tigCache = NULL; _dataFile = new dataFileT [MAX_VERS]; for (uint32 i=0; isvID); // The atEOF flag allows us to skip a seek when we're already (supposed) to be at the EOF. This // (hopefully) fixes a problem on one system where the seek() was placing the FP just before EOF // (almost like the last block of data wasn't being flushed), and the tell() would then place the // next tig in the middle of the previous one. // // It also should (greatly) improve performance over NFS, espeically during BOG and CNS. Both of // these only write data, so no repositioning of the stream is needed. // if (_dataFile[te->svID].atEOF == false) { AS_UTL_fseek(FP, 0, SEEK_END); _dataFile[te->svID].atEOF = true; } te->flushNeeded = 0; te->fileOffset = AS_UTL_ftell(FP); //fprintf(stderr, "tgStore::writeTigToDisk()-- write tig " F_S32 " in store version " F_U64 " at file position " F_U64 "\n", // tig->_tigID, te->svID, te->fileOffset); tig->saveToStream(FP); } void tgStore::insertTig(tgTig *tig, bool keepInCache) { // Check that the components do not exceed the bound. // if (tig->_gappedLen > 0) { uint32 len = tig->_gappedLen; uint32 swp = 0; uint32 neg = 0; uint32 pos = 0; for (uint32 i=0; i_childrenLen; i++) { tgPosition *read = tig->_children + i; if ((read->_max < read->_min)) fprintf(stderr, "tgStore::insertTig()-- ERROR: tig %d read %d at (%d,%d) has swapped min/max coordinates\n", tig->_tigID, read->_objID, read->_min, read->_max), swp++; // Could fix, but we currently just fail. This is an algorithmic problem that should be fixed. if ((read->_min < 0) || (read->_max < 0)) fprintf(stderr, "tgStore::insertTig()-- WARNING: tig %d read %d at (%d,%d) has negative position\n", tig->_tigID, read->_objID, read->_min, read->_max), neg++; if (read->_min < 0) read->_min = 0; if (read->_max < 0) read->_max = 0; if ((read->_min > len) || (read->_max > len)) fprintf(stderr, "tgStore::insertTig()-- WARNING: tig %d read %d at (%d,%d) exceeded multialign length %d\n", tig->_tigID, read->_objID, read->_min, read->_max, len), pos++; if (read->_min > len) read->_min = len; if (read->_max > len) read->_max = len; } #if 0 if (swp + neg + pos > 0) { tig->dumpLayout(stderr); fprintf(stderr, "tgStore::insertTig()-- ERROR: tig %d has invalid layout, exceeds bounds of consensus sequence (length %d) -- neg=%d pos=%d -- swp=%d.\n", tig->_tigID, len, neg, pos, swp); } #endif assert(swp == 0); //assert(neg == 0); //assert(pos == 0); } if (tig->_tigID == UINT32_MAX) { tig->_tigID = _tigLen; _newTigs = true; fprintf(stderr, "tgStore::insertTig()-- Added new tig %d\n", tig->_tigID); } if (_tigMax <= tig->_tigID) { while (_tigMax <= tig->_tigID) _tigMax = (_tigMax == 0) ? (1024) : (2 * _tigMax); assert(tig->_tigID < _tigMax); tgStoreEntry *nr = new tgStoreEntry [_tigMax]; tgTig **nc = new tgTig * [_tigMax]; memcpy(nr, _tigEntry, sizeof(tgStoreEntry) * _tigLen); memcpy(nc, _tigCache, sizeof(tgTig *) * _tigLen); memset(nr + _tigLen, 0, sizeof(tgStoreEntry) * (_tigMax - _tigLen)); memset(nc + _tigLen, 0, sizeof(tgTig *) * (_tigMax - _tigLen)); for (uint32 xx=_tigLen; xx<_tigMax; xx++) { nr[xx].isDeleted = true; // Deleted until it gets added, otherwise we try to load and fail. nc[xx] = NULL; } delete [] _tigEntry; delete [] _tigCache; _tigEntry = nr; _tigCache = nc; } _tigLen = MAX(_tigLen, tig->_tigID + 1); _tigEntry[tig->_tigID].tigRecord = *tig; _tigEntry[tig->_tigID].unusedFlags = 0; _tigEntry[tig->_tigID].flushNeeded = true; // Mark as needing a flush by default _tigEntry[tig->_tigID].isDeleted = false; // Now really here! _tigEntry[tig->_tigID].svID = _currentVersion; _tigEntry[tig->_tigID].fileOffset = 123456789; // Write to disk RIGHT NOW unless we're keeping it in cache. If it is written, the flushNeeded // flag is cleared. // if ((keepInCache == false) && (_type != tgStoreReadOnly)) writeTigToDisk(tig, _tigEntry + tig->_tigID); // If the cache is different from this tig, delete the cache. Not sure why this happens -- // did we copy a tig, muck with it, and then want to replace the one in the store? // if ((_tigCache[tig->_tigID] != tig) && (_tigCache[tig->_tigID] != NULL)) { delete _tigCache[tig->_tigID]; _tigCache[tig->_tigID] = NULL; } // Cache it if requested, otherwise clear the cache. // _tigCache[tig->_tigID] = (keepInCache) ? tig : NULL; } void tgStore::deleteTig(uint32 tigID) { assert(tigID < _tigLen); flushDisk(tigID); assert(_tigEntry[tigID].flushNeeded == 0); assert(_tigEntry[tigID].isDeleted == 0); _tigEntry[tigID].isDeleted = 1; delete [] _tigCache[tigID]; _tigCache[tigID] = NULL; } tgTig * tgStore::loadTig(uint32 tigID) { bool cantLoad = true; if (_tigLen <= tigID) fprintf(stderr, "tgStore::loadTig()-- WARNING: invalid out-of-range tigID " F_S32 ", only " F_S32 " ma in store; return NULL.\n", tigID, _tigLen); assert(tigID < _tigLen); //fprintf(stderr, "tgStore::loadTig()-- Loading tig %u (out of %u) from version %u at offest %lu\n", // tigID, _tigLen, // _tigEntry[tigID].svID, // _tigEntry[tigID].fileOffset); // This is...and is not...an error. It does indicate something didn't go according to plan, like // loading a tig that doesn't exist (that should be caught by the above 'tigID < _tigLen' // assert). if (_tigEntry[tigID].isDeleted == true) return(NULL); // This _is_ an error. If a tig is in version zero, it isn't in the store at all. // Someone did something stupid when adding tigs. if (_tigEntry[tigID].svID == 0) return(NULL); // Otherwise, we can load something. if (_tigCache[tigID] == NULL) { FILE *FP = openDB(_tigEntry[tigID].svID); // Since the tig isn't in the cache, it had better NOT be marked as needing to be flushed! assert(_tigEntry[tigID].flushNeeded == false); // Seek to the correct position, and reset the atEOF to indicate we're (with high probability) // not at EOF anymore. if (_dataFile[_tigEntry[tigID].svID].atEOF == true) { fflush(FP); _dataFile[_tigEntry[tigID].svID].atEOF = false; } AS_UTL_fseek(FP, _tigEntry[tigID].fileOffset, SEEK_SET); _tigCache[tigID] = new tgTig; if (_tigCache[tigID]->loadFromStream(FP) == false) fprintf(stderr, "Failed to load tig %u.\n", tigID), exit(1); // ALWAYS assume the incore record is more up to date *_tigCache[tigID] = _tigEntry[tigID].tigRecord; // Since we just loaded, no flush is needed. _tigEntry[tigID].flushNeeded = 0; } return(_tigCache[tigID]); } void tgStore::unloadTig(uint32 tigID, bool discardChanges) { if (discardChanges) _tigEntry[tigID].flushNeeded = 0; flushDisk(tigID); assert(_tigEntry[tigID].flushNeeded == 0); delete _tigCache[tigID]; _tigCache[tigID] = NULL; } void tgStore::copyTig(uint32 tigID, tgTig *tigcopy) { assert(tigID < _tigLen); // Deleted? Clear it and return. if (_tigEntry[tigID].isDeleted) { tigcopy->clear(); return; } // In the cache? Deep copy it and return. if (_tigCache[tigID]) { *tigcopy = *_tigCache[tigID]; return; } // Otherwise, load from disk. FILE *FP = openDB(_tigEntry[tigID].svID); // Seek to the correct position, and reset the atEOF to indicate we're (with high probability) // not at EOF anymore. if (_dataFile[_tigEntry[tigID].svID].atEOF == true) { fflush(FP); _dataFile[_tigEntry[tigID].svID].atEOF = false; } AS_UTL_fseek(FP, _tigEntry[tigID].fileOffset, SEEK_SET); tigcopy->clear(); if (tigcopy->loadFromStream(FP) == false) fprintf(stderr, "Failed to load tig %u.\n", tigID), exit(1); // ALWAYS assume the incore record is more up to date *tigcopy = _tigEntry[tigID].tigRecord; } void tgStore::flushDisk(uint32 tigID) { if (_tigEntry[tigID].flushNeeded == 0) return; writeTigToDisk(_tigCache[tigID], _tigEntry+tigID); } void tgStore::flushDisk(void) { for (uint32 tigID=0; tigID<_tigLen; tigID++) if ((_tigCache[tigID]) && (_tigEntry[tigID].flushNeeded)) flushDisk(tigID); } void tgStore::flushCache(void) { flushDisk(); for (uint32 i=0; i<_tigLen; i++) if (_tigCache[i]) { delete _tigCache[i]; _tigCache[i] = NULL; } } uint32 tgStore::numTigsInMASRfile(char *name) { uint32 MASRmagicInFile = 0; uint32 MASRversionInFile = 0; uint32 MASRtotalInFile = 0; uint32 indxLen = 0; uint32 masrLen = 0; if (AS_UTL_fileExists(name, false, false) == false) return(0); errno = 0; FILE *F = fopen(name, "r"); if (errno) fprintf(stderr, "tgStore::numTigsInMASRfile()-- Failed to open '%s': %s\n", name, strerror(errno)), exit(1); AS_UTL_safeRead(F, &MASRmagicInFile, "MASRmagic", sizeof(uint32), 1); AS_UTL_safeRead(F, &MASRversionInFile, "MASRversion", sizeof(uint32), 1); AS_UTL_safeRead(F, &MASRtotalInFile, "MASRtotal", sizeof(uint32), 1); fclose(F); if (MASRmagicInFile != MASRmagic) { fprintf(stderr, "tgStore::numTigsInMASRfile()-- Failed to open '%s': magic number mismatch; file=0x%08x code=0x%08x\n", name, MASRmagicInFile, MASRmagic); exit(1); } if (MASRversionInFile != MASRversion) { fprintf(stderr, "tgStore::numTigsInMASRfile()-- Failed to open '%s': version number mismatch; file=%d code=%d\n", name, MASRversionInFile, MASRversion); exit(1); } return(MASRtotalInFile); } void tgStore::dumpMASR(tgStoreEntry* &R, uint32& L, uint32 V) { snprintf(_name, FILENAME_MAX, "%s/seqDB.v%03d.tig", _path, V); errno = 0; FILE *F = fopen(_name, "w"); if (errno) fprintf(stderr, "tgStore::dumpMASR()-- Failed to create '%s': %s\n", _name, strerror(errno)), exit(1); AS_UTL_safeWrite(F, &MASRmagic, "MASRmagic", sizeof(uint32), 1); AS_UTL_safeWrite(F, &MASRversion, "MASRversion", sizeof(uint32), 1); AS_UTL_safeWrite(F, &L, "MASRtotal", sizeof(uint32), 1); uint32 indxLen = 0; //fprintf(stderr, "tgStore::dumpMASR()-- Writing '%s' (indxLen=%d masrLen=%d).\n", _name, indxLen, L); // The max isn't written. On load, max is set to length. AS_UTL_safeWrite(F, &indxLen, "MASRindexLen", sizeof(uint32), 1); AS_UTL_safeWrite(F, &L, "MASRlen", sizeof(uint32), 1); AS_UTL_safeWrite(F, R, "MASR", sizeof(tgStoreEntry), L); fclose(F); } void tgStore::loadMASR(tgStoreEntry* &R, uint32& L, uint32& M, uint32 V) { // Allocate space for the data. Search for the latest version, ask it how many tigs are in the // store. // // We don't always need to do this, sometimes we're called to update the data with newer data. // if (R == NULL) { for (int32 i=V; i>0; i--) { snprintf(_name, FILENAME_MAX, "%s/seqDB.v%03d.tig", _path, i); L = numTigsInMASRfile(_name); if (L > 0) break; } M = L + 1024; R = new tgStoreEntry [M]; memset(R, 0, sizeof(tgStoreEntry) * M); } snprintf(_name, FILENAME_MAX, "%s/seqDB.v%03d.tig", _path, V); while ((AS_UTL_fileExists(_name) == false) && (V > 0)) { V--; snprintf(_name, FILENAME_MAX, "%s/seqDB.v%03d.tig", _path, V); } if (V == 0) fprintf(stderr, "tgStore::loadMASR()-- Failed to find any tigs in store '%s'.\n", _path), exit(1); errno = 0; FILE *F = fopen(_name, "r"); if (errno) fprintf(stderr, "tgStore::loadMASR()-- Failed to open '%s': %s\n", _name, strerror(errno)), exit(1); uint32 MASRmagicInFile = 0; uint32 MASRversionInFile = 0; uint32 MASRtotalInFile = 0; uint32 indxLen = 0; uint32 masrLen = 0; AS_UTL_safeRead(F, &MASRmagicInFile, "MASRmagic", sizeof(uint32), 1); AS_UTL_safeRead(F, &MASRversionInFile, "MASRversion", sizeof(uint32), 1); AS_UTL_safeRead(F, &MASRtotalInFile, "MASRtotal", sizeof(uint32), 1); AS_UTL_safeRead(F, &indxLen, "MASRindxLen", sizeof(uint32), 1); AS_UTL_safeRead(F, &masrLen, "MASRmasrLen", sizeof(uint32), 1); if (MASRmagicInFile != MASRmagic) { fprintf(stderr, "tgStore::loadMASR()-- Failed to open '%s': magic number mismatch; file=0x%08x code=0x%08x\n", _name, MASRmagicInFile, MASRmagic); exit(1); } if (MASRversionInFile != MASRversion) { fprintf(stderr, "tgStore::loadMASR()-- Failed to open '%s': version number mismatch; file=%d code=%d\n", _name, MASRversionInFile, MASRversion); exit(1); } // Check we're consistent. if (L < MASRtotalInFile) fprintf(stderr, "tgStore::loadMASR()-- '%s' has more tigs (" F_U32 ") than expected (" F_U32 ").\n", _name, MASRtotalInFile, L), exit(1); AS_UTL_safeRead(F, R, "MASR", sizeof(tgStoreEntry), masrLen); fclose(F); } FILE * tgStore::openDB(uint32 version) { if (_dataFile[version].FP) return(_dataFile[version].FP); // Load the data snprintf(_name, FILENAME_MAX, "%s/seqDB.v%03d.dat", _path, version); // If version is the _currentVersion, open for writing if allowed. // // "a+" technically writes (always) to the end of file, but this hasn't been tested. errno = 0; if ((_type != tgStoreReadOnly) && (version == _currentVersion)) { _dataFile[version].FP = fopen(_name, "a+"); _dataFile[version].atEOF = false; } else { _dataFile[version].FP = fopen(_name, "r"); _dataFile[version].atEOF = false; } if (errno) fprintf(stderr, "tgStore::openDB()-- Failed to open '%s': %s\n", _name, strerror(errno)), exit(1); return(_dataFile[version].FP); } canu-1.6/src/stores/tgStore.H000066400000000000000000000204421314437614700161700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignStore.H * src/AS_CNS/MultiAlignStore.h * * Modifications by: * * Brian P. Walenz from 2009-OCT-05 to 2014-MAR-31 * are Copyright 2009-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-22 to 2015-AUG-11 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-29 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef TGSTORE_H #define TGSTORE_H #include "AS_global.H" #include "tgTig.H" // // The tgStore is a disk-resident (with memory cache) database of tgTig structures. // // There are two basic modes of operation: // open a store for reading version v // open a store for reading version v, and writing to version v+1, erasing v+1 before starting // open a store for reading version v, and writing to version v+1, preserving the contents // open a store for reading version v, and writing to version v, preserving the contents // enum tgStoreType { // writable inplace append tgStoreCreate = 0, // Make a new one, then become tgStoreWrite tgStoreReadOnly = 1, // false * * - open version v for reading; inplace=append=false in the code tgStoreWrite = 2, // true false false - open version v+1 for writing, purge contents of v+1; standard open for writing tgStoreAppend = 3, // true false true - open version v+1 for writing, do not purge contents tgStoreModify = 4, // true true false - open version v for writing, do not purge contents }; class tgStore { public: tgStore(const char *path, uint32 version = 0, tgStoreType type = tgStoreReadOnly); ~tgStore(); // Update to the next version. // void nextVersion(void); // Add or update a MA in the store. If keepInCache, we keep a pointer to the tgTig. THE // STORE NOW OWNS THE OBJECT. // void insertTig(tgTig *ma, bool keepInCache); // delete() removes the tig from the cache, and marks it as deleted in the store. // void deleteTig(uint32 tigID); // load() will load and cache the MA. THE STORE OWNS THIS OBJECT. // copy() will load and copy the MA. It will not cache. YOU OWN THIS OBJECT. // tgTig *loadTig(uint32 tigID); void unloadTig(uint32 tigID, bool discardChanges=false); void copyTig(uint32 tigID, tgTig *ma); // Flush to disk any cached MAs. This is called by flushCache(). // void flushDisk(uint32 tigID); void flushDisk(void); // Flush the cache of loaded MAs. Be aware that this is expensive in that the flushed things // usually just get loaded back into core. // void flushCache(uint32 tigID, bool discard=false) { unloadTig(tigID, discard); }; void flushCache(void); uint32 numTigs(void) { return(_tigLen); }; // Accessors to tig data; these do not load the tig from disk. bool isDeleted(uint32 tigID); double getCoverageStat(uint32 tigID); double getMicroHetProb(uint32 tigID); tgTig_class getClass(uint32 tigID); bool getSuggestRepeat(uint32 tigID); bool getSuggestCircular(uint32 tigID); uint32 getNumChildren(uint32 tigID); void setCoverageStat(uint32 tigID, double cs); void setMicroHetProb(uint32 tigID, double mp); void setClass(uint32 tigID, tgTig_class c); void setSuggestRepeat(uint32 tigID, bool enable=true); void setSuggestCircular(uint32 tigID, bool enable=true); uint32 getVersion(uint32 tigID); private: struct tgStoreEntry { tgTigRecord tigRecord; uint64 unusedFlags : 12; // One whole bit for future use. uint64 flushNeeded : 1; // If true, this MAR and associated tig are NOT saved to disk. uint64 isDeleted : 1; // If true, this MAR has been deleted from the assembly. uint64 svID : 10; // 10 -> 1024 versions (HARDCODED in tgStore.C) uint64 fileOffset : 40; // 40 -> 1 TB file size; offset in file where MA is stored }; void writeTigToDisk(tgTig *ma, tgStoreEntry *maRecord); uint32 numTigsInMASRfile(char *name); void dumpMASR(tgStoreEntry* &R, uint32& L, uint32 V); void loadMASR(tgStoreEntry* &R, uint32& L, uint32& M, uint32 V); void purgeVersion(uint32 version); void purgeCurrentVersion(void); friend void operationCompress(char *tigName, int tigVers); FILE *openDB(uint32 V); char _path[FILENAME_MAX+1]; // Path to the store. char _name[FILENAME_MAX+1]; // Name of the currently opened file, and other uses. tgStoreType _type; bool _newTigs; // internal flag, set if tigs were added uint32 _originalVersion; // Version we started from (see newTigs in code) uint32 _currentVersion; // Version we are writing to uint32 _tigMax; uint32 _tigLen; tgStoreEntry *_tigEntry; tgTig **_tigCache; struct dataFileT { FILE *FP; bool atEOF; }; dataFileT *_dataFile; // dataFile[version] }; inline bool tgStore::isDeleted(uint32 tigID) { return(_tigEntry[tigID].isDeleted); } inline double tgStore::getCoverageStat(uint32 tigID) { assert(tigID < _tigLen); return(_tigEntry[tigID].tigRecord._coverageStat); } inline double tgStore::getMicroHetProb(uint32 tigID) { assert(tigID < _tigLen); return(_tigEntry[tigID].tigRecord._microhetProb); } inline tgTig_class tgStore::getClass(uint32 tigID) { assert(tigID < _tigLen); return(_tigEntry[tigID].tigRecord._class); } inline bool tgStore::getSuggestRepeat(uint32 tigID) { assert(tigID < _tigLen); return(_tigEntry[tigID].tigRecord._suggestRepeat); } inline bool tgStore::getSuggestCircular(uint32 tigID) { assert(tigID < _tigLen); return(_tigEntry[tigID].tigRecord._suggestCircular); } inline uint32 tgStore::getNumChildren(uint32 tigID) { return(_tigEntry[tigID].tigRecord._childrenLen); } inline void tgStore::setCoverageStat(uint32 tigID, double cs) { assert(tigID < _tigLen); _tigEntry[tigID].tigRecord._coverageStat = cs; if (_tigCache[tigID]) _tigCache[tigID]->_coverageStat = cs; } inline void tgStore::setMicroHetProb(uint32 tigID, double mp) { assert(tigID < _tigLen); _tigEntry[tigID].tigRecord._microhetProb = mp; if (_tigCache[tigID]) _tigCache[tigID]->_microhetProb = mp; } inline void tgStore::setClass(uint32 tigID, tgTig_class c) { assert(tigID < _tigLen); _tigEntry[tigID].tigRecord._class = c; if (_tigCache[tigID]) _tigCache[tigID]->_class = c; } inline void tgStore::setSuggestRepeat(uint32 tigID, bool enable) { assert(tigID < _tigLen); _tigEntry[tigID].tigRecord._suggestRepeat = enable; if (_tigCache[tigID]) _tigCache[tigID]->_suggestRepeat = enable; } inline void tgStore::setSuggestCircular(uint32 tigID, bool enable) { assert(tigID < _tigLen); _tigEntry[tigID].tigRecord._suggestCircular = enable; if (_tigCache[tigID]) _tigCache[tigID]->_suggestCircular = enable; } inline uint32 tgStore::getVersion(uint32 tigID) { assert(tigID < _tigLen); return(_tigEntry[tigID].svID); } #endif canu-1.6/src/stores/tgStoreCompress.C000066400000000000000000000112731314437614700177010ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/stores/tgStore.C * * Modifications by: * * Brian P. Walenz beginning on 2016-APR-18 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" void operationCompress(char *tigName, int tigVers) { tgStore *tigStore = new tgStore(tigName, tigVers); uint32 nErrors = 0; uint32 nCompress = 0; // Fail if this isn't the latest version. If we try to compress something that isn't the latest // version, versions after this still point to the uncompressed tigs. // // Function never written - is this still a problem? (18 APR 2018) // Check that we aren't going to pull a tig out of the future and place it in the past. for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; if (tigStore->getVersion(ti) > tigVers) { fprintf(stderr, "WARNING: Attempt to move future unitig " F_U32 " from version " F_U32 " to previous version %d.\n", ti, tigStore->getVersion(ti), tigVers); nErrors++; } else if (tigStore->getVersion(ti) < tigVers) { nCompress++; } } if (nErrors > 0) { fprintf(stderr, "Store can't be compressed; probably trying to compress to something that isn't the latest version.\n"); fprintf(stderr, " " F_U32 " tigs failed; " F_U32 " compressable\n", nErrors, nCompress); delete tigStore; exit(1); } // Actually do the moves. if (nCompress > 0) { delete tigStore; tigStore = new tgStore(tigName, tigVers, tgStoreModify); } if (nCompress > 0) { fprintf(stderr, "Compressing " F_U32 " tigs into version %d\n", nCompress, tigVers); for (uint32 ti=0; tinumTigs(); ti++) { if ((ti % 1000000) == 0) fprintf(stderr, "tig %d\n", ti); if (tigStore->isDeleted(ti)) { continue; } if (tigStore->getVersion(ti) == tigVers) continue; tgTig *tig = tigStore->loadTig(ti); if (tig == NULL) continue; tigStore->insertTig(tig, true); tigStore->unloadTig(ti); } } // Clean up the older files. if (nCompress > 0) { for (uint32 version=1; versionpurgeVersion(version); } } // And the newer files. delete tigStore; } int main (int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; int32 tigVers = -1; vector tigInputs; tgStoreType tigType = tgStoreModify; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else { fprintf(stderr, "%s: unknown option '%s'\n", argv[0], argv[arg]); err++; } arg++; } if ((err) || (gkpName == NULL) || (tigName == NULL) || (tigInputs.size() == 0)) { fprintf(stderr, "usage: %s -G -T \n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G Path to the gatekeeper store\n"); fprintf(stderr, " -T Path to the tigStore and version to add tigs to\n"); fprintf(stderr, "\n"); fprintf(stderr, " Remove store versions before . Data present in versions before \n"); fprintf(stderr, " are copied to version . Files for the earlier versions are removed.\n"); fprintf(stderr, "\n"); fprintf(stderr, " WARNING! This code HAS NOT been tested with canu.\n"); fprintf(stderr, "\n"); if (gkpName == NULL) fprintf(stderr, "ERROR: no gatekeeper store (-G) supplied.\n"); if (tigName == NULL) fprintf(stderr, "ERROR: no tig store (-T) supplied.\n"); exit(1); } operationCompress(tigName, tigVers); exit(0); } canu-1.6/src/stores/tgStoreCompress.mk000066400000000000000000000007371314437614700201310ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := tgStoreCompress SOURCES := tgStoreCompress.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/tgStoreCoverageStat.C000066400000000000000000000437621314437614700205050ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/computeCoverageStat.C * * Modifications by: * * Brian P. Walenz from 2011-DEC-20 to 2013-AUG-01 * are Copyright 2011-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2012-JUL-18 * are Copyright 2012 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2015-AUG-14 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include using namespace std; // This program will recompute the coverage statistic for all unitigs in the tigStore. // It replaces at least four implementations (AS_CGB, AS_BOG, AS_BAT, AS_CGW). // // Notes from the AS_CGB version: // // Rho is the number of bases in the chunk between the first fragment arrival and the last fragment // arrival. It is the sum of the fragment overhangs in the chunk. For intuitive purposes you can // think of it as the length of the chunk minus the length of the last fragment (if that last isn't // contained). Thus a singleton chunk has a rho equal to zero. // // A singleton chunk provides no information as to its local fragment arrival rate. We need at // least two closely spaced fragments that are randomly sampled from the chunk to get a local // estimate of the fragment arrival rate. // // The local arrival rate of fragments in the chunk is: // arrival_rate_local = (nfrag_randomly_sampled_in_chunk - 1) / rho // // The arrival distance of fragments in the chunk is the reciprocal of the last formula: // arrival_distance_local = rho / (nfrag_randomly_sampled_in_chunk - 1) // // Note a problem with this formula is that a singleton chunk has a coverage discriminator // statistic of 0/0. // // The formula for the coverage discriminator statistic for the chunk is: // (arrival_rate_global / arrival_rate_local - ln(2)) * (nfrag_randomly_sampled_in_chunk - 1) // // The division by zero singularity cancels out to give the formula: // (arrival_rate_global * rho - ln(2) * (nfrag_randomly_sampled_in_chunk - 1) // // The coverage discriminator statistic should be positive for single coverage, negative for // multiple coverage, and near zero for indecisive. // // ADJUST_FOR_PARTIAL_EXCESS: The standard statistic gives log likelihood ratio of expected depth // vs twice expected depth; but when enough fragments are present, we can actually test whether // depth exceeds expected even fractionally; in deeply sequenced datasets (e.g. bacterial genomes), // this has been observed for repetitive segments. // double ln2 = 0.69314718055994530941723212145818; double globalArrivalRate = 0.0; bool *isNonRandom = NULL; uint32 *readLength = NULL; bool leniant = false; // No frags -> 1 // One frag -> 1 double computeRho(tgTig *tig) { int32 minBgn = INT32_MAX; int32 maxEnd = INT32_MIN; int32 fwdRho = INT32_MIN; int32 revRho = INT32_MAX; // We compute the two rho's using the first definition above - distance between the first // and last fragment arrival. This changes based on the orientation of the unitig, so we // return the average of those two. for (uint32 i=0; inumberOfChildren(); i++) { tgPosition *pos = tig->getChild(i); minBgn = MIN(minBgn, pos->min()); maxEnd = MAX(maxEnd, pos->max()); fwdRho = MAX(fwdRho, pos->min()); // largest begin coord revRho = MIN(revRho, pos->max()); // smallest end coord } if ((leniant == false) && (minBgn != 0)) { fprintf(stderr, "tig %d doesn't begin at zero. Layout:\n", tig->tigID()); tig->dumpLayout(stderr); } if (leniant == false) assert(minBgn == 0); fwdRho = fwdRho - minBgn; revRho = maxEnd - revRho; assert(fwdRho >= 0); assert(revRho >= 0); // AS_CGB is using the begin of the last fragment as rho return((fwdRho + revRho) / 2.0); } uint32 numRandomFragments(tgTig *tig) { uint32 numRand = 0; for (uint32 ii=0; iinumberOfChildren(); ii++) if (isNonRandom[tig->getChild(ii)->ident()] == false) numRand++; return(numRand); } double getGlobalArrivalRate(tgStore *tigStore, FILE *outSTA, uint64 genomeSize, bool useN50) { double globalRate = 0; double recalRate = 0; double sumRho = 0; int32 arLen = 0; double *ar = NULL; uint32 NF; uint64 totalRandom = 0; uint64 totalNF = 0; int32 BIG_SPAN = 10000; int32 big_spans_in_unitigs = 0; // formerly arMax // Go through all the unitigs to sum rho and unitig arrival frags uint32 *allRho = new uint32 [tigStore->numTigs()]; for (uint32 i=0; inumTigs(); i++) { tgTig *tig = tigStore->loadTig(i); allRho[i] = 0; if (tig == NULL) continue; double rho = computeRho(tig); int32 numRandom = numRandomFragments(tig); tigStore->unloadTig(i); sumRho += rho; big_spans_in_unitigs += (int32) (rho / BIG_SPAN); // Keep integral portion of fraction. totalRandom += numRandom; totalNF += (numRandom == 0) ? (0) : (numRandom - 1); allRho[i] = rho; } // Here is a rough estimate of arrival rate. // Use (number frags)/(unitig span) unless unitig span is zero; then use (reads)/(genome). // Here, (number frags) includes only random reads and omits the last read of each unitig. // Here, (sumRho) is total unitig span omitting last read of each unitig. if (genomeSize > 0) { globalRate = totalRandom / (double)genomeSize; } else { if (sumRho > 0) globalRate = totalNF / sumRho; } fprintf(outSTA, "BASED ON ALL UNITIGS:\n"); fprintf(outSTA, "sumRho: %.0f\n", sumRho); fprintf(outSTA, "totalRandomFrags: " F_U64 "\n", totalRandom); fprintf(outSTA, "Supplied genome size " F_U64 "\n", genomeSize); fprintf(outSTA, "Computed genome size: %.2f\n", totalRandom / globalRate); fprintf(outSTA, "Calculated Global Arrival rate: %f\n", globalRate); // Stop here and return the rough estimate under some circumstances. // *) If user suppled a genome size, we are done. // *) No unitigs. if (genomeSize > 0 || tigStore->numTigs()==0) { delete [] allRho; return(globalRate); } // Calculate rho N50 double rhoN50 = 0; if (useN50) { uint32 growUntil = sumRho / 2; // half is 50%, needed for N50 uint64 growRho = 0; sort (allRho, allRho+tigStore->numTigs()); for (uint32 i=tigStore->numTigs(); i>0; i--) { // from largest to smallest unitig... rhoN50 = allRho[i-1]; growRho += rhoN50; if (growRho >= growUntil) break; // break when sum of rho > 50% } } delete [] allRho; // Try for a better estimate based on just unitigs larger than N50. if (useN50) { double keepRho = 0; double keepNF = 0; for (uint32 i=0; inumTigs(); i++) { tgTig *tig = tigStore->loadTig(i); if (tig == NULL) continue; double rho = computeRho(tig); if (rho < rhoN50) continue; // keep only rho from unitigs > N50 int32 numRandom = numRandomFragments(tig); keepNF += (numRandom == 0) ? (0) : (numRandom - 1); keepRho += rho; tigStore->unloadTig(i); } fprintf(outSTA, "BASED ON UNITIGS > N50:\n"); fprintf(outSTA, "rho N50: %.0f\n", rhoN50); if (keepRho > 1) { // the cutoff 1 is arbitrary but larger than 0.0f globalRate = keepNF / keepRho; fprintf(outSTA, "sumRho: %.0f\n", keepRho); fprintf(outSTA, "totalRandomFrags: %.0f\n", keepNF); fprintf(outSTA, "Computed genome size: %.2f\n", totalRandom / globalRate); fprintf(outSTA, "Calculated Global Arrival rate: %f\n", globalRate); return (globalRate); } else { fprintf(outSTA, "It did not work to re-estimate using the N50 method.\n"); } } // Recompute based on just big unitigs. Big is 10Kbp. double BIG_THRESHOLD = 0.5; int32 big_spans_in_rho = (int32) (sumRho / BIG_SPAN); fprintf(outSTA, "Size of big spans is %d\n", BIG_SPAN); fprintf(outSTA, "Number of big spans in unitigs is %d\n", big_spans_in_unitigs); fprintf(outSTA, "Number of big spans in sum-of-rho is %d\n", big_spans_in_rho); fprintf(outSTA, "Ratio required for re-estimate is %f\n", BIG_THRESHOLD); if ( (big_spans_in_unitigs / big_spans_in_rho) <= BIG_THRESHOLD) { fprintf(outSTA, "Too few big spans to re-estimate using the big spans method.\n"); return(globalRate); } // The test above is a rewrite of the former version, where arMax=big_spans_in_unitigs... // if (arMax <= sumRho / 20000) ar = new double [big_spans_in_unitigs]; for (uint32 i=0; inumTigs(); i++) { tgTig *tig = tigStore->loadTig(i); if (tig == NULL) continue; double rho = computeRho(tig); if (rho <= BIG_SPAN) continue; int32 numRandom = numRandomFragments(tig); double localArrivalRate = numRandom / rho; uint32 rhoDiv10k = rho / BIG_SPAN; assert(0 < rhoDiv10k); for (uint32 aa=0; aa maxDiff) { maxDiff = ard; maxDiffIdx = idx - 1; break; } } double recalRate = 0; recalRate = ar[arLen * 19 / 20]; recalRate = MIN(recalRate, ar[arLen * 1 / 10] * 2.0); recalRate = MIN(recalRate, ar[arLen * 1 / 2] * 1.25); recalRate = MIN(recalRate, ar[maxDiffIdx]); globalRate = MAX(globalRate, recalRate); tigStore->unloadTig(i); } delete [] ar; fprintf(outSTA, "BASED ON BIG SPANS IN UNITIGS:\n"); fprintf(outSTA, "Computed genome size: %.2f (reestimated)\n", totalRandom / globalRate); fprintf(outSTA, "Calculated Global Arrival rate: %f (reestimated)\n", globalRate); return(globalRate); } int main(int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; int32 tigVers = -1; int64 genomeSize = 0; char *outPrefix = NULL; FILE *outLOG = NULL; FILE *outSTA = NULL; uint32 bgnID = 0; uint32 endID = 0; bool doUpdate = true; bool use_N50 = true; argc = AS_configure(argc, argv); int err = 0; int arg = 1; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-s") == 0) { genomeSize = atol(argv[++arg]); } else if (strcmp(argv[arg], "-o") == 0) { outPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-n") == 0) { doUpdate = false; } else if (strcmp(argv[arg], "-u") == 0) { use_N50 = false; } else if (strcmp(argv[arg], "-L") == 0) { leniant = true; } else { err++; } arg++; } if (gkpName == NULL) err++; if (tigName == NULL) err++; if (outPrefix == NULL) err++; if (err) { fprintf(stderr, "usage: %s -G gkpStore -T tigStore version -o output-prefix [-s genomeSize] ...\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G Mandatory, path G to a gkpStore directory.\n"); fprintf(stderr, " -T Mandatory, path T to a tigStore, and version V.\n"); fprintf(stderr, " -o Mandatory, prefix for output files.\n"); fprintf(stderr, " -s Optional, assume genome size S.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -n Do not update the tigStore (default = do update).\n"); fprintf(stderr, " -u Do not estimate based on N50 (default = use N50).\n"); fprintf(stderr, "\n"); fprintf(stderr, " -L Be leniant; don't require reads start at position zero.\n"); fprintf(stderr, "\n"); if (gkpName == NULL) fprintf(stderr, "No gatekeeper store (-G option) supplied.\n"); if (tigName == NULL) fprintf(stderr, "No input tigStore (-T option) supplied.\n"); if (outPrefix == NULL) fprintf(stderr, "No output prefix (-o option) supplied.\n"); exit(1); } // Open output files first, so we can fail before getting too far along. { char outName[FILENAME_MAX]; errno = 0; snprintf(outName, FILENAME_MAX, "%s.log", outPrefix); outLOG = fopen(outName, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", outName, strerror(errno)), exit(1); snprintf(outName, FILENAME_MAX, "%s.stats", outPrefix); outSTA = fopen(outName, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", outName, strerror(errno)), exit(1); } // // Load fragment data // fprintf(stderr, "Opening gkpStore '%s'\n", gkpName); gkStore *gkpStore = gkStore::gkStore_open(gkpName, gkStore_readOnly); fprintf(stderr, "Reading read lengths and randomness for %u reads.\n", gkpStore->gkStore_getNumReads()); isNonRandom = new bool [gkpStore->gkStore_getNumReads() + 1]; readLength = new uint32 [gkpStore->gkStore_getNumReads() + 1]; for (uint32 ii=0; ii<=gkpStore->gkStore_getNumReads(); ii++) { gkRead *read = gkpStore->gkStore_getRead(ii); gkLibrary *libr = gkpStore->gkStore_getLibrary(read->gkRead_libraryID()); isNonRandom[ii] = libr->gkLibrary_isNonRandom(); readLength[ii] = read->gkRead_sequenceLength(); } fprintf(stderr, "Closing gkpStore.\n"); gkpStore->gkStore_close(); gkpStore = NULL; // // Open tigs. Kind of important to do this. // fprintf(stderr, "Opening tigStore '%s'\n", tigName); tgStore *tigStore = new tgStore(tigName, tigVers, tgStoreModify); if (endID == 0) endID = tigStore->numTigs(); // // Compute global arrival rate. This ain't cheap. // fprintf(stderr, "Computing global arrival rate.\n"); double globalRate = getGlobalArrivalRate(tigStore, outSTA, genomeSize, use_N50); // // Compute coverage stat for each unitig, populate histograms, write logging. // // Three histograms were made, one for length, coverage stat and arrival distance. The // histograms included // // columns: sum, cumulative_sum, cumulative_fraction, min, average, max // rows: reads, rs reads, nr reads, bases, rho, arrival, discriminator // // Most of those are not defined or null (nr reads? currently zero), and the histograms // in general aren't useful anymore. They were used to help decide if the genome size // was incorrect, causing too many non-unique unitigs. // // They were removed 13 Aug 2015. fprintf(stderr, "Computing coverage stat for tigs %u-%u.\n", bgnID, endID-1); fprintf(outLOG, "# tigID rho covStat arrDist\n"); for (uint32 i=bgnID; iloadTig(i); if (tig == NULL) continue; int32 tigLength = tig->length(true); int32 numFrags = tig->numberOfChildren(); int32 numRandom = numRandomFragments(tig); double rho = computeRho(tig); double covStat = 0.0; double arrDist = 0.0; if (numRandom > 1) arrDist = rho / (numRandom - 1); if ((numRandom > 0) && (globalRate > 0.0)) covStat = (rho * globalRate) - (ln2 * (numRandom - 1)); fprintf(outLOG, "%10u %10.2f %10.2f %10.2f\n", tig->tigID(), rho, covStat, arrDist); #undef ADJUST_FOR_PARTIAL_EXCESS #ifdef ADJUST_FOR_PARTIAL_EXCESS // Straight from AS_CGB_all.h, not refactored if(rho>0&&global_fragment_arrival_rate>0.f){ double lambda = global_fragment_arrival_rate * rho; double zscore = ((number_of_randomly_sampled_fragments_in_chunk -1)-lambda) / sqrt(lambda); double p = .5 - erf(zscore/sqrt2)*.5; if(coverage_statistic>5 && p < .001){ fprintf(outSTA, "Standard unitigger a-stat is %f, but only %e chance of this great an excess of fragments: obs = %d, expect = %g rho = " F_S64 " Will reset a-stat to 1.5\n", coverage_statistic,p, number_of_randomly_sampled_fragments_in_chunk-1, lambda,rho); covStat = 1.5; } } #endif if (doUpdate) tigStore->setCoverageStat(tig->tigID(), covStat); tigStore->unloadTig(tig->tigID()); } fclose(outLOG); delete [] isNonRandom; delete [] readLength; delete tigStore; exit(0); } canu-1.6/src/stores/tgStoreCoverageStat.mk000066400000000000000000000007471314437614700207260ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := tgStoreCoverageStat SOURCES := tgStoreCoverageStat.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/tgStoreDump.C000066400000000000000000001166571314437614700170270ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/tigStore.C * * Modifications by: * * Brian P. Walenz from 2009-OCT-05 to 2014-MAR-31 * are Copyright 2009-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Michael Schatz from 2010-FEB-27 to 2010-MAY-17 * are Copyright 2010 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2014-APR-13 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-OCT-09 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2016-FEB-22 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include "AS_UTL_decodeRange.H" #include "intervalList.H" #include "tgTigSizeAnalysis.H" #undef DEBUG_IGNORE #define DUMP_UNSET 0 #define DUMP_STATUS 1 #define DUMP_TIGS 2 #define DUMP_CONSENSUS 3 #define DUMP_LAYOUT 4 #define DUMP_MULTIALIGN 5 #define DUMP_SIZES 6 #define DUMP_COVERAGE 7 #define DUMP_DEPTH_HISTOGRAM 8 #define DUMP_THIN_OVERLAP 9 #define DUMP_OVERLAP_HISTOGRAM 10 // positions can be // layout (if no consensusExists() // gapped // ungapped // // // MISSING // masking and splitting consensus based on (low) coverage // coverage // single unitig coverage plot and dump of low coverage regions // all unitig histogram of coverage // reporting gc content (need to do this after consensus, while sequence is still in memory, then save in the tig itself) class tgFilter { public: tgFilter() { tigIDbgn = 0; tigIDend = UINT32_MAX; dumpAllClasses = true; dumpUnassembled = false; dumpBubbles = false; dumpContigs = false; minNreads = 0; maxNreads = UINT32_MAX; minLength = 0; maxLength = UINT32_MAX; minCoverage = 0.0; maxCoverage = DBL_MAX; minGoodCov = 0.0; maxGoodCov = DBL_MAX; IL = NULL; ID = NULL; }; ~tgFilter() { delete IL; delete ID; }; bool ignore(tgTig *tig, bool useGapped) { #ifdef DEBUG_IGNORE bool iI = ignoreID(tig); bool iN = ignoreNreads(tig); bool iL = ignoreLength(tig, useGapped); bool iC = ignoreCoverage(tig, useGapped); bool iS = ignoreClass(tig); fprintf(stderr, "ignore()-- tig %u - ignore id %s Nreads %s length %s coverage %s class %s\n", tig->tigID(), (iI) ? "true" : "false", (iN) ? "true" : "false", (iL) ? "true" : "false", (iC) ? "true" : "false", (iS) ? "true" : "false"); #endif return(ignoreID(tig) || ignoreNreads(tig) || ignoreLength(tig, useGapped) || ignoreCoverage(tig, useGapped) || ignoreClass(tig)); }; bool ignoreID(tgTig *tig) { return((tig->tigID() < tigIDbgn) || (tigIDend < tig->tigID())); }; bool ignoreClass(tgTig *tig) { if ((dumpAllClasses == true) || ((tig->_class == tgTig_unassembled) && (dumpUnassembled == true)) || ((tig->_class == tgTig_bubble) && (dumpBubbles == true)) || ((tig->_class == tgTig_contig) && (dumpContigs == true))) return(false); return(true); }; bool ignoreNreads(tgTig *tig) { return((tig->numberOfChildren() < minNreads) || (maxNreads < tig->numberOfChildren())); }; bool ignoreLength(tgTig *tig, bool useGapped) { uint32 length = tig->length(useGapped); return((length < minLength) || (maxLength < length)); }; bool ignoreCoverage(tgTig *tig, bool useGapped) { if ((minCoverage == 0) && (maxCoverage == UINT32_MAX)) return(false); if (tig->consensusExists() == false) useGapped = true; delete IL; IL = new intervalList; for (uint32 i=0; inumberOfChildren(); i++) { tgPosition *pos = tig->getChild(i); int32 bgn = (useGapped) ? pos->min() : tig->mapGappedToUngapped(pos->min()); int32 end = (useGapped) ? pos->max() : tig->mapGappedToUngapped(pos->max()); IL->add(bgn, end - bgn); } delete ID; ID = new intervalList(*IL); uint32 goodCov = 0; uint32 badCov = 0; for (uint32 ii=0; iinumberOfIntervals(); ii++) if ((minCoverage <= ID->depth(ii)) && (ID->depth(ii) <= maxCoverage)) goodCov += ID->hi(ii) - ID->lo(ii); else badCov += ID->hi(ii) - ID->lo(ii); double fracGood = (double)(goodCov) / (goodCov + badCov); return((fracGood < minGoodCov) || (maxGoodCov < fracGood)); }; void maskConsensus(char *cns) { for (uint32 ii=0; iinumberOfIntervals(); ii++) { if ((minCoverage <= ID->depth(ii)) && (ID->depth(ii) <= maxCoverage)) for (uint32 pp=ID->lo(ii); pphi(ii); pp++) cns[pp] = '_'; } }; uint32 tigIDbgn; uint32 tigIDend; bool dumpAllClasses; bool dumpUnassembled; bool dumpBubbles; bool dumpContigs; uint32 minNreads; uint32 maxNreads; uint32 minLength; uint32 maxLength; double minCoverage; double maxCoverage; double minGoodCov; double maxGoodCov; intervalList *IL; intervalList *ID; }; void dumpStatus(gkStore *UNUSED(gkpStore), tgStore *tigStore) { fprintf(stderr, "%u\n", tigStore->numTigs()); } void dumpTig(FILE *out, tgTig *tig, bool useGapped) { fprintf(out, F_U32"\t" F_U32 "\t%s\t%.2f\t%.2f\t%s\t%s\t%s\t" F_U32 "\n", tig->tigID(), tig->length(useGapped), tig->coordinateType(useGapped), tig->_coverageStat, tig->computeCoverage(useGapped), toString(tig->_class), tig->_suggestRepeat ? "yes" : "no", tig->_suggestCircular ? "yes" : "no", tig->numberOfChildren()); } void dumpRead(FILE *out, tgTig *tig, tgPosition *read, bool useGapped) { fprintf(out, F_U32"\t" F_U32 "\t%s\t" F_U32 "\t" F_U32 "\n", read->ident(), tig->tigID(), tig->coordinateType(useGapped), (useGapped) ? read->bgn() : tig->mapGappedToUngapped(read->bgn()), (useGapped) ? read->end() : tig->mapGappedToUngapped(read->end())); } void dumpTigs(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped) { fprintf(stdout, "#tigID\ttigLen\tcoordType\tcovStat\tcoverage\ttigClass\tsugRept\tsugCirc\tnumChildren\n"); for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, useGapped) == true) { tigStore->unloadTig(ti); continue; } dumpTig(stdout, tig, useGapped); tigStore->unloadTig(ti); } } void dumpConsensus(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, bool useReverse, char cnsFormat) { for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (tig->consensusExists() == false) { //fprintf(stderr, "dumpConsensus()-- tig %u has no consensus sequence.\n", ti); tigStore->unloadTig(ti); continue; } if (filter.ignore(tig, useGapped) == true) { tigStore->unloadTig(ti); continue; } if (useReverse) tig->reverseComplement(); switch (cnsFormat) { case 'A': tig->dumpFASTA(stdout, useGapped); break; case 'Q': tig->dumpFASTQ(stdout, useGapped); break; default: break; } tigStore->unloadTig(ti); } } void dumpLayout(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, char *outPrefix) { FILE *tigs = NULL; // Length and flags of tigs, same as dumpTigs() FILE *reads = NULL; // Length and flags of reads, mapping of read to tig FILE *layout = stdout; // Standard layout file if (outPrefix) { char T[FILENAME_MAX]; int32 Terr = 0; char R[FILENAME_MAX]; int32 Rerr = 0; char L[FILENAME_MAX]; int32 Lerr = 0; snprintf(T, FILENAME_MAX, "%s.layout.tigInfo", outPrefix); snprintf(R, FILENAME_MAX, "%s.layout.readToTig", outPrefix); snprintf(L, FILENAME_MAX, "%s.layout", outPrefix); errno = 0; tigs = fopen(T, "w"); Terr = errno; reads = fopen(R, "w"); Rerr = errno; layout = fopen(L, "w"); Lerr = errno; if (Terr) fprintf(stderr, "Failed to open '%s': %s\n", T, strerror(Terr)); if (Rerr) fprintf(stderr, "Failed to open '%s': %s\n", R, strerror(Rerr)); if (Lerr) fprintf(stderr, "Failed to open '%s': %s\n", L, strerror(Lerr)); if (Terr + Rerr + Lerr > 0) exit(1); fprintf(tigs, "#tigID\ttigLen\tcoordType\tcovStat\tcoverage\ttigClass\tsugRept\tsugCirc\tnumChildren\n"); fprintf(reads, "#readID\ttigID\tcoordType\tbgn\tend\n"); } for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, useGapped) == true) { tigStore->unloadTig(ti); continue; } if (tigs) dumpTig(tigs, tig, useGapped); if (reads) for (uint32 ci=0; cinumberOfChildren(); ci++) dumpRead(reads, tig, tig->getChild(ci), useGapped); if (layout) tig->dumpLayout(layout); tigStore->unloadTig(ti); } if (outPrefix) { fclose(tigs); fclose(reads); fclose(layout); } } void dumpMultialign(gkStore *gkpStore, tgStore *tigStore, tgFilter &filter, bool maWithQV, bool maWithDots, uint32 maDisplayWidth, uint32 maDisplaySpacing) { for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (filter.ignore(tig, true) == true) { tigStore->unloadTig(ti); continue; } tig->display(stdout, gkpStore, maDisplayWidth, maDisplaySpacing, maWithQV, maWithDots); tigStore->unloadTig(ti); } } void dumpSizes(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, uint64 genomeSize) { tgTigSizeAnalysis *siz = new tgTigSizeAnalysis(genomeSize); for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, useGapped) == true) { tigStore->unloadTig(ti); continue; } siz->evaluateTig(tig, useGapped); tigStore->unloadTig(ti); } siz->finalize(); siz->printSummary(stdout); delete siz; } void plotDepthHistogram(char *N, uint64 *cov, uint32 covMax) { FILE *F; // Find the smallest and largest values with counts. uint32 minii = 0; uint32 maxii = covMax - 1; while ((minii < covMax) && (cov[minii] == 0)) minii++; while ((minii < maxii) && (cov[maxii] == 0)) maxii--; // Extend a little bit, to give a little context (and to get zero). minii = (minii > 10) ? minii-10 : 0; maxii = maxii+10; // Dump the values from min to max errno = 0; F = fopen(N, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", N, strerror(errno)); for (uint32 ii=minii; ii<=maxii; ii++) for (uint32 xx=0; xx /dev/null 2>&1", "w"); if (F) { fprintf(F, "set terminal 'png'\n"); fprintf(F, "set output '%s.png'\n", N); fprintf(F, "set xlabel 'binwidth=%u'\n", boxsize); //fprintf(F, "set ylabel 'number of bases'\n"); // A histogram, assuming the data is sorted values fprintf(F, "binwidth=%u\n", boxsize); fprintf(F, "set boxwidth binwidth\n"); fprintf(F, "bin(x,width) = width*floor(x/width) + binwidth/2.0\n"); fprintf(F, "plot [%u:%u] '%s' using (bin($1,binwidth)):(1.0) smooth freq with boxes title '%s'\n", minii, maxii, N, N); // A histogram, assuming the data is a histogram. It's ugly though. //fprintf(F, "plot [%u:%u] [] '%s' using 1:2 with lines title '%s', \\\n", // Used to report 'tigs %u-%u' for the range // minii, maxii, N, N); pclose(F); } // Dump the values again, this time as a real histogram. errno = 0; F = fopen(N, "w"); if (errno) fprintf(stderr, "Failed to open '%s' for writing: %s\n", N, strerror(errno)); for (uint32 ii=minii; ii<=maxii; ii++) fprintf(F, F_U32"\t" F_U64 "\n", ii, cov[ii]); fclose(F); } void dumpDepthHistogram(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, bool single, char *outPrefix) { char N[FILENAME_MAX]; intervalList IL; int32 covMax = 1048576; uint64 *cov = new uint64 [covMax]; memset(cov, 0, sizeof(uint64) * covMax); for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, useGapped) == true) { tigStore->unloadTig(ti); continue; } // Save all the read intervals to the list. IL.clear(); for (uint32 ci=0; cinumberOfChildren(); ci++) { tgPosition *read = tig->getChild(ci); uint32 bgn = (useGapped) ? read->min() : tig->mapGappedToUngapped(read->min()); uint32 end = (useGapped) ? read->max() : tig->mapGappedToUngapped(read->max()); IL.add(bgn, end - bgn); } // Convert to depths. intervalList ID(IL); // Add the depths to the histogram. for (uint32 ii=0; iitigID()); plotDepthHistogram(N, cov, covMax); memset(cov, 0, sizeof(uint64) * covMax); // Slight optimization if we do this in plotDepthHistogram of just the set values. } // Repeat. tigStore->unloadTig(ti); } if (single == false) { snprintf(N, FILENAME_MAX, "%s.depthHistogram", outPrefix); plotDepthHistogram(N, cov, covMax); } delete [] cov; } void dumpCoverage(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, char *outPrefix) { uint32 covMax = 1024; uint64 *cov = new uint64 [covMax]; for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); uint32 tigLen = tig->length(useGapped); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, true) == true) { tigStore->unloadTig(ti); continue; } if (tigLen == 0) { tigStore->unloadTig(ti); continue; } // Do something. intervalList allL; for (uint32 ci=0; cinumberOfChildren(); ci++) { tgPosition *read = tig->getChild(ci); uint32 bgn = (useGapped) ? read->min() : tig->mapGappedToUngapped(read->min()); uint32 end = (useGapped) ? read->max() : tig->mapGappedToUngapped(read->max()); allL.add(bgn, end - bgn); } intervalList ID(allL); uint32 maxDepth = 0; double aveDepth = 0; double sdeDepth = 0; #if 0 // Report regions that have abnormally low or abnormally high coverage intervalList minL; intervalList maxL; for (uint32 ii=0; iitigID(), ID.lo(ii), ID.hi(ii), tigLen, ID.depth(ii)); minL.add(ID.lo(ii), ID.hi(ii) - ID.lo(ii) + 1); } if (maxCoverage <= ID.depth(ii)) { fprintf(stderr, "tig %d high coverage interval %ld %ld max %u coverage %u\n", tig->tigID(), ID.lo(ii), ID.hi(ii), tigLen, ID.depth(ii)); maxL.add(ID.lo(ii), ID.hi(ii) - ID.lo(ii) + 1); } } #endif // Compute max and average depth, and save the depth in a histogram. #warning replace this with genericStatistics for (uint32 ii=0; ii maxDepth) maxDepth = ID.depth(ii); aveDepth += (ID.hi(ii) - ID.lo(ii) + 1) * ID.depth(ii); while (covMax <= ID.depth(ii)) resizeArray(cov, covMax, covMax, covMax * 2); cov[ID.depth(ii)] += ID.hi(ii) - ID.lo(ii) + 1; } aveDepth /= tigLen; // Now the std.dev for (uint32 ii=0; ii 0) && (maxL.numberOfIntervals() > 0)) fprintf(stderr, "tig %d has %u intervals, %u regions below %u coverage and %u regions at or above %u coverage\n", tig->tigID(), allL.numberOfIntervals(), minL.numberOfIntervals(), minCoverage, maxL.numberOfIntervals(), maxCoverage); else if (minL.numberOfIntervals() > 0) fprintf(stderr, "tig %d has %u intervals, %u regions below %u coverage\n", tig->tigID(), allL.numberOfIntervals(), minL.numberOfIntervals(), minCoverage); else if (maxL.numberOfIntervals() > 0) fprintf(stderr, "tig %d has %u intervals, %u regions at or above %u coverage\n", tig->tigID(), allL.numberOfIntervals(), maxL.numberOfIntervals(), maxCoverage); else fprintf(stderr, "tig %d has %u intervals\n", tig->tigID(), allL.numberOfIntervals()); #endif // Plot the depth for each tig if (outPrefix) { char outName[FILENAME_MAX]; snprintf(outName, FILENAME_MAX, "%s.tig%08u.depth", outPrefix, tig->tigID()); FILE *outFile = fopen(outName, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", outName, strerror(errno)), exit(1); for (uint32 ii=0; ii /dev/null 2>&1", "w"); if (gnuPlot) { fprintf(gnuPlot, "set terminal 'png'\n"); fprintf(gnuPlot, "set output '%s.tig%08u.png'\n", outPrefix, tig->tigID()); fprintf(gnuPlot, "set xlabel 'position'\n"); fprintf(gnuPlot, "set ylabel 'coverage'\n"); fprintf(gnuPlot, "set terminal 'png'\n"); fprintf(gnuPlot, "plot '%s.tig%08u.depth' using 1:2 with lines title 'tig %u length %u', \\\n", outPrefix, tig->tigID(), tig->tigID(), tigLen); fprintf(gnuPlot, " %f title 'mean %.2f +- %.2f', \\\n", aveDepth, aveDepth, sdeDepth); fprintf(gnuPlot, " %f title '' lt 0 lc 2, \\\n", aveDepth - sdeDepth); fprintf(gnuPlot, " %f title '' lt 0 lc 2\n", aveDepth + sdeDepth); pclose(gnuPlot); } } // Did something. tigStore->unloadTig(ti); } delete [] cov; } void dumpThinOverlap(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, uint32 minOverlap) { fprintf(stderr, "reporting overlaps of at most %u bases\n", minOverlap); for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, true) == true) { tigStore->unloadTig(ti); continue; } // Do something. intervalList allL; intervalList ovlL; intervalList badL; for (uint32 ri=0; rinumberOfChildren(); ri++) { tgPosition *read = tig->getChild(ri); uint32 bgn = (useGapped) ? read->min() : tig->mapGappedToUngapped(read->min()); uint32 end = (useGapped) ? read->max() : tig->mapGappedToUngapped(read->max()); allL.add(bgn, end - bgn); ovlL.add(bgn, end - bgn); } allL.merge(); // Merge, requiring zero overlap (adjacent is OK) between pieces ovlL.merge(minOverlap); // Merge, requiring minOverlap overlap between pieces // If there is more than one interval, make a list of the regions where we have thin overlaps. if (ovlL.numberOfIntervals() > 1) // Vertical space between tig reports fprintf(stderr, "\n"); for (uint32 ii=1; iitigID(), ovlL.lo(ii), ovlL.hi(ii-1)); badL.add(ovlL.lo(ii), ovlL.hi(ii-1) - ovlL.lo(ii)); } // Then report any reads that intersect that region. for (uint32 ri=0; rinumberOfChildren(); ri++) { tgPosition *read = tig->getChild(ri); uint32 bgn = (useGapped) ? read->min() : tig->mapGappedToUngapped(read->min()); uint32 end = (useGapped) ? read->max() : tig->mapGappedToUngapped(read->max()); bool report = false; for (uint32 oo=0; ootigID(), read->ident(), (useGapped) ? read->min() : tig->mapGappedToUngapped(read->min()), (useGapped) ? read->max() : tig->mapGappedToUngapped(read->max())); } if ((allL.numberOfIntervals() != 1) || (ovlL.numberOfIntervals() != 1)) fprintf(stderr, "tig %d %s length %u has %u interval%s and %u interval%s after enforcing minimum overlap of %u\n", tig->tigID(), tig->coordinateType(useGapped), tig->length(), allL.numberOfIntervals(), (allL.numberOfIntervals() == 1) ? "" : "s", ovlL.numberOfIntervals(), (ovlL.numberOfIntervals() == 1) ? "" : "s", minOverlap); // There, did something. tigStore->unloadTig(ti); } } void dumpOverlapHistogram(gkStore *UNUSED(gkpStore), tgStore *tigStore, tgFilter &filter, bool useGapped, char *outPrefix) { uint32 histMax = AS_MAX_READLEN; uint64 *hist = new uint64 [histMax]; memset(hist, 0, sizeof(uint64) * histMax); for (uint32 ti=0; tinumTigs(); ti++) { if (tigStore->isDeleted(ti)) continue; tgTig *tig = tigStore->loadTig(ti); int32 tn = tig->numberOfChildren(); if (tig->consensusExists() == false) useGapped = true; if (filter.ignore(tig, true) == true) { tigStore->unloadTig(ti); continue; } // Do something. For each read, compute the thickest overlap off of each end. // First, decide on positions for each read. Store in an array for easier use later. uint32 *bgn = new uint32 [tn]; uint32 *end = new uint32 [tn]; for (uint32 ri=0; rigetChild(ri); bgn[ri] = (useGapped) ? read->min() : tig->mapGappedToUngapped(read->min()); end[ri] = (useGapped) ? read->max() : tig->mapGappedToUngapped(read->max()); } // Scan these, marking contained reads. for (uint32 ri=0; ri bgn[ri] for (int32 ii=ri-1; ii>0; ii--) { if (bgn[ii] == UINT32_MAX) continue; if (end[ii] < bgn[ri]) // Read doesn't overlap, no more reads will. break; if (thickest5 < end[ii] - bgn[ri]) thickest5 = end[ii] - bgn[ri]; } // Off the 3' end, expect bgn[ii] < end[ri] and bgn[ii] > bgn[ri] for (int32 ii=ri+1; ii 0) { assert(thickest5 < histMax); hist[thickest5]++; } if (thickest3 > 0) { assert(thickest3 < histMax); hist[thickest3]++; } } delete [] bgn; delete [] end; // There, did something. tigStore->unloadTig(ti); } // All computed. Dump the data and plot. char N[FILENAME_MAX]; snprintf(N, FILENAME_MAX, "%s.thickestOverlapHistogram", outPrefix); plotDepthHistogram(N, hist, histMax); // Cleanup and Bye! delete [] hist; } int main (int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; int tigVers = -1; // Tig Selection tgFilter filter; // Dump options uint32 dumpType = DUMP_UNSET; bool useGapped = false; bool useReverse = false; char cnsFormat = 'A'; // Or 'Q' for FASTQ bool maWithQV = false; bool maWithDots = true; uint32 maDisplayWidth = 100; uint32 maDisplaySpacing = 3; uint64 genomeSize = 0; char *outPrefix = NULL; bool single = false; uint32 minOverlap = 0; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if ((strcmp(argv[arg], "-tig") == 0) || (strcmp(argv[arg], "-t") == 0) || // Deprecated! (strcmp(argv[arg], "-u") == 0)) { // Deprecated too! AS_UTL_decodeRange(argv[++arg], filter.tigIDbgn, filter.tigIDend); } else if (strcmp(argv[arg], "-unassembled") == 0) { filter.dumpAllClasses = false; filter.dumpUnassembled = true; } else if (strcmp(argv[arg], "-bubbles") == 0) { filter.dumpAllClasses = false; filter.dumpBubbles = true; } else if (strcmp(argv[arg], "-contigs") == 0) { filter.dumpAllClasses = false; filter.dumpContigs = true; } else if (strcmp(argv[arg], "-nreads") == 0) { filter.minNreads = atoi(argv[++arg]); filter.maxNreads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-length") == 0) { filter.minLength = atoi(argv[++arg]); filter.maxLength = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-coverage") == 0) { if ((arg == argc-1) || (argv[arg+1][0] == '-')) { dumpType = DUMP_COVERAGE; } else if (arg + 4 < argc) { filter.minCoverage = atof(argv[++arg]); filter.maxCoverage = atof(argv[++arg]); filter.minGoodCov = atof(argv[++arg]); filter.maxGoodCov = atof(argv[++arg]); } else { fprintf(stderr, "ERROR: -coverage needs four values.\n"); err++; } } // Dump types. else if (strcmp(argv[arg], "-status") == 0) dumpType = DUMP_STATUS; else if (strcmp(argv[arg], "-tigs") == 0) dumpType = DUMP_TIGS; else if (strcmp(argv[arg], "-consensus") == 0) dumpType = DUMP_CONSENSUS; else if (strcmp(argv[arg], "-layout") == 0) dumpType = DUMP_LAYOUT; else if (strcmp(argv[arg], "-multialign") == 0) dumpType = DUMP_MULTIALIGN; else if (strcmp(argv[arg], "-sizes") == 0) dumpType = DUMP_SIZES; else if (strcmp(argv[arg], "-coverage") == 0) // NOTE! Actually handled above. dumpType = DUMP_COVERAGE; else if (strcmp(argv[arg], "-depth") == 0) dumpType = DUMP_DEPTH_HISTOGRAM; else if (strcmp(argv[arg], "-overlap") == 0) dumpType = DUMP_THIN_OVERLAP; else if (strcmp(argv[arg], "-overlaphistogram") == 0) dumpType = DUMP_OVERLAP_HISTOGRAM; // Options. else if (strcmp(argv[arg], "-gapped") == 0) useGapped = true; else if (strcmp(argv[arg], "-reverse") == 0) useReverse = true; else if (strcmp(argv[arg], "-fasta") == 0) cnsFormat = 'A'; else if (strcmp(argv[arg], "-fastq") == 0) cnsFormat = 'Q'; else if (strcmp(argv[arg], "-w") == 0) maDisplayWidth = atoi(argv[++arg]); else if (strcmp(argv[arg], "-s") == 0) maDisplaySpacing = genomeSize = atol(argv[++arg]); else if (strcmp(argv[arg], "-o") == 0) outPrefix = argv[++arg]; else if (strcmp(argv[arg], "-single") == 0) single = true; else if (strcmp(argv[arg], "-thin") == 0) minOverlap = atoi(argv[++arg]); // Errors. else { fprintf(stderr, "%s: Unknown option '%s'\n", argv[0], argv[arg]); err++; } arg++; } if (gkpName == NULL) err++; if (tigName == NULL) err++; if ((outPrefix == NULL) && (dumpType == DUMP_COVERAGE)) err++; if (dumpType == DUMP_UNSET) err++; if (err) { fprintf(stderr, "usage: %s -G -T [opts]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "STORE SELECTION (mandatory)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -G path to the gatekeeper store\n"); fprintf(stderr, " -T path to the tigStore, version, to use\n"); fprintf(stderr, "\n"); fprintf(stderr, "TIG SELECTION - if nothing specified, all tigs are reported\n"); fprintf(stderr, " - all ranges are inclusive.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -tig A[-B] only dump tigs between ids A and B\n"); fprintf(stderr, " -unassembled only dump tigs that are 'unassembled'\n"); fprintf(stderr, " -bubbles only dump tigs that are 'bubbles'\n"); fprintf(stderr, " -contigs only dump tigs that are 'contigs'\n"); fprintf(stderr, " -nreads min max only dump tigs with between min and max reads\n"); fprintf(stderr, " -length min max only dump tigs with length between 'min' and 'max' bases\n"); fprintf(stderr, " -coverage c C g G only dump tigs with between fraction g and G at coverage between c and C\n"); fprintf(stderr, " example: -coverage 10 inf 0.5 1.0 would report tigs where half of the\n"); fprintf(stderr, " bases are at 10+ times coverage.\n"); fprintf(stderr, "\n"); fprintf(stderr, "DUMP TYPE - all dumps, except status, report on tigs selected as above\n"); fprintf(stderr, "\n"); fprintf(stderr, " -status the number of tigs in the store\n"); fprintf(stderr, "\n"); fprintf(stderr, " -tigs a list of tigs, and some information about them\n"); fprintf(stderr, "\n"); fprintf(stderr, " -consensus [opts] the consensus sequence, with options:\n"); fprintf(stderr, " -gapped report the gapped (multialignment) consensus sequence\n"); fprintf(stderr, " -reverse reverse complement the sequence\n"); fprintf(stderr, " -fasta report sequences in FASTA format (the default)\n"); fprintf(stderr, " -fastq report sequences in FASTQ format\n"); fprintf(stderr, "\n"); fprintf(stderr, " -layout [opts] the layout of reads in each tig. if '-o' is supplied, three files are created.\n"); fprintf(stderr, " -gapped report the gapped (multialignment) positions\n"); fprintf(stderr, " -o name write data to 'name.*' files in the current directory\n"); fprintf(stderr, " name.layout - layout of reads\n"); fprintf(stderr, " name.layout.readToTig - read to tig position\n"); fprintf(stderr, " name.layout.tigInfo - metadata for each tig\n"); fprintf(stderr, "\n"); fprintf(stderr, " -multialign [opts] the full multialignment, output is to stdout\n"); fprintf(stderr, " -w width width of the page\n"); fprintf(stderr, " -s spacing spacing between reads on the same line\n"); fprintf(stderr, "\n"); fprintf(stderr, " -sizes [opts] size statistics\n"); fprintf(stderr, " -s genomesize denominator to use for n50 computation\n"); fprintf(stderr, "\n"); fprintf(stderr, " -coverage [opts] read coverage plots, one plot per tig\n"); fprintf(stderr, " -o outputPrefix write plots to 'outputPrefix.*' in the current directory\n"); fprintf(stderr, "\n"); fprintf(stderr, " -depth [opts] a histogram of depths\n"); fprintf(stderr, " -single one histogram per tig\n"); fprintf(stderr, "\n"); fprintf(stderr, " -overlap read overlaps\n"); fprintf(stderr, " -thin overlap report regions where the (thickest) read overlap is less than 'overlap' bases\n"); fprintf(stderr, "\n"); fprintf(stderr, " -overlaphistogram a histogram of the thickest overlaps used\n"); fprintf(stderr, " -o outputPrefix write plots to 'outputPrefix.*' in the current directory\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); #if 0 fprintf(stderr, " -compress Move tigs from earlier versions into the specified version. This removes\n"); fprintf(stderr, " historical versions of unitigs/contigs, and can save tremendous storage space,\n"); fprintf(stderr, " but makes it impossible to back up the assembly past the specified versions\n"); #endif if (gkpName == NULL) err++; if (tigName == NULL) err++; if ((outPrefix == NULL) && (dumpType == DUMP_COVERAGE)) err++; if (dumpType == DUMP_UNSET) err++; exit(1); } // Open stores. gkStore *gkpStore = gkStore::gkStore_open(gkpName); tgStore *tigStore = new tgStore(tigName, tigVers); // Check that the tig ID range is valid, and fix it if possible. uint32 nTigs = tigStore->numTigs(); if (filter.tigIDend == UINT32_MAX) filter.tigIDend = nTigs-1; if ((nTigs > 0) && (nTigs <= filter.tigIDend)) { fprintf(stderr, "WARNING: adjusting tig ID range from " F_U32 "-" F_U32 " to " F_U32 "-" F_U32 " as there are only " F_U32 " tigs in the store.\n", filter.tigIDbgn, filter.tigIDend, filter.tigIDbgn, nTigs-1, nTigs); filter.tigIDend = nTigs - 1; } if (filter.tigIDend < filter.tigIDbgn) { fprintf(stderr, "WARNING: adjusting inverted tig ID range -t " F_U32 "-" F_U32 "\n", filter.tigIDbgn, filter.tigIDend); uint32 x = filter.tigIDend; filter.tigIDend = filter.tigIDbgn; filter.tigIDbgn = x; } if ((nTigs > 0) && (nTigs <= filter.tigIDbgn)) fprintf(stderr, "ERROR: only " F_U32 " tigs in the store (IDs 0-" F_U32 " inclusive); can't dump requested range -t " F_U32 "-" F_U32 "\n", nTigs, nTigs-1, filter.tigIDbgn, filter.tigIDend), exit(1); // Call the dump routine. switch (dumpType) { case DUMP_STATUS: dumpStatus(gkpStore, tigStore); break; case DUMP_TIGS: dumpTigs(gkpStore, tigStore, filter, useGapped); break; case DUMP_CONSENSUS: dumpConsensus(gkpStore, tigStore, filter, useGapped, useReverse, cnsFormat); break; case DUMP_LAYOUT: dumpLayout(gkpStore, tigStore, filter, useGapped, outPrefix); break; case DUMP_MULTIALIGN: dumpMultialign(gkpStore, tigStore, filter, maWithQV, maWithDots, maDisplayWidth, maDisplaySpacing); break; case DUMP_SIZES: dumpSizes(gkpStore, tigStore, filter, useGapped, genomeSize); break; case DUMP_COVERAGE: dumpCoverage(gkpStore, tigStore, filter, useGapped, outPrefix); break; case DUMP_DEPTH_HISTOGRAM: dumpDepthHistogram(gkpStore, tigStore, filter, useGapped, single, outPrefix); break; case DUMP_THIN_OVERLAP: dumpThinOverlap(gkpStore, tigStore, filter, useGapped, minOverlap); break; case DUMP_OVERLAP_HISTOGRAM: dumpOverlapHistogram(gkpStore, tigStore, filter, useGapped, outPrefix); break; default: break; } // Clean up. delete tigStore; gkpStore->gkStore_close(); // Bye. exit(0); } canu-1.6/src/stores/tgStoreDump.mk000066400000000000000000000007711314437614700172410ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := tgStoreDump SOURCES := tgStoreDump.C \ tgTigSizeAnalysis.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/tgStoreFilter.C000066400000000000000000000463161314437614700173410ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_BAT/markRepeatUnique.C * * Modifications by: * * Brian P. Walenz from 2014-MAR-31 to 2014-APR-15 * are Copyright 2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren on 2014-APR-14 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz from 2014-OCT-09 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include "intervalList.H" #include using namespace std; // Stats on repeat labeling of input unitigs. // class ruLabelStat { public: ruLabelStat() { num = 0; len = 0; }; void operator+=(uint64 len_) { num++; len += len_; }; uint32 num; uint64 len; }; ruLabelStat repeat_LowReads; ruLabelStat repeat_LowCovStat; ruLabelStat repeat_Short; ruLabelStat repeat_SingleSpan; ruLabelStat repeat_LowCov; ruLabelStat repeat_IsSingleton; ruLabelStat repeat_IsUnique; ruLabelStat repeat_IsRepeat; intervalList * computeCoverage(tgTig *tig) { intervalList IL; for (uint32 ii=0; iinumberOfChildren(); ii++) { tgPosition *pos = tig->getChild(ii); IL.add(pos->min(), pos->max() - pos->min()); } return(new intervalList(IL)); } int main(int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; int32 tigVers = -1; int64 genomeSize = 0; char outName[FILENAME_MAX]; char *outPrefix = NULL; FILE *outLOG = NULL; FILE *outSTA = NULL; // From the original description of these values (CGB_UNIQUE_CUTOFF): // A threshold value for Gene's coverage statistic. Values above this value have never been // known to associated with unitigs with fragments that are not contiguous in the genome. double cgbUniqueCutoff = 10.0; double cgbDefinitelyUniqueCutoff = 10.0; double singleReadMaxCoverage = 1.0; // Reads covering more than this will demote the unitig uint32 lowCovDepth = 2; double lowCovFractionAllowed = 1.0; uint32 tooLong = UINT32_MAX; uint32 tooShort = 1000; uint32 minReads = 2; double minPopulous = 0; uint32 bgnID = 0; uint32 endID = 0; uint32 maxID = 0; gkStore *gkpStore = NULL; tgStore *tigStore = NULL; tgStoreType tigMode = tgStoreModify; argc = AS_configure(argc, argv); fprintf(stderr, "this is obsolete. do not use.\n"); exit(1); int err = 0; int arg = 1; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-o") == 0) { outPrefix = argv[++arg]; } else if (strcmp(argv[arg], "-n") == 0) { tigMode = tgStoreReadOnly; } else if (strcmp(argv[arg], "-j") == 0) { cgbUniqueCutoff = atof(argv[++arg]); } else if (strcmp(argv[arg], "-k") == 0) { cgbDefinitelyUniqueCutoff = atof(argv[++arg]); } else if (strcmp(argv[arg], "-span") == 0) { singleReadMaxCoverage = atof(argv[++arg]); // If a single read spans more than this fraction of unitig, it is demoted } else if (strcmp(argv[arg], "-lowcov") == 0) { lowCovDepth = atoi(argv[++arg]); // Coverage below this is too low lowCovFractionAllowed = atof(argv[++arg]); // If unitig has more than this fraction low coverage, it is demoted } else if (strcmp(argv[arg], "-reads") == 0) { arg++; minReads = atoi(argv[arg]); // If unitig has fewer than this number of reads it is demoted minPopulous = atof(argv[arg]); if (minPopulous > 1.0) minPopulous = 0; } else if (strcmp(argv[arg], "-long") == 0) { tooLong = atoi(argv[++arg]); // Unitigs longer than this cannot be demoted } else if (strcmp(argv[arg], "-short") == 0) { tooShort = atoi(argv[++arg]); // Unitigs shorter than this are demoted } else { err++; } arg++; } if (gkpName == NULL) err++; if (tigName == NULL) err++; if (err) { fprintf(stderr, "usage: %s -g gkpStore -t tigStore version\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G Mandatory, path G to a gkpStore directory.\n"); fprintf(stderr, " -T Mandatory, path T to a tigStore, and version V.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -j J Unitig is not unique if astat is below J (cgbUniqueCutoff)\n"); fprintf(stderr, " -k K (unused) (cgbDefinitelyUniqueCutoff)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -span F Unitig is not unique if a single read spans more than fraction F (default 1.0) of unitig\n"); fprintf(stderr, " -lowcov D F Unitig is not unique if fraction F (default 1.0) of unitig is below read depth D (default 2)\n"); fprintf(stderr, " -reads R Unitig is not unique if unitig has fewer than R (default 2) reads\n"); fprintf(stderr, " If R is fractional, the least populous unitigs containing fraction R of reads are marked as repeat\n"); fprintf(stderr, " Example: unitigs with 9 or fewer reads contain 10%% of the reads. -reads 0.10 would mark these are repeat.\n"); fprintf(stderr, " -long L Unitig is unique if unitig is at least L (default unlimited) bases long\n"); fprintf(stderr, " -short S Unitig is not unique if unitig is shorter than S (default 1000) bases long\n"); fprintf(stderr, "\n"); fprintf(stderr, " -o Prefix for output files.\n"); fprintf(stderr, " -n Do not update the tigStore.\n"); fprintf(stderr, "\n"); fprintf(stderr, "Algorithm: The first rule to trigger will mark the unitig.\n"); fprintf(stderr, "\n"); fprintf(stderr, " 1) A unitig with a single read is NOT unique.\n"); fprintf(stderr, " 2) A unitig with fewer than R (-reads) reads is NOT unique.\n"); fprintf(stderr, " 3) A unitig with a single read spanning fraction F (-span) of the unitig is NOT unique.\n"); fprintf(stderr, " 4) A unitig longer than L (-length) bases IS unique.\n"); fprintf(stderr, " 5) A unitig with astat less than J (-j) is NOT unique.\n"); fprintf(stderr, " 6) A unitig with fraction F below coverage D (-lowcov) is NOT unique.\n"); fprintf(stderr, " 7) A unitig shorter than S (-short) bases long is NOT unique.\n"); fprintf(stderr, " 8) Otherwise, the unitig IS unique.\n"); if (gkpName == NULL) fprintf(stderr, "No gatekeeper store (-G option) supplied.\n"); if (tigName == NULL) fprintf(stderr, "No input tigStore (-T option) supplied.\n"); if (outPrefix == NULL) fprintf(stderr, "No output prefix (-o option) supplied.\n"); exit(1); } gkpStore = gkStore::gkStore_open(gkpName, gkStore_readOnly); tigStore = new tgStore(tigName, tigVers, tgStoreReadOnly); if (endID == 0) endID = tigStore->numTigs(); maxID = tigStore->numTigs(); errno = 0; snprintf(outName, FILENAME_MAX, "%s.log", outPrefix); outLOG = fopen(outName, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", outName, strerror(errno)), exit(1); fprintf(outLOG, "tigID\trho\tcovStat\tarrDist\n"); snprintf(outName, FILENAME_MAX, "%s.stats", outPrefix); outSTA = fopen(outName, "w"); if (errno) fprintf(stderr, "Failed to open '%s': %s\n", outName, strerror(errno)), exit(1); fprintf(stderr, "Command Line options:\n"); fprintf(stderr, " singleReadMaxCoverage %f\n", singleReadMaxCoverage); fprintf(stderr, " lowCoverage %u coverage %f fraction\n", lowCovDepth, lowCovFractionAllowed); if (minPopulous > 0) fprintf(stderr, " minReads recomputed based on least populous fraction %.4f\n", minPopulous); else fprintf(stderr, " minReads %u\n", minReads); fprintf(stderr, " tooLong %u\n", tooLong); fprintf(stderr, " tooShort %u\n", tooShort); // // Load fragment data // fprintf(stderr, "Loading fragment data.\n"); double globalRate = 0; bool *isNonRandom = new bool [gkpStore->gkStore_getNumReads() + 1]; uint32 *fragLength = new uint32 [gkpStore->gkStore_getNumReads() + 1]; for (uint32 fi=1; fi <= gkpStore->gkStore_getNumReads(); fi++) { gkRead *read = gkpStore->gkStore_getRead(fi); gkLibrary *libr = gkpStore->gkStore_getLibrary(read->gkRead_libraryID()); isNonRandom[fi] = libr->gkLibrary_isNonRandom(); fragLength[fi] = read->gkRead_sequenceLength(); } // // Compute: // global coverage histogram // global single-read fraction covered - number of bases in unitigs with specific fraction covered // global number of reads per unitig // // This is using GAPPED lengths, because they're faster, and we don't need the actual ungapped positions. fprintf(stderr, "Generating statistics.\n"); uint32 covHistogramMax = 1048576; uint32 *covHistogram = new uint32 [covHistogramMax]; uint32 **utgCovHistogram = new uint32 * [maxID]; uint32 *utgCovData = new uint32 [maxID * lowCovDepth]; memset(covHistogram, 0, sizeof(uint32) * covHistogramMax); memset(utgCovHistogram, 0, sizeof(uint32) * maxID); memset(utgCovData, 0, sizeof(uint32) * maxID * lowCovDepth); uint32 *singleReadCoverageHistogram = new uint32 [1001]; double *singleReadCoverage = new double [maxID]; memset(singleReadCoverageHistogram, 0, sizeof(uint32) * 1001); memset(singleReadCoverage, 0, sizeof(double) * maxID); uint32 numReadsPerUnitigMax = maxID + 1; uint32 *numReadsPerUnitig = new uint32 [numReadsPerUnitigMax]; memset(numReadsPerUnitig, 0, sizeof(uint32) * numReadsPerUnitigMax); for (uint32 uu=0; uuloadTig(uu); uint32 tigLen = tig->length(true); if (tig == NULL) continue; utgCovHistogram[tig->tigID()] = utgCovData + tig->tigID() * lowCovDepth; if (tig->numberOfChildren() == 1) continue; // Global coverage histogram. intervalList *ID = computeCoverage(tig); for (uint32 ii=0; iinumberOfIntervals(); ii++) { if (ID->depth(ii) < lowCovDepth) utgCovHistogram[tig->tigID()][ID->depth(ii)] += ID->hi(ii) - ID->lo(ii) + 1; if (ID->depth(ii) < covHistogramMax) covHistogram[ID->depth(ii)] += ID->hi(ii) - ID->lo(ii) + 1; } delete ID; ID = NULL; // Single read max fraction covered. uint32 covMax = 0; uint32 cov; for (uint32 ff=0; ffnumberOfChildren(); ff++) { tgPosition *pos = tig->getChild(ff); cov = 1000 * (pos->max() - pos->min()) / tigLen; if (covMax < cov) covMax = cov; } singleReadCoverageHistogram[covMax]++; singleReadCoverage[tig->tigID()] = covMax / 1000.0; // Number of reads per unitig numReadsPerUnitig[uu] = tig->numberOfChildren(); //fprintf(stderr, "unitig %u covMax %f\n", tig->tigID(), covMax / 1000.0); tigStore->unloadTig(uu); } // // Analyze our collected data, decide on some thresholds. // fprintf(stderr, "Analyzing statistics.\n"); if ((minPopulous > 0.0) && (minPopulous < 1.0)) { uint32 maxReadsPerUnitig = 0; for (uint32 uu=0; uu 0) fprintf(outSTA, "minReads %u with fraction %.04f of reads\n", hist[xx], vals[xx]); delete [] totReadsPerNumReads; } // // Apply the thresholds to unitigs. The first half of these are the historical CGW rules. // fprintf(stderr, "Processing unitigs %u to %u.\n", bgnID, endID); for (uint32 uu=bgnID; uuloadTig(uu); if (tig == NULL) { fprintf(outLOG, "unitig %d not present\n", uu); continue; } // This uses UNGAPPED lengths, because they make more sense to humans. uint32 tigLen = tig->length(false); uint32 lowCovBases = 0; for (uint32 ll=0; lltigID()][ll]; bool isUnique = true; bool isSingleton = false; if (tig->numberOfChildren() == 1) { fprintf(outLOG, "unitig %d not unique -- singleton\n", tig->tigID()); isUnique = false; isSingleton = true; } else if (tig->numberOfChildren() < minReads) { fprintf(outLOG, "unitig %d not unique -- %u reads, need at least %d\n", tig->tigID(), tig->numberOfChildren(), minReads); repeat_LowReads += tigLen; isUnique = false; } else if (singleReadCoverage[tig->tigID()] > singleReadMaxCoverage) { fprintf(outLOG, "unitig %d not unique -- single read spans fraction %f of unitig (>= %f)\n", tig->tigID(), singleReadCoverage[tig->tigID()], singleReadMaxCoverage); repeat_SingleSpan += tigLen; isUnique = false; } else if (tigLen >= tooLong) { fprintf(outLOG, "unitig %d IS unique -- too long to be repeat, %u > allowed %u\n", tig->tigID(), tigLen, tooLong); isUnique = true; } else if (tigStore->getCoverageStat(tig->tigID()) < cgbUniqueCutoff) { fprintf(outLOG, "unitig %d not unique -- coverage stat %f, needs to be at least %f\n", tig->tigID(), tigStore->getCoverageStat(tig->tigID()), cgbUniqueCutoff); repeat_LowCovStat += tigLen; isUnique = false; } else if ((double)lowCovBases / tigLen > lowCovFractionAllowed) { fprintf(outLOG, "unitig %d not unique -- too many low coverage bases, %u out of %u bases, fraction %f > allowed %f\n", tig->tigID(), lowCovBases, tigLen, (double)lowCovBases / tigLen, lowCovFractionAllowed); repeat_LowCov += tigLen; isUnique = false; } // This was an attempt to not blindly call all short unitigs as non-unique. It didn't work so // well in initial limited testing. The threshold is arbitrary; older versions used // cgbDefinitelyUniqueCutoff. If used, be sure to disable the real check after this! #if 0 else if ((tigStore->getCoverageStat(tig->tigID()) < cgbUniqueCutoff * 10) && (tigLen < CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH)) { fprintf(outLOG, "unitig %d not unique -- length %d too short, need to be at least %d AND coverage stat %d must be larger than %d\n", tig->tigID(), tigLen, CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH, tigStore->getCoverageStat(tig->tigID()), cgbUniqueCutoff * 10); repeat_Short += tigLen; isUnique = false; } #endif else if (tigLen < tooShort) { fprintf(outLOG, "unitig %d not unique -- length %d too short, need to be at least %d\n", tig->tigID(), tigLen, tooShort); repeat_Short += tigLen; isUnique = false; } else { fprintf(outLOG, "unitig %d not repeat -- no test failed\n", tig->tigID()); } // // Allow flag to override the rules and force it to be unique or repeat. AKA, toggling. // if (isUnique) { repeat_IsUnique += tigLen; #warning NOT setSuggestUnique //tigStore->setSuggestUnique(tig->tigID()); tigStore->setSuggestRepeat(tig->tigID(), false); } else if (isSingleton) { repeat_IsSingleton += tigLen; #warning NOT setSuggestUnique //tigStore->setSuggestUnique(tig->tigID(), false); tigStore->setSuggestRepeat(tig->tigID()); } else { repeat_IsRepeat += tigLen; #warning NOT setSuggestUnique //tigStore->setSuggestUnique(tig->tigID(), false); tigStore->setSuggestRepeat(tig->tigID()); } } fprintf(outSTA, "classification number of unitigs total length\n"); fprintf(outSTA, " unique: %17" F_U32P " %14" F_U64P "\n", repeat_IsUnique.num, repeat_IsUnique.len); fprintf(outSTA, " singleton: %17" F_U32P " %14" F_U64P "\n", repeat_IsSingleton.num, repeat_IsSingleton.len); fprintf(outSTA, " repeat: %17" F_U32P " %14" F_U64P "\n", repeat_IsRepeat.num, repeat_IsRepeat.len); fprintf(outSTA, " too few reads: %17" F_U32P " %14" F_U64P "\n", repeat_LowReads.num, repeat_LowReads.len); fprintf(outSTA, " low cov stat: %17" F_U32P " %14" F_U64P "\n", repeat_LowCovStat.num, repeat_LowCovStat.len); fprintf(outSTA, " too short: %17" F_U32P " %14" F_U64P "\n", repeat_Short.num, repeat_Short.len); fprintf(outSTA, " spanning read: %17" F_U32P " %14" F_U64P "\n", repeat_SingleSpan.num, repeat_SingleSpan.len); fprintf(outSTA, " low coverage: %17" F_U32P " %14" F_U64P "\n", repeat_LowCov.num, repeat_LowCov.len); fclose(outLOG); fclose(outSTA); delete [] isNonRandom; delete [] fragLength; gkpStore->gkStore_close(); delete tigStore; exit(0); } canu-1.6/src/stores/tgStoreFilter.mk000066400000000000000000000007331314437614700175570ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := tgStoreFilter SOURCES := tgStoreFilter.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/tgStoreLoad.C000066400000000000000000000154421314437614700167670ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-AUG-07 to 2015-AUG-14 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" void operationBuild(char *buildName, char *tigName, uint32 tigVers) { errno = 0; FILE *F = fopen(buildName, "r"); if (errno) fprintf(stderr, "Failed to open '%s' for reading: %s\n", buildName, strerror(errno)), exit(1); if (AS_UTL_fileExists(tigName, TRUE, TRUE)) { fprintf(stderr, "ERROR: '%s' exists, and I will not clobber an existing store.\n", tigName); exit(1); } tgStore *tigStore = new tgStore(tigName); tgTig *tig = new tgTig(); for (int32 v=1; vnextVersion(); while (tig->loadLayout(F) == true) { if (tig->numberOfChildren() == 0) continue; // The log isn't correct. For new tigs (all of these are) we don't know the // id until after it is added. Further, if these come with id's already set, // they can't be added to a new store -- they don't exist. #if 0 fprintf(stderr, "INSERTING tig %d (%d children) (originally ID %d)\n", tig->tigID(), tig->numberOfChildren(), oID); #endif tigStore->insertTig(tig, false); } fclose(F); delete tig; delete tigStore; } int main (int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; int32 tigVers = -1; vector tigInputs; char *tigInputsFile = NULL; tgStoreType tigType = tgStoreModify; argc = AS_configure(argc, argv); vector err; int arg = 1; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-L") == 0) { tigInputsFile = argv[++arg]; AS_UTL_loadFileList(tigInputsFile, tigInputs); } else if (strcmp(argv[arg], "-n") == 0) { tigType = tgStoreReadOnly; } else if (AS_UTL_fileExists(argv[arg])) { tigInputs.push_back(argv[arg]); } else { char *s = new char [1024]; snprintf(s, 1024, "ERROR: Unknown option '%s'.\n", argv[arg]); err.push_back(s); } arg++; } if (gkpName == NULL) err.push_back("ERROR: no gatekeeper store (-G) supplied.\n"); if (tigName == NULL) err.push_back("ERROR: no tig store (-T) supplied.\n"); if ((tigInputs.size() == 0) && (tigInputsFile == NULL)) err.push_back("ERROR: no input tigs supplied on command line and no -L file supplied.\n"); if (err.size() > 0) { fprintf(stderr, "usage: %s -G -T [input.cns]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " -G Path to the gatekeeper store\n"); fprintf(stderr, " -T Path to the tigStore and version to add tigs to\n"); fprintf(stderr, "\n"); fprintf(stderr, " -L Load the tig(s) from files listed in 'file-of-files'\n"); fprintf(stderr, " (WARNING: program will succeed if this file is empty)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -n Don't replace, just report what would have happened\n"); fprintf(stderr, "\n"); fprintf(stderr, " The primary operation is to replace tigs in the store with ones in a set of input files.\n"); fprintf(stderr, " The input files can be either supplied directly on the command line or listed in\n"); fprintf(stderr, " a text file (-L).\n"); fprintf(stderr, "\n"); fprintf(stderr, " A new store is created if one doesn't exist, otherwise, whatever tigs are there are\n"); fprintf(stderr, " replaced with those in the -R file. If version 'v' doesn't exist, it is created.\n"); fprintf(stderr, "\n"); fprintf(stderr, " Even if -n is supplied, a new store is created if one doesn't exist.\n"); fprintf(stderr, "\n"); fprintf(stderr, " To add a new tig, give it a tig id of -1. New tigs must be added to the latest version.\n"); fprintf(stderr, " To delete a tig, remove all children, and set the number of them to zero.\n"); fprintf(stderr, "\n"); for (uint32 ii=0; iinextVersion(); delete tigStore; } gkStore *gkpStore = gkStore::gkStore_open(gkpName); tgStore *tigStore = new tgStore(tigName, tigVers, tigType); tgTig *tig = new tgTig; for (uint32 ff=0; ffloadFromStreamOrLayout(TI) == true) { // Handle insertion. if (tig->numberOfChildren() > 0) { //fprintf(stderr, "INSERTING tig %d\n", tig->tigID()); tigStore->insertTig(tig, false); continue; } // Deleted already? if (tigStore->isDeleted(tig->tigID()) == true) { //fprintf(stderr, "DELETING tig %d -- ALREADY DELETED\n", tig->tigID()); continue; } // Really delete it then. //fprintf(stderr, "DELETING tig %d\n", tig->tigID()); tigStore->deleteTig(tig->tigID()); } fclose(TI); fprintf(stderr, "Reading layouts from '%s' completed.\n", tigInputs[ff]); } delete tig; delete tigStore; gkpStore->gkStore_close(); exit(0); } canu-1.6/src/stores/tgStoreLoad.mk000066400000000000000000000007271314437614700172140ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := tgStoreLoad SOURCES := tgStoreLoad.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/tgTig.C000066400000000000000000000511731314437614700156170ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-DEC-22 to 2015-AUG-11 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-30 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-NOV-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "tgTig.H" #include "AS_UTL_fileIO.H" #include "AS_UTL_fasta.C" #include "AS_UTL_reverseComplement.H" #include "splitToWords.H" #include "intervalList.H" tgPosition::tgPosition() { _objID = UINT32_MAX; _isRead = true; // Bogus values. _isUnitig = true; _isContig = true; _isReverse = false; _spare = 0; _anchor = UINT32_MAX; _ahang = INT32_MAX; _bhang = INT32_MAX; _askip = 0; _bskip = 0; _min = INT32_MIN; _max = INT32_MAX; _deltaOffset = UINT32_MAX; _deltaLen = 0; } tgTigRecord::tgTigRecord() { _tigID = UINT32_MAX; _coverageStat = 0.0; _microhetProb = 0.0; _class = tgTig_noclass; _suggestRepeat = false; _suggestCircular = false; _spare = 0; _layoutLen = 0; _gappedLen = 0; _childrenLen = 0; _childDeltasLen = 0; } tgTig::tgTig() { _tigID = UINT32_MAX; _coverageStat = 0; _microhetProb = 0; _utgcns_verboseLevel = 0; _class = tgTig_noclass; _suggestRepeat = 0; _suggestCircular = 0; _spare = 0; _layoutLen = 0; _gappedBases = NULL; _gappedQuals = NULL; _gappedLen = 0; _gappedMax = 0; _ungappedBases = NULL; _ungappedQuals = NULL; _ungappedLen = 0; _ungappedMax = 0; _gappedToUngapped = NULL; _children = NULL; _childrenLen = 0; _childrenMax = 0; _childDeltas = NULL; _childDeltasLen = 0; _childDeltasMax = 0; } tgTig::~tgTig() { delete [] _gappedBases; delete [] _gappedQuals; delete [] _ungappedBases; delete [] _ungappedQuals; delete [] _gappedToUngapped; delete [] _children; delete [] _childDeltas; } // Copy data from an in-core tgTig to an on-disk tgTigRecord. tgTigRecord & tgTigRecord::operator=(tgTig & tg) { _tigID = tg._tigID; _coverageStat = tg._coverageStat; _microhetProb = tg._microhetProb; _class = tg._class; _suggestRepeat = tg._suggestRepeat; _suggestCircular = tg._suggestCircular; _spare = tg._spare; _layoutLen = tg._layoutLen; _gappedLen = tg._gappedLen; _childrenLen = tg._childrenLen; _childDeltasLen = tg._childDeltasLen; return(*this); } // Copy data from an on-disk tgTigRecord to an in-core tgTig. tgTig & tgTig::operator=(tgTigRecord & tr) { _tigID = tr._tigID; _coverageStat = tr._coverageStat; _microhetProb = tr._microhetProb; _class = tr._class; _suggestRepeat = tr._suggestRepeat; _suggestCircular = tr._suggestCircular; _spare = tr._spare; _layoutLen = tr._layoutLen; _gappedLen = tr._gappedLen; _childrenLen = tr._childrenLen; _childDeltasLen = tr._childDeltasLen; return(*this); } // Deep copy the tig. tgTig & tgTig::operator=(tgTig & tg) { _tigID = tg._tigID; _coverageStat = tg._coverageStat; _microhetProb = tg._microhetProb; _class = tg._class; _suggestRepeat = tg._suggestRepeat; _suggestCircular = tg._suggestCircular; _spare = tg._spare; _layoutLen = tg._layoutLen; _gappedLen = tg._gappedLen; duplicateArray(_gappedBases, _gappedLen, _gappedMax, tg._gappedBases, tg._gappedLen, tg._gappedMax); duplicateArray(_gappedQuals, _gappedLen, _gappedMax, tg._gappedQuals, tg._gappedLen, tg._gappedMax, true); if (_gappedLen > 0) { assert(_gappedMax > _gappedLen); _gappedBases[_gappedLen] = 0; _gappedQuals[_gappedLen] = 0; } _ungappedLen = tg._ungappedLen; duplicateArray(_ungappedBases, _ungappedLen, _ungappedMax, tg._ungappedBases, tg._ungappedLen, tg._ungappedMax); duplicateArray(_ungappedQuals, _ungappedLen, _ungappedMax, tg._ungappedQuals, tg._ungappedLen, tg._ungappedMax, true); if (_ungappedLen > 0) { assert(_ungappedMax > _ungappedLen); _ungappedBases[_ungappedLen] = 0; _ungappedQuals[_ungappedLen] = 0; } duplicateArray(_gappedToUngapped, _gappedLen, _gappedMax, tg._gappedToUngapped, tg._gappedLen, tg._gappedMax, true); _childrenLen = tg._childrenLen; duplicateArray(_children, _childrenLen, _childrenMax, tg._children, tg._childrenLen, tg._childrenMax); _childDeltasLen = tg._childDeltasLen; duplicateArray(_childDeltas, _childDeltasLen, _childDeltasMax, tg._childDeltas, tg._childDeltasLen, tg._childDeltasMax); return(*this); } double tgTig::computeCoverage(bool useGapped) { intervalList allL; for (uint32 ci=0; cimin() : mapGappedToUngapped(read->min()); uint32 end = (useGapped) ? read->max() : mapGappedToUngapped(read->max()); allL.add(bgn, end - bgn); } intervalList ID(allL); double aveDepth = 0; for (uint32 ii=0; ii 0) // Already computed. Return what is here. return; if (_gappedLen == 0) { // No gapped sequence to convert to ungapped. fprintf(stderr, "tgTig::buildUngapped()-- WARNING: tried to build ungapped sequence for tigID %u before consensus exists.\n", tigID()); return; } // Allocate more space, if needed. We'll need no more than gappedMax. We need to stash away the // max size so we can call two resizeArray() functions. uint64 ugMax = _ungappedMax; resizeArrayPair(_ungappedBases, _ungappedQuals, 0, _ungappedMax, _gappedMax, resizeArray_doNothing); resizeArray(_gappedToUngapped, 0, ugMax, _gappedMax, resizeArray_doNothing); // gappedLen doesn't include the terminating null, but gappedMax does. // See abMultiAlign.C, among other places. if (_gappedLen >= _gappedMax) fprintf(stderr, "ERROR: gappedLen = %u >= gappedMax = %u\n", _gappedLen+1, _gappedMax); assert(_gappedLen < _gappedMax); // Copy all but the gaps. _ungappedLen = 0; for (uint32 gp=0; gp<_gappedLen; gp++) { _gappedToUngapped[gp] = _ungappedLen; if (_gappedBases[gp] == '-') continue; _ungappedBases[_ungappedLen] = _gappedBases[gp]; _ungappedQuals[_ungappedLen] = _gappedQuals[gp]; _ungappedLen++; } assert(_ungappedLen < _ungappedMax); // Terminate it. Lots of work just for printf...and getting rid of gaps. _gappedToUngapped[_gappedLen] = _ungappedLen; _ungappedBases[_ungappedLen] = 0; _ungappedQuals[_ungappedLen] = 0; } // Clears the data but doesn't release memory. The only way to do that is to delete it. void tgTig::clear(void) { _tigID = UINT32_MAX; _coverageStat = 0; _microhetProb = 0; _class = tgTig_noclass; _suggestRepeat = 0; _suggestCircular = 0; _spare = 0; _layoutLen = 0; _gappedLen = 0; _ungappedLen = 0; _childrenLen = 0; _childDeltasLen = 0; } bool tgTig::loadFromStreamOrLayout(FILE *F) { // Decide if the file contains an ASCII layout or a binary stream. It's probably rather fragile, // testing if the first byte is 't' (from 'tig') or 'T' (from 'TIGR'). int ch = getc(F); ungetc(ch, F); if (ch == 't') return(loadLayout(F)); else if (ch == 'T') return(loadFromStream(F)); else return(false); } void tgTig::saveToStream(FILE *F) { tgTigRecord tr = *this; char tag[4] = {'T', 'I', 'G', 'R', }; // That's tigRecord, not TIGR AS_UTL_safeWrite(F, tag, "tgTig::saveToStream::tigr", sizeof(char), 4); AS_UTL_safeWrite(F, &tr, "tgTig::saveToStream::tr", sizeof(tgTigRecord), 1); // We could save the null byte too, but don't. It's explicitly added during the load. if (_gappedLen > 0) { AS_UTL_safeWrite(F, _gappedBases, "tgTig::saveToStream::gappedBases", sizeof(char), _gappedLen); AS_UTL_safeWrite(F, _gappedQuals, "tgTig::saveToStream::gappedQuals", sizeof(char), _gappedLen); } if (_childrenLen > 0) AS_UTL_safeWrite(F, _children, "tgTig::saveToStream::children", sizeof(tgPosition), _childrenLen); if (_childDeltasLen > 0) AS_UTL_safeWrite(F, _childDeltas, "tgTig::saveToStream::childDeltas", sizeof(int32), _childDeltasLen); } bool tgTig::loadFromStream(FILE *F) { char tag[4]; clear(); // Read the tgTigRecord from disk and copy it into our tgTig. tgTigRecord tr; if (4 != AS_UTL_safeRead(F, tag, "tgTig::saveToStream::tigr", sizeof(char), 4)) { fprintf(stderr, "tgTig::loadFromStream()-- failed to read four byte code: %s\n", strerror(errno)); return(false); } if ((tag[0] != 'T') || (tag[1] != 'I') || (tag[2] != 'G') || (tag[3] != 'R')) { fprintf(stderr, "tgTig::loadFromStream()-- not at a tigRecord, got bytes '%c%c%c%c' (0x%02x%02x%02x%02x).\n", tag[0], tag[1], tag[2], tag[3], tag[0], tag[1], tag[2], tag[3]); return(false); } if (0 == AS_UTL_safeRead(F, &tr, "tgTig::loadFromStream::tr", sizeof(tgTigRecord), 1)) { fprintf(stderr, "tgTig::loadFromStream()-- failed to read tgTigRecord: %s\n", strerror(errno)); return(false); } *this = tr; // Allocate space for bases/quals and load them. Be sure to terminate them, too. resizeArrayPair(_gappedBases, _gappedQuals, 0, _gappedMax, _gappedLen + 1, resizeArray_doNothing); if (_gappedLen > 0) { AS_UTL_safeRead(F, _gappedBases, "tgTig::loadFromStream::gappedBases", sizeof(char), _gappedLen); AS_UTL_safeRead(F, _gappedQuals, "tgTig::loadFromStream::gappedQuals", sizeof(char), _gappedLen); _gappedBases[_gappedLen] = 0; _gappedQuals[_gappedLen] = 0; } // Allocate space for reads and alignments, and load them. resizeArray(_children, 0, _childrenMax, _childrenLen, resizeArray_doNothing); resizeArray(_childDeltas, 0, _childDeltasMax, _childDeltasLen, resizeArray_doNothing); if (_childrenLen > 0) AS_UTL_safeRead(F, _children, "tgTig::savetoStream::children", sizeof(tgPosition), _childrenLen); if (_childDeltasLen > 0) AS_UTL_safeRead(F, _childDeltas, "tgTig::loadFromStream::childDeltas", sizeof(int32), _childDeltasLen); // Return success. return(true); }; void tgTig::dumpLayout(FILE *F) { char deltaString[128] = {0}; char trimString[128] = {0}; if (_gappedLen > 0) assert(_gappedLen == _layoutLen); fprintf(F, "tig " F_U32 "\n", _tigID); fprintf(F, "len %d\n", _layoutLen); // Adjust QV's to Sanger encoding for (uint32 ii=0; ii<_gappedLen; ii++) _gappedQuals[ii] += '!'; // Dump the sequence and quality if (_gappedLen == 0) { fputs("cns\n", F); fputs("qlt\n", F); } else { fputs("cns ", F); fputs(_gappedBases, F); fputs("\n", F); // strings are null terminated now, but expected to be long. fputs("qlt ", F); fputs(_gappedQuals, F); fputs("\n", F); } // Adjust QV's back to no encoding for (uint32 ii=0; ii<_gappedLen; ii++) _gappedQuals[ii] -= '!'; // Properties. fprintf(F, "coverageStat %f\n", _coverageStat); fprintf(F, "microhetProb %f\n", _microhetProb); fprintf(F, "class %s\n", toString(_class)); fprintf(F, "suggestRepeat %c\n", _suggestRepeat ? 'T' : 'F'); fprintf(F, "suggestCircular %c\n", _suggestCircular ? 'T' : 'F'); fprintf(F, "numChildren " F_U32 "\n", _childrenLen); // And the reads. for (uint32 i=0; i<_childrenLen; i++) { tgPosition *imp = _children + i; trimString[0] = 0; deltaString[0] = 0; if (imp->_askip + imp->_bskip > 0) snprintf(trimString, 128, " trim %6u %6u", imp->_askip, imp->_bskip); if (imp->_deltaLen > 0) snprintf(deltaString, 128, " delta %5u at %u", imp->_deltaLen, imp->_deltaOffset); if (imp->_isRead) fprintf(F, "read %9" F_U32P " anchor %9" F_U32P " hang %7" F_S32P " %7" F_S32P " position %9" F_U32P " %9" F_U32P "%s%s\n", imp->ident(), imp->anchor(), imp->aHang(), imp->bHang(), imp->bgn(), imp->end(), trimString, deltaString); if (imp->_isUnitig) fprintf(F, "unitig %9" F_U32P " anchor %9" F_U32P " hang %7" F_S32P " %7" F_S32P " position %9" F_U32P " %9" F_U32P "%s%s\n", imp->ident(), imp->anchor(), imp->aHang(), imp->bHang(), imp->bgn(), imp->end(), trimString, deltaString); if (imp->_isContig) fprintf(F, "contig %9" F_U32P " anchor %9" F_U32P " hang %7" F_S32P " %7" F_S32P " position %9" F_U32P " %9" F_U32P "%s%s\n", imp->ident(), imp->anchor(), imp->aHang(), imp->bHang(), imp->bgn(), imp->end(), trimString, deltaString); } fprintf(F, "tigend\n"); } bool tgTig::loadLayout(FILE *F) { uint64 LINEnum = 0; uint32 LINElen = 0; uint32 LINEmax = 1 * 1024 * 1024; char *LINE = new char [LINEmax]; uint32 nChildren = 0; clear(); fgets(LINE, LINEmax, F); LINEnum++; if (feof(F)) { delete [] LINE; return(false); } while (!feof(F)) { splitToWords W(LINE); if ((W.numWords() == 0) || (W[0][0] == '#') || (W[0][0] == '!')) { // Comment, ignore. } else if (strcmp(W[0], "tig") == 0) { _tigID = strtouint32(W[1]); } else if (strcmp(W[0], "len") == 0) { _layoutLen = strtouint32(W[1]); resizeArray(LINE, LINElen, LINEmax, _layoutLen + 1, resizeArray_doNothing); } else if (((strcmp(W[0], "cns") == 0) || (strcmp(W[0], "qlt") == 0)) && (W.numWords() == 1)) { _gappedLen = 0; } else if (((strcmp(W[0], "cns") == 0) || (strcmp(W[0], "qlt") == 0)) && (W.numWords() == 2)) { _gappedLen = strlen(W[1]); _layoutLen = _gappedLen; // Must be enforced, probably should be an explicit error. resizeArrayPair(_gappedBases, _gappedQuals, 0, _gappedMax, _gappedLen+1, resizeArray_doNothing); if (W[0][0] == 'c') memcpy(_gappedBases, W[1], sizeof(char) * (_gappedLen + 1)); // W[1] is null terminated, and we just copy it in else memcpy(_gappedQuals, W[1], sizeof(char) * (_gappedLen + 1)); } else if (strcmp(W[0], "coverageStat") == 0) { _coverageStat = strtodouble(W[1]); } else if (strcmp(W[0], "microhetProb") == 0) { _microhetProb = strtodouble(W[1]); } else if (strcmp(W[0], "class") == 0) { if (strcmp(W[1], "unassembled") == 0) _class = tgTig_unassembled; else if (strcmp(W[1], "bubble") == 0) _class = tgTig_bubble; else if (strcmp(W[1], "contig") == 0) _class = tgTig_contig; else fprintf(stderr, "tgTig::loadLayout()-- '%s' line " F_U64 " invalid: '%s'\n", W[0], LINEnum, LINE), exit(1); } else if (strcmp(W[0], "suggestRepeat") == 0) { _suggestRepeat = strtouint32(W[1]); } else if (strcmp(W[0], "suggestCircular") == 0) { _suggestCircular = strtouint32(W[1]); } else if (strcmp(W[0], "numChildren") == 0) { //_numChildren = strtouint32(W[1]); //resizeArray(_children, 0, _childrenMax, _childrenLen, resizeArray_doNothing); } else if ((strcmp(W[0], "read") == 0) || (strcmp(W[0], "unitig") == 0) || (strcmp(W[0], "contig") == 0)) { if (W.numWords() < 10) fprintf(stderr, "tgTig::loadLayout()-- '%s' line " F_U64 " invalid: '%s'\n", W[0], LINEnum, LINE), exit(1); if (nChildren >= _childrenLen) { resizeArray(_children, _childrenLen, _childrenMax, _childrenLen + 1, resizeArray_copyData); _childrenLen++; } _children[nChildren]._objID = strtouint32(W[1]); _children[nChildren]._isRead = (strcmp(W[0], "read") == 0); _children[nChildren]._isUnitig = (strcmp(W[0], "unitig") == 0); _children[nChildren]._isContig = (strcmp(W[0], "contig") == 0); _children[nChildren]._isReverse = false; _children[nChildren]._spare = 0; _children[nChildren]._anchor = strtouint32(W[3]); _children[nChildren]._ahang = strtouint32(W[5]); _children[nChildren]._bhang = strtouint32(W[6]); _children[nChildren]._askip = 0; _children[nChildren]._bskip = 0; _children[nChildren]._min = strtouint32(W[8]); _children[nChildren]._max = strtouint32(W[9]); _children[nChildren]._deltaOffset = 0; _children[nChildren]._deltaLen = 0; if (_children[nChildren]._max < _children[nChildren]._min) { _children[nChildren]._min = strtouint32(W[9]); _children[nChildren]._max = strtouint32(W[8]); _children[nChildren]._isReverse = true; } for (uint32 pos=10; (pos < W.numWords()); pos++) { if (strcmp(W[pos], "delta") == 0) { _children[nChildren]._deltaLen = strtouint32(W[++pos]); pos++; // "at" _children[nChildren]._deltaOffset = strtouint32(W[++pos]); } if (strcmp(W[pos], "trim") == 0) { _children[nChildren]._askip = strtouint32(W[++pos]); _children[nChildren]._bskip = strtouint32(W[++pos]); } pos++; } nChildren++; } else if (strcmp(W[0], "tigend") == 0) { // All done, get out of the reading loop. break; } else { // LINE is probably munged by splitToWords. fprintf(stderr, "tgTig::loadLayout()-- unknown line '%s'\n", LINE); } fgets(LINE, LINEmax, F); LINEnum++; } delete [] LINE; return(true); } void tgTig::reverseComplement(void) { // Primary data is in _gapped and _children. ::reverseComplement(_gappedBases, _gappedQuals, _gappedLen); // Remove _ungapped and _gappedToUngapped, let it be rebuilt if needed. delete [] _ungappedBases; _ungappedBases = NULL; delete [] _ungappedQuals; _ungappedQuals = NULL; _ungappedLen = 0; _ungappedMax = 0; delete [] _gappedToUngapped; _gappedToUngapped = NULL; // _anchor, and the hangs, are now invalid. for (uint32 ii=0; ii<_childrenLen; ii++) { int32 bgn = _gappedLen - _children[ii].bgn(); int32 end = _gappedLen - _children[ii].end(); _children[ii].set(_children[ii].ident(), 0, 0, 0, bgn, end); } // _childDeltas are also invalid. } void tgTig::dumpFASTA(FILE *F, bool useGapped) { AS_UTL_writeFastA(F, bases(useGapped), length(useGapped), 100, ">tig%08u len=" F_U32 " reads=" F_U32 " covStat=%.2f gappedBases=%s class=%s suggestRepeat=%s suggestCircular=%s\n", tigID(), length(useGapped), numberOfChildren(), _coverageStat, (useGapped) ? "yes" : "no", toString(_class), _suggestRepeat ? "yes" : "no", _suggestCircular ? "yes" : "no"); } void tgTig::dumpFASTQ(FILE *F, bool useGapped) { AS_UTL_writeFastQ(F, bases(useGapped), length(useGapped), quals(useGapped), length(useGapped), "@tig%08u len=" F_U32 " reads=" F_U32 " covStat=%.2f gappedBases=%s class=%s suggestRepeat=%s suggestCircular=%s\n", tigID(), length(useGapped), numberOfChildren(), _coverageStat, (useGapped) ? "yes" : "no", toString(_class), _suggestRepeat ? "yes" : "no", _suggestCircular ? "yes" : "no"); } canu-1.6/src/stores/tgTig.H000066400000000000000000000311441314437614700156200ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2014-DEC-22 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef TG_TIG_H #define TG_TIG_H #include "AS_global.H" #include "gkStore.H" typedef enum { tgTig_noclass = 0x00, // Could use a tgTig_read for corrections tgTig_unassembled = 0x01, tgTig_bubble = 0x02, tgTig_contig = 0x03 } tgTig_class; static const char * toString(tgTig_class c) { switch (c) { case tgTig_noclass: return("unsetc"); break; case tgTig_unassembled: return("unassm"); break; case tgTig_bubble: return("bubble"); break; case tgTig_contig: return("contig"); break; } return("undefined-class"); } static tgTig_class toClass(char const *c) { if (strcmp(c, "unsetc") == 0) return(tgTig_noclass); if (strcmp(c, "unassm") == 0) return(tgTig_unassembled); if (strcmp(c, "bubble") == 0) return(tgTig_bubble); if (strcmp(c, "contig") == 0) return(tgTig_contig); fprintf(stderr, "WARNING: tiTig_class toClass('%s') is not a valid class.\n", c); return(tgTig_noclass); } // Info about the placement of an object in a tig. For unitigs, this // will be just reads. For contigs, this could be unitigs and reads. // // These coordinates are ALWAYS gapped coordinates. // class tgPosition { public: tgPosition(); // Accessors uint32 ident(void) const { return(_objID); }; bool isRead(void) const { return(_isRead == true); }; bool isUnitig(void) const { return(_isUnitig == true); }; bool isContig(void) const { return(_isContig == true); }; uint32 anchor(void) { return(_anchor); }; int32 aHang(void) { return(_ahang); }; int32 bHang(void) { return(_bhang); }; bool isForward(void) const { return(_isReverse == false); }; bool isReverse(void) const { return(_isReverse == true); }; // Position in the parent, both oriented (bgn/end) and unoriented (min/max). int32 bgn(void) const { return((_isReverse == false) ? _min : _max); }; int32 end(void) const { return((_isReverse == false) ? _max : _min); }; int32 min(void) const { return(_min); }; int32 max(void) const { return(_max); }; // Amount of this object to ignore; e.g., trim from the ends. int32 askip(void) const { return(_askip); }; int32 bskip(void) const { return(_bskip); }; // Delta values for the alignment to the parent. uint32 deltaOffset(void) { return(_deltaOffset); }; uint32 deltaLength(void) { return(_deltaLen); }; // Set just the anchor and hangs, leave positions alone. void setAnchor(uint32 anchor, int32 ahang, int32 bhang) { _anchor = anchor; _ahang = ahang; _bhang = bhang; }; // Set everything. This is to be used by unitigger. void set(uint32 id, uint32 anchor, int32 ahang, int32 bhang, int32 bgn, int32 end) { _objID = id; _isRead = true; _isUnitig = false; _isContig = false; _anchor = anchor; _ahang = ahang; _bhang = bhang; _askip = 0; _bskip = 0; if (bgn < end) { _min = bgn; _max = end; _isReverse = false; } else { _min = end; _max = bgn; _isReverse = true; } _deltaOffset = 0; _deltaLen = 0; }; // Set the coords, ignoring orientation. void setMinMax(int32 min, int32 max) { _min = min; _max = max; }; // Operators bool operator<(tgPosition const &that) const { int32 al = min(); int32 bl = that.min(); if (al != bl) return(al < bl); if (that._anchor == _objID) // I'm his anchor, I must be before it. return(true); if (_anchor == that._objID) // My anchor is the other tig; I must return(false); // be after it. int32 ah = max(); int32 bh = that.max(); return(ah > bh); } private: public: uint32 _objID; // ID of this object uint32 _isRead : 1; // A full length read alignment uint32 _isUnitig : 1; // uint32 _isContig : 1; // uint32 _isReverse : 1; // Child is oriented forward relative to parent, used during consensus. uint32 _spare : 28; uint32 _anchor; // ID of the like-type object we align to int32 _ahang; // Placement relative to anchor object int32 _bhang; // int32 _askip; // Amount of sequence to not align on each end int32 _bskip; // // Must be signed, utgcns can push reads negative. //int32 _bgn; // Coords in the parent object we are part of //int32 _end; // (for a read, this will be the position in the unitig) // Must be signed, utgcns can push reads negative. int32 _min; int32 _max; uint32 _deltaOffset; // Our delta alignment to the parent object consensus uint32 _deltaLen; }; class tgTig; // Early declaration, for use in tgTigRecord operator= // On-disk tig, same as tgTig without the pointers class tgTigRecord { public: tgTigRecord(); tgTigRecord(tgTig &tg) { *this = tg; }; // to support tgTigRecord tr = tgtig tgTigRecord &operator=(tgTig & tg); private: public: uint32 _tigID; double _coverageStat; double _microhetProb; // placeholder for some future multialignment based uniqueness probability tgTig_class _class : 2; // Output classification: unassembled, bubble, contig uint32 _suggestRepeat : 1; // Bogart made this from detected repeat. uint32 _suggestCircular : 1; // Bogart found overlaps making a circle. uint32 _spare : 32 - 2 - 2; uint32 _layoutLen; // Max coord uint32 _gappedLen; // Gapped consensus uint32 _childrenLen; uint32 _childDeltasLen; }; // Former MultiAlignT // In core data class tgTig { public: tgTig(); // All data unallocated, lengths set to zero ~tgTig(); // Releases memory // Accessors uint32 tigID(void) { return(_tigID); }; char const *coordinateType(bool useGapped=true) { return( (consensusExists() == false) ? ("layout") : ((useGapped == true) ? "gapped" : "ungapped") ); }; uint32 length(bool useGapped=true) { return( (consensusExists() == false) ? (layoutLength()) : ((useGapped == true) ? gappedLength() : ungappedLength()) ); }; bool consensusExists(void) { return(_gappedLen > 0); }; char *bases(bool useGapped=true) { return( (useGapped == true) ? gappedBases() : ungappedBases() ); }; char *quals(bool useGapped=true) { return( (useGapped == true) ? gappedQuals() : ungappedQuals() ); }; double computeCoverage(bool useGapped=true); private: uint32 layoutLength(void) { return(_layoutLen); }; uint32 gappedLength(void) { return(_gappedLen); }; char *gappedBases(void) { return(_gappedBases); }; char *gappedQuals(void) { return(_gappedQuals); }; void buildUngapped(void); uint32 ungappedLength(void) { buildUngapped(); return(_ungappedLen); }; char *ungappedBases(void) { buildUngapped(); return(_ungappedBases); }; char *ungappedQuals(void) { buildUngapped(); return(_ungappedQuals); }; public: // This function needs to be hidden. Coordinates in reads need to be transparent. uint32 mapGappedToUngapped(uint32 p) { if (consensusExists() == false) return(p); buildUngapped(); return(_gappedToUngapped[p]); }; uint32 numberOfChildren(void) { return(_childrenLen); }; tgPosition *getChild(uint32 c) { assert(c < _childrenLen); return(_children + c); }; tgPosition *addChild(void) { return(_children + _childrenLen++); }; // Operators void clear(void); // Clears data but doesn't release memory. tgTig &operator=(tgTigRecord & tr); tgTig &operator=(tgTig &tg); bool loadFromStreamOrLayout(FILE *F); void saveToStream(FILE *F); bool loadFromStream(FILE *F); void dumpLayout(FILE *F); bool loadLayout(FILE *F); void reverseComplement(void); // Does NOT update childDeltas void dumpFASTA(FILE *F, bool useGapped); void dumpFASTQ(FILE *F, bool useGapped); // There are two multiAlign displays; this one, and one in abMultiAlign. void display(FILE *F, gkStore *gkp, uint32 displayWidth = 100, // Width of display uint32 displaySpacing = 3, // Space between reads on the same line bool withQV = false, bool withDots = false); private: public: uint32 _tigID; // ID in the store, or UINT32_MAX if not set double coverageStat(void) { return(_coverageStat); }; //double microhetProb(void) { return(_microhetProb); }; double _coverageStat; double _microhetProb; // Placeholder. // A variety of flags to suggest what type of unitig this is tgTig_class _class : 2; // Output classification: unassembled, bubble, contig uint32 _suggestRepeat : 1; // Bogart made this from detected repeat. uint32 _suggestCircular : 1; // Bogart found overlaps making a circle. uint32 _spare : 32 - 2 - 2; uint32 _layoutLen; // The max coord in the layout. Same as gappedLen if it exists. char *_gappedBases; // Gapped consensus - used by the multialignment. NUL terminated. char *_gappedQuals; uint32 _gappedLen; // Doesn't include the NUL. uint32 _gappedMax; // Does include the NUL. char *_ungappedBases; // Ungapped consensus - not used by the assember, only output. char *_ungappedQuals; uint32 _ungappedLen; uint32 _ungappedMax; uint32 *_gappedToUngapped; // Map a gapped position to an ungapped posision, only output. tgPosition *_children; // positions of objects that make up this tig uint32 _childrenLen; uint32 _childrenMax; int32 *_childDeltas; // deltas for all objects in the _children list uint32 _childDeltasLen; uint32 _childDeltasMax; // Flags for computing consensus/multialignments uint32 _utgcns_verboseLevel; }; #endif canu-1.6/src/stores/tgTigDisplay.C000066400000000000000000000040271314437614700171410ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-JAN-26 to 2015-JUL-01 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-07 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" int main(int argc, char **argv) { tgTig tig; char *gkpName = NULL; char *tigFileName = NULL; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-t") == 0) { tigFileName = argv[++arg]; } else { err++; } arg++; } if (gkpName == NULL) err++; if (tigFileName == NULL) err++; if (err) { fprintf(stderr, "usage: %s -G gkpStore -t tigFile\n", argv[0]); exit(1); } gkStore *gkpStore = gkStore::gkStore_open(gkpName); FILE *F = fopen(tigFileName, "r"); tig.loadFromStream(F); fclose(F); uint32 displayWidth = 250; uint32 displaySpacing = 10; bool withQV = false; bool withDots = true; tig.display(stdout, gkpStore, displayWidth, displaySpacing, withQV, withDots); exit(0); } canu-1.6/src/stores/tgTigDisplay.mk000066400000000000000000000007311314437614700173640ustar00rootroot00000000000000 # If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := tgTigDisplay SOURCES := tgTigDisplay.C SRC_INCDIRS := .. ../AS_UTL TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES := canu-1.6/src/stores/tgTigMultiAlignDisplay.C000066400000000000000000000300271314437614700211260ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignPrint.C * src/AS_CNS/MultiAlignPrint.c * * Modifications by: * * Brian P. Walenz from 2007-SEP-05 to 2013-AUG-01 * are Copyright 2007-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-21 to 2015-JUN-03 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-03 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include "AS_UTL_reverseComplement.H" #include class LaneNode { public: LaneNode() { read = NULL; readLen = 0; bases = NULL; quals = NULL; delta = NULL; next = NULL; }; ~LaneNode() { delete [] bases; delete [] quals; }; tgPosition *read; int32 readLen; char *bases; // Allocated bases char *quals; // Allocated quals int32 *delta; // Pointer to tgTig _childDeltas LaneNode *next; }; class Lane { public: Lane() { first = NULL; last = NULL; lastcol = 0; }; ~Lane() { LaneNode *node = first; LaneNode *next = NULL; while (node) { next = node->next; delete node; node = next; } }; // Add a node to this lane, if possible. bool addNode(LaneNode *node, uint32 displaySpacing) { int32 leftpos = node->read->min(); if ((lastcol > 0) && (leftpos < lastcol + displaySpacing)) return(false); assert(node->next == NULL); if (first == NULL) { first = node; last = node; } else { last->next = node; last = node; } lastcol = leftpos + node->read->deltaLength() + node->readLen; return(true); }; LaneNode *first; LaneNode *last; int32 lastcol; }; void tgTig::display(FILE *F, gkStore *gkp, uint32 displayWidth, uint32 displaySpacing, bool withQV, bool withDots) { if (gappedLength() == 0) { fprintf(F, "No MultiAlignment to print for tig %d -- no consensus sequence present.\n", tigID()); return; } fprintf(stderr, "tgTig::display()-- display tig %d with %d children\n", tigID(), _childrenLen); fprintf(stderr, "tgTig::display()-- width %u spacing %u\n", displayWidth, displaySpacing); // // Convert the children to a list of lines to print // Lane *lane = NULL; Lane *lanes = new Lane [_childrenLen]; int32 lanesLen = 0; int32 lanesPos = 0; // Sort the fragments by leftmost position within tig std::sort(_children, _children + _childrenLen); // Load into lanes. gkReadData *readData = new gkReadData; for (int32 i=0; i<_childrenLen; i++) { gkRead *read = gkp->gkStore_getRead(_children[i].ident()); // Too many reads in this code. gkp->gkStore_loadReadData(read, readData); LaneNode *node = new LaneNode(); node->read = _children + i; node->readLen = read->gkRead_sequenceLength(); node->bases = new char [node->readLen + 1]; node->quals = new char [node->readLen + 1]; node->delta = _childDeltas + node->read->deltaOffset(); memcpy(node->bases, readData->gkReadData_getSequence(), sizeof(char) * node->readLen); memcpy(node->quals, readData->gkReadData_getQualities(), sizeof(char) * node->readLen); node->bases[node->readLen] = 0; node->quals[node->readLen] = 0; for (uint32 ii=0; iireadLen; ii++) // Adjust QVs for display node->quals[ii] += '!'; if (node->read->isReverse()) ::reverseComplement(node->bases, node->quals, node->readLen); //fprintf(stderr, "NODE READ %u at %d %d tigLength %d with %u deltas at offset %u\n", // i, node->read->bgn(), node->read->end(), gappedLength(), node->read->deltaLength(), node->read->deltaOffset()); // Try to add this new node to the lanes. The last iteration will always succeed, adding the // node to a fresh empty lane. for (lanesPos=0; lanesPos <= lanesLen; lanesPos++) if (lanes[lanesPos].addNode(node, displaySpacing)) break; assert(lanesPos <= lanesLen); // If it is added to the last one, increment our cnsLen. if (lanesPos == lanesLen) lanesLen++; } delete readData; // Allocate space. char **multia = new char * [2 * lanesLen]; for (int32 i=0; i<2*lanesLen; i++) { multia[i] = new char [gappedLength() + displayWidth + 1]; memset(multia[i], ' ', gappedLength() + displayWidth); multia[i][gappedLength() + displayWidth] = 0; } int32 **idarray = new int32 * [lanesLen]; int32 **oriarray = new int32 * [lanesLen]; // Not sure where the +1 comes from. The original was using GetNumchars(ma->consensus) which (I // think) included the terminating NUL, while gappedLength() does not. for (int32 i=0; inext) { int32 firstcol = (node->read->bgn() < node->read->end()) ? node->read->bgn() : node->read->end(); int32 lastcol = (node->read->bgn() < node->read->end()) ? node->read->end() : node->read->bgn(); int32 orient = (node->read->bgn() < node->read->end()) ? 1 : -1; if (lastcol > gappedLength()) fprintf(stderr, "lastcol too big: %d vs %d\n", lastcol, gappedLength()); assert(firstcol <= gappedLength()); assert(lastcol <= gappedLength()); // Set ID and orientation //fprintf(stderr, "READ %d minmax %d %d bgnend %d %d orient %d\n", // node->read->ident(), node->read->min(), node->read->max(), node->read->bgn(), node->read->end(), orient); for (int32 col=firstcol; colread->ident(); oriarray[i][col] = orient; } // Set bases int32 col = firstcol; int32 cols = 0; for (int32 j=0; jread->deltaLength(); j++) { int32 seglen = node->delta[j] - ((j > 0) ? node->delta[j-1] : 0); if (cols + seglen >= node->readLen) fprintf(stderr, "ERROR: Clear ranges not correct.\n"); assert(cols + seglen < node->readLen); //fprintf(stderr, "copy to col=%d from cols=%d seglen=%d -- max %d -- delta %d\n", // col, cols, seglen, gappedLength() + displayWidth, node->delta[j]); memcpy(srow + col, node->bases + cols, seglen); memcpy(qrow + col, node->quals + cols, seglen); col += seglen; srow[col] = '-'; qrow[col] = '-'; col++; cols += seglen; } memcpy(srow + col, node->bases + cols, node->readLen - cols); memcpy(qrow + col, node->quals + cols, node->readLen - cols); } } // Cleanup delete [] lanes; // // // fprintf(F, "<<< begin Contig %d >>>", tigID()); uint32 lruler = displayWidth + 200; char *gruler = new char [lruler]; char *uruler = new char [lruler]; int32 ungapped = 1; int32 tick = 1; // The original used 'length = strlen(consensus)' which does NOT include the terminating NUL. for (uint32 window=0; window < gappedLength(); ) { uint32 row_id = 0; uint32 orient = 0; uint32 rowlen = (window + displayWidth < gappedLength()) ? displayWidth : gappedLength() - window; fprintf(F, "\n"); fprintf(F, "\n"); fprintf(F, "<<< tig %d, gapped length: %d >>>\n", tigID(), gappedLength()); { memset(gruler, 0, displayWidth + 200); memset(uruler, 0, displayWidth + 200); for (uint32 rowind=0; rowind= 0) && (gruler[i] == ' '); i--) gruler[i] = 0; for (int32 i=displayWidth-1; (i >= 0) && (uruler[i] == ' '); i--) uruler[i] = 0; fprintf(F, "%s\n", gruler); fprintf(F, "%s\n", uruler); } { char save = _gappedBases[window + rowlen]; _gappedBases[window + rowlen] = 0; fprintf(F, "%s cns (iid) type\n", _gappedBases + window); _gappedBases[window + rowlen] = save; } { for (uint32 ii=0; ii gappedLength()) break; if (multia[2*i][window+j] == _gappedBases[window+j]) { if (withDots) { multia[2*i] [window+j] = '.'; multia[2*i+1][window+j] = ' '; } else { multia[2*i][window+j] = tolower(multia[2*i][window+j]); } } if (multia[2*i][window+j] != ' ') nonBlank++; if (idarray[i][window + j] > 0) { row_id = idarray[i][window + j]; orient = oriarray[i][window + j]; } } if (nonBlank == 0) continue; // Figure out the ID and orientation for this block { char save = multia[2*i][window + displayWidth]; multia[2*i][window + displayWidth] = 0; fprintf(F, "%s %c (%d)\n", multia[2*i] + window, (orient>0)?'>':'<', row_id); multia[2*i][window + displayWidth] = save; } if (withQV) { char save = multia[2*i+1][window + displayWidth]; multia[2*i+1][window + displayWidth] = 0; fprintf(F, "%s\n", multia[2*i+1] + window); multia[2*i+1][window + displayWidth] = save; } } window += displayWidth; } fprintf(F, "\n<<< end Contig %d >>>\n", tigID()); delete [] uruler; delete [] gruler; for (uint32 i=0; i < 2 * lanesLen; i++) delete [] multia[i]; delete [] multia; for (uint32 i=0; i < lanesLen; i++) { delete [] idarray[i]; delete [] oriarray[i]; } delete [] idarray; delete [] oriarray; //fprintf(stderr, "Return.\n"); } canu-1.6/src/stores/tgTigSizeAnalysis.C000066400000000000000000000077251314437614700201620ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignSizeAnalysis.C * * Modifications by: * * Brian P. Walenz from 2012-MAR-26 to 2013-OCT-24 * are Copyright 2012-2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz on 2014-DEC-22 * are Copyright 2014 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-30 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "tgTigSizeAnalysis.H" #include #include #include #include #include using namespace std; tgTigSizeAnalysis::tgTigSizeAnalysis(uint64 genomeSize_) { genomeSize = genomeSize_; } tgTigSizeAnalysis::~tgTigSizeAnalysis() { } void tgTigSizeAnalysis::evaluateTig(tgTig *tig, bool useGapped) { // Try to get the ungapped length. // But revert to the gapped length if that doesn't exist. This should // only occur for pre-consensus unitigs. uint32 length = tig->length(useGapped); if (tig->_suggestRepeat) lenSuggestRepeat.push_back(length); if (tig->_suggestCircular) lenSuggestCircular.push_back(length); switch (tig->_class) { case tgTig_unassembled: lenUnassembled.push_back(length); break; case tgTig_bubble: lenBubble.push_back(length); break; case tgTig_contig: lenContig.push_back(length); break; default: break; } } void tgTigSizeAnalysis::finalize(void) { sort(lenSuggestRepeat.begin(), lenSuggestRepeat.end(), greater()); sort(lenSuggestCircular.begin(), lenSuggestCircular.end(), greater()); sort(lenUnassembled.begin(), lenUnassembled.end(), greater()); sort(lenBubble.begin(), lenBubble.end(), greater()); sort(lenContig.begin(), lenContig.end(), greater()); } void tgTigSizeAnalysis::printSummary(FILE *out, char *description, vector &data) { uint64 cnt = data.size(); uint64 sum = 0; uint64 tot = 0; uint64 nnn = 10; uint64 siz = 0; // Duplicates AS_BAT_Instrumentation.C reportN50(). if (cnt == 0) return; for (uint64 i=0; i 0) siz = genomeSize; else siz = tot; for (uint64 i=0; i using namespace std; class tgTigSizeAnalysis { public: tgTigSizeAnalysis(uint64 genomeSize); ~tgTigSizeAnalysis(); void evaluateTig(tgTig *tig, bool useGapped=true); void finalize(void); void printSummary(FILE *out, char *description, vector &data); void printSummary(FILE *out); private: uint64 genomeSize; vector lenSuggestRepeat; vector lenSuggestCircular; vector lenUnassembled; vector lenBubble; vector lenContig; }; #endif // TGTIGSIZEANALYSIS canu-1.6/src/utgcns/000077500000000000000000000000001314437614700144125ustar00rootroot00000000000000canu-1.6/src/utgcns/libNDalign/000077500000000000000000000000001314437614700164155ustar00rootroot00000000000000canu-1.6/src/utgcns/libNDalign/Binomial_Bound.C000066400000000000000000000053671314437614700214150ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2015-OCT-12 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "Binomial_Bound.H" #include "gkStore.H" #undef COMPUTE_IN_LOG_SPACE // Determined by EDIT_DIST_PROB_BOUND #define NORMAL_DISTRIB_THOLD 3.62 // Probability limit to "band" edit-distance calculation // Determines NORMAL_DISTRIB_THOLD #define EDIT_DIST_PROB_BOUND 1e-4 // Return the smallest n >= Start s.t. // prob [>= e errors in n binomial trials (p = error prob)] > EDIT_DIST_PROB_BOUND // int Binomial_Bound(int e, double p, int Start) { double Normal_Z, Mu_Power, Factorial, Poisson_Coeff; double q, Sum, P_Power, Q_Power, X; int k, n, Bin_Coeff, Ct; q = 1.0 - p; if (Start < e) Start = e; for (n = Start; n < AS_MAX_READLEN; n ++) { if (n <= 35) { Sum = 0.0; Bin_Coeff = 1; Ct = 0; P_Power = 1.0; Q_Power = pow (q, n); for (k = 0; k < e && 1.0 - Sum > EDIT_DIST_PROB_BOUND; k ++) { X = Bin_Coeff * P_Power * Q_Power; Sum += X; Bin_Coeff *= n - Ct; Bin_Coeff /= ++ Ct; P_Power *= p; Q_Power /= q; } if (1.0 - Sum > EDIT_DIST_PROB_BOUND) return(n); } else { Normal_Z = (e - 0.5 - n * p) / sqrt (n * p * q); if (Normal_Z <= NORMAL_DISTRIB_THOLD) return n; #ifndef COMPUTE_IN_LOG_SPACE Sum = 0.0; Mu_Power = 1.0; Factorial = 1.0; Poisson_Coeff = exp (- n * p); for (k = 0; k < e; k ++) { Sum += Mu_Power * Poisson_Coeff / Factorial; Mu_Power *= n * p; Factorial *= k + 1; } #else Sum = 0.0; Mu_Power = 0.0; Factorial = 0.0; Poisson_Coeff = - n * p; for (k = 0; k < e; k ++) { Sum += exp(Mu_Power + Poisson_Coeff - Factorial); Mu_Power += log(n * p); Factorial = lgamma(k + 1); } #endif if (1.0 - Sum > EDIT_DIST_PROB_BOUND) return(n); } } return(AS_MAX_READLEN); } canu-1.6/src/utgcns/libNDalign/Binomial_Bound.H000066400000000000000000000015621314437614700214130ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef BINOMIAL_BOUND_H #define BINOMIAL_BOUND_H #include "AS_global.H" int Binomial_Bound(int e, double p, int Start); #endif canu-1.6/src/utgcns/libNDalign/NDalgorithm-allocateMoreSpace.C000066400000000000000000000057511314437614700243220ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/utgcns/libNDalign/prefixEditDistance-allocateMoreSpace.C * * Modifications by: * * Brian P. Walenz from 2015-JUL-20 to 2015-AUG-05 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "NDalgorithm.H" // Allocate another block of 64mb for edits // Needs to be at least: // 52,432 to handle 40% error at 64k overlap // 104,860 to handle 80% error at 64k overlap // 209,718 to handle 40% error at 256k overlap // 419,434 to handle 80% error at 256k overlap // 3,355,446 to handle 40% error at 4m overlap // 6,710,890 to handle 80% error at 4m overlap // Bigger means we can assign more than one Edit_Array[] in one allocation. uint32 EDIT_SPACE_SIZE = 1 * 1024 * 1024; bool NDalgorithm::allocateMoreEditSpace(void) { // Determine the last allocated block, and the last assigned block int32 b = 0; // Last edit array assigned int32 e = 0; // Last edit array assigned more space int32 a = 0; // Last allocated block while (Edit_Array_Lazy[b] != NULL) b++; while (Edit_Space_Lazy[a] != NULL) a++; // Fill in the edit space array. Well, not quite yet. First, decide the minimum size. // // Element [0] can access from [-2] to [2] = 5 elements. // Element [1] can access from [-3] to [3] = 7 elements. // // Element [e] can access from [-2-e] to [2+e] = 5 + e * 2 elements // // So, our offset for this new block needs to put [e][0] at offset... int32 Offset = 2 + b; int32 Del = 6 + b * 2; int32 Size = EDIT_SPACE_SIZE; while (Size < Offset + Del) Size *= 2; // Allocate another block Edit_Space_Lazy[a] = new pedEdit [Size]; // And, now, fill in the edit space array. e = b; while ((Offset + Del < Size) && (e < Edit_Space_Max)) { Edit_Array_Lazy[e++] = Edit_Space_Lazy[a] + Offset; Offset += Del; Del += 2; } if (e == b) { fprintf(stderr, "Allocate_More_Edit_Space()-- ERROR: couldn't allocate enough space for even one more entry! e=%d\n", e); return(false); } assert(e != b); return(true); //fprintf(stderr, "WorkArea %d allocates space %d of size %d for array %d through %d\n", thread_id, a, Size, b, e-1); } canu-1.6/src/utgcns/libNDalign/NDalgorithm-extend.C000066400000000000000000000134751314437614700222300ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/utgcns/libNDalign/prefixEditDistance-extend.C * * Modifications by: * * Brian P. Walenz from 2015-JUL-20 to 2015-JUL-28 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "NDalgorithm.H" #undef DEBUG // See how far the exact match in Match extends. The match // refers to strings S and T with lengths S_Len and T_Len , // respectively. Set S_Lo , S_Hi , T_Lo and T_Hi to the // regions within S and T to which the match extends. // Return the type of overlap: NONE if doesn't extend to // the end of either fragment; LEFT_BRANCH_PT // or RIGHT_BRANCH_PT if match extends to the end of just one fragment; // or DOVETAIL if it extends to the end of both fragments, i.e., // it is a complete overlap. // Set Errors to the number of errors in the alignment if it is // a DOVETAIL overlap. pedOverlapType NDalgorithm::Extend_Alignment(Match_Node_t *Match, char *S, int32 S_Len, char *T, int32 T_Len, int32 &S_Lo, int32 &S_Hi, int32 &T_Lo, int32 &T_Hi) { int32 Leftover = 0; bool rMatchToEnd = true; bool lMatchToEnd = true; int32 S_Left_Begin = Match->Start - 1; int32 S_Right_Begin = Match->Start + Match->Len; int32 S_Right_Len = S_Len - S_Right_Begin; int32 T_Left_Begin = Match->Offset - 1; int32 T_Right_Begin = Match->Offset + Match->Len; int32 T_Right_Len = T_Len - T_Right_Begin; int32 Total_Olap = (min(Match->Start, Match->Offset) + Match->Len + min(S_Right_Len, T_Right_Len)); #ifdef DEBUG fprintf(stderr, "NDalgorithm::Extend_Alignment()-- S: %d-%d and %d-%d T: %d-%d and %d-%d\n", 0, S_Left_Begin, S_Right_Begin, S_Right_Begin + S_Right_Len, 0, T_Left_Begin, T_Right_Begin, T_Right_Begin + T_Right_Len); #endif Right_Score = 0; Right_Delta_Len = 0; Left_Score = 0; Left_Delta_Len = 0; bool invertLeftDeltas = false; bool invertRightDeltas = false; if ((S_Right_Len == 0) || (T_Right_Len == 0)) { S_Hi = 0; T_Hi = 0; rMatchToEnd = true; } else if (S_Right_Len <= T_Right_Len) { #ifdef DEBUG fprintf(stderr, "NDalgorithm::Extend_Alignment()-- FORWARD S T\n"); #endif forward(S + S_Right_Begin, S_Right_Len, T + T_Right_Begin, T_Right_Len, S_Hi, T_Hi, rMatchToEnd); for (int32 i=0; i 0) { if (Right_Delta[0] > 0) Left_Delta[Left_Delta_Len++] = -(Right_Delta[0] + Leftover + Match->Len); else Left_Delta[Left_Delta_Len++] = -(Right_Delta[0] - Leftover - Match->Len); } // WHY?! Does this mean the invesion on the forward() calls is backwards? // But note interaction with the if test just above here! for (int32 i=1; i0; k--) { assert(Edit_Array_Lazy[k] != NULL); // Analyze cells at errors = k-1 for the maximum -- no analysis needed, since we stored this cell as fromd. int32 from = Edit_Array_Lazy[k][d].fromd; int32 lasts = Edit_Array_Lazy[k][d].score; //Edit_Array_Lazy[k-1][from].display(k-1, from); //fprintf(stderr, "NDalgorithm::Set_Right_Delta()-- k-1=%5d d=%5d from=%5d lastr=%5d - r=%5d s=%5d d=%5d\n", // k-1, d, from, lastr, // Edit_Array_Lazy[k-1][from].row, Edit_Array_Lazy[k-1][from].score, Edit_Array_Lazy[k-1][from].fromd); if (from == d - 1) { Delta_Stack[Right_Delta_Len++] = Edit_Array_Lazy[k-1][d-1].row - lastr - 1; d--; lastr = Edit_Array_Lazy[k-1][from].row; } else if (from == d + 1) { Delta_Stack[Right_Delta_Len++] = lastr - Edit_Array_Lazy[k-1][d+1].row; d++; lastr = Edit_Array_Lazy[k-1][from].row; } else { //fprintf(stderr, "LeftDelta: mismatch at %d max=%d last=%d\n", maxr - lastr - 1, maxr, lastr); } } Delta_Stack[Right_Delta_Len++] = lastr + 1; for (int32 k=0, i=Right_Delta_Len-1; i>0; i--) Right_Delta[k++] = abs(Delta_Stack[i]) * Sign(Delta_Stack[i-1]); Right_Delta_Len--; } // Return the minimum number of changes (inserts, deletes, replacements) // needed to match string A[0 .. (m-1)] with a prefix of string // T[0 .. (n-1)] if it's not more than Error_Limit . // If no match, return the number of errors for the best match // up to a branch point. // Put delta description of alignment in Right_Delta and set // Right_Delta_Len to the number of entries there if it's a complete // match. // Set A_End and T_End to the rightmost positions where the // alignment ended in A and T , respectively. // Set Match_To_End true if the match extended to the end // of at least one string; otherwise, set it false to indicate // a branch point. void NDalgorithm::forward(char *A, int32 Alen, char *T, int32 Tlen, int32 &A_End, int32 &T_End, bool &Match_To_End) { assert (Alen <= Tlen); int32 Best_d = 0; int32 Best_e = 0; int32 Best_row = 0; int32 Best_score = 0; int32 Row = 0; int32 Dst = 0; int32 Err = 0; int32 Sco = 0; int32 fromd = 0; // Skip ahead over matches. The original used to also skip if either sequence was N. while ((Row < Alen) && (isMatch(A[Row], T[Row]))) { Sco += matchScore(A[Row], T[Row]); Row++; } if (Edit_Array_Lazy[0] == NULL) allocateMoreEditSpace(); Edit_Array_Lazy[0][0].row = Row; Edit_Array_Lazy[0][0].dist = Dst; Edit_Array_Lazy[0][0].errs = 0; Edit_Array_Lazy[0][0].score = Sco; Edit_Array_Lazy[0][0].fromd = INT32_MAX; // Exact match? if (Row == Alen) { A_End = Alen; T_End = Alen; Match_To_End = true; Right_Score = Sco; Right_Delta_Len = 0; return; } int32 Left = 0; int32 Right = 0; int32 Max_Score = PEDMINSCORE; int32 Max_Score_Len = 0; int32 Max_Score_Best_d = 0; int32 Max_Score_Best_e = 0; for (int32 ei=1; ei <= Edit_Space_Max; ei++) { if (Edit_Array_Lazy[ei] == NULL) if (allocateMoreEditSpace() == false) { // FAIL return; } Left = MAX (Left - 1, -ei); Right = MIN (Right + 1, ei); //fprintf(stderr, "FORWARD ei=%d Left=%d Right=%d\n", ei, Left, Right); Edit_Array_Lazy[ei-1][Left - 1].init(); Edit_Array_Lazy[ei-1][Left ].init(); // Of note, [0][0] on the first iteration is not reset here. Edit_Array_Lazy[ei-1][Right ].init(); Edit_Array_Lazy[ei-1][Right + 1].init(); for (int32 d = Left; d <= Right; d++) { // A mismatch. { int32 aPos = (1 + Edit_Array_Lazy[ei-1][d].row) - 1; // -1 because we need to compare the base we are at, int32 tPos = (1 + Edit_Array_Lazy[ei-1][d].row) + d - 1; // not the base we will be at after the mismatch Row = 1 + Edit_Array_Lazy[ei-1][d].row; Dst = Edit_Array_Lazy[ei-1][d].dist + 1; Err = Edit_Array_Lazy[ei-1][d].errs + 1; fromd = d; // If positive, we have a pointer into valid sequence. If not, this mismatch // doesn't make sense, and the row/score are set to bogus values. if ((aPos >= 0) && (tPos >= 0)) { assert (aPos <= Alen); assert( tPos <= Tlen); assert(A[aPos] != T[tPos]); Sco = Edit_Array_Lazy[ei-1][d].score + mismatchScore(A[aPos], T[tPos]); } else { Sco = PEDMINSCORE; } } // Insert a gap in A. Check the other sequence to see if this is a zero-cost gap. Note // agreement with future value of Row and what is used in isMatch() below. { int32 tPos = 0 + Edit_Array_Lazy[ei-1][d-1].row + d; //assert(tPos >= 0); //assert(tPos < Tlen); if ((tPos >= 0) && (tPos <= Tlen)) { int32 gapCost = isFreeGap( T[tPos] ) ? PEDFREEGAP : PEDGAP; //if (gapCost == 0) // fprintf(stderr, "NDalgorithm::forward()-- free A gap for aPos=%d tPos=%d t=%c/%d\n", tPos - d, tPos, T[tPos], T[tPos]); if (Edit_Array_Lazy[ei-1][d-1].score + gapCost > Sco) { Row = Edit_Array_Lazy[ei-1][d-1].row; Dst = Edit_Array_Lazy[ei-1][d-1].dist + (gapCost == PEDFREEGAP) ? 0 : 0; Err = Edit_Array_Lazy[ei-1][d-1].errs + (gapCost == PEDFREEGAP) ? 0 : 0; Sco = Edit_Array_Lazy[ei-1][d-1].score + gapCost; fromd = d-1; } } } // Insert a gap in T. // Testcase test-st-ts shows this works. { int32 aPos = 1 + Edit_Array_Lazy[ei-1][d+1].row; //assert(aPos >= 0); //assert(aPos < Tlen); if ((aPos >= 0) && (aPos <= Alen)) { int32 gapCost = isFreeGap( A[aPos] ) ? 0 : PEDGAP; //if (gapCost == 0) // fprintf(stderr, "NDalgorithm::forward()-- free T gap for aPos=%d tPos=%d a=%c/%d\n", aPos, aPos + d, A[aPos], A[aPos]); if (Edit_Array_Lazy[ei-1][d+1].score + gapCost > Sco) { Row = 1 + Edit_Array_Lazy[ei-1][d+1].row; Dst = Edit_Array_Lazy[ei-1][d+1].dist + (gapCost == PEDFREEGAP) ? 0 : 1; Err = Edit_Array_Lazy[ei-1][d+1].errs + (gapCost == PEDFREEGAP) ? 0 : 1; Sco = Edit_Array_Lazy[ei-1][d+1].score + gapCost; fromd = d+1; } } } // If A or B is N, that isn't a mismatch. // If A is lowercase and T is uppercase, it's a match. // If A is lowercase and T doesn't match, ignore the cost of the gap in B while ((Row < Alen) && (Row + d < Tlen) && (isMatch(A[Row], T[Row + d]))) { Sco += matchScore(A[Row], T[Row + d]); Row += 1; Dst += 1; Err += 0; } Edit_Array_Lazy[ei][d].row = Row; Edit_Array_Lazy[ei][d].dist = Dst; Edit_Array_Lazy[ei][d].errs = Err; Edit_Array_Lazy[ei][d].score = Sco; Edit_Array_Lazy[ei][d].fromd = fromd; //fprintf(stderr, "SET ei=%d d=%d -- row=%d dist=%d errs=%d score=%d fromd=%d\n", ei, d, Row, Dst, Err, Sco, fromd); if (Row == Alen || Row + d == Tlen) { A_End = Row; // One past last align position T_End = Row + d; Set_Right_Delta(ei, d); Match_To_End = true; return; //return(ei); } } // Over all diagonals. // Reset the band // // The .dist used to be .row. while ((Left <= Right) && (Left < 0) && (Edit_Array_Lazy[ei][Left].dist < Edit_Match_Limit[ Edit_Array_Lazy[ei][Left].errs ])) Left++; if (Left >= 0) while ((Left <= Right) && (Edit_Array_Lazy[ei][Left].dist + Left < Edit_Match_Limit[ Edit_Array_Lazy[ei][Left].errs ])) Left++; if (Left > Right) break; while ((Right > 0) && (Edit_Array_Lazy[ei][Right].dist + Right < Edit_Match_Limit[ Edit_Array_Lazy[ei][Right].errs ])) Right--; if (Right <= 0) while (Edit_Array_Lazy[ei][Right].dist < Edit_Match_Limit[ Edit_Array_Lazy[ei][Right].errs ]) Right--; assert (Left <= Right); for (int32 d = Left; d <= Right; d++) if (Edit_Array_Lazy[ei][d].score > Best_score) { Best_d = d; Best_e = ei; Best_row = Edit_Array_Lazy[ei][d].row; Best_score = Edit_Array_Lazy[ei][d].score; } if (Best_score > Max_Score) { Max_Score_Best_d = Best_d; Max_Score_Best_e = Best_e; Max_Score = Best_score; Max_Score_Len = Best_row; } } // Over all possible number of errors //fprintf(stderr, "NDalgorithm::forward()- iterated over all errors, return best found\n"); A_End = Max_Score_Len; T_End = Max_Score_Len + Max_Score_Best_d; Set_Right_Delta(Max_Score_Best_e, Max_Score_Best_d); Match_To_End = false; return; //return(Max_Score_Best_e); } canu-1.6/src/utgcns/libNDalign/NDalgorithm-reverse.C000066400000000000000000000261671314437614700224160ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/utgcns/libNDalign/prefixEditDistance-reverse.C * * Modifications by: * * Brian P. Walenz from 2015-JUL-20 to 2015-AUG-05 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "NDalgorithm.H" // Put the delta encoding of the alignment represented in Edit_Array // starting at row e (which is the number of errors) and column d // (which is the diagonal) and working back to the start, into // Left_Delta. Set Left_Delta_Len to the number of // delta entries and set leftover to the number of // characters that match after the last Left_Delta entry. // Don't allow the first delta to be an indel if it can be // converted to a substitution by adjusting t_end which // is where the alignment ends in the T string, which has length // t_len . // void NDalgorithm::Set_Left_Delta(int32 e, int32 d, int32 &leftover) { Left_Score = Edit_Array_Lazy[e][d].score; Left_Delta_Len = 0; int32 lastr = Edit_Array_Lazy[e][d].row; //fprintf(stderr, "NDalgorithm::Set_Left_Delta()-- e =%5d d=%5d lastr=%5d fromd=%5d\n", // e, d, lastr); for (int32 k=e; k>0; k--) { assert(Edit_Array_Lazy[k] != NULL); int32 from = Edit_Array_Lazy[k][d].fromd; int32 lasts = Edit_Array_Lazy[k][d].score; //Edit_Array_Lazy[k-1][from].display(k-1, from); //fprintf(stderr, "NDalgorithm::Set_Left_Delta()-- k-1=%5d d=%5d from=%5d lastr=%5d - r=%5d s=%5d d=%5d\n", // k-1, d, from, lastr, // Edit_Array_Lazy[k-1][from].row, Edit_Array_Lazy[k-1][from].score, Edit_Array_Lazy[k-1][from].fromd); if (from == d - 1) { Left_Delta[Left_Delta_Len++] = Edit_Array_Lazy[k-1][d-1].row - lastr - 1; d--; lastr = Edit_Array_Lazy[k-1][from].row; } else if (from == d + 1) { Left_Delta[Left_Delta_Len++] = lastr - Edit_Array_Lazy[k-1][d+1].row; d++; lastr = Edit_Array_Lazy[k-1][from].row; } else { //fprintf(stderr, "LeftDelta: mismatch at %d max=%d last=%d\n", maxr - lastr - 1, maxr, lastr); } } leftover = lastr; } // Return the minimum number of changes (inserts, deletes, replacements) // needed to match string A[0 .. (1-m)] right-to-left with a prefix of string // T[0 .. (1-n)] right-to-left if it's not more than Error_Limit . // If no match, return the number of errors for the best match // up to a branch point. // Put delta description of alignment in Left_Delta and set // Left_Delta_Len to the number of entries there. // Set A_End and T_End to the leftmost positions where the // alignment ended in A and T , respectively. // If the alignment succeeds set Leftover to the number of // characters that match after the last Left_Delta entry; // otherwise, set Leftover to zero. // Set Match_To_End true if the match extended to the end // of at least one string; otherwise, set it false to indicate // a branch point. void NDalgorithm::reverse(char *A, int32 Alen, // first sequence and length char *T, int32 Tlen, // second sequence and length int32 &A_End, int32 &T_End, int32 &Leftover, // <- novel bool &Match_To_End) { assert (Alen <= Tlen); int32 Best_d = 0; int32 Best_e = 0; int32 Best_row = 0; int32 Best_score = 0; int32 Row = 0; int32 Dst = 0; int32 Err = 0; int32 Sco = 0; int32 fromd = 0; // Skip ahead over matches. The original used to also skip if either sequence was N. while ((Row < Alen) && (isMatch(A[-Row], T[-Row]))) { Sco += matchScore(A[-Row], T[-Row]); Row++; } if (Edit_Array_Lazy[0] == NULL) allocateMoreEditSpace(); Edit_Array_Lazy[0][0].row = Row; Edit_Array_Lazy[0][0].dist = Dst; Edit_Array_Lazy[0][0].errs = 0; Edit_Array_Lazy[0][0].score = Sco; Edit_Array_Lazy[0][0].fromd = INT32_MAX; // Exact match? if (Row == Alen) { A_End = -Alen; T_End = -Alen; Leftover = Alen; Match_To_End = true; Left_Score = Sco; Left_Delta_Len = 0; return; } int32 Left = 0; int32 Right = 0; int32 Max_Score = PEDMINSCORE; int32 Max_Score_Len = 0; int32 Max_Score_Best_d = 0; int32 Max_Score_Best_e = 0; for (int32 ei=1; ei <= Edit_Space_Max; ei++) { if (Edit_Array_Lazy[ei] == NULL) if (allocateMoreEditSpace() == false) { // FAIL return; } Left = MAX (Left - 1, -ei); Right = MIN (Right + 1, ei); //fprintf(stderr, "REVERSE ei=%d Left=%d Right=%d\n", ei, Left, Right); Edit_Array_Lazy[ei-1][Left - 1].init(); Edit_Array_Lazy[ei-1][Left ].init(); // Of note, [0][0] on the first iteration is not reset here. Edit_Array_Lazy[ei-1][Right ].init(); Edit_Array_Lazy[ei-1][Right + 1].init(); for (int32 d = Left; d <= Right; d++) { // A mismatch. { int32 aPos = -(1 + Edit_Array_Lazy[ei-1][d].row) + 1; // +1 because we need to compare the base we are at, int32 tPos = -(1 + Edit_Array_Lazy[ei-1][d].row) - d + 1; // not the base we will be at after the mismatch Row = 1 + Edit_Array_Lazy[ei-1][d].row; Dst = Edit_Array_Lazy[ei-1][d].dist + 1; Err = Edit_Array_Lazy[ei-1][d].errs + 1; fromd = d; //fprintf(stderr, "aPos %d tPos %d\n", aPos, tPos); // If negative, we have a pointer into valid sequence. If not, this mismatch // doesn't make sense, and the row/score are set to bogus values. if ((aPos <= 0) && (tPos <= 0)) { assert(-aPos <= Alen); assert(-tPos <= Tlen); assert(A[aPos] != T[tPos]); Sco = Edit_Array_Lazy[ei-1][d].score + mismatchScore(A[aPos], T[tPos]); } else { Sco = PEDMINSCORE; } } // Insert a gap in A. Check the other sequence to see if this is a zero-cost gap. Note // agreement with future value of Row and what is used in isMatch() below. // Testcase test-st-ts shows this works. { int32 tPos = -(0 + Edit_Array_Lazy[ei-1][d-1].row) - d; //assert( tPos <= 0); Not true at the lower end; we'll just skip the cell since it's invalid //assert(-tPos < Tlen); if ((tPos <= 0) && (-tPos < Tlen)) { int32 gapCost = isFreeGap( T[tPos] ) ? PEDFREEGAP : PEDGAP; //if (gapCost == 0) // fprintf(stderr, "NDalgorithm::reverse()-- free A gap for aPos=%d tPos=%d t=%c/%d\n", tPos + d, tPos, T[tPos], T[tPos]); if (Edit_Array_Lazy[ei-1][d-1].score + gapCost > Sco) { Row = Edit_Array_Lazy[ei-1][d-1].row; Dst = Edit_Array_Lazy[ei-1][d-1].dist + (gapCost == PEDFREEGAP) ? 0 : 0; Err = Edit_Array_Lazy[ei-1][d-1].errs + (gapCost == PEDFREEGAP) ? 0 : 0; Sco = Edit_Array_Lazy[ei-1][d-1].score + gapCost; fromd = d-1; } } } // Insert a gap in T. { int32 aPos = -(1 + Edit_Array_Lazy[ei-1][d+1].row); //assert( aPos <= 0); //assert(-aPos < Tlen); if ((aPos <= 0) && (-aPos < Alen)) { int32 gapCost = isFreeGap( A[aPos] ) ? 0 : PEDGAP; //if (gapCost == 0) // fprintf(stderr, "NDalgorithm::reverse()-- free T gap for aPos=%d tPos=%d t=%c/%d\n", aPos, aPos - d, A[aPos], A[aPos]); if (Edit_Array_Lazy[ei-1][d+1].score + gapCost > Sco) { Row = 1 + Edit_Array_Lazy[ei-1][d+1].row; Dst = Edit_Array_Lazy[ei-1][d+1].dist + (gapCost == PEDFREEGAP) ? 0 : 1; Err = Edit_Array_Lazy[ei-1][d+1].errs + (gapCost == PEDFREEGAP) ? 0 : 1; Sco = Edit_Array_Lazy[ei-1][d+1].score + gapCost; fromd = d+1; } } } // If A or B is N, that isn't a mismatch. // If A is lowercase and T is uppercase, it's a match. // If A is lowercase and T doesn't match, ignore the cost of the gap in B while ((Row < Alen) && (Row + d < Tlen) && (isMatch(A[-Row], T[-Row - d]))) { Sco += matchScore(A[-Row], T[-Row - d]); Row += 1; Dst += 1; Err += 0; } Edit_Array_Lazy[ei][d].row = Row; Edit_Array_Lazy[ei][d].dist = Dst; Edit_Array_Lazy[ei][d].errs = Err; Edit_Array_Lazy[ei][d].score = Sco; Edit_Array_Lazy[ei][d].fromd = fromd; if (Row == Alen || Row + d == Tlen) { A_End = - Row; // One past last align position T_End = - Row - d; Set_Left_Delta(ei, d, Leftover); Match_To_End = true; return; //return(ei); } } // Over all diagonals. // Reset the band // // The .dist used to be .row. while ((Left <= Right) && (Left < 0) && (Edit_Array_Lazy[ei][Left].dist < Edit_Match_Limit[ Edit_Array_Lazy[ei][Left].errs ])) Left++; if (Left >= 0) while ((Left <= Right) && (Edit_Array_Lazy[ei][Left].dist + Left < Edit_Match_Limit[ Edit_Array_Lazy[ei][Left].errs ])) Left++; if (Left > Right) break; while ((Right > 0) && (Edit_Array_Lazy[ei][Right].dist + Right < Edit_Match_Limit[ Edit_Array_Lazy[ei][Right].errs ])) Right--; if (Right <= 0) while (Edit_Array_Lazy[ei][Right].dist < Edit_Match_Limit[ Edit_Array_Lazy[ei][Right].errs ]) Right--; assert (Left <= Right); for (int32 d = Left; d <= Right; d++) if (Edit_Array_Lazy[ei][d].score > Best_score) { Best_d = d; Best_e = ei; Best_row = Edit_Array_Lazy[ei][d].row; Best_score = Edit_Array_Lazy[ei][d].score; } if (Best_score > Max_Score) { Max_Score_Best_d = Best_d; Max_Score_Best_e = Best_e; Max_Score = Best_score; Max_Score_Len = Best_row; } } // Over all possible number of errors //fprintf(stderr, "NDalgorithm::reverse()-- iterated over all errors, return best found\n"); A_End = - Max_Score_Len; T_End = - Max_Score_Len - Max_Score_Best_d; Set_Left_Delta(Max_Score_Best_e, Max_Score_Best_d, Leftover); Match_To_End = false; return; //return(Max_Score_Best_e); } canu-1.6/src/utgcns/libNDalign/NDalgorithm.C000066400000000000000000000127331314437614700207370ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/utgcns/libNDalign/prefixEditDistance.C * * Modifications by: * * Brian P. Walenz from 2015-JUL-20 to 2015-AUG-05 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-FEB-25 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "NDalgorithm.H" #include "Binomial_Bound.H" const char * toString(pedAlignType at) { const char *ret = NULL; switch (at) { case pedOverlap: ret = "full-overlap"; break; case pedLocal: ret = "local-overlap"; break; case pedGlobal: ret = "global-alignment"; break; default: assert(0); break; } return(ret); } const char * toString(pedOverlapType ot) { const char *ret = NULL; switch (ot) { case pedBothBranch: ret = "both-branch"; break; case pedLeftBranch: ret = "left-branch"; break; case pedRightBranch: ret = "right-branch"; break; case pedDovetail: ret = "dovetail-overlap"; break; default: assert(0); break; } return(ret); } NDalgorithm::NDalgorithm(pedAlignType alignType_, double maxErate_) { alignType = alignType_; maxErate = maxErate_; ERRORS_FOR_FREE = 1; MIN_BRANCH_END_DIST = 20; MIN_BRANCH_TAIL_SLOPE = ((maxErate > 0.06) ? 1.0 : 0.20); Left_Delta = new int32 [AS_MAX_READLEN]; Right_Delta = new int32 [AS_MAX_READLEN]; Delta_Stack = new int32 [AS_MAX_READLEN]; allocated = 3 * AS_MAX_READLEN * sizeof(int32); Edit_Space_Max = AS_MAX_READLEN; //(alignType == pedGlobal) ? (AS_MAX_READLEN) : (1 + (int32)ceil(maxErate * AS_MAX_READLEN)); Edit_Space_Lazy = new pedEdit * [Edit_Space_Max]; Edit_Array_Lazy = new pedEdit * [Edit_Space_Max]; memset(Edit_Space_Lazy, 0, sizeof(pedEdit *) * Edit_Space_Max); memset(Edit_Array_Lazy, 0, sizeof(pedEdit *) * Edit_Space_Max); allocated += Edit_Space_Max * sizeof (pedEdit *); allocated += Edit_Space_Max * sizeof (pedEdit *); int32 dataIndex = (int)ceil(maxErate * 100) - 1; if ((dataIndex < 0) || (50 <= dataIndex)) fprintf(stderr, "NDalgorithm()-- Invalid maxErate=%f -> dataIndex=%d\n", maxErate, dataIndex); assert(0 <= dataIndex); assert(dataIndex < 50); #if 0 // Use the precomputed values. { Edit_Match_Limit_Allocation = NULL; Edit_Match_Limit = Edit_Match_Limit_Data[dataIndex]; fprintf(stderr, "NDalgorithm()-- Set Edit_Match_Limit to %p; dataIndex=%d 6 = %p\n", Edit_Match_Limit, dataIndex, Edit_Match_Limit_0600); } #else // Compute values on the fly. { int32 MAX_ERRORS = (1 + (int32)ceil(maxErate * AS_MAX_READLEN)); Edit_Match_Limit_Allocation = new int32 [MAX_ERRORS + 1]; for (int32 e=0; e<= ERRORS_FOR_FREE; e++) Edit_Match_Limit_Allocation[e] = 0; int Start = 1; for (int32 e=ERRORS_FOR_FREE + 1; e= Edit_Match_Limit_Allocation[e-1]); } Edit_Match_Limit = Edit_Match_Limit_Allocation; } #endif for (int32 i=0; i <= AS_MAX_READLEN; i++) { //Error_Bound[i] = (int32) (i * maxErate + 0.0000000000001); Error_Bound[i] = (int32)ceil(i * maxErate); } // Value to add for a match in finding branch points. // // ALH: Note that maxErate also affects what overlaps get found // // ALH: Scoring seems to be unusual: given an alignment of length l with k mismatches, the score // seems to be computed as l + k * error value and NOT (l-k)*match+k*error // // I.e. letting x := DEFAULT_BRANCH_MATCH_VAL, the max mismatch fraction p to give a non-negative // score would be p = x/(1-x); conversely, to compute x for a goal p, we have x = p/(1+p). E.g. // // for p=0.06, x = .06 / (1.06) = .0566038 // for p=0.35, x = .35 / (1.35) = .259259 // for p=0.2, x = .20 / (1.20) = .166667 // for p=0.15, x = .15 / (1.15) = .130435 // // Value was for 6% vs 35% error discrimination. // // Converting to integers didn't make it faster. // // Corresponding error value is this value minus 1.0 Branch_Match_Value = maxErate / (1 + maxErate); for (uint32 ii=0; ii<256; ii++) tolower[ii] = ::tolower(ii); }; NDalgorithm::~NDalgorithm() { delete [] Left_Delta; delete [] Right_Delta; delete [] Delta_Stack; for (uint32 i=0; i 0) - ((a) < 0) ) // Scores used in the dynamic programming. // #define PEDMATCH 2 // Match from UPPERCASE to UPPERCASE #define PEDGAPMATCH 1 // Match from LOWERCASE to UPPERCASE #define PEDMISMATCH -3 // Mismatch #define PEDFREEGAP 1 // Gap from LOWERCASE #define PEDGAP -5 // Gap // PEDMINSCORE wants to be INT32_MIN, except that it will underflow, and wrap around to the highest // score possible, on the first insert. I think it suffices to set this just slightly higher than // the minimum, since it should never be picked as a maximum, and thus never see more than one // insert. Just to be safe, we give it lots of space. // #define PEDMINSCORE (INT32_MIN - PEDGAP * 32768) enum pedAlignType { pedOverlap = 0, // Former normal overlap pedLocal = 1, // Former partial overlap pedGlobal = 2 // New forced global overlap }; enum pedOverlapType { pedBothBranch = 0, pedLeftBranch = 1, pedRightBranch = 2, pedDovetail = 3 }; const char *toString(pedAlignType at); const char *toString(pedOverlapType ot); // the input to Extend_Alignment. class Match_Node_t { public: int32 Offset; // To start of exact match in hash-table frag int32 Len; // Of exact match int32 Start; // Of exact match in current (new) frag int32 Next; // Subscript of next match in list }; class pedEdit { public: int32 row; // Position in the A sequence; was previously the score too int32 dist; // The number of non-free bases in the alignment, used to decide band int32 errs; // The number of non-free errors in the alignment, used to decide band int32 score; // The dynamic programming score, used to decide best align int32 fromd; // For backtracking, where we came from void init(void) { row = -2; dist = -2; errs = 0; score = PEDMINSCORE; fromd = INT32_MAX; }; void display(int32 e, int32 d) { fprintf(stderr, "e=%d d=%d - ptr %p - ", e, d, this); fprintf(stderr, "row=%d ", row); fprintf(stderr, "dist=%d ", dist); fprintf(stderr, "errs=%d ", errs); fprintf(stderr, "score=%d ", score); fprintf(stderr, "fromd=%d\n", fromd); }; }; class NDalgorithm { public: NDalgorithm(pedAlignType alignType_, double maxErate_); ~NDalgorithm(); private: bool allocateMoreEditSpace(void); void Set_Right_Delta(int32 e, int32 d); void forward(char *A, int32 m, char *T, int32 n, int32 &A_End, int32 &T_End, bool &Match_To_End); void Set_Left_Delta(int32 e, int32 d, int32 &leftover); void reverse(char *A, int32 m, char *T, int32 n, int32 &A_End, int32 &T_End, int32 &Leftover, // <- novel bool &Match_To_End); // Returns true if letter 'a' from sequence A matches letter 't' from sequence T. // Wanted to allow lowercase as free matches, but the O(ND) algorithm doesn't support that. // bool isMatch(char a, char t) { return(a == t); }; // Returns the score of a match. Pretty basic, was written to support 'a' == 'A' matches, but // the O(ND) algorithm cannot support that. // int32 matchScore(char a, char t) { if (isMatch(a, t)) return(PEDMATCH); fprintf(stderr, "ERROR: a=%c/%d does not match t=%c/%d\n", a, a, t, t); assert(isMatch(a, t) == true); return(0); }; // Returns the score of a mismatch. As with gaps, we can get free mismatches to lower case letters, // which should prevent alignments such as: // // A: GtTTgaTGACCTGG (the scoring should be 2 mismatches then 3 gaps) // B: GTTTGA---CCTGG // // Favoring instead // // A: GtTTgaTGACCTGG (the scoring should be 2 free gaps, 1 gap and 2 matches) // B: GTTT---GACCTGG // int32 mismatchScore(char a, char t) { if ((a == tolower[a]) || (t == tolower[t])) return(PEDGAPMATCH); return(PEDMISMATCH); } // Returns true if the gap we'd be adding to T is zero cost: // lower case 'a' is a don't case position in sequence A // bool isFreeGap(char a) { if (a == 0) return(false); return(a == tolower[a]); } public: pedOverlapType Extend_Alignment(Match_Node_t *Match, char *S, int32 S_Len, char *T, int32 T_Len, int32 &S_Lo, int32 &S_Hi, int32 &T_Lo, int32 &T_Hi); int32 score(void) { return(Left_Score + Right_Score); }; public: // The three below were global #defines, two depended on the error rate which is now local. // The number of errors that are ignored in setting probability bound for terminating alignment // extensions in edit distance calculations uint32 ERRORS_FOR_FREE; // Branch points must be at least this many bases from the end of the fragment to be reported uint32 MIN_BRANCH_END_DIST; // Branch point tails must fall off from the max by at least this rate double MIN_BRANCH_TAIL_SLOPE; pedAlignType alignType; double maxErate; uint64 allocated; int32 Left_Score; int32 Left_Delta_Len; int32 *Left_Delta; int32 Right_Score; int32 Right_Delta_Len; int32 *Right_Delta; int32 *Delta_Stack; int32 Edit_Space_Max; pedEdit **Edit_Space_Lazy; // Array of pointers, if set, it was a new'd allocation pedEdit **Edit_Array_Lazy; // Array of pointers, some are not new'd allocations // This array [e] is the minimum value of Edit_Array[e][d] // to be worth pursuing in edit-distance computations between reads const int32 *Edit_Match_Limit; int32 *Edit_Match_Limit_Allocation; // The maximum number of errors allowed in a match between reads of length i, // which is i * AS_OVL_ERROR_RATE. int32 Error_Bound[AS_MAX_READLEN + 1]; // Scores of matches and mismatches in alignments. Alignment ends at maximum score. double Branch_Match_Value; char tolower[256]; }; #endif // NDALGORITHM_H canu-1.6/src/utgcns/libNDalign/NDalign.C000066400000000000000000000727201314437614700200450ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/overlapInCore/overlapAlign.C * * Modifications by: * * Brian P. Walenz from 2015-JUN-17 to 2015-AUG-25 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "NDalign.H" #include "kMer.H" #include "merStream.H" #include "stddev.H" #include "Display_Alignment.H" #undef DEBUG_ALGORITHM // Some details. #undef DEBUG_HITS // Lots of details (chainHits()) #undef SEED_NON_OVERLAPPING // Allow mismatches in seeds #define SEED_OVERLAPPING #define DISPLAY_WIDTH 250 NDalign::NDalign(pedAlignType alignType, double maxErate, int32 merSize) { _alignType = alignType; _maxErate = maxErate; _merSizeInitial = merSize; _aID = UINT32_MAX; _aStr = NULL; _aLen = 0; _aLoOrig = 0; _aHiOrig = 0; _bID = UINT32_MAX; _bStr = NULL; _bLen = 0; _bLoOrig = 0; _bHiOrig = 0; _bFlipped = false; _bRevMax = 0; _bRev = NULL; _editDist = new NDalgorithm(_alignType, _maxErate); _minDiag = 0; _maxDiag = 0; _merSize = 0; _topDisplay = NULL; _botDisplay = NULL; _resDisplay = NULL; // Initialize Constants for (uint32 ii=0; ii<256; ii++) { acgtToBit[ii] = 0x00; acgtToVal[ii] = 0x03; } // Do NOT include lowercase letters here. These are 'don't care' matches // in consensus, and are usually matches we want to be skipping. // // ATGtTTgaTGACC vs ATGtTTgaTGACC // ATG-TT--TGACC ATGTTT---GACC // // While, yes, the second form is simpler (and would be correct had the last T been // optional) the first is what consensus is expecting to find. #if 0 acgtToBit['a'] = acgtToBit['A'] = 0x00; // Bit encoding of ACGT acgtToBit['c'] = acgtToBit['C'] = 0x01; acgtToBit['g'] = acgtToBit['G'] = 0x02; acgtToBit['t'] = acgtToBit['T'] = 0x03; acgtToVal['a'] = acgtToVal['A'] = 0x00; // Word is valid if zero acgtToVal['c'] = acgtToVal['C'] = 0x00; acgtToVal['g'] = acgtToVal['G'] = 0x00; acgtToVal['t'] = acgtToVal['T'] = 0x00; #else acgtToBit['A'] = 0x00; // Bit encoding of ACGT acgtToBit['C'] = 0x01; acgtToBit['G'] = 0x02; acgtToBit['T'] = 0x03; acgtToVal['A'] = 0x00; // Word is valid if zero acgtToVal['C'] = 0x00; acgtToVal['G'] = 0x00; acgtToVal['T'] = 0x00; #endif merMask[0] = 0x0000000000000000llu; merMask[1] = 0x0000000000000003llu; for (uint32 ii=2; ii<33; ii++) merMask[ii] = (merMask[ii-1] << 2) | 0x03; assert(merMask[ 6] == 0x0000000000000fffllu); assert(merMask[17] == 0x00000003ffffffffllu); assert(merMask[26] == 0x000fffffffffffffllu); assert(merMask[32] == 0xffffffffffffffffllu); } NDalign::~NDalign() { delete _editDist; delete [] _bRev; delete [] _topDisplay; delete [] _botDisplay; delete [] _resDisplay; } void NDalign::initialize(uint32 aID, char *aStr, int32 aLen, int32 aLo, int32 aHi, uint32 bID, char *bStr, int32 bLen, int32 bLo, int32 bHi, bool bFlipped) { #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::initialize()-- A %u %d-%d B %u %d-%d\n", aID, aLo, aHi, bID, bLo, bHi); #endif _aID = aID; _aStr = aStr; _aLen = aLen; _aLoOrig = aLo; _aHiOrig = aHi; _bID = bID; _bStr = bStr; _bLen = bLen; _bLoOrig = bLo; _bHiOrig = bHi; _bFlipped = bFlipped; if (_bFlipped == true) { if (_bRevMax < _bLen) { delete [] _bRev; _bRevMax = _bLen + 1000; _bRev = new char [_bRevMax]; } memcpy(_bRev, bStr, sizeof(char) * (_bLen + 1)); reverseComplementSequence(_bRev, _bLen); _bStr = _bRev; _bLoOrig = _bLen - bLo; // Now correct for the reverse complemented sequence _bHiOrig = _bLen - bHi; } assert(_aLoOrig < _aHiOrig); assert(_bLoOrig < _bHiOrig); //_editDist doesn't need to be cleared. _minDiag = 0; _maxDiag = 0; _merSize = 0; _aMap.clear(); _bMap.clear(); _rawhits.clear(); _hits.clear(); _hitr = UINT32_MAX; _bestResult.clear(); } // Set the min/max diagonal we will accept seeds for. It's just the min/max diagonal for the // two endpoints extended by half the erate times the align length. // // Returns false if there is no chance of an overlap - the computed alignment length is too small. // // The adjusts change the sequence start and end position (bgn + bgnAdj, end - endAdj), // and are intended to compensate for supplying sequences larger than absolutely necessary // (for when the exact boundaries aren't known). bool NDalign::findMinMaxDiagonal(int32 minLength, uint32 AbgnAdj, uint32 AendAdj, uint32 BbgnAdj, uint32 BendAdj) { _minDiag = 0; _maxDiag = 0; int32 aALen = (_aHiOrig - AendAdj) - (_aLoOrig + AbgnAdj); int32 bALen = (_bHiOrig - BendAdj) - (_bLoOrig + BbgnAdj); int32 alignLen = (aALen < bALen) ? bALen : aALen; if (alignLen < minLength) return(false); int32 bgnDiag = (_aLoOrig + AbgnAdj) - (_bLoOrig + BbgnAdj); int32 endDiag = (_aHiOrig - AendAdj) - (_bHiOrig - BendAdj); if (bgnDiag < endDiag) { _minDiag = bgnDiag - _maxErate * alignLen / 2; _maxDiag = endDiag + _maxErate * alignLen / 2; } else { _minDiag = endDiag - _maxErate * alignLen / 2; _maxDiag = bgnDiag + _maxErate * alignLen / 2; } // For very very short overlaps (mhap kindly reports 4 bp overlaps) reset the min/max // to something a little more permissive. //if (_minDiag > -5) _minDiag = -5; //if (_maxDiag < 5) _maxDiag = 5; if (_minDiag > _maxDiag) fprintf(stderr, "NDalign::findMinMaxDiagonal()-- ERROR: _minDiag=%d >= _maxDiag=%d\n", _minDiag, _maxDiag); assert(_minDiag <= _maxDiag); #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::findMinMaxDiagonal()-- A: _aLoOrig %d + AbgnAdj %d -- _aHiOrig %d - AendAdj %d\n", _aLoOrig, AbgnAdj, _aHiOrig, AendAdj); fprintf(stderr, "NDalign::findMinMaxDiagonal()-- B: _bLoOrig %d + BbgnAdj %d -- _bHiOrig %d - BendAdj %d\n", _bLoOrig, BbgnAdj, _bHiOrig, BendAdj); fprintf(stderr, "NDalign::findMinMaxDiagonal()-- min %d max %d -- span %d -- alignLen %d\n", _minDiag, _maxDiag, _maxDiag-_minDiag, alignLen); #endif return(true); } void NDalign::fastFindMersA(bool dupIgnore) { uint64 mer = 0x0000000000000000llu; uint64 val = 0xffffffffffffffffllu; uint64 mask = merMask[_merSize]; // Restict the mers we seed with the those in the overlap (plus a little wiggle room). If this // isn't done, overlaps in repeats are sometimes lost. In the A read, mers will drop out (e.g., // if a repeat at both the 5' and 3' end, and we have an overlap on one end only). In the B // read, the mers can now be on the wrong diagonal. int32 bgn = _aLoOrig - _merSize - _merSize; int32 end = _aHiOrig + _merSize; if (bgn < 0) bgn = 0; // Create mers. Since 'val' was initialized as invalid until the first _merSize things // are pushed on, no special case is needed to load the mer. It costs us two extra &'s // and the test for saving the valid mer while we initialize. for (int32 seqpos=bgn; (seqpos < end) && (_aStr[seqpos] != 0); seqpos++) { mer <<= 2; val <<= 2; mer |= acgtToBit[_aStr[seqpos]]; val |= acgtToVal[_aStr[seqpos]]; mer &= merMask[_merSize]; val &= merMask[_merSize]; if (val != 0x0000000000000000) // Not a valid mer. continue; // +1 - consider a 1-mer. The first time through we have a valid mer, but seqpos == 0. // To get an aMap position of zero (the true position) we need to add one. if (_aMap.find(mer) != _aMap.end()) { if (dupIgnore == true) _aMap[mer] = INT32_MAX; // Duplicate mer, now ignored! } else { _aMap[mer] = seqpos + 1 - _merSize; } } //fprintf(stderr, "Found %u hits in A at mersize %u dupIgnore %u t %u %u\n", _aMap.size(), _merSize, dupIgnore, t[0], t[1]); } void NDalign::fastFindMersB(bool dupIgnore) { uint64 mer = 0x0000000000000000llu; uint64 val = 0xffffffffffffffffllu; uint64 mask = merMask[_merSize]; // Like the A read, we limit to mers in the overlap region. int32 bgn = _bLoOrig - _merSize - _merSize; int32 end = _bHiOrig + _merSize; if (bgn < 0) bgn = 0; // Create mers. Since 'val' was initialized as invalid until the first _merSize things // are pushed on, no special case is needed to load the mer. It costs us two extra &'s // and the test for saving the valid mer while we initialize. for (int32 seqpos=bgn; (seqpos < end) && (_bStr[seqpos] != 0); seqpos++) { mer <<= 2; val <<= 2; mer |= acgtToBit[_bStr[seqpos]]; val |= acgtToVal[_bStr[seqpos]]; mer &= merMask[_merSize]; val &= merMask[_merSize]; if (val != 0x0000000000000000) // Not a valid mer. continue; if (_aMap.find(mer) == _aMap.end()) // Not in the A sequence, don't care. continue; int32 apos = _aMap[mer]; int32 bpos = seqpos + 1 - _merSize; if (apos == INT32_MAX) // Exists too many times in aSeq, don't care. continue; if ((apos - bpos < _minDiag) || (apos - bpos > _maxDiag)) // Too different. continue; if (_bMap.find(mer) != _bMap.end()) { if (dupIgnore == true) _bMap[mer] = INT32_MAX; // Duplicate mer, now ignored! } else { _bMap[mer] = bpos; } } //fprintf(stderr, "Found %u hits in B at mersize %u dupIgnore %u t %u %u %u %u %u\n", _bMap.size(), _merSize, dupIgnore, t[0], t[1], t[2], t[3], t[4]); } // Find seeds - hash the kmer and position from the first read, then lookup // each kmer in the second read. For unique hits, save the diagonal. Then what? // If the diagonal is too far from the expected diagonal (based on the overlap), // ignore the seed. // // If dupIgnore == true, ignore duplicates. Otherwise, use the first occurrence // // Returns false if no mers found. // bool NDalign::findSeeds(bool dupIgnore) { _merSize = _merSizeInitial; // Find mers in A findMersAgain: fastFindMersA(dupIgnore); if (_aMap.size() == 0) { _aMap.clear(); _bMap.clear(); _merSize--; if ((_merSize < 8) && (dupIgnore == true)) { _merSize = _merSizeInitial + 2; dupIgnore = false; } if (_merSize >= 8) goto findMersAgain; } // Find mers in B fastFindMersB(dupIgnore); if (_bMap.size() == 0) { _aMap.clear(); _bMap.clear(); _merSize--; if ((_merSize < 8) && (dupIgnore == true)) { _merSize = _merSizeInitial + 2; dupIgnore = false; } if (_merSize >= 8) goto findMersAgain; } // Still zero? Didn't find any unique seeds anywhere. if (_bMap.size() == 0) { #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::findSeeds()-- No seeds found.\n"); #endif return(false); } #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::findSeeds()-- Found %u seeds.\n", _bMap.size()); #endif return(true); } // For unique mers in B, if the mer is also unique in A, add a hit. bool NDalign::findHits(void) { for (map::iterator bit=_bMap.begin(); bit != _bMap.end(); bit++) { uint64 kmer = bit->first; int32 bpos = bit->second; if (bpos == INT32_MAX) // Exists too many times in bSeq, don't care about it. continue; int32 apos = _aMap[kmer]; assert(apos != INT32_MAX); // Should never get a bMap if the aMap isn't set if ((apos - bpos < _minDiag) || (apos - bpos > _maxDiag)) fprintf(stderr, "NDalign::findHits()-- kmer " F_X64 " apos - bpos = %d _minDiag = %d _maxDiag = %d\n", kmer, apos-bpos, _minDiag, _maxDiag); assert(apos - bpos >= _minDiag); // ...these too. assert(apos - bpos <= _maxDiag); _rawhits.push_back(exactMatch(apos, bpos, _merSize)); } #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::findHits()-- Found %u hits.\n", _rawhits.size()); #endif return(true); } bool NDalign::chainHits(void) { // Sort by aPos (actually by length, then by aPos, but length is constant here). sort(_rawhits.begin(), _rawhits.end()); #ifdef DEBUG_HITS for (uint32 rr=0; rr<_rawhits.size(); rr++) if (_rawhits[rr].aBgn - _rawhits[rr].bBgn == 2289) fprintf(stderr, "NDalign::chainHits()-- HIT: %d - %d diag %d\n", _rawhits[rr].aBgn, _rawhits[rr].bBgn, _rawhits[rr].aBgn - _rawhits[rr].bBgn); #endif // Chain the _hits. This chains hits that are on the same diagonal and contiguous. _hits.push_back(_rawhits[0]); for (uint32 rr=1; rr<_rawhits.size(); rr++) { uint32 hh = _hits.size() - 1; bool merge = false; assert(_rawhits[rr-1].aBgn < _rawhits[rr].aBgn); #ifdef SEED_NON_OVERLAPPING // Allows non-overlapping (mismatch only) hits - but breaks seeding of alignment // --------- end of next hit ----------- ----- end of existing hit ----- int32 da = _rawhits[rr].aBgn + _rawhits[rr].tLen - _hits[hh].aBgn - _hits[hh].tLen; int32 db = _rawhits[rr].bBgn + _rawhits[rr].tLen - _hits[hh].bBgn - _hits[hh].tLen; assert(da > 0); if (da == db) merge = true; #endif #ifdef SEED_OVERLAPPING // Requires overlapping hits - full block of identity // - start of next - ------- end of existing ------- int32 da = _rawhits[rr].aBgn - _hits[hh].aBgn - _hits[hh].tLen; int32 db = _rawhits[rr].bBgn - _hits[hh].bBgn - _hits[hh].tLen; if ((da < 0) && (da == db)) merge = true; #endif if (merge) { //fprintf(stderr, "NDalign::chainHits()-- MERGE HIT: %d - %d diag %d da=%d db=%d\n", _rawhits[rr].aBgn, _rawhits[rr].bBgn, _rawhits[rr].aBgn - _rawhits[rr].bBgn, da, db); _hits[hh].tLen = _rawhits[rr].aBgn + _rawhits[rr].tLen - _hits[hh].aBgn; } else { //fprintf(stderr, "NDalign::chainHits()-- NEW HIT: %d - %d diag %d da=%d db=%d\n", _rawhits[rr].aBgn, _rawhits[rr].bBgn, _rawhits[rr].aBgn - _rawhits[rr].bBgn, da, db); _hits.push_back(_rawhits[rr]); } } // Sort by longest sort(_hits.begin(), _hits.end()); #ifdef DEBUG_HITS for (uint32 hh=0; hh<_hits.size(); hh++) { fprintf(stderr, "NDalign::chainHits()-- hit %02u %5d-%5d diag %d len %3u\n", hh, _hits[hh].aBgn, _hits[hh].bBgn, _hits[hh].aBgn - _hits[hh].bBgn, _hits[hh].tLen); } #endif #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::chainHits()-- Found %u chains of hits.\n", _hits.size()); #endif return(true); } bool NDalign::makeNullHit(void) { _hits.push_back(exactMatch(0, 0, 0)); return(true); } bool NDalign::processHits(void) { // If the first time here, set the hit iterator to zero, otherwise move to the next one. // And then return if there are no more hits to iterate over. if (_hitr == UINT32_MAX) _hitr = 0; else _hitr++; if (_hitr >= _hits.size()) return(false); // While hits, process them. // // If a good hit is found, return, leaving hitr as is. The next time we enter this function, // we'll increment hitr and process the next hit. If no good hit is found, we iterate the loop // until a good one is found, or we run out of hits. for (; _hitr < _hits.size(); _hitr++) { Match_Node_t match; match.Start = _hits[_hitr].aBgn; // Begin position in a match.Offset = _hits[_hitr].bBgn; // Begin position in b match.Len = _hits[_hitr].tLen; // tLen can include mismatches if alternate scoring is used! match.Next = 0; // Not used here #ifdef SEED_NON_OVERLAPPING match.Offset = _merSize; // Really should track this in the hits, oh well. #endif #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::processHits()--\n"); fprintf(stderr, "NDalign::processHits()-- Extend_Alignment Astart %d Bstart %d length %d\n", match.Start, match.Offset, match.Len); fprintf(stderr, "NDalign::processHits()--\n"); fprintf(stderr, ">A len=%u\n", _aLen); fwrite(_aStr, sizeof(char), _aLen, stderr); fprintf(stderr, "\n"); fprintf(stderr, ">B len=%u\n", _bLen); fwrite(_bStr, sizeof(char), _bLen, stderr); fprintf(stderr, "\n"); #endif int32 aLo=0, aHi=0; int32 bLo=0, bHi=0; char origScore[1024]; pedOverlapType olapType = _editDist->Extend_Alignment(&match, // Initial exact match, relative to start of string _aStr, _aLen, _bStr, _bLen, aLo, aHi, // Output: Regions which the match extends bLo, bHi); aHi++; // Add one to the end point because Extend_Alignment returns the base-based coordinate. bHi++; // Is this a better overlap than what we have? Save it and update statistics. if (((score() < _editDist->score())) || ((score() <= _editDist->score()) && (length() > ((aHi - aLo) + (bHi - bLo) + _editDist->Left_Delta_Len) / 2))) { #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::processHits()--\n"); fprintf(stderr, "NDalign::processHits()-- %s\n", (length() > 0) ? "Save better alignment" : "First alignment"); fprintf(stderr, "NDalign::processHits()--\n"); sprintf(origScore, "NDalign::processHits()-- OLD length %u erate %f score %u (%d-%d %d-%d)\n", length(), erate(), score(), abgn(), aend(), bbgn(), bend()); #endif _bestResult.save(aLo, aHi, bLo, bHi, _editDist->score(), olapType, _editDist->Left_Delta_Len, _editDist->Left_Delta); display("NDalign::processHits()-- ", false); _bestResult.setErate(1.0 - (double)(_matches + _gapmatches) / (length() - _freegaps)); #ifdef DEBUG_ALGORITHM fprintf(stderr, "%sNDalign::processHits()-- NEW length %u erate %f score %u (%d-%d %d-%d)\n", origScore, length(), erate(), score(), abgn(), aend(), bbgn(), bend()); #endif } else { olapType = pedBothBranch; #ifdef DEBUG_ALGORITHM fprintf(stderr, "NDalign::processHits()-- DON'T save alignment - OLD length %u erate %f score %u (%d-%d %d-%d) ", length(), erate(), score(), abgn(), aend(), bbgn(), bend()); fprintf(stderr, "NDalign::processHits()-- NEW length %u score %u coords %u-%u %u-%u\n", ((aHi - aLo) + (bHi - bLo) + _editDist->Left_Delta_Len) / 2, _editDist->score(), aLo, aHi, bLo, bHi); #endif } // If a dovetail, we're done. Let the client figure out if the quality is good. if (olapType == pedDovetail) return(true); } // Over all seeds. // No more seeds to align. Did we save an alignment? return(score() > 0); } #define ABS(x) (((x) < 0) ? -(x) : (x)) // A simple scan for blocks of gaps. The delta values here are relative to the last gap, // so a block of too much gap will have N delta values with small values. // bool NDalign::scanDeltaForBadness(bool verbose, bool showAlign) { double ema = 0.0; double alpha = 0.001; // Smaller will increase the averaging length int32 badBlocks = 0; for (uint32 ii=0; _resDisplay[ii]; ii++) { if (_resDisplay[ii] == '^') ema = computeExponentialMovingAverage(alpha, ema, 1.0); else ema = computeExponentialMovingAverage(alpha, ema, 0.0); if (ema > 0.25) badBlocks++; } if ((verbose == true) && (badBlocks > 0)) { fprintf(stderr, "NDalign::scanForDeltaBadness()-- Potential bad alignment: found %d bad blocks (alpha %f)\n", badBlocks, alpha); if (showAlign == true) display("NDalign::scanForDeltaBadness()-- ", true); } return(badBlocks > 0); } void NDalign::realignForward(bool displayAlgorithm, bool displayAlign) { Match_Node_t match; match.Start = abgn(); // Begin position in a match.Offset = bbgn(); // Begin position in b match.Len = 0; match.Next = 0; // Not used here int32 aLo=0, aHi=0; int32 bLo=0, bHi=0; char origScore[1024]; if (displayAlgorithm) fprintf(stderr, "NDalign::realignForward()--\n"); pedOverlapType olapType = _editDist->Extend_Alignment(&match, // Initial exact match, relative to start of string _aStr, _aLen, _bStr, _bLen, aLo, aHi, // Output: Regions which the match extends bLo, bHi); aHi++; // Add one to the end point because Extend_Alignment returns the base-based coordinate. bHi++; // Is this a better overlap than what we have? if (((score() < _editDist->score())) || ((score() <= _editDist->score()) && (length() > ((aHi - aLo) + (bHi - bLo) + _editDist->Left_Delta_Len) / 2))) { if (displayAlgorithm) { fprintf(stderr, "NDalign::realignForward()--\n"); fprintf(stderr, "NDalign::realignForward()-- Save better alignment\n"); fprintf(stderr, "NDalign::realignForward()--\n"); sprintf(origScore, "NDalign::realignForward()-- OLD length %u erate %f score %u (%d-%d %d-%d)\n", length(), erate(), score(), abgn(), aend(), bbgn(), bend()); } _bestResult.save(aLo, aHi, bLo, bHi, _editDist->score(), olapType, _editDist->Left_Delta_Len, _editDist->Left_Delta); display("NDalign::realignForward()-- ", displayAlign); _bestResult.setErate(1.0 - (double)(_matches + _gapmatches) / (length() - _freegaps)); if (displayAlgorithm) { fprintf(stderr, "%sNDalign::realignForward()-- NEW length %u erate %f score %u (%d-%d %d-%d)\n", origScore, length(), erate(), score(), abgn(), aend(), bbgn(), bend()); } } else if (displayAlgorithm) { fprintf(stderr, "NDalign::realignForward()-- Alignment no better - OLD length %u erate %f score %u (%d-%d %d-%d)\n", length(), erate(), score(), abgn(), aend(), bbgn(), bend()); fprintf(stderr, "NDalign::realignForward()-- Alignment no better - NEW length %u erate %f score %u (%d-%d %d-%d)\n", ((aHi - aLo) + (bHi - bLo) + _editDist->Left_Delta_Len) / 2, 0.0, _editDist->score(), aLo, aHi, bLo, bHi); //display("NDalign::realignForward(NB)--", aLo, aHi, bLo, bHi, _editDist->Left_Delta, _editDist->Left_Delta_Len, true, false); } } void NDalign::realignBackward(bool displayAlgorithm, bool displayAlign) { Match_Node_t match; match.Start = aend(); // Begin position in a match.Offset = bend(); // Begin position in b match.Len = 0; match.Next = 0; // Not used here int32 aLo=0, aHi=0; int32 bLo=0, bHi=0; char origScore[1024]; if (displayAlgorithm) fprintf(stderr, "NDalign::realignBackward()--\n"); pedOverlapType olapType = _editDist->Extend_Alignment(&match, // Initial exact match, relative to start of string _aStr, _aLen, _bStr, _bLen, aLo, aHi, // Output: Regions which the match extends bLo, bHi); aHi++; // Add one to the end point because Extend_Alignment returns the base-based coordinate. bHi++; // Is this a better overlap than what we have? if (((score() < _editDist->score())) || ((score() <= _editDist->score()) && (length() > ((aHi - aLo) + (bHi - bLo) + _editDist->Left_Delta_Len) / 2))) { if (displayAlign) { fprintf(stderr, "NDalign::realignBackward()--\n"); fprintf(stderr, "NDalign::realignBackward()-- Save better alignment\n"); fprintf(stderr, "NDalign::realignBackward()--\n"); sprintf(origScore, "NDalign::realignBackward()-- OLD length %u erate %f score %u (%d-%d %d-%d)\n", length(), erate(), score(), abgn(), aend(), bbgn(), bend()); } _bestResult.save(aLo, aHi, bLo, bHi, _editDist->score(), olapType, _editDist->Left_Delta_Len, _editDist->Left_Delta); display("NDalign::realignBackward()-- ", displayAlign); _bestResult.setErate(1.0 - (double)(_matches + _gapmatches) / (length() - _freegaps)); if (displayAlign) { fprintf(stderr, "%sNDalign::realignBackward()-- NEW length %u erate %f score %u (%d-%d %d-%d)\n", origScore, length(), erate(), score(), abgn(), aend(), bbgn(), bend()); } } else if (displayAlign) { fprintf(stderr, "NDalign::realignBackward()-- Alignment no better - OLD length %u erate %f score %u (%d-%d %d-%d)\n", length(), erate(), score(), abgn(), aend(), bbgn(), bend()); fprintf(stderr, "NDalign::realignBackward()-- Alignment no better - NEW length %u erate %f score %u (%d-%d %d-%d)\n", ((aHi - aLo) + (bHi - bLo) + _editDist->Left_Delta_Len) / 2, 0.0, _editDist->score(), aLo, aHi, bLo, bHi); //display("NDalign::realignBackward(NB)--", aLo, aHi, bLo, bHi, _editDist->Left_Delta, _editDist->Left_Delta_Len, true, false); } } void NDalign::display(char *prefix, int32 aLo, int32 aHi, int32 bLo, int32 bHi, int32 *delta, int32 deltaLen, bool displayIt, bool saveStats) { uint32 matches = 0; uint32 errors = 0; uint32 gapmatches = 0; uint32 freegaps = 0; if (_topDisplay == NULL) { _topDisplay = new char [2 * AS_MAX_READLEN + 1]; _botDisplay = new char [2 * AS_MAX_READLEN + 1]; _resDisplay = new char [2 * AS_MAX_READLEN + 1]; } char *a = astr() + aLo; int32 a_len = aHi - aLo; char *b = bstr() + bLo; int32 b_len = bHi - bLo; int32 top_len = 0; { int32 i = 0; int32 j = 0; for (int32 k = 0; k < deltaLen; k++) { for (int32 m = 1; m < abs (delta[k]); m++) { _topDisplay[top_len++] = a[i++]; j++; } if (delta[k] < 0) { _topDisplay[top_len++] = '-'; j++; } else { _topDisplay[top_len++] = a[i++]; } } while (i < a_len && j < b_len) { _topDisplay[top_len++] = a[i++]; j++; } _topDisplay[top_len] = 0; } int32 bot_len = 0; { int32 i = 0; int32 j = 0; for (int32 k = 0; k < deltaLen; k++) { for (int32 m = 1; m < abs (delta[k]); m++) { _botDisplay[bot_len++] = b[j++]; i++; } if (delta[k] > 0) { _botDisplay[bot_len++] = '-'; i++; } else { _botDisplay[bot_len++] = b[j++]; } } while (j < b_len && i < a_len) { _botDisplay[bot_len++] = b[j++]; i++; } _botDisplay[bot_len] = 0; } // Compute stats, build the result display. { for (int32 i=0; (i < top_len) || (i < bot_len); i++) { char T = _topDisplay[i]; char B = _botDisplay[i]; char t = tolower(T); char b = tolower(B); // A minor-allele match if the lower case matches, but upper case doesn't if ((t == b) && (T != B)) { _resDisplay[i] = '\''; gapmatches++; } // A free-gap if either is lowercase else if (((t == T) && (b == '-')) || // T is lowercase, B is a gap ((b == B) && (t == '-'))) { // B is lowercase, T is a gap _resDisplay[i] = '-'; freegaps++; } // A match if the originals match else if ((t == b) && (T == B)) { _resDisplay[i] = ' '; matches++; } // Otherwise, an error else { _resDisplay[i] = '^'; errors++; } } _resDisplay[ max(top_len, bot_len) ] = 0; } // Really display it? if (displayIt == true) { for (int32 i=0; (i < top_len) || (i < bot_len); i += DISPLAY_WIDTH) { fprintf(stderr, "%s%d\n", prefix, i); fprintf(stderr, "%sA: ", prefix); for (int32 j=0; (j < DISPLAY_WIDTH) && (i+j < top_len); j++) putc(_topDisplay[i+j], stderr); fprintf(stderr, "\n"); fprintf(stderr, "%sB: ", prefix); for (int32 j=0; (j < DISPLAY_WIDTH) && (i+j < bot_len); j++) putc(_botDisplay[i+j], stderr); fprintf(stderr, "\n"); fprintf(stderr, "%s ", prefix); for (int32 j=0; (j #include #include using namespace std; class NDalignResult { public: NDalignResult() { clear(); _deltaMax = 0; _delta = NULL; }; ~NDalignResult() { delete [] _delta; }; // Accessors double score(void) { return(_olapScore); }; int32 deltaLen(void) { return(_deltaLen); }; int32 *delta(void) { return(_delta); }; // Setters void clear(void) { _aLo = 0; _aHi = 0; _bLo = 0; _bHi = 0; _olapLen = 0; _olapScore = DBL_MIN; _olapErate = 0; _olapType = pedBothBranch; _deltaLen = 0; }; void save(int32 aLo, int32 aHi, int32 bLo, int32 bHi, int32 score, pedOverlapType type, int32 dl, int32 *d) { _aLo = aLo; _aHi = aHi; _bLo = bLo; _bHi = bHi; _olapLen = ((aHi - aLo) + (bHi - bLo) + dl) / 2; _olapScore = score; _olapErate = DBL_MAX; _olapType = type; _deltaLen = dl; if (_deltaMax < _deltaLen) { delete [] _delta; _deltaMax = _deltaLen + 4096; _delta = new int32 [_deltaMax]; } memcpy(_delta, d, sizeof(int32) * _deltaLen); }; void setErate(double olapErate) { _olapErate = olapErate; }; int32 _aLo; int32 _aHi; int32 _bLo; int32 _bHi; int32 _olapLen; double _olapScore; double _olapErate; pedOverlapType _olapType; int32 _deltaMax; int32 _deltaLen; int32 *_delta; }; class NDalign { public: NDalign(pedAlignType alignType, double maxErate, int32 merSize); ~NDalign(); void initialize(uint32 aID, char *aStr, int32 aLen, int32 aLo, int32 aHi, uint32 bID, char *bStr, int32 bLen, int32 bLo, int32 bHi, bool bFlipped); // Algorithm bool findMinMaxDiagonal(int32 minLength, uint32 AbgnAdj=0, uint32 AendAdj=0, uint32 BbgnAdj=0, uint32 BendAdj=0); bool findSeeds(bool dupIgnore); bool findHits(void); bool chainHits(void); bool makeNullHit(void); bool processHits(void); // Diagnostic bool scanDeltaForBadness(bool verbose=false, bool showAlign=false); void realignForward(bool verbose=false, bool displayAlign=false); void realignBackward(bool verbose=false, bool displayAlign=false); void display(char *prefix, int32 aLo, int32 aHi, int32 bLo, int32 bHi, int32 *delta, int32 deltaLen, bool displayIt = true, bool saveStats = false); void display(char *prefix, bool displayIt = true); // Result reporting char *astr(void) { return(_aStr); }; char *bstr(void) { return(_bStr); }; int32 abgn(void) { return(_bestResult._aLo); }; int32 aend(void) { return(_bestResult._aHi); }; int32 bbgn(void) { return((_bFlipped == false) ? (_bestResult._bLo) : (_bestResult._bLo)); }; int32 bend(void) { return((_bFlipped == false) ? (_bestResult._bHi) : (_bestResult._bHi)); }; int32 ahg5(void) { return( _bestResult._aLo); }; int32 ahg3(void) { return(_aLen - _bestResult._aHi); }; int32 bhg5(void) { return((_bFlipped == false) ? ( _bestResult._bLo) : (_bLen - _bestResult._bLo)); }; int32 bhg3(void) { return((_bFlipped == false) ? (_bLen - _bestResult._bHi) : ( _bestResult._bHi)); }; int32 length(void) { return(_bestResult._olapLen); }; double erate(void) { return(_bestResult._olapErate); }; int32 score(void) { return(_bestResult._olapScore); }; pedOverlapType type(void) { return(_bestResult._olapType); }; int32 deltaLen(void) { return(_bestResult.deltaLen()); }; int32 *delta(void) { return(_bestResult.delta()); }; private: class exactMatch { public: exactMatch(int32 a, int32 b, int32 l) { aBgn = a; bBgn = b; tLen = l; }; int32 aBgn; // Signed to allow for easy compute of diagonal int32 bBgn; int32 tLen; bool operator<(exactMatch const that) const { if (tLen == that.tLen) return(aBgn < that.aBgn); return(tLen > that.tLen); }; }; // Parameters of the alignment pedAlignType _alignType; double _maxErate; int32 _merSizeInitial; // From the parameters. uint32 _aID; char *_aStr; int32 _aLen; int32 _aLoOrig; int32 _aHiOrig; uint32 _bID; char *_bStr; int32 _bLen; int32 _bLoOrig; int32 _bHiOrig; // If flipped, we need to save the reverse complement somewhere bool _bFlipped; uint32 _bRevMax; char *_bRev; // Computed stuff, for each alignment. NDalgorithm *_editDist; int32 _minDiag; int32 _maxDiag; int32 _merSize; map _aMap; // Signed, to allow for easy compute of diagonal map _bMap; vector _rawhits; vector _hits; uint32 _hitr; // The result. NDalignResult _bestResult; // For displaying and stats. char *_topDisplay; char *_botDisplay; char *_resDisplay; uint32 _matches; uint32 _errors; uint32 _gapmatches; uint32 _freegaps; // Constants uint64 acgtToBit[256]; uint64 acgtToVal[256]; uint64 merMask[33]; void fastFindMersA(bool dupIgnore); void fastFindMersB(bool dupIgnore); }; #endif // NDALIGN_H canu-1.6/src/utgcns/libNNalign/000077500000000000000000000000001314437614700164275ustar00rootroot00000000000000canu-1.6/src/utgcns/libNNalign/NNalgorithm.C000066400000000000000000000273231314437614700207640ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/alignment/AS_ALN_bruteforcedp.C * src/alignment/brute-force-dp.C * * Modifications by: * * Brian P. Walenz from 2014-DEC-23 to 2015-AUG-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "aligners.H" #include "gkStore.H" #include "AS_UTL_reverseComplement.H" #undef DEBUG #define STOP 0 // IMPORTANT; default is STOP #define MATCH 1 #define GAPA 2 #define GAPB 3 // 3, -2, -2 original // 5, -4, -3 suggested // 3, -6, -4 got around one problem, but made many more // #define MATCHSCORE 3 #define GAPSCORE -6 #define MISMATCHSCORE -4 #define SLOP 10 // Scores in the dynamic programming matrix are unsigned ints, currently 30 bits wide. // // The smallest score is 0. // The zero score is 2^29 (512 million) // The largest score is 2^30-1 (1 billion). // // Edges that we don't want to end up at are initialized to DP_NEGT (2^28, 256 million) which will // take an awful lot of alignment to make it decent, or to underflow. // // Edges that we want to end up at are set to DP_ZERO (2^29, 512 million). #define DP_NEGT (1 << 28) // A very negative value #define DP_ZERO (1 << 29) // Zero class dpActions { public: dpActions() { actions = new uint64 * [2 * AS_MAX_READLEN]; memset(actions, 0, 2 * AS_MAX_READLEN * sizeof(uint64 *)); }; ~dpActions() { for (uint32 i=0; i<2 * AS_MAX_READLEN; i++) if (actions[i] != NULL) delete [] actions[i]; delete [] actions; }; // Pointer to array of packed bits. uint64 **actions; uint32 *actionLength; void allocate(uint32 i) { if (actions[i] != NULL) return; uint32 len = 2 * AS_MAX_READLEN * 2 / 64 + 1; actions[i] = new uint64 [len]; memset(actions[i], 0, len * sizeof(uint64)); // IMPORTANT, init to STOP }; uint64 get(uint32 i, uint32 j) { allocate(i); uint32 addr = ((j >> 5) & 0x07ffffff); uint32 bits = ((j ) & 0x0000001f) << 1; uint64 v = (actions[i][addr] >> bits) & 0x00000003llu; //fprintf(stderr, "GET %u %u %lu (%u %u)\n", i, j, v, addr, bits); return(v); }; void set(uint32 i, uint32 j, uint64 v) { allocate(i); assert(v < 4); uint32 addr = ((j >> 5) & 0x07ffffff); uint32 bits = ((j ) & 0x0000001f) << 1; //fprintf(stderr, "SET %u %u to %lu (%u %u)\n", i, j, v, addr, bits); actions[i][addr] &= ~(0x00000003llu << bits); actions[i][addr] |= (v << bits); }; }; // // ahang, bhang represent any sequence to EXCLUDE from the alignmet. There is a little bit of slop // in this exclusion. // // Setting both to the length of the sequence will try to find an alignment over all bases. The // highest scoring alignment is returned, which is likely not an alignment that uses all bases -- // the only requirement (assuming endToEnd is set) is that the alignment reaches the end of one // sequence. // void alignLinker(char *alignA, char *alignB, char *stringA, char *stringB, alignLinker_s *a, int endToEnd, int allowNs, int ahang, int bhang) { int32 lenA = strlen(stringA); int32 lenB = strlen(stringB); memset(a, 0, sizeof(alignLinker_s)); dpActions ACT; if ((lenA > AS_MAX_READLEN) || (lenB > AS_MAX_READLEN)) { fprintf(stderr, "alignLinker()-- Reads too long. %d or %d > %d\n", lenA, lenB, AS_MAX_READLEN); return; } uint32 *lastCol = new uint32 [AS_MAX_READLEN * 2]; uint32 *thisCol = new uint32 [AS_MAX_READLEN * 2]; uint32 *iFinal = new uint32 [AS_MAX_READLEN * 2]; uint32 *jFinal = new uint32 [AS_MAX_READLEN * 2]; //memset(lastCol, 0, sizeof(uint32) * AS_MAX_READLEN * 2); //memset(thisCol, 0, sizeof(uint32) * AS_MAX_READLEN * 2); //memset(iFinal, 0, sizeof(uint32) * AS_MAX_READLEN * 2); //memset(jFinal, 0, sizeof(uint32) * AS_MAX_READLEN * 2); // Definition of the box we want to do dynamic programming in. int32 ibgn = 1; int32 iend = lenA; int32 jbgn = 1; int32 jend = lenB; // Set the edges. for (int32 i=ibgn-1; i<=iend+1; i++) { iFinal[i] = DP_ZERO; } for (int32 j=jbgn-1; j<=jend+1; j++) { lastCol[j] = DP_ZERO; thisCol[j] = DP_ZERO; jFinal[j] = DP_ZERO; } // Catch an invalid use case if ((endToEnd == true) && ((ahang != 0) || (bhang != 0))) { fprintf(stderr, "Invalid code path. I shouldn't be here. Please report to atg@jcvi.org, thanks.\n"); assert(0); } int32 scoreMax = 0; int32 endI=0, curI=0; int32 endJ=0, curJ=0; #ifdef DEBUG fprintf(stderr, "%d,%d - %d,%d -- ahang,bhang %d,%d alen,blen %d,%d\n", ibgn, jbgn, iend, jend, ahang, bhang, lenA, lenB); #endif for (int32 i=ibgn; i<=iend; i++) { for (int32 j=jbgn; j<=jend; j++) { // Pick the max of these // i-1 -> lastCol // i -> thisCol uint32 ul = lastCol[j-1] + ((stringA[i-1] == stringB[j-1]) ? MATCHSCORE : MISMATCHSCORE); uint32 lf = lastCol[j ] + GAPSCORE; // Gap in B uint32 up = thisCol[j-1] + GAPSCORE; // Gap in A // When computing unitig consensus, the N letter is used to indicate a gap. This might be a // match, or it might be a mismatch. We just don't know. Don't count it as anything. This // should be extended to allow ambiguitity codes -- but that then needs base calling support. // if (allowNs) if ((stringA[i-1] == 'N') || (stringA[i-1] == 'n') || (stringB[j-1] == 'N') || (stringB[j-1] == 'n')) { ul = lastCol[j-1] + MATCHSCORE; } // For unitig consensus, if the consensus sequence (stringA) is lowercase, count it as a // match, otherwise ignore the gap it induces in stringB. // #warning stop complaing that this is ambiguous, please. if (stringA[i-1] == 'a') if (stringB[j-1] == 'A') ul = lastCol[j-1] + MATCHSCORE; else { ul = lastCol[j-1]; lf = lastCol[j]; } if (stringA[i-1] == 'c') if (stringB[j-1] == 'C') ul = lastCol[j-1] + MATCHSCORE; else { ul = lastCol[j-1]; lf = lastCol[j]; } if (stringA[i-1] == 'g') if (stringB[j-1] == 'G') ul = lastCol[j-1] + MATCHSCORE; else { ul = lastCol[j-1]; lf = lastCol[j]; } if (stringA[i-1] == 't') if (stringB[j-1] == 'T') ul = lastCol[j-1] + MATCHSCORE; else { ul = lastCol[j-1]; lf = lastCol[j]; } if (endToEnd) { // Set score to the smallest value possible; we will then ALWAYS pick an action below. // thisCol[j] = 0; ACT.set(i, j, MATCH); assert(ACT.get(i, j) == MATCH); } else { // (i,j) is the beginning of a subsequence, our default behavior. If we're looking for // local alignments, make the alignment stop here. // thisCol[j] = DP_ZERO; ACT.set(i, j, STOP); assert(ACT.get(i, j) == STOP); } if (thisCol[j] < ul) { thisCol[j] = ul; ACT.set(i, j, MATCH); assert(ACT.get(i, j) == MATCH); } if (thisCol[j] < lf) { thisCol[j] = lf; ACT.set(i, j, GAPB); assert(ACT.get(i, j) == GAPB); } if (thisCol[j] < up) { thisCol[j] = up; ACT.set(i, j, GAPA); assert(ACT.get(i, j) == GAPA); } if (scoreMax < thisCol[j]) { scoreMax = thisCol[j]; endI = curI = i; endJ = curJ = j; } if (j == jend) jFinal[i] = thisCol[j]; if (i == iend) iFinal[j] = thisCol[j]; // Not really needed (just a copy of thisCol), but simplifies later code } // Over j // Swap the columns uint32 *tc = thisCol; thisCol = lastCol; lastCol = tc; } // Over i // If we're not looking for local alignments, scan the end points for the best value. If we are // looking for local alignments, we've already found and remembered the best end point. // // lastCol - [iend][] // jFinal - [][jend] // if (endToEnd) { //fprintf(stderr, "ENDTOEND curI %u curJ %u\n", curI, curJ); scoreMax = 0; endI = curI = 0; endJ = curJ = 0; for (int32 i=ibgn, j=jend; i<=iend; i++) { if (scoreMax < jFinal[i]) { scoreMax = jFinal[i]; endI = curI = i; endJ = curJ = j; //fprintf(stderr, "IscoreMax = %d at %d,%d\n", scoreMax - DP_ZERO, i, j); } } for (int32 j=jbgn, i=iend; j<=jend; j++) { if (scoreMax < iFinal[j]) { scoreMax = iFinal[j]; endI = curI = i; endJ = curJ = j; //fprintf(stderr, "JscoreMax = %d at %d,%d\n", scoreMax - DP_ZERO, i, j); } } // Not sure what this is for. //M[endI][endJ].score = 0; } //fprintf(stderr, "FINAL curI %u curJ %u\n", curI, curJ); int32 alignLen = 0; int32 nGapA = 0; int32 nGapB = 0; int32 nMatch = 0; int32 nMismatch = 0; int32 terminate = 0; while (terminate == 0) { uint32 act = ACT.get(curI, curJ); //fprintf(stderr, "ACTION %2u curI %u curJ %u\n", act, curI, curJ); switch (act) { case STOP: terminate = 1; break; case MATCH: alignA[alignLen] = stringA[curI-1]; alignB[alignLen] = stringB[curJ-1]; if (alignA[alignLen] == alignB[alignLen]) { alignA[alignLen] = tolower(alignA[alignLen]); alignB[alignLen] = tolower(alignB[alignLen]); nMatch++; } else { alignA[alignLen] = toupper(alignA[alignLen]); alignB[alignLen] = toupper(alignB[alignLen]); nMismatch++; } curI--; curJ--; alignLen++; break; case GAPA: alignA[alignLen] = '-'; alignB[alignLen] = stringB[curJ-1]; curJ--; alignLen++; nGapA++; break; case GAPB: alignA[alignLen] = stringA[curI-1]; alignB[alignLen] = '-'; curI--; alignLen++; nGapB++; break; } } alignA[alignLen] = 0; alignB[alignLen] = 0; reverse(alignA, alignB, alignLen); a->matches = nMatch; a->alignLen = alignLen; a->begI = curI; a->begJ = curJ; a->endI = endI; a->endJ = endJ; a->lenA = lenA; a->lenB = lenB; a->pIdentity = (double)(nMatch) / (double)(nGapA + nGapB + nMatch + nMismatch); a->pCoverageA = (double)(a->endI - a->begI) / (double)(lenA); a->pCoverageB = (double)(a->endJ - a->begJ) / (double)(lenB); delete [] lastCol; delete [] thisCol; delete [] iFinal; delete [] jFinal; } canu-1.6/src/utgcns/libNNalign/NNalign.C000066400000000000000000000144441314437614700200700ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/alignment/AS_ALN_forcns.C * src/alignment/alignment-drivers.C * * Modifications by: * * Brian P. Walenz from 2014-DEC-23 to 2015-AUG-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "aligners.H" #include "gkStore.H" // For AS_MAX_READLEN #include "AS_UTL_reverseComplement.H" class Optimal_Overlap_Data { public: char h_alignA[AS_MAX_READLEN + AS_MAX_READLEN + 2]; char h_alignB[AS_MAX_READLEN + AS_MAX_READLEN + 2]; int h_trace[AS_MAX_READLEN + AS_MAX_READLEN + 2]; ALNoverlap o; }; ALNoverlap * Optimal_Overlap_AS_forCNS(char *a, char *b, int begUNUSED, int endUNUSED, int ahang, int bhang, int opposite, double erate, double thresh, int minlen, CompareOptions what) { static Optimal_Overlap_Data *ood = NULL; if (ood == NULL) ood = new Optimal_Overlap_Data; memset(ood->h_alignA, 0, sizeof(char) * (AS_MAX_READLEN + AS_MAX_READLEN + 2)); memset(ood->h_alignB, 0, sizeof(char) * (AS_MAX_READLEN + AS_MAX_READLEN + 2)); memset(ood->h_trace, 0, sizeof(int) * (AS_MAX_READLEN + AS_MAX_READLEN + 2)); alignLinker_s al; //if (VERBOSE_MULTIALIGN_OUTPUT >= 3) // fprintf(stderr, "Optimal_Overlap_AS_forCNS()-- Begins\n"); #if 0 if (erate > AS_MAX_ERROR_RATE) { //fprintf(stderr, "Optimal_Overlap_AS_forCNS()-- erate=%f >= AS_MAX_ERROR_RATE=%f, reset to max\n", erate, (double)AS_MAX_ERROR_RATE); erate = AS_MAX_ERROR_RATE; } assert((0.0 <= erate) && (erate <= AS_MAX_ERROR_RATE)); #endif if (opposite) reverseComplementSequence(b, strlen(b)); //if (VERBOSE_MULTIALIGN_OUTPUT >= 3) { // fprintf(stderr, "ALIGN %s\n", a); // fprintf(stderr, "ALIGN %s\n", b); //} alignLinker(ood->h_alignA, ood->h_alignB, a, b, &al, true, // Looking for global end-to-end alignments false, // Count matches to N as matches ahang, bhang); if (al.alignLen == 0) { return NULL; } //if (VERBOSE_MULTIALIGN_OUTPUT >= 3) { // fprintf(stderr, "ALIGN %d %d-%d %d-%d opposite=%d\n", al.alignLen, al.begI, al.endI, al.begJ, al.endJ, opposite); // fprintf(stderr, "ALIGN '%s'\n", ood->h_alignA); // fprintf(stderr, "ALIGN '%s'\n", ood->h_alignB); //} if (opposite) { reverseComplementSequence(b, strlen(b)); reverseComplementSequence(ood->h_alignA, al.alignLen); reverseComplementSequence(ood->h_alignB, al.alignLen); int x = al.begJ; al.begJ = al.lenB - al.endJ; al.endJ = al.lenB - x; } // We don't expect partial overlaps here. At least one fragment // must have an alignment to the very start. // // ECR depends on this return value; it is allowed to fail // when building a new unitig multialign. For example: // // <----------------------- // ------> // // When ECR tries to extend the second fragment, it checks that // the extended fragment overlaps the next contig. It does not // check that the extended bits agree with the first fragment, // leaving that up to "does the unitig rebuild". // if ((al.begJ != 0) && (al.begI != 0)) return(NULL); ood->o.begpos = (al.begI > 0) ? (al.begI) : -(al.begJ); ood->o.endpos = (al.lenB - al.endJ > 0) ? (al.lenB - al.endJ) : -(al.lenA - al.endI); ood->o.length = al.alignLen; ood->o.diffs = 0; ood->o.comp = opposite; ood->o.trace = ood->h_trace; { int x=0; int tp = 0; int ap = al.begI; int bp = al.begJ; for (x=0; xh_alignA[x] == '-') { ood->h_trace[tp++] = -(ap + 1); ap--; } if (ood->h_alignB[x] == '-') { ood->h_trace[tp++] = bp + 1; bp--; } // Count the differences. // // But allow N's and lowercase as matches. If either letter is N, then the other letter is // NOT N (if both letters were N, both would be lowercase n, representing a match). This // just subtracts out the diff we added in above. // bool diff = false; bool ignore = false; if (toupper(ood->h_alignA[x]) != toupper(ood->h_alignB[x])) diff = true; if ((ood->h_alignA[x] == 'N') || (ood->h_alignA[x] == 'n') || (ood->h_alignB[x] == 'N') || (ood->h_alignB[x] == 'n')) ignore = true; if (islower(ood->h_alignA[x]) && (ood->h_alignB[x] == '-')) ignore = true; if ((diff == true) && (ignore == false)) ood->o.diffs++; bp++; ap++; } ood->h_trace[tp] = 0; //if (VERBOSE_MULTIALIGN_OUTPUT >= 4) { // fprintf(stderr, "trace"); // for (x=0; xh_trace[x]); // fprintf(stderr, "\n"); // fprintf(stderr, "A: %4d-%4d %4d %s\n", al.begI, al.endI, al.lenA, ood->h_alignA); // fprintf(stderr, "B: %4d-%4d %4d %s\n", al.begJ, al.endJ, al.lenB, ood->h_alignB); //} } //if (VERBOSE_MULTIALIGN_OUTPUT >= 3) { // fprintf(stderr, "ERATE: diffs=%d / length=%d = %f\n", ood->o.diffs, ood->o.length, (double)ood->o.diffs / ood->o.length); // fprintf(stderr, "Optimal_Overlap_AS_forCNS()-- Ends\n"); //} if ((double)ood->o.diffs / ood->o.length <= erate) return(&ood->o); return(NULL); } canu-1.6/src/utgcns/libNNalign/NNalign.H000066400000000000000000000034261314437614700200730ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/alignment/AS_ALN_aligners.H * src/alignment/aligners.H * * Modifications by: * * Brian P. Walenz from 2014-DEC-23 to 2015-AUG-10 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-11 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef ALIGNERS_H #define ALIGNERS_H #include "AS_global.H" // // BPW's brute-force dynamic programming // typedef struct { int32 matches; int32 alignLen; int32 begI, begJ; int32 endI, endJ; int32 lenA, lenB; double pIdentity; double pCoverageA; double pCoverageB; } alignLinker_s; void alignLinker(char *alignA, char *alignB, char *stringA, char *stringB, alignLinker_s *a, int endToEnd, int allowNs, int ahang, int bhang); #endif canu-1.6/src/utgcns/libcns/000077500000000000000000000000001314437614700156645ustar00rootroot00000000000000canu-1.6/src/utgcns/libcns/NOTES000066400000000000000000000011521314437614700164760ustar00rootroot00000000000000 BeadLinks (prev/this/next) M=max #0 <--> 0/0/0 <--> 0/0/0 <--> #1 <\ M/1/1 <--> 1/1/1 <--> #2 <\\> 1/2/M /> 3/2/2 <--> #3 \> 2/3/2 0; pp++) { _bases[pp] = inv[ seq[ii] ]; _quals[pp] = qlt[ii]; } _bases[_length] = 0; // NUL terminate the strings so we can use them in aligners. _quals[_length] = 0; // Not actually a string, the 0 is just another QV=0 entry. }; void abAbacus::addRead(gkStore *gkpStore, uint32 readID, uint32 askip, uint32 bskip, bool complemented, map *inPackageRead, map *inPackageReadData) { // Grab the read. If there is no package, load the read from the store. Otherwise, load the // read from the package. This REQUIRES that the package be in-sync with the unitig. We fail // otherwise. Hey, it's used for debugging only... gkRead *read = NULL; gkReadData *readData = NULL; if (inPackageRead == NULL) { read = gkpStore->gkStore_getRead(readID); readData = new gkReadData; gkpStore->gkStore_loadReadData(read, readData); } else { read = (*inPackageRead)[readID]; readData = (*inPackageReadData)[readID]; } assert(read != NULL); assert(readData != NULL); // Grab seq/qlt from the read, offset to the proper begin and length. uint32 seqLen = read->gkRead_sequenceLength() - askip - bskip; char *seq = readData->gkReadData_getSequence() + ((complemented == false) ? askip : bskip); char *qlt = readData->gkReadData_getQualities() + ((complemented == false) ? askip : bskip); // Tell abacus about it. We could pre-allocate _sequences (in the constructor) but this is // relatively painless and makes life easier outside here. increaseArray(_sequences, _sequencesLen, _sequencesMax, 1); _sequences[_sequencesLen++] = new abSequence(readID, seqLen, seq, qlt, complemented); delete readData; } canu-1.6/src/utgcns/libcns/abAbacus-appendBases.C000066400000000000000000000066371314437614700217300ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" // Just appends bases to the frankenstein sequence. // Sets read abacus pointers assuming gapless alignments. void abAbacus::appendBases(uint32 bid, uint32 bgn, uint32 end) { uint32 alen = numberOfColumns(); // Set up the b read abSequence *bseq = getSequence(bid); uint32 blen = bseq->length(); // Find the first and last columns the read 'aligns' to. If the read isn't contained, the last // column is the last column in the frankenstein. We'll append new columns after this one. // // Space-based, so the end element is the (n-1)th element. // abColumn *fc = getColumn(bgn); abColumn *lc = (end <= alen) ? getColumn(end-1) : getLastColumn(); uint16 fl = UINT16_MAX; uint16 ll = UINT16_MAX; // To get positions in output, we need to have both the first and last beads // placed in the multialign. The first bead is always aligned, but the last bead // is aligned only if it is contained. fl = fc->alignBead(UINT16_MAX, bseq->getBase(0), bseq->getQual(0)); if (end <= alen) ll = lc->alignBead(UINT16_MAX, bseq->getBase(blen-1), bseq->getQual(blen-1)); // If not contained, push on bases, and update the consensus base. This is all _very_ rough. // The unitig-supplied coordinates aren't guaranteed to contain 'blen' bases. We make the // obvious guess as to where the frankenstein ends in the read, and push on bases from there till // the end. // // The obvious guess is to add exactly enough bases to make the new frankenstein end at // the expected length. // // alen // v blen // -------------------- v // -------------------------- // ^ ^ ^ // bgn bpos end !! end-bgn != blen !! // else for (uint32 bpos=blen - (end - alen); bposinsertAtEnd(lc, UINT16_MAX, bseq->getBase(bpos), bseq->getQual(bpos)); lc = nc; //baseCallMajority(lc); } // Now set the first/last links. beadID f(fc, fl); beadID l(lc, ll); readTofBead[bid] = f; fbeadToRead[f] = bid; readTolBead[bid] = l; lbeadToRead[l] = bid; // If we did this correctly, then the first/last column indices should agree with the read placement. //assert(bgn == getBead(bfrag->firstBead())->colIdx().get()); //assert(end-1 == getBead(bfrag->lastBead())->colIdx().get()); } canu-1.6/src/utgcns/libcns/abAbacus-applyAlignment.C000066400000000000000000000555051314437614700224650ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/ApplyAlignment.C * src/AS_CNS/ApplyAlignment.c * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/ApplyAlignment.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-MAY-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JUL-28 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-13 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" #undef DEBUG_INFER #undef DEBUG_ABACUS_ALIGN // Primary debug output, shows progress of the algorithm void abColumn::allocateInitialBeads(void) { // Allocate beads. We'll need no more than the max of either the prev or the next. Any read that we // interrupt gets a new gap bead. Any read that has just ended gets nothing. And, +1 for the read // we might be adding to the multialign. uint32 pmax = (_prevColumn != NULL) ? (_prevColumn->depth() + 1) : (4); uint32 nmax = (_nextColumn != NULL) ? (_nextColumn->depth() + 1) : (4); _beadsMax = MAX(pmax, nmax); _beadsLen = 0; _beads = new abBead [_beadsMax]; for (uint32 ii=0; ii<_beadsMax; ii++) // Probably done by the constructor. _beads[ii].clear(); } void abColumn::inferPrevNextBeadPointers(void) { #ifdef DEBUG_INFER fprintf(stderr, "infer between columns %d and %d\n", _prevColumn->position(), _nextColumn->position()); #endif // Scan the next/prev lists, adding a gap bead for any read that is spanned. // // prev/this/next prev/this/next // 0 0 0 <-----> 0 0 0 // A spanned read. // 1 1 MAX /--> 2 1 1 // The first read ended // 2 2 1 <-/ /-> 3 2 2 // Now columns are different // 3 3 3 <--/ assert(_prevColumn != NULL); assert(_nextColumn != NULL); // At startup, the links in this column are all invalid, and the base hasn't been added. assert(_prevColumn->_beadsLen <= _prevColumn->_beadsMax); assert( _beadsLen == 0); assert(_nextColumn->_beadsLen <= _nextColumn->_beadsMax); uint32 pmax = (_prevColumn == NULL) ? 0 : _prevColumn->_beadsLen; uint32 nmax = (_nextColumn == NULL) ? 0 : _nextColumn->_beadsLen; uint32 ppos = 0; // Position in the prev's nextOffset list uint32 tpos = 0; // Position in our offset list uint32 npos = 0; // Position in the next's prevOffset list // Add gaps and link into any existing reads. while (1) { // Find reads to work on. while ((ppos < pmax) && // If the previous position is less than the depth, and (_prevColumn->_beads[ppos].nextOffset() == UINT16_MAX)) // the previous pointer is not valid, ppos++; // move ahead to the next read. while ((npos < nmax) && (_nextColumn->_beads[npos].prevOffset() == UINT16_MAX)) npos++; // If either column ran out of entries, we're done. No more reads will span this new column. if ((ppos >= _prevColumn->depth()) || (npos >= _nextColumn->depth())) break; // The columns must agree. uint16 no = _prevColumn->_beads[ppos].nextOffset(); uint16 po = _nextColumn->_beads[npos].prevOffset(); if ((no != npos) || (po != ppos)) { fprintf(stderr, "ERROR: link mismatch.\n"); fprintf(stderr, "ERROR: prev %d %d -- next %d %d\n", _prevColumn->_beads[ppos].prevOffset(), _prevColumn->_beads[ppos].nextOffset(), _nextColumn->_beads[npos].prevOffset(), _nextColumn->_beads[npos].nextOffset()); } assert(no == npos); // The 'next-bead' that the prev is pointing to must be npos. assert(po == ppos); // The 'prev-bead' that the next is pointing to must be ppos. // Reset the prev/next offsets to point to us. _prevColumn->_beads[ppos]._nextOffset = tpos; _nextColumn->_beads[npos]._prevOffset = tpos; // Set our offsets to point to them. _beads[tpos]._prevOffset = po; _beads[tpos]._nextOffset = no; _beadsLen = tpos + 1; // Needed for showLink() #ifdef DEBUG_INFER fprintf(stderr, "inferPrevNextBeadPointers()\n"); showLinks(ppos, tpos, npos); #endif // And move forward to the next read in all columns. ppos++; assert(ppos <= _prevColumn->_beadsLen); tpos++; assert(tpos <= _beadsMax); npos++; assert(npos <= _nextColumn->_beadsLen); } // Set the new length of the bead list. assert(_beadsLen == tpos); _beadsLen = tpos; #ifdef DEBUG_INFER fprintf(stderr, "inferPrevNextBeadPointers() final\n"); showLinks(); #endif } // Add a column, seeded with some base/qual assignment. The assignment cannot be a gap (because // all other reads in this column are set to gap). // // Returns the bead index for the read that was added. // // Can insert the new column before or after some specified column. In both cases, we need to // know the read index for the read in that column that we're adding a base to. // // Common use cases are to: // add a new column for a base aligned to a gap in the multialign // - all existing reads get a gap // - this read is assigned a base // // add a new column before/after the multialign // - this read is the only base in this column // // In both cases, we need to have and return linking information to the existing read. // // One odd ball case occurs when the read isn't in the muiltialign at all; when we're adding // the very first base. This is handled by setting the link to max. // // Insert a new empty column, seeded with only the base/qual supplied, to the start of the multialign. // This grows the multialign to add new bases to the start: // [original-multialign] // 1[original-multialign] // 12[original-multialign] // 123[original-multialign] // 1234[original-multialign] // uint16 abColumn::insertAtBegin(abColumn *first, uint16 prevLink, char base, uint8 qual) { // The base CAN NOT be a gap - the new column would then be entirely a gap column, with no base. assert(base != '-'); assert(base != 0); _columnPosition = INT32_MAX; _call = base; _qual = qual; _prevColumn = first->_prevColumn; _nextColumn = first; first->_prevColumn = this; if (_prevColumn) _prevColumn->_nextColumn = this; allocateInitialBeads(); _beads[0]._unused = 0; _beads[0]._isRead = 1; _beads[0]._isUnitig = 0; _beads[0]._base = base; _beads[0]._qual = qual; _beads[0]._prevOffset = prevLink; _beads[0]._nextOffset = UINT16_MAX; // No next offset for the next base. _beadsLen++; assert(_beadsLen <= _beadsMax); if (_prevColumn) _prevColumn->_beads[prevLink]._nextOffset = 0; #ifdef BASECOUNT baseCountIncr(base); #endif // With only the one read, we don't need to do any base calling here. #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "insertAtBegin()-- column prev=%p next=%p link prev=%d this=%d next=%d\n", _prevColumn, this, _nextColumn, _beads[0]._prevOffset, _beads[0]._nextOffset); #endif return(0); } // Insert a new empty column, seeded with only the base/qual supplied, to the end of the multialign. // The growth is simpler than insertAtBegin() because the end is always NULL. // [original-multialign] // [original-multialign]7 // [original-multialign]78 // [original-multialign]789 // uint16 abColumn::insertAtEnd(abColumn *prev, uint16 prevLink, char base, uint8 qual) { assert(base != '-'); // The base CAN NOT be a gap - the new column would then be entirely a gap column, with no base. assert(base != 0); _columnPosition = INT32_MAX; _call = base; _qual = qual; _prevColumn = prev; _nextColumn = NULL; if (prev == NULL) assert(prevLink == UINT16_MAX); if (prev != NULL) assert(prevLink != UINT16_MAX); if (prev) prev->_nextColumn = this; allocateInitialBeads(); _beads[0]._unused = 0; _beads[0]._isRead = 1; _beads[0]._isUnitig = 0; _beads[0]._base = base; _beads[0]._qual = qual; _beads[0]._prevOffset = prevLink; _beads[0]._nextOffset = UINT16_MAX; _beadsLen++; assert(_beadsLen <= _beadsMax); if (prev) prev->_beads[prevLink]._nextOffset = 0; #ifdef BASECOUNT baseCountIncr(base); #endif // With only the one read, we don't need to do any base calling here. #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "insertAtEnd()-- column prev=%p this=%p next=%p link prev=%d next=%d\n", _prevColumn, this, _nextColumn, _beads[0]._prevOffset, _beads[0]._nextOffset); #endif return(0); } // Insert a column in the middle of the multialign, after some column. uint16 abColumn::insertAfter(abColumn *prev, // Add new column after 'prev' uint16 prevLink, // The bead for this read in 'prev' is at 'prevLink'. char base, uint8 qual) { // The base CAN NOT be a gap - the new column would then be entirely a gap column, with no base. assert(base != '-'); assert(base != 0); // Make sure that we're actually in the middle of consensus. assert(prev != NULL); assert(prev->_nextColumn != NULL); // Link in the new column (us!) _prevColumn = prev; _nextColumn = prev->_nextColumn; _prevColumn->_nextColumn = this; _nextColumn->_prevColumn = this; // Allocate space for beads in this column (based on _prevColumn and _nextColumn) allocateInitialBeads(); // Add gaps for the existing reads. This is quite complicated, so stashed away in a closet where we won't see it. inferPrevNextBeadPointers(); // Set up the new consensus base. _columnPosition = INT32_MAX; _call = base; _qual = qual; // Then add the base for this read. allocateInitialBeads() guarantees space for at least one // more bead, infer() filled in the rest. uint16 tpos = _beadsLen++; assert(_beadsLen <= _beadsMax); _beads[tpos]._unused = 0; _beads[tpos]._isRead = 1; _beads[tpos]._isUnitig = 0; _beads[tpos]._base = base; _beads[tpos]._qual = qual; _beads[tpos]._prevOffset = prevLink; _beads[tpos]._nextOffset = UINT16_MAX; // Don't forget to update the link to us! (I forgot.) if (prevLink != UINT16_MAX) _prevColumn->_beads[prevLink]._nextOffset = tpos; #ifdef BASECOUNT baseCountIncr(base); #endif baseCall(false); // We need to recall the base, using the majority vote. #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "insertAfter()-- column prev=%d this=%p next=%d link %d prev=%d next=%d\n", _prevColumn->position(), this, _nextColumn->position(), tpos, _beads[tpos]._prevOffset, _beads[tpos]._nextOffset); #endif #ifdef DEBUG_INFER showLinks(); #endif #ifdef CHECK_LINKS checkLinks(); #endif return(tpos); } uint16 abColumn::alignBead(uint16 prevIndex, char base, uint8 qual) { // First, make sure the column has enough space for the new read. increaseArray(_beads, _beadsLen, _beadsMax, 1); // Set up the new bead. uint16 tpos = _beadsLen++; assert(_beadsLen <= _beadsMax); _beads[tpos].initialize(base, qual, prevIndex, UINT16_MAX); // Link to the previous if ((_prevColumn) && (prevIndex != UINT16_MAX)) { assert(prevIndex < _prevColumn->_beadsLen); assert(prevIndex < _prevColumn->_beadsMax); _prevColumn->_beads[prevIndex]._nextOffset = tpos; } // increment the base count too! #ifdef BASECOUNT baseCountIncr(base); #endif baseCall(false); // We need to recall the base, using the majority vote. #ifdef CHECK_LINKS checkLinks(); #endif // Return the index of the bead we just added (for future linking) return(tpos); }; // Add sequence 'bid' to the multialign using ahang,bhang for positioning, and trace for the alignment. // // ahang can be negative, zero or positive. // // If negative, it will be equal in magnitude to bhang. // // If positive, bhang will be zero, and this is the amount of sequence in frank we should ignore. // So, we set apos to be this positive value. void abAbacus::applyAlignment(uint32 bid, int32 ahang, int32 UNUSED(bhang), int32 *trace, uint32 traceLen) { #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "abAbacus::applyAlignment()-- ahang=%d traceLen=%u trace=%d %d %d %d %d ... %d %d %d %d %d\n", ahang, traceLen, (traceLen > 0) ? trace[0] : 0, (traceLen > 1) ? trace[1] : 0, (traceLen > 2) ? trace[2] : 0, (traceLen > 3) ? trace[3] : 0, (traceLen > 4) ? trace[4] : 0, (traceLen > 4) ? trace[traceLen-5] : 0, (traceLen > 3) ? trace[traceLen-4] : 0, (traceLen > 2) ? trace[traceLen-3] : 0, (traceLen > 1) ? trace[traceLen-2] : 0, (traceLen > 0) ? trace[traceLen-1] : 0); #endif // Finish some initialization. If this is the first call, readTofBead (and readTolBead) // will be NULL, and we need to allocate space for them. if (readTofBead == NULL) { readTofBead = new beadID [numberOfSequences()]; readTolBead = new beadID [numberOfSequences()]; } // Figure out where we are in the multialignment. abSequence *bseq = getSequence(bid); int32 alen = _columnsLen; // Actual length of the consensus sequence BEFORE any new bases are added. int32 blen = bseq->length(); // Actual length of the read we're adding. // We used to check that alen and blen were both positive, but we now allow applyAlignment() of // the first read (to an empty multialignment) to initialize the structure. int32 apos = MAX(ahang, 0); // if apos == alen, we'd just be pasting on new sequence. int32 bpos = 0; // if bpos == blen...we're pasting on one base? assert(apos <= alen); assert(0 <= bpos); // We tried letting bpos be set to non-zero, to ignore bases at the start of the read, assert(bpos <= blen); // but it doesn't work. abColumn *ncolumn = _columns[apos]; // The next empty column (where we will add the next aligned base). abColumn *pcolumn = NULL; // The previous column (that we just added a base to). uint16 plink = UINT16_MAX; // The read index of the read we just added to the previous column. beadID fBead; beadID lBead; // Negative ahang? Push these things onto the start of the multialign. // // We should fail if we get a negative ahang, but there is a column already before the first one // (which implies we aligned to something not full-length, possibly because we trimmed // frankenstein wrong).....but we don't even check. for (; bpos < -ahang; bpos++) { abColumn *newcol = new abColumn; plink = newcol->insertAtBegin(ncolumn, plink, bseq->getBase(bpos), bseq->getQual(bpos)); fBead.setF(newcol, plink); lBead.setL(newcol, plink); } if (ncolumn) pcolumn = ncolumn->_prevColumn; // Skip any positive traces at the start. These are template sequence (A-read) aligned to gaps // in the B-read, but we don't care because there isn't any B-read aligned yet. // // We probably should check that we didn't just process negative ahang above. for (; ((traceLen > 0) && (*trace != 0) && (*trace == 1)); trace++) { ncolumn = ncolumn->_nextColumn; apos++; } // Similarly, remove negative traces at the end. These are read sequence (B-read) aligned to gaps // in the A-read (template). The A-read extent should be reduced by one for each gap. while ((traceLen > 0) && (trace[traceLen-1] > blen)) { trace[--traceLen] = 0; } // Process the trace. while ((traceLen > 0) && (*trace != 0)) { // Gap is in afrag. Align ( - *trace - apos ) positions, then insert a new column, before // ncolumn, to accommodate the insert in B. ncolumn remains unchanged; it's still the column // that we want to place the next aligned base. if ( *trace < 0 ) { while (apos < (- *trace - 1)) { #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "applyAlignment()-- align base %6d/%6d '%c' to column %7d\n", bpos, blen, bseq->getBase(bpos), ncolumn->position()); #endif plink = ncolumn->alignBead(plink, bseq->getBase(bpos), bseq->getQual(bpos)); fBead.setF(ncolumn, plink); lBead.setL(ncolumn, plink); pcolumn = ncolumn; // ...updating the previous column ncolumn = ncolumn->next(); // bpos++; // Move to the next base in the read. apos++; // Move to the next column } assert(apos < alen); assert(bpos < blen); // Three types of insert: // before the multialign (done above in the initialization) - an entirely new column, no gaps to add // after the multialign (done at the end) - same, entirely new column // insertion to the multialign - in the middle, complicated // // should it be insertAfter() to add a column before the exiting one and // appendColumn() to add a column after? appendColumm() is a special case of // tacking on new sequence to the multialign. // Add a new column for this insertion. abColumn *newcol = new abColumn; #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "applyAlignment()-- align base %6d/%6d '%c' to after column %7d (new column)\n", bpos, blen, bseq->getBase(bpos), ncolumn->position()); #endif plink = newcol->insertAfter(pcolumn, plink, bseq->getBase(bpos), bseq->getQual(bpos)); fBead.setF(newcol, plink); lBead.setL(newcol, plink); pcolumn = newcol; bpos++; // Move to the next base in the read. } // Gap is in bfrag. Align ( *trace - bpos ) positions, then insert a gap bead into the read, // and align it to the existing column. if (*trace > 0) { while ( bpos < (*trace - 1) ) { #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "applyAlignment()-- align base %6d/%6d '%c' to column %7d\n", bpos, blen, bseq->getBase(bpos), ncolumn->position()); #endif plink = ncolumn->alignBead(plink, bseq->getBase(bpos), bseq->getQual(bpos)); fBead.setF(ncolumn, plink); lBead.setL(ncolumn, plink); pcolumn = ncolumn; // ...updating the previous column ncolumn = ncolumn->next(); // bpos++; // Move to the next base in the read. apos++; // Move to the next column } assert(apos < alen); assert(bpos < blen); #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "applyAlignment()-- align base %6d/%6d '-' to column %7d (gap in read)\n", bpos, blen, ncolumn->position()); #endif plink = ncolumn->alignBead(plink, '-', 0); fBead.setF(ncolumn, plink); lBead.setL(ncolumn, plink); pcolumn = ncolumn; ncolumn = ncolumn->next(); apos++; } trace++; } // Remaining alignment contains no indels, just slap in the bases. Note that when there is // no consensus sequence (this is the first read added) this loop does nothing; alen=apos=0. for (int32 rem = MIN(blen - bpos, alen - apos); rem > 0; rem--) { #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "applyAlignment()-- align base %6d/%6d '%c' to column %7d (end of read)\n", bpos, blen, bseq->getBase(bpos), ncolumn->position()); #endif plink = ncolumn->alignBead(plink, bseq->getBase(bpos), bseq->getQual(bpos)); fBead.setF(ncolumn, plink); lBead.setL(ncolumn, plink); pcolumn = ncolumn; ncolumn = ncolumn->next(); apos++; bpos++; } // Finally, append any new (unaligned) sequence from the read. This handles the special case when there // are no existing columns (pcolumn == NULL). for (int32 rem=blen-bpos; rem > 0; rem--) { assert(ncolumn == NULL); // Can't be a column after where we're tring to append to! abColumn *newcol = new abColumn; #ifdef DEBUG_ABACUS_ALIGN fprintf(stderr, "applyAlignment()-- align base %6d/%6d '%c' to extend consensus\n", bpos, blen, bseq->getBase(bpos)); #endif plink = newcol->insertAtEnd(pcolumn, plink, bseq->getBase(bpos), bseq->getQual(bpos)); fBead.setF(newcol, plink); lBead.setL(newcol, plink); pcolumn = newcol; bpos++; #ifdef CHECK_LINKS newcol->checkLinks(); #endif } // Insert the first and last beads into our tracking maps. assert(fBead.column->_beads[fBead.link].prevOffset() == UINT16_MAX); assert(lBead.column->_beads[lBead.link].nextOffset() == UINT16_MAX); fbeadToRead[fBead] = bid; readTofBead[bid] = fBead; lbeadToRead[lBead] = bid; readTolBead[bid] = lBead; // Update the firstColumn in the abAbacus if it isn't set. updateColumns() will // reset it if the actual first column has changed here. if (_firstColumn == NULL) _firstColumn = fBead.column; // Finally, recall bases (not needed; done inline when bases are aligned) and refresh the column/cnsBases/cnsQuals lists. //recallBases(false); refreshColumns(); } canu-1.6/src/utgcns/libcns/abAbacus-baseCall.C000066400000000000000000000324211314437614700211770ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/BaseCall.C * src/AS_CNS/BaseCall.c * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/BaseCall.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-MAY-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JUL-28 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-NOV-23 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" #include using namespace std; void abColumn::baseCallMajority(void) { uint32 bsSum[CNS_NUM_SYMBOLS] = {0}; // Number of times we've seen this base uint32 qvSum[CNS_NUM_SYMBOLS] = {0}; // Sum of their QVs for (uint32 ii=0; ii<_beadsLen; ii++) { char base = _beads[ii].base(); uint8 qual = _beads[ii].qual(); uint32 bidx = baseToIndex[base]; if (bidx >= CNS_NUM_SYMBOLS) fprintf(stderr, "abColumn::baseCallMajority()-- For column %u, link %u, base '%c' (%d) is invalid.\n", position(), ii, _beads[ii].base(), _beads[ii].base()); assert(bidx < CNS_NUM_SYMBOLS); bsSum[bidx] += 1; qvSum[bidx] += qual; } // Find the best, and second best, ignore ties. uint32 bestIdx = 0; uint32 nextIdx = 1; for (uint32 i=1; i bsSum[bestIdx])) || ((bsSum[i] >= bsSum[bestIdx]) && (qvSum[i] > qvSum[bestIdx]))) { nextIdx = bestIdx; bestIdx = i; } else if (((bsSum[i] > bsSum[nextIdx])) || ((bsSum[i] >= bsSum[nextIdx]) && (qvSum[i] > qvSum[nextIdx]))) { nextIdx = i; } } // Original version set QV to zero. _call = indexToBase[bestIdx]; _qual = 0; // If the best is a gap, use the lowercase second best - the alignment will trest this specially if (_call == '-') _call = indexToBase[nextIdx] - 'A' + 'a'; } uint32 baseToTauIndex(char base) { switch (base) { case '-': return(0); break; case 'A': return(1); break; case 'C': return(2); break; case 'G': return(3); break; case 'T': return(4); break; default: fprintf(stderr, "baseToTauIndex()-- invalid base '%c' %d\n", base, base); assert(0); break; } return(0); } void abColumn::baseCallQuality(void) { char consensusBase = '-'; char consensusQual = 0; // The original versions classified reads as 'best allele', 'other allele' or 'guide allele'. // Other allele was set if we were targetting a specific set of reads here (e.g., we had already // clustered reads into haplotypes). Guide allele was used if the read was not really a read // (originally, these were reads that came from outside the project, but eventually, I think it // came to also mean the read was a unitig surrogate). All this was stripped out in early // December 2015. The pieces removed all mirrored what is done for the bReads. vector bReads; uint32 bBaseCount[CNS_NUM_SYMBOLS] = {0}; uint32 bQVSum[CNS_NUM_SYMBOLS] = {0}; // Best allele uint32 frag_cov = 0; // Scan a column of aligned bases. Sort into three groups: // - those corresponding to the reads of the best allele, // - those corresponding to the reads of the other allele and // - those corresponding to non-read fragments (aka guides) for (uint32 ii=0; ii<_beadsLen; ii++) { char base = _beads[ii].base(); uint8 qual = _beads[ii].qual(); uint32 bidx = baseToIndex[base]; // If we encode bases properly, this will go away. if (bidx >= CNS_NUM_SYMBOLS) fprintf(stderr, "abColumn::baseCallQuality()-- For column %u, link %u, base '%c' (%d) is invalid.\n", position(), ii, _beads[ii].base(), _beads[ii].base()); assert(bidx < CNS_NUM_SYMBOLS); bBaseCount[bidx] += 1; // Could have saved to 'best', 'other' or 'guide' here. bQVSum[bidx] += qual; bReads.push_back(ii); } double cw[5] = { 0.0, 0.0, 0.0, 0.0, 0.0 }; // "consensus weight" for a given base double tau[5] = { 1.0, 1.0, 1.0, 1.0, 1.0 }; // Compute tau based on real reads. for (uint32 cind=0; cind < bReads.size(); cind++) { char base = _beads[ bReads[cind] ].base(); uint8 qual = _beads[ bReads[cind] ].qual(); if (qual == 0) qual += 5; tau[0] += (base == '-') ? PROB[qual] : EPROB[qual]; tau[1] += (base == 'A') ? PROB[qual] : EPROB[qual]; tau[2] += (base == 'C') ? PROB[qual] : EPROB[qual]; tau[3] += (base == 'G') ? PROB[qual] : EPROB[qual]; tau[4] += (base == 'T') ? PROB[qual] : EPROB[qual]; //fprintf(stderr, "TAU7[%2d] %f %f %f %f %f qv %d\n", cind, tau[0], tau[1], tau[2], tau[3], tau[4], qual); consensusQual = qual; } //fprintf(stderr, "TAU8 %f %f %f %f %f\n", tau[0], tau[1], tau[2], tau[3], tau[4]); // Occasionally we get a single read of coverage, and the base is an N, which we ignored above. // This is probably of historical interest any more (it happened with 454 reads) but is left in // because its a cheap and working and safe. if (bReads.size() == 0) { // + oReads.size() + gReads.size() _call = 'N'; _qual = 0; return; } // Do we need to scale before normalizing? Anything out of bounds? We'll try to scale the small // values up to DBL_MIN without making the large values larger than DBL_MAX. If we end up with // some values still too small, oh well. We have a winner (the large value) anyway! // // Note, in log-space, min value (1e-309) is around -711.5, and max value (1e309) is around 711.5. //fprintf(stderr, "tau: %f %f %f %f %f\n", tau[0], tau[1], tau[2], tau[3], tau[4]); double minTau = DBL_MAX; double maxTau = -DBL_MAX; minTau = MIN(minTau, tau[0]); minTau = MIN(minTau, tau[1]); minTau = MIN(minTau, tau[2]); minTau = MIN(minTau, tau[3]); minTau = MIN(minTau, tau[4]); maxTau = MAX(maxTau, tau[0]); maxTau = MAX(maxTau, tau[1]); maxTau = MAX(maxTau, tau[2]); maxTau = MAX(maxTau, tau[3]); maxTau = MAX(maxTau, tau[4]); // Now that we know the min and max values, shift them as far positive as possible. Ideally, // this will just add an offset to bring the smallest value up to the minimum representable. double minValue = log(DBL_MIN) + DBL_EPSILON; double maxValue = log(DBL_MAX) - DBL_EPSILON; double scaleValue = 0.0; if (minTau < minValue) scaleValue = minValue - minTau; if (maxTau + scaleValue > maxValue) scaleValue = maxValue - maxTau; tau[0] += scaleValue; tau[1] += scaleValue; tau[2] += scaleValue; tau[3] += scaleValue; tau[4] += scaleValue; // It could however overflow the max (which is the value we care about), so any values still too // low are thresholded. if (tau[0] < minValue) tau[0] = minValue; if (tau[1] < minValue) tau[1] = minValue; if (tau[2] < minValue) tau[2] = minValue; if (tau[3] < minValue) tau[3] = minValue; if (tau[4] < minValue) tau[4] = minValue; //fprintf(stderr, "TAU9 %f %f %f %f %f value %f/%f tau %f/%f scale %f\n", // tau[0], tau[1], tau[2], tau[3], tau[4], minValue, maxValue, minTau, maxTau, scaleValue); // I give up. I can't make the following asserts true when the scale value is reset to not // exceed maxValue. I'm off, somewhere, by 1e-13 (EPSILON is 2e-16, nowhere near). My test case // takes 18 minutes to get here, and I don't really want to slap in a bunch of logging. So, // thresholding it is. if (tau[0] > maxValue) tau[0] = maxValue; if (tau[1] > maxValue) tau[1] = maxValue; if (tau[2] > maxValue) tau[2] = maxValue; if (tau[3] > maxValue) tau[3] = maxValue; if (tau[4] > maxValue) tau[4] = maxValue; assert(tau[0] <= maxValue); assert(tau[1] <= maxValue); assert(tau[2] <= maxValue); assert(tau[3] <= maxValue); assert(tau[4] <= maxValue); tau[0] = exp(tau[0]); tau[1] = exp(tau[1]); tau[2] = exp(tau[2]); tau[3] = exp(tau[3]); tau[4] = exp(tau[4]); //fprintf(stderr, "TAU10 %f %f %f %f %f\n", tau[0], tau[1], tau[2], tau[3], tau[4]); assert(tau[0] >= 0.0); assert(tau[1] >= 0.0); assert(tau[2] >= 0.0); assert(tau[3] >= 0.0); assert(tau[4] >= 0.0); cw[0] = tau[0] * 0.2; cw[1] = tau[1] * 0.2; cw[2] = tau[2] * 0.2; cw[3] = tau[3] * 0.2; cw[4] = tau[4] * 0.2; double normalize = cw[0] + cw[1] + cw[2] + cw[3] + cw[4]; double normalizeS = normalize; if (normalize > 0.0) { normalize = 1.0 / normalize; cw[0] *= normalize; cw[1] *= normalize; cw[2] *= normalize; cw[3] *= normalize; cw[4] *= normalize; } if ((cw[0] < 0.0) || (cw[1] < 0.0) || (cw[2] < 0.0) || (cw[3] < 0.0) || (cw[4] < 0.0)) fprintf(stderr, "ERROR: cw[0-4] invalid: %f %f %f %f %f - tau %f %f %f %f %f - normalize %.60e %.60e\n", cw[0], cw[1], cw[2], cw[3], cw[4], tau[0], tau[1], tau[2], tau[3], tau[4], normalize, normalizeS); if (!(cw[0] >= 0.0) || !(cw[1] >= 0.0) || !(cw[2] >= 0.0) || !(cw[3] >= 0.0) || !(cw[4] >= 0.0)) fprintf(stderr, "ERROR: cw[0-4] invalid: %f %f %f %f %f - tau %f %f %f %f %f - normalize %.60e %.60e\n", cw[0], cw[1], cw[2], cw[3], cw[4], tau[0], tau[1], tau[2], tau[3], tau[4], normalize, normalizeS); assert(cw[0] >= 0.0); assert(cw[1] >= 0.0); assert(cw[2] >= 0.0); assert(cw[3] >= 0.0); assert(cw[4] >= 0.0); double cwMax = DBL_MIN; uint32 cwIdx = 0; // Default is a gap. if (cw[0] > cwMax) { cwIdx = 0; cwMax = cw[0]; } // '-' if (cw[1] > cwMax) { cwIdx = 1; cwMax = cw[1]; } // 'A' if (cw[2] > cwMax) { cwIdx = 2; cwMax = cw[2]; } // 'C' if (cw[3] > cwMax) { cwIdx = 3; cwMax = cw[3]; } // 'G' if (cw[4] > cwMax) { cwIdx = 4; cwMax = cw[4]; } // 'T' // If cwMax = 0 then consensus is a gap. Otherwise, we deterministically set it to A C G T. #if 0 fprintf(stderr, "prob('%c') %f %c\n", indexToBase[0], cw[0], (cw[0] == cwMax) ? '*' : ' '); fprintf(stderr, "prob('%c') %f %c\n", indexToBase[1], cw[1], (cw[1] == cwMax) ? '*' : ' '); fprintf(stderr, "prob('%c') %f %c\n", indexToBase[2], cw[2], (cw[2] == cwMax) ? '*' : ' '); fprintf(stderr, "prob('%c') %f %c\n", indexToBase[3], cw[3], (cw[3] == cwMax) ? '*' : ' '); fprintf(stderr, "prob('%c') %f %c\n", indexToBase[4], cw[4], (cw[4] == cwMax) ? '*' : ' '); #endif consensusBase = indexToBase[cwIdx]; // Compute the QV. // If cwMax is big, we've max'd out the QV and set it to the maximum. // if (cwMax >= 1.0 - DBL_EPSILON) { consensusQual = CNS_MAX_QV; } // Otherwise compute the QV. If there is more than one read, or we used the surrogate, // we can compute from cwMax. If only one read, use its qv. // else { int32 qual = consensusQual; if ((bReads.size() > 1) /*|| (used_surrogate == true)*/) { double dqv = -10.0 * log10(1.0 - cwMax); qual = DBL_TO_INT(dqv); if (dqv - qual >= 0.50) qual++; } qual = MIN(CNS_MAX_QV, qual); qual = MAX(CNS_MIN_QV, qual); consensusQual = qual; } // If no target allele, or this is the target allele, set the base. Since there // is (currently) always no target allele, we always set the base. //fprintf(stderr, "SET %u to %c qv %c\n", position, consensusBase, consensusQV); _call = consensusBase; _qual = consensusQual; } char abColumn::baseCall(bool highQuality) { if (highQuality) baseCallQuality(); else baseCallMajority(); return(_call); } canu-1.6/src/utgcns/libcns/abAbacus-mergeRefine.C000066400000000000000000000246121314437614700217240ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MergeRefine.C * src/AS_CNS/MergeRefine.c * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/MergeRefine.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-MAY-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-MAR-06 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-14 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" // Extends the read represented by column/beadLink into this column. uint16 abColumn::extendRead(abColumn *column, uint16 beadLink) { increaseArray(_beads, _beadsLen, _beadsMax, 1); uint32 link = _beadsLen++; _beads[link]._unused = column->_beads[beadLink]._unused; _beads[link]._isRead = column->_beads[beadLink]._isRead; _beads[link]._isUnitig = column->_beads[beadLink]._isUnitig; _beads[link]._base = '-'; _beads[link]._qual = 0; if (column->_beads[beadLink]._prevOffset == UINT16_MAX) { assert(column->_beads[beadLink]._nextOffset != UINT16_MAX); assert(_nextColumn == column); _beads[link]._prevOffset = UINT16_MAX; _beads[link]._nextOffset = beadLink; column->_beads[beadLink]._prevOffset = link; } else { assert(column->_beads[beadLink]._prevOffset != UINT16_MAX); assert(_prevColumn == column); _beads[link]._prevOffset = beadLink; _beads[link]._nextOffset = UINT16_MAX; column->_beads[beadLink]._nextOffset = link; } return(link); } // Merges the next column into this column, if possible. Possible if no read has // an actual base in both columns. bool abColumn::mergeWithNext(abAbacus *abacus, bool highQuality) { abColumn *lcolumn = this; abColumn *rcolumn = next(); abColumn *ncolumn = next()->next(); // The column after rcolumn. assert(lcolumn != NULL); assert(rcolumn != NULL); assert(lcolumn->next() == rcolumn); assert(rcolumn->prev() == lcolumn); #if 0 lcolumn->checkLinks(); rcolumn->checkLinks(); #endif // If both columns have a non-gap (for a single read), we cannot merge. for (uint32 ii=0; ii_beadsLen; ii++) { uint32 jj = lcolumn->_beads[ii].nextOffset(); if ((jj < UINT16_MAX) && (lcolumn->_beads[ii].base() != '-') && (rcolumn->_beads[jj].base() != '-')) return(false); } #if 0 fprintf(stderr, "MERGE columns %d %p <- %d %p\n", lcolumn->position(), lcolumn, rcolumn->position(), rcolumn); fprintf(stderr, "rcolumn links\n"); rcolumn->showLinks(); lcolumn->checkLinks(); rcolumn->checkLinks(); ncolumn->checkLinks(); #endif // OK to merge. Merge all the bases from the right column to the current column. We already // checked that whenever the right column has a base, the left column has a gap, so just march // down the right column and move those bases over! for (uint16 rr=0; rr_beadsLen; rr++) { uint16 ll = rcolumn->_beads[rr].prevOffset(); // Ignore the gaps. if (rcolumn->_beads[rr].base() == '-') continue; // Oh, great. We just found the end of a read. We need to link in the gap (in lcolumn) // before we can swap. Correction: we need to ADD a gap (in lcolumn) before we can swap. if (ll == UINT16_MAX) { //fprintf(stderr, "EXTEND READ at rr=%d\n", rr); ll = lcolumn->extendRead(rcolumn, rr); } // The simple case: just swap the contents. #if 0 fprintf(stderr, "mergeWithNext()-- swap beads lcolumn %d %c and rcolumn %d %c\n", ll, lcolumn->_beads[ll].base(), rr, rcolumn->_beads[rr].base()); #endif #ifdef BASECOUNT lcolumn->baseCountDecr(lcolumn->_beads[ll].base()); rcolumn->baseCountDecr(rcolumn->_beads[rr].base()); // We don't really care about rcolumn. #endif swap(lcolumn->_beads[ll], rcolumn->_beads[rr]); #ifdef BASECOUNT lcolumn->baseCountIncr(lcolumn->_beads[ll].base()); rcolumn->baseCountIncr(rcolumn->_beads[rr].base()); #endif // While we're here, update the bead-to-read maps. beadID oldb(rcolumn, rr); beadID newb(lcolumn, ll); map::iterator fit = abacus->fbeadToRead.find(oldb); // Does old bead exist map::iterator lit = abacus->lbeadToRead.find(oldb); // in either map? if (fit != abacus->fbeadToRead.end()) { uint32 rid = fit->second; //fprintf(stderr, "mergeWithNext()-- move fbeadToRead from %p/%d to %p/%d for read %d\n", // rcolumn, rr, lcolumn, ll, rid); abacus->fbeadToRead.erase(fit); // Remove the old bead to read pointer abacus->fbeadToRead[newb] = rid; // Add a new bead to read pointer abacus->readTofBead[rid] = newb; // Update the read to bead pointer } if (lit != abacus->lbeadToRead.end()) { uint32 rid = lit->second; //fprintf(stderr, "mergeWithNext()-- move lbeadToRead from %p/%d to %p/%d for read %d\n", // rcolumn, rr, lcolumn, ll, rid); abacus->lbeadToRead.erase(lit); abacus->lbeadToRead[newb] = rid; abacus->readTolBead[rid] = newb; } } // The rcolumn should now be full of gaps. (We could just test that baseCount('-') == depth() for (uint32 rr=0; rr_beadsLen; rr++) assert(rcolumn->_beads[rr].base() == '-'); #if 0 lcolumn->checkLinks(); rcolumn->checkLinks(); ncolumn->checkLinks(); #endif // To make checkLinks() work, we need to unlink rcolumn from the column list right now. if (rcolumn->_prevColumn) rcolumn->_prevColumn->_nextColumn = rcolumn->_nextColumn; if (rcolumn->_nextColumn) rcolumn->_nextColumn->_prevColumn = rcolumn->_prevColumn; assert(ncolumn == rcolumn->next()); assert(ncolumn == next()); // Before the rcolumn can be removed, we need to unlink it from the bead link list. If there is a column after rcolumn, // we need to move rcolumn's link pointers to lcolumn (prev) and ncolumn (next). // // The actual example (,'s indicate no bases because the read ended): // // 1234 Column 2 is merged into column 1, and then we delete column 2. // 1 -aa-aaa // 2 gaaT,,, Column 1 read 4 has _beads position 3. (next = 1) // 3 gaaT,,, Column 2 read 4 has _beads position 1. (prev = 3, next = 1) // 4 -aa-aaa Column 3 read 4 has _beads position 1. (prev = 1) // 5 -aa-aaa When column 2 is deleted, we're left with a busted link back from col 3 read 4; it should be 3 // 6 gaa,,,, if (ncolumn != NULL) { for (uint32 rr=0; rr_beadsLen; rr++) { uint16 bl = rcolumn->_beads[rr].prevOffset(); // back link from deleted column to lcolumn -> set as ncolumns back link (known as ll above) uint16 fl = rcolumn->_beads[rr].nextOffset(); // forw link from deleted column to ncolumn -> set as lcolumns forw link if (bl != UINT16_MAX) lcolumn->_beads[bl]._nextOffset = fl; if (fl != UINT16_MAX) ncolumn->_beads[fl]._prevOffset = bl; } } lcolumn->checkLinks(); ncolumn->checkLinks(); // Now, finally, we're done. Remove the old column, recall the base, and do a final check. //fprintf(stderr, "mergeWithNext()-- Remove rcolumn %d %p\n", rcolumn->position(), rcolumn); delete rcolumn; baseCall(highQuality); return(true); } // Simple sweep through the MultiAlignment columns, looking for columns to merge and removing null // columns // // Loop over all columns. If we do not merge, advance to the next column, otherwise, stay here and // merge to the now different next column (mergeCompatible removes the column that gets merged into // the current column). // // Note that _firstColumn is never removed. The second column could be merged into the first, // and the second one then removed. void abAbacus::mergeColumns(bool highQuality) { assert(_firstColumn != NULL); abColumn *column = _firstColumn; bool somethingMerged = false; assert(column->prev() == NULL); #if 0 fprintf(stderr, "mergeColumns()--\n"); display(stderr); #endif // If we merge, update the base call, and stay here to try another merge of the now different // next column. Otherwise, we didn't merge anything, so advance to the next column. while (column->next()) { if (column->mergeWithNext(this, highQuality) == true) somethingMerged = true; else column = column->next(); } // If any merges were performed, refresh. This updates the column list. if (somethingMerged) refreshColumns(); } canu-1.6/src/utgcns/libcns/abAbacus-refine.C000066400000000000000000001152231314437614700207430ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/AbacusRefine.C * src/AS_CNS/AbacusRefine.c * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/AbacusRefine.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2010-JUN-09 * are Copyright 2008-2010 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2011-SEP-21 * are Copyright 2011 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-30 to 2015-AUG-14 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-10 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" #define MAX_WINDOW_FOR_ABACUS_REFINE 100 enum ShiftStatus { LEFT_SHIFT = (int) 'L', // Left Shifted (76) RIGHT_SHIFT = (int) 'R', // Right Shifted (82) UNSHIFTED = (int) 'U', // Unshifted (85) }; class abAbacusWork { public: abAbacusWork() { abacus_indices = NULL; start_column = NULL; end_column = NULL; rows = 0; columns = 0; window_width = 0; shift = UNSHIFTED; beads = NULL; calls = NULL; }; abAbacusWork(abAbacus *abacus, abColumn *from, abColumn *end); ~abAbacusWork() { delete [] beads; delete [] calls; }; void reset(void) { for (uint32 ii=0; iistart_column = start_column; clone->end_column = end_column; clone->window_width = window_width; clone->rows = rows; clone->columns = columns; clone->shift = shift; clone->beads = new char [rows * (columns+2)]; clone->calls = new char [columns]; memcpy(clone->beads, beads, (rows * (columns+2)) * sizeof(char)); memcpy(clone->calls, calls, (columns) * sizeof(char)); clone->abacus_indices = abacus_indices; return(clone); }; void show(void) { char form[32]; sprintf(form, "'%%%d.%ds'\n", columns, columns); fprintf(stderr, "start_column 0x%16p %d\n", start_column, start_column->position()); fprintf(stderr, "end_column 0x%16p %d\n", end_column, end_column->position()); fprintf(stderr, "rows %d columns %d\n", rows, columns); fprintf(stderr, "window_width %d\n", window_width); fprintf(stderr, "shift %c\n", shift); // Shows the 'n' border around each row. for (int32 i=0; inumberOfSequences()]; start_column = bgn; end_column = end; rows = 0; // rows is incremented and set below window_width = end->position() - bgn->position() + 1; columns = 3 * window_width; shift = UNSHIFTED; for (uint32 ii=0; iinumberOfSequences(); ii++) { abacus_indices[ii] = UINT32_MAX; // Needs to use abAbacus readTofBead and readTolBead; see unitigConsensus::refreshPositions #if 0 if (abacus->getSequence(ii)->lastColumn()->position() < bgn->position()) continue; if (abacus->getSequence(ii)->firstColumn()->position() > end->position()) continue; #endif abacus_indices[ii] = rows++; } // Now we can allocate beads and calls. beads = new char [rows * (columns + 2)]; // two extra gap columns, plus "null" borders calls = new char [columns]; // Clear every base. for (uint32 ii=0; iinumberOfSequences(); ss++) { if (abacus_indices[ss] == UINT32_MAX) continue; // Read is in the range. We need to figure out a beadLink for some column, preferably close, // so that we can start iterating over the beads. abColumn *bColumn = bgn; uint16 bLink = UINT16_MAX; // Find the earliest column with a link map. The data for this is NOT IMPLEMENTED. #if 0 while (bColumn->beadReadIDsExist() == false) bColumn = bColumn->prev(); // Find the link, move back to the starting column for (uint32 ll=0; lldepth(); ll++) if (ss == bColumn->beadReadID(ll)) { bLink = ll; break; } #endif assert(bLink != UINT16_MAX); while (bColumn != bgn) { bLink = bColumn->bead(bLink)->nextOffset(); bColumn = bColumn->next(); } // Now just add bases into abacus. uint32 columnIndex = 0; while (1) { setBase(abacus_indices[ss], window_width + columnIndex, bColumn->bead(bLink)->base()); if (bColumn == end) break; bLink = bColumn->bead(bLink)->nextOffset(); bColumn = bColumn->next(); } } // Clear the border columns. This is done last, so that we can initialize beads with no read // coverage to 'n' (from the first initialization), and then clear the borders to '-' (here). for (uint32 i=0; i 0) && (cc < columns - 1) && ((getBase(rr, cc-1) == 'n') || (getBase(rr, cc+1) == 'n'))) b = 'n'; counts[b]++; } // Pick the majority. char best = 0; for (uint32 ii=0; ii<256; ii++) if (counts[best] < counts[ii]) best = ii; // Pick a call, and score the non-gap columns. if (counts['-'] + counts['n'] == rows) { calls[cc] = 'n'; } else { cols++; calls[cc] = best; score += rows - counts[best] - counts['n']; } } return(score); } uint32 abAbacusWork::affineScoreAbacus(void) { // This simply counts the number of opened gaps, to be used in tie breaker // of edit scores. uint32 score=0; uint32 start_column; uint32 end_column; if (shift == LEFT_SHIFT) { start_column = 0; end_column = columns/3; } else if (shift == RIGHT_SHIFT) { start_column = 2*columns/3; end_column = columns; } else { // shift == UNSHIFTED start_column = columns/3; end_column = 2*columns/3; } // Size of a gap does not matter, their number in a row does - GD for (uint32 i=0; i0;j--) { int32 null_column = 1; for (int32 i=0; i= 0 && num_gaps > num_ns) { columns_merged++; for (int32 i=0;i 0 */ { for (int32 j=last_non_null-1; j>first_non_null; j--) { int32 num_gaps=0, num_ns=0; mergeok = 1; curr_column_good = -1; for (int32 i=0;i= 0 && num_gaps > num_ns) { columns_merged++; for (int32 i=0;ifirst_non_null; k--) { prev = getBase(i,k-1); setBase(i, k, prev); } setBase(i, first_non_null, '-'); } // Return to the next column to see if it can be merged again j++; } } } return(columns_merged); } // lcols is the number of non-null columns in result uint32 abAbacusWork::leftShift(int32 &lcols) { //fprintf(stderr, "leftShift()--\n"); reset(); //show(); for (int32 j=window_width; j<2*window_width; j++) { for (int32 i=0; iwindow_width-1;j--) { for (int32 i=0; ij;pcol--) { char call = calls[pcol]; if (call != 'n' && call != c && c != 'n' ) continue; if (call == 'n') calls[pcol] = c; if (calls[pcol] == c || c == 'n' ) { setBase(i,j,'-'); setBase(i,pcol,c); break; } } if (getBase(i,j) != '-' ) calls[j] = c; } } } //fprintf(stderr, "rightShift()-- PREMERGE\n"); //show(); merge(1); //fprintf(stderr, "rightShift()-- POSTMERGE\n"); //show(); shift = RIGHT_SHIFT; return(scoreAbacus(rcols)); } #if 0 // Relationship must be one of: // // a) end gap moving left: // // X > A > B > C > ... > - becomes X - A B C ... // ^________________/ // // b) non-gap moving left across only gap characters // (more efficient special case, since first gap and last // character can just be exchanged) // // X > - > - > - > ... > A becomes X A - - - ... // ^________________/ static void leftEndShiftBead(abAbacus *abacus, abBeadID bid, abBeadID eid) { abBead *shift = abacus->getBead(eid); abBeadID aid = abacus->getBead(bid)->prevID(); assert(shift != NULL); if (abacus->getBase(shift->baseIdx()) != '-' ) { // assume first and internal characters are gaps abacus->lateralExchangeBead(bid, eid); } else { while (shift->prevID() != aid ) { abacus->lateralExchangeBead(shift->prevID(), shift->ident()); } } } // Relationship must be one of: // // a) end gap moving right: // // - > A > B > ... > C > X becomes A B ... C - X // \_________________^ // // b) non-gap moving right across only gap characters // (more efficient special case, since first gap and last // character can just be exchanged) // // A > - > - > ... > - > X becomes - - - ... A X // \________________^ static void rightEndShiftBead(abAbacus *abacus, abBeadID bid, abBeadID eid) { abBead *shift = abacus->getBead(bid); abBeadID aid = abacus->getBead(eid)->nextID(); assert(shift != NULL); if (abacus->getBase(shift->baseIdx()) != '-' ) { // assume last and internal characters are gaps abacus->lateralExchangeBead(bid, eid); } else { while (shift->nextID() != aid ) { abacus->lateralExchangeBead(shift->ident(), shift->nextID()); } } } void abAbacusWork::applyAbacus(abAbacus *abacus) { beadID bid; // ALWAYS the id of bead beadID eid; // ALWAYS the id of exch //fprintf(stderr, "applyAbacus()-- shift=%c start=%d width=%d\n", shift, start_column.get(), window_width); //show(); if (shift == LEFT_SHIFT) { abColumn *column = start_column; for (uint32 columnCount=0; columnCount < window_width; ) { bid.column = column; bid.link = 0; //fprintf(stderr, "0; bid=%d eid=%d\n", bid.get(), eid.get()); // Update all beads in a given column while (bid.link != UINT16_MAXisValid()) { bead = abacus->getBead(bid); char a_entry = getBase(abacus_indices[bead->seqIdx().get()] - 1, columnCount); //fprintf(stderr, "a_entry=%c bead=%c\n", a_entry, abacus->getBase(bead->baseIdx())); if (a_entry == 'n') { eid = bead->upID(); exch = abacus->getBead(eid); //fprintf(stderr, "1; bid=%d eid=%d\n", bid.get(), eid.get()); abacus->unalignTrailingGapBeads(bid); } else if (a_entry != abacus->getBase(bead->baseIdx())) { // Look for matching bead in frag and exchange eid = bead->ident(); exch = abacus->getBead(eid); //fprintf(stderr, "2; bid=%d eid=%d\n", bid.get(), eid.get()); if (NULL == exch) { eid = abacus->appendGapBead(bead->ident()); bead = abacus->getBead(bid); abacus->alignBeadToColumn(abacus->getColumn(bead->colIdx())->nextID(),eid, "ApplyAbacus(1)"); exch = abacus->getBead(eid); } //fprintf(stderr, "3; bid=%d eid=%d\n", bid.get(), eid.get()); while (a_entry != abacus->getBase(exch->baseIdx())) { abBeadID eidp = exch->nextID(); if (exch->nextID().isInvalid()) { eidp = abacus->appendGapBead(exch->ident()); bead = abacus->getBead(bid); exch = abacus->getBead(eid); abacus->alignBeadToColumn(abacus->getColumn(exch->colIdx())->nextID(),eidp, "ApplyAbacus(2)"); //fprintf(stderr, "4; bid=%d eid=%d\n", bid.get(), eid.get()); } else if (exch->colIdx() == end_column) { eidp = abacus->appendGapBead(exch->ident()); bead = abacus->getBead(bid); exch = abacus->getBead(eid); //fprintf(stderr, "5; bid=%d eid=%d\n", bid.get(), eid.get()); abColumn *cid = column; abacus->appendColumn(exch->colIdx(),eidp); column = abacus->getColumn(cid); } //fprintf(stderr, "6; bid=%d eid=%d b col/frg=%d/%d e_col/frg=%d/%d\n", // bid.get(), eid.get(), // bead->colIdx().get(), bead->seqIdx().get(), // exch->colIdx().get(), exch->seqIdx().get()); eid = eidp; exch = abacus->getBead(eid); } //fprintf(stderr,"LeftShifting bead %d (%c) with bead %d (%c).\n", // bid.get(), abacus->getBase(bead->baseIdx()), // eid.get(), abacus->getBase(exch->baseIdx())); leftEndShiftBead(abacus, bid, eid); } else { // no exchange necessary; eid = bid; exch = bead; //fprintf(stderr, "7; bid=%d eid=%d\n", bid.get(), eid.get()); } bid = exch->downID(); bead = NULL; //fprintf(stderr,"New bid is %d (%c), from %d down\n", // bid.get(), (bid.isValid()) ? abacus->getBase(abacus->getBead(bid)->baseIdx()) : 'n', // eid.get()); } // End of update; call base now. base = abacus->baseCall(column->ident(), true); column = column->next(); columnCount++; } } if (shift == RIGHT_SHIFT) { abColumn *column = end_column; for (uint32 columnCount=0; columnCount < window_width; ) { bid = abacus->getBead(column->callID())->downID(); while (bid.isValid()) { bead = abacus->getBead(bid); a_entry = getBase(abacus_indices[bead->seqIdx().get()] - 1, columns - columnCount - 1); if (a_entry == 'n') { eid = bead->upID(); exch = abacus->getBead(eid); abacus->unalignTrailingGapBeads(bid); } else if (a_entry != abacus->getBase(bead->baseIdx())) { // Look for matching bead in frag and exchange eid = bead->ident(); exch = abacus->getBead(eid); if (NULL == exch) { eid = abacus->prependGapBead(bead->ident()); bead = abacus->getBead(bid); exch = abacus->getBead(eid); abacus->alignBeadToColumn(abacus->getColumn(bead->colIdx())->prevID(),eid, "ApplyAbacus(3)"); } while (a_entry != abacus->getBase(exch->baseIdx())) { abBeadID eidp = exch->prevID(); if (exch->prevID().isInvalid()) { eidp = abacus->prependGapBead(exch->ident()); bead = abacus->getBead(bid); exch = abacus->getBead(eid); abacus->alignBeadToColumn(abacus->getColumn(exch->colIdx())->prevID(),eidp, "ApplyAbacus(4)"); } else if (exch->colIdx() == start_column) { eidp = abacus->appendGapBead(exch->prevID()); bead = abacus->getBead(bid); exch = abacus->getBead(eid); abColumn *cid = column; abacus->appendColumn(abacus->getColumn(exch->colIdx())->prevID(), eidp); column = abacus->getColumn(cid); } eid = eidp; exch = abacus->getBead(eid); } //fprintf(stderr,"RightShifting bead %d (%c) with bead %d (%c).\n", // eid.get(), abacus->getBase(exch->baseIdx()), // bid.get(), abacus->getBase(bead->baseIdx())); rightEndShiftBead(abacus, eid, bid); } else { eid = bid; exch = bead; // no exchange necessary; } bid = exch->downID(); bead = NULL; //fprintf(stderr,"New bid is %d (%c), from %d down\n", // bid.get(), (bid.isValid()) ? abacus->getBase(abacus->getBead(bid)->baseIdx()) : 'n', // eid.get()); } base = abacus->baseCall(column->ident(), true); column = abacus->getColumn(column->prevID()); columnCount++; } } } #endif void abAbacusWork::applyAbacus(abAbacus *abacus) { } // // In this case, we just look for a string of gaps in the consensus sequence // static int32 IdentifyWindow_Smooth(abColumn *&bgnCol, abColumn *&terCol) { terCol = bgnCol->next(); // Consensus not a gap, nothing to do. if (bgnCol->baseCall() != '-') return(0); int32 winLen = 1; // Consensus is a gap. Expand it to the maximum gap. while ((terCol->baseCall() == '-') && (terCol->next() != NULL)) { terCol = terCol->next(); winLen++; } #ifdef DEBUG_IDENTIFY_WINDOW fprintf(stderr, "identifyWindow()-- gap at %d to %d winLen=%d (return)\n", bgnCol->position(), ter->position(), winLen); #endif return(winLen); } // // Here, we're looking for a string of the same character // static int32 IdentifyWindow_Poly_X(abColumn *&bgnCol, abColumn *&terCol) { terCol = bgnCol->next(); int32 gapCount = bgnCol->baseCount('-'); int32 winLen = 1; char poly = bgnCol->baseCall(); if (poly == '-') return(0); while ((terCol->baseCall() == poly) || (terCol->baseCall() == '-')) { if (terCol->next() == NULL) break; gapCount += terCol->baseCount('-'); winLen += 1; terCol = terCol->next(); } if (winLen <= 2) return(0); // capture trailing gap-called columns while (terCol->baseCall() == '-' ) { if (terCol->majorityBase(true) != poly) break; if (terCol->next() == NULL) break; gapCount += terCol->baseCount('-'); winLen += 1; terCol = terCol->next(); } // now that a poly run with trailing gaps is established, look for leading gaps abColumn *preCol = bgnCol->prev(); while (preCol != NULL) { if ((preCol->baseCall() != '-') && (preCol->baseCall() != poly)) break; bgnCol = preCol; gapCount += preCol->baseCount('-'); winLen += 1; preCol = preCol->prev(); } //fprintf(stderr,"POLYX candidate (%c) at column %d ter %d , width %d, gapcount %d\n", // poly, bgnCol->position(), terCol.get(), winLen, gapCount); if ((bgnCol->prev() != NULL) && (winLen > 2) && (gapCount > 0)) return(winLen); return(0); } // // In this case, we look for a string mismatches, indicating a poor alignment region // which might benefit from Abacus refinement // // heuristics: // // > terle border on either side of window of width: STABWIDTH // > fewer than STABMISMATCH in stable border // // _ __ ___ // SSSSS SSSSS SSSSS .SSSS+ SSSSS .SSSS+ // SSSSS SSSSS SSSSS .SSSS+ SSSSS .SSSS+ // SSSSS SSSSS => SSSSS .SSSS+ => SSSSS .SSSS+ // SSSSS SSSSS SSSSS .SSSS+ SSSSS .SSSS+ // SSSSS_SSSSS SSSSS_.SSSS+ SSSSS__.SSSS+ // | | | // |\_____________|_______________|____ growing 'gappy' window // | // bgnCol // #define STABWIDTH 6 static int32 IdentifyWindow_Indel(abColumn *&bgnCol, abColumn *&terCol) { terCol = bgnCol->next(); int32 cum_mm = bgnCol->mismatchCount(); int32 stab_mm = 0; int32 stab_gaps = 0; int32 stab_width = 0; int32 stab_bases = 0; int32 winLen = 0; if ((cum_mm == 0) || (bgnCol->baseCount('-') == 0)) return(0); abColumn *stableEnd = terCol; // Compute the number of mismatches, gapes and bases in the next STABWIDTH columnd. while ((stableEnd->next() != NULL) && (stab_width < STABWIDTH)) { stab_mm += stableEnd->mismatchCount(); stab_gaps += stableEnd->baseCount('-'); stab_bases += stableEnd->depth(); stableEnd = stableEnd->next(); stab_width++; } // If no bases (how?) return an empty window. if (stab_bases == 0) return(0); // While the number of mismatches is high, and the number of gaps is also high, shift the stable // region ahead by one column. Subtract out the values for the column we shift out (on the left) // and add in the values for the column we shift in (on the right). while ((stab_mm > 0.02 * stab_bases) || // CNS_SEQUENCING_ERROR_EST (stab_gaps > 0.25 * stab_bases)) { int32 mm = terCol->mismatchCount(); int32 gp = terCol->baseCount('-'); int32 bps = terCol->depth(); // move terCol ahead if (stableEnd->next() != NULL) { stab_mm += stableEnd->mismatchCount(); stab_bases += stableEnd->depth(); stab_gaps += stableEnd->baseCount('-'); stableEnd = stableEnd->next(); stab_mm -= mm; stab_gaps -= gp; stab_bases -= bps; cum_mm += mm; terCol = terCol->next(); winLen++; } else { break; } } if (winLen > 1) return(winLen); return(0); } int32 abAbacus::refineWindow(abColumn *bgnCol, abColumn *terCol) { abAbacusWork *orig_abacus = new abAbacusWork(this, bgnCol, terCol); int32 orig_columns = 0; int32 orig_mm_score = orig_abacus->scoreAbacus(orig_columns); abAbacusWork *left_abacus = orig_abacus->clone(); int32 left_columns = 0; abAbacusWork *right_abacus = orig_abacus->clone(); int32 right_columns = 0; int32 left_mm_score = left_abacus->leftShift(left_columns); int32 right_mm_score = right_abacus->rightShift(right_columns); // determine best score and apply abacus to real columns int32 orig_gap_score = orig_abacus->affineScoreAbacus(); int32 left_gap_score = left_abacus->affineScoreAbacus(); int32 right_gap_score = right_abacus->affineScoreAbacus(); abAbacusWork *best_abacus = orig_abacus; int32 best_columns = orig_columns; int32 best_gap_score = orig_gap_score; int32 best_mm_score = orig_mm_score; int32 orig_total_score = orig_mm_score + orig_columns + orig_gap_score; int32 left_total_score = left_mm_score + left_columns + left_gap_score; int32 right_total_score = right_mm_score + right_columns + right_gap_score; int32 best_total_score = orig_total_score; int32 score_reduction = 0; // Use the total score to refine the abacus if (left_total_score < orig_total_score || right_total_score < orig_total_score ) { if (left_total_score <= right_total_score ) { score_reduction += orig_total_score - left_total_score; //fprintf(stderr,"\nTry to apply LEFT abacus:\n"); //ShowAbacus(left_abacus); best_abacus = left_abacus; best_mm_score = left_mm_score; best_columns = left_columns; best_gap_score = left_gap_score; best_total_score = left_total_score; } else { score_reduction += orig_total_score - right_total_score; //fprintf(stderr,"\nTry to apply RIGHT abacus:\n"); //ShowAbacus(right_abacus); best_abacus = right_abacus; best_mm_score = right_mm_score; best_columns = right_columns; best_gap_score = right_gap_score; best_total_score = right_total_score; } } best_abacus->applyAbacus(this); delete orig_abacus; delete left_abacus; delete right_abacus; return(score_reduction); } //********************************************************************************* // Abacus Refinement: // AbacusRefine contains the logic for sweeping through the multialignment, // and identifying candidate windows for refinement. // Each window is cast into an abacus, which is left and right shifted. // The best resulting base arrangement (if different from input) is then // applied to window of the MultiAlignment //********************************************************************************* // AbacusRefine // // ctgcns - from=0 to=-1 level=abAbacus_Smooth // - from=0 to=-1 level=abAbacus_Poly_X // - from=0 to=-1 level=abAbacus_Indel // utgcns - same // // from,to are C-style. Used to be INCLUSIVE, but never used anyway // int32 abAbacus::refine(abAbacusRefineLevel level, uint32 bgn, uint32 end) { #warning SKIPPING ALL REFINEMENTS return(0); if (end > numberOfColumns()) end = numberOfColumns(); abColumn *bgnCol = _columns[bgn]; abColumn *endCol = _columns[end - 1]; abColumn *terCol = NULL; int32 score_reduction = 0; while (bgnCol != endCol) { int32 window_width = 0; switch (level) { case abAbacus_Smooth: window_width = IdentifyWindow_Smooth(bgnCol, terCol); break; case abAbacus_Poly_X: window_width = IdentifyWindow_Poly_X(bgnCol, terCol); break; case abAbacus_Indel: window_width = IdentifyWindow_Indel(bgnCol, terCol); break; default: break; } // If the window is too big, there's likely a polymorphism that won't respond well to abacus, // so skip it. // if ((window_width > 0) && (window_width < MAX_WINDOW_FOR_ABACUS_REFINE)) { // Insert a gap column for maneuvering room if the window starts in the first. This used to // be logged, and BPW can't remember EVER seeing the message. #if 0 if (bgnCol->prev() == NULL) { abBeadID firstbeadID = abacus->getBead(bgnCol->callID() )->downID(); abBeadID newbeadID = abacus->appendGapBead(abacus->getBead(firstbeadID)->ident()); fprintf(stderr, "abMultiAlign::refine()-- Adding gap bead "F_U32" after first bead "F_U32" to add abacus room for abutting left of multialignment\n", newbeadID.get(), firstbeadID.get()); abacus->appendColumn(abacus->getBead(firstbeadID)->colIdx(), newbeadID); } #endif // Actually do the refinements. score_reduction += refineWindow(bgnCol, terCol); } // Move to the column after the window we just examined. bgnCol = terCol; } // WITH quality=1 make_v_list=1, all the rest defaults refreshColumns(); recallBases(true); return(score_reduction); } canu-1.6/src/utgcns/libcns/abAbacus-refreshMultiAlign.C000066400000000000000000000117031314437614700231150ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.c * src/AS_CNS/RefreshMANode.C * src/AS_CNS/RefreshMANode.c * src/utgcns/libcns/RefreshMANode.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-SEP-25 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-MAR-06 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-14 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" // (Optionally) recall consensus in all columns. // Rebuild the list of columns in the multialign. // Rebuild the column to position map. void abAbacus::refreshColumns(void) { //fprintf(stderr, "abAbacus::refreshColumns()--\n"); // Given that _firstColumn is a valid column, walk to the start of the column list. while (_firstColumn->_prevColumn != NULL) _firstColumn = _firstColumn->_prevColumn; // Number the columns, so we can make sure the _columns array has enough space. Probably not // needed to be done first, but avoids having the resize call in the next loop. uint32 cn = 0; for (abColumn *column = _firstColumn; column; column = column->next()) column->_columnPosition = cn++; // Position of the column in the gapped consensus. // Fake out resizeArray so it will work on three arrays. uint32 cm = _columnsMax; resizeArray(_columns, 0, cm, cn+1, resizeArray_doNothing); cm = _columnsMax; resizeArray(_cnsBases, 0, cm, cn+1, resizeArray_doNothing); cm = _columnsMax; resizeArray(_cnsQuals, 0, cm, cn+1, resizeArray_doNothing); _columnsMax = cm; // Build the list of columns and update consensus and quals while we're there. _columnsLen = 0; for (abColumn *column = _firstColumn; column; column = column->next()) { _columns [_columnsLen] = column; _cnsBases[_columnsLen] = column->baseCall(); _cnsQuals[_columnsLen] = column->baseQual(); _columnsLen++; } _cnsBases[_columnsLen] = 0; _cnsQuals[_columnsLen] = 0; // Not actually zero terminated. //for (abColumn *column = _firstColumn; column; column = column->next()) // fprintf(stderr, "refreshColumns()-- column %p is at position %d\n", // column, column->position()); } void abAbacus::recallBases(bool highQuality) { //fprintf(stderr, "abAbacus::recallBases()-- highQuality=%d\n", highQuality); // Given that _firstColumn is a valid column, walk to the start of the column list. // We could use _columns[] instead. while (_firstColumn->_prevColumn != NULL) _firstColumn = _firstColumn->_prevColumn; // Number the columns, so we can make sure the _columns array has enough space. Probably not // needed to be done first, but avoids having the resize call in the next loop. for (abColumn *column = _firstColumn; column; column = column->next()) column->baseCall(highQuality); // After calling bases, we need to refresh to copy the bases from each column into // _cnsBases and _cnsQuals. refreshColumns(); } canu-1.6/src/utgcns/libcns/abAbacus.C000066400000000000000000000104641314437614700174760ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.C * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/MultiAlignment_CNS.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-MAY-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JUL-08 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-14 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" // Shouldn't be global, but some things -- like abBaseCount -- need it. bool DATAINITIALIZED = false; double EPROB[CNS_MAX_QV - CNS_MIN_QV + 1] = { 0 }; double PROB [CNS_MAX_QV - CNS_MIN_QV + 1] = { 0 }; uint32 baseToIndex[256] = { 0 }; char indexToBase[CNS_NUM_SYMBOLS] = { 0 }; void abAbacus::initializeGlobals(void) { for (int32 i=0; i<256; i++) baseToIndex[i] = UINT32_MAX; indexToBase[ 0] = '-'; indexToBase[ 1] = 'A'; indexToBase[ 2] = 'C'; indexToBase[ 3] = 'G'; indexToBase[ 4] = 'T'; indexToBase[ 5] = 'N'; #if 0 indexToBase[ 6] = 'a'; // -A indexToBase[ 7] = 'c'; // -C indexToBase[ 8] = 'g'; // -G indexToBase[ 9] = 't'; // -T indexToBase[10] = 'M'; // AC indexToBase[11] = 'R'; // AG indexToBase[12] = 'W'; // AT indexToBase[13] = 'S'; // CG indexToBase[14] = 'Y'; // CT indexToBase[15] = 'K'; // GT indexToBase[16] = 'm'; // -AC indexToBase[17] = 'r'; // -AG indexToBase[18] = 'w'; // -AT indexToBase[19] = 's'; // -CG indexToBase[20] = 'y'; // -CT indexToBase[21] = 'k'; // -GT indexToBase[22] = 'V'; // ACG indexToBase[23] = 'H'; // ACT indexToBase[24] = 'D'; // AGT indexToBase[25] = 'B'; // CGT indexToBase[26] = 'v'; // -ACG indexToBase[27] = 'h'; // -ACT indexToBase[28] = 'd'; // -AGT indexToBase[29] = 'b'; // -CGT indexToBase[30] = 'X'; // ACGT indexToBase[31] = 'x'; // -ACGT #endif for (int32 i=0; ibead(link)->base()); }; uint16 prevLink(void) { return(column->bead(link)->prevOffset()); }; uint16 nextLink(void) { return(column->bead(link)->nextOffset()); }; abColumn *prevColumn(void) { return(column->prev()); }; abColumn *nextColumn(void) { return(column->next()); }; abColumn *column; uint16 link; }; class abAbacus { public: abAbacus() { _sequencesLen = 0; _sequencesMax = 65536; _sequences = new abSequence * [_sequencesMax]; memset(_sequences, 0, sizeof(abSequence *) * _sequencesMax); _columnsLen = 0; _columnsMax = 1024 * 1024; _columns = new abColumn * [_columnsMax]; memset(_columns, 0, sizeof(abColumn *) * _columnsMax); _cnsBases = new char [_columnsMax]; _cnsQuals = new uint8 [_columnsMax]; memset(_cnsBases, 0, sizeof(char) * _columnsMax); memset(_cnsQuals, 0, sizeof(uint8) * _columnsMax); _firstColumn = NULL; readTofBead = NULL; readTolBead = NULL; if (DATAINITIALIZED == false) initializeGlobals(); }; ~abAbacus() { for (uint32 ss=0; ss<_sequencesLen; ss++) delete _sequences[ss]; for (abColumn *del = _firstColumn; (del = _firstColumn); ) { _firstColumn = _firstColumn->next(); delete del; } delete [] _sequences; delete [] _columns; delete [] _cnsBases; delete [] _cnsQuals; delete [] readTofBead; delete [] readTolBead; }; private: void initializeGlobals(void); public: char *bases(void) { return(_cnsBases); }; uint8 *quals(void) { return(_cnsQuals); }; // Adds gkpStore read 'readID' to the abacus; former AppendFragToLocalStore void addRead(gkStore *gkpStore, uint32 readID, uint32 askip, uint32 bskip, bool complemented, map *inPackageRead, map *inPackageReadData); public: void refreshColumns(void); void recallBases(bool highQuality = false); void appendBases(uint32 bid, uint32 bgn, uint32 end); void applyAlignment(uint32 bid, int32 ahang, int32 bhang, int32 *trace, uint32 traceLen); public: uint32 numberOfSequences(void) { return(_sequencesLen); }; abSequence *getSequence(uint32 sid) { assert(sid < _sequencesLen); return(_sequences[sid]); }; private: uint32 _sequencesLen; uint32 _sequencesMax; abSequence **_sequences; // The consensus sequence (actually, the multialign structure itself) Bases in columns get // shifted, but columns themselves do not move. However, we can insert columns. This array is // an in order list of the columns in the multialignment. public: abColumn *getFirstColumn(void) { return(_columns[0]); }; abColumn *getLastColumn(void) { return(_columns[_columnsLen-1]); }; uint32 numberOfColumns(void) { return(_columnsLen); }; abColumn *getColumn(uint32 cid) { assert(cid < _columnsLen); return(_columns[cid]); }; // Returns column position for position 'pos' in sequence 'seqIdx'. If the position is not contained // in the sequence, the remaining bases are assumed to be ungapped. uint32 getColumn(uint32 seqIdx, uint32 pos) { beadID bid = readTofBead[seqIdx]; uint32 cur = 0; while ((cur < pos) && (bid.nextLink() != UINT16_MAX)) { if (bid.base() != '-') // Count it only if it is a real base in the read. cur++; bid.link = bid.nextLink(); bid.column = bid.nextColumn(); } assert(bid.column != NULL); return(bid.column->position() + pos - cur); }; public: uint32 _columnsLen; uint32 _columnsMax; abColumn **_columns; // Sized in refreshColumns() char *_cnsBases; uint8 *_cnsQuals; abColumn *_firstColumn; public: // These maps are used to populate abSequence's first and last column pointers. // readToBead is the one we use for this. // beadToRead is used to ...? beadID *readTofBead; // Allocated once, after all reads are beadID *readTolBead; // added to us. map fbeadToRead; map lbeadToRead; // This is the former abMultiAlign private: //abBeadID unalignBeadFromColumn(abBeadID bid); //void unalignTrailingGapBeads(abBeadID bid); //void checkColumnBaseCounts(); //void lateralExchangeBead(abBeadID lid, abBeadID rid); public: void mergeColumns(bool highQuality); void getConsensus(tgTig *tig); uint32 getSequenceDeltas(uint32 sid, int32 *deltas); void getPositions(tgTig *tig); int32 refineWindow(abColumn *bgnCol_column, abColumn *terCol); int32 refine(abAbacusRefineLevel level, uint32 from = 0, uint32 to = UINT32_MAX); // There are two multiAlign displays; this one, and one in tgTig. void display(FILE *F); }; #endif canu-1.6/src/utgcns/libcns/abBead.H000066400000000000000000000103471314437614700171400ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.h * src/AS_CNS/MultiAlignment_CNS_private.H * src/AS_CNS/MultiAlignment_CNS_private.h * src/utgcns/libcns/MultiAlignment_CNS_private.H * * Modifications by: * * Gennady Denisov from 2005-MAY-23 to 2007-OCT-25 * are Copyright 2005-2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUL-08 to 2013-AUG-01 * are Copyright 2005-2009,2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2006-FEB-13 to 2008-FEB-13 * are Copyright 2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-JAN-28 to 2009-SEP-25 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2011-SEP-21 * are Copyright 2011 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JAN-27 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-18 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef ABBEAD_H #define ABBEAD_H class abBead { public: abBead() { clear(); }; ~abBead() { }; void clear(void) { _unused = 0; _isRead = 0; _isUnitig = 0; _base = '-'; _qual = 0; _prevOffset = UINT16_MAX; _nextOffset = UINT16_MAX; }; void initialize(char base, uint8 qual, uint16 prevOff, uint16 nextOff) { _base = base; _qual = qual; _prevOffset = prevOff; _nextOffset = nextOff; }; public: // ACCESSORS char base(void) { return(_base); }; // Set only at init, or maybe char qual(void) { return(_qual); }; // reset with swap() uint16 prevOffset(void) { return(_prevOffset); }; // Generally set only at init, uint16 nextOffset(void) { return(_nextOffset); }; // columns are added public: // METHODS private: uint16 _unused:1; uint16 _isRead:1; // If set, base is from an actual read, use it for consensus. uint16 _isUnitig:1; // If set, base is from a unitig, don't use it for consensus (unless needed). uint16 _base:7; // Base at this position. (eventually will be encoded to 3 bits) uint16 _qual:6; // Quality at this position. uint16 _prevOffset; // Position in the array of beads for the previous column uint16 unused_thisOffset; // Position in the arary of beads for this column uint16 _nextOffset; // Position in the array of beads for the next column friend class abAbacus; friend class abColumn; // inferPrevNextBeadPointers(), mergeWithNext() friend void swap(abBead &a, abBead &b); }; inline void swap(abBead &a, abBead &b) { abBead c; c._isRead = b._isRead; c._isUnitig = b._isUnitig; c._base = b._base; c._qual = b._qual; b._isRead = a._isRead; b._isUnitig = a._isUnitig; b._base = a._base; b._qual = a._qual; a._isRead = c._isRead; a._isUnitig = c._isUnitig; a._base = c._base; a._qual = c._qual; } #endif // ABBEAD_H canu-1.6/src/utgcns/libcns/abColumn.C000066400000000000000000000047231314437614700175360ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.C * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/MultiAlignment_CNS.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-MAY-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JAN-21 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-18 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" canu-1.6/src/utgcns/libcns/abColumn.H000066400000000000000000000221321314437614700175350ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.h * src/AS_CNS/MultiAlignment_CNS_private.H * src/AS_CNS/MultiAlignment_CNS_private.h * src/utgcns/libcns/MultiAlignment_CNS_private.H * * Modifications by: * * Gennady Denisov from 2005-MAY-23 to 2007-OCT-25 * are Copyright 2005-2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUL-08 to 2013-AUG-01 * are Copyright 2005-2009,2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2006-FEB-13 to 2008-FEB-13 * are Copyright 2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-JAN-28 to 2009-SEP-25 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2011-SEP-21 * are Copyright 2011 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-JUL-01 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-18 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef ABCOLUMN_H #define ABCOLUMN_H #undef BASECOUNT // Used in abAbacus-refine.C, which is disabled. #include "abBead.H" class abAbacus; class abColumn { public: abColumn() { _columnPosition = INT32_MAX; _call = '-'; _qual = 0; _prevColumn = NULL; _nextColumn = NULL; _beadsMax = 0; _beadsLen = 0; _beads = NULL; #if 0 _beadReadIDs = NULL; #endif #ifdef BASECOUNT for (int32 ii=0; ii= _baseCounts[iMax]) iMax = i; return(indexToBase[iMax]); }; uint32 majorityCount(bool ignoreGap) { return(_baseCounts[majorityBase(ignoreGap)]); }; uint32 mismatchCount(void) { return(depth() - majorityCount(false)); } #else uint32 baseCount(char ) { return(0); }; // Just a stub so abAbacus-refine.C will compile char majorityBase(bool ) { return(0); }; uint32 mismatchCount(void) { return(0); }; #endif char baseCall(void) { return(_call); }; char baseQual(void) { return(_qual); }; abColumn *prev(void) { return(_prevColumn); }; abColumn *next(void) { return(_nextColumn); }; private: void allocateInitialBeads(void); void inferPrevNextBeadPointers(void); public: uint16 insertAtBegin(abColumn *first, uint16 prevLink, char base, uint8 qual); uint16 insertAtEnd (abColumn *prev, uint16 prevLink, char base, uint8 qual); uint16 insertAfter (abColumn *prev, uint16 prevLink, char base, uint8 qual); uint16 alignBead(uint16 prevIndex, char base, uint8 qual); uint16 extendRead(abColumn *column, uint16 beadLink); bool mergeWithNext(abAbacus *abacus, bool highQuality); private: void baseCallMajority(void); void baseCallQuality(void); public: char baseCall(bool highQuality); public: abColumn *mergeNext(void); private: int32 _columnPosition; // coordinate of the column in the multialign; needs to be signed for compat with tgTig bgn/end char _call; // The base call for this column. uint8 _qual; // The quality of that base call. // 16 bytes of pointers. // Alternate schemes: // two 4 byte offsets into an arary of pointers, but that's also 16 bytes per column. // two 4 byte offsets into an array of objects, but we'd then need to realloc sometime. // two 4 byte offsets into a heap, but that adds a lot of complication. // abColumn *_prevColumn; abColumn *_nextColumn; public: abBead *bead(uint32 ii) { return(_beads + ii); }; private: uint16 _beadsMax; // Number of beads allocated uint16 _beadsLen; // Depth; number of reads that span this column abBead *_beads; // If allocated, the read idx (NOT gkpID) for each bead in the column. This will // be used to (efficiently) map arbitrary columns back to their reads when refining abacus // layouts. The data is never populated (needs to be done in applyAlignment()). // #if 0 public: bool beadReadIDsExist(void) { return(_beadReadIDs != NULL); }; uint32 beadReadID(uint32 xx) { return(_beadReadIDs[xx]); }; private: uint32 *_beadReadIDs; #endif #ifdef BASECOUNT uint16 _baseCounts[CNS_NUM_SYMBOLS]; #endif public: void checkLinks(void) { uint32 nErrors = 0; for (uint32 tt=0; tt<_beadsLen; tt++) { uint16 pl = _beads[tt].prevOffset(); uint16 nl = _beads[tt].nextOffset(); if ((prev() != NULL) && (pl != UINT16_MAX) && (prev()->_beads[pl].nextOffset() != tt)) fprintf(stderr, "ERROR: prev bead pl=%d nextOffset=%d != tt=%d\n", pl, prev()->_beads[pl].nextOffset(), tt), nErrors++; if ((prev() == NULL) && (pl != UINT16_MAX)) fprintf(stderr, "no prev column, yet defined link for column %p bead %d\n", this, tt), nErrors++; if ((next() != NULL) && (nl != UINT16_MAX) && (next()->_beads[nl].prevOffset() != tt)) fprintf(stderr, "ERROR: next bead pl=%d prevOffset=%d != tt=%d\n", nl, next()->_beads[nl].prevOffset(), tt), nErrors++; if ((next() == NULL) && (nl != UINT16_MAX)) fprintf(stderr, "no next column, yet defined link for column %p bead %d\n", this, tt), nErrors++; } if (nErrors == 0) return; fprintf(stderr, "column %d %p has %d error in links\n", position(), this, nErrors); showLinks(); assert(nErrors == 0); }; void showLinks(uint16 ppos=UINT16_MAX, uint16 tpos=UINT16_MAX, uint16 npos=UINT16_MAX) { uint16 pmax = (_prevColumn == NULL) ? 0 : _prevColumn->_beadsLen; uint16 tmax = _beadsLen; uint16 nmax = (_nextColumn == NULL) ? 0 : _nextColumn->_beadsLen; uint16 lim = max(max(pmax, tmax), nmax); fprintf(stderr, "\n"); for (uint32 xx=0; xx_beads[xx].prevOffset() : UINT16_MAX, xx, (xx < pmax) ? _prevColumn->_beads[xx].nextOffset() : UINT16_MAX, (xx == ppos) ? '*' : ' ', (xx < tmax) ? _beads[xx].prevOffset() : UINT16_MAX, xx, (xx < tmax) ? _beads[xx].nextOffset() : UINT16_MAX, (xx == tpos) ? '*' : ' ', (xx < nmax) ? _nextColumn->_beads[xx].prevOffset() : UINT16_MAX, xx, (xx < nmax) ? _nextColumn->_beads[xx].nextOffset() : UINT16_MAX, (xx == npos) ? '*' : ' '); } }; friend class abAbacus; // friend bool mergeColumns(abColumn *lcolumn, abColumn *rcolumn); }; #endif // ABCOLUMN_H canu-1.6/src/utgcns/libcns/abMultiAlign.C000066400000000000000000000247071314437614700203520ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.C * src/AS_CNS/MultiAlignment_CNS.c * src/utgcns/libcns/MultiAlignment_CNS.C * * Modifications by: * * Michael Schatz on 2004-SEP-23 * are Copyright 2004 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2005-MAR-22 * are Copyright 2005 The Institute for Genomics Research, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2005-MAR-30 to 2008-FEB-13 * are Copyright 2005-2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Gennady Denisov from 2005-MAY-09 to 2008-JUN-06 * are Copyright 2005-2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUN-16 to 2013-AUG-01 * are Copyright 2005-2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Aaron Halpern from 2005-SEP-29 to 2006-OCT-03 * are Copyright 2005-2006 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-FEB-27 to 2009-MAY-14 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-AUG-11 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-01 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "abAbacus.H" void abAbacus::getConsensus(tgTig *tig) { // Resize the bases/quals storage to include a NUL byte. resizeArrayPair(tig->_gappedBases, tig->_gappedQuals, 0, tig->_gappedMax, _columnsLen + 1, resizeArray_doNothing); // Copy in the bases. memcpy(tig->_gappedBases, _cnsBases, sizeof(char) * _columnsLen); memcpy(tig->_gappedQuals, _cnsQuals, sizeof(uint8) * _columnsLen); // Terminate the strings. tig->_gappedBases[_columnsLen] = 0; tig->_gappedQuals[_columnsLen] = 0; // And set the length. tig->_gappedLen = _columnsLen; } // Crap. How can I figure out where a read starts in the multialign?? // Something needs to store the first column and index for the first bead....but // the column could be merged and removed. // // Put it in the column? A list of: // uint32 readId - the read id that starts here // uint16 link - the bead for that read // Adds one pointer to each column, most pointers are null. If the pointer is set, and we move bases, // we need to update. On a 32m unitig (pretty big) this adds 256mb + 1mb (100k reads at 8 bytes each) // // Put it in the read? If we move the first bead in the read, we know we need to update something, // but we have no link back to the read from the bead. // // It's so sparse - can we just hash it? Need to store: // readID -> column/link of first bead // column/link -> readID // The column probably needs to be a pointer (not the columnid) since the ID changes frequently. // Moving beads needs to update this structure, but any structure we make has the same problem. // index < length eliminates any endgaps from the delta list KAR, 09/19/02 uint32 abAbacus::getSequenceDeltas(uint32 sid, int32 *deltas) { abSequence *seq = getSequence(sid); beadID bid = readTofBead[sid]; abColumn *column = bid.column; uint16 link = bid.link; uint32 dl = 0; uint32 bp = 0; while (link != UINT16_MAX) { assert(column != NULL); char base = column->_beads[link].base(); if (base == '-') { if (deltas) deltas[dl] = bp; dl++; } else bp++; link = column->_beads[link].nextOffset(); column = column->next(); } if (deltas) deltas[dl] = 0; dl++; return(dl); } void abAbacus::getPositions(tgTig *tig) { uint32 nd = 0; for (uint32 si=0; si_childDeltas, tig->_childDeltasLen, tig->_childDeltasMax, nd, resizeArray_doNothing); tig->_childDeltasLen = 0; int32 maxPos = 0; for (uint32 si=0; sigetChild(si); // Assumes one-to-one map of seqs to children; assert(seq->gkpIdent() == child->ident()); beadID fBead = readTofBead[si]; beadID lBead = readTolBead[si]; if (fBead.column == NULL) { fprintf(stderr, "WARNING: read %u not in multialignment; position set to 0,0.\n", seq->gkpIdent()); child->setMinMax(0, 0); continue; } assert(fBead.column != NULL); assert(lBead.column != NULL); // Positions are zero-based and inclusive. The end position gets one added to it to make it true space-based. int32 min = fBead.column->position(); int32 max = lBead.column->position() + 1; if (maxPos < min) maxPos = min; if (maxPos < max) maxPos = max; if (seq->isRead() == true) { child->setMinMax(min, max); // The child remembers its orientation, and sets min/max appropriately. // Grab the deltas for real now, and set the deltaLen to one less than the number returned // (we don't care about the terminating zero!) child->_deltaOffset = tig->_childDeltasLen; child->_deltaLen = getSequenceDeltas(si, tig->_childDeltas + tig->_childDeltasLen); tig->_childDeltasLen += child->_deltaLen; child->_deltaLen--; //fprintf(stderr, "getPositions()-- si %u deltaLen %u offset %u\n", // si.get(), child->_deltaLen, child->_deltaOffset); } } // Hopefully, maxPos and _gappedLen agree. If not....? Note that positions are space-based if (maxPos != tig->_gappedLen) fprintf(stderr, "WARNING: maxPos=%d differs from gappedLen=%d. Huh?\n", maxPos, tig->_gappedLen); tig->_layoutLen = tig->_gappedLen; } // If called from unitigConsensus, this needs a rebuild() first. void abAbacus::display(FILE *F) { int32 pageWidth = 250; // For display, offset the quals to Sanger spec. for (uint32 ii=0; iigkpIdent(); fit[i].column = NULL; fit[i].link = UINT16_MAX; type[i] = (seq->isRead() == true) ? 'R' : '?'; bgn[i] = bgnColumn->position(); end[i] = endColumn->position(); } } fprintf(F,"\n==================== abAbacus::display ====================\n"); for (uint32 window_start=0; window_start < numberOfColumns(); ) { fprintf(F,"\n"); fprintf(F,"%d - %d (length %d)\n", window_start, window_start + pageWidth, numberOfColumns()); fprintf(F,"%-*.*s <<< consensus\n", pageWidth, pageWidth, _cnsBases + window_start); fprintf(F,"%-*.*s <<< quality\n\n", pageWidth, pageWidth, _cnsQuals + window_start); for (uint32 i=0; iposition() < wi) { fit[i].link = fit[i].column->_beads[fit[i].link].nextOffset(); fit[i].column = fit[i].column->next(); } } char pc = fit[i].column->_beads[fit[i].link].base(); if (pc == _cnsBases[wi]) pc = tolower(pc); else pc = toupper(pc); fprintf(F, "%c", pc); fit[i].link = fit[i].column->_beads[fit[i].link].nextOffset(); fit[i].column = fit[i].column->next(); } // Spaces after the read else if (end[i] < wi) { fprintf(F, ","); } // Sequence isn't involved in this window. else { fprintf(stderr, "bgn=%d end=%d wi=%d\n", bgn[i], end[i], wi); assert(0); } // If at the end of the window, print the id of the object we just printed. // if (wi == window_start + pageWidth - 1) fprintf(F," <<< %d (%c)\n", fid[i], type[i]); } } window_start += pageWidth; } // Unoffset the quals back to integers. for (uint32 ii=0; ii using namespace std; // Define this. Use the faster aligner from overlapper. If not defined, // a full O(n^2) DP is computed. // #undef WITH_NDALIGN #define WITH_NDALIGN unitigConsensus::unitigConsensus(gkStore *gkpStore_, double errorRate_, double errorRateMax_, uint32 minOverlap_) { gkpStore = gkpStore_; tig = NULL; numfrags = 0; trace = NULL; abacus = NULL; utgpos = NULL; cnspos = NULL; tiid = 0; piid = -1; minOverlap = minOverlap_; errorRate = errorRate_; errorRateMax = errorRateMax_; oaPartial = NULL; oaFull = NULL; } unitigConsensus::~unitigConsensus() { delete [] trace; delete abacus; delete [] utgpos; delete [] cnspos; delete oaPartial; delete oaFull; } void unitigConsensus::reportStartingWork(void) { if (showProgress()) fprintf(stderr, "unitigConsensus()-- processing read %u/%u id %d pos %d,%d anchor %d,%d,%d -- length %u\n", tiid+1, numfrags, utgpos[tiid].ident(), utgpos[tiid].min(), utgpos[tiid].max(), utgpos[tiid].anchor(), utgpos[tiid].aHang(), utgpos[tiid].bHang(), abacus->numberOfColumns()); if (showPlacementBefore()) for (int32 x=0; x<=tiid; x++) fprintf(stderr, "unitigConsensus()-- mid %10d utgpos %7d,%7d cnspos %7d,%7d anchor %10d,%6d,%6d\n", utgpos[x].ident(), utgpos[x].min(), utgpos[x].max(), cnspos[x].min(), cnspos[x].max(), utgpos[x].anchor(), utgpos[x].aHang(), utgpos[x].bHang()); } void unitigConsensus::reportFailure(void) { fprintf(stderr, "unitigConsensus()-- failed to align fragment %d in unitig %d.\n", utgpos[tiid].ident(), tig->tigID()); } void unitigConsensus::reportSuccess() { //fprintf(stderr, "unitigConsensus()-- fragment %d aligned in unitig %d.\n", // utgpos[tiid].ident(), tig->tigID()); } // Dump the unitig and reads to a single file. We should also probably save any parameters, // but then it's not clear what to do with them. // bool unitigConsensus::savePackage(FILE *outPackageFile, tgTig *tig) { // Saving the tig is easy, just use the standard dump. tig->saveToStream(outPackageFile); // Saving the reads is also easy, but it's a non-standard dump. for (uint32 ii=0; iinumberOfChildren(); ii++) gkpStore->gkStore_saveReadToStream(outPackageFile, tig->getChild(ii)->ident()); return(true); } bool unitigConsensus::generate(tgTig *tig_, map *inPackageRead_, map *inPackageReadData_) { tig = tig_; numfrags = tig->numberOfChildren(); if (initialize(inPackageRead_, inPackageReadData_) == FALSE) { fprintf(stderr, "generate()-- Failed to initialize for tig %u with %u children\n", tig->tigID(), tig->numberOfChildren()); goto returnFailure; } while (moreFragments()) { reportStartingWork(); // First attempt, all default parameters if (computePositionFromAnchor() && alignFragment()) goto applyAlignment; if (computePositionFromLayout() && alignFragment()) goto applyAlignment; if (computePositionFromAlignment() && alignFragment()) goto applyAlignment; // Second attempt, default parameters after recomputing consensus sequence. if (showAlgorithm()) fprintf(stderr, "generate()-- recompute full consensus\n"); recomputeConsensus(showMultiAlignments()); if (computePositionFromAnchor() && alignFragment()) goto applyAlignment; if (computePositionFromLayout() && alignFragment()) goto applyAlignment; if (computePositionFromAlignment() && alignFragment()) goto applyAlignment; // Third attempot, use whatever aligns. (alignFragment(true) forced it to align, but that's breaking the consensus with garbage alignments) if (computePositionFromAlignment() && alignFragment(true)) goto applyAlignment; // Nope, failed to align. reportFailure(); continue; applyAlignment: setErrorRate(errorRate); setMinOverlap(minOverlap); reportSuccess(); abacus->applyAlignment(tiid, traceABgn, traceBBgn, trace, traceLen); refreshPositions(); } generateConsensus(tig); return(true); returnFailure: fprintf(stderr, "generate()-- unitig %d FAILED.\n", tig->tigID()); // tgTig should have no changes. return(false); } char * generateTemplateStitch(abAbacus *abacus, tgPosition *utgpos, uint32 numfrags, double errorRate, bool verbose) { int32 minOlap = 500; // Initialize, copy the first read. uint32 rid = 0; abSequence *seq = abacus->getSequence(rid); char *fragment = seq->getBases(); uint32 readLen = seq->length(); uint32 tigmax = AS_MAX_READLEN; // Must be at least AS_MAX_READLEN, else resizeArray() could fail uint32 tiglen = 0; char *tigseq = NULL; allocateArray(tigseq, tigmax, resizeArray_clearNew); if (verbose) { fprintf(stderr, "\n"); fprintf(stderr, "generateTemplateStitch()-- COPY READ read #%d %d (len=%d to %d-%d)\n", 0, utgpos[0].ident(), readLen, utgpos[0].min(), utgpos[0].max()); } for (uint32 ii=0; iigetSequence(rid); fragment = seq->getBases(); readLen = seq->length(); int32 readBgn; int32 readEnd; EdlibAlignResult result; bool aligned = false; double templateSize = 0.80; double extensionSize = 0.20; int32 olapLen = ePos - utgpos[nr].min(); // The expected size of the overlap int32 templateLen = 0; int32 extensionLen = 0; alignAgain: templateLen = (int32)ceil(olapLen * templateSize); // Extract 80% of the expected overlap size extensionLen = (int32)ceil(olapLen * extensionSize); // Extend read by 20% of the expected overlap size readBgn = 0; readEnd = olapLen + extensionLen; if (readEnd > readLen) readEnd = readLen; if (verbose) { fprintf(stderr, "\n"); fprintf(stderr, "generateTemplateStitch()-- ALIGN template %d-%d (len=%d) to read #%d %d %d-%d (len=%d actual=%d at %d-%d) expecting olap of %d\n", tiglen - templateLen, tiglen, templateLen, nr, utgpos[nr].ident(), readBgn, readEnd, readEnd - readBgn, readLen, utgpos[nr].min(), utgpos[nr].max(), olapLen); } result = edlibAlign(tigseq + tiglen - templateLen, templateLen, fragment, readEnd - readBgn, edlibNewAlignConfig(olapLen * errorRate, EDLIB_MODE_HW, EDLIB_TASK_PATH)); // We're expecting the template to align inside the read. // // v- always the end // TEMPLATE --------------------------[---------------] // READ [------------------------------]--------- // always the start -^ // // If we don't find an alignment at all, we move the template start point to the right (making // the template smaller) and also move the read end point to the right (making the read // bigger). bool tryAgain = false; bool noResult = (result.numLocations == 0); bool gotResult = (result.numLocations > 0); bool hitTheStart = (gotResult) && (result.startLocations[0] == 0); bool hitTheEnd = (gotResult) && (result.endLocations[0] + 1 == readEnd - readBgn); bool moreToExtend = (readEnd < readLen); // HOWEVER, if we get a result and it's near perfect, declare success even if we hit the start. // These are simple repeats that will align with any overlap. The one BPW debugged was 99+% A. if ((gotResult == true) && (hitTheStart == true) && ((double)result.editDistance / result.alignmentLength < 0.1)) { hitTheStart = false; } // NOTE that if we hit the end with the same conditions, we should try again, unless there // isn't anything left. In that case, we don't extend the template. if ((gotResult == true) && (hitTheEnd == true) && (moreToExtend == false) && ((double)result.editDistance / result.alignmentLength < 0.1)) { hitTheEnd = false; } // Now, report what happened, and maybe try again. if (verbose) if (noResult) fprintf(stderr, "generateTemplateStitch()-- FAILED to align - no result\n"); else fprintf(stderr, "generateTemplateStitch()-- FOUND alignment at %d-%d editDist %d alignLen %d %.f%%\n", result.startLocations[0], result.endLocations[0]+1, result.editDistance, result.alignmentLength, (double)result.editDistance / result.alignmentLength); if ((noResult) || (hitTheStart)) { if (verbose) fprintf(stderr, "generateTemplateStitch()-- FAILED to align - %s - decrease template size by 10%%\n", (noResult == true) ? "no result" : "hit the start"); tryAgain = true; templateSize -= 0.10; } if ((noResult) || (hitTheEnd && moreToExtend)) { if (verbose) fprintf(stderr, "generateTemplateStitch()-- FAILED to align - %s - increase read size by 10%%\n", (noResult == true) ? "no result" : "hit the end"); tryAgain = true; extensionSize += 0.10; } if (tryAgain) { edlibFreeAlignResult(result); goto alignAgain; } readBgn = result.startLocations[0]; // Expected to be zero readEnd = result.endLocations[0] + 1; // Where we need to start copying the read edlibFreeAlignResult(result); if (verbose) fprintf(stderr, "generateTemplateStitch()-- Aligned template %d-%d to read %u %d-%d; copy read %d-%d to template.\n", tiglen - templateLen, tiglen, nr, readBgn, readEnd, readEnd, readLen); increaseArray(tigseq, tiglen, tigmax, tiglen + readLen - readEnd + 1); for (uint32 ii=readEnd; ii= 100000) && ((pd < -50.0) || (pd > 50.0))) fprintf(stderr, "generateTemplateStitch()-- significant size difference, stopping.\n"); assert((tiglen < 100000) || ((-50.0 <= pd) && (pd <= 50.0))); return(tigseq); } bool alignEdLib(dagAlignment &aln, tgPosition &utgpos, char *fragment, uint32 fragmentLength, char *tigseq, uint32 tiglen, double lengthScale, double errorRate, bool normalize, bool verbose) { EdlibAlignResult align; int32 padding = (int32)ceil(fragmentLength * 0.10); double bandErrRate = errorRate / 2; bool aligned = false; double alignedErrRate = 0.0; // Decide on where to align this read. // But, the utgpos positions are largely bogus, especially at the end of the tig. utgcns (the // original) used to track positions of previously placed reads, find an overlap beterrn this // read and the last read, and use that info to find the coordinates for the new read. That was // very complicated. Here, we just linearly scale. int32 tigbgn = max((int32)0, (int32)floor(lengthScale * utgpos.min() - padding)); int32 tigend = min((int32)tiglen, (int32)floor(lengthScale * utgpos.max() + padding)); if (verbose) fprintf(stderr, "alignEdLib()-- align read %7u eRate %.4f at %9d-%-9d", utgpos.ident(), bandErrRate, tigbgn, tigend); // This occurs if we don't lengthScale the positions. if (tigend < tigbgn) fprintf(stderr, "alignEdLib()-- ERROR: tigbgn %d > tigend %d - tiglen %d utgpos %d-%d padding %d\n", tigbgn, tigend, tiglen, utgpos.min(), utgpos.max(), padding); assert(tigend > tigbgn); // Align! If there is an alignment, compute error rate and declare success if acceptable. align = edlibAlign(fragment, fragmentLength, tigseq + tigbgn, tigend - tigbgn, edlibNewAlignConfig(bandErrRate * fragmentLength, EDLIB_MODE_HW, EDLIB_TASK_PATH)); if (align.alignmentLength > 0) { alignedErrRate = (double)align.editDistance / align.alignmentLength; aligned = (alignedErrRate <= errorRate); if (verbose) fprintf(stderr, " - ALIGNED %.4f at %9d-%-9d\n", alignedErrRate, tigbgn + align.startLocations[0], tigbgn + align.endLocations[0]+1); } else { if (verbose) fprintf(stderr, "\n"); } for (uint32 ii=0; ((ii < 4) && (aligned == false)); ii++) { tigbgn = max((int32)0, tigbgn - 2 * padding); tigend = min((int32)tiglen, tigend + 2 * padding); bandErrRate += errorRate / 2; edlibFreeAlignResult(align); if (verbose) fprintf(stderr, "alignEdLib()-- eRate %.4f at %9d-%-9d", bandErrRate, tigbgn, tigend); align = edlibAlign(fragment, strlen(fragment), tigseq + tigbgn, tigend - tigbgn, edlibNewAlignConfig(bandErrRate * fragmentLength, EDLIB_MODE_HW, EDLIB_TASK_PATH)); if (align.alignmentLength > 0) { alignedErrRate = (double)align.editDistance / align.alignmentLength; aligned = (alignedErrRate <= errorRate); if (verbose) fprintf(stderr, " - ALIGNED %.4f at %9d-%-9d\n", alignedErrRate, tigbgn + align.startLocations[0], tigbgn + align.endLocations[0]+1); } else { if (verbose) fprintf(stderr, "\n"); } } if (aligned == false) { edlibFreeAlignResult(align); return(false); } char *tgtaln = new char [align.alignmentLength+1]; char *qryaln = new char [align.alignmentLength+1]; memset(tgtaln, 0, sizeof(char) * (align.alignmentLength+1)); memset(qryaln, 0, sizeof(char) * (align.alignmentLength+1)); edlibAlignmentToStrings(align.alignment, // Alignment align.alignmentLength, // and length align.startLocations[0], // tgtStart align.endLocations[0]+1, // tgtEnd 0, // qryStart fragmentLength, // qryEnd tigseq + tigbgn, // tgt sequence fragment, // qry sequence tgtaln, // output tgt alignment string qryaln); // output qry alignment string // Populate the output. AlnGraphBoost does not handle mismatch alignments, at all, so convert // them to a pair of indel. uint32 nMatch = 0; for (uint32 ii=0; ii tiglen) fprintf(stderr, "ERROR: alignment from %d to %d, but tiglen is only %d\n", aln.start, aln.end, tiglen); assert(aln.end <= tiglen); return(true); } void realignReads() { #ifdef REALIGN // update positions, this requires remapping but this time to the final consensus, turned off for now uint32 minPos = cns.size(); uint32 maxPos = 0; #pragma omp parallel for schedule(dynamic) for (uint32 i=0; igetSequence(i); uint32 bandTolerance = (int32)round((double)(seq->length() * errorRate)) * 2; uint32 maxExtend = (int32)round((double)seq->length() * 0.01) + 1; int32 padding = bandTolerance; uint32 start = max((int32)0, (int32)utgpos[i].min() - padding); uint32 end = min((int32)cns.size(), (int32)utgpos[i].max() + padding); EdlibAlignResult align = edlibAlign(seq->getBases(), seq->length()-1, cns.c_str()+start, end-start+1, edlibNewAlignConfig(bandTolerance, EDLIB_MODE_HW, EDLIB_TASK_LOC)); if (align.numLocations > 0) { cnspos[i].setMinMax(align.startLocations[0]+start, align.endLocations[0]+start+1); // when we are very close to end extend if (cnspos[i].max() < cns.size() && cns.size() - cnspos[i].max() <= maxExtend && (align.editDistance + cns.size() - cnspos[i].max()) < bandTolerance) { cnspos[i].setMinMax(cnspos[i].min(), cns.size()); } #pragma omp critical (trackMin) if (cnspos[i].min() < minPos) minPos = cnspos[i].min(); #pragma omp critical (trackMax) if (cnspos[i].max() > maxPos) maxPos = cnspos[i].max(); } else { } edlibFreeAlignResult(align); } memcpy(tig->getChild(0), cnspos, sizeof(tgPosition) * numfrags); // trim consensus if needed if (maxPos < cns.size()) cns = cns.substr(0, maxPos); assert(minPos == 0); assert(maxPos == cns.size()); #endif } bool unitigConsensus::generatePBDAG(char aligner, bool normalize, tgTig *tig_, map *inPackageRead_, map *inPackageReadData_) { bool verbose = (tig_->_utgcns_verboseLevel > 1); tig = tig_; numfrags = tig->numberOfChildren(); if (initialize(inPackageRead_, inPackageReadData_) == FALSE) { fprintf(stderr, "generatePBDAG()-- Failed to initialize for tig %u with %u children\n", tig->tigID(), tig->numberOfChildren()); return(false); } // Build a quick consensus to align to. char *tigseq = generateTemplateStitch(abacus, utgpos, numfrags, errorRate, tig->_utgcns_verboseLevel); uint32 tiglen = strlen(tigseq); fprintf(stderr, "Generated template of length %d\n", tiglen); // Compute alignments of each sequence in parallel fprintf(stderr, "Aligning reads.\n"); dagAlignment *aligns = new dagAlignment [numfrags]; uint32 pass = 0; uint32 fail = 0; #pragma omp parallel for schedule(dynamic) for (uint32 ii=0; iigetSequence(ii); bool aligned = false; assert(aligner == 'E'); // Maybe later we'll have more than one aligner again. aligned = alignEdLib(aligns[ii], utgpos[ii], seq->getBases(), seq->length(), tigseq, tiglen, (double)tiglen / tig->_layoutLen, errorRate, normalize, verbose); if (aligned == false) { if (verbose) fprintf(stderr, "generatePBDAG()-- read %7u FAILED\n", utgpos[ii].ident()); fail++; continue; } pass++; } fprintf(stderr, "Finished aligning reads. %d failed, %d passed.\n", fail, pass); // Construct the graph from the alignments. This is not thread safe. fprintf(stderr, "Constructing graph\n"); AlnGraphBoost ag(string(tigseq, tiglen)); for (uint32 ii=0; ii_gappedBases, tig->_gappedQuals, 0, tig->_gappedMax, (uint32) cns.length() + 1, resizeArray_doNothing); std::string::size_type len = 0; for (len=0; len_gappedBases[len] = cns[len]; tig->_gappedQuals[len] = CNS_MIN_QV; } // Terminate the string. tig->_gappedBases[len] = 0; tig->_gappedQuals[len] = 0; tig->_gappedLen = len; tig->_layoutLen = len; assert(len < tig->_gappedMax); return(true); } bool unitigConsensus::generateQuick(tgTig *tig_, map *inPackageRead_, map *inPackageReadData_) { tig = tig_; numfrags = tig->numberOfChildren(); if (initialize(inPackageRead_, inPackageReadData_) == FALSE) { fprintf(stderr, "generatePBDAG()-- Failed to initialize for tig %u with %u children\n", tig->tigID(), tig->numberOfChildren()); return(false); } // Quick is just the template sequence, so one and done! char *tigseq = generateTemplateStitch(abacus, utgpos, numfrags, errorRate, tig->_utgcns_verboseLevel); uint32 tiglen = strlen(tigseq); // Save consensus resizeArrayPair(tig->_gappedBases, tig->_gappedQuals, 0, tig->_gappedMax, tiglen + 1, resizeArray_doNothing); for (uint32 ii=0; ii_gappedBases[ii] = tigseq[ii]; tig->_gappedQuals[ii] = CNS_MIN_QV; } // Terminate the string. tig->_gappedBases[tiglen] = 0; tig->_gappedQuals[tiglen] = 0; tig->_gappedLen = tiglen; tig->_layoutLen = tiglen; delete [] tigseq; return(true); } bool unitigConsensus::generateSingleton(tgTig *tig_, map *inPackageRead_, map *inPackageReadData_) { tig = tig_; numfrags = tig->numberOfChildren(); assert(numfrags == 1); if (initialize(inPackageRead_, inPackageReadData_) == FALSE) { fprintf(stderr, "generatePBDAG()-- Failed to initialize for tig %u with %u children\n", tig->tigID(), tig->numberOfChildren()); return(false); } // Copy the single read to the tig sequence. abSequence *seq = abacus->getSequence(0); char *fragment = seq->getBases(); uint32 readLen = seq->length(); resizeArrayPair(tig->_gappedBases, tig->_gappedQuals, 0, tig->_gappedMax, readLen + 1, resizeArray_doNothing); for (uint32 ii=0; ii_gappedBases[ii] = fragment[ii]; tig->_gappedQuals[ii] = CNS_MIN_QV; } // Terminate the string. tig->_gappedBases[readLen] = 0; tig->_gappedQuals[readLen] = 0; tig->_gappedLen = readLen; tig->_layoutLen = readLen; return(true); } int unitigConsensus::initialize(map *inPackageRead, map *inPackageReadData) { int32 num_columns = 0; //int32 num_bases = 0; if (numfrags == 0) { fprintf(stderr, "utgCns::initialize()-- unitig has no children.\n"); return(false); } utgpos = new tgPosition [numfrags]; cnspos = new tgPosition [numfrags]; memcpy(utgpos, tig->getChild(0), sizeof(tgPosition) * numfrags); memcpy(cnspos, tig->getChild(0), sizeof(tgPosition) * numfrags); traceLen = 0; trace = new int32 [2 * AS_MAX_READLEN]; traceABgn = 0; traceBBgn = 0; memset(trace, 0, sizeof(int32) * 2 * AS_MAX_READLEN); abacus = new abAbacus(); // Clear the cnspos position. We use this to show it's been placed by consensus. // Guess the number of columns we'll end up with. // Initialize abacus with the reads. for (int32 i=0; i num_columns) ? utgpos[i].min() : num_columns; num_columns = (utgpos[i].max() > num_columns) ? utgpos[i].max() : num_columns; abacus->addRead(gkpStore, utgpos[i].ident(), utgpos[i]._askip, utgpos[i]._bskip, utgpos[i].isReverse(), inPackageRead, inPackageReadData); } // Check for duplicate reads { set dupFrag; for (uint32 i=0; itigID(), utgpos[i].ident()); return(false); } if (dupFrag.find(utgpos[i].ident()) != dupFrag.end()) { fprintf(stderr, "unitigConsensus()-- Unitig %d FAILED. Child %d is a duplicate.\n", tig->tigID(), utgpos[i].ident()); return(false); } dupFrag.insert(utgpos[i].ident()); } } // Initialize with the first read. abacus->applyAlignment(0, 0, 0, NULL, 0); // And set the placement of the first read. cnspos[0].setMinMax(0, abacus->numberOfColumns()); return(true); } int unitigConsensus::computePositionFromAnchor(void) { assert(piid == -1); uint32 anchor = utgpos[tiid].anchor(); if (anchor == 0) // No anchor?! Damn. goto computePositionFromAnchorFail; for (piid = tiid-1; piid >= 0; piid--) { abSequence *aseq = abacus->getSequence(piid); if (anchor != aseq->gkpIdent()) // Not the anchor. continue; if ((cnspos[piid].min() == 0) && (cnspos[piid].max() == 0)) // Is the anchor, but that isn't placed. goto computePositionFromAnchorFail; if ((utgpos[piid].max() < utgpos[tiid].min()) || (utgpos[tiid].max() < utgpos[piid].min())) { // Is the anchor, and anchor is placed, but the anchor doesn't agree with the placement. if (showPlacement()) fprintf(stderr, "computePositionFromAnchor()-- anchor %d at utg %d,%d doesn't agree with my utg %d,%d. FAIL\n", anchor, utgpos[piid].min(), utgpos[piid].max(), utgpos[tiid].min(), utgpos[tiid].max()); goto computePositionFromAnchorFail; } // Scale the hangs by the change in the anchor size between bogart and consensus. #if 0 double anchorScale = (double)(cnspos[piid].max() - cnspos[piid].min()) / (double)(utgpos[piid].max() - utgpos[piid].min()); if (showPlacement()) fprintf(stderr, "computePositionFromAnchor()-- frag %u in anchor %u -- hangs %d,%d -- scale %f -- final hangs %.0f,%.0f\n", utgpos[tiid].ident(), utgpos[piid].ident(), utgpos[tiid].aHang(), utgpos[tiid].bHang(), anchorScale, utgpos[tiid].aHang() * anchorScale, utgpos[tiid].bHang() * anchorScale); cnspos[tiid].setMinMax(cnspos[piid].min() + utgpos[tiid].aHang() * anchorScale, cnspos[piid].max() + utgpos[tiid].bHang() * anchorScale); // Hmmm, but if we shrank the read too much, add back in some of the length. We want to end up // with the read scaled by anchorScale, and centered on the hangs. int32 fragmentLength = utgpos[tiid].max() - utgpos[tiid].min(); if ((cnspos[tiid].min() >= cnspos[tiid].max()) || (cnspos[tiid].max() - cnspos[tiid].min() < 0.75 * fragmentLength)) { int32 center = (cnspos[tiid].min() + cnspos[tiid].max()) / 2; if (showPlacement()) { fprintf(stderr, "computePositionFromAnchor()-- frag %u in anchor %u -- too short. reposition around center %d with adjusted length %.0f\n", utgpos[tiid].ident(), utgpos[piid].ident(), center, fragmentLength * anchorScale); } cnspos[tiid].setMinMax(center - fragmentLength * anchorScale / 2, center + fragmentLength * anchorScale / 2); // We seem immune to having a negative position. We only use this to pull out a region from // the partial consensus to align to. // //if (cnspos[tiid].min() < 0) { // cnspos[tiid].min() = 0; // cnspos[tiid].max() = fragmentLength * anchorScale; //} } #else assert(0 <= utgpos[tiid].aHang()); uint32 bgn = abacus->getColumn(piid, cnspos[piid].min() + utgpos[tiid].aHang() - cnspos[piid].min()); uint32 end = abacus->getColumn(piid, cnspos[piid].max() + utgpos[tiid].bHang() - cnspos[piid].min()); cnspos[tiid].setMinMax(bgn, end); #endif assert(cnspos[tiid].min() < cnspos[tiid].max()); if (showPlacement()) fprintf(stderr, "computePositionFromAnchor()-- anchor %d at %d,%d --> beg,end %d,%d (tigLen %d)\n", anchor, cnspos[piid].min(), cnspos[piid].max(), cnspos[tiid].min(), cnspos[tiid].max(), abacus->numberOfColumns()); return(true); } computePositionFromAnchorFail: cnspos[tiid].setMinMax(0, 0); piid = -1; return(false); } int unitigConsensus::computePositionFromLayout(void) { int32 thickestLen = 0; assert(piid == -1); // Find the thickest qiid overlap to any cnspos fragment for (int32 qiid = tiid-1; qiid >= 0; qiid--) { if ((utgpos[tiid].min() < utgpos[qiid].max()) && (utgpos[tiid].max() > utgpos[qiid].min()) && ((cnspos[qiid].min() != 0) || (cnspos[qiid].max() != 0))) { cnspos[tiid].setMinMax(cnspos[qiid].min() + utgpos[tiid].min() - utgpos[qiid].min(), cnspos[qiid].max() + utgpos[tiid].max() - utgpos[qiid].max()); // This assert triggers. It results in 'ooo' below being negative, and we // discard this overlap anyway. // //assert(cnspos[tiid].min() < cnspos[tiid].max()); int32 ooo = MIN(cnspos[tiid].max(), abacus->numberOfColumns()) - cnspos[tiid].min(); #if 1 if (showPlacement()) fprintf(stderr, "computePositionFromLayout()-- layout %d at utg %d,%d cns %d,%d --> utg %d,%d cns %d,%d -- overlap %d\n", utgpos[qiid].ident(), utgpos[qiid].min(), utgpos[qiid].max(), cnspos[qiid].min(), cnspos[qiid].max(), utgpos[tiid].min(), utgpos[tiid].max(), cnspos[tiid].min(), cnspos[tiid].max(), ooo); #endif // Occasionally we see an overlap in the original placement (utgpos overlap) by after // adjusting our fragment to the consensus position, we no longer have an overlap. This // seems to be caused by a bad original placement. // // Example: // utgpos[a] = 13480,14239 cnspos[a] = 13622,14279 // utgpos[b] = 14180,15062 // // Our placement is 200bp different at the start, but close at the end. When we compute the // new start placement, it starts after the end of the A read -- the utgpos say the B read // starts 700bp after the A read, which is position 13622 + 700 = 14322....50bp after A ends. if ((cnspos[tiid].min() < abacus->numberOfColumns()) && (thickestLen < ooo)) { thickestLen = ooo; assert(cnspos[tiid].min() < cnspos[tiid].max()); // But we'll still assert cnspos is ordered correctly. int32 ovl = ooo; int32 ahang = cnspos[tiid].min(); int32 bhang = cnspos[tiid].max() - abacus->numberOfColumns(); piid = qiid; } } } // If we have a VALID thickest placement, use that (recompute the placement that is likely // overwritten -- ahang, bhang and piid are still correct). if (thickestLen >= minOverlap) { assert(piid != -1); cnspos[tiid].setMinMax(cnspos[piid].min() + utgpos[tiid].min() - utgpos[piid].min(), cnspos[piid].max() + utgpos[tiid].max() - utgpos[piid].max()); assert(cnspos[tiid].min() < cnspos[tiid].max()); if (showPlacement()) fprintf(stderr, "computePositionFromLayout()-- layout %d at %d,%d --> beg,end %d,%d (tigLen %d)\n", utgpos[piid].ident(), cnspos[piid].min(), cnspos[piid].max(), cnspos[tiid].min(), cnspos[tiid].max(), abacus->numberOfColumns()); return(true); } cnspos[tiid].setMinMax(0, 0); piid = -1; return(false); } // Occasionally we get a fragment that just refuses to go in the correct spot. Search for the // correct placement in all of consensus, update ahang,bhang and retry. // // We don't expect to have big negative ahangs, and so we don't allow them. To unlimit this, use // "-fragmentLen" instead of the arbitrary cutoff below. int unitigConsensus::computePositionFromAlignment(void) { assert(piid == -1); int32 minlen = minOverlap; int32 ahanglimit = -10; abSequence *seq = abacus->getSequence(tiid); char *fragment = seq->getBases(); int32 fragmentLen = seq->length(); bool foundAlign = false; // // Try NDalign. // if (foundAlign == false) { if (oaPartial == NULL) oaPartial = new NDalign(pedLocal, errorRate, 17); // partial allowed! oaPartial->initialize(0, abacus->bases(), abacus->numberOfColumns(), 0, abacus->numberOfColumns(), 1, fragment, fragmentLen, 0, fragmentLen, false); if ((oaPartial->findMinMaxDiagonal(minOverlap) == true) && (oaPartial->findSeeds(false) == true) && (oaPartial->findHits() == true) && (oaPartial->chainHits() == true) && (oaPartial->processHits() == true)) { cnspos[tiid].setMinMax(oaPartial->abgn(), oaPartial->aend()); //fprintf(stderr, "computePositionFromAlignment()-- cnspos[%3d] mid %d %d,%d (from NDalign)\n", tiid, utgpos[tiid].ident(), cnspos[tiid].min(), cnspos[tiid].max()); foundAlign = true; } } // // Fail. // if (foundAlign == false) { cnspos[tiid].setMinMax(0, 0); piid = -1; if (showAlgorithm()) fprintf(stderr, "computePositionFromAlignment()-- Returns fail (no alignment).\n"); return(false); } // From the overlap and existing placements, find the thickest overlap, to set the piid and // hangs, then reset the original placement based on that anchors original placement. // // To work with fixFailures(), we need to scan the entire fragment list. This isn't so bad, // really, since before we were scanning (on average) half of it. assert(cnspos[tiid].min() < cnspos[tiid].max()); int32 thickestLen = 0; for (int32 qiid = numfrags-1; qiid >= 0; qiid--) { if ((tiid != qiid) && (cnspos[tiid].min() < cnspos[qiid].max()) && (cnspos[tiid].max() > cnspos[qiid].min())) { int32 ooo = (MIN(cnspos[tiid].max(), cnspos[qiid].max()) - MAX(cnspos[tiid].min(), cnspos[qiid].min())); if (thickestLen < ooo) { thickestLen = ooo; int32 ovl = ooo; int32 ahang = cnspos[tiid].min(); int32 bhang = cnspos[tiid].max() - abacus->numberOfColumns(); piid = qiid; } } } // No thickest? Dang. if (thickestLen == 0) { cnspos[tiid].setMinMax(0, 0); piid = -1; if (showAlgorithm()) fprintf(stderr, "computePositionFromAlignment()-- Returns fail (no thickest).\n"); return(false); } // Success, yay! assert(piid != -1); if (showPlacement()) fprintf(stderr, "computePositionFromAlignment()-- layout %d at %d,%d --> beg,end %d,%d (tigLen %d)\n", utgpos[piid].ident(), cnspos[piid].min(), cnspos[piid].max(), cnspos[tiid].min(), cnspos[tiid].max(), abacus->numberOfColumns()); return(true); } void unitigConsensus::generateConsensus(tgTig *tig) { abacus->recallBases(true); // Do one last base call, using the full works. abacus->refine(abAbacus_Smooth); abacus->mergeColumns(true); abacus->refine(abAbacus_Poly_X); abacus->mergeColumns(true); abacus->refine(abAbacus_Indel); abacus->mergeColumns(true); abacus->recallBases(true); // The bases are possibly all recalled, depending on the above refinements keeping things consistent. //abacus->refreshColumns(); // Definitely needed, this copies base calls into _cnsBases and _cnsQuals. // Copy the consensus and positions into the tig. abacus->getConsensus(tig); abacus->getPositions(tig); // While we have fragments in memory, compute the microhet probability. Ideally, this would be // done in CGW when loading unitigs (the only place the probability is used) but the code wants // to load sequence and quality for every fragment, and that's too expensive. } // Update the position of each fragment in the consensus sequence. // Update the anchor/hang of the fragment we just placed. void unitigConsensus::refreshPositions(void) { for (int32 i=0; i<=tiid; i++) { if ((cnspos[i].min() == 0) && (cnspos[i].max() == 0)) // Uh oh, not placed originally. continue; abColumn *fcol = abacus->readTofBead[i].column; abColumn *lcol = abacus->readTolBead[i].column; cnspos[i].setMinMax(fcol->position(), lcol->position() + 1); assert(cnspos[i].min() >= 0); assert(cnspos[i].max() > cnspos[i].min()); } if (piid >= 0) utgpos[tiid].setAnchor(utgpos[piid].ident(), cnspos[tiid].min() - cnspos[piid].min(), cnspos[tiid].max() - cnspos[piid].max()); piid = -1; } // Run abacus to rebuild the consensus sequence. VERY expensive. void unitigConsensus::recomputeConsensus(bool display) { //abacus->recallBases(false); // Needed? We should be up to date. abacus->refine(abAbacus_Smooth); abacus->mergeColumns(false); abacus->refine(abAbacus_Poly_X); abacus->mergeColumns(false); abacus->refine(abAbacus_Indel); abacus->mergeColumns(false); abacus->recallBases(false); // Possibly not needed. If this is removed, the following refresh is definitely needed. //abacus->refreshColumns(); // Definitely needed, this copies base calls into _cnsBases and _cnsQuals. refreshPositions(); if (display) abacus->display(stderr); } // This stub lets alignFragmnet() cleanup and return on alignment failures. The original // implementation did the same thing with a goto to the end of the function. Opening up the if // statements exposed variable declarations that prevented the goto from compiling. // bool unitigConsensus::alignFragmentFailure(void) { cnspos[tiid].setMinMax(0, 0); piid = -1; if (showAlgorithm()) fprintf(stderr, "alignFragment()-- No alignment found.\n"); return(false); } // Generates an alignment of the current read to the partial consensus. // The primary output is a trace stored in the object data. bool unitigConsensus::alignFragment(bool forceAlignment) { assert((cnspos[tiid].min() != 0) || (cnspos[tiid].max() != 0)); assert(piid != -1); assert(cnspos[tiid].min() < cnspos[tiid].max()); abSequence *bSEQ = abacus->getSequence(tiid); char *fragSeq = bSEQ->getBases(); int32 fragLen = bSEQ->length(); // Decide on how much to align. Pick too little of consensus, and we leave some of the read // unaligned. Pick too much, and the read aligns poorly. // // endTrim is trimmed from the 3' of the read. This is the stuff we don't expect to align to consensus. // bgnExtra is trimmed from the 5' of consensus. Same idea as endTrim. // endExtra is trimmed from the 3' of consensus. Only for contained reads. // // These values are adjusted later, in trimStep increments, based on the alignments returned. // // Of the two choices, making the two Extra's small at the start is probably safer. That case is // easy to detect, and easy to fix. // // The expectedAlignLen is almost always an underestimate. Any gaps inserted will make the real // alignment length longer. This used to be multiplied by the error rate. int32 expectedAlignLen = cnspos[tiid].max() - cnspos[tiid].min(); // If the read is contained, the full read is aligned. // Otherwise, an extra 1/32 of the align length is added for padding. int32 fragBgn = 0; int32 fragEnd = (cnspos[tiid].max() < abacus->numberOfColumns()) ? (fragLen) : (33 * expectedAlignLen / 32); if (fragEnd > fragLen) fragEnd = fragLen; // Given the usual case of using an actual overlap to a read in the multialign to find the region // to align to, we expect that region to be nearly perfect. Thus, we shouldn't need to extend it // much. If anything, we'll need to extend the 3' end of the read. int32 bgnExtra = 10; // Start with a small 'extra' allowance, easy to make bigger. int32 endExtra = 10; // int32 trimStep = max(10, expectedAlignLen / 50); // Step by 10 bases or do at most 50 steps. // Find an alignment! bool allowedToTrim = true; assert(abacus->bases()[abacus->numberOfColumns()] == 0); // Consensus must be NUL terminated assert(fragSeq[fragLen] == 0); // The read must be NUL terminated alignFragmentAgain: // Truncate consensus and the read to prevent false alignments. if (cnspos[tiid].max() + endExtra > abacus->numberOfColumns()) endExtra = abacus->numberOfColumns() - cnspos[tiid].max(); int32 cnsBgn = MAX(0, cnspos[tiid].min() - bgnExtra); // Start position in consensus int32 cnsEnd = cnspos[tiid].max() + endExtra; // Truncation of consensus int32 cnsEndBase = abacus->bases()[cnsEnd]; // Saved base (if not truncated, it's the NUL byte at the end) char *aseq = abacus->bases() + cnsBgn; char *bseq = fragSeq; char fragEndBase = bseq[fragEnd]; // Saved base abacus->bases()[cnsEnd] = 0; // Do the truncations. bseq[fragEnd] = 0; // Report! if (showAlgorithm()) fprintf(stderr, "alignFragment()-- Allow bgnExtra=%d and endExtra=%d (cnsBgn=%d cnsEnd=%d cnsLen=%d) (fragBgn=0 fragEnd=%d fragLen=%d)\n", bgnExtra, endExtra, cnsBgn, cnsEnd, abacus->numberOfColumns(), fragEnd, fragLen); // Create new aligner object. 'Global' in this case just means to not stop early, not a true global alignment. if (oaFull == NULL) oaFull = new NDalign(pedGlobal, errorRate, 17); oaFull->initialize(0, aseq, cnsEnd - cnsBgn, 0, cnsEnd - cnsBgn, 1, bseq, fragEnd - fragBgn, 0, fragEnd - fragBgn, false); // Generate a null hit, then align it and then realign, from both endpoints, and save the better // of the two. if ((oaFull->makeNullHit() == true) && (oaFull->processHits() == true)) { if (showAlignments()) oaFull->display("utgCns::alignFragment()--", true); oaFull->realignBackward(showAlgorithm(), showAlignments()); oaFull->realignForward (showAlgorithm(), showAlignments()); } // Restore the bases we removed to end the strings early. if (cnsEndBase) abacus->bases()[cnsEnd] = cnsEndBase; if (fragEndBase) bseq[fragEnd] = fragEndBase; // If no alignment, bail. if (oaFull->length() == 0) return(alignFragmentFailure()); // // Check quality and fail if it sucks. // bool isBad = oaFull->scanDeltaForBadness(showAlgorithm(), showAlignments()); // Check for bad (under) trimming of input sequences. // // If the alignment is bad, and we hit the start of the consensus sequence (or the end of // same), chances are good that the aligner returned a (higher scoring) global alignment instead // of a (lower scoring) local alignment. Trim off some of the extension and try again. if ((allowedToTrim == true) && (isBad == true) && (oaFull->ahg5() == 0) && (bgnExtra > 0)) { int32 adj = (bgnExtra < trimStep) ? 0 : bgnExtra - trimStep; if (showAlgorithm()) fprintf(stderr, "utgCns::alignFragment()-- alignment is bad, hit the trimmed start of consensus, decrease bgnExtra from %u to %u\n", bgnExtra, adj); bgnExtra = adj; goto alignFragmentAgain; } if ((allowedToTrim == true) && (isBad == true) && (oaFull->ahg3() == 0) && (endExtra > 0)) { int32 adj = (endExtra < trimStep) ? 0 : endExtra - trimStep; if (showAlgorithm()) fprintf(stderr, "utgCns::alignFragment()-- alignment is bad, hit the trimmed end of consensus, decrease endExtra from %u to %u\n", endExtra, adj); endExtra = adj; goto alignFragmentAgain; } // Check for bad (over) trimming of input sequences. Bad if: // we don't hit the start of the read, and we chopped the start of consensus. // we don't hit the end of the read, and we chopped the end of consensus. // we do hit the end of the read, and we chopped the end of the read. // (we don't chop the start of the read, so the fourth possible case never happens) allowedToTrim = false; // No longer allowed to reduce bgnExtra or endExtra. We'd hit infinite loops otherwise. if ((oaFull->bhg5() > 0) && (cnsBgn > 0)) { int32 adj = bgnExtra + 2 * oaFull->bhg5(); if (showAlgorithm()) fprintf(stderr, "utgCns::alignFragment()-- hit the trimmed start of consensus, increase bgnExtra from %u to %u\n", bgnExtra, adj); bgnExtra = adj; goto alignFragmentAgain; } if ((oaFull->bhg3() > 0) && (cnsEnd < abacus->numberOfColumns())) { int32 adj = endExtra + 2 * oaFull->bhg3(); if (showAlgorithm()) fprintf(stderr, "utgCns::alignFragment()-- hit the trimmed end of consensus, increase endExtra from %u to %u\n", endExtra, adj); endExtra = adj; goto alignFragmentAgain; } if ((oaFull->bhg3() == 0) && (fragEnd < fragLen)) { int32 adj = (fragEnd + trimStep < fragLen) ? fragEnd + trimStep : fragLen; if (showAlgorithm()) fprintf(stderr, "utgCns::alignFragment()-- hit the trimmed end of the read, increase fragEnd from %d to %d\n", fragEnd, adj); fragEnd = adj; goto alignFragmentAgain; } // If we get here, and it's still bad, well, not much we can do. if ((forceAlignment == false) && (isBad == true)) { if (showAlgorithm()) fprintf(stderr, "utgCns::alignFragment()-- alignment bad after realigning\n"); return(alignFragmentFailure()); } if ((forceAlignment == false) && (oaFull->erate() > errorRate)) { if (showAlgorithm()) { fprintf(stderr, "utgCns::alignFragment()-- alignment is low quality: %f > %f\n", oaFull->erate(), errorRate); oaFull->display("utgCns::alignFragment()-- ", true); } return(alignFragmentFailure()); } // Otherwise, its a good alignment. Process the trace to the 'consensus-format' and return true. // // Set the begin points of the trace. We probably need at least one of abgn and bbgn to be // zero. If both are nonzero, then we have a branch in the alignment. // // If traceABgn is negative, we insert gaps into A before it starts, but I'm not sure how that works. // // Overlap encoding: // Add N matches or mismatches. If negative, insert a base in the first sequence. If // positive, delete a base in the first sequence. // // Consensus encoding: // If negative, align (-trace - apos) bases, then add a gap in A. // If positive, align ( trace - bpos) bases, then add a gap in B. // if (oaFull->abgn() > 0) assert(oaFull->bbgn() == 0); // read aligned fully if consensus isn't if (oaFull->bbgn() > 0) assert(oaFull->abgn() == 0); // read extends past the begin, consensus aligned fully if (oaFull->bbgn() > 0) assert(cnsBgn == 0); // read extends past the begin, consensus not trimmed at begin traceABgn = cnsBgn + oaFull->abgn() - oaFull->bbgn(); traceBBgn = oaFull->bbgn(); int32 apos = oaFull->abgn(); int32 bpos = oaFull->bbgn(); traceLen = 0; for (uint32 ii=0; iideltaLen(); ii++, traceLen++) { if (oaFull->delta()[ii] < 0) { apos += -oaFull->delta()[ii] - 1; bpos += -oaFull->delta()[ii]; trace[traceLen] = -apos - cnsBgn - 1; } else { apos += oaFull->delta()[ii]; bpos += oaFull->delta()[ii] - 1; // critical trace[traceLen] = bpos + 1; } } trace[traceLen] = 0; return(true); } canu-1.6/src/utgcns/libcns/unitigConsensus.H000066400000000000000000000141551314437614700212030ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/MultiAlignment_CNS.h * src/AS_CNS/MultiAlignment_CNS_private.H * src/AS_CNS/MultiAlignment_CNS_private.h * src/utgcns/libcns/MultiAlignment_CNS_private.H * * Modifications by: * * Gennady Denisov from 2005-MAY-23 to 2007-OCT-25 * are Copyright 2005-2007 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2005-JUL-08 to 2013-AUG-01 * are Copyright 2005-2009,2011,2013 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Eli Venter from 2006-FEB-13 to 2008-FEB-13 * are Copyright 2006,2008 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Sergey Koren from 2008-JAN-28 to 2009-SEP-25 * are Copyright 2008-2009 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Jason Miller on 2011-SEP-21 * are Copyright 2011 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-NOV-17 to 2015-AUG-05 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-09 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-DEC-28 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef UNITIGCONSENSUS_H #define UNITIGCONSENSUS_H #include "AS_global.H" #include "tgStore.H" #include "abAbacus.H" class ALNoverlap; class NDalign; class unitigConsensus { public: unitigConsensus(gkStore *gkpStore_, double errorRate_, double errorRateMax_, uint32 minOverlap_); ~unitigConsensus(); bool savePackage(FILE *outPackageFile, tgTig *tig); bool generate(tgTig *tig, map *inPackageRead = NULL, map *inPackageReadData = NULL); bool generatePBDAG(char aligner, bool normalize, tgTig *tig, map *inPackageRead = NULL, map *inPackageReadData = NULL); bool generateQuick(tgTig *tig, map *inPackageRead = NULL, map *inPackageReadData = NULL); bool generateSingleton(tgTig *tig, map *inPackageRead = NULL, map *inPackageReadData = NULL); int32 initialize(map *inPackageRead, map *inPackageReadData); void setErrorRate(double errorRate_) { errorRate = errorRate_; }; void setMinOverlap(uint32 minOverlap_) { minOverlap = minOverlap_; }; bool showProgress(void) { return(tig->_utgcns_verboseLevel >= 1); }; // -V displays which reads are processing bool showAlgorithm(void) { return(tig->_utgcns_verboseLevel >= 2); }; // -V -V displays some details on the algorithm bool showPlacementBefore(void) { return(tig->_utgcns_verboseLevel >= 3); }; // -V -V -V displays placement info before each read bool showPlacement(void) { return(tig->_utgcns_verboseLevel >= 3); }; // bool showAlignments(void) { return(tig->_utgcns_verboseLevel >= 4); }; // -V -V -V -V displays aligns and multialigns bool showMultiAlignments(void) { return(tig->_utgcns_verboseLevel >= 4); }; // void reportStartingWork(void); void reportFailure(void); void reportSuccess(void); int32 moreFragments(void) { tiid++; return (tiid < numfrags); }; int32 computePositionFromAnchor(void); int32 computePositionFromLayout(void); int32 computePositionFromAlignment(void); void recomputeConsensus(bool display); void refreshPositions(void); bool rejectAlignment(bool allowBhang, bool allowAhang, ALNoverlap *O); bool alignFragmentFailure(void); bool alignFragment(bool forceAlignment=false); void generateConsensus(tgTig *tig); private: gkStore *gkpStore; tgTig *tig; uint32 numfrags; // == tig->numberOfChildren() uint32 traceLen; uint32 traceMax; int32 *trace; int32 traceABgn; // Start of the aligned regions, discovered in alignFragment(), int32 traceBBgn; // used in applyAlignment(). abAbacus *abacus; // The two positions below are storing the low/high coords for the read. // They do not encode the orientation in the coordinates. // tgPosition *utgpos; // Original unitigger location. tgPosition *cnspos; // Actual location in frankenstein. int32 tiid; // This frag IID int32 piid; // Anchor frag IID - if -1, not valid uint32 minOverlap; double errorRate; double errorRateMax; NDalign *oaPartial; NDalign *oaFull; }; #endif canu-1.6/src/utgcns/libpbutgcns/000077500000000000000000000000001314437614700167265ustar00rootroot00000000000000canu-1.6/src/utgcns/libpbutgcns/.gitignore000066400000000000000000000003031314437614700207120ustar00rootroot00000000000000# Compiled Object files and Bindaries *.slo *.lo *.o pbutgcns # Compiled Dynamic libraries *.so *.dylib # Compiled Static libraries *.lai *.la *.a # dependency files *.d # vim tmpfiles *.swp canu-1.6/src/utgcns/libpbutgcns/Alignment.H000066400000000000000000000036131314437614700207600ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2015-DEC-28 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2016-JAN-04 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef __GCON_ALIGNMENT_HPP__ #define __GCON_ALIGNMENT_HPP__ /// /// Super-simple alignment representation. Represents an alignment between two /// PacBio reads, one of which we're trying to correct. The read to correct /// may be either the target or the query, depending on how the alignment was /// done. /// #include #include class dagAlignment { public: dagAlignment() { start = 0; end = 0; length = 0; qstr = NULL; tstr = NULL; }; ~dagAlignment() { delete [] qstr; delete [] tstr; }; void clear(void) { delete [] qstr; delete [] tstr; start = 0; end = 0; length = 0; qstr = NULL; tstr = NULL; }; void normalizeGaps(void); uint32_t start; // 1-based! uint32_t end; uint32_t length; char *qstr; char *tstr; }; #endif // __GCON_ALIGNMENT_HPP__ canu-1.6/src/utgcns/libpbutgcns/AlnGraphBoost.C000066400000000000000000000425751314437614700215520ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2015-DEC-28 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2017-MAY-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright (c) 2011-2015, Pacific Biosciences of California, Inc. // // All rights reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted (subject to the limitations in the // disclaimer below) provided that the following conditions are met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following // disclaimer in the documentation and/or other materials provided // with the distribution. // // * Neither the name of Pacific Biosciences nor the names of its // contributors may be used to endorse or promote products derived // from this software without specific prior written permission. // // NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE // GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC // BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES // OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE // DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF // USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND // ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, // OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT // OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF // SUCH DAMAGE. #include #include #include #include #include #include #include #include #include #include #include "Alignment.H" #include "AlnGraphBoost.H" AlnGraphBoost::AlnGraphBoost(const std::string& backbone) { // initialize the graph structure with the backbone length + enter/exit // vertex size_t blen = backbone.length(); _g = G(blen+1); for (size_t i = 0; i < blen+1; i++) boost::add_edge(i, i+1, _g); VtxIter curr, last; boost::tie(curr, last) = boost::vertices(_g); _enterVtx = *curr++; _g[_enterVtx].base = '^'; _g[_enterVtx].backbone = true; for (size_t i = 0; i < blen; i++, ++curr) { VtxDesc v = *curr; _g[v].backbone = true; _g[v].weight = 1; _g[v].base = backbone[i]; _bbMap[v] = v; } _exitVtx = *curr; _g[_exitVtx].base = '$'; _g[_exitVtx].backbone = true; } AlnGraphBoost::AlnGraphBoost(const size_t blen) { _g = G(blen+1); for (size_t i = 0; i < blen+1; i++) boost::add_edge(i, i+1, _g); VtxIter curr, last; boost::tie(curr, last) = boost::vertices(_g); _enterVtx = *curr++; _g[_enterVtx].base = '^'; _g[_enterVtx].backbone = true; for (size_t i = 0; i < blen; i++, ++curr) { VtxDesc v = *curr; _g[v].backbone = true; _g[v].weight = 1; _g[v].deleted = false; _g[v].base = 'N'; _bbMap[v] = v; } _exitVtx = *curr; _g[_exitVtx].base = '$'; _g[_exitVtx].backbone = true; } void AlnGraphBoost::addAln(dagAlignment& aln) { IndexMap index = boost::get(boost::vertex_index, _g); // tracks the position on the backbone uint32_t bbPos = aln.start; VtxDesc prevVtx = _enterVtx; for (size_t i = 0; i < aln.length; i++) { char queryBase = aln.qstr[i], targetBase = aln.tstr[i]; VtxDesc currVtx = index[bbPos]; // match if (queryBase == targetBase) { _g[_bbMap[currVtx]].coverage++; // NOTE: for empty backbones _g[_bbMap[currVtx]].base = targetBase; _g[currVtx].weight++; addEdge(prevVtx, currVtx); bbPos++; prevVtx = currVtx; // query deletion } else if (queryBase == '-' && targetBase != '-') { _g[_bbMap[currVtx]].coverage++; // NOTE: for empty backbones _g[_bbMap[currVtx]].base = targetBase; bbPos++; // query insertion } else if (queryBase != '-' && targetBase == '-') { // create new node and edge VtxDesc newVtx = boost::add_vertex(_g); _g[newVtx].base = queryBase; _g[newVtx].weight++; _g[newVtx].backbone = false; _g[newVtx].deleted = false; _bbMap[newVtx] = bbPos; addEdge(prevVtx, newVtx); prevVtx = newVtx; } } addEdge(prevVtx, _exitVtx); } void AlnGraphBoost::addEdge(VtxDesc u, VtxDesc v) { // Check if edge exists with prev node. If it does, increment edge counter, // otherwise add a new edge. InEdgeIter ii, ie; bool edgeExists = false; for (boost::tie(ii, ie) = boost::in_edges(v, _g); ii != ie; ++ii) { EdgeDesc e = *ii; if (boost::source(e , _g) == u) { // increment edge count _g[e].count++; edgeExists = true; } } if (! edgeExists) { // add new edge std::pair p = boost::add_edge(u, v, _g); _g[p.first].count++; } } void AlnGraphBoost::mergeNodes() { std::queue seedNodes; seedNodes.push(_enterVtx); while(true) { if (seedNodes.size() == 0) break; VtxDesc u = seedNodes.front(); seedNodes.pop(); mergeInNodes(u); mergeOutNodes(u); OutEdgeIter oi, oe; for (boost::tie(oi, oe) = boost::out_edges(u, _g); oi != oe; ++oi) { EdgeDesc e = *oi; _g[e].visited = true; VtxDesc v = boost::target(e, _g); InEdgeIter ii, ie; int notVisited = 0; for (boost::tie(ii, ie) = boost::in_edges(v, _g); ii != ie; ++ii) { if (_g[*ii].visited == false) notVisited++; } // move onto the boost::target node after we visit all incoming edges for // the boost::target node if (notVisited == 0) seedNodes.push(v); } } } void AlnGraphBoost::mergeInNodes(VtxDesc n) { std::map > nodeGroups; InEdgeIter ii, ie; // Group neighboring nodes by base for(boost::tie(ii, ie) = boost::in_edges(n, _g); ii != ie; ++ii) { VtxDesc inNode = boost::source(*ii, _g); if (out_degree(inNode, _g) == 1) { nodeGroups[_g[inNode].base].push_back(inNode); } } // iterate over node groups, merge an accumulate information for(std::map >::iterator kvp = nodeGroups.begin(); kvp != nodeGroups.end(); ++kvp) { std::vector nodes = (*kvp).second; if (nodes.size() <= 1) continue; std::vector::const_iterator ni = nodes.begin(); VtxDesc an = *ni++; OutEdgeIter anoi, anoe; boost::tie(anoi, anoe) = boost::out_edges(an, _g); // Accumulate out edge information for (; ni != nodes.end(); ++ni) { OutEdgeIter oi, oe; boost::tie(oi, oe) = boost::out_edges(*ni, _g); _g[*anoi].count += _g[*oi].count; _g[an].weight += _g[*ni].weight; } // Accumulate in edge information, merges nodes ni = nodes.begin(); ++ni; for (; ni != nodes.end(); ++ni) { InEdgeIter ii, ie; VtxDesc n = *ni; for (boost::tie(ii, ie) = boost::in_edges(n, _g); ii != ie; ++ii) { VtxDesc n1 = boost::source(*ii, _g); EdgeDesc e; bool exists; boost::tie(e, exists) = edge(n1, an, _g); if (exists) { _g[e].count += _g[*ii].count; } else { std::pair p = boost::add_edge(n1, an, _g); _g[p.first].count = _g[*ii].count; _g[p.first].visited = _g[*ii].visited; } } markForReaper(n); } mergeInNodes(an); } } void AlnGraphBoost::mergeOutNodes(VtxDesc n) { std::map > nodeGroups; OutEdgeIter oi, oe; for(boost::tie(oi, oe) = boost::out_edges(n, _g); oi != oe; ++oi) { VtxDesc outNode = boost::target(*oi, _g); if (in_degree(outNode, _g) == 1) { nodeGroups[_g[outNode].base].push_back(outNode); } } for(std::map >::iterator kvp = nodeGroups.begin(); kvp != nodeGroups.end(); ++kvp) { std::vector nodes = (*kvp).second; if (nodes.size() <= 1) continue; std::vector::const_iterator ni = nodes.begin(); VtxDesc an = *ni++; InEdgeIter anii, anie; boost::tie(anii, anie) = boost::in_edges(an, _g); // Accumulate inner edge information for (; ni != nodes.end(); ++ni) { InEdgeIter ii, ie; boost::tie(ii, ie) = boost::in_edges(*ni, _g); _g[*anii].count += _g[*ii].count; _g[an].weight += _g[*ni].weight; } // Accumulate and merge outer edge information ni = nodes.begin(); ++ni; for (; ni != nodes.end(); ++ni) { OutEdgeIter oi, oe; VtxDesc n = *ni; for (boost::tie(oi, oe) = boost::out_edges(n, _g); oi != oe; ++oi) { VtxDesc n2 = boost::target(*oi, _g); EdgeDesc e; bool exists; boost::tie(e, exists) = edge(an, n2, _g); if (exists) { _g[e].count += _g[*oi].count; } else { std::pair p = boost::add_edge(an, n2, _g); _g[p.first].count = _g[*oi].count; _g[p.first].visited = _g[*oi].visited; } } markForReaper(n); } } } void AlnGraphBoost::markForReaper(VtxDesc n) { _g[n].deleted = true; clear_vertex(n, _g); _reaperBag.push_back(n); } void AlnGraphBoost::reapNodes() { int reapCount = 0; std::sort(_reaperBag.begin(), _reaperBag.end()); std::vector::iterator curr = _reaperBag.begin(); for (; curr != _reaperBag.end(); ++curr) { assert(_g[*curr].backbone==false); remove_vertex(*curr-reapCount++, _g); } } const std::string AlnGraphBoost::consensus(int minWeight) { // get the best scoring path std::vector path = bestPath(); // consensus sequence std::string cns; // track the longest consensus path meeting minimum weight int offs = 0, bestOffs = 0, length = 0, idx = 0; bool metWeight = false; std::vector::iterator curr = path.begin(); for (; curr != path.end(); ++curr) { AlnNode n = *curr; if (n.base == _g[_enterVtx].base || n.base == _g[_exitVtx].base) continue; cns += n.base; // initial beginning of minimum weight section if (!metWeight && n.weight >= minWeight) { offs = idx; metWeight = true; } else if (metWeight && n.weight < minWeight) { // concluded minimum weight section, update if longest seen so far if ((idx - offs) > length) { bestOffs = offs; length = idx - offs; } metWeight = false; } idx++; } // include end of sequence if (metWeight && (idx - offs) > length) { bestOffs = offs; length = idx - offs; } return cns.substr(bestOffs, length); } void AlnGraphBoost::consensus(std::vector& seqs, int minWeight, size_t minLen) { seqs.clear(); // get the best scoring path std::vector path = bestPath(); // consensus sequence std::string cns; // track the longest consensus path meeting minimum weight int offs = 0, idx = 0; bool metWeight = false; std::vector::iterator curr = path.begin(); for (; curr != path.end(); ++curr) { AlnNode n = *curr; if (n.base == _g[_enterVtx].base || n.base == _g[_exitVtx].base) continue; cns += n.base; // initial beginning of minimum weight section if (!metWeight && n.weight >= minWeight) { offs = idx; metWeight = true; } else if (metWeight && n.weight < minWeight) { // concluded minimum weight section, add sequence to supplied vector metWeight = false; CnsResult result; result.range[0] = offs; result.range[1] = idx; size_t length = idx - offs; result.seq = cns.substr(offs, length); if (length >= minLen) seqs.push_back(result); } idx++; } // include end of sequence if (metWeight) { size_t length = idx - offs; CnsResult result; result.range[0] = offs; result.range[1] = idx; result.seq = cns.substr(offs, length); if (length >= minLen) seqs.push_back(result); } } const std::vector AlnGraphBoost::bestPath() { EdgeIter ei, ee; for (boost::tie(ei, ee) = edges(_g); ei != ee; ++ei) _g[*ei].visited = false; std::map bestNodeScoreEdge; std::map nodeScore; std::queue seedNodes; // start at the end and make our way backwards seedNodes.push(_exitVtx); nodeScore[_exitVtx] = 0.0f; while (true) { if (seedNodes.size() == 0) break; VtxDesc n = seedNodes.front(); seedNodes.pop(); bool bestEdgeFound = false; float bestScore = -FLT_MAX; EdgeDesc bestEdgeD = boost::initialized_value; OutEdgeIter oi, oe; for(boost::tie(oi, oe) = boost::out_edges(n, _g); oi != oe; ++oi) { EdgeDesc outEdgeD = *oi; VtxDesc outNodeD = boost::target(outEdgeD, _g); AlnNode outNode = _g[outNodeD]; float newScore, score = nodeScore[outNodeD]; if (outNode.backbone && outNode.weight == 1) { newScore = score - 10.0f; } else { AlnNode bbNode = _g[_bbMap[outNodeD]]; newScore = _g[outEdgeD].count - bbNode.coverage*0.5f + score; } if (newScore > bestScore) { bestScore = newScore; bestEdgeD = outEdgeD; bestEdgeFound = true; } } if (bestEdgeFound) { nodeScore[n]= bestScore; bestNodeScoreEdge[n] = bestEdgeD; } InEdgeIter ii, ie; for (boost::tie(ii, ie) = boost::in_edges(n, _g); ii != ie; ++ii) { EdgeDesc inEdge = *ii; _g[inEdge].visited = true; VtxDesc inNode = boost::source(inEdge, _g); int notVisited = 0; OutEdgeIter oi, oe; for (boost::tie(oi, oe) = boost::out_edges(inNode, _g); oi != oe; ++oi) { if (_g[*oi].visited == false) notVisited++; } // move onto the target node after we visit all incoming edges for // the target node if (notVisited == 0) seedNodes.push(inNode); } } // construct the final best path VtxDesc prev = _enterVtx, next; std::vector bpath; while (true) { bpath.push_back(_g[prev]); if (bestNodeScoreEdge.count(prev) == 0) { break; } else { EdgeDesc bestOutEdge = bestNodeScoreEdge[prev]; _g[prev].bestOutEdge = bestOutEdge; next = boost::target(bestOutEdge, _g); _g[next].bestInEdge = bestOutEdge; prev = next; } } return bpath; } bool AlnGraphBoost::danglingNodes() { VtxIter curr, last; boost::tie(curr, last) = boost::vertices(_g); bool found = false; for (;curr != last; ++curr) { if (_g[*curr].deleted) continue; if (_g[*curr].base == _g[_enterVtx].base || _g[*curr].base == _g[_exitVtx].base) continue; int indeg = out_degree(*curr, _g); int outdeg = in_degree(*curr, _g); if (outdeg > 0 && indeg > 0) continue; found = true; } return found; } AlnGraphBoost::~AlnGraphBoost(){} canu-1.6/src/utgcns/libpbutgcns/AlnGraphBoost.H000066400000000000000000000200561314437614700215450ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Sergey Koren beginning on 2015-DEC-28 * are a 'United States Government Work', and * are released in the public domain * * Brian P. Walenz beginning on 2017-MAY-09 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ // Copyright (c) 2011-2015, Pacific Biosciences of California, Inc. // // All rights reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted (subject to the limitations in the // disclaimer below) provided that the following conditions are met: // // * Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // // * Redistributions in binary form must reproduce the above // copyright notice, this list of conditions and the following // disclaimer in the documentation and/or other materials provided // with the distribution. // // * Neither the name of Pacific Biosciences nor the names of its // contributors may be used to endorse or promote products derived // from this software without specific prior written permission. // // NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE // GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC // BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES // OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE // DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS // CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT // LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF // USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND // ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, // OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT // OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF // SUCH DAMAGE. #ifndef __GCON_ALNGRAPHBOOST_HPP__ #define __GCON_ALNGRAPHBOOST_HPP__ #include /// Alignment graph representation and consensus caller. Based on the original /// Python implementation, pbdagcon. This class is modelled after its /// aligngraph.py component, which accumulates alignment information into a /// partial-order graph and then calls consensus. Used to error-correct pacbio /// on pacbio reads. /// /// Implemented using the boost graph library. // forward declaration //struct Alignment; // this allows me to forward-declare properties with graph descriptors as // members types typedef boost::adjacency_list graphTraits; /// Graph vertex property. An alignment node, which represents one base position /// in the alignment graph. struct AlnNode { char base; ///< DNA base: [ACTG] int coverage; ///< Number of reads align to this position, but not ///< necessarily match int weight; ///< Number of reads that align to this node *with the same base*, but not ///< necessarily represented in the target. bool backbone; ///< Is this node based on the reference bool deleted; ///< mark for removed as part of the merging process graphTraits::edge_descriptor bestInEdge; ///< Best scoring in edge graphTraits::edge_descriptor bestOutEdge; ///< Best scoring out edge AlnNode() { base = 'N'; coverage = 0; weight = 0; backbone = false; deleted = false; } }; /// Graph edge property. Represents an edge between alignment nodes. struct AlnEdge { int count; ///< Number of times this edge was confirmed by an alignment bool visited; ///< Tracks a visit during algorithm processing AlnEdge() { count = 0; visited = false; } }; // Boost-related typedefs // XXX: listS, listS? typedef boost::adjacency_list G; typedef boost::graph_traits::vertex_descriptor VtxDesc; typedef boost::graph_traits::vertex_iterator VtxIter; typedef boost::graph_traits::edge_descriptor EdgeDesc; typedef boost::graph_traits::edge_iterator EdgeIter; typedef boost::graph_traits::in_edge_iterator InEdgeIter; typedef boost::graph_traits::out_edge_iterator OutEdgeIter; typedef boost::property_map::type IndexMap; /// /// Simple consensus interface datastructure /// struct CnsResult { int range[2]; ///< Range on the target std::string seq; ///< Consensus fragment }; /// /// Core alignments into consensus algorithm, implemented using the boost graph /// library. Takes a set of alignments to a reference and builds a higher /// accuracy (~ 99.9) consensus sequence from it. Designed for use in the HGAP /// pipeline as a long read error correction step. /// class AlnGraphBoost { public: /// Constructor. Initialize graph based on the given sequence. Graph is /// annotated with the bases from the backbone. /// \param backbone the reference sequence. AlnGraphBoost(const std::string& backbone); /// Constructor. Initialize graph to a given backbone length. Base /// information is filled in as alignments are added. /// \param blen length of the reference sequence. AlnGraphBoost(const size_t blen); /// Add alignment to the graph. /// \param Alignment an alignment record (see Alignment.hpp) void addAln(dagAlignment& aln); /// Adds a new or increments an existing edge between two aligned bases. /// \param u the 'from' vertex descriptor /// \param v the 'to' vertex descriptor void addEdge(VtxDesc u, VtxDesc v); /// Collapses degenerate nodes (vertices). Must be called before /// consensus(). Calls mergeInNodes() followed by mergeOutNodes(). void mergeNodes(); /// Recursive merge of 'in' nodes. /// \param n the base node to merge around. void mergeInNodes(VtxDesc n); /// Non-recursive merge of 'out' nodes. /// \param n the base node to merge around. void mergeOutNodes(VtxDesc n); /// Mark a given node for removal from graph. Doesn't not modify graph. /// \param n the node to remove. void markForReaper(VtxDesc n); /// Removes the set of nodes that have been marked. Modifies graph. /// Prohibitively expensive when using vecS as the vertex container. void reapNodes(); /// Generates the consensus from the graph. Must be called after /// mergeNodes(). Returns the longest contiguous consensus sequence where /// each base meets the minimum weight requirement. /// \param minWeight sets the minimum weight for each base in the consensus. /// default = 0 const std::string consensus(int minWeight=0); /// Generates all consensus sequences from a target that meet the minimum /// weight requirement. void consensus(std::vector& seqs, int minWeight=0, size_t minLength=500); /// Locates the optimal path through the graph. Called by consensus() const std::vector bestPath(); /// Locate nodes that are missing either in or out edges. bool danglingNodes(); /// Destructor. virtual ~AlnGraphBoost(); private: G _g; VtxDesc _enterVtx; VtxDesc _exitVtx; std::map _bbMap; std::vector _reaperBag; }; #endif // __GCON_ALNGRAPHBOOST_HPP__ canu-1.6/src/utgcns/libpbutgcns/LICENSE000066400000000000000000000035521314437614700177400ustar00rootroot00000000000000#################################################################################$$ # Copyright (c) 2011-2015, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ canu-1.6/src/utgcns/libpbutgcns/README.md000066400000000000000000000027321314437614700202110ustar00rootroot00000000000000pbutgcns ======== DAG-based consensus caller for celera assemblers unitigs. It is currently tested only using pacbio corrected reads, though in prinicipal it could be used for pacbio un-corrected. It currently uses a simple python adapter to interface with the tigStore. Eventually, this is expected to be replaced with a more direct call to lower-level API calls. Building -------- > make BLASR= > ./pbutgcns -h ### Support libraries needed * [pblibblasr](https://github.com/PacificBiosciences/pblibblasr) BLASR library * [boost](http://www.boost.org/) Popular C++ utility library (1.46 or 1.47) * [log4cpp](http://log4cpp.sourceforge.net/) Logging library (1.0 or 1.1) Running ------- The input is an assembly generated by celera-assembler, typically only through the unitigger stage. However, if you want to compare it to the existing consensus caller, it can be run to completion. Either one will work. A utitilty shell script *pbutgcns_wf.sh* can be used to execute the consensus workflow for a given file containing a list of unitigs ids. Make sure that you have CA's tigStore and gatekeeper utilities, tigStore-adapter.py and pbutgcns in your path. # tmp: temporary directory to generate consensus into # cap: CA-prefix used in celera assembler # nproc: number of threads to use while building consensus # cns: the output file name > tmp=/tmp cap=asm utg=utg.lst nproc=4 cns=utg_consensus.fa pbutgcns_wf.sh canu-1.6/src/utgcns/stashContains.C000066400000000000000000000161711314437614700173450ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-09 to 2015-MAY-09 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-DEC-01 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "stashContains.H" // Replace the children list in tig with one that has fewer contains. The original // list is returned. savedChildren * stashContains(tgTig *tig, double maxCov, bool beVerbose) { if (tig->numberOfChildren() == 1) return(NULL); // Stats we report int32 nOrig = tig->numberOfChildren(); int32 nBack = 0; int32 nCont = 0; int32 nSave = 0; int64 nBase = 0; int64 nBaseDove = 0; int64 nBaseCont = 0; int64 nBaseSave = 0; // Save the original children savedChildren *saved = new savedChildren(tig); bool *isBack = new bool [nOrig]; // True, we save the child for processing readLength *posLen = new readLength [nOrig]; // Sorting by length of child // Sort the original children by position. std::sort(saved->children, saved->children + saved->childrenLen); // The first read is always saved int32 loEnd = saved->children[0].min(); int32 hiEnd = saved->children[0].max(); isBack[0] = 1; nBack = 1; posLen[0].idx = 0; posLen[0].len = hiEnd - loEnd; nBaseDove += posLen[0].len; nBase += posLen[0].len; // For the other reads, save it if it extends the backbone sequence. for (uint32 fi=1; fichildren[fi].min(); int32 hi = saved->children[fi].max(); posLen[fi].idx = fi; posLen[fi].len = hi - lo; nBase += posLen[fi].len; if (hi <= hiEnd) { isBack[fi] = false; nCont++; nBaseCont += posLen[fi].len; } else { isBack[fi] = true; nBack++; nBaseDove += posLen[fi].len; } hiEnd = MAX(hi, hiEnd); } // Entertain the user with some statistics double totlCov = (double)nBase / hiEnd; saved->numContains = nCont; saved->covContain = (double)nBaseCont / hiEnd; saved->percContain = 100.0 * nBaseCont / nBase;; saved->numDovetails = nBack; saved->covDovetail = (double)nBaseDove / hiEnd; saved->percDovetail = 100.0 * nBaseDove / nBase;; if (beVerbose) saved->reportDetected(stderr, tig->tigID()); // If the tig has more coverage than allowed, throw out some of the contained reads. if ((totlCov >= maxCov) && (maxCov > 0)) { std::sort(posLen, posLen + nOrig, greater()); // Sort by length, larger first nBaseSave = 0.0; for (uint32 ii=0; ii < nOrig; ii++) { if (ii > 0) assert(posLen[ii-1].len >= posLen[ii].len); if (isBack[posLen[ii].idx]) // Already a backbone read. continue; if ((double)(nBaseSave + nBaseDove) / hiEnd < maxCov) { isBack[posLen[ii].idx] = true; // Save it. nSave++; nBaseSave += posLen[ii].len; } } saved->numContainsRemoved = nOrig - nBack - nSave; saved->covContainsRemoved = (double)(nBaseCont - nBaseSave) / hiEnd; saved->numContainsSaved = nSave; saved->covContainsSaved = (double)nBaseSave / hiEnd; if (beVerbose) saved->reportRemoved(stderr, tig->tigID()); // For all the reads we saved, copy them to a new children list in the tig tig->_childrenLen = 0; tig->_childrenMax = nBack + nSave; tig->_children = new tgPosition [tig->_childrenMax]; // The original is in savedChildren now for (uint32 fi=0; fichildren[fi].ident(), saved->children[fi].bgn(), children[fi].end()); tig->_children[tig->_childrenLen++] = saved->children[fi]; } } // Else, the tig coverage is acceptable and we do no filtering. else { delete saved; saved = NULL; } delete [] isBack; delete [] posLen; return(saved); } // Restores the f_list, and updates the position of non-contained reads. // void unstashContains(tgTig *tig, savedChildren *saved) { if (saved == NULL) return; // For fragments not involved in the consensus computation, we'll scale their position linearly // from the old max to the new max. // // We probably should do an alignment to the consensus sequence to find the true location, but // that's (a) expensive and (b) likely overkill for these unitigs. uint32 oldMax = 0; uint32 newMax = 0; double sf = 1.0; for (uint32 fi=0, ci=0; fichildrenLen; fi++) if (oldMax < saved->children[fi].max()) oldMax = saved->children[fi].max(); for (uint32 fi=0, ci=0; finumberOfChildren(); fi++) if (newMax < tig->getChild(fi)->max()) newMax = tig->getChild(fi)->max(); if (oldMax > 0) sf = (double)newMax / oldMax; // First, we need a map from the child id to the location in the current tig map idmap; for (uint32 ci=0; ci < tig->numberOfChildren(); ci++) idmap[tig->getChild(ci)->ident()] = tig->getChild(ci); // Now, over all the reads in the original saved fragment list, update the position. Either from // the computed result, or by extrapolating. for (uint32 fi=0; fichildrenLen; fi++) { uint32 iid = saved->children[fi].ident(); // Does the ID exist in the new positions? Copy the new position to the original list. if (idmap.find(iid) != idmap.end()) { saved->children[fi] = *idmap[iid]; idmap.erase(iid); } // Otherwise, fudge the positions. else { int32 nmin = sf * saved->children[fi].min(); int32 nmax = sf * saved->children[fi].max(); if (nmin > newMax) nmin = newMax; if (nmax > newMax) nmax = newMax; saved->children[fi].setMinMax(nmin, nmax); } } if (idmap.empty() == false) fprintf(stderr, "Failed to unstash the contained reads. Still have " F_SIZE_T " reads unplaced.\n", idmap.size()); assert(idmap.empty() == true); // Throw out the reduced list, and restore the original. delete [] tig->_children; tig->_childrenLen = saved->childrenLen; tig->_childrenMax = saved->childrenMax; tig->_children = saved->children; } canu-1.6/src/utgcns/stashContains.H000066400000000000000000000063011314437614700173440ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * Modifications by: * * Brian P. Walenz from 2015-APR-09 to 2015-MAY-09 * are Copyright 2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2016-JAN-06 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #ifndef STASH_CONTAINS_H #define STASH_CONTAINS_H #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include #include class readLength { public: uint32 idx; int32 len; bool operator<(const readLength &that) const { return(len < that.len); }; bool operator>(const readLength &that) const { return(len > that.len); }; }; class savedChildren { public: savedChildren(tgTig *tig) { childrenLen = tig->_childrenLen; childrenMax = tig->_childrenMax; children = tig->_children; numContains = 0; covContain = 0.0; percContain = 0.0; numContainsRemoved = 0; covContainsRemoved = 0.0; numContainsSaved = 0; covContainsSaved = 0.0; numDovetails = 0; covDovetail = 0.0; percDovetail = 0.0; }; void reportDetected(FILE *out, uint32 id) { fprintf(out, " unitig %d detected " F_S32 " contains (%.2fx, %.2f%%) " F_S32 " dovetail (%.2fx, %.2f%%)\n", id, numContains, covContain, percContain, numDovetails, covDovetail, percDovetail); }; void reportRemoved(FILE *out, uint32 id) { fprintf(out, " unitig %d removing " F_S32 " (%.2fx) contained reads; processing only " F_S32 " contained (%.2fx) and " F_S32 " dovetail (%.2fx) reads\n", id, numContainsRemoved, covContainsRemoved, numContainsSaved, covContainsSaved, numDovetails, covDovetail); }; // The saved children proper. uint32 childrenLen; uint32 childrenMax; tgPosition *children; // Stats on the filtering, for logging by the caller uint32 numContains; double covContain; double percContain; uint32 numContainsRemoved; double covContainsRemoved; uint32 numContainsSaved; double covContainsSaved; uint32 numDovetails; double covDovetail; double percDovetail; }; savedChildren * stashContains(tgTig *tig, double maxCov, bool beVerbose = false); void unstashContains(tgTig *tig, savedChildren *saved); #endif // STASH_CONTAINS_H canu-1.6/src/utgcns/utgcns.C000066400000000000000000000565541314437614700160400ustar00rootroot00000000000000 /****************************************************************************** * * This file is part of canu, a software program that assembles whole-genome * sequencing reads into contigs. * * This software is based on: * 'Celera Assembler' (http://wgs-assembler.sourceforge.net) * the 'kmer package' (http://kmer.sourceforge.net) * both originally distributed by Applera Corporation under the GNU General * Public License, version 2. * * Canu branched from Celera Assembler at its revision 4587. * Canu branched from the kmer project at its revision 1994. * * This file is derived from: * * src/AS_CNS/utgcns.C * * Modifications by: * * Brian P. Walenz from 2009-OCT-05 to 2014-MAR-31 * are Copyright 2009-2014 J. Craig Venter Institute, and * are subject to the GNU General Public License version 2 * * Brian P. Walenz from 2014-DEC-30 to 2015-AUG-07 * are Copyright 2014-2015 Battelle National Biodefense Institute, and * are subject to the BSD 3-Clause License * * Brian P. Walenz beginning on 2015-OCT-09 * are a 'United States Government Work', and * are released in the public domain * * Sergey Koren beginning on 2015-DEC-28 * are a 'United States Government Work', and * are released in the public domain * * File 'README.licenses' in the root directory of this distribution contains * full conditions and disclaimers for each license. */ #include "AS_global.H" #include "gkStore.H" #include "tgStore.H" #include "AS_UTL_decodeRange.H" #include "stashContains.H" #include "unitigConsensus.H" #ifndef BROKEN_CLANG_OpenMP #include #endif #include #include int main (int argc, char **argv) { char *gkpName = NULL; char *tigName = NULL; uint32 tigVers = UINT32_MAX; uint32 tigPart = UINT32_MAX; char *tigFileName = NULL; uint32 utgBgn = UINT32_MAX; uint32 utgEnd = UINT32_MAX; char *outResultsName = NULL; char *outLayoutsName = NULL; char *outSeqNameA = NULL; char *outSeqNameQ = NULL; char *outPackageName = NULL; FILE *outResultsFile = NULL; FILE *outLayoutsFile = NULL; FILE *outSeqFileA = NULL; FILE *outSeqFileQ = NULL; FILE *outPackageFile = NULL; char *inPackageName = NULL; char algorithm = 'P'; char aligner = 'E'; bool normalize = false; // Not used, left for future use. uint32 numThreads = 0; bool forceCompute = false; double errorRate = 0.12; double errorRateMax = 0.40; uint32 minOverlap = 40; int32 numFailures = 0; bool showResult = false; double maxCov = 0.0; uint32 maxLen = UINT32_MAX; bool onlyUnassem = false; bool onlyBubble = false; bool onlyContig = false; bool noSingleton = false; uint32 verbosity = 0; argc = AS_configure(argc, argv); int arg=1; int err=0; while (arg < argc) { if (strcmp(argv[arg], "-G") == 0) { gkpName = argv[++arg]; } else if (strcmp(argv[arg], "-T") == 0) { tigName = argv[++arg]; tigVers = atoi(argv[++arg]); tigPart = atoi(argv[++arg]); if (argv[arg][0] == '.') tigPart = UINT32_MAX; if (tigVers == 0) fprintf(stderr, "invalid tigStore version (-T store version partition) '-t %s %s %s'.\n", argv[arg-2], argv[arg-1], argv[arg]), exit(1); if (tigPart == 0) fprintf(stderr, "invalid tigStore partition (-T store version partition) '-t %s %s %s'.\n", argv[arg-2], argv[arg-1], argv[arg]), exit(1); } else if ((strcmp(argv[arg], "-u") == 0) || (strcmp(argv[arg], "-tig") == 0)) { AS_UTL_decodeRange(argv[++arg], utgBgn, utgEnd); } else if (strcmp(argv[arg], "-t") == 0) { tigFileName = argv[++arg]; } else if (strcmp(argv[arg], "-O") == 0) { outResultsName = argv[++arg]; } else if (strcmp(argv[arg], "-L") == 0) { outLayoutsName = argv[++arg]; } else if (strcmp(argv[arg], "-A") == 0) { outSeqNameA = argv[++arg]; } else if (strcmp(argv[arg], "-Q") == 0) { outSeqNameQ = argv[++arg]; } else if (strcmp(argv[arg], "-quick") == 0) { algorithm = 'Q'; } else if (strcmp(argv[arg], "-pbdagcon") == 0) { algorithm = 'P'; } else if (strcmp(argv[arg], "-utgcns") == 0) { algorithm = 'U'; } else if (strcmp(argv[arg], "-edlib") == 0) { aligner = 'E'; } else if (strcmp(argv[arg], "-normalize") == 0) { normalize = true; } else if (strcmp(argv[arg], "-nonormalize") == 0) { normalize = false; } else if (strcmp(argv[arg], "-threads") == 0) { numThreads = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-p") == 0) { inPackageName = argv[++arg]; } else if (strcmp(argv[arg], "-P") == 0) { outPackageName = argv[++arg]; } else if (strcmp(argv[arg], "-e") == 0) { errorRate = atof(argv[++arg]); } else if (strcmp(argv[arg], "-em") == 0) { errorRateMax = atof(argv[++arg]); } else if (strcmp(argv[arg], "-l") == 0) { minOverlap = atoi(argv[++arg]); } else if (strcmp(argv[arg], "-f") == 0) { forceCompute = true; } else if (strcmp(argv[arg], "-v") == 0) { showResult = true; } else if (strcmp(argv[arg], "-V") == 0) { verbosity++; } else if (strcmp(argv[arg], "-maxcoverage") == 0) { maxCov = atof(argv[++arg]); } else if (strcmp(argv[arg], "-maxlength") == 0) { maxLen = atof(argv[++arg]); } else if (strcmp(argv[arg], "-onlyunassem") == 0) { onlyUnassem = true; } else if (strcmp(argv[arg], "-onlybubble") == 0) { onlyBubble = true; } else if (strcmp(argv[arg], "-onlycontig") == 0) { onlyContig = true; } else if (strcmp(argv[arg], "-nosingleton") == 0) { noSingleton = true; } else { fprintf(stderr, "%s: Unknown option '%s'\n", argv[0], argv[arg]); err++; } arg++; } if ((gkpName == NULL) && (tigName != NULL)) { gkpName = new char [FILENAME_MAX]; snprintf(gkpName, FILENAME_MAX, "%s/partitionedReads.gkpStore", tigName); } if ((gkpName == NULL) && (inPackageName == NULL)) err++; if ((tigFileName == NULL) && (tigName == NULL) && (inPackageName == NULL)) err++; if ((algorithm != 'Q') && (algorithm != 'P') && (algorithm != 'U')) err++; if (err) { fprintf(stderr, "usage: %s [opts]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, " INPUT\n"); fprintf(stderr, " -G g Load reads from gkStore 'g'\n"); fprintf(stderr, " -T t v p Load tig from tgStore 't', version 'v', partition 'p'.\n"); fprintf(stderr, " Expects reads will be in gkStore partition 'p' as well\n"); fprintf(stderr, " Use p='.' to specify no partition\n"); fprintf(stderr, " -t file Test the computation of the tig layout in 'file'\n"); fprintf(stderr, " 'file' can be from:\n"); fprintf(stderr, " 'tgStoreDump -d layout' (human readable layout format)\n"); fprintf(stderr, " 'utgcns -L' (human readable layout format)\n"); fprintf(stderr, " 'utgcns -O' (binary multialignment format)\n"); fprintf(stderr, "\n"); fprintf(stderr, " -p package Load tig and reads from 'package' created with -P. This\n"); fprintf(stderr, " is usually used by developers.\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, " ALGORITHM\n"); fprintf(stderr, " -quick No alignments, just paste read sequence into the tig positions.\n"); fprintf(stderr, " This is very fast, but the consensus sequence is formed from a mosaic\n"); fprintf(stderr, " of read sequences, and there can be large indel. This is useful for\n"); fprintf(stderr, " checking intermediate assembly structure by mapping to reference, or\n"); fprintf(stderr, " possibly for use as input to a polishing step.\n"); fprintf(stderr, " -pbdagcon Use pbdagcon (https://github.com/PacificBiosciences/pbdagcon).\n"); fprintf(stderr, " This is fast and robust. It is the default algorithm. It does not\n"); fprintf(stderr, " generate a final multialignment output (the -v option will not show\n"); fprintf(stderr, " anything useful).\n"); fprintf(stderr, " -utgcns Use utgcns (the original Celera Assembler consensus algorithm)\n"); fprintf(stderr, " This isn't as fast, isn't as robust, but does generate a final multialign\n"); fprintf(stderr, " output.\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, " ALIGNER\n"); fprintf(stderr, " -edlib Myers' O(ND) algorithm from Edlib (https://github.com/Martinsos/edlib).\n"); fprintf(stderr, " This is the default (and, yes, there is no non-default aligner).\n"); //fprintf(stderr, "\n"); //fprintf(stderr, " -normalize Shift gaps to one side. Probably not useful anymore.\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, " OUTPUT\n"); fprintf(stderr, " -O results Write computed tigs to binary output file 'results'\n"); fprintf(stderr, " -L layouts Write computed tigs to layout output file 'layouts'\n"); fprintf(stderr, " -A fasta Write computed tigs to fasta output file 'fasta'\n"); fprintf(stderr, " -Q fastq Write computed tigs to fastq output file 'fastq'\n"); fprintf(stderr, "\n"); fprintf(stderr, " -P package Create a copy of the inputs needed to compute the tigs. This\n"); fprintf(stderr, " file can then be sent to the developers for debugging. The tig(s)\n"); fprintf(stderr, " are not processed and no other outputs are created. Ideally,\n"); fprintf(stderr, " only one tig is selected (-u, below).\n"); fprintf(stderr, "\n"); fprintf(stderr, "\n"); fprintf(stderr, " TIG SELECTION (if -T input is used)\n"); fprintf(stderr, " -tig b Compute only tig ID 'b' (must be in the correct partition!)\n"); fprintf(stderr, " -tig b-e Compute only tigs from ID 'b' to ID 'e'\n"); fprintf(stderr, " -u Alias for -tig\n"); fprintf(stderr, " -f Recompute tigs that already have a multialignment\n"); fprintf(stderr, " -maxlength l Do not compute consensus for tigs longer than l bases.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -onlyunassem Only compute consensus for unassembled tigs.\n"); fprintf(stderr, " -onlybubble Only compute consensus for bubble tigs (there are no bubbles).\n"); fprintf(stderr, " -onlycontig Only compute consensus for real unitigs/contigs.\n"); fprintf(stderr, "\n"); fprintf(stderr, " -nosingleton Do not compute consensus for singleton (single-read) tigs.\n"); fprintf(stderr, "\n"); fprintf(stderr, " PARAMETERS\n"); fprintf(stderr, " -e e Expect alignments at up to fraction e error\n"); fprintf(stderr, " -em m Don't ever allow alignments more than fraction m error\n"); fprintf(stderr, " -l l Expect alignments of at least l bases\n"); fprintf(stderr, " -maxcoverage c Use non-contained reads and the longest contained reads, up to\n"); fprintf(stderr, " C coverage, for consensus generation. The default is 0, and will\n"); fprintf(stderr, " use all reads.\n"); fprintf(stderr, " -threads t Use 't' compute threads; default 1.\n"); fprintf(stderr, "\n"); fprintf(stderr, " LOGGING\n"); fprintf(stderr, " -v Show multialigns.\n"); fprintf(stderr, " -V Enable debugging option 'verbosemultialign'.\n"); fprintf(stderr, "\n"); if ((gkpName == NULL) && (inPackageName == NULL)) fprintf(stderr, "ERROR: No gkpStore (-G) and no package (-p) supplied.\n"); if ((tigFileName == NULL) && (tigName == NULL) && (inPackageName == NULL)) fprintf(stderr, "ERROR: No tigStore (-T) OR no test tig (-t) OR no package (-p) supplied.\n"); if ((algorithm != 'Q') && (algorithm != 'P') && (algorithm != 'U')) fprintf(stderr, "ERROR: Invalid algorithm '%c' specified; must be one of -quick, -pbdagcon, -utgcns.\n", algorithm); exit(1); } errno = 0; // Open output files. If we're creating a package, the usual output files are not opened. if (outPackageName) outPackageFile = fopen(outPackageName, "w"); if (errno) fprintf(stderr, "Failed to open output package file '%s': %s\n", outPackageName, strerror(errno)), exit(1); if ((outResultsName) && (outPackageName == NULL)) outResultsFile = fopen(outResultsName, "w"); if (errno) fprintf(stderr, "Failed to open output results file '%s': %s\n", outResultsName, strerror(errno)), exit(1); if ((outLayoutsName) && (outPackageName == NULL)) outLayoutsFile = fopen(outLayoutsName, "w"); if (errno) fprintf(stderr, "Failed to open output layout file '%s': %s\n", outLayoutsName, strerror(errno)), exit(1); if ((outSeqNameA) && (outPackageName == NULL)) outSeqFileA = fopen(outSeqNameA, "w"); if (errno) fprintf(stderr, "Failed to open output FASTA file '%s': %s\n", outSeqNameA, strerror(errno)), exit(1); if ((outSeqNameQ) && (outPackageName == NULL)) outSeqFileQ = fopen(outSeqNameQ, "w"); if (errno) fprintf(stderr, "Failed to open output FASTQ file '%s': %s\n", outSeqNameQ, strerror(errno)), exit(1); if (numThreads > 0) { omp_set_num_threads(numThreads); fprintf(stderr, "number of threads = %d (command line)\n", numThreads); fprintf(stderr, "\n"); } else { fprintf(stderr, "number of threads = %d (OpenMP default)\n", omp_get_max_threads()); fprintf(stderr, "\n"); } // Open gatekeeper for read only, and load the partitioned data if tigPart > 0. gkStore *gkpStore = NULL; tgStore *tigStore = NULL; FILE *tigFile = NULL; FILE *inPackageFile = NULL; map *inPackageRead = NULL; map *inPackageReadData = NULL; if (gkpName) { fprintf(stderr, "-- Opening gkpStore '%s' partition %u.\n", gkpName, tigPart); gkpStore = gkStore::gkStore_open(gkpName, gkStore_readOnly, tigPart); } if (tigName) { fprintf(stderr, "-- Opening tigStore '%s' version %u.\n", tigName, tigVers); tigStore = new tgStore(tigName, tigVers); } if (tigFileName) { fprintf(stderr, "-- Opening tigFile '%s'.\n", tigFileName); errno = 0; tigFile = fopen(tigFileName, "r"); if (errno) fprintf(stderr, "Failed to open input tig file '%s': %s\n", tigFileName, strerror(errno)), exit(1); } if (inPackageName) { fprintf(stderr, "-- Opening package file '%s'.\n", inPackageName); errno = 0; inPackageFile = fopen(inPackageName, "r"); if (errno) fprintf(stderr, "Failed to open input package file '%s': %s\n", inPackageName, strerror(errno)), exit(1); } // Report some sizes. fprintf(stderr, "sizeof(abBead) " F_SIZE_T "\n", sizeof(abBead)); fprintf(stderr, "sizeof(abColumn) " F_SIZE_T "\n", sizeof(abColumn)); fprintf(stderr, "sizeof(abAbacus) " F_SIZE_T "\n", sizeof(abAbacus)); fprintf(stderr, "sizeof(abSequence) " F_SIZE_T "\n", sizeof(abSequence)); // Decide on what to compute. Either all tigs, or a single tig, or a special case test. uint32 b = 0; uint32 e = UINT32_MAX; if (tigStore) { if (utgEnd > tigStore->numTigs() - 1) utgEnd = tigStore->numTigs() - 1; if (utgBgn != UINT32_MAX) { b = utgBgn; e = utgEnd; } else { b = 0; e = utgEnd; } fprintf(stderr, "-- Computing consensus for b=" F_U32 " to e=" F_U32 " with errorRate %0.4f (max %0.4f) and minimum overlap " F_U32 "\n", b, e, errorRate, errorRateMax, minOverlap); } else { fprintf(stderr, "-- Computing consensus with errorRate %0.4f (max %0.4f) and minimum overlap " F_U32 "\n", errorRate, errorRateMax, minOverlap); } fprintf(stderr, "\n"); // I don't like this loop control. for (uint32 ti=b; (e == UINT32_MAX) || (ti <= e); ti++) { tgTig *tig = NULL; // If a tigStore, load the tig. The tig is the owner; it cannot be deleted by us. if (tigStore) { tig = tigStore->loadTig(ti); } // If a tigFile, create a new tig and load it. Obviously, we own it. if (tigFile) { tig = new tgTig(); if (tig->loadFromStreamOrLayout(tigFile) == false) { delete tig; break; } } // If a package, create a new tig and loat it. Obviously, we own it. If the tig loads, // populate the read and readData maps with data from the package. if (inPackageFile) { tig = new tgTig(); if (tig->loadFromStreamOrLayout(inPackageFile) == false) { delete tig; break; } inPackageRead = new map; inPackageReadData = new map; for (int32 ii=0; iinumberOfChildren(); ii++) { uint32 readID = tig->getChild(ii)->ident(); gkRead *read = (*inPackageRead)[readID] = new gkRead; gkReadData *data = (*inPackageReadData)[readID] = new gkReadData; gkStore::gkStore_loadReadFromStream(inPackageFile, read, data); if (read->gkRead_readID() != readID) fprintf(stderr, "ERROR: package not in sync with tig. package readID = %u tig readID = %u\n", read->gkRead_readID(), readID); assert(read->gkRead_readID() == readID); } } // No tig loaded, keep going. if (tig == NULL) continue; // More 'not liking' - set the verbosity level for logging. tig->_utgcns_verboseLevel = verbosity; // Are we parittioned? Is this tig in our partition? if (tigPart != UINT32_MAX) { uint32 missingReads = 0; for (uint32 ii=0; iinumberOfChildren(); ii++) if (gkpStore->gkStore_getReadInPartition(tig->getChild(ii)->ident()) == NULL) missingReads++; if (missingReads) { //fprintf(stderr, "SKIP tig %u with %u reads found only %u reads in partition, skipped\n", // tig->tigID(), tig->numberOfChildren(), tig->numberOfChildren() - missingReads); continue; } } // Skip stuff we want to skip. if (tig->length(true) > maxLen) continue; if ((onlyUnassem == true) && (tig->_class != tgTig_unassembled)) continue; if ((onlyContig == true) && (tig->_class != tgTig_contig)) continue; if ((onlyBubble == true) && (tig->_class != tgTig_bubble)) continue; if ((noSingleton == true) && (tig->numberOfChildren() == 1)) continue; if (tig->numberOfChildren() == 0) continue; // Process the tig. Remove deep coverage, create a consensus object, process it, and report the results. // before we add it to the store. bool exists = tig->consensusExists(); if (tig->numberOfChildren() > 1) fprintf(stderr, "Working on tig %d of length %d (%d children)%s%s\n", tig->tigID(), tig->length(true), tig->numberOfChildren(), ((exists == true) && (forceCompute == false)) ? " - already computed" : "", ((exists == true) && (forceCompute == true)) ? " - already computed, recomputing" : ""); unitigConsensus *utgcns = new unitigConsensus(gkpStore, errorRate, errorRateMax, minOverlap); savedChildren *origChildren = NULL; bool success = exists; // Save the tig in the package? // // The original idea was to dump the tig and all the reads, then load the tig and process as normal. // Sadly, stashContains() rearranges the order of the reads even if it doesn't remove any. The rearranged // tig couldn't be saved (otherwise it would be rearranged again). So, we were in the position of // needing to save the original tig and the rearranged reads. Impossible. // // Instead, we save the origianl tig and original reads -- including any that get stashed -- then // load them all back into a map for use in consensus proper. It's a bit of a pain, and could // have way more reads saved than necessary. if (outPackageFile) { utgcns->savePackage(outPackageFile, tig); fprintf(stderr, " Packaged tig %u into '%s'\n", tig->tigID(), outPackageName); } // Compute consensus if it doesn't exist, or if we're forcing a recompute. But only if we // didn't just package it. if ((outPackageFile == NULL) && ((exists == false) || (forceCompute == true))) { origChildren = stashContains(tig, maxCov, true); if (tig->numberOfChildren() == 1) { success = utgcns->generateSingleton(tig, inPackageRead, inPackageReadData); } else if (algorithm == 'Q') { success = utgcns->generateQuick(tig, inPackageRead, inPackageReadData); } else if (algorithm == 'P') { success = utgcns->generatePBDAG(aligner, normalize, tig, inPackageRead, inPackageReadData); } else if (algorithm == 'U') { success = utgcns->generate(tig, inPackageRead, inPackageReadData); } else { fprintf(stderr, "Invalid algorithm. How'd you do this?\n"); assert(0); } } // If it was successful (or existed already), output. Success is always false if the tig // was packaged, regardless of if it existed already. if (success == true) { if ((showResult) && (gkpStore)) // No gkpStore if we're from a package. Dang. tig->display(stdout, gkpStore, 200, 3); unstashContains(tig, origChildren); if (outResultsFile) tig->saveToStream(outResultsFile); if (outLayoutsFile) tig->dumpLayout(outLayoutsFile); if (outSeqFileA) tig->dumpFASTA(outSeqFileA, true); if (outSeqFileQ) tig->dumpFASTQ(outSeqFileQ, true); } // Report failures. if ((success == false) && (outPackageFile == NULL)) { fprintf(stderr, "unitigConsensus()-- tig %d failed.\n", tig->tigID()); numFailures++; } // Clean up, unloading or deleting the tig. delete utgcns; // No real reason to keep this until here. delete origChildren; // Need to keep it until after we display() above. if (tigStore) tigStore->unloadTig(tig->tigID(), true); // Tell the store we're done with it if (tigFile) delete tig; } finish: delete tigStore; gkpStore->gkStore_close(); if (tigFile) fclose(tigFile); if (outResultsFile) fclose(outResultsFile); if (outLayoutsFile) fclose(outLayoutsFile); if (outPackageFile) fclose(outPackageFile); if (inPackageFile) fclose(inPackageFile); if (numFailures) { fprintf(stderr, "WARNING: Total number of tig failures = %d\n", numFailures); fprintf(stderr, "\n"); fprintf(stderr, "Consensus did NOT finish successfully.\n"); } else { fprintf(stderr, "Consensus finished successfully. Bye.\n"); } return(numFailures != 0); } canu-1.6/src/utgcns/utgcns.mk000066400000000000000000000010171314437614700162450ustar00rootroot00000000000000# If 'make' isn't run from the root directory, we need to set these to # point to the upper level build directory. ifeq "$(strip ${BUILD_DIR})" "" BUILD_DIR := ../$(OSTYPE)-$(MACHINETYPE)/obj endif ifeq "$(strip ${TARGET_DIR})" "" TARGET_DIR := ../$(OSTYPE)-$(MACHINETYPE)/bin endif TARGET := utgcns SOURCES := utgcns.C stashContains.C SRC_INCDIRS := .. ../AS_UTL ../stores libcns libpbutgcns libNDFalcon libboost TGT_LDFLAGS := -L${TARGET_DIR} TGT_LDLIBS := -lcanu TGT_PREREQS := libcanu.a SUBMAKEFILES :=