pax_global_header 0000666 0000000 0000000 00000000064 13021605047 0014510 g ustar 00root root 0000000 0000000 52 comment=0c5dad8601f879947ed57d17068a69eacd9b11f7
centrifuge-f39767eb57d8e175029c/ 0000775 0000000 0000000 00000000000 13021605047 0015771 5 ustar 00root root 0000000 0000000 centrifuge-f39767eb57d8e175029c/.gitignore 0000664 0000000 0000000 00000000337 13021605047 0017764 0 ustar 00root root 0000000 0000000 *~
*.dSYM
.DS_Store
*-debug
*-s
*-l
centrifuge.xcodeproj/project.xcworkspace
centrifuge.xcodeproj/xcuserdata
centrifuge.xcodeproj/xcshareddata
centrifuge-build-bin
centrifuge-buildc
centrifuge-class
centrifuge-inspect-bin
centrifuge-f39767eb57d8e175029c/AUTHORS 0000664 0000000 0000000 00000002425 13021605047 0017044 0 ustar 00root root 0000000 0000000 Ben Langmead wrote Bowtie 2, which is based partially on
Bowtie. Bowtie was written by Ben Langmead and Cole Trapnell.
Bowtie & Bowtie 2: http://bowtie-bio.sf.net
A DLL from the pthreads for Win32 library is distributed with the Win32 version
of Bowtie 2. The pthreads for Win32 library and the GnuWin32 package have many
contributors (see their respective web sites).
pthreads for Win32: http://sourceware.org/pthreads-win32
GnuWin32: http://gnuwin32.sf.net
The ForkManager.pm perl module is used in Bowtie 2's random testing framework,
and is included as scripts/sim/contrib/ForkManager.pm. ForkManager.pm is
written by dLux (Szabo, Balazs), with contributions by others. See the perldoc
in ForkManager.pm for the complete list.
The file ls.h includes an implementation of the Larsson-Sadakane suffix sorting
algorithm. The implementation is by N. Jesper Larsson and was adapted somewhat
for use in Bowtie 2.
TinyThreads is a portable thread implementation with a fairly compatible subset
of C++11 thread management classes written by Marcus Geelnard. For more info
check http://tinythreadpp.bitsnbites.eu/
Various users have kindly supplied patches, bug reports and feature requests
over the years. Many, many thanks go to them.
September 2011
centrifuge-f39767eb57d8e175029c/LICENSE 0000664 0000000 0000000 00000104513 13021605047 0017002 0 ustar 00root root 0000000 0000000 GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc.
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Copyright (C)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see .
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
Copyright (C)
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
.
centrifuge-f39767eb57d8e175029c/MANUAL 0000664 0000000 0000000 00000124267 13021605047 0016705 0 ustar 00root root 0000000 0000000
Introduction
============
What is Centrifuge?
-----------------
[Centrifuge] is a novel microbial classification engine that enables
rapid, accurate, and sensitive labeling of reads and quantification of
species on desktop computers. The system uses a novel indexing scheme
based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini
(FM) index, optimized specifically for the metagenomic classification
problem. Centrifuge requires a relatively small index (5.8 GB for all
complete bacterial and viral genomes plus the human genome) and
classifies sequences at a very high speed, allowing it to process the
millions of reads from a typical high-throughput DNA sequencing run
within a few minutes. Together these advances enable timely and
accurate analysis of large metagenomics data sets on conventional
desktop computers.
[Centrifuge]: http://www.ccb.jhu.edu/software/centrifuge
[Burrows-Wheeler Transform]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
[FM Index]: http://en.wikipedia.org/wiki/FM-index
[GPLv3 license]: http://www.gnu.org/licenses/gpl-3.0.html
Obtaining Centrifuge
==================
Download Centrifuge and binaries from the Releases sections on the right side.
Binaries are available for Intel architectures (`x86_64`) running Linux, and Mac OS X.
Building from source
--------------------
Building Centrifuge from source requires a GNU-like environment with GCC, GNU Make
and other basics. It should be possible to build Centrifuge on most vanilla Linux
installations or on a Mac installation with [Xcode] installed. Centrifuge can
also be built on Windows using [Cygwin] or [MinGW] (MinGW recommended). For a
MinGW build the choice of what compiler is to be used is important since this
will determine if a 32 or 64 bit code can be successfully compiled using it. If
there is a need to generate both 32 and 64 bit on the same machine then a multilib
MinGW has to be properly installed. [MSYS], the [zlib] library, and depending on
architecture [pthreads] library are also required. We are recommending a 64 bit
build since it has some clear advantages in real life research problems. In order
to simplify the MinGW setup it might be worth investigating popular MinGW personal
builds since these are coming already prepared with most of the toolchains needed.
First, download the [source package] from the Releases secion on the right side.
Unzip the file, change to the unzipped directory, and build the
Centrifuge tools by running GNU `make` (usually with the command `make`, but
sometimes with `gmake`) with no arguments. If building with MinGW, run `make`
from the MSYS environment.
Centrifuge is using the multithreading software model in order to speed up
execution times on SMP architectures where this is possible. On POSIX
platforms (like linux, Mac OS, etc) it needs the pthread library. Although
it is possible to use pthread library on non-POSIX platform like Windows, due
to performance reasons Centrifuge will try to use Windows native multithreading
if possible.
For the support of SRA data access in HISAT2, please download and install the [NCBI-NGS] toolkit.
When running `make`, specify additional variables as follow.
`make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory`,
where `NCBI_NGS_DIR` and `NCBI_VDB_DIR` will be used in Makefile for -I and -L compilation options.
For example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.
[Cygwin]: http://www.cygwin.com/
[MinGW]: http://www.mingw.org/
[MSYS]: http://www.mingw.org/wiki/msys
[zlib]: http://cygwin.com/packages/mingw-zlib/
[pthreads]: http://sourceware.org/pthreads-win32/
[GnuWin32]: http://gnuwin32.sf.net/packages/coreutils.htm
[Xcode]: http://developer.apple.com/xcode/
[Github site]: https://github.com/infphilo/centrifuge
[NCBI-NGS]: https://github.com/ncbi/ngs/wiki/Downloads
Running Centrifuge
=============
Adding to PATH
--------------
By adding your new Centrifuge directory to your [PATH environment variable], you
ensure that whenever you run `centrifuge`, `centrifuge-build`, `centrifuge-download` or `centrifuge-inspect`
from the command line, you will get the version you just installed without
having to specify the entire path. This is recommended for most users. To do
this, follow your operating system's instructions for adding the directory to
your [PATH].
If you would like to install Centrifuge by copying the Centrifuge executable files
to an existing directory in your [PATH], make sure that you copy all the
executables, including `centrifuge`, `centrifuge-class`, `centrifuge-build`, `centrifuge-build-bin`, `centrifuge-download` `centrifuge-inspect`
and `centrifuge-inspect-bin`. Furthermore you need the programs
in the scripts/ folder if you opt for genome compression in the database construction.
[PATH environment variable]: http://en.wikipedia.org/wiki/PATH_(variable)
[PATH]: http://en.wikipedia.org/wiki/PATH_(variable)
Before running Centrifuge
-----------------
Classification is considerably different from alignment in that classification is performed on a large set of genomes as opposed to on just one reference genome as in alignment. Currently, an enormous number of complete genomes are available at the GenBank (e.g. >4,000 bacterial genomes, >10,000 viral genomes, …). These genomes are organized in a taxonomic tree where each genome is located at the bottom of the tree, at the strain or subspecies level. On the taxonomic tree, genomes have ancestors usually situated at the species level, and those ancestors also have ancestors at the genus level and so on up the family level, the order level, class level, phylum, kingdom, and finally at the root level.
Given the gigantic number of genomes available, which continues to expand at a rapid rate, and the development of the taxonomic tree, which continues to evolve with new advancements in research, we have designed Centrifuge to be flexible and general enough to reflect this huge database. We provide several standard indexes that will meet most of users’ needs (see the side panel - Indexes). In our approach our indexes not only include raw genome sequences, but also genome names/sizes and taxonomic trees. This enables users to perform additional analyses on Centrifuge’s classification output without the need to download extra database sources. This also eliminates the potential issue of discrepancy between the indexes we provide and the databases users may otherwise download. We plan to provide a couple of additional standard indexes in the near future, and update the indexes on a regular basis.
We encourage first time users to take a look at and follow a `small example` that illustrates how to build an index, how to run Centrifuge using the index, how to interpret the classification results, and how to extract additional genomic information from the index. For those who choose to build customized indexes, please take a close look at the following description.
Database download and index building
-----------------
Centrifuge indexes can be built with arbritary sequences. Standard choices are
all of the complete bacterial and viral genomes, or using the sequences that
are part of the BLAST nt database. Centrifuge always needs the
nodes.dmp file from the NCBI taxonomy dump to build the taxonomy tree,
as well as a sequence ID to taxonomy ID map. The map is a tab-separated
file with the sequence ID to taxonomy ID map.
To download all of the complete archaeal, viral, and bacterial genomes from RefSeq, and
build the index:
Centrifuge indices can be build on arbritary sequences. Usually an ensemble of
genomes is used - such as all complete microbial genomes in the RefSeq database,
or all sequences in the BLAST nt database.
To map sequence identifiers to taxonomy IDs, and taxonomy IDs to names and
its parents, three files are necessary in addition to the sequence files:
- taxonomy tree: typically nodes.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their parents
- names file: typically names.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their scientific name
- a tab-separated sequence ID to taxonomy ID mapping
When using the provided scripts to download the genomes, these files are automatically downloaded or generated.
When using a custom taxonomy or sequence files, please refer to the section `TODO` to learn more about their format.
### Building index on all complete bacterial and viral genomes
Use `centrifuge-download` to download genomes from NCBI. The following two commands download
the NCBI taxonomy to `taxonomy/` in the current directory, and all complete archaeal,
bacterial and viral genomes to `library/`. Low-complexity regions in the genomes are masked after
download (parameter `-m`) using blast+'s `dustmasker`. `centrifuge-download` outputs tab-separated
sequence ID to taxonomy ID mappings to standard out, which are required by `centrifuge-build`.
centrifuge-download -o taxonomy taxonomy
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map
To build the index, first concatenate all downloaded sequences into a single file, and then
run `centrifuge-build`:
cat library/*/*.fna > input-sequences.fna
## build centrifuge index with 4 threads
centrifuge-build -p 4 --conversion-table seqid2taxid.map \
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
input-sequences.fna abv
After building the index, all files except the index *.[123].cf files may be removed.
If you also want to include the human and/or the mouse genome, add their sequences to
the library folder before building the index with one of the following commands:
After the index building, all but the *.[123].cf index files may be removed. I.e. the files in
the `library/` and `taxonomy/` directories are no longer needed.
### Adding human or mouse genome to the index
The human and mouse genomes can also be downloaded using `centrifuge-download`. They are in the
domain "vertebrate_mammalian" (argument `-d`), are assembled at the chromosome level (argument `-a`)
and categorized as reference genomes by RefSeq (`-c`). The argument `-t` takes a comma-separated
list of taxonomy IDs - e.g. `9606` for human and `10090` for mouse:
# download mouse and human reference genomes
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606,10090 -c 'reference genome' >> seqid2taxid.map
# only human
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' >> seqid2taxid.map
# only mouse
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 10090 -c 'reference genome' >> seqid2taxid.map
### nt database
NCBI BLAST's nt database contains all spliced non-redundant coding
sequences from multiplpe databases, inferred from genommic
sequences. Traditionally used with BLAST, a download of the FASTA is
provided on the NCBI homepage. Building an index with any database
requires the user to creates a sequence ID to taxonomy ID map that
can be generated from a GI taxid dump:
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz && mv -v nt nt.fa
# Get mapping file
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map
# build index using 16 cores and a small bucket size, which will require less memory
centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map \
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
nt.fa nt
### Custom database
TODO: Add toy example for nodes.dmp, names.dmp and seqid2taxid.map
### Centrifuge classification output
The following example shows classification assignments for a read. The assignment output has 8 columns.
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
1_1 gi|4 9646 4225 0 80 80 1
The first column is the read ID from a raw sequencing read (e.g., 1_1 in the example).
The second column is the sequence ID of the genomic sequence, where the read is classified (e.g., gi|4).
The third column is the taxonomic ID of the genomic sequence in the second column (e.g., 9646).
The fourth column is the score for the classification, which is the weighted sum of hits (e.g., 4225)
The fifth column is the score for the next best classification (e.g., 0).
The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).
### Centrifuge summary output (the default filename is centrifuge_report.tsv)
The following example shows a classification summary for each genome or taxonomic unit. The assignment output has 7 columns.
name taxID taxRank genomeSize numReads numUniqueReads abundance
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 36870 leaf 703004 5981 5964 0.0152317
The first column is the name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).
The second column is the taxonomic ID (e.g., 36870).
The third column is the taxonomic rank (e.g., leaf).
The fourth column is the length of the genome sequence (e.g., 703004).
The fifth column is the number of reads classified to this genomic sequence including multi-classified reads (e.g., 5981).
The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).
The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).
As the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments. In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower. It may be also helpful to use a score (4th column) for filtering out some assignments. Our future research plans include working on developing methods that estimate confidence scores for assignments.
### Kraken-style report
`centrifuge-kreport` can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:
`centrifuge-kreport -x `
Inspecting the Centrifuge index
-----------------------
The index can be inspected with `centrifuge-inspect`. To extract raw sequences:
centrifuge-inspect
Extract the sequence ID to taxonomy ID conversion table from the index
centrifuge-inspect --conversion-table
Extract the taxonomy tree from the index:
centrifuge-inspect --taxonomy-tree
Extract the lengths of the sequences from the index (each row has two columns: taxonomic ID and length):
centrifuge-inspect --size-table
Extract the names from the index (each row has two columns: taxonomic ID and name):
centrifuge-inspect --name-table
Wrapper
-------
The `centrifuge`, `centrifuge-build` and `centrifuge-inspect` executables are actually
wrapper scripts that call binary programs as appropriate. Also, the `centrifuge` wrapper
provides some key functionality, like the ability to handle compressed inputs,
and the functionality for `--un`, `--al` and related options.
It is recommended that you always run the centrifuge wrappers and not run the
binaries directly.
Performance tuning
------------------
1. If your computer has multiple processors/cores, use `-p NTHREADS`
The `-p` option causes Centrifuge to launch a specified number of parallel
search threads. Each thread runs on a different processor/core and all
threads find alignments in parallel, increasing alignment throughput by
approximately a multiple of the number of threads (though in practice,
speedup is somewhat worse than linear).
Command Line
------------
### Usage
centrifuge [options]* -x {-1 -2 | -U | --sra-acc } [--report-file -S ]
### Main arguments
-x
The basename of the index for the reference genomes. The basename is the name of
any of the index files up to but not including the final `.1.cf` / etc.
`centrifuge` looks for the specified index first in the current directory,
then in the directory specified in the `CENTRIFUGE_INDEXES` environment variable.
-1
Comma-separated list of files containing mate 1s (filename usually includes
`_1`), e.g. `-1 flyA_1.fq,flyB_1.fq`. Sequences specified with this option must
correspond file-for-file and read-for-read with those specified in ``. Reads
may be a mix of different lengths. If `-` is specified, `centrifuge` will read the
mate 1s from the "standard in" or "stdin" filehandle.
-2
Comma-separated list of files containing mate 2s (filename usually includes
`_2`), e.g. `-2 flyA_2.fq,flyB_2.fq`. Sequences specified with this option must
correspond file-for-file and read-for-read with those specified in ``. Reads
may be a mix of different lengths. If `-` is specified, `centrifuge` will read the
mate 2s from the "standard in" or "stdin" filehandle.
-U
Comma-separated list of files containing unpaired reads to be aligned, e.g.
`lane1.fq,lane2.fq,lane3.fq,lane4.fq`. Reads may be a mix of different lengths.
If `-` is specified, `centrifuge` gets the reads from the "standard in" or "stdin"
filehandle.
--sra-acc
Comma-separated list of SRA accession numbers, e.g. `--sra-acc SRR353653,SRR353654`.
Information about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=sra-acc&retmode=xml,
where sra-acc is SRA accession number. If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at [SRA-MANUAL]).
[SRA-MANUAL]: https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration
-S
File to write classification results to. By default, assignments are written to the
"standard out" or "stdout" filehandle (i.e. the console).
--report-file
File to write a classification summary to (default: centrifuge_report.tsv).
### Options
#### Input options
-q
Reads (specified with ``, ``, ``) are FASTQ files. FASTQ files
usually have extension `.fq` or `.fastq`. FASTQ is the default format. See
also: `--solexa-quals` and `--int-quals`.
--qseq
Reads (specified with ``, ``, ``) are QSEQ files. QSEQ files usually
end in `_qseq.txt`. See also: `--solexa-quals` and `--int-quals`.
-f
Reads (specified with ``, ``, ``) are FASTA files. FASTA files
usually have extension `.fa`, `.fasta`, `.mfa`, `.fna` or similar. FASTA files
do not have a way of specifying quality values, so when `-f` is set, the result
is as if `--ignore-quals` is also set.
-r
Reads (specified with ``, ``, ``) are files with one input sequence
per line, without any other information (no read names, no qualities). When
`-r` is set, the result is as if `--ignore-quals` is also set.
-c
The read sequences are given on command line. I.e. ``, `` and
`` are comma-separated lists of reads rather than lists of read files.
There is no way to specify read names or qualities, so `-c` also implies
`--ignore-quals`.
-s/--skip
Skip (i.e. do not align) the first `` reads or pairs in the input.
-u/--qupto
Align the first `` reads or read pairs from the input (after the
`-s`/`--skip` reads or pairs have been skipped), then stop. Default: no limit.
-5/--trim5
Trim `` bases from 5' (left) end of each read before alignment (default: 0).
-3/--trim3
Trim `` bases from 3' (right) end of each read before alignment (default:
0).
--phred33
Input qualities are ASCII chars equal to the [Phred quality] plus 33. This is
also called the "Phred+33" encoding, which is used by the very latest Illumina
pipelines.
[Phred quality]: http://en.wikipedia.org/wiki/Phred_quality_score
--phred64
Input qualities are ASCII chars equal to the [Phred quality] plus 64. This is
also called the "Phred+64" encoding.
--solexa-quals
Convert input qualities from [Solexa][Phred quality] (which can be negative) to
[Phred][Phred quality] (which can't). This scheme was used in older Illumina GA
Pipeline versions (prior to 1.3). Default: off.
--int-quals
Quality values are represented in the read input file as space-separated ASCII
integers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....
Integers are treated as being on the [Phred quality] scale unless
`--solexa-quals` is also specified. Default: off.
#### Classification
--min-hitlen
Minimum length of partial hits, which must be greater than 15 (default: 22)"
-k
It searches for at most `` distinct, primary assignments for each read or pair.
Primary assignments mean assignments whose assignment score is equal or higher than any other assignments.
If there are more primary assignments than this value,
the search will merge some of the assignments into a higher taxonomic rank.
The assignment score for a paired-end assignment equals the sum of the assignment scores of the individual mates.
Default: 5
--host-taxids
A comma-separated list of taxonomic IDs that will be preferred in classification procedure.
The descendants from these IDs will also be preferred. In case some of a read's assignments correspond to
these taxonomic IDs, only those corresponding assignments will be reported.
--exclude-taxids
A comma-separated list of taxonomic IDs that will be excluded in classification procedure.
The descendants from these IDs will also be exclude.
#### Alignment options
--n-ceil
Sets a function governing the maximum number of ambiguous characters (usually
`N`s and/or `.`s) allowed in a read as a function of read length. For instance,
specifying `-L,0,0.15` sets the N-ceiling function `f` to `f(x) = 0 + 0.15 * x`,
where x is the read length. See also: [setting function options]. Reads
exceeding this ceiling are [filtered out]. Default: `L,0,0.15`.
--ignore-quals
When calculating a mismatch penalty, always consider the quality value at the
mismatched position to be the highest possible, regardless of the actual value.
I.e. input is treated as though all quality values are high. This is also the
default behavior when the input doesn't specify quality values (e.g. in `-f`,
`-r`, or `-c` modes).
--nofw/--norc
If `--nofw` is specified, `centrifuge` will not attempt to align unpaired reads to
the forward (Watson) reference strand. If `--norc` is specified, `centrifuge` will
not attempt to align unpaired reads against the reverse-complement (Crick)
reference strand. In paired-end mode, `--nofw` and `--norc` pertain to the
fragments; i.e. specifying `--nofw` causes `centrifuge` to explore only those
paired-end configurations corresponding to fragments from the reverse-complement
(Crick) strand. Default: both strands enabled.
#### Paired-end options
--fr/--rf/--ff
The upstream/downstream mate orientations for a valid paired-end alignment
against the forward reference strand. E.g., if `--fr` is specified and there is
a candidate paired-end alignment where mate 1 appears upstream of the reverse
complement of mate 2 and the fragment length constraints (`-I` and `-X`) are
met, that alignment is valid. Also, if mate 2 appears upstream of the reverse
complement of mate 1 and all other constraints are met, that too is valid.
`--rf` likewise requires that an upstream mate1 be reverse-complemented and a
downstream mate2 be forward-oriented. ` --ff` requires both an upstream mate 1
and a downstream mate 2 to be forward-oriented. Default: `--fr` (appropriate
for Illumina's Paired-end Sequencing Assay).
#### Output options
-t/--time
Print the wall-clock time required to load the index files and align the reads.
This is printed to the "standard error" ("stderr") filehandle. Default: off.
--un
--un-gz
--un-bz2
Write unpaired reads that fail to align to file at ``. These reads
correspond to the SAM records with the FLAGS `0x4` bit set and neither the
`0x40` nor `0x80` bits set. If `--un-gz` is specified, output will be gzip
compressed. If `--un-bz2` is specified, output will be bzip2 compressed. Reads
written in this way will appear exactly as they did in the input file, without
any modification (same sequence, same name, same quality string, same quality
encoding). Reads will not necessarily appear in the same order as they did in
the input.
--al
--al-gz
--al-bz2
Write unpaired reads that align at least once to file at ``. These reads
correspond to the SAM records with the FLAGS `0x4`, `0x40`, and `0x80` bits
unset. If `--al-gz` is specified, output will be gzip compressed. If `--al-bz2`
is specified, output will be bzip2 compressed. Reads written in this way will
appear exactly as they did in the input file, without any modification (same
sequence, same name, same quality string, same quality encoding). Reads will
not necessarily appear in the same order as they did in the input.
--un-conc
--un-conc-gz
--un-conc-bz2
Write paired-end reads that fail to align concordantly to file(s) at ``.
These reads correspond to the SAM records with the FLAGS `0x4` bit set and
either the `0x40` or `0x80` bit set (depending on whether it's mate #1 or #2).
`.1` and `.2` strings are added to the filename to distinguish which file
contains mate #1 and mate #2. If a percent symbol, `%`, is used in ``,
the percent symbol is replaced with `1` or `2` to make the per-mate filenames.
Otherwise, `.1` or `.2` are added before the final dot in `` to make the
per-mate filenames. Reads written in this way will appear exactly as they did
in the input files, without any modification (same sequence, same name, same
quality string, same quality encoding). Reads will not necessarily appear in
the same order as they did in the inputs.
--al-conc
--al-conc-gz
--al-conc-bz2
Write paired-end reads that align concordantly at least once to file(s) at
``. These reads correspond to the SAM records with the FLAGS `0x4` bit
unset and either the `0x40` or `0x80` bit set (depending on whether it's mate #1
or #2). `.1` and `.2` strings are added to the filename to distinguish which
file contains mate #1 and mate #2. If a percent symbol, `%`, is used in
``, the percent symbol is replaced with `1` or `2` to make the per-mate
filenames. Otherwise, `.1` or `.2` are added before the final dot in `` to
make the per-mate filenames. Reads written in this way will appear exactly as
they did in the input files, without any modification (same sequence, same name,
same quality string, same quality encoding). Reads will not necessarily appear
in the same order as they did in the inputs.
--quiet
Print nothing besides alignments and serious errors.
--met-file
Write `centrifuge` metrics to file ``. Having alignment metric can be useful
for debugging certain problems, especially performance issues. See also:
`--met`. Default: metrics disabled.
--met-stderr
Write `centrifuge` metrics to the "standard error" ("stderr") filehandle. This is
not mutually exclusive with `--met-file`. Having alignment metric can be
useful for debugging certain problems, especially performance issues. See also:
`--met`. Default: metrics disabled.
--met
Write a new `centrifuge` metrics record every `` seconds. Only matters if
either `--met-stderr` or `--met-file` are specified. Default: 1.
#### Performance options
-o/--offrate
Override the offrate of the index with ``. If `` is greater
than the offrate used to build the index, then some row markings are
discarded when the index is read into memory. This reduces the memory
footprint of the aligner but requires more time to calculate text
offsets. `` must be greater than the value used to build the
index.
-p/--threads NTHREADS
Launch `NTHREADS` parallel search threads (default: 1). Threads will run on
separate processors/cores and synchronize when parsing reads and outputting
alignments. Searching for alignments is highly parallel, and speedup is close
to linear. Increasing `-p` increases Centrifuge's memory footprint. E.g. when
aligning to a human genome index, increasing `-p` from 1 to 8 increases the
memory footprint by a few hundred megabytes. This option is only available if
`bowtie` is linked with the `pthreads` library (i.e. if `BOWTIE_PTHREADS=0` is
not specified at build time).
--reorder
Guarantees that output records are printed in an order corresponding to the
order of the reads in the original input file, even when `-p` is set greater
than 1. Specifying `--reorder` and setting `-p` greater than 1 causes Centrifuge
to run somewhat slower and use somewhat more memory then if `--reorder` were
not specified. Has no effect if `-p` is set to 1, since output order will
naturally correspond to input order in that case.
--mm
Use memory-mapped I/O to load the index, rather than typical file I/O.
Memory-mapping allows many concurrent `bowtie` processes on the same computer to
share the same memory image of the index (i.e. you pay the memory overhead just
once). This facilitates memory-efficient parallelization of `bowtie` in
situations where using `-p` is not possible or not preferable.
#### Other options
--qc-filter
Filter out reads for which the QSEQ filter field is non-zero. Only has an
effect when read format is `--qseq`. Default: off.
--seed
Use `` as the seed for pseudo-random number generator. Default: 0.
--non-deterministic
Normally, Centrifuge re-initializes its pseudo-random generator for each read. It
seeds the generator with a number derived from (a) the read name, (b) the
nucleotide sequence, (c) the quality sequence, (d) the value of the `--seed`
option. This means that if two reads are identical (same name, same
nucleotides, same qualities) Centrifuge will find and report the same classification(s)
for both, even if there was ambiguity. When `--non-deterministic` is specified,
Centrifuge re-initializes its pseudo-random generator for each read using the
current time. This means that Centrifuge will not necessarily report the same
classification for two identical reads. This is counter-intuitive for some users,
but might be more appropriate in situations where the input consists of many
identical reads.
--version
Print version information and quit.
-h/--help
Print usage information and quit.
The `centrifuge-build` indexer
===========================
`centrifuge-build` builds a Centrifuge index from a set of DNA sequences.
`centrifuge-build` outputs a set of 6 files with suffixes `.1.cf`, `.2.cf`, and
`.3.cf`. These files together
constitute the index: they are all that is needed to align reads to that
reference. The original sequence FASTA files are no longer used by Centrifuge
once the index is built.
Use of Karkkainen's [blockwise algorithm] allows `centrifuge-build` to trade off
between running time and memory usage. `centrifuge-build` has two options
governing how it makes this trade: `--bmax`/`--bmaxdivn`,
and `--dcv`. By default, `centrifuge-build` will automatically search for the
settings that yield the best running time without exhausting memory. This
behavior can be disabled using the `-a`/`--noauto` option.
The indexer provides options pertaining to the "shape" of the index, e.g.
`--offrate` governs the fraction of [Burrows-Wheeler]
rows that are "marked" (i.e., the density of the suffix-array sample; see the
original [FM Index] paper for details). All of these options are potentially
profitable trade-offs depending on the application. They have been set to
defaults that are reasonable for most cases according to our experiments. See
[Performance tuning] for details.
The Centrifuge index is based on the [FM Index] of Ferragina and Manzini, which in
turn is based on the [Burrows-Wheeler] transform. The algorithm used to build
the index is based on the [blockwise algorithm] of Karkkainen.
[Blockwise algorithm]: http://portal.acm.org/citation.cfm?id=1314852
[Burrows-Wheeler]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Command Line
------------
Usage:
centrifuge-build [options]* --conversion-table --taxonomy-tree --name-table
### Main arguments
A comma-separated list of FASTA files containing the reference sequences to be
aligned to, or, if `-c` is specified, the sequences
themselves. E.g., `` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`,
or, if `-c` is specified, this might be
`GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA`.
The basename of the index files to write. By default, `centrifuge-build` writes
files named `NAME.1.cf`, `NAME.2.cf`, and `NAME.3.cf`, where `NAME` is ``.
### Options
-f
The reference input files (specified as ``) are FASTA files
(usually having extension `.fa`, `.mfa`, `.fna` or similar).
-c
The reference sequences are given on the command line. I.e. `` is
a comma-separated list of sequences rather than a list of FASTA files.
-a/--noauto
Disable the default behavior whereby `centrifuge-build` automatically selects
values for the `--bmax`, `--dcv` and `--packed` parameters according to
available memory. Instead, user may specify values for those parameters. If
memory is exhausted during indexing, an error message will be printed; it is up
to the user to try new parameters.
-p/--threads
Launch `NTHREADS` parallel search threads (default: 1).
--conversion-table
List of UIDs (unique ID) and corresponding taxonomic IDs.
--taxonomy-tree
Taxonomic tree (e.g. nodes.dmp).
--name-table
Name table (e.g. names.dmp).
--size-table
List of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.
--bmax
The maximum number of suffixes allowed in a block. Allowing more suffixes per
block makes indexing faster, but increases peak memory usage. Setting this
option overrides any previous setting for `--bmax`, or `--bmaxdivn`.
Default (in terms of the `--bmaxdivn` parameter) is `--bmaxdivn` 4. This is
configured automatically by default; use `-a`/`--noauto` to configure manually.
--bmaxdivn
The maximum number of suffixes allowed in a block, expressed as a fraction of
the length of the reference. Setting this option overrides any previous setting
for `--bmax`, or `--bmaxdivn`. Default: `--bmaxdivn` 4. This is
configured automatically by default; use `-a`/`--noauto` to configure manually.
--dcv
Use `` as the period for the difference-cover sample. A larger period
yields less memory overhead, but may make suffix sorting slower, especially if
repeats are present. Must be a power of 2 no greater than 4096. Default: 1024.
This is configured automatically by default; use `-a`/`--noauto` to configure
manually.
--nodc
Disable use of the difference-cover sample. Suffix sorting becomes
quadratic-time in the worst case (where the worst case is an extremely
repetitive reference). Default: off.
-o/--offrate
To map alignments back to positions on the reference sequences, it's necessary
to annotate ("mark") some or all of the [Burrows-Wheeler] rows with their
corresponding location on the genome.
`-o`/`--offrate` governs how many rows get marked:
the indexer will mark every 2^`` rows. Marking more rows makes
reference-position lookups faster, but requires more memory to hold the
annotations at runtime. The default is 4 (every 16th row is marked; for human
genome, annotations occupy about 680 megabytes).
-t/--ftabchars
The ftab is the lookup table used to calculate an initial [Burrows-Wheeler]
range with respect to the first `` characters of the query. A larger
`` yields a larger lookup table but faster query times. The ftab has size
4^(``+1) bytes. The default setting is 10 (ftab is 4MB).
--seed
Use `` as the seed for pseudo-random number generator.
--kmer-count
Use `` as kmer-size for counting the distinct number of k-mers in the input sequences.
-q/--quiet
`centrifuge-build` is verbose by default. With this option `centrifuge-build` will
print only error messages.
-h/--help
Print usage information and quit.
--version
Print version information and quit.
The `centrifuge-inspect` index inspector
=====================================
`centrifuge-inspect` extracts information from a Centrifuge index about what kind of
index it is and what reference sequences were used to build it. When run without
any options, the tool will output a FASTA file containing the sequences of the
original references (with all non-`A`/`C`/`G`/`T` characters converted to `N`s).
It can also be used to extract just the reference sequence names using the
`-n`/`--names` option or a more verbose summary using the `-s`/`--summary`
option.
Command Line
------------
Usage:
centrifuge-inspect [options]*
### Main arguments
The basename of the index to be inspected. The basename is name of any of the
index files but with the `.X.cf` suffix omitted.
`centrifuge-inspect` first looks in the current directory for the index files, then
in the directory specified in the `Centrifuge_INDEXES` environment variable.
### Options
-a/--across
When printing FASTA output, output a newline character every `` bases
(default: 60).
-n/--names
Print reference sequence names, one per line, and quit.
-s/--summary
Print a summary that includes information about index settings, as well as the
names and lengths of the input sequences. The summary has this format:
Colorspace <0 or 1>
SA-Sample 1 in
FTab-Chars
Sequence-1
Sequence-2
...
Sequence-N
Fields are separated by tabs. Colorspace is always set to 0 for Centrifuge.
--conversion-table
Print a list of UIDs (unique ID) and corresponding taxonomic IDs.
--taxonomy-tree
Print taxonomic tree.
--name-table
Print name table.
--size-table
Print a list of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.
-v/--verbose
Print verbose output (for debugging).
--version
Print version information and quit.
-h/--help
Print usage information and quit.
Getting started with Centrifuge
===================================================
Centrifuge comes with some example files to get you started. The example files
are not scientifically significant; these files will simply let you start running Centrifuge and
downstream tools right away.
First follow the manual instructions to [obtain Centrifuge]. Set the `CENTRIFUGE_HOME`
environment variable to point to the new Centrifuge directory containing the
`centrifuge`, `centrifuge-build` and `centrifuge-inspect` binaries. This is important,
as the `CENTRIFUGE_HOME` variable is used in the commands below to refer to that
directory.
Indexing a reference genome
---------------------------
To create an index for two small sequences included with Centrifuge, create a new temporary directory (it doesn't matter where), change into that directory, and run:
$CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test
The command should print many lines of output then quit. When the command
completes, the current directory will contain ten new files that all start with
`test` and end with `.1.cf`, `.2.cf`, `.3.cf`. These files constitute the index - you're done!
You can use `centrifuge-build` to create an index for a set of FASTA files obtained
from any source, including sites such as [UCSC], [NCBI], and [Ensembl]. When
indexing multiple FASTA files, specify all the files using commas to separate
file names. For more details on how to create an index with `centrifuge-build`,
see the [manual section on index building]. You may also want to bypass this
process by obtaining a pre-built index.
[UCSC]: http://genome.ucsc.edu/cgi-bin/hgGateway
[NCBI]: http://www.ncbi.nlm.nih.gov/sites/genome
[Ensembl]: http://www.ensembl.org/
Classifying example reads
----------------------
Stay in the directory created in the previous step, which now contains the
`test` index files. Next, run:
$CENTRIFUGE_HOME/centrifuge -f -x test $CENTRIFUGE_HOME/example/reads/input.fa
This runs the Centrifuge classifier, which classifies a set of unpaired reads to the
the genomes using the index generated in the previous step.
The classification results are reported to stdout, and a
short classification summary is written to centrifuge-species_report.tsv.
You will see something like this:
readID seqID taxID score 2ndBestScore hitLength numMatches
C_1 gi|7 9913 4225 4225 80 2
C_1 gi|4 9646 4225 4225 80 2
C_2 gi|4 9646 4225 4225 80 2
C_2 gi|7 9913 4225 4225 80 2
C_3 gi|7 9913 4225 4225 80 2
C_3 gi|4 9646 4225 4225 80 2
C_4 gi|4 9646 4225 4225 80 2
C_4 gi|7 9913 4225 4225 80 2
1_1 gi|4 9646 4225 0 80 1
1_2 gi|4 9646 4225 0 80 1
2_1 gi|7 9913 4225 0 80 1
2_2 gi|7 9913 4225 0 80 1
2_3 gi|7 9913 4225 0 80 1
2_4 gi|7 9913 4225 0 80 1
2_5 gi|7 9913 4225 0 80 1
2_6 gi|7 9913 4225 0 80 1
centrifuge-f39767eb57d8e175029c/MANUAL.markdown 0000664 0000000 0000000 00000145366 13021605047 0020531 0 ustar 00root root 0000000 0000000
Introduction
============
What is Centrifuge?
-----------------
[Centrifuge] is a novel microbial classification engine that enables
rapid, accurate, and sensitive labeling of reads and quantification of
species on desktop computers. The system uses a novel indexing scheme
based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini
(FM) index, optimized specifically for the metagenomic classification
problem. Centrifuge requires a relatively small index (5.8 GB for all
complete bacterial and viral genomes plus the human genome) and
classifies sequences at a very high speed, allowing it to process the
millions of reads from a typical high-throughput DNA sequencing run
within a few minutes. Together these advances enable timely and
accurate analysis of large metagenomics data sets on conventional
desktop computers.
[Centrifuge]: http://www.ccb.jhu.edu/software/centrifuge
[Burrows-Wheeler Transform]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
[FM Index]: http://en.wikipedia.org/wiki/FM-index
[GPLv3 license]: http://www.gnu.org/licenses/gpl-3.0.html
Obtaining Centrifuge
==================
Download Centrifuge and binaries from the Releases sections on the right side.
Binaries are available for Intel architectures (`x86_64`) running Linux, and Mac OS X.
Building from source
--------------------
Building Centrifuge from source requires a GNU-like environment with GCC, GNU Make
and other basics. It should be possible to build Centrifuge on most vanilla Linux
installations or on a Mac installation with [Xcode] installed. Centrifuge can
also be built on Windows using [Cygwin] or [MinGW] (MinGW recommended). For a
MinGW build the choice of what compiler is to be used is important since this
will determine if a 32 or 64 bit code can be successfully compiled using it. If
there is a need to generate both 32 and 64 bit on the same machine then a multilib
MinGW has to be properly installed. [MSYS], the [zlib] library, and depending on
architecture [pthreads] library are also required. We are recommending a 64 bit
build since it has some clear advantages in real life research problems. In order
to simplify the MinGW setup it might be worth investigating popular MinGW personal
builds since these are coming already prepared with most of the toolchains needed.
First, download the [source package] from the Releases secion on the right side.
Unzip the file, change to the unzipped directory, and build the
Centrifuge tools by running GNU `make` (usually with the command `make`, but
sometimes with `gmake`) with no arguments. If building with MinGW, run `make`
from the MSYS environment.
Centrifuge is using the multithreading software model in order to speed up
execution times on SMP architectures where this is possible. On POSIX
platforms (like linux, Mac OS, etc) it needs the pthread library. Although
it is possible to use pthread library on non-POSIX platform like Windows, due
to performance reasons Centrifuge will try to use Windows native multithreading
if possible.
For the support of SRA data access in HISAT2, please download and install the [NCBI-NGS] toolkit.
When running `make`, specify additional variables as follow.
`make USE_SRA=1 NCBI_NGS_DIR=/path/to/NCBI-NGS-directory NCBI_VDB_DIR=/path/to/NCBI-NGS-directory`,
where `NCBI_NGS_DIR` and `NCBI_VDB_DIR` will be used in Makefile for -I and -L compilation options.
For example, $(NCBI_NGS_DIR)/include and $(NCBI_NGS_DIR)/lib64 will be used.
[Cygwin]: http://www.cygwin.com/
[MinGW]: http://www.mingw.org/
[MSYS]: http://www.mingw.org/wiki/msys
[zlib]: http://cygwin.com/packages/mingw-zlib/
[pthreads]: http://sourceware.org/pthreads-win32/
[GnuWin32]: http://gnuwin32.sf.net/packages/coreutils.htm
[Xcode]: http://developer.apple.com/xcode/
[Github site]: https://github.com/infphilo/centrifuge
[NCBI-NGS]: https://github.com/ncbi/ngs/wiki/Downloads
Running Centrifuge
=============
Adding to PATH
--------------
By adding your new Centrifuge directory to your [PATH environment variable], you
ensure that whenever you run `centrifuge`, `centrifuge-build`, `centrifuge-download` or `centrifuge-inspect`
from the command line, you will get the version you just installed without
having to specify the entire path. This is recommended for most users. To do
this, follow your operating system's instructions for adding the directory to
your [PATH].
If you would like to install Centrifuge by copying the Centrifuge executable files
to an existing directory in your [PATH], make sure that you copy all the
executables, including `centrifuge`, `centrifuge-class`, `centrifuge-build`, `centrifuge-build-bin`, `centrifuge-download` `centrifuge-inspect`
and `centrifuge-inspect-bin`. Furthermore you need the programs
in the scripts/ folder if you opt for genome compression in the database construction.
[PATH environment variable]: http://en.wikipedia.org/wiki/PATH_(variable)
[PATH]: http://en.wikipedia.org/wiki/PATH_(variable)
Before running Centrifuge
-----------------
Classification is considerably different from alignment in that classification is performed on a large set of genomes as opposed to on just one reference genome as in alignment. Currently, an enormous number of complete genomes are available at the GenBank (e.g. >4,000 bacterial genomes, >10,000 viral genomes, …). These genomes are organized in a taxonomic tree where each genome is located at the bottom of the tree, at the strain or subspecies level. On the taxonomic tree, genomes have ancestors usually situated at the species level, and those ancestors also have ancestors at the genus level and so on up the family level, the order level, class level, phylum, kingdom, and finally at the root level.
Given the gigantic number of genomes available, which continues to expand at a rapid rate, and the development of the taxonomic tree, which continues to evolve with new advancements in research, we have designed Centrifuge to be flexible and general enough to reflect this huge database. We provide several standard indexes that will meet most of users’ needs (see the side panel - Indexes). In our approach our indexes not only include raw genome sequences, but also genome names/sizes and taxonomic trees. This enables users to perform additional analyses on Centrifuge’s classification output without the need to download extra database sources. This also eliminates the potential issue of discrepancy between the indexes we provide and the databases users may otherwise download. We plan to provide a couple of additional standard indexes in the near future, and update the indexes on a regular basis.
We encourage first time users to take a look at and follow a [`small example`] that illustrates how to build an index, how to run Centrifuge using the index, how to interpret the classification results, and how to extract additional genomic information from the index. For those who choose to build customized indexes, please take a close look at the following description.
Database download and index building
-----------------
Centrifuge indexes can be built with arbritary sequences. Standard choices are
all of the complete bacterial and viral genomes, or using the sequences that
are part of the BLAST nt database. Centrifuge always needs the
nodes.dmp file from the NCBI taxonomy dump to build the taxonomy tree,
as well as a sequence ID to taxonomy ID map. The map is a tab-separated
file with the sequence ID to taxonomy ID map.
To download all of the complete archaeal, viral, and bacterial genomes from RefSeq, and
build the index:
Centrifuge indices can be build on arbritary sequences. Usually an ensemble of
genomes is used - such as all complete microbial genomes in the RefSeq database,
or all sequences in the BLAST nt database.
To map sequence identifiers to taxonomy IDs, and taxonomy IDs to names and
its parents, three files are necessary in addition to the sequence files:
- taxonomy tree: typically nodes.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their parents
- names file: typically names.dmp from the NCBI taxonomy dump. Links taxonomy IDs to their scientific name
- a tab-separated sequence ID to taxonomy ID mapping
When using the provided scripts to download the genomes, these files are automatically downloaded or generated.
When using a custom taxonomy or sequence files, please refer to the section `TODO` to learn more about their format.
### Building index on all complete bacterial and viral genomes
Use `centrifuge-download` to download genomes from NCBI. The following two commands download
the NCBI taxonomy to `taxonomy/` in the current directory, and all complete archaeal,
bacterial and viral genomes to `library/`. Low-complexity regions in the genomes are masked after
download (parameter `-m`) using blast+'s `dustmasker`. `centrifuge-download` outputs tab-separated
sequence ID to taxonomy ID mappings to standard out, which are required by `centrifuge-build`.
centrifuge-download -o taxonomy taxonomy
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map
To build the index, first concatenate all downloaded sequences into a single file, and then
run `centrifuge-build`:
cat library/*/*.fna > input-sequences.fna
## build centrifuge index with 4 threads
centrifuge-build -p 4 --conversion-table seqid2taxid.map \
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
input-sequences.fna abv
After building the index, all files except the index *.[123].cf files may be removed.
If you also want to include the human and/or the mouse genome, add their sequences to
the library folder before building the index with one of the following commands:
After the index building, all but the *.[123].cf index files may be removed. I.e. the files in
the `library/` and `taxonomy/` directories are no longer needed.
### Adding human or mouse genome to the index
The human and mouse genomes can also be downloaded using `centrifuge-download`. They are in the
domain "vertebrate_mammalian" (argument `-d`), are assembled at the chromosome level (argument `-a`)
and categorized as reference genomes by RefSeq (`-c`). The argument `-t` takes a comma-separated
list of taxonomy IDs - e.g. `9606` for human and `10090` for mouse:
# download mouse and human reference genomes
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606,10090 -c 'reference genome' >> seqid2taxid.map
# only human
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' >> seqid2taxid.map
# only mouse
centrifuge-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 10090 -c 'reference genome' >> seqid2taxid.map
### nt database
NCBI BLAST's nt database contains all spliced non-redundant coding
sequences from multiplpe databases, inferred from genommic
sequences. Traditionally used with BLAST, a download of the FASTA is
provided on the NCBI homepage. Building an index with any database
requires the user to creates a sequence ID to taxonomy ID map that
can be generated from a GI taxid dump:
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip nt.gz && mv -v nt nt.fa
# Get mapping file
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map
# build index using 16 cores and a small bucket size, which will require less memory
centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map \
--taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
nt.fa nt
### Custom database
TODO: Add toy example for nodes.dmp, names.dmp and seqid2taxid.map
### Centrifuge classification output
The following example shows classification assignments for a read. The assignment output has 8 columns.
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
1_1 gi|4 9646 4225 0 80 80 1
The first column is the read ID from a raw sequencing read (e.g., 1_1 in the example).
The second column is the sequence ID of the genomic sequence, where the read is classified (e.g., gi|4).
The third column is the taxonomic ID of the genomic sequence in the second column (e.g., 9646).
The fourth column is the score for the classification, which is the weighted sum of hits (e.g., 4225)
The fifth column is the score for the next best classification (e.g., 0).
The sixth column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
The seventh column is a pair of two numbers: (1) an approximate number of base pairs of the read that match the genomic sequence and (2) the length of a read or the combined length of mate pairs (e.g., 80 / 80).
The eighth column is the number of classifications for this read, indicating how many assignments were made (e.g.,1).
### Centrifuge summary output (the default filename is centrifuge_report.tsv)
The following example shows a classification summary for each genome or taxonomic unit. The assignment output has 7 columns.
name taxID taxRank genomeSize numReads numUniqueReads abundance
Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis 36870 leaf 703004 5981 5964 0.0152317
The first column is the name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Wigglesworthia glossinidia endosymbiont of Glossina brevipalpis).
The second column is the taxonomic ID (e.g., 36870).
The third column is the taxonomic rank (e.g., leaf).
The fourth column is the length of the genome sequence (e.g., 703004).
The fifth column is the number of reads classified to this genomic sequence including multi-classified reads (e.g., 5981).
The sixth column is the number of reads uniquely classified to this genomic sequence (e.g., 5964).
The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.0152317).
As the GenBank database is incomplete (i.e., many more genomes remain to be identified and added), and reads have sequencing errors, classification programs including Centrifuge often report many false assignments. In order to perform more conservative analyses, users may want to discard assignments for reads having a matching length (8th column in the output of Centrifuge) of 40% or lower. It may be also helpful to use a score (4th column) for filtering out some assignments. Our future research plans include working on developing methods that estimate confidence scores for assignments.
### Kraken-style report
`centrifuge-kreport` can be used to make a Kraken-style report from the Centrifuge output including taxonomy information:
`centrifuge-kreport -x `
Inspecting the Centrifuge index
-----------------------
The index can be inspected with `centrifuge-inspect`. To extract raw sequences:
centrifuge-inspect
Extract the sequence ID to taxonomy ID conversion table from the index
centrifuge-inspect --conversion-table
Extract the taxonomy tree from the index:
centrifuge-inspect --taxonomy-tree
Extract the lengths of the sequences from the index (each row has two columns: taxonomic ID and length):
centrifuge-inspect --size-table
Extract the names from the index (each row has two columns: taxonomic ID and name):
centrifuge-inspect --name-table
Wrapper
-------
The `centrifuge`, `centrifuge-build` and `centrifuge-inspect` executables are actually
wrapper scripts that call binary programs as appropriate. Also, the `centrifuge` wrapper
provides some key functionality, like the ability to handle compressed inputs,
and the functionality for [`--un`], [`--al`] and related options.
It is recommended that you always run the centrifuge wrappers and not run the
binaries directly.
Performance tuning
------------------
1. If your computer has multiple processors/cores, use `-p NTHREADS`
The [`-p`] option causes Centrifuge to launch a specified number of parallel
search threads. Each thread runs on a different processor/core and all
threads find alignments in parallel, increasing alignment throughput by
approximately a multiple of the number of threads (though in practice,
speedup is somewhat worse than linear).
Command Line
------------
### Usage
centrifuge [options]* -x {-1 -2 | -U | --sra-acc } [--report-file -S ]
### Main arguments
[`-x`]: #centrifuge-options-x
-x
|
The basename of the index for the reference genomes. The basename is the name of
any of the index files up to but not including the final `.1.cf` / etc.
`centrifuge` looks for the specified index first in the current directory,
then in the directory specified in the `CENTRIFUGE_INDEXES` environment variable.
|
[`-1`]: #centrifuge-options-1
-1
|
Comma-separated list of files containing mate 1s (filename usually includes
`_1`), e.g. `-1 flyA_1.fq,flyB_1.fq`. Sequences specified with this option must
correspond file-for-file and read-for-read with those specified in ``. Reads
may be a mix of different lengths. If `-` is specified, `centrifuge` will read the
mate 1s from the "standard in" or "stdin" filehandle.
|
[`-2`]: #centrifuge-options-2
-2
|
Comma-separated list of files containing mate 2s (filename usually includes
`_2`), e.g. `-2 flyA_2.fq,flyB_2.fq`. Sequences specified with this option must
correspond file-for-file and read-for-read with those specified in ``. Reads
may be a mix of different lengths. If `-` is specified, `centrifuge` will read the
mate 2s from the "standard in" or "stdin" filehandle.
|
[`-U`]: #centrifuge-options-U
-U
|
Comma-separated list of files containing unpaired reads to be aligned, e.g.
`lane1.fq,lane2.fq,lane3.fq,lane4.fq`. Reads may be a mix of different lengths.
If `-` is specified, `centrifuge` gets the reads from the "standard in" or "stdin"
filehandle.
|
[`--sra-acc`]: #hisat2-options-sra-acc
--sra-acc
|
Comma-separated list of SRA accession numbers, e.g. `--sra-acc SRR353653,SRR353654`.
Information about read types is available at http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?sp=runinfo&acc=sra-acc&retmode=xml,
where sra-acc is SRA accession number. If users run HISAT2 on a computer cluster, it is recommended to disable SRA-related caching (see the instruction at [SRA-MANUAL]).
[SRA-MANUAL]: https://github.com/ncbi/sra-tools/wiki/Toolkit-Configuration
|
[`-S`]: #centrifuge-options-S
-S
|
File to write classification results to. By default, assignments are written to the
"standard out" or "stdout" filehandle (i.e. the console).
|
[`--report-file`]: #centrifuge-options-report-file
--report-file
|
File to write a classification summary to (default: centrifuge_report.tsv).
|
### Options
#### Input options
[`-q`]: #centrifuge-options-q
-q
|
Reads (specified with ``, ``, ``) are FASTQ files. FASTQ files
usually have extension `.fq` or `.fastq`. FASTQ is the default format. See
also: [`--solexa-quals`] and [`--int-quals`].
|
[`--qseq`]: #centrifuge-options-qseq
--qseq
|
Reads (specified with ``, ``, ``) are QSEQ files. QSEQ files usually
end in `_qseq.txt`. See also: [`--solexa-quals`] and [`--int-quals`].
|
[`-f`]: #centrifuge-options-f
-f
|
Reads (specified with ``, ``, ``) are FASTA files. FASTA files
usually have extension `.fa`, `.fasta`, `.mfa`, `.fna` or similar. FASTA files
do not have a way of specifying quality values, so when `-f` is set, the result
is as if `--ignore-quals` is also set.
|
[`-r`]: #centrifuge-options-r
-r
|
Reads (specified with ``, ``, ``) are files with one input sequence
per line, without any other information (no read names, no qualities). When
`-r` is set, the result is as if `--ignore-quals` is also set.
|
[`-c`]: #centrifuge-options-c
-c
|
The read sequences are given on command line. I.e. ``, `` and
`` are comma-separated lists of reads rather than lists of read files.
There is no way to specify read names or qualities, so `-c` also implies
`--ignore-quals`.
|
[`-s`/`--skip`]: #centrifuge-options-s
[`-s`]: #centrifuge-options-s
-s/--skip
|
Skip (i.e. do not align) the first `` reads or pairs in the input.
|
[`-u`/`--qupto`]: #centrifuge-options-u
[`-u`]: #centrifuge-options-u
-u/--qupto
|
Align the first `` reads or read pairs from the input (after the
[`-s`/`--skip`] reads or pairs have been skipped), then stop. Default: no limit.
|
[`-5`/`--trim5`]: #centrifuge-options-5
[`-5`]: #centrifuge-options-5
-5/--trim5
|
Trim `` bases from 5' (left) end of each read before alignment (default: 0).
|
[`-3`/`--trim3`]: #centrifuge-options-3
[`-3`]: #centrifuge-options-3
-3/--trim3
|
Trim `` bases from 3' (right) end of each read before alignment (default:
0).
|
[`--phred33`]: #centrifuge-options-phred33-quals
--phred33
|
Input qualities are ASCII chars equal to the [Phred quality] plus 33. This is
also called the "Phred+33" encoding, which is used by the very latest Illumina
pipelines.
[Phred quality]: http://en.wikipedia.org/wiki/Phred_quality_score
|
[`--phred64`]: #centrifuge-options-phred64-quals
--phred64
|
Input qualities are ASCII chars equal to the [Phred quality] plus 64. This is
also called the "Phred+64" encoding.
|
[`--solexa-quals`]: #centrifuge-options-solexa-quals
--solexa-quals
|
Convert input qualities from [Solexa][Phred quality] (which can be negative) to
[Phred][Phred quality] (which can't). This scheme was used in older Illumina GA
Pipeline versions (prior to 1.3). Default: off.
|
[`--int-quals`]: #centrifuge-options-int-quals
--int-quals
|
Quality values are represented in the read input file as space-separated ASCII
integers, e.g., `40 40 30 40`..., rather than ASCII characters, e.g., `II?I`....
Integers are treated as being on the [Phred quality] scale unless
[`--solexa-quals`] is also specified. Default: off.
|
#### Classification
[`--min-hitlen`]: #centrifuge-options-min-hitlen
--min-hitlen
|
Minimum length of partial hits, which must be greater than 15 (default: 22)"
|
[`-k`]: #centrifuge-options-k
-k
|
It searches for at most `` distinct, primary assignments for each read or pair.
Primary assignments mean assignments whose assignment score is equal or higher than any other assignments.
If there are more primary assignments than this value,
the search will merge some of the assignments into a higher taxonomic rank.
The assignment score for a paired-end assignment equals the sum of the assignment scores of the individual mates.
Default: 5
|
[`--host-taxids`]: #centrifuge-options-host-taxids
--host-taxids
|
A comma-separated list of taxonomic IDs that will be preferred in classification procedure.
The descendants from these IDs will also be preferred. In case some of a read's assignments correspond to
these taxonomic IDs, only those corresponding assignments will be reported.
|
[`--exclude-taxids`]: #centrifuge-options-exclude-taxids
--exclude-taxids
|
A comma-separated list of taxonomic IDs that will be excluded in classification procedure.
The descendants from these IDs will also be exclude.
|
#### Output options
[`-t`/`--time`]: #centrifuge-options-t
[`-t`]: #centrifuge-options-t
-t/--time
|
Print the wall-clock time required to load the index files and align the reads.
This is printed to the "standard error" ("stderr") filehandle. Default: off.
|
[`--quiet`]: #centrifuge-options-quiet
--quiet
|
Print nothing besides alignments and serious errors.
|
[`--met-file`]: #centrifuge-options-met-file
--met-file
|
Write `centrifuge` metrics to file ``. Having alignment metric can be useful
for debugging certain problems, especially performance issues. See also:
[`--met`]. Default: metrics disabled.
|
[`--met-stderr`]: #centrifuge-options-met-stderr
--met-stderr
|
Write `centrifuge` metrics to the "standard error" ("stderr") filehandle. This is
not mutually exclusive with [`--met-file`]. Having alignment metric can be
useful for debugging certain problems, especially performance issues. See also:
[`--met`]. Default: metrics disabled.
|
[`--met`]: #centrifuge-options-met
--met
|
Write a new `centrifuge` metrics record every `` seconds. Only matters if
either [`--met-stderr`] or [`--met-file`] are specified. Default: 1.
|
#### Performance options
[`-o`/`--offrate`]: #centrifuge-options-o
[`-o`]: #centrifuge-options-o
[`--offrate`]: #centrifuge-options-o
-o/--offrate
|
Override the offrate of the index with ``. If `` is greater
than the offrate used to build the index, then some row markings are
discarded when the index is read into memory. This reduces the memory
footprint of the aligner but requires more time to calculate text
offsets. `` must be greater than the value used to build the
index.
|
[`-p`/`--threads`]: #centrifuge-options-p
[`-p`]: #centrifuge-options-p
-p/--threads NTHREADS
|
Launch `NTHREADS` parallel search threads (default: 1). Threads will run on
separate processors/cores and synchronize when parsing reads and outputting
alignments. Searching for alignments is highly parallel, and speedup is close
to linear. Increasing `-p` increases Centrifuge's memory footprint. E.g. when
aligning to a human genome index, increasing `-p` from 1 to 8 increases the
memory footprint by a few hundred megabytes. This option is only available if
`bowtie` is linked with the `pthreads` library (i.e. if `BOWTIE_PTHREADS=0` is
not specified at build time).
|
[`--reorder`]: #centrifuge-options-reorder
--reorder
|
Guarantees that output records are printed in an order corresponding to the
order of the reads in the original input file, even when [`-p`] is set greater
than 1. Specifying `--reorder` and setting [`-p`] greater than 1 causes Centrifuge
to run somewhat slower and use somewhat more memory then if `--reorder` were
not specified. Has no effect if [`-p`] is set to 1, since output order will
naturally correspond to input order in that case.
|
[`--mm`]: #centrifuge-options-mm
--mm
|
Use memory-mapped I/O to load the index, rather than typical file I/O.
Memory-mapping allows many concurrent `bowtie` processes on the same computer to
share the same memory image of the index (i.e. you pay the memory overhead just
once). This facilitates memory-efficient parallelization of `bowtie` in
situations where using [`-p`] is not possible or not preferable.
|
#### Other options
[`--qc-filter`]: #centrifuge-options-qc-filter
--qc-filter
|
Filter out reads for which the QSEQ filter field is non-zero. Only has an
effect when read format is [`--qseq`]. Default: off.
|
[`--seed`]: #centrifuge-options-seed
--seed
|
Use `` as the seed for pseudo-random number generator. Default: 0.
|
[`--non-deterministic`]: #centrifuge-options-non-deterministic
--non-deterministic
|
Normally, Centrifuge re-initializes its pseudo-random generator for each read. It
seeds the generator with a number derived from (a) the read name, (b) the
nucleotide sequence, (c) the quality sequence, (d) the value of the [`--seed`]
option. This means that if two reads are identical (same name, same
nucleotides, same qualities) Centrifuge will find and report the same classification(s)
for both, even if there was ambiguity. When `--non-deterministic` is specified,
Centrifuge re-initializes its pseudo-random generator for each read using the
current time. This means that Centrifuge will not necessarily report the same
classification for two identical reads. This is counter-intuitive for some users,
but might be more appropriate in situations where the input consists of many
identical reads.
|
[`--version`]: #centrifuge-options-version
--version
|
Print version information and quit.
|
-h/--help
|
Print usage information and quit.
|
The `centrifuge-build` indexer
===========================
`centrifuge-build` builds a Centrifuge index from a set of DNA sequences.
`centrifuge-build` outputs a set of 6 files with suffixes `.1.cf`, `.2.cf`, and
`.3.cf`. These files together
constitute the index: they are all that is needed to align reads to that
reference. The original sequence FASTA files are no longer used by Centrifuge
once the index is built.
Use of Karkkainen's [blockwise algorithm] allows `centrifuge-build` to trade off
between running time and memory usage. `centrifuge-build` has two options
governing how it makes this trade: [`--bmax`]/[`--bmaxdivn`],
and [`--dcv`]. By default, `centrifuge-build` will automatically search for the
settings that yield the best running time without exhausting memory. This
behavior can be disabled using the [`-a`/`--noauto`] option.
The indexer provides options pertaining to the "shape" of the index, e.g.
[`--offrate`](#centrifuge-build-options-o) governs the fraction of [Burrows-Wheeler]
rows that are "marked" (i.e., the density of the suffix-array sample; see the
original [FM Index] paper for details). All of these options are potentially
profitable trade-offs depending on the application. They have been set to
defaults that are reasonable for most cases according to our experiments. See
[Performance tuning] for details.
The Centrifuge index is based on the [FM Index] of Ferragina and Manzini, which in
turn is based on the [Burrows-Wheeler] transform. The algorithm used to build
the index is based on the [blockwise algorithm] of Karkkainen.
[Blockwise algorithm]: http://portal.acm.org/citation.cfm?id=1314852
[Burrows-Wheeler]: http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
[Performance tuning]: #performance-tuning
Command Line
------------
Usage:
centrifuge-build [options]* --conversion-table --taxonomy-tree --name-table
### Main arguments
|
A comma-separated list of FASTA files containing the reference sequences to be
aligned to, or, if [`-c`](#centrifuge-build-options-c) is specified, the sequences
themselves. E.g., `` might be `chr1.fa,chr2.fa,chrX.fa,chrY.fa`,
or, if [`-c`](#centrifuge-build-options-c) is specified, this might be
`GGTCATCCT,ACGGGTCGT,CCGTTCTATGCGGCTTA`.
|
|
The basename of the index files to write. By default, `centrifuge-build` writes
files named `NAME.1.cf`, `NAME.2.cf`, and `NAME.3.cf`, where `NAME` is ``.
|
### Options
-f
|
The reference input files (specified as ``) are FASTA files
(usually having extension `.fa`, `.mfa`, `.fna` or similar).
|
-c
|
The reference sequences are given on the command line. I.e. `` is
a comma-separated list of sequences rather than a list of FASTA files.
|
[`-a`/`--noauto`]: #centrifuge-build-options-a
-a/--noauto
|
Disable the default behavior whereby `centrifuge-build` automatically selects
values for the [`--bmax`], [`--dcv`] and [`--packed`] parameters according to
available memory. Instead, user may specify values for those parameters. If
memory is exhausted during indexing, an error message will be printed; it is up
to the user to try new parameters.
|
[`-p`]: #centrifuge-build-options-p
-p/--threads
|
Launch `NTHREADS` parallel search threads (default: 1).
|
[`--conversion-table`]: #centrifuge-build-options-conversion-table
--conversion-table
|
List of UIDs (unique ID) and corresponding taxonomic IDs.
|
[`--taxonomy-tree`]: #centrifuge-build-options-taxonomy-tree
--taxonomy-tree
|
Taxonomic tree (e.g. nodes.dmp).
|
[`--taxonomy-tree`]: #centrifuge-build-options-name-table
--name-table
|
Name table (e.g. names.dmp).
|
[`--size-table`]: #centrifuge-build-options-size-table
--size-table
|
List of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.
|
[`--bmax`]: #centrifuge-build-options-bmax
--bmax
|
The maximum number of suffixes allowed in a block. Allowing more suffixes per
block makes indexing faster, but increases peak memory usage. Setting this
option overrides any previous setting for [`--bmax`], or [`--bmaxdivn`].
Default (in terms of the [`--bmaxdivn`] parameter) is [`--bmaxdivn`] 4. This is
configured automatically by default; use [`-a`/`--noauto`] to configure manually.
|
[`--bmaxdivn`]: #centrifuge-build-options-bmaxdivn
--bmaxdivn
|
The maximum number of suffixes allowed in a block, expressed as a fraction of
the length of the reference. Setting this option overrides any previous setting
for [`--bmax`], or [`--bmaxdivn`]. Default: [`--bmaxdivn`] 4. This is
configured automatically by default; use [`-a`/`--noauto`] to configure manually.
|
[`--dcv`]: #centrifuge-build-options-dcv
--dcv
|
Use `` as the period for the difference-cover sample. A larger period
yields less memory overhead, but may make suffix sorting slower, especially if
repeats are present. Must be a power of 2 no greater than 4096. Default: 1024.
This is configured automatically by default; use [`-a`/`--noauto`] to configure
manually.
|
[`--nodc`]: #centrifuge-build-options-nodc
--nodc
|
Disable use of the difference-cover sample. Suffix sorting becomes
quadratic-time in the worst case (where the worst case is an extremely
repetitive reference). Default: off.
|
-o/--offrate
|
To map alignments back to positions on the reference sequences, it's necessary
to annotate ("mark") some or all of the [Burrows-Wheeler] rows with their
corresponding location on the genome.
[`-o`/`--offrate`](#centrifuge-build-options-o) governs how many rows get marked:
the indexer will mark every 2^`` rows. Marking more rows makes
reference-position lookups faster, but requires more memory to hold the
annotations at runtime. The default is 4 (every 16th row is marked; for human
genome, annotations occupy about 680 megabytes).
|
-t/--ftabchars
|
The ftab is the lookup table used to calculate an initial [Burrows-Wheeler]
range with respect to the first `` characters of the query. A larger
`` yields a larger lookup table but faster query times. The ftab has size
4^(``+1) bytes. The default setting is 10 (ftab is 4MB).
|
--seed
|
Use `` as the seed for pseudo-random number generator.
|
--kmer-count
|
Use `` as kmer-size for counting the distinct number of k-mers in the input sequences.
|
-q/--quiet
|
`centrifuge-build` is verbose by default. With this option `centrifuge-build` will
print only error messages.
|
-h/--help
|
Print usage information and quit.
|
--version
|
Print version information and quit.
|
The `centrifuge-inspect` index inspector
=====================================
`centrifuge-inspect` extracts information from a Centrifuge index about what kind of
index it is and what reference sequences were used to build it. When run without
any options, the tool will output a FASTA file containing the sequences of the
original references (with all non-`A`/`C`/`G`/`T` characters converted to `N`s).
It can also be used to extract just the reference sequence names using the
[`-n`/`--names`] option or a more verbose summary using the [`-s`/`--summary`]
option.
Command Line
------------
Usage:
centrifuge-inspect [options]*
### Main arguments
|
The basename of the index to be inspected. The basename is name of any of the
index files but with the `.X.cf` suffix omitted.
`centrifuge-inspect` first looks in the current directory for the index files, then
in the directory specified in the `Centrifuge_INDEXES` environment variable.
|
### Options
-a/--across
|
When printing FASTA output, output a newline character every `` bases
(default: 60).
|
[`-n`/`--names`]: #centrifuge-inspect-options-n
-n/--names
|
Print reference sequence names, one per line, and quit.
|
[`-s`/`--summary`]: #centrifuge-inspect-options-s
-s/--summary
|
Print a summary that includes information about index settings, as well as the
names and lengths of the input sequences. The summary has this format:
Colorspace <0 or 1>
SA-Sample 1 in
FTab-Chars
Sequence-1
Sequence-2
...
Sequence-N
Fields are separated by tabs. Colorspace is always set to 0 for Centrifuge.
|
[`--conversion-table`]: #centrifuge-inspect-options-conversion-table
--conversion-table
|
Print a list of UIDs (unique ID) and corresponding taxonomic IDs.
|
[`--taxonomy-tree`]: #centrifuge-inspect-options-taxonomy-tree
--taxonomy-tree
|
Print taxonomic tree.
|
[`--taxonomy-tree`]: #centrifuge-inspect-options-name-table
--name-table
|
Print name table.
|
[`--size-table`]: #centrifuge-inspect-options-size-table
--size-table
|
Print a list of taxonomic IDs and lengths of the sequences belonging to the same taxonomic IDs.
|
-v/--verbose
|
Print verbose output (for debugging).
|
--version
|
Print version information and quit.
|
-h/--help
|
Print usage information and quit.
|
[`small example`]: #centrifuge-example
Getting started with Centrifuge
===================================================
Centrifuge comes with some example files to get you started. The example files
are not scientifically significant; these files will simply let you start running Centrifuge and
downstream tools right away.
First follow the manual instructions to [obtain Centrifuge]. Set the `CENTRIFUGE_HOME`
environment variable to point to the new Centrifuge directory containing the
`centrifuge`, `centrifuge-build` and `centrifuge-inspect` binaries. This is important,
as the `CENTRIFUGE_HOME` variable is used in the commands below to refer to that
directory.
[obtain Centrifuge]: #obtaining-centrifuge
Indexing a reference genome
---------------------------
To create an index for two small sequences included with Centrifuge, create a new temporary directory (it doesn't matter where), change into that directory, and run:
$CENTRIFUGE_HOME/centrifuge-build --conversion-table $CENTRIFUGE_HOME/example/reference/gi_to_tid.dmp --taxonomy-tree $CENTRIFUGE_HOME/example/reference/nodes.dmp --name-table $CENTRIFUGE_HOME/example/reference/names.dmp $CENTRIFUGE_HOME/example/reference/test.fa test
The command should print many lines of output then quit. When the command
completes, the current directory will contain ten new files that all start with
`test` and end with `.1.cf`, `.2.cf`, `.3.cf`. These files constitute the index - you're done!
You can use `centrifuge-build` to create an index for a set of FASTA files obtained
from any source, including sites such as [UCSC], [NCBI], and [Ensembl]. When
indexing multiple FASTA files, specify all the files using commas to separate
file names. For more details on how to create an index with `centrifuge-build`,
see the [manual section on index building]. You may also want to bypass this
process by obtaining a pre-built index.
[UCSC]: http://genome.ucsc.edu/cgi-bin/hgGateway
[NCBI]: http://www.ncbi.nlm.nih.gov/sites/genome
[Ensembl]: http://www.ensembl.org/
[manual section on index building]: #the-centrifuge-build-indexer
[using a pre-built index]: #using-a-pre-built-index
Classifying example reads
----------------------
Stay in the directory created in the previous step, which now contains the
`test` index files. Next, run:
$CENTRIFUGE_HOME/centrifuge -f -x test $CENTRIFUGE_HOME/example/reads/input.fa
This runs the Centrifuge classifier, which classifies a set of unpaired reads to the
the genomes using the index generated in the previous step.
The classification results are reported to stdout, and a
short classification summary is written to centrifuge-species_report.tsv.
You will see something like this:
readID seqID taxID score 2ndBestScore hitLength numMatches
C_1 gi|7 9913 4225 4225 80 2
C_1 gi|4 9646 4225 4225 80 2
C_2 gi|4 9646 4225 4225 80 2
C_2 gi|7 9913 4225 4225 80 2
C_3 gi|7 9913 4225 4225 80 2
C_3 gi|4 9646 4225 4225 80 2
C_4 gi|4 9646 4225 4225 80 2
C_4 gi|7 9913 4225 4225 80 2
1_1 gi|4 9646 4225 0 80 1
1_2 gi|4 9646 4225 0 80 1
2_1 gi|7 9913 4225 0 80 1
2_2 gi|7 9913 4225 0 80 1
2_3 gi|7 9913 4225 0 80 1
2_4 gi|7 9913 4225 0 80 1
2_5 gi|7 9913 4225 0 80 1
2_6 gi|7 9913 4225 0 80 1
centrifuge-f39767eb57d8e175029c/Makefile 0000664 0000000 0000000 00000027777 13021605047 0017455 0 ustar 00root root 0000000 0000000 #
# Copyright 2014, Daehwan Kim
#
# This file is part of Centrifuge, which is copied and modified from Makefile in the Bowtie2 package.
#
# Centrifuge is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Centrifuge is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Centrifuge. If not, see .
#
#
# Makefile for centrifuge-bin, centrifuge-build, centrifuge-inspect
#
INC =
GCC_PREFIX = $(shell dirname `which gcc`)
GCC_SUFFIX =
CC = $(GCC_PREFIX)/gcc$(GCC_SUFFIX)
CPP = $(GCC_PREFIX)/g++$(GCC_SUFFIX)
CXX = $(CPP) #-fdiagnostics-color=always
HEADERS = $(wildcard *.h)
BOWTIE_MM = 1
BOWTIE_SHARED_MEM = 0
# Detect Cygwin or MinGW
WINDOWS = 0
CYGWIN = 0
MINGW = 0
ifneq (,$(findstring CYGWIN,$(shell uname)))
WINDOWS = 1
CYGWIN = 1
# POSIX memory-mapped files not currently supported on Windows
BOWTIE_MM = 0
BOWTIE_SHARED_MEM = 0
else
ifneq (,$(findstring MINGW,$(shell uname)))
WINDOWS = 1
MINGW = 1
# POSIX memory-mapped files not currently supported on Windows
BOWTIE_MM = 0
BOWTIE_SHARED_MEM = 0
endif
endif
MACOS = 0
ifneq (,$(findstring Darwin,$(shell uname)))
MACOS = 1
endif
POPCNT_CAPABILITY ?= 1
ifeq (1, $(POPCNT_CAPABILITY))
EXTRA_FLAGS += -DPOPCNT_CAPABILITY
INC += -I third_party
endif
MM_DEF =
ifeq (1,$(BOWTIE_MM))
MM_DEF = -DBOWTIE_MM
endif
SHMEM_DEF =
ifeq (1,$(BOWTIE_SHARED_MEM))
SHMEM_DEF = -DBOWTIE_SHARED_MEM
endif
PTHREAD_PKG =
PTHREAD_LIB =
ifeq (1,$(MINGW))
PTHREAD_LIB =
else
PTHREAD_LIB = -lpthread
endif
SEARCH_LIBS =
BUILD_LIBS =
INSPECT_LIBS =
ifeq (1,$(MINGW))
BUILD_LIBS =
INSPECT_LIBS =
endif
USE_SRA = 0
SRA_DEF =
SRA_LIB =
SERACH_INC =
ifeq (1,$(USE_SRA))
SRA_DEF = -DUSE_SRA
SRA_LIB = -lncbi-ngs-c++-static -lngs-c++-static -lncbi-vdb-static -ldl
SEARCH_INC += -I$(NCBI_NGS_DIR)/include -I$(NCBI_VDB_DIR)/include
SEARCH_LIBS += -L$(NCBI_NGS_DIR)/lib64 -L$(NCBI_VDB_DIR)/lib64
endif
LIBS = $(PTHREAD_LIB)
SHARED_CPPS = ccnt_lut.cpp ref_read.cpp alphabet.cpp shmem.cpp \
edit.cpp bt2_idx.cpp \
reference.cpp ds.cpp limit.cpp \
random_source.cpp tinythread.cpp
SEARCH_CPPS = qual.cpp pat.cpp \
read_qseq.cpp ref_coord.cpp mask.cpp \
pe.cpp aligner_seed_policy.cpp \
scoring.cpp presets.cpp \
simple_func.cpp random_util.cpp outq.cpp
BUILD_CPPS = diff_sample.cpp
CENTRIFUGE_CPPS_MAIN = $(SEARCH_CPPS) centrifuge_main.cpp
CENTRIFUGE_BUILD_CPPS_MAIN = $(BUILD_CPPS) centrifuge_build_main.cpp
CENTRIFUGE_COMPRESS_CPPS_MAIN = $(BUILD_CPPS) \
aligner_seed.cpp \
aligner_sw.cpp \
aligner_cache.cpp \
dp_framer.cpp \
aligner_bt.cpp sse_util.cpp \
aligner_swsse.cpp \
aligner_swsse_loc_i16.cpp \
aligner_swsse_ee_i16.cpp \
aligner_swsse_loc_u8.cpp \
aligner_swsse_ee_u8.cpp \
scoring.cpp \
mask.cpp \
qual.cpp
CENTRIFUGE_REPORT_CPPS_MAIN=$(BUILD_CPPS)
SEARCH_FRAGMENTS = $(wildcard search_*_phase*.c)
VERSION = $(shell cat VERSION)
GIT_VERSION = $(VERSION)
#GIT_VERSION = $(shell command -v git 2>&1 > /dev/null && git describe --long --tags --dirty --always --abbrev=10 || cat VERSION)
# Convert BITS=?? to a -m flag
BITS=32
ifeq (x86_64,$(shell uname -m))
BITS=64
endif
# msys will always be 32 bit so look at the cpu arch instead.
ifneq (,$(findstring AMD64,$(PROCESSOR_ARCHITEW6432)))
ifeq (1,$(MINGW))
BITS=64
endif
endif
BITS_FLAG =
ifeq (32,$(BITS))
BITS_FLAG = -m32
endif
ifeq (64,$(BITS))
BITS_FLAG = -m64
endif
SSE_FLAG=-msse2
DEBUG_FLAGS = -O0 -g3 $(BIToS_FLAG) $(SSE_FLAG)
DEBUG_DEFS = -DCOMPILER_OPTIONS="\"$(DEBUG_FLAGS) $(EXTRA_FLAGS)\""
RELEASE_FLAGS = -O3 $(BITS_FLAG) $(SSE_FLAG) -funroll-loops -g3
RELEASE_DEFS = -DCOMPILER_OPTIONS="\"$(RELEASE_FLAGS) $(EXTRA_FLAGS)\""
NOASSERT_FLAGS = -DNDEBUG
FILE_FLAGS = -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE
CFLAGS =
#CFLAGS = -fdiagnostics-color=always
ifeq (1,$(USE_SRA))
ifeq (1, $(MACOS))
DEBUG_FLAGS += -mmacosx-version-min=10.6
RELEASE_FLAGS += -mmacosx-version-min=10.6
endif
endif
CENTRIFUGE_BIN_LIST = centrifuge-build-bin \
centrifuge-class \
centrifuge-inspect-bin
CENTRIFUGE_BIN_LIST_AUX = centrifuge-build-bin-debug \
centrifuge-class-debug \
centrifuge-inspect-bin-debug
CENTRIFUGE_SCRIPT_LIST = centrifuge \
centrifuge-build \
centrifuge-inspect \
centrifuge-download \
$(wildcard centrifuge-*.pl)
GENERAL_LIST = $(wildcard scripts/*.sh) \
$(wildcard scripts/*.pl) \
$(wildcard *.py) \
$(wildcard *.pl) \
doc/manual.inc.html \
doc/README \
doc/style.css \
$(wildcard example/index/*.cf) \
$(wildcard example/reads/*.fa) \
$(wildcard example/reference/*) \
indices/Makefile \
$(PTHREAD_PKG) \
$(CENTRIFUGE_SCRIPT_LIST) \
AUTHORS \
LICENSE \
NEWS \
MANUAL \
MANUAL.markdown \
TUTORIAL \
VERSION
ifeq (1,$(WINDOWS))
CENTRIFUGE_BIN_LIST := $(CENTRIFUGE_BIN_LIST) centrifuge.bat centrifuge-build.bat centrifuge-inspect.bat
endif
# This is helpful on Windows under MinGW/MSYS, where Make might go for
# the Windows FIND tool instead.
FIND=$(shell which find)
SRC_PKG_LIST = $(wildcard *.h) \
$(wildcard *.hh) \
$(wildcard *.c) \
$(wildcard *.cpp) \
$(wildcard third_party/*.h) \
$(wildcard third_party/*.cpp) \
doc/strip_markdown.pl \
Makefile \
$(GENERAL_LIST)
BIN_PKG_LIST = $(GENERAL_LIST)
.PHONY: all allall both both-debug
all: $(CENTRIFUGE_BIN_LIST)
allall: $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)
both: centrifuge-class centrifuge-build-bin
both-debug: centrifuge-class-debug centrifuge-build-bin-debug
DEFS=-fno-strict-aliasing \
-DCENTRIFUGE_VERSION="\"$(GIT_VERSION)\"" \
-DBUILD_HOST="\"`hostname`\"" \
-DBUILD_TIME="\"`date`\"" \
-DCOMPILER_VERSION="\"`$(CXX) -v 2>&1 | tail -1`\"" \
$(FILE_FLAGS) \
$(CFLAGS) \
$(PREF_DEF) \
$(MM_DEF) \
$(SHMEM_DEF)
#
# centrifuge targets
#
centrifuge-class: centrifuge.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) $(SRA_DEF) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) $(SEARCH_INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_CPPS_MAIN) \
$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
centrifuge-class-debug: centrifuge.cpp $(SEARCH_CPPS) $(SHARED_CPPS) $(HEADERS) $(SEARCH_FRAGMENTS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) $(SRA_DEF) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) $(SRA_LIB) $(SEARCH_INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_CPPS_MAIN) \
$(LIBS) $(SRA_LIB) $(SEARCH_LIBS)
centrifuge-build-bin: centrifuge_build.cpp $(SHARED_CPPS) $(HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_BUILD_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
centrifuge-build-bin-debug: centrifuge_build.cpp $(SHARED_CPPS) $(HEADERS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_BUILD_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
centrifuge-compress-bin: centrifuge_compress.cpp $(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) $(HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
centrifuge-compress-bin-debug: centrifuge_compress.cpp $(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) $(HEADERS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_COMPRESS_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
centrifuge-report-bin: centrifuge_report.cpp $(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) $(HEADERS)
$(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
centrifuge-report-bin-debug: centrifuge_report.cpp $(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) $(HEADERS)
$(CXX) $(DEBUG_FLAGS) $(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) \
-o $@ $< \
$(SHARED_CPPS) $(CENTRIFUGE_REPORT_CPPS_MAIN) \
$(LIBS) $(BUILD_LIBS)
#centrifuge-RemoveN: centrifuge-RemoveN.cpp
# $(CXX) $(RELEASE_FLAGS) $(RELEASE_DEFS) $(EXTRA_FLAGS) \
# $(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX $(NOASSERT_FLAGS) -Wall \
# $(INC) \
# -o $@ $<
#
# centrifuge-inspect targets
#
centrifuge-inspect-bin: centrifuge_inspect.cpp $(HEADERS) $(SHARED_CPPS)
$(CXX) $(RELEASE_FLAGS) \
$(RELEASE_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) -I . \
-o $@ $< \
$(SHARED_CPPS) \
$(LIBS) $(INSPECT_LIBS)
centrifuge-inspect-bin-debug: centrifuge_inspect.cpp $(HEADERS) $(SHARED_CPPS)
$(CXX) $(DEBUG_FLAGS) \
$(DEBUG_DEFS) $(EXTRA_FLAGS) \
$(DEFS) -DCENTRIFUGE -DBOWTIE2 -DBOWTIE_64BIT_INDEX -Wall \
$(INC) -I . \
-o $@ $< \
$(SHARED_CPPS) \
$(LIBS) $(INSPECT_LIBS)
centrifuge: ;
centrifuge.bat:
echo "@echo off" > centrifuge.bat
echo "perl %~dp0/centrifuge %*" >> centrifuge.bat
centrifuge-build.bat:
echo "@echo off" > centrifuge-build.bat
echo "python %~dp0/centrifuge-build %*" >> centrifuge-build.bat
centrifuge-inspect.bat:
echo "@echo off" > centrifuge-inspect.bat
echo "python %~dp0/centrifuge-inspect %*" >> centrifuge-inspect.bat
.PHONY: centrifuge-src
centrifuge-src: $(SRC_PKG_LIST)
mkdir .src.tmp
mkdir .src.tmp/centrifuge-$(VERSION)
zip tmp.zip $(SRC_PKG_LIST)
mv tmp.zip .src.tmp/centrifuge-$(VERSION)
cd .src.tmp/centrifuge-$(VERSION) ; unzip tmp.zip ; rm -f tmp.zip
cd .src.tmp ; zip -r centrifuge-$(VERSION)-source.zip centrifuge-$(VERSION)
cp .src.tmp/centrifuge-$(VERSION)-source.zip .
rm -rf .src.tmp
.PHONY: centrifuge-bin
centrifuge-bin: $(BIN_PKG_LIST) $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)
rm -rf .bin.tmp
mkdir .bin.tmp
mkdir .bin.tmp/centrifuge-$(VERSION)
if [ -f centrifuge.exe ] ; then \
zip tmp.zip $(BIN_PKG_LIST) $(addsuffix .exe,$(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)) ; \
else \
zip tmp.zip $(BIN_PKG_LIST) $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX) ; \
fi
mv tmp.zip .bin.tmp/centrifuge-$(VERSION)
cd .bin.tmp/centrifuge-$(VERSION) ; unzip tmp.zip ; rm -f tmp.zip
cd .bin.tmp ; zip -r centrifuge-$(VERSION)-$(BITS).zip centrifuge-$(VERSION)
cp .bin.tmp/centrifuge-$(VERSION)-$(BITS).zip .
rm -rf .bin.tmp
.PHONY: doc
doc: doc/manual.inc.html MANUAL
doc/manual.inc.html: MANUAL.markdown
pandoc -T "Centrifuge Manual" -o $@ \
--from markdown --to HTML --toc $^
perl -i -ne \
'$$w=0 if m|^
|;' $@
MANUAL: MANUAL.markdown
perl doc/strip_markdown.pl < $^ > $@
prefix=/usr/local
.PHONY: install
install: all
mkdir -p $(prefix)/bin
mkdir -p $(prefix)/share/centrifuge/indices
install -m 0644 indices/Makefile $(prefix)/share/centrifuge/indices
install -d -m 0755 $(prefix)/share/centrifuge/doc
install -m 0644 doc/* $(prefix)/share/centrifuge/doc
for file in $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_SCRIPT_LIST); do \
install -m 0755 $$file $(prefix)/bin ; \
done
.PHONY: uninstall
uninstall: all
for file in $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_SCRIPT_LIST); do \
rm -v $(prefix)/bin/$$file ; \
rm -v $(prefix)/share/centrifuge; \
done
.PHONY: clean
clean:
rm -f $(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX) \
$(addsuffix .exe,$(CENTRIFUGE_BIN_LIST) $(CENTRIFUGE_BIN_LIST_AUX)) \
centrifuge-src.zip centrifuge-bin.zip
rm -f core.* .tmp.head
rm -rf *.dSYM
push-doc: doc/manual.inc.html
scp doc/*.*html igm1:/data1/igm3/www/ccb.jhu.edu/html/software/centrifuge/
centrifuge-f39767eb57d8e175029c/NEWS 0000664 0000000 0000000 00000000037 13021605047 0016470 0 ustar 00root root 0000000 0000000 Centrifuge NEWS
=============
centrifuge-f39767eb57d8e175029c/README.md 0000664 0000000 0000000 00000003617 13021605047 0017257 0 ustar 00root root 0000000 0000000 # Centrifuge
Classifier for metagenomic sequences
[Centrifuge] is a novel microbial classification engine that enables
rapid, accurate and sensitive labeling of reads and quantification of
species on desktop computers. The system uses a novel indexing scheme
based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini
(FM) index, optimized specifically for the metagenomic classification
problem. Centrifuge requires a relatively small index (4.7 GB for all
complete bacterial and viral genomes plus the human genome) and
classifies sequences at very high speed, allowing it to process the
millions of reads from a typical high-throughput DNA sequencing run
within a few minutes. Together these advances enable timely and
accurate analysis of large metagenomics data sets on conventional
desktop computers
The Centrifuge hompage is http://www.ccb.jhu.edu/software/centrifuge
The Centrifuge preprint is available at http://biorxiv.org/content/early/2016/05/25/054965.abstract
The Centrifuge poster is available at http://www.ccb.jhu.edu/people/infphilo/data/Centrifuge-poster.pdf
For more details on installing and running Centrifuge, look at MANUAL
## Quick guide
### Installation from source
git clone https://github.com/infphilo/centrifuge
cd centrifuge
make
sudo make install prefix=/usr/local
### Building indexes
We provide several indexes on the Centrifuge homepage at http://www.ccb.jhu.edu/software/centrifuge.
Centrifuge needs sequence and taxonomy files, as well as sequence ID to taxonomy ID mapping.
See the MANUAL files for details. We provide a Makefile that simplifies the building of several
standard and custom indices
cd indices
make p+h+v # bacterial, human, and viral genomes [~12G]
make p_compressed # bacterial genomes compressed at the species level [~4.2G]
make p_compressed+h+v # combination of the two above [~8G]
centrifuge-f39767eb57d8e175029c/TUTORIAL 0000664 0000000 0000000 00000000370 13021605047 0017157 0 ustar 00root root 0000000 0000000 See section toward end of MANUAL entited "Getting started with Bowtie 2: Lambda
phage example". Or, for tutorial for latest Bowtie 2 version, visit:
http://bowtie-bio.sf.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-phage-example
centrifuge-f39767eb57d8e175029c/VERSION 0000664 0000000 0000000 00000000013 13021605047 0017033 0 ustar 00root root 0000000 0000000 1.0.3-beta
centrifuge-f39767eb57d8e175029c/aligner_bt.cpp 0000664 0000000 0000000 00000151265 13021605047 0020615 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#include "aligner_bt.h"
#include "mask.h"
using namespace std;
#define CHECK_ROW_COL(rowc, colc) \
if(rowc >= 0 && colc >= 0) { \
if(!sawcell_[colc].insert(rowc)) { \
/* was already in there */ \
abort = true; \
return; \
} \
assert(local || prob_.cper_->debugCell(rowc, colc, hefc)); \
}
/**
* Fill in a triangle of the DP table and backtrace from the given cell to
* a cell in the previous checkpoint, or to the terminal cell.
*/
void BtBranchTracer::triangleFill(
int64_t rw, // row of cell to backtrace from
int64_t cl, // column of cell to backtrace from
int hef, // cell to backtrace from is H (0), E (1), or F (2)
TAlScore targ, // score of cell to backtrace from
TAlScore targ_final, // score of alignment we're looking for
RandomSource& rnd, // pseudo-random generator
int64_t& row_new, // out: row we ended up in after backtrace
int64_t& col_new, // out: column we ended up in after backtrace
int& hef_new, // out: H/E/F after backtrace
TAlScore& targ_new, // out: score up to cell we ended up in
bool& done, // out: finished tracing out an alignment?
bool& abort) // out: aborted b/c cell was seen before?
{
assert_geq(rw, 0);
assert_geq(cl, 0);
assert_range(0, 2, hef);
assert_lt(rw, (int64_t)prob_.qrylen_);
assert_lt(cl, (int64_t)prob_.reflen_);
assert(prob_.usecp_ && prob_.fill_);
int64_t row = rw, col = cl;
const int64_t colmin = 0;
const int64_t rowmin = 0;
const int64_t colmax = prob_.reflen_ - 1;
const int64_t rowmax = prob_.qrylen_ - 1;
assert_leq(prob_.reflen_, (TRefOff)sawcell_.size());
assert_leq(col, (int64_t)prob_.cper_->hicol());
assert_geq(col, (int64_t)prob_.cper_->locol());
assert_geq(prob_.cper_->per(), 2);
size_t mod = (row + col) & prob_.cper_->lomask();
assert_lt(mod, prob_.cper_->per());
// Allocate room for diags
size_t depth = mod+1;
assert_leq(depth, prob_.cper_->per());
size_t breadth = depth;
tri_.resize(depth);
// Allocate room for each diag
for(size_t i = 0; i < depth; i++) {
tri_[i].resize(breadth - i);
}
bool upperleft = false;
size_t off = (row + col) >> prob_.cper_->perpow2();
if(off == 0) {
upperleft = true;
} else {
off--;
}
const TAlScore sc_rdo = prob_.sc_->readGapOpen();
const TAlScore sc_rde = prob_.sc_->readGapExtend();
const TAlScore sc_rfo = prob_.sc_->refGapOpen();
const TAlScore sc_rfe = prob_.sc_->refGapExtend();
const bool local = !prob_.sc_->monotone;
int64_t row_lo = row - (int64_t)mod;
const CpQuad *prev2 = NULL, *prev1 = NULL;
if(!upperleft) {
// Read-only pointer to cells in diagonal -2. Start one row above the
// target row.
prev2 = prob_.cper_->qdiag1sPtr() + (off * prob_.cper_->nrow() + row_lo - 1);
// Read-only pointer to cells in diagonal -1. Start one row above the
// target row
prev1 = prob_.cper_->qdiag2sPtr() + (off * prob_.cper_->nrow() + row_lo - 1);
#ifndef NDEBUG
if(row >= (int64_t)mod) {
size_t rowc = row - mod, colc = col;
if(rowc > 0 && prob_.cper_->isCheckpointed(rowc-1, colc)) {
TAlScore al = prev1[0].sc[0];
if(al == MIN_I16) al = MIN_I64;
assert_eq(prob_.cper_->scoreTriangle(rowc-1, colc, 0), al);
}
if(rowc > 0 && colc > 0 && prob_.cper_->isCheckpointed(rowc-1, colc-1)) {
TAlScore al = prev2[0].sc[0];
if(al == MIN_I16) al = MIN_I64;
assert_eq(prob_.cper_->scoreTriangle(rowc-1, colc-1, 0), al);
}
}
#endif
}
// Pointer to cells in current diagonal
// For each diagonal we need to fill in
for(size_t i = 0; i < depth; i++) {
CpQuad * cur = tri_[i].ptr();
CpQuad * curc = cur;
size_t doff = mod - i; // # diagonals we are away from target diag
//assert_geq(row, (int64_t)doff);
int64_t rowc = row - doff;
int64_t colc = col;
size_t neval = 0; // # cells evaluated in this diag
ASSERT_ONLY(const CpQuad *last = NULL);
// Fill this diagonal from upper right to lower left
for(size_t j = 0; j < breadth; j++) {
if(rowc >= rowmin && rowc <= rowmax &&
colc >= colmin && colc <= colmax)
{
neval++;
int64_t fromend = prob_.qrylen_ - rowc - 1;
bool allowGaps = fromend >= prob_.sc_->gapbar && rowc >= prob_.sc_->gapbar;
// Fill this cell
// Some things we might want to calculate about this cell up front:
// 1. How many matches are possible from this cell to the cell in
// row, col, in case this allows us to prune
// Get character from read
int qc = prob_.qry_[rowc];
// Get quality value from read
int qq = prob_.qual_[rowc];
assert_geq(qq, 33);
// Get character from reference
int rc = prob_.ref_[colc];
assert_range(0, 16, rc);
int16_t sc_diag = prob_.sc_->score(qc, rc, qq - 33);
int16_t sc_h_up = MIN_I16;
int16_t sc_f_up = MIN_I16;
int16_t sc_h_lf = MIN_I16;
int16_t sc_e_lf = MIN_I16;
if(allowGaps) {
if(rowc > 0) {
assert(local || prev1[j+0].sc[2] < 0);
if(prev1[j+0].sc[0] > MIN_I16) {
sc_h_up = prev1[j+0].sc[0] - sc_rfo;
if(local) sc_h_up = max(sc_h_up, 0);
}
if(prev1[j+0].sc[2] > MIN_I16) {
sc_f_up = prev1[j+0].sc[2] - sc_rfe;
if(local) sc_f_up = max(sc_f_up, 0);
}
#ifndef NDEBUG
TAlScore hup = prev1[j+0].sc[0];
TAlScore fup = prev1[j+0].sc[2];
if(hup == MIN_I16) hup = MIN_I64;
if(fup == MIN_I16) fup = MIN_I64;
if(local) {
hup = max(hup, 0);
fup = max(fup, 0);
}
if(prob_.cper_->isCheckpointed(rowc-1, colc)) {
assert_eq(hup, prob_.cper_->scoreTriangle(rowc-1, colc, 0));
assert_eq(fup, prob_.cper_->scoreTriangle(rowc-1, colc, 2));
}
#endif
}
if(colc > 0) {
assert(local || prev1[j+1].sc[1] < 0);
if(prev1[j+1].sc[0] > MIN_I16) {
sc_h_lf = prev1[j+1].sc[0] - sc_rdo;
if(local) sc_h_lf = max(sc_h_lf, 0);
}
if(prev1[j+1].sc[1] > MIN_I16) {
sc_e_lf = prev1[j+1].sc[1] - sc_rde;
if(local) sc_e_lf = max(sc_e_lf, 0);
}
#ifndef NDEBUG
TAlScore hlf = prev1[j+1].sc[0];
TAlScore elf = prev1[j+1].sc[1];
if(hlf == MIN_I16) hlf = MIN_I64;
if(elf == MIN_I16) elf = MIN_I64;
if(local) {
hlf = max(hlf, 0);
elf = max(elf, 0);
}
if(prob_.cper_->isCheckpointed(rowc, colc-1)) {
assert_eq(hlf, prob_.cper_->scoreTriangle(rowc, colc-1, 0));
assert_eq(elf, prob_.cper_->scoreTriangle(rowc, colc-1, 1));
}
#endif
}
}
assert(rowc <= 1 || colc <= 0 || prev2 != NULL);
int16_t sc_h_dg = ((rowc > 0 && colc > 0) ? prev2[j+0].sc[0] : 0);
if(colc == 0 && rowc > 0 && !local) {
sc_h_dg = MIN_I16;
}
if(sc_h_dg > MIN_I16) {
sc_h_dg += sc_diag;
}
if(local) sc_h_dg = max(sc_h_dg, 0);
// cerr << sc_diag << " " << sc_h_dg << " " << sc_h_up << " " << sc_f_up << " " << sc_h_lf << " " << sc_e_lf << endl;
int mask = 0;
// Calculate best ways into H, E, F cells starting with H.
// Mask bits:
// H: 1=diag, 2=hhoriz, 4=ehoriz, 8=hvert, 16=fvert
// E: 32=hhoriz, 64=ehoriz
// F: 128=hvert, 256=fvert
int16_t sc_best = sc_h_dg;
if(sc_h_dg > MIN_I64) {
mask = 1;
}
if(colc > 0 && sc_h_lf >= sc_best && sc_h_lf > MIN_I64) {
if(sc_h_lf > sc_best) mask = 0;
mask |= 2;
sc_best = sc_h_lf;
}
if(colc > 0 && sc_e_lf >= sc_best && sc_e_lf > MIN_I64) {
if(sc_e_lf > sc_best) mask = 0;
mask |= 4;
sc_best = sc_e_lf;
}
if(rowc > 0 && sc_h_up >= sc_best && sc_h_up > MIN_I64) {
if(sc_h_up > sc_best) mask = 0;
mask |= 8;
sc_best = sc_h_up;
}
if(rowc > 0 && sc_f_up >= sc_best && sc_f_up > MIN_I64) {
if(sc_f_up > sc_best) mask = 0;
mask |= 16;
sc_best = sc_f_up;
}
// Calculate best way into E cell
int16_t sc_e_best = sc_h_lf;
if(colc > 0) {
if(sc_h_lf >= sc_e_lf && sc_h_lf > MIN_I64) {
if(sc_h_lf == sc_e_lf) {
mask |= 64;
}
mask |= 32;
} else if(sc_e_lf > MIN_I64) {
sc_e_best = sc_e_lf;
mask |= 64;
}
}
if(sc_e_best > sc_best) {
sc_best = sc_e_best;
mask &= ~31; // don't go diagonal
}
// Calculate best way into F cell
int16_t sc_f_best = sc_h_up;
if(rowc > 0) {
if(sc_h_up >= sc_f_up && sc_h_up > MIN_I64) {
if(sc_h_up == sc_f_up) {
mask |= 256;
}
mask |= 128;
} else if(sc_f_up > MIN_I64) {
sc_f_best = sc_f_up;
mask |= 256;
}
}
if(sc_f_best > sc_best) {
sc_best = sc_f_best;
mask &= ~127; // don't go horizontal or diagonal
}
// Install results in cur
assert(!prob_.sc_->monotone || sc_best <= 0);
assert(!prob_.sc_->monotone || sc_e_best <= 0);
assert(!prob_.sc_->monotone || sc_f_best <= 0);
curc->sc[0] = sc_best;
assert( local || sc_e_best < 0);
assert( local || sc_f_best < 0);
assert(!local || sc_e_best >= 0 || sc_e_best == MIN_I16);
assert(!local || sc_f_best >= 0 || sc_f_best == MIN_I16);
curc->sc[1] = sc_e_best;
curc->sc[2] = sc_f_best;
curc->sc[3] = mask;
// cerr << curc->sc[0] << " " << curc->sc[1] << " " << curc->sc[2] << " " << curc->sc[3] << endl;
ASSERT_ONLY(last = curc);
#ifndef NDEBUG
if(prob_.cper_->isCheckpointed(rowc, colc)) {
if(local) {
sc_e_best = max(sc_e_best, 0);
sc_f_best = max(sc_f_best, 0);
}
TAlScore sc_best64 = sc_best; if(sc_best == MIN_I16) sc_best64 = MIN_I64;
TAlScore sc_e_best64 = sc_e_best; if(sc_e_best == MIN_I16) sc_e_best64 = MIN_I64;
TAlScore sc_f_best64 = sc_f_best; if(sc_f_best == MIN_I16) sc_f_best64 = MIN_I64;
assert_eq(prob_.cper_->scoreTriangle(rowc, colc, 0), sc_best64);
assert_eq(prob_.cper_->scoreTriangle(rowc, colc, 1), sc_e_best64);
assert_eq(prob_.cper_->scoreTriangle(rowc, colc, 2), sc_f_best64);
}
#endif
}
// Update row, col
assert_lt(rowc, (int64_t)prob_.qrylen_);
rowc++;
colc--;
curc++;
} // for(size_t j = 0; j < breadth; j++)
if(i == depth-1) {
// Final iteration
assert(last != NULL);
assert_eq(1, neval);
assert_neq(0, last->sc[3]);
assert_eq(targ, last->sc[hef]);
} else {
breadth--;
prev2 = prev1 + 1;
prev1 = cur;
}
} // for(size_t i = 0; i < depth; i++)
//
// Now backtrack through the triangle. Abort as soon as we enter a cell
// that was visited by a previous backtrace.
//
int64_t rowc = row, colc = col;
size_t curid;
int hefc = hef;
if(bs_.empty()) {
// Start an initial branch
CHECK_ROW_COL(rowc, colc);
curid = bs_.alloc();
assert_eq(0, curid);
Edit e;
bs_[curid].init(
prob_,
0, // parent ID
0, // penalty
0, // score_en
rowc, // row
colc, // col
e, // edit
0, // hef
true, // I am the root
false); // don't try to extend with exact matches
bs_[curid].len_ = 0;
} else {
curid = bs_.size()-1;
}
size_t idx_orig = (row + col) >> prob_.cper_->perpow2();
while(true) {
// What depth are we?
size_t mod = (rowc + colc) & prob_.cper_->lomask();
assert_lt(mod, prob_.cper_->per());
CpQuad * cur = tri_[mod].ptr();
int64_t row_off = rowc - row_lo - mod;
assert(!local || cur[row_off].sc[0] > 0);
assert_geq(row_off, 0);
int mask = cur[row_off].sc[3];
assert_gt(mask, 0);
int sel = -1;
// Select what type of move to make, which depends on whether we're
// currently in H, E, F:
if(hefc == 0) {
if( (mask & 1) != 0) {
// diagonal
sel = 0;
} else if((mask & 8) != 0) {
// up to H
sel = 3;
} else if((mask & 16) != 0) {
// up to F
sel = 4;
} else if((mask & 2) != 0) {
// left to H
sel = 1;
} else if((mask & 4) != 0) {
// left to E
sel = 2;
}
} else if(hefc == 1) {
if( (mask & 32) != 0) {
// left to H
sel = 5;
} else if((mask & 64) != 0) {
// left to E
sel = 6;
}
} else {
assert_eq(2, hefc);
if( (mask & 128) != 0) {
// up to H
sel = 7;
} else if((mask & 256) != 0) {
// up to F
sel = 8;
}
}
assert_geq(sel, 0);
// Get character from read
int qc = prob_.qry_[rowc], qq = prob_.qual_[rowc];
// Get character from reference
int rc = prob_.ref_[colc];
assert_range(0, 16, rc);
// Now that we know what type of move to make, make it, updating our
// row and column and moving updating the branch.
if(sel == 0) {
assert_geq(rowc, 0);
assert_geq(colc, 0);
TAlScore scd = prob_.sc_->score(qc, rc, qq - 33);
if((rc & (1 << qc)) == 0) {
// Mismatch
size_t id = curid;
// Check if the previous branch was the initial (bottommost)
// branch with no matches. If so, the mismatch should be added
// to the initial branch, instead of starting a new branch.
bool empty = (bs_[curid].len_ == 0 && curid == 0);
if(!empty) {
id = bs_.alloc();
}
Edit e((int)rowc, mask2dna[rc], "ACGTN"[qc], EDIT_TYPE_MM);
assert_lt(scd, 0);
TAlScore score_en = bs_[curid].score_st_ + scd;
bs_[id].init(
prob_,
curid, // parent ID
-scd, // penalty
score_en, // score_en
rowc, // row
colc, // col
e, // edit
hefc, // hef
empty, // root?
false); // don't try to extend with exact matches
//assert(!local || bs_[id].score_st_ >= 0);
curid = id;
} else {
// Match
bs_[curid].score_st_ += prob_.sc_->match();
bs_[curid].len_++;
assert_leq((int64_t)bs_[curid].len_, bs_[curid].row_ + 1);
}
rowc--;
colc--;
assert(local || bs_[curid].score_st_ >= targ_final);
hefc = 0;
} else if((sel >= 1 && sel <= 2) || (sel >= 5 && sel <= 6)) {
assert_gt(colc, 0);
// Read gap
size_t id = bs_.alloc();
Edit e((int)rowc+1, mask2dna[rc], '-', EDIT_TYPE_READ_GAP);
TAlScore gapp = prob_.sc_->readGapOpen();
if(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isReadGap()) {
gapp = prob_.sc_->readGapExtend();
}
TAlScore score_en = bs_[curid].score_st_ - gapp;
bs_[id].init(
prob_,
curid, // parent ID
gapp, // penalty
score_en, // score_en
rowc, // row
colc-1, // col
e, // edit
hefc, // hef
false, // root?
false); // don't try to extend with exact matches
colc--;
curid = id;
assert( local || bs_[curid].score_st_ >= targ_final);
//assert(!local || bs_[curid].score_st_ >= 0);
if(sel == 1 || sel == 5) {
hefc = 0;
} else {
hefc = 1;
}
} else {
assert_gt(rowc, 0);
// Reference gap
size_t id = bs_.alloc();
Edit e((int)rowc, '-', "ACGTN"[qc], EDIT_TYPE_REF_GAP);
TAlScore gapp = prob_.sc_->refGapOpen();
if(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isRefGap()) {
gapp = prob_.sc_->refGapExtend();
}
TAlScore score_en = bs_[curid].score_st_ - gapp;
bs_[id].init(
prob_,
curid, // parent ID
gapp, // penalty
score_en, // score_en
rowc-1, // row
colc, // col
e, // edit
hefc, // hef
false, // root?
false); // don't try to extend with exact matches
rowc--;
curid = id;
//assert(!local || bs_[curid].score_st_ >= 0);
if(sel == 3 || sel == 7) {
hefc = 0;
} else {
hefc = 2;
}
}
CHECK_ROW_COL(rowc, colc);
size_t mod_new = (rowc + colc) & prob_.cper_->lomask();
size_t idx = (rowc + colc) >> prob_.cper_->perpow2();
assert_lt(mod_new, prob_.cper_->per());
int64_t row_off_new = rowc - row_lo - mod_new;
CpQuad * cur_new = NULL;
if(colc >= 0 && rowc >= 0 && idx == idx_orig) {
cur_new = tri_[mod_new].ptr();
}
bool hit_new_tri = (idx < idx_orig && colc >= 0 && rowc >= 0);
// Check whether we made it to the top row or to a cell with score 0
if(colc < 0 || rowc < 0 ||
(cur_new != NULL && (local && cur_new[row_off_new].sc[0] == 0)))
{
done = true;
assert(bs_[curid].isSolution(prob_));
addSolution(curid);
#ifndef NDEBUG
// A check to see if any two adjacent branches in the backtrace
// overlap. If they do, the whole alignment will be filtered out
// in trySolution(...)
size_t cur = curid;
if(!bs_[cur].root_) {
size_t next = bs_[cur].parentId_;
while(!bs_[next].root_) {
assert_neq(cur, next);
if(bs_[next].len_ != 0 || bs_[cur].len_ == 0) {
assert(!bs_[cur].overlap(prob_, bs_[next]));
}
cur = next;
next = bs_[cur].parentId_;
}
}
#endif
return;
}
if(hit_new_tri) {
assert(rowc < 0 || colc < 0 || prob_.cper_->isCheckpointed(rowc, colc));
row_new = rowc; col_new = colc;
hef_new = hefc;
done = false;
if(rowc < 0 || colc < 0) {
assert(local);
targ_new = 0;
} else {
targ_new = prob_.cper_->scoreTriangle(rowc, colc, hefc);
}
if(local && targ_new == 0) {
done = true;
assert(bs_[curid].isSolution(prob_));
addSolution(curid);
}
assert((row_new >= 0 && col_new >= 0) || done);
return;
}
}
assert(false);
}
#ifndef NDEBUG
#define DEBUG_CHECK(ss, row, col, hef) { \
if(prob_.cper_->debug() && row >= 0 && col >= 0) { \
TAlScore s = ss; \
if(s == MIN_I16) s = MIN_I64; \
if(local && s < 0) s = 0; \
TAlScore deb = prob_.cper_->debugCell(row, col, hef); \
if(local && deb < 0) deb = 0; \
assert_eq(s, deb); \
} \
}
#else
#define DEBUG_CHECK(ss, row, col, hef)
#endif
/**
* Fill in a square of the DP table and backtrace from the given cell to
* a cell in the previous checkpoint, or to the terminal cell.
*/
void BtBranchTracer::squareFill(
int64_t rw, // row of cell to backtrace from
int64_t cl, // column of cell to backtrace from
int hef, // cell to backtrace from is H (0), E (1), or F (2)
TAlScore targ, // score of cell to backtrace from
TAlScore targ_final, // score of alignment we're looking for
RandomSource& rnd, // pseudo-random generator
int64_t& row_new, // out: row we ended up in after backtrace
int64_t& col_new, // out: column we ended up in after backtrace
int& hef_new, // out: H/E/F after backtrace
TAlScore& targ_new, // out: score up to cell we ended up in
bool& done, // out: finished tracing out an alignment?
bool& abort) // out: aborted b/c cell was seen before?
{
assert_geq(rw, 0);
assert_geq(cl, 0);
assert_range(0, 2, hef);
assert_lt(rw, (int64_t)prob_.qrylen_);
assert_lt(cl, (int64_t)prob_.reflen_);
assert(prob_.usecp_ && prob_.fill_);
const bool is8_ = prob_.cper_->is8_;
int64_t row = rw, col = cl;
assert_leq(prob_.reflen_, (TRefOff)sawcell_.size());
assert_leq(col, (int64_t)prob_.cper_->hicol());
assert_geq(col, (int64_t)prob_.cper_->locol());
assert_geq(prob_.cper_->per(), 2);
size_t xmod = col & prob_.cper_->lomask();
size_t ymod = row & prob_.cper_->lomask();
size_t xdiv = col >> prob_.cper_->perpow2();
size_t ydiv = row >> prob_.cper_->perpow2();
size_t sq_ncol = xmod+1, sq_nrow = ymod+1;
sq_.resize(sq_ncol * sq_nrow);
bool upper = ydiv == 0;
bool left = xdiv == 0;
const TAlScore sc_rdo = prob_.sc_->readGapOpen();
const TAlScore sc_rde = prob_.sc_->readGapExtend();
const TAlScore sc_rfo = prob_.sc_->refGapOpen();
const TAlScore sc_rfe = prob_.sc_->refGapExtend();
const bool local = !prob_.sc_->monotone;
const CpQuad *qup = NULL;
const __m128i *qlf = NULL;
size_t per = prob_.cper_->per_;
ASSERT_ONLY(size_t nrow = prob_.cper_->nrow());
size_t ncol = prob_.cper_->ncol();
assert_eq(prob_.qrylen_, nrow);
assert_eq(prob_.reflen_, (TRefOff)ncol);
size_t niter = prob_.cper_->niter_;
if(!upper) {
qup = prob_.cper_->qrows_.ptr() + (ncol * (ydiv-1)) + xdiv * per;
}
if(!left) {
// Set up the column pointers to point to the first __m128i word in the
// relevant column
size_t off = (niter << 2) * (xdiv-1);
qlf = prob_.cper_->qcols_.ptr() + off;
}
size_t xedge = xdiv * per; // absolute offset of leftmost cell in square
size_t yedge = ydiv * per; // absolute offset of topmost cell in square
size_t xi = xedge, yi = yedge; // iterators for columns, rows
size_t ii = 0; // iterator into packed square
// Iterate over rows, then over columns
size_t m128mod = yi % prob_.cper_->niter_;
size_t m128div = yi / prob_.cper_->niter_;
int16_t sc_h_dg_lastrow = MIN_I16;
for(size_t i = 0; i <= ymod; i++, yi++) {
assert_lt(yi, nrow);
xi = xedge;
// Handling for first column is done outside the loop
size_t fromend = prob_.qrylen_ - yi - 1;
bool allowGaps = fromend >= (size_t)prob_.sc_->gapbar && yi >= (size_t)prob_.sc_->gapbar;
// Get character, quality from read
int qc = prob_.qry_[yi], qq = prob_.qual_[yi];
assert_geq(qq, 33);
int16_t sc_h_lf_last = MIN_I16;
int16_t sc_e_lf_last = MIN_I16;
for(size_t j = 0; j <= xmod; j++, xi++) {
assert_lt(xi, ncol);
// Get character from reference
int rc = prob_.ref_[xi];
assert_range(0, 16, rc);
int16_t sc_diag = prob_.sc_->score(qc, rc, qq - 33);
int16_t sc_h_up = MIN_I16, sc_f_up = MIN_I16,
sc_h_lf = MIN_I16, sc_e_lf = MIN_I16,
sc_h_dg = MIN_I16;
int16_t sc_h_up_c = MIN_I16, sc_f_up_c = MIN_I16,
sc_h_lf_c = MIN_I16, sc_e_lf_c = MIN_I16,
sc_h_dg_c = MIN_I16;
if(yi == 0) {
// If I'm in the first first row or column set it to 0
sc_h_dg = 0;
} else if(xi == 0) {
// Do nothing; leave it at min
if(local) {
sc_h_dg = 0;
}
} else if(i == 0 && j == 0) {
// Otherwise, if I'm in the upper-left square corner, I can get
// it from the checkpoint
sc_h_dg = qup[-1].sc[0];
} else if(j == 0) {
// Otherwise, if I'm in the leftmost cell of this row, I can
// get it from sc_h_lf in first column of previous row
sc_h_dg = sc_h_dg_lastrow;
} else {
// Otherwise, I can get it from qup
sc_h_dg = qup[j-1].sc[0];
}
if(yi > 0 && xi > 0) DEBUG_CHECK(sc_h_dg, yi-1, xi-1, 2);
// If we're in the leftmost column, calculate sc_h_lf regardless of
// allowGaps.
if(j == 0 && xi > 0) {
// Get values for left neighbors from the checkpoint
if(is8_) {
size_t vecoff = (m128mod << 6) + m128div;
sc_e_lf = ((uint8_t*)(qlf + 0))[vecoff];
sc_h_lf = ((uint8_t*)(qlf + 2))[vecoff];
if(local) {
// No adjustment
} else {
if(sc_h_lf == 0) sc_h_lf = MIN_I16;
else sc_h_lf -= 0xff;
if(sc_e_lf == 0) sc_e_lf = MIN_I16;
else sc_e_lf -= 0xff;
}
} else {
size_t vecoff = (m128mod << 5) + m128div;
sc_e_lf = ((int16_t*)(qlf + 0))[vecoff];
sc_h_lf = ((int16_t*)(qlf + 2))[vecoff];
if(local) {
sc_h_lf += 0x8000; assert_geq(sc_h_lf, 0);
sc_e_lf += 0x8000; assert_geq(sc_e_lf, 0);
} else {
if(sc_h_lf != MIN_I16) sc_h_lf -= 0x7fff;
if(sc_e_lf != MIN_I16) sc_e_lf -= 0x7fff;
}
}
DEBUG_CHECK(sc_e_lf, yi, xi-1, 0);
DEBUG_CHECK(sc_h_lf, yi, xi-1, 2);
sc_h_dg_lastrow = sc_h_lf;
}
if(allowGaps) {
if(j == 0 /* at left edge */ && xi > 0 /* not extreme */) {
sc_h_lf_c = sc_h_lf;
sc_e_lf_c = sc_e_lf;
if(sc_h_lf_c != MIN_I16) sc_h_lf_c -= sc_rdo;
if(sc_e_lf_c != MIN_I16) sc_e_lf_c -= sc_rde;
assert_leq(sc_h_lf_c, prob_.cper_->perf_);
assert_leq(sc_e_lf_c, prob_.cper_->perf_);
} else if(xi > 0) {
// Get values for left neighbors from the previous iteration
if(sc_h_lf_last != MIN_I16) {
sc_h_lf = sc_h_lf_last;
sc_h_lf_c = sc_h_lf - sc_rdo;
}
if(sc_e_lf_last != MIN_I16) {
sc_e_lf = sc_e_lf_last;
sc_e_lf_c = sc_e_lf - sc_rde;
}
}
if(yi > 0 /* not extreme */) {
// Get column values
assert(qup != NULL);
assert(local || qup[j].sc[2] < 0);
if(qup[j].sc[0] > MIN_I16) {
DEBUG_CHECK(qup[j].sc[0], yi-1, xi, 2);
sc_h_up = qup[j].sc[0];
sc_h_up_c = sc_h_up - sc_rfo;
}
if(qup[j].sc[2] > MIN_I16) {
DEBUG_CHECK(qup[j].sc[2], yi-1, xi, 1);
sc_f_up = qup[j].sc[2];
sc_f_up_c = sc_f_up - sc_rfe;
}
}
if(local) {
sc_h_up_c = max(sc_h_up_c, 0);
sc_f_up_c = max(sc_f_up_c, 0);
sc_h_lf_c = max(sc_h_lf_c, 0);
sc_e_lf_c = max(sc_e_lf_c, 0);
}
}
if(sc_h_dg > MIN_I16) {
sc_h_dg_c = sc_h_dg + sc_diag;
}
if(local) sc_h_dg_c = max(sc_h_dg_c, 0);
int mask = 0;
// Calculate best ways into H, E, F cells starting with H.
// Mask bits:
// H: 1=diag, 2=hhoriz, 4=ehoriz, 8=hvert, 16=fvert
// E: 32=hhoriz, 64=ehoriz
// F: 128=hvert, 256=fvert
int16_t sc_best = sc_h_dg_c;
if(sc_h_dg_c > MIN_I64) {
mask = 1;
}
if(xi > 0 && sc_h_lf_c >= sc_best && sc_h_lf_c > MIN_I64) {
if(sc_h_lf_c > sc_best) mask = 0;
mask |= 2;
sc_best = sc_h_lf_c;
}
if(xi > 0 && sc_e_lf_c >= sc_best && sc_e_lf_c > MIN_I64) {
if(sc_e_lf_c > sc_best) mask = 0;
mask |= 4;
sc_best = sc_e_lf_c;
}
if(yi > 0 && sc_h_up_c >= sc_best && sc_h_up_c > MIN_I64) {
if(sc_h_up_c > sc_best) mask = 0;
mask |= 8;
sc_best = sc_h_up_c;
}
if(yi > 0 && sc_f_up_c >= sc_best && sc_f_up_c > MIN_I64) {
if(sc_f_up_c > sc_best) mask = 0;
mask |= 16;
sc_best = sc_f_up_c;
}
// Calculate best way into E cell
int16_t sc_e_best = sc_h_lf_c;
if(xi > 0) {
if(sc_h_lf_c >= sc_e_lf_c && sc_h_lf_c > MIN_I64) {
if(sc_h_lf_c == sc_e_lf_c) {
mask |= 64;
}
mask |= 32;
} else if(sc_e_lf_c > MIN_I64) {
sc_e_best = sc_e_lf_c;
mask |= 64;
}
}
if(sc_e_best > sc_best) {
sc_best = sc_e_best;
mask &= ~31; // don't go diagonal
}
// Calculate best way into F cell
int16_t sc_f_best = sc_h_up_c;
if(yi > 0) {
if(sc_h_up_c >= sc_f_up_c && sc_h_up_c > MIN_I64) {
if(sc_h_up_c == sc_f_up_c) {
mask |= 256;
}
mask |= 128;
} else if(sc_f_up_c > MIN_I64) {
sc_f_best = sc_f_up_c;
mask |= 256;
}
}
if(sc_f_best > sc_best) {
sc_best = sc_f_best;
mask &= ~127; // don't go horizontal or diagonal
}
// Install results in cur
assert( local || sc_best <= 0);
sq_[ii+j].sc[0] = sc_best;
assert( local || sc_e_best < 0);
assert( local || sc_f_best < 0);
assert(!local || sc_e_best >= 0 || sc_e_best == MIN_I16);
assert(!local || sc_f_best >= 0 || sc_f_best == MIN_I16);
sq_[ii+j].sc[1] = sc_e_best;
sq_[ii+j].sc[2] = sc_f_best;
sq_[ii+j].sc[3] = mask;
DEBUG_CHECK(sq_[ii+j].sc[0], yi, xi, 2); // H
DEBUG_CHECK(sq_[ii+j].sc[1], yi, xi, 0); // E
DEBUG_CHECK(sq_[ii+j].sc[2], yi, xi, 1); // F
// Update sc_h_lf_last, sc_e_lf_last
sc_h_lf_last = sc_best;
sc_e_lf_last = sc_e_best;
}
// Update m128mod, m128div
m128mod++;
if(m128mod == prob_.cper_->niter_) {
m128mod = 0;
m128div++;
}
// update qup
ii += sq_ncol;
// dimensions of sq_
qup = sq_.ptr() + sq_ncol * i;
}
assert_eq(targ, sq_[ymod * sq_ncol + xmod].sc[hef]);
//
// Now backtrack through the triangle. Abort as soon as we enter a cell
// that was visited by a previous backtrace.
//
int64_t rowc = row, colc = col;
size_t curid;
int hefc = hef;
if(bs_.empty()) {
// Start an initial branch
CHECK_ROW_COL(rowc, colc);
curid = bs_.alloc();
assert_eq(0, curid);
Edit e;
bs_[curid].init(
prob_,
0, // parent ID
0, // penalty
0, // score_en
rowc, // row
colc, // col
e, // edit
0, // hef
true, // root?
false); // don't try to extend with exact matches
bs_[curid].len_ = 0;
} else {
curid = bs_.size()-1;
}
size_t ymodTimesNcol = ymod * sq_ncol;
while(true) {
// What depth are we?
assert_eq(ymodTimesNcol, ymod * sq_ncol);
CpQuad * cur = sq_.ptr() + ymodTimesNcol + xmod;
int mask = cur->sc[3];
assert_gt(mask, 0);
int sel = -1;
// Select what type of move to make, which depends on whether we're
// currently in H, E, F:
if(hefc == 0) {
if( (mask & 1) != 0) {
// diagonal
sel = 0;
} else if((mask & 8) != 0) {
// up to H
sel = 3;
} else if((mask & 16) != 0) {
// up to F
sel = 4;
} else if((mask & 2) != 0) {
// left to H
sel = 1;
} else if((mask & 4) != 0) {
// left to E
sel = 2;
}
} else if(hefc == 1) {
if( (mask & 32) != 0) {
// left to H
sel = 5;
} else if((mask & 64) != 0) {
// left to E
sel = 6;
}
} else {
assert_eq(2, hefc);
if( (mask & 128) != 0) {
// up to H
sel = 7;
} else if((mask & 256) != 0) {
// up to F
sel = 8;
}
}
assert_geq(sel, 0);
// Get character from read
int qc = prob_.qry_[rowc], qq = prob_.qual_[rowc];
// Get character from reference
int rc = prob_.ref_[colc];
assert_range(0, 16, rc);
bool xexit = false, yexit = false;
// Now that we know what type of move to make, make it, updating our
// row and column and moving updating the branch.
if(sel == 0) {
assert_geq(rowc, 0);
assert_geq(colc, 0);
TAlScore scd = prob_.sc_->score(qc, rc, qq - 33);
if((rc & (1 << qc)) == 0) {
// Mismatch
size_t id = curid;
// Check if the previous branch was the initial (bottommost)
// branch with no matches. If so, the mismatch should be added
// to the initial branch, instead of starting a new branch.
bool empty = (bs_[curid].len_ == 0 && curid == 0);
if(!empty) {
id = bs_.alloc();
}
Edit e((int)rowc, mask2dna[rc], "ACGTN"[qc], EDIT_TYPE_MM);
assert_lt(scd, 0);
TAlScore score_en = bs_[curid].score_st_ + scd;
bs_[id].init(
prob_,
curid, // parent ID
-scd, // penalty
score_en, // score_en
rowc, // row
colc, // col
e, // edit
hefc, // hef
empty, // root?
false); // don't try to extend with exact matches
curid = id;
//assert(!local || bs_[curid].score_st_ >= 0);
} else {
// Match
bs_[curid].score_st_ += prob_.sc_->match();
bs_[curid].len_++;
assert_leq((int64_t)bs_[curid].len_, bs_[curid].row_ + 1);
}
if(xmod == 0) xexit = true;
if(ymod == 0) yexit = true;
rowc--; ymod--; ymodTimesNcol -= sq_ncol;
colc--; xmod--;
assert(local || bs_[curid].score_st_ >= targ_final);
hefc = 0;
} else if((sel >= 1 && sel <= 2) || (sel >= 5 && sel <= 6)) {
assert_gt(colc, 0);
// Read gap
size_t id = bs_.alloc();
Edit e((int)rowc+1, mask2dna[rc], '-', EDIT_TYPE_READ_GAP);
TAlScore gapp = prob_.sc_->readGapOpen();
if(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isReadGap()) {
gapp = prob_.sc_->readGapExtend();
}
//assert(!local || bs_[curid].score_st_ >= gapp);
TAlScore score_en = bs_[curid].score_st_ - gapp;
bs_[id].init(
prob_,
curid, // parent ID
gapp, // penalty
score_en, // score_en
rowc, // row
colc-1, // col
e, // edit
hefc, // hef
false, // root?
false); // don't try to extend with exact matches
if(xmod == 0) xexit = true;
colc--; xmod--;
curid = id;
assert( local || bs_[curid].score_st_ >= targ_final);
//assert(!local || bs_[curid].score_st_ >= 0);
if(sel == 1 || sel == 5) {
hefc = 0;
} else {
hefc = 1;
}
} else {
assert_gt(rowc, 0);
// Reference gap
size_t id = bs_.alloc();
Edit e((int)rowc, '-', "ACGTN"[qc], EDIT_TYPE_REF_GAP);
TAlScore gapp = prob_.sc_->refGapOpen();
if(bs_[curid].len_ == 0 && bs_[curid].e_.inited() && bs_[curid].e_.isRefGap()) {
gapp = prob_.sc_->refGapExtend();
}
//assert(!local || bs_[curid].score_st_ >= gapp);
TAlScore score_en = bs_[curid].score_st_ - gapp;
bs_[id].init(
prob_,
curid, // parent ID
gapp, // penalty
score_en, // score_en
rowc-1, // row
colc, // col
e, // edit
hefc, // hef
false, // root?
false); // don't try to extend with exact matches
if(ymod == 0) yexit = true;
rowc--; ymod--; ymodTimesNcol -= sq_ncol;
curid = id;
assert( local || bs_[curid].score_st_ >= targ_final);
//assert(!local || bs_[curid].score_st_ >= 0);
if(sel == 3 || sel == 7) {
hefc = 0;
} else {
hefc = 2;
}
}
CHECK_ROW_COL(rowc, colc);
CpQuad * cur_new = NULL;
if(!xexit && !yexit) {
cur_new = sq_.ptr() + ymodTimesNcol + xmod;
}
// Check whether we made it to the top row or to a cell with score 0
if(colc < 0 || rowc < 0 ||
(cur_new != NULL && local && cur_new->sc[0] == 0))
{
done = true;
assert(bs_[curid].isSolution(prob_));
addSolution(curid);
#ifndef NDEBUG
// A check to see if any two adjacent branches in the backtrace
// overlap. If they do, the whole alignment will be filtered out
// in trySolution(...)
size_t cur = curid;
if(!bs_[cur].root_) {
size_t next = bs_[cur].parentId_;
while(!bs_[next].root_) {
assert_neq(cur, next);
if(bs_[next].len_ != 0 || bs_[cur].len_ == 0) {
assert(!bs_[cur].overlap(prob_, bs_[next]));
}
cur = next;
next = bs_[cur].parentId_;
}
}
#endif
return;
}
assert(!xexit || hefc == 0 || hefc == 1);
assert(!yexit || hefc == 0 || hefc == 2);
if(xexit || yexit) {
//assert(rowc < 0 || colc < 0 || prob_.cper_->isCheckpointed(rowc, colc));
row_new = rowc; col_new = colc;
hef_new = hefc;
done = false;
if(rowc < 0 || colc < 0) {
assert(local);
targ_new = 0;
} else {
// TODO: Don't use scoreSquare
targ_new = prob_.cper_->scoreSquare(rowc, colc, hefc);
assert(local || targ_new >= targ);
assert(local || targ_new >= targ_final);
}
if(local && targ_new == 0) {
assert_eq(0, hefc);
done = true;
assert(bs_[curid].isSolution(prob_));
addSolution(curid);
}
assert((row_new >= 0 && col_new >= 0) || done);
return;
}
}
assert(false);
}
/**
* Caller gives us score_en, row and col. We figure out score_st and len_
* by comparing characters from the strings.
*
* If this branch comes after a mismatch, (row, col) describe the cell that the
* mismatch occurs in. len_ is initially set to 1, and the next cell we test
* is the next cell up and to the left (row-1, col-1).
*
* If this branch comes after a read gap, (row, col) describe the leftmost cell
* involved in the gap. len_ is initially set to 0, and the next cell we test
* is the current cell (row, col).
*
* If this branch comes after a reference gap, (row, col) describe the upper
* cell involved in the gap. len_ is initially set to 0, and the next cell we
* test is the current cell (row, col).
*/
void BtBranch::init(
const BtBranchProblem& prob,
size_t parentId,
TAlScore penalty,
TAlScore score_en,
int64_t row,
int64_t col,
Edit e,
int hef,
bool root,
bool extend)
{
score_en_ = score_en;
penalty_ = penalty;
score_st_ = score_en_;
row_ = row;
col_ = col;
parentId_ = parentId;
e_ = e;
root_ = root;
assert(!root_ || parentId == 0);
assert_lt(row, (int64_t)prob.qrylen_);
assert_lt(col, (int64_t)prob.reflen_);
// First match to check is diagonally above and to the left of the cell
// where the edit occurs
int64_t rowc = row;
int64_t colc = col;
len_ = 0;
if(e.inited() && e.isMismatch()) {
rowc--; colc--;
len_ = 1;
}
int64_t match = prob.sc_->match();
bool cp = prob.usecp_;
size_t iters = 0;
curtailed_ = false;
if(extend) {
while(rowc >= 0 && colc >= 0) {
int rfm = prob.ref_[colc];
assert_range(0, 16, rfm);
int rdc = prob.qry_[rowc];
bool matches = (rfm & (1 << rdc)) != 0;
if(!matches) {
// What's the mismatch penalty?
break;
}
// Get score from checkpointer
score_st_ += match;
if(cp && rowc - 1 >= 0 && colc - 1 >= 0 &&
prob.cper_->isCheckpointed(rowc - 1, colc - 1))
{
// Possibly prune
int16_t cpsc;
cpsc = prob.cper_->scoreTriangle(rowc - 1, colc - 1, hef);
if(cpsc + score_st_ < prob.targ_) {
curtailed_ = true;
break;
}
}
iters++;
rowc--; colc--;
}
}
assert_geq(rowc, -1);
assert_geq(colc, -1);
len_ = (int64_t)row - rowc;
assert_leq((int64_t)len_, row_+1);
assert_leq((int64_t)len_, col_+1);
assert_leq((int64_t)score_st_, (int64_t)prob.qrylen_ * match);
}
/**
* Given a potential branch to add to the queue, see if we can follow the
* branch a little further first. If it's still valid, or if we reach a
* choice between valid outgoing paths, go ahead and add it to the queue.
*/
void BtBranchTracer::examineBranch(
int64_t row,
int64_t col,
const Edit& e,
TAlScore pen, // penalty associated with edit
TAlScore sc,
size_t parentId)
{
size_t id = bs_.alloc();
bs_[id].init(prob_, parentId, pen, sc, row, col, e, 0, false, true);
if(bs_[id].isSolution(prob_)) {
assert(bs_[id].isValid(prob_));
addSolution(id);
} else {
// Check if this branch is legit
if(bs_[id].isValid(prob_)) {
add(id);
} else {
bs_.pop();
}
}
}
/**
* Take all possible ways of leaving the given branch and add them to the
* branch queue.
*/
void BtBranchTracer::addOffshoots(size_t bid) {
BtBranch& b = bs_[bid];
TAlScore sc = b.score_en_;
int64_t match = prob_.sc_->match();
int64_t scoreFloor = prob_.sc_->monotone ? MIN_I64 : 0;
bool cp = prob_.usecp_; // Are there are any checkpoints?
ASSERT_ONLY(TAlScore perfectScore = prob_.sc_->perfectScore(prob_.qrylen_));
assert_leq(prob_.targ_, perfectScore);
// For each cell in the branch
for(size_t i = 0 ; i < b.len_; i++) {
assert_leq((int64_t)i, b.row_+1);
assert_leq((int64_t)i, b.col_+1);
int64_t row = b.row_ - i, col = b.col_ - i;
int64_t bonusLeft = (row + 1) * match;
int64_t fromend = prob_.qrylen_ - row - 1;
bool allowGaps = fromend >= prob_.sc_->gapbar && row >= prob_.sc_->gapbar;
if(allowGaps && row >= 0 && col >= 0) {
if(col > 0) {
// Try a read gap - it's either an extension or an open
bool extend = b.e_.inited() && b.e_.isReadGap() && i == 0;
TAlScore rdgapPen = extend ?
prob_.sc_->readGapExtend() : prob_.sc_->readGapOpen();
bool prune = false;
assert_gt(rdgapPen, 0);
if(cp && prob_.cper_->isCheckpointed(row, col - 1)) {
// Possibly prune
int16_t cpsc = (int16_t)prob_.cper_->scoreTriangle(row, col - 1, 0);
assert_leq(cpsc, perfectScore);
assert_geq(prob_.sc_->readGapOpen(), prob_.sc_->readGapExtend());
TAlScore bonus = prob_.sc_->readGapOpen() - prob_.sc_->readGapExtend();
assert_geq(bonus, 0);
if(cpsc + bonus + sc - rdgapPen < prob_.targ_) {
prune = true;
}
}
if(prune) {
if(extend) { nrdexPrune_++; } else { nrdopPrune_++; }
} else if(sc - rdgapPen >= scoreFloor && sc - rdgapPen + bonusLeft >= prob_.targ_) {
// Yes, we can introduce a read gap here
Edit e((int)row + 1, mask2dna[(int)prob_.ref_[col]], '-', EDIT_TYPE_READ_GAP);
assert(e.isReadGap());
examineBranch(row, col - 1, e, rdgapPen, sc - rdgapPen, bid);
if(extend) { nrdex_++; } else { nrdop_++; }
}
}
if(row > 0) {
// Try a reference gap - it's either an extension or an open
bool extend = b.e_.inited() && b.e_.isRefGap() && i == 0;
TAlScore rfgapPen = (b.e_.inited() && b.e_.isRefGap()) ?
prob_.sc_->refGapExtend() : prob_.sc_->refGapOpen();
bool prune = false;
assert_gt(rfgapPen, 0);
if(cp && prob_.cper_->isCheckpointed(row - 1, col)) {
// Possibly prune
int16_t cpsc = (int16_t)prob_.cper_->scoreTriangle(row - 1, col, 0);
assert_leq(cpsc, perfectScore);
assert_geq(prob_.sc_->refGapOpen(), prob_.sc_->refGapExtend());
TAlScore bonus = prob_.sc_->refGapOpen() - prob_.sc_->refGapExtend();
assert_geq(bonus, 0);
if(cpsc + bonus + sc - rfgapPen < prob_.targ_) {
prune = true;
}
}
if(prune) {
if(extend) { nrfexPrune_++; } else { nrfopPrune_++; }
} else if(sc - rfgapPen >= scoreFloor && sc - rfgapPen + bonusLeft >= prob_.targ_) {
// Yes, we can introduce a ref gap here
Edit e((int)row, '-', "ACGTN"[(int)prob_.qry_[row]], EDIT_TYPE_REF_GAP);
assert(e.isRefGap());
examineBranch(row - 1, col, e, rfgapPen, sc - rfgapPen, bid);
if(extend) { nrfex_++; } else { nrfop_++; }
}
}
}
// If we're at the top of the branch but not yet at the top of
// the DP table, a mismatch branch is also possible.
if(i == b.len_ && !b.curtailed_ && row >= 0 && col >= 0) {
int rfm = prob_.ref_[col];
assert_lt(row, (int64_t)prob_.qrylen_);
int rdc = prob_.qry_[row];
int rdq = prob_.qual_[row];
int scdiff = prob_.sc_->score(rdc, rfm, rdq - 33);
assert_lt(scdiff, 0); // at end of branch, so can't match
bool prune = false;
if(cp && row > 0 && col > 0 && prob_.cper_->isCheckpointed(row - 1, col - 1)) {
// Possibly prune
int16_t cpsc = prob_.cper_->scoreTriangle(row - 1, col - 1, 0);
assert_leq(cpsc, perfectScore);
assert_leq(cpsc + scdiff + sc, perfectScore);
if(cpsc + scdiff + sc < prob_.targ_) {
prune = true;
}
}
if(prune) {
nmm_++;
} else {
// Yes, we can introduce a mismatch here
if(sc + scdiff >= scoreFloor && sc + scdiff + bonusLeft >= prob_.targ_) {
Edit e((int)row, mask2dna[rfm], "ACGTN"[rdc], EDIT_TYPE_MM);
bool nmm = (mask2dna[rfm] == 'N' || rdc > 4);
assert_neq(e.chr, e.qchr);
assert_lt(scdiff, 0);
examineBranch(row - 1, col - 1, e, -scdiff, sc + scdiff, bid);
if(nmm) { nnmm_++; } else { nmm_++; }
}
}
}
sc += match;
}
}
/**
* Sort unsorted branches, merge them with master sorted list.
*/
void BtBranchTracer::flushUnsorted() {
if(unsorted_.empty()) {
return;
}
unsorted_.sort();
unsorted_.reverse();
#ifndef NDEBUG
for(size_t i = 1; i < unsorted_.size(); i++) {
assert_leq(bs_[unsorted_[i].second].score_st_, bs_[unsorted_[i-1].second].score_st_);
}
#endif
EList *src2 = sortedSel_ ? &sorted1_ : &sorted2_;
EList *dest = sortedSel_ ? &sorted2_ : &sorted1_;
// Merge src1 and src2 into dest
dest->clear();
size_t cur1 = 0, cur2 = cur_;
while(cur1 < unsorted_.size() || cur2 < src2->size()) {
// Take from 1 or 2 next?
bool take1 = true;
if(cur1 == unsorted_.size()) {
take1 = false;
} else if(cur2 == src2->size()) {
take1 = true;
} else {
assert_neq(unsorted_[cur1].second, (*src2)[cur2]);
take1 = bs_[unsorted_[cur1].second] < bs_[(*src2)[cur2]];
}
if(take1) {
dest->push_back(unsorted_[cur1++].second); // Take from list 1
} else {
dest->push_back((*src2)[cur2++]); // Take from list 2
}
}
assert_eq(cur1, unsorted_.size());
assert_eq(cur2, src2->size());
sortedSel_ = !sortedSel_;
cur_ = 0;
unsorted_.clear();
}
/**
* Try all the solutions accumulated so far. Solutions might be rejected
* if they, for instance, overlap a previous solution, have too many Ns,
* fail to overlap a core diagonal, etc.
*/
bool BtBranchTracer::trySolutions(
bool lookForOlap,
SwResult& res,
size_t& off,
size_t& nrej,
RandomSource& rnd,
bool& success)
{
if(solutions_.size() > 0) {
for(size_t i = 0; i < solutions_.size(); i++) {
int ret = trySolution(solutions_[i], lookForOlap, res, off, nrej, rnd);
if(ret == BT_FOUND) {
success = true;
return true; // there were solutions and one was good
}
}
solutions_.clear();
success = false;
return true; // there were solutions but none were good
}
return false; // there were no solutions to check
}
/**
* Given the id of a branch that completes a successful backtrace, turn the
* chain of branches into
*/
int BtBranchTracer::trySolution(
size_t id,
bool lookForOlap,
SwResult& res,
size_t& off,
size_t& nrej,
RandomSource& rnd)
{
#if 0
AlnScore score;
BtBranch *br = &bs_[id];
// 'br' corresponds to the leftmost edit in a right-to-left
// chain of edits.
EList& ned = res.alres.ned();
const BtBranch *cur = br, *prev = NULL;
size_t ns = 0, nrefns = 0;
size_t ngap = 0;
while(true) {
if(cur->e_.inited()) {
if(cur->e_.isMismatch()) {
if(cur->e_.qchr == 'N' || cur->e_.chr == 'N') {
ns++;
}
} else if(cur->e_.isGap()) {
ngap++;
}
if(cur->e_.chr == 'N') {
nrefns++;
}
ned.push_back(cur->e_);
}
if(cur->root_) {
break;
}
cur = &bs_[cur->parentId_];
}
if(ns > prob_.nceil_) {
// Alignment has too many Ns in it!
res.reset();
assert(res.alres.ned().empty());
nrej++;
return BT_REJECTED_N;
}
// Update 'seenPaths_'
cur = br;
bool rejSeen = false; // set =true if we overlap prev path
bool rejCore = true; // set =true if we don't touch core diag
while(true) {
// Consider row, col, len, then do something
int64_t row = cur->row_, col = cur->col_;
assert_lt(row, (int64_t)prob_.qrylen_);
size_t fromend = prob_.qrylen_ - row - 1;
size_t diag = fromend + col;
// Calculate the diagonal within the *trimmed* rectangle,
// i.e. the rectangle we dealt with in align, gather and
// backtrack.
int64_t diagi = col - row;
// Now adjust to the diagonal within the *untrimmed*
// rectangle by adding on the amount trimmed from the left.
diagi += prob_.rect_->triml;
assert_lt(diag, seenPaths_.size());
// Does it overlap a core diagonal?
if(diagi >= 0) {
size_t diag = (size_t)diagi;
if(diag >= prob_.rect_->corel &&
diag <= prob_.rect_->corer)
{
// Yes it does - it's OK
rejCore = false;
}
}
if(lookForOlap) {
int64_t newlo, newhi;
if(cur->len_ == 0) {
if(prev != NULL && prev->len_ > 0) {
// If there's a gap at the base of a non-0 length branch, the
// gap will appear to overlap the branch if we give it length 1.
newhi = newlo = 0;
} else {
// Read or ref gap with no matches coming off of it
newlo = row;
newhi = row + 1;
}
} else {
// Diagonal with matches
newlo = row - (cur->len_ - 1);
newhi = row + 1;
}
assert_geq(newlo, 0);
assert_geq(newhi, 0);
// Does the diagonal cover cells?
if(newhi > newlo) {
// Check whether there is any overlap with previously traversed
// cells
bool added = false;
const size_t sz = seenPaths_[diag].size();
for(size_t i = 0; i < sz; i++) {
// Does the new interval overlap this already-seen
// interval? Also of interest: does it abut this
// already-seen interval? If so, we should merge them.
size_t lo = seenPaths_[diag][i].first;
size_t hi = seenPaths_[diag][i].second;
assert_lt(lo, hi);
size_t lo_sm = newlo, hi_sm = newhi;
if(hi - lo < hi_sm - lo_sm) {
swap(lo, lo_sm);
swap(hi, hi_sm);
}
if((lo <= lo_sm && hi > lo_sm) ||
(lo < hi_sm && hi >= hi_sm))
{
// One or both of the shorter interval's end points
// are contained in the longer interval - so they
// overlap.
rejSeen = true;
// Merge them into one longer interval
seenPaths_[diag][i].first = min(lo, lo_sm);
seenPaths_[diag][i].second = max(hi, hi_sm);
#ifndef NDEBUG
for(int64_t ii = seenPaths_[diag][i].first;
ii < (int64_t)seenPaths_[diag][i].second;
ii++)
{
//cerr << "trySolution rejected (" << ii << ", " << (ii + col - row) << ")" << endl;
}
#endif
added = true;
break;
} else if(hi == lo_sm || lo == hi_sm) {
// Merge them into one longer interval
seenPaths_[diag][i].first = min(lo, lo_sm);
seenPaths_[diag][i].second = max(hi, hi_sm);
#ifndef NDEBUG
for(int64_t ii = seenPaths_[diag][i].first;
ii < (int64_t)seenPaths_[diag][i].second;
ii++)
{
//cerr << "trySolution rejected (" << ii << ", " << (ii + col - row) << ")" << endl;
}
#endif
added = true;
// Keep going in case it overlaps one of the other
// intervals
}
}
if(!added) {
seenPaths_[diag].push_back(make_pair(newlo, newhi));
}
}
}
// After the merging that may have occurred above, it's no
// longer guarnateed that all the overlapping intervals in
// the list have been merged. That's OK though. We'll
// still get correct answers to overlap queries.
if(cur->root_) {
assert_eq(0, cur->parentId_);
break;
}
prev = cur;
cur = &bs_[cur->parentId_];
} // while(cur->e_.inited())
if(rejSeen) {
res.reset();
assert(res.alres.ned().empty());
nrej++;
return BT_NOT_FOUND;
}
if(rejCore) {
res.reset();
assert(res.alres.ned().empty());
nrej++;
return BT_REJECTED_CORE_DIAG;
}
off = br->leftmostCol();
score.score_ = prob_.targ_;
score.ns_ = ns;
score.gaps_ = ngap;
res.alres.setScore(score);
res.alres.setRefNs(nrefns);
size_t trimBeg = br->uppermostRow();
size_t trimEnd = prob_.qrylen_ - prob_.row_ - 1;
assert_leq(trimBeg, prob_.qrylen_);
assert_leq(trimEnd, prob_.qrylen_);
TRefOff refoff = off + prob_.refoff_ + prob_.rect_->refl;
res.alres.setShape(
prob_.refid_, // ref id
refoff, // 0-based ref offset
prob_.treflen(), // ref length
prob_.fw_, // aligned to Watson?
prob_.qrylen_, // read length
true, // pretrim soft?
0, // pretrim 5' end
0, // pretrim 3' end
true, // alignment trim soft?
prob_.fw_ ? trimBeg : trimEnd, // alignment trim 5' end
prob_.fw_ ? trimEnd : trimBeg); // alignment trim 3' end
#endif
return BT_FOUND;
}
/**
* Get the next valid alignment given a backtrace problem. Return false
* if there is no valid solution. Use a backtracking search to find the
* solution. This can be very slow.
*/
bool BtBranchTracer::nextAlignmentBacktrace(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd)
{
assert(!empty() || !emptySolution());
assert(prob_.inited());
// There's a subtle case where we might fail to backtracing in
// local-alignment mode. The basic fact to remember is that when we're
// backtracing from the highest-scoring cell in the table, we're guaranteed
// to be able to backtrace without ever dipping below 0. But if we're
// backtracing from a cell other than the highest-scoring cell in the
// table, we might dip below 0. Dipping below 0 implies that there's a
// shorted local alignment with a better score. In which case, it's
// perfectly fair for us to abandon any path that dips below the floor, and
// this might result in the queue becoming empty before we finish.
bool result = false;
niter = 0;
while(!empty()) {
if(trySolutions(true, res, off, nrej, rnd, result)) {
return result;
}
if(niter++ >= maxiter) {
break;
}
size_t brid = best(rnd); // put best branch in 'br'
assert(!seen_.contains(brid));
ASSERT_ONLY(seen_.insert(brid));
#if 0
BtBranch *br = &bs_[brid];
cerr << brid
<< ": targ:" << prob_.targ_
<< ", sc:" << br->score_st_
<< ", row:" << br->uppermostRow()
<< ", nmm:" << nmm_
<< ", nnmm:" << nnmm_
<< ", nrdop:" << nrdop_
<< ", nrfop:" << nrfop_
<< ", nrdex:" << nrdex_
<< ", nrfex:" << nrfex_
<< ", nrdop_pr: " << nrdopPrune_
<< ", nrfop_pr: " << nrfopPrune_
<< ", nrdex_pr: " << nrdexPrune_
<< ", nrfex_pr: " << nrfexPrune_
<< endl;
#endif
addOffshoots(brid);
}
if(trySolutions(true, res, off, nrej, rnd, result)) {
return result;
}
return false;
}
/**
* Get the next valid alignment given a backtrace problem. Return false
* if there is no valid solution. Use a triangle-fill backtrace to find
* the solution. This is usually fast (it's O(m + n)).
*/
bool BtBranchTracer::nextAlignmentFill(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd)
{
assert(prob_.inited());
assert(!emptySolution());
bool result = false;
if(trySolutions(false, res, off, nrej, rnd, result)) {
return result;
}
return false;
}
/**
* Get the next valid alignment given the backtrace problem. Return false
* if there is no valid solution, e.g., if
*/
bool BtBranchTracer::nextAlignment(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd)
{
if(prob_.fill_) {
return nextAlignmentFill(
maxiter,
res,
off,
nrej,
niter,
rnd);
} else {
return nextAlignmentBacktrace(
maxiter,
res,
off,
nrej,
niter,
rnd);
}
}
#ifdef MAIN_ALIGNER_BT
#include
int main(int argc, char **argv) {
size_t off = 0;
RandomSource rnd(77);
BtBranchTracer tr;
Scoring sc = Scoring::base1();
SwResult res;
tr.init(
"ACGTACGT", // in: read sequence
"IIIIIIII", // in: quality sequence
8, // in: read sequence length
"ACGTACGT", // in: reference sequence
8, // in: reference sequence length
0, // in: reference id
0, // in: reference offset
true, // in: orientation
sc, // in: scoring scheme
0, // in: N ceiling
8, // in: alignment score
7, // start in this row
7, // start in this column
rnd); // random gen, to choose among equal paths
size_t nrej = 0;
tr.nextAlignment(
res,
off,
nrej,
rnd);
}
#endif /*def MAIN_ALIGNER_BT*/
centrifuge-f39767eb57d8e175029c/aligner_bt.h 0000664 0000000 0000000 00000074322 13021605047 0020260 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#ifndef ALIGNER_BT_H_
#define ALIGNER_BT_H_
#include
#include
#include "aligner_sw_common.h"
#include "aligner_result.h"
#include "scoring.h"
#include "edit.h"
#include "limit.h"
#include "dp_framer.h"
#include "sse_util.h"
/* Say we've filled in a DP matrix in a cost-only manner, not saving the scores
* for each of the cells. At the end, we obtain a list of candidate cells and
* we'd like to backtrace from them. The per-cell scores are gone, but we have
* to re-create the correct path somehow. Hopefully we can do this without
* recreating most or al of the score matrix, since this takes too much memory.
*
* Approach 1: Naively refill the matrix.
*
* Just refill the matrix, perhaps backwards starting from the backtrace cell.
* Since this involves recreating all or most of the score matrix, this is not
* a good approach.
*
* Approach 2: Naive backtracking.
*
* Conduct a search through the space of possible backtraces, rooted at the
* candidate cell. To speed things along, we can prioritize paths that have a
* high score and that align more characters from the read.
*
* The approach is simple, but it's neither fast nor memory-efficient in
* general.
*
* Approach 3: Refilling with checkpoints.
*
* Refill the matrix "backwards" starting from the candidate cell, but use
* checkpoints to ensure that only a series of relatively small triangles or
* rectangles need to be refilled. The checkpoints must include elements from
* the H, E and F matrices; not just H. After each refill, we backtrace
* through the refilled area, then discard/reuse the fill memory. I call each
* such fill/backtrace a mini-fill/backtrace.
*
* If there's only one path to be found, then this is O(m+n). But what if
* there are many? And what if we would like to avoid paths that overlap in
* one or more cells? There are two ways we can make this more efficient:
*
* 1. Remember the re-calculated E/F/H values and try to retrieve them
* 2. Keep a record of cells that have already been traversed
*
* Legend:
*
* 1: Candidate cell
* 2: Final cell from first mini-fill/backtrace
* 3: Final cell from second mini-fill/backtrace (third not shown)
* +: Checkpointed cell
* *: Cell filled from first or second mini-fill/backtrace
* -: Unfilled cell
*
* ---++--------++--------++----
* --++--------++*-------++-----
* -++--(etc)-++**------++------
* ++--------+3***-----++-------
* +--------++****----++--------
* --------++*****---++--------+
* -------++******--++--------++
* ------++*******-++*-------++-
* -----++********++**------++--
* ----++********2+***-----++---
* ---++--------++****----++----
* --++--------++*****---++-----
* -++--------++*****1--++------
* ++--------++--------++-------
*
* Approach 4: Backtracking with checkpoints.
*
* Conduct a search through the space of possible backtraces, rooted at the
* candidate cell. Use "checkpoints" to prune. That is, when a backtrace
* moves through a cell with a checkpointed score, consider the score
* accumulated so far and the cell's saved score; abort if those two scores
* add to something less than a valid score. Note we're only checkpointing H
* in this case (possibly; see "subtle point"), not E or F.
*
* Subtle point: checkpoint scores are a result of moving forward through
* the matrix whereas backtracking scores result from moving backward. This
* matters becuase the two paths that meet up at a cell might have both
* factored in a gap open penalty for the same gap, in which case we will
* underestimate the overall score and prune a good path. Here are two ideas
* for how to resolve this:
*
* Idea 1: when we combine the forward and backward scores to find an overall
* score, and our backtrack procedure *just* made a horizontal or vertical
* move, add in a "bonus" equal to the gap open penalty of the appropraite
* type (read gap open for horizontal, ref gap open for vertical). This might
* overcompensate, since
*
* Idea 2: keep the E and F values for the checkpoints around, in addition to
* the H values. When it comes time to combine the score from the forward
* and backward paths, we consider the last move we made in the backward
* backtrace. If it's a read gap (horizontal move), then we calculate the
* overall score as:
*
* max(Score-backward + H-forward, Score-backward + E-forward + read-open)
*
* If it's a reference gap (vertical move), then we calculate the overall
* score as:
*
* max(Score-backward + H-forward, Score-backward + F-forward + ref-open)
*
* What does it mean to abort a backtrack? If we're starting a new branch
* and there is a checkpoing in the bottommost cell of the branch, and the
* overall score is less than the target, then we can simply ignore the
* branch. If the checkpoint occurs in the middle of a string of matches, we
* need to curtail the branch such that it doesn't include the checkpointed
* cell and we won't ever try to enter the checkpointed cell, e.g., on a
* mismatch.
*
* Approaches 3 and 4 seem reasonable, and could be combined. For simplicity,
* we implement only approach 4 for now.
*
* Checkpoint information is propagated from the fill process to the backtracer
* via a
*/
enum {
BT_NOT_FOUND = 1, // could not obtain the backtrace because it
// overlapped a previous solution
BT_FOUND, // obtained a valid backtrace
BT_REJECTED_N, // backtrace rejected because it had too many Ns
BT_REJECTED_CORE_DIAG // backtrace rejected because it failed to overlap a
// core diagonal
};
/**
* Parameters for a matrix of potential backtrace problems to solve.
* Encapsulates information about:
*
* The problem given a particular reference substring:
*
* - The query string (nucleotides and qualities)
* - The reference substring (incl. orientation, offset into overall sequence)
* - Checkpoints (i.e. values of matrix cells)
* - Scoring scheme and other thresholds
*
* The problem given a particular reference substring AND a particular row and
* column from which to backtrace:
*
* - The row and column
* - The target score
*/
class BtBranchProblem {
public:
/**
* Create new uninitialized problem.
*/
BtBranchProblem() { reset(); }
/**
* Initialize a new problem.
*/
void initRef(
const char *qry, // query string (along rows)
const char *qual, // query quality string (along rows)
size_t qrylen, // query string (along rows) length
const char *ref, // reference string (along columns)
TRefOff reflen, // in-rectangle reference string length
TRefOff treflen,// total reference string length
TRefId refid, // reference id
TRefOff refoff, // reference offset
bool fw, // orientation of problem
const DPRect* rect, // dynamic programming rectangle filled out
const Checkpointer* cper, // checkpointer
const Scoring *sc, // scoring scheme
size_t nceil) // max # Ns allowed in alignment
{
qry_ = qry;
qual_ = qual;
qrylen_ = qrylen;
ref_ = ref;
reflen_ = reflen;
treflen_ = treflen;
refid_ = refid;
refoff_ = refoff;
fw_ = fw;
rect_ = rect;
cper_ = cper;
sc_ = sc;
nceil_ = nceil;
}
/**
* Initialize a new problem.
*/
void initBt(
size_t row, // row
size_t col, // column
bool fill, // use a filling rather than a backtracking strategy
bool usecp, // use checkpoints to short-circuit while backtracking
TAlScore targ) // target score
{
row_ = row;
col_ = col;
targ_ = targ;
fill_ = fill;
usecp_ = usecp;
if(fill) {
assert(usecp_);
}
}
/**
* Reset to uninitialized state.
*/
void reset() {
qry_ = qual_ = ref_ = NULL;
cper_ = NULL;
rect_ = NULL;
sc_ = NULL;
qrylen_ = reflen_ = treflen_ = refid_ = refoff_ = row_ = col_ = targ_ = nceil_ = 0;
fill_ = fw_ = usecp_ = false;
}
/**
* Return true iff the BtBranchProblem has been initialized.
*/
bool inited() const {
return qry_ != NULL;
}
#ifndef NDEBUG
/**
* Sanity-check the problem.
*/
bool repOk() const {
assert_gt(qrylen_, 0);
assert_gt(reflen_, 0);
assert_gt(treflen_, 0);
assert_lt(row_, qrylen_);
assert_lt((TRefOff)col_, reflen_);
return true;
}
#endif
size_t reflen() const { return reflen_; }
size_t treflen() const { return treflen_; }
protected:
const char *qry_; // query string (along rows)
const char *qual_; // query quality string (along rows)
size_t qrylen_; // query string (along rows) length
const char *ref_; // reference string (along columns)
TRefOff reflen_; // in-rectangle reference string length
TRefOff treflen_;// total reference string length
TRefId refid_; // reference id
TRefOff refoff_; // reference offset
bool fw_; // orientation of problem
const DPRect* rect_; // dynamic programming rectangle filled out
size_t row_; // starting row
size_t col_; // starting column
TAlScore targ_; // target score
const Checkpointer *cper_; // checkpointer
bool fill_; // use mini-fills
bool usecp_; // use checkpointing?
const Scoring *sc_; // scoring scheme
size_t nceil_; // max # Ns allowed in alignment
friend class BtBranch;
friend class BtBranchQ;
friend class BtBranchTracer;
};
/**
* Encapsulates a "branch" which is a diagonal of cells (possibly of length 0)
* in the matrix where all the cells are matches. These stretches are linked
* together by edits to form a full backtrace path through the matrix. Lengths
* are measured w/r/t to the number of rows traversed by the path, so a branch
* that represents a read gap extension could have length = 0.
*
* At the end of the day, the full backtrace path is represented as a list of
* BtBranch's where each BtBranch represents a stretch of matching cells (and
* up to one mismatching cell at its bottom extreme) ending in an edit (or in
* the bottommost row, in which case the edit is uninitialized). Each
* BtBranch's row and col fields indicate the bottommost cell involved in the
* diagonal stretch of matches, and the len_ field indicates the length of the
* stretch of matches. Note that the edits themselves also correspond to
* movement through the matrix.
*
* A related issue is how we record which cells have been visited so that we
* never report a pair of paths both traversing the same (row, col) of the
* overall DP matrix. This gets a little tricky because we have to take into
* account the cells covered by *edits* in addition to the cells covered by the
* stretches of matches. For instance: imagine a mismatch. That takes up a
* cell of the DP matrix, but it may or may not be preceded by a string of
* matches. It's hard to imagine how to represent this unless we let the
* mismatch "count toward" the len_ of the branch and let (row, col) refer to
* the cell where the mismatch occurs.
*
* We need BtBranches to "live forever" so that we can make some BtBranches
* parents of others using parent pointers. For this reason, BtBranch's are
* stored in an EFactory object in the BtBranchTracer class.
*/
class BtBranch {
public:
BtBranch() { reset(); }
BtBranch(
const BtBranchProblem& prob,
size_t parentId,
TAlScore penalty,
TAlScore score_en,
int64_t row,
int64_t col,
Edit e,
int hef,
bool root,
bool extend)
{
init(prob, parentId, penalty, score_en, row, col, e, hef, root, extend);
}
/**
* Reset to uninitialized state.
*/
void reset() {
parentId_ = 0;
score_st_ = score_en_ = len_ = row_ = col_ = 0;
curtailed_ = false;
e_.reset();
}
/**
* Caller gives us score_en, row and col. We figure out score_st and len_
* by comparing characters from the strings.
*/
void init(
const BtBranchProblem& prob,
size_t parentId,
TAlScore penalty,
TAlScore score_en,
int64_t row,
int64_t col,
Edit e,
int hef,
bool root,
bool extend);
/**
* Return true iff this branch ends in a solution to the backtrace problem.
*/
bool isSolution(const BtBranchProblem& prob) const {
const bool end2end = prob.sc_->monotone;
return score_st_ == prob.targ_ && (!end2end || endsInFirstRow());
}
/**
* Return true iff this branch could potentially lead to a valid alignment.
*/
bool isValid(const BtBranchProblem& prob) const {
int64_t scoreFloor = prob.sc_->monotone ? MIN_I64 : 0;
if(score_st_ < scoreFloor) {
// Dipped below the score floor
return false;
}
if(isSolution(prob)) {
// It's a solution, so it's also valid
return true;
}
if((int64_t)len_ > row_) {
// Went all the way to the top row
//assert_leq(score_st_, prob.targ_);
return score_st_ == prob.targ_;
} else {
int64_t match = prob.sc_->match();
int64_t bonusLeft = (row_ + 1 - len_) * match;
return score_st_ + bonusLeft >= prob.targ_;
}
}
/**
* Return true iff this branch overlaps with the given branch.
*/
bool overlap(const BtBranchProblem& prob, const BtBranch& bt) const {
// Calculate this branch's diagonal
assert_lt(row_, (int64_t)prob.qrylen_);
size_t fromend = prob.qrylen_ - row_ - 1;
size_t diag = fromend + col_;
int64_t lo = 0, hi = row_ + 1;
if(len_ == 0) {
lo = row_;
} else {
lo = row_ - (len_ - 1);
}
// Calculate other branch's diagonal
assert_lt(bt.row_, (int64_t)prob.qrylen_);
size_t ofromend = prob.qrylen_ - bt.row_ - 1;
size_t odiag = ofromend + bt.col_;
if(diag != odiag) {
return false;
}
int64_t olo = 0, ohi = bt.row_ + 1;
if(bt.len_ == 0) {
olo = bt.row_;
} else {
olo = bt.row_ - (bt.len_ - 1);
}
int64_t losm = olo, hism = ohi;
if(hi - lo < ohi - olo) {
swap(lo, losm);
swap(hi, hism);
}
if((lo <= losm && hi > losm) || (lo < hism && hi >= hism)) {
return true;
}
return false;
}
/**
* Return true iff this branch is higher priority than the branch 'o'.
*/
bool operator<(const BtBranch& o) const {
// Prioritize uppermost above score
if(uppermostRow() != o.uppermostRow()) {
return uppermostRow() < o.uppermostRow();
}
if(score_st_ != o.score_st_) return score_st_ > o.score_st_;
if(row_ != o.row_) return row_ < o.row_;
if(col_ != o.col_) return col_ > o.col_;
if(parentId_ != o.parentId_) return parentId_ > o.parentId_;
assert(false);
return false;
}
/**
* Return true iff the topmost cell involved in this branch is in the top
* row.
*/
bool endsInFirstRow() const {
assert_leq((int64_t)len_, row_ + 1);
return (int64_t)len_ == row_+1;
}
/**
* Return the uppermost row covered by this branch.
*/
size_t uppermostRow() const {
assert_geq(row_ + 1, (int64_t)len_);
return row_ + 1 - (int64_t)len_;
}
/**
* Return the leftmost column covered by this branch.
*/
size_t leftmostCol() const {
assert_geq(col_ + 1, (int64_t)len_);
return col_ + 1 - (int64_t)len_;
}
#ifndef NDEBUG
/**
* Sanity-check this BtBranch.
*/
bool repOk() const {
assert(root_ || e_.inited());
assert_gt(len_, 0);
assert_geq(col_ + 1, (int64_t)len_);
assert_geq(row_ + 1, (int64_t)len_);
return true;
}
#endif
protected:
// ID of the parent branch.
size_t parentId_;
// Penalty associated with the edit at the bottom of this branch (0 if
// there is no edit)
TAlScore penalty_;
// Score at the beginning of the branch
TAlScore score_st_;
// Score at the end of the branch (taking the edit into account)
TAlScore score_en_;
// Length of the branch. That is, the total number of diagonal cells
// involved in all the matches and in the edit (if any). Should always be
// > 0.
size_t len_;
// The row of the final (bottommost) cell in the branch. This might be the
// bottommost match if the branch has no associated edit. Otherwise, it's
// the cell occupied by the edit.
int64_t row_;
// The column of the final (bottommost) cell in the branch.
int64_t col_;
// The edit at the bottom of the branch. If this is the bottommost branch
// in the alignment and it does not end in an edit, then this remains
// uninitialized.
Edit e_;
// True iff this is the bottommost branch in the alignment. We can't just
// use row_ to tell us this because local alignments don't necessarily end
// in the last row.
bool root_;
bool curtailed_; // true -> pruned at a checkpoint where we otherwise
// would have had a match
friend class BtBranchQ;
friend class BtBranchTracer;
};
/**
* Instantiate and solve best-first branch-based backtraces.
*/
class BtBranchTracer {
public:
explicit BtBranchTracer() :
prob_(), bs_(), seenPaths_(DP_CAT), sawcell_(DP_CAT), doTri_() { }
/**
* Add a branch to the queue.
*/
void add(size_t id) {
assert(!bs_[id].isSolution(prob_));
unsorted_.push_back(make_pair(bs_[id].score_st_, id));
}
/**
* Add a branch to the list of solutions.
*/
void addSolution(size_t id) {
assert(bs_[id].isSolution(prob_));
solutions_.push_back(id);
}
/**
* Given a potential branch to add to the queue, see if we can follow the
* branch a little further first. If it's still valid, or if we reach a
* choice between valid outgoing paths, go ahead and add it to the queue.
*/
void examineBranch(
int64_t row,
int64_t col,
const Edit& e,
TAlScore pen,
TAlScore sc,
size_t parentId);
/**
* Take all possible ways of leaving the given branch and add them to the
* branch queue.
*/
void addOffshoots(size_t bid);
/**
* Get the best branch and remove it from the priority queue.
*/
size_t best(RandomSource& rnd) {
assert(!empty());
flushUnsorted();
assert_gt(sortedSel_ ? sorted1_.size() : sorted2_.size(), cur_);
// Perhaps shuffle everyone who's tied for first?
size_t id = sortedSel_ ? sorted1_[cur_] : sorted2_[cur_];
cur_++;
return id;
}
/**
* Return true iff there are no branches left to try.
*/
bool empty() const {
return size() == 0;
}
/**
* Return the size, i.e. the total number of branches contained.
*/
size_t size() const {
return unsorted_.size() +
(sortedSel_ ? sorted1_.size() : sorted2_.size()) - cur_;
}
/**
* Return true iff there are no solutions left to try.
*/
bool emptySolution() const {
return sizeSolution() == 0;
}
/**
* Return the size of the solution set so far.
*/
size_t sizeSolution() const {
return solutions_.size();
}
/**
* Sort unsorted branches, merge them with master sorted list.
*/
void flushUnsorted();
#ifndef NDEBUG
/**
* Sanity-check the queue.
*/
bool repOk() const {
assert_lt(cur_, (sortedSel_ ? sorted1_.size() : sorted2_.size()));
return true;
}
#endif
/**
* Initialize the tracer with respect to a new read. This involves
* resetting all the state relating to the set of cells already visited
*/
void initRef(
const char* rd, // in: read sequence
const char* qu, // in: quality sequence
size_t rdlen, // in: read sequence length
const char* rf, // in: reference sequence
size_t rflen, // in: in-rectangle reference sequence length
TRefOff trflen, // in: total reference sequence length
TRefId refid, // in: reference id
TRefOff refoff, // in: reference offset
bool fw, // in: orientation
const DPRect *rect, // in: DP rectangle
const Checkpointer *cper, // in: checkpointer
const Scoring& sc, // in: scoring scheme
size_t nceil) // in: N ceiling
{
prob_.initRef(rd, qu, rdlen, rf, rflen, trflen, refid, refoff, fw, rect, cper, &sc, nceil);
const size_t ndiag = rflen + rdlen - 1;
seenPaths_.resize(ndiag);
for(size_t i = 0; i < ndiag; i++) {
seenPaths_[i].clear();
}
// clear each of the per-column sets
if(sawcell_.size() < rflen) {
size_t isz = sawcell_.size();
sawcell_.resize(rflen);
for(size_t i = isz; i < rflen; i++) {
sawcell_[i].setCat(DP_CAT);
}
}
for(size_t i = 0; i < rflen; i++) {
sawcell_[i].setCat(DP_CAT);
sawcell_[i].clear(); // clear the set
}
}
/**
* Initialize with a new backtrace.
*/
void initBt(
TAlScore escore, // in: alignment score
size_t row, // in: start in this row
size_t col, // in: start in this column
bool fill, // in: use mini-filling?
bool usecp, // in: use checkpointing?
bool doTri, // in: triangle-shaped mini-fills?
RandomSource& rnd) // in: random gen, to choose among equal paths
{
prob_.initBt(row, col, fill, usecp, escore);
Edit e; e.reset();
unsorted_.clear();
solutions_.clear();
sorted1_.clear();
sorted2_.clear();
cur_ = 0;
nmm_ = 0; // number of mismatches attempted
nnmm_ = 0; // number of mismatches involving N attempted
nrdop_ = 0; // number of read gap opens attempted
nrfop_ = 0; // number of ref gap opens attempted
nrdex_ = 0; // number of read gap extensions attempted
nrfex_ = 0; // number of ref gap extensions attempted
nmmPrune_ = 0; // number of mismatches attempted
nnmmPrune_ = 0; // number of mismatches involving N attempted
nrdopPrune_ = 0; // number of read gap opens attempted
nrfopPrune_ = 0; // number of ref gap opens attempted
nrdexPrune_ = 0; // number of read gap extensions attempted
nrfexPrune_ = 0; // number of ref gap extensions attempted
row_ = row;
col_ = col;
doTri_ = doTri;
bs_.clear();
if(!prob_.fill_) {
size_t id = bs_.alloc();
bs_[id].init(
prob_,
0, // parent id
0, // penalty
0, // starting score
row, // row
col, // column
e,
0,
true, // this is the root
true); // this should be extend with exact matches
if(bs_[id].isSolution(prob_)) {
addSolution(id);
} else {
add(id);
}
} else {
int64_t row = row_, col = col_;
TAlScore targsc = prob_.targ_;
int hef = 0;
bool done = false, abort = false;
size_t depth = 0;
while(!done && !abort) {
// Accumulate edits as we go. We can do this by adding
// BtBranches to the bs_ structure. Each step of the backtrace
// either involves an edit (thereby starting a new branch) or
// extends the previous branch by one more position.
//
// Note: if the BtBranches are in line, then trySolution can be
// used to populate the SwResult and check for various
// situations where we might reject the alignment (i.e. due to
// a cell having been visited previously).
if(doTri_) {
triangleFill(
row, // row of cell to backtrace from
col, // column of cell to backtrace from
hef, // cell to bt from: H (0), E (1), or F (2)
targsc, // score of cell to backtrace from
prob_.targ_, // score of alignment we're looking for
rnd, // pseudo-random generator
row, // out: row we ended up in after bt
col, // out: column we ended up in after bt
hef, // out: H/E/F after backtrace
targsc, // out: score up to cell we ended up in
done, // out: finished tracing out an alignment?
abort); // out: aborted b/c cell was seen before?
} else {
squareFill(
row, // row of cell to backtrace from
col, // column of cell to backtrace from
hef, // cell to bt from: H (0), E (1), or F (2)
targsc, // score of cell to backtrace from
prob_.targ_, // score of alignment we're looking for
rnd, // pseudo-random generator
row, // out: row we ended up in after bt
col, // out: column we ended up in after bt
hef, // out: H/E/F after backtrace
targsc, // out: score up to cell we ended up in
done, // out: finished tracing out an alignment?
abort); // out: aborted b/c cell was seen before?
}
if(depth >= ndep_.size()) {
ndep_.resize(depth+1);
ndep_[depth] = 1;
} else {
ndep_[depth]++;
}
depth++;
assert((row >= 0 && col >= 0) || done);
}
}
ASSERT_ONLY(seen_.clear());
}
/**
* Get the next valid alignment given the backtrace problem. Return false
* if there is no valid solution, e.g., if
*/
bool nextAlignment(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd);
/**
* Return true iff this tracer has been initialized
*/
bool inited() const {
return prob_.inited();
}
/**
* Return true iff the mini-fills are triangle-shaped.
*/
bool doTri() const { return doTri_; }
/**
* Fill in a triangle of the DP table and backtrace from the given cell to
* a cell in the previous checkpoint, or to the terminal cell.
*/
void triangleFill(
int64_t rw, // row of cell to backtrace from
int64_t cl, // column of cell to backtrace from
int hef, // cell to backtrace from is H (0), E (1), or F (2)
TAlScore targ, // score of cell to backtrace from
TAlScore targ_final, // score of alignment we're looking for
RandomSource& rnd, // pseudo-random generator
int64_t& row_new, // out: row we ended up in after backtrace
int64_t& col_new, // out: column we ended up in after backtrace
int& hef_new, // out: H/E/F after backtrace
TAlScore& targ_new, // out: score up to cell we ended up in
bool& done, // out: finished tracing out an alignment?
bool& abort); // out: aborted b/c cell was seen before?
/**
* Fill in a square of the DP table and backtrace from the given cell to
* a cell in the previous checkpoint, or to the terminal cell.
*/
void squareFill(
int64_t rw, // row of cell to backtrace from
int64_t cl, // column of cell to backtrace from
int hef, // cell to backtrace from is H (0), E (1), or F (2)
TAlScore targ, // score of cell to backtrace from
TAlScore targ_final, // score of alignment we're looking for
RandomSource& rnd, // pseudo-random generator
int64_t& row_new, // out: row we ended up in after backtrace
int64_t& col_new, // out: column we ended up in after backtrace
int& hef_new, // out: H/E/F after backtrace
TAlScore& targ_new, // out: score up to cell we ended up in
bool& done, // out: finished tracing out an alignment?
bool& abort); // out: aborted b/c cell was seen before?
protected:
/**
* Get the next valid alignment given a backtrace problem. Return false
* if there is no valid solution. Use a backtracking search to find the
* solution. This can be very slow.
*/
bool nextAlignmentBacktrace(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd);
/**
* Get the next valid alignment given a backtrace problem. Return false
* if there is no valid solution. Use a triangle-fill backtrace to find
* the solution. This is usually fast (it's O(m + n)).
*/
bool nextAlignmentFill(
size_t maxiter,
SwResult& res,
size_t& off,
size_t& nrej,
size_t& niter,
RandomSource& rnd);
/**
* Try all the solutions accumulated so far. Solutions might be rejected
* if they, for instance, overlap a previous solution, have too many Ns,
* fail to overlap a core diagonal, etc.
*/
bool trySolutions(
bool lookForOlap,
SwResult& res,
size_t& off,
size_t& nrej,
RandomSource& rnd,
bool& success);
/**
* See if a given solution branch works as a solution (i.e. doesn't overlap
* another one, have too many Ns, fail to overlap a core diagonal, etc.)
*/
int trySolution(
size_t id,
bool lookForOlap,
SwResult& res,
size_t& off,
size_t& nrej,
RandomSource& rnd);
BtBranchProblem prob_; // problem configuration
EFactory bs_; // global BtBranch factory
// already reported alignments going through these diagonal segments
ELList > seenPaths_;
ELSet sawcell_; // cells already backtraced through
EList > unsorted_; // unsorted list of as-yet-unflished BtBranches
EList sorted1_; // list of BtBranch, sorted by score
EList sorted2_; // list of BtBranch, sorted by score
EList solutions_; // list of solution branches
bool sortedSel_; // true -> 1, false -> 2
size_t cur_; // cursor into sorted list to start from
size_t nmm_; // number of mismatches attempted
size_t nnmm_; // number of mismatches involving N attempted
size_t nrdop_; // number of read gap opens attempted
size_t nrfop_; // number of ref gap opens attempted
size_t nrdex_; // number of read gap extensions attempted
size_t nrfex_; // number of ref gap extensions attempted
size_t nmmPrune_; //
size_t nnmmPrune_; //
size_t nrdopPrune_; //
size_t nrfopPrune_; //
size_t nrdexPrune_; //
size_t nrfexPrune_; //
size_t row_; // row
size_t col_; // column
bool doTri_; // true -> fill in triangles; false -> squares
EList sq_; // square to fill when doing mini-fills
ELList tri_; // triangle to fill when doing mini-fills
EList ndep_; // # triangles mini-filled at various depths
#ifndef NDEBUG
ESet seen_; // seedn branch ids; should never see same twice
#endif
};
#endif /*ndef ALIGNER_BT_H_*/
centrifuge-f39767eb57d8e175029c/aligner_cache.cpp 0000664 0000000 0000000 00000010213 13021605047 0021236 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#include "aligner_cache.h"
#include "tinythread.h"
#ifdef ALIGNER_CACHE_MAIN
#include
#include
#include
#include "random_source.h"
using namespace std;
enum {
ARG_TESTS = 256
};
static const char *short_opts = "vCt";
static struct option long_opts[] = {
{(char*)"verbose", no_argument, 0, 'v'},
{(char*)"tests", no_argument, 0, ARG_TESTS},
};
static void printUsage(ostream& os) {
os << "Usage: sawhi-cache [options]*" << endl;
os << "Options:" << endl;
os << " --tests run unit tests" << endl;
os << " -v/--verbose talkative mode" << endl;
}
int gVerbose = 0;
static void add(
RedBlack& t,
Pool& p,
const char *dna)
{
QKey qk;
qk.init(BTDnaString(dna, true));
t.add(p, qk, NULL);
}
/**
* Small tests for the AlignmentCache.
*/
static void aligner_cache_tests() {
RedBlack rb(1024);
Pool p(64 * 1024, 1024);
// Small test
add(rb, p, "ACGTCGATCGT");
add(rb, p, "ACATCGATCGT");
add(rb, p, "ACGACGATCGT");
add(rb, p, "ACGTAGATCGT");
add(rb, p, "ACGTCAATCGT");
add(rb, p, "ACGTCGCTCGT");
add(rb, p, "ACGTCGAACGT");
assert_eq(7, rb.size());
rb.clear();
p.clear();
// Another small test
add(rb, p, "ACGTCGATCGT");
add(rb, p, "CCGTCGATCGT");
add(rb, p, "TCGTCGATCGT");
add(rb, p, "GCGTCGATCGT");
add(rb, p, "AAGTCGATCGT");
assert_eq(5, rb.size());
rb.clear();
p.clear();
// Regression test (attempt to make it smaller)
add(rb, p, "CCTA");
add(rb, p, "AGAA");
add(rb, p, "TCTA");
add(rb, p, "GATC");
add(rb, p, "CTGC");
add(rb, p, "TTGC");
add(rb, p, "GCCG");
add(rb, p, "GGAT");
rb.clear();
p.clear();
// Regression test
add(rb, p, "CCTA");
add(rb, p, "AGAA");
add(rb, p, "TCTA");
add(rb, p, "GATC");
add(rb, p, "CTGC");
add(rb, p, "CATC");
add(rb, p, "CAAA");
add(rb, p, "CTAT");
add(rb, p, "CTCA");
add(rb, p, "TTGC");
add(rb, p, "GCCG");
add(rb, p, "GGAT");
assert_eq(12, rb.size());
rb.clear();
p.clear();
// Larger random test
EList strs;
char buf[5];
for(int i = 0; i < 4; i++) {
for(int j = 0; j < 4; j++) {
for(int k = 0; k < 4; k++) {
for(int m = 0; m < 4; m++) {
buf[0] = "ACGT"[i];
buf[1] = "ACGT"[j];
buf[2] = "ACGT"[k];
buf[3] = "ACGT"[m];
buf[4] = '\0';
strs.push_back(BTDnaString(buf, true));
}
}
}
}
// Add all of the 4-mers in several different random orders
RandomSource rand;
for(uint32_t runs = 0; runs < 100; runs++) {
rb.clear();
p.clear();
assert_eq(0, rb.size());
rand.init(runs);
EList used;
used.resize(256);
for(int i = 0; i < 256; i++) used[i] = false;
for(int i = 0; i < 256; i++) {
int r = rand.nextU32() % (256-i);
int unused = 0;
bool added = false;
for(int j = 0; j < 256; j++) {
if(!used[j] && unused == r) {
used[j] = true;
QKey qk;
qk.init(strs[j]);
rb.add(p, qk, NULL);
added = true;
break;
}
if(!used[j]) unused++;
}
assert(added);
}
}
}
/**
* A way of feeding simply tests to the seed alignment infrastructure.
*/
int main(int argc, char **argv) {
int option_index = 0;
int next_option;
do {
next_option = getopt_long(argc, argv, short_opts, long_opts, &option_index);
switch (next_option) {
case 'v': gVerbose = true; break;
case ARG_TESTS: aligner_cache_tests(); return 0;
case -1: break;
default: {
cerr << "Unknown option: " << (char)next_option << endl;
printUsage(cerr);
exit(1);
}
}
} while(next_option != -1);
}
#endif
centrifuge-f39767eb57d8e175029c/aligner_cache.h 0000664 0000000 0000000 00000064470 13021605047 0020721 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#ifndef ALIGNER_CACHE_H_
#define ALIGNER_CACHE_H_
/**
* CACHEING
*
* By caching the results of some alignment sub-problems, we hope to
* enable a "fast path" for read alignment whereby answers are mostly
* looked up rather than calculated from scratch. This is particularly
* effective when the input is sorted or otherwise grouped in a way
* that brings together reads with (at least some) seed sequences in
* common.
*
* But the cache is also where results are held, regardless of whether
* the results are maintained & re-used across reads.
*
* The cache consists of two linked potions:
*
* 1. A multimap from seed strings (i.e. read substrings) to reference strings
* that are within some edit distance (roughly speaking). This is the "seed
* multimap".
*
* Key: Read substring (2-bit-per-base encoded + length)
* Value: Set of reference substrings (i.e. keys into the suffix
* array multimap).
*
* 2. A multimap from reference strings to the corresponding elements of the
* suffix array. Elements are filled in with reference-offset info as it's
* calculated. This is the "suffix array multimap"
*
* Key: Reference substring (2-bit-per-base encoded + length)
* Value: (a) top from BWT, (b) length of range, (c) offset of first
* range element in
*
* For both multimaps, we use a combo Red-Black tree and EList. The payload in
* the Red-Black tree nodes points to a range in the EList.
*/
#include
#include "ds.h"
#include "read.h"
#include "threading.h"
#include "mem_ids.h"
#include "simple_func.h"
#include "btypes.h"
#define CACHE_PAGE_SZ (16 * 1024)
typedef PListSlice TSlice;
/**
* Key for the query multimap: the read substring and its length.
*/
struct QKey {
/**
* Initialize invalid QKey.
*/
QKey() { reset(); }
/**
* Initialize QKey from DNA string.
*/
QKey(const BTDnaString& s ASSERT_ONLY(, BTDnaString& tmp)) {
init(s ASSERT_ONLY(, tmp));
}
/**
* Initialize QKey from DNA string. Rightmost character is placed in the
* least significant bitpair.
*/
bool init(
const BTDnaString& s
ASSERT_ONLY(, BTDnaString& tmp))
{
seq = 0;
len = (uint32_t)s.length();
ASSERT_ONLY(tmp.clear());
if(len > 32) {
len = 0xffffffff;
return false; // wasn't cacheable
} else {
// Rightmost char of 's' goes in the least significant bitpair
for(size_t i = 0; i < 32 && i < s.length(); i++) {
int c = (int)s.get(i);
assert_range(0, 4, c);
if(c == 4) {
len = 0xffffffff;
return false;
}
seq = (seq << 2) | s.get(i);
}
ASSERT_ONLY(toString(tmp));
assert(sstr_eq(tmp, s));
assert_leq(len, 32);
return true; // was cacheable
}
}
/**
* Convert this key to a DNA string.
*/
void toString(BTDnaString& s) {
s.resize(len);
uint64_t sq = seq;
for(int i = (len)-1; i >= 0; i--) {
s.set((uint32_t)(sq & 3), i);
sq >>= 2;
}
}
/**
* Return true iff the read substring is cacheable.
*/
bool cacheable() const { return len != 0xffffffff; }
/**
* Reset to uninitialized state.
*/
void reset() { seq = 0; len = 0xffffffff; }
/**
* True -> my key is less than the given key.
*/
bool operator<(const QKey& o) const {
return seq < o.seq || (seq == o.seq && len < o.len);
}
/**
* True -> my key is greater than the given key.
*/
bool operator>(const QKey& o) const {
return !(*this < o || *this == o);
}
/**
* True -> my key is equal to the given key.
*/
bool operator==(const QKey& o) const {
return seq == o.seq && len == o.len;
}
/**
* True -> my key is not equal to the given key.
*/
bool operator!=(const QKey& o) const {
return !(*this == o);
}
#ifndef NDEBUG
/**
* Check that this is a valid, initialized QKey.
*/
bool repOk() const {
return len != 0xffffffff;
}
#endif
uint64_t seq; // sequence
uint32_t len; // length of sequence
};
template
class AlignmentCache;
/**
* Payload for the query multimap: a range of elements in the reference
* string list.
*/
template
class QVal {
public:
QVal() { reset(); }
/**
* Return the offset of the first reference substring in the qlist.
*/
index_t offset() const { return i_; }
/**
* Return the number of reference substrings associated with a read
* substring.
*/
index_t numRanges() const {
assert(valid());
return rangen_;
}
/**
* Return the number of elements associated with all associated
* reference substrings.
*/
index_t numElts() const {
assert(valid());
return eltn_;
}
/**
* Return true iff the read substring is not associated with any
* reference substrings.
*/
bool empty() const {
assert(valid());
return numRanges() == 0;
}
/**
* Return true iff the QVal is valid.
*/
bool valid() const { return rangen_ != (index_t)OFF_MASK; }
/**
* Reset to invalid state.
*/
void reset() {
i_ = 0; rangen_ = eltn_ = (index_t)OFF_MASK;
}
/**
* Initialize Qval.
*/
void init(index_t i, index_t ranges, index_t elts) {
i_ = i; rangen_ = ranges; eltn_ = elts;
}
/**
* Tally another range with given number of elements.
*/
void addRange(index_t numElts) {
rangen_++;
eltn_ += numElts;
}
#ifndef NDEBUG
/**
* Check that this QVal is internally consistent and consistent
* with the contents of the given cache.
*/
bool repOk(const AlignmentCache& ac) const;
#endif
protected:
index_t i_; // idx of first elt in qlist
index_t rangen_; // # ranges (= # associated reference substrings)
index_t eltn_; // # elements (total)
};
/**
* Key for the suffix array multimap: the reference substring and its
* length. Same as QKey so I typedef it.
*/
typedef QKey SAKey;
/**
* Payload for the suffix array multimap: (a) the top element of the
* range in BWT, (b) the offset of the first elt in the salist, (c)
* length of the range.
*/
template
struct SAVal {
SAVal() : topf(), topb(), i(), len(OFF_MASK) { }
/**
* Return true iff the SAVal is valid.
*/
bool valid() { return len != (index_t)OFF_MASK; }
#ifndef NDEBUG
/**
* Check that this SAVal is internally consistent and consistent
* with the contents of the given cache.
*/
bool repOk(const AlignmentCache& ac) const;
#endif
/**
* Initialize the SAVal.
*/
void init(
index_t tf,
index_t tb,
index_t ii,
index_t ln)
{
topf = tf;
topb = tb;
i = ii;
len = ln;
}
index_t topf; // top in BWT
index_t topb; // top in BWT'
index_t i; // idx of first elt in salist
index_t len; // length of range
};
/**
* One data structure that encapsulates all of the cached information
* associated with a particular reference substring. This is useful
* for summarizing what info should be added to the cache for a partial
* alignment.
*/
template
class SATuple {
public:
SATuple() { reset(); };
SATuple(SAKey k, index_t tf, index_t tb, TSlice o) {
init(k, tf, tb, o);
}
void init(SAKey k, index_t tf, index_t tb, TSlice o) {
key = k; topf = tf; topb = tb; offs = o;
}
/**
* Initialize this SATuple from a subrange of the SATuple 'src'.
*/
void init(const SATuple& src, index_t first, index_t last) {
assert_neq((index_t)OFF_MASK, src.topb);
key = src.key;
topf = (index_t)(src.topf + first);
topb = (index_t)OFF_MASK; // unknown!
offs.init(src.offs, first, last);
}
#ifndef NDEBUG
/**
* Check that this SATuple is internally consistent and that its
* PListSlice is consistent with its backing PList.
*/
bool repOk() const {
assert(offs.repOk());
return true;
}
#endif
/**
* Function for ordering SATuples. This is used when prioritizing which to
* explore first when extending seed hits into full alignments. Smaller
* ranges get higher priority and we use 'top' to break ties, though any
* way of breaking a tie would be fine.
*/
bool operator<(const SATuple& o) const {
if(offs.size() < o.offs.size()) {
return true;
}
if(offs.size() > o.offs.size()) {
return false;
}
return topf < o.topf;
}
bool operator>(const SATuple& o) const {
if(offs.size() < o.offs.size()) {
return false;
}
if(offs.size() > o.offs.size()) {
return true;
}
return topf > o.topf;
}
bool operator==(const SATuple& o) const {
return key == o.key && topf == o.topf && topb == o.topb && offs == o.offs;
}
void reset() { topf = topb = (index_t)OFF_MASK; offs.reset(); }
/**
* Set the length to be at most the original length.
*/
void setLength(index_t nlen) {
assert_leq(nlen, offs.size());
offs.setLength(nlen);
}
/**
* Return the number of times this reference substring occurs in the
* reference, which is also the size of the 'offs' TSlice.
*/
index_t size() const { return (index_t)offs.size(); }
// bot/length of SA range equals offs.size()
SAKey key; // sequence key
index_t topf; // top in BWT index
index_t topb; // top in BWT' index
TSlice offs; // offsets
};
/**
* Encapsulate the data structures and routines that constitute a
* particular cache, i.e., a particular stratum of the cache system,
* which might comprise many strata.
*
* Each thread has a "current-read" AlignmentCache which is used to
* build and store subproblem results as alignment is performed. When
* we're finished with a read, we might copy the cached results for
* that read (and perhaps a bundle of other recently-aligned reads) to
* a higher-level "across-read" cache. Higher-level caches may or may
* not be shared among threads.
*
* A cache consists chiefly of two multimaps, each implemented as a
* Red-Black tree map backed by an EList. A 'version' counter is
* incremented every time the cache is cleared.
*/
template
class AlignmentCache {
typedef RedBlackNode > QNode;
typedef RedBlackNode > SANode;
typedef PList TQList;
typedef PList TSAList;
public:
AlignmentCache(
uint64_t bytes,
bool shared) :
pool_(bytes, CACHE_PAGE_SZ, CA_CAT),
qmap_(CACHE_PAGE_SZ, CA_CAT),
qlist_(CA_CAT),
samap_(CACHE_PAGE_SZ, CA_CAT),
salist_(CA_CAT),
shared_(shared),
mutex_m(),
version_(0)
{
}
/**
* Given a QVal, populate the given EList of SATuples with records
* describing all of the cached information about the QVal's
* reference substrings.
*/
template
void queryQval(
const QVal& qv,
EList, S>& satups,
index_t& nrange,
index_t& nelt,
bool getLock = true)
{
ThreadSafe ts(lockPtr(), shared_ && getLock);
assert(qv.repOk(*this));
const index_t refi = qv.offset();
const index_t reff = refi + qv.numRanges();
// For each reference sequence sufficiently similar to the
// query sequence in the QKey...
for(index_t i = refi; i < reff; i++) {
// Get corresponding SAKey, containing similar reference
// sequence & length
SAKey sak = qlist_.get(i);
// Shouldn't have identical keys in qlist_
assert(i == refi || qlist_.get(i) != qlist_.get(i-1));
// Get corresponding SANode
SANode *n = samap_.lookup(sak);
assert(n != NULL);
const SAVal& sav = n->payload;
assert(sav.repOk(*this));
if(sav.len > 0) {
nrange++;
satups.expand();
satups.back().init(sak, sav.topf, sav.topb, TSlice(salist_, sav.i, sav.len));
nelt += sav.len;
#ifndef NDEBUG
// Shouldn't add consecutive identical entries too satups
if(i > refi) {
const SATuple b1 = satups.back();
const SATuple b2 = satups[satups.size()-2];
assert(b1.key != b2.key || b1.topf != b2.topf || b1.offs != b2.offs);
}
#endif
}
}
}
/**
* Return true iff the cache has no entries in it.
*/
bool empty() const {
bool ret = qmap_.empty();
assert(!ret || qlist_.empty());
assert(!ret || samap_.empty());
assert(!ret || salist_.empty());
return ret;
}
/**
* Add a new query key ('qk'), usually a 2-bit encoded substring of
* the read) as the key in a new Red-Black node in the qmap and
* return a pointer to the node's QVal.
*
* The expectation is that the caller is about to set about finding
* associated reference substrings, and that there will be future
* calls to addOnTheFly to add associations to reference substrings
* found.
*/
QVal* add(
const QKey& qk,
bool *added,
bool getLock = true)
{
ThreadSafe ts(lockPtr(), shared_ && getLock);
assert(qk.cacheable());
QNode *n = qmap_.add(pool(), qk, added);
return (n != NULL ? &n->payload : NULL);
}
/**
* Add a new association between a read sequnce ('seq') and a
* reference sequence ('')
*/
bool addOnTheFly(
QVal& qv, // qval that points to the range of reference substrings
const SAKey& sak, // the key holding the reference substring
index_t topf, // top range elt in BWT index
index_t botf, // bottom range elt in BWT index
index_t topb, // top range elt in BWT' index
index_t botb, // bottom range elt in BWT' index
bool getLock = true);
/**
* Clear the cache, i.e. turn it over. All HitGens referring to
* ranges in this cache will become invalid and the corresponding
* reads will have to be re-aligned.
*/
void clear(bool getLock = true) {
ThreadSafe ts(lockPtr(), shared_ && getLock);
pool_.clear();
qmap_.clear();
qlist_.clear();
samap_.clear();
salist_.clear();
version_++;
}
/**
* Return the number of keys in the query multimap.
*/
index_t qNumKeys() const { return (index_t)qmap_.size(); }
/**
* Return the number of keys in the suffix array multimap.
*/
index_t saNumKeys() const { return (index_t)samap_.size(); }
/**
* Return the number of elements in the reference substring list.
*/
index_t qSize() const { return (index_t)qlist_.size(); }
/**
* Return the number of elements in the SA range list.
*/
index_t saSize() const { return (index_t)salist_.size(); }
/**
* Return the pool.
*/
Pool& pool() { return pool_; }
/**
* Return the lock object.
*/
MUTEX_T& lock() {
return mutex_m;
}
/**
* Return a const pointer to the lock object. This allows us to
* write const member functions that grab the lock.
*/
MUTEX_T* lockPtr() const {
return const_cast(&mutex_m);
}
/**
* Return true iff this cache is shared among threads.
*/
bool shared() const { return shared_; }
/**
* Return the current "version" of the cache, i.e. the total number
* of times it has turned over since its creation.
*/
uint32_t version() const { return version_; }
protected:
Pool pool_; // dispenses memory pages
RedBlack > qmap_; // map from query substrings to reference substrings
TQList qlist_; // list of reference substrings
RedBlack > samap_; // map from reference substrings to SA ranges
TSAList salist_; // list of SA ranges
bool shared_; // true -> this cache is global
MUTEX_T mutex_m; // mutex used for syncronization in case the the cache is shared.
uint32_t version_; // cache version
};
/**
* Interface used to query and update a pair of caches: one thread-
* local and unsynchronized, another shared and synchronized. One or
* both can be NULL.
*/
template
class AlignmentCacheIface {
public:
AlignmentCacheIface(
AlignmentCache *current,
AlignmentCache *local,
AlignmentCache *shared) :
qk_(),
qv_(NULL),
cacheable_(false),
rangen_(0),
eltsn_(0),
current_(current),
local_(local),
shared_(shared)
{
assert(current_ != NULL);
}
#if 0
/**
* Query the relevant set of caches, looking for a QVal to go with
* the provided QKey. If the QVal is found in a cache other than
* the current-read cache, it is copied into the current-read cache
* first and the QVal pointer for the current-read cache is
* returned. This function never returns a pointer from any cache
* other than the current-read cache. If the QVal could not be
* found in any cache OR if the QVal was found in a cache other
* than the current-read cache but could not be copied into the
* current-read cache, NULL is returned.
*/
QVal* queryCopy(const QKey& qk, bool getLock = true) {
assert(qk.cacheable());
AlignmentCache* caches[3] = { current_, local_, shared_ };
for(int i = 0; i < 3; i++) {
if(caches[i] == NULL) continue;
QVal* qv = caches[i]->query(qk, getLock);
if(qv != NULL) {
if(i == 0) return qv;
if(!current_->copy(qk, *qv, *caches[i], getLock)) {
// Exhausted memory in the current cache while
// attempting to copy in the qk
return NULL;
}
QVal* curqv = current_->query(qk, getLock);
assert(curqv != NULL);
return curqv;
}
}
return NULL;
}
/**
* Query the relevant set of caches, looking for a QVal to go with
* the provided QKey. If a QVal is found and which is non-NULL,
* *which is set to 0 if the qval was found in the current-read
* cache, 1 if it was found in the local across-read cache, and 2
* if it was found in the shared across-read cache.
*/
inline QVal* query(
const QKey& qk,
AlignmentCache** which,
bool getLock = true)
{
assert(qk.cacheable());
AlignmentCache* caches[3] = { current_, local_, shared_ };
for(int i = 0; i < 3; i++) {
if(caches[i] == NULL) continue;
QVal* qv = caches[i]->query(qk, getLock);
if(qv != NULL) {
if(which != NULL) *which = caches[i];
return qv;
}
}
return NULL;
}
#endif
/**
* This function is called whenever we start to align a new read or
* read substring. We make key for it and store the key in qk_.
* If the sequence is uncacheable, we don't actually add it to the
* map but the corresponding reference substrings are still added
* to the qlist_.
*
* Returns:
* -1 if out of memory
* 0 if key was found in cache
* 1 if key was not found in cache (and there's enough memory to
* add a new key)
*/
int beginAlign(
const BTDnaString& seq,
const BTString& qual,
QVal& qv, // out: filled in if we find it in the cache
bool getLock = true)
{
assert(repOk());
qk_.init(seq ASSERT_ONLY(, tmpdnastr_));
//if(qk_.cacheable() && (qv_ = current_->query(qk_, getLock)) != NULL) {
// // qv_ holds the answer
// assert(qv_->valid());
// qv = *qv_;
// resetRead();
// return 1; // found in cache
//} else
if(qk_.cacheable()) {
// Make a QNode for this key and possibly add the QNode to the
// Red-Black map; but if 'seq' isn't cacheable, just create the
// QNode (without adding it to the map).
qv_ = current_->add(qk_, &cacheable_, getLock);
} else {
qv_ = &qvbuf_;
}
if(qv_ == NULL) {
resetRead();
return -1; // Not in memory
}
qv_->reset();
return 0; // Need to search for it
}
ASSERT_ONLY(BTDnaString tmpdnastr_);
/**
* Called when is finished aligning a read (and so is finished
* adding associated reference strings). Returns a copy of the
* final QVal object and resets the alignment state of the
* current-read cache.
*
* Also, if the alignment is cacheable, it commits it to the next
* cache up in the cache hierarchy.
*/
QVal finishAlign(bool getLock = true) {
if(!qv_->valid()) {
qv_->init(0, 0, 0);
}
// Copy this pointer because we're about to reset the qv_ field
// to NULL
QVal* qv = qv_;
// Commit the contents of the current-read cache to the next
// cache up in the hierarchy.
// If qk is cacheable, then it must be in the cache
#if 0
if(qk_.cacheable()) {
AlignmentCache* caches[3] = { current_, local_, shared_ };
ASSERT_ONLY(AlignmentCache* which);
ASSERT_ONLY(QVal* qv2 = query(qk_, &which, true));
assert(qv2 == qv);
assert(which == current_);
for(int i = 1; i < 3; i++) {
if(caches[i] != NULL) {
// Copy this key/value pair to the to the higher
// level cache and, if its memory is exhausted,
// clear the cache and try again.
caches[i]->clearCopy(qk_, *qv_, *current_, getLock);
break;
}
}
}
#endif
// Reset the state in this iface in preparation for the next
// alignment.
resetRead();
assert(repOk());
return *qv;
}
/**
* A call to this member indicates that the caller has finished
* with the last read (if any) and is ready to work on the next.
* This gives the cache a chance to reset some of its state if
* necessary.
*/
void nextRead() {
current_->clear();
resetRead();
assert(!aligning());
}
/**
* Return true iff we're in the middle of aligning a sequence.
*/
bool aligning() const {
return qv_ != NULL;
}
/**
* Clears both the local and shared caches.
*/
void clear() {
if(current_ != NULL) current_->clear();
if(local_ != NULL) local_->clear();
if(shared_ != NULL) shared_->clear();
}
/**
* Add an alignment to the running list of alignments being
* compiled for the current read in the local cache.
*/
bool addOnTheFly(
const BTDnaString& rfseq, // reference sequence close to read seq
index_t topf, // top in BWT index
index_t botf, // bot in BWT index
index_t topb, // top in BWT' index
index_t botb, // bot in BWT' index
bool getLock = true) // true -> lock is not held by caller
{
assert(aligning());
assert(repOk());
ASSERT_ONLY(BTDnaString tmp);
SAKey sak(rfseq ASSERT_ONLY(, tmp));
//assert(sak.cacheable());
if(current_->addOnTheFly((*qv_), sak, topf, botf, topb, botb, getLock)) {
rangen_++;
eltsn_ += (botf-topf);
return true;
}
return false;
}
/**
* Given a QVal, populate the given EList of SATuples with records
* describing all of the cached information about the QVal's
* reference substrings.
*/
template
void queryQval(
const QVal& qv,
EList, S>& satups,
index_t& nrange,
index_t& nelt,
bool getLock = true)
{
current_->queryQval(qv, satups, nrange, nelt, getLock);
}
/**
* Return a pointer to the current-read cache object.
*/
const AlignmentCache* currentCache() const { return current_; }
index_t curNumRanges() const { return rangen_; }
index_t curNumElts() const { return eltsn_; }
#ifndef NDEBUG
/**
* Check that AlignmentCacheIface is internally consistent.
*/
bool repOk() const {
assert(current_ != NULL);
assert_geq(eltsn_, rangen_);
if(qv_ == NULL) {
assert_eq(0, rangen_);
assert_eq(0, eltsn_);
}
return true;
}
#endif
/**
* Return the alignment cache for the current read.
*/
const AlignmentCache& current() {
return *current_;
}
protected:
/**
* Reset fields encoding info about the in-process read.
*/
void resetRead() {
cacheable_ = false;
rangen_ = eltsn_ = 0;
qv_ = NULL;
}
QKey qk_; // key representation for current read substring
QVal *qv_; // pointer to value representation for current read substring
QVal qvbuf_; // buffer for when key is uncacheable but we need a qv
bool cacheable_; // true iff the read substring currently being aligned is cacheable
index_t rangen_; // number of ranges since last alignment job began
index_t eltsn_; // number of elements since last alignment job began
AlignmentCache *current_; // cache dedicated to the current read
AlignmentCache *local_; // local, unsynchronized cache
AlignmentCache *shared_; // shared, synchronized cache
};
#ifndef NDEBUG
/**
* Check that this QVal is internally consistent and consistent
* with the contents of the given cache.
*/
template
bool QVal::repOk(const AlignmentCache& ac) const {
if(rangen_ > 0) {
assert_lt(i_, ac.qSize());
assert_leq(i_ + rangen_, ac.qSize());
}
assert_geq(eltn_, rangen_);
return true;
}
#endif
#ifndef NDEBUG
/**
* Check that this SAVal is internally consistent and consistent
* with the contents of the given cache.
*/
template
bool SAVal::repOk(const AlignmentCache& ac) const {
assert(len == 0 || i < ac.saSize());
assert_leq(i + len, ac.saSize());
return true;
}
#endif
/**
* Add a new association between a read sequnce ('seq') and a
* reference sequence ('')
*/
template
bool AlignmentCache::addOnTheFly(
QVal& qv, // qval that points to the range of reference substrings
const SAKey& sak, // the key holding the reference substring
index_t topf, // top range elt in BWT index
index_t botf, // bottom range elt in BWT index
index_t topb, // top range elt in BWT' index
index_t botb, // bottom range elt in BWT' index
bool getLock)
{
ThreadSafe ts(lockPtr(), shared_ && getLock);
bool added = true;
// If this is the first reference sequence we're associating with
// the query sequence, initialize the QVal.
if(!qv.valid()) {
qv.init((index_t)qlist_.size(), 0, 0);
}
qv.addRange(botf-topf); // update tally for # ranges and # elts
if(!qlist_.add(pool(), sak)) {
return false; // Exhausted pool memory
}
#ifndef NDEBUG
for(index_t i = qv.offset(); i < qlist_.size(); i++) {
if(i > qv.offset()) {
assert(qlist_.get(i) != qlist_.get(i-1));
}
}
#endif
assert_eq(qv.offset() + qv.numRanges(), qlist_.size());
SANode *s = samap_.add(pool(), sak, &added);
if(s == NULL) {
return false; // Exhausted pool memory
}
assert(s->key.repOk());
if(added) {
s->payload.i = (index_t)salist_.size();
s->payload.len = botf - topf;
s->payload.topf = topf;
s->payload.topb = topb;
for(size_t j = 0; j < (botf-topf); j++) {
if(!salist_.add(pool(), (index_t)0xffffffff)) {
// Change the payload's len field
s->payload.len = (uint32_t)j;
return false; // Exhausted pool memory
}
}
assert(s->payload.repOk(*this));
}
// Now that we know all allocations have succeeded, we can do a few final
// updates
return true;
}
#endif /*ALIGNER_CACHE_H_*/
centrifuge-f39767eb57d8e175029c/aligner_metrics.h 0000664 0000000 0000000 00000025762 13021605047 0021325 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#ifndef ALIGNER_METRICS_H_
#define ALIGNER_METRICS_H_
#include
#include
#include "alphabet.h"
#include "timer.h"
#include "sstring.h"
using namespace std;
/**
* Borrowed from http://www.johndcook.com/standard_deviation.html,
* which in turn is borrowed from Knuth.
*/
class RunningStat {
public:
RunningStat() : m_n(0), m_tot(0.0) { }
void clear() {
m_n = 0;
m_tot = 0.0;
}
void push(float x) {
m_n++;
m_tot += x;
// See Knuth TAOCP vol 2, 3rd edition, page 232
if (m_n == 1) {
m_oldM = m_newM = x;
m_oldS = 0.0;
} else {
m_newM = m_oldM + (x - m_oldM)/m_n;
m_newS = m_oldS + (x - m_oldM)*(x - m_newM);
// set up for next iteration
m_oldM = m_newM;
m_oldS = m_newS;
}
}
int num() const {
return m_n;
}
double tot() const {
return m_tot;
}
double mean() const {
return (m_n > 0) ? m_newM : 0.0;
}
double variance() const {
return ( (m_n > 1) ? m_newS/(m_n - 1) : 0.0 );
}
double stddev() const {
return sqrt(variance());
}
private:
int m_n;
double m_tot;
double m_oldM, m_newM, m_oldS, m_newS;
};
/**
* Encapsulates a set of metrics that we would like an aligner to keep
* track of, so that we can possibly use it to diagnose performance
* issues.
*/
class AlignerMetrics {
public:
AlignerMetrics() :
curBacktracks_(0),
curBwtOps_(0),
first_(true),
curIsLowEntropy_(false),
curIsHomoPoly_(false),
curHadRanges_(false),
curNumNs_(0),
reads_(0),
homoReads_(0),
lowEntReads_(0),
hiEntReads_(0),
alignedReads_(0),
unalignedReads_(0),
threeOrMoreNReads_(0),
lessThanThreeNRreads_(0),
bwtOpsPerRead_(),
backtracksPerRead_(),
bwtOpsPerHomoRead_(),
backtracksPerHomoRead_(),
bwtOpsPerLoEntRead_(),
backtracksPerLoEntRead_(),
bwtOpsPerHiEntRead_(),
backtracksPerHiEntRead_(),
bwtOpsPerAlignedRead_(),
backtracksPerAlignedRead_(),
bwtOpsPerUnalignedRead_(),
backtracksPerUnalignedRead_(),
bwtOpsPer0nRead_(),
backtracksPer0nRead_(),
bwtOpsPer1nRead_(),
backtracksPer1nRead_(),
bwtOpsPer2nRead_(),
backtracksPer2nRead_(),
bwtOpsPer3orMoreNRead_(),
backtracksPer3orMoreNRead_(),
timer_(cout, "", false)
{ }
void printSummary() {
if(!first_) {
finishRead();
}
cout << "AlignerMetrics:" << endl;
cout << " # Reads: " << reads_ << endl;
float hopct = (reads_ > 0) ? (((float)homoReads_)/((float)reads_)) : (0.0f);
hopct *= 100.0f;
cout << " % homo-polymeric: " << (hopct) << endl;
float lopct = (reads_ > 0) ? ((float)lowEntReads_/(float)(reads_)) : (0.0f);
lopct *= 100.0f;
cout << " % low-entropy: " << (lopct) << endl;
float unpct = (reads_ > 0) ? ((float)unalignedReads_/(float)(reads_)) : (0.0f);
unpct *= 100.0f;
cout << " % unaligned: " << (unpct) << endl;
float npct = (reads_ > 0) ? ((float)threeOrMoreNReads_/(float)(reads_)) : (0.0f);
npct *= 100.0f;
cout << " % with 3 or more Ns: " << (npct) << endl;
cout << endl;
cout << " Total BWT ops: avg: " << bwtOpsPerRead_.mean() << ", stddev: " << bwtOpsPerRead_.stddev() << endl;
cout << " Total Backtracks: avg: " << backtracksPerRead_.mean() << ", stddev: " << backtracksPerRead_.stddev() << endl;
time_t elapsed = timer_.elapsed();
cout << " BWT ops per second: " << (bwtOpsPerRead_.tot()/elapsed) << endl;
cout << " Backtracks per second: " << (backtracksPerRead_.tot()/elapsed) << endl;
cout << endl;
cout << " Homo-poly:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerHomoRead_.mean() << ", stddev: " << bwtOpsPerHomoRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerHomoRead_.mean() << ", stddev: " << backtracksPerHomoRead_.stddev() << endl;
cout << " Low-entropy:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerLoEntRead_.mean() << ", stddev: " << bwtOpsPerLoEntRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerLoEntRead_.mean() << ", stddev: " << backtracksPerLoEntRead_.stddev() << endl;
cout << " High-entropy:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerHiEntRead_.mean() << ", stddev: " << bwtOpsPerHiEntRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerHiEntRead_.mean() << ", stddev: " << backtracksPerHiEntRead_.stddev() << endl;
cout << endl;
cout << " Unaligned:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerUnalignedRead_.mean() << ", stddev: " << bwtOpsPerUnalignedRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerUnalignedRead_.mean() << ", stddev: " << backtracksPerUnalignedRead_.stddev() << endl;
cout << " Aligned:" << endl;
cout << " BWT ops: avg: " << bwtOpsPerAlignedRead_.mean() << ", stddev: " << bwtOpsPerAlignedRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPerAlignedRead_.mean() << ", stddev: " << backtracksPerAlignedRead_.stddev() << endl;
cout << endl;
cout << " 0 Ns:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer0nRead_.mean() << ", stddev: " << bwtOpsPer0nRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer0nRead_.mean() << ", stddev: " << backtracksPer0nRead_.stddev() << endl;
cout << " 1 N:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer1nRead_.mean() << ", stddev: " << bwtOpsPer1nRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer1nRead_.mean() << ", stddev: " << backtracksPer1nRead_.stddev() << endl;
cout << " 2 Ns:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer2nRead_.mean() << ", stddev: " << bwtOpsPer2nRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer2nRead_.mean() << ", stddev: " << backtracksPer2nRead_.stddev() << endl;
cout << " >2 Ns:" << endl;
cout << " BWT ops: avg: " << bwtOpsPer3orMoreNRead_.mean() << ", stddev: " << bwtOpsPer3orMoreNRead_.stddev() << endl;
cout << " Backtracks: avg: " << backtracksPer3orMoreNRead_.mean() << ", stddev: " << backtracksPer3orMoreNRead_.stddev() << endl;
cout << endl;
}
/**
*
*/
void nextRead(const BTDnaString& read) {
if(!first_) {
finishRead();
}
first_ = false;
//float ent = entropyDna5(read);
float ent = 0.0f;
curIsLowEntropy_ = (ent < 0.75f);
curIsHomoPoly_ = (ent < 0.001f);
curHadRanges_ = false;
curBwtOps_ = 0;
curBacktracks_ = 0;
// Count Ns
curNumNs_ = 0;
const size_t len = read.length();
for(size_t i = 0; i < len; i++) {
if((int)read[i] == 4) curNumNs_++;
}
}
/**
*
*/
void setReadHasRange() {
curHadRanges_ = true;
}
/**
* Commit the running statistics for this read to
*/
void finishRead() {
reads_++;
if(curIsHomoPoly_) homoReads_++;
else if(curIsLowEntropy_) lowEntReads_++;
else hiEntReads_++;
if(curHadRanges_) alignedReads_++;
else unalignedReads_++;
bwtOpsPerRead_.push((float)curBwtOps_);
backtracksPerRead_.push((float)curBacktracks_);
// Drill down by entropy
if(curIsHomoPoly_) {
bwtOpsPerHomoRead_.push((float)curBwtOps_);
backtracksPerHomoRead_.push((float)curBacktracks_);
} else if(curIsLowEntropy_) {
bwtOpsPerLoEntRead_.push((float)curBwtOps_);
backtracksPerLoEntRead_.push((float)curBacktracks_);
} else {
bwtOpsPerHiEntRead_.push((float)curBwtOps_);
backtracksPerHiEntRead_.push((float)curBacktracks_);
}
// Drill down by whether it aligned
if(curHadRanges_) {
bwtOpsPerAlignedRead_.push((float)curBwtOps_);
backtracksPerAlignedRead_.push((float)curBacktracks_);
} else {
bwtOpsPerUnalignedRead_.push((float)curBwtOps_);
backtracksPerUnalignedRead_.push((float)curBacktracks_);
}
if(curNumNs_ == 0) {
lessThanThreeNRreads_++;
bwtOpsPer0nRead_.push((float)curBwtOps_);
backtracksPer0nRead_.push((float)curBacktracks_);
} else if(curNumNs_ == 1) {
lessThanThreeNRreads_++;
bwtOpsPer1nRead_.push((float)curBwtOps_);
backtracksPer1nRead_.push((float)curBacktracks_);
} else if(curNumNs_ == 2) {
lessThanThreeNRreads_++;
bwtOpsPer2nRead_.push((float)curBwtOps_);
backtracksPer2nRead_.push((float)curBacktracks_);
} else {
threeOrMoreNReads_++;
bwtOpsPer3orMoreNRead_.push((float)curBwtOps_);
backtracksPer3orMoreNRead_.push((float)curBacktracks_);
}
}
// Running-total of the number of backtracks and BWT ops for the
// current read
uint32_t curBacktracks_;
uint32_t curBwtOps_;
protected:
bool first_;
// true iff the current read is low entropy
bool curIsLowEntropy_;
// true if current read is all 1 char (or very close)
bool curIsHomoPoly_;
// true iff the current read has had one or more ranges reported
bool curHadRanges_;
// number of Ns in current read
int curNumNs_;
// # reads
uint32_t reads_;
// # homo-poly reads
uint32_t homoReads_;
// # low-entropy reads
uint32_t lowEntReads_;
// # high-entropy reads
uint32_t hiEntReads_;
// # reads with alignments
uint32_t alignedReads_;
// # reads without alignments
uint32_t unalignedReads_;
// # reads with 3 or more Ns
uint32_t threeOrMoreNReads_;
// # reads with < 3 Ns
uint32_t lessThanThreeNRreads_;
// Distribution of BWT operations per read
RunningStat bwtOpsPerRead_;
RunningStat backtracksPerRead_;
// Distribution of BWT operations per homo-poly read
RunningStat bwtOpsPerHomoRead_;
RunningStat backtracksPerHomoRead_;
// Distribution of BWT operations per low-entropy read
RunningStat bwtOpsPerLoEntRead_;
RunningStat backtracksPerLoEntRead_;
// Distribution of BWT operations per high-entropy read
RunningStat bwtOpsPerHiEntRead_;
RunningStat backtracksPerHiEntRead_;
// Distribution of BWT operations per read that "aligned" (for
// which a range was arrived at - range may not have necessarily
// lead to an alignment)
RunningStat bwtOpsPerAlignedRead_;
RunningStat backtracksPerAlignedRead_;
// Distribution of BWT operations per read that didn't align
RunningStat bwtOpsPerUnalignedRead_;
RunningStat backtracksPerUnalignedRead_;
// Distribution of BWT operations/backtracks per read with no Ns
RunningStat bwtOpsPer0nRead_;
RunningStat backtracksPer0nRead_;
// Distribution of BWT operations/backtracks per read with one N
RunningStat bwtOpsPer1nRead_;
RunningStat backtracksPer1nRead_;
// Distribution of BWT operations/backtracks per read with two Ns
RunningStat bwtOpsPer2nRead_;
RunningStat backtracksPer2nRead_;
// Distribution of BWT operations/backtracks per read with three or
// more Ns
RunningStat bwtOpsPer3orMoreNRead_;
RunningStat backtracksPer3orMoreNRead_;
Timer timer_;
};
#endif /* ALIGNER_METRICS_H_ */
centrifuge-f39767eb57d8e175029c/aligner_result.h 0000664 0000000 0000000 00000026572 13021605047 0021175 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#ifndef ALIGNER_RESULT_H_
#define ALIGNER_RESULT_H_
#include
#include
#include
#include "mem_ids.h"
#include "ref_coord.h"
#include "read.h"
#include "filebuf.h"
#include "ds.h"
#include "edit.h"
#include "limit.h"
typedef int64_t TAlScore;
#define VALID_AL_SCORE(x) ((x).score_ > MIN_I64)
#define VALID_SCORE(x) ((x) > MIN_I64)
#define INVALIDATE_SCORE(x) ((x) = MIN_I64)
/**
* A generic score object for an alignment. Used for accounting during
* SW and elsewhere. Encapsulates the score, the number of N positions
* and the number gaps in the alignment.
*
* The scale for 'score' is such that a perfect alignment score is 0
* and a score with non-zero penalty is less than 0. So differences
* between scores work as expected, but interpreting an individual
* score (larger is better) as a penalty (smaller is better) requires
* taking the absolute value.
*/
class AlnScore {
public:
/**
* Gapped scores are invalid until proven valid.
*/
inline AlnScore() {
reset();
invalidate();
assert(!valid());
}
/**
* Gapped scores are invalid until proven valid.
*/
inline AlnScore(TAlScore score) {
score_ = score;
}
/**
* Reset the score.
*/
void reset() {
score_ = 0;
}
/**
* Return an invalid SwScore.
*/
inline static AlnScore INVALID() {
AlnScore s;
s.invalidate();
assert(!s.valid());
return s;
}
/**
* Return true iff this score has a valid value.
*/
inline bool valid() const {
return score_ != MIN_I64;
}
/**
* Make this score invalid (and therefore <= all other scores).
*/
inline void invalidate() {
score_ = MIN_I64;
assert(!valid());
}
/**
* Scores are equal iff they're bitwise equal.
*/
inline bool operator==(const AlnScore& o) const {
// Profiling shows cache misses on following line
return VALID_AL_SCORE(*this) && VALID_AL_SCORE(o) && score_ == o.score_;
}
/**
* Return true iff the two scores are unequal.
*/
inline bool operator!=(const AlnScore& o) const {
return !(*this == o);
}
/**
* Return true iff this score is >= score o.
*/
inline bool operator>=(const AlnScore& o) const {
if(!VALID_AL_SCORE(o)) {
if(!VALID_AL_SCORE(*this)) {
// both invalid
return false;
} else {
// I'm valid, other is invalid
return true;
}
} else if(!VALID_AL_SCORE(*this)) {
// I'm invalid, other is valid
return false;
}
return score_ >= o.score_;
}
/**
* Return true iff this score is < score o.
*/
inline bool operator<(const AlnScore& o) const {
return !operator>=(o);
}
/**
* Return true iff this score is <= score o.
*/
inline bool operator<=(const AlnScore& o) const {
return operator<(o) || operator==(o);
}
/**
* Return true iff this score is < score o.
*/
inline bool operator>(const AlnScore& o) const {
return !operator<=(o);
}
TAlScore score() const { return score_; }
// Score accumulated so far (penalties are subtracted starting at 0)
TAlScore score_;
};
static inline ostream& operator<<(ostream& os, const AlnScore& o) {
os << o.score();
return os;
}
// Forward declaration
class BitPairReference;
/**
* Encapsulates an alignment result. The result comprises:
*
* 1. All the nucleotide edits for both mates ('ned').
* 2. All "edits" where an ambiguous reference char is resolved to an
* unambiguous char ('aed').
* 3. The score for the alginment, including summary information about the
* number of gaps and Ns involved.
* 4. The reference id, strand, and 0-based offset of the leftmost character
* involved in the alignment.
* 5. Information about trimming prior to alignment and whether it was hard or
* soft.
* 6. Information about trimming during alignment and whether it was hard or
* soft. Local-alignment trimming is usually soft when aligning nucleotide
* reads.
*
* Note that the AlnRes, together with the Read and an AlnSetSumm (*and* the
* opposite mate's AlnRes and Read in the case of a paired-end alignment),
* should contain enough information to print an entire alignment record.
*
* TRIMMING
*
* Accounting for trimming is tricky. Trimming affects:
*
* 1. The values of the trim* and pretrim* fields.
* 2. The offsets of the Edits in the ELists.
* 3. The read extent, if the trimming is soft.
* 4. The read extent and the read sequence and length, if trimming is hard.
*
* Handling 1. is not too difficult. 2., 3., and 4. are handled in setShape().
*/
class AlnRes {
public:
AlnRes()
{
reset();
}
AlnRes(const AlnRes& other)
{
score_ = other.score_;
max_score_ = other.max_score_;
uid_ = other.uid_;
tid_ = other.tid_;
taxRank_ = other.taxRank_;
summedHitLen_ = other.summedHitLen_;
readPositions_ = other.readPositions_;
isFw_ = other.isFw_;
}
AlnRes& operator=(const AlnRes& other) {
if(this == &other) return *this;
score_ = other.score_;
max_score_ = other.max_score_;
uid_ = other.uid_;
tid_ = other.tid_;
taxRank_ = other.taxRank_;
summedHitLen_ = other.summedHitLen_;
readPositions_ = other.readPositions_;
isFw_ = other.isFw_;
return *this;
}
~AlnRes() {}
/**
* Clear all contents.
*/
void reset() {
score_ = 0;
max_score_ = 0;
uid_ = "";
tid_ = 0;
taxRank_ = RANK_UNKNOWN;
summedHitLen_ = 0.0;
readPositions_.clear();
}
/**
* Set alignment score for this alignment.
*/
void setScore(TAlScore score) {
score_ = score;
}
TAlScore score() const { return score_; }
TAlScore max_score() const { return max_score_; }
string uid() const { return uid_; }
uint64_t taxID() const { return tid_; }
uint8_t taxRank() const { return taxRank_; }
double summedHitLen() const { return summedHitLen_; }
const EList >& readPositionsPtr() const { return readPositions_; }
const pair readPositions(size_t i) const { return readPositions_[i]; }
size_t nReadPositions() const { return readPositions_.size(); }
bool isFw() const { return isFw_; }
/**
* Print the sequence for the read that aligned using A, C, G and
* T. This will simply print the read sequence (or its reverse
* complement).
*/
void printSeq(
const Read& rd,
const BTDnaString* dns,
BTString& o) const
{
assert(!rd.patFw.empty());
ASSERT_ONLY(size_t written = 0);
// Print decoded nucleotides
assert(dns != NULL);
size_t len = dns->length();
size_t st = 0;
size_t en = len;
for(size_t i = st; i < en; i++) {
int c = dns->get(i);
assert_range(0, 3, c);
o.append("ACGT"[c]);
ASSERT_ONLY(written++);
}
}
/**
* Print the quality string for the read that aligned. This will
* simply print the read qualities (or their reverse).
*/
void printQuals(
const Read& rd,
const BTString* dqs,
BTString& o) const
{
assert(dqs != NULL);
size_t len = dqs->length();
// Print decoded qualities from upstream to downstream Watson
for(size_t i = 1; i < len-1; i++) {
o.append(dqs->get(i));
}
}
/**
* Initialize new AlnRes.
*/
void init(
TAlScore score, // alignment score
TAlScore max_score,
const string& uniqueID,
uint64_t taxID,
uint8_t taxRank,
double summedHitLen,
const EList >& readPositions,
bool isFw)
{
score_ = score;
max_score_ = max_score;
uid_ = uniqueID;
tid_ = taxID;
taxRank_ = taxRank;
summedHitLen_ = summedHitLen;
readPositions_ = readPositions;
isFw_ = isFw;
}
protected:
TAlScore score_; //
TAlScore max_score_;
string uid_;
uint64_t tid_;
uint8_t taxRank_;
double summedHitLen_; // sum of the length of all partial hits, divided by the number of genome matches
bool isFw_;
EList > readPositions_;
};
typedef uint64_t TNumAlns;
/**
* Encapsulates a concise summary of a set of alignment results for a
* given pair or mate. Referring to the fields of this object should
* provide enough information to print output records for the read.
*/
class AlnSetSumm {
public:
AlnSetSumm() { reset(); }
/**
* Given an unpaired read (in either rd1 or rd2) or a read pair
* (mate 1 in rd1, mate 2 in rd2).
*/
explicit AlnSetSumm(
const Read* rd1,
const Read* rd2,
const EList* rs)
{
init(rd1, rd2, rs);
}
explicit AlnSetSumm(
AlnScore best,
AlnScore secbest)
{
init(best, secbest);
}
/**
* Set to uninitialized state.
*/
void reset() {
best_.invalidate();
secbest_.invalidate();
}
/**
* Given all the paired and unpaired results involving mates #1 and #2,
* calculate best and second-best scores for both mates. These are
* used for future MAPQ calculations.
*/
void init(
const Read* rd1,
const Read* rd2,
const EList* rs)
{
assert(rd1 != NULL || rd2 != NULL);
assert(rs != NULL);
AlnScore best, secbest;
size_t szs = 0;
best.invalidate(); secbest.invalidate();
szs = rs->size();
//assert_gt(szs[j], 0);
for(size_t i = 0; i < rs->size(); i++) {
AlnScore sc = (*rs)[i].score();
if(sc > best) {
secbest = best;
best = sc;
assert(VALID_AL_SCORE(best));
} else if(sc > secbest) {
secbest = sc;
assert(VALID_AL_SCORE(best));
assert(VALID_AL_SCORE(secbest));
}
}
if(szs > 0) {
init(best, secbest);
} else {
reset();
}
}
/**
* Initialize given fields. See constructor for how fields are set.
*/
void init(
AlnScore best,
AlnScore secbest)
{
best_ = best;
secbest_ = secbest;
assert(repOk());
}
/**
* Return true iff there is at least a best alignment
*/
bool empty() const {
assert(repOk());
return !VALID_AL_SCORE(best_);
}
#ifndef NDEBUG
/**
* Check that the summary is internally consistent.
*/
bool repOk() const {
return true;
}
#endif
AlnScore best() const { return best_; }
AlnScore secbest() const { return secbest_; }
protected:
AlnScore best_; // best full-alignment score found for this read
AlnScore secbest_; // second-best
};
#endif
centrifuge-f39767eb57d8e175029c/aligner_seed.cpp 0000664 0000000 0000000 00000040265 13021605047 0021125 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#include "aligner_cache.h"
#include "aligner_seed.h"
#include "search_globals.h"
#include "bt2_idx.h"
using namespace std;
/**
* Construct a constraint with no edits of any kind allowed.
*/
Constraint Constraint::exact() {
Constraint c;
c.edits = c.mms = c.ins = c.dels = c.penalty = 0;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
Constraint Constraint::penaltyBased(int pen) {
Constraint c;
c.penalty = pen;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint related to the length of the read.
*/
Constraint Constraint::penaltyFuncBased(const SimpleFunc& f) {
Constraint c;
c.penFunc = f;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
Constraint Constraint::mmBased(int mms) {
Constraint c;
c.mms = mms;
c.edits = c.dels = c.ins = 0;
return c;
}
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
Constraint Constraint::editBased(int edits) {
Constraint c;
c.edits = edits;
c.dels = c.ins = c.mms = 0;
return c;
}
//
// Some static methods for constructing some standard SeedPolicies
//
/**
* Given a read, depth and orientation, extract a seed data structure
* from the read and fill in the steps & zones arrays. The Seed
* contains the sequence and quality values.
*/
bool
Seed::instantiate(
const Read& read,
const BTDnaString& seq, // seed read sequence
const BTString& qual, // seed quality sequence
const Scoring& pens,
int depth,
int seedoffidx,
int seedtypeidx,
bool fw,
InstantiatedSeed& is) const
{
assert(overall != NULL);
int seedlen = len;
if((int)read.length() < seedlen) {
// Shrink seed length to fit read if necessary
seedlen = (int)read.length();
}
assert_gt(seedlen, 0);
is.steps.resize(seedlen);
is.zones.resize(seedlen);
// Fill in 'steps' and 'zones'
//
// The 'steps' list indicates which read character should be
// incorporated at each step of the search process. Often we will
// simply proceed from one end to the other, in which case the
// 'steps' list is ascending or descending. In some cases (e.g.
// the 2mm case), we might want to switch directions at least once
// during the search, in which case 'steps' will jump in the
// middle. When an element of the 'steps' list is negative, this
// indicates that the next
//
// The 'zones' list indicates which zone constraint is active at
// each step. Each element of the 'zones' list is a pair; the
// first pair element indicates the applicable zone when
// considering either mismatch or delete (ref gap) events, while
// the second pair element indicates the applicable zone when
// considering insertion (read gap) events. When either pair
// element is a negative number, that indicates that we are about
// to leave the zone for good, at which point we may need to
// evaluate whether we have reached the zone's budget.
//
switch(type) {
case SEED_TYPE_EXACT: {
for(int k = 0; k < seedlen; k++) {
is.steps[k] = -(seedlen - k);
// Zone 0 all the way
is.zones[k].first = is.zones[k].second = 0;
}
break;
}
case SEED_TYPE_LEFT_TO_RIGHT: {
for(int k = 0; k < seedlen; k++) {
is.steps[k] = k+1;
// Zone 0 from 0 up to ceil(len/2), then 1
is.zones[k].first = is.zones[k].second = ((k < (seedlen+1)/2) ? 0 : 1);
}
// Zone 1 ends at the RHS
is.zones[seedlen-1].first = is.zones[seedlen-1].second = -1;
break;
}
case SEED_TYPE_RIGHT_TO_LEFT: {
for(int k = 0; k < seedlen; k++) {
is.steps[k] = -(seedlen - k);
// Zone 0 from 0 up to floor(len/2), then 1
is.zones[k].first = ((k < seedlen/2) ? 0 : 1);
// Inserts: Zone 0 from 0 up to ceil(len/2)-1, then 1
is.zones[k].second = ((k < (seedlen+1)/2+1) ? 0 : 1);
}
is.zones[seedlen-1].first = is.zones[seedlen-1].second = -1;
break;
}
case SEED_TYPE_INSIDE_OUT: {
// Zone 0 from ceil(N/4) up to N-floor(N/4)
int step = 0;
for(int k = (seedlen+3)/4; k < seedlen - (seedlen/4); k++) {
is.zones[step].first = is.zones[step].second = 0;
is.steps[step++] = k+1;
}
// Zone 1 from N-floor(N/4) up
for(int k = seedlen - (seedlen/4); k < seedlen; k++) {
is.zones[step].first = is.zones[step].second = 1;
is.steps[step++] = k+1;
}
// No Zone 1 if seedlen is short (like 2)
//assert_eq(1, is.zones[step-1].first);
is.zones[step-1].first = is.zones[step-1].second = -1;
// Zone 2 from ((seedlen+3)/4)-1 down to 0
for(int k = ((seedlen+3)/4)-1; k >= 0; k--) {
is.zones[step].first = is.zones[step].second = 2;
is.steps[step++] = -(k+1);
}
assert_eq(2, is.zones[step-1].first);
is.zones[step-1].first = is.zones[step-1].second = -2;
assert_eq(seedlen, step);
break;
}
default:
throw 1;
}
// Instantiate constraints
for(int i = 0; i < 3; i++) {
is.cons[i] = zones[i];
is.cons[i].instantiate(read.length());
}
is.overall = *overall;
is.overall.instantiate(read.length());
// Take a sweep through the seed sequence. Consider where the Ns
// occur and how zones are laid out. Calculate the maximum number
// of positions we can jump over initially (e.g. with the ftab) and
// perhaps set this function's return value to false, indicating
// that the arrangements of Ns prevents the seed from aligning.
bool streak = true;
is.maxjump = 0;
bool ret = true;
bool ltr = (is.steps[0] > 0); // true -> left-to-right
for(size_t i = 0; i < is.steps.size(); i++) {
assert_neq(0, is.steps[i]);
int off = is.steps[i];
off = abs(off)-1;
Constraint& cons = is.cons[abs(is.zones[i].first)];
int c = seq[off]; assert_range(0, 4, c);
int q = qual[off];
if(ltr != (is.steps[i] > 0) || // changed direction
is.zones[i].first != 0 || // changed zone
is.zones[i].second != 0) // changed zone
{
streak = false;
}
if(c == 4) {
// Induced mismatch
if(cons.canN(q, pens)) {
cons.chargeN(q, pens);
} else {
// Seed disqualified due to arrangement of Ns
return false;
}
}
if(streak) is.maxjump++;
}
is.seedoff = depth;
is.seedoffidx = seedoffidx;
is.fw = fw;
is.s = *this;
return ret;
}
/**
* Return a set consisting of 1 seed encapsulating an exact matching
* strategy.
*/
void
Seed::zeroMmSeeds(int ln, EList& pols, Constraint& oall) {
oall.init();
// Seed policy 1: left-to-right search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_EXACT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::exact();
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
}
/**
* Return a set of 2 seeds encapsulating a half-and-half 1mm strategy.
*/
void
Seed::oneMmSeeds(int ln, EList& pols, Constraint& oall) {
oall.init();
// Seed policy 1: left-to-right search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_LEFT_TO_RIGHT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(1);
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
// Seed policy 2: right-to-left search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_RIGHT_TO_LEFT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(1);
pols.back().zones[1].mmsCeil = 0;
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
}
/**
* Return a set of 3 seeds encapsulating search roots for:
*
* 1. Starting from the left-hand side and searching toward the
* right-hand side allowing 2 mismatches in the right half.
* 2. Starting from the right-hand side and searching toward the
* left-hand side allowing 2 mismatches in the left half.
* 3. Starting (effectively) from the center and searching out toward
* both the left and right-hand sides, allowing one mismatch on
* either side.
*
* This is not exhaustive. There are 2 mismatch cases mised; if you
* imagine the seed as divided into four successive quarters A, B, C
* and D, the cases we miss are when mismatches occur in A and C or B
* and D.
*/
void
Seed::twoMmSeeds(int ln, EList& pols, Constraint& oall) {
oall.init();
// Seed policy 1: left-to-right search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_LEFT_TO_RIGHT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(2);
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
// Seed policy 2: right-to-left search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_RIGHT_TO_LEFT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(2);
pols.back().zones[1].mmsCeil = 1; // Must have used at least 1 mismatch
pols.back().zones[2] = Constraint::exact(); // not used
pols.back().overall = &oall;
// Seed policy 3: inside-out search
pols.expand();
pols.back().len = ln;
pols.back().type = SEED_TYPE_INSIDE_OUT;
pols.back().zones[0] = Constraint::exact();
pols.back().zones[1] = Constraint::mmBased(1);
pols.back().zones[1].mmsCeil = 0; // Must have used at least 1 mismatch
pols.back().zones[2] = Constraint::mmBased(1);
pols.back().zones[2].mmsCeil = 0; // Must have used at least 1 mismatch
pols.back().overall = &oall;
}
/**
* Types of actions that can be taken by the SeedAligner.
*/
enum {
SA_ACTION_TYPE_RESET = 1,
SA_ACTION_TYPE_SEARCH_SEED, // 2
SA_ACTION_TYPE_FTAB, // 3
SA_ACTION_TYPE_FCHR, // 4
SA_ACTION_TYPE_MATCH, // 5
SA_ACTION_TYPE_EDIT // 6
};
#define MIN(x, y) ((x < y) ? x : y)
#ifdef ALIGNER_SEED_MAIN
#include
#include
/**
* Parse an int out of optarg and enforce that it be at least 'lower';
* if it is less than 'lower', than output the given error message and
* exit with an error and a usage message.
*/
static int parseInt(const char *errmsg, const char *arg) {
long l;
char *endPtr = NULL;
l = strtol(arg, &endPtr, 10);
if (endPtr != NULL) {
return (int32_t)l;
}
cerr << errmsg << endl;
throw 1;
return -1;
}
enum {
ARG_NOFW = 256,
ARG_NORC,
ARG_MM,
ARG_SHMEM,
ARG_TESTS,
ARG_RANDOM_TESTS,
ARG_SEED
};
static const char *short_opts = "vCt";
static struct option long_opts[] = {
{(char*)"verbose", no_argument, 0, 'v'},
{(char*)"color", no_argument, 0, 'C'},
{(char*)"timing", no_argument, 0, 't'},
{(char*)"nofw", no_argument, 0, ARG_NOFW},
{(char*)"norc", no_argument, 0, ARG_NORC},
{(char*)"mm", no_argument, 0, ARG_MM},
{(char*)"shmem", no_argument, 0, ARG_SHMEM},
{(char*)"tests", no_argument, 0, ARG_TESTS},
{(char*)"random", required_argument, 0, ARG_RANDOM_TESTS},
{(char*)"seed", required_argument, 0, ARG_SEED},
};
static void printUsage(ostream& os) {
os << "Usage: ac [options]* " << endl;
os << "Options:" << endl;
os << " --mm memory-mapped mode" << endl;
os << " --shmem shared memory mode" << endl;
os << " --nofw don't align forward-oriented read" << endl;
os << " --norc don't align reverse-complemented read" << endl;
os << " -t/--timing show timing information" << endl;
os << " -C/--color colorspace mode" << endl;
os << " -v/--verbose talkative mode" << endl;
}
bool gNorc = false;
bool gNofw = false;
bool gColor = false;
int gVerbose = 0;
int gGapBarrier = 1;
bool gColorExEnds = true;
int gSnpPhred = 30;
bool gReportOverhangs = true;
extern void aligner_seed_tests();
extern void aligner_random_seed_tests(
int num_tests,
uint32_t qslo,
uint32_t qshi,
bool color,
uint32_t seed);
/**
* A way of feeding simply tests to the seed alignment infrastructure.
*/
int main(int argc, char **argv) {
bool useMm = false;
bool useShmem = false;
bool mmSweep = false;
bool noRefNames = false;
bool sanity = false;
bool timing = false;
int option_index = 0;
int seed = 777;
int next_option;
do {
next_option = getopt_long(
argc, argv, short_opts, long_opts, &option_index);
switch (next_option) {
case 'v': gVerbose = true; break;
case 'C': gColor = true; break;
case 't': timing = true; break;
case ARG_NOFW: gNofw = true; break;
case ARG_NORC: gNorc = true; break;
case ARG_MM: useMm = true; break;
case ARG_SHMEM: useShmem = true; break;
case ARG_SEED: seed = parseInt("", optarg); break;
case ARG_TESTS: {
aligner_seed_tests();
aligner_random_seed_tests(
100, // num references
100, // queries per reference lo
400, // queries per reference hi
false, // true -> generate colorspace reference/reads
18); // pseudo-random seed
return 0;
}
case ARG_RANDOM_TESTS: {
seed = parseInt("", optarg);
aligner_random_seed_tests(
100, // num references
100, // queries per reference lo
400, // queries per reference hi
false, // true -> generate colorspace reference/reads
seed); // pseudo-random seed
return 0;
}
case -1: break;
default: {
cerr << "Unknown option: " << (char)next_option << endl;
printUsage(cerr);
exit(1);
}
}
} while(next_option != -1);
char *reffn;
if(optind >= argc) {
cerr << "No reference; quitting..." << endl;
return 1;
}
reffn = argv[optind++];
if(optind >= argc) {
cerr << "No reads; quitting..." << endl;
return 1;
}
string ebwtBase(reffn);
BitPairReference ref(
ebwtBase, // base path
gColor, // whether we expect it to be colorspace
sanity, // whether to sanity-check reference as it's loaded
NULL, // fasta files to sanity check reference against
NULL, // another way of specifying original sequences
false, // true -> infiles (2 args ago) contains raw seqs
useMm, // use memory mapping to load index?
useShmem, // use shared memory (not memory mapping)
mmSweep, // touch all the pages after memory-mapping the index
gVerbose, // verbose
gVerbose); // verbose but just for startup messages
Timer *t = new Timer(cerr, "Time loading fw index: ", timing);
Ebwt ebwtFw(
ebwtBase,
gColor, // index is colorspace
0, // don't need entireReverse for fw index
true, // index is for the forward direction
-1, // offrate (irrelevant)
useMm, // whether to use memory-mapped files
useShmem, // whether to use shared memory
mmSweep, // sweep memory-mapped files
!noRefNames, // load names?
false, // load SA sample?
true, // load ftab?
true, // load rstarts?
NULL, // reference map, or NULL if none is needed
gVerbose, // whether to be talkative
gVerbose, // talkative during initialization
false, // handle memory exceptions, don't pass them up
sanity);
delete t;
t = new Timer(cerr, "Time loading bw index: ", timing);
Ebwt ebwtBw(
ebwtBase + ".rev",
gColor, // index is colorspace
1, // need entireReverse
false, // index is for the backward direction
-1, // offrate (irrelevant)
useMm, // whether to use memory-mapped files
useShmem, // whether to use shared memory
mmSweep, // sweep memory-mapped files
!noRefNames, // load names?
false, // load SA sample?
true, // load ftab?
false, // load rstarts?
NULL, // reference map, or NULL if none is needed
gVerbose, // whether to be talkative
gVerbose, // talkative during initialization
false, // handle memory exceptions, don't pass them up
sanity);
delete t;
for(int i = optind; i < argc; i++) {
}
}
#endif
centrifuge-f39767eb57d8e175029c/aligner_seed.h 0000664 0000000 0000000 00000244216 13021605047 0020574 0 ustar 00root root 0000000 0000000 /*
* Copyright 2011, Ben Langmead
*
* This file is part of Bowtie 2.
*
* Bowtie 2 is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Bowtie 2 is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Bowtie 2. If not, see .
*/
#ifndef ALIGNER_SEED_H_
#define ALIGNER_SEED_H_
#include
#include
#include
#include "qual.h"
#include "ds.h"
#include "sstring.h"
#include "alphabet.h"
#include "edit.h"
#include "read.h"
// Threading is necessary to synchronize the classes that dump
// intermediate alignment results to files. Otherwise, all data herein
// is constant and shared, or per-thread.
#include "threading.h"
#include "aligner_result.h"
#include "aligner_cache.h"
#include "scoring.h"
#include "mem_ids.h"
#include "simple_func.h"
#include "btypes.h"
/**
* A constraint to apply to an alignment zone, or to an overall
* alignment.
*
* The constraint can put both caps and ceilings on the number and
* types of edits allowed.
*/
struct Constraint {
Constraint() { init(); }
/**
* Initialize Constraint to be fully permissive.
*/
void init() {
edits = mms = ins = dels = penalty = editsCeil = mmsCeil =
insCeil = delsCeil = penaltyCeil = MAX_I;
penFunc.reset();
instantiated = false;
}
/**
* Return true iff penalities and constraints prevent us from
* adding any edits.
*/
bool mustMatch() {
assert(instantiated);
return (mms == 0 && edits == 0) ||
penalty == 0 ||
(mms == 0 && dels == 0 && ins == 0);
}
/**
* Return true iff a mismatch of the given quality is permitted.
*/
bool canMismatch(int q, const Scoring& cm) {
assert(instantiated);
return (mms > 0 || edits > 0) &&
penalty >= cm.mm(q);
}
/**
* Return true iff a mismatch of the given quality is permitted.
*/
bool canN(int q, const Scoring& cm) {
assert(instantiated);
return (mms > 0 || edits > 0) &&
penalty >= cm.n(q);
}
/**
* Return true iff a mismatch of *any* quality (even qual=1) is
* permitted.
*/
bool canMismatch() {
assert(instantiated);
return (mms > 0 || edits > 0) && penalty > 0;
}
/**
* Return true iff a mismatch of *any* quality (even qual=1) is
* permitted.
*/
bool canN() {
assert(instantiated);
return (mms > 0 || edits > 0);
}
/**
* Return true iff a deletion of the given extension (0=open, 1=1st
* extension, etc) is permitted.
*/
bool canDelete(int ex, const Scoring& cm) {
assert(instantiated);
return (dels > 0 && edits > 0) &&
penalty >= cm.del(ex);
}
/**
* Return true iff a deletion of any extension is permitted.
*/
bool canDelete() {
assert(instantiated);
return (dels > 0 || edits > 0) &&
penalty > 0;
}
/**
* Return true iff an insertion of the given extension (0=open,
* 1=1st extension, etc) is permitted.
*/
bool canInsert(int ex, const Scoring& cm) {
assert(instantiated);
return (ins > 0 || edits > 0) &&
penalty >= cm.ins(ex);
}
/**
* Return true iff an insertion of any extension is permitted.
*/
bool canInsert() {
assert(instantiated);
return (ins > 0 || edits > 0) &&
penalty > 0;
}
/**
* Return true iff a gap of any extension is permitted
*/
bool canGap() {
assert(instantiated);
return ((ins > 0 || dels > 0) || edits > 0) && penalty > 0;
}
/**
* Charge a mismatch of the given quality.
*/
void chargeMismatch(int q, const Scoring& cm) {
assert(instantiated);
if(mms == 0) { assert_gt(edits, 0); edits--; }
else mms--;
penalty -= cm.mm(q);
assert_geq(mms, 0);
assert_geq(edits, 0);
assert_geq(penalty, 0);
}
/**
* Charge an N mismatch of the given quality.
*/
void chargeN(int q, const Scoring& cm) {
assert(instantiated);
if(mms == 0) { assert_gt(edits, 0); edits--; }
else mms--;
penalty -= cm.n(q);
assert_geq(mms, 0);
assert_geq(edits, 0);
assert_geq(penalty, 0);
}
/**
* Charge a deletion of the given extension.
*/
void chargeDelete(int ex, const Scoring& cm) {
assert(instantiated);
dels--;
edits--;
penalty -= cm.del(ex);
assert_geq(dels, 0);
assert_geq(edits, 0);
assert_geq(penalty, 0);
}
/**
* Charge an insertion of the given extension.
*/
void chargeInsert(int ex, const Scoring& cm) {
assert(instantiated);
ins--;
edits--;
penalty -= cm.ins(ex);
assert_geq(ins, 0);
assert_geq(edits, 0);
assert_geq(penalty, 0);
}
/**
* Once the constrained area is completely explored, call this
* function to check whether there were *at least* as many
* dissimilarities as required by the constraint. Bounds like this
* are helpful to resolve instances where two search roots would
* otherwise overlap in what alignments they can find.
*/
bool acceptable() {
assert(instantiated);
return edits <= editsCeil &&
mms <= mmsCeil &&
ins <= insCeil &&
dels <= delsCeil &&
penalty <= penaltyCeil;
}
/**
* Instantiate a constraint w/r/t the read length and the constant
* and linear coefficients for the penalty function.
*/
static int instantiate(size_t rdlen, const SimpleFunc& func) {
return func.f((double)rdlen);
}
/**
* Instantiate this constraint w/r/t the read length.
*/
void instantiate(size_t rdlen) {
assert(!instantiated);
if(penFunc.initialized()) {
penalty = Constraint::instantiate(rdlen, penFunc);
}
instantiated = true;
}
int edits; // # edits permitted
int mms; // # mismatches permitted
int ins; // # insertions permitted
int dels; // # deletions permitted
int penalty; // penalty total permitted
int editsCeil; // <= this many edits can be left at the end
int mmsCeil; // <= this many mismatches can be left at the end
int insCeil; // <= this many inserts can be left at the end
int delsCeil; // <= this many deletions can be left at the end
int penaltyCeil;// <= this much leftover penalty can be left at the end
SimpleFunc penFunc;// penalty function; function of read len
bool instantiated; // whether constraint is instantiated w/r/t read len
//
// Some static methods for constructing some standard Constraints
//
/**
* Construct a constraint with no edits of any kind allowed.
*/
static Constraint exact();
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
static Constraint penaltyBased(int pen);
/**
* Construct a constraint where the only constraint is a total
* penalty constraint related to the length of the read.
*/
static Constraint penaltyFuncBased(const SimpleFunc& func);
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
static Constraint mmBased(int mms);
/**
* Construct a constraint where the only constraint is a total
* penalty constraint.
*/
static Constraint editBased(int edits);
};
/**
* We divide seed search strategies into three categories:
*
* 1. A left-to-right search where the left half of the read is
* constrained to match exactly and the right half is subject to
* some looser constraint (e.g. 1mm or 2mm)
* 2. Same as 1, but going right to left with the exact matching half
* on the right.
* 3. Inside-out search where the center half of the read is
* constrained to match exactly, and the extreme quarters of the
* read are subject to a looser constraint.
*/
enum {
SEED_TYPE_EXACT = 1,
SEED_TYPE_LEFT_TO_RIGHT,
SEED_TYPE_RIGHT_TO_LEFT,
SEED_TYPE_INSIDE_OUT
};
struct InstantiatedSeed;
/**
* Policy dictating how to size and arrange seeds along the length of
* the read, and what constraints to force on the zones of the seed.
* We assume that seeds are plopped down at regular intervals from the
* 5' to 3' ends, with the first seed flush to the 5' end.
*
* If the read is shorter than a single seed, one seed is used and it
* is shrunk to accommodate the read.
*/
struct Seed {
int len; // length of a seed
int type; // dictates anchor portion, direction of search
Constraint *overall; // for the overall alignment
Seed() { init(0, 0, NULL); }
/**
* Construct and initialize this seed with given length and type.
*/
Seed(int ln, int ty, Constraint* oc) {
init(ln, ty, oc);
}
/**
* Initialize this seed with given length and type.
*/
void init(int ln, int ty, Constraint* oc) {
len = ln;
type = ty;
overall = oc;
}
// If the seed is split into halves, we just use zones[0] and
// zones[1]; 0 is the near half and 1 is the far half. If the seed
// is split into thirds (i.e. inside-out) then 0 is the center, 1
// is the far portion on the left, and 2 is the far portion on the
// right.
Constraint zones[3];
/**
* Once the constrained seed is completely explored, call this
* function to check whether there were *at least* as many
* dissimilarities as required by all constraints. Bounds like this
* are helpful to resolve instances where two search roots would
* otherwise overlap in what alignments they can find.
*/
bool acceptable() {
assert(overall != NULL);
return zones[0].acceptable() &&
zones[1].acceptable() &&
zones[2].acceptable() &&
overall->acceptable();
}
/**
* Given a read, depth and orientation, extract a seed data structure
* from the read and fill in the steps & zones arrays. The Seed
* contains the sequence and quality values.
*/
bool instantiate(
const Read& read,
const BTDnaString& seq, // already-extracted seed sequence
const BTString& qual, // already-extracted seed quality sequence
const Scoring& pens,
int depth,
int seedoffidx,
int seedtypeidx,
bool fw,
InstantiatedSeed& si) const;
/**
* Return a list of Seed objects encapsulating
*/
static void mmSeeds(
int mms,
int ln,
EList& pols,
Constraint& oall)
{
if(mms == 0) {
zeroMmSeeds(ln, pols, oall);
} else if(mms == 1) {
oneMmSeeds(ln, pols, oall);
} else if(mms == 2) {
twoMmSeeds(ln, pols, oall);
} else throw 1;
}
static void zeroMmSeeds(int ln, EList&, Constraint&);
static void oneMmSeeds (int ln, EList&, Constraint&);
static void twoMmSeeds (int ln, EList&, Constraint&);
};
/**
* An instantiated seed is a seed (perhaps modified to fit the read)
* plus all data needed to conduct a search of the seed.
*/
struct InstantiatedSeed {
InstantiatedSeed() : steps(AL_CAT), zones(AL_CAT) { }
// Steps map. There are as many steps as there are positions in
// the seed. The map is a helpful abstraction because we sometimes
// visit seed positions in an irregular order (e.g. inside-out
// search).
EList steps;
// Zones map. For each step, records what constraint to charge an
// edit to. The first entry in each pair gives the constraint for
// non-insert edits and the second entry in each pair gives the
// constraint for insert edits. If the value stored is negative,
// this indicates that the zone is "closed out" after this
// position, so zone acceptility should be checked.
EList > zones;
// Nucleotide sequence covering the seed, extracted from read
BTDnaString *seq;
// Quality sequence covering the seed, extracted from read
BTString *qual;
// Initial constraints governing zones 0, 1, 2. We precalculate
// the effect of Ns on these.
Constraint cons[3];
// Overall constraint, tailored to the read length.
Constraint overall;
// Maximum number of positions that the aligner may advance before
// its first step. This lets the aligner know whether it can use
// the ftab or not.
int maxjump;
// Offset of seed from 5' end of read
int seedoff;
// Id for seed offset; ids are such that the smallest index is the
// closest to the 5' end and consecutive ids are adjacent (i.e.
// there are no intervening offsets with seeds)
int seedoffidx;
// Type of seed (left-to-right, etc)
int seedtypeidx;
// Seed comes from forward-oriented read?
bool fw;
// Filtered out due to the pattern of Ns present. If true, this
// seed should be ignored by searchAllSeeds().
bool nfiltered;
// Seed this was instantiated from
Seed s;
#ifndef NDEBUG
/**
* Check that InstantiatedSeed is internally consistent.
*/
bool repOk() const {
return true;
}
#endif
};
/**
* Simple struct for holding a end-to-end alignments for the read with at most
* 2 edits.
*/
template
struct EEHit {
EEHit() { reset(); }
void reset() {
top = bot = 0;
fw = false;
e1.reset();
e2.reset();
score = MIN_I64;
}
void init(
index_t top_,
index_t bot_,
const Edit* e1_,
const Edit* e2_,
bool fw_,
int64_t score_)
{
top = top_; bot = bot_;
if(e1_ != NULL) {
e1 = *e1_;
} else {
e1.reset();
}
if(e2_ != NULL) {
e2 = *e2_;
} else {
e2.reset();
}
fw = fw_;
score = score_;
}
/**
* Return number of mismatches in the alignment.
*/
int mms() const {
if (e2.inited()) return 2;
else if(e1.inited()) return 1;
else return 0;
}
/**
* Return the number of Ns involved in the alignment.
*/
int ns() const {
int ns = 0;
if(e1.inited() && e1.hasN()) {
ns++;
if(e2.inited() && e2.hasN()) {
ns++;
}
}
return ns;
}
/**
* Return the number of Ns involved in the alignment.
*/
int refns() const {
int ns = 0;
if(e1.inited() && e1.chr == 'N') {
ns++;
if(e2.inited() && e2.chr == 'N') {
ns++;
}
}
return ns;
}
/**
* Return true iff there is no hit.
*/
bool empty() const {
return bot <= top;
}
/**
* Higher score = higher priority.
*/
bool operator<(const EEHit& o) const {
return score > o.score;
}
/**
* Return the size of the alignments SA range.s
*/
index_t size() const { return bot - top; }
#ifndef NDEBUG
/**
* Check that hit is sane w/r/t read.
*/
bool repOk(const Read& rd) const {
assert_gt(bot, top);
if(e1.inited()) {
assert_lt(e1.pos, rd.length());
if(e2.inited()) {
assert_lt(e2.pos, rd.length());
}
}
return true;
}
#endif
index_t top;
index_t bot;
Edit e1;
Edit e2;
bool fw;
int64_t score;
};
/**
* Data structure for holding all of the seed hits associated with a read. All
* the seed hits for a given read are encapsulated in a single QVal object. A
* QVal refers to a range of values in the qlist, where each qlist value is a
* BW range and a slot to hold the hit's suffix array offset. QVals are kept
* in two lists (hitsFw_ and hitsRc_), one for seeds on the forward read strand,
* one for seeds on the reverse read strand. The list is indexed by read
* offset index (e.g. 0=closest-to-5', 1=second-closest, etc).
*
* An assumption behind this data structure is that all the seeds are found
* first, then downstream analyses try to extend them. In between finding the
* seed hits and extending them, the sort() member function is called, which
* ranks QVals according to the order they should be extended. Right now the
* policy is that QVals with fewer elements (hits) should be tried first.
*/
template
class SeedResults {
public:
SeedResults() :
seqFw_(AL_CAT),
seqRc_(AL_CAT),
qualFw_(AL_CAT),
qualRc_(AL_CAT),
hitsFw_(AL_CAT),
hitsRc_(AL_CAT),
isFw_(AL_CAT),
isRc_(AL_CAT),
sortedFw_(AL_CAT),
sortedRc_(AL_CAT),
offIdx2off_(AL_CAT),
rankOffs_(AL_CAT),
rankFws_(AL_CAT),
mm1Hit_(AL_CAT)
{
clear();
}
/**
* Set the current read.
*/
void nextRead(const Read& read) {
read_ = &read;
}
/**
* Set the appropriate element of either hitsFw_ or hitsRc_ to the given
* QVal. A QVal encapsulates all the BW ranges for reference substrings
* that are within some distance of the seed string.
*/
void add(
const QVal& qv, // range of ranges in cache
const AlignmentCache& ac, // cache
index_t seedIdx, // seed index (from 5' end)
bool seedFw) // whether seed is from forward read
{
assert(qv.repOk(ac));
assert(repOk(&ac));
assert_lt(seedIdx, hitsFw_.size());
assert_gt(numOffs_, 0); // if this fails, probably failed to call reset
if(qv.empty()) return;
if(seedFw) {
assert(!hitsFw_[seedIdx].valid());
hitsFw_[seedIdx] = qv;
numEltsFw_ += qv.numElts();
numRangesFw_ += qv.numRanges();
if(qv.numRanges() > 0) nonzFw_++;
} else {
assert(!hitsRc_[seedIdx].valid());
hitsRc_[seedIdx] = qv;
numEltsRc_ += qv.numElts();
numRangesRc_ += qv.numRanges();
if(qv.numRanges() > 0) nonzRc_++;
}
numElts_ += qv.numElts();
numRanges_ += qv.numRanges();
if(qv.numRanges() > 0) {
nonzTot_++;
}
assert(repOk(&ac));
}
/**
* Clear buffered seed hits and state. Set the number of seed
* offsets and the read.
*/
void reset(
const Read& read,
const EList& offIdx2off,
size_t numOffs)
{
assert_gt(numOffs, 0);
clearSeeds();
numOffs_ = numOffs;
seqFw_.resize(numOffs_);
seqRc_.resize(numOffs_);
qualFw_.resize(numOffs_);
qualRc_.resize(numOffs_);
hitsFw_.resize(numOffs_);
hitsRc_.resize(numOffs_);
isFw_.resize(numOffs_);
isRc_.resize(numOffs_);
sortedFw_.resize(numOffs_);
sortedRc_.resize(numOffs_);
offIdx2off_ = offIdx2off;
for(size_t i = 0; i < numOffs_; i++) {
sortedFw_[i] = sortedRc_[i] = false;
hitsFw_[i].reset();
hitsRc_[i].reset();
isFw_[i].clear();
isRc_[i].clear();
}
read_ = &read;
sorted_ = false;
}
/**
* Clear seed-hit state.
*/
void clearSeeds() {
sortedFw_.clear();
sortedRc_.clear();
rankOffs_.clear();
rankFws_.clear();
offIdx2off_.clear();
hitsFw_.clear();
hitsRc_.clear();
isFw_.clear();
isRc_.clear();
seqFw_.clear();
seqRc_.clear();
nonzTot_ = 0;
nonzFw_ = 0;
nonzRc_ = 0;
numOffs_ = 0;
numRanges_ = 0;
numElts_ = 0;
numRangesFw_ = 0;
numEltsFw_ = 0;
numRangesRc_ = 0;
numEltsRc_ = 0;
}
/**
* Clear seed-hit state and end-to-end alignment state.
*/
void clear() {
clearSeeds();
read_ = NULL;
exactFwHit_.reset();
exactRcHit_.reset();
mm1Hit_.clear();
mm1Sorted_ = false;
mm1Elt_ = 0;
assert(empty());
}
/**
* Return average number of hits per seed.
*/
float averageHitsPerSeed() const {
return (float)numElts_ / (float)nonzTot_;
}
/**
* Return median of all the non-zero per-seed # hits
*/
float medianHitsPerSeed() const {
EList& median = const_cast&>(tmpMedian_);
median.clear();
for(size_t i = 0; i < numOffs_; i++) {
if(hitsFw_[i].valid() && hitsFw_[i].numElts() > 0) {
median.push_back(hitsFw_[i].numElts());
}
if(hitsRc_[i].valid() && hitsRc_[i].numElts() > 0) {
median.push_back(hitsRc_[i].numElts());
}
}
if(tmpMedian_.empty()) {
return 0.0f;
}
median.sort();
float med1 = (float)median[tmpMedian_.size() >> 1];
float med2 = med1;
if((median.size() & 1) == 0) {
med2 = (float)median[(tmpMedian_.size() >> 1) - 1];
}
return med1 + med2 * 0.5f;
}
/**
* Return a number that's meant to quantify how hopeful we are that this
* set of seed hits will lead to good alignments.
*/
double uniquenessFactor() const {
double result = 0.0;
for(size_t i = 0; i < numOffs_; i++) {
if(hitsFw_[i].valid()) {
size_t nelt = hitsFw_[i].numElts();
result += (1.0 / (double)(nelt * nelt));
}
if(hitsRc_[i].valid()) {
size_t nelt = hitsRc_[i].numElts();
result += (1.0 / (double)(nelt * nelt));
}
}
return result;
}
/**
* Return the number of ranges being held.
*/
index_t numRanges() const { return numRanges_; }
/**
* Return the number of elements being held.
*/
index_t numElts() const { return numElts_; }
/**
* Return the number of ranges being held for seeds on the forward
* read strand.
*/
index_t numRangesFw() const { return numRangesFw_; }
/**
* Return the number of elements being held for seeds on the
* forward read strand.
*/
index_t numEltsFw() const { return numEltsFw_; }
/**
* Return the number of ranges being held for seeds on the
* reverse-complement read strand.
*/
index_t numRangesRc() const { return numRangesRc_; }
/**
* Return the number of elements being held for seeds on the
* reverse-complement read strand.
*/
index_t numEltsRc() const { return numEltsRc_; }
/**
* Given an offset index, return the offset that has that index.
*/
index_t idx2off(size_t off) const {
return offIdx2off_[off];
}
/**
* Return true iff there are 0 hits being held.
*/
bool empty() const { return numRanges() == 0; }
/**
* Get the QVal representing all the reference hits for the given
* orientation and seed offset index.
*/
const QVal& hitsAtOffIdx(bool fw, size_t seedoffidx) const {
assert_lt(seedoffidx, numOffs_);
assert(repOk(NULL));
return fw ? hitsFw_[seedoffidx] : hitsRc_[seedoffidx];
}
/**
* Get the Instantiated seeds for the given orientation and offset.
*/
EList& instantiatedSeeds(bool fw, size_t seedoffidx) {
assert_lt(seedoffidx, numOffs_);
assert(repOk(NULL));
return fw ? isFw_[seedoffidx] : isRc_[seedoffidx];
}
/**
* Return the number of different seed offsets possible.
*/
index_t numOffs() const { return numOffs_; }
/**
* Return the read from which seeds were extracted, aligned.
*/
const Read& read() const { return *read_; }
#ifndef NDEBUG
/**
* Check that this SeedResults is internally consistent.
*/
bool repOk(
const AlignmentCache* ac,
bool requireInited = false) const
{
if(requireInited) {
assert(read_ != NULL);
}
if(numOffs_ > 0) {
assert_eq(numOffs_, hitsFw_.size());
assert_eq(numOffs_, hitsRc_.size());
assert_leq(numRanges_, numElts_);
assert_leq(nonzTot_, numRanges_);
size_t nonzs = 0;
for(int fw = 0; fw <= 1; fw++) {
const EList >& rrs = (fw ? hitsFw_ : hitsRc_);
for(size_t i = 0; i < numOffs_; i++) {
if(rrs[i].valid()) {
if(rrs[i].numRanges() > 0) nonzs++;
if(ac != NULL) {
assert(rrs[i].repOk(*ac));
}
}
}
}
assert_eq(nonzs, nonzTot_);
assert(!sorted_ || nonzTot_ == rankFws_.size());
assert(!sorted_ || nonzTot_ == rankOffs_.size());
}
return true;
}
#endif
/**
* Populate rankOffs_ and rankFws_ with the list of QVals that need to be
* examined for this SeedResults, in order. The order is ascending by
* number of elements, so QVals with fewer elements (i.e. seed sequences
* that are more unique) will be tried first and QVals with more elements
* (i.e. seed sequences
*/
void rankSeedHits(RandomSource& rnd) {
while(rankOffs_.size() < nonzTot_) {
index_t minsz = (index_t)0xffffffff;
index_t minidx = 0;
bool minfw = true;
// Rank seed-hit positions in ascending order by number of elements
// in all BW ranges
bool rb = rnd.nextBool();
assert(rb == 0 || rb == 1);
for(int fwi = 0; fwi <= 1; fwi++) {
bool fw = (fwi == (rb ? 1 : 0));
EList >& rrs = (fw ? hitsFw_ : hitsRc_);
EList& sorted = (fw ? sortedFw_ : sortedRc_);
index_t i = (rnd.nextU32() % (index_t)numOffs_);
for(index_t ii = 0; ii < numOffs_; ii++) {
if(rrs[i].valid() && // valid QVal
rrs[i].numElts() > 0 && // non-empty
!sorted[i] && // not already sorted
rrs[i].numElts() < minsz) // least elts so far?
{
minsz = rrs[i].numElts();
minidx = i;
minfw = (fw == 1);
}
if((++i) == numOffs_) {
i = 0;
}
}
}
assert_neq((index_t)0xffffffff, minsz);
if(minfw) {
sortedFw_[minidx] = true;
} else {
sortedRc_[minidx] = true;
}
rankOffs_.push_back(minidx);
rankFws_.push_back(minfw);
}
assert_eq(rankOffs_.size(), rankFws_.size());
sorted_ = true;
}
/**
* Return the number of orientation/offsets into the read that have
* at least one seed hit.
*/
size_t nonzeroOffsets() const {
assert(!sorted_ || nonzTot_ == rankFws_.size());
assert(!sorted_ || nonzTot_ == rankOffs_.size());
return nonzTot_;
}
/**
* Return true iff all seeds hit for forward read.
*/
bool allFwSeedsHit() const {
return nonzFw_ == numOffs();
}
/**
* Return true iff all seeds hit for revcomp read.
*/
bool allRcSeedsHit() const {
return nonzRc_ == numOffs();
}
/**
* Return the minimum number of edits that an end-to-end alignment of the
* fw read could have. Uses knowledge of how many seeds have exact hits
* and how the seeds overlap.
*/
index_t fewestEditsEE(bool fw, int seedlen, int per) const {
assert_gt(seedlen, 0);
assert_gt(per, 0);
index_t nonz = fw ? nonzFw_ : nonzRc_;
if(nonz < numOffs()) {
int maxdepth = (seedlen + per - 1) / per;
int missing = (int)(numOffs() - nonz);
return (missing + maxdepth - 1) / maxdepth;
} else {
// Exact hit is possible (not guaranteed)
return 0;
}
}
/**
* Return the number of offsets into the forward read that have at
* least one seed hit.
*/
index_t nonzeroOffsetsFw() const {
return nonzFw_;
}
/**
* Return the number of offsets into the reverse-complement read
* that have at least one seed hit.
*/
index_t nonzeroOffsetsRc() const {
return nonzRc_;
}
/**
* Return a QVal of seed hits of the given rank 'r'. 'offidx' gets the id
* of the offset from 5' from which it was extracted (0 for the 5-most
* offset, 1 for the next closes to 5', etc). 'off' gets the offset from
* the 5' end. 'fw' gets true iff the seed was extracted from the forward
* read.
*/
const QVal& hitsByRank(
index_t r, // in
index_t& offidx, // out
index_t& off, // out
bool& fw, // out
index_t& seedlen) // out
{
assert(sorted_);
assert_lt(r, nonzTot_);
if(rankFws_[r]) {
fw = true;
offidx = rankOffs_[r];
assert_lt(offidx, offIdx2off_.size());
off = offIdx2off_[offidx];
seedlen = (index_t)seqFw_[rankOffs_[r]].length();
return hitsFw_[rankOffs_[r]];
} else {
fw = false;
offidx = rankOffs_[r];
assert_lt(offidx, offIdx2off_.size());
off = offIdx2off_[offidx];
seedlen = (index_t)seqRc_[rankOffs_[r]].length();
return hitsRc_[rankOffs_[r]];
}
}
/**
* Return an EList of seed hits of the given rank.
*/
const BTDnaString& seqByRank(index_t r) {
assert(sorted_);
assert_lt(r, nonzTot_);
return rankFws_[r] ? seqFw_[rankOffs_[r]] : seqRc_[rankOffs_[r]];
}
/**
* Return an EList of seed hits of the given rank.
*/
const BTString& qualByRank(index_t r) {
assert(sorted_);
assert_lt(r, nonzTot_);
return rankFws_[r] ? qualFw_[rankOffs_[r]] : qualRc_[rankOffs_[r]];
}
/**
* Return the list of extracted seed sequences for seeds on either
* the forward or reverse strand.
*/
EList& seqs(bool fw) { return fw ? seqFw_ : seqRc_; }
/**
* Return the list of extracted quality sequences for seeds on
* either the forward or reverse strand.
*/
EList& quals(bool fw) { return fw ? qualFw_ : qualRc_; }
/**
* Return exact end-to-end alignment of fw read.
*/
EEHit exactFwEEHit() const { return exactFwHit_; }
/**
* Return exact end-to-end alignment of rc read.
*/
EEHit exactRcEEHit() const { return exactRcHit_; }
/**
* Return const ref to list of 1-mismatch end-to-end alignments.
*/
const EList >& mm1EEHits() const { return mm1Hit_; }
/**
* Sort the end-to-end 1-mismatch alignments, prioritizing by score (higher
* score = higher priority).
*/
void sort1mmEe(RandomSource& rnd) {
assert(!mm1Sorted_);
mm1Hit_.sort();
size_t streak = 0;
for(size_t i = 1; i < mm1Hit_.size(); i++) {
if(mm1Hit_[i].score == mm1Hit_[i-1].score) {
if(streak == 0) { streak = 1; }
streak++;
} else {
if(streak > 1) {
assert_geq(i, streak);
mm1Hit_.shufflePortion(i-streak, streak, rnd);
}
streak = 0;
}
}
if(streak > 1) {
mm1Hit_.shufflePortion(mm1Hit_.size() - streak, streak, rnd);
}
mm1Sorted_ = true;
}
/**
* Add an end-to-end 1-mismatch alignment.
*/
void add1mmEe(
index_t top,
index_t bot,
const Edit* e1,
const Edit* e2,
bool fw,
int64_t score)
{
mm1Hit_.expand();
mm1Hit_.back().init(top, bot, e1, e2, fw, score);
mm1Elt_ += (bot - top);
}
/**
* Add an end-to-end exact alignment.
*/
void addExactEeFw(
index_t top,
index_t bot,
const Edit* e1,
const Edit* e2,
bool fw,
int64_t score)
{
exactFwHit_.init(top, bot, e1, e2, fw, score);
}
/**
* Add an end-to-end exact alignment.
*/
void addExactEeRc(
index_t top,
index_t bot,
const Edit* e1,
const Edit* e2,
bool fw,
int64_t score)
{
exactRcHit_.init(top, bot, e1, e2, fw, score);
}
/**
* Clear out the end-to-end exact alignments.
*/
void clearExactE2eHits() {
exactFwHit_.reset();
exactRcHit_.reset();
}
/**
* Clear out the end-to-end 1-mismatch alignments.
*/
void clear1mmE2eHits() {
mm1Hit_.clear(); // 1-mismatch end-to-end hits
mm1Elt_ = 0; // number of 1-mismatch hit rows
mm1Sorted_ = false; // true iff we've sorted the mm1Hit_ list
}
/**
* Return the number of distinct exact and 1-mismatch end-to-end hits
* found.
*/
index_t numE2eHits() const {
return (index_t)(exactFwHit_.size() + exactRcHit_.size() + mm1Elt_);
}
/**
* Return the number of distinct exact end-to-end hits found.
*/
index_t numExactE2eHits() const {
return (index_t)(exactFwHit_.size() + exactRcHit_.size());
}
/**
* Return the number of distinct 1-mismatch end-to-end hits found.
*/
index_t num1mmE2eHits() const {
return mm1Elt_;
}
/**
* Return the length of the read that yielded the seed hits.
*/
index_t readLength() const {
assert(read_ != NULL);
return read_->length();
}
protected:
// As seed hits and edits are added they're sorted into these
// containers
EList seqFw_; // seqs for seeds from forward read
EList seqRc_; // seqs for seeds from revcomp read
EList qualFw_; // quals for seeds from forward read
EList qualRc_; // quals for seeds from revcomp read
EList > hitsFw_; // hits for forward read
EList > hitsRc_; // hits for revcomp read
EList > isFw_; // hits for forward read
EList > isRc_; // hits for revcomp read
EList sortedFw_; // true iff fw QVal was sorted/ranked
EList sortedRc_; // true iff rc QVal was sorted/ranked
index_t nonzTot_; // # offsets with non-zero size
index_t nonzFw_; // # offsets into fw read with non-0 size
index_t nonzRc_; // # offsets into rc read with non-0 size
index_t numRanges_; // # ranges added
index_t numElts_; // # elements added
index_t numRangesFw_; // # ranges added for fw seeds
index_t numEltsFw_; // # elements added for fw seeds
index_t numRangesRc_; // # ranges added for rc seeds
index_t numEltsRc_; // # elements added for rc seeds
EList offIdx2off_;// map from offset indexes to offsets from 5' end
// When the sort routine is called, the seed hits collected so far
// are sorted into another set of containers that allow easy access
// to hits from the lowest-ranked offset (the one with the fewest
// BW elements) to the greatest-ranked offset. Offsets with 0 hits
// are ignored.
EList rankOffs_; // sorted offests of seeds to try
EList rankFws_; // sorted orientations assoc. with rankOffs_
bool sorted_; // true if sort() called since last reset
// These fields set once per read
index_t numOffs_; // # different seed offsets possible
const Read* read_; // read from which seeds were extracted
EEHit exactFwHit_; // end-to-end exact hit for fw read
EEHit exactRcHit_; // end-to-end exact hit for rc read
EList > mm1Hit_; // 1-mismatch end-to-end hits
index_t mm1Elt_; // number of 1-mismatch hit rows
bool mm1Sorted_; // true iff we've sorted the mm1Hit_ list
EList tmpMedian_; // temporary storage for calculating median
};
// Forward decl
template class Ebwt;
template struct SideLocus;
/**
* Encapsulates a sumamry of what the searchAllSeeds aligner did.
*/
struct SeedSearchMetrics {
SeedSearchMetrics() : mutex_m() {
reset();
}
/**
* Merge this metrics object with the given object, i.e., sum each
* category. This is the only safe way to update a
* SeedSearchMetrics object shread by multiple threads.
*/
void merge(const SeedSearchMetrics& m, bool getLock = false) {
ThreadSafe ts(&mutex_m, getLock);
seedsearch += m.seedsearch;
possearch += m.possearch;
intrahit += m.intrahit;
interhit += m.interhit;
filteredseed += m.filteredseed;
ooms += m.ooms;
bwops += m.bwops;
bweds += m.bweds;
bestmin0 += m.bestmin0;
bestmin1 += m.bestmin1;
bestmin2 += m.bestmin2;
}
/**
* Set all counters to 0.
*/
void reset() {
seedsearch =
possearch =
intrahit =
interhit =
filteredseed =
ooms =
bwops =
bweds =
bestmin0 =
bestmin1 =
bestmin2 = 0;
}
uint64_t seedsearch; // # times we executed strategy in InstantiatedSeed
uint64_t possearch; // # offsets where aligner executed >= 1 strategy
uint64_t intrahit; // # offsets where current-read cache gave answer
uint64_t interhit; // # offsets where across-read cache gave answer
uint64_t filteredseed; // # seed instantiations skipped due to Ns
uint64_t ooms; // out-of-memory errors
uint64_t bwops; // Burrows-Wheeler operations
uint64_t bweds; // Burrows-Wheeler edits
uint64_t bestmin0; // # times the best min # edits was 0
uint64_t bestmin1; // # times the best min # edits was 1
uint64_t bestmin2; // # times the best min # edits was 2
MUTEX_T mutex_m;
};
/**
* Given an index and a seeding scheme, searches for seed hits.
*/
template
class SeedAligner {
public:
/**
* Initialize with index.
*/
SeedAligner() : edits_(AL_CAT), offIdx2off_(AL_CAT) { }
/**
* Given a read and a few coordinates that describe a substring of the
* read (or its reverse complement), fill in 'seq' and 'qual' objects
* with the seed sequence and qualities.
*/
void instantiateSeq(
const Read& read, // input read
BTDnaString& seq, // output sequence
BTString& qual, // output qualities
int len, // seed length
int depth, // seed's 0-based offset from 5' end
bool fw) const; // seed's orientation
/**
* Iterate through the seeds that cover the read and initiate a
* search for each seed.
*/
std::pair instantiateSeeds(
const EList& seeds, // search seeds
index_t off, // offset into read to start extracting
int per, // interval between seeds
const Read& read, // read to align
const Scoring& pens, // scoring scheme
bool nofw, // don't align forward read
bool norc, // don't align revcomp read
AlignmentCacheIface& cache, // holds some seed hits from previous reads
SeedResults& sr, // holds all the seed hits
SeedSearchMetrics& met); // metrics
/**
* Iterate through the seeds that cover the read and initiate a
* search for each seed.
*/
void searchAllSeeds(
const EList& seeds, // search seeds
const Ebwt* ebwtFw, // BWT index
const Ebwt* ebwtBw, // BWT' index
const Read& read, // read to align
const Scoring& pens, // scoring scheme
AlignmentCacheIface& cache, // local seed alignment cache
SeedResults& hits, // holds all the seed hits
SeedSearchMetrics& met, // metrics
PerReadMetrics& prm); // per-read metrics
/**
* Sanity-check a partial alignment produced during oneMmSearch.
*/
bool sanityPartial(
const Ebwt* ebwtFw, // BWT index
const Ebwt* ebwtBw, // BWT' index
const BTDnaString& seq,
index_t dep,
index_t len,
bool do1mm,
index_t topfw,
index_t botfw,
index_t topbw,
index_t botbw);
/**
* Do an exact-matching sweet to establish a lower bound on number of edits
* and to find exact alignments.
*/
size_t exactSweep(
const Ebwt& ebwt, // BWT index
const Read& read, // read to align
const Scoring& sc, // scoring scheme
bool nofw, // don't align forward read
bool norc, // don't align revcomp read
size_t mineMax, // don't care about edit bounds > this
size_t& mineFw, // minimum # edits for forward read
size_t& mineRc, // minimum # edits for revcomp read
bool repex, // report 0mm hits?
SeedResults& hits, // holds all the seed hits (and exact hit)
SeedSearchMetrics& met); // metrics
/**
* Search for end-to-end alignments with up to 1 mismatch.
*/
bool oneMmSearch(
const Ebwt* ebwtFw, // BWT index
const Ebwt* ebwtBw, // BWT' index
const Read& read, // read to align
const Scoring& sc, // scoring
int64_t minsc, // minimum score
bool nofw, // don't align forward read
bool norc, // don't align revcomp read
bool local, // 1mm hits must be legal local alignments
bool repex, // report 0mm hits?
bool rep1mm, // report 1mm hits?
SeedResults& hits, // holds all the seed hits (and exact hit)
SeedSearchMetrics& met); // metrics
protected:
/**
* Report a seed hit found by searchSeedBi(), but first try to extend it out in
* either direction as far as possible without hitting any edits. This will
* allow us to prioritize the seed hits better later on. Call reportHit() when
* we're done, which actually adds the hit to the cache. Returns result from
* calling reportHit().
*/
bool extendAndReportHit(
index_t topf, // top in BWT
index_t botf, // bot in BWT
index_t topb, // top in BWT'
index_t botb, // bot in BWT'
index_t len, // length of hit
DoublyLinkedList *prevEdit); // previous edit
/**
* Report a seed hit found by searchSeedBi() by adding it to the cache. Return
* false if the hit could not be reported because of, e.g., cache exhaustion.
*/
bool reportHit(
index_t topf, // top in BWT
index_t botf, // bot in BWT
index_t topb, // top in BWT'
index_t botb, // bot in BWT'
index_t len, // length of hit
DoublyLinkedList *prevEdit); // previous edit
/**
* Given an instantiated seed (in s_ and other fields), search
*/
bool searchSeedBi();
/**
* Main, recursive implementation of the seed search.
*/
bool searchSeedBi(
int step, // depth into steps_[] array
int depth, // recursion depth
index_t topf, // top in BWT
index_t botf, // bot in BWT
index_t topb, // top in BWT'
index_t botb, // bot in BWT'
SideLocus tloc, // locus for top (perhaps unititialized)
SideLocus bloc, // locus for bot (perhaps unititialized)
Constraint c0, // constraints to enforce in seed zone 0
Constraint c1, // constraints to enforce in seed zone 1
Constraint c2, // constraints to enforce in seed zone 2
Constraint overall, // overall constraints
DoublyLinkedList *prevEdit); // previous edit
/**
* Get tloc and bloc ready for the next step.
*/
inline void nextLocsBi(
SideLocus& tloc, // top locus
SideLocus& bloc, // bot locus
index_t topf, // top in BWT
index_t botf, // bot in BWT
index_t topb, // top in BWT'
index_t botb, // bot in BWT'
int step); // step to get ready for
// Following are set in searchAllSeeds then used by searchSeed()
// and other protected members.
const Ebwt* ebwtFw_; // forward index (BWT)
const Ebwt* ebwtBw_; // backward/mirror index (BWT')
const Scoring* sc_; // scoring scheme
const InstantiatedSeed* s_; // current instantiated seed
const Read* read_; // read whose seeds are currently being aligned
// The following are set just before a call to searchSeedBi()
const BTDnaString* seq_; // sequence of current seed
const BTString* qual_; // quality string for current seed
index_t off_; // offset of seed currently being searched
bool fw_; // orientation of seed currently being searched
EList edits_; // temporary place to sort edits
AlignmentCacheIface *ca_; // local alignment cache for seed alignments
EList offIdx2off_; // offset idx to read offset map, set up instantiateSeeds()
uint64_t bwops_; // Burrows-Wheeler operations
uint64_t bwedits_; // Burrows-Wheeler edits
BTDnaString tmprfdnastr_; // used in reportHit
ASSERT_ONLY(ESet hits_); // Ref hits so far for seed being aligned
BTDnaString tmpdnastr_;
};
#define INIT_LOCS(top, bot, tloc, bloc, e) { \
if(bot - top == 1) { \
tloc.initFromRow(top, (e).eh(), (e).ebwt()); \
bloc.invalidate(); \
} else { \
SideLocus::initFromTopBot(top, bot, (e).eh(), (e).ebwt(), tloc, bloc); \
assert(bloc.valid()); \
} \
}
#define SANITY_CHECK_4TUP(t, b, tp, bp) { \
ASSERT_ONLY(index_t tot = (b[0]-t[0])+(b[1]-t[1])+(b[2]-t[2])+(b[3]-t[3])); \
ASSERT_ONLY(index_t totp = (bp[0]-tp[0])+(bp[1]-tp[1])+(bp[2]-tp[2])+(bp[3]-tp[3])); \
assert_eq(tot, totp); \
}
/**
* Given a read and a few coordinates that describe a substring of the read (or
* its reverse complement), fill in 'seq' and 'qual' objects with the seed
* sequence and qualities.
*
* The seq field is filled with the sequence as it would align to the Watson
* reference strand. I.e. if fw is false, then the sequence that appears in
* 'seq' is the reverse complement of the raw read substring.
*/
template
void SeedAligner::instantiateSeq(
const Read& read, // input read
BTDnaString& seq, // output sequence
BTString& qual, // output qualities
int len, // seed length
int depth, // seed's 0-based offset from 5' end
bool fw) const // seed's orientation
{
// Fill in 'seq' and 'qual'
int seedlen = len;
if((int)read.length() < seedlen) seedlen = (int)read.length();
seq.resize(len);
qual.resize(len);
// If fw is false, we take characters starting at the 3' end of the
// reverse complement of the read.
for(int i = 0; i < len; i++) {
seq.set(read.patFw.windowGetDna(i, fw, read.color, depth, len), i);
qual.set(read.qual.windowGet(i, fw, depth, len), i);
}
}
/**
* We assume that all seeds are the same length.
*
* For each seed, instantiate the seed, retracting if necessary.
*/
template
pair SeedAligner::instantiateSeeds(
const EList& seeds, // search seeds
index_t off, // offset into read to start extracting
int per, // interval between seeds
const Read& read, // read to align
const Scoring& pens, // scoring scheme
bool nofw, // don't align forward read
bool norc, // don't align revcomp read
AlignmentCacheIface& cache,// holds some seed hits from previous reads
SeedResults& sr, // holds all the seed hits
SeedSearchMetrics& met) // metrics
{
assert(!seeds.empty());
assert_gt(read.length(), 0);
// Check whether read has too many Ns
offIdx2off_.clear();
int len = seeds[0].len; // assume they're all the same length
#ifndef NDEBUG
for(size_t i = 1; i < seeds.size(); i++) {
assert_eq(len, seeds[i].len);
}
#endif
// Calc # seeds within read interval
int nseeds = 1;
if((int)read.length() - (int)off > len) {
nseeds += ((int)read.length() - (int)off - len) / per;
}
for(int i = 0; i < nseeds; i++) {
offIdx2off_.push_back(per * i + (int)off);
}
pair ret;
ret.first = 0; // # seeds that require alignment
ret.second = 0; // # seeds that hit in cache with non-empty results
sr.reset(read, offIdx2off_, nseeds);
assert(sr.repOk(&cache.current(), true)); // require that SeedResult be initialized
// For each seed position
for(int fwi = 0; fwi < 2; fwi++) {
bool fw = (fwi == 0);
if((fw && nofw) || (!fw && norc)) {
// Skip this orientation b/c user specified --nofw or --norc
continue;
}
// For each seed position
for(int i = 0; i < nseeds; i++) {
int depth = i * per + (int)off;
int seedlen = seeds[0].len;
// Extract the seed sequence at this offset
// If fw == true, we extract the characters from i*per to
// i*(per-1) (exclusive). If fw == false,
instantiateSeq(
read,
sr.seqs(fw)[i],
sr.quals(fw)[i],
std::min((int)seedlen, (int)read.length()),
depth,
fw);
//QKey qk(sr.seqs(fw)[i] ASSERT_ONLY(, tmpdnastr_));
// For each search strategy
EList& iss = sr.instantiatedSeeds(fw, i);
for(int j = 0; j < (int)seeds.size(); j++) {
iss.expand();
assert_eq(seedlen, seeds[j].len);
InstantiatedSeed* is = &iss.back();
if(seeds[j].instantiate(
read,
sr.seqs(fw)[i],
sr.quals(fw)[i],
pens,
depth,
i,
j,
fw,
*is))
{
// Can we fill this seed hit in from the cache?
ret.first++;
} else {
// Seed may fail to instantiate if there are Ns
// that prevent it from matching
met.filteredseed++;
iss.pop_back();
}
}
}
}
return ret;
}
/**
* We assume that all seeds are the same length.
*
* For each seed:
*
* 1. Instantiate all seeds, retracting them if necessary.
* 2. Calculate zone boundaries for each seed
*/
template
void SeedAligner::searchAllSeeds(
const EList& seeds, // search seeds
const Ebwt* ebwtFw, // BWT index
const Ebwt* ebwtBw, // BWT' index
const Read& read, // read to align
const Scoring& pens, // scoring scheme
AlignmentCacheIface& cache, // local cache for seed alignments
SeedResults& sr, // holds all the seed hits
SeedSearchMetrics& met, // metrics
PerReadMetrics& prm) // per-read metrics
{
assert(!seeds.empty());
assert(ebwtFw != NULL);
assert(ebwtFw->isInMemory());
assert(sr.repOk(&cache.current()));
ebwtFw_ = ebwtFw;
ebwtBw_ = ebwtBw;
sc_ = &pens;
read_ = &read;
ca_ = &cache;
bwops_ = bwedits_ = 0;
uint64_t possearches = 0, seedsearches = 0, intrahits = 0, interhits = 0, ooms = 0;
// For each instantiated seed
for(int i = 0; i < (int)sr.numOffs(); i++) {
size_t off = sr.idx2off(i);
for(int fwi = 0; fwi < 2; fwi++) {
bool fw = (fwi == 0);
assert(sr.repOk(&cache.current()));
EList& iss = sr.instantiatedSeeds(fw, i);
if(iss.empty()) {
// Cache hit in an across-read cache
continue;
}
QVal qv;
seq_ = &sr.seqs(fw)[i]; // seed sequence
qual_ = &sr.quals(fw)[i]; // seed qualities
off_ = off; // seed offset (from 5')
fw_ = fw; // seed orientation
// Tell the cache that we've started aligning, so the cache can
// expect a series of on-the-fly updates
int ret = cache.beginAlign(*seq_, *qual_, qv);
ASSERT_ONLY(hits_.clear());
if(ret == -1) {
// Out of memory when we tried to add key to map
ooms++;
continue;
}
bool abort = false;
if(ret == 0) {
// Not already in cache
assert(cache.aligning());
possearches++;
for(size_t j = 0; j < iss.size(); j++) {
// Set seq_ and qual_ appropriately, using the seed sequences
// and qualities already installed in SeedResults
assert_eq(fw, iss[j].fw);
assert_eq(i, (int)iss[j].seedoffidx);
s_ = &iss[j];
// Do the search with respect to seq_, qual_ and s_.
if(!searchSeedBi()) {
// Memory exhausted during search
ooms++;
abort = true;
break;
}
seedsearches++;
assert(cache.aligning());
}
if(!abort) {
qv = cache.finishAlign();
}
} else {
// Already in cache
assert_eq(1, ret);
assert(qv.valid());
intrahits++;
}
assert(abort || !cache.aligning());
if(qv.valid()) {
sr.add(
qv, // range of ranges in cache
cache.current(), // cache
i, // seed index (from 5' end)
fw); // whether seed is from forward read
}
}
}
prm.nSeedRanges = sr.numRanges();
prm.nSeedElts = sr.numElts();
prm.nSeedRangesFw = sr.numRangesFw();
prm.nSeedRangesRc = sr.numRangesRc();
prm.nSeedEltsFw = sr.numEltsFw();
prm.nSeedEltsRc = sr.numEltsRc();
prm.seedMedian = (uint64_t)(sr.medianHitsPerSeed() + 0.5);
prm.seedMean = (uint64_t)sr.averageHitsPerSeed();
prm.nSdFmops += bwops_;
met.seedsearch += seedsearches;
met.possearch += possearches;
met.intrahit += intrahits;
met.interhit += interhits;
met.ooms += ooms;
met.bwops += bwops_;
met.bweds += bwedits_;
}
template
bool SeedAligner::sanityPartial(
const Ebwt* ebwtFw, // BWT index
const Ebwt* ebwtBw, // BWT' index
const BTDnaString& seq,
index_t dep,
index_t len,
bool do1mm,
index_t topfw,
index_t botfw,
index_t topbw,
index_t botbw)
{
tmpdnastr_.clear();
for(size_t i = dep; i < len; i++) {
tmpdnastr_.append(seq[i]);
}
index_t top_fw = 0, bot_fw = 0;
ebwtFw->contains(tmpdnastr_, &top_fw, &bot_fw);
assert_eq(top_fw, topfw);
assert_eq(bot_fw, botfw);
if(do1mm && ebwtBw != NULL) {
tmpdnastr_.reverse();
index_t top_bw = 0, bot_bw = 0;
ebwtBw->contains(tmpdnastr_, &top_bw, &bot_bw);
assert_eq(top_bw, topbw);
assert_eq(bot_bw, botbw);
}
return true;
}
/**
* Sweep right-to-left and left-to-right using exact matching. Remember all
* the SA ranges encountered along the way. Report exact matches if there are
* any. Calculate a lower bound on the number of edits in an end-to-end
* alignment.
*/
template
size_t SeedAligner::exactSweep(
const Ebwt& ebwt, // BWT index
const Read& read, // read to align
const Scoring& sc, // scoring scheme
bool nofw, // don't align forward read
bool norc, // don't align revcomp read
size_t mineMax, // don't care about edit bounds > this
size_t& mineFw, // minimum # edits for forward read
size_t& mineRc, // minimum # edits for revcomp read
bool repex, // report 0mm hits?
SeedResults& hits, // holds all the seed hits (and exact hit)
SeedSearchMetrics& met) // metrics
{
assert_gt(mineMax, 0);
index_t top = 0, bot = 0;
SideLocus tloc, bloc;
const size_t len = read.length();
size_t nelt = 0;
for(int fwi = 0; fwi < 2; fwi++) {
bool fw = (fwi == 0);
if( fw && nofw) continue;
if(!fw && norc) continue;
const BTDnaString& seq = fw ? read.patFw : read.patRc;
assert(!seq.empty());
int ftabLen = ebwt.eh().ftabChars();
size_t dep = 0;
size_t nedit = 0;
bool done = false;
while(dep < len && !done) {
top = bot = 0;
size_t left = len - dep;
assert_gt(left, 0);
bool doFtab = ftabLen > 1 && left >= (size_t)ftabLen;
if(doFtab) {
// Does N interfere with use of Ftab?
for(size_t i = 0; i < (size_t)ftabLen; i++) {
int c = seq[len-dep-1-i];
if(c > 3) {
doFtab = false;
break;
}
}
}
if(doFtab) {
// Use ftab
ebwt.ftabLoHi(seq, len - dep - ftabLen, false, top, bot);
dep += (size_t)ftabLen;
} else {
// Use fchr
int c = seq[len-dep-1];
if(c < 4) {
top = ebwt.fchr()[c];
bot = ebwt.fchr()[c+1];
}
dep++;
}
if(bot <= top) {
nedit++;
if(nedit >= mineMax) {
if(fw) { mineFw = nedit; } else { mineRc = nedit; }
break;
}
continue;
}
INIT_LOCS(top, bot, tloc, bloc, ebwt);
// Keep going
while(dep < len) {
int c = seq[len-dep-1];
if(c > 3) {
top = bot = 0;
} else {
if(bloc.valid()) {
bwops_ += 2;
top = ebwt.mapLF(tloc, c);
bot = ebwt.mapLF(bloc, c);
} else {
bwops_++;
top = ebwt.mapLF1(top, tloc, c);
if(top == (index_t)OFF_MASK) {
top = bot = 0;
} else {
bot = top+1;
}
}
}
if(bot <= top) {
nedit++;
if(nedit >= mineMax) {
if(fw) { mineFw = nedit; } else { mineRc = nedit; }
done = true;
}
break;
}
INIT_LOCS(top, bot, tloc, bloc, ebwt);
dep++;
}
if(done) {
break;
}
if(dep == len) {
// Set the minimum # edits
if(fw) { mineFw = nedit; } else { mineRc = nedit; }
// Done
if(nedit == 0 && bot > top) {
if(repex) {
// This is an exact hit
int64_t score = len * sc.match();
if(fw) {
hits.addExactEeFw(top, bot, NULL, NULL, fw, score);
assert(ebwt.contains(seq, NULL, NULL));
} else {
hits.addExactEeRc(top, bot, NULL, NULL, fw, score);
assert(ebwt.contains(seq, NULL, NULL));
}
}
nelt += (bot - top);
}
break;
}
dep++;
}
}
return nelt;
}
/**
* Search for end-to-end exact hit for read. Return true iff one is found.
*/
template
bool SeedAligner::oneMmSearch(
const Ebwt* ebwtFw, // BWT index
const Ebwt* ebwtBw, // BWT' index
const Read& read, // read to align
const Scoring& sc, // scoring
int64_t minsc, // minimum score
bool nofw, // don't align forward read
bool norc, // don't align revcomp read
bool local, // 1mm hits must be legal local alignments
bool repex, // report 0mm hits?
bool rep1mm, // report 1mm hits?
SeedResults& hits, // holds all the seed hits (and exact hit)
SeedSearchMetrics& met) // metrics
{
assert(!rep1mm || ebwtBw != NULL);
const size_t len = read.length();
int nceil = sc.nCeil.f((double)len);
size_t ns = read.ns();
if(ns > 1) {
// Can't align this with <= 1 mismatches
return false;
} else if(ns == 1 && !rep1mm) {
// Can't align this with 0 mismatches
return false;
}
assert_geq(len, 2);
assert(!rep1mm || ebwtBw->eh().ftabChars() == ebwtFw->eh().ftabChars());
#ifndef NDEBUG
if(ebwtBw != NULL) {
for(int i = 0; i < 4; i++) {
assert_eq(ebwtBw->fchr()[i], ebwtFw->fchr()[i]);
}
}
#endif
size_t halfFw = len >> 1;
size_t halfBw = len >> 1;
if((len & 1) != 0) {
halfBw++;
}
assert_geq(halfFw, 1);
assert_geq(halfBw, 1);
SideLocus tloc, bloc;
index_t t[4], b[4]; // dest BW ranges for BWT
t[0] = t[1] = t[2] = t[3] = 0;
b[0] = b[1] = b[2] = b[3] = 0;
index_t tp[4], bp[4]; // dest BW ranges for BWT'
tp[0] = tp[1] = tp[2] = tp[3] = 0;
bp[0] = bp[1] = bp[2] = bp[3] = 0;
index_t top = 0, bot = 0, topp = 0, botp = 0;
// Align fw read / rc read
bool results = false;
for(int fwi = 0; fwi < 2; fwi++) {
bool fw = (fwi == 0);
if( fw && nofw) continue;
if(!fw && norc) continue;
// Align going right-to-left, left-to-right
int lim = rep1mm ? 2 : 1;
for(int ebwtfwi = 0; ebwtfwi < lim; ebwtfwi++) {
bool ebwtfw = (ebwtfwi == 0);
const Ebwt* ebwt = (ebwtfw ? ebwtFw : ebwtBw);
const Ebwt* ebwtp = (ebwtfw ? ebwtBw : ebwtFw);
assert(rep1mm || ebwt->fw());
const BTDnaString& seq =
(fw ? (ebwtfw ? read.patFw : read.patFwRev) :
(ebwtfw ? read.patRc : read.patRcRev));
assert(!seq.empty());
const BTString& qual =
(fw ? (ebwtfw ? read.qual : read.qualRev) :
(ebwtfw ? read.qualRev : read.qual));
int ftabLen = ebwt->eh().ftabChars();
size_t nea = ebwtfw ? halfFw : halfBw;
// Check if there's an N in the near portion
bool skip = false;
for(size_t dep = 0; dep < nea; dep++) {
if(seq[len-dep-1] > 3) {
skip = true;
break;
}
}
if(skip) {
continue;
}
size_t dep = 0;
// Align near half
if(ftabLen > 1 && (size_t)ftabLen <= nea) {
// Use ftab to jump partway into near half
bool rev = !ebwtfw;
ebwt->ftabLoHi(seq, len - ftabLen, rev, top, bot);
if(rep1mm) {
ebwtp->ftabLoHi(seq, len - ftabLen, rev, topp, botp);
assert_eq(bot - top, botp - topp);
}
if(bot - top == 0) {
continue;
}
int c = seq[len - ftabLen];
t[c] = top; b[c] = bot;
tp[c] = topp; bp[c] = botp;
dep = ftabLen;
// initialize tloc, bloc??
} else {
// Use fchr to jump in by 1 pos
int c = seq[len-1];
assert_range(0, 3, c);
top = topp = tp[c] = ebwt->fchr()[c];
bot = botp = bp[c] = ebwt->fchr()[c+1];
if(bot - top == 0) {
continue;
}
dep = 1;
// initialize tloc, bloc??
}
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
assert(sanityPartial(ebwt, ebwtp, seq, len-dep, len, rep1mm, top, bot, topp, botp));
bool do_continue = false;
for(; dep < nea; dep++) {
assert_lt(dep, len);
int rdc = seq[len - dep - 1];
tp[0] = tp[1] = tp[2] = tp[3] = topp;
bp[0] = bp[1] = bp[2] = bp[3] = botp;
if(bloc.valid()) {
bwops_++;
t[0] = t[1] = t[2] = t[3] = b[0] = b[1] = b[2] = b[3] = 0;
ebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);
SANITY_CHECK_4TUP(t, b, tp, bp);
top = t[rdc]; bot = b[rdc];
if(bot <= top) {
do_continue = true;
break;
}
topp = tp[rdc]; botp = bp[rdc];
assert(!rep1mm || bot - top == botp - topp);
} else {
assert_eq(bot, top+1);
assert(!rep1mm || botp == topp+1);
bwops_++;
top = ebwt->mapLF1(top, tloc, rdc);
if(top == (index_t)OFF_MASK) {
do_continue = true;
break;
}
bot = top + 1;
t[rdc] = top; b[rdc] = bot;
tp[rdc] = topp; bp[rdc] = botp;
assert(!rep1mm || b[rdc] - t[rdc] == bp[rdc] - tp[rdc]);
// topp/botp stay the same
}
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
assert(sanityPartial(ebwt, ebwtp, seq, len - dep - 1, len, rep1mm, top, bot, topp, botp));
}
if(do_continue) {
continue;
}
// Align far half
for(; dep < len; dep++) {
int rdc = seq[len-dep-1];
int quc = qual[len-dep-1];
if(rdc > 3 && nceil == 0) {
break;
}
tp[0] = tp[1] = tp[2] = tp[3] = topp;
bp[0] = bp[1] = bp[2] = bp[3] = botp;
int clo = 0, chi = 3;
bool match = true;
if(bloc.valid()) {
bwops_++;
t[0] = t[1] = t[2] = t[3] = b[0] = b[1] = b[2] = b[3] = 0;
ebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);
SANITY_CHECK_4TUP(t, b, tp, bp);
match = rdc < 4;
top = t[rdc]; bot = b[rdc];
topp = tp[rdc]; botp = bp[rdc];
} else {
assert_eq(bot, top+1);
assert(!rep1mm || botp == topp+1);
bwops_++;
clo = ebwt->mapLF1(top, tloc);
match = (clo == rdc);
assert_range(-1, 3, clo);
if(clo < 0) {
break; // Hit the $
} else {
t[clo] = top;
b[clo] = bot = top + 1;
}
bp[clo] = botp;
tp[clo] = topp;
assert(!rep1mm || bot - top == botp - topp);
assert(!rep1mm || b[clo] - t[clo] == bp[clo] - tp[clo]);
chi = clo;
}
//assert(sanityPartial(ebwt, ebwtp, seq, len - dep - 1, len, rep1mm, top, bot, topp, botp));
if(rep1mm && (ns == 0 || rdc > 3)) {
for(int j = clo; j <= chi; j++) {
if(j == rdc || b[j] == t[j]) {
// Either matches read or isn't a possibility
continue;
}
// Potential mismatch - next, try
size_t depm = dep + 1;
index_t topm = t[j], botm = b[j];
index_t topmp = tp[j], botmp = bp[j];
assert_eq(botm - topm, botmp - topmp);
index_t tm[4], bm[4]; // dest BW ranges for BWT
tm[0] = t[0]; tm[1] = t[1];
tm[2] = t[2]; tm[3] = t[3];
bm[0] = b[0]; bm[1] = t[1];
bm[2] = b[2]; bm[3] = t[3];
index_t tmp[4], bmp[4]; // dest BW ranges for BWT'
tmp[0] = tp[0]; tmp[1] = tp[1];
tmp[2] = tp[2]; tmp[3] = tp[3];
bmp[0] = bp[0]; bmp[1] = tp[1];
bmp[2] = bp[2]; bmp[3] = tp[3];
SideLocus tlocm, blocm;
INIT_LOCS(topm, botm, tlocm, blocm, *ebwt);
for(; depm < len; depm++) {
int rdcm = seq[len - depm - 1];
tmp[0] = tmp[1] = tmp[2] = tmp[3] = topmp;
bmp[0] = bmp[1] = bmp[2] = bmp[3] = botmp;
if(blocm.valid()) {
bwops_++;
tm[0] = tm[1] = tm[2] = tm[3] =
bm[0] = bm[1] = bm[2] = bm[3] = 0;
ebwt->mapBiLFEx(tlocm, blocm, tm, bm, tmp, bmp);
SANITY_CHECK_4TUP(tm, bm, tmp, bmp);
topm = tm[rdcm]; botm = bm[rdcm];
topmp = tmp[rdcm]; botmp = bmp[rdcm];
if(botm <= topm) {
break;
}
} else {
assert_eq(botm, topm+1);
assert_eq(botmp, topmp+1);
bwops_++;
topm = ebwt->mapLF1(topm, tlocm, rdcm);
if(topm == (index_t)0xffffffff) {
break;
}
botm = topm + 1;
// topp/botp stay the same
}
INIT_LOCS(topm, botm, tlocm, blocm, *ebwt);
}
if(depm == len) {
// Success; this is a 1MM hit
size_t off5p = dep; // offset from 5' end of read
size_t offstr = dep; // offset into patFw/patRc
if(fw == ebwtfw) {
off5p = len - off5p - 1;
}
if(!ebwtfw) {
offstr = len - offstr - 1;
}
Edit e((uint32_t)off5p, j, rdc, EDIT_TYPE_MM, false);
results = true;
int64_t score = (len - 1) * sc.match();
// In --local mode, need to double-check that
// end-to-end alignment doesn't violate local
// alignment principles. Specifically, it
// shouldn't to or below 0 anywhere in the middle.
int pen = sc.score(rdc, (int)(1 << j), quc - 33);
score += pen;
bool valid = true;
if(local) {
int64_t locscore_fw = 0, locscore_bw = 0;
for(size_t i = 0; i < len; i++) {
if(i == dep) {
if(locscore_fw + pen <= 0) {
valid = false;
break;
}
locscore_fw += pen;
} else {
locscore_fw += sc.match();
}
if(len-i-1 == dep) {
if(locscore_bw + pen <= 0) {
valid = false;
break;
}
locscore_bw += pen;
} else {
locscore_bw += sc.match();
}
}
}
if(valid) {
valid = score >= minsc;
}
if(valid) {
#ifndef NDEBUG
BTDnaString& rf = tmprfdnastr_;
rf.clear();
edits_.clear();
edits_.push_back(e);
if(!fw) Edit::invertPoss(edits_, len, false);
Edit::toRef(fw ? read.patFw : read.patRc, edits_, rf);
if(!fw) Edit::invertPoss(edits_, len, false);
assert_eq(len, rf.length());
for(size_t i = 0; i < len; i++) {
assert_lt((int)rf[i], 4);
}
ASSERT_ONLY(index_t toptmp = 0);
ASSERT_ONLY(index_t bottmp = 0);
assert(ebwtFw->contains(rf, &toptmp, &bottmp));
#endif
index_t toprep = ebwtfw ? topm : topmp;
index_t botrep = ebwtfw ? botm : botmp;
assert_eq(toprep, toptmp);
assert_eq(botrep, bottmp);
hits.add1mmEe(toprep, botrep, &e, NULL, fw, score);
}
}
}
}
if(bot > top && match) {
assert_lt(rdc, 4);
if(dep == len-1) {
// Success; this is an exact hit
if(ebwtfw && repex) {
if(fw) {
results = true;
int64_t score = len * sc.match();
hits.addExactEeFw(
ebwtfw ? top : topp,
ebwtfw ? bot : botp,
NULL, NULL, fw, score);
assert(ebwtFw->contains(seq, NULL, NULL));
} else {
results = true;
int64_t score = len * sc.match();
hits.addExactEeRc(
ebwtfw ? top : topp,
ebwtfw ? bot : botp,
NULL, NULL, fw, score);
assert(ebwtFw->contains(seq, NULL, NULL));
}
}
break; // End of far loop
} else {
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
assert(sanityPartial(ebwt, ebwtp, seq, len - dep - 1, len, rep1mm, top, bot, topp, botp));
}
} else {
break; // End of far loop
}
} // for(; dep < len; dep++)
} // for(int ebwtfw = 0; ebwtfw < 2; ebwtfw++)
} // for(int fw = 0; fw < 2; fw++)
return results;
}
/**
* Wrapper for initial invcation of searchSeed.
*/
template
bool SeedAligner::searchSeedBi() {
return searchSeedBi(
0, 0,
0, 0, 0, 0,
SideLocus(), SideLocus(),
s_->cons[0], s_->cons[1], s_->cons[2], s_->overall,
NULL);
}
/**
* Get tloc, bloc ready for the next step. If the new range is under
* the ceiling.
*/
template
inline void SeedAligner::nextLocsBi(
SideLocus& tloc, // top locus
SideLocus& bloc, // bot locus
index_t topf, // top in BWT
index_t botf, // bot in BWT
index_t topb, // top in BWT'
index_t botb, // bot in BWT'
int step // step to get ready for
#if 0
, const SABWOffTrack* prevOt, // previous tracker
SABWOffTrack& ot // current tracker
#endif
)
{
assert_gt(botf, 0);
assert(ebwtBw_ == NULL || botb > 0);
assert_geq(step, 0); // next step can't be first one
assert(ebwtBw_ == NULL || botf-topf == botb-topb);
if(step == (int)s_->steps.size()) return; // no more steps!
// Which direction are we going in next?
if(s_->steps[step] > 0) {
// Left to right; use BWT'
if(botb - topb == 1) {
// Already down to 1 row; just init top locus
tloc.initFromRow(topb, ebwtBw_->eh(), ebwtBw_->ebwt());
bloc.invalidate();
} else {
SideLocus::initFromTopBot(
topb, botb, ebwtBw_->eh(), ebwtBw_->ebwt(), tloc, bloc);
assert(bloc.valid());
}
} else {
// Right to left; use BWT
if(botf - topf == 1) {
// Already down to 1 row; just init top locus
tloc.initFromRow(topf, ebwtFw_->eh(), ebwtFw_->ebwt());
bloc.invalidate();
} else {
SideLocus::initFromTopBot(
topf, botf, ebwtFw_->eh(), ebwtFw_->ebwt(), tloc, bloc);
assert(bloc.valid());
}
}
// Check if we should update the tracker with this refinement
#if 0
if(botf-topf <= BW_OFF_TRACK_CEIL) {
if(ot.size() == 0 && prevOt != NULL && prevOt->size() > 0) {
// Inherit state from the predecessor
ot = *prevOt;
}
bool ltr = s_->steps[step-1] > 0;
int adj = abs(s_->steps[step-1])-1;
const Ebwt* ebwt = ltr ? ebwtBw_ : ebwtFw_;
ot.update(
ltr ? topb : topf, // top
ltr ? botb : botf, // bot
adj, // adj (to be subtracted from offset)
ebwt->offs(), // offs array
ebwt->eh().offRate(), // offrate (sample = every 1 << offrate elts)
NULL // dead
);
assert_gt(ot.size(), 0);
}
#endif
assert(botf - topf == 1 || bloc.valid());
assert(botf - topf > 1 || !bloc.valid());
}
/**
* Report a seed hit found by searchSeedBi(), but first try to extend it out in
* either direction as far as possible without hitting any edits. This will
* allow us to prioritize the seed hits better later on. Call reportHit() when
* we're done, which actually adds the hit to the cache. Returns result from
* calling reportHit().
*/
template
bool SeedAligner::extendAndReportHit(
index_t topf, // top in BWT
index_t botf, // bot in BWT
index_t topb, // top in BWT'
index_t botb, // bot in BWT'
index_t len, // length of hit
DoublyLinkedList *prevEdit) // previous edit
{
index_t nlex = 0, nrex = 0;
index_t t[4], b[4];
index_t tp[4], bp[4];
SideLocus tloc, bloc;
if(off_ > 0) {
const Ebwt *ebwt = ebwtFw_;
assert(ebwt != NULL);
// Extend left using forward index
const BTDnaString& seq = fw_ ? read_->patFw : read_->patRc;
// See what we get by extending
index_t top = topf, bot = botf;
t[0] = t[1] = t[2] = t[3] = 0;
b[0] = b[1] = b[2] = b[3] = 0;
tp[0] = tp[1] = tp[2] = tp[3] = topb;
bp[0] = bp[1] = bp[2] = bp[3] = botb;
SideLocus tloc, bloc;
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
for(size_t ii = off_; ii > 0; ii--) {
size_t i = ii-1;
// Get char from read
int rdc = seq.get(i);
// See what we get by extending
if(bloc.valid()) {
bwops_++;
t[0] = t[1] = t[2] = t[3] =
b[0] = b[1] = b[2] = b[3] = 0;
ebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);
SANITY_CHECK_4TUP(t, b, tp, bp);
int nonz = -1;
bool abort = false;
for(int j = 0; j < 4; j++) {
if(b[i] > t[i]) {
if(nonz >= 0) {
abort = true;
break;
}
nonz = j;
top = t[i]; bot = b[i];
}
}
if(abort || nonz != rdc) {
break;
}
} else {
assert_eq(bot, top+1);
bwops_++;
int c = ebwt->mapLF1(top, tloc);
if(c != rdc) {
break;
}
bot = top + 1;
}
if(++nlex == 255) {
break;
}
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
}
}
size_t rdlen = read_->length();
size_t nright = rdlen - off_ - len;
if(nright > 0 && ebwtBw_ != NULL) {
const Ebwt *ebwt = ebwtBw_;
assert(ebwt != NULL);
// Extend right using backward index
const BTDnaString& seq = fw_ ? read_->patFw : read_->patRc;
// See what we get by extending
index_t top = topb, bot = botb;
t[0] = t[1] = t[2] = t[3] = 0;
b[0] = b[1] = b[2] = b[3] = 0;
tp[0] = tp[1] = tp[2] = tp[3] = topb;
bp[0] = bp[1] = bp[2] = bp[3] = botb;
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
for(size_t i = off_ + len; i < rdlen; i++) {
// Get char from read
int rdc = seq.get(i);
// See what we get by extending
if(bloc.valid()) {
bwops_++;
t[0] = t[1] = t[2] = t[3] =
b[0] = b[1] = b[2] = b[3] = 0;
ebwt->mapBiLFEx(tloc, bloc, t, b, tp, bp);
SANITY_CHECK_4TUP(t, b, tp, bp);
int nonz = -1;
bool abort = false;
for(int j = 0; j < 4; j++) {
if(b[i] > t[i]) {
if(nonz >= 0) {
abort = true;
break;
}
nonz = j;
top = t[i]; bot = b[i];
}
}
if(abort || nonz != rdc) {
break;
}
} else {
assert_eq(bot, top+1);
bwops_++;
int c = ebwt->mapLF1(top, tloc);
if(c != rdc) {
break;
}
bot = top + 1;
}
if(++nrex == 255) {
break;
}
INIT_LOCS(top, bot, tloc, bloc, *ebwt);
}
}
assert_lt(nlex, rdlen);
assert_leq(nlex, off_);
assert_lt(nrex, rdlen);
return reportHit(topf, botf, topb, botb, len, prevEdit);
}
/**
* Report a seed hit found by searchSeedBi() by adding it to the cache. Return
* false if the hit could not be reported because of, e.g., cache exhaustion.
*/
template