././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1670803658.5112376 changeo-1.3.0/0000755000076500000240000000000000000000000012423 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1635461599.0 changeo-1.3.0/INSTALL.rst0000644000076500000240000001150000000000000014260 0ustar00vandej27staffInstallation ================================================================================ The simplest way to install the latest stable release of Change-O is via pip:: > pip3 install changeo --user The current development build can be installed using pip and git in similar fashion:: > pip3 install git+https://bitbucket.org/kleinstein/changeo@master --user If you currently have a development version installed, then you will likely need to add the arguments ``--upgrade --no-deps --force-reinstall`` to the pip3 command. Requirements -------------------------------------------------------------------------------- The minimum dependencies for installation are: + `Python 3.4.0 `__ + `setuptools 2.0 `__ + `NumPy 1.8 `__ + `SciPy 0.14 `__ + `pandas 0.24 `__ + `Biopython 1.77 `__ + `presto 0.7.0 `__ + `airr 1.3.1 `__ Some tools wrap external applications that are not required for installation. Those tools require minimum versions of: + AlignRecords requires `MUSCLE 3.8 `__ + ConvertDb-genbank requires `tbl2asn `__ + AssignGenes requires `IgBLAST 1.6 `__, but version 1.11 or higher is strongly recommended. + BuildTrees requires `IgPhyML 1.0.5 `_ Linux -------------------------------------------------------------------------------- 1. The simplest way to install all Python dependencies is to install the full SciPy stack using the `instructions `__, then install Biopython according to its `instructions `__. 2. Install `presto 0.6.2 `__ or greater. 3. Download the Change-O bundle and run:: > pip3 install changeo-x.y.z.tar.gz --user Mac OS X -------------------------------------------------------------------------------- 1. Install Xcode. Available from the Apple store or `developer downloads `__. 2. Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the `XQuartz project `__. 3. Install Homebrew following the installation and post-installation `instructions `__. 4. Install Python 3.4.0+ and set the path to the python3 executable:: > brew install python3 > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile 5. Exit and reopen the terminal application so the PATH setting takes effect. 6. You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:: > brew install gfortran If the above fails run this instead:: > brew install --env=std gfortran 7. Install NumPy, SciPy, pandas and Biopython using the Python package manager:: > pip3 install numpy scipy pandas biopython 8. Install `presto 0.6.2 `__ or greater. 9. Download the Change-O bundle, open a terminal window, change directories to the download folder, and run:: > pip3 install changeo-x.y.z.tar.gz Windows -------------------------------------------------------------------------------- 1. Install Python 3.4.0+ from `Python `__, selecting both the options 'pip' and 'Add python.exe to Path'. 2. Install NumPy, SciPy, pandas and Biopython using the packages available from the `Unofficial Windows binary `__ collection. 3. Install `presto 0.6.2 `__ or greater. 4. Download the Change-O bundle, open a Command Prompt, change directories to the download folder, and run:: > pip install changeo-x.y.z.tar.gz 5. For a default installation of Python 3.4, the Change-0 scripts will be installed into ``C:\Python34\Scripts`` and should be directly executable from the Command Prompt. If this is not the case, then follow step 6 below. 6. Add both the ``C:\Python34`` and ``C:\Python34\Scripts`` directories to your ``%Path%``. On both Windows 7 and Windows 10, the ``%Path%`` setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path. 7. If you have trouble with the ``.py`` file associations, try adding ``.PY`` to your ``PATHEXT`` environment variable. Also, try opening a Command Prompt as Administrator and run:: > assoc .py=Python.File > ftype Python.File="C:\Python34\python.exe" "%1" %* ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1592854316.0 changeo-1.3.0/LICENSE0000644000076500000240000010333300000000000013433 0ustar00vandej27staff GNU AFFERO GENERAL PUBLIC LICENSE Version 3, 19 November 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software. A secondary benefit of defending all users' freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public. The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version. An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU Affero General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Remote Network Interaction; Use with the GNU General Public License. Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a "Source" link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements. You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see . ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1592854316.0 changeo-1.3.0/MANIFEST.in0000644000076500000240000000014000000000000014154 0ustar00vandej27staffinclude requirements.txt include README.rst include INSTALL.rst include NEWS.rst include LICENSE././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1670803603.0 changeo-1.3.0/NEWS.rst0000644000076500000240000006307200000000000013741 0ustar00vandej27staffRelease Notes =============================================================================== Version 1.3.0: December 11, 2022 ------------------------------------------------------------------------------- + Various updates to internals and error messages. AssignGenes: + Added support for ``.fastq`` files. If a ``.fastq`` file is input, then a corresponding ``.fasta`` file will be created in output directory. + Added support for C region alignment calls provide by IgBLAST v1.18+. MakeDb: + Added support for C region alignment calls provide by IgBLAST v1.18+. Version 1.2.0: October 29, 2021 ------------------------------------------------------------------------------- + Updated dependencies to presto >= v0.7.0. AssignGenes: + Fixed reporting of IgBLAST output counts when specifying ``--format airr``. BuildTrees: + Added support for specifying fixed omega and hotness parameters at the commandline. CreateGermlines: + Will now use the first allele in the reference database when duplicate allele names are provided. Only appears to affect mouse BCR light chains and TCR alleles in the IMGT database when the same allele name differs by strain. MakeDb: + Added support for changes in how IMGT/HighV-QUEST v1.8.4 handles special characters in sequence identifiers. + Fixed the ``imgt`` subcommand incorrectly allowing execution without specifying the IMGT/HighV-QUEST output file at the commandline. ParseDb: + Added reporting of output file sizes to the console log of the ``split`` subcommand. Version 1.1.0: June 21, 2021 ------------------------------------------------------------------------------- + Fixed gene parsing for IMGT temporary designation nomenclature. + Updated dependencies to biopython >= v1.77, airr >= v1.3.1, PyYAML>=5.1. MakeDb: + Added the ``--imgt-id-len`` argument to accommodate changes introduced in how IMGT/HighV-QUEST truncates sequence identifiers as of v1.8.3 (May 7, 2021). The header lines in the fasta files are now truncated to 49 characters. In IMGT/HighV-QUEST versions older than v1.8.3, they were truncated to 50 characters. ``--imgt-id-len`` default value is 49. Users should specify ``--imgt-id-len 50`` to analyze IMGT results generated with IMGT/HighV-QUEST versions older than v1.8.3. + Added the ``--infer-junction`` argument to ``MakeDb igblast``, to enable the inference of the junction sequence when not reported by IgBLAST. Should be used with data from IgBLAST v1.6.0 or older; before igblast added the IMGT-CDR3 inference. Version 1.0.2: January 18, 2021 ------------------------------------------------------------------------------- AlignRecords: + Fixed a bug caused the program to exit when encountering missing sequence data. It will now fail the row or group with missing data and continue. MakeDb: + Added support for IgBLAST v1.17.0. ParseDb: + Added a relevant error message when an input field is missing from the data. Version 1.0.1: October 13, 2020 ------------------------------------------------------------------------------- + Updated to support Biopython v1.78. + Increased the biopython dependency to v1.71. + Increased the presto dependency to 0.6.2. Version 1.0.0: May 6, 2020 ------------------------------------------------------------------------------- + The default output in all tools is now the AIRR Rearrangement standard (``--format airr``). Support for the legacy Change-O data standard is still provided through the ``--format changeo`` argument to the tools. + License changed to AGPL-3. AssignGenes: + Added the ``igblast-aa`` subcommand to run igblastp on amino acid input. BuildTrees: + Adjusted ``RECORDS`` to indicate all sequences in input file. ``INITIAL_FILTER`` now shows sequence count after initial ``min_seq`` filtering. + Added option to skip codon masking: ``--nmask``. + Mask ``:``, ``,``, ``)``, and ``(`` in IDs and metadata with ``-``. + Can obtain germline from ``GERMLINE_IMGT`` if ``GERMLINE_IMGT_D_MASK`` not specified. + Can reconstruct intermediate sequences with IgPhyML using ``--asr``. ConvertDb: + Fixed a bug in the ``airr`` subcommand that caused the ``junction_length`` field to be deleted from the output. + Fixed a bug in the ``genbank`` subcommand that caused the junction CDS to be missing from the ASN output. CreateGermlines: + Added the ``--cf`` argument to allow specification of the clone field. MakeDb: + Added the ``igblast-aa`` subcommand to parse the output of igblastp. + Changed the log entry ``FUNCTIONAL`` to ``PRODUCTIVE`` and removed the ``IMGT_PASS`` log entry in favor of an informative ``ERROR`` entry when sequences fail the junction region validation. + Add --regions argument to the ``igblast`` and ``igblast-aa`` subcommands to allow specification of the IMGT CDR/FWR region boundaries. Currently, the supported specifications are ``default`` (human, mouse) and ``rhesus-igl``. Version 0.4.6: July 19, 2019 ------------------------------------------------------------------------------- BuildTrees: + Added capability of running IgPhyML on outputted data (``--igphyml``) and support for passing IgPhyML arguments through BuildTrees. + Added the ``--clean`` argument to force deletion of all intermediate files after IgPhyML execution. + Added the ``--format`` argument to allow specification input and output of either the Change-O standard (``changeo``) or AIRR Rearrangement standard (``airr``). CreateGermlines: + Fixed a bug causing incorrect reporting of the germline format in the console log. ConvertDb: + Removed requirement for the ``NP1_LENGTH`` and ``NP2_LENGTH`` fields from the genbank subcommand. DefineClones: + Fixed a biopython warning arising when applying ``--model aa`` to junction sequences that are not a multiple of three. The junction will now be padded with an appropriate number of Ns (usually resulting in a translation to X). MakeDb: + Added the ``--10x`` argument to all subcommands to support merging of Cell Ranger annotation data, such as UMI count and C-region assignment, with the output of the supported alignment tools. + Added inference of the receptor locus from the alignment data to all subcommands, which is output in the ``LOCUS`` field. + Combined the extended field arguments of all subcommands (``--scores``, ``--regions``, ``--cdr3``, and ``--junction``) into a single ``--extended`` argument. + Removed parsing of old IgBLAST v1.5 CDR3 fields (``CDR3_IGBLAST``, ``CDR3_IGBLAST_AA``). Version 0.4.5: January 9, 2019 ------------------------------------------------------------------------------- + Slightly changed version number display in commandline help. BuildTrees: + Fixed a bug that caused malformed lineages.tsv output file. CreateGermlines: + Fixed a bug in the CreateGermlines log output causing incorrect missing D gene or J gene error messages. DefineClones: + Fixed a bug that caused a missing junction column to cluster sequences together. MakeDb: + Fixed a bug that caused failed germline reconstructions to be recorded as ``None``, rather than an empty string, in the ``GERMLINE_IMGT`` column. Version 0.4.4: October 27, 2018 ------------------------------------------------------------------------------- + Fixed a bug causing the values of ``_start`` fields to be off by one from the v1.2 AIRR Schema requirement when specifying ``--format airr``. Version 0.4.3: October 19, 2018 ------------------------------------------------------------------------------- + Updated airr library requirement to v1.2.1 to fix empty V(D)J start coordinate values when specifying ``--format airr`` to tools. + Changed pRESTO dependency to v0.5.10. BuildTrees: + New tool. + Converts tab-delimited database files into input for `IgPhyML `_ CreateGermlines: + Now verifies that all files/folder passed to the ``-r`` argument exist. Version 0.4.2: September 6, 2018 ------------------------------------------------------------------------------- + Updated support for the AIRR Rearrangement schema to v1.2 and added the associated airr library dependency. AssignGenes: + New tool. + Provides a simple IgBLAST wrapper as the ``igblast`` subcommand. ConvertDb: + The ``genbank`` subcommand will perform a check for some of the required columns in the input file and exit if they are not found. + Changed the behavior of the ``-y`` argument in the ``genbank`` subcommand. This argument is now featured to sample features only, but allows for the inclusion of any BioSample attribute. CreateGermlines: + Will now perform a naive verification that the reference sequences provided to the ``-r`` argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check. + Will perform a check for some of the required columns in the input file and exit if they are not found. MakeDb: + Changed the output of ``SEQUENCE_VDJ`` from the igblast subcommand to retain insertions in the query sequence rather than delete them as is done in the ``SEQUENCE_IMGT`` field. + Will now perform a naive verification that the reference sequences provided to the ``-r`` argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check. Version 0.4.1: July 16, 2018 ------------------------------------------------------------------------------- + Fixed installation incompatibility with pip 10. + Fixed duplicate newline issue on Windows. + All tools will no longer create empty pass or fail files if there are no records meeting the appropriate criteria for output. + Most tools now allow explicit specification of the output file name via the optional ``-o`` argument. + Added support for the AIRR standard TSV via the ``--format airr`` argument to all relevant tools. + Replaced V, D and J ``BTOP`` columns with ``CIGAR`` columns in data standard. + Numerous API changes and internal structural changes to commandline tools. AlignRecords: + Fixed a bug arising when space characters are present in the sequence identifiers. ConvertDb: + New tool. + Includes the airr and changeo subcommand to convert between AIRR and Change-O formatted TSV files. + The genbank subcommand creates MiAIRR compliant files for submission to GenBank/TLS. + Contains the baseline and fasta subcommands previously in ParseDb. CreateGermlines + Changed character used to pad clonal consensus sequences from ``.`` to ``N``. + Changed tie resolution in clonal consensus from random V/J gene to alphabetical by sequence identifier. + Added ``--df`` and ``-jf`` arguments for specifying D and J fields, respectively. + Add initial sorting step with specifying ``--cloned`` so that clonally ordered input is no longer required. DefineClones: + Removed the chen2010 and ademokun2011 and made the previous bygroup subcommand the default behavior. + Renamed the ``--f`` argument to ``--gf`` for consistency with other tools. + Added the arguments ``--vf`` and ``-jf`` to allow specification of V and J call fields, respectively. MakeDb: + Renamed ``--noparse`` argument to ``--asis-id``. + Added ``asis-calls`` argument to igblast subcommand to allow use with non-standard gene names. + Added the ``GERMLINE_IMGT`` column to the default output. + Changed junction inference in igblast subcommand to use IgBLAST's CDR3 assignment for IgBLAST versions greater than or equal to 1.7.0. + Added a verification that the ``SEQUENCE_IMGT`` and ``JUNCTION`` fields are in agreement for records to pass. + Changed behavior of the igblast subcommand's translation of the junction sequence to truncate junction that are not multiples of 3, rather than pad to a multiple of 3 (removes trailing X character). + The igblast subcommand will now fail records missing the required optional fields ``subject seq``, ``query seq`` and ``BTOP``, rather than abort. + Fixed bug causing parsing of IgBLAST <= 1.4 output to fail. ParseDb: + Added the merge subcommand which will combine TSV files. + All field arguments are now case sensitive to provide support for both the Change-O and AIRR data standards. Version 0.3.12: February 16, 2018 ------------------------------------------------------------------------------- MakeDb: + Fixed a bug wherein specifying multiple simultaneous inputs would cause duplication of parsed pRESTO fields to appear in the second and higher output files. Version 0.3.11: February 6, 2018 ------------------------------------------------------------------------------- MakeDb: + Fixed junction inferrence for igblast subcommand when J region is truncated. Version 0.3.10: February 6, 2018 ------------------------------------------------------------------------------- Fixed incorrect progress bars resulting from files containing empty lines. DefineClones: + Fixed several bugs in the chen2010 and ademokun2011 methods that caused them to either fail or incorrectly cluster all sequences into a single clone. + Added informative message for out of memory error in chen2010 and ademokun2011 methods. Version 0.3.9: October 17, 2017 ------------------------------------------------------------------------------- DefineClones: + Fixed a bug causing DefineClones to fail when all are sequences removed from a group due to missing characters. Version 0.3.8: October 5, 2017 ------------------------------------------------------------------------------- AlignRecords: + Ressurrected AlignRecords which performs multiple alignment of sequence fields. + Added new subcommands ``across`` (multiple aligns within columns), ``within`` (multiple aligns columns within each row), and ``block`` (multiple aligns across both columns and rows). CreateGermlines: + Fixed a bug causing CreateGermlines to incorrectly fail records when using the argument ``--vf V_CALL_GENOTYPED``. DefineClones: + Added the ``--maxmiss`` argument to the bygroup subcommand of DefineClones which set exclusion criteria for junction sequence with ambiguous and missing characters. By default, bygroup will now fail all sequences with any missing characters in the junction (``--maxmiss 0``). Version 0.3.7: June 30, 2017 ------------------------------------------------------------------------------- MakeDb: + Fixed an incompatibility with IgBLAST v1.7.0. CreateGermlines: + Fixed an error that occurs when using the ``--cloned`` with an input file containing duplicate values in ``SEQUENCE_ID`` that caused some records to be discarded. Version 0.3.6: June 13, 2017 ------------------------------------------------------------------------------- + Fixed an overflow error on Windows that caused tools to fatally exit. + All tools will now print detailed help if no arguments are provided. Version 0.3.5: May 12, 2017 ------------------------------------------------------------------------------- Fixed a bug wherein ``.tsv`` was not being recognized as a valid extension. MakeDb: + Added the ``--cdr3`` argument to the igblast subcommand to extract the CDR3 nucleotide and amino acid sequence defined by IgBLAST. + Updated the IMGT/HighV-QUEST parser to handle recent column name changes. + Fixed a bug in the igblast parser wherein some sequence identifiers were not being processed correctly. DefineClones: + Changed the way ``X`` characters are handled in the amino acid Hamming distance model to count as a match against any character. Version 0.3.4: February 14, 2017 ------------------------------------------------------------------------------- License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). CreateGermlines: + Added ``GERMLINE_V_CALL``, ``GERMLINE_D_CALL`` and ``GERMLINE_J_CALL`` columns to the output when the ``-cloned`` argument is specified. These columns contain the consensus annotations when clonal groups contain ambiguous gene assignments. + Fixed the error message for an invalid repo (``-r``) argument. DefineClones: + Deprecated ``m1n`` and ``hs1f`` distance models, renamed them to ``m1n_compat`` and ``hs1f_compat``, and replaced them with ``hh_s1f`` and replaced ``mk_rs1nf``, respectively. + Renamed the ``hs5f`` distance model to ``hh_s5f``. + Added the mouse specific distance model ``mk_rs5nf`` from Cui et al, 2016. MakeDb: + Added compatibility for IgBLAST v1.6. + Added the flag ``--partial`` which tells MakeDb to pass incomplete alignment results specified. + Added missing console log entries for the ihmm subcommand. + IMGT/HighV-QUEST, IgBLAST and iHMMune-Align parsers have been cleaned up, better documented and moved into the iterable classes ``changeo.Parsers.IMGTReader``, ``change.Parsers.IgBLASTReader``, and ``change.Parsers.IHMMuneReader``, respectively. + Corrected behavior of ``D_FRAME`` annotation from the ``--junction`` argument to the imgt subcommand such that it now reports no value when no value is reported by IMGT, rather than reporting the reading frame as 0 in these cases. + Fixed parsing of ``IN_FRAME``, ``STOP``, ``D_SEQ_START`` and ``D_SEQ_LENGTH`` fields from iHMMune-Align output. + Removed extraneous score fields from each parser. + Fixed the error message for an invalid repo (``-r``) argument. Version 0.3.3: August 8, 2016 ------------------------------------------------------------------------------- Increased ``csv.field_size_limit`` in changeo.IO, ParseDb and DefineClones to be able to handle files with larger number of UMIs in one field. Renamed the fields ``N1_LENGTH`` to ``NP1_LENGTH`` and ``N2_LENGTH`` to ``NP2_LENGTH``. CreateGermlines: + Added differentiation of the N and P regions the the ``REGION`` log field if the N/P region info is present in the input file (eg, from the ``--junction`` argument to MakeDb-imgt). If the additional N/P region columns are not present, then both N and P regions will be denoted by N, as in previous versions. + Added the option 'regions' to the ``-g`` argument to create add the ``GERMLINE_REGIONS`` field to the output which represents the germline positions as V, D, J, N and P characters. This is equivalent to the ``REGION`` log entry. DefineClones: + Improved peformance significantly of the ``--act set`` grouping method in the bygroup subcommand. MakeDb: + Fixed a bug producing ``D_SEQ_START`` and ``J_SEQ_START`` relative to ``SEQUENCE_VDJ`` when they should be relative to ``SEQUENCE_INPUT``. + Added the argument ``--junction`` to the imgt subcommand to parse additional junction information fields, including N/P region lengths and the D-segment reading frame. This provides the following additional output fields: ``D_FRAME``, ``N1_LENGTH``, ``N2_LENGTH``, ``P3V_LENGTH``, ``P5D_LENGTH``, ``P3D_LENGTH``, ``P5J_LENGTH``. + The fields ``N1_LENGTH`` and ``N2_LENGTH`` have been renamed to accommodate adding additional output from IMGT under the ``--junction`` flag. The new names are ``NP1_LENGTH`` and ``NP2_LENGTH``. + Fixed a bug that caused the ``IN_FRAME``, ``MUTATED_INVARIANT`` and ``STOP`` field to be be parsed incorrectly from IMGT data. + Ouput from iHMMuneAlign can now be parsed via the ``ihmm`` subcommand. Note, there is insufficient information returned by iHMMuneAlign to reliably reconstruct germline sequences from the output using CreateGermlines. ParseDb: + Renamed the clip subcommand to baseline. Version 0.3.2: March 8, 2016 ------------------------------------------------------------------------------- Fixed a bug with installation on Windows due to old file paths lingering in changeo.egg-info/SOURCES.txt. Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0. CreateGermlines: + Fixed a bug producing incorrect values in the ``SEQUENCE`` field on the log file. MakeDb: + Updated igblast subcommand to correctly parse records with indels. Now igblast must be run with the argument ``outfmt "7 std qseq sseq btop"``. + Changed the names of the FWR and CDR output columns added with ``--regions`` to ``_IMGT``. + Added ``V_BTOP`` and ``J_BTOP`` output when the ``--scores`` flag is specified to the igblast subcommand. Version 0.3.1: December 18, 2015 ------------------------------------------------------------------------------- MakeDb: + Fixed bug wherein the imgt subcommand was not properly recognizing an extracted folder as input to the ``-i`` argument. Version 0.3.0: December 4, 2015 ------------------------------------------------------------------------------- Conversion to a proper Python package which uses pip and setuptools for installation. The package now requires Python 3.4. Python 2.7 is not longer supported. The required dependency versions have been bumped to numpy 1.9, scipy 0.14, pandas 0.16 and biopython 1.65. DbCore: + Divided DbCore functionality into the separate modules: Defaults, Distance, IO, Multiprocessing and Receptor. IgCore: + Remove IgCore in favor of dependency on pRESTO >= 0.5.0. AnalyzeAa: + This tool was removed. This functionality has been migrated to the alakazam R package. DefineClones: + Added ``--sf`` flag to specify sequence field to be used to calculate distance between sequences. + Fixed bug in wherein sequences with missing data in grouping columns were being assigned into a single group and clustered. Sequences with missing grouping variables will now be failed. + Fixed bug where sequences with "None" junctions were grouped together. GapRecords: + This tool was removed in favor of adding IMGT gapping support to igblast subcommand of MakeDb. MakeDb: + Updated IgBLAST parser to create an IMGT gapped sequence and infer the junction region as defined by IMGT. + Added the ``--regions`` flag which adds extra columns containing FWR and CDR regions as defined by IMGT. + Added support to imgt subcommand for the new IMGT/HighV-QUEST compression scheme (.txz files). Version 0.2.5: August 25, 2015 ------------------------------------------------------------------------------- CreateGermlines: + Removed default '-r' repository and added informative error messages when invalid germline repositories are provided. + Updated '-r' flag to take list of folders and/or fasta files with germlines. Version 0.2.4: August 19, 2015 ------------------------------------------------------------------------------- MakeDb: + Fixed a bug wherein N1 and N2 region indexing was off by one nucleotide for the igblast subcommand (leading to incorrect SEQUENCE_VDJ values). ParseDb: + Fixed a bug wherein specifying the ``-f`` argument to the index subcommand would cause an error. Version 0.2.3: July 22, 2015 ------------------------------------------------------------------------------- DefineClones: + Fixed a typo in the default normalization setting of the bygroup subcommand, which was being interpreted as 'none' rather than 'len'. + Changed the 'hs5f' model of the bygroup subcommand to be centered -log10 of the targeting probability. + Added the ``--sym`` argument to the bygroup subcommand which determines how asymmetric distances are handled. Version 0.2.2: July 8, 2015 ------------------------------------------------------------------------------- CreateGermlines: + Germline creation now works for IgBLAST output parsed with MakeDb. The argument ``--sf SEQUENCE_VDJ`` must be provided to generate germlines from IgBLAST output. The same reference database used for the IgBLAST alignment must be specified with the ``-r`` flag. + Fixed a bug with determination of N1 and N2 region positions. MakeDb: + Combined the ``-z`` and ``-f`` flags of the imgt subcommand into a single flag, ``-i``, which autodetects the input type. + Added requirement that IgBLAST input be generated using the ``-outfmt "7 std qseq"`` argument to igblastn. + Modified SEQUENCE_VDJ output from IgBLAST parser to include gaps inserted during alignment. + Added correction for IgBLAST alignments where V/D, D/J or V/J segments are assigned overlapping positions. + Corrected N1_LENGTH and N2_LENGTH calculation from IgBLAST output. + Added the ``--scores`` flag which adds extra columns containing alignment scores from IMGT and IgBLAST output. Version 0.2.1: June 18, 2015 ------------------------------------------------------------------------------- DefineClones: + Removed mouse 3-mer model, 'm3n'. Version 0.2.0: June 17, 2015 ------------------------------------------------------------------------------- Initial public prerelease. Output files were added to the usage documentation of all scripts. General code cleanup. DbCore: + Updated loading of database files to convert column names to uppercase. AnalyzeAa: + Fixed a bug where junctions less than one codon long would lead to a division by zero error. + Added ``--failed`` flag to create database with records that fail analysis. + Added ``--sf`` flag to specify sequence field to be analyzed. CreateGermlines: + Fixed a bug where germline sequences could not be created for light chains. DefineClones: + Added a human 1-mer model, 'hs1f', which uses the substitution rates from from Yaari et al, 2013. + Changed default model to 'hs1f' and default normalization to length for bygroup subcommand. + Added ``--link`` argument which allows for specification of single, complete, or average linkage during clonal clustering (default single). GapRecords: + Fixed a bug wherein non-standard sequence fields could not be aligned. MakeDb: + Fixed bug where the allele 'TRGVA*01' was not recognized as a valid allele. ParseDb: + Added rename subcommand to ParseDb which renames fields. Version 0.2.0.beta-2015-05-31: May 31, 2015 ------------------------------------------------------------------------------- Minor changes to a few output file names and log field entries. ParseDb: + Added index subcommand to ParseDb which adds a numeric index field. Version 0.2.0.beta-2015-05-05: May 05, 2015 ------------------------------------------------------------------------------- Prerelease for review. ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1670803658.5110652 changeo-1.3.0/PKG-INFO0000644000076500000240000000417600000000000013530 0ustar00vandej27staffMetadata-Version: 2.1 Name: changeo Version: 1.3.0 Summary: A bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data. Home-page: http://changeo.readthedocs.io Download-URL: https://bitbucket.org/kleinstein/changeo/downloads Author: Namita Gupta, Jason Anthony Vander Heiden Author-email: immcantation@googlegroups.com License: GNU Affero General Public License 3 (AGPL-3) Keywords: bioinformatics,sequencing,immunology,adaptive immunity,immunoglobulin,AIRR-seq,Rep-Seq,B cell repertoire analysis,adaptive immune receptor repertoires Classifier: Development Status :: 4 - Beta Classifier: Environment :: Console Classifier: Intended Audience :: Science/Research Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 3.4 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics License-File: LICENSE .. image:: https://img.shields.io/pypi/dm/changeo :target: https://pypi.org/project/changeo .. image:: https://img.shields.io/static/v1?label=AIRR-C%20sw-tools%20v1&message=compliant&color=008AFF&labelColor=000000&style=plastic :target: https://docs.airr-community.org/en/stable/swtools/airr_swtools_standard.html Change-O - Repertoire clonal assignment toolkit ================================================================================ Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences. Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of Ig repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. Change-O is a suite of utilities to facilitate advanced analysis of Ig and TCR sequences following germline segment assignment. Change-O handles output from IMGT/HighV-QUEST and IgBLAST, and provides a wide variety of clustering methods for assigning clonal groups to Ig sequences. Record sorting, grouping, and various database manipulation operations are also included. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1619987271.0 changeo-1.3.0/README.rst0000755000076500000240000000237300000000000014122 0ustar00vandej27staff.. image:: https://img.shields.io/pypi/dm/changeo :target: https://pypi.org/project/changeo .. image:: https://img.shields.io/static/v1?label=AIRR-C%20sw-tools%20v1&message=compliant&color=008AFF&labelColor=000000&style=plastic :target: https://docs.airr-community.org/en/stable/swtools/airr_swtools_standard.html Change-O - Repertoire clonal assignment toolkit ================================================================================ Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences. Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of Ig repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. Change-O is a suite of utilities to facilitate advanced analysis of Ig and TCR sequences following germline segment assignment. Change-O handles output from IMGT/HighV-QUEST and IgBLAST, and provides a wide variety of clustering methods for assigning clonal groups to Ig sequences. Record sorting, grouping, and various database manipulation operations are also included. ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1670803658.5037303 changeo-1.3.0/bin/0000755000076500000240000000000000000000000013173 5ustar00vandej27staff././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1611006994.0 changeo-1.3.0/bin/AlignRecords.py0000755000076500000240000004552200000000000016134 0ustar00vandej27staff#!/usr/bin/env python3 """ Multiple aligns sequence fields """ # Info __author__ = 'Jason Anthony Vander Heiden' from changeo import __version__, __date__ # Imports import os import shutil from argparse import ArgumentParser from collections import OrderedDict from itertools import chain from textwrap import dedent from Bio.SeqRecord import SeqRecord # Presto and changeo import from presto.Defaults import default_out_args, default_muscle_exec from presto.Applications import runMuscle from presto.IO import printLog, printError, printWarning from presto.Multiprocessing import manageProcesses from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs from changeo.IO import getDbFields, getFormatOperators from changeo.Multiprocessing import DbResult, feedDbQueue, processDbQueue, collectDbQueue # TODO: maybe not bothering with 'set' is best. can just work off field identity def groupRecords(records, fields=None, calls=['v', 'j'], mode='gene', action='first'): """ Groups Receptor objects based on gene or annotation Arguments: records : an iterator of Receptor objects to group. fields : gene field to group by. calls : allele calls to use for grouping. one or more of ('v', 'd', 'j'). mode : specificity of alignment call to use for allele call fields. one of ('allele', 'gene'). action : only 'first' is currently supported. Returns: dictionary of grouped records """ # Define functions for grouping keys if mode == 'allele' and fields is None: def _get_key(rec, calls, action): return tuple(rec.getAlleleCalls(calls, action)) elif mode == 'gene' and fields is None: def _get_key(rec, calls, action): return tuple(rec.getGeneCalls(calls, action)) elif mode == 'allele' and fields is not None: def _get_key(rec, calls, action): vdj = rec.getAlleleCalls(calls, action) ann = [rec.getChangeo(k) for k in fields] return tuple(chain(vdj, ann)) elif mode == 'gene' and fields is not None: def _get_key(rec, calls, action): vdj = rec.getGeneCalls(calls, action) ann = [rec.getChangeo(k) for k in fields] return tuple(chain(vdj, ann)) rec_index = {} for rec in records: key = _get_key(rec, calls, action) # Assigned grouped records to individual keys and all failed to a single key if all([k is not None for k in key]): rec_index.setdefault(key, []).append(rec) else: rec_index.setdefault(None, []).append(rec) return rec_index def alignBlocks(data, field_map, muscle_exec=default_muscle_exec): """ Multiple aligns blocks of sequence fields together Arguments: data : DbData object with Receptor objects to process. field_map : a dictionary of {input sequence : output sequence) field names to multiple align. muscle_exec : the MUSCLE executable. Returns: changeo.Multiprocessing.DbResult : object containing Receptor objects with multiple aligned sequence fields. """ # Define sequence fields seq_fields = list(field_map.keys()) # Function to validate record def _pass(rec): if all([len(rec.getField(f)) > 0 for f in seq_fields]): return True else: return False # Define return object result = DbResult(data.id, data.data) result.results = data.data result.valid = True # Fail invalid groups if result.id is None or not all([_pass(x) for x in data.data]): result.log = None result.valid = False return result # Run muscle and map results seq_list = [SeqRecord(r.getSeq(f), id='%s_%s' % (r.sequence_id.replace(' ', '_'), f)) for f in seq_fields \ for r in data.data] seq_aln = runMuscle(seq_list, aligner_exec=muscle_exec) if seq_aln is not None: aln_map = {x.id: i for i, x in enumerate(seq_aln)} for i, r in enumerate(result.results, start=1): for f in seq_fields: idx = aln_map['%s_%s' % (r.sequence_id.replace(' ', '_'), f)] seq = str(seq_aln[idx].seq) r.annotations[field_map[f]] = seq result.log['%s-%s' % (f, r.sequence_id)] = seq else: result.valid = False #for r in result.results: print r.annotations return result def alignAcross(data, field_map, muscle_exec=default_muscle_exec): """ Multiple aligns sequence fields column wise Arguments: data : DbData object with Receptor objects to process. field_map : a dictionary of {input sequence : output sequence) field names to multiple align. muscle_exec : the MUSCLE executable. Returns: changeo.Multiprocessing.DbResult : object containing Receptor objects with multiple aligned sequence fields. """ # Define sequence fields seq_fields = list(field_map.keys()) # Function to validate record def _pass(rec): if all([len(rec.getField(f)) > 0 for f in seq_fields]): return True else: return False # Define return object result = DbResult(data.id, data.data) result.results = data.data result.valid = True # Fail invalid groups if result.id is None or not all([_pass(x) for x in data.data]): result.log = None result.valid = False return result seq_fields = list(field_map.keys()) for f in seq_fields: seq_list = [SeqRecord(r.getSeq(f), id=r.sequence_id.replace(' ', '_')) for r in data.data] seq_aln = runMuscle(seq_list, aligner_exec=muscle_exec) if seq_aln is not None: aln_map = {x.id: i for i, x in enumerate(seq_aln)} for i, r in enumerate(result.results, start=1): idx = aln_map[r.sequence_id.replace(' ', '_')] seq = str(seq_aln[idx].seq) r.annotations[field_map[f]] = seq result.log['%s-%s' % (f, r.sequence_id)] = seq else: result.valid = False #for r in result.results: print r.annotations return result def alignWithin(data, field_map, muscle_exec=default_muscle_exec): """ Multiple aligns sequence fields within a row Arguments: data : DbData object with Receptor objects to process. field_map : a dictionary of {input sequence : output sequence) field names to multiple align. muscle_exec : the MUSCLE executable. Returns: changeo.Multiprocessing.DbResult : object containing Receptor objects with multiple aligned sequence fields. """ # Define sequence fields seq_fields = list(field_map.keys()) # Function to validate record def _pass(rec): if all([len(rec.getField(f)) > 0 for f in seq_fields]): return True else: return False # Define return object result = DbResult(data.id, data.data) result.results = data.data result.valid = True # Fail invalid groups if result.id is None or not _pass(data.data): result.log = None result.valid = False return result record = data.data seq_list = [SeqRecord(record.getSeq(f), id=f) for f in seq_fields] seq_aln = runMuscle(seq_list, aligner_exec=muscle_exec) if seq_aln is not None: aln_map = {x.id: i for i, x in enumerate(seq_aln)} for f in seq_fields: idx = aln_map[f] seq = str(seq_aln[idx].seq) record.annotations[field_map[f]] = seq result.log[f] = seq else: result.valid = False return result def alignRecords(db_file, seq_fields, group_func, align_func, group_args={}, align_args={}, format='changeo', out_file=None, out_args=default_out_args, nproc=None, queue_size=None): """ Performs a multiple alignment on sets of sequences Arguments: db_file : filename of the input database. seq_fields : the sequence fields to multiple align. group_func : function to use to group records. align_func : function to use to multiple align sequence groups. group_args : dictionary of arguments to pass to group_func. align_args : dictionary of arguments to pass to align_func. format : output format. One of 'changeo' or 'airr'. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. nproc : the number of processQueue processes. if None defaults to the number of CPUs. queue_size : maximum size of the argument queue. if None defaults to 2*nproc. Returns: dict : names of the 'pass' and 'fail' output files. """ # Define subcommand label dictionary cmd_dict = {alignAcross: 'across', alignWithin: 'within', alignBlocks: 'block'} # Print parameter info log = OrderedDict() log['START'] = 'AlignRecords' log['COMMAND'] = cmd_dict.get(align_func, align_func.__name__) log['FILE'] = os.path.basename(db_file) log['SEQ_FIELDS'] = ','.join(seq_fields) if 'group_fields' in group_args: log['GROUP_FIELDS'] = ','.join(group_args['group_fields']) if 'mode' in group_args: log['MODE'] = group_args['mode'] if 'action' in group_args: log['ACTION'] = group_args['action'] log['NPROC'] = nproc printLog(log) # Define format operators try: reader, writer, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Define feeder function and arguments if 'group_fields' in group_args and group_args['group_fields'] is not None: group_args['group_fields'] = [schema.toReceptor(f) for f in group_args['group_fields']] feed_func = feedDbQueue feed_args = {'db_file': db_file, 'reader': reader, 'group_func': group_func, 'group_args': group_args} # Define worker function and arguments field_map = OrderedDict([(schema.toReceptor(f), '%s_align' % f) for f in seq_fields]) align_args['field_map'] = field_map work_func = processDbQueue work_args = {'process_func': align_func, 'process_args': align_args} # Define collector function and arguments out_fields = getDbFields(db_file, add=list(field_map.values()), reader=reader) out_args['out_type'] = schema.out_type collect_func = collectDbQueue collect_args = {'db_file': db_file, 'label': 'align', 'fields': out_fields, 'writer': writer, 'out_file': out_file, 'out_args': out_args} # Call process manager result = manageProcesses(feed_func, work_func, collect_func, feed_args, work_args, collect_args, nproc, queue_size) # Print log result['log']['END'] = 'AlignRecords' printLog(result['log']) output = {k: v for k, v in result.items() if k in ('pass', 'fail')} return output def getArgParser(): """ Defines the ArgumentParser Arguments: None Returns: an ArgumentParser object """ # Define output file names and header fields fields = dedent( ''' output files: align-pass database with multiple aligned sequences. align-fail database with records failing alignment. required fields: sequence_id, v_call, j_call user specified sequence fields to align. output fields: _align ''') # Define ArgumentParser parser = ArgumentParser(description=__doc__, epilog=fields, formatter_class=CommonHelpFormatter, add_help=False) group_help = parser.add_argument_group('help') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s %s' %(__version__, __date__)) group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='alignment method') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Parent parser parser_parent = getCommonArgParser(format=True, multiproc=True) # Argument parser for column-wise alignment across records parser_across = subparsers.add_parser('across', parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False, help='''Multiple aligns sequence columns within groups and across rows using MUSCLE.''') group_across = parser_across.add_argument_group('alignment arguments') group_across.add_argument('--sf', nargs='+', action='store', dest='seq_fields', required=True, help='The sequence fields to multiple align within each group.') group_across.add_argument('--gf', nargs='+', action='store', dest='group_fields', default=None, help='Additional (not allele call) fields to use for grouping.') group_across.add_argument('--calls', nargs='+', action='store', dest='calls', choices=('v', 'd', 'j'), default=['v', 'j'], help='Segment calls (allele assignments) to use for grouping.') group_across.add_argument('--mode', action='store', dest='mode', choices=('allele', 'gene'), default='gene', help='''Specifies whether to use the V(D)J allele or gene when an allele call field (--calls) is specified.''') group_across.add_argument('--act', action='store', dest='action', default='first', choices=('first', ), help='''Specifies how to handle multiple values within default allele call fields. Currently, only "first" is supported.''') group_across.add_argument('--exec', action='store', dest='muscle_exec', default=default_muscle_exec, help='The location of the MUSCLE executable') parser_across.set_defaults(group_func=groupRecords, align_func=alignAcross) # Argument parser for alignment of fields within records parser_within = subparsers.add_parser('within', parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Multiple aligns sequence fields within rows using MUSCLE') group_within = parser_within.add_argument_group('alignment arguments') group_within.add_argument('--sf', nargs='+', action='store', dest='seq_fields', required=True, help='The sequence fields to multiple align within each record.') group_within.add_argument('--exec', action='store', dest='muscle_exec', default=default_muscle_exec, help='The location of the MUSCLE executable') parser_within.set_defaults(group_func=None, align_func=alignWithin) # Argument parser for column-wise alignment across records parser_block = subparsers.add_parser('block', parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False, help='''Multiple aligns sequence groups across both columns and rows using MUSCLE.''') group_block = parser_block.add_argument_group('alignment arguments') group_block.add_argument('--sf', nargs='+', action='store', dest='seq_fields', required=True, help='The sequence fields to multiple align within each group.') group_block.add_argument('--gf', nargs='+', action='store', dest='group_fields', default=None, help='Additional (not allele call) fields to use for grouping.') group_block.add_argument('--calls', nargs='+', action='store', dest='calls', choices=('v', 'd', 'j'), default=['v', 'j'], help='Segment calls (allele assignments) to use for grouping.') group_block.add_argument('--mode', action='store', dest='mode', choices=('allele', 'gene'), default='gene', help='''Specifies whether to use the V(D)J allele or gene when an allele call field (--calls) is specified.''') group_block.add_argument('--act', action='store', dest='action', default='first', choices=('first', ), help='''Specifies how to handle multiple values within default allele call fields. Currently, only "first" is supported.''') group_block.add_argument('--exec', action='store', dest='muscle_exec', default=default_muscle_exec, help='The location of the MUSCLE executable') parser_block.set_defaults(group_func=groupRecords, align_func=alignBlocks) return parser if __name__ == '__main__': """ Parses command line arguments and calls main function """ # Parse arguments parser = getArgParser() checkArgs(parser) args = parser.parse_args() args_dict = parseCommonArgs(args) # Check if a valid MUSCLE executable was specified for muscle mode if not shutil.which(args.muscle_exec): parser.error('%s does not exist or is not executable.' % args.muscle_exec) # Define align_args args_dict['align_args'] = {'muscle_exec': args_dict['muscle_exec']} del args_dict['muscle_exec'] # Define group_args if args_dict['group_func'] is groupRecords: args_dict['group_args'] = {'fields':args_dict['group_fields'], 'calls':args_dict['calls'], 'mode':args_dict['mode'], 'action':args_dict['action']} del args_dict['group_fields'] del args_dict['calls'] del args_dict['mode'] del args_dict['action'] # Clean arguments dictionary del args_dict['command'] del args_dict['db_files'] if 'out_files' in args_dict: del args_dict['out_files'] # Call main function for each input file for i, f in enumerate(args.__dict__['db_files']): args_dict['db_file'] = f args_dict['out_file'] = args.__dict__['out_files'][i] \ if args.__dict__['out_files'] else None alignRecords(**args_dict) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1664055685.0 changeo-1.3.0/bin/AssignGenes.py0000755000076500000240000003230100000000000015755 0ustar00vandej27staff#!/usr/bin/env python3 """ Assign V(D)J gene annotations """ # Info __author__ = 'Jason Anthony Vander Heiden' from changeo import __version__, __date__ # Imports import os import shutil from argparse import ArgumentParser from collections import OrderedDict from pkg_resources import parse_version from textwrap import dedent from time import time import re import Bio from Bio import SeqIO # Presto imports from presto.IO import printLog, printMessage, printError, printWarning from changeo.Defaults import default_igblastn_exec, default_igblastp_exec, default_out_args from changeo.Applications import runIgBLASTN, runIgBLASTP, getIgBLASTVersion from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs from changeo.IO import getOutputName # Defaults choices_format = ('blast', 'airr') choices_loci = ('ig', 'tr') choices_organism = ('human', 'mouse', 'rabbit', 'rat', 'rhesus_monkey') default_format = 'blast' default_loci = 'ig' default_organism = 'human' default_igdata = '~/share/igblast' def assignIgBLAST(seq_file, amino_acid=False, igdata=default_igdata, loci='ig', organism='human', vdb=None, ddb=None, jdb=None, cdb=None, format=default_format, igblast_exec=default_igblastn_exec, out_file=None, out_args=default_out_args, nproc=None): """ Performs clustering on sets of sequences Arguments: seq_file (str): the sample sequence file name. amino_acid : if True then run igblastp. igblastn is assumed if False. igdata (str): path to the IgBLAST database directory (IGDATA environment). loci (str): receptor type; one of 'ig' or 'tr'. organism (str): species name. vdb (str): name of a custom V reference in the database folder to use. ddb (str): name of a custom D reference in the database folder to use. jdb (str): name of a custom J reference in the database folder to use. cdb (str): name of a custom C reference in the database folder to use. format (str): output format. One of 'blast' or 'airr'. exec (str): the path to the igblastn executable. out_file (str): output file name. Automatically generated from the input file if None. out_args (dict): common output argument dictionary from parseCommonArgs. nproc (int): the number of processQueue processes; if None defaults to the number of CPUs. Returns: str: the output file name """ # Check format argument try: out_type = {'blast': 'fmt7', 'airr': 'tsv'}[format] except KeyError: printError('Invalid output format %s.' % format) # Get IgBLAST version version = getIgBLASTVersion(exec=igblast_exec) if parse_version(version) < parse_version('1.6'): printError('IgBLAST version is %s and 1.6 or higher is required.' % version) if format == 'airr' and parse_version(version) < parse_version('1.9'): printError('IgBLAST version is %s and 1.9 or higher is required for AIRR format support.' % version) # Print parameter info log = OrderedDict() log['START'] = 'AssignGenes' log['COMMAND'] = 'igblast-aa' if amino_acid else 'igblast' log['VERSION'] = version log['FILE'] = os.path.basename(seq_file) log['ORGANISM'] = organism log['LOCI'] = loci log['NPROC'] = nproc printLog(log) # Open output writer if out_file is None: out_file = getOutputName(seq_file, out_label='igblast', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=out_type) # convert to FASTA if needed infile = open(seq_file, 'r') test = infile.read()[0] if test == "@": printMessage("Running conversion from FASTQ to FASTA") fasta_out_dir, filename = os.path.split(out_file) out_fasta_file = os.path.split(seq_file)[1] out_fasta_file = os.path.join(fasta_out_dir,'%s.fasta' % os.path.splitext(out_fasta_file)[0]) with open(out_fasta_file, "w") as out_handle: records = SeqIO.parse(seq_file, 'fastq') if parse_version(Bio.__version__) >= parse_version('1.71'): # Biopython >= v1.71 SeqIO.write(records, out_handle, format='fasta-2line') else: # Biopython < v1.71 writer = SeqIO.FastaIO.FastaWriter(out_handle, wrap=None) writer.write_file(records) seq_file = out_fasta_file # Run IgBLAST clustering start_time = time() printMessage('Running IgBLAST', start_time=start_time, width=25) if not amino_acid: console_out = runIgBLASTN(seq_file, igdata, loci=loci, organism=organism, vdb=vdb, ddb=ddb, jdb=jdb, cdb=cdb, output=out_file, format=format, threads=nproc, exec=igblast_exec) else: console_out = runIgBLASTP(seq_file, igdata, loci=loci, organism=organism, vdb=vdb, output=out_file, threads=nproc, exec=igblast_exec) printMessage('Done', start_time=start_time, end=True, width=25) # Get number of processed sequences if (format == 'blast'): with open(out_file, 'rb') as f: f.seek(-2, os.SEEK_END) while f.read(1) != b'\n': f.seek(-2, os.SEEK_CUR) pass_info = f.readline().decode() num_seqs_match = re.search('(# BLAST processed )(\d+)( .*)', pass_info) num_sequences = num_seqs_match.group(2) else: f = open(out_file, 'rb') lines = 0 buf_size = 1024 * 1024 read_f = f.raw.read buf = read_f(buf_size) while buf: lines += buf.count(b'\n') buf = read_f(buf_size) num_sequences = lines - 1 # Print log log = OrderedDict() log['PASS'] = num_sequences log['OUTPUT'] = os.path.basename(out_file) log['END'] = 'AssignGenes' printLog(log) return out_file def getArgParser(): """ Defines the ArgumentParser Arguments: None Returns: an ArgumentParser object """ # Define output file names and header fields fields = dedent( ''' output files: igblast Reference alignment results from IgBLAST. ''') # Define ArgumentParser parser = ArgumentParser(description=__doc__, epilog=fields, formatter_class=CommonHelpFormatter, add_help=False) group_help = parser.add_argument_group('help') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s %s' %(__version__, __date__)) group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='Assignment operation') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Parent parser parent_parser = getCommonArgParser(db_in=False, log=False, failed=False, format=False, multiproc=True) # Subparser to run igblastn parser_igblast = subparsers.add_parser('igblast', parents=[parent_parser], formatter_class=CommonHelpFormatter, add_help=False, help='Executes igblastn.', description='Executes igblastn.') group_igblast = parser_igblast.add_argument_group('alignment arguments') group_igblast.add_argument('-s', nargs='+', action='store', dest='seq_files', required=True, help='A list of FASTA files containing sequences to process.') group_igblast.add_argument('-b', action='store', dest='igdata', required=True, help='IgBLAST database directory (IGDATA).') group_igblast.add_argument('--organism', action='store', dest='organism', default=default_organism, choices=choices_organism, help='Organism name.') group_igblast.add_argument('--loci', action='store', dest='loci', default=default_loci, choices=choices_loci, help='The receptor type.') group_igblast.add_argument('--vdb', action='store', dest='vdb', default=None, help='''Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___v will be used.''') group_igblast.add_argument('--ddb', action='store', dest='ddb', default=None, help='''Name of the custom D reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___d will be used.''') group_igblast.add_argument('--jdb', action='store', dest='jdb', default=None, help='''Name of the custom J reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___j will be used.''') group_igblast.add_argument('--cdb', action='store', dest='cdb', default=None, help='''Name of the custom C reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___c will be used. Note, this argument will be ignored for IgBLAST versions below 1.18.0.''') group_igblast.add_argument('--format', action='store', dest='format', default=default_format, choices=choices_format, help='''Specify the output format. The "blast" will result in the IgBLAST "-outfmt 7 std qseq sseq btop" output format. Specifying "airr" will output the AIRR TSV format provided by the IgBLAST argument "-outfmt 19".''') group_igblast.add_argument('--exec', action='store', dest='igblast_exec', default=default_igblastn_exec, help='Path to the igblastn executable.') parser_igblast.set_defaults(func=assignIgBLAST, amino_acid=False) # Subparser to run igblastp parser_igblast_aa = subparsers.add_parser('igblast-aa', parents=[parent_parser], formatter_class=CommonHelpFormatter, add_help=False, help='Executes igblastp.', description='Executes igblastp.') group_igblast_aa = parser_igblast_aa.add_argument_group('alignment arguments') group_igblast_aa.add_argument('-s', nargs='+', action='store', dest='seq_files', required=True, help='A list of FASTA files containing sequences to process.') group_igblast_aa.add_argument('-b', action='store', dest='igdata', required=True, help='IgBLAST database directory (IGDATA).') group_igblast_aa.add_argument('--organism', action='store', dest='organism', default=default_organism, choices=choices_organism, help='Organism name.') group_igblast_aa.add_argument('--loci', action='store', dest='loci', default=default_loci, choices=choices_loci, help='The receptor type.') group_igblast_aa.add_argument('--vdb', action='store', dest='vdb', default=None, help='''Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_aa___v will be used.''') group_igblast_aa.add_argument('--exec', action='store', dest='igblast_exec', default=default_igblastp_exec, help='Path to the igblastp executable.') parser_igblast_aa.set_defaults(func=assignIgBLAST, amino_acid=True, ddb=None, jdb=None, format='blast') return parser if __name__ == '__main__': """ Parses command line arguments and calls main function """ # Parse arguments parser = getArgParser() checkArgs(parser) args = parser.parse_args() args_dict = parseCommonArgs(args) # Check if a valid clustering executable was specified if not shutil.which(args_dict['igblast_exec']): parser.error('%s executable not found' % args_dict['igblast_exec']) # Clean arguments dictionary del args_dict['seq_files'] if 'out_files' in args_dict: del args_dict['out_files'] del args_dict['func'] del args_dict['command'] # Call main function for each input file for i, f in enumerate(args.__dict__['seq_files']): args_dict['seq_file'] = f args_dict['out_file'] = args.__dict__['out_files'][i] \ if args.__dict__['out_files'] else None args.func(**args_dict) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1635436149.0 changeo-1.3.0/bin/BuildTrees.py0000755000076500000240000015671400000000000015630 0ustar00vandej27staff#!/usr/bin/env python3 """ Converts TSV files into IgPhyML input files """ # Info __author__ = "Kenneth Hoehn" from changeo import __version__, __date__ # Imports import os import random import subprocess import multiprocessing as mp from argparse import ArgumentParser from collections import OrderedDict from textwrap import dedent from time import time from Bio.Seq import Seq from functools import partial # Presto and changeo imports from presto.Defaults import default_out_args from presto.IO import printLog, printMessage, printWarning, printError, printDebug from changeo.Defaults import default_format from changeo.IO import splitName, getDbFields, getFormatOperators, getOutputHandle, getOutputName from changeo.Alignment import RegionDefinition from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs def correctMidCodonStart(scodons, qi, debug): """ Find and mask split codons Arguments: scodons (list): list of codons in IMGT sequence. qi (str) : input sequence. spos (int) : starting position of IMGT sequence in input sequence. debug (bool) : print debugging statements. Returns: tuple: (modified input sequence, modified starting position of IMGT sequence in input sequence). """ spos = 0 for i in range(0, len(scodons)): printDebug("%s %s" % (scodons[i], qi[0:3]), debug) if scodons[i] != "...": if scodons[i][0:2] == "..": scodons[i] = "NN" + scodons[i][2] #sometimes IMGT will just cut off first letter if non-match, at which point we"ll just want to mask the #first codon in the IMGT seq, other times it will be legitimately absent from the query, at which point #we have to shift the frame. This attempts to correct for this by looking at the next codon over in the #alignment if scodons[i][2:3] != qi[2:3] or scodons[i + 1] != qi[3:6]: qi = "NN" + qi spos = i break elif scodons[i][0] == ".": scodons[i] = "N" + scodons[i][1:3] if scodons[i][1:3] != qi[1:3] or scodons[i+1] != qi[3:6]: qi = "N" + qi spos = i break else: spos = i break return qi, spos def checkFrameShifts(receptor, oqpos, ospos, log, debug): """ Checks whether a frameshift occured in a sequence Arguments: receptor (changeo.Receptor.Receptor): Receptor object. oqpos (int) : position of interest in input sequence. ospos (int) : position of interest in IMGT sequence. log (dict) : log of information for each sequence. debug (bool) : print debugging statements. """ frameshifts = 0 for ins in range(1, 3): ros = receptor.sequence_input ris = receptor.sequence_imgt psite = receptor.v_seq_start - 1 + oqpos*3 pisite = ospos * 3 if (psite + 3 + ins) < len(ros) and (pisite + 3) < len(ris): #cut out 1 or 2 nucleotides downstream of offending codon receptor.sequence_input = ros[0:(psite + 3)] + ros[(psite + 3 + ins):] receptor.sequence_imgt = ris[0:(pisite + 3)] + ris[(pisite + 3):] # Debug sequence modifications printDebug(ros, debug) printDebug(receptor.sequence_input, debug) printDebug(ris, debug) printDebug(receptor.sequence_imgt, debug) printDebug("RUNNING %d" % ins, debug) mout = maskSplitCodons(receptor, recursive=True) if mout[1]["PASS"]: #if debug: receptor.sequence_input = ros receptor.sequence_imgt = ris frameshifts += 1 printDebug("FRAMESHIFT of length %d!" % ins, debug) log["FAIL"] = "SINGLE FRAME-SHIFTING INSERTION" break else: receptor.sequence_input = ros receptor.sequence_imgt = ris return frameshifts def findAndMask(receptor, scodons, qcodons, spos, s_end, qpos, log, debug, recursive=False): """ Find and mask split codons Arguments: receptor (changeo.Receptor.Receptor): Receptor object. scodons (list): list of codons in IMGT sequence qcodons (list): list of codons in input sequence spos (int): starting position of IMGT sequence in input sequence s_end (int): end of IMGT sequence qpos (int): starting position of input sequence in IMGT sequence log (dict): log of information for each sequence debug (bool): print debugging statements? recursive (bool): was this function called recursively? """ frameshifts = 0 while spos < s_end and qpos < len(qcodons): if debug: print(scodons[spos] + "\t" + qcodons[qpos]) if scodons[spos] == "..." and qcodons[qpos] != "...": #if IMGT gap, move forward in imgt spos += 1 elif scodons[spos] == qcodons[qpos]: # if both are the same, move both forward spos += 1 qpos += 1 elif qcodons[qpos] == "N": # possible that SEQ-IMGT ends on a bunch of Ns qpos += 1 spos += 1 else: # if not the same, mask IMGT at that site and scan forward until you find a codon that matches next site if debug: print("checking %s at position %d %d" % (scodons[spos], spos, qpos)) ospos=spos oqpos=qpos spos += 1 qpos += 1 while spos < s_end and scodons[spos] == "...": #possible next codon is just a gap spos += 1 while qpos < len(qcodons) and spos < s_end and scodons[spos] != qcodons[qpos]: printDebug("Checking " + scodons[spos]+ "\t" + qcodons[qpos], debug) qpos += 1 if qcodons[qpos-1] == scodons[ospos]: #if codon in previous position is equal to original codon, it was preserved qpos -= 1 spos = ospos printDebug("But codon was apparently preserved", debug) if "IN-FRAME" in log: log["IN-FRAME"] = log["IN-FRAME"] + "," + str(spos) else: log["IN-FRAME"] = str(spos) elif qpos >= len(qcodons) and spos < s_end: printDebug("FAILING MATCH", debug) log["PASS"] = False #if no match for the adjacent codon was found, something"s up. log["FAIL"] = "FAILED_MATCH_QSTRING:"+str(spos) #figure out if this was due to a frame-shift by repeating this method but with an edited input sequence if not recursive: frameshifts += checkFrameShifts(receptor, oqpos, ospos, log, debug) elif spos >= s_end or qcodons[qpos] != scodons[spos]: scodons[ospos] = "NNN" if spos >= s_end: printDebug("Masked %s at position %d, at end of subject sequence" % (scodons[ospos], ospos), debug) if "END-MASKED" in log: log["END-MASKED"] = log["END-MASKED"] + "," + str(spos) else: log["END-MASKED"] = str(spos) else: printDebug("Masked %s at position %d, but couldn't find upstream match" % (scodons[ospos], ospos), debug) log["PASS"]=False log["FAIL"]="FAILED_MATCH:"+str(spos) elif qcodons[qpos] == scodons[spos]: printDebug("Masked %s at position %d" % (scodons[ospos], ospos), debug) scodons[ospos] = "NNN" if "MASKED" in log: log["MASKED"] = log["MASKED"] + "," + str(spos) else: log["MASKED"] = str(spos) else: log["PASS"] = False log["FAIL"] = "UNKNOWN" def maskSplitCodons(receptor, recursive=False, mask=True): """ Identify junction region by IMGT definition. Arguments: receptor (changeo.Receptor.Receptor): Receptor object. recursive (bool) : was this method part of a recursive call? mask (bool) : mask split codons for use with igphyml? Returns: str: modified IMGT gapped sequence. log: dict of sequence information """ debug = False qi = receptor.sequence_input si = receptor.sequence_imgt log = OrderedDict() log["ID"]=receptor.sequence_id log["CLONE"]=receptor.clone log["PASS"] = True if debug: print(receptor.sequence_id) # adjust starting position of query sequence qi = qi[(receptor.v_seq_start - 1):] #tally where --- gaps are in IMGT sequence and remove them for now gaps = [] ndotgaps = [] nsi = "" for i in range(0,len(si)): if si[i] == "-": gaps.append(1) ndotgaps.append(1) else: gaps.append(0) nsi = nsi + si[i] if si[i] != ".": ndotgaps.append(0) #find any gaps not divisible by three curgap = 0 for i in ndotgaps: if i == 1: curgap += 1 elif i == 0 and curgap != 0: if curgap % 3 != 0 : printDebug("Frame-shifting gap detected! Refusing to include sequence.", debug) log["PASS"] = False log["FAIL"] = "FRAME-SHIFTING DELETION" log["SEQ_IN"] = receptor.sequence_input log["SEQ_IMGT"] = receptor.sequence_imgt log["SEQ_MASKED"] = receptor.sequence_imgt return receptor.sequence_imgt, log else: curgap = 0 si = nsi scodons = [si[i:i + 3] for i in range(0, len(si), 3)] # deal with the fact that it's possible to start mid-codon qi,spos = correctMidCodonStart(scodons, qi, debug) qcodons = [qi[i:i + 3] for i in range(0, len(qi), 3)] frameshifts = 0 s_end = 0 #adjust for the fact that IMGT sequences can end on gaps for i in range(spos, len(scodons)): if scodons[i] != "..." and len(scodons[i]) == 3 and scodons[i] != "NNN": s_end = i printDebug("%i:%i:%s" % (s_end, len(scodons), scodons[s_end]), debug) s_end += 1 qpos = 0 if mask: findAndMask(receptor, scodons, qcodons, spos, s_end, qpos, log, debug, recursive) if not log["PASS"] and not recursive: log["FRAMESHIFTS"] = frameshifts if len(scodons[-1]) != 3: if scodons[-1] == ".." or scodons[-1] == ".": scodons[-1] = "..." else: scodons[-1] = "NNN" if "END-MASKED" in log: log["END-MASKED"] = log["END-MASKED"] + "," + str(len(scodons)) else: log["END-MASKED"] = str(spos) concatenated_seq = Seq("") for i in scodons: concatenated_seq += i # add --- gaps back to IMGT sequence ncon_seq = "" counter = 0 for i in gaps: #print(str(i) + ":" + ncon_seq) if i == 1: ncon_seq = ncon_seq + "." elif i == 0: ncon_seq = ncon_seq + concatenated_seq[counter] counter += 1 ncon_seq = ncon_seq + concatenated_seq[counter:] concatenated_seq = ncon_seq log["SEQ_IN"] = receptor.sequence_input log["SEQ_IMGT"] = receptor.sequence_imgt log["SEQ_MASKED"] = concatenated_seq return concatenated_seq, log def unAmbigDist(seq1, seq2, fbreak=False): """ Calculate the distance between two sequences counting only A,T,C,Gs Arguments: seq1 (str): sequence 1 seq2 (str): sequence 2 fbreak (bool): break after first difference found? Returns: int: number of ACGT differences. """ if len(seq1) != len(seq2): printError("Sequences are not the same length! %s %s" % (seq1, seq2)) dist = 0 for i in range(0,len(seq1)): if seq1[i] != "N" and seq1[i] != "-" and seq1[i] != ".": if seq2[i] != "N" and seq2[i] != "-" and seq2[i] != ".": if seq1[i] != seq2[i]: dist += 1 if fbreak: break return dist def deduplicate(useqs, receptors, log=None, meta_data=None, delim=":"): """ Collapses identical sequences Argument: useqs (dict): unique sequences within a clone. maps sequence to index in Receptor list. receptors (dict): receptors within a clone (index is value in useqs dict). log (collections.OrderedDict): log of sequence errors. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. delim (str): delimited to use when appending meta_data. Returns: list: deduplicated receptors within a clone. """ keys = list(useqs.keys()) join = {} # id -> sequence id to join with (least ambiguous chars) joinseqs = {} # id -> useq to join with (least ambiguous chars) ambigchar = {} #sequence id -> number ATCG nucleotides for i in range(0,len(keys)-1): for j in range(i+1,len(keys)): ki = keys[i] kj = keys[j] if meta_data is None: ski = keys[i] skj = keys[j] else: ski, cid = keys[i].split(delim) skj, cid = keys[j].split(delim) ri = receptors[useqs[ki]] rj = receptors[useqs[kj]] dist = unAmbigDist(ski, skj, True) m_match = True if meta_data is not None: matches = 0 for m in meta_data: if ri.getField(m) == rj.getField(m) and m != "DUPCOUNT": matches += 1 m_match = (matches == len(meta_data)) if dist == 0 and m_match: ncounti = ki.count("A") + ki.count("T") + ki.count("G") + ki.count("C") ncountj = kj.count("A") + kj.count("T") + kj.count("G") + kj.count("C") ambigchar[useqs[ki]] = ncounti ambigchar[useqs[kj]] = ncountj # this algorithm depends on the fact that all sequences are compared pairwise, and all are zero # distance from the sequence they will be collapse to. if ncountj > ncounti: nci = 0 if useqs[ki] in join: nci = ambigchar[join[useqs[ki]]] if nci < ncountj: join[useqs[ki]] = useqs[kj] joinseqs[ki] = kj else: ncj = 0 if useqs[kj] in join: ncj = ambigchar[join[useqs[kj]]] if ncj < ncounti: join[useqs[kj]] = useqs[ki] joinseqs[kj] = ki # loop through list of joined sequences and collapse keys = list(useqs.keys()) for k in keys: if useqs[k] in join: rfrom = receptors[useqs[k]] rto = receptors[join[useqs[k]]] rto.dupcount += rfrom.dupcount if log is not None: log[rfrom.sequence_id]["PASS"] = False log[rfrom.sequence_id]["DUPLICATE"] = True log[rfrom.sequence_id]["COLLAPSETO"] = joinseqs[k] log[rfrom.sequence_id]["COLLAPSEFROM"] = k log[rfrom.sequence_id]["FAIL"] = "Collapsed with " + rto.sequence_id del useqs[k] return useqs def hasPTC(sequence): """ Determines whether a PTC exits in a sequence Arguments: sequence (str): IMGT gapped sequence in frame 1. Returns: int: negative if not PTCs, position of PTC if found. """ ptcs = ("TAA", "TGA", "TAG", "TRA", "TRG", "TAR", "TGR", "TRR") for i in range(0, len(sequence), 3): if sequence[i:(i+3)] in ptcs: return i return -1 def rmCDR3(sequences, clones): """ Remove CDR3 from all sequences and germline of a clone Arguments: sequences (list): list of sequences in clones. clones (list): list of Receptor objects. """ for i in range(0,len(sequences)): imgtar = clones[i].getField("imgtpartlabels") germline = clones[i].getField("germline_imgt_d_mask") nseq = [] nimgtar = [] ngermline = [] ncdr3 = 0 #print("imgtarlen: " + str(len(imgtar))) #print("seqlen: " + str(len(sequences[i]))) #print("germline: " + str(len(germline))) #if len(germline) < len(sequences[i]): # print("\n" + str(clones[i].sequence_id)) # print("\n " + str((sequences[i])) ) # print("\n" + str((germline))) for j in range(0,len(imgtar)): if imgtar[j] != 108: nseq.append(sequences[i][j]) if j < len(germline): ngermline.append(germline[j]) nimgtar.append(imgtar[j]) else: ncdr3 += 1 clones[i].setField("imgtpartlabels",nimgtar) clones[i].setField("germline_imgt_d_mask", "".join(ngermline)) sequences[i] = "".join(nseq) #print("Length: " + str(ncdr3)) def characterizePartitionErrors(sequences, clones, meta_data): """ Characterize potential mismatches between IMGT labels within a clone Arguments: sequences (list): list of sequences in clones. clones (list): list of Receptor objects. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. Returns: tuple: tuple of length four containing a list of IMGT positions for first sequence in clones, the germline sequence of the first receptor in clones, the length of the first sequence in clones, and the number of sequences in clones. """ sites = len(sequences[0]) nseqs = len(sequences) imgtar = clones[0].getField("imgtpartlabels") germline = clones[0].getField("germline_imgt_d_mask") if germline == "": germline = clones[0].getField("germline_imgt") correctseqs = False for seqi in range(0, len(sequences)): i = sequences[seqi] if len(i) != sites or len(clones[seqi].getField("imgtpartlabels")) != len(imgtar): correctseqs = True if correctseqs: maxlen = sites maximgt = len(imgtar) for j in range(0,len(sequences)): if len(sequences[j]) > maxlen: maxlen = len(sequences[j]) if len(clones[j].getField("imgtpartlabels")) > maximgt: imgtar = clones[j].getField("imgtpartlabels") maximgt = len(imgtar) sites = maxlen for j in range(0,len(sequences)): cimgt = clones[j].getField("imgtpartlabels") seqdiff = maxlen - len(sequences[j]) imgtdiff = len(imgtar)-len(cimgt) sequences[j] = sequences[j] + "N"*(seqdiff) last = cimgt[-1] cimgt.extend([last]*(imgtdiff)) clones[j].setField("imgtpartlabels",cimgt) if meta_data is not None: meta_data_ar = meta_data[0].split(",") for c in clones: if meta_data is not None: c.setField(meta_data[0],c.getField(meta_data_ar[0])) for m in range(1,len(meta_data_ar)): st = c.getField(meta_data[0])+":"+c.getField(meta_data_ar[m]) c.setField(meta_data[0],st) if len(c.getField("imgtpartlabels")) != len(imgtar): printError("IMGT assignments are not the same within clone %d!\n" % c.clone,False) printError(c.getField("imgtpartlabels"),False) printError("%s\n%d\n" % (imgtar,j),False) for j in range(0, len(sequences)): printError("%s\n%s\n" % (sequences[j],clones[j].getField("imgtpartlabels")),False) printError("ChangeO file needs to be corrected") for j in range(0,len(imgtar)): if c.getField("imgtpartlabels")[j] != imgtar[j]: printError("IMGT assignments are not the same within clone %d!\n" % c.clone, False) printError(c.getField("imgtpartlabels"), False) printError("%s\n%d\n" % (imgtar, j)) #Resolve germline if there are differences, e.g. if reconstruction was done before clonal clustering resolveglines = False for c in clones: ngermline = c.getField("germline_imgt_d_mask") if ngermline == "": ngermline = c.getField("germline_imgt") if ngermline != germline: resolveglines = True if resolveglines: printError("%s %s" % ("Predicted germlines are not the same among sequences in the same clone.", "Be sure to cluster sequences into clones first and then predict germlines using --cloned")) if sites > (len(germline)): seqdiff = sites - len(germline) germline = germline + "N" * (seqdiff) if sites % 3 != 0: printError("number of sites must be divisible by 3! len: %d, clone: %s , id: %s, seq: %s" %(len(sequences[0]),\ clones[0].clone,clones[0].sequence_id,sequences[0])) return imgtar, germline, sites, nseqs def outputSeqPartFiles(out_dir, useqs_f, meta_data, clones, collapse, nseqs, delim, newgerm, conseqs, duplicate, imgt): """ Create intermediate sequence alignment and partition files for IgPhyML output Arguments: out_dir (str): directory for sequence files. useqs_f (dict): unique sequences mapped to ids. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. clones (list) : list of receptor objects. collapse (bool) : deduplicate sequences. nseqs (int): number of sequences. delim (str) : delimiter for extracting metadata from ID. newgerm (str) : modified germline of clonal lineage. conseqs (list) : consensus sequences. duplicate (bool) : duplicate sequence if only one in a clone. imgt (list) : IMGT numbering of clonal positions . """ # bootstrap these data if desired lg = len(newgerm) sites = range(0, lg) transtable = clones[0].sequence_id.maketrans(" ", "_") outfile = os.path.join(out_dir, "%s.fasta" % clones[0].clone) with open(outfile, "w") as clonef: if collapse: for seq_f, num in useqs_f.items(): seq = seq_f cid = "" if meta_data is not None: seq, cid = seq_f.split(delim) cid = delim + cid.replace(":", "_") sid = clones[num].sequence_id.translate(transtable) + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), seq.replace(".", "-"))) if len(useqs_f) == 1 and duplicate: if meta_data is not None: if meta_data[0] == "DUPCOUNT": cid = delim + "0" sid = clones[num].sequence_id.translate(transtable) + "_1" + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), seq.replace(".", "-"))) else: for j in range(0, nseqs): cid = "" if meta_data is not None: meta_data_list = [] for m in meta_data: meta_data_list.append(clones[j].getField(m).replace(":", "_")) cid = delim + str(delim.join(meta_data_list)) sid = clones[j].sequence_id.translate(transtable) + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), conseqs[j].replace(".", "-"))) if nseqs == 1 and duplicate: if meta_data is not None: if meta_data[0] == "DUPCOUNT": cid = delim + "0" sid = clones[j].sequence_id.translate(transtable)+"_1" + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), conseqs[j].replace(".", "-"))) germ_id = ["GERM"] if meta_data is not None: for i in range(1,len(meta_data)): germ_id.append("GERM") clonef.write(">%s_%s\n" % (clones[0].clone,"_".join(germ_id))) for i in range(0, len(newgerm)): clonef.write("%s" % newgerm[i].replace(".","-")) clonef.write("\n") #output partition file partfile = os.path.join(out_dir, "%s.part.txt" % clones[0].clone) with open(partfile, "w") as partf: partf.write("%d %d\n" % (2, len(newgerm))) partf.write("FWR:IMGT\n") partf.write("CDR:IMGT\n") partf.write("%s\n" % (clones[0].v_call.split("*")[0])) partf.write("%s\n" % (clones[0].j_call.split("*")[0])) partf.write(",".join(map(str, imgt))) partf.write("\n") def outputIgPhyML(clones, sequences, meta_data=None, collapse=False, ncdr3=False, logs=None, fail_writer=None, out_dir=None, min_seq=1): """ Create intermediate sequence alignment and partition files for IgPhyML output Arguments: clones (list): receptor objects within the same clone. sequences (list): sequences within the same clone (share indexes with clones parameter). meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data collapse (bool): if True collapse identical sequences. ncdr3 (bool): if True remove CDR3 logs (dict): contains log information for each sequence out_dir (str): directory for output files. fail_writer (changeo.IO.TSVWriter): failed sequences writer object. min_seq (int): minimum number of data sequences to include. Returns: int: number of clones. """ s = "" delim = "_" duplicate = True # duplicate sequences in clones with only 1 sequence? imgtar, germline, sites, nseqs = characterizePartitionErrors(sequences, clones, meta_data) tallies = [] for i in range(0, sites, 3): tally = 0 for j in range(0, nseqs): if sequences[j][i:(i + 3)] != "...": tally += 1 tallies.append(tally) newseqs = [] # remove gap only sites from observed data newgerm = [] imgt = [] for j in range(0, nseqs): for i in range(0, sites, 3): if i == 0: newseqs.append([]) if tallies[i//3] > 0: newseqs[j].append(sequences[j][i:(i+3)]) lcodon = "" for i in range(0, sites, 3): if tallies[i//3] > 0: newgerm.append(germline[i:(i+3)]) lcodon=germline[i:(i+3)] imgt.append(imgtar[i]) if len(lcodon) == 2: newgerm[-1] = newgerm[-1] + "N" elif len(lcodon) == 1: newgerm[-1] = newgerm[-1] + "NN" if ncdr3: ngerm = [] nimgt = [] for i in range(0, len(newseqs)): nseq = [] ncdr3 = 0 for j in range(0, len(imgt)): if imgt[j] != 108: nseq.append(newseqs[i][j]) if i == 0: ngerm.append(newgerm[j]) nimgt.append(imgt[j]) else: ncdr3 += 1 newseqs[i] = nseq newgerm = ngerm imgt = nimgt #print("Length: " + str(ncdr3)) useqs_f = OrderedDict() conseqs = [] for j in range(0, nseqs): conseq = "".join([str(seq_rec) for seq_rec in newseqs[j]]) if meta_data is not None: meta_data_list = [] for m in range(0,len(meta_data)): if isinstance(clones[j].getField(meta_data[m]), str): clones[j].setField(meta_data[m],clones[j].getField(meta_data[m]).replace("_", "")) meta_data_list.append(str(clones[j].getField(meta_data[m]))) conseq_f = "".join([str(seq_rec) for seq_rec in newseqs[j]])+delim+":".join(meta_data_list) else: conseq_f = conseq if conseq_f in useqs_f and collapse: clones[useqs_f[conseq_f]].dupcount += clones[j].dupcount logs[clones[j].sequence_id]["PASS"] = False logs[clones[j].sequence_id]["FAIL"] = "Duplication of " + clones[useqs_f[conseq_f]].sequence_id logs[clones[j].sequence_id]["DUPLICATE"]=True if fail_writer is not None: fail_writer.writeReceptor(clones[j]) else: useqs_f[conseq_f] = j conseqs.append(conseq) if collapse: useqs_f = deduplicate(useqs_f, clones, logs, meta_data, delim) if collapse and len(useqs_f) < min_seq: for seq_f, num in useqs_f.items(): logs[clones[num].sequence_id]["FAIL"] = "Clone too small: " + str(len(useqs_f)) logs[clones[num].sequence_id]["PASS"] = False return -len(useqs_f) elif not collapse and len(conseqs) < min_seq: for j in range(0, nseqs): logs[clones[j].sequence_id]["FAIL"] = "Clone too small: " + str(len(conseqs)) logs[clones[j].sequence_id]["PASS"] = False return -len(conseqs) # Output fasta file of masked, concatenated sequences outputSeqPartFiles(out_dir, useqs_f, meta_data, clones, collapse, nseqs, delim, newgerm, conseqs, duplicate, imgt) if collapse: return len(useqs_f) else: return nseqs def maskCodonsLoop(r, clones, cloneseqs, logs, fails, out_args, fail_writer, mask=True): """ Masks codons split by alignment to IMGT reference Arguments: r (changeo.Receptor.Receptor): receptor object for a particular sequence. clones (list): list of receptors. cloneseqs (list): list of masked clone sequences. logs (dict): contains log information for each sequence. fails (dict): counts of various sequence processing failures. out_args (dict): arguments for output preferences. fail_writer (changeo.IO.TSVWriter): failed sequences writer object. Returns: 0: returns 0 if an error occurs or masking fails. 1: returns 1 masking succeeds """ if r.clone is None: printError("Cannot export datasets until sequences are clustered into clones.") if r.dupcount is None: r.dupcount = 1 fails["rec_count"] += 1 #printProgress(rec_count, rec_count, 0.05, start_time) ptcs = hasPTC(r.sequence_imgt) gptcs = hasPTC(r.getField("germline_imgt_d_mask")) if gptcs >= 0: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["SEQ_IN"] = r.sequence_input log["SEQ_IMGT"] = r.sequence_imgt logs[r.sequence_id] = log logs[r.sequence_id]["PASS"] = False logs[r.sequence_id]["FAIL"] = "Germline PTC" fails["seq_fail"] += 1 fails["germlineptc"] += 1 return 0 if r.functional and ptcs < 0: #If IMGT regions are provided, record their positions rd = RegionDefinition(r.junction_length, amino_acid=False) regions = rd.getRegions(r.sequence_imgt) if regions["cdr3_imgt"] != "" and regions["cdr3_imgt"] is not None: simgt = regions["fwr1_imgt"] + regions["cdr1_imgt"] + regions["fwr2_imgt"] + regions["cdr2_imgt"] + \ regions["fwr3_imgt"] + regions["cdr3_imgt"] + regions["fwr4_imgt"] if len(simgt) < len(r.sequence_imgt): r.fwr4_imgt = r.fwr4_imgt + ("."*(len(r.sequence_imgt) - len(simgt))) simgt = regions["fwr1_imgt"] + regions["cdr1_imgt"] + regions["fwr2_imgt"] + \ regions["cdr2_imgt"] + regions["fwr3_imgt"] + regions["cdr3_imgt"] + regions["fwr4_imgt"] imgtpartlabels = [13]*len(regions["fwr1_imgt"]) + [30]*len(regions["cdr1_imgt"]) + [45]*len(regions["fwr2_imgt"]) + \ [60]*len(regions["cdr2_imgt"]) + [80]*len(regions["fwr3_imgt"]) + [108] * len(regions["cdr3_imgt"]) + \ [120] * len(regions["fwr4_imgt"]) r.setField("imgtpartlabels", imgtpartlabels) if len(r.getField("imgtpartlabels")) != len(r.sequence_imgt) or simgt != r.sequence_imgt: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["SEQ_IN"] = r.sequence_input log["SEQ_IMGT"] = r.sequence_imgt logs[r.sequence_id] = log logs[r.sequence_id]["PASS"] = False logs[r.sequence_id]["FAIL"] = "FWR/CDR error" logs[r.sequence_id]["FWRCDRSEQ"] = simgt fails["seq_fail"] += 1 fails["region_fail"] += 1 return 0 elif regions["fwr3_imgt"] != "" and regions["fwr3_imgt"] is not None: simgt = regions["fwr1_imgt"] + regions["cdr1_imgt"] + regions["fwr2_imgt"] + regions["cdr2_imgt"] + \ regions["fwr3_imgt"] nseq = r.sequence_imgt[len(simgt):len(r.sequence_imgt)] if len(simgt) < len(r.sequence_imgt): simgt = regions["fwr1_imgt"] + regions["cdr1_imgt"] + regions["fwr2_imgt"] + \ regions["cdr2_imgt"] + regions["fwr3_imgt"] + nseq imgtpartlabels = [13] * len(regions["fwr1_imgt"]) + [30] * len(regions["cdr1_imgt"]) + [45] * len( regions["fwr2_imgt"]) + \ [60] * len(regions["cdr2_imgt"]) + [80] * len(regions["fwr3_imgt"]) + \ [108] * int(len(nseq)) r.setField("imgtpartlabels", imgtpartlabels) if len(r.getField("imgtpartlabels")) != len(r.sequence_imgt) or simgt != r.sequence_imgt: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["SEQ_IN"] = r.sequence_input log["SEQ_IMGT"] = r.sequence_imgt logs[r.sequence_id] = log logs[r.sequence_id]["PASS"] = False logs[r.sequence_id]["FAIL"] = "FWR/CDR error" logs[r.sequence_id]["FWRCDRSEQ"] = simgt fails["seq_fail"] += 1 fails["region_fail"] += 1 return 0 else: #imgt_warn = "\n! IMGT FWR/CDR sequence columns not detected.\n! Cannot run CDR/FWR partitioned model on this data.\n" imgtpartlabels = [0] * len(r.sequence_imgt) r.setField("imgtpartlabels", imgtpartlabels) mout = maskSplitCodons(r, mask=mask) mask_seq = mout[0] ptcs = hasPTC(mask_seq) if ptcs >= 0: printWarning("Masked sequence suddenly has a PTC.. %s\n" % r.sequence_id) mout[1]["PASS"] = False mout[1]["FAIL"] = "PTC_ADDED_FROM_MASKING" logs[mout[1]["ID"]] = mout[1] if mout[1]["PASS"]: #passreads += r.dupcount if r.clone in clones: clones[r.clone].append(r) cloneseqs[r.clone].append(mask_seq) else: clones[r.clone] = [r] cloneseqs[r.clone] = [mask_seq] return 1 else: if out_args["failed"]: fail_writer.writeReceptor(r) fails["seq_fail"] += 1 fails["failreads"] += r.dupcount if mout[1]["FAIL"] == "FRAME-SHIFTING DELETION": fails["del_fail"] += 1 elif mout[1]["FAIL"] == "SINGLE FRAME-SHIFTING INSERTION": fails["in_fail"] += 1 else: fails["other_fail"] += 1 else: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["PASS"] = False log["FAIL"] = "NONFUNCTIONAL/PTC" log["SEQ_IN"] = r.sequence_input logs[r.sequence_id] = log if out_args["failed"]: fail_writer.writeReceptor(r) fails["seq_fail"] += 1 fails["nf_fail"] += 1 return 0 # Run IgPhyML on outputed data def runIgPhyML(outfile, igphyml_out, clone_dir, nproc=1, optimization="lr", omega="e,e", kappa="e", motifs="FCH", hotness="e,e,e,e,e,e", oformat="tab", nohlp=False, asr=-1, clean="none"): """ Run IgPhyML on outputted data Arguments: outfile (str): Output file name. igphymlout (str): igphyml output file nproc (int): Number of threads to parallelize IgPhyML across optimization (str): Optimize combination of topology (t) branch lengths (l) and parameters (r) in IgPhyML. omega (str): omega optimization in IgPhyML (--omega) kappa (str): kappa optimization in IgPhyML (-t) motifs (str): motifs to use in IgPhyML (--motifs) hotness (str): motif in IgPhyML (--hotness) oformat (str): output format for IgPhyML (tab or txt) nohlp (bool): If True, only estimate GY94 trees and parameters clean (str): delete intermediate files? (none, all) """ osplit = outfile.split(".") outrep = ".".join(osplit[0:(len(osplit)-1)]) + "_gy.tsv" gyout = outfile + "_igphyml_stats_gy.txt" gy_args = ["igphyml", "--repfile", outfile, "-m", "GY", "--run_id", "gy", "--outrep", outrep, "--threads", str(nproc),"--outname",gyout] hlp_args = ["igphyml","--repfile", outrep, "-m", "HLP", "--run_id", "hlp", "--threads", str(nproc), "-o", optimization, "--omega", omega, "-t", kappa, "--motifs", motifs, "--hotness", hotness, "--oformat", oformat, "--outname", igphyml_out] if asr >= 0: hlp_args.append("--ASRc") hlp_args.append(str(asr)) log = OrderedDict() log["START"] = "IgPhyML GY94 tree estimation" printLog(log) try: #check for igphyml executable subprocess.check_output(["igphyml"]) except: printError("igphyml not found :-/") try: #get GY94 starting topologies p = subprocess.check_output(gy_args) except subprocess.CalledProcessError as e: print(" ".join(gy_args)) print('error>', e.output, '<') printError("GY94 tree building in IgPhyML failed") log = OrderedDict() log["START"] = "IgPhyML HLP analysis" log["OPTIMIZE"] = optimization log["TS/TV"] = kappa log["wFWR,wCDR"] = omega log["MOTIFS"] = motifs log["HOTNESS"] = hotness log["NPROC"] = nproc printLog(log) if not nohlp: try: #estimate HLP parameters/trees p = subprocess.check_output(hlp_args) except subprocess.CalledProcessError as e: print(" ".join(hlp_args)) print('error>', e.output, '<') printError("HLP tree building failed") log = OrderedDict() log["OUTPUT"] = igphyml_out if oformat == "tab": igf = open(igphyml_out) names = igf.readline().split("\t") vals = igf.readline().split("\t") for i in range(3,len(names)-1): log[names[i]] = round(float(vals[i]),2) printLog(log) if clean != "none": log = OrderedDict() log["START"] = "CLEANING" log["SCOPE"] = clean printLog(log) todelete = open(outrep) for line in todelete: line = line.rstrip("\n") line = line.rstrip("\r") lsplit = line.split("\t") if len(lsplit) == 4: os.remove(lsplit[0]) os.remove(lsplit[1]) os.remove(lsplit[3]) todelete.close() os.remove(outrep) os.remove(outfile) os.remove(gyout) cilog = outrep + "_igphyml_CIlog.txt_hlp" if os.path.isfile(cilog): os.remove(cilog) if oformat == "tab": os.rmdir(clone_dir) else: printWarning("Using --clean all with --oformat txt will not delete all tree file results.\n" "You'll have to do that yourself.") log = OrderedDict() log["END"] = "IgPhyML analysis" printLog(log) # Note: Collapse can give misleading dupcount information if some sequences have ambiguous characters at polymorphic sites def buildTrees(db_file, meta_data=None, target_clones=None, collapse=False, ncdr3=False, nmask=False, sample_depth=-1, min_seq=1,append=None, igphyml=False, nproc=1, optimization="lr", omega="e,e", kappa="e", motifs="FCH", hotness="e,e,e,e,e,e", oformat="tab", clean="none", nohlp=False, asr=-1, format=default_format, out_args=default_out_args): """ Masks codons split by alignment to IMGT reference, then produces input files for IgPhyML Arguments: db_file (str): input tab-delimited database file. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data target_clones (str): List of clone IDs to analyze. collapse (bool): if True collapse identical sequences. ncdr3 (bool): if True remove all CDR3s. nmask (bool): if False, do not attempt to mask split codons sample_depth (int): depth of subsampling before deduplication min_seq (int): minimum number of sequences per clone append (str): column name to append to sequence_id igphyml (bool): If True, run IgPhyML on outputted data nproc (int) : Number of threads to parallelize IgPhyML across optimization (str): Optimize combination of topology (t) branch lengths (l) and parameters (r) in IgPhyML. omega (str): omega optimization in IgPhyML (--omega) kappa (str): kappa optimization in IgPhyML (-t) motifs (str): motifs to use in IgPhyML (--motifs) hotness (str): motif in IgPhyML (--hotness) oformat (str): output format for IgPhyML (tab or txt) clean (str): delete intermediate files? (none, all) nohlp (bool): If True, only estimate GY94 trees and parameters format (str): input and output format. out_args (dict): arguments for output preferences. Returns: dict: dictionary of output pass and fail files. """ # Print parameter info log = OrderedDict() log["START"] = "BuildTrees" log["FILE"] = os.path.basename(db_file) log["COLLAPSE"] = collapse printLog(log) # Open output files out_label = "lineages" pass_handle = getOutputHandle(db_file, out_label=out_label, out_dir=out_args["out_dir"], out_name= out_args["out_name"], out_type="tsv") igphyml_out = None if igphyml: igphyml_out = getOutputName(db_file, out_label="igphyml-pass", out_dir=out_args["out_dir"], out_name=out_args["out_name"], out_type=oformat) dir_name, __ = os.path.split(pass_handle.name) if out_args["out_name"] is None: __, clone_name, __ = splitName(db_file) else: clone_name = out_args["out_name"] if dir_name is None: clone_dir = clone_name else: clone_dir = os.path.join(dir_name, clone_name) if not os.path.exists(clone_dir): os.makedirs(clone_dir) # Format options try: reader, writer, __ = getFormatOperators(format) except ValueError: printError("Invalid format %s." % format) out_fields = getDbFields(db_file, reader=reader) # open input file handle = open(db_file, "r") records = reader(handle) fail_handle, fail_writer = None, None if out_args["failed"]: fail_handle = getOutputHandle(db_file, out_label="lineages-fail", out_dir=out_args["out_dir"], out_name=out_args["out_name"], out_type=out_args["out_type"]) fail_writer = writer(fail_handle, fields=out_fields) cloneseqs = {} clones = {} logs = OrderedDict() fails = {"rec_count":0, "seq_fail":0, "nf_fail":0, "del_fail":0, "in_fail":0, "minseq_fail":0, "other_fail":0, "region_fail":0, "germlineptc":0, "fdcount":0, "totalreads":0, "passreads":0, "failreads":0} # Mask codons split by indels start_time = time() printMessage("Correcting frames and indels of sequences", start_time=start_time, width=50) #subsampling loop init_clone_sizes = {} big_enough = [] all_records = [] found_no_funct = False for r in records: if r.functional is None: r.functional = True if found_no_funct is False: printWarning("FUNCTIONAL column not found.") found_no_funct = True all_records.append(r) if r.clone in init_clone_sizes: init_clone_sizes[r.clone] += 1 else: init_clone_sizes[r.clone] = 1 for r in all_records: if target_clones is None or r.clone in target_clones: if init_clone_sizes[r.clone] >= min_seq: big_enough.append(r) fails["totalreads"] = len(all_records) #fails["minseq_fail"] = len(all_records) - len(big_enough) if len(big_enough) == 0: printError("\n\nNo sequences found that match specified criteria.",1) if sample_depth > 0: random.shuffle(big_enough) total = 0 for r in big_enough: if r.functional is None: r.functional = True if found_no_funct is False: printWarning("FUNCTIONAL column not found.") found_no_funct = True r.sequence_id = r.sequence_id.replace(",","-") #remove commas from sequence ID r.sequence_id = r.sequence_id.replace(":","-") #remove colons from sequence ID r.sequence_id = r.sequence_id.replace(",","-") #remove commas from sequence ID r.sequence_id = r.sequence_id.replace(")","-") #remove parenthesis from sequence ID r.sequence_id = r.sequence_id.replace("(","-") #remove parenthesis from sequence ID if(meta_data is not None): for m in range(0,len(meta_data)): md = r.getField(meta_data[m]) md = md.replace(",","-") #remove commas from metadata md = md.replace(":","-") #remove colons from metadata md = md.replace(",","-") #remove commas from metadata md = md.replace(")","-") #remove parenthesis from metadata md = md.replace("(","-") #remove parenthesis from metadata r.setField(meta_data[m],md) if append is not None: if append is not None: for m in append: r.sequence_id = r.sequence_id + "_" + r.getField(m) total += maskCodonsLoop(r, clones, cloneseqs, logs, fails, out_args, fail_writer, mask = not nmask) if total == sample_depth: break # Start processing clones clonesizes = {} pass_count, nclones = 0, 0 printMessage("Processing clones", start_time=start_time, width=50) for k in clones.keys(): if len(clones[str(k)]) < min_seq: for j in range(0, len(clones[str(k)])): logs[clones[str(k)][j].sequence_id]["FAIL"] = "Clone too small: " + str(len(cloneseqs[str(k)])) logs[clones[str(k)][j].sequence_id]["PASS"] = False clonesizes[str(k)] = -len(cloneseqs[str(k)]) else: clonesizes[str(k)] = outputIgPhyML(clones[str(k)], cloneseqs[str(k)], meta_data=meta_data, collapse=collapse, ncdr3=ncdr3, logs=logs, fail_writer=fail_writer, out_dir=clone_dir, min_seq=min_seq) #If clone is too small, size is returned as a negative if clonesizes[str(k)] > 0: nclones += 1 pass_count += clonesizes[str(k)] else: fails["seq_fail"] -= clonesizes[str(k)] fails["minseq_fail"] -= clonesizes[str(k)] fail_count = fails["rec_count"] - pass_count # End clone processing printMessage("Done", start_time=start_time, end=True, width=50) log_handle = None if out_args["log_file"] is not None: log_handle = open(out_args["log_file"], "w") for j in logs.keys(): printLog(logs[j], handle=log_handle) pass_handle.write(str(nclones)+"\n") for key in sorted(clonesizes, key=clonesizes.get, reverse=True): #print(key + "\t" + str(clonesizes[key])) outfile = os.path.join(clone_dir, "%s.fasta" % key) partfile = os.path.join(clone_dir, "%s.part.txt" % key) if clonesizes[key] > 0: germ_id = ["GERM"] if meta_data is not None: for i in range(1, len(meta_data)): germ_id.append("GERM") pass_handle.write("%s\t%s\t%s_%s\t%s\n" % (outfile, "N", key,"_".join(germ_id), partfile)) handle.close() output = {"pass": None, "fail": None} if pass_handle is not None: output["pass"] = pass_handle.name pass_handle.close() if fail_handle is not None: output["fail"] = fail_handle.name fail_handle.close() if log_handle is not None: log_handle.close() #printProgress(rec_count, rec_count, 0.05, start_time) log = OrderedDict() log["OUTPUT"] = os.path.basename(pass_handle.name) if pass_handle is not None else None log["RECORDS"] = fails["totalreads"] log["INITIAL_FILTER"] = fails["rec_count"] log["PASS"] = pass_count log["FAIL"] = fail_count log["NONFUNCTIONAL"] = fails["nf_fail"] log["FRAMESHIFT_DEL"] = fails["del_fail"] log["FRAMESHIFT_INS"] = fails["in_fail"] log["CLONETOOSMALL"] = fails["minseq_fail"] log["CDRFWR_ERROR"] = fails["region_fail"] log["GERMLINE_PTC"] = fails["germlineptc"] log["OTHER_FAIL"] = fails["other_fail"] if collapse: log["DUPLICATE"] = fail_count - fails["seq_fail"] log["END"] = "BuildTrees" printLog(log) #Run IgPhyML on outputted data? if igphyml: runIgPhyML(pass_handle.name, igphyml_out=igphyml_out, clone_dir=clone_dir, nproc=nproc, optimization=optimization, omega=omega, kappa=kappa, motifs=motifs, hotness=hotness, oformat=oformat, nohlp=nohlp,clean=clean,asr=asr) return output def getArgParser(): """ Defines the ArgumentParser Returns: argparse.ArgumentParser: argument parsers. """ # Define input and output field help message fields = dedent( """ output files: folder containing fasta and partition files for each clone. lineages successfully processed records. lineages-fail database records failed processing. igphyml-pass parameter estimates and lineage trees from running IgPhyML, if specified required fields: sequence_id, sequence, sequence_alignment, germline_alignment_d_mask or germline_alignment, v_call, j_call, clone_id, v_sequence_start """) # Parent parser parser_parent = getCommonArgParser(out_file=False, log=True, format=True) # Define argument parser parser = ArgumentParser(description=__doc__, epilog=fields, parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False) group = parser.add_argument_group("sequence processing arguments") group.add_argument("--collapse", action="store_true", dest="collapse", help="""If specified, collapse identical sequences before exporting to fasta.""") group.add_argument("--ncdr3", action="store_true", dest="ncdr3", help="""If specified, remove CDR3 from all sequences.""") group.add_argument("--nmask", action="store_true", dest="nmask", help="""If specified, do not attempt to mask split codons.""") group.add_argument("--md", nargs="+", action="store", dest="meta_data", help="""List of fields to containing metadata to include in output fasta file sequence headers.""") group.add_argument("--clones", nargs="+", action="store", dest="target_clones", help="""List of clone IDs to output, if specified.""") group.add_argument("--minseq", action="store", dest="min_seq", type=int, default=1, help="""Minimum number of data sequences. Any clones with fewer than the specified number of sequences will be excluded.""") group.add_argument("--sample", action="store", dest="sample_depth", type=int, default=-1, help="""Depth of reads to be subsampled (before deduplication).""") group.add_argument("--append", nargs="+", action="store", dest="append", help="""List of columns to append to sequence ID to ensure uniqueness.""") igphyml_group = parser.add_argument_group("IgPhyML arguments (see igphyml -h for details)") igphyml_group.add_argument("--igphyml", action="store_true", dest="igphyml", help="""Run IgPhyML on output?""") igphyml_group.add_argument("--nproc", action="store", dest="nproc", type=int, default=1, help="""Number of threads to parallelize IgPhyML across.""") igphyml_group.add_argument("--clean", action="store", choices=("none", "all"), dest="clean", type=str, default="none", help="""Delete intermediate files? none: leave all intermediate files; all: delete all intermediate files.""") igphyml_group.add_argument("--optimize", action="store", dest="optimization", type=str, default="lr", choices=("n","r","l","lr","tl","tlr"), help="""Optimize combination of topology (t) branch lengths (l) and parameters (r), or nothing (n), for IgPhyML.""") igphyml_group.add_argument("--omega", action="store", dest="omega", type=str, default="e,e", help="""Omega parameters to estimate for FWR,CDR respectively: e = estimate, ce = estimate + confidence interval, or numeric value""") igphyml_group.add_argument("-t", action="store", dest="kappa", type=str, default="e", help="""Kappa parameters to estimate: e = estimate, ce = estimate + confidence interval, or numeric value""") igphyml_group.add_argument("--motifs", action="store", dest="motifs", type=str, default="WRC_2:0,GYW_0:1,WA_1:2,TW_0:3,SYC_2:4,GRS_0:5", help="""Which motifs to estimate mutability.""") igphyml_group.add_argument("--hotness", action="store", dest="hotness", type=str, default="e,e,e,e,e,e", help="""Mutability parameters to estimate: e = estimate, ce = estimate + confidence interval, or numeric value""") igphyml_group.add_argument("--oformat", action="store", dest="oformat", type=str, default="tab", choices=("tab", "txt"), help="""IgPhyML output format.""") igphyml_group.add_argument("--nohlp", action="store_true", dest="nohlp", help="""Don't run HLP model?""") igphyml_group.add_argument("--asr", action="store", dest="asr", type=float, default=-1, help="""Ancestral sequence reconstruction interval (0-1).""") return parser if __name__ == "__main__": """ Parses command line arguments and calls main """ # Parse command line arguments parser = getArgParser() checkArgs(parser) args = parser.parse_args() args_dict = parseCommonArgs(args) del args_dict["db_files"] # Call main for each input file for f in args.__dict__["db_files"]: args_dict["db_file"] = f buildTrees(**args_dict)././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1664055685.0 changeo-1.3.0/bin/ConvertDb.py0000755000076500000240000012670100000000000015445 0ustar00vandej27staff#!/usr/bin/env python3 """ Parses tab delimited database files """ # Info __author__ = 'Jason Anthony Vander Heiden' from changeo import __version__, __date__ # Imports import csv import os import re import shutil from argparse import ArgumentParser from collections import OrderedDict from itertools import chain from textwrap import dedent from time import time from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord # Presto and changeo imports from presto.Annotation import flattenAnnotation from presto.IO import printLog, printMessage, printProgress, printError, printWarning from changeo.Applications import default_tbl2asn_exec, runASN from changeo.Defaults import default_id_field, default_seq_field, default_germ_field, \ default_csv_size, default_format, default_out_args from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs from changeo.Gene import getCGene from changeo.IO import countDbFile, getFormatOperators, getOutputHandle, AIRRReader, AIRRWriter, \ ChangeoReader, ChangeoWriter, TSVReader, ReceptorData, readGermlines, \ checkFields, yamlDict from changeo.Receptor import AIRRSchema, ChangeoSchema # System settings csv.field_size_limit(default_csv_size) # Defaults default_db_xref = 'IMGT/GENE-DB' default_molecule = 'mRNA' default_product = 'immunoglobulin heavy chain' default_allele_delim = '*' def buildSeqRecord(db_record, id_field, seq_field, meta_fields=None): """ Parses a database record into a SeqRecord Arguments: db_record : a dictionary containing a database record. id_field : the field containing identifiers. seq_field : the field containing sequences. meta_fields : a list of fields to add to sequence annotations. Returns: Bio.SeqRecord.SeqRecord: record. """ # Return None if ID or sequence fields are empty if not db_record[id_field] or not db_record[seq_field]: return None # Create description string desc_dict = OrderedDict([('ID', db_record[id_field])]) if meta_fields is not None: desc_dict.update([(f, db_record[f]) for f in meta_fields if f in db_record]) desc_str = flattenAnnotation(desc_dict) # Create SeqRecord seq_record = SeqRecord(Seq(db_record[seq_field]), id=desc_str, name=desc_str, description='') return seq_record def convertToAIRR(db_file, format=default_format, out_file=None, out_args=default_out_args): """ Converts a Change-O formatted file into an AIRR formatted file Arguments: db_file : the database file name. format : input format. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'airr' log['FILE'] = os.path.basename(db_file) printLog(log) # Define format operators try: reader, __, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Open input db_handle = open(db_file, 'rt') db_iter = reader(db_handle) # Set output fields replacing length with end fields in_fields = [schema.toReceptor(f) for f in db_iter.fields] out_fields = [] for f in in_fields: if f in ReceptorData.length_fields and ReceptorData.length_fields[f][0] in in_fields: out_fields.append(ReceptorData.length_fields[f][1]) out_fields.append(f) out_fields = list(OrderedDict.fromkeys(out_fields)) out_fields = [AIRRSchema.fromReceptor(f) for f in out_fields] # Open output writer if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='airr', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=AIRRSchema.out_type) pass_writer = AIRRWriter(pass_handle, fields=out_fields) # Count records result_count = countDbFile(db_file) # Iterate over records start_time = time() rec_count = 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Write records pass_writer.writeReceptor(rec) # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def convertToChangeo(db_file, out_file=None, out_args=default_out_args): """ Converts an AIRR formatted file into an Change-O formatted file Arguments: db_file: the database file name. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name. """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'changeo' log['FILE'] = os.path.basename(db_file) printLog(log) # Open input db_handle = open(db_file, 'rt') db_iter = AIRRReader(db_handle) # Set output fields replacing length with end fields in_fields = [AIRRSchema.toReceptor(f) for f in db_iter.fields] out_fields = [] for f in in_fields: out_fields.append(f) if f in ReceptorData.end_fields and ReceptorData.end_fields[f][0] in in_fields: out_fields.append(ReceptorData.end_fields[f][1]) out_fields = list(OrderedDict.fromkeys(out_fields)) out_fields = [ChangeoSchema.fromReceptor(f) for f in out_fields] # Open output writer if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='changeo', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=ChangeoSchema.out_type) pass_writer = ChangeoWriter(pass_handle, fields=out_fields) # Count records result_count = countDbFile(db_file) # Iterate over records start_time = time() rec_count = 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Write records pass_writer.writeReceptor(rec) # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name # TODO: SHOULD ALLOW FOR UNSORTED CLUSTER COLUMN # TODO: SHOULD ALLOW FOR GROUPING FIELDS def convertToBaseline(db_file, id_field=default_id_field, seq_field=default_seq_field, germ_field=default_germ_field, cluster_field=None, meta_fields=None, out_file=None, out_args=default_out_args): """ Builds fasta files from database records Arguments: db_file : the database file name. id_field : the field containing identifiers. seq_field : the field containing sample sequences. germ_field : the field containing germline sequences. cluster_field : the field containing clonal groupings; if None write the germline for each record. meta_fields : a list of fields to add to sequence annotations. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'fasta' log['FILE'] = os.path.basename(db_file) log['ID_FIELD'] = id_field log['SEQ_FIELD'] = seq_field log['GERM_FIELD'] = germ_field log['CLUSTER_FIELD'] = cluster_field if meta_fields is not None: log['META_FIELDS'] = ','.join(meta_fields) printLog(log) # Open input db_handle = open(db_file, 'rt') db_iter = TSVReader(db_handle) result_count = countDbFile(db_file) # Open output if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='sequences', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type='clip') # Iterate over records start_time = time() rec_count, germ_count, pass_count, fail_count = 0, 0, 0, 0 cluster_last = None for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Update cluster ID cluster = rec.get(cluster_field, None) # Get germline SeqRecord when needed if cluster_field is None: germ = buildSeqRecord(rec, id_field, germ_field, meta_fields) germ.id = '>' + germ.id elif cluster != cluster_last: germ = buildSeqRecord(rec, cluster_field, germ_field) germ.id = '>' + germ.id else: germ = None # Get read SeqRecord seq = buildSeqRecord(rec, id_field, seq_field, meta_fields) # Write germline if germ is not None: germ_count += 1 SeqIO.write(germ, pass_handle, 'fasta') # Write sequences if seq is not None: pass_count += 1 SeqIO.write(seq, pass_handle, 'fasta') else: fail_count += 1 # Set last cluster ID cluster_last = cluster # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['GERMLINES'] = germ_count log['PASS'] = pass_count log['FAIL'] = fail_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def convertToFasta(db_file, id_field=default_id_field, seq_field=default_seq_field, meta_fields=None, out_file=None, out_args=default_out_args): """ Builds fasta files from database records Arguments: db_file : the database file name. id_field : the field containing identifiers. seq_field : the field containing sequences. meta_fields : a list of fields to add to sequence annotations. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name. """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'fasta' log['FILE'] = os.path.basename(db_file) log['ID_FIELD'] = id_field log['SEQ_FIELD'] = seq_field if meta_fields is not None: log['META_FIELDS'] = ','.join(meta_fields) printLog(log) # Open input out_type = 'fasta' db_handle = open(db_file, 'rt') db_iter = TSVReader(db_handle) result_count = countDbFile(db_file) # Open output if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='sequences', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=out_type) # Iterate over records start_time = time() rec_count, pass_count, fail_count = 0, 0, 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Get SeqRecord seq = buildSeqRecord(rec, id_field, seq_field, meta_fields) # Write sequences if seq is not None: pass_count += 1 SeqIO.write(seq, pass_handle, out_type) else: fail_count += 1 # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['PASS'] = pass_count log['FAIL'] = fail_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def makeGenbankFeatures(record, start=None, end=None, product=default_product, inference=None, db_xref=None, c_field=None, allow_stop=False, asis_calls=False, allele_delim=default_allele_delim): """ Creates a feature table for GenBank submissions Arguments: record : Receptor record. start : start position of the modified sequence in the input sequence. Used for feature position offsets. end : end position of the modified sequence in the input sequence. Used for feature position offsets. product : Product (protein) name. inference : Reference alignment tool. db_xref : Reference database name. c_field : column containing the C region gene call. allow_stop : if True retain records with junctions having stop codons. asis_calls : if True do not parse gene calls for IMGT nomenclature. allele_delim : delimiter separating the gene name from the allele number when asis_calls=True. Returns: dict : dictionary defining GenBank features where the key is a tuple (start, end, feature key) and values are a list of tuples contain (qualifier key, qualifier value). """ # .tbl file format # Line 1, Column 1: Start location of feature # Line 1, Column 2: Stop location of feature # Line 1, Column 3: Feature key # Line 2, Column 4: Qualifier key # Line 2, Column 5: Qualifier value # Get genes and alleles c_gene = None if not asis_calls: # V gene v_gene = record.getVGene() v_allele = record.getVAlleleNumber() # D gene d_gene = record.getDGene() d_allele = record.getDAlleleNumber() # J gene j_gene = record.getJGene() j_allele = record.getJAlleleNumber() # C region if c_field is not None: c_gene = getCGene(record.getField(c_field), action='first') else: # V gene v_split = iter(record.v_call.rsplit(allele_delim, maxsplit=1)) v_gene = next(v_split, None) v_allele = next(v_split, None) # D gene d_split = iter(record.d_call.rsplit(allele_delim, maxsplit=1)) d_gene = next(d_split, None) d_allele = next(d_split, None) # J gene j_split = iter(record.j_call.rsplit(allele_delim, maxsplit=1)) j_gene = next(j_split, None) j_allele = next(j_split, None) # C region if c_field is not None: c_gene = record.getField(c_field) # Fail if V or J is missing if v_gene is None or j_gene is None: return None # Set position offsets if required start_trim = 0 if start is None else start end_trim = 0 if end is None else len(record.sequence_input) - end source_len = len(record.sequence_input) - end_trim # Define return object result = OrderedDict() # C_region # gene # db_xref # inference c_region_start = record.j_seq_end + 1 - start_trim c_region_length = len(record.sequence_input[(c_region_start + start_trim - 1):]) - end_trim if c_region_length > 0: if c_gene is not None: c_region = [('gene', c_gene)] if db_xref is not None: c_region.append(('db_xref', '%s:%s' % (db_xref, c_gene))) else: c_region = [] # Assign C_region feature c_region_end = c_region_start + c_region_length - 1 result[(c_region_start, '>%i' % c_region_end, 'C_region')] = c_region # Preserve J segment end position j_end = record.j_seq_end # Check for range error if c_region_end > source_len: return None else: # Trim J segment end position j_end = record.j_seq_end + c_region_length # V_region variable_start = max(record.v_seq_start - start_trim, 1) variable_end = j_end - start_trim result[(variable_start, variable_end, 'V_region')] = [] # Check for range error if variable_end > source_len: return None # Product feature result[(variable_start, variable_end, 'misc_feature')] = [('note', '%s variable region' % product)] # V_segment # gene (gene name) # allele (allele only, without gene name, don't use if ambiguous) # db_xref (database link) # inference (reference alignment tool) v_segment = [('gene', v_gene)] if v_allele is not None: v_segment.append(('allele', v_allele)) if db_xref is not None: v_segment.append(('db_xref', '%s:%s' % (db_xref, v_gene))) if inference is not None: v_segment.append(('inference', 'COORDINATES:alignment:%s' % inference)) result[(variable_start, record.v_seq_end - start_trim, 'V_segment')] = v_segment # D_segment # gene # allele # db_xref # inference if d_gene: d_segment = [('gene', d_gene)] if d_allele is not None: d_segment.append(('allele', d_allele)) if db_xref is not None: d_segment.append(('db_xref', '%s:%s' % (db_xref, d_gene))) if inference is not None: d_segment.append(('inference', 'COORDINATES:alignment:%s' % inference)) result[(record.d_seq_start - start_trim, record.d_seq_end - start_trim, 'D_segment')] = d_segment # J_segment # gene # allele # db_xref # inference j_segment = [('gene', j_gene)] if j_allele is not None: j_segment.append(('allele', j_allele)) if db_xref is not None: j_segment.append(('db_xref', '%s:%s' % (db_xref, j_gene))) if inference is not None: j_segment.append(('inference', 'COORDINATES:alignment:%s' % inference)) result[(record.j_seq_start - start_trim, j_end - start_trim, 'J_segment')] = j_segment # CDS # codon_start (must indicate codon offset) # function = JUNCTION # inference # print(record.v_germ_end_imgt, record.v_seq_end, record.v_germ_end_imgt - 310) # print(record.junction_start, record.junction_end, record.junction_length) if record.junction_start is not None and record.junction_end is not None: # Define junction boundaries junction_start = record.junction_start - start_trim junction_end = record.junction_end - start_trim # CDS record cds_start = '<%i' % junction_start cds_end = '>%i' % junction_end cds_record = [('function', 'JUNCTION')] if inference is not None: cds_record.append(('inference', 'COORDINATES:protein motif:%s' % inference)) # Check for valid translation junction_seq = record.sequence_input[(junction_start - 1):junction_end] if len(junction_seq) % 3 > 0: junction_seq = junction_seq + 'N' * (3 - len(junction_seq) % 3) junction_aa = Seq(junction_seq).translate() # Return invalid record upon junction stop codon if '*' in junction_aa and not allow_stop: return None elif '*' in junction_aa: cds_record.append(('note', '%s junction region' % product)) result[(cds_start, cds_end, 'misc_feature')] = cds_record else: cds_record.append(('product', '%s junction region' % product)) cds_record.append(('codon_start', 1)) result[(cds_start, cds_end, 'CDS')] = cds_record return result def makeGenbankSequence(record, name=None, label=None, count_field=None, index_field=None, molecule=default_molecule, features=None): """ Creates a sequence for GenBank submissions Arguments: record : Receptor record. name : sequence identifier for the output sequence. If None, use the original sequence identifier. label : a string to use as a label for the ID. if None do not add a field label. count_field : field name to populate the AIRR_READ_COUNT note. index_field : field name to populate the AIRR_CELL_INDEX note. molecule : source molecule (eg, "mRNA", "genomic DNA") features : dictionary of sample features (BioSample attributes) to add to the description of each record. Returns: dict: dictionary with {'record': SeqRecord, 'start': start position in raw sequence, 'end': end position in raw sequence} """ # Replace gaps with N seq = record.sequence_input seq = seq.replace('-', 'N').replace('.', 'N') # Strip leading and trailing Ns head_match = re.search('^N+', seq) tail_match = re.search('N+$', seq) seq_start = head_match.end() if head_match else 0 seq_end = tail_match.start() if tail_match else len(seq) # Define ID if name is None: name = record.sequence_id.split(' ')[0] if label is not None: name = '%s=%s' % (label, name) if features is not None: sample_desc = ' '.join(['[%s=%s]' % (k, v) for k, v in features.items()]) name = '%s %s' % (name, sample_desc) name = '%s [moltype=%s] [keyword=TLS; Targeted Locus Study; AIRR; MiAIRR:1.0]' % (name, molecule) # Notes note_dict = OrderedDict() if count_field is not None: note_dict['AIRR_READ_COUNT'] = record.getField(count_field) if index_field is not None: note_dict['AIRR_CELL_INDEX'] = record.getField(index_field) if note_dict: note = '; '.join(['%s:%s' % (k, v) for k, v in note_dict.items()]) name = '%s [note=%s]' % (name, note) # Return SeqRecord and positions record = SeqRecord(Seq(seq[seq_start:seq_end]), id=name, name=name, description='') result = {'record': record, 'start': seq_start, 'end': seq_end} return result def convertToGenbank(db_file, inference=None, db_xref=None, molecule=default_molecule, product=default_product, features=None, c_field=None, label=None, count_field=None, index_field=None, allow_stop=False, asis_id=False, asis_calls=False, allele_delim=default_allele_delim, build_asn=False, asn_template=None, tbl2asn_exec=default_tbl2asn_exec, format=default_format, out_file=None, out_args=default_out_args): """ Builds GenBank submission fasta and table files Arguments: db_file : the database file name. inference : reference alignment tool. db_xref : reference database link. molecule : source molecule (eg, "mRNA", "genomic DNA") product : Product (protein) name. features : dictionary of sample features (BioSample attributes) to add to the description of each record. c_field : column containing the C region gene call. label : a string to use as a label for the ID. if None do not add a field label. count_field : field name to populate the AIRR_READ_COUNT note. index_field : field name to populate the AIRR_CELL_INDEX note. allow_stop : if True retain records with junctions having stop codons. asis_id : if True use the original sequence ID for the output IDs. asis_calls : if True do not parse gene calls for IMGT nomenclature. allele_delim : delimiter separating the gene name from the allele number when asis_calls=True. build_asn : if True run tbl2asn on the generated .tbl and .fsa files. asn_template : template file (.sbt) to pass to tbl2asn. tbl2asn_exec : name of or path to the tbl2asn executable. format : input and output format. out_file : output file name without extension. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: tuple : the output (feature table, fasta) file names. """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'genbank' log['FILE'] = os.path.basename(db_file) printLog(log) # Define format operators try: reader, __, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Open input db_handle = open(db_file, 'rt') db_iter = reader(db_handle) # Check for required columns try: required = ['sequence_input', 'v_call', 'd_call', 'j_call', 'v_seq_start', 'd_seq_start', 'j_seq_start'] checkFields(required, db_iter.fields, schema=schema) except LookupError as e: printError(e) # Open output if out_file is not None: out_name, __ = os.path.splitext(out_file) fsa_handle = open('%s.fsa' % out_name, 'w') tbl_handle = open('%s.tbl' % out_name, 'w') else: fsa_handle = getOutputHandle(db_file, out_label='genbank', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type='fsa') tbl_handle = getOutputHandle(db_file, out_label='genbank', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type='tbl') # Count records result_count = countDbFile(db_file) # Define writer writer = csv.writer(tbl_handle, delimiter='\t', quoting=csv.QUOTE_NONE) # Iterate over records start_time = time() rec_count, pass_count, fail_count = 0, 0, 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Extract table dictionary name = None if asis_id else rec_count seq = makeGenbankSequence(rec, name=name, label=label, count_field=count_field, index_field=index_field, molecule=molecule, features=features) tbl = makeGenbankFeatures(rec, start=seq['start'], end=seq['end'], product=product, db_xref=db_xref, inference=inference, c_field=c_field, allow_stop=allow_stop, asis_calls=asis_calls, allele_delim=allele_delim) if tbl is not None: pass_count +=1 # Write table writer.writerow(['>Features', seq['record'].id]) for feature, qualifiers in tbl.items(): writer.writerow(feature) if qualifiers: for x in qualifiers: writer.writerow(list(chain(['', '', ''], x))) # Write sequence SeqIO.write(seq['record'], fsa_handle, 'fasta') else: fail_count += 1 # Final progress bar printProgress(rec_count, result_count, 0.05, start_time=start_time) # Run tbl2asn if build_asn: start_time = time() printMessage('Running tbl2asn', start_time=start_time, width=25) result = runASN(fsa_handle.name, template=asn_template, exec=tbl2asn_exec) printMessage('Done', start_time=start_time, end=True, width=25) # Print ending console log log = OrderedDict() log['OUTPUT_TBL'] = os.path.basename(tbl_handle.name) log['OUTPUT_FSA'] = os.path.basename(fsa_handle.name) log['RECORDS'] = rec_count log['PASS'] = pass_count log['FAIL'] = fail_count log['END'] = 'ConvertDb' printLog(log) # Close file handles tbl_handle.close() fsa_handle.close() db_handle.close() return (tbl_handle.name, fsa_handle.name) def getArgParser(): """ Defines the ArgumentParser Arguments: None Returns: an ArgumentParser object """ # Define input and output field help message fields = dedent( ''' output files: airr AIRR formatted database files. changeo Change-O formatted database files. sequences FASTA formatted sequences output from the subcommands fasta and clip. genbank feature tables and fasta files containing MiAIRR compliant input for tbl2asn. required fields: sequence_id, sequence, sequence_alignment, junction, v_call, d_call, j_call, v_germline_start, v_germline_end, v_sequence_start, v_sequence_end, d_sequence_start, d_sequence_end, j_sequence_start, j_sequence_end optional fields: germline_alignment, c_call, clone_id ''') # Define ArgumentParser parser = ArgumentParser(description=__doc__, epilog=fields, formatter_class=CommonHelpFormatter, add_help=False) group_help = parser.add_argument_group('help') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s %s' %(__version__, __date__)) group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='Database operation') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Define parent parsers default_parent = getCommonArgParser(failed=False, log=False, format=False) format_parent = getCommonArgParser(failed=False, log=False) # Subparser to convert changeo to AIRR files parser_airr = subparsers.add_parser('airr', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Converts input to an AIRR TSV file.', description='Converts input to an AIRR TSV file.') parser_airr.set_defaults(func=convertToAIRR) # Subparser to convert AIRR to changeo files parser_changeo = subparsers.add_parser('changeo', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Converts input into a Change-O TSV file.', description='Converts input into a Change-O TSV file.') parser_changeo.set_defaults(func=convertToChangeo) # Subparser to convert database entries to sequence file parser_fasta = subparsers.add_parser('fasta', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Creates a fasta file from database records.', description='Creates a fasta file from database records.') group_fasta = parser_fasta.add_argument_group('conversion arguments') group_fasta.add_argument('--if', action='store', dest='id_field', default=default_id_field, help='The name of the field containing identifiers') group_fasta.add_argument('--sf', action='store', dest='seq_field', default=default_seq_field, help='The name of the field containing sequences') group_fasta.add_argument('--mf', nargs='+', action='store', dest='meta_fields', help='List of annotation fields to add to the sequence description') parser_fasta.set_defaults(func=convertToFasta) # Subparser to convert database entries to clip-fasta file parser_baseln = subparsers.add_parser('baseline', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, description='Creates a BASELINe fasta file from database records.', help='''Creates a specially formatted fasta file from database records for input into the BASELINe website. The format groups clonally related sequences sequentially, with the germline sequence preceding each clone and denoted by headers starting with ">>".''') group_baseln = parser_baseln.add_argument_group('conversion arguments') group_baseln.add_argument('--if', action='store', dest='id_field', default=default_id_field, help='The name of the field containing identifiers') group_baseln.add_argument('--sf', action='store', dest='seq_field', default=default_seq_field, help='The name of the field containing reads') group_baseln.add_argument('--gf', action='store', dest='germ_field', default=default_germ_field, help='The name of the field containing germline sequences') group_baseln.add_argument('--cf', action='store', dest='cluster_field', default=None, help='The name of the field containing containing sorted clone IDs') group_baseln.add_argument('--mf', nargs='+', action='store', dest='meta_fields', help='List of annotation fields to add to the sequence description') parser_baseln.set_defaults(func=convertToBaseline) # Subparser to convert database entries to a GenBank fasta and feature table file parser_gb = subparsers.add_parser('genbank', parents=[format_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Creates files for GenBank/TLS submissions.', description='Creates files for GenBank/TLS submissions.') # Genbank source information arguments group_gb_src = parser_gb.add_argument_group('source information arguments') group_gb_src.add_argument('--mol', action='store', dest='molecule', default=default_molecule, help='''The source molecule type. Usually one of "mRNA" or "genomic DNA".''') group_gb_src.add_argument('--product', action='store', dest='product', default=default_product, help='''The product name, such as "immunoglobulin heavy chain".''') group_gb_src.add_argument('--db', action='store', dest='db_xref', default=None, help='''Name of the reference database used for alignment. Usually "IMGT/GENE-DB".''') group_gb_src.add_argument('--inf', action='store', dest='inference', default=None, help='''Name and version of the inference tool used for reference alignment in the form tool:version.''') # Genbank sample information arguments group_gb_sam = parser_gb.add_argument_group('sample information arguments') group_gb_sam.add_argument('--organism', action='store', dest='organism', default=None, help='The scientific name of the organism.') group_gb_sam.add_argument('--sex', action='store', dest='sex', default=None, help='''If specified, adds the given sex annotation to the fasta headers.''') group_gb_sam.add_argument('--isolate', action='store', dest='isolate', default=None, help='''If specified, adds the given isolate annotation (sample label) to the fasta headers.''') group_gb_sam.add_argument('--tissue', action='store', dest='tissue', default=None, help='''If specified, adds the given tissue-type annotation to the fasta headers.''') group_gb_sam.add_argument('--cell-type', action='store', dest='cell_type', default=None, help='''If specified, adds the given cell-type annotation to the fasta headers.''') group_gb_sam.add_argument('-y', action='store', dest='yaml_config', default=None, help='''A yaml file specifying sample features (BioSample attributes) in the form \'variable: value\'. If specified, any features provided in the yaml file will override those provided at the commandline. Note, this config file applies to sample features only and cannot be used for required source features such as the --product or --mol argument.''') # General genbank conversion arguments group_gb_cvt = parser_gb.add_argument_group('conversion arguments') group_gb_cvt.add_argument('--label', action='store', dest='label', default=None, help='''If specified, add a field name to the sequence identifier. Sequence identifiers will be output in the form