changeo-0.4.6/0000755000076600000240000000000013514432111013664 5ustar vandej27staff00000000000000changeo-0.4.6/INSTALL.rst0000644000076600000240000001120213402556426015534 0ustar vandej27staff00000000000000Installation ================================================================================ The simplest way to install the latest stable release of Change-O is via pip:: > pip3 install changeo --user The current development build can be installed using pip and mercurial in similar fashion:: > pip3 install hg+https://bitbucket.org/kleinstein/changeo#default --user If you currently have a development version installed, then you will likely need to add the arguments ``--upgrade --no-deps --force-reinstall`` to the pip3 command. Requirements -------------------------------------------------------------------------------- + `Python 3.4.0 `__ + `setuptools 2.0 `__ + `NumPy 1.8 `__ + `SciPy 0.14 `__ + `pandas 0.15 `__ + `Biopython 1.65 `__ + `presto 0.5.10 `__ + `airr 1.2.1 `__. + AlignRecords requires `MUSCLE 3.8 `__ + ConvertDb-genbank requires `tbl2asn `__ + AssignGenes requires `IgBLAST 1.6 `__, but version 1.11 or higher is recommended. + BuildTrees requires `IgPhyML 1.0.5 `_ Linux -------------------------------------------------------------------------------- 1. The simplest way to install all Python dependencies is to install the full SciPy stack using the `instructions `__, then install Biopython according to its `instructions `__. 2. Install `presto 0.5.0 `__ or greater. 3. Download the Change-O bundle and run:: > pip3 install changeo-x.y.z.tar.gz --user Mac OS X -------------------------------------------------------------------------------- 1. Install Xcode. Available from the Apple store or `developer downloads `__. 2. Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the `XQuartz project `__. 3. Install Homebrew following the installation and post-installation `instructions `__. 4. Install Python 3.4.0+ and set the path to the python3 executable:: > brew install python3 > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile 5. Exit and reopen the terminal application so the PATH setting takes effect. 6. You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:: > brew install gfortran If the above fails run this instead:: > brew install --env=std gfortran 7. Install NumPy, SciPy, pandas and Biopyton using the Python package manager:: > pip3 install numpy scipy pandas biopython 8. Install `presto 0.5.0 `__ or greater. 9. Download the Change-O bundle, open a terminal window, change directories to the download folder, and run:: > pip3 install changeo-x.y.z.tar.gz Windows -------------------------------------------------------------------------------- 1. Install Python 3.4.0+ from `Python `__, selecting both the options 'pip' and 'Add python.exe to Path'. 2. Install NumPy, SciPy, pandas and Biopython using the packages available from the `Unofficial Windows binary `__ collection. 3. Install `presto 0.5.0 `__ or greater. 4. Download the Change-O bundle, open a Command Prompt, change directories to the download folder, and run:: > pip install changeo-x.y.z.tar.gz 5. For a default installation of Python 3.4, the Change-0 scripts will be installed into ``C:\Python34\Scripts`` and should be directly executable from the Command Prompt. If this is not the case, then follow step 5 below. 6. Add both the ``C:\Python34`` and ``C:\Python34\Scripts`` directories to your ``%Path%``. On Windows 7 the ``%Path%`` setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path. 6. If you have trouble with the ``.py`` file associations, try adding ``.PY`` to your ``PATHEXT`` environment variable. Also, opening a command prompt as Administrator and run:: > assoc .py=Python.File > ftype Python.File="C:\Python34\python.exe" "%1" %* changeo-0.4.6/LICENSE0000644000076600000240000004724313402556426014717 0ustar vandej27staff00000000000000Attribution-ShareAlike 4.0 International ======================================================================= Creative Commons Corporation ("Creative Commons") is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an "as-is" basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible. Using Creative Commons Public Licenses Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses. Considerations for licensors: Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC- licensed material, or material used under an exception or limitation to copyright. More considerations for licensors: wiki.creativecommons.org/Considerations_for_licensors Considerations for the public: By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor's permission is not necessary for any reason--for example, because of any applicable exception or limitation to copyright--then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. More_considerations for the public: wiki.creativecommons.org/Considerations_for_licensees ======================================================================= Creative Commons Attribution-ShareAlike 4.0 International Public License By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions. Section 1 -- Definitions. a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. c. BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License. d. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights. e. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. f. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. g. License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike. h. Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License. i. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. j. Licensor means the individual(s) or entity(ies) granting rights under this Public License. k. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them. l. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. m. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning. Section 2 -- Scope. a. License grant. 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: a. reproduce and Share the Licensed Material, in whole or in part; and b. produce, reproduce, and Share Adapted Material. 2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 3. Term. The term of this Public License is specified in Section 6(a). 4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a) (4) never produces Adapted Material. 5. Downstream recipients. a. Offer from the Licensor -- Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. b. Additional offer from the Licensor -- Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter's License You apply. c. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). b. Other rights. 1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 2. Patent and trademark rights are not licensed under this Public License. 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties. Section 3 -- License Conditions. Your exercise of the Licensed Rights is expressly made subject to the following conditions. a. Attribution. 1. If You Share the Licensed Material (including in modified form), You must: a. retain the following if it is supplied by the Licensor with the Licensed Material: i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); ii. a copyright notice; iii. a notice that refers to this Public License; iv. a notice that refers to the disclaimer of warranties; v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable; b. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and c. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. b. ShareAlike. In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply. 1. The Adapter's License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License. 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material. 3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply. Section 4 -- Sui Generis Database Rights. Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material: a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database; b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights. Section 5 -- Disclaimer of Warranties and Limitation of Liability. a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability. Section 6 -- Term and Termination. a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or 2. upon express reinstatement by the Licensor. For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License. c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License. Section 7 -- Other Terms and Conditions. a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License. Section 8 -- Interpretation. a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority. ======================================================================= Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the CC0 Public Domain Dedication. Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at creativecommons.org/policies, Creative Commons does not authorize the use of the trademark "Creative Commons" or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses. Creative Commons may be contacted at creativecommons.org. changeo-0.4.6/MANIFEST.in0000644000076600000240000000014013402556426015431 0ustar vandej27staff00000000000000include requirements.txt include README.rst include INSTALL.rst include NEWS.rst include LICENSEchangeo-0.4.6/NEWS.rst0000644000076600000240000005156413514407712015216 0ustar vandej27staff00000000000000Release Notes =============================================================================== Version 0.4.6: July 19, 2019 ------------------------------------------------------------------------------- BuildTrees: + Added capability of running IgPhyML on outputted data (``--igphyml``) and support for passing IgPhyML arguments through BuildTrees. + Added the ``--clean`` argument to force deletion of all intermediate files after IgPhyML execution. + Added the ``--format`` argument to allow specification input and output of either the Change-O standard (``changeo``) or AIRR Rearrangement standard (``airr``). CreateGermlines: + Fixed a bug causing incorrect reporting of the germline format in the console log. ConvertDb: + Removed requirement for the ``NP1_LENGTH`` and ``NP2_LENGTH`` fields from the genbank subcommand. DefineClones: + Fixed a biopython warning arising when applying ``--model aa`` to junction sequences that are not a multiple of three. The junction will now be padded with an appropriate number of Ns (usually resulting in a translation to X). MakeDb: + Added the ``--10x`` argument to all subcommands to support merging of Cell Ranger annotation data, such as UMI count and C-region assignment, with the output of the supported alignment tools. + Added inference of the receptor locus from the alignment data to all subcommands, which is output in the ``LOCUS`` field. + Combined the extended field arguments of all subcommands (``--scores``, ``--regions``, ``--cdr3``, and ``--junction``) into a single ``--extended`` argument. + Removed parsing of old IgBLAST v1.5 CDR3 fields (``CDR3_IGBLAST``, ``CDR3_IGBLAST_AA``). Version 0.4.5: January 9, 2019 ------------------------------------------------------------------------------- + Slightly changed version number display in commandline help. BuildTrees: + Fixed a bug that caused malformed lineages.tsv output file. CreateGermlines: + Fixed a bug in the CreateGermlines log output causing incorrect missing D gene or J gene error messages. DefineClones: + Fixed a bug that caused a missing junction column to cluster sequences together. MakeDb: + Fixed a bug that caused failed germline reconstructions to be recorded as ``None``, rather than an empty string, in the ``GERMLINE_IMGT`` column. Version 0.4.4: October 27, 2018 ------------------------------------------------------------------------------- + Fixed a bug causing the values of ``_start`` fields to be off by one from the v1.2 AIRR Schema requirement when specifying ``--format airr``. Version 0.4.3: October 19, 2018 ------------------------------------------------------------------------------- + Updated airr library requirement to v1.2.1 to fix empty V(D)J start coordinate values when specifying ``--format airr`` to tools. + Changed pRESTO dependency to v0.5.10. BuildTrees: + New tool. + Converts tab-delimited database files into input for `IgPhyML `_ CreateGermlines: + Now verifies that all files/folder passed to the ``-r`` argument exist. Version 0.4.2: September 6, 2018 ------------------------------------------------------------------------------- + Updated support for the AIRR Rearrangement schema to v1.2 and added the associated airr library dependency. AssignGenes: + New tool. + Provides a simple IgBLAST wrapper as the ``igblast`` subcommand. ConvertDb: + The ``genbank`` subcommand will perform a check for some of the required columns in the input file and exit if they are not found. + Changed the behavior of the ``-y`` argument in the ``genbank`` subcommand. This argument is now featured to sample features only, but allows for the inclusion of any BioSample attribute. CreateGermlines: + Will now perform a naive verification that the reference sequences provided to the ``-r`` argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check. + Will perform a check for some of the required columns in the input file and exit if they are not found. MakeDb: + Changed the output of ``SEQUENCE_VDJ`` from the igblast subcommand to retain insertions in the query sequence rather than delete them as is done in the ``SEQUENCE_IMGT`` field. + Will now perform a naive verification that the reference sequences provided to the ``-r`` argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check. Version 0.4.1: July 16, 2018 ------------------------------------------------------------------------------- + Fixed installation incompatibility with pip 10. + Fixed duplicate newline issue on Windows. + All tools will no longer create empty pass or fail files if there are no records meeting the appropriate criteria for output. + Most tools now allow explicit specification of the output file name via the optional ``-o`` argument. + Added support for the AIRR standard TSV via the ``--format airr`` argument to all relevant tools. + Replaced V, D and J ``BTOP`` columns with ``CIGAR`` columns in data standard. + Numerous API changes and internal structural changes to commandline tools. AlignRecords: + Fixed a bug arising when space characters are present in the sequence identifiers. ConvertDb: + New tool. + Includes the airr and changeo subcommand to convert between AIRR and Change-O formatted TSV files. + The genbank subcommand creates MiAIRR compliant files for submission to GenBank/TLS. + Contains the baseline and fasta subcommands previously in ParseDb. CreateGermlines + Changed character used to pad clonal consensus sequences from ``.`` to ``N``. + Changed tie resolution in clonal consensus from random V/J gene to alphabetical by sequence identifier. + Added ``--df`` and ``-jf`` arguments for specifying D and J fields, respectively. + Add initial sorting step with specifying ``--cloned`` so that clonally ordered input is no longer required. DefineClones: + Removed the chen2010 and ademokun2011 and made the previous bygroup subcommand the default behavior. + Renamed the ``--f`` argument to ``--gf`` for consistency with other tools. + Added the arguments ``--vf`` and ``-jf`` to allow specification of V and J call fields, respectively. MakeDb: + Renamed ``--noparse`` argument to ``--asis-id``. + Added ``asis-calls`` argument to igblast subcommand to allow use with non-standard gene names. + Added the ``GERMLINE_IMGT`` column to the default output. + Changed junction inference in igblast subcommand to use IgBLAST's CDR3 assignment for IgBLAST versions greater than or equal to 1.7.0. + Added a verification that the ``SEQUENCE_IMGT`` and ``JUNCTION`` fields are in agreement for records to pass. + Changed behavior of the igblast subcommand's translation of the junction sequence to truncate junction that are not multiples of 3, rather than pad to a multiple of 3 (removes trailing X character). + The igblast subcommand will now fail records missing the required optional fields ``subject seq``, ``query seq`` and ``BTOP``, rather than abort. + Fixed bug causing parsing of IgBLAST <= 1.4 output to fail. ParseDb: + Added the merge subcommand which will combine TSV files. + All field arguments are now case sensitive to provide support for both the Change-O and AIRR data standards. Version 0.3.12: February 16, 2018 ------------------------------------------------------------------------------- MakeDb: + Fixed a bug wherein specifying multiple simultaneous inputs would cause duplication of parsed pRESTO fields to appear in the second and higher output files. Version 0.3.11: February 6, 2018 ------------------------------------------------------------------------------- MakeDb: + Fixed junction inferrence for igblast subcommand when J region is truncated. Version 0.3.10: February 6, 2018 ------------------------------------------------------------------------------- Fixed incorrect progress bars resulting from files containing empty lines. DefineClones: + Fixed several bugs in the chen2010 and ademokun2011 methods that caused them to either fail or incorrectly cluster all sequences into a single clone. + Added informative message for out of memory error in chen2010 and ademokun2011 methods. Version 0.3.9: October 17, 2017 ------------------------------------------------------------------------------- DefineClones: + Fixed a bug causing DefineClones to fail when all are sequences removed from a group due to missing characters. Version 0.3.8: October 5, 2017 ------------------------------------------------------------------------------- AlignRecords: + Ressurrected AlignRecords which performs multiple alignment of sequence fields. + Added new subcommands ``across`` (multiple aligns within columns), ``within`` (multiple aligns columns within each row), and ``block`` (multiple aligns across both columns and rows). CreateGermlines: + Fixed a bug causing CreateGermlines to incorrectly fail records when using the argument ``--vf V_CALL_GENOTYPED``. DefineClones: + Added the ``--maxmiss`` argument to the bygroup subcommand of DefineClones which set exclusion criteria for junction sequence with ambiguous and missing characters. By default, bygroup will now fail all sequences with any missing characters in the junction (``--maxmiss 0``). Version 0.3.7: June 30, 2017 ------------------------------------------------------------------------------- MakeDb: + Fixed an incompatibility with IgBLAST v1.7.0. CreateGermlines: + Fixed an error that occurs when using the ``--cloned`` with an input file containing duplicate values in ``SEQUENCE_ID`` that caused some records to be discarded. Version 0.3.6: June 13, 2017 ------------------------------------------------------------------------------- + Fixed an overflow error on Windows that caused tools to fatally exit. + All tools will now print detailed help if no arguments are provided. Version 0.3.5: May 12, 2017 ------------------------------------------------------------------------------- Fixed a bug wherein ``.tsv`` was not being recognized as a valid extension. MakeDb: + Added the ``--cdr3`` argument to the igblast subcommand to extract the CDR3 nucleotide and amino acid sequence defined by IgBLAST. + Updated the IMGT/HighV-QUEST parser to handle recent column name changes. + Fixed a bug in the igblast parser wherein some sequence identifiers were not being processed correctly. DefineClones: + Changed the way ``X`` characters are handled in the amino acid Hamming distance model to count as a match against any character. Version 0.3.4: February 14, 2017 ------------------------------------------------------------------------------- License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0). CreateGermlines: + Added ``GERMLINE_V_CALL``, ``GERMLINE_D_CALL`` and ``GERMLINE_J_CALL`` columns to the output when the ``-cloned`` argument is specified. These columns contain the consensus annotations when clonal groups contain ambiguous gene assignments. + Fixed the error message for an invalid repo (``-r``) argument. DefineClones: + Deprecated ``m1n`` and ``hs1f`` distance models, renamed them to ``m1n_compat`` and ``hs1f_compat``, and replaced them with ``hh_s1f`` and replaced ``mk_rs1nf``, respectively. + Renamed the ``hs5f`` distance model to ``hh_s5f``. + Added the mouse specific distance model ``mk_rs5nf`` from Cui et al, 2016. MakeDb: + Added compatibility for IgBLAST v1.6. + Added the flag ``--partial`` which tells MakeDb to pass incomplete alignment results specified. + Added missing console log entries for the ihmm subcommand. + IMGT/HighV-QUEST, IgBLAST and iHMMune-Align parsers have been cleaned up, better documented and moved into the iterable classes ``changeo.Parsers.IMGTReader``, ``change.Parsers.IgBLASTReader``, and ``change.Parsers.IHMMuneReader``, respectively. + Corrected behavior of ``D_FRAME`` annotation from the ``--junction`` argument to the imgt subcommand such that it now reports no value when no value is reported by IMGT, rather than reporting the reading frame as 0 in these cases. + Fixed parsing of ``IN_FRAME``, ``STOP``, ``D_SEQ_START`` and ``D_SEQ_LENGTH`` fields from iHMMune-Align output. + Removed extraneous score fields from each parser. + Fixed the error message for an invalid repo (``-r``) argument. Version 0.3.3: August 8, 2016 ------------------------------------------------------------------------------- Increased ``csv.field_size_limit`` in changeo.IO, ParseDb and DefineClones to be able to handle files with larger number of UMIs in one field. Renamed the fields ``N1_LENGTH`` to ``NP1_LENGTH`` and ``N2_LENGTH`` to ``NP2_LENGTH``. CreateGermlines: + Added differentiation of the N and P regions the the ``REGION`` log field if the N/P region info is present in the input file (eg, from the ``--junction`` argument to MakeDb-imgt). If the additional N/P region columns are not present, then both N and P regions will be denoted by N, as in previous versions. + Added the option 'regions' to the ``-g`` argument to create add the ``GERMLINE_REGIONS`` field to the output which represents the germline positions as V, D, J, N and P characters. This is equivalent to the ``REGION`` log entry. DefineClones: + Improved peformance significantly of the ``--act set`` grouping method in the bygroup subcommand. MakeDb: + Fixed a bug producing ``D_SEQ_START`` and ``J_SEQ_START`` relative to ``SEQUENCE_VDJ`` when they should be relative to ``SEQUENCE_INPUT``. + Added the argument ``--junction`` to the imgt subcommand to parse additional junction information fields, including N/P region lengths and the D-segment reading frame. This provides the following additional output fields: ``D_FRAME``, ``N1_LENGTH``, ``N2_LENGTH``, ``P3V_LENGTH``, ``P5D_LENGTH``, ``P3D_LENGTH``, ``P5J_LENGTH``. + The fields ``N1_LENGTH`` and ``N2_LENGTH`` have been renamed to accommodate adding additional output from IMGT under the ``--junction`` flag. The new names are ``NP1_LENGTH`` and ``NP2_LENGTH``. + Fixed a bug that caused the ``IN_FRAME``, ``MUTATED_INVARIANT`` and ``STOP`` field to be be parsed incorrectly from IMGT data. + Ouput from iHMMuneAlign can now be parsed via the ``ihmm`` subcommand. Note, there is insufficient information returned by iHMMuneAlign to reliably reconstruct germline sequences from the output using CreateGermlines. ParseDb: + Renamed the clip subcommand to baseline. Version 0.3.2: March 8, 2016 ------------------------------------------------------------------------------- Fixed a bug with installation on Windows due to old file paths lingering in changeo.egg-info/SOURCES.txt. Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0. CreateGermlines: + Fixed a bug producing incorrect values in the ``SEQUENCE`` field on the log file. MakeDb: + Updated igblast subcommand to correctly parse records with indels. Now igblast must be run with the argument ``outfmt "7 std qseq sseq btop"``. + Changed the names of the FWR and CDR output columns added with ``--regions`` to ``_IMGT``. + Added ``V_BTOP`` and ``J_BTOP`` output when the ``--scores`` flag is specified to the igblast subcommand. Version 0.3.1: December 18, 2015 ------------------------------------------------------------------------------- MakeDb: + Fixed bug wherein the imgt subcommand was not properly recognizing an extracted folder as input to the ``-i`` argument. Version 0.3.0: December 4, 2015 ------------------------------------------------------------------------------- Conversion to a proper Python package which uses pip and setuptools for installation. The package now requires Python 3.4. Python 2.7 is not longer supported. The required dependency versions have been bumped to numpy 1.9, scipy 0.14, pandas 0.16 and biopython 1.65. DbCore: + Divided DbCore functionality into the separate modules: Defaults, Distance, IO, Multiprocessing and Receptor. IgCore: + Remove IgCore in favor of dependency on pRESTO >= 0.5.0. AnalyzeAa: + This tool was removed. This functionality has been migrated to the alakazam R package. DefineClones: + Added ``--sf`` flag to specify sequence field to be used to calculate distance between sequences. + Fixed bug in wherein sequences with missing data in grouping columns were being assigned into a single group and clustered. Sequences with missing grouping variables will now be failed. + Fixed bug where sequences with "None" junctions were grouped together. GapRecords: + This tool was removed in favor of adding IMGT gapping support to igblast subcommand of MakeDb. MakeDb: + Updated IgBLAST parser to create an IMGT gapped sequence and infer the junction region as defined by IMGT. + Added the ``--regions`` flag which adds extra columns containing FWR and CDR regions as defined by IMGT. + Added support to imgt subcommand for the new IMGT/HighV-QUEST compression scheme (.txz files). Version 0.2.5: August 25, 2015 ------------------------------------------------------------------------------- CreateGermlines: + Removed default '-r' repository and added informative error messages when invalid germline repositories are provided. + Updated '-r' flag to take list of folders and/or fasta files with germlines. Version 0.2.4: August 19, 2015 ------------------------------------------------------------------------------- MakeDb: + Fixed a bug wherein N1 and N2 region indexing was off by one nucleotide for the igblast subcommand (leading to incorrect SEQUENCE_VDJ values). ParseDb: + Fixed a bug wherein specifying the ``-f`` argument to the index subcommand would cause an error. Version 0.2.3: July 22, 2015 ------------------------------------------------------------------------------- DefineClones: + Fixed a typo in the default normalization setting of the bygroup subcommand, which was being interpreted as 'none' rather than 'len'. + Changed the 'hs5f' model of the bygroup subcommand to be centered -log10 of the targeting probability. + Added the ``--sym`` argument to the bygroup subcommand which determines how asymmetric distances are handled. Version 0.2.2: July 8, 2015 ------------------------------------------------------------------------------- CreateGermlines: + Germline creation now works for IgBLAST output parsed with MakeDb. The argument ``--sf SEQUENCE_VDJ`` must be provided to generate germlines from IgBLAST output. The same reference database used for the IgBLAST alignment must be specified with the ``-r`` flag. + Fixed a bug with determination of N1 and N2 region positions. MakeDb: + Combined the ``-z`` and ``-f`` flags of the imgt subcommand into a single flag, ``-i``, which autodetects the input type. + Added requirement that IgBLAST input be generated using the ``-outfmt "7 std qseq"`` argument to igblastn. + Modified SEQUENCE_VDJ output from IgBLAST parser to include gaps inserted during alignment. + Added correction for IgBLAST alignments where V/D, D/J or V/J segments are assigned overlapping positions. + Corrected N1_LENGTH and N2_LENGTH calculation from IgBLAST output. + Added the ``--scores`` flag which adds extra columns containing alignment scores from IMGT and IgBLAST output. Version 0.2.1: June 18, 2015 ------------------------------------------------------------------------------- DefineClones: + Removed mouse 3-mer model, 'm3n'. Version 0.2.0: June 17, 2015 ------------------------------------------------------------------------------- Initial public prerelease. Output files were added to the usage documentation of all scripts. General code cleanup. DbCore: + Updated loading of database files to convert column names to uppercase. AnalyzeAa: + Fixed a bug where junctions less than one codon long would lead to a division by zero error. + Added ``--failed`` flag to create database with records that fail analysis. + Added ``--sf`` flag to specify sequence field to be analyzed. CreateGermlines: + Fixed a bug where germline sequences could not be created for light chains. DefineClones: + Added a human 1-mer model, 'hs1f', which uses the substitution rates from from Yaari et al, 2013. + Changed default model to 'hs1f' and default normalization to length for bygroup subcommand. + Added ``--link`` argument which allows for specification of single, complete, or average linkage during clonal clustering (default single). GapRecords: + Fixed a bug wherein non-standard sequence fields could not be aligned. MakeDb: + Fixed bug where the allele 'TRGVA*01' was not recognized as a valid allele. ParseDb: + Added rename subcommand to ParseDb which renames fields. Version 0.2.0.beta-2015-05-31: May 31, 2015 ------------------------------------------------------------------------------- Minor changes to a few output file names and log field entries. ParseDb: + Added index subcommand to ParseDb which adds a numeric index field. Version 0.2.0.beta-2015-05-05: May 05, 2015 ------------------------------------------------------------------------------- Prerelease for review. changeo-0.4.6/PKG-INFO0000644000076600000240000000373013514432111014764 0ustar vandej27staff00000000000000Metadata-Version: 1.1 Name: changeo Version: 0.4.6 Summary: A bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data. Home-page: http://changeo.readthedocs.io Author: Namita Gupta, Jason Anthony Vander Heiden Author-email: immcantation@googlegroups.com License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Download-URL: https://bitbucket.org/kleinstein/changeo/downloads Description: Change-O - Repertoire clonal assignment toolkit ================================================================================ Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences. Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of Ig repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. Change-O is a suite of utilities to facilitate advanced analysis of Ig and TCR sequences following germline segment assignment. Change-O handles output from IMGT/HighV-QUEST and IgBLAST, and provides a wide variety of clustering methods for assigning clonal groups to Ig sequences. Record sorting, grouping, and various database manipulation operations are also included. Keywords: bioinformatics,sequencing,immunology,adaptive immunity,immunoglobulin,AIRR-seq,Rep-Seq,B cell repertoire analysis,adaptive immune receptor repertoires Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Environment :: Console Classifier: Intended Audience :: Science/Research Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python :: 3.4 Classifier: Topic :: Scientific/Engineering :: Bio-Informatics changeo-0.4.6/README.rst0000755000076600000240000000166313402556426015400 0ustar vandej27staff00000000000000Change-O - Repertoire clonal assignment toolkit ================================================================================ Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences. Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of Ig repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. Change-O is a suite of utilities to facilitate advanced analysis of Ig and TCR sequences following germline segment assignment. Change-O handles output from IMGT/HighV-QUEST and IgBLAST, and provides a wide variety of clustering methods for assigning clonal groups to Ig sequences. Record sorting, grouping, and various database manipulation operations are also included. changeo-0.4.6/bin/0000755000076600000240000000000013514432110014433 5ustar vandej27staff00000000000000changeo-0.4.6/bin/AlignRecords.py0000755000076600000240000004404113444271360017401 0ustar vandej27staff00000000000000#!/usr/bin/env python3 """ Multiple aligns sequence fields """ # Info __author__ = 'Jason Anthony Vander Heiden' from changeo import __version__, __date__ # Imports import os import shutil from argparse import ArgumentParser from collections import OrderedDict from itertools import chain from textwrap import dedent from Bio.SeqRecord import SeqRecord # Presto and changeo import from presto.Defaults import default_out_args, default_muscle_exec from presto.Applications import runMuscle from presto.IO import printLog, printError, printWarning from presto.Multiprocessing import manageProcesses from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs from changeo.IO import getDbFields, getFormatOperators from changeo.Multiprocessing import DbResult, feedDbQueue, processDbQueue, collectDbQueue # TODO: maybe not bothering with 'set' is best. can just work off field identity def groupRecords(records, fields=None, calls=['v', 'j'], mode='gene', action='first'): """ Groups Receptor objects based on gene or annotation Arguments: records : an iterator of Receptor objects to group. fields : gene field to group by. calls : allele calls to use for grouping. one or more of ('v', 'd', 'j'). mode : specificity of alignment call to use for allele call fields. one of ('allele', 'gene'). action : only 'first' is currently supported. Returns: dictionary of grouped records """ # Define functions for grouping keys if mode == 'allele' and fields is None: def _get_key(rec, calls, action): return tuple(rec.getAlleleCalls(calls, action)) elif mode == 'gene' and fields is None: def _get_key(rec, calls, action): return tuple(rec.getGeneCalls(calls, action)) elif mode == 'allele' and fields is not None: def _get_key(rec, calls, action): vdj = rec.getAlleleCalls(calls, action) ann = [rec.getChangeo(k) for k in fields] return tuple(chain(vdj, ann)) elif mode == 'gene' and fields is not None: def _get_key(rec, calls, action): vdj = rec.getGeneCalls(calls, action) ann = [rec.getChangeo(k) for k in fields] return tuple(chain(vdj, ann)) rec_index = {} for rec in records: key = _get_key(rec, calls, action) # Assigned grouped records to individual keys and all failed to a single key if all([k is not None for k in key]): rec_index.setdefault(key, []).append(rec) else: rec_index.setdefault(None, []).append(rec) return rec_index def alignBlocks(data, field_map, muscle_exec=default_muscle_exec): """ Multiple aligns blocks of sequence fields together Arguments: data : DbData object with Receptor objects to process. field_map : a dictionary of {input sequence : output sequence) field names to multiple align. muscle_exec : the MUSCLE executable. Returns: changeo.Multiprocessing.DbResult : object containing Receptor objects with multiple aligned sequence fields. """ # Define return object result = DbResult(data.id, data.data) result.results = data.data result.valid = True # Fail invalid groups if result.id is None: result.log = None result.valid = False return result seq_fields = list(field_map.keys()) seq_list = [SeqRecord(r.getSeq(f), id='%s_%s' % (r.sequence_id.replace(' ', '_'), f)) for f in seq_fields \ for r in data.data] seq_aln = runMuscle(seq_list, aligner_exec=muscle_exec) if seq_aln is not None: aln_map = {x.id: i for i, x in enumerate(seq_aln)} for i, r in enumerate(result.results, start=1): for f in seq_fields: idx = aln_map['%s_%s' % (r.sequence_id.replace(' ', '_'), f)] seq = str(seq_aln[idx].seq) r.annotations[field_map[f]] = seq result.log['%s-%s' % (f, r.sequence_id)] = seq else: result.valid = False #for r in result.results: print r.annotations return result def alignAcross(data, field_map, muscle_exec=default_muscle_exec): """ Multiple aligns sequence fields column wise Arguments: data : DbData object with Receptor objects to process. field_map : a dictionary of {input sequence : output sequence) field names to multiple align. muscle_exec : the MUSCLE executable. Returns: changeo.Multiprocessing.DbResult : object containing Receptor objects with multiple aligned sequence fields. """ # Define return object result = DbResult(data.id, data.data) result.results = data.data result.valid = True # Fail invalid groups if result.id is None: result.log = None result.valid = False return result seq_fields = list(field_map.keys()) for f in seq_fields: seq_list = [SeqRecord(r.getSeq(f), id=r.sequence_id.replace(' ', '_')) for r in data.data] seq_aln = runMuscle(seq_list, aligner_exec=muscle_exec) if seq_aln is not None: aln_map = {x.id: i for i, x in enumerate(seq_aln)} for i, r in enumerate(result.results, start=1): idx = aln_map[r.sequence_id.replace(' ', '_')] seq = str(seq_aln[idx].seq) r.annotations[field_map[f]] = seq result.log['%s-%s' % (f, r.sequence_id)] = seq else: result.valid = False #for r in result.results: print r.annotations return result def alignWithin(data, field_map, muscle_exec=default_muscle_exec): """ Multiple aligns sequence fields within a row Arguments: data : DbData object with Receptor objects to process. field_map : a dictionary of {input sequence : output sequence) field names to multiple align. muscle_exec : the MUSCLE executable. Returns: changeo.Multiprocessing.DbResult : object containing Receptor objects with multiple aligned sequence fields. """ # Define return object result = DbResult(data.id, data.data) result.results = data.data result.valid = True # Fail invalid groups if result.id is None: result.log = None result.valid = False return result record = data.data seq_fields = list(field_map.keys()) seq_list = [SeqRecord(record.getSeq(f), id=f) for f in seq_fields] seq_aln = runMuscle(seq_list, aligner_exec=muscle_exec) if seq_aln is not None: aln_map = {x.id: i for i, x in enumerate(seq_aln)} for f in seq_fields: idx = aln_map[f] seq = str(seq_aln[idx].seq) record.annotations[field_map[f]] = seq result.log[f] = seq else: result.valid = False return result def alignRecords(db_file, seq_fields, group_func, align_func, group_args={}, align_args={}, format='changeo', out_file=None, out_args=default_out_args, nproc=None, queue_size=None): """ Performs a multiple alignment on sets of sequences Arguments: db_file : filename of the input database. seq_fields : the sequence fields to multiple align. group_func : function to use to group records. align_func : function to use to multiple align sequence groups. group_args : dictionary of arguments to pass to group_func. align_args : dictionary of arguments to pass to align_func. format : output format. One of 'changeo' or 'airr'. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. nproc : the number of processQueue processes. if None defaults to the number of CPUs. queue_size : maximum size of the argument queue. if None defaults to 2*nproc. Returns: dict : names of the 'pass' and 'fail' output files. """ # Define subcommand label dictionary cmd_dict = {alignAcross: 'across', alignWithin: 'within', alignBlocks: 'block'} # Print parameter info log = OrderedDict() log['START'] = 'AlignRecords' log['COMMAND'] = cmd_dict.get(align_func, align_func.__name__) log['FILE'] = os.path.basename(db_file) log['SEQ_FIELDS'] = ','.join(seq_fields) if 'group_fields' in group_args: log['GROUP_FIELDS'] = ','.join(group_args['group_fields']) if 'mode' in group_args: log['MODE'] = group_args['mode'] if 'action' in group_args: log['ACTION'] = group_args['action'] log['NPROC'] = nproc printLog(log) # Define format operators try: reader, writer, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Define feeder function and arguments if 'group_fields' in group_args and group_args['group_fields'] is not None: group_args['group_fields'] = [schema.toReceptor(f) for f in group_args['group_fields']] feed_func = feedDbQueue feed_args = {'db_file': db_file, 'reader': reader, 'group_func': group_func, 'group_args': group_args} # Define worker function and arguments field_map = OrderedDict([(schema.toReceptor(f), '%s_align' % f) for f in seq_fields]) align_args['field_map'] = field_map work_func = processDbQueue work_args = {'process_func': align_func, 'process_args': align_args} # Define collector function and arguments out_fields = getDbFields(db_file, add=list(field_map.values()), reader=reader) out_args['out_type'] = schema.out_type collect_func = collectDbQueue collect_args = {'db_file': db_file, 'label': 'align', 'fields': out_fields, 'writer': writer, 'out_file': out_file, 'out_args': out_args} # Call process manager result = manageProcesses(feed_func, work_func, collect_func, feed_args, work_args, collect_args, nproc, queue_size) # Print log result['log']['END'] = 'AlignRecords' printLog(result['log']) output = {k: v for k, v in result.items() if k in ('pass', 'fail')} return output def getArgParser(): """ Defines the ArgumentParser Arguments: None Returns: an ArgumentParser object """ # Define output file names and header fields fields = dedent( ''' output files: align-pass database with multiple aligned sequences. align-fail database with records failing alignment. required fields: SEQUENCE_ID, V_CALL, J_CALL user specified sequence fields to align. output fields: _ALIGN ''') # Define ArgumentParser parser = ArgumentParser(description=__doc__, epilog=fields, formatter_class=CommonHelpFormatter, add_help=False) group_help = parser.add_argument_group('help') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s %s' %(__version__, __date__)) group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='alignment method') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Parent parser parser_parent = getCommonArgParser(format=True, multiproc=True) # Argument parser for column-wise alignment across records parser_across = subparsers.add_parser('across', parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False, help='''Multiple aligns sequence columns within groups and across rows using MUSCLE.''') group_across = parser_across.add_argument_group('alignment arguments') group_across.add_argument('--sf', nargs='+', action='store', dest='seq_fields', required=True, help='The sequence fields to multiple align within each group.') group_across.add_argument('--gf', nargs='+', action='store', dest='group_fields', default=None, help='Additional (not allele call) fields to use for grouping.') group_across.add_argument('--calls', nargs='+', action='store', dest='calls', choices=('v', 'd', 'j'), default=['v', 'j'], help='Segment calls (allele assignments) to use for grouping.') group_across.add_argument('--mode', action='store', dest='mode', choices=('allele', 'gene'), default='gene', help='''Specifies whether to use the V(D)J allele or gene when an allele call field (--calls) is specified.''') group_across.add_argument('--act', action='store', dest='action', default='first', choices=('first', ), help='''Specifies how to handle multiple values within default allele call fields. Currently, only "first" is supported.''') group_across.add_argument('--exec', action='store', dest='muscle_exec', default=default_muscle_exec, help='The location of the MUSCLE executable') parser_across.set_defaults(group_func=groupRecords, align_func=alignAcross) # Argument parser for alignment of fields within records parser_within = subparsers.add_parser('within', parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Multiple aligns sequence fields within rows using MUSCLE') group_within = parser_within.add_argument_group('alignment arguments') group_within.add_argument('--sf', nargs='+', action='store', dest='seq_fields', required=True, help='The sequence fields to multiple align within each record.') group_within.add_argument('--exec', action='store', dest='muscle_exec', default=default_muscle_exec, help='The location of the MUSCLE executable') parser_within.set_defaults(group_func=None, align_func=alignWithin) # Argument parser for column-wise alignment across records parser_block = subparsers.add_parser('block', parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False, help='''Multiple aligns sequence groups across both columns and rows using MUSCLE.''') group_block = parser_block.add_argument_group('alignment arguments') group_block.add_argument('--sf', nargs='+', action='store', dest='seq_fields', required=True, help='The sequence fields to multiple align within each group.') group_block.add_argument('--gf', nargs='+', action='store', dest='group_fields', default=None, help='Additional (not allele call) fields to use for grouping.') group_block.add_argument('--calls', nargs='+', action='store', dest='calls', choices=('v', 'd', 'j'), default=['v', 'j'], help='Segment calls (allele assignments) to use for grouping.') group_block.add_argument('--mode', action='store', dest='mode', choices=('allele', 'gene'), default='gene', help='''Specifies whether to use the V(D)J allele or gene when an allele call field (--calls) is specified.''') group_block.add_argument('--act', action='store', dest='action', default='first', choices=('first', ), help='''Specifies how to handle multiple values within default allele call fields. Currently, only "first" is supported.''') group_block.add_argument('--exec', action='store', dest='muscle_exec', default=default_muscle_exec, help='The location of the MUSCLE executable') parser_block.set_defaults(group_func=groupRecords, align_func=alignBlocks) return parser if __name__ == '__main__': """ Parses command line arguments and calls main function """ # Parse arguments parser = getArgParser() checkArgs(parser) args = parser.parse_args() args_dict = parseCommonArgs(args) # Check if a valid MUSCLE executable was specified for muscle mode if not shutil.which(args.muscle_exec): parser.error('%s does not exist or is not executable.' % args.muscle_exec) # Define align_args args_dict['align_args'] = {'muscle_exec': args_dict['muscle_exec']} del args_dict['muscle_exec'] # Define group_args if args_dict['group_func'] is groupRecords: args_dict['group_args'] = {'fields':args_dict['group_fields'], 'calls':args_dict['calls'], 'mode':args_dict['mode'], 'action':args_dict['action']} del args_dict['group_fields'] del args_dict['calls'] del args_dict['mode'] del args_dict['action'] # Clean arguments dictionary del args_dict['command'] del args_dict['db_files'] if 'out_files' in args_dict: del args_dict['out_files'] # Call main function for each input file for i, f in enumerate(args.__dict__['db_files']): args_dict['db_file'] = f args_dict['out_file'] = args.__dict__['out_files'][i] \ if args.__dict__['out_files'] else None alignRecords(**args_dict) changeo-0.4.6/bin/AssignGenes.py0000755000076600000240000002146313442756024017241 0ustar vandej27staff00000000000000#!/usr/bin/env python3 """ Assign V(D)J gene annotations """ # Info __author__ = 'Jason Anthony Vander Heiden' from changeo import __version__, __date__ # Imports import os import shutil from argparse import ArgumentParser from collections import OrderedDict from pkg_resources import parse_version from textwrap import dedent from time import time # Presto imports from presto.IO import printLog, printMessage, printError, printWarning from changeo.Defaults import default_igblast_exec, default_out_args from changeo.Applications import runIgBLAST, getIgBLASTVersion from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs from changeo.IO import getOutputName # Defaults choices_format = ('blast', 'airr') choices_loci = ('ig', 'tr') choices_organism = ('human', 'mouse', 'rabbit', 'rat', 'rhesus_monkey') default_format = 'blast' default_loci = 'ig' default_organism = 'human' default_igdata = '~/share/igblast' def assignIgBLAST(seq_file, igdata=default_igdata, loci='ig', organism='human', vdb=None, ddb=None, jdb=None, format=default_format, igblast_exec=default_igblast_exec, out_file=None, out_args=default_out_args, nproc=None): """ Performs clustering on sets of sequences Arguments: seq_file (str): the sample sequence file name. igdata (str): path to the IgBLAST database directory (IGDATA environment). loci (str): receptor type; one of 'ig' or 'tr'. organism (str): species name. vdb (str): name of a custom V reference in the database folder to use. ddb (str): name of a custom D reference in the database folder to use. jdb (str): name of a custom J reference in the database folder to use. format (str): output format. One of 'blast' or 'airr'. exec (str): the path to the igblastn executable. out_file (str): output file name. Automatically generated from the input file if None. out_args (dict): common output argument dictionary from parseCommonArgs. nproc (int): the number of processQueue processes; if None defaults to the number of CPUs. Returns: str: the output file name """ # Check format argument try: out_type = {'blast': 'fmt7', 'airr': 'tsv'}[format] except KeyError: printError('Invalid output format %s.' % format) # Get IgBLAST version version = getIgBLASTVersion(exec=igblast_exec) if parse_version(version) < parse_version('1.6'): printError('IgBLAST version is %s and 1.6 or higher is required.' % version) if format == 'airr' and parse_version(version) < parse_version('1.9'): printError('IgBLAST version is %s and 1.9 or higher is required for AIRR format support.' % version) # Print parameter info log = OrderedDict() log['START'] = 'AssignGenes' log['COMMAND'] = 'igblast' log['VERSION'] = version log['FILE'] = os.path.basename(seq_file) log['ORGANISM'] = organism log['LOCI'] = loci log['NPROC'] = nproc printLog(log) # Open output writer if out_file is None: out_file = getOutputName(seq_file, out_label='igblast', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=out_type) # Run IgBLAST clustering start_time = time() printMessage('Running IgBLAST', start_time=start_time, width=25) console_out = runIgBLAST(seq_file, igdata, loci=loci, organism=organism, vdb=vdb, ddb=ddb, jdb=jdb, output=out_file, format=format, threads=nproc, exec=igblast_exec) printMessage('Done', start_time=start_time, end=True, width=25) # Print log log = OrderedDict() log['OUTPUT'] = os.path.basename(out_file) log['END'] = 'AssignGenes' printLog(log) return out_file def getArgParser(): """ Defines the ArgumentParser Arguments: None Returns: an ArgumentParser object """ # Define output file names and header fields fields = dedent( ''' output files: igblast Reference alignment results from IgBLAST. ''') # Define ArgumentParser parser = ArgumentParser(description=__doc__, epilog=fields, formatter_class=CommonHelpFormatter, add_help=False) group_help = parser.add_argument_group('help') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s %s' %(__version__, __date__)) group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='Assignment operation') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Parent parser parent_parser = getCommonArgParser(db_in=False, log=False, failed=False, format=False, multiproc=True) # Subparser to run IgBLAT parser_igblast = subparsers.add_parser('igblast', parents=[parent_parser], formatter_class=CommonHelpFormatter, add_help=False, help='Executes IgBLAST.', description='Executes IgBLAST.') group_igblast = parser_igblast.add_argument_group('alignment arguments') group_igblast.add_argument('-s', nargs='+', action='store', dest='seq_files', required=True, help='A list of FASTA files containing sequences to process.') group_igblast.add_argument('-b', action='store', dest='igdata', required=True, help='IgBLAST database directory (IGDATA).') group_igblast.add_argument('--organism', action='store', dest='organism', default=default_organism, choices=choices_organism, help='Organism name.') group_igblast.add_argument('--loci', action='store', dest='loci', default=default_loci, choices=choices_loci, help='The receptor type.') group_igblast.add_argument('--vdb', action='store', dest='vdb', default=None, help='''Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___v will be used.''') group_igblast.add_argument('--ddb', action='store', dest='ddb', default=None, help='''Name of the custom D reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___d will be used.''') group_igblast.add_argument('--jdb', action='store', dest='jdb', default=None, help='''Name of the custom J reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt___j will be used.''') group_igblast.add_argument('--format', action='store', dest='format', default=default_format, choices=choices_format, help='''Specify the output format. The "blast" will result in the IgBLAST "-outfmt 7 std qseq sseq btop" output format. Specifying "airr" will output the AIRR TSV format provided by the IgBLAST argument "-outfmt 19".''') group_igblast.add_argument('--exec', action='store', dest='igblast_exec', default=default_igblast_exec, help='Path to the igblastn executable.') parser_igblast.set_defaults(func=assignIgBLAST) return parser if __name__ == '__main__': """ Parses command line arguments and calls main function """ # Parse arguments parser = getArgParser() checkArgs(parser) args = parser.parse_args() args_dict = parseCommonArgs(args) # Check if a valid clustering executable was specified if not shutil.which(args_dict['igblast_exec']): parser.error('%s executable not found' % args_dict['igblast_exec']) # Clean arguments dictionary del args_dict['seq_files'] if 'out_files' in args_dict: del args_dict['out_files'] del args_dict['func'] del args_dict['command'] # Call main function for each input file for i, f in enumerate(args.__dict__['seq_files']): args_dict['seq_file'] = f args_dict['out_file'] = args.__dict__['out_files'][i] \ if args.__dict__['out_files'] else None args.func(**args_dict) changeo-0.4.6/bin/BuildTrees.py0000755000076600000240000015032513513131311017056 0ustar vandej27staff00000000000000#!/usr/bin/env python3 """ Converts TSV files into IgPhyML input files """ # Info __author__ = "Kenneth Hoehn" from changeo import __version__, __date__ # Imports import os import random import subprocess import multiprocessing as mp from argparse import ArgumentParser from collections import OrderedDict from textwrap import dedent from time import time from Bio.Seq import Seq from functools import partial # Presto and changeo imports from presto.Defaults import default_out_args from presto.IO import printLog, printMessage, printWarning, printError, printDebug from changeo.Defaults import default_format from changeo.IO import splitName, getDbFields, getFormatOperators, getOutputHandle, getOutputName from changeo.Alignment import getRegions from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs def correctMidCodonStart(scodons, qi, debug): """ Find and mask split codons Arguments: scodons (list): list of codons in IMGT sequence. qi (str) : input sequence. spos (int) : starting position of IMGT sequence in input sequence. debug (bool) : print debugging statements. Returns: tuple: (modified input sequence, modified starting position of IMGT sequence in input sequence). """ spos = 0 for i in range(0, len(scodons)): printDebug("%s %s" % (scodons[i], qi[0:3]), debug) if scodons[i] != "...": if scodons[i][0:2] == "..": scodons[i] = "NN" + scodons[i][2] #sometimes IMGT will just cut off first letter if non-match, at which point we"ll just want to mask the #first codon in the IMGT seq, other times it will be legitimately absent from the query, at which point #we have to shift the frame. This attempts to correct for this by looking at the next codon over in the #alignment if scodons[i][2:3] != qi[2:3] or scodons[i + 1] != qi[3:6]: qi = "NN" + qi spos = i break elif scodons[i][0] == ".": scodons[i] = "N" + scodons[i][1:3] if scodons[i][1:3] != qi[1:3] or scodons[i+1] != qi[3:6]: qi = "N" + qi spos = i break else: spos = i break return qi, spos def checkFrameShifts(receptor, oqpos, ospos, log, debug): """ Checks whether a frameshift occured in a sequence Arguments: receptor (changeo.Receptor.Receptor): Receptor object. oqpos (int) : position of interest in input sequence. ospos (int) : position of interest in IMGT sequence. log (dict) : log of information for each sequence. debug (bool) : print debugging statements. """ frameshifts = 0 for ins in range(1, 3): ros = receptor.sequence_input ris = receptor.sequence_imgt psite = receptor.v_seq_start - 1 + oqpos*3 pisite = ospos * 3 if (psite + 3 + ins) < len(ros) and (pisite + 3) < len(ris): #cut out 1 or 2 nucleotides downstream of offending codon receptor.sequence_input = ros[0:(psite + 3)] + ros[(psite + 3 + ins):] receptor.sequence_imgt = ris[0:(pisite + 3)] + ris[(pisite + 3):] # Debug sequence modifications printDebug(ros, debug) printDebug(receptor.sequence_input, debug) printDebug(ris, debug) printDebug(receptor.sequence_imgt, debug) printDebug("RUNNING %d" % ins, debug) mout = maskSplitCodons(receptor, recursive=True) if mout[1]["PASS"]: #if debug: receptor.sequence_input = ros receptor.sequence_imgt = ris frameshifts += 1 printDebug("FRAMESHIFT of length %d!" % ins, debug) log["FAIL"] = "SINGLE FRAME-SHIFTING INSERTION" break else: receptor.sequence_input = ros receptor.sequence_imgt = ris return frameshifts def findAndMask(receptor, scodons, qcodons, spos, s_end, qpos, log, debug, recursive=False): """ Find and mask split codons Arguments: receptor (changeo.Receptor.Receptor): Receptor object. scodons (list): list of codons in IMGT sequence qcodons (list): list of codons in input sequence spos (int): starting position of IMGT sequence in input sequence s_end (int): end of IMGT sequence qpos (int): starting position of input sequence in IMGT sequence log (dict): log of information for each sequence debug (bool): print debugging statements? recursive (bool): was this function called recursively? """ frameshifts = 0 while spos < s_end and qpos < len(qcodons): if debug: print(scodons[spos] + "\t" + qcodons[qpos]) if scodons[spos] == "..." and qcodons[qpos] != "...": #if IMGT gap, move forward in imgt spos += 1 elif scodons[spos] == qcodons[qpos]: # if both are the same, move both forward spos += 1 qpos += 1 elif qcodons[qpos] == "N": # possible that SEQ-IMGT ends on a bunch of Ns qpos += 1 spos += 1 else: # if not the same, mask IMGT at that site and scan forward until you find a codon that matches next site if debug: print("checking %s at position %d %d" % (scodons[spos], spos, qpos)) ospos=spos oqpos=qpos spos += 1 qpos += 1 while spos < s_end and scodons[spos] == "...": #possible next codon is just a gap spos += 1 while qpos < len(qcodons) and spos < s_end and scodons[spos] != qcodons[qpos]: printDebug("Checking " + scodons[spos]+ "\t" + qcodons[qpos], debug) qpos += 1 if qcodons[qpos-1] == scodons[ospos]: #if codon in previous position is equal to original codon, it was preserved qpos -= 1 spos = ospos printDebug("But codon was apparently preserved", debug) if "IN-FRAME" in log: log["IN-FRAME"] = log["IN-FRAME"] + "," + str(spos) else: log["IN-FRAME"] = str(spos) elif qpos >= len(qcodons) and spos < s_end: printDebug("FAILING MATCH", debug) log["PASS"] = False #if no match for the adjacent codon was found, something"s up. log["FAIL"] = "FAILED_MATCH_QSTRING:"+str(spos) #figure out if this was due to a frame-shift by repeating this method but with an edited input sequence if not recursive: frameshifts += checkFrameShifts(receptor, oqpos, ospos, log, debug) elif spos >= s_end or qcodons[qpos] != scodons[spos]: scodons[ospos] = "NNN" if spos >= s_end: printDebug("Masked %s at position %d, at end of subject sequence" % (scodons[ospos], ospos), debug) if "END-MASKED" in log: log["END-MASKED"] = log["END-MASKED"] + "," + str(spos) else: log["END-MASKED"] = str(spos) else: printDebug("Masked %s at position %d, but couldn't find upstream match" % (scodons[ospos], ospos), debug) log["PASS"]=False log["FAIL"]="FAILED_MATCH:"+str(spos) elif qcodons[qpos] == scodons[spos]: printDebug("Masked %s at position %d" % (scodons[ospos], ospos), debug) scodons[ospos] = "NNN" if "MASKED" in log: log["MASKED"] = log["MASKED"] + "," + str(spos) else: log["MASKED"] = str(spos) else: log["PASS"] = False log["FAIL"] = "UNKNOWN" def maskSplitCodons(receptor, recursive=False, mask=True): """ Identify junction region by IMGT definition. Arguments: receptor (changeo.Receptor.Receptor): Receptor object. recursive (bool) : was this method part of a recursive call? mask (bool) : mask split codons for use with igphyml? Returns: str: modified IMGT gapped sequence. log: dict of sequence information """ debug = False qi = receptor.sequence_input si = receptor.sequence_imgt log = OrderedDict() log["ID"]=receptor.sequence_id log["CLONE"]=receptor.clone log["PASS"] = True if debug: print(receptor.sequence_id) # adjust starting position of query sequence qi = qi[(receptor.v_seq_start - 1):] #tally where --- gaps are in IMGT sequence and remove them for now gaps = [] nsi = "" for i in range(0,len(si)): if si[i] == "-": gaps.append(1) else: gaps.append(0) nsi = nsi + si[i] #find any gaps not divisible by three curgap = 0 for i in gaps: if i == 1: curgap += 1 elif i == 0 and curgap != 0: if curgap % 3 != 0 : printDebug("Frame-shifting gap detected! Refusing to include sequence.", debug) log["PASS"] = False log["FAIL"] = "FRAME-SHIFTING DELETION" log["SEQ_IN"] = receptor.sequence_input log["SEQ_IMGT"] = receptor.sequence_imgt log["SEQ_MASKED"] = receptor.sequence_imgt return receptor.sequence_imgt, log else: curgap = 0 si = nsi scodons = [si[i:i + 3] for i in range(0, len(si), 3)] # deal with the fact that it's possible to start mid-codon qi,spos = correctMidCodonStart(scodons, qi, debug) qcodons = [qi[i:i + 3] for i in range(0, len(qi), 3)] frameshifts = 0 s_end = 0 #adjust for the fact that IMGT sequences can end on gaps for i in range(spos, len(scodons)): if scodons[i] != "..." and len(scodons[i]) == 3 and scodons[i] != "NNN": s_end = i printDebug("%i:%i:%s" % (s_end, len(scodons), scodons[s_end]), debug) s_end += 1 qpos = 0 if mask: findAndMask(receptor, scodons, qcodons, spos, s_end, qpos, log, debug, recursive) if not log["PASS"] and not recursive: log["FRAMESHIFTS"] = frameshifts if len(scodons[-1]) != 3: if scodons[-1] == ".." or scodons[-1] == ".": scodons[-1] = "..." else: scodons[-1] = "NNN" if "END-MASKED" in log: log["END-MASKED"] = log["END-MASKED"] + "," + str(len(scodons)) else: log["END-MASKED"] = str(spos) concatenated_seq = Seq("") for i in scodons: concatenated_seq += i # add --- gaps back to IMGT sequence ncon_seq = "" counter = 0 for i in gaps: #print(str(i) + ":" + ncon_seq) if i == 1: ncon_seq = ncon_seq + "." elif i == 0: ncon_seq = ncon_seq + concatenated_seq[counter] counter += 1 ncon_seq = ncon_seq + concatenated_seq[counter:] concatenated_seq = ncon_seq log["SEQ_IN"] = receptor.sequence_input log["SEQ_IMGT"] = receptor.sequence_imgt log["SEQ_MASKED"] = concatenated_seq return concatenated_seq, log def unAmbigDist(seq1, seq2, fbreak=False): """ Calculate the distance between two sequences counting only A,T,C,Gs Arguments: seq1 (str): sequence 1 seq2 (str): sequence 2 fbreak (bool): break after first difference found? Returns: int: number of ACGT differences. """ if len(seq1) != len(seq2): printError("Sequences are not the same length! %s %s" % (seq1, seq2)) dist = 0 for i in range(0,len(seq1)): if seq1[i] != "N" and seq1[i] != "-" and seq1[i] != ".": if seq2[i] != "N" and seq2[i] != "-" and seq2[i] != ".": if seq1[i] != seq2[i]: dist += 1 if fbreak: break return dist def deduplicate(useqs, receptors, log=None, meta_data=None, delim=":"): """ Collapses identical sequences Argument: useqs (dict): unique sequences within a clone. maps sequence to index in Receptor list. receptors (dict): receptors within a clone (index is value in useqs dict). log (collections.OrderedDict): log of sequence errors. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. delim (str): delimited to use when appending meta_data. Returns: list: deduplicated receptors within a clone. """ keys = list(useqs.keys()) join = {} # id -> sequence id to join with (least ambiguous chars) joinseqs = {} # id -> useq to join with (least ambiguous chars) ambigchar = {} #sequence id -> number ATCG nucleotides for i in range(0,len(keys)-1): for j in range(i+1,len(keys)): ki = keys[i] kj = keys[j] if meta_data is None: ski = keys[i] skj = keys[j] else: ski, cid = keys[i].split(delim) skj, cid = keys[j].split(delim) ri = receptors[useqs[ki]] rj = receptors[useqs[kj]] dist = unAmbigDist(ski, skj, True) m_match = True if meta_data is not None: matches = 0 for m in meta_data: if ri.getField(m) == rj.getField(m) and m != "DUPCOUNT": matches += 1 m_match = (matches == len(meta_data)) if dist == 0 and m_match: ncounti = ki.count("A") + ki.count("T") + ki.count("G") + ki.count("C") ncountj = kj.count("A") + kj.count("T") + kj.count("G") + kj.count("C") ambigchar[useqs[ki]] = ncounti ambigchar[useqs[kj]] = ncountj # this algorithm depends on the fact that all sequences are compared pairwise, and all are zero # distance from the sequence they will be collapse to. if ncountj > ncounti: nci = 0 if useqs[ki] in join: nci = ambigchar[join[useqs[ki]]] if nci < ncountj: join[useqs[ki]] = useqs[kj] joinseqs[ki] = kj else: ncj = 0 if useqs[kj] in join: ncj = ambigchar[join[useqs[kj]]] if ncj < ncounti: join[useqs[kj]] = useqs[ki] joinseqs[kj] = ki # loop through list of joined sequences and collapse keys = list(useqs.keys()) for k in keys: if useqs[k] in join: rfrom = receptors[useqs[k]] rto = receptors[join[useqs[k]]] rto.dupcount += rfrom.dupcount if log is not None: log[rfrom.sequence_id]["PASS"] = False log[rfrom.sequence_id]["DUPLICATE"] = True log[rfrom.sequence_id]["COLLAPSETO"] = joinseqs[k] log[rfrom.sequence_id]["COLLAPSEFROM"] = k log[rfrom.sequence_id]["FAIL"] = "Collapsed with " + rto.sequence_id del useqs[k] return useqs def hasPTC(sequence): """ Determines whether a PTC exits in a sequence Arguments: sequence (str): IMGT gapped sequence in frame 1. Returns: int: negative if not PTCs, position of PTC if found. """ ptcs = ("TAA", "TGA", "TAG", "TRA", "TRG", "TAR", "TGR", "TRR") for i in range(0, len(sequence), 3): if sequence[i:(i+3)] in ptcs: return i return -1 def rmCDR3(sequences, clones): """ Remove CDR3 from all sequences and germline of a clone Arguments: sequences (list): list of sequences in clones. clones (list): list of Receptor objects. """ for i in range(0,len(sequences)): imgtar = clones[i].getField("imgtpartlabels") germline = clones[i].getField("germline_imgt_d_mask") nseq = [] nimgtar = [] ngermline = [] ncdr3 = 0 #print("imgtarlen: " + str(len(imgtar))) #print("seqlen: " + str(len(sequences[i]))) #print("germline: " + str(len(germline))) #if len(germline) < len(sequences[i]): # print("\n" + str(clones[i].sequence_id)) # print("\n " + str((sequences[i])) ) # print("\n" + str((germline))) for j in range(0,len(imgtar)): if imgtar[j] != 108: nseq.append(sequences[i][j]) if j < len(germline): ngermline.append(germline[j]) nimgtar.append(imgtar[j]) else: ncdr3 += 1 clones[i].setField("imgtpartlabels",nimgtar) clones[i].setField("germline_imgt_d_mask", "".join(ngermline)) sequences[i] = "".join(nseq) #print("Length: " + str(ncdr3)) def characterizePartitionErrors(sequences, clones, meta_data): """ Characterize potential mismatches between IMGT labels within a clone Arguments: sequences (list): list of sequences in clones. clones (list): list of Receptor objects. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. Returns: tuple: tuple of length four containing a list of IMGT positions for first sequence in clones, the germline sequence of the first receptor in clones, the length of the first sequence in clones, and the number of sequences in clones. """ sites = len(sequences[0]) nseqs = len(sequences) imgtar = clones[0].getField("imgtpartlabels") germline = clones[0].getField("germline_imgt_d_mask") correctseqs = False for seqi in range(0, len(sequences)): i = sequences[seqi] if len(i) != sites or len(clones[seqi].getField("imgtpartlabels")) != len(imgtar): correctseqs = True if correctseqs: maxlen = sites maximgt = len(imgtar) for j in range(0,len(sequences)): if len(sequences[j]) > maxlen: maxlen = len(sequences[j]) if len(clones[j].getField("imgtpartlabels")) > maximgt: imgtar = clones[j].getField("imgtpartlabels") maximgt = len(imgtar) sites = maxlen for j in range(0,len(sequences)): cimgt = clones[j].getField("imgtpartlabels") seqdiff = maxlen - len(sequences[j]) imgtdiff = len(imgtar)-len(cimgt) sequences[j] = sequences[j] + "N"*(seqdiff) last = cimgt[-1] cimgt.extend([last]*(imgtdiff)) clones[j].setField("imgtpartlabels",cimgt) if meta_data is not None: meta_data_ar = meta_data[0].split(",") for c in clones: if meta_data is not None: c.setField(meta_data[0],c.getField(meta_data_ar[0])) for m in range(1,len(meta_data_ar)): st = c.getField(meta_data[0])+":"+c.getField(meta_data_ar[m]) c.setField(meta_data[0],st) if len(c.getField("imgtpartlabels")) != len(imgtar): printError("IMGT assignments are not the same within clone %d!\n" % c.clone,False) printError(c.getField("imgtpartlabels"),False) printError("%s\n%d\n" % (imgtar,j),False) for j in range(0, len(sequences)): printError("%s\n%s\n" % (sequences[j],clones[j].getField("imgtpartlabels")),False) printError("ChangeO file needs to be corrected") for j in range(0,len(imgtar)): if c.getField("imgtpartlabels")[j] != imgtar[j]: printError("IMGT assignments are not the same within clone %d!\n" % c.clone, False) printError(c.getField("imgtpartlabels"), False) printError("%s\n%d\n" % (imgtar, j)) #Resolve germline if there are differences, e.g. if reconstruction was done before clonal clustering resolveglines = False for c in clones: if c.getField("germline_imgt_d_mask") != germline: resolveglines = True if resolveglines: printError("%s %s" % ("Predicted germlines are not the same among sequences in the same clone.", "Be sure to cluster sequences into clones first and then predict germlines using --cloned")) if sites > (len(germline)): seqdiff = sites - len(germline) germline = germline + "N" * (seqdiff) if sites % 3 != 0: printError("number of sites must be divisible by 3! len: %d, clone: %s , id: %s, seq: %s" %(len(sequences[0]),\ clones[0].clone,clones[0].sequence_id,sequences[0])) return imgtar, germline, sites, nseqs def outputSeqPartFiles(out_dir, useqs_f, meta_data, clones, collapse, nseqs, delim, newgerm, conseqs, duplicate, imgt): """ Create intermediate sequence alignment and partition files for IgPhyML output Arguments: out_dir (str): directory for sequence files. useqs_f (dict): unique sequences mapped to ids. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data. clones (list) : list of receptor objects. collapse (bool) : deduplicate sequences. nseqs (int): number of sequences. delim (str) : delimiter for extracting metadata from ID. newgerm (str) : modified germline of clonal lineage. conseqs (list) : consensus sequences. duplicate (bool) : duplicate sequence if only one in a clone. imgt (list) : IMGT numbering of clonal positions . """ # bootstrap these data if desired lg = len(newgerm) sites = range(0, lg) transtable = clones[0].sequence_id.maketrans(" ", "_") outfile = os.path.join(out_dir, "%s.fasta" % clones[0].clone) with open(outfile, "w") as clonef: if collapse: for seq_f, num in useqs_f.items(): seq = seq_f cid = "" if meta_data is not None: seq, cid = seq_f.split(delim) cid = delim + cid.replace(":", "_") sid = clones[num].sequence_id.translate(transtable) + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), seq.replace(".", "-"))) if len(useqs_f) == 1 and duplicate: if meta_data is not None: if meta_data[0] == "DUPCOUNT": cid = delim + "0" sid = clones[num].sequence_id.translate(transtable) + "_1" + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), seq.replace(".", "-"))) else: for j in range(0, nseqs): cid = "" if meta_data is not None: meta_data_list = [] for m in meta_data: meta_data_list.append(clones[j].getField(m).replace(":", "_")) cid = delim + str(delim.join(meta_data_list)) sid = clones[j].sequence_id.translate(transtable) + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), conseqs[j].replace(".", "-"))) if nseqs == 1 and duplicate: if meta_data is not None: if meta_data[0] == "DUPCOUNT": cid = delim + "0" sid = clones[j].sequence_id.translate(transtable)+"_1" + cid clonef.write(">%s\n%s\n" % (sid.replace(":","-"), conseqs[j].replace(".", "-"))) germ_id = ["GERM"] if meta_data is not None: for i in range(1,len(meta_data)): germ_id.append("GERM") clonef.write(">%s_%s\n" % (clones[0].clone,"_".join(germ_id))) for i in range(0, len(newgerm)): clonef.write("%s" % newgerm[i].replace(".","-")) clonef.write("\n") #output partition file partfile = os.path.join(out_dir, "%s.part.txt" % clones[0].clone) with open(partfile, "w") as partf: partf.write("%d %d\n" % (2, len(newgerm))) partf.write("FWR:IMGT\n") partf.write("CDR:IMGT\n") partf.write("%s\n" % (clones[0].v_call.split("*")[0])) partf.write("%s\n" % (clones[0].j_call.split("*")[0])) partf.write(",".join(map(str, imgt))) partf.write("\n") def outputIgPhyML(clones, sequences, meta_data=None, collapse=False, ncdr3=False, logs=None, fail_writer=None, out_dir=None, min_seq=1): """ Create intermediate sequence alignment and partition files for IgPhyML output Arguments: clones (list): receptor objects within the same clone. sequences (list): sequences within the same clone (share indexes with clones parameter). meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data collapse (bool): if True collapse identical sequences. ncdr3 (bool): if True remove CDR3 logs (dict): contains log information for each sequence out_dir (str): directory for output files. fail_writer (changeo.IO.TSVWriter): failed sequences writer object. min_seq (int): minimum number of data sequences to include. Returns: int: number of clones. """ s = "" delim = "_" duplicate = True # duplicate sequences in clones with only 1 sequence? imgtar, germline, sites, nseqs = characterizePartitionErrors(sequences, clones, meta_data) tallies = [] for i in range(0, sites, 3): tally = 0 for j in range(0, nseqs): if sequences[j][i:(i + 3)] != "...": tally += 1 tallies.append(tally) newseqs = [] # remove gap only sites from observed data newgerm = [] imgt = [] for j in range(0, nseqs): for i in range(0, sites, 3): if i == 0: newseqs.append([]) if tallies[i//3] > 0: newseqs[j].append(sequences[j][i:(i+3)]) lcodon = "" for i in range(0, sites, 3): if tallies[i//3] > 0: newgerm.append(germline[i:(i+3)]) lcodon=germline[i:(i+3)] imgt.append(imgtar[i]) if len(lcodon) == 2: newgerm[-1] = newgerm[-1] + "N" elif len(lcodon) == 1: newgerm[-1] = newgerm[-1] + "NN" if ncdr3: ngerm = [] nimgt = [] for i in range(0, len(newseqs)): nseq = [] ncdr3 = 0 for j in range(0, len(imgt)): if imgt[j] != 108: nseq.append(newseqs[i][j]) if i == 0: ngerm.append(newgerm[j]) nimgt.append(imgt[j]) else: ncdr3 += 1 newseqs[i] = nseq newgerm = ngerm imgt = nimgt #print("Length: " + str(ncdr3)) useqs_f = OrderedDict() conseqs = [] for j in range(0, nseqs): conseq = "".join([str(seq_rec) for seq_rec in newseqs[j]]) if meta_data is not None: meta_data_list = [] for m in range(0,len(meta_data)): if isinstance(clones[j].getField(meta_data[m]), str): clones[j].setField(meta_data[m],clones[j].getField(meta_data[m]).replace("_", "")) meta_data_list.append(str(clones[j].getField(meta_data[m]))) conseq_f = "".join([str(seq_rec) for seq_rec in newseqs[j]])+delim+":".join(meta_data_list) else: conseq_f = conseq if conseq_f in useqs_f and collapse: clones[useqs_f[conseq_f]].dupcount += clones[j].dupcount logs[clones[j].sequence_id]["PASS"] = False logs[clones[j].sequence_id]["FAIL"] = "Duplication of " + clones[useqs_f[conseq_f]].sequence_id logs[clones[j].sequence_id]["DUPLICATE"]=True if fail_writer is not None: fail_writer.writeReceptor(clones[j]) else: useqs_f[conseq_f] = j conseqs.append(conseq) if collapse: useqs_f = deduplicate(useqs_f, clones, logs, meta_data, delim) if collapse and len(useqs_f) < min_seq: for seq_f, num in useqs_f.items(): logs[clones[num].sequence_id]["FAIL"] = "Clone too small: " + str(len(useqs_f)) logs[clones[num].sequence_id]["PASS"] = False return -len(useqs_f) elif not collapse and len(conseqs) < min_seq: for j in range(0, nseqs): logs[clones[j].sequence_id]["FAIL"] = "Clone too small: " + str(len(conseqs)) logs[clones[j].sequence_id]["PASS"] = False return -len(conseqs) # Output fasta file of masked, concatenated sequences outputSeqPartFiles(out_dir, useqs_f, meta_data, clones, collapse, nseqs, delim, newgerm, conseqs, duplicate, imgt) if collapse: return len(useqs_f) else: return nseqs def maskCodonsLoop(r, clones, cloneseqs, logs, fails, out_args, fail_writer): """ Masks codons split by alignment to IMGT reference Arguments: r (changeo.Receptor.Receptor): receptor object for a particular sequence. clones (list): list of receptors. cloneseqs (list): list of masked clone sequences. logs (dict): contains log information for each sequence. fails (dict): counts of various sequence processing failures. out_args (dict): arguments for output preferences. fail_writer (changeo.IO.TSVWriter): failed sequences writer object. Returns: 0: returns 0 if an error occurs or masking fails. 1: returns 1 masking succeeds """ if r.clone is None: printError("Cannot export datasets until sequences are clustered into clones.") if r.dupcount is None: r.dupcount = 1 fails["rec_count"] += 1 fails["totalreads"] += 1 #printProgress(rec_count, rec_count, 0.05, start_time) ptcs = hasPTC(r.sequence_imgt) gptcs = hasPTC(r.getField("germline_imgt_d_mask")) if gptcs >= 0: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["SEQ_IN"] = r.sequence_input log["SEQ_IMGT"] = r.sequence_imgt logs[r.sequence_id] = log logs[r.sequence_id]["PASS"] = False logs[r.sequence_id]["FAIL"] = "Germline PTC" fails["seq_fail"] += 1 fails["germlineptc"] += 1 return 0 if r.functional and ptcs < 0: #If IMGT regions are provided, record their positions regions = getRegions(r.sequence_imgt, r.junction_length) #print(regions["cdr1_imgt"]+regions["fwr4_imgt"]) if regions["cdr3_imgt"] is not "" and regions["cdr3_imgt"] is not None: simgt = regions["fwr1_imgt"] + regions["cdr1_imgt"] + regions["fwr2_imgt"] + regions["cdr2_imgt"] + \ regions["fwr3_imgt"] + regions["cdr3_imgt"] + regions["fwr4_imgt"] if len(simgt) < len(r.sequence_imgt): r.fwr4_imgt = r.fwr4_imgt + ("."*(len(r.sequence_imgt) - len(simgt))) simgt = regions["fwr1_imgt"] + regions["cdr1_imgt"] + regions["fwr2_imgt"] + \ regions["cdr2_imgt"] + regions["fwr3_imgt"] + regions["cdr3_imgt"] + regions["fwr4_imgt"] imgtpartlabels = [13]*len(regions["fwr1_imgt"]) + [30]*len(regions["cdr1_imgt"]) + [45]*len(regions["fwr2_imgt"]) + \ [60]*len(regions["cdr2_imgt"]) + [80]*len(regions["fwr3_imgt"]) + [108] * len(regions["cdr3_imgt"]) + \ [120] * len(regions["fwr4_imgt"]) r.setField("imgtpartlabels", imgtpartlabels) if len(r.getField("imgtpartlabels")) != len(r.sequence_imgt) or simgt != r.sequence_imgt: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["SEQ_IN"] = r.sequence_input log["SEQ_IMGT"] = r.sequence_imgt logs[r.sequence_id] = log logs[r.sequence_id]["PASS"] = False logs[r.sequence_id]["FAIL"] = "FWR/CDR error" logs[r.sequence_id]["FWRCDRSEQ"] = simgt fails["seq_fail"] += 1 fails["region_fail"] += 1 return 0 else: #imgt_warn = "\n! IMGT FWR/CDR sequence columns not detected.\n! Cannot run CDR/FWR partitioned model on this data.\n" imgtpartlabels = [0] * len(r.sequence_imgt) r.setField("imgtpartlabels", imgtpartlabels) mout = maskSplitCodons(r) mask_seq = mout[0] ptcs = hasPTC(mask_seq) if ptcs >= 0: printWarning("Masked sequence suddenly has a PTC.. %s\n" % r.sequence_id) mout[1]["PASS"] = False mout[1]["FAIL"] = "PTC_ADDED_FROM_MASKING" logs[mout[1]["ID"]] = mout[1] if mout[1]["PASS"]: #passreads += r.dupcount if r.clone in clones: clones[r.clone].append(r) cloneseqs[r.clone].append(mask_seq) else: clones[r.clone] = [r] cloneseqs[r.clone] = [mask_seq] return 1 else: if out_args["failed"]: fail_writer.writeReceptor(r) fails["seq_fail"] += 1 fails["failreads"] += r.dupcount if mout[1]["FAIL"] == "FRAME-SHIFTING DELETION": fails["del_fail"] += 1 elif mout[1]["FAIL"] == "SINGLE FRAME-SHIFTING INSERTION": fails["in_fail"] += 1 else: fails["other_fail"] += 1 else: log = OrderedDict() log["ID"] = r.sequence_id log["CLONE"] = r.clone log["PASS"] = False log["FAIL"] = "NONFUNCTIONAL/PTC" log["SEQ_IN"] = r.sequence_input logs[r.sequence_id] = log if out_args["failed"]: fail_writer.writeReceptor(r) fails["seq_fail"] += 1 fails["nf_fail"] += 1 return 0 # Run IgPhyML on outputed data def runIgPhyML(outfile, igphyml_out, clone_dir, nproc=1, optimization="lr", omega="e,e", kappa="e", motifs="FCH", hotness="e,e,e,e,e,e",oformat="tab", nohlp=False, clean="none"): """ Run IgPhyML on outputted data Arguments: outfile (str): Output file name. igphymlout (str): igphyml output file nproc (int): Number of threads to parallelize IgPhyML across optimization (str): Optimize combination of topology (t) branch lengths (l) and parameters (r) in IgPhyML. omega (str): omega optimization in IgPhyML (--omega) kappa (str): kappa optimization in IgPhyML (-t) motifs (str): motifs to use in IgPhyML (--motifs) hotness (str): motif in IgPhyML (--hotness) oformat (str): output format for IgPhyML (tab or txt) nohlp (bool): If True, only estimate GY94 trees and parameters clean (str): delete intermediate files? (none, all) """ osplit = outfile.split(".") outrep = ".".join(osplit[0:(len(osplit)-1)]) + "_gy.tsv" gyout = outfile + "_igphyml_stats_gy.txt" gy_args = ["igphyml", "--repfile", outfile, "-m", "GY", "--run_id", "gy", "--outrep", outrep, "--threads", str(nproc),"--outname",gyout] hlp_args = ["igphyml","--repfile", outrep, "-m", "HLP", "--run_id", "hlp", "--threads", str(nproc), "-o", optimization, "--omega", omega, "-t", kappa, "--motifs", motifs, "--hotness", hotness, "--oformat", oformat, "--outname", igphyml_out] log = OrderedDict() log["START"] = "IgPhyML GY94 tree estimation" printLog(log) try: #check for igphyml executable subprocess.check_output(["igphyml"]) except: printError("igphyml not found :-/") try: #get GY94 starting topologies p = subprocess.check_output(gy_args) except subprocess.CalledProcessError as e: print(" ".join(gy_args)) print('error>', e.output, '<') printError("GY94 tree building in IgPhyML failed") log = OrderedDict() log["START"] = "IgPhyML HLP analysis" log["OPTIMIZE"] = optimization log["TS/TV"] = kappa log["wFWR,wCDR"] = omega log["MOTIFS"] = motifs log["HOTNESS"] = hotness log["NPROC"] = nproc printLog(log) if not nohlp: try: #estimate HLP parameters/trees p = subprocess.check_output(hlp_args) except subprocess.CalledProcessError as e: print(" ".join(hlp_args)) print('error>', e.output, '<') printError("HLP tree building failed") log = OrderedDict() log["OUTPUT"] = igphyml_out if oformat == "tab": igf = open(igphyml_out) names = igf.readline().split("\t") vals = igf.readline().split("\t") for i in range(3,len(names)-1): log[names[i]] = round(float(vals[i]),2) printLog(log) if clean != "none": log = OrderedDict() log["START"] = "CLEANING" log["SCOPE"] = clean printLog(log) todelete = open(outrep) for line in todelete: line = line.rstrip("\n") line = line.rstrip("\r") lsplit = line.split("\t") if len(lsplit) == 4: os.remove(lsplit[0]) os.remove(lsplit[1]) os.remove(lsplit[3]) todelete.close() os.remove(outrep) os.remove(outfile) os.remove(gyout) cilog = outrep + "_igphyml_CIlog.txt_hlp" if os.path.isfile(cilog): os.remove(cilog) if oformat == "tab": os.rmdir(clone_dir) else: printWarning("Using --clean all with --oformat txt will delete all tree file results.\n" "You'll have to do that yourself.") log = OrderedDict() log["END"] = "IgPhyML analysis" printLog(log) # Note: Collapse can give misleading dupcount information if some sequences have ambiguous characters at polymorphic sites def buildTrees(db_file, meta_data=None, target_clones=None, collapse=False, ncdr3=False, sample_depth=-1, min_seq=1,append=None, igphyml=False, nproc=1, optimization="lr", omega="e,e", kappa="e", motifs="FCH", hotness="e,e,e,e,e,e", oformat="tab", clean="none", nohlp=False, format=default_format, out_args=default_out_args): """ Masks codons split by alignment to IMGT reference, then produces input files for IgPhyML Arguments: db_file (str): input tab-delimited database file. meta_data (str): Field to append to sequence IDs. Splits identical sequences with different meta_data target_clones (str): List of clone IDs to analyze. collapse (bool): if True collapse identical sequences. ncdr3 (bool): if True remove all CDR3s. sample_depth (int): depth of subsampling before deduplication min_seq (int): minimum number of sequences per clone append (str): column name to append to sequence_id igphyml (bool): If True, run IgPhyML on outputted data nproc (int) : Number of threads to parallelize IgPhyML across optimization (str): Optimize combination of topology (t) branch lengths (l) and parameters (r) in IgPhyML. omega (str): omega optimization in IgPhyML (--omega) kappa (str): kappa optimization in IgPhyML (-t) motifs (str): motifs to use in IgPhyML (--motifs) hotness (str): motif in IgPhyML (--hotness) oformat (str): output format for IgPhyML (tab or txt) clean (str): delete intermediate files? (none, all) nohlp (bool): If True, only estimate GY94 trees and parameters format (str): input and output format. out_args (dict): arguments for output preferences. Returns: dict: dictionary of output pass and fail files. """ # Print parameter info log = OrderedDict() log["START"] = "BuildTrees" log["FILE"] = os.path.basename(db_file) log["COLLAPSE"] = collapse printLog(log) # Open output files out_label = "lineages" pass_handle = getOutputHandle(db_file, out_label=out_label, out_dir=out_args["out_dir"], out_name= out_args["out_name"], out_type="tsv") igphyml_out = None if igphyml: igphyml_out = getOutputName(db_file, out_label="igphyml-pass", out_dir=out_args["out_dir"], out_name=out_args["out_name"], out_type=oformat) dir_name, __ = os.path.split(pass_handle.name) if out_args["out_name"] is None: __, clone_name, __ = splitName(db_file) else: clone_name = out_args["out_name"] if dir_name is None: clone_dir = clone_name else: clone_dir = os.path.join(dir_name, clone_name) if not os.path.exists(clone_dir): os.makedirs(clone_dir) # Format options try: reader, writer, __ = getFormatOperators(format) except ValueError: printError("Invalid format %s." % format) out_fields = getDbFields(db_file, reader=reader) # open input file handle = open(db_file, "r") records = reader(handle) fail_handle, fail_writer = None, None if out_args["failed"]: fail_handle = getOutputHandle(db_file, out_label="lineages-fail", out_dir=out_args["out_dir"], out_name=out_args["out_name"], out_type=out_args["out_type"]) fail_writer = writer(fail_handle, fields=out_fields) cloneseqs = {} clones = {} logs = OrderedDict() fails = {"rec_count":0, "seq_fail":0, "nf_fail":0, "del_fail":0, "in_fail":0, "minseq_fail":0, "other_fail":0, "region_fail":0, "germlineptc":0, "fdcount":0, "totalreads":0, "passreads":0, "failreads":0} # Mask codons split by indels start_time = time() printMessage("Correcting frames and indels of sequences", start_time=start_time, width=50) #subsampling loop init_clone_sizes = {} big_enough = [] all_records = [] found_no_funct = False for r in records: if r.functional is None: r.functional = True if found_no_funct is False: printWarning("FUNCTIONAL column not found.") found_no_funct = True all_records.append(r) if r.clone in init_clone_sizes: init_clone_sizes[r.clone] += 1 else: init_clone_sizes[r.clone] = 1 for r in all_records: if target_clones is None or r.clone in target_clones: if init_clone_sizes[r.clone] >= min_seq: big_enough.append(r) if len(big_enough) == 0: printError("\n\nNo sequences found that match specified criteria.",1) if sample_depth > 0: random.shuffle(big_enough) total = 0 for r in big_enough: if r.functional is None: r.functional = True if found_no_funct is False: printWarning("FUNCTIONAL column not found.") found_no_funct = True r.sequence_id = r.sequence_id.replace(",","-") #remove commas from sequence ID r.sequence_id = r.sequence_id.replace(":","-") #remove colons from sequence ID if append is not None: if append is not None: for m in append: r.sequence_id = r.sequence_id + "_" + r.getField(m) total += maskCodonsLoop(r, clones, cloneseqs, logs, fails, out_args, fail_writer) if total == sample_depth: break # Start processing clones clonesizes = {} pass_count, nclones = 0, 0 printMessage("Processing clones", start_time=start_time, width=50) for k in clones.keys(): if len(clones[str(k)]) < min_seq: for j in range(0, len(clones[str(k)])): logs[clones[str(k)][j].sequence_id]["FAIL"] = "Clone too small: " + str(len(cloneseqs[str(k)])) logs[clones[str(k)][j].sequence_id]["PASS"] = False clonesizes[str(k)] = -len(cloneseqs[str(k)]) else: clonesizes[str(k)] = outputIgPhyML(clones[str(k)], cloneseqs[str(k)], meta_data=meta_data, collapse=collapse, ncdr3=ncdr3, logs=logs, fail_writer=fail_writer, out_dir=clone_dir, min_seq=min_seq) #If clone is too small, size is returned as a negative if clonesizes[str(k)] > 0: nclones += 1 pass_count += clonesizes[str(k)] else: fails["seq_fail"] -= clonesizes[str(k)] fails["minseq_fail"] -= clonesizes[str(k)] fail_count = fails["rec_count"] - pass_count # End clone processing printMessage("Done", start_time=start_time, end=True, width=50) log_handle = None if out_args["log_file"] is not None: log_handle = open(out_args["log_file"], "w") for j in logs.keys(): printLog(logs[j], handle=log_handle) pass_handle.write(str(nclones)+"\n") for key in sorted(clonesizes, key=clonesizes.get, reverse=True): #print(key + "\t" + str(clonesizes[key])) outfile = os.path.join(clone_dir, "%s.fasta" % key) partfile = os.path.join(clone_dir, "%s.part.txt" % key) if clonesizes[key] > 0: germ_id = ["GERM"] if meta_data is not None: for i in range(1, len(meta_data)): germ_id.append("GERM") pass_handle.write("%s\t%s\t%s_%s\t%s\n" % (outfile, "N", key,"_".join(germ_id), partfile)) handle.close() output = {"pass": None, "fail": None} if pass_handle is not None: output["pass"] = pass_handle.name pass_handle.close() if fail_handle is not None: output["fail"] = fail_handle.name fail_handle.close() if log_handle is not None: log_handle.close() #printProgress(rec_count, rec_count, 0.05, start_time) log = OrderedDict() log["OUTPUT"] = os.path.basename(pass_handle.name) if pass_handle is not None else None log["RECORDS"] = fails["rec_count"] log["PASS"] = pass_count log["FAIL"] = fail_count log["NONFUNCTIONAL"] = fails["nf_fail"] log["FRAMESHIFT_DEL"] = fails["del_fail"] log["FRAMESHIFT_INS"] = fails["in_fail"] log["CLONETOOSMALL"] = fails["minseq_fail"] log["CDRFWR_ERROR"] = fails["region_fail"] log["GERMLINE_PTC"] = fails["germlineptc"] log["OTHER_FAIL"] = fails["other_fail"] if collapse: log["DUPLICATE"] = fail_count - fails["seq_fail"] log["END"] = "BuildTrees" printLog(log) #Run IgPhyML on outputted data? if igphyml: runIgPhyML(pass_handle.name, igphyml_out=igphyml_out, clone_dir=clone_dir, nproc=nproc, optimization=optimization, omega=omega, kappa=kappa, motifs=motifs, hotness=hotness, oformat=oformat, nohlp=nohlp,clean=clean) return output def getArgParser(): """ Defines the ArgumentParser Returns: argparse.ArgumentParser: argument parsers. """ # Define input and output field help message fields = dedent( """ output files: folder containing fasta and partition files for each clone. lineages successfully processed records. lineages-fail database records failed processing. igphyml-pass parameter estimates and lineage trees from running IgPhyML, if specified required fields: SEQUENCE_ID, SEQUENCE_INPUT, SEQUENCE_IMGT, GERMLINE_IMGT_D_MASK, V_CALL, J_CALL, CLONE, V_SEQ_START """) # Parent parser parser_parent = getCommonArgParser(out_file=False, log=True, format=True) # Define argument parser parser = ArgumentParser(description=__doc__, epilog=fields, parents=[parser_parent], formatter_class=CommonHelpFormatter, add_help=False) group = parser.add_argument_group("sequence processing arguments") group.add_argument("--collapse", action="store_true", dest="collapse", help="""If specified, collapse identical sequences before exporting to fasta.""") group.add_argument("--ncdr3", action="store_true", dest="ncdr3", help="""If specified, remove CDR3 from all sequences.""") group.add_argument("--md", nargs="+", action="store", dest="meta_data", help="""List of fields to containing metadata to include in output fasta file sequence headers.""") group.add_argument("--clones", nargs="+", action="store", dest="target_clones", help="""List of clone IDs to output, if specified.""") group.add_argument("--minseq", action="store", dest="min_seq", type=int, default=1, help="""Minimum number of data sequences. Any clones with fewer than the specified number of sequences will be excluded.""") group.add_argument("--sample", action="store", dest="sample_depth", type=int, default=-1, help="""Depth of reads to be subsampled (before deduplication).""") group.add_argument("--append", nargs="+", action="store", dest="append", help="""List of columns to append to sequence ID to ensure uniqueness.""") igphyml_group = parser.add_argument_group("IgPhyML arguments (see igphyml -h for details)") igphyml_group.add_argument("--igphyml", action="store_true", dest="igphyml", help="""Run IgPhyML on output?""") igphyml_group.add_argument("--nproc", action="store", dest="nproc", type=int, default=1, help="""Number of threads to parallelize IgPhyML across.""") igphyml_group.add_argument("--clean", action="store", choices=("none", "all"), dest="clean", type=str, default="none", help="""Delete intermediate files? none: leave all intermediate files; all: delete all intermediate files.""") igphyml_group.add_argument("--optimize", action="store", dest="optimization", type=str, default="lr", choices=("n","r","l","lr","tl","tlr"), help="""Optimize combination of topology (t) branch lengths (l) and parameters (r), or nothing (n), for IgPhyML.""") igphyml_group.add_argument("--omega", action="store", dest="omega", type=str, default="e,e", choices = ("e", "ce", "e,e", "ce,e", "e,ce", "ce,ce"), help="""Omega parameters to estimate for FWR,CDR respectively: e = estimate, ce = estimate + confidence interval""") igphyml_group.add_argument("-t", action="store", dest="kappa", type=str, default="e", choices=("e", "ce"), help="""Kappa parameters to estimate: e = estimate, ce = estimate + confidence interval""") igphyml_group.add_argument("--motifs", action="store", dest="motifs", type=str, default="WRC_2:0,GYW_0:1,WA_1:2,TW_0:3,SYC_2:4,GRS_0:5", help="""Which motifs to estimate mutability.""") igphyml_group.add_argument("--hotness", action="store", dest="hotness", type=str, default="e,e,e,e,e,e", help="""Mutability parameters to estimate: e = estimate, ce = estimate + confidence interval""") igphyml_group.add_argument("--oformat", action="store", dest="oformat", type=str, default="tab", choices=("tab", "txt"), help="""IgPhyML output format.""") igphyml_group.add_argument("--nohlp", action="store_true", dest="nohlp", help="""Don't run HLP model?""") return parser if __name__ == "__main__": """ Parses command line arguments and calls main """ # Parse command line arguments parser = getArgParser() checkArgs(parser) args = parser.parse_args() args_dict = parseCommonArgs(args) del args_dict["db_files"] # Call main for each input file for f in args.__dict__["db_files"]: args_dict["db_file"] = f buildTrees(**args_dict)changeo-0.4.6/bin/ConvertDb.py0000755000076600000240000014171713453736571016735 0ustar vandej27staff00000000000000#!/usr/bin/env python3 """ Parses tab delimited database files """ # Info __author__ = 'Jason Anthony Vander Heiden' from changeo import __version__, __date__ # Imports import csv import os import re import shutil from argparse import ArgumentParser from collections import OrderedDict from itertools import chain from textwrap import dedent from time import time from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import IUPAC # Presto and changeo imports from presto.Annotation import flattenAnnotation from presto.IO import printLog, printMessage, printProgress, printError, printWarning from changeo.Alignment import gapV from changeo.Applications import default_tbl2asn_exec, runASN from changeo.Defaults import default_id_field, default_seq_field, default_germ_field, \ default_csv_size, default_format, default_out_args from changeo.Commandline import CommonHelpFormatter, checkArgs, getCommonArgParser, parseCommonArgs from changeo.Gene import c_gene_regex, parseAllele, buildGermline from changeo.IO import countDbFile, getFormatOperators, getOutputHandle, AIRRReader, AIRRWriter, \ ChangeoReader, ChangeoWriter, TSVReader, ReceptorData, readGermlines, \ checkFields, yamlDict from changeo.Receptor import AIRRSchema, ChangeoSchema # System settings csv.field_size_limit(default_csv_size) # Defaults default_db_xref = 'IMGT/GENE-DB' default_molecule = 'mRNA' default_product = 'immunoglobulin heavy chain' default_allele_delim = '*' def buildSeqRecord(db_record, id_field, seq_field, meta_fields=None): """ Parses a database record into a SeqRecord Arguments: db_record : a dictionary containing a database record. id_field : the field containing identifiers. seq_field : the field containing sequences. meta_fields : a list of fields to add to sequence annotations. Returns: Bio.SeqRecord.SeqRecord: record. """ # Return None if ID or sequence fields are empty if not db_record[id_field] or not db_record[seq_field]: return None # Create description string desc_dict = OrderedDict([('ID', db_record[id_field])]) if meta_fields is not None: desc_dict.update([(f, db_record[f]) for f in meta_fields if f in db_record]) desc_str = flattenAnnotation(desc_dict) # Create SeqRecord seq_record = SeqRecord(Seq(db_record[seq_field], IUPAC.ambiguous_dna), id=desc_str, name=desc_str, description='') return seq_record def correctIMGTFields(receptor, references): """ Add IMGT-gaps to IMGT fields in a Receptor object Arguments: receptor (changeo.Receptor.Receptor): Receptor object to modify. references (dict): dictionary of IMGT-gapped references sequences. Returns: changeo.Receptor.Receptor: modified Receptor with IMGT-gapped fields. """ # Initialize update object imgt_dict = {'sequence_imgt': None, 'v_germ_start_imgt': None, 'v_germ_length_imgt': None, 'germline_imgt': None} try: if not all([receptor.sequence_imgt, receptor.v_germ_start_imgt, receptor.v_germ_length_imgt, receptor.v_call]): raise AttributeError except AttributeError: return None # Update IMGT fields try: gapped = gapV(receptor.sequence_imgt, receptor.v_germ_start_imgt, receptor.v_germ_length_imgt, receptor.v_call, references) except KeyError as e: printWarning(e) return None # Verify IMGT-gapped sequence and junction concur try: check = (receptor.junction == gapped['sequence_imgt'][309:(309 + receptor.junction_length)]) except TypeError: check = False if not check: return None # Rebuild germline sequence __, germlines, __ = buildGermline(receptor, references) if germlines is None: return None else: gapped['germline_imgt'] = germlines['full'] # Update return object imgt_dict.update(gapped) return imgt_dict def insertGaps(db_file, references=None, format=default_format, out_file=None, out_args=default_out_args): """ Inserts IMGT numbering into V fields Arguments: db_file : the database file name. references : folder with germline repertoire files. If None, do not updated alignment columns wtih IMGT gaps. format : input format. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'imgt' log['FILE'] = os.path.basename(db_file) printLog(log) # Define format operators try: reader, writer, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Open input db_handle = open(db_file, 'rt') db_iter = reader(db_handle) # Check for required columns try: required = ['sequence_imgt', 'v_germ_start_imgt'] checkFields(required, db_iter.fields, schema=schema) except LookupError as e: printError(e) # Load references reference_dict = readGermlines(references) # Check for IMGT-gaps in germlines if all('...' not in x for x in reference_dict.values()): printWarning('Germline reference sequences do not appear to contain IMGT-numbering spacers. Results may be incorrect.') # Open output writer if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='gap', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=schema.out_type) pass_writer = writer(pass_handle, fields=db_iter.fields) # Count records result_count = countDbFile(db_file) # Iterate over records start_time = time() rec_count = pass_count = 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Update IMGT fields imgt_dict = correctIMGTFields(rec, reference_dict) # Write records if imgt_dict is not None: pass_count += 1 rec.setDict(imgt_dict, parse=False) pass_writer.writeReceptor(rec) # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['PASS'] = pass_count log['FAIL'] = rec_count - pass_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def convertToAIRR(db_file, format=default_format, out_file=None, out_args=default_out_args): """ Converts a Change-O formatted file into an AIRR formatted file Arguments: db_file : the database file name. format : input format. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'airr' log['FILE'] = os.path.basename(db_file) printLog(log) # Define format operators try: reader, __, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Open input db_handle = open(db_file, 'rt') db_iter = reader(db_handle) # Set output fields replacing length with end fields in_fields = [schema.toReceptor(f) for f in db_iter.fields] out_fields = [ReceptorData.length_fields[f][1] if f in ReceptorData.length_fields else f \ for f in in_fields] out_fields = [AIRRSchema.fromReceptor(f) for f in out_fields] # Open output writer if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='airr', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=AIRRSchema.out_type) pass_writer = AIRRWriter(pass_handle, fields=out_fields) # Count records result_count = countDbFile(db_file) # Iterate over records start_time = time() rec_count = 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Write records pass_writer.writeReceptor(rec) # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def convertToChangeo(db_file, out_file=None, out_args=default_out_args): """ Converts an AIRR formatted file into an Change-O formatted file Arguments: db_file: the database file name. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name. """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'changeo' log['FILE'] = os.path.basename(db_file) printLog(log) # Open input db_handle = open(db_file, 'rt') db_iter = AIRRReader(db_handle) # Set output fields replacing length with end fields in_fields = [AIRRSchema.toReceptor(f) for f in db_iter.fields] out_fields = [ReceptorData.end_fields[f][1] if f in ReceptorData.end_fields else f \ for f in in_fields] out_fields = [ChangeoSchema.fromReceptor(f) for f in out_fields] # Open output writer if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='changeo', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=ChangeoSchema.out_type) pass_writer = ChangeoWriter(pass_handle, fields=out_fields) # Count records result_count = countDbFile(db_file) # Iterate over records start_time = time() rec_count = 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Write records pass_writer.writeReceptor(rec) # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name # TODO: SHOULD ALLOW FOR UNSORTED CLUSTER COLUMN # TODO: SHOULD ALLOW FOR GROUPING FIELDS def convertToBaseline(db_file, id_field=default_id_field, seq_field=default_seq_field, germ_field=default_germ_field, cluster_field=None, meta_fields=None, out_file=None, out_args=default_out_args): """ Builds fasta files from database records Arguments: db_file : the database file name. id_field : the field containing identifiers. seq_field : the field containing sample sequences. germ_field : the field containing germline sequences. cluster_field : the field containing clonal groupings; if None write the germline for each record. meta_fields : a list of fields to add to sequence annotations. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'fasta' log['FILE'] = os.path.basename(db_file) log['ID_FIELD'] = id_field log['SEQ_FIELD'] = seq_field log['GERM_FIELD'] = germ_field log['CLUSTER_FIELD'] = cluster_field if meta_fields is not None: log['META_FIELDS'] = ','.join(meta_fields) printLog(log) # Open input db_handle = open(db_file, 'rt') db_iter = TSVReader(db_handle) result_count = countDbFile(db_file) # Open output if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='sequences', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type='clip') # Iterate over records start_time = time() rec_count, germ_count, pass_count, fail_count = 0, 0, 0, 0 cluster_last = None for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Update cluster ID cluster = rec.get(cluster_field, None) # Get germline SeqRecord when needed if cluster_field is None: germ = buildSeqRecord(rec, id_field, germ_field, meta_fields) germ.id = '>' + germ.id elif cluster != cluster_last: germ = buildSeqRecord(rec, cluster_field, germ_field) germ.id = '>' + germ.id else: germ = None # Get read SeqRecord seq = buildSeqRecord(rec, id_field, seq_field, meta_fields) # Write germline if germ is not None: germ_count += 1 SeqIO.write(germ, pass_handle, 'fasta') # Write sequences if seq is not None: pass_count += 1 SeqIO.write(seq, pass_handle, 'fasta') else: fail_count += 1 # Set last cluster ID cluster_last = cluster # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['GERMLINES'] = germ_count log['PASS'] = pass_count log['FAIL'] = fail_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def convertToFasta(db_file, id_field=default_id_field, seq_field=default_seq_field, meta_fields=None, out_file=None, out_args=default_out_args): """ Builds fasta files from database records Arguments: db_file : the database file name. id_field : the field containing identifiers. seq_field : the field containing sequences. meta_fields : a list of fields to add to sequence annotations. out_file : output file name. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: str : output file name. """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'fasta' log['FILE'] = os.path.basename(db_file) log['ID_FIELD'] = id_field log['SEQ_FIELD'] = seq_field if meta_fields is not None: log['META_FIELDS'] = ','.join(meta_fields) printLog(log) # Open input out_type = 'fasta' db_handle = open(db_file, 'rt') db_iter = TSVReader(db_handle) result_count = countDbFile(db_file) # Open output if out_file is not None: pass_handle = open(out_file, 'w') else: pass_handle = getOutputHandle(db_file, out_label='sequences', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type=out_type) # Iterate over records start_time = time() rec_count, pass_count, fail_count = 0, 0, 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Get SeqRecord seq = buildSeqRecord(rec, id_field, seq_field, meta_fields) # Write sequences if seq is not None: pass_count += 1 SeqIO.write(seq, pass_handle, out_type) else: fail_count += 1 # Print counts printProgress(rec_count, result_count, 0.05, start_time=start_time) log = OrderedDict() log['OUTPUT'] = os.path.basename(pass_handle.name) log['RECORDS'] = rec_count log['PASS'] = pass_count log['FAIL'] = fail_count log['END'] = 'ConvertDb' printLog(log) # Close file handles pass_handle.close() db_handle.close() return pass_handle.name def makeGenbankFeatures(record, start=None, end=None, product=default_product, inference=None, db_xref=None, c_field=None, allow_stop=False, asis_calls=False, allele_delim=default_allele_delim): """ Creates a feature table for GenBank submissions Arguments: record : Receptor record. start : start position of the modified seqeuence in the input sequence. Used for feature position offsets. end : end position of the modified seqeuence in the input sequence. Used for feature position offsets. product : Product (protein) name. inference : Reference alignment tool. db_xref : Reference database name. c_field : column containing the C region gene call. allow_stop : if True retain records with junctions having stop codons. asis_calls : if True do not parse gene calls for IMGT nomenclature. allele_delim : delimiter separating the gene name from the allele number when asis_calls=True. Returns: dict : dictionary defining GenBank features where the key is a tuple (start, end, feature key) and values are a list of tuples contain (qualifier key, qualifier value). """ # .tbl file format # Line 1, Column 1: Start location of feature # Line 1, Column 2: Stop location of feature # Line 1, Column 3: Feature key # Line 2, Column 4: Qualifier key # Line 2, Column 5: Qualifier value # Get genes and alleles c_gene = None if not asis_calls: # V gene v_gene = record.getVGene() v_allele = record.getVAlleleNumber() # D gene d_gene = record.getDGene() d_allele = record.getDAlleleNumber() # J gene j_gene = record.getJGene() j_allele = record.getJAlleleNumber() # C region if c_field is not None: c_gene = parseAllele(record.getField(c_field), c_gene_regex, action='first') else: # V gene v_split = iter(record.v_call.rsplit(allele_delim, maxsplit=1)) v_gene = next(v_split, None) v_allele = next(v_split, None) # D gene d_split = iter(record.d_call.rsplit(allele_delim, maxsplit=1)) d_gene = next(d_split, None) d_allele = next(d_split, None) # J gene j_split = iter(record.j_call.rsplit(allele_delim, maxsplit=1)) j_gene = next(j_split, None) j_allele = next(j_split, None) # C region if c_field is not None: c_gene = record.getField(c_field) # Fail if V or J is missing if v_gene is None or j_gene is None: return None # Set position offsets if required start_trim = 0 if start is None else start end_trim = 0 if end is None else len(record.sequence_input) - end source_len = len(record.sequence_input) - end_trim # Define return object result = OrderedDict() # C_region # gene # db_xref # inference c_region_start = record.j_seq_end + 1 - start_trim c_region_length = len(record.sequence_input[(c_region_start + start_trim - 1):]) - end_trim if c_region_length > 0: if c_gene is not None: c_region = [('gene', c_gene)] if db_xref is not None: c_region.append(('db_xref', '%s:%s' % (db_xref, c_gene))) else: c_region = [] # Assign C_region feature c_region_end = c_region_start + c_region_length - 1 result[(c_region_start, '>%i' % c_region_end, 'C_region')] = c_region # Preserve J segment end position j_end = record.j_seq_end # Check for range error if c_region_end > source_len: return None else: # Trim J segment end position j_end = record.j_seq_end + c_region_length # V_region variable_start = max(record.v_seq_start - start_trim, 1) variable_end = j_end - start_trim result[(variable_start, variable_end, 'V_region')] = [] # Check for range error if variable_end > source_len: return None # Product feature result[(variable_start, variable_end, 'misc_feature')] = [('note', '%s variable region' % product)] # V_segment # gene (gene name) # allele (allele only, without gene name, don't use if ambiguous) # db_xref (database link) # inference (reference alignment tool) v_segment = [('gene', v_gene)] if v_allele is not None: v_segment.append(('allele', v_allele)) if db_xref is not None: v_segment.append(('db_xref', '%s:%s' % (db_xref, v_gene))) if inference is not None: v_segment.append(('inference', 'COORDINATES:alignment:%s' % inference)) result[(variable_start, record.v_seq_end - start_trim, 'V_segment')] = v_segment # D_segment # gene # allele # db_xref # inference if d_gene: d_segment = [('gene', d_gene)] if d_allele is not None: d_segment.append(('allele', d_allele)) if db_xref is not None: d_segment.append(('db_xref', '%s:%s' % (db_xref, d_gene))) if inference is not None: d_segment.append(('inference', 'COORDINATES:alignment:%s' % inference)) result[(record.d_seq_start - start_trim, record.d_seq_end - start_trim, 'D_segment')] = d_segment # J_segment # gene # allele # db_xref # inference j_segment = [('gene', j_gene)] if j_allele is not None: j_segment.append(('allele', j_allele)) if db_xref is not None: j_segment.append(('db_xref', '%s:%s' % (db_xref, j_gene))) if inference is not None: j_segment.append(('inference', 'COORDINATES:alignment:%s' % inference)) result[(record.j_seq_start - start_trim, j_end - start_trim, 'J_segment')] = j_segment # CDS # codon_start (must indicate codon offset) # function = JUNCTION # inference if record.junction_start is not None and record.junction_end is not None: # Define junction boundaries junction_start = record.junction_start - start_trim junction_end = record.junction_end - start_trim # CDS record cds_start = '<%i' % junction_start cds_end = '>%i' % junction_end cds_record = [('function', 'JUNCTION')] if inference is not None: cds_record.append(('inference', 'COORDINATES:protein motif:%s' % inference)) # Check for valid translation junction_seq = record.sequence_input[(junction_start - 1):junction_end] if len(junction_seq) % 3 > 0: junction_seq = junction_seq + 'N' * (3 - len(junction_seq) % 3) junction_aa = Seq(junction_seq).translate() # Return invalid record upon junction stop codon if '*' in junction_aa and not allow_stop: return None elif '*' in junction_aa: cds_record.append(('note', '%s junction region' % product)) result[(cds_start, cds_end, 'misc_feature')] = cds_record else: cds_record.append(('product', '%s junction region' % product)) cds_record.append(('codon_start', 1)) result[(cds_start, cds_end, 'CDS')] = cds_record return result def makeGenbankSequence(record, name=None, label=None, count_field=None, index_field=None, molecule=default_molecule, features=None): """ Creates a sequence for GenBank submissions Arguments: record : Receptor record. name : sequence identifier for the output sequence. If None, use the original sequence identifier. label : a string to use as a label for the ID. if None do not add a field label. count_field : field name to populate the AIRR_READ_COUNT note. index_field : field name to populate the AIRR_CELL_INDEX note. molecule : source molecule (eg, "mRNA", "genomic DNA") features : dictionary of sample features (BioSample attributes) to add to the description of each record. Returns: dict: dictionary with {'record': SeqRecord, 'start': start position in raw sequence, 'end': end position in raw sequence} """ # Replace gaps with N seq = record.sequence_input seq = seq.replace('-', 'N').replace('.', 'N') # Strip leading and trailing Ns head_match = re.search('^N+', seq) tail_match = re.search('N+$', seq) seq_start = head_match.end() if head_match else 0 seq_end = tail_match.start() if tail_match else len(seq) # Define ID if name is None: name = record.sequence_id.split(' ')[0] if label is not None: name = '%s=%s' % (label, name) if features is not None: sample_desc = ' '.join(['[%s=%s]' % (k, v) for k, v in features.items()]) name = '%s %s' % (name, sample_desc) name = '%s [moltype=%s] [keyword=TLS; Targeted Locus Study; AIRR; MiAIRR:1.0]' % (name, molecule) # Notes note_dict = OrderedDict() if count_field is not None: note_dict['AIRR_READ_COUNT'] = record.getField(count_field) if index_field is not None: note_dict['AIRR_CELL_INDEX'] = record.getField(index_field) if note_dict: note = '; '.join(['%s:%s' % (k, v) for k, v in note_dict.items()]) name = '%s [note=%s]' % (name, note) # Return SeqRecord and positions record = SeqRecord(Seq(seq[seq_start:seq_end], IUPAC.ambiguous_dna), id=name, name=name, description='') result = {'record': record, 'start': seq_start, 'end': seq_end} return result def convertToGenbank(db_file, inference=None, db_xref=None, molecule=default_molecule, product=default_product, features=None, c_field=None, label=None, count_field=None, index_field=None, allow_stop=False, asis_id=False, asis_calls=False, allele_delim=default_allele_delim, build_asn=False, asn_template=None, tbl2asn_exec=default_tbl2asn_exec, format=default_format, out_file=None, out_args=default_out_args): """ Builds GenBank submission fasta and table files Arguments: db_file : the database file name. inference : reference alignment tool. db_xref : reference database link. molecule : source molecule (eg, "mRNA", "genomic DNA") product : Product (protein) name. features : dictionary of sample features (BioSample attributes) to add to the description of each record. c_field : column containing the C region gene call. label : a string to use as a label for the ID. if None do not add a field label. count_field : field name to populate the AIRR_READ_COUNT note. index_field : field name to populate the AIRR_CELL_INDEX note. allow_stop : if True retain records with junctions having stop codons. asis_id : if True use the original sequence ID for the output IDs. asis_calls : if True do not parse gene calls for IMGT nomenclature. allele_delim : delimiter separating the gene name from the allele number when asis_calls=True. build_asn : if True run tbl2asn on the generated .tbl and .fsa files. asn_template : template file (.sbt) to pass to tbl2asn. tbl2asn_exec : name of or path to the tbl2asn executable. format : input and output format. out_file : output file name without extension. Automatically generated from the input file if None. out_args : common output argument dictionary from parseCommonArgs. Returns: tuple : the output (feature table, fasta) file names. """ log = OrderedDict() log['START'] = 'ConvertDb' log['COMMAND'] = 'genbank' log['FILE'] = os.path.basename(db_file) printLog(log) # Define format operators try: reader, __, schema = getFormatOperators(format) except ValueError: printError('Invalid format %s.' % format) # Open input db_handle = open(db_file, 'rt') db_iter = reader(db_handle) # Check for required columns try: required = ['sequence_input', 'v_call', 'd_call', 'j_call', 'v_seq_start', 'd_seq_start', 'j_seq_start'] checkFields(required, db_iter.fields, schema=schema) except LookupError as e: printError(e) # Open output if out_file is not None: out_name, __ = os.path.splitext(out_file) fsa_handle = open('%s.fsa' % out_name, 'w') tbl_handle = open('%s.tbl' % out_name, 'w') else: fsa_handle = getOutputHandle(db_file, out_label='genbank', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type='fsa') tbl_handle = getOutputHandle(db_file, out_label='genbank', out_dir=out_args['out_dir'], out_name=out_args['out_name'], out_type='tbl') # Count records result_count = countDbFile(db_file) # Define writer writer = csv.writer(tbl_handle, delimiter='\t', quoting=csv.QUOTE_NONE) # Iterate over records start_time = time() rec_count, pass_count, fail_count = 0, 0, 0 for rec in db_iter: # Print progress for previous iteration printProgress(rec_count, result_count, 0.05, start_time=start_time) rec_count += 1 # Extract table dictionary name = None if asis_id else rec_count seq = makeGenbankSequence(rec, name=name, label=label, count_field=count_field, index_field=index_field, molecule=molecule, features=features) tbl = makeGenbankFeatures(rec, start=seq['start'], end=seq['end'], product=product, db_xref=db_xref, inference=inference, c_field=c_field, allow_stop=allow_stop, asis_calls=asis_calls, allele_delim=allele_delim) if tbl is not None: pass_count +=1 # Write table writer.writerow(['>Features', seq['record'].id]) for feature, qualifiers in tbl.items(): writer.writerow(feature) if qualifiers: for x in qualifiers: writer.writerow(list(chain(['', '', ''], x))) # Write sequence SeqIO.write(seq['record'], fsa_handle, 'fasta') else: fail_count += 1 # Final progress bar printProgress(rec_count, result_count, 0.05, start_time=start_time) # Run tbl2asn if build_asn: start_time = time() printMessage('Running tbl2asn', start_time=start_time, width=25) result = runASN(fsa_handle.name, template=asn_template, exec=tbl2asn_exec) printMessage('Done', start_time=start_time, end=True, width=25) # Print ending console log log = OrderedDict() log['OUTPUT_TBL'] = os.path.basename(tbl_handle.name) log['OUTPUT_FSA'] = os.path.basename(fsa_handle.name) log['RECORDS'] = rec_count log['PASS'] = pass_count log['FAIL'] = fail_count log['END'] = 'ConvertDb' printLog(log) # Close file handles tbl_handle.close() fsa_handle.close() db_handle.close() return (tbl_handle.name, fsa_handle.name) def getArgParser(): """ Defines the ArgumentParser Arguments: None Returns: an ArgumentParser object """ # Define input and output field help message fields = dedent( ''' output files: airr AIRR formatted database files. changeo Change-O formatted database files. sequences FASTA formatted sequences output from the subcommands fasta and clip. genbank feature tables and fasta files containing MiAIRR compliant input for tbl2asn. required fields: SEQUENCE_ID, SEQUENCE_INPUT, JUNCTION, V_CALL, D_CALL, J_CALL, V_SEQ_START, V_SEQ_LENGTH, D_SEQ_START, D_SEQ_LENGTH, J_SEQ_START, J_SEQ_LENGTH, NP1_LENGTH, NP2_LENGTH SEQUENCE_IMGT, V_GERM_START_IMGT, V_GERM_LENGTH_IMGT optional fields: GERMLINE_IMGT, GERMLINE_IMGT_D_MASK, CLONE, C_CALL ''') # Define ArgumentParser parser = ArgumentParser(description=__doc__, epilog=fields, formatter_class=CommonHelpFormatter, add_help=False) group_help = parser.add_argument_group('help') group_help.add_argument('--version', action='version', version='%(prog)s:' + ' %s %s' %(__version__, __date__)) group_help.add_argument('-h', '--help', action='help', help='show this help message and exit') subparsers = parser.add_subparsers(title='subcommands', dest='command', metavar='', help='Database operation') # TODO: This is a temporary fix for Python issue 9253 subparsers.required = True # Define parent parsers default_parent = getCommonArgParser(failed=False, log=False, format=False) format_parent = getCommonArgParser(failed=False, log=False) # Subparser to convert changeo to AIRR files parser_airr = subparsers.add_parser('airr', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Converts input to an AIRR TSV file.', description='Converts input to an AIRR TSV file.') parser_airr.set_defaults(func=convertToAIRR) # Subparser to convert AIRR to changeo files parser_changeo = subparsers.add_parser('changeo', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Converts input into a Change-O TSV file.', description='Converts input into a Change-O TSV file.') parser_changeo.set_defaults(func=convertToChangeo) # Subparser to insert IMGT-gaps # desc_gap = dedent(''' # Inserts IMGT numbering spacers into the observed sequence # (SEQUENCE_IMGT, sequence_alignment) and rebuilds the germline sequence # (GERMLINE_IMGT, germline_alignment) if present. Also adjusts the values # in the V germline coordinate fields (V_GERM_START_IMGT, V_GERM_LENGTH_IMGT; # v_germline_end, v_germline_start), which are required. # ''') # parser_gap = subparsers.add_parser('gap', parents=[format_parent], # formatter_class=CommonHelpFormatter, add_help=False, # help='Inserts IMGT numbering spacers into the V region.', # description=desc_gap) # group_gap = parser_gap.add_argument_group('conversion arguments') # group_gap.add_argument('-r', nargs='+', action='store', dest='references', required=False, # help='''List of folders and/or fasta files containing # IMGT-gapped germline sequences corresponding to the # set of germlines used for the alignment.''') # parser_gap.set_defaults(func=insertGaps) # Subparser to convert database entries to sequence file parser_fasta = subparsers.add_parser('fasta', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Creates a fasta file from database records.', description='Creates a fasta file from database records.') group_fasta = parser_fasta.add_argument_group('conversion arguments') group_fasta.add_argument('--if', action='store', dest='id_field', default=default_id_field, help='The name of the field containing identifiers') group_fasta.add_argument('--sf', action='store', dest='seq_field', default=default_seq_field, help='The name of the field containing sequences') group_fasta.add_argument('--mf', nargs='+', action='store', dest='meta_fields', help='List of annotation fields to add to the sequence description') parser_fasta.set_defaults(func=convertToFasta) # Subparser to convert database entries to clip-fasta file parser_baseln = subparsers.add_parser('baseline', parents=[default_parent], formatter_class=CommonHelpFormatter, add_help=False, description='Creates a BASELINe fasta file from database records.', help='''Creates a specially formatted fasta file from database records for input into the BASELINe website. The format groups clonally related sequences sequentially, with the germline sequence preceding each clone and denoted by headers starting with ">>".''') group_baseln = parser_baseln.add_argument_group('conversion arguments') group_baseln.add_argument('--if', action='store', dest='id_field', default=default_id_field, help='The name of the field containing identifiers') group_baseln.add_argument('--sf', action='store', dest='seq_field', default=default_seq_field, help='The name of the field containing reads') group_baseln.add_argument('--gf', action='store', dest='germ_field', default=default_germ_field, help='The name of the field containing germline sequences') group_baseln.add_argument('--cf', action='store', dest='cluster_field', default=None, help='The name of the field containing containing sorted clone IDs') group_baseln.add_argument('--mf', nargs='+', action='store', dest='meta_fields', help='List of annotation fields to add to the sequence description') parser_baseln.set_defaults(func=convertToBaseline) # Subparser to convert database entries to a GenBank fasta and feature table file parser_gb = subparsers.add_parser('genbank', parents=[format_parent], formatter_class=CommonHelpFormatter, add_help=False, help='Creates files for GenBank/TLS submissions.', description='Creates files for GenBank/TLS submissions.') # Genbank source information arguments group_gb_src = parser_gb.add_argument_group('source information arguments') group_gb_src.add_argument('--mol', action='store', dest='molecule', default=default_molecule, help='''The source molecule type. Usually one of "mRNA" or "genomic DNA".''') group_gb_src.add_argument('--product', action='store', dest='product', default=default_product, help='''The product name, such as "immunoglobulin heavy chain".''') group_gb_src.add_argument('--db', action='store', dest='db_xref', default=None, help='''Name of the reference database used for alignment. Usually "IMGT/GENE-DB".''') group_gb_src.add_argument('--inf', action='store', dest='inference', default=None, help='''Name and version of the inference tool used for reference alignment in the form tool:version.''') # Genbank sample information arguments group_gb_sam = parser_gb.add_argument_group('sample information arguments') group_gb_sam.add_argument('--organism', action='store', dest='organism', default=None, help='The scientific name of the organism.') group_gb_sam.add_argument('--sex', action='store', dest='sex', default=None, help='''If specified, adds the given sex annotation to the fasta headers.''') group_gb_sam.add_argument('--isolate', action='store', dest='isolate', default=None, help='''If specified, adds the given isolate annotation (sample label) to the fasta headers.''') group_gb_sam.add_argument('--tissue', action='store', dest='tissue', default=None, help='''If specified, adds the given tissue-type annotation to the fasta headers.''') group_gb_sam.add_argument('--cell-type', action='store', dest='cell_type', default=None, help='''If specified, adds the given cell-type annotation to the fasta headers.''') group_gb_sam.add_argument('-y', action='store', dest='yaml_config', default=None, help='''A yaml file specifying sample features (BioSample attributes) in the form \'variable: value\'. If specified, any features provided in the yaml file will override those provided at the commandline. Note, this config file applies to sample features only and cannot be used for required source features such as the --product or --mol argument.''') # General genbank conversion arguments group_gb_cvt = parser_gb.add_argument_group('conversion arguments') group_gb_cvt.add_argument('--label', action='store', dest='label', default=None, help='''If specified, add a field name to the sequence identifier. Sequence identifiers will be output in the form