pax_global_header00006660000000000000000000000064133772526350014527gustar00rootroot0000000000000052 comment=d48994b2bb99dbd0cfe245c8cb57247e84b0c56c IgDiscover-0.11/000077500000000000000000000000001337725263500135065ustar00rootroot00000000000000IgDiscover-0.11/.gitattributes000066400000000000000000000000441337725263500163770ustar00rootroot00000000000000igdiscover/_version.py export-subst IgDiscover-0.11/.gitignore000066400000000000000000000002651337725263500155010ustar00rootroot00000000000000*~ *.kate-swp .ipynb_checkpoints/ raw/ tmp/ __pycache__/ databases/*/*.log wiki/ runs/ venv/ build/ *.egg-info/ .tox/ .idea/ _build bin/ testrun dist/ igblast/ *.pyc .pytest_cache/ IgDiscover-0.11/.travis.yml000066400000000000000000000013551337725263500156230ustar00rootroot00000000000000language: python cache: directories: - $HOME/.cache/pip python: - "3.6" before_install: - wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh - bash miniconda.sh -b -p $HOME/miniconda - export PATH="$HOME/miniconda/bin:$PATH" - conda config --set always_yes yes --set changeps1 no - conda update -q conda - conda config --add channels bioconda --add channels conda-forge - conda info -a - wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz - tar xvf igdiscover-testdata-0.5.tar.gz install: - conda env create -n testenv -f environment.yml - conda install -n testenv pytest - source activate testenv - pip install . script: tests/run.sh IgDiscover-0.11/CHANGES.rst000066400000000000000000000306441337725263500153170ustar00rootroot00000000000000======= Changes ======= v0.11 (2018-11-27) ------------------ * The IgBLAST cache is now disabled by default. We assume that, in most cases, datasets will not be re-run with the exact same parameters, and then it only fills up the disk. Delete your cache with ``rm -r ~/.cache/igdiscover`` to reclaim the space. To enable the cache, create a file ``~/.config/igdiscover.conf`` with the contents ``use_cache: true``. * If you choose to enable the cache, results from the PEAR merging step will now also be cached. See also the :ref:`caching documentation `. * Added detection of chimeras to the (pre-)germline filters. Any novel allele that can be explained as a chimera of two unmodified reference alleles is marked in the ``new_V_germline.tab`` file. This is a bit sensitive, so the candidate is currently not discarded. * Two additional files ``annotated_V_germline.tab`` and ``annotated_V_pregermline.tab`` are created in each iteration during the germline filtering step. These are identical to the ``candidates.tab`` file, except that they contain a ``why_filtered`` column that describes why a sequence was filtered. See the :ref:`documentation for this feature `. * A more realistic test dataset (v0.5), now based on human instead of rhesus data, was prepared. The :ref:`testing instructions ` have been updated accordingly. * J discovery has been tuned to give fewer truncated sequences. * Statistics are written to ``stats/stats.json``. * V SHM distribution plots are created automatically and written written to ``v-shm-distributions.pdf`` in each iteration folder. * An ``igdiscover dbdiff`` subcommand was added that can compare two FASTA files. v0.10 (2018-05-11) ------------------ * When computing a consensus sequence, allow some sequences to be truncated in the 3' end. Many of the discovered novel V alleles were truncated by one nucleotide in the 3' end because IgBLAST does not always extend the alignment to the end of the V sequence. If these slightly too short V sequences were in the majority, their consensus would lead to a truncated sequence as well. The new consensus algorithm allows for this effect at the 3' end and can therefore more often than previously find the full sequence. Example:: TACTGTGCGAGAGA (seq 1) TACTGTGCGAGAGA (seq 2) TACTGTGCGAGAG- (seq 3) TACTGTGCGAG--- (seq 4) TACTGTGCGAG--- (seq 5) TACTGTGCGAGAG (previous consensus) TACTGTGCGAGAGA (new consensus) * Add a column ``database_changes`` to the ``new_V_germline.tab`` file that describes how the novel sequence differs from the database sequence. Example: ``93C>T; 114A>G`` * Allow filtering by ``CDR3_shared_ratio`` and do so by default (needs documentation) * Cache the edit distance when computing the distance matrix. Speeds up the ``discover`` command slightly. * ``discover``: Use more than six CPU cores if available * ``igblast``: Print progress every minute v0.9 (2018-03-22) ----------------- * Implemented allele ratio filtering for J gene discovery * J genes are discovered as part of the pipeline (previously, one needed to run the ``discoverj`` script manually) * In each iteration, dendrograms are now created not only for V genes, but also for D and J genes. The file names are ``dendrogram_D.pdf``, ``dendrogram_J.pdf`` * The V dendrograms are now in ``dendrogram_V.pdf`` (no longer ``V_dendrogram.pdf``). This puts all the dendrograms together when looking at the files in the iteration directory. * The ``V_usage.tab`` and ``V_usage.pdf`` files are no longer created. Instead, ``expressed_V.tab`` and ``expressed_V.pdf`` are created. These contain similar information, but an allele-ratio filter is used to filter out artifacts. * Similarly, ``expressed_D.tab`` and ``expressed_J.tab`` and their ``.pdf`` counterparts are created in each iteration. * Removed ``parse`` subcommand (functionality is in the ``igblast`` subcommand) * New CDR3 detection method (only heavy chain sequences): CDR3 start/end coordinates are pre-computed using the database V and J sequences. Increases detection rate to 99% (previously less than 90%). * Remove the ability to check discovered genes for required motifs. This has never worked well. * Add a column ``clonotypes`` to the ``candidates.tab`` that tries to count how many clonotypes are associated with a single candidate (using only exact occurrences). This is intended to replace the ``CDR3s_exact`` column. * Add an ``exact_ratio`` to the germline filtering options. This checks the ratio between the exact V occurrence counts (``exact`` column) between alleles. * Germline filtering option ``allele_ratio`` was renamed to ``clonotypes_ratio`` * Implement a cache for IgBLAST results. When the same dataset is re-analyzed, possibly with different parameters, the cached results are used instead of re-running IgBLAST, which saves a lot of time. If the V/D/J database or the IgBLAST version has changed, results are not re-used. v0.8.0 (2017-06-20) ------------------- * Add a ``barcodes_exact`` column to the candidates table. It gives the number of unique barcode sequences that were used by the sequences in the set of exact sequences. Also, add a configuration setting ``barcode_consensus`` that can turn off consensus taking of barcode groups, which needs to be set to ``false`` for ``barcodes_exact`` to work. * Add a ``Ds_exact`` column to candidates table. * Add a ``D_coverage`` configuration option. * The pre-processing filtering step no longer reads in the full table of IgBLAST assignments, but filters the table piece by piece. Memory usage for this step therefore does not depend anymore on the dataset size and should always be below 1 GB. * The functionality of the ``parse`` subcommand has been integrated into the ``igblast`` subcommand. This means that ``igdiscover igblast`` now directly outputs a result table (``assigned.tab``). This makes it easier to use that subcommand directly instead of only via the workflow. * The ``igblast`` subcommand now always runs ``makeblastdb`` by itself and deletes the BLAST database afterwards. This reduces clutter and ensures the database is always up to date. * Remove the ``library_name`` configuration setting. Instead, the ``library_name`` is now always the same as the name of analysis directory. v0.7.0 (2017-05-04) ------------------- * Add an “allele ratio” criterion to the germline filter to further reduce the number of false positives. The filter is activated by default and can be configured through the ``allele_ratio`` setting in the configuration file. :ref:`See the documentation for how it works `. * Ignore the CDR3-encoding bases whenever comparing two V gene sequences. * Avoid finding 5'-truncated V genes by extending found hits towards the 5' end. * By default, candidate sequences are no longer merged if they are nearly identical. That is, the ``differences`` setting within the two germline filter configuration sections is now set to zero by default. Previously, we believed the merging would remove some false positives, but it turns out we also miss true positives. It also seems that with the other changes in this version we also no longer get the particular false positives the setting was supposed to catch. * Implement an experimental ``discoverj`` script for J gene discovery. It is curently not run automatically as part of ``igdiscover run``. See ``igdiscover discoverj --help`` for how to run it manually. * Add a ``config`` subcommand, which can be used to change the configuration file from the command-line. * Add a ``V_CDR3_start`` column to the ``assigned.tab``/``filtered.tab`` tables. It describes where the CDR3 starts within the V sequence. * Similarly, add a ``CDR3_start`` column to the ``new_V_germline.tab`` file describing where the CDR3 starts within a discovered V sequence. It is computed by using the most common CDR3 start of the sequences within the cluster. * Rename the ``compose`` subcommand to ``germlinefilter``. * The ``init`` subcommand automatically fixes certain problems in the input database (duplicate sequences, empty records, duplicate sequence names). Previously, it would complain, but the user would have to fix the problems themselves. * Move source code to GitHub * Set up automatic code testing (continuous integration) via Travis * Many documentation improvements v0.6.0 (2016-12-07) ------------------- * The FASTA files of the input V/D/J gene lists now need to be named ``V.fasta``, ``D.fasta`` and ``J.fasta``. The species name is no longer part of the file name. This should reduce confusion when working with species not supported by IgBLAST. * The ``species:`` configuration setting in the configuration can (and should) now be left empty. Its only use was that it is passed to IgBLAST, but since IgDiscover provides IgBLAST with its own V/D/J sequences anyway, it does not seem to make a difference. * A “cross-mapping” detection has been added, which should reduce the number of false positives. :ref:`See the documentation for an explanation `. * Novel sequences identical to a database sequence no longer get the ``_S1234`` suffix. * No longer trim trim the initial ``G`` run in sequences (due to RACE) by default. It is now a configuration setting. * Add ``cdr3_location`` configuration setting: It allows to set whether to use a CDR3 in addition to the barcode for grouping sequences. * Create a ``groups.tab.gz`` file by default (describing the de-barcoded groups) * The pre-processing filter is now configurable. See the ``preprocessing_filter`` section in the configuration file. * Many improvements to the documentation * Extended and fixed unit tests. These are now run via a CI system. * Statistics in JSON format are written to ``stats/stats.json``. * IgBLAST 1.5.0 output can now be parsed. Parsing is also faster by 25%. * More helpful warning message when no sequences were discovered in an iteration. * Drop support for Python 3.3. v0.5 (2016-09-01) ----------------- * V sequences of the input database are now whitelisted by default. The meaning of the ``whitelist`` configuration option has changed: If set to ``false``, those sequences are no longer whitelisted. To whitelist additional sequences, create a ``whitelist.fasta`` file as before. * Sequences with stop codons are now filtered out by default. * Use more stringent germline filtering parameters by default. v0.4 (2016-08-24) ----------------- * It is now possible to install and run IgDiscover on OS X. Appropriate Conda packages are available on bioconda. * Add column ``has_stop`` to ``candidates.tab``, which indicates whether the candidate sequence contains a stop codon. * Add a configuration option that makes it possible to disable the 5' motif check by setting ``check_motifs: false`` (the ``looks_like_V`` column is ignored in this case). * Make it possible to whitelist known sequences: If a found gene candidate appears in that list, the sequence is included in the list of discovered sequences even when it would otherwise not pass filtering criteria. To enable this, just add a ``whitelist.fasta`` file to the project directory before starting the analysis. * The criteria for germline filter and pre-germline filter are now configurable: See ``germline_filter`` and ``pre_germline_filter`` sections in the configuration file. * Different runs of IgDiscover with the same parameters on the same input files will now give the same results. See the ``seed`` parameter in the configuration, also on how to get non-reproducible results as before. * Both the germline and pre-germline filter are now applied in each iteration. Instead of the ``new_V_database.fasta`` file, two files named ``new_V_germline.fasta`` and ``new_V_pregermline.fasta`` are created. * The ``compose`` subcommand now outputs a filtered version of the ``candidates.tab`` file in addition to a FASTA file. The table contains columns **closest_whitelist**, which is the name of the closest whitelist sequence, and **whitelist_diff**, which is the number of differences to that whitelist sequence. v0.3 (2016-08-08) ----------------- * Optionally, sequences are not renamed in the ``assigned.tab`` file, but retain their original name as in the FASTA or FASTQ file. Set ``rename: false`` in the configuration file to get this behavior. * Started an “advanced” section in the manual. v0.2 ---- * IgDiscover can now also detect kappa and lambda light chain V genes (VK, VL) IgDiscover-0.11/CITATION000066400000000000000000000005661337725263500146520ustar00rootroot00000000000000Corcoran, Martin M. and Phad, Ganesh E. and Bernat, Néstor Vázquez and Stahl-Hennig, Christiane and Sumida, Noriyuki and Persson, Mats A.A. and Martin, Marcel and Karlsson Hedestam, Gunilla B. Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity. Nature Communications 7:13642 (2016) https://dx.doi.org/10.1038/ncomms13642 IgDiscover-0.11/LICENSE000066400000000000000000000021041337725263500145100ustar00rootroot00000000000000Copyright (c) 2015-2016 Marcel Martin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. IgDiscover-0.11/MANIFEST.in000066400000000000000000000002311337725263500152400ustar00rootroot00000000000000include README.rst CHANGES.rst include doc/*.rst doc/conf.py doc/Makefile include tests/*.py include versioneer.py include igdiscover/_version.py IgDiscover-0.11/README.rst000066400000000000000000000036361337725263500152050ustar00rootroot00000000000000.. image:: https://img.shields.io/pypi/v/igdiscover.svg?branch=master :target: https://pypi.python.org/pypi/igdiscover .. image:: https://travis-ci.org/NBISweden/IgDiscover.svg?branch=master :target: https://travis-ci.org/NBISweden/IgDiscover .. image:: https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg :target: http://bioconda.github.io/recipes/igdiscover/README.html ========== IgDiscover ========== IgDiscover analyzes antibody repertoires and discovers new V genes from high-throughput sequencing reads. Heavy chains, kappa and lambda light chains are supported (to discover VH, VK and VL genes). IgDiscover is the result of a collaboration between the `Gunilla Karlsson Hedestam group `_ at the `Department of Microbiology, Tumor and Cell Biology `_ at `Karolinska Institutet `_, Sweden and the `Bioinformatics Long-Term Support `_ facility at `Science for Life Laboratory (SciLifeLab) `_, Sweden. If you use IgDiscover, please cite: | Corcoran, Martin M. and Phad, Ganesh E. and Bernat, Néstor Vázquez and Stahl-Hennig, Christiane and Sumida, Noriyuki and Persson, Mats A.A. and Martin, Marcel and Karlsson Hedestam, Gunilla B. | *Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity.* | Nature Communications 7:13642 (2016) | https://dx.doi.org/10.1038/ncomms13642 Links ----- * `Documentation `_ * `Source code `_ * `Report an issue `_ * `Project page on PyPI (Python package index) `_ | .. figure:: https://raw.githubusercontent.com/NBISweden/IgDiscover/master/doc/clusterplot.jpeg | IgDiscover-0.11/doc/000077500000000000000000000000001337725263500142535ustar00rootroot00000000000000IgDiscover-0.11/doc/Makefile000066400000000000000000000002031337725263500157060ustar00rootroot00000000000000all: sphinx-build -W -b html -d _build/doctrees . _build/html @echo @echo "Build finished. The HTML pages are in _build/html." IgDiscover-0.11/doc/_templates/000077500000000000000000000000001337725263500164105ustar00rootroot00000000000000IgDiscover-0.11/doc/_templates/sidebar.html000066400000000000000000000010021337725263500207000ustar00rootroot00000000000000

Support

Join the igdiscover mailing list:

IgDiscover-0.11/doc/advanced.rst000066400000000000000000000053431337725263500165570ustar00rootroot00000000000000.. _advanced: Advanced topics =============== IgDiscover itself does not (yet) come with all imaginable analysis facilities built into it. However, it creates many files (mostly with tables) that can be used for custom analysis. For example, all ``.tab`` files (in particular ``assigned.tab.gz`` and ``candidates.tab``) can be opened and inspected in a spreadsheet application such as LibreOffice. From there, you can do basic tasks such as sorting from the menu of that application. Often, these facilities are not enough, however, and some basic understanding of the command-line is helpful. Clearly, this is not as convenient as working in a graphical user interface (GUI), but we do not currently have the resources to provide one for IgDiscover. To alleviate this somewhat, we provide here instructions for a few things that you may want to do with the IgDiscover result files. Extract all sequences that match any database gene exactly ---------------------------------------------------------- The ``candidates.tab`` file tells you for each discovered sequence how often an *exact match* of that sequence was found in your input reads. A high number of exact matches is a good indication that the candidate is actually a new gene or allele. In order to find the original reads that correspond to those matches, you can look at the ``filtered.tab.gz`` file and extract all rows where the ``V_errors`` column is zero. First, run this on the filtered.tab.gz file:: zcat filtered.tab.gz | head -n 1 | tr '\t' '\n' | nl This will enumerate the columns in the file. Take a note of the index that the V_errors column has. In newer pipeline versions, the index is 21. Then extract all rows of the file where that field is equal to zero: zcat filtered.tab.gz | awk -vFS="\t" '$21 == 0 || NR == 1' > exact.tab If the column wasn’t 21, then replace the ``$21`` appropriately. The part where it says ``NR == 1`` ensures that the column headings are also printed. Extra configuration settings ---------------------------- Some configuration settings are not documented in the default ``igdiscover.yaml`` file since they rarely need to be changed. :: # Leave empty or choose a species name supported by IgBLAST: # human, mouse, rabbit, rat, rhesus_monkey # This setting is not used anywhere except that it is passed # to IgBLAST using the -organism option. Since we provide IgBLAST # with our own gene databases, it seems this has no effect. species: :: # Which program to use for computing multiple alignments. This is used for # computing consens sequences. # Choose 'mafft', 'clustalo', 'muscle' or 'muscle-fast'. # 'muscle-fast' runs muscle with parameters "-maxiters 1 -diags". # #multialign_program: muscle-fast IgDiscover-0.11/doc/changes.rst000066400000000000000000000000341337725263500164120ustar00rootroot00000000000000.. include:: ../CHANGES.rst IgDiscover-0.11/doc/clusterplot.jpeg000066400000000000000000002563321337725263500175150ustar00rootroot00000000000000JFIF&&C       C " x   !1"AQ#2aq 37BWv$56Rbrstu%4CSTUV8Xcw&dDefg'9EF> !1Q"Aaq23Sr#4BRb$C5 ?N  (lxkU{G3J'ڐUVn^OTByC"%+ HՑYRrkLI0B۬Teej>AOKS$/~eZ%A*URRT@N[qK,- IRMPAA@AAA@AAA@AAA@AAA@AAA@AAA@AAA@AAA@A~H1veOm&/AC ͽPqrn=Qf̀wh.eiz?Gʝh(ߵn8s< Λn݌ݿm^OrIz'GG1Gc_弡ޝb08D؊#Vr] *K*au `[^XEFp'If,HAm]٥6d]ˑm.͑8B?&Cm[`JT曟asLKJS&\S(PJS.RjcVQ[}翣':qH~:_If&t&\̴ͤ,7(AJI$9)oP&0|sԆҒ}r"P>̩5,;SŌJ'.uSS8tJ\)R^.Q`A@AAA@AAA@Tڍ~pLV5 u d)9$ @VvK/PE5i{#yYߦ/d|"A_+;[ե쏄Agk~Q|oV>jG E򳵿MZ_DVvK/PE5i{#yYߦ/d|"A_+;[ե쏄Agk~Q|oV>jG E򳵿MZ_DVvK/PE5i{#yYߦ/d|"A_+;[ե쏄Agk~Q|oV>jG E򳵿MZ_DVvK/PE5i{#yYߦ/d|"A_+;[ե쏄Agk~Q|oV>jG E򳵿MZ_DVvK/PE5i{#yYߦ/d|"A_+;[ե쏄Agk~Q|oV>jG E򳵿MZ_DVvK/PE5i{#yYߦ/d|"A_+;[ե쏄Agk~Q|oV>jG E򳵿MZ_DVvK/PEOe$Աȼ%9ɧ^a+*V[HԘ@yٍOvLc-O5vʝƮeZ|v>(<2v3W~?3N>ozd؆8R?6Ucfk8hWCi+):R SIp45d+R6Qcx 3icn-u &FbTmO?10[i.J % M|Ϸ}:qŦ[oeNPcOhmҦK/ҤYI@fx/.d$X֗C˴̴۪!GS[)&MUY&'[mɣTWwѮ^AA@AAA@AAAm~w\?e]}rQA@AAA@AAA@AAA@AAA@AAA@AAEط*+M(EWi@y;Nms<쇦ъ]|H#=L\_k ڑ#؃#GN#\yFz9$nO8v?fEvl Wb ċ’XBVf]XP-ʒp$45HX;ehRDOi>fVer2 %Ɵii)ZAXAbs>} !ԻENzmLo J4Yٖn7)Q%*d>^GVCS8ZkY`i79* uRh A MH4dZŕ)$wp        .f>R((kI/                  "'kcX]qϔ'pAx.|ϩE P0^1´ƩIRf&IyM\ R F֪'ᜋGmʜ^ھ"w/}0HێO3joVv?yp_5ړ>?3E۴9A}흣w2M'1*RR1P1#P.XQX}B,i$MRV 74Q 1v?ϵTpNM7,dsO˩bŇ+NI;3#S>kYެK>Vл]zMǛ0vWQL,2)'QZ`Rf&/0q"ee'0;B]n9.ӎ*ԄF)Ml^ēBYSiJn ӻl+%[Z.QpA@AAA@jٱ:1]$hNLʊYC9HVT߯(ꍫZF2Aݒ1)b~n߱W /7dJoث߃_%7U /*M{zlcm&{IyR*lOA>!%bS~^^(ݒ1)b~n߱W /7dJoث߃_%7U /*'[exN7+<:e@PRQFЈLo_5R\ g)c ~PNnfg^nLqĶs=a = )x!sBfSnnU'xeP h@#G"LNKWԛҒzU[y8Q\A A(ZAЈ{,Ò8bl@&G\E-LA<))!ۛÌ              _Qz.|ϩo}j~ȽE oTO9x5<;jeO&:45⊃Q--J|QQ]&Xj̈uA9©Ɇ( q 12Xm+hre'UGV7;Xq,2"\ WMӒMG2?HǻoG'h2nvB!]RwB]P%b*mJ{=5h%h)W0f9s\⽘*sS Ǟm72T)*xl1Uv59Pbimœm*#ּP+ݏxcuz]N%NZIh6JTRIFim X$D @AAA@Qh>}j;cm/NӲLQYf^JRlK9:ʍ1$c5f1VkiDM0wl{oIq}7p뉬6" $8T).e*ӑ@Em>~^}F:~ՊNVG6>9.,s:&O@ F   )us_*&T`?QFjCaQZ،`YZ:\ڐ›?2I)w; a1yc DDğ5cNۘj[a8k!7*.ܮ5M&wHCΫ;e$Au0,PL8g$55/_u)dY$+c؝xWl%Õ*ȶxu$"e zA:zxwױL*JOlXnNpakX9ȥpH""6e1 *W)&RHP /              "_9|s~E(u>Ta=mg"\7翍Q? ^5'Wn+#e.+L.T?f\Tu"9"=Pr#4⋵!9i2Il#Wm7m;mWgjK4<#]v}/{9:O|6eetj$ ڔ吖OS(L1`%ʜ#\¹̶워%W׼gZ;kOCاkX1@˜ɬ80!aL"eYN>VFU! W&:25W`=TRk4WXĕ,>$Vy6&`en2RJr"H     ^U1Чy\ Lnkzh%@H"1f0k=W;J~h;AUx3!Ml7^nS;6' ! !ICm.jQSx{vQUG4j-FJֱ1 .0켛ND!W8)@mmxlP^'TXn L-*J` g[O9Ts,3G}lG&݆z,zKUG4+!#UnB'ɧI r.8 %7'@LM|NM_s%@vQUG4ONM_s%mJZ8QdjBiRٶ47(xJVD&NRraR̸*{Ap\$h<" 8im_Ļf^XcD l }ȗ) 8im 8im_Ļf~XcD l }ȗ) 8im 8im_Ļf~XcD l }ȗ*KVWRMO ]\JC& Z2OKaef $5'ʪ{[ fh:Us-YZZBPRV"|oa(XF~CĻ,J"&V$ѫpyl&_m/RMnaL/%ͤ%AISnn#%6SӲVA&K*JHC}J33 RiStLM%u)ȷJR.,H]c*R=_%2q LU'hsJMLJkr/tP9I\goYU5m+oK@ p(얤ajfIJ /S.Ty&M2:T*Q"c^ l;E55.4zk&a*lJjy4]eٕ9 ^ُG)^'д%SEpE // Af5@.zH;'4[SWtAHs*BA #[cm`7VVOOSs&MmVɕ 4z7P.lO)/fN~FzBu$*eU#EMQq)ڦMULfX!) p#|fkijv^T'$eM&Mcq<@)2O8K<,,5= <ԙHӥ'0$YxÈ<еMf*O6ؼ9j[rՎ\Xly63`,bsBl8k^dAgDD3m]IX/s] lm`n'G9Z0ȵ~n_f9s;j/ Lv6u#h2 PM &fm15#4{n䩈⽷T|ԏӟ#~~?SR??n8#N|WJ>+p%(XC}Idz(,a!>ܤ20nR~^/ zHc)? ⅌=$1۔@"BO cI a'(u>T,a!>ܤ25JxSi[6ZaleTJkmqЇR8: c clچbi))Q/<{M<䫥֋I!+]H1E }'~]b7Dyn'"r0-6rO' ;u"Aȼ`ܵ}7-b-q'ti=HHVͩc YY›9´ZzjSK ʱU wIZ ħj#Plmqn۠Laɪͺy?FF׾fVIm6Pv\6FiƑWi)\<\E^U, %p&"E=-RPKaDu;AA@AAA(WVÅ]ÒMKeY\@PmWVÅU+ژWދE5DS4F*},!d$@6WIE#Yi:.)^fT.oci> Jr\E Sh^RiFߦ6nB\rI]#k `[18 rG,%72(l3(} 40jE퇸<ijr@yZ)B<6oᑤ.I[!+s =3;0ژPTTU k<#bI$%2V6ee ہ!2 $zI-?{%ݪ> p*_*$́ ɐOypXN]l3+C}ace-zReRxfNYoʓa{ %CU00`J[2LS[sqʤ fä hv闅~|G^{+-ݛN̵'',4KM4Z]Kqj6 JBI$Y` /hv闅~|W5 OtugpK)"$mΟ}J?LZ^]by͚K^Kz]nua.mOu )d~92ZFY5)~2QX7DM|IV?!]l%_+Տu8F_*kZ>F_*kZ> óe[≚quyWenU4,Y-+˺t@7x{byO}]_=t՚E_g1Jz'}.nYI /o/=f Uj]f3RKTzMj ʶ-SzXe-"쓨ҰVQ,* K*R^am:XIs-r io6 :qB_ZP?; () 2p|Ba&qZZ^"b_S:t!rAqBRT:cm>sOܮGTģ keZR KJ^BJ զ l# نaKTZĨSEh{N˞MsbTM4Ө>PRͼN,6Aj2Jz8 G;V9[ċ[87NJV{/|a&Ph(ZHZ NaYJ >i.#x' ʠEҠ= UznTǸfQey,d̳hMs)u6Ԓxqضz.c\u'mtԪ%>aJd6ޡR ֥pF ? W ~9&[-/V PjC`7EmbTNvÂQ .^*qh ]efi(6An `     -m+\G[_ }K! ZkkIz*P4K֞uJQ‚2l@]K׹u        9Gh>z/ʯ _P-~G1s/ 2W _ bIOS%Q~JC u!t'E$`꬝&mETfDQ3E·8J`yLbk|o\I:kZNKHx iLٚ,ULFueL= ހ lZj1-{V$KASčJxj zADȇ0fkJE)v2)jIUg*\z6𖓩2f#'+YC9)GCaz$;)eQOe0tCmĔÈ c0;/aI),'Wa32Z6i)ܰڒFOC11ES NadKNsrK"~SHLHMMebw}ZJwEۏ>$XNd`zԵm E8ӕe*'(6'R.6Ue;rвۣamfD[u,?X*_ C͕q@8xuM9@hN˴Bq[q $^ژ؛4쉠SF%ØcL5 Y֛BRYѾv^Y,%I.d :z),s'Qx]ՙ Wj,M ImL-p{'A76dʂBnnj֟j臓b7p++O0Mĺ\*oBd'9Rl3?-i~.0c DDğ5cN%`?Q5'XS-4o<ܢw`$QR=6Ę43rLo 5#R`lP kk*52ޢEտ-Eb[B?#HӾ]J5zT@3"gpΕk <~[gp9q+VYTڜU_&@t6"62 BbE\D,q. IX2CnQ9Ww% Jai%_ʗ%{s@_ lo)NN)ôRh[*2AO3Od/5&_ڷIӕL)2eɉ2QKZ9ly.O6T@ ))£[n5Kbh>Ԍ~ogn\e+xB8_ s͹MYLÒU_!sOI@biTUd%gM Jj_[Nx) IW@"$8ڎ%:'NT{OWsw#X~_ []. .S} xM1誟ގNl $2°M4fcGClݪE*U9Y*U9Eኍ-n+xQRsufi킚KdKKPn$\`7N/2vaUOFRwIRI%(O6kn$攰:@JtU쑥R)Wl1Ly2&eyK-iEԛC!m gsS6ZJͪY,<;qk`_gK7dԦ1a:c2ivh6_[m*JNk\<Ѧo'젨k{@>;T/'LuBin䋮au,x˶w>.ۏ[ߜP3v>iTDg3j.7pnԺ mAI^U=Gm9-9 ~Wߥ;aր'vqk`_|ٌlT Uo+!9c~ c}YI KS *yyB|h%sR1^.zI2nVU2]eXf5a{pq:ΝGoT} SG-=*J CJr9OQyyT=IT=9cx:9Gaϯső@AAAZIH#9zvvoa+;9<}kLoBRgtQB ?£r^bKl5íhqA琻EFrvCSfާpZy=6lR˂`3*>AgT}+ˀ?p87e8g7 od#3 lpNG65>UOQPKkT[A!U-a rl&$f4!Ig?) ')(k>&[{VX H"yfNt#YkU :=im0( gąR_7{|=vJ#ְ0gY ٥f4̢ZNQ*$(t銵CDӈ:6T:/#)Eq,akp$I99_/M-*r3U,!ʁx0-Gr 'Sx6dSDU@RjB.y!r3+M*]MLYd#3~OKm5ft-ÄXWNly@O|kHz`RsHٜG9[oam) i*~mj[׊V]lBrs鮝1)+nO!}$,ftOZT~L^6-<1V8Ak&v/r%t~9mWvS: ̐ڑ.FS]uj’5$N)fRiIrjZ 9}'8^(C2 ߈:Nm8MSǍ4T %Ak:}jsYX[7jݑ.W~j@z49gvGQ x&搓8-LRP<\skm˯ht4Ê_BM$^'5;Bz0vSFOvbJr?|S1*oʍ؍/pL` |HfYCnUWH^(sTmrZ$ܚH!w&|6݆f$E^^𩶥H`.|>'U>g> +B+_m߁m0*ftg>4WFU}}jWEl)"N+y)Yh0 LK%.j#B<_?wٗн4;'י~^B9:jڕٺEgJ0X/%9$ضv+tk2fJjcoK:CHVT"Th"fI Z*6WlǁT_3AJ# Ӓóz+ХWQJ!W{!SG ŖN0B*/߷mإxùmI2ty`R$(Lg@X%9ouEG+bcz ^z?")+^|ʥ)%{= N ògHu"76TI`EgY{ +ܪrg+wSX9ԜXh3/Ͱ R/Đg}hhWY۩x d==el>*q宷 Sg܋%#=hԒ0Z kۼMK|ȐHt|8UNnZvff쨡Qt'0!-x|3v◽IP(Ϸ2KH5يTY^(_POfY¶˳(ҙk閣4OUM[tRni{ߧaeY-A{ [tqy W^%.׽Oj%BT,]253psT=vC\vpo*Ӯ'R4CZ{߁ĮY:72s*H#1KV41}xΤ揥z%>) ְG$Vb.Eg/–TV[ \q0j˓ g!9lH-j<14mogmaMm nz'^*bS a1 &`B̅NvuĊ3=PHZ{xGhJp(%]6:܁2[RӃ"P$kDQZ+;KgjcK+s !Kü )\M`z+ֶI 9Ԝ<eҕ%Rj!'U%B7H:ut<럻<~Le+Iah-J F<럹x-j~i6c3Rų[LR*/(9OƮT-jJS1 6j\#?ـU|<&n}6jt鰭yy$IQFJx!w|%JAQtG?wykWԯ{:ocGpڮуU&q39ݙxR\ zad$ MewӶb%ezeyR iJ -ti\i<:`߆#3FJs!ڐ <럻gJ{GVIl>Q:{UQ$L)KqǻNHS.;HN`N{.KءԤʒf,.%IB_#AW!1Gh@Xh̕fO<럻`Rstci9MW70Ƕ)7i۞8 %KN٩I:RTFP ^)@Ē6L[9,ZxZYJs.Bl.@ƃ⊤ ҳjT6tz`z 9=SK Q!ٓrװ 1V~~Mf503 P>f7;_%K]R^t$ +ZOMkKS Zk^2_|Rcbmo`؆R#Rm6ٝ/!Xy,]]1Fn3xn9 #p7#LEsSrj*XCK*!9FS4Zצ yc`᱇XQZZU,* 0kp!G3vD`?B*W!$e»|5SWG(g'}?F;BJҊ~w4)+' }|;z%q+> BYٕ_*H6xcZĩeګl E."[.eU^%A9r߅#jeۑq&*4P*Zp)Jt"5m^jfki8RH8.SZmX 7~Aɤ9Ԛi߸UZ]CJ@iIFMA$PVͬwHb' $MeUMڐ5%\='JHqecT;;pV[]%=rf[QnY]Oһ"e)M T4nby-ŏĒ&'Q,Ҳ7|+xDيj;bMKHހҲ9VZHJ5L>vL:rw2.qӚb^-r ]#x[)#@.uْh:9_2jC;c\²˛P!H[2O]ꢅ#44RR)"# <]JncӳY2FQ6,~FT7 d)IuJOPr'*@(ߦ&i]f2YSIj pD|LlUMMwLc Zj rV)v$s1;!isR o! 2<=:wZ!&q=Xbb^ʋ^lna1NJ5k-Y<@\*HRRQIZ$er_ZG7"_NfjZU K'2o㭮j 2SOH#{QV9¯6eII̸v@zpj:Ew)钗Z® M΃=Kk5jkQHau\$wk&M-m|8EЗ-Ŝ>]ӑAqJSud׿닭=0ެ65+N) ؔ|Lm:tUs 9=UFk;Z8k:(bqՐ8aY792.|@#Ri\ B`2ө-L&i YZtx"nN̅5)Vjujp3q}/5iGÁrSRa|Q[bNY2mE:s>96$@}QA$}hIfyٹI'*O8^1NuoҢ!Zi!pߌ!N~ySi*l](k{q+;A/E)ysӹu4nO풟q89Ƚk LNi knCUY^$O6D,Y`7Ll<ۅ* nEZe\VKxŚXraa `kJ!m#^ |*U|)n)JQbEfw VWH]pU2s; t=´ GU,Ty*mʲUw,)K+Cnr,cfUE suLrҨfd h c#AYUzJ*fzK)."7̷LKI!)IoCF >}o23`Fɨ^Zx90=ˤN\1%Y{tqGPC[7O'E쎝aIC<2 cVV# Lxi dCk t!v 넪 ҶJPMz>XSlds#*Qz^KL蚃IqxINTE:s]\a;9>cvNn6A6>)3F]U}ٽ|\v-7y4[&IKJ2>:9,A$f4LLJ Z1$e12!)VQ: (e)MY8tÁ=8t '" 4e0-Ҳ PQ7zjQ23$ԈfgUfD}MBuJFfΈj]M0@z -gK7 =7}<>b272;, ڒsFD(%<LOM$pג# 37C1J|s>>KRn8BZzo2{#ݕܓ37DmM%SP9!4؝kp*a\{ܳS/;,vu@*?0aƒocFSO օg?\^3x/6[9o)=> 'h  .?Pѿf/QE{x7"%X)Md^[fgjX]] unjfejJîk G|Mck1 9Sq!jX.vdErV[.wq銋m*v+$ΥNdYA Rz/Den;m/.;J9E쑈i5̓mgF鏖\خU٦!hK$Y2"H/V4 )&)m˭,-Key^7UPF&;;&XrbMymq RJȿE[ 1ITfZȱ>omp8@ZB EP(SUaݥaEate8A5zro .k\oxT څ:ӵ=QmR[QhrKEF BIU`Z5_LNўUޚ=愞(!rY T)$tx,#P5Z' 9X_ S)lPE:-s~N#OxD^WQL9l?.YI⥤(o*x[JMz>u4(SpyI'im N [w#sa﮲&:ʞmRoއ\/Z8!hiIReb ^ۼ߯Zr%Žԧ/Ϳ.6Zt9g>xE6մLEw3͌^$%VٔVdpu>82?௭,yws+eb̉&Ry+ʻeZ^<#W6fRnXͼ`36e->[G9h(Vi4)!dt C4a r,ץ;cMHmm ! B*~^]-dƫ]F,4RrCVۺP m!I#Cmaf͐ODO ثoAZl5lje{̰,%fVjLNLU=~~|V$I>85Y @a njX [ɺso{Z+T&ԩѡCErT~hۨFKe _)?FЩ /#هel76[%+!$\j%*(?"60T)wҩMG;>~Uity*&CIQS.Z)*r]MM-H!,)'mLrMtxbS);dWae1zc.M {2cCHCL4r#/}@S1^)HJ/ 5vf( Snar09u'-·SB,{WQ%'5놴9gfK6q3 ]*,uӜ"F`SG@I+(B/K'h!9m桔B&q}:" %RRru̽An-4gm!b&A.6y"ԂhaFTQ-$j3YWSXթNU3d$frV>1-On%/.8¼?4`̮f%G)oXĴ=]oA:ÌdM9"p6Hbe_aJni-9{>݅~7`!Eζi-HyPIW9nehvXlE[gXcQ{GjşP^]AyAA y㟨hߋ3=s fbAV~~Mf.bB+JE/mb/)o{TẶ*5fCzQ|mRʃJ2ٌ)M6v,gXc404*Ekq&dy”Y(Nu)B UڽlD^5F˜ʟْCSH.UeZHday ZTJY]+lTn2RWGvRAI̠-kTұ̣oi6dD"癿nm ZQ]K Q)ʪve[.6x4B]CB,J6kS ~n OJ -:IluyizɊm?r'ߊ7~_պ˓Yr䵷)]?cV^}46mB7om?r'߇j-uy*IhM<1W Q1FG#JO]|v9Ğ\nE#R51Y|(*zٝW ֋iY Ko@ ~I~O)I$ȡ50uf*Ҙ7&6 \ 7DPL#4GVJpo6RV+/7$ے%gx\ ΜW_o+Y7FۉcJ+ 8鿆5لZ[هRI#8÷ѻ^l~~ ' fu6lx̨Y؀ɖ%)7) 1df$\aLD۟ mZU:YcD^LIRIEL)Y:z|q2}lQS RN)REv kx%NF{1BJ=&8ZBsSoX֡k"]Cs̝֙1#“PuU6#pbuO2$m]l;#5Sã"7N˙t6M9qq iF^i96ne>ܕIյC8S%>੻*Dҡexʷrh{t^uKAJ$pAI;NJVFq1/W2Rkija$T?湸 kFoƼ!ia6u%=7⍍Ju2z#>G Sorw/Yi#gJ­} W]1s^OK8v~'$o> oNR 7:HYOSOJRϼl@AuAu"lߍ4/0ԃKJA(w %f:&wO8.K#:(}te_\'(ExcCL.n:t{`]˟R_)"3'ggZy:nRDre 8سI=*AI2d.∉iCpTs_BfY` 2NiE9!k d%_͔n T єJ1uJ .@mHAe}HF"E2`ր4b*ejS'6FbZ^uNU[hU5mUeWH*H6ytIiMIQq#>n؈-dK TE2صˍD6 uе#̀H=fs4s"/9Z\фe1!2%J8 ZkR20L[%KB|q$MDEfli9!ꏒ i9RFnabuY,@<Pʊ ̢W "Zursl):}T,9nT C Bē{A]L (ZE5dgU&0ZSp|>Y0&Zxڄ̺T F~gyM=WriqW{*6xKҠ{o v ~g*wCz Z < <4ořQpǞ91z ؇ΫnT?&LHON>ҵ PHJkQ: Nc\ &hU|?RMu9Wؠ?*6h'2T *lohFS >gjy9Kbb]n flhACUY/tREFUe22ӳ*m` IN6&ƀNAf>b7Zglx5לKhMu(N0~ ;+ڪ1b%PqOK%/Vzʥ 0Qpf%*h uQ1-yRۍD I JKNĝ(SJJ+*JTA-SvS˜"WgT:*9UU[A0̐o:wJ ]@/hnKU!&J^rod!5m Q%SKG!MۗQH!hTYGt_n>yC¿-L# ,O hԅ~cfHZtޘHQ,GtEqQՖݝΙɛ`,wCKw@m(u{qVm,,!.ih 0A5NM@YΉ~&D?qr57Au#M,u-I sxrT[N))GSN8JD T>R䲄齇a)mɖ JĀ*Ri fّ[c><`io vY~SWK9ŰJ(2T\–TpdfMc}>gZ[D˙WijX4 ERd*C Jy\n5}>W-Ŋ5ܧ*C:jmm`9>?ER#K.S .:}2z3aerq#-XWr> V(?1}P\TiS h~ 0<ӊ@3ѕJ!R)}Ni>a 9+8ba:P"(nD^v9TQ*E玆'9Z* ,*ENLr|#-(囘-6$%~]P.0X/˲괟*~ZkkSU}JzQ.!ć$^,L;FL.*FR6Xy)qRB$6 2vIG#˗%MEXRJs).NboOOW#cЭKn/ew;|>KI}Pӣc֍c)Zx^7aGOu:ŦO/.[2Xշ MZM1ck#93Vuj7ԏj k7& @ pq8쓈bޛtB <(MDE*\b47tuSJ~ZS*.̮arsa ,N6׌H!M"%w{$%SpIO;AR2'Cl0BrJSjT, ݧ-JT\m^d[00{xUYQMJ٢[&2L bK*o\DYIHXUnPJ}?z%k3&]Q,JBbnIAK ݶ\V|C Nhi%&}[|fm'5d6K#xbV-Pa.77cRvG_^)m6ȎWIt󙍂J}ǚK[Vn8hod^8RyYRS.}oi&R2F)mKI}P?C~ ,= r<[s=thό\٩0 ksn~6h #!EW ?%ɔ\X{Z$1Lrh{9*3b>GN6^R!#ډpQ·kv&IbH lmLL~hK_u~4aeІ6|‚~C/.$"Y͠1-9U71#7[O m) Iׇw"mPRwy)؋NJsaXB{1kG!APR]9`bP 5uLE)K2\y /m$1 N.^vWvV#89tQpxĐ$rm:9cnJ^m'!mZ1{%N⸪U:".EnRQ4rWy}_D'4߶CbX[$ʦ}ߑzm*cTR_ݎ( , De b%_mh!JQ΁&etYjC;%}&5#5gB)aK]w"{h5\cxIzE;>i l bbAĩIHwcHՌ%wO8L]+櫽Oƥ~%.]*m׉]fN$~{YWꋼ6f_f4*ݺ,ubrV\J@LߖRAlH6ꔕ\E6o5 XIq%S󎺊=aw^5FJtkm%V)OF/o&.Q7h1̣-d}Yޓfa֧'4$MK[aP6Jn=mI,sUZ%mԉ.l?֌$+ͱ:dnQ65]hڴUF ݡyDIS$;5;:_U.ehJtieN[9mɅ ̫JelqJfaiM+1nݭq.%.劝)+""Zܫ7WቦXJI<`2)89xD1%{ozbўQE%깪:b.Zy}>h-C$VMF&,m\0;ĥ4o,W7DE/=m;xDZ}o;5m|r"HJ{9V%SI}dbRNťV?L(6]YV"-C2IF=۪_pҭq#OpE,'*֞ZȮjEc[ԛ$6|PXQ9hE"Ҟ[*ֽ4k*.q=0)r6c8>W1&ӡyjACk:m;,&ZB5Wl^y]0}%c&mw#}uFweTƒuO~aEMd,S,J͆p1Tأ$y8Kw:ɒ\Lbn>~=ŌgwR;NLcr/lO5]aY@q*؀/ݿ& fU3VXe9MҮ0{1YoC[XQ#2qVDg:w>˖l%Q[h UHc8 %"LJw TMVY*,Unȵ}A)|xsfQNI{oN'iۜ~(ܣJiIWNqh[zvdC~9ghM䣲gnWPw~=GSOZZkS7?èkS7?Ǘ;cc^t3ܭ)ަn.0@\1緎~~,^<4ořA#ns8C닣>bЏySs.B?q;F_JTJe!<;\[vXEݺLj,U6cɌQ#??( '-een VүIש3:f$j2J̲@qR549zVr3?Yb]ۧWqtYCf[iVN}^}7R¸sfMݫyfbz^ôI230KIB7NkÏ+5OKo35;W͟7x8_Ù),X|g6Oy(z( %tNcRZ,hqB'TfuWt*U"a - mUgTYVW5+P*M2Nz*0UeE;-@N(W.M|ꏉ#VQsbcVAd*IdlҡOP躁?EkQ !ÝJrH4ar[k+Z̗TM=萚YBy)CЦvU ݂B.P9IK~Bs*Zāb9^Ůwn(4uW9iH@QO;MV'6ر"m$ho ulB*hOí9&ZiBT/|FۖOw,o {Ջy<W#ҡ&Ti)lge d"=;W]TUv>Φ(jenrǦ(8Glʯeߎ&,\$ eܒt422㦱Ebl],4*ACeƜ6%LYWפ.0I$h̻n8 ՆGVJkKi~1VDn☗fl}{5ÀCN.UԽl-f)y*ͨaVN%"4Z '6쏪î.ENReS} 9q( } ZT d0v B:XXxǝE\ZrA*[>Lzlx|:#SUJʁO~{0UL}R@tBMG.&%ex}ԨctU~%ⷩMnFGP[F鳠F8Xq1aawꆳxd{<M}klTN完ZU o ]?BOmB۷-~Tu]iThd|P7W~2+*#HDxT"R0rSt!DQw7!*S*':g[ B+GP (VhN#ʝA6oǟ*I"pQ oE[X_gGo{?g4_5Ca5Cc̝: nVy`oStr .?Pѿf/QE{x7 &&?%}pA ekZ씥7*' Lkzoe0*Z&dA2RoRÇ;w- #Gb=m *r;;!EG&BWTm%';] %7?d#~G|'OIk/K )-i@)2/<@Z LVo⟻T+ݫ`Madu86K C ̂A$A"˘c Yo⟻T?qO]*s-!iSkIIg)NR8}HGq/=7.0B%.8s`=`C~]%q|Iʾ6㈫q),Ŧ9z<0.83-6%i\jҏG.tg5_9̘jcM呫#?(<۪MK;9TS R좐 cT#$ԱNWW¯jdԲJVȱsa”uW7N䔱X֒rlQr.!/~b/.+M*ffQR 3iRq<̯!xrycG [8&WN hm4^k{PWYI{ F|-Rڽ& ~)ӾwurTl1Ua!S s̀=SLe;[tĤM(qө)mn %di`*[LA}rJr:s\c .'NYYXVN]^Yj{ZӍ.ZSeKB trկLJi9!/ +buB48VmkGС:vohъMGs6ʱZ+j H.lFeV_%<+6g\Z -meDW5VkiuB,/ % g_)ytE[IE<ڈqSM ȧnRm94n߇&.Ъ1VmI\}7by4LHKl ;Whq-so-{*;JP/Ј6'JdS'$(TeqVQFCN%7<ocưjTYoMӇ(bC-ki67ዬZzW+\0P5ahɱ;Z0-Np@yGO M6C&b N1OySI.!#nl[Ŧ\5^Fjo bT0jUrg$Ѡ#xUbDq最ZEى.GXé$/ \-ۇ2'í%ք|q}=Z̼ܲ]W˩WE{Ħ֍@m;YW'jc{v3r`6KIB^Z9Ĥjʥ =9h;\]VsCshPrycA;4TM6EGy=]DND;mntxꮄe,Vq ,6I^Qn1НiUGU+CQj>vj/IM$D#a-Q 7"9˄6[N#r4)gEz캹Y.YI<iڛSxiSZJ[RZPI9&JuӕG\cyfm 4T _*m۫RsO ds47]NgQVh _pv>#jASZ!<)Up6Rz#[u.-;hz?g7E]ڜ&[­s?h ól9GDaX^lŶ>Jwj# EЌ> MHcaQIIJu᭽g?\^3x/6[9o)=>!OlSOlSdy`oStrlk΃z{AsNpGc [CԨy'۪$O52y27s Z( V'uJN#}(99C$Rk ^UZ4yfZ[ !֜V)*D1pqF5l30*Ԧ~A\KIٗ Dm%$cJ.UMA\}ۥx{c!/7@"M`m'. UEa0=/\}ۥu)VK5^X|Py`A@Oʷݺ_7R˕otQ%LSfDH긴N1Q)U٩/>J'2zu(u-ΥYM9\ɮ)hgpK }|p2/c(uRK>aneJ)3])L͈pk- ?aPZ2$ )>h:G֋Etj2I—0PP4?<\.`"ѩNͤk@3n/h(F̞"(ڰ\SH6"ofGƘq]yisq%$II# qd3G,I\a)K˘MT`yܪڍY^n/47a8ƔZV{;%̓0+ OJV\CJ\Qph 3'H|9OrICZZ ԈV+Zz:+ feHEJ\rO2HJyDn,{p =b]H^$ 1]}a 2!%Ԭ[ .S K%Jyv[Z58y6gq\SaFh5Λ f%Blo79dxE9&]l%.(̈́Y-Ǭn\7jegk_kXպ4sԭ2vASK>)Vy%Vܥ%3t V bGXiU)jyKfNm\i7 F5^veIqi["OS2(8Bz_&L#ftI:^޴JKPLSӸbY,'vE:@Ċ;4YSk҆]:4N{Wc0_s'&g)8l nꏴsԒd% Qn.RU}Oʙ':9%)t8Yy Ό2E óyUȺr'1ӝxn}쉥KtD&뾺0&3 Tm5Nů-[ 9جLPf]LYiJVQ!YP]9&,xJWGܳ@S PRrhH93@qeCP<.8mB5.DҚx||Q6)r AhF[ҭ8?8jCnNJ9:=uԝڿoq8{tNܰ7<އpyjU7Q2Гt)5 V.*Q}ή<9NaV!I2ɽ >գNh/ͪt$Rlo$)Iaܛ6wPIG\npF8hm½bZS-)&+ BntaO(޻ס% >]ʣͭZȱs3nM6[@p%@xܹmR-!.XljFk5uSY[X⊄U4As$eHqA Xqn\hmzN{Ci;6ܦe)M|QT-=ܠv\+:Q!:Pe@BQ~q1xb[-^UEK:T*s P:!(JBY/Ƃ{-;+-* /#Y;sƆh_y;g?\^3x/6[9opP~,> ?HsL:?HsLyV65AM=\b: nVAA-o LaȤhLSUPaNU%7@JIMrs*y2N:6M:ڔ1io:GCbF֩u:暜e9e^iФҒmzus~8'f2TknpbI,Pfj2 Q u u~L?JV=SO쫕 GxEXa}@6N%?oA}#<=}@إޗ_OL?a}3ޗ_<=}@إ8w )j_{F_9~:{(ȔZIRTP k/jZya A wbrOitOf2^+(kAo9gS[ݿ)4-*Iu8\*_DnJժfrHނ(Z_jZ*xsZ<}t 3N#W'5ޏ@z#Fӽ88%C#Nj7_;ZmV}!J|jCw"1^N 3xG8~Xj10ܤRO- ,̳JkTr#v<|oE<ʞThrމTZګ#xk9ҘԥOj̒y=cR茭%*)mDؖ􍳌&BQuYƋ%N!|j4JR-F;IJwI+YY]O40*:TX+y9f+Snk[l]c^ijLi|!i C97P҆'fhS̙V‹J̸V*.F))o}F"tq5sVɾu*ЅZ<},:Һ;cOÕ90ҞaT)InHB'TxkhzejʠbFP^(INl-ݮ:8E=QR<:t-erNMUR^[x7) >b-3J5 :߮#ؒ*U+ΰE3k'*hLM.]h?Z6([Q\*~] 0-hN=RQf~u(T AHJHrЉB$8Dt^$%e竲蟜ilB@66 JU99n(4͸pf)ΣnK ڭpK"ۇ& &%VM51B/R܀Ju hEQ];PEH[#A_IS;'_fBm.zQ?'%w33 F3 IT="#G8::5GUiRK2Vڂw`)|E[ Hi=rCFe_mSZhikf6b`P-K-F&n?E~F 764$ܒ+:J[I9$BKOd"M4<"]i3RߔAԬ F{ur?4'ZyHң<p3)&5Bn"齳%4JιO^Zm ԉ(ٵK-Fr@p }uK Qq1ZyZ*Hy gKkp6RTJFФ,B[NaD}<5<&RJ mIJjY*#woA$FSSTrx):KR3",oDJKԷ!Jͺz!nx M/9rER+dh&E T2!\Nu9f!ul+_O[tpj ܡRR.^# ez,%\^@gqW!KÈ=Ve(q&b_,;Z]h q(tY#Dhݹ|Y΂ZZȋ>v,=72Qg@*Sac8 rRqEemz}9AJG絺28=$t VnM=b zZ2]F,mrIp:8<:J\rYoRrf#*r[/\ X @z1s[ pneɝ%)R1=R6oUFGwPa r*R}j0H;sk].+g*ݜOG`Mֻ&6s hԥ(\I$I12)+j-VZmeJFl C}ĥ{'VK >bا.+j1S~8+졪;MHЀѾ!9: Oe-\(G(F~I`l[z{ųdz3 ;-Cj6Êq&U%ׯDYͥN啒#]oMJKJ)I>HQk)\%,s@n- l#.?GSaU**?Z%sѭSV.mRrSk H&΍bkr!.jm{n,j+RRIHlSSnm)(PfLUNV]%m&ԼTa97{R@J2K558\RIbɉ6gg W<彺91,͸ͥy" RJm'![5.(YcmTM?LŶ3ߢ"o_6oʤ"RY6B+9& 7C®kmȩD'KCjS&3sIJPf(q"jy"ER2eJ>8YkOJ&\ "Is$Тfޭh @"QEȶ%kFR=WբdwX9P^ [|r`onAͷI2-қ*@f ZOm^|Ȗ"n9ȲUaD(% +"0Ћ8})JML֛\@:2?f"Qss :- >32Tfk=xk 'g(K8i˕՛^^jPѨJ4;>juTMa}#;2&مk2TMQK[y]7Վ^6׃OIIMA}s=G]-,/!w|qqW~Ş@z _֧6o~)P֧6o~).wJƼ07g[S5AM=\`  !9agu 0[?,!~ĭjw T6Z^i*W=2Ӈ2VVw`_<17|:6m&Zڝ/`/8n?WyD#"Rýp1W+ .m%i3J*LSˆLRm;RJ8xcnt >Dϖj3G YTϾ906!ᏹCn/>>YTϾ90yfzaS>@/d5]HOiBB@7!)В:aGWGVQXY4zIzN{{7#%mL4:)W ͡N`B5 ?j )NeVOx.? ?7Xt<^[ [qmˡkكW .uc9MTU%v( j5EeLXs}bJut5-c?fdͦU@PU)w 6UvR)7ݓpzv+C[P-m\I%T9ާ# 5dFxɼ,xc֚q M=x~v-*Ǖ?Jf*tyLXbהىTMb"IcשEI>f#ֱ]$$h̒8=;6FU5),sDLo6~u5TUEioˡ-Wtxϥs{8:&L-k}/4Y3Mf :o R]ys*w=H^ juc6O7~Y6eB/sa8*R5tv\lLN,j5ϿzO--M&o*Yzb$&g_LŽzbR( J6P8V,az]9.W$Q0WV-kh2OQ>rPiW= SR䐍7~Lט%!vǣXqY|Ԑg|^ے;3R/| o{ނm~p50 6넧[[ Wru%g]˩Yo{0Ρn6/6}؇4 >R]`/2CC>BQD*D2iӤc'#9.lR9B^GEcJS)N[kթ MgsjײV+R*-s5O!KN^Z娲%J)ad{4(-ɱ>x?u`SڨWZasao`(IU/s]D,ȱnM[ceJr%SH ݣVtİd^7:TіbR`!{w|o{:ԕ$YZG ӡŨ#pF,KNfH I_Zn#C[%drmL8\:|6!\c2ڜZЅ0VOg7kGI%Iz9Dib^t(WpG3*!BTek  +@<5,2H{ux"-͡^ykhX77}Ji8vf2\| J{_9b5m;sepIDR[7|ٵ5N/|v@9CBT{nnjRֻ\ؕCotZo}xmt]j`w@Lf̰R}:ኄ:Ⱦ]i +_@ϥۤE%Wxm4 t R®2"-s5 H+^=LGWH]ᅑ&Ә8Z#&Z|f:8-cxŞ%GeH$ifH(C7*WB71cu+7G:0%,CiV +9IDrofj:w1Sh)%YIeWJH;ѱKdPV5_`jsJlHdO4lj=P΀whpMi.z^B1QǞO5Ca5Ccʝ򱱯: nVy`oStr   }0_e[ ٜ/%(tUVkarDⱻ@ xlf#þ38311de4L!}/)) Q)Nr{"Mmpdw[vyK.Qɾ۳,!!م\JG&ke{Hvbm&uXjVGZJ$I0{lm751|H+>ߤ+`jgig 4.RTVJ@lm751Js3d?jJf~բ8 쥫~o=r7?:E^ӾHI젪4ZE@[ؑn<&9"ٲO=lꆝ͒y`T4tn-%mK%c)U~u[],&q㛴hxtϵ$bz7VT⛔Rgּ} ܻϝ\)5"tK/f.()9(-]o  M{,r-Jb$:@56+)+\]enđn1(ÒuYPĬopuM*&Krg"/$z,eTḲLS̒ tkZ&e)H$[z}sMPR9׼BrXQs70qv4,5>aE%6g ԜNP]Im󻾙~n"q$4~'PI'=e$.#e{='K/LH&#U[b0q20&*R] 8"%{PꊅMgnyřXq[ʜHM[U9$T(#%jj2䙌QaF0ե*].2P5q+r"N Qəě0Ne\HmKv =(]#Z5\);@%׋K.%*VnxYf"/먺e*d'得V6hH)ą\ЋuTngj {fH_( $e%֥OTwIR q̕6s|qIl)e<X|!LesaBN8oMKI@Z`huڠ%4̱rQ+!,#)y׹9.b'rHB͸,makcmNKWӅ^W3[jH #CS*Rr!IMn|q<:|~rqX^2WM-| U2+c''1R%>aJl=%Pt0Knb^JC,-:R i!,<5M*'fK[(QT](rvQ4ثwg}PL\CO3 Yc0Jle f {I CYncQHE26sF{x®+b vZ]4P El !*$4Le (V.C)[M-̞8qf*]_њزrxH]G7t]A΁ќfE.Jd?F KY,nz:>J+X*맇BHvL=9n5+]nם\QNVPߦnBI{ύ&ѨۅO(`q<}dI)5 $8֊[zc_7E(rوsh)J~1=Ys}vp߳6Ⳏ{~|U2I6ZkV dW 6J]x-ċ6O>Y4j\2l؀s&Tً0mR@Ke/efg8Z:z3dr mZet5 w8[V3&vsSKj[|"<:jc*bgOW?CZٿ9CZٿ9<+ަn.1Nם7L+qq  rǢIdٞ9~/0p, vf ='0d*mlh1^,>E_u17ؼvf˦ccNQIY}׶%~4Y BfRo;2WBS)d29;ƁWJFHS+tc~ZnF҇>AC"v{%[+;5ZaicZylLt0 C Mx.$}?X1'ݪh##d{n.*%iP>f8dLAZ b#_M5eORv4gԓΓȶl[z{_rHim4*r67oW=~F:ڽ#שCmQը6ITB 󉵂Vh>rFN1AaT!溴/OaD:fY+.!RleI%ڒh WdHZeںxw>1Ͷ1h$/utRRLܽ|D\QgsiK#ew_6'7hEDnޤmEUm#DY#M!t}+aw%;jrɠim- O$6@<%ql.ċ'Vrb3594WeRM9BfW\cZVIf4˾5h(RF%XˎW63`"i )q)7HhZx|sfe@ (/sLڕF.3iHSJE<&u”&iW*3ccS*4*ZGl&Ps6"hơ#W}4IpJ((%Wnxe6Y4d}IPbq瞢Pe fuN^#(jjՒ譴$-+y4#woxHva!:oUoEHy\T3jG*y,dSM0GShz+;4'-9nkfeՎNLɲnjs2۽׻_~ifS3rJ楒̫W4" FʘjF+Br89RU훧Fa|K:AvDBHSrri2K=7_ i[7gtM+Wv۹)S [p%hdU6)Ei/HJlOӅj!sW܍JHHBFIf.e&Qk+ƔtqqϯtiejiUMzY˛/E}! ,i.ט+mݣ4zyN͑r&}d|Ÿjʵ'=7٫*wW?J wd.ޯ[Bz.bRN4섥 W\Fr7P;* .}xuE^FYl.X54 ]MFM-GO %$;*Cy䥓Y}>PؒJR8枎 ;,@ IOk^ l%VdKFgS>!%7fo톭ɼRIm5gӒ >ӂ ժGU'eԇ4cnj)f](dċCr搫<0ZZ`eeSd:bVajJVPH pYJDp.ty2iJNJs-MZ[Mu[6"3.S-DFX m"&䑨L*CoB!ad* $ˆ8gBQO(gٌdI9nex›a&eDIHbv,FLocT,̫*0I^#I @Wѕ|GjZAE.;U}-61px rf\ ^esl.`TB\5:z>bw9Xܒh=xuQr?g^}o* .u$'f6nYDO_B\T u@7 jt*RNvS"M+W ɩPH$ٗK&Ev XK /am'ئI+uDƼ٘%-2ԁz8ą"JsXb'HTʳ~hM>W:S?cξA++2BrQv^Za6dE2^)߬b͸1Ha@e+˖ X< *T6?k.{b2IvRC#DJ)ScrRXazt'<-F\^QJQ{*մf4>%fYƿ*Pv0o m"tn$Hp1?>aSoJAӛSX䨗MK)HRyRE()<^/tUj2e]^DnigYJ\ xERiĸ.&9X#urKE!M!ͲM%PIE$ihAp\mF M$e IJD22sғFMқ{hS\Qp$q?=l LjciQTEM8 WnŰ7W` S'1im"wi $)G۪:>i:h|vI*kb*37+2RJŽӦsQ9z/&c &Aꈺٖ,'$5Lt)yte)4kTnm{x"‰v H D,iYKaTsܕshrdՕ:$}]RI6W]u)0$TfĤ'p<^QYV@z4aml)iU4vf  ,Y'Fm˶NTXbO ʫrs[;J ܮ("QWGTfTXkUXU3)2*!A+1}YW6Qc f"^DbŇIm+Vc{qW^Hj>Q86z=;[A}/u>/O ڼm k5Ca5Cclk΃z{;c^t3ܭ   `j0kP1ꎡNb׹u y@ \Gex?eQ11Ky*!ܗ`\γbRG`:ʖ"cv*Xwحnko > f#ʖ"cv*Xwحnko > f# y#:eVZ^~j]{n)'"УodNW?g_e Su'iir䵹vK't91d,>sf{k5r.:Sad[-$inB#lK8$rDL y vFakbcjrM|Xi۪US4|ezХNUeh4Cw#O;o}KaaQX-p;*SrMZQJT{oAMS0-$dE8JISŎ^3\e"R[vNcvn%\b]$bPu&JKoP^$i2nVyxgQ ReA9M6 9RS"$[}ՕxU>]n ׼Jm{HԐiNgw|t׏ԻP)$J$֜,Le( I k[q0eT\*bJqvxjX),RS\ [D v# -d,CQ&6&khb)‡q Rt*6<)s;5AZ<!=QZ ds l2nsgCr\D178weA{ܨV+r*HJKtiTՌN,̨D9iycVpW0|"'"d&ra 8ywE= 1E¬ΥrT(L%(|_҂Ե pXWVe=.RrM(C"&DZܴ)]v?Eo +!ulp'ke%#_Z9n78mRͱ+}nau7ӧH]!ٙ>8)([9]NXuDL>ڻL|DBb u-גE"5Oٌy+ϐgui)\!K?45᯵2Z\Eerms<>-RK3avJʫiCr-@H<I6-'yG`2m鄿:ˈYZ.\6DN##2RTTp*s2Y [ ONz,yYǢ־q-Wf$fD3>iڐe(hu.uHJ* $imoy*bd QDiR7ЛBs^B,VEGm3̵ޤ7i:YDb,(SVY$=_" iMRyqt G V q׷m)a2P%i?m:ڻz<7_1Q}.)f*BG ΪZ-MoACrGWBdIɺ58%Q5pGZ/*%JZW*m]niMN@7.k "EB-b,@H}b'q!hZ+8I:"wc2͛p}qJ fJs* BT,F>Yv igd[x6ڥQJ@E%CK%;E??q;E??qRvW1ݨmoW+iɚlkn] HFMoc#N9/8RqҬR  $\LQb's+Ti&v;6s_e5gtm-*XT #vĖbZ1zz_擨sA/6$EVrb ˮlzG (vʖw$:3pRY6szu(GQ km9]Rnڱhq2ryNHPN ߌTh Y} i=gA)3-/2^~#Mc5+bdt;ONNM!׺VE6q`{%L򙅔\$:| Դt(C t)$ۜ9U>557)>: )MͳM~s+"YFeХYHGDHO14Seāz"PkC+5WQ{ɿ=XQZ6)s3%eHnCrbO%.Joa [:嬂$ߦ8 A&j4SuG̚;R:y %ڦ3RR+!BBӴMtFT:0XP nV4Qhjrd* kS(mn FmX饡h)weQ‰Z۸i\ihh9 %d5*S3Q{vjA`wb_5ň)xXfזln(3y FmeV`Z>ҥdUőg5QuL(PH$vlTVY喕sӦ:mʬ- )rgf[HEFT\4v:Z6j9X[F ϥmJ6 ~~`-E:2cٸi͚VY{;pT·-U?HsL:?HsLyҕy`oStrlk΃z{AAA>I6&Gf4啘6[&sX:cHUxJykzJr9Vج oL[]) ޸ 6m}&ǘ"/-G 53:;BꔤH66:I̿Uôz|ĵNr aFZeTf] qM!Bը̮'H}=Cl1ev^O9JET .Bu i!-WrbH37'W#0HV!U:jBH ؛D؋gݠ [mms~N"ㅕf"b|E:r8O>_ps(!f0lɠm}|g(*t92% Јglaqh}nAQ7\O^g |[T7bzo6"Kwb#zW)ūIUJYp)*'0eח.mm$陷h员Y5%(T\mn ~:CԞڸm$ze/f˛F\+zrSR YxAԛ"sYV Qxč-1A!=&6<"szȜUÆaK^,af Єh|1m-ySe>>hʴdqU(#tC8=-qen:D!3ȝi % 6( *:LRj%)=hgI  ÚrbRokV oaK!8QܸI=|$c裑)rov+"J,xs, /0<0kGDK #.Ҽ44uV%Zyet)gJ$SY8Feyy[=5)%J7NEIA)9U8X~ (:1D.07URM.WLZev^nN01G% ZX%.NmF˿$dyRJ *돴6sM8ig) W./xz|Q)Z, $R֧m~pN$%hfpc9Riw^D!geкDKӏVWtZnۈw4b专7 xF`8;x/\y&kfIl6mG2ؙl7E2ns>m&\9P0D!*Šs *!-yɹ#Us: U58h[H5䣍,okx۔V'Iqae5/삐m AK Nmɟ9!g{jD}C hG79!#vsi otaQ r $pEȶ4m2e(e!`sM[L8܋""=VQ/($% %\9ht4,=ÕƝ eWw{xatF[oh_tլ,r&a(:FBQ >tdÌgWXbdPrFc׽ۣlߢfue_33 e˙]0 p|p8F5lҮ񬾼}ܟ@YMh82x w7cX͈yBQ$>m5Z j5IR ,i vu7.6+WRJ)pxyu4d+# 3A o6^367{tx<ylϏޏ`#<2VR_w|wU7܂F!0F!1RV65AM=\b: nVAA@s7.iccauCO3m<-篎V^v 4?.Ndj^SiRiym4ٕ>ʈV 'Y OKC|}! 2]*ӇL^)M4hneMsaψ'#+4L{WŤa/ ۘBttsƅ|$k2U oKm9M(L64,L%Y_0%fRw(]Zk-eS#?˧zaӏ!RӦmbWM8D's**Ӿsy̏mq6t -*|:b*B^)Zz‹ KUr7w^t^]j/Ǣ>k+R%F_ZvY[.JR̲R`PW0!qk R[!{7E4A#K+JsDǔBe&*@q%̶jiQCHȹJ'2xM9Ρ}U!t27.'lOjیax{}N4aӻJZ[U"rʧ"X2˿$0H3i'^oB\V-B+*wdh9%-1˟uBv}=ưUM15+p$k{PJbQ徠BnJ xxLiC۟1d v aDL2n%Q:f&U RZsΏ*(j/"o=EfD7PEҼ_f|KeHH;UST]o.Oz15hL4SfMd5'޵Ν'N5_avxidÜ#)V k1ju)ʛ8:RW{ beY#&S}/dž)6[ɔBμu 3R}VZ /)/ tRRBVg =xqٚ-SJ 9 T $)yt>q.iq{1'52jryX3_5T]GAQvbrE Ex6Ip1uIikbuJq@j V!@_zIU<;eԱqZ*FN?'*IDj_6k5'~fCw:'@#\cRm_E7Zm/anw|f]t*IL-jG4JV Mc)6$]DsP&N}r3ld_Z>)نISqnZ[ *IkC( FK8V²/BuD:Hzi͠`APpFe RfT"bNral ߢg#w3 З z2lLz{q* O]mfR4-8JoIH~C)%'4qxfA>H }mjX̩$t!\zR\|KijR+660&7NcsXƫ^ ݄[ PЕQ7dJp3rAMg1c=+و7dL}JCyc$8lMxc\`Le1%CbrUY='Vι6YZ'[v+U2VgMouѪ֎nwF$Z]ێU'f-`SݜeoMΡWHEċv x0* M7xiE!I5d]1R;3)NnԺs=CLvLlBD;3I[m4\q)!&<#Zd &rah1LX)P7OhA!H}'&L-au!/py5aYPϻÌ/(0-Y9xUiㆧXE,9.Zi UbfE7d.;<,D-{"~|x[.o6غ@xD>&mRT2Ӯ^1qJk32ylDu'RGXqE@VQ9;}MfIPm':ŝɥfRCɶMtu(FC}VNI{e}bF]Rg}-d#\h)S m)9*bh"RyJz 6.er['e`mmFВjqلTf҅HPB{XG"M:Ka/%',2fÿ 5+&ۊy2eu"B:LJ|KސS"BO<@""Ѧ>sv)Z9s UG#~e$XNaEHCͽ0RO2k}!%utÕ-hi,dNMz앙`]ԞvxA !tfl@ LCi+ZfGD;N2R<1\e"5vLk Y(%$aZ \t +GmJJw3ZHBOpm!@vl "Zh0rSK(iZ&2tƬ^F]@mJrZGYt$3\`6nxx83 4A IW@Aeu#t:Z%n^S:D]1}|P$e[PHKWF,iۨ{ua+$ P9xhi9Dt " itiO,܂FIb4KTnIp"-TҜab7N, Ň;x]j(@%pW |ob.\3ܥd,- -⌒˲֒ӈVU!CT8HXJApZ \+Xh͵fpcFHIQPP@DnXz` @dro{B]"Mu/bI\ŋJo8ȴ*[~Ru)9ilF  dGK¡ )'̍gDfH g 6!pe()\5S0&WnQ$ih&-4@<̈Q%') n̈a {p\Xm4sB qñ NAU=;e:ߝȽr?INmSFzhqF nRN=iǍ!3_N=^3{.࿡֧6o~)P֧6o~) zb: nVy`oStr   #N98,g*"7i?lD@vm5(!ØJpF QZx~sFnWvi45]uĭn,CT9=(ý4n˸̮rVBb d%i*6rGM?o>)?E'K|0v ݔuegJFPP?7 $A_Rfi7͞㟝ׯXfM.&Ki̶jX Z3XuE[mh\Ere>!̛[r4 ȥ ĴV $Sͩq#FJJ8ڙJrٵD9ԝYT[kZD&~VQ#/eff^d[4HMU;X[~]u:BKa*9x `KMijn$y#B:|Q@Y}pM ujJmo⋒ZM2oXz:#o(;NH b#>)Y K;k"R<ˬͥפ:7ǒ4W0zD1Vé e ͽd8Y-|iK#ӋRmf2&ލq Nr[yl^7YB& {CZ=4ۅ-ٽmݨr %&a"/& B]H +XqHMJrl@ZRB֤ozz[)jn')]EYs/rwq a40)ie$6bRY)[\snT:Ԣyy дo.d5d8ȗtêJMoKF O,P]&zq-kJIکI L'5ʅΦIɋ{I=c S 7 jkĘt,F-69:, 왕C:lTMH0ĪԜ h”sɼˉWX:.Rrߪ#&dxʵ*lF&8&4$u~oud7 ߍiFq[_HٖFum-iJQa a.s@`Ϊj6_aV)]dž)7*SZ˨,*Zn"GTBHV&FZ,.hZ7he)K]( 3 Y2M$eW&Ֆ}l, )d'ۉ4"z\9F9+T wyPQ B\Z!VF+O2! v0 0\y+ijmSe&!((JM0, $.5f#U6nT:@48%"qJ:#Py2[8ۭ%U%Ӗed' ѬS}LLC)'-CJQ@1I*ʾm(\1e$ 38nvm*&׺xF(rMR(D[fRQ= #"F<c䛞f@!9kPxDL%J7!;$AD+HƘbŭ6;0`VQIA㠏S #+fl$Qk\tiQa¶$+6\͈9NA~wG#vx7aRSOX2OG"vzXlB/Ƨ)lu4cx\Ҥv@l#4tyx(MeW cc_.gѬA j#fu j#f+ަn.1Nם7L+qq  46w#qƜspXFTD;.h=!/92Z"k|̔t e: ll :M%l(0Hv_3Nf&rwN=K7;Cn,OKLaya儥 =&! kZg0_ _/GSGVI@AzHzX >39:{{(j IOi)) Ho9"ٲO=lꆝ͒y`T4tuEgkAAR΢p0I2knGS).11> ótQC`)Ք&QKCXsݥp18X2PrVTvћ}ٛwlfq3V4_w _ /GOe%Y**$Q JTz1HQCUJB{MHHHrFxQ0͒y`T4t8l[z{qcƬi UzI3K Za)h#[ 醛FSZECԕ)\-$ڂ\TL} Ծñ}oME& 0 $|Xr6m, uXR()@KeZ-%6͝ĝ1mb'LAe*wuĭ.A̫94/c.p91)O|8!ZD r.V$bV$(h/DmnB~ߨo.=M% ԥՌ+bMrEwt,36fNu)y\^s'X!Ha4SοvO̓)V6)6I&񣐄t̔%2`JHm6˯,R%i6NثIZSj8MIm˴o#4JOT֛Iw6&4R Fs;;o3Ă* Sƶ)el"GKBOfPKeZa&EI[em%jEBbKK%S ruymi[Sů$oBy)l-za(Hqւי'[ho2_1RɹK h1DiبIF$ْkeYIwp]C[Q\^>Zbk3wuTvfJQRF(QqBƻ|sj+!*-eVaY79lmBsBZ%IE~:q+.Kd#[p;(WsAv\ 87ȍ8k-Ժ, ;m3,j d,4DeAAﳄ(:% +ׇtt>C$dhl[O^2zs6f A..&3z[90icxܣ`uTO{i6rOgĻ6R+{^CnTg6 8\O?k})2 {<2/aw [oCZٿ9CZٿ9C>}jMg7)EF"ƥ,HW0Ք]"J_.FffޛE-fE@=GyωS-:(9;IQ cp5!M9K 0ϙSNp41l82^> b >̈́:V8eN ZD64Oī>QKj](SYE%k~Z:Ejn-x c|눪3촨9tcI }O<4%9")Hgt0'9vES- 1N)ǜ du 7%sjJ|ȶq1?3NX^iȻ*>hMW=EaD轸ZFev 6=Vk_7T|vǒZY|Zq{suSA9xhLH 9'nŀ TiߔΆq+['}8sy$[9j\ASv),o{ZĨyL>aҗ\Q$tԴ%#_B8o"=n ҇=HP7$p̜/QbRh8 e#CujRy`â2򦀺4I1ՙd"Ԓ&?"1Qhy L* Z[LUlHQV/Y)QQ DCq25r:hm-nb-ƔSm*xYǒҋF^%Up-{M6V@4+bijMhm7`=p(Vs]NY, 8q]-,ɴ9Jdoî3^G\ZbM1*-%a*oa>xISbzrۙT& ,uBXâ>^ mh"@1a-BD?ݍ3+ N[6)9'Xi%brܝ`4ϢvGD>mvnmx NK)Ф؂8D57$c.!@j>!(eIcncBmcm,m#ѲVHHPmHG)uaM F%d`2Bc<2I">T}BT fl;ϒl.ce cRZH. m]9uU^P:-ȏZ9Arsn[=J3iOG`bVv߳2~>f{2/aw A j#fu j#f+ަn.1Nם7L+qq  (/?;Vl^Rþk> j_Q$!zYvmHX<%~haj,J^BV_I(RP o@7I'mԴqc* xڹw"U}4EJP#Cþ1-?d]: ͨǓ\y 8d&y녅԰%7T/!NYK+B(NDzzĥR)*.e8zeF[KiNb 2ˋl} QDjmХ,6,mْNQ32 , jEb6]Wv鉉I&Gi$kʋE!Yx<1rͶ^>Lt9 t1R9좔?cR@(:ěA%7 "{>(Nao7RAͮmh/<ݓ`V)V 1g( 6JorXa ar:zAQM\lZ" V\dMF1s|!!+@:@/[ךL., E Je%&䏞zz66'BuN 'ܣQ*P>'ax>'x z$%֑Ћ3mRf *u0{GЬʢmKӔy.ґo\KҀ(6Yy8AKT뾦D JR._@b]N\#}7cu#\=0}NI$0Uo\ KfErK|OV B|ة7F̃cLF"S|1ÍCQbp%nlPDTO)ɤ:thbޙښ*%^hT;HQF&tCKQ s~<ڊ墜sj$,눺P4S6ti]FOHͿC xI% [-pm&f O)I!Gxzpxqnmבa]$ドm?_ _[2ԺVeqUNQ'/T@;Q9Z&Xu(ÜftGUr3GEƅ?倖K_/c_/.e "wmE0Y \:L+nIdO)ԀY6S2ѻ7Y#j3 r7m:wՒG C" qӧ )o!>7{5z@Otb2N2dkS%VI&ʾ7|j~NI!ڹ+63S=̝"Bh姨#XK@w֊کmerE6I<E1JVIl0=h *ǜG|f%m>"\ěU2{HZ@U9B*u kxF`6.E,nep٭-tp$!|rqȅfRnFH (f@|Qbf&񝀶" $ /ai!Gv[ORqNxƒB(/?;VlzW+c(@"Ǧ0Vl8pY%!A@k5 s ?NE:]I3N2WKBР(f0 Z* ߛ)l9 E"]Ki2ffZFcZ:G5a&v+IJ)^Y0̕^miOK\IQ[9/Q_e%YjXYUJƍp!9=6쮒sl3FjU:)JuQ"]I*7ݮ>zL?^"jIG¸k·}VGEdz3 ;-TF_$9$S$"(A ;|l3(ž7SFrQv/hRT1®I"磔$Y9FXSh0;k^imd&iaԾƏkܤjh,e V@.F" l=>n0OtO{^XHN'')8^ %]ꁠԞ)[DŽjԖF877$\tIRYQt&jNIrjTVѲpM=-u!7vw9rl[{Ǥw7+Щ G1uecӄ-qLH] |P֎wξa- AqnB~AJ(0ɰ%i^1f $(R2dZwDp14-lF(JAP*:P!`sx[l:aɳb7-[ BRz!"عc$}BnEu@Bh=q f0#5z4kYԋ8IGKz,Ŵ%dt!don%l fI4&"wbN60e@qh_T|ic=.-s$6H7CT{1[Mg~c 5(4=G )BvYQI[a"ү=A`ӝ?&dmíw!l&6ϳG[1[y &e6oG1/R{4)+{o5Ca5CcǞlk΃z{;c^t3ܭ   ԼpZϺz-K^O;_OL _iO11Ĭ#ILչXKvBP ا ſ(R{Tay-:ozfajBR $s}[jq٤DPTB z}ZF3Q\vRU4zI9$oP##OLciASK%r*^RwyoݟUj>?&{"m yQJ9MnEܸ@zof&شC:5YB$R͍O+6l:"ͺtȤu>NcܟE Mr<(@'iCLjյ[OзopB%i) x wu.&qI̍;55lS\c~>ĀOx2č+-i[is6q7= Wd,EGLDbyBʂ7$w]Zt6*iD6o%QV5a,-Je@>kS(HQb=Zt!$Fksձؖ x~}qFqHdyyi@Ϫ"htZ'^tiۈ]F_JͫH[n;% PoND]KyQ6Tש~KIw\k2\ %a*khb;Hl&B !YI2t^ 3NMk H,J@ ֎-G}gr+$$lQӕac!n'I2ii ӣB&c*EvԆu4M7L8Am(hNMaT6RecK^A'P7bUi"0lnB.=ZŽHY2üꅩhO+r/=85Иy34ʎDQ$TG ~ DŽK$sGdJ  EyWhp ;"40.e6ޞF!.3E{&%H*8$p2,Rw* &BS0eaB:!dTl~o2ݧ0X)0䃐 kx3m@hB4+ #$j BZuF !YD*۞4}\S #҅9`,:m h4͊Kp@h*cu"P[EW>)*zmLg_Tba"q|_> &]3-nh PfEumd^IZr/ %3v&Ǜ dGT*m#^aQ(Nae]Q(Va.,F̸t®m8ZWZ6gݣiZI;䁲++߶C4uWaֻfximg c߽G}mgսF 7K]{o5Ca5CcǞlk΃z{;c^t3ܭ   ԼpZϺz-K^O;_O;T1Z7M5IKJey9T\S$+0=dHmyJp~Pm۴6GjsƯocf"EhP8YZ-,{w kq?>JUI0P7 %C$;䍖ePA2$ r6l߱ӫH(f\8!TrcPG+iy 1Hu,?XuGM$p|Eu998#׉gz}xHIJR_Xw׍Z [OԧIXD#㈄^ƽO4 `sa9e\a\Ao,\_iH l\n:^{%LU 9?&LU 9?#KDyߺ\[GKQPߺ\[$)?b<4[Qq=9'@~:&UD.OM&?qF*Ϗ#K.NfҢ~kFi @SI* o{&[}q_V3 YBTs_O~E IX%_76(bQ*Yb8%%7+@>0}Gz‰GGz‰G0̡VP<:a$p0Q.M#{{2GTH$[`修ĨhH$J?_GgD~fkE.{pՏVzIv 6AMɜ~ǫz3/awC_֧6o~)P֧6o~)z: nVy`oStr   "Rþk>(/?;Vlz .<%~U?(LF۫;M=hJq( Lt 6^7<%~U?(LFx{c|OkXݯbzS42ei򬺷P̅˯|TTs(2>I'U1Y?Wοk׬rt>J4zH)˖7x|gsEdz3 ;-N-$in{0,ed ڿQ? Z#Ly06I1 Hq.HNK|-4WC|VfnM8*M&E"{bsVYS""fFiQ t鉪l<\@(3jdKW-*:FX٤d}Ľ"m;|~f!H}ZUZ>NZy; R,jfYP;uF;MwwsN~@-dlꈚLRJJ$aM@tn@4]QGA6 ( 4fӢ,${kE&1DͶ]o"6ELeJ)=mW:F~4{ZlG xAp"i@;/6irjHlZ}gz$d>C#GB=àFjBiIov@>3fq!AQ7{N 6RG\!52A $y&yyzP3e"{Y죤d$d G0#Ld ř2OdqvWMͿGaՂZe$$ݶ?񣱢3Cw!)u4c9/?oG``4!97g~g;_֧6o~)PޠݐiRZR:Iz­: nVfIZ_ gI9#-9\"aÿͬuTz5MKmu UL8/ ;ʧg'{3=Ceal#+jrU[0 Zl` k-{Zog0ykg}?64|kٟk_ѫ-¸;l^JeY5YatSՑ%dQ'"cxn}T3 a SjZIPIZ1uuZQ1cɺEJbAyש>Y%eG͑iW9< JM`7?c>G̿lg[ٗh^bO1G|S"bO1G|S"gT/#[Tggb\6;4eNcEoxƿحN?4ŠEB&ڈ) 66X Q_X Q_8+(jR2(#v@"ZG1GO$J쥫Yhs~HOsEdz3 ;-N-$inꍧU3CSWH5W)]`kL6H,5Xlzv6%e YfRUb㚛+Yslx}5 5hPGkC&;:3 S̴喻=^(/ Mþ:iJskٔ*J@-6Z+mHdi? :2ekGbV{ΫbY.=0s(sI^1LM-#[^>r+{{Ũkti%G\YYRTN8 Lvwuۤ-$+QIRJڀN_/J* EC8kc,U"B XpXnө*ؖ]6Cw{zP%ʗ\JJShw 1t].w`mk(d>/i @Uv^'N*;]ת;D>&IJQTq)V&-R=X;SGۿůX]*߈(K8WȺRUMaIJEFte){z8KKZ.w_)9;UH;HR{KҨR mA 3\sէiJĒRLK%*: m} s[ x\L1%/˵Daե!nκkhr2KO=W1Vp{F,H ;XQRT1Yw;/䳒vMwNyR$R)4d8B~%e.`efy$spǢy$k|?k~>$kM}"gk>bxM;}&R~6Bȫ4&>@C6T0B_DNDa k+.ŤVzl)nh(huzU+S|GvėRpf  i_ /QtVcOG1=:to >}Mn]ztxq1}skF|F]//5uv%zv+,!73^˷->11wS7Okm w3pa^ktk[_x5]'7.:z<Hٔ ~7f$glr >ך>n07,ɵބȕ˔pޏ0S*U_Jxi/5ݤ3^aA]`}`YFA__?+f ^kG9KO}x:qa#hW7` GZ/yQ~}>GF"%m9NJ5ה ېJoYo?NGW;;ȜX_P3gL"RSIiԬkǤ0@wmȰlhirD0KSI}T#5n#zvR>;OTt/VN=7ӲCȋŠY;l}w(t^>~yߑРIfw󱛞Dn(SA M$+5;K Eb0> DbNGg;ȊI]ک$uvǥF< /?ta~'CȋĺmIE2mRůǥF|x_j"7k-/Qov1_yPUsGtd="Z C`~yD~+IpZcuO|J5IeѵnsnXXo:/Zb<^) v?GFW ˌ1MhSKy8%E|JvLbs[GF|޳Ǔ:3x LBIRM*(~չHM>F#A3S@yYgcFzy|Aѵhet瑙VWkڷ:#`W&4gR86㧵n~r]x:[wJLhWpS~F}:׿?9 #ZmvO?9<Ҿy/=_eS{s>F IR1GyAJ例^'֯*OR#1nboNcU_Yc<~KDy}S1Ǒ]PkU_YccmԼN>GvmBo~r3o㭢[N{Z#`ZW%_R8Bc׮R~ֹH^F%¸R)8̝sQ{pJMZUcY wQKĪ|5GjQnul{jT*֑٧J4IgDiscover-0.11/doc/conf.py000066400000000000000000000212401337725263500155510ustar00rootroot00000000000000# IgDiscover documentation build configuration file import sys import os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.insert(0, os.path.abspath(os.path.dirname(__file__))) sys.path.insert(0, os.path.abspath(os.pardir)) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = ['fulltoc'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. authors = u'Marcel Martin' project = u'IgDiscover' copyright = u'2015-2017, ' + authors # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # When generating the documentation, we currently do not require the # dependencies to be installed. We therefore cannot 'import' anything from # igdiscover since that may fail. The versioneer module can only be imported # from the project root (it has extra checks such that even changing sys.path # will not work), so the following is what we need to do. import subprocess version = subprocess.check_output( [sys.executable, '-c', 'import versioneer; print(versioneer.get_version())'], cwd='..').decode().strip() # Read The Docs modifies the conf.py script and we therefore get 'dirty' # version numbers like 0.12+0.g27d0d31.dirty from versioneer. if version.endswith('.dirty') and os.environ.get('READTHEDOCS') == 'True': version, _, rest = version.partition('+') if not rest.startswith('0.'): version = version + '+' + rest[:-6] # The full version, including alpha/beta/rc tags. release = version suppress_warnings = ['image.nonlocal_uri'] # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'alabaster' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. html_logo = 'preliminary-logo.jpeg' # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. html_sidebars = { '**': [ 'about.html', 'localtoc.html', 'sidebar.html', # see _templates/sidebar.html 'searchbox.html', ] } # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'IgDiscoverdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ('index', 'IgDiscover.tex', u'IgDiscover Documentation', authors, 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'igdiscover', u'IgDiscover Documentation', [authors], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'IgDiscover', u'IgDiscover Documentation', authors, 'IgDiscover', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False IgDiscover-0.11/doc/develop.rst000066400000000000000000000116651337725263500164540ustar00rootroot00000000000000.. _develop: Development =========== * `Source code `_ * `Report an issue `_ .. _developer-install: Installing the development version ---------------------------------- To use the most recent IgDiscover version from Git, follow these steps. 1. If you haven’t done so, install miniconda. See the first steps of the :ref:`regular installation instructions `. Do not install IgDiscover, yet! 2. Clone the IgDiscover repository:: git clone https://github.com/NBISweden/IgDiscover.git (Use the git@ URL instead if you are a developer.) 4. Create a new Conda environment using the ``environment.yml`` file in the repository:: cd IgDiscover conda env create -n igdiscover -f environment.yml You can choose a different environment name by changing the name after the ``-n`` parameter. This may be necessary, when you already have a regular (non-developer) IgDiscover installation in an ``igdiscover`` environment that you don’t want to overwrite. 5. Activate the environment:: source activate igdiscover (Or use whichever name you chose above.) 6. Install IgDiscover in “editable” mode:: python3 -m pip install -e . Whenever you want to update the software:: cd IgDiscover git pull It may also be necessary to repeat the ``python3 -m pip install -e .`` step. IgBLAST result cache -------------------- For development, in particular when running tests repeatedly, you should enable the IgBLAST result cache. The cache stores IgBLAST output. If the same dataset with the same dataset is run a second time, the result is retrieved from the cache and IgBLAST is not re-run. This saves a lot of time when re-running datasets, but may also fill up the cache directory ``~/.cache/igdiscover/``. Also, in production, datasets are usually not re-run with the same settings, which is why caching is disabled by default. To enable the cache, create a file ``~/.config/igdiscover.conf`` with the following content:: use_cache: true The file is in YAML format, but at the moment, no other settings are supported. Building the documentation -------------------------- Go to the ``doc/`` directory in the repository, then run:: make to build the documentation locally. Open ``_build/html/index.html`` in a browser. The layout is different from the `version shown on Read the Docs `_, but allows you to preview any changes you may have made. Making a release ---------------- We use `versioneer `_ to manage version numbers. It extracts the version number from the most recent tag in Git. Thus, to increment the version number, create a Git tag:: git tag v0.5 The ``v`` prefix is mandatory. Then: * ``tests/run.sh`` * ``python3 setup.py sdist`` * ``twine upload sdist/igdiscover-0.10.tar.gz`` * Update bioconda recipe .. _removing-igdiscover: Removing IgDiscover from a Linux system --------------------------------------- If you have been playing around with different installation methods (``pip``, ``conda``, ``git``, ``python3 setup.py install`` etc.) you may have multiple copies of IgDiscover on your system and you will likely run into problems on updates. Here is a list you can follow in order to get rid of the installations as preparation for a clean re-install. *Do not* add ``sudo`` to the commands below if you get permission problems, unless explicitly told to do so! If one of the steps does not work, that is fine, just continue. 1. Delete miniconda: Run the command ``which conda``. The output will be something like ``/home/myusername/miniconda3/bin/conda``. The part before ``bin/conda`` is the miniconda installation directory. Delete that folder. In this case, you would need to delete ``miniconda3`` in ``/home/myusername``. 2. Run ``pip3 uninstall igdiscover``. If this runs successfully and prints some messages about removing files, then *repeat the same command*! Do this until you get a message telling you that the package cannot be uninstalled because it is not installed. 3. Repeat the previous step, but with ``pip3 uninstall sqt``. 4. If you have a directory named ``.local`` within your home directory, you may want to rename it: ``mv .local dot-local-backup`` You can also delete it, but there is a small risk that other software (not IgDiscover) uses that directory. The directory is hidden, so a normal ``ls`` will not show it. Use ``ls -la`` while in your home directory to see it. 5. If you have ever used ``sudo`` to install IgDiscover, you may have an installation in ``/usr/local/``. You can try to remove it with ``sudo pip3 uninstall igdiscover``. 6. Delete the cloned Git repository if you have one. This is the directory in which you run ``git pull``. Finally, you can follow the normal installation instructions and then the developer installation instructions. IgDiscover-0.11/doc/faq.rst000066400000000000000000000130241337725263500155540ustar00rootroot00000000000000Questions and Answers ===================== How many sequences are needed to discover germline V gene sequences? -------------------------------------------------------------------- Library sizes of several hundred thousand sequences are required for V gene discovery, with even higher numbers necessary for full database production. For example, IgM library sizes of 750,000 to 1,000,000 sequences for heavy chain databases and 1.5 to 2 million sequences for light chain databases. Can IgDiscover analyze IgG libraries? ------------------------------------- IgDiscover has been developed to identify germline databases from libraries that contain substantial fractions of unswitched antibody sequences. We recommend IgM libraries for heavy chain V gene identification and IgKappa and IgLambda libraries for light chain identification. IgDiscover can identify a proportion of gemline sequences in IgG libraries but the process is much more efficient with IgM libraries, enabling the full set of germline sequences to be discovered. Can IgDiscover analyze a previously sequenced library? ------------------------------------------------------ Yes, IgDiscover accepts both unpaired FASTQ files and paired FASTA files but the program should be made aware which is being used, see :ref:`input requirements `. Do the positions of the PCR primers make a difference to the output? -------------------------------------------------------------------- Yes. For accurate V gene discovery, all primer sequences must be external to the V gene sequences. For example, forward multiplex amplification primers should be present in the leader sequence or 5' UTR, and reverse amplification primers should be located in the constant region, preferably close to the 5' border of the constant region. Primers that are present in framework 1 region or J segments are not recommended for library production. What are the advantages to 5'-RACE compared to multiplex PCR for IgDiscover analysis? ------------------------------------------------------------------------------------- Both 5'-RACE and multiplex PCR have their own advantages. 5'-RACE will enable library production from species where the upstream V gene sequence is unknown. The output of the ``upstream`` subcommand in IgDiscovery enables the identification of consensus leader and 5'-UTR sequences for each of the identified germline V genes, that can subsequenctly be used for primer design for either multiplex PCR or for monoclonal antibody amplification sets. Multiplex PCR is recommended for species where the upstream sequences are well characterized. Multiplex amplification products are shorter than 5'-RACE products and therefore will be easier to pair and will have less length associated sequence errors. What is meant by 'starting database'? ------------------------------------- The starting database refers to the folder that contains the three FASTA files necessary for the process of iterative V gene discovery to begin. IgDiscover uses the standalone IgBLAST program for comparative assignment of sequences to the starting database. Because IgBlast requires three files (for example ``V.fasta``, ``D.fasta``, ``J.fasta``), three FASTA files should be included in the database folder for each analysis to proceed. In the case of light chains (that do not contain D segments), a dummy D segment file should be included as IgBLAST will not proceed if it does not see three files in the database folder. It is sufficient to save the following sequence as a fasta file and rename it D.fasta, for example, for it to function as the dummy ``D.fasta`` file for human light chain analysis:: >D_ummy GGGGGGGGGG How can I use the IMGT database as a starting database? ------------------------------------------------------- Since we do not have permission to distribute IMGT database files with IgDiscover, you need to download them directly from `IMGT `_. See the :ref:`section about obtaining a V/D/J database `. How do I change the parameters of the program? ---------------------------------------------- By editing :ref:`the configuration file `. Where do I find the individualized database produced by IgDiscover? ------------------------------------------------------------------- The final germline database in FASTA format is in your :ref:`analysis directory ` in the subdirectory ``final/database/``. The ``V.fasta`` file contains the new list of V genes. The ``D.fasta`` and ``J.fasta`` files are unchanged from the starting database. A phylogenetic tree of the V sequences can be found in ``final/dendrogram_V.pdf``. For more details of how that database was created, you need to inspect the files created in the last iteration of the discovery process, located in ``iteration-xx``, where ``xx`` is the number of iterations configured in the ``igdiscover.yaml`` configuration file. For example, if three iterations were used, look into ``iteration-03/``. Most interesting in that folder are likely - the linkage cluster analysis plots in ``iteration-03/clusterplots/``, - the error histograms in ``iteration-03/errorhistograms.pdf``, which contain the windowed cluster analysis figures. - Details about the individualized database in ``new_V_germline.tab`` in tab-separated-value format The ``new_V_germline.fasta`` file is identical to the one in ``final/database/V.fasta`` What does the _S1234 at the end of same gene names mean? -------------------------------------------------------- Please see the :ref:`Section on gene names `.IgDiscover-0.11/doc/fulltoc.py000066400000000000000000000063311337725263500163000ustar00rootroot00000000000000# -*- encoding: utf-8 -*- # # Copyright © 2012 New Dream Network, LLC (DreamHost) # # Author: Doug Hellmann # # Licensed under the Apache License, Version 2.0 (the "License"); you may # not use this file except in compliance with the License. You may obtain # a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the # License for the specific language governing permissions and limitations # under the License. from sphinx import addnodes def html_page_context(app, pagename, templatename, context, doctree): """Event handler for the html-page-context signal. Modifies the context directly. - Replaces the 'toc' value created by the HTML builder with one that shows all document titles and the local table of contents. - Sets display_toc to True so the table of contents is always displayed, even on empty pages. - Replaces the 'toctree' function with one that uses the entire document structure, ignores the maxdepth argument, and uses only prune and collapse. """ rendered_toc = get_rendered_toctree(app.builder, pagename) context['toc'] = rendered_toc context['display_toc'] = True # force toctree to display if "toctree" not in context: # json builder doesn't use toctree func, so nothing to replace return def make_toctree(collapse=True): return get_rendered_toctree(app.builder, pagename, prune=False, collapse=collapse, ) context['toctree'] = make_toctree def get_rendered_toctree(builder, docname, prune=False, collapse=True): """Build the toctree relative to the named document, with the given parameters, and then return the rendered HTML fragment. """ fulltoc = build_full_toctree(builder, docname, prune=prune, collapse=collapse, ) rendered_toc = builder.render_partial(fulltoc)['fragment'] return rendered_toc def build_full_toctree(builder, docname, prune, collapse): """Return a single toctree starting from docname containing all sub-document doctrees. """ env = builder.env doctree = env.get_doctree(env.config.master_doc) toctrees = [] for toctreenode in doctree.traverse(addnodes.toctree): toctree = env.resolve_toctree(docname, builder, toctreenode, collapse=collapse, prune=prune, ) toctrees.append(toctree) if not toctrees: return None result = toctrees[0] for toctree in toctrees[1:]: if toctree: result.extend(toctree.children) env.resolve_references(result, docname, builder) return result def setup(app): app.connect('html-page-context', html_page_context) IgDiscover-0.11/doc/guide.rst000066400000000000000000001320421337725263500161040ustar00rootroot00000000000000========== User guide ========== Overview ======== IgDiscover works on a single library at a time. It works within an “analysis directory” for the library, which contains all intermediate and result files. To start an analysis, you need: 1. A FASTA or FASTQ file with single-end reads or two FASTQ files with paired-end reads (also, the files must be gzip-compressed) 2. A database of V/D/J genes (three FASTA files named ``V.fasta``, ``D.fasta``, ``J.fasta``) 3. A configuration file that describes the library If you do not have a V/D/J database, yet, you may want to read the section about :ref:`how to obtain V/D/J sequences `. To run an analysis, proceed as follows. .. note:: If you are on macOS, it may be necessary to run ``export SHELL=/bin/bash`` before continuing. 1. Create and initialize the analysis directory. First, pick a name for your analysis. We will use ``myexperiment`` in the following. Then run ``igdiscover init``:: igdiscover init myexperiment A dialog will appear and ask for the file with the *first* (forward) reads. Find your compressed FASTQ file that contains them and select it. Typical file names may be ``Library1_S1_L001_R1_001.fastq.gz`` or ``mylibrary.1.fastq.gz``. You do not need to choose the second read file! It is found automatically. Next, choose the directory with your database. The directory must contain the three files ``V.fasta``, ``D.fasta``, ``J.fasta``. These files contain the V, D, J gene sequences, respectively. Even if have have only light chains in your data, a ``D.fasta`` file needs to be provided, just use one with the heavy chain D gene sequences. If you do not want a graphical user interface, use the two command-line parameters ``--db`` and ``--reads1`` to provide this information instead:: igdiscover init --db path/to/my/database/ --reads1 mylibrary.1.fastq.gz myexperiment Again, the second reads file will be found automatically. Use ``--single-reads`` instead of ``--reads1`` if you have single-end reads or a dataset with already merged reads. For ``--single-reads``, a FASTA file (not only FASTQ) is also allowed. In any case, an analysis directory named ``myexperiment`` will have been created. 2. Adjust the configuration file The previous step created a configuration file named ``myexperiment/igdiscover.yaml``, which you may :ref:`need to adjust `. In particular, the number of discovery rounds is set to 3 by default, which takes a long time. Reducing this to 2 or even 1 often works just as well. 3. Run the analysis Change into the newly created analysis directory and run the analysis:: igdiscover run Depending on the size of your library, your computer, and the number of iterations, this will now take from a few hours to a day. See the :ref:`running IgDiscover ` section for more fine-grained control over what to run and how to resume the process if something failed. .. _obtaining-database: Obtaining a V/D/J database ========================== We use the term “database” to refer to three FASTA files that contain the sequences for the V, D and J genes. IMGT provides `sequences for download `_. For discovering new VH genes, for example, you need to get the IGHV, IGHD and IGHJ files of your species. As IgDiscover uses this only as a starting point, using a similar species will also work. When using an IMGT database, it is very important to change the long IMGT sequence headers to short headers as IgBLAST does not accept the long headers. We recommend using the program ``edit_imgt_file.pl``. If you installed IgDiscover from Conda, the script is already installed and you can run it by typing the name. It is also `available on the IgBlast FTP site `_. Run it for all three downloaded files, and then rename files appropritely to make sure that they named ``V.fasta``, ``D.fasta`` and ``J.fasta``. You always need a file with D genes even if you analyze light chains. In case you have used IgBLAST previously, note that there is *no need* to run the ``makeblastdb`` tool as IgDiscover will do that for you. .. _input-requirements: Input data requirements ======================= Paired-end or single-end data ----------------------------- IgDiscover can process input data of three different types: * Paired-end reads in gzipped FASTQ format, * Single-end reads in gzipped FASTQ format, * Single-end reads in gzipped FASTA format. IgDiscover was tested mainly on paired-end Illumina MiSeq reads (2x300bp), but it can also handle 454 and Ion Torrent data. Depending on the input file type, use a variant of one of the following commands to initialize the analysis directory:: igdiscover init --single-reads=file.fasta.gz --database=my-database-dir/ myexperiment igdiscover init --reads1=file.1.fasta.gz --database=my-database-dir/ myexperiment igdiscover init --reads1=file.1.fastq.gz --database=my-database-dir/ myexperiment Read layout ----------- Paired-end reads are first merged and then processed in the same way as single-end reads. Reads that could not be merged are discarded. Single-end reads and merged paired-end reads are expected to follow this structure (from 5' to 3'): * The forward primer sequence. This is optional. * A random barcode (molecular identifier). This is optional. Set the configuration option ``barcode_length_5p`` to 0 if you don’t have random barcodes or if you don’t want the program to use them. * Optionally, a run of G nucleotides. This is an artifact of the RACE protocol (Rapid amplification of cDNA ends). If you have this, set ``race_g`` to ``true`` in the configuration file. * 5' UTR * Leader * Re-arranged V, D and J gene sequences for heavy chains; only V and J for light chains * An optional random barcode. Set the configuration option ``barcode_length_3p`` to the length of this barcode. You can currently not have both a 5' and a 3' barcode. * The reverse primer. This is optional. We use IgBLAST to detect the location of the V, D, J genes through the ``igdiscover igblast`` subcommand. The G nucleotides after the barcode are split off if the configuration specifies ``race_g: true``. The leader sequence is detected by looking for a start codon near 60 bp upstream of the start of the V gene match. .. _configuration: Configuration ============= The ``igdiscover init`` command creates a configuration file ``igdiscover.yaml`` in the analysis directory. To configure your analysis, change that file with a text editor before running the analysis with ``igdiscover run``. The syntax should be mostly self-explanatory. The file is in YAML format, but you will not need to learn that. Just follow the examples given in the file. A few rules that may be good to know are the following ones: 1. Lines starting with the ``#`` symbol are comments (they are ignored) 2. A configuration option that is meant to be switched on or off will say something like ``stranded: false`` if it is off. Change this to ``stranded: true`` to switch the option on (and vice versa). 3. The primer sequences are given as a list, and must be written in a certain way - one sequence per line, and a ``-`` (dash) in front, like so:: forward_primers: - ACGTACGTACGT - AACCGGTTAACC Even if you have only one primer sequence, you still need to use this syntax. To find out what the configuration options achieve, see the explanations in the configuration file itself. The main parameters parameters that may require adjusting are the following. The ``iterations`` option sets the number of rounds of V gene discovery that will be performed. By default, three iterations are run. Even with a very restricted starting V database (for example with only a single V gene sequence), this is usually sufficient to identify most novel germline sequences. When the starting database is more complete, for example, when analyzing a human IgM library with the current IMGT heavy chain database, a single iteration may be sufficient to produce an individualized database. If you do not want to discover any new genes and only want to produce an expression profile, for example, then use ``iterations: 0``. The ``ignore_j`` option should be set to ``true`` when producing a V gene database for a species where J sequences are unknown:: ignore_j: true Setting the parameters ``stranded``, ``forward_primers`` and ``reverse_primers`` to the correct values can be used to remove 5' and 3' primers from the sequences. Doing this is not strictly necessary for IgDiscover. It is simplest if you do not specify any primer sequences. Pregermline and germline filter criteria ---------------------------------------- This provides IgDiscover with stringency requirements for V gene discovery that enable the program to filter out false positives. Usually the ”pregermline filter” can be used in the default mode since all these sequences will be subsequently passed to the higher stringency ”germline filter” where the criteria are set to maximize stringency. Here is how it looks in the configuration file:: pre_germline_filter: unique_cdr3s: 2 # Minimum number of unique CDR3s (within exact matches) unique_js: 2 # Minimum number of unique J genes (within exact matches) check_motifs: false # Check whether 5' end starts with known motif whitelist: true # Add database sequences to the whitelist cluster_size: 0 # Minimum number of sequences assigned to cluster differences: 0 # Merge sequences if they have at most this number of differences allow_stop: true # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) allele_ratio: 0.1 # Required minimum ratio between alleles of a single gene # Filtering criteria applied to candidate sequences in the last iteration. # These should be more strict than the pre_germline_filter criteria. # germline_filter: unique_cdr3s: 5 # Minimum number of unique CDR3s (within exact matches) unique_js: 3 # Minimum number of unique J genes (within exact matches) check_motifs: false # Check whether 5' end starts with known motif whitelist: true # Add database sequences to the whitelist cluster_size: 100 # Minimum number of sequences assigned to cluster differences: 0 # Merge sequences if they have at most this number of differences allow_stop: false # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) allele_ratio: 0.1 # Required minimum ratio between alleles of a single gene Factors that affect germline discovery include library source (IgM vs IgK, IgL or IgG) library size, sequence error rate and individual genomic factors (for example the number of J segments present in an individual). In general, setting a higher cutoff of ``unique_cdr3s`` and ``unique_js`` will minimize the number of false positives in the output. Example:: unique_cdr3s: 10 # Minimum number of unique CDR3s (within exact matches) unique_js: 4 # Minimum number of unique J genes (within exact matches) If the ``differences`` parameter is set to a value higher than 0, the germline filter inspects clusters of sequences that are closely related (when the edit distance between them is at most ``differences``) and retains only the most common sequence of each cluster. Previously, we believed this would removes some false positives due to accumulated random sequence errors of highly expressed alleles that otherwise would pass the cutoff criteria. However, we found out that we miss true positives, in particular if there are two alleles in the sample that differ in only a single nucleotide. We have now implemented other measures to avoid false positives and recommend against setting the ``differences`` to something other than ``0``. Read also about the :ref:`cross mapping `, for which germline filtering corrects, and about the :ref:`germline filters `. .. versionchanged:: The default for the ``differences`` setting was changed from 1 to 0. .. _running: Running IgDiscover ================== Resuming failed runs -------------------- The command ``igdiscover run``, which is used to start the pipeline, can also be used to resume execution if there was an interruption (a transient failure). Reasons for interruptions might be: * Ctrl+C was pressed on the keyboard * A full harddisk * If running on a cluster, the program may have been terminated because it exceeded its allocated time * Too little RAM * Power loss To resume execution after you have fixed the problem, go to the analysis directory and run ``igdiscover run`` again. It will skip the steps that have already finished successfully. This capability comes from the workflow management system `snakemake `_, on which ``igdiscover run`` is based. Snakemake will determine automatically which steps need to be re-run in order to get to a full result and then run only those. Alterations to the configuration file after an interruption are possible, but affect only steps that have not already finished successfully. For example, assume you interrupted a run with Ctrl+C after it is already past the step in which barcodes are removed. Then, even if you change the barcode length in the configuration, the barcode removal step will not be re-run when you resume the pipeline and the previous barcode length is in effect. See also the next section. Changing parameters and re-running parts of the pipeline -------------------------------------------------------- When you experiment with parameters in the ``igdiscover.yaml`` file, such as germline filtering criteria, you do not need to re-run the entire pipeline from the beginning, but can re-use the results that already exist. This can save a lot of processing time, in particular when you avoid re-running IgBLAST in this way. As described in the previous section, ``igdiscover run`` automatically figures out which files need to be re-created if a run was interrupted. Unfortunately, this mechanism is currently not smart enough to also look for changes in the ``igdiscover.yaml`` file. Thus, if the full pipeline has finished successfully, then re-running ``igdiscover run`` will just print the message ``Nothing to be done.`` even after you have changed the configuration file. You will therefore need to know yourself which file you want to regenerate. Then follow the following steps. Note that these will remove parts of the existing results, and if you need to keep them, make a copy of your analysis directory first. 1. Change the configuration setting. 2. Delete the file that needs to be re-generated. Assume it is ``filename`` 3. Run ``igdiscover run filename`` to re-create the file. Only that file will be created, not the ones that usually would be created afterwards. 4. Optionally, run ``igdiscover run`` (without a file name this time) to update the remaining files (those that depend on the file that was just updated). For example, assume you want to modify some germline filtering setting and then re-run the pipeline. Change the setting in your ``igdiscover.yaml``, then run these commands:: rm iteration-01/new_V_germline.tab igdiscover run iteration-01/new_V_germline.tab The above will have regenerated the ``iteration-01/new_V_germline.tab`` file and also the ``iteration-01/new_V_germline.fasta`` file since they are generated by the same script. If you want to update any other files, then also run :: igdiscover run .. _analysis-directory: The analysis directory ====================== IgDiscover writes all intermediate files, the final V gene database, statistics and plots into the analysis directory that was created with ``igdiscover init``. Inside that directory, there is a ``final/`` subdirectory that contains the analysis results. These are the files and subdirectories that can be found in the analysis directory. Subdirectories are described in detail below. igdiscover.yaml The configuration file. Make sure to adjust this to your needs as described above. reads.1.fastq.gz, reads.2.fastq.gz Symbolic links to the raw paired-end reads. database/ The input V/D/J database (as three FASTA files). The files are a copy of the ones you selected when running ``igdiscover init``. reads/ Processed reads (merged, de-duplicated etc.) iteration-xx/ Iteration-specific analysis directory, where “xx” is a number starting from 01. Each iteration is run in one of these directories. The first iteration (in ``iteration-01``) uses the original input database, which is also found in the ``database/`` directory. The database is updated and then used as input for the next iteration. final/ After the last iteration, IgBLAST is run again on the input sequences, but using the final database (the one created in the very last iteration). This directory contains all the results, such as plots of the repertoire profiles. If you set the number of iterations to 0 in the configuration file, this directory is the only one that is created. .. _final-results: Final results ------------- Final results are found in the ``final/`` subdirectory of the analysis directory. final/database/(V,D,J).fasta These three files represent the final, individualized V/D/J database found by IgDiscover. The D and J files are copies of the original starting database; they are not updated by IgDiscover. final/dendrogram_(V,D,J).pdf These three PDF files contain dendrograms of the V, D and J sequences in the individualized database. final/assigned.tab.gz V/D/J gene assignments and other information for each sequence. The file is created by parsing the IgBLAST output in the ``igblast.txt.gz`` file. This is a table that contains one row for each input sequence. See below for a detailed description of the columns. final/filtered.tab.gz Filtered V/D/J gene assignments. This is the same as the assigned.tab file mentioned above, but with low-quality assignments filtered out. Run ``igdiscover filter --help`` to see the filtering criteria. final/expressed_(V,D,J).tab, final/expressed_(V,D,J).pdf The V, D and J gene expression counts. Some assignments are filtered out to reduce artifacts. In particular, an allele-ratio filter of 10% is applied. For D genes, only those with an E-value of at most 1E-4 and a coverage of at least 70% are counted. See also the help for the ``igdiscover count`` subcommand, which is used to create these files. The ``.tab`` file contains the counts as a table, while the PDF file contains a plot of the same values. These tables also exist in the iteration-specific directories (``iteration-xx``). For those, note that the numbers do not include the genes that were discovered in that iteration. For example, ``iteration-01/expressed_V.tab`` shows only expression counts of the V genes in the starting database. final/errorhistograms.pdf A PDF with one page per V gene/allele. Each page shows a histogram of the percentage differences for that gene. final/clusterplots/ This is a directory that contains one PNG file for each discovered gene/allele. Each image shows a clusterplot of all the sequences assigned to that gene. Note that the shown clusterplots are by default restricted to showing only at most 300 sequences, while the actual clustering used by IgDiscover uses 1000 sequences. If you are interested in the results of each iteration, you can inspect the iteration-xx/ directories. They are structured in the same way as the final/ subdirectory, except that the results are based on the intermediate databases of that iteration. They also contain the following additional files. iteration-xx/candidates.tab A table with candidate novel V alleles (or genes). This is a list of sequences found through the *windowing strategy* or *linkage cluster analysis*, as discussed in our paper. See :ref:`the full description of candidates.tab `. iteration-xx/read_names_map.tab For each candidate novel V allele listed in ``candidates.tab``, this file contains one row that lists which sequences went into generating this candidate. Only the exact matches are listed, that is, the number of listed sequence names should be equal to the value in the *exact* column. Each line in this file contains tab-separated values. The first is name of the candidate, the others are the names of the sequences. Some of these sequences may be consensus sequences if barcode grouping was enabled, so in that case, this will not be a read name. iteration-xx/new_V_germline.fasta, iteration-xx/new_V_pregermline.fasta The discovered list of V genes for this iteration. The file is created from the ``candidates.tab`` file by applying either the germline or pre-germline filter. The file resulting from application of the germline filter is used in the last iteration only. The file resulting from application of the pre-germline filter is used in earlier iterations. iteration-xx/annotated_V_germline.tab, iteration-xx/annotated_V_pregermline.tab A version of the ``candidates.tab`` file that is annotated with extra columns that describe why a candidate was filtered out. See :ref:`the description of this file `. Other files ----------- For completeness, here is a description of the files in the ``reads/`` and ``stats/`` directories. They are created during pre-processing and are not iteration specific. reads/1-limited.1.fastq.gz, reads/1-limited.1.fastq.gz Input reads file limited to the first N entries. This is just a symbolic link to the input file if the ``limit`` configuration option is not set. reads/2-merged.fastq.gz Reads merged with PEAR or FLASH reads/3-forward-primer-trimmed.fastq.gz Merged reads with 5' primer sequences removed. (This file is automatically removed when it is not needed anymore.) reads/4-trimmed.fastq.gz Merged reads with 5' and 3' primer sequences removed. reads/5-filtered.fasta Merged, primer-trimmed sequences converted to FASTA, and too short sequences removed. (This file is automatically removed when it is not needed anymore.) reads/sequences.fasta.gz Fully pre-processed sequences. That is, filtered sequences without duplicates (using VSEARCH) stats/reads.txt Statistics of pre-processed sequences. stats/readlengths.txt, stats/readlengths.pdf Histogram of the lengths of pre-processed sequences (created from ``reads/sequences.fasta``) Format of output files ====================== assigned.tab.gz --------------- This file is a gzip-compressed table with tab-separated values. It is created by the ``igdiscover igblast`` subcommand and is the result of parsing raw output from IgBLAST. It contains a few additional columns that do not come directly from IgBLAST. In particular, the CDR3 sequence is detected, the sequence before the V gene match is split into *UTR* and *leader*, and the RACE-specific run of G nucleotides is also detected. The first row is a header row with column names. Each subsequent row describes the IgBLAST results for a single pre-processed input sequence. Note: This file is typically quite large. LibreOffice can open the file directly (even though it is compressed), but make sure you have enough RAM. Columns: count How many copies of input sequence this query sequence represents. Copied from the ``;size=3;`` entry in the FASTA header field that is added by ``VSEARCH -derep_fulllength``. V_gene, D_gene, J_gene V/D/J gene match for the query sequence stop whether the sequence contains a stop codon (either “yes” or “no”) productive V_covered, D_covered, J_covered percentage of bases of the reference gene that is covered by the bases of the query sequence V_evalue, D_evalue, J_evalue E-value of V/D/J hit FR1_SHM, CDR1_SHM, FR2_SHM, CDR2_SHM, FR3_SHM, V_SHM, J_SHM rate of somatic hypermutation (actually, an error rate) V_errors, J_errors Absolute number of errors (differences) in the V and J gene match UTR Sequence of the 5' UTR (the part before the V gene match up to, but not including, the start codon) leader Leader sequence (the part between UTR and the V gene match) CDR1_nt, CDR1_aa, CDR2_nt, CDR2_aa, CDR3_nt, CDR3_aa nucleotide and amino acid sequence of CDR1/2/3 V_nt, V_aa Nucleotide and amino acid sequence of V gene match V_CDR3_start Start coordinate of CDR3 within ``V_nt``. Set to zero if no CDR3 was detected. Comparisons involving the V gene ignore those V bases that are part of the CDR3. V_end, VD_junction, D_region, DJ_junction, J_start nucleotide sequences for various match regions name, barcode, race_G, genomic_sequence see the following explanation The UTR, leader, barcode, race_G and genomic_sequence columns are filled in the following way. 1. Split the 5' end barcode from the sequence (if barcode length is zero, this will be empty), put it in the **barcode** column. 2. Remove the initial run of G bases from the remaining sequence, put that in the **race_G** column. 3. The remainder is put into the **genomic_sequence** column. 4. If there is a V gene match, take the sequence *before* it and split it up in the following way. Search for the start codon and write the part before it into the **UTR** column. Write the part starting with the start column into the **leader** column. filtered.tab.gz --------------- This table is the same as the ``assigned.tab.gz`` table, except that rows containing low-quality matches have been filtered out. Rows fulfilling any of the following criteria are filtered: - The J gene was not assigned - A stop was codon found - The V gene coverage is less than 90% - The J gene coverage is less than 60% - The V gene E-value is greater than 10\ :sup:`-3` .. _candidates_tab: candidates.tab -------------- This table contains the candidates for novel V genes found by the ``discover`` subcommand. As the other files, it is a text file in tab-separated values format, with the first row containing the column headings. It can be opened directly in LibreOffice, for example. Candidates are found by inspecting all the sequences assigned to a database gene, and clustering them in multiple ways. The candidate sequences are found by computing a consensus from each found cluster. Each row describes a single candidate, but possibly multiple clusters. If there are multiple clusters from a single gene that lead to the same consensus sequence, then they get only one row. The *cluster* column lists the source clusters for the given sequence. Duplicate sequences can still occur when two different genes lead to identical consensus sequences. (These duplicated sequences are merged by the germline filters.) Below, we use the term *cluster set* to refer to all the sequences that are in any of the listed clusters. Some clusters lead to ambiguous consensus sequences (those that include ``N`` bases). These have already been filtered out. name The name of the candidate gene. See :ref:`novel gene names `. source The original database gene to which the sequences from this row were originally assigned. All candidates coming from the same source gene are grouped together. chain Chain type: *VH* for heavy, *VK* for light chain lambda, *VL* for light chain kappa cluster From which type of cluster or clusters the consensus was computed. If there are multiple clusters that give rise to the same consensus sequence, they are all listed here, separated by semicolon. A cluster name such as ``2-4`` is for a percentage difference window: Such a cluster consists of all sequences assigned to the source gene that have a percentage difference to it between 2 and 4 percent. A cluster name such as ``cl3`` describes a cluster generated through linkage cluster analysis. The clusters are simply named ``cl1``, ``cl2``, ``cl3`` etc. If any cluster number seems to be missing (such as when cl1 and cl3 occur, but not cl2), then this means that the cluster led to an ambiguous consensus sequence that has been filtered out. Since the ``cl`` clusters are created from a random subsample of the data (in order to keep computation time down), they are never larger than the size of the subsample (currently 1000). The name ``db`` represents a cluster that is identical to the database sequence. If no actual cluster corresponding to the database sequence is found, but the database sequence is expressed, a ``db`` cluster is inserted artificially in order to make sure that the sequence is not lost. The cluster name ``all`` represents the set of all sequences assigned to the source gene. This means that an unambiguous consensus could be computed from all the sequences. Typically, this happens during later iterations when there are no more novel sequences among the sequences assigned to the database gene. cluster_size The number of sequences from which the consensus was computed. Equivalently, the size of the cluster set (all clusters described in this row). Sequences that are in multiple clusters at the same time are counted only once. Js The number of unique J genes associated with the sequences in the cluster set. Consensus sequences are computed only from V gene sequences, but each V gene sequence is part of a full V/D/J sequence. We therefore know for each V sequence which J gene it was found with. This number says how many different J genes were found for all sequences that the consensus in this row was computed from. CDR3s The number of unique CDR3 sequences associated with the sequences in the cluster set. See also the description for the *Js* column. This number says how many different CDR3 sequences were found for all sequences that the consensus in this row was computed from. exact The number of exact occurrences of the consensus sequence among all sequences assigned to the source gene, ignoring the 3' junction region. To clarify: While the consensus sequence is computed only from a subset of sequences assigned to a source gene, *all* sequences assigned to the source gene are searched for exact occurrences of that consensus sequence. When comparing sequences, they are first truncated at the 3' end by removing those (typically 8) bases that correspond to the CDR3 region. barcodes_exact How many unique barcode sequences were used by the sequences in the set of exact sequences (described above). Ds_exact How many unique D genes were used by the sequences in the set of exact sequences (described above). Only those D gene assignments are included in this count for which the number of errors is zero, the E-value is at most a given threshold, and for which the number of covered bases is at least a given percentage. Js_exact How many unique J genes were used by the sequences in the set of exact sequences (described above). CDR3s_exact How many unique CDR3 sequences were used by the sequences in the set of exact sequences (described above). clonotypes The estimated number of clonotypes within the set of exact sequences (which is described above). The value is computed by clustering the unique CDR3 sequences associated with all exact occurrences, allowing up to six differences (mismatches, insertions, deletions) and then counting the number of resulting clusters. database_diff The number of differences between the consensus sequence and the sequence of the source gene. (Given as edit distance, that is insertion, deletion, mismatch count as one difference each.) has_stop Indicates whether the consensus sequence contains a stop codon. looks_like_V Whether the consensus sequence “looks like” a true V gene (1 if yes, 0 if no). Currently, this checks whether the 5' end of the sequence matches a known V gene motif. CDR3_start Where the CDR3 starts within the discovered V gene sequence. This uses the most common CDR3 start location among the sequences from which this consensus is derived. consensus The consensus sequence itself. The ``igdiscover discover`` command can also be run by hand with other parameters, in which case additional columns may appear. N_bases Number of ``N`` bases in the consensus .. _annotated_v_tab: annotated_V_*.tab ----------------- The two files ``annotated_V_germline.tab`` and ``annotated_V_pregermline.tab`` are copies of the ``candidates.tab`` file with two extra columns that show *why* a candidate was filtered in the germline and pre-germline filtering steps. The two columns are: * ``is_filtered`` – This is a number that indicates how many filtering criteria exclude this candidate apply. * ``why_filtered`` – This is a semicolon-separated list of filtering reasons. The following values can occur in the ``why_filtered`` column: too_low_dbdiff The number of differences between this candidate and the database is lower than the required number. too_many_N_bases The candidate contains too many ``N`` nucleotide wildcard characters. too_low_CDR3s_exact The ``CDR3s_exact`` value for this candidate is lower than required. too_high_CDR3_shared_ratio The ``CDR3_shared_ratio`` is higher than the configured threshold. too_low_Js_exact The ``Js_exact`` value is lower than the configured threshold. has_stop The filter configuration disallows stop codons, but this candidate has one and is not whitelisted. too_low_cluster_size The ``cluster_size`` of this candidate is lower than the configured threshold, and the candidate is not whitelisted. is_duplicate A filtering criterion not listed above applies to this candidate. This covers all the filters that need to compare candidates to each other: cross-mapping ratio, clonotype allele ratio, exact ratio, Ds_exact ratio. .. _gene-names: Names for discovered genes -------------------------- Each gene discovered by IgDiscover gets a unique name such as “VH4.11_S1234”. The “VH4.11” is the name of the database gene to which the novel V gene was initially assigned. The number *1234* is derived from the nucleotide sequence of the novel gene. That is, if you discover the same sequence in two different runs of the IgDiscover, or just in different iterations, the number will be the same. This may help when manually inspecting results. Be aware that you still need to check the sequence itself since even different sequences can sometimes lead to the same number (a “hash collision”). The ``_S1234`` suffixes do not accumulate. Before IgDiscover adds the suffix in an iteration, it removes the suffix if it already exists. Subcommands =========== The ``igdiscover`` program has multiple subcommands. You should already be familiar with the two commands ``init`` and ``run``. Each subcommand comes with its own help page that shows how to use that subcommand. Run the command with the ``--help`` option to see the help. For example, :: igdiscover run --help shows the help for the ``run`` subcommand. The following additional subcommands may be useful for further analysis. commonv Find common V genes between two different antibody libraries upstream Cluster upstream sequences (UTR and leader) for each gene dendrogram Draw a dendrogram of sequences in a FASTA file. rename Rename sequences in a target FASTA file using a template FASTA file union Compute union of sequences in multiple FASTA files The following subcommands are used internally, and listed here for completeness. filter Filter a table with IgBLAST results count Count and plot V, D, J gene usage group Group sequences by barcode and V/J assignment and print each group’s consensus (unused in IgDiscover) germlinefilter Create new V gene database from V gene candidates using the germline and pre-germline filter criteria. discover Discover candidate new V genes within a single antibody library clusterplot For each V gene, plot a clustermap of the sequences assigned to it errorplot Plot histograms of differences to reference V gene .. _germline-filters: Germline and pre-germline filtering =================================== V gene sequences found by the clustering step of the program (the ``discover`` subcommand) are stored in the ``candidates.tab`` file. The entries are “candidates” because many of these will be PCR or other artifacts and therefore do not represent true novel V genes. The germline and pre-germline filters take care of removing artifacts. They germline filter is the “real” filter and used only in the last iteration in order to obtain the final gene database. The pre-germline filter is less strict and used in all the earlier iterations. The germline filters are implemented in the ``igdiscover germlinefilter`` subcommand. It performs the following filtering and processing steps: * Discard sequences with ``N`` bases * Discard sequences that come from a consensus over too few source sequences * Discard sequences with too few unique CDR3s (CDR3s_exact column) * Discard sequences with too few unique Js (Js_exact column) * Discard sequences identical to one of the database sequences (if DB given) * Discard sequences that do not match a set of known good motifs * Discard sequences that contain a stop codon (has_stop column) * Discard near-duplicate sequences * Discard cross-mapping artifacts * Discard sequences whose “allele ratio” is too low. If a whitelist of sequences is provided (by default, this is the input V gene database), then the candidates that appear on it * are not checked for the cluster size criterion, * do not need to match a set of known good motifs, * are never considered near-duplicates (but they are checked for cross-mapping and for the allele ratio), * are allowed to contain a stop codon. Whitelisting allows IgDiscover to identify known germline sequences that are expressed at low levels in a library. If enabled with ``whitelist: true`` (the default) in the pregermline and germline filter sections of the configuration file, the sequences present in the starting database are treated as validated germline sequences and will not be discarded if due to too small cluster size as long as they fulfill the remaining criteria (unique_cdr3s, unique_js etc.). You can see why a candidate was filtered by inspecting the :ref:`annotated_V_*.tab files ` .. _cross-mapping: Cross-mapping artifacts ----------------------- If two very similar sequences appear in the database used by IgBLAST, then sequencing errors may lead to one sequence incorrectly being assigned to the other. This is particularly problematic if one of the sequences is highly expressed while the other is not expressed at all. The not expressed sequence is even included in the list of V gene candidates because it is in the input database and therefore whitelisted. We call this a “cross-mapping artifact”. The germline filtering step of IgDiscover therefore aims to eliminate cross-mapping artifacts by checking all pairs of sequences for the following: * The two sequences have a distance of 1, * they are both in the database for that particular iteration (only then can cross-mapping occur) * the ratio between the expression levels of the two sequences (using the cluster_size field in the ``candidates.tab`` file) is less than the value ``cross_mapping_ratio`` defined in the configuration file (0.02 by default). If all that is the case, then the sequence with the lower expression is discarded. .. _allele-ratio: Allele-ratio filtering ---------------------- When multiple alleles of the same gene appear in the list of V gene candidates, such as IGHV1-2*02 and IGHV1-2*04, the germline filter computes the ratio of the values in the ``exact`` and the ``clonotypes`` columns between them. If the ratio is under the configured threshold, the candidate with the lower count is discarded. See the ``exact_ratio`` and ``clonotype_ratio`` settings in the ``germline_filter`` and ``pregermline_filter`` sections of the configuration file. .. versionadded:: 0.7.0 Data from the Sequence Read Archive (SRA) ========================================= To work with datasets from the Sequence Read Archive, you may want to use the tool ``fastq-dump``, which can download the reads in the format required by IgDiscover. You just need to know the accession number, such as “SRR2905710” and then run this command to download the files to the current directory:: fastq-dump --split-files --gzip SRR2905710 The ``--split-files`` option ensures that the paired-end reads are stored in two separate files, one for the forward and one for the reverse read, respectively. (If you do not provide it, you will get an interleaved FASTQ file that currently cannot be read by IgDiscover). The ``--gzip`` option creates compressed output. The command creates two files in the current directory. In the above example, they would be named ``SRR2905710_1.fastq.gz`` and ``SRR2905710_2.fastq.gz``. The program ``fastq-dump`` is part of the SRA toolkit. On Debian-derived Linux distributions, you can typically install it with ``sudo apt-get install sra-toolkit``. On Conda, install it with ``conda install -c bioconda sra-tools``. Does random subsampling influence results? ========================================== Random subsampling indeed influences somewhat which sequences are found by the cluster analysis, particularly in the beginning. However, the probability is large that all highly expressed sequences are represented in the random sample. Also, due to the database growing with subsequent iterations, the set of sequences assigned to a single database gene becomes smaller and more homogeneous. This makes it increasingly likely that also sequences expressed at lower levels result in a cluster since they now make up a larger fraction of each subsample. Also, many of the clusters which are captured in one subsample but not in the other are artifacts that are then filtered out anyway by the pre-germline or germline filter. On human data with a nearly complete starting database, the subsampling seems to have no influence at all, as we determined experimentally. We repeated a run of the program four times on the same human dataset, using identical parameters each time except that the subsampling was done in a different way. Although intermediate results differed, all four personalized databases that the program produced were exactly identical. Concordance is lower, though, when the input database is not as complete as the human one. The way in which random subsampling is done is modified by the ``seed`` configuration setting, which is set to 1 by default. If its value is the same for two different runs of the program with otherwise identical settings, the numbers chosen by the random number generator will be the same and therefore also subsampling will be done in an identical way. This makes runs of the program reproducible. In order to test how results differ when subsampling is done in a different way, change the ``seed`` to a different value. Logging the program’s output to a file ====================================== When you report a bug or unusual behavior to us, we might ask you to send us the output of ``igdiscover run``. You can send its output to a file by running the program like this:: igdiscover run >& logfile.txt And here is how to send the logging output to a file *and* also see the output in your terminal at the same time (but you lose the colors):: igdiscover run |& tee logfile.txt .. _caching: Caching of IgBLAST results and of merged reads ============================================== Sometimes you may want to re-analyze a dataset multiple times with different filter settings. To speed this up, IgDiscover can cache the results of two of the most time-consuming steps, read-merging with PEAR and running IgBLAST. The cache is disabled by default as it uses a lot of disk space. To enable the cache, create a file named ``~/.config/igdiscover.conf`` with the following contents:: use_cache: true If you do so, a directory named ``~/.cache/igdiscover/`` is created the next time you run IgDiscover and all IgBLAST results as well as merged reads from PEAR are stored there. On subsequent runs, the existing result is used directly without calling the respective program, which speeds up the pipeline considerably. The cache is only used when we are certain that the results will indeed be the same. For example, if the IgBLAST program version or th V/D/J database changes, the cached result is not used. The files in the cache are compressed, but the cache may still get large over time. You can delete the cache with ``rm -r ~/.cache/igdiscover`` to free the space. You should also delete the cache when updating to a newer IgBLAST version as the old results will not be used anymore. Terms ===== Analysis directory The directory that was created with ``igdiscover init``. Separate ones are created for each experiment. When you used ``igdiscover init myexperiment``, the analysis directory would be ``myexperiment/``. Starting database The initial list of V/D/J genes. These are expected to be in FASTA format and are copied into the ``database/`` directory within each analysis directory. IgDiscover-0.11/doc/index.rst000066400000000000000000000002721337725263500161150ustar00rootroot00000000000000.. include:: ../README.rst ======== Contents ======== .. toctree:: :maxdepth: 2 installation manual-installation testing guide faq advanced develop changes IgDiscover-0.11/doc/installation.rst000066400000000000000000000106601337725263500175110ustar00rootroot00000000000000============ Installation ============ IgDiscover is written in Python 3 and is developed on Linux. The tool also runs on macOS, but is not as well tested on that platform. For installation on either system, we recommend that you follow the instructions below, which will first explain how to install the `Conda `__ package manager. IgDiscover is available as a Conda-package from `the bioconda channel `__. Using Conda will make the installation easy because all dependencies are also available as Conda packages and can thus be installed automatically along with IgDiscover. There are also :ref:`non-Conda installation instructions ` if you cannot use Conda. .. _install-with-conda: Installing IgDiscover with Conda -------------------------------- 1. Install `Conda `__ by following the `conda installation instructions `_ as appropriate for your system. You will need to choose between a “Miniconda” and “Anaconda” installation. We recommend Miniconda as the download is smaller. If you are in a hurry, these two commands are usually sufficient to install Miniconda on Linux (read the linked document for macOS instructions):: wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh When the installer asks you about modifying the ``PATH`` in your ``.bashrc`` file, answer ``yes``. 2. Close the terminal window and open a new one. Then test whether conda is installed correctly by running :: conda --version If you see the conda version number, it worked. 3. Set up Conda so that it can access the `bioconda channel `__. For that, follow `the instructions on the bioconda website `_ or simply run these commands:: conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge 4. Install IgDiscover with this command:: conda create -n igdiscover igdiscover This will create a new so-called “environment” for IgDiscover (retry if it fails). **Whenever you want to run IgDiscover, you will need to activate the environment with this command**:: source activate igdiscover 5. Make sure you have activated the ``igdiscover`` environment. Then test whether IgDiscover is correctly installed with this command:: igdiscover --version If you see the version number of IgDiscover, it worked! If an error message appears that says "The 'networkx' distribution was not found and is required by snakemake", install networkx manually with:: pip install networkx==2.1 Then retry to check the igdiscover version. 6. You can now :ref:`run IgDiscover on the test data set ` to familiarize yourself with how it works. .. _troubleshooting: Troubleshooting on Linux ------------------------ If you use ``zsh`` instead of ``bash`` (applies to Bio-Linux, for example), the ``$PATH`` environment variable will not be setup correctly by the Conda installer. The miniconda installer adds a line ``export PATH=...`` to the to the end of your ``/home/your-user-name/.bashrc`` file. Copy that line from the file and add it to the end of the file ``/home/your-user-name/.zshrc`` instead. Alternatively, change your default shell to bash by running ``chsh -s /bin/bash``. If you use conda and see an error that includes something like this:: ImportError: .../.local/lib/python3.5/site-packages/sqt/_helpers.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf Or you see any error that mentions a ``.local/`` directory, then a previous installation of IgDiscover is interfering with the conda installation. The easiest way to solve this problem is to delete the directory ``.local/`` in your home directory, see also :ref:`how to remove IgDiscover from a Linux system `. Troubleshooting on macOS ------------------------ If you get the error :: ValueError: unknown locale: UTF-8 Then follow `these instructions `_. Development version ------------------- To install IgDiscover directly from the most recent source code, :ref:`read the developer installation instructions `. IgDiscover-0.11/doc/manual-installation.rst000066400000000000000000000072301337725263500207630ustar00rootroot00000000000000.. _manual-installation: Manual installation =================== IgDiscover requires quite a few other software tools that are not included in most Linux distributions (or mac OS) and which are also not available from the Python packaging index (PyPI) because they are not Python tools. If you do not use the :ref:`recommended simple installation instructions via Conda `, you need to install those non-Python dependencies manually. Regular Python dependencies are automatically pulled in when IgDiscover itself is installed in the last step with the ``pip install`` command. The instructions below are written for Linux and require modifications if you want to try this on OS X. .. note:: We recommend the much simpler :ref:`installation via Conda ` instead of using the instructions in this section. Install non-Python dependencies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The dependencies are: MUSCLE, IgBLAST, PEAR, and -- optionally -- flash. 1. Install Python 3.5 or newer. It most likely is already installed on your system, but in Debian/Ubuntu, you can get it with :: sudo apt-get install python3 2. Create the directory where binaries will be installed. We assume ``$HOME/.local/bin`` here, but this can be anywhere as long as they are in your ``$PATH``. :: mkdir -p ~/.local/bin Add this line to the end of your ``~/.bashrc`` file:: export PATH=$HOME/.local/bin:$PATH Then either start a new shell or run ``source ~/.bashrc`` to get the changes. 3. Install MUSCLE. This is available as a package in Ubuntu:: sudo apt-get install muscle If your distribution does not have a 'muscle' package or if you are not allowed to run ``sudo``:: wget -O - http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz | tar xz mv muscle3.8.31_i86linux64 ~/.local/bin/ 4. Install PEAR:: wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.6-bin-64.tar.gz tar xvf pear-0.9.6-bin-64.tar.gz mv pear-0.9.6-bin-64/pear-0.9.6-bin-64 ~/.local/bin/pear 5. Install IgBLAST:: wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.4.0/ncbi-igblast-1.4.0-x64-linux.tar.gz tar xvf ncbi-igblast-1.4.0-x64-linux.tar.gz mv ncbi-igblast-1.4.0/bin/igblast? ~/.local/bin/ IgBLAST requires some data files that must be downloaded separately. The following commands put the files into ``~/.local/igdata``:: mkdir ~/.local/igdata cd ~/.local/igdata wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/ wget -r -nH --cut-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/ Also, you must set the ``$IGDATA`` environment variable to point to the directory with data files. Add this line to your ``~/.bashrc``:: export IGDATA=$HOME/.local/igdata Then run ``source ~/.bashrc`` to get the changes. 7. Optionally, install flash:: wget -O FLASH-1.2.11.tar.gz http://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz/download tar xf FLASH-1.2.11.tar.gz cd FLASH-1.2.11 make mv flash ~/.local/bin/ Install IgDiscover ~~~~~~~~~~~~~~~~~~ Install IgDiscover with the Python package manager ``pip``, which will download and install IgDiscover and its dependencies:: pip3 install --user igdiscover Both commands also install all remaining dependencies. The ``--user`` option instructs both commands to install everything into ``$HOME/.local``. Finally, check the installation with :: igdiscover --version and you should see the version number of IgDiscover. You should now :ref:`run IgDiscover on the test data set `. IgDiscover-0.11/doc/preliminary-logo.jpeg000066400000000000000000000200031337725263500204060ustar00rootroot00000000000000JFIFOOC       C " Q !1"AQa2q#$5u346BRbst%7rvCDSTV: !1AQa"2Bq3Cb5Rr ? 'nKgOsJoqMmxGL/fTu;رR2DBTDJ(!EPTFk{aid6tx$j>(d_u؅鸬̀}/.x×8Ǚ3D0!jҪ׷z%!MOpoM!!S&GYb(R.ݷ ;/ӏ%O2Uvؑc&84Q)GXRSk*@lsZe/.ZoKTR HUVQw,S!bcSMK|Dtx,c\ws4%E=7`i~2>ϠP"{cN}OYIRH[cH;Szȡ#.'|ṔYAĹ38tf|Y#QZ>x6S)|8o$SiMs%G*Z`OEY=ٍzZ?(,߿!r,e2:꠻.@%ev mp9DE3@Re`@ܯcjӤwjB_k9S9"t: /GBREwZ*sHY!>8 ל ӯcZсrg7ucV[铫UŭkTmb]ܶn#qsa\ {8obUxQҪ?c9Fʪv'Pprj.gR ZjSr_,t]Ju jN*5 BR*TSC%HTѭ*Y! M;@XZZe. n>r6U%$(u*sSE QCʀ܂6ŮbjṷX @*QnT*:zGmPZ5l + ŗPOWN/q523RVB\A.>韱Bشغо Ai#K(7?1= @M= jmРɇc]U$4f40K)[nl.2Gq]"Kb9s<QB7OXs8×#_fʿp e0F)YI_+ƅ^oirh'" :%\gxTKVX$8 +_Ij{&:XYMkIA+([md*Zqz:$] `9aN/=uZ`Ap9  ypʭnZ$JuN![BlHE#H\DJtƚI(E\_J=p]rcA/V[\`~NW7LJR(}$kNߘ{V>V֜Tڋƍ -jaRPo$6Aߕ=ddU"T!eH c HJOB;# 4sk郬EaCJ V-,^R{:7]|UcPyA:Mke 6Ssdᖕ ~M•Z5Z߉9d I/LI:/)B_|9?S^O.JXfcJ.7熪Rp!F˿p;./?cg M3 ԋ.ڛ@'N:R>F+ dfU0Zl*>xu[EPSUn#Y: 䶭rݯ* L@wbS3Od-A[V:VY[X<p]qbę*tCl.9"%Cj^UI~ ]0q왎%\!? lq,jÊ~jN[pZ"2BƟnXW-%=*ZTFJ BM]G2hfb 9wYP{~JU'4q[MpRv.x;)U QEsI)җ:3)Z$@6||Iy: 2ӮA|O=N]̾S*[ -#Bl->8P̙Nu'U{)U|үls۲8' a'KU5&bAj*lPQ eV8Ek1e9m?N .$PH=%rcQ9{Zk 'q?8CnS_\`N7Z1&UC*sPhs{ 4ܡKb;$u\( #zG ݷ?4 |1x7>Ъ6P %6o8Ք J j'sH ܄!Im%(M:jRX'*q.L4Z#p)թKq8Zv"bU30֡Sڥʊ.ZBԕ)iDqu^@3N9Gz/+*IJ$TB_;(LR IǚZ˔葦SϷRui]_~ c*wiuM{w yz0\od% NqiT2YM>(}XD̹Feb:6;AMk|rA,n9F :|XoeArA?7;5*`uE [Č59,R,R*6M#L)j[ݰ,W ŀ|%˕(l:MS1m+q`#Q"e..,uاr$oY)Q"2zWo毟a;sQ0mH)|!)CX;QO:Y $G]ɾ a5.:eFy ~@ m![DPvM`F2ʕs@ \͓rq47!0\mi:! Hyi;H )IH7"I1u*w&_Q$gӣ&L ǢIw+㔞 S97Ke.TRor|5LE{ *}3s ;l :P|9ōZc3C`tfWD/50ZPc.x.,z>5/ǭ~*6vlw/d/揺-Knw c,=y ejT*IJ*V' 2VqD9f}kG(Fbcsx~ ^Ⱦ>Z!N-Hs";l2VomKզ3.LflV\i,v:R, ۻPɸ~43IY/+>'52C 0}ei $ 0Wodw2VJDaEʣGƜVCw$pM4pWmImm' `9:NZ3Ǒ1i/C #JQY씥j r{ b{B;J}uU m=>xU GW?kO*5kTӝ/MҪ5kd$[}s$8N)؏TY+ P@(\w~'+6xe"0{:$JsugC/ITXDۥvf ˿3W|eNc8#* B&;Hi%\HcS/gQ6*M)SpWZ-sӯlk-1Ji>^[!'JR/˟<1iߧ7JږiJ(hTy2ڂP;&ǿ(6 뭨]Z˂yq"F1%,GSe'Q$鱷#r)Hu Ka:Өv~Lu ucݘ*P| BByhRP H"'dqoa6\ q_^).fSZg.ⴐI?{oHL`竭l؄%(=,Z-Ъlw[ @>Nvq.ՇoZJꃑ㍄4ԷUmk$m}NBʍdOM:U+[%:C8 qk6Qɑ/eT"*GqHrB4ۥZO8!sW|eNc81>2}T- @d<8*TVa,JgMgԩ! zQ6k>”>a\R(kQ%/릎Fh]AVM0!*OӀ}B uOٔ&068kFWN:$))HP:O0lM݌Th("bݔVG !3ZssU&c9r_Vj ![mo|܀ު4Ib M!6) -Jt` _ ]Q v/_x/W㤹ЭRPMZA;h܋u .q֯!jJJ_  ()t,HJq(+)֩!ZA6 Y<  f!n:ڪE[n k+hAI+v6IPlRqֵ!ꌄ(":/ŏ  c*w_1;Bf_4~-Fje'O>ZxwZNUy$kWDh$;!_PQ|{pi.做žiTO/'[zn:e2*#HShmr )Vx~p&CR"Q# !ij"W~ $+rRZ[)\7όEt[o-C[L)q~R"+LB7Fݑ팿9!K[E 6βBT8uUm9]VluBP)NviuU6Uf=QpIRt"O(8trկpڭm6Qa>-ǻ(a*$ w?\KfԈ аo#U%)RK) xU7!W8\5nI;SEx!KԕPMNO3s=N~ίO[H2wR#,G?mjPS}q#.pnM `lJ=q烵 aRťYd D@0ܶv$[qpBnZe',W2,REjUJG[%DF?VdQd,¹Xx˙A)_VS+ `V }8YD]\xʻN䃶$G&8p_Y{ ,6n''x?5iJu;(%Hbe+J_nnF7*E !u)JJE(^QumkllS%R_ i S&Cp쐕-V֔!KU})$ $Z2U#6frRz_Qu %:ȓc(pŬI~_b'IWVQMLvW RSDf[ϼB:Kպܰ뀹*̛AIZVZdVRI!JKIw `B毃ʝpcW|eNc8%*) NQq\pdʘKSFSj|aDG\G? -mq;J%kXJ@JAQ܍'Eg MmngSj\)O0$8RR, e+MZ-9y5ԴE&A@Y 4[eNEʵu4u$I6M<>'eZB[BP s05|T3 c*wIgDiscover-0.11/doc/testing.rst000066400000000000000000000061051337725263500164640ustar00rootroot00000000000000.. _test: ============= Test data set ============= After installing IgDiscover, you should run it once on a small test data that we provide, both to test your installation and to familiarize yourself with running the program. 1. Download und unpack `the test data set (version 0.5)`_. To do this from the command-line, use these commands:: wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz tar xvf igdiscover-testdata-0.5.tar.gz .. _the test data set (version 0.5): https://bitbucket.org/igdiscover/testdata/downloads/igdiscover-testdata-0.5.tar.gz The test data set contains some paired-end reads from human IgM heavy chain dataset ERR1760498 and a database of IGHV, IGHD, IGHJ sequences based on Ensembl annotations. You should use a database of higher quality for your own experiments. 2. Initialize the IgDiscover pipeline directory:: igdiscover init --db igdiscover-testdata/database/ --reads igdiscover-testdata/reads.1.fastq.gz discovertest The name ``discovertest`` is the name of the pipeline directory that will be created. Note that only the path to the *first* reads file needs to be given. The second file is found automatically. There may be a couple of messages “Skipping 'x' because it contains the same sequence as 'y'”, which you can ignore. The command will have printed a message telling you that the pipeline directory has been initialized, that you should edit the configuration file, and how to actually run IgDiscover after that. 3. The generated ``igdiscover.yaml`` configuration file does not actually need to be edited for the test dataset, but you may still want to have a read through it as you will need to do so for you own data. You may want to do this while the pipeline is running in the next step. The configuration is in YAML format. When editing the file, just follow the way it is already structured. 4. Run the analysis. To do so, change into the pipeline directory and run this command:: cd discovertest && igdiscover run On this small dataset, running the pipeline should take not more than about 5 minutes. 5. Finally, inspect the results in the ``discovertest/iteration-01`` or ``discovertest/final`` directories. The discovered V genes and extra information are listed in ``discovertest/iteration-01/new_V_germline.tab``. Discovered J genes are in ``discovertest/iteration-01/new_J.tab``. There are also corresponding ``.fasta`` files with the sequences only. See the :ref:`explanation of final result files `. Other test data sets -------------------- ENA project `PRJEB15295 `_ contains the data for our Nature Communications paper from 2016, in particular `ERR1760498 `_, which is the data for the human “H1” sample (multiplex PCR, IgM heavy chain). Data used for testing TCR detection (human, RACE): `SRR2905677 `_ and `SRR2905710 `_. IgDiscover-0.11/environment.yml000066400000000000000000000064051337725263500166020ustar00rootroot00000000000000# abstract dependencies: # python=3.6 seaborn sqt snakemake-minimal cutadapt muscle pear flash igblast=1.7 ruamel.yaml nomkl channels: - bioconda - conda-forge - defaults dependencies: - bcftools=1.9=h4da6232_0 - cutadapt=1.18=py36_0 - flash=1.2.11=ha92aebf_2 - htslib=1.9=hc238db4_4 - igblast=1.7.0=pl5.22.0_0 - libdeflate=1.0=h470a237_0 - muscle=3.8.1551=h2d50403_3 - pear=0.9.6=he4cf2ce_4 - pysam=0.15.1=py36h0380709_0 - samtools=1.9=h8ee4bcc_1 - snakemake-minimal=5.3.0=py_2 - sqt=0.8.0=py36h470a237_2 - xopen=0.3.5=py_0 - appdirs=1.4.3=py_1 - asn1crypto=0.24.0=py36_1003 - attrs=18.2.0=py_0 - blas=1.1=openblas - bz2file=0.98=py_0 - bzip2=1.0.6=h470a237_2 - ca-certificates=2018.10.15=ha4d7672_0 - certifi=2018.10.15=py36_1000 - cffi=1.11.5=py36h5e8e0c9_1 - chardet=3.0.4=py36_1003 - configargparse=0.13.0=py_1 - cryptography=2.3.1=py36hdffb7b8_0 - cryptography-vectors=2.3.1=py36_1000 - curl=7.62.0=h74213dd_0 - cycler=0.10.0=py_1 - dbus=1.13.0=h3a4f0e9_0 - docutils=0.14=py36_1001 - expat=2.2.5=hfc679d8_2 - fontconfig=2.13.1=h65d0f4c_0 - freetype=2.9.1=h6debe1e_4 - gettext=0.19.8.1=h5e8e0c9_1 - gitdb2=2.0.5=py_0 - gitpython=2.1.11=py_0 - glib=2.56.2=h464dc38_1 - gst-plugins-base=1.12.5=hde13a9d_0 - gstreamer=1.12.5=h5856ed1_0 - icu=58.2=hfc679d8_0 - idna=2.7=py36_1002 - jpeg=9c=h470a237_1 - jsonschema=3.0.0a3=py36_1000 - kiwisolver=1.0.1=py36h2d50403_2 - krb5=1.16.2=hbb41f41_0 - libcurl=7.62.0=hbdb9355_0 - libedit=3.1.20170329=haf1bffa_1 - libffi=3.2.1=hfc679d8_5 - libgcc=7.2.0=h69d50b8_2 - libgfortran=3.0.0=1 - libiconv=1.15=h470a237_3 - libpng=1.6.35=ha92aebf_2 - libssh2=1.8.0=h5b517e9_3 - libstdcxx-ng=7.2.0=hdf63c60_3 - libuuid=2.32.1=h470a237_2 - libxcb=1.13=h470a237_2 - libxml2=2.9.8=h422b904_5 - matplotlib=3.0.2=py36h8a2030e_1 - matplotlib-base=3.0.2=py36h20b835b_1 - ncurses=6.1=hfc679d8_1 - numpy=1.15.4=py36_blas_openblashb06ca3d_0 - openblas=0.3.3=ha44fe06_1 - openssl=1.0.2p=h470a237_1 - pandas=0.23.4=py36hf8a1672_0 - patsy=0.5.1=py_0 - pcre=8.41=hfc679d8_3 - perl=5.22.0.1=0 - pigz=2.3.4=0 - pip=18.1=py36_1000 - psutil=5.4.8=py36h470a237_0 - pthread-stubs=0.4=h470a237_1 - pycparser=2.19=py_0 - pyopenssl=18.0.0=py36_1000 - pyparsing=2.3.0=py_0 - pyqt=5.6.0=py36h8210e8a_7 - pyrsistent=0.14.7=py36h470a237_0 - pysocks=1.6.8=py36_1002 - python=3.6.7=h5001a0f_1 - python-dateutil=2.7.5=py_0 - pytz=2018.7=py_0 - pyyaml=3.13=py36h470a237_1 - qt=5.6.2=hf70d934_9 - ratelimiter=1.2.0=py36_1000 - readline=7.0=haf1bffa_1 - requests=2.20.1=py36_1000 - ruamel.yaml=0.15.80=py36h470a237_0 - scipy=1.1.0=py36_blas_openblashb06ca3d_202 - seaborn=0.9.0=py_0 - setuptools=40.6.2=py36_0 - sip=4.18.1=py36hfc679d8_0 - six=1.11.0=py36_1001 - smmap2=2.0.5=py_0 - sqlite=3.25.3=hb1c47c0_0 - statsmodels=0.9.0=py36h7eb728f_0 - tk=8.6.9=ha92aebf_0 - tornado=5.1.1=py36h470a237_0 - urllib3=1.23=py36_1001 - wheel=0.32.3=py36_0 - wrapt=1.10.11=py36h470a237_1 - xorg-libxau=1.0.8=h470a237_6 - xorg-libxdmcp=1.1.2=h470a237_7 - xz=5.2.4=h470a237_1 - yaml=0.1.7=h470a237_1 - zlib=1.2.11=h470a237_3 - datrie=0.7.1=py36h7b6447c_1 - libgcc-ng=8.2.0=hdf63c60_1 - nomkl=3.0=0 - pip: - snakemake==5.3.0 IgDiscover-0.11/igdiscover/000077500000000000000000000000001337725263500156445ustar00rootroot00000000000000IgDiscover-0.11/igdiscover/Snakefile000066400000000000000000000634221337725263500174770ustar00rootroot00000000000000# kate: syntax Python; space-indent off; tab-width 4; indent-width 4; import shutil import textwrap import json from sqt.dna import reverse_complement from sqt import FastaReader, SequenceReader import igdiscover from igdiscover.utils import relative_symlink from igdiscover.config import Config, GlobalConfig try: config = Config.from_default_path() except FileNotFoundError as e: sys.exit("Pipeline configuration file {!r} not found. Please create it!".format(e.filename)) # Use pigz (parallel gzip) if available GZIP = 'pigz' if shutil.which('pigz') is not None else 'gzip' PREPROCESSED_READS = 'reads/sequences.fasta.gz' if config.debug: # Do not delete intermediate files when debugging temp = lambda x: x # Targets for each iteration ITERATION_TARGETS = [ 'clusterplots/done', 'errorhistograms.pdf', 'v-shm-distributions.pdf', ] + expand(['expressed_{gene}.tab', 'expressed_{gene}.pdf', 'dendrogram_{gene}.pdf'], gene=['V', 'D', 'J']) # Targets for non-final iterations DISCOVERY_TARGETS = [ 'candidates.tab', 'new_V_germline.fasta', 'new_V_pregermline.fasta', ] TARGETS = expand('iteration-{nr:02d}/{path}', nr=range(1, config.iterations+1), path=ITERATION_TARGETS + DISCOVERY_TARGETS) TARGETS += [ 'stats/readlengths.pdf', 'stats/merging-successful', 'stats/trimming-successful', 'stats/stats_nofinal.json' ] if config.iterations >= 1: TARGETS += ['iteration-01/new_J.fasta'] FINAL_TARGETS = expand('final/{path}', path=ITERATION_TARGETS) + ['stats/stats.json'] rule all: input: TARGETS + FINAL_TARGETS message: "IgDiscover finished." rule nofinal: input: TARGETS if config.limit: rule limit_reads_gz: output: 'reads/1-limited.{nr,([12]\\.|)}{ext,(fasta|fastq)}.gz' input: 'reads.{nr}{ext}.gz' shell: 'sqt fastxmod -w 0 --limit {config.limit} {input} | {GZIP} > {output}' rule limit_reads: output: 'reads/1-limited.{nr,([12]\\.|)}{ext,(fasta|fastq)}.gz' input: 'reads.{nr}{ext}' shell: 'sqt fastxmod -w 0 --limit {config.limit} {input} | {GZIP} > {output}' else: rule symlink_limited: output: fastaq='reads/1-limited.{nr,([12]\\.|)}{ext,(fasta|fastq)}.gz' input: fastaq='reads.{nr}{ext}.gz' resources: time=1 run: relative_symlink(input.fastaq, output.fastaq, force=True) # TODO compressing the input file is an unnecessary step rule gzip_limited: output: fastaq='reads/1-limited.{nr,([12]\\.|)}{ext,(fasta|fastq)}.gz' input: fastaq='reads.{nr}{ext}' shell: '{GZIP} < {input} > {output}' # After the rules above, we either end up with # # 'reads/1-limited.1.fastq.gz' and 'reads/1-limited.2.fastq.gz' # or # 'reads/1-limited.fasta.gz' # or # 'reads/1-limited.fastq.gz' if config.merge_program == 'flash': rule flash_merge: """Use FLASH to merge paired-end reads""" output: fastqgz='reads/2-merged.fastq.gz', log='reads/2-flash.log' input: 'reads/1-limited.1.fastq.gz', 'reads/1-limited.2.fastq.gz' resources: time=60 threads: 8 shell: # -M: maximal overlap (2x300, 420-450bp expected fragment size) "time flash -t {threads} -c -M {config.flash_maximum_overlap} {input} 2> " ">(tee {output.log} >&2) | {GZIP} > {output.fastqgz}" rule parse_flash_stats: input: log='reads/2-flash.log' output: json='stats/reads.json' run: total_ex = re.compile(r'\[FLASH\]\s*Total reads:\s*([0-9]+)') merged_ex = re.compile(r'\[FLASH\]\s*Combined reads:\s*([0-9]+)') with open(input.log) as f: for line in f: match = total_ex.search(line) if match: total = int(match.group(1)) continue match = merged_ex.search(line) if match: merged = int(match.group(1)) break else: sys.exit('Could not parse the FLASH log file') d = OrderedDict({'total': total}) d['merged'] = merged d['merging_was_done'] = True with open(output.json, 'w') as f: json.dump(d, f) elif config.merge_program == 'pear': rule pear_merge: """Use pear to merge paired-end reads""" output: fastq='reads/2-merged.fastq.gz', log='reads/2-pear.log' input: fastq1='reads/1-limited.1.fastq.gz', fastq2='reads/1-limited.2.fastq.gz' log: 'reads/2-pear.log' resources: time=60 threads: 20 shell: "igdiscover merge -j {threads} {input.fastq1} {input.fastq2} {output.fastq} | tee {log}" rule parse_pear_stats: input: log='reads/2-pear.log' output: json='stats/reads.json' run: expression = re.compile(r"Assembled reads \.*: (?P[0-9,]*) / (?P[0-9,]*)") with open(input.log) as f: for line in f: match = expression.search(line) if match: merged = int(match.group('merged').replace(',', '')) total = int(match.group('total').replace(',', '')) break else: sys.exit('Could not parse the PEAR log file') d = OrderedDict({'total': total}) d['merged'] = merged d['merging_was_done'] = True with open(output.json, 'w') as f: json.dump(d, f) else: sys.exit("merge_program {config.merge_program!r} given in configuration file not recognized".format(config=config)) # This rule applies only when the input is single-end or already merged rule symlink_merged: output: fastaq='reads/2-merged.{ext,(fasta|fastq)}.gz' input: fastaq='reads/1-limited.{ext}.gz' run: relative_symlink(input.fastaq, output.fastaq, force=True) # After the rules above, we end up with # # 'reads/2-merged.fasta.gz' # or # 'reads/2-merged.fastq.gz' rule read_stats_single_fasta: """Compute statistics if no merging was done (FASTA input)""" output: json='stats/reads.json', input: fastagz='reads/1-limited.fasta.gz' run: total = count_sequences(input.fastagz) d = OrderedDict({'total': total}) d['merged'] = total d['merging_was_done'] = False with open(output.json, 'w') as f: json.dump(d, f) rule read_stats_single_fastq: """Compute statistics if no merging was done (FASTQ input)""" output: json='stats/reads.json', input: fastagz='reads/1-limited.fastq.gz' run: total = count_sequences(input.fastagz) d = OrderedDict({'total': total}) d['merged'] = total d['merging_was_done'] = False with open(output.json, 'w') as f: json.dump(d, f) rule check_merging: """Ensure the merging succeeded""" output: success='stats/merging-successful' input: json='stats/reads.json' run: with open(input.json) as f: d = json.load(f) total = d['total'] merged = d['merged'] if total == 0 or merged / total >= 0.3: with open(output.success, 'w') as f: print('This marker file exists if at least 30% of the input ' 'reads could be merged.', file=f) else: sys.exit('Less than 30% of the input reads could be merged. Please ' 'check whether there is an issue with your input data. To skip ' 'this check, create the file "stats/merging-successful" and ' 're-start "igdiscover run".') rule merged_read_length_histogram: output: txt="stats/merged.readlengths.txt", pdf="stats/merged.readlengths.pdf" input: fastq='reads/2-merged.fastq.gz' shell: """sqt readlenhisto --bins 100 --left {config.minimum_merged_read_length} --title "Lengths of merged reads" --plot {output.pdf} {input} > {output.txt}""" rule read_length_histogram: output: txt="stats/readlengths.txt", pdf="stats/readlengths.pdf" input: fastq=PREPROCESSED_READS shell: """sqt readlenhisto --bins 100 --left {config.minimum_merged_read_length} --title "Lengths of pre-processed reads" --plot {output.pdf} {input} > {output.txt}""" rule reads_stats_fasta: """ TODO implement this """ output: txt="stats/reads.txt" input: merged='reads/1-limited.fasta.gz' shell: "touch {output}" # Remove primer sequences if config.forward_primers: # At least one forward primer is to be removed rule trim_forward_primers: output: fastaq=temp('reads/3-forward-primer-trimmed.{ext,(fasta|fastq)}.gz') input: fastaq='reads/2-merged.{ext}.gz', mergesuccess='stats/merging-successful' resources: time=120 log: 'reads/3-forward-primer-trimmed.{ext}.log' params: fwd_primers=''.join(' -g ^{}'.format(seq) for seq in config.forward_primers), rev_primers=''.join(' -a {}$'.format(reverse_complement(seq)) for seq in config.forward_primers) if not config.stranded else '', shell: "cutadapt --discard-untrimmed" "{params.fwd_primers}" "{params.rev_primers}" " -o {output.fastaq} {input.fastaq} | tee {log}" else: # No trimming, just symlink the file rule dont_trim_forward_primers: output: fastaq='reads/3-forward-primer-trimmed.{ext,(fasta|fastq)}.gz' input: fastaq='reads/2-merged.{ext}.gz', mergesuccess='stats/merging-successful' resources: time=1 run: relative_symlink(input.fastaq, output.fastaq, force=True) if config.reverse_primers: # At least one reverse primer is to be removed rule trim_reverse_primers: output: fastaq='reads/4-trimmed.{ext,(fasta|fastq)}.gz' input: fastaq='reads/3-forward-primer-trimmed.{ext}.gz' resources: time=120 log: 'reads/4-trimmed.{ext}.log' params: # Reverse primers should appear reverse-complemented at the 3' end # of the merged read. fwd_primers=''.join(' -a {}$'.format(reverse_complement(seq)) for seq in config.reverse_primers), rev_primers=''.join(' -g ^{}'.format(seq) for seq in config.reverse_primers) if not config.stranded else '' shell: "cutadapt --discard-untrimmed" "{params.fwd_primers}" "{params.rev_primers}" " -o {output.fastaq} {input.fastaq} | tee {log}" else: # No trimming, just symlink the file rule dont_trim_reverse_primers: output: fastaq='reads/4-trimmed.{ext,(fasta|fastq)}.gz' input: fastaq='reads/3-forward-primer-trimmed.{ext}.gz' resources: time=1 run: relative_symlink(input.fastaq, output.fastaq, force=True) rule trimmed_fasta_stats: output: json='stats/trimmed.json', input: fastagz='reads/4-trimmed.fasta.gz' run: with open(output.json, 'w') as f: json.dump({'trimmed': count_sequences(input.fastagz)}, f) rule trimmed_fastq_stats: output: json='stats/trimmed.json', input: fastqgz='reads/4-trimmed.fastq.gz' run: with open(output.json, 'w') as f: json.dump({'trimmed': count_sequences(input.fastqgz)}, f) rule check_trimming: """Ensure that some reads are left after trimming""" output: success='stats/trimming-successful' input: reads_json='stats/reads.json', trimmed_json='stats/trimmed.json' run: with open(input.reads_json) as f: total = json.load(f)['total'] with open(input.trimmed_json) as f: trimmed = json.load(f)['trimmed'] if total == 0 or trimmed / total >= 0.1: with open(output.success, 'w') as f: print('This marker file exists if at least 10% of input ' 'reads contain the required primer sequences.', file=f) else: print(*textwrap.wrap( 'Less than 10% of the input reads contain the required primer ' 'sequences. Please check whether you have specified the ' 'correct primer sequences in the configuration file. To skip ' 'this check, create the file "stats/trimming-successful" and ' 're-start "igdiscover run".'), sep='\n') sys.exit(1) def group_cdr3_arg(): if not config.cdr3_location: cdr3_arg = '' elif config.cdr3_location == 'detect': cdr3_arg = ' --real-cdr3' else: cdr3_arg = ' --pseudo-cdr3={}:{}'.format(*config.cdr3_location) return cdr3_arg for ext in ('fasta', 'fastq'): if config.barcode_length and config.barcode_consensus: rule: """Group by barcode and CDR3 (also implicitly removes duplicates)""" output: fastagz=PREPROCESSED_READS, pdf="stats/groupsizes.pdf", groups="reads/4-groups.tab.gz", json="stats/groups.json" input: fastaq='reads/4-trimmed.{ext}.gz'.format(ext=ext), success='stats/trimming-successful' log: 'reads/4-sequences.fasta.gz.log' params: race_arg=' --trim-g' if config.race_g else '', cdr3_arg=group_cdr3_arg(), shell: "igdiscover group" "{params.cdr3_arg}{params.race_arg}" " --json={output.json}" " --minimum-length={config.minimum_merged_read_length}" " --groups-output={output.groups}" " --barcode-length={config.barcode_length}" " --plot-sizes={output.pdf}" " {input.fastaq} 2> {log} | {GZIP} > {output.fastagz}" else: rule: """Collapse identical sequences, remove barcodes""" output: fastagz=PREPROCESSED_READS, json="stats/groups.json" input: fastaq='reads/4-trimmed.{ext}.gz'.format(ext=ext), success='stats/trimming-successful', params: barcode_length=' --barcode-length={}'.format(config.barcode_length) if config.barcode_length else '', race_arg = ' --trim-g' if config.race_g else '', shell: "igdiscover dereplicate" "{params.barcode_length}{params.race_arg}" " --json={output.json}" " --minimum-length={config.minimum_merged_read_length}" " {input.fastaq} | {GZIP} > {output.fastagz}" rule copy_d_database: """Copy D gene database into the iteration folder""" output: fasta="{base}/database/D.fasta" input: fasta="database/D.fasta" shell: "cp -p {input} {output}" rule vj_database_iteration_1: """Copy original V or J gene database into the iteration 1 folder""" output: fasta="iteration-01/database/{gene,[VJ]}.fasta" input: fasta="database/{gene}.fasta" shell: "cp -p {input} {output}" def ensure_fasta_not_empty(path, gene): with FastaReader(path) as fr: for _ in fr: has_records = True break else: has_records = False if not has_records: print( 'ERROR: No {gene} genes were discovered in this iteration (file ' '{path!r} is empty)! Cannot continue.\n' 'Check whether the starting database is of the correct chain type ' '(heavy, light lambda, light kappa). It needs to match the type ' 'of sequences you analyze.'.format(gene=gene, path=path), file=sys.stderr) sys.exit(1) for i in range(2, config.iterations + 1): rule: output: fasta='iteration-{nr:02d}/database/V.fasta'.format(nr=i) input: fasta='iteration-{nr:02d}/new_V_pregermline.fasta'.format(nr=i-1) run: ensure_fasta_not_empty(input.fasta, 'V') shell("cp -p {input.fasta} {output.fasta}") rule: # Even with multiple iterations, J genes are discovered only once output: fasta='iteration-{nr:02d}/database/J.fasta'.format(nr=i) input: fasta='iteration-01/new_J.fasta' if config.j_discovery['propagate'] else 'database/J.fasta' run: ensure_fasta_not_empty(input.fasta, 'J') shell("cp -p {input.fasta} {output.fasta}") # Rules for last iteration if config.iterations == 0: # Copy over the input database (would be nice to avoid this) rule copy_database: output: fasta='final/database/{gene,[VJ]}.fasta' input: fasta='database/{gene}.fasta' shell: "cp -p {input.fasta} {output.fasta}" else: rule copy_final_v_database: output: fasta='final/database/V.fasta' input: fasta='iteration-{nr:02d}/new_V_germline.fasta'.format(nr=config.iterations) run: ensure_fasta_not_empty(input.fasta, 'V') shell("cp -p {input.fasta} {output.fasta}") rule copy_final_j_database: output: fasta='final/database/J.fasta' input: fasta=('iteration-01/new_J.fasta' if config.j_discovery['propagate'] else 'database/J.fasta') run: ensure_fasta_not_empty(input.fasta, 'J') shell("cp -p {input.fasta} {output.fasta}") rule igdiscover_igblast: output: tabgz="{dir}/assigned.tab.gz", json="{dir}/stats/assigned.json" input: fastagz=PREPROCESSED_READS, db_v="{dir}/database/V.fasta", db_d="{dir}/database/D.fasta", db_j="{dir}/database/J.fasta" params: penalty=' --penalty {}'.format(config.mismatch_penalty) if config.mismatch_penalty is not None else '', database='{dir}/database', species=' --species={}'.format(config.species) if config.species else '', sequence_type=' --sequence-type={}'.format(config.sequence_type), rename=' --rename {path!r}_'.format(path=os.path.basename(os.getcwd())) if config.rename else '' log: '{dir}/igblast.log' threads: 16 shell: "time igdiscover igblast{params.sequence_type}{params.penalty}{params.rename} --threads={threads}" "{params.species} --stats={output.json} {params.database} {input.fastagz} | " "{GZIP} 2>&1 > {output.tabgz} | tee {log} >&2" rule check_parsing: output: success="{dir}/stats/parsing-successful" input: json="{dir}/stats/assigned.json" run: with open(input.json) as f: d = json.load(f) n = d['total'] detected_cdr3s = d['detected_cdr3s'] if n == 0: print('No IgBLAST assignments found, something is wrong.') sys.exit(1) elif detected_cdr3s / n >= 0.1: with open(output.success, 'w') as f: print('This marker file exists if a CDR3 sequence could be ' 'detected for at least 10% of IgBLAST-assigned sequences.', file=f) else: print(*textwrap.wrap( 'A CDR3 sequence could be detected in less than 10% of the ' 'IgBLAST-assigned sequences. Possibly there is a problem with ' 'the starting database. To skip this check and continue anyway, ' 'create the file "{}" and re-start "igdiscover run".'.format( output.success)), sep='\n') sys.exit(1) rule igdiscover_filter: output: filtered="{dir}/filtered.tab.gz", json="{dir}/stats/filtered.json" input: assigned="{dir}/assigned.tab.gz", success="{dir}/stats/parsing-successful" run: conf = config.preprocessing_filter criteria = ['--v-coverage={}'.format(conf['v_coverage'])] criteria += ['--j-coverage={}'.format(conf['j_coverage'])] criteria += ['--v-evalue={}'.format(conf['v_evalue'])] criteria = ' '.join(criteria) shell("igdiscover filter --json={output.json} {criteria} {input.assigned} | {GZIP} > {output.filtered}") rule igdiscover_exact: output: exact="{dir}/exact.tab" input: filtered="{dir}/filtered.tab.gz" shell: # extract rows where V_errors == 0 """zcat {input.filtered} |""" """ awk 'NR==1 {{ for(i=1;i<=NF;i++) if ($i == "V_errors") col=i}};NR==1 || $col == 0' > {output}""" rule igdiscover_count: output: plot="{dir}/expressed_{gene,[VDJ]}.pdf", counts="{dir}/expressed_{gene}.tab" input: tab="{dir}/filtered.tab.gz" shell: "igdiscover count --gene={wildcards.gene} " "--allele-ratio=0.2 " "--plot={output.plot} {input.tab} > {output.counts}" rule igdiscover_clusterplot: output: done="{dir}/clusterplots/done" input: tab="{dir}/filtered.tab.gz" params: clusterplots="{dir}/clusterplots/", ignore_j=' --ignore-J' if config.ignore_j else '' shell: "igdiscover clusterplot{params.ignore_j} {input.tab} {params.clusterplots} && touch {output.done}" rule igdiscover_discover: """Discover potential new V gene sequences""" output: tab="{dir}/candidates.tab", read_names="{dir}/read_names_map.tab", input: v_reference="{dir}/database/V.fasta", tab="{dir}/filtered.tab.gz" params: ignore_j=' --ignore-J' if config.ignore_j else '', seed=' --seed={}'.format(config.seed) if config.seed is not None else '', exact_copies=' --exact-copies={}'.format(config.exact_copies) if config.exact_copies is not None else '' threads: 128 shell: "time igdiscover discover -j {threads}{params.seed}{params.ignore_j}{params.exact_copies}" " --d-coverage={config.d_coverage}" " --read-names={output.read_names}" " --subsample={config.subsample} --database={input.v_reference}" " {input.tab} > {output.tab}" def db_whitelist_or_not(wildcards): filterconf = config.pre_germline_filter if wildcards.pre == 'pre' else config.germline_filter if filterconf['whitelist']: # Use original (non-iteration-specific) database as whitelist return 'database/V.fasta' else: return [] def germlinefilter_criteria(wildcards, input): nr = int(wildcards.nr, base=10) conf = config.pre_germline_filter if wildcards.pre == 'pre' else config.germline_filter criteria = [] for path in [input.db_whitelist, input.whitelist]: if path: criteria += ['--whitelist=' + path] if conf['allow_stop']: criteria += ['--allow-stop'] # if conf['allow_chimeras']: # criteria += ['--allow-chimeras'] criteria += ['--unique-CDR3={}'.format(conf['unique_cdr3s'])] criteria += ['--cluster-size={}'.format(conf['cluster_size'])] criteria += ['--unique-J={}'.format(conf['unique_js'])] criteria += ['--cross-mapping-ratio={}'.format(conf['cross_mapping_ratio'])] criteria += ['--clonotype-ratio={}'.format(conf['clonotype_ratio'])] criteria += ['--exact-ratio={}'.format(conf['exact_ratio'])] criteria += ['--cdr3-shared-ratio={}'.format(conf['cdr3_shared_ratio'])] criteria += ['--unique-D-ratio={}'.format(conf['unique_d_ratio'])] criteria += ['--unique-D-threshold={}'.format(conf['unique_d_threshold'])] return ' '.join(criteria) rule igdiscover_germlinefilter: """Construct a new database out of the discovered sequences""" output: tab='iteration-{nr}/new_V_{pre,(pre|)}germline.tab', fasta='iteration-{nr}/new_V_{pre,(pre|)}germline.fasta', annotated_tab='iteration-{nr}/annotated_V_{pre,(pre|)}germline.tab', input: tab='iteration-{nr}/candidates.tab', db_whitelist=db_whitelist_or_not, whitelist='whitelist.fasta' if os.path.exists('whitelist.fasta') else [], params: criteria=germlinefilter_criteria log: 'iteration-{nr}/new_V_{pre,(pre|)}germline.log' shell: "igdiscover germlinefilter {params.criteria}" " --annotate={output.annotated_tab}" " --fasta={output.fasta} {input.tab} " " 2> >(tee {log} >&2) " " > {output.tab}" rule igdiscover_discover_j: """Discover potential new J gene sequences""" output: tab="iteration-01/new_J.tab", fasta="iteration-01/new_J.fasta", input: j_reference="iteration-01/database/J.fasta", tab="iteration-01/filtered.tab.gz", params: allele_ratio='--allele-ratio={}'.format(config.j_discovery['allele_ratio']), cross_mapping_ratio=' --cross-mapping-ratio={}'.format(config.j_discovery['cross_mapping_ratio']) shell: "time igdiscover discoverj {params.allele_ratio}{params.cross_mapping_ratio} " "--database={input.j_reference} " "--fasta={output.fasta} " "{input.tab} > {output.tab}" rule stats_correlation_V_J: output: pdf="{dir}/correlationVJ.pdf" input: table="{dir}/assigned.tab.gz" run: import matplotlib matplotlib.use('pdf') # sns.heatmap will not work properly with the object-oriented interface, # so use pyplot import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd from collections import Counter table = igdiscover.read_table(input.table) fig = plt.figure(figsize=(29.7/2.54, 21/2.54)) counts = np.zeros((21, 11), dtype=np.int64) counter = Counter(zip(table['V_errors'], table['J_errors'])) for (v,j), count in counter.items(): if v is not None and v < counts.shape[0] and j is not None and j < counts.shape[1]: counts[v,j] = count df = pd.DataFrame(counts.T)[::-1] df.index.name = 'J errors' df.columns.name = 'V errors' sns.heatmap(df, annot=True, fmt=',d', cbar=False) fig.suptitle('V errors vs. J errors in unfiltered sequences') fig.set_tight_layout(True) fig.savefig(output.pdf) rule plot_errorhistograms: output: multi_pdf='{dir}/errorhistograms.pdf', boxplot_pdf='{dir}/v-shm-distributions.pdf' input: table='{dir}/filtered.tab.gz' params: ignore_j=' --max-j-shm=0' if config.ignore_j else '' shell: 'igdiscover errorplot{params.ignore_j} --multi={output.multi_pdf} --boxplot={output.boxplot_pdf} {input.table}' rule dendrogram: output: pdf='{dir}/dendrogram_{gene}.pdf' input: fasta='{dir}/database/{gene}.fasta' shell: 'igdiscover dendrogram --mark database/{wildcards.gene}.fasta {input.fasta} {output.pdf}' rule version: output: txt='stats/version.txt' run: with open(output.txt, 'w') as f: print('IgDiscover version', igdiscover.__version__, file=f) def get_sequences(path): with SequenceReader(path) as fr: sequences = [ record.sequence.upper() for record in fr ] return sequences def count_sequences(path): with SequenceReader(path) as fr: n = 0 for _ in fr: n += 1 return n rule json_stats_nofinal: output: json='stats/stats_nofinal.json' input: original_db='database/V.fasta', v_pre_germline=['iteration-{:02d}/new_V_pregermline.fasta'.format(i+1) for i in range(config.iterations)], v_germline=['iteration-{:02d}/new_V_germline.fasta'.format(i+1) for i in range(config.iterations)], filtered_stats=['iteration-{:02d}/stats/filtered.json'.format(i+1) for i in range(config.iterations)], group_stats='stats/groups.json', reads='stats/reads.json', trimmed='stats/trimmed.json' run: d = OrderedDict() d['version'] = igdiscover.__version__ with open(input.reads) as f: rp = json.load(f) rp['raw_reads'] = rp['total'] rp['merged'] = rp['merged'] rp['merging_was_done'] = rp['merging_was_done'] with open(input.trimmed) as f: rp['after_primer_trimming'] = json.load(f)['trimmed'] with open(input.group_stats) as f: rp['grouping'] = json.load(f) d['read_preprocessing'] = rp prev_sequences = set(get_sequences(input.original_db)) size = len(prev_sequences) iterations = [{'database': {'size': size}}] for i, (pre_germline_path, germline_path, filtered_json_path) in enumerate( zip(input.v_pre_germline, input.v_germline, input.filtered_stats)): pre_germline_sequences = set(get_sequences(pre_germline_path)) germline_sequences = set(get_sequences(germline_path)) gained = len(germline_sequences - prev_sequences) lost = len(prev_sequences - germline_sequences) gained_pre = len(pre_germline_sequences - prev_sequences) lost_pre = len(prev_sequences - pre_germline_sequences) iteration_info = OrderedDict() with open(filtered_json_path) as f: iteration_info['assignment_filtering'] = json.load(f) db_info = OrderedDict() db_info['size'] = len(germline_sequences) db_info['gained'] = gained db_info['lost'] = lost db_info['size_pre'] = len(pre_germline_sequences) db_info['gained_pre'] = gained_pre db_info['lost_pre'] = lost_pre iteration_info['database'] = db_info iterations.append(iteration_info) prev_sequences = pre_germline_sequences d['iterations'] = iterations with open(output.json, 'w') as f: json.dump(d, f, indent=2) print(file=f) rule json_stats: output: json='stats/stats.json' input: stats_nofinal='stats/stats_nofinal.json', final_stats='final/stats/filtered.json', run: with open(input.stats_nofinal) as f: d = json.load(f) with open(input.final_stats) as f: d['assignment_filtering'] = json.load(f) with open(output.json, 'w') as f: json.dump(d, f, indent=2) print(file=f) IgDiscover-0.11/igdiscover/__init__.py000066400000000000000000000001351337725263500177540ustar00rootroot00000000000000from ._version import get_versions __version__ = get_versions()['version'] del get_versions IgDiscover-0.11/igdiscover/__main__.py000066400000000000000000000067761337725263500177560ustar00rootroot00000000000000#!/usr/bin/env python3 """ IgDiscover computes V/D/J gene usage profiles and discovers novel V genes - Run IgBLAST in parallel (wrapper inspired by igblastwrp). - Parse IgBLAST output into a tab-separated table - Group sequences by barcode - Plot V gene usage - Discover new V genes given more than one dataset """ import sys import logging import importlib from argparse import ArgumentParser, RawDescriptionHelpFormatter import matplotlib as mpl import warnings import resource from . import __version__ __author__ = "Marcel Martin" mpl.use('Agg') warnings.filterwarnings('ignore', 'axes.color_cycle is deprecated and replaced with axes.prop_cycle') warnings.filterwarnings('ignore', 'The `IPython.html` package') # List of all subcommands. A module of the given name must exist and define # add_arguments() and main() functions. Documentation is taken from the first # line of the module’s docstring. COMMANDS = [ 'init', 'run', 'config', 'commonv', 'igblast', 'filter', 'count', 'group', 'dereplicate', #'multidiscover', 'germlinefilter', 'discover', 'discoverj', 'clusterplot', 'errorplot', 'upstream', 'dendrogram', 'rename', 'union', 'dbdiff', 'merge', 'clonotypes', 'clonoquery', 'plotalleles', 'haplotype', ] logger = logging.getLogger(__name__) class HelpfulArgumentParser(ArgumentParser): """An ArgumentParser that prints full help on errors.""" def __init__(self, *args, **kwargs): if 'formatter_class' not in kwargs: kwargs['formatter_class'] = RawDescriptionHelpFormatter super().__init__(*args, **kwargs) def error(self, message): self.print_help(sys.stderr) args = {'prog': self.prog, 'message': message} self.exit(2, '%(prog)s: error: %(message)s\n' % args) def format_duration(seconds): h = int(seconds // 3600) seconds -= h * 3600 m = int(seconds // 60) seconds -= m * 60 return '{:02d}:{:02d}:{:04.1f}'.format(h, m, seconds) def main(arguments=None): logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s') parser = HelpfulArgumentParser(description=__doc__, prog='igdiscover') parser.add_argument('--profile', default=False, action='store_true', help='Save profiling information to igdiscover.prof') parser.add_argument('--version', action='version', version='%(prog)s ' + __version__) show_cpustats = dict() subparsers = parser.add_subparsers() for command_name in COMMANDS: module = importlib.import_module('.' + command_name, 'igdiscover') subparser = subparsers.add_parser(command_name, help=module.__doc__.split('\n')[1], description=module.__doc__) subparser.set_defaults(func=module.main) module.add_arguments(subparser) if hasattr(module, 'do_not_show_cpustats'): show_cpustats[module.main] = False args = parser.parse_args(arguments) if not hasattr(args, 'func'): parser.error('Please provide the name of a subcommand to run') elif args.profile: import cProfile as profile profile.runctx('args.func(args)', globals(), locals(), filename='igdiscover.prof') logger.info('Wrote profiling data to igdiscover.prof') else: args.func(args) if sys.platform == 'linux' and show_cpustats.get(args.func, True): rself = resource.getrusage(resource.RUSAGE_SELF) rchildren = resource.getrusage(resource.RUSAGE_CHILDREN) memory_kb = rself.ru_maxrss + rchildren.ru_maxrss cpu_time = rself.ru_utime + rself.ru_stime + rchildren.ru_utime + rchildren.ru_stime cpu_time_s = format_duration(cpu_time) logger.info('CPU time {}. Maximum memory usage {:.3f} GB'.format( cpu_time_s, memory_kb / 1E6)) if __name__ == '__main__': main() IgDiscover-0.11/igdiscover/_version.py000066400000000000000000000406171337725263500200520ustar00rootroot00000000000000 # This file helps to compute a version number in source trees obtained from # git-archive tarball (such as those provided by githubs download-from-tag # feature). Distribution tarballs (built by setup.py sdist) and build # directories (produced by setup.py build) will contain a much shorter file # that just contains the computed version number. # This file is released into the public domain. Generated by # versioneer-0.16 (https://github.com/warner/python-versioneer) """Git implementation of _version.py.""" import errno import os import re import subprocess import sys def get_keywords(): """Get the keywords needed to look up the version information.""" # these strings will be replaced by git during git-archive. # setup.py/versioneer.py will grep for the variable names, so they must # each be defined on a line of their own. _version.py will just call # get_keywords(). git_refnames = " (tag: v0.11)" git_full = "d48994b2bb99dbd0cfe245c8cb57247e84b0c56c" keywords = {"refnames": git_refnames, "full": git_full} return keywords class VersioneerConfig: """Container for Versioneer configuration parameters.""" def get_config(): """Create, populate and return the VersioneerConfig() object.""" # these strings are filled in when 'setup.py versioneer' creates # _version.py cfg = VersioneerConfig() cfg.VCS = "git" cfg.style = "pep440" cfg.tag_prefix = "v" cfg.parentdir_prefix = "igdiscover-" cfg.versionfile_source = "igdiscover/_version.py" cfg.verbose = False return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" LONG_VERSION_PY = {} HANDLERS = {} def register_vcs_handler(vcs, method): # decorator """Decorator to mark a method as the handler for a particular VCS.""" def decorate(f): """Store f in HANDLERS[vcs][method].""" if vcs not in HANDLERS: HANDLERS[vcs] = {} HANDLERS[vcs][method] = f return f return decorate def run_command(commands, args, cwd=None, verbose=False, hide_stderr=False): """Call the given command(s).""" assert isinstance(commands, list) p = None for c in commands: try: dispcmd = str([c] + args) # remember shell=False, so use git.cmd on windows, not just git p = subprocess.Popen([c] + args, cwd=cwd, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None)) break except EnvironmentError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue if verbose: print("unable to run %s" % dispcmd) print(e) return None else: if verbose: print("unable to find command, tried %s" % (commands,)) return None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: stdout = stdout.decode() if p.returncode != 0: if verbose: print("unable to run %s (error)" % dispcmd) return None return stdout def versions_from_parentdir(parentdir_prefix, root, verbose): """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. """ dirname = os.path.basename(root) if not dirname.startswith(parentdir_prefix): if verbose: print("guessing rootdir is '%s', but '%s' doesn't start with " "prefix '%s'" % (root, dirname, parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None} @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs): """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords = {} try: f = open(versionfile_abs, "r") for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) f.close() except EnvironmentError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords(keywords, tag_prefix, verbose): """Get version information from git keywords.""" if not keywords: raise NotThisMethod("no keywords at all, weird") refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = set([r.strip() for r in refnames.strip("()").split(",")]) # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = set([r[len(TAG):] for r in refs if r.startswith(TAG)]) if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = set([r for r in refs if re.search(r'\d', r)]) if verbose: print("discarding '%s', no digits" % ",".join(refs-tags)) if verbose: print("likely tags: %s" % ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] if verbose: print("picking %s" % r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None } # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags"} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ if not os.path.exists(os.path.join(root, ".git")): if verbose: print("no .git in %s" % root) raise NotThisMethod("no .git directory") GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out = run_command(GITS, ["describe", "--tags", "--dirty", "--always", "--long", "--match", "%s*" % tag_prefix], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out = run_command(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparseable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%s'" % describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) pieces["error"] = ("tag '%s' doesn't start with prefix '%s'" % (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None count_out = run_command(GITS, ["rev-list", "HEAD", "--count"], cwd=root) pieces["distance"] = int(count_out) # total number of commits return pieces def plus_or_dot(pieces): """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces): """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_pre(pieces): """TAG[.post.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post.devDISTANCE """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += ".post.dev%d" % pieces["distance"] else: # exception #1 rendered = "0.post.dev%d" % pieces["distance"] return rendered def render_pep440_post(pieces): """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%s" % pieces["short"] else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%s" % pieces["short"] return rendered def render_pep440_old(pieces): """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Eexceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces): """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces): """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces, style): """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"]} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%s'" % style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None} def get_versions(): """Get version information or return default if unable to do so.""" # I am in _version.py, which lives at ROOT/VERSIONFILE_SOURCE. If we have # __file__, we can work backwards from there to the root. Some # py2exe/bbfreeze/non-CPython implementations don't do __file__, in which # case we can only use expanded keywords. cfg = get_config() verbose = cfg.verbose try: return git_versions_from_keywords(get_keywords(), cfg.tag_prefix, verbose) except NotThisMethod: pass try: root = os.path.realpath(__file__) # versionfile_source is the relative path from the top of the source # tree (where the .git directory might live) to this file. Invert # this to find the root from __file__. for i in cfg.versionfile_source.split('/'): root = os.path.dirname(root) except NameError: return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to find root of source tree"} try: pieces = git_pieces_from_vcs(cfg.tag_prefix, root, verbose) return render(pieces, cfg.style) except NotThisMethod: pass try: if cfg.parentdir_prefix: return versions_from_parentdir(cfg.parentdir_prefix, root, verbose) except NotThisMethod: pass return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version"} IgDiscover-0.11/igdiscover/clonoquery.py000066400000000000000000000143071337725263500204230ustar00rootroot00000000000000""" Query a table of assigned sequences by clonotype Two sequences have the same clonotype if - their V and J assignments are the same - the length of their CDR3 is identical - the difference between their CDR3s (in terms of mismatches) is not higher than a given threshold (by default 1) Clonotypes for the query sequences are determined and sequences in the input table that have this clonotype are reported. The table is written to standard output. """ import logging from collections import defaultdict from contextlib import ExitStack import pandas as pd from xopen import xopen from .table import read_table from .utils import slice_arg from .clonotypes import is_similar_with_junction, CLONOTYPE_COLUMNS logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--minimum-count', '-c', metavar='N', default=1, type=int, help='Discard all rows with count less than N. Default: %(default)s') arg('--cdr3-core', default=None, type=slice_arg, metavar='START:END', help='START:END defines the non-junction region of CDR3 ' 'sequences. Use negative numbers for END to count ' 'from the end. Regions before and after are considered to ' 'be junction sequence, and for two CDR3s to be considered ' 'similar, at least one of the junctions must be identical. ' 'Default: no junction region.') arg('--mismatches', default=1, type=float, help='No. of allowed mismatches between CDR3 sequences. ' 'Can also be a fraction between 0 and 1 (such as 0.15), ' 'interpreted relative to the length of the CDR3 (minus the front non-core). ' 'Default: %(default)s') arg('--aa', default=False, action='store_true', help='Count CDR3 mismatches on amino-acid level. Default: Compare nucleotides.') arg('--summary', metavar='FILE', help='Write summary table to FILE') arg('reftable', help='Reference table with parsed and filtered ' 'IgBLAST results (filtered.tab)') arg('querytable', help='Query table with IgBLAST results (assigned.tab or filtered.tab)') def collect(querytable, reftable, mismatches, cdr3_core_slice, cdr3_column): """ Find all queries from the querytable in the reftable. Yield tuples (query_rows, similar_rows) where the query_rows is a list with all the rows that have the same result. similar_rows is a DataFrame whose rows are the ones matching the query. """ # The vjlentype is a "clonotype without CDR3 sequence" (only V, J, CDR3 length) # Determine set of vjlentypes to query query_vjlentypes = defaultdict(list) for row in querytable.itertuples(): vjlentype = (row.V_gene, row.J_gene, len(row.CDR3_nt)) query_vjlentypes[vjlentype].append(row) groupby = ['V_gene', 'J_gene', 'CDR3_length'] for vjlentype, vjlen_group in reftable.groupby(groupby): # (v_gene, j_gene, cdr3_length) = vjlentype if vjlentype not in query_vjlentypes: continue # Collect results for this vjlentype. The results dict # maps row indices (into the vjlen_group) to each query_row, # allowing us to group identical results together. results = defaultdict(list) for query_row in query_vjlentypes.pop(vjlentype): cdr3 = getattr(query_row, cdr3_column) # Save indices of the rows that are similar to this query indices = tuple(index for index, r in enumerate(vjlen_group.itertuples()) if is_similar_with_junction(cdr3, getattr(r, cdr3_column), mismatches, cdr3_core_slice)) results[indices].append(query_row) # Yield results, grouping queries that lead to the same result for indices, query_rows in results.items(): if not indices: for query_row in query_rows: yield ([query_row], []) continue similar_group = vjlen_group.iloc[list(indices), :] yield (query_rows, similar_group) # Yield result tuples for all the queries that have not been found for queries in query_vjlentypes.values(): for query_row in queries: yield ([query_row], []) def main(args): usecols = CLONOTYPE_COLUMNS # TODO backwards compatibility if ('FR1_aa_mut' not in pd.read_csv(args.querytable, nrows=0, sep='\t').columns or 'FR1_aa_mut' not in pd.read_csv(args.reftable, nrows=0, sep='\t').columns): usecols = [col for col in usecols if not col.endswith('_aa_mut')] querytable = read_table(args.querytable, usecols=usecols) querytable = querytable[usecols] # reorder columns # Filter empty rows (happens sometimes) querytable = querytable[querytable.V_gene != ''] logger.info('Read query table with %s rows', len(querytable)) reftable = read_table(args.reftable, usecols=usecols) reftable = reftable[usecols] logger.info('Read reference table with %s rows', len(reftable)) if args.minimum_count > 1: reftable = reftable[reftable['count'] >= args.minimum_count] logger.info('After filtering out rows with count < %s, %s rows remain', args.minimum_count, len(reftable)) for tab in querytable, reftable: tab.insert(5, 'CDR3_length', tab['CDR3_nt'].apply(len)) if len(querytable) > len(reftable): logger.warning('The reference table is smaller than the ' 'query table! Did you swap query and reference?') cdr3_column = 'CDR3_aa' if args.aa else 'CDR3_nt' summary_columns = ['FR1_SHM', 'CDR1_SHM', 'FR2_SHM', 'CDR2_SHM', 'FR3_SHM', 'V_SHM', 'J_SHM', 'V_aa_mut', 'J_aa_mut'] summary_columns.extend(col for col in usecols if col.endswith('_aa_mut')) with ExitStack() as stack: if args.summary: summary_file = stack.enter_context(xopen(args.summary, 'w')) print('name', 'size', *('avg_' + s for s in summary_columns), sep='\t', file=summary_file) else: summary_file = None # Print header print(*reftable.columns, sep='\t') # Do the actual work for query_rows, result_table in collect(querytable, reftable, args.mismatches, args.cdr3_core, cdr3_column): assert len(query_rows) >= 1 if summary_file: for query_row in query_rows: print(query_row.name, len(result_table), sep='\t', end='', file=summary_file) for col in summary_columns: mean = result_table[col].mean() if len(result_table) > 0 else 0 print('\t{:.2f}'.format(mean), end='', file=summary_file) print(file=summary_file) for query_row in query_rows: print('# Query: {}'.format(query_row.name), '', *(query_row[3:]), sep='\t') if len(result_table) > 0: print(result_table.to_csv(sep='\t', header=False, index=False)) else: print() IgDiscover-0.11/igdiscover/clonotypes.py000066400000000000000000000152731337725263500204250ustar00rootroot00000000000000""" Group assigned sequences by clonotype Two sequences have the same clonotype if - their V and J assignments are the same - the length of their CDR3 is identical - the difference between their CDR3s (in terms of mismatches) is not higher than a given threshold (by default 1) The output is a table with one row per clonotype, written to standard output. Optionally, a full table of all members (sequences belonging to a clonotype) can be created with one row per input sequence, sorted by clonotype, plus an empty line between each group of sequences that have the same clonotype. The tables are by default sorted by clonotype, but can instead be sorted by the group size (number of members of a clonotype). """ import logging from itertools import islice from contextlib import ExitStack from collections import Counter import pandas as pd from xopen import xopen from sqt.align import hamming_distance from .table import read_table from .cluster import hamming_single_linkage from .utils import slice_arg CLONOTYPE_COLUMNS = ['name', 'count', 'V_gene', 'D_gene', 'J_gene', 'CDR3_nt', 'CDR3_aa', 'FR1_SHM', 'CDR1_SHM', 'FR2_SHM', 'CDR2_SHM', 'FR3_SHM', 'FR1_aa_mut', 'CDR1_aa_mut', 'FR2_aa_mut', 'CDR2_aa_mut', 'FR3_aa_mut', 'V_aa_mut', 'J_aa_mut', 'V_errors', 'J_errors', 'V_SHM', 'J_SHM', 'barcode', 'VDJ_nt', 'VDJ_aa'] logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--sort', action='store_true', default=False, help='Sort by group size (largest first). Default: Sort by V/D/J gene names') arg('--limit', metavar='N', type=int, default=None, help='Print out only the first N groups') arg('--cdr3-core', default=None, type=slice_arg, metavar='START:END', help='START:END defines the non-junction region of CDR3 ' 'sequences. Use negative numbers for END to count ' 'from the end. Regions before and after are considered to ' 'be junction sequence, and for two CDR3s to be considered ' 'similar, at least one of the junctions must be identical. ' 'Default: no junction region.') arg('--mismatches', default=1, type=float, help='No. of allowed mismatches between CDR3 sequences. ' 'Can also be a fraction between 0 and 1 (such as 0.15), ' 'interpreted relative to the length of the CDR3 (minus the front non-core). ' 'Default: %(default)s') arg('--aa', default=False, action='store_true', help='Count CDR3 mismatches on amino-acid level. Default: Compare nucleotides.') arg('--members', metavar='FILE', help='Write member table to FILE') arg('table', help='Table with parsed and filtered IgBLAST results') def is_similar_with_junction(s, t, mismatches, cdr3_core): """ Return whether strings s and t have at most the given number of mismatches *and* have at least one identical junction. """ # TODO see issue #81 if len(s) != len(t): return False if 0 < mismatches < 1: delta = cdr3_core.start if cdr3_core is not None else 0 distance_ok = hamming_distance(s, t) <= (len(s) - delta) * mismatches else: distance_ok = hamming_distance(s, t) <= mismatches if cdr3_core is None: return distance_ok return distance_ok and ( (s[:cdr3_core.start] == t[:cdr3_core.start]) or (s[cdr3_core.stop:] == t[cdr3_core.stop:])) def group_by_cdr3(table, mismatches, cdr3_core, cdr3_column): """ Cluster the rows of the table by Hamming distance between their CDR3 sequences. Yield (index, group) tuples similar to .groupby(). """ # Cluster all unique CDR3s by Hamming distance sequences = list(set(table[cdr3_column])) def linked(s, t): return is_similar_with_junction(s, t, mismatches, cdr3_core) clusters = hamming_single_linkage(sequences, mismatches, linked=linked) # Create dict that maps CDR3 sequences to a numeric cluster id cluster_ids = dict() for cluster_id, cdr3s in enumerate(clusters): for cdr3 in cdr3s: cluster_ids[cdr3] = cluster_id # Assign cluster id to each row table['cluster_id'] = table[cdr3_column].apply(lambda cdr3: cluster_ids[cdr3]) for index, group in table.groupby('cluster_id'): yield group.drop('cluster_id', axis=1) def representative(table): """ Given a table with members of the same clonotype, return a representative as a dict. """ c = Counter() for row in table.itertuples(): c[row.VDJ_nt] += row.count most_common_vdj_nt = c.most_common(1)[0][0] result = table[table['VDJ_nt'] == most_common_vdj_nt].iloc[0] result.at['count'] = table['count'].sum() return result def group_by_clonotype(table, mismatches, sort, cdr3_core, cdr3_column): """ Yield clonotype groups. Each item is a DataFrame with all the members of the clonotype. """ logger.info('Computing clonotypes ...') prev_v = None groups = [] for (v_gene, j_gene, cdr3_length), vj_group in table.groupby( ('V_gene', 'J_gene', 'CDR3_length')): if prev_v != v_gene: logger.info('Processing %s', v_gene) prev_v = v_gene cdr3_groups = group_by_cdr3(vj_group.copy(), mismatches=mismatches, cdr3_core=cdr3_core, cdr3_column=cdr3_column) if sort: # When sorting by group size is requested, we need to buffer # results groups.extend(cdr3_groups) else: yield from cdr3_groups if sort: logger.info('Sorting by group size ...') groups.sort(key=len, reverse=True) yield from groups def main(args): logger.info('Reading input table ...') usecols = CLONOTYPE_COLUMNS # TODO backwards compatibility if 'FR1_aa_mut' not in pd.read_csv(args.table, nrows=0, sep='\t').columns: usecols = [col for col in usecols if not col.endswith('_aa_mut')] table = read_table(args.table, usecols=usecols) table = table[usecols] logger.info('Read table with %s rows', len(table)) table.insert(5, 'CDR3_length', table['CDR3_nt'].apply(len)) table = table[table['CDR3_length'] > 0] table = table[table['CDR3_aa'].map(lambda s: '*' not in s)] logger.info('After discarding rows with unusable CDR3, %s remain', len(table)) with ExitStack() as stack: if args.members: members_file = stack.enter_context(xopen(args.members, 'w')) else: members_file = None columns = usecols[:] columns.remove('barcode') columns.remove('count') columns.insert(0, 'count') columns.insert(columns.index('CDR3_nt'), 'CDR3_length') print(*columns, sep='\t') print_header = True n = 0 cdr3_column = 'CDR3_aa' if args.aa else 'CDR3_nt' grouped = group_by_clonotype(table, args.mismatches, args.sort, args.cdr3_core, cdr3_column) for group in islice(grouped, 0, args.limit): if members_file: # We get an intentional empty line between groups since # to_csv() already includes a line break print(group.to_csv(sep='\t', header=print_header, index=False), file=members_file) print_header = False rep = representative(group) print(*[rep[col] for col in columns], sep='\t') n += 1 logger.info('%d clonotypes written', n) IgDiscover-0.11/igdiscover/cluster.py000066400000000000000000000114121337725263500176760ustar00rootroot00000000000000from collections import OrderedDict, defaultdict import pandas as pd from scipy.spatial import distance from scipy.cluster import hierarchy from sqt.align import hamming_distance from .utils import distances from .trie import Trie def all_nodes(root): """Return a list of all nodes of the tree, from left to right (iterative implementation).""" result = [] path = [None] node = root while node is not None: if node.left is not None: path.append(node) node = node.left elif node.right is not None: result.append(node) node = node.right else: result.append(node) node = path.pop() if node is not None: result.append(node) node = node.right return result def inner_nodes(root): """ Return a list of all inner nodes of the tree, from left to right. """ return [node for node in all_nodes(root) if not node.is_leaf()] def collect_ids(root): """ Return a list of ids of all leaves of the given tree """ return [node.id for node in all_nodes(root) if node.is_leaf()] def cluster_sequences(sequences, minsize=5): """ Cluster the given sequences into groups of similar sequences. Return a triple that contains a pandas.DataFrame with the edit distances, the linkage result, and a list that maps sequence ids to their cluster id. If an entry is zero in that list, it means that the sequence is not part of a cluster. """ matrix = distances(sequences) linkage = hierarchy.linkage(distance.squareform(matrix), method='average') # Linkage columns are: # 0, 1: merged clusters, 2: distance, 3: number of nodes in cluster inner = inner_nodes(hierarchy.to_tree(linkage)) prev = linkage[:, 2].max() # highest distance clusters = [0] * len(sequences) cl = 1 for n in inner: if n.dist > 0 and prev / n.dist < 0.8 \ and n.left.count >= minsize and n.right.count >= minsize: for id in collect_ids(n.left): # Do not overwrite previously assigned ids if clusters[id] == 0: clusters[id] = cl cl += 1 prev = n.dist # At the end of the above loop, we have not processed the rightmost # subtree. In our experiments, it never contains true novel sequences, # so we omit it. return pd.DataFrame(matrix), linkage, clusters class Graph: """Graph that can find connected components""" def __init__(self, nodes): self._nodes = OrderedDict() for node in nodes: self._nodes[node] = [] def add_edge(self, node1, node2): self._nodes[node1].append(node2) self._nodes[node2].append(node1) def connected_components(self): """Return a list of connected components.""" visited = set() components = [] for node, neighbors in self._nodes.items(): if node in visited: continue # Start a new component to_visit = [node] component = [] while to_visit: n = to_visit.pop() if n in visited: continue visited.add(n) component.append(n) for neighbor in self._nodes[n]: if neighbor not in visited: to_visit.append(neighbor) components.append(component) return components def cluster_by_length(strings): """ Cluster a set of strings by length """ string_lists = defaultdict(list) for s in strings: string_lists[len(s)].append(s) return list(string_lists.values()) def single_linkage(strings, linked): """ Cluster a set of strings. *linked* is a function with two parameters s and t that returns whether *s* and *t* are in the same cluster. >>> single_linkage(['ABC', 'ABD', 'DEFG', 'DEFH'], lambda s, t: len(s) == len(t) and ... hamming_distance(s, t) <= 1) [['ABC', 'ABD'], ['DEFG', 'DEFH']] """ graph = Graph(strings) for i, s in enumerate(strings): for j, t in enumerate(strings[i + 1:]): if linked(s, t): graph.add_edge(s, t) return graph.connected_components() def hamming_single_linkage(strings, mismatches, linked=None): """ Cluster a set of strings by hamming distance. Strings with a distance of at most 'mismatches' will be put into the same cluster. Uses the optimization that strings of different lengths can be clustered separately. Use *linked* to override the function passed to single_linkage. It will only be called for strings of the same length. Return a list of connected components (clusters). """ if linked is None: def linked(s, t): return hamming_distance(s, t) <= mismatches components = [] for strings in cluster_by_length(strings): components.extend(single_linkage(strings, linked)) return components def hamming_single_linkage_trie(strings, mismatches): """ Cluster by hamming distance using a trie """ components = [] for strings in cluster_by_length(strings): graph = Graph(strings) trie = Trie() for s in strings: trie.add(s) for s in strings: for neighbor in trie.find_all_similar(s, mismatches): if neighbor != s: graph.add_edge(s, neighbor) components.extend(graph.connected_components()) return components IgDiscover-0.11/igdiscover/clusterplot.py000066400000000000000000000075131337725263500206040ustar00rootroot00000000000000""" Plot a clustermap of all sequences assigned to a gene """ import os.path import logging from .table import read_table from .utils import downsampled, plural_s from .cluster import cluster_sequences logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--minimum-group-size', '-m', metavar='N', type=int, default=200, help='Do not plot if there are less than N sequences for a gene. Default: %(default)s') arg('--gene', '-g', action='append', default=[], help='Plot GENE. Can be given multiple times. Default: Plot all genes.') arg('--type', choices=('V', 'D', 'J'), default='V', help='Gene type. Default: %(default)s') arg('--size', metavar='N', type=int, default=300, help='Show at most N sequences (with a matrix of size N x N). Default: %(default)s') arg('--ignore-J', action='store_true', default=False, help='Include also rows without J assignment or J%%SHM>0.') arg('--dpi', type=int, default=200, help='Resolution of output file. Default: %(default)s') arg('--no-title', dest='title', action='store_false', default=True, help='Do not add a title to the plot') arg('table', help='Table with parsed and filtered IgBLAST results') arg('directory', help='Save clustermaps as PNG into this directory', default=None) def plot_clustermap(sequences, title, plotpath, size=300, dpi=200): """ Plot a clustermap of the given sequences size -- Downsample to this many sequences title -- plot title Return the number of clusters. """ import seaborn as sns logger.info('Clustering %d sequences (downsampled to at most %d)', len(sequences), size) sequences = downsampled(sequences, size) df, linkage, clusters = cluster_sequences(sequences) palette = sns.color_palette([(0.15, 0.15, 0.15)]) palette += sns.color_palette('Spectral', n_colors=max(clusters), desat=0.9) row_colors = [palette[cluster_id] for cluster_id in clusters] cm = sns.clustermap(df, row_linkage=linkage, col_linkage=linkage, row_colors=row_colors, linewidths=None, linecolor='none', figsize=(210/25.4, 210/25.4), cmap='Blues', xticklabels=False, yticklabels=False ) if title is not None: cm.fig.suptitle(title) cm.savefig(plotpath, dpi=dpi) # free the memory used by the plot import matplotlib.pyplot as plt plt.close('all') return len(set(clusters)) def main(args): import matplotlib matplotlib.use('agg') import seaborn as sns sns.set() if not os.path.exists(args.directory): os.mkdir(args.directory) gene_col = args.type + '_gene' seq_col = args.type + '_nt' usecols = ['J_SHM', 'V_gene', gene_col, seq_col] table = read_table(args.table, usecols=usecols) # Discard rows with any mutation within J at all logger.info('%s rows read', len(table)) if not args.ignore_J: # Discard rows with any mutation within J at all table = table[table.J_SHM == 0][:] logger.info('%s rows remain after discarding J%%SHM > 0', len(table)) path_sanitizer = {ord(c): None for c in "\\/*?[]=\"'"} genes = frozenset(args.gene) n = 0 too_few = 0 for gene, group in table.groupby(gene_col): if genes and gene not in genes: continue if len(group) < args.minimum_group_size: too_few += 1 continue title = gene if args.title else None path = os.path.join(args.directory, gene.translate(path_sanitizer) + '.png') sequences = list(group[seq_col]) n_clusters = plot_clustermap(sequences, title, path, size=args.size, dpi=args.dpi) n += 1 logger.info('Plotted %r with %d cluster%s', gene, n_clusters, plural_s(n_clusters)) #for i, cons in enumerate(consensus_sequences): #print('>{}_cluster{}\n{}'.format(gene, ('red', 'blue')[i], cons)) #print('number of Ns:', cons.count('N')) #if len(consensus_sequences) >= 2: #print('difference between consensuses:', edit_distance(*consensus_sequences[:2])) logger.info('%s plots created (%s skipped because of too few sequences)', n, too_few) IgDiscover-0.11/igdiscover/commonv.py000066400000000000000000000031031337725263500176710ustar00rootroot00000000000000""" Find common V genes between two different antibody libraries. """ import logging from collections import Counter from .table import read_table logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--minimum-frequency', '-n', type=int, metavar='N', default=None, help='Minimum number of datasets in which sequence must occur (default is no. of files divided by two)') arg('table', help='Tables with parsed and filtered IgBLAST results (give at least two)', nargs='+') def main(args): if args.minimum_frequency is None: # args.table is a list of file names minimum_frequency = max((len(args.table) + 1) // 2, 2) else: minimum_frequency = args.minimum_frequency logger.info('Minimum frequency set to %s', minimum_frequency) # Read in tables tables = [] for path in args.table: table = read_table(path) table = table.loc[:,['V_gene', 'V_SHM', 'V_nt', 'name']] tables.append(table) # Count V sequence occurrences counter = Counter() for table in tables: counter.update(set(table.V_nt)) # Find most frequent occurrences and print result print('Frequency', 'Gene', '%SHM', 'Sequence', sep='\t') for sequence, frequency in counter.most_common(): if frequency < minimum_frequency: break names = [] gene = None for table in tables: matching_rows = table[table.V_nt == sequence] if matching_rows.empty: continue names.extend(matching_rows.name) if gene is None: row = matching_rows.iloc[0] gene = row['V_gene'] shm = row['V_SHM'] print(frequency, gene, shm, sequence, *names, sep='\t') IgDiscover-0.11/igdiscover/config.py000066400000000000000000000112071337725263500174640ustar00rootroot00000000000000""" Change configuration file Example: igdiscover config --set iterations 1 """ import os import logging import ruamel.yaml logger = logging.getLogger(__name__) class ConfigurationError(Exception): pass class Config: DEFAULT_PATH = 'igdiscover.yaml' def __init__(self, file): # Set some defaults. self.debug = False self.species = None self.sequence_type = 'Ig' self.merge_program = 'pear' self.flash_maximum_overlap = 300 self.limit = None # or an integer self.multialign_program = 'muscle-fast' self.minimum_merged_read_length = 300 self.mismatch_penalty = None self.barcode_length = 0 self.barcode_consensus = True self.iterations = 1 self.ignore_j = False self.d_coverage = 70 self.subsample = 1000 self.stranded = False self.forward_primers = None self.reverse_primers = None self.rename = False self.race_g = False self.seed = 1 self.exact_copies = None self.preprocessing_filter = dict(v_coverage=90, j_coverage=60, v_evalue=1E-3) self.pre_germline_filter = dict( unique_cdr3s=2, unique_js=2, whitelist=True, cluster_size=0, allow_stop=True, # allow_chimeras=False, cross_mapping_ratio=0.02, clonotype_ratio=0.12, exact_ratio=0.12, cdr3_shared_ratio=0.8, unique_d_ratio=0.3, unique_d_threshold=10 ) self.germline_filter = dict( unique_cdr3s=5, unique_js=3, whitelist=True, cluster_size=100, allow_stop=False, # allow_chimeras=False, cross_mapping_ratio=0.02, clonotype_ratio=0.12, exact_ratio=0.12, cdr3_shared_ratio=0.8, unique_d_ratio=0.3, unique_d_threshold=10 ) self.j_discovery = dict(allele_ratio=0.2, cross_mapping_ratio=0.1, propagate=True) self.cdr3_location = 'detect' self.read_from(file) def read_from(self, file): content = file.read() new_config = self.make_compatible(ruamel.yaml.safe_load(content)) for key in ('preprocessing_filter', 'pre_germline_filter', 'germline_filter', 'j_discovery'): if key in new_config: self.__dict__[key].update(new_config[key]) del new_config[key] self.__dict__.update(new_config) @staticmethod def make_compatible(config): """ Convert old-style configuration to new style. Raise ConfigurationError if configuration is invalid. Return updated config dict. """ if 'barcode_length' in config: raise ConfigurationError( 'Old-style configuration of barcode length via "barcode_length"' 'is no longer supported.') barcode_length_5prime = config.get('barcode_length_5prime', 0) barcode_length_3prime = config.get('barcode_length_3prime', 0) if barcode_length_5prime > 0 and barcode_length_3prime > 0: raise ConfigurationError( 'barcode_length_5prime and barcode_length_3prime can currently ' 'not both be greater than zero.') if barcode_length_5prime > 0: config['barcode_length'] = barcode_length_5prime elif barcode_length_3prime > 0: config['barcode_length'] = -barcode_length_3prime config.pop('barcode_length_5prime', None) config.pop('barcode_length_3prime', None) if 'seed' in config and config['seed'] is False: config['seed'] = None for key in ('germline_filter', 'pregermline_filter'): if key in config and 'allele_ratio' in config[key]: config[key]['clonotype_ratio'] = config[key]['allele_ratio'] return config @classmethod def from_default_path(cls): with open(cls.DEFAULT_PATH) as f: return Config(file=f) class GlobalConfig: def __init__(self): self.use_cache = False path = os.getenv('XDG_CONFIG_HOME', os.path.expanduser('~/.config')) path = os.path.join(path, 'igdiscover.conf') if os.path.exists(path): with open(path) as f: config = ruamel.yaml.safe_load(f) if config is None: return self.use_cache = config.get('use_cache', False) def add_arguments(parser): arg = parser.add_argument arg('--set', nargs=2, default=[], metavar=('KEY', 'VALUE'), action='append', help='Set KEY to VALUE. Use KEY.SUBKEY[.SUBSUBKEY...] for nested keys.') arg('--file', default=Config.DEFAULT_PATH, help='Configuration file to modify. Default: igdiscover.yaml in current directory.') def main(args): if args.set: with open(args.file) as f: config = ruamel.yaml.load(f, ruamel.yaml.RoundTripLoader) for k, v in args.set: v = ruamel.yaml.safe_load(v) # config[k] = v item = config # allow nested keys keys = k.split('.') for i in keys[:-1]: item = item[i] item[keys[-1]] = v tmpfile = args.file + '.tmp' with open(tmpfile, 'w') as f: print(ruamel.yaml.dump(config, Dumper=ruamel.yaml.RoundTripDumper), end='', file=f) os.rename(tmpfile, args.file) else: with open(args.file) as f: config = ruamel.yaml.safe_load(f) print(ruamel.yaml.dump(config), end='') IgDiscover-0.11/igdiscover/count.py000066400000000000000000000112141337725263500173450ustar00rootroot00000000000000""" Compute expression counts This command takes a table of filtered IgBLAST results (filtered.tab.gz), filters it by various criteria and then counts how often specific genes are named. By default, all genes with non-zero expression are listed, sorted by name. Use --database to change this. """ import sys import logging import pandas as pd from sqt import SequenceReader from .table import read_table from .utils import natural_sort_key from .discoverj import filter_by_allele_ratio, compute_expressions logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--gene', choices=('V', 'D', 'J'), default='V', help='Which gene type: Choose V, D or J. Default: Default: %(default)s') arg('--database', metavar='FASTA', help='Compute expressions for the sequences that are named in the FASTA ' 'file. Only the sequence names in the file are used! This is the only ' 'way to also include genes with an expression of zero.') arg('--plot', metavar='FILE', help='Plot expressions to FILE (PDF or PNG)') group = parser.add_argument_group('Input table filtering') group.add_argument('--d-evalue', type=float, default=None, help='Maximal allowed E-value for D gene match. Default: 1E-4 ' 'if --gene=D, no restriction otherwise.') group.add_argument('--d-coverage', '--D-coverage', type=float, default=None, help='Minimum D coverage (in percent). Default: 70 ' 'if --gene=D, no restriction otherwise.') group.add_argument('--d-errors', type=int, default=None, help='Maximum allowed D errors. Default: No limit.') group = parser.add_argument_group('Expression counts table filtering') group.add_argument('--allele-ratio', type=float, metavar='RATIO', default=None, help='Required allele ratio. Works only for genes named "NAME*ALLELE". ' 'Default: Do not check allele ratio.') arg('table', help='Table with parsed and filtered IgBLAST results') def plot_counts(counts, gene_type): """Plot expression counts. Return a Figure object""" from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas from matplotlib.figure import Figure import matplotlib import seaborn as sns import numpy as np sns.set() fig = Figure(figsize=((50 + len(counts) * 5) / 25.4, 210/25.4)) matplotlib.rcParams.update({'font.size': 14}) FigureCanvas(fig) ax = fig.add_subplot(111) ax.set_title('{} gene usage'.format(gene_type)) ax.set_xlabel('{} gene'.format(gene_type)) ax.set_ylabel('Count') ax.set_xticks(np.arange(len(counts)) + 0.5) ax.set_xticklabels(counts.index, rotation='vertical') ax.grid(axis='x') ax.set_xlim((-0.25, len(counts))) ax.bar(np.arange(len(counts)), counts['count']) fig.set_tight_layout(True) return fig def main(args): if args.database: with SequenceReader(args.database) as fr: gene_names = [record.name for record in fr] gene_names.sort(key=natural_sort_key) else: gene_names = None usecols = ['V_gene', 'D_gene', 'J_gene', 'V_errors', 'D_errors', 'J_errors', 'D_covered', 'D_evalue', 'CDR3_nt'] table = read_table(args.table, usecols=usecols) logger.info('Table with %s rows read', len(table)) # Set default filters depending on gene if args.gene == 'D': if args.d_evalue is None: args.d_evalue = 1E-4 if args.d_coverage is None: args.d_coverage = 70 if args.d_evalue is not None: table = table[table.D_evalue <= args.d_evalue] logger.info('%s rows remain after requiring D E-value <= %s', len(table), args.d_evalue) if args.d_coverage is not None: table = table[table.D_covered >= args.d_coverage] logger.info('%s rows remain after requiring D coverage >= %s', len(table), args.d_coverage) if args.d_errors is not None: table = table[table.D_errors <= args.d_errors] logger.info('%s rows remain after requiring D errors <= %s', len(table), args.d_errors) logger.info('Computing expression counts for %s genes', args.gene) counts = compute_expressions(table, args.gene) # Make sure that always all gene names are listed even if no sequences # were assigned. if gene_names: counts = counts.reindex(gene_names, fill_value=0) if args.database and args.allele_ratio: logger.error('--database and --allele-ratio cannot be used at the same time.') sys.exit(1) logger.info('Computed expressions for %d genes', len(counts)) if args.allele_ratio is not None: counts = filter_by_allele_ratio(counts, args.allele_ratio) logger.info( 'After filtering by allele ratio, %d genes remain', len(counts)) # logger.info('%d sequences were not assigned a %s gene', unassigned, args.gene) counts.to_csv(sys.stdout, sep='\t') logger.info('Wrote expression count table') if args.plot: plot_counts(counts, args.gene).savefig(args.plot) logger.info("Wrote %s", args.plot) IgDiscover-0.11/igdiscover/dbdiff.py000066400000000000000000000136351337725263500174440ustar00rootroot00000000000000""" Compare two FASTA files based on sequences The order of records in the two files does not matter. Exit code: 2 if duplicate sequences or duplicate record names were found 1 if there are any lost or gained records or sequence differences 0 if the records are identical, but allowing for different record names """ import sys import logging import numpy as np from scipy.optimize import linear_sum_assignment from sqt import FastaReader from sqt.align import hamming_distance logger = logging.getLogger(__name__) do_not_show_cpustats = 1 def add_arguments(parser): arg = parser.add_argument arg('--color', default='auto', choices=('auto', 'never', 'always'), help='Whether to colorize output') arg('a', help='FASTA file with expected sequences') arg('b', help='FASTA file with actual sequences') RED = "\x1b[0;31m" GREEN = "\x1b[0;32m" RESET = "\x1b[0m" def red(s): return RED + s + RESET def green(s): return GREEN + s + RESET def check_duplicate_names(records): names = set() for record in records: if record.name in names: yield record.name names.add(record.name) def check_exact_duplicate_sequences(records): sequences = dict() for record in records: if record.sequence in sequences: yield record.name, sequences[record.sequence] else: sequences[record.sequence] = record.name def compare(a, b): """Return cost of comparing a to b""" l = min(len(a.sequence), len(b.sequence)) length_diff = max(len(a.sequence), len(b.sequence)) - l dist_prefixes = hamming_distance(a.sequence[:l], b.sequence[:l]) dist_suffixes = hamming_distance(a.sequence[-l:], b.sequence[-l:]) return 5 * min(dist_prefixes, dist_suffixes) + length_diff def pair_up_identical(a_records, b_records): identical = [] b_map = {record.sequence: record for record in b_records} a_rest = [] for a in a_records: if a.sequence in b_map: identical.append((a, b_map[a.sequence])) del b_map[a.sequence] else: a_rest.append(a) return identical, a_rest, list(b_map.values()) def pair_up(a_records, b_records, max_cost=20): # Pair up identical sequences first identical, a_records, b_records = pair_up_identical(a_records[:], b_records[:]) # Compare all vs all and fill in a score matrix m = len(a_records) n = len(b_records) cost = np.zeros((m, n), dtype=int) for i, a in enumerate(a_records): for j, b in enumerate(b_records): cost[i, j] = compare(a, b) # Solve minimum weighted bipartite matching assignment = linear_sum_assignment(cost) similar = [] a_similar = set() b_similar = set() for i, j in zip(*assignment): if cost[i, j] <= max_cost: similar.append((a_records[i], b_records[j])) a_similar.add(i) b_similar.add(j) a_only = [a for i, a in enumerate(a_records) if i not in a_similar] b_only = [b for j, b in enumerate(b_records) if j not in b_similar] return a_only, b_only, identical, similar def format_indel(a, b, colored: bool): if len(a) > len(b): assert len(b) == 0 s = '{-' + a + '}' return red(s) if colored else s elif len(b) > len(a): assert len(a) == 0 s = '{+' + b + '}' return green(s) if colored else s else: return '' def print_similar(a, b, colored: bool): l = min(len(a.sequence), len(b.sequence)) dist_prefixes = hamming_distance(a.sequence[:l], b.sequence[:l]) dist_suffixes = hamming_distance(a.sequence[-l:], b.sequence[-l:]) if dist_prefixes <= dist_suffixes: a_prefix = '' b_prefix = '' a_common = a.sequence[:l] b_common = b.sequence[:l] a_suffix = a.sequence[l:] b_suffix = b.sequence[l:] else: a_prefix = a.sequence[:-l] b_prefix = b.sequence[:-l] a_common = a.sequence[-l:] b_common = b.sequence[-l:] a_suffix = '' b_suffix = '' s = format_indel(a_prefix, b_prefix, colored) edits = [] for i, (ac, bc) in enumerate(zip(a_common, b_common)): if ac != bc: if colored: s = '{' + red(ac) + ' → ' + green(bc) + '}' else: s = '{' + ac + ' → ' + bc + '}' edits.append(s) else: edits.append(ac) s += ''.join(edits) s += format_indel(a_suffix, b_suffix, colored) print('~', a.name, '--', b.name) print(s) print() def main(args): if args.color == 'auto': colored = sys.stdout.isatty() elif args.color == 'never': colored = False else: assert args.color == 'always' colored = True with FastaReader(args.a) as f: a_records = list(f) with FastaReader(args.b) as f: b_records = list(f) has_duplicate_names = False for records, path in ((a_records, args.a), (b_records, args.b)): dups = list(check_duplicate_names(records)) if dups: has_duplicate_names = True print('Duplicate record names found in', path) for name in dups: print('-', name) has_duplicate_sequences = False for record, path in ((a_records, args.a), (b_records, args.b)): dups = list(check_exact_duplicate_sequences(records)) if dups: has_duplicate_sequences = True print('Duplicate sequences found in', path) for name, name_orig in dups: print('-', name, 'is identical to earlier record', name_orig) only_a, only_b, identical, similar = pair_up(a_records, b_records) different_name = [(a, b) for a, b in identical if a.name != b.name] # Summary print('{} vs {} records. {} lost, {} gained, {} identical, {} different name, {} similar'.format( len(a_records), len(b_records), len(only_a), len(only_b), len(identical) - len(different_name), len(different_name), len(similar))) # Report what has changed if only_a: print() print('## Only in A') for record in only_a: print('-', record.name) if only_b: print() print('## Only in B') for record in only_b: print('+', record.name) if different_name: print() print('## Different name (sequence identical)') for a, b in different_name: print('=', a.name, '--', b.name) if similar: print() print('## Similar') for a, b in similar: print_similar(a, b, colored) if has_duplicate_names or has_duplicate_sequences: sys.exit(2) if only_a or only_b or similar: sys.exit(1) # different name is fine for success sys.exit(0) IgDiscover-0.11/igdiscover/dendrogram.py000066400000000000000000000052011337725263500203360ustar00rootroot00000000000000""" Draw a dendrogram of sequences in a FASTA file. """ import logging import numpy as np import matplotlib from scipy.spatial import distance from scipy.cluster import hierarchy from sqt import FastaReader from igdiscover.utils import distances logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--mark', '--db', metavar='FASTA', help='Path to a FASTA file with a set of "known" sequences. Sequences ' 'in the main file that do *not* occur here will be marked with (new). ' 'If not given, no sequences will be marked (use this to compare two ' 'databases).') arg('--method', choices=('single', 'complete', 'weighted', 'average'), default='average', help='Linkage method. Default: "average" (=UPGMA)') arg('fasta', help='Path to input FASTA file') arg('plot', help='Path to output PDF or PNG') class PrefixComparer: def __init__(self, sequences): self._sequences = [ s.upper() for s in sequences ] def __contains__(self, other): for seq in self._sequences: if seq.startswith(other) or other.startswith(seq): return True return False def main(args): with FastaReader(args.fasta) as fr: sequences = list(fr) logger.info('Plotting dendrogram of %s sequences', len(sequences)) if args.mark: with FastaReader(args.mark) as fr: mark = PrefixComparer(record.sequence for record in fr) labels = [] n_new = 0 for record in sequences: if record.sequence not in mark: extra = ' (new)' n_new += 1 else: extra = '' labels.append(record.name + extra) logger.info('%s sequence(s) marked as "new"', n_new) else: labels = [s.name for s in sequences] import seaborn as sns import matplotlib.pyplot as plt sns.set() sns.set_style("white") font_size = 297 / 25.4 * 72 / (len(labels) + 5) font_size = min(16, max(6, font_size)) height = font_size * (len(labels) + 5) / 72 fig = plt.figure(figsize=(210 / 25.4, height)) matplotlib.rcParams.update({'font.size': 4}) ax = fig.gca() sns.despine(ax=ax, top=True, right=True, left=True, bottom=True) sns.set_style('whitegrid') if len(sequences) >= 2: m = distances([s.sequence for s in sequences]) y = distance.squareform(m) mindist = int(y.min()) logger.info('Smallest distance is %s. Found between:', mindist) for i,j in np.argwhere(m == y.min()): if i < j: logger.info('%s and %s', labels[i], labels[j]) l = hierarchy.linkage(y, method=args.method) hierarchy.dendrogram(l, labels=labels, leaf_font_size=font_size, orientation='right', color_threshold=0.95*max(l[:,2])) else: ax.text(0.5, 0.5, 'no sequences', fontsize='xx-large') ax.grid(False) fig.set_tight_layout(True) fig.savefig(args.plot) IgDiscover-0.11/igdiscover/dereplicate.py000066400000000000000000000055161337725263500205060ustar00rootroot00000000000000""" Dereplicate sequences and remove barcodes The difference to the 'group' subcommand is that that one also computes consensus sequences from groups with identical barcode/CDR3. This one does not. The barcode can be in the 5' end or the 3' end of the sequence. Use --trim-g to remove initial runs of G at the 5' end (artifact from RACE protocol). These are removed after the barcode is removed. """ import logging from collections import defaultdict from itertools import islice import json from sqt import SequenceReader logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--limit', default=None, type=int, metavar='N', help='Limit processing to the first N reads') arg('--trim-g', action='store_true', default=False, help="Trim 'G' nucleotides at 5' end") arg('--minimum-length', '-l', type=int, default=0, help='Minimum sequence length') arg('--barcode-length', '-b', type=int, default=0, help="Length of barcode. Positive for 5' barcode, negative for 3' barcode. " "Default: %(default)s") arg('--json', metavar="FILE", help="Write statistics to FILE") arg('fastx', metavar='FASTA/FASTQ', help='FASTA or FASTQ file (can be gzip-compressed) with sequences') def main(args): barcode_length = args.barcode_length too_short = 0 n = 0 sequences = defaultdict(list) # maps sequences to a list of Sequence objects containing them with SequenceReader(args.fastx) as f: for record in islice(f, 0, args.limit): n += 1 if len(record) < args.minimum_length: too_short += 1 continue sequences[record.sequence].append(record) n_written = 0 for records in sequences.values(): # If there are multiple records with the same sequence, pick the first record = records[0] if barcode_length >= 0: barcode = record.sequence[:barcode_length] unbarcoded = record[barcode_length:] else: barcode = record.sequence[barcode_length:] unbarcoded = record[:barcode_length] if args.trim_g: # The RACE protocol leads to a run of non-template Gs in the beginning # of the sequence, after the barcode. unbarcoded.sequence = unbarcoded.sequence.lstrip('G') if unbarcoded.qualities: unbarcoded.qualities = unbarcoded.qualities[-len(unbarcoded.sequence):] name = record.name.split(maxsplit=1)[0] if name.endswith(';'): name = name[:-1] if barcode_length: print('>{};barcode={};size={};\n{}'.format(name, barcode, len(records), unbarcoded.sequence)) else: print('>{};size={};\n{}'.format(name, len(records), unbarcoded.sequence)) n_written += 1 logger.info('%s sequences processed', n) logger.info('%s sequences long enough', n - too_short) logger.info('%s dereplicated sequences written', n_written) if args.json: stats = { 'groups_written': n_written, } with open(args.json, 'w') as f: json.dump(stats, f, indent=2) print(file=f) IgDiscover-0.11/igdiscover/discover.py000066400000000000000000000457421337725263500200500ustar00rootroot00000000000000""" Discover candidate new V genes within a single antibody library Existing V sequences are grouped by their V gene assignment, and within each group, consensus sequences are computed. """ import csv import sys import hashlib import logging import random import itertools import multiprocessing from contextlib import ExitStack from collections import namedtuple, Counter import numpy as np import pandas as pd from sqt import SequenceReader from sqt.align import edit_distance from .table import read_table from .cluster import cluster_sequences, single_linkage from .utils import (iterative_consensus, unique_name, downsampled, SerialPool, Merger, has_stop, describe_nt_change, available_cpu_count) logger = logging.getLogger(__name__) MINGROUPSIZE = 5 MINEXPRESSED = 10 MAXIMUM_SUBSAMPLE_SIZE = 1600 def add_arguments(parser): arg = parser.add_argument arg('--threads', '-j', type=int, default=min(4, available_cpu_count()), help='Number of threads. Default: no. of available CPUs, but at most 4') arg('--seed', type=int, default=None, help='Seed value for random numbers for reproducible runs.') arg('--consensus-threshold', '-t', metavar='PERCENT', type=float, default=60, help='Threshold for consensus computation. Default: %(default)s%%') arg('--gene', '-g', action='append', default=[], help='Compute consensus for this gene. Can be given multiple times. ' 'Default: Compute for all genes.') arg('--limit', type=int, default=None, metavar='N', help='Skip remaining genes as soon as at least N candidates were generated. Default: No limit') arg('--left', '-l', type=float, metavar='ERROR-RATE', help='For consensus, include only sequences that have at least this ' 'error rate (in percent). Default: %(default)s', default=0) arg('--right', '-r', type=float, metavar='ERROR-RATE', help='For consensus, include only sequences that have at most this ' 'error rate (in percent). Default: %(default)s', default=100) arg('--window-width', '-w', type=float, metavar='PERCENT', help='Compute consensus for all PERCENT-wide windows. Set to 0 to ' 'disable. Default: %(default)s', default=2) arg('--no-cluster', dest='cluster', action='store_false', default=True, help='Do not run linkage cluster analysis.') arg('--cluster-exact', metavar='N', type=int, default=0, help='Treat N exact occurrences of a sequence as a cluster. ' 'Default: Do not cluster exact occurrences') arg('--max-n-bases', type=int, default=0, metavar='MAXN', help='Remove rows that have more than MAXN "N" nucleotides. If >0, an ' 'N_bases column is added. Default: %(default)s') arg('--subsample', metavar='N', type=int, default=1000, help='When clustering, use N randomly chosen sequences. Default: %(default)s') arg('--ignore-J', action='store_true', default=False, help='Include also rows without J assignment or J%%SHM>0.') arg('--exact-copies', metavar='N', type=int, default=1, help='When subsampling, first pick rows whose V gene sequences' 'have at least N exact copies in the input. Default: %(default)s') arg('--d-evalue', metavar='EVALUE', type=float, default=1E-4, help='For Ds_exact, require D matches with an E-value of ' 'at most EVALUE. Default: %(default)s') arg('--d-coverage', '--D-coverage', metavar='COVERAGE', type=float, default=70, help='For Ds_exact, require D matches with a minimum D ' 'coverage of COVERAGE (in percent). Default: %(default)s)') arg('--clonotype-diff', metavar='DIFFERENCES', type=int, default=6, help='When clustering CDR3s to computer the no. of clonotypes, allow DIFFERENCES ' 'between (nucleotide-)sequences. Default: %(default)s') arg('--table-output', '-o', metavar='DIRECTORY', help='Output tables for all analyzed genes to DIRECTORY. ' 'Files will be named .tab.') arg('--database', metavar='FASTA', default=None, help='FASTA file with V genes. If provided, differences between consensus ' 'and database will be computed.') arg('--read-names', metavar='FILE', help='Write names of reads with exact matches used in discovering each candidate ' 'to FILE') arg('table', help='Table with parsed IgBLAST results') Groupinfo = namedtuple('Groupinfo', 'count unique_D unique_J unique_CDR3 shared_CDR3_ratio clonotypes read_names unique_barcodes') SiblingInfo = namedtuple('SiblingInfo', 'sequence requested name group') class SiblingMerger(Merger): """ Merge very similar consensus sequences into single entries. This could be seen as a type of clustering using very specific criteria. Two sequences are merged if one is the prefix of the other, allowing differences where one of the sequences has an 'N' base. """ def merged(self, s, t): chars = [] for c1, c2 in itertools.zip_longest(s.sequence, t.sequence): if c1 is None: c = c2 elif c2 is None: c = c1 elif c1 == 'N': c = c2 elif c2 == 'N': c = c1 elif c1 != c2: return None else: assert c1 == c2 c = c1 chars.append(c) seq = ''.join(chars) requested = s.requested or t.requested name = s.name + ';' + t.name # take union of groups group = pd.concat([s.group, t.group]).groupby(level=0).last() return SiblingInfo(seq, requested, name, group) class Discoverer: """ Discover candidates for novel V genes. """ def __init__(self, database, windows, left, right, cluster, cluster_exact, table_output, consensus_threshold, downsample, clonotype_differences, cluster_subsample_size, max_n_bases, exact_copies, d_coverage, d_evalue, seed, cdr3_counts): self.database = database self.windows = windows self.left = left self.right = right self.cluster = cluster self.cluster_exact = cluster_exact self.table_output = table_output self.consensus_threshold = consensus_threshold self.downsample = downsample self.clonotype_differences = clonotype_differences self.cluster_subsample_size = cluster_subsample_size self.max_n_bases = max_n_bases self.exact_copies = exact_copies self.d_coverage = d_coverage self.d_evalue = d_evalue self.seed = seed self.cdr3_counts = cdr3_counts def _sibling_sequence(self, gene, group): """ For a given group, compute a consensus sequence over the V gene sequences in that group. If the found sibling is slightly longer or shorter than the version in the database, adjust it so it corresponds to the database version exactly. """ sequence = iterative_consensus(list(group.V_nt), program='muscle-medium', threshold=self.consensus_threshold/100, maximum_subsample_size=self.downsample) if gene in self.database: database_sequence = self.database[gene] if sequence.startswith(database_sequence) or database_sequence.startswith(sequence): return database_sequence return sequence @staticmethod def _guess_chain(group): """ Return a guess for the chain type of a given group """ return Counter(group.chain).most_common()[0][0] @staticmethod def _guess_cdr3_start(group): """ Return a guess for the CDR3 start within sequences in the given group """ return Counter(group.V_CDR3_start).most_common()[0][0] def count_unique_d(self, table): g = table[(table.D_errors == 0) & (table.D_covered >= self.d_coverage) & (table.D_evalue <= self.d_evalue)] return len(set(s for s in g.D_gene if s)) @staticmethod def count_unique_barcodes(group): return len(set(s for s in group.barcode if s)) def count_clonotypes(self, table): """ Cluster sequences by edit distance and return the number of clusters. The sequences are first group by their J assignment and cluster numbers are computed within these groups separately, then summed up. """ distance = self.clonotype_differences def linked(s, t): return edit_distance(s, t, distance) <= distance total = 0 for j_gene, group in table.groupby('J_gene'): sequences = list(set(s for s in group.CDR3_nt if s)) components = single_linkage(sequences, linked) total += len(components) return total def _cluster_siblings(self, gene, table): """Find candidates by clustering sequences assigned to one gene""" if self.exact_copies > 1: # Preferentially pick those sequences (for subsampling) that have # multiple exact copies, then fill up with the others exact_group = table[table.copies >= self.exact_copies] indices = downsampled(list(exact_group.index), self.cluster_subsample_size) if len(indices) < self.cluster_subsample_size: not_exact_group = table[table.copies < self.exact_copies] indices.extend(downsampled(list(not_exact_group.index), self.cluster_subsample_size - len(indices))) else: indices = downsampled(list(table.index), self.cluster_subsample_size) # Ignore CDR3 part of the V sequence for clustering sequences_no_cdr3 = list(table.V_no_CDR3.loc[indices]) df, linkage, clusters = cluster_sequences(sequences_no_cdr3, MINGROUPSIZE) n_clusters = len(set(clusters)) logger.info('%6d %s assignments generated %d cluster%s', len(table), gene, n_clusters, 's' if n_clusters != 1 else '') cluster_indices = [[] for _ in range(max(clusters) + 1)] for i, cluster_id in enumerate(clusters): cluster_indices[cluster_id].append(indices[i]) cl = 0 for ind in cluster_indices: group = table.loc[ind] if len(group) < MINGROUPSIZE: continue sibling = self._sibling_sequence(gene, group) cl += 1 yield SiblingInfo( sequence=sibling, requested=False, name='cl{}'.format(cl), group=group) def _window_siblings(self, gene, table): """ Find candidates by clustering sequences that have a similar number of differences to the reference sequence """ for left, right in self.windows: left, right = float(left), float(right) group = table[(left <= table.V_SHM) & (table.V_SHM < right)] if len(group) < MINGROUPSIZE: continue sibling = self._sibling_sequence(gene, group) if left == int(left): left = int(left) if right == int(right): right = int(right) requested = (left, right) == (self.left, self.right) if (left, right) == (0, 100): name = 'all' else: name = '{}-{}'.format(left, right) yield SiblingInfo(sequence=sibling, requested=requested, name=name, group=group) def _cluster_exact_candidates(self, gene, table): index = 1 for sequence, group in table.groupby('V_nt'): if len(group) >= self.cluster_exact: name = 'ex{}'.format(index) index += 1 yield SiblingInfo(sequence, False, name, group) def _collect_siblings(self, gene, group): """ gene -- gene name group -- pandas.DataFrame of sequences assigned to that gene Yield SiblingInfo objects """ group = group.copy() # the original reference sequence for all the IgBLAST assignments in this group database_sequence = self.database.get(gene, None) database_sequence_found = False candidate_iterators = [self._window_siblings(gene, group)] if self.cluster: candidate_iterators.append(self._cluster_siblings(gene, group)) if self.cluster_exact: candidate_iterators.append(self._cluster_exact_candidates(gene, group)) for sibling in itertools.chain(*candidate_iterators): if sibling.sequence == database_sequence: database_sequence_found = True yield sibling if database_sequence: # If this is a database sequence and there are some exact occurrences of it, # add it to the list of candidates even if it has not been found as a # cluster. group_in_window = group[group.V_errors == 0] if len(group_in_window) >= MINEXPRESSED: if not database_sequence_found: logger.info('Database sequence %r seems to be expressed, but is missing from ' 'candidates. Re-adding it.', gene) yield SiblingInfo(database_sequence, False, 'db', group_in_window) def set_random_seed(self, name): """Set random seed depending on gene name and seed given to constructor""" h = hashlib.md5(name.encode()).digest()[:4] n = int.from_bytes(h, byteorder='big') random.seed(n + self.seed) def __call__(self, args): """ Discover new V genes. args is a tuple (gene, group) gene -- name of the gene group -- a pandas DataFrame with the group corresponding to the gene """ gene, group = args self.set_random_seed(gene) siblings = SiblingMerger() for sibling in self._collect_siblings(gene, group): siblings.add(sibling) candidates = [] for sibling_info in siblings: sibling = sibling_info.sequence n_bases = sibling.count('N') if n_bases > self.max_n_bases: logger.debug('Sibling %s has too many N bases', sibling_info.name) continue sibling_no_cdr3 = sibling[:self._guess_cdr3_start(group)] group_exact_v = group[group.V_no_CDR3 == sibling_no_cdr3] groups = ( ('window', sibling_info.group), ('exact', group_exact_v)) del sibling_no_cdr3 other_cdr3_counts = self.cdr3_counts - Counter(s for s in sibling_info.group.CDR3_nt if s) info = dict() for key, g in groups: cdr3_counts = Counter(s for s in g.CDR3_nt if s) unique_cdr3 = len(cdr3_counts) shared_cdr3_ratio = len(other_cdr3_counts & cdr3_counts) / unique_cdr3 if unique_cdr3 > 0 else 0 unique_j = len(set(s for s in g.J_gene if s)) clonotypes = self.count_clonotypes(g) unique_d = self.count_unique_d(g) unique_barcodes = self.count_unique_barcodes(g) count = len(g.index) read_names = list(g.name) info[key] = Groupinfo(count=count, unique_D=unique_d, unique_J=unique_j, unique_CDR3=unique_cdr3, shared_CDR3_ratio=shared_cdr3_ratio, clonotypes=clonotypes, read_names=read_names, unique_barcodes=unique_barcodes) if gene in self.database: database_diff = edit_distance(sibling, self.database[gene]) database_changes = describe_nt_change(self.database[gene], sibling) else: database_diff = None database_changes = None # Build the Candidate sequence_id = gene if database_diff == 0 else unique_name(gene, sibling) chain = self._guess_chain(sibling_info.group) cdr3_start = self._guess_cdr3_start(sibling_info.group) try: ratio = info['exact'].count / info['exact'].unique_CDR3 except ZeroDivisionError: ratio = 0 # Apply some very light filtering on non-database sequences if database_diff > 0 and info['exact'].count < 2: continue candidate = Candidate( name=sequence_id, source=gene, chain=chain, cluster=sibling_info.name, cluster_size=info['window'].count, Js=info['window'].unique_J, CDR3s=info['window'].unique_CDR3, exact=info['exact'].count, barcodes_exact=info['exact'].unique_barcodes, Ds_exact=info['exact'].unique_D, Js_exact=info['exact'].unique_J, CDR3s_exact=info['exact'].unique_CDR3, clonotypes=info['exact'].clonotypes, CDR3_exact_ratio=ratio, CDR3_shared_ratio=info['exact'].shared_CDR3_ratio, N_bases=n_bases, database_diff=database_diff, database_changes=database_changes, has_stop=has_stop(sibling), CDR3_start=cdr3_start, consensus=sibling, read_names=info['exact'].read_names, ) candidates.append(candidate) return candidates class Candidate(namedtuple('_Candidate', [ 'name', 'source', 'chain', 'cluster', 'cluster_size', 'Js', 'CDR3s', 'exact', 'barcodes_exact', 'Ds_exact', 'Js_exact', 'CDR3s_exact', 'clonotypes', 'CDR3_exact_ratio', 'CDR3_shared_ratio', 'N_bases', 'database_diff', 'database_changes', 'has_stop', 'CDR3_start', 'consensus', 'read_names', ])): def formatted_dict(self): d = self._asdict() d['has_stop'] = int(d['has_stop']) for name in 'CDR3_exact_ratio', 'CDR3_shared_ratio': d[name] = '{:.2f}'.format(d[name]) del d['read_names'] return d def count_prefixes(sequences): """ Count how often each sequence occurs in the given list of sequences. If one sequence is the prefix of another one, they are considered to be 'identical'. Return a dictionary that maps sequence to count. >>> r = count_prefixes(['A', 'BB', 'CD', 'CDE', 'CDEF']) >>> r == {'A': 1, 'BB': 1, 'CD': 3, 'CDE': 3, 'CDEF': 3} True """ sequences = sorted(sequences) sequences.append('GUARD') prev = 'X' start = 0 count = dict() for i, s in enumerate(sequences): if not s.startswith(prev): # end of a run for j in range(start, i): count[sequences[j]] = i - start start = i prev = s return count def main(args): if args.database: with SequenceReader(args.database) as sr: database = {record.name: record.sequence.upper() for record in sr} else: database = dict() if args.seed: seed = args.seed else: seed = random.randrange(10**6) logger.info('Use --seed=%d to reproduce this run', seed) table = read_table(args.table, usecols=('name', 'chain', 'V_gene', 'D_gene', 'J_gene', 'V_nt', 'CDR3_nt', 'barcode', 'V_CDR3_start', 'V_SHM', 'J_SHM', 'D_covered', 'D_evalue', 'V_errors', 'D_errors', 'J_errors')) table['V_no_CDR3'] = [s[:start] if start != 0 else s for s, start in zip(table.V_nt, table.V_CDR3_start)] logger.info('%s rows read', len(table)) if not args.ignore_J: # Discard rows with any mutation within J at all table = table[table.J_SHM == 0][:] logger.info('%s rows remain after discarding J%%SHM > 0', len(table)) if args.exact_copies > 1: multiplicities = count_prefixes(table.V_no_CDR3) table['copies'] = table.V_no_CDR3.map(multiplicities) logger.info('%s rows contain V sequences with at least %s copies', sum(table.copies >= args.exact_copies), args.exact_copies) columns = list(Candidate._fields) if not args.max_n_bases: columns.remove('N_bases') columns.remove('read_names') writer = csv.DictWriter(sys.stdout, fieldnames=columns, delimiter='\t', lineterminator='\n', extrasaction='ignore') writer.writeheader() genes = set(args.gene) if args.window_width: windows = [(start, start + args.window_width) for start in np.arange(0, 20, args.window_width)] logger.info('Using an error rate window of %.1f%% to %.1f%%', args.left, args.right) windows.append((args.left, args.right)) else: windows = [] groups = [] for gene, group in table.groupby('V_gene'): if genes and gene not in genes: continue if len(group) < MINGROUPSIZE: continue groups.append((gene, group)) cdr3_counts = Counter(s for s in table.CDR3_nt if s) logger.info('%s unique CDR3s detected overall', len(cdr3_counts)) discoverer = Discoverer(database, windows, args.left, args.right, args.cluster, args.cluster_exact, args.table_output, args.consensus_threshold, MAXIMUM_SUBSAMPLE_SIZE, clonotype_differences=args.clonotype_diff, cluster_subsample_size=args.subsample, max_n_bases=args.max_n_bases, exact_copies=args.exact_copies, d_coverage=args.d_coverage, d_evalue=args.d_evalue, seed=seed, cdr3_counts=cdr3_counts) Pool = SerialPool if args.threads == 1 else multiprocessing.Pool n_candidates = 0 read_names_file = None with ExitStack() as stack: pool = stack.enter_context(Pool(args.threads)) if args.read_names: read_names_file = stack.enter_context(open(args.read_names, 'w')) for candidates in pool.imap(discoverer, groups, chunksize=1): for candidate in candidates: writer.writerow(candidate.formatted_dict()) if read_names_file: print(candidate.name, *candidate.read_names, sep='\t', file=read_names_file) n_candidates += len(candidates) if args.limit is not None and n_candidates >= args.limit: break sys.stdout.flush() logger.info('%s candidate sequences for %s gene(s) generated', n_candidates, len(groups)) IgDiscover-0.11/igdiscover/discoverj.py000066400000000000000000000377011337725263500202160ustar00rootroot00000000000000""" Discover D and J genes The most frequent D/J sequences are considered candidates. For J genes, candidate sequences are merged if they overlap each other. The result table is written to standard output. Use --fasta to also generate FASTA output. """ import logging import pandas as pd from collections import defaultdict from typing import List from sqt import FastaReader from sqt.align import edit_distance from .utils import Merger, merge_overlapping, unique_name, is_same_gene, slice_arg from .table import read_table, fix_columns logger = logging.getLogger(__name__) MINIMUM_CANDIDATE_LENGTH = 5 def add_arguments(parser): arg = parser.add_argument arg('--database', metavar='FASTA', help='FASTA file with reference gene sequences') arg('--merge', default=None, action='store_true', help='Merge overlapping genes. ' 'Default: Enabled for D, disabled for J and V.') arg('--no-merge', dest='merge', action='store_false', help='Do not merge overlapping genes') arg('--gene', default='J', choices=('V', 'D', 'J'), help='Which gene category to discover. Default: %(default)s') arg('--j-coverage', type=float, default=None, metavar='PERCENT', help='Require that the sequence covers at least PERCENT of the J gene. ' 'Default: 90 when --gene=J; 0 otherwise') arg('--allele-ratio', type=float, metavar='RATIO', default=0.2, help='Required allele ratio. Works only for genes named "NAME*ALLELE". Default: %(default)s') arg('--cross-mapping-ratio', type=float, metavar='RATIO', default=0.1, help='Ratio for detection of cross-mapping artifacts. Default: %(default)s') arg('--min-count', metavar='N', type=int, default=None, help='Omit candidates with fewer than N exact occurrences in the input table. ' 'Default: 1 for J; 10 for D; 100 for V') arg('--no-perfect-matches', dest='perfect_matches', default=True, action='store_false', help='Do not filter out sequences for which the V assignment (or J for --gene=V) ' 'has at least one error') # --gene=D options arg('--d-core-length', metavar='L', type=int, default=6, help='Use only D core regions that have at least length L (only ' 'applies when --gene=D). Default: %(default)s') arg('--d-core', type=slice_arg, default=slice(2, -2), help='D core region location (only applies when --gene=D). ' 'Default: %(default)s') arg('--fasta', help='Write discovered sequences to FASTA file') arg('table', help='Table with parsed and filtered IgBLAST results') class Candidate: __slots__ = ('name', 'sequence', 'exact_occ', 'max_count', 'other_genes', 'db_name', 'db_distance', 'cdr3s', 'missing') def __init__(self, name, sequence, exact_occ=0, max_count=0, cdr3s=None, other_genes=None, db_name=None, db_distance=None): self.name = name self.sequence = sequence self.exact_occ = exact_occ self.max_count = max_count # an upper bound for the exact_occ self.cdr3s = cdr3s if cdr3s is not None else set() self.other_genes = other_genes if other_genes is not None else set() self.db_name = db_name self.db_distance = db_distance self.missing = '' @property def unique_CDR3(self): return len(self.cdr3s) def __repr__(self): return 'Candidate({sequence!r}, exact_occ={exact_occ}, max_count={max_count}, ...)'.format( sequence=self.sequence, exact_occ=self.exact_occ, max_count=self.max_count, ) class OverlappingSequenceMerger(Merger): """ Merge sequences that overlap """ def merged(self, s, t): """ Merge two sequences if they overlap. If they should not be merged, None is returned. """ m = merge_overlapping(s.sequence, t.sequence) if m is not None: return Candidate(s.name, m, max_count=t.max_count + s.max_count) return None class AlleleRatioMerger(Merger): """ Discard sequences with too low allele ratio """ def __init__(self, allele_ratio, cross_mapping_ratio): super().__init__() self._allele_ratio = allele_ratio self._cross_mapping_ratio = cross_mapping_ratio def merged(self, s, t): """ Merge two sequences if they overlap. If they should not be merged, None is returned. """ # TODO copy-and-pasted from germlinefilter # # Check allele ratio. Somewhat similar to cross-mapping, but # this uses sequence names to decide whether two genes can be # alleles of each other and the ratio is between the CDR3s_exact # values if self._allele_ratio and is_same_gene(s.name, t.name): for u, v in [(s, t), (t, s)]: if v.unique_CDR3 == 0: continue ratio = u.unique_CDR3 / v.unique_CDR3 if ratio < self._allele_ratio: # logger.info('Allele ratio %.4f too low for %r compared to %r', # ratio, u.name, v.name) return v if self._cross_mapping_ratio: # When checking for cross mapping, ignore overhanging bases in the 5' end. # Example: # ---ACTACGACTA... # XXX|||||X|||| # ATTACTACTACTA... if len(t.sequence) < len(s.sequence): t, s = s, t # s is now the shorter sequence t_seq = t.sequence[len(t.sequence) - len(s.sequence):] s_seq = s.sequence dist = edit_distance(s_seq, t_seq, 1) if dist > 1: return None total_occ = (s.exact_occ + t.exact_occ) if total_occ == 0: return None for u, v in [(s, t), (t, s)]: ratio = u.exact_occ / total_occ if ratio < self._cross_mapping_ratio: # u is probably a cross-mapping artifact of the higher-expressed v logger.info('%r is a cross-mapping artifact of %r (ratio %.4f)', u.name, v.name, ratio) return v return None def filter_by_allele_ratio(table, allele_ratio): arm = AlleleRatioMerger(allele_ratio, cross_mapping_ratio=None) renamed_counts = table.reset_index().rename(columns={'gene': 'name'}) arm.extend(renamed_counts.itertuples(index=False)) counts = pd.DataFrame(list(arm), columns=renamed_counts.columns) \ .rename(columns={'name': 'gene'}) \ .set_index('gene') return counts def count_occurrences(candidates, table_path, search_columns, other_gene, other_errors, merge, perfect_matches): """ Count how often each candidate sequence occurs in the input table. The columns named in search_columns are concatenated and searched. This circumvents inaccurate IgBLAST alignment boundaries. Rows where the column named by other_errors is not zero are ignored. The input table is read in chunks to reduce memory usage. candidates -- list of candidates table_path -- path to input table search_columns -- which columns to search. The contained strings are concatenated and then searched. merge -- If True, stop searching for other candidates in a single row after one candidate has been found. The following attributes of the candidates are updated: - exact_occ - other_genes - cdr3s Return the updated list of candidates. """ candidates_map = {c.sequence: c for c in candidates} search_order = [c.sequence for c in candidates] cols = [other_gene, 'V_errors', 'J_errors', 'CDR3_nt'] + search_columns search_cache = defaultdict(list) # map haystack sequence to list of candidates that occur in it for chunk in pd.read_csv(table_path, usecols=cols, chunksize=10000, sep='\t'): fix_columns(chunk) if perfect_matches: chunk = chunk[chunk[other_errors] == 0] # concatenate search columns if len(chunk) == 0: # TODO that this is needed is possibly a pandas bug continue chunk['haystack'] = chunk.loc[:, search_columns].astype(str).sum(axis=1) chunk['haystack'] = chunk['haystack'].str.replace('(', '').replace(')', '') for row in chunk.itertuples(): if row.haystack not in search_cache: for needle in search_order: if needle in row.haystack: search_cache[row.haystack].append(candidates_map[needle]) for candidate in search_cache[row.haystack]: candidate.exact_occ += 1 # TODO += row.count? candidate.other_genes.add(getattr(row, other_gene)) candidate.cdr3s.add(row.CDR3_nt) if merge: # When overlapping candidates have been merged, # there will be no other pattern that is a # substring of the current search pattern. break return candidates_map.values() def discard_substring_occurrences(candidates): """ Filter a candidate list by discarding candidates whose sequences are substrings of another candidate’s sequence """ # Shorter sequences first candidates = sorted(candidates, key=lambda c: len(c.sequence)) for i, short in enumerate(candidates): for long in candidates[i+1:]: if short.sequence in long.sequence: break else: # no substring occurrence - keep this candidate yield short def sequence_candidates(table, column, minimum_length, core=slice(None, None), min_occ=3): """ Generate candidates by clustering all sequences in a column (V_nt, D_nt or J_nt). At least min_occ occurrences are required. core -- a slice object. If given, the strings in the column are sliced before being clustered. """ for sequence, occ in table[column].str[core].value_counts().items(): if len(sequence) >= minimum_length and occ >= min_occ: yield Candidate(None, sequence, max_count=occ) def count_unique_cdr3(table): return len(set(s for s in table.CDR3_nt if s)) def count_unique_gene(table, gene_type): return len(set(s for s in table[gene_type + '_gene'] if s)) def compute_expressions(table, gene_type): """Compute expression counts of known genes""" assert gene_type in {'V', 'D', 'J'} columns = ('gene', 'count', 'unique_CDR3') for gt in 'V', 'D', 'J': if gene_type != gt: columns += ('unique_' + gt,) gene_column = gene_type + '_gene' rows = [] for gene, group in table.groupby(gene_column): if gene == '': continue unique_cdr3 = count_unique_cdr3(group) row = dict(gene=gene, count=len(group), unique_CDR3=unique_cdr3) for gt in 'V', 'D', 'J': if gene_type != gt: row['unique_' + gt] = count_unique_gene(group, gene_type=gt) rows.append(row) counts = pd.DataFrame(rows, columns=columns).set_index('gene') return counts def make_whitelist(table, database, gene_type: str, allele_ratio=None) -> List[str]: """ Return a list of sequences that represent expressed alleles """ assert gene_type in {'V', 'D', 'J'} # Compute expression counts in the same way the 'igdiscover count' command would do it counts = compute_expressions(table, gene_type) if allele_ratio: counts = filter_by_allele_ratio(counts, allele_ratio) names = list(counts.index) # Construct whitelist from all expressed alleles database = {r.name: r.sequence for r in database} sequences = [database[name] for name in names] return sequences def print_table(candidates, other_gene, missing): columns = ['name', 'exact_occ', other_gene + 's', 'CDR3s', 'database', 'database_diff', 'sequence'] if missing: columns.append('missing') print(*columns, sep='\t') for candidate in candidates: columns = [ candidate.name, candidate.exact_occ, len(set(candidate.other_genes)), candidate.unique_CDR3, candidate.db_name if candidate.db_name is not None else '', candidate.db_distance if candidate.db_distance is not None else -1, candidate.sequence ] if missing: columns.append(candidate.missing) print(*columns, sep='\t') def main(args): if args.database: with FastaReader(args.database) as fr: database = list(fr) logger.info('Read %d sequences from %r', len(database), args.database) else: database = None column = {'V': 'V_nt', 'J': 'J_nt', 'D': 'D_region'}[args.gene] other = 'V' if args.gene in ('D', 'J') else 'J' other_gene = other + '_gene' other_errors = other + '_errors' table = read_table(args.table, usecols=['count', 'V_gene', 'D_gene', 'J_gene', 'V_errors', 'J_errors', 'J_covered', column, 'CDR3_nt']) logger.info('Table with %s rows read', len(table)) if args.j_coverage is None and args.gene == 'J': args.j_coverage = 90 if args.j_coverage: table = table[table['J_covered'] >= args.j_coverage] logger.info('Keeping %s rows that have J_covered >= %s', len(table), args.j_coverage) if args.perfect_matches: table = table[table[other_errors] == 0] logger.info('Keeping %s rows that have no %s mismatches', len(table), other) if args.merge is None: args.merge = args.gene == 'D' if args.min_count is None: args.min_count = {'J': 1, 'D': 10, 'V': 100}[args.gene] # TODO J is fine, but are D and V? if args.gene == 'D': candidates = sequence_candidates( table, column, minimum_length=args.d_core_length, core=args.d_core) elif args.gene == 'J': candidates = sequence_candidates( table, column, minimum_length=MINIMUM_CANDIDATE_LENGTH) else: candidates = sequence_candidates( table, column, minimum_length=MINIMUM_CANDIDATE_LENGTH) candidates = list(candidates) logger.info('Collected %s unique %s sequences', len(candidates), args.gene) # Add whitelisted sequences if database: whitelist = make_whitelist(table, database, args.gene, args.allele_ratio) missing_whitelisted = set(whitelist) - set(c.sequence for c in candidates) for sequence in missing_whitelisted: candidates.append(Candidate(None, sequence)) logger.info('Added %d whitelisted sequence%s', len(missing_whitelisted), 's' if len(missing_whitelisted) != 1 else '') candidates = list(discard_substring_occurrences(candidates)) logger.info('Removing candidate sequences that occur within others results in %s candidates', len(candidates)) candidates = [candidate for candidate in candidates if 'N' not in candidate.sequence] logger.info('Removing candidates containing "N" results in %s candidates', len(candidates)) if args.merge: logger.info('Merging overlapping sequences ...') # Merge candidate sequences that overlap. If one candidate is longer than # another, this is typically a sign that IgBLAST has not extended the # alignment long enough. merger = OverlappingSequenceMerger() for candidate in candidates: merger.add(candidate) logger.info('After merging overlapping %s sequences, %s remain', args.gene, len(merger)) candidates = list(merger) del table logger.info('%d candidates', len(candidates)) # Assign names etc. if database: for candidate in candidates: distances = [(edit_distance(db.sequence, candidate.sequence), db) for db in database] candidate.db_distance, closest = min(distances, key=lambda x: x[0]) candidate.db_name = closest.name if candidate.db_distance == 0: candidate.name = closest.name else: # Exact db sequence not found, is there one that contains # this candidate as a substring? for db_record in database: index = db_record.sequence.find(candidate.sequence) if index == -1: continue if args.gene == 'D': start = db_record.sequence.find(candidate.sequence) prefix = db_record.sequence[:start] suffix = db_record.sequence[start + len(candidate.sequence):] candidate.missing = '{}...{}'.format(prefix, suffix) else: # Replace this record with the full-length version candidate.sequence = db_record.sequence candidate.db_distance = 0 candidate.name = db_record.name break else: candidate.name = unique_name(closest.name, candidate.sequence) else: for candidate in candidates: candidate.name = unique_name(args.gene, candidate.sequence) logger.info('Counting occurrences ...') if args.gene == 'D': search_columns = ['VD_junction', 'D_region', 'DJ_junction'] elif args.gene == 'J': search_columns = ['DJ_junction', 'J_nt'] else: search_columns = ['genomic_sequence'] candidates = count_occurrences(candidates, args.table, search_columns, other_gene, other_errors, args.merge, args.perfect_matches) # Filter by allele ratio if args.allele_ratio or args.cross_mapping_ratio: arm = AlleleRatioMerger(args.allele_ratio, args.cross_mapping_ratio) arm.extend(candidates) candidates = list(arm) logger.info('After filtering by allele ratio and/or cross-mapping ratio, %d candidates remain', len(candidates)) candidates = sorted(candidates, key=lambda c: c.name) candidates = [c for c in candidates if c.exact_occ >= args.min_count or c.db_distance == 0] print_table(candidates, other_gene, missing=args.gene == 'D') if args.fasta: with open(args.fasta, 'w') as f: for candidate in sorted(candidates, key=lambda r: r.name): print('>{}\n{}'.format(candidate.name, candidate.sequence), file=f) logger.info('Wrote %d genes', len(candidates)) IgDiscover-0.11/igdiscover/empty.aux000066400000000000000000000000011337725263500175100ustar00rootroot00000000000000 IgDiscover-0.11/igdiscover/errorplot.py000066400000000000000000000073221337725263500202520ustar00rootroot00000000000000""" Plot histograms of differences to reference V gene For each gene, a histogram is plotted that shows how often a sequence was assigned to that gene at a certain percentage difference. """ import sys import logging import numpy as np import seaborn as sns from matplotlib.backends.backend_pdf import FigureCanvasPdf, PdfPages from matplotlib.figure import Figure from .table import read_table logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--minimum-group-size', '-m', metavar='N', default=None, type=int, help="Plot only genes with at least N assigned sequences. " "Default: 0.1%% of assigned sequences or 100, whichever is smaller.") arg('--max-j-shm', metavar='VALUE', type=float, default=None, help='Use only rows with J%%SHM >= VALUE') arg('--multi', metavar='PDF', default=None, help='Plot individual error frequency histograms (for each V gene) to this PDF file') arg('--boxplot', metavar='PDF', default=None, help='Plot a single page with box(en)plots of V SHM for multiple genes') arg('table', metavar='FILTERED.TAB.GZ', help='Table with parsed IgBLAST results') def plot_difference_histogram(group, gene_name, bins=np.arange(20.1)): """ Plot a histogram of percentage differences for a specific gene. """ exact_matches = group[group.V_SHM == 0] cdr3s_exact = len(set(s for s in exact_matches.CDR3_nt if s)) js_exact = len(set(exact_matches.J_gene)) fig = Figure(figsize=(100/25.4, 60/25.4)) ax = fig.gca() ax.set_xlabel('Percentage difference') ax.set_ylabel('Frequency') fig.suptitle('Gene ' + gene_name, y=1.08, fontsize=16) ax.set_title('{:,} sequences assigned'.format(len(group))) ax.text(0.25, 0.95, '{:,} ({:.1%}) exact matches\n {} unique CDR3\n {} unique J'.format( len(exact_matches), len(exact_matches) / len(group), cdr3s_exact, js_exact), transform=ax.transAxes, fontsize=10, bbox=dict(boxstyle='round', facecolor='white', alpha=0.5), horizontalalignment='left', verticalalignment='top') _ = ax.hist(list(group.V_SHM), bins=bins) return fig def main(args): table = read_table(args.table, usecols=['V_gene', 'J_gene', 'V_SHM', 'J_SHM', 'CDR3_nt']) if not args.multi and not args.boxplot: print('Don’t know what to do', file=sys.stderr) sys.exit(2) # Discard rows with any mutation within J at all logger.info('%s rows read', len(table)) if args.max_j_shm is not None: # Discard rows with too many J mutations table = table[table.J_SHM <= args.max_j_shm][:] logger.info('%s rows remain after keeping only those with J%%SHM <= %s', len(table), args.max_j_shm) if args.minimum_group_size is None: total = len(table) minimum_group_size = min(total // 1000, 100) logger.info('Skipping genes with less than %s assignments', minimum_group_size) else: minimum_group_size = args.minimum_group_size # Genes with high enough assignment count all_genes = table['V_gene'].unique() genes = sorted(table['V_gene'].value_counts().loc[lambda x: x >= minimum_group_size].index) gene_set = set(genes) logger.info('%s out of %s genes have enough assignments', len(genes), len(all_genes)) if args.multi: with PdfPages(args.multi) as pages: for gene, group in table.groupby('V_gene'): if gene not in gene_set: continue fig = plot_difference_histogram(group, gene) FigureCanvasPdf(fig).print_figure(pages, bbox_inches='tight') logger.info('Wrote %r', args.multi) if args.boxplot: aspect = 1 + len(genes) / 32 g = sns.catplot(x='V_gene', y='V_SHM', kind='boxen', order=genes, data=table, height=2 * 2.54, aspect=aspect, color='g') # g.set(ylim=(-.1, None)) g.set(ylabel='% V SHM (nt)') g.set(xlabel='V gene') g.set_xticklabels(rotation=90) g.savefig(args.boxplot) logger.info('Wrote %r', args.boxplot) IgDiscover-0.11/igdiscover/filter.py000066400000000000000000000102101337725263500174750ustar00rootroot00000000000000""" Filter table with parsed IgBLAST results Discard the following rows in the table: - no J assigned - stop codon found - V gene coverage less than 90% - J gene coverage less than 60% - V gene E-value greater than 1E-3 The filtered table is printed to standard output. """ import logging import json import pandas as pd from .table import fix_columns logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--v-coverage', type=float, default=90, metavar='PERCENT', help='Require that the sequence covers at least PERCENT of the V gene. ' 'Default: %(default)s') arg('--v-evalue', type=float, default=1E-3, metavar='EVALUE', help='Require that the E-value for the V gene match is at most EVALUE. ' 'Default: %(default)s') arg('--j-coverage', type=float, default=60, metavar='PERCENT', help='Require that the sequence covers at least PERCENT of the J gene. ' 'Default: %(default)s') arg('--json', metavar='FILE', help='Write statistics to FILE') arg('table', help='Table with filtered IgBLAST results.') class FilteringStatistics: __slots__ = ('total', 'has_vj_assignment', 'has_no_stop', 'good_v_evalue', 'good_v_coverage', 'good_j_coverage', 'has_cdr3') def __init__(self): self.total = 0 self.has_vj_assignment = 0 self.has_no_stop = 0 self.good_v_evalue = 0 self.good_v_coverage = 0 self.good_j_coverage = 0 self.has_cdr3 = 0 def __iadd__(self, other): for att in self.__slots__: v = getattr(self, att) setattr(self, att, v + getattr(other, att)) return self def asdict(self): d = dict() for att in self.__slots__: d[att] = getattr(self, att) return d def filtered_table(table, v_gene_coverage, # at least j_gene_coverage, # at least v_gene_evalue, # at most ): """ Discard the following rows in the table: - no J assigned - stop codon found - V gene coverage less than v_gene_coverage - J gene coverage less than j_gene_coverage - V gene E-value greater than v_gene_evalue Return the filtered table. """ stats = FilteringStatistics() stats.total = len(table) # Both V and J must be assigned # (Note V_gene and J_gene columns use empty strings instead of NA) filtered = table[(table['V_gene'] != '') & (table['J_gene'] != '')][:] stats.has_vj_assignment = len(filtered) filtered['V_gene'] = pd.Categorical(filtered['V_gene']) # Filter out sequences that have a stop codon filtered = filtered[filtered.stop == 'no'] stats.has_no_stop = len(filtered) # Filter out sequences with a too low V gene hit E-value filtered = filtered[filtered.V_evalue <= v_gene_evalue] stats.good_v_evalue = len(filtered) # Filter out sequences with too low V gene coverage filtered = filtered[filtered.V_covered >= v_gene_coverage] stats.good_v_coverage = len(filtered) # Filter out sequences with too low J gene coverage filtered = filtered[filtered.J_covered >= j_gene_coverage] stats.good_j_coverage = len(filtered) stats.has_cdr3 = sum(filtered['CDR3_nt'] != '') return filtered, stats def main(args): first = True written = 0 stats = FilteringStatistics() for chunk in pd.read_csv(args.table, chunksize=10000, float_precision='high', sep='\t'): fix_columns(chunk) filtered, chunk_stats = filtered_table(chunk, v_gene_coverage=args.v_coverage, j_gene_coverage=args.j_coverage, v_gene_evalue=args.v_evalue) stats += chunk_stats print(filtered.to_csv(sep='\t', index=False, header=first), end='') first = False written += len(filtered) logger.info('%s rows in input table', stats.total) logger.info('%s rows have both V and J assignment', stats.has_vj_assignment) logger.info('%s of those do not have a stop codon', stats.has_no_stop) logger.info('%s of those have an E-value of at most %s', stats.good_v_evalue, args.v_evalue) logger.info('%s of those cover the V gene by at least %s%%', stats.good_v_coverage, args.v_coverage) logger.info('%s of those cover the J gene by at least %s%%', stats.good_j_coverage, args.j_coverage) logger.info('%d rows written', written) logger.info('%s rows have a recognized CDR3 (these are not filtered)', stats.has_cdr3) if args.json: with open(args.json, 'w') as f: json.dump(stats.asdict(), f, indent=2) print(file=f) IgDiscover-0.11/igdiscover/germlinefilter.py000066400000000000000000000447441337725263500212430ustar00rootroot00000000000000""" Filter V gene candidates (germline and pre-germline filter) After candidates for novel V genes have been found with the 'discover' subcommand, this script is used to filter the candidates and make sure that only true germline genes remain ("germline filter" and "pre-germline filter"). The following filtering and processing steps are performed on each candidate separately: * Discard sequences with N bases * Discard sequences that come from a consensus over too few source sequences (unless whitelisted) * Discard sequences with too few unique CDR3s (CDR3_clusters column) * Discard sequences with too few unique Js (Js_exact column) * Discard sequences identical to one of the database sequences (if DB given) * Discard sequences that contain a stop codon (has_stop column) (unless whitelisted) The following criteria involve comparison of candidates against each other: * Discard sequences that are too similar to another (unless whitelisted) * Discard sequences that are cross-mapping artifacts * Discard sequences that have a too low allele ratio * Discard sequences with too few unique Ds relative to other alleles (Ds_exact column) If you provide a whitelist of sequences, then the candidates that appear on it * are not checked for the cluster size criterion, * are never considered near-duplicates, * are allowed to contain a stop codon. The filtered table is written to standard output. """ import sys import logging from collections import namedtuple import pandas as pd from sqt import FastaReader from sqt.align import edit_distance from .utils import UniqueNamer, is_same_gene, ChimeraFinder logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--cluster-size', type=int, metavar='N', default=0, help='Consensus must represent at least N sequences. ' 'Default: %(default)s') arg('--cross-mapping-ratio', type=float, metavar='RATIO', default=0.02, help='Ratio for detection of cross-mapping artifacts. Default: %(default)s') arg('--clonotype-ratio', '--allele-ratio', type=float, metavar='RATIO', default=0.1, help='Required ratio of "clonotypes" counts between alleles. ' 'Works only for genes named "NAME*ALLELE". Default: %(default)s') arg('--exact-ratio', type=float, metavar='RATIO', default=0.1, help='Required ratio of "exact" counts between alleles. ' 'Works only for genes named "NAME*ALLELE". Default: %(default)s') arg('--cdr3-shared-ratio', type=float, metavar='RATIO', default=1.0, help='Maximum allowed CDR3_shared_ratio. Default: %(default)s') arg('--minimum-db-diff', '-b', type=int, metavar='N', default=0, help='Sequences must have at least N differences to the database ' 'sequence. Default: %(default)s') arg('--maximum-N', '-N', type=int, metavar='COUNT', default=0, help='Sequences must have at most COUNT "N" bases. Default: %(default)s') arg('--unique-CDR3', '--CDR3s', type=int, metavar='N', default=1, help='Sequences must have at least N unique CDR3s within exact sequence matches. ' 'Default: %(default)s') # The default for unique-J is 0 because we might work on data without # any assigned J genes. arg('--unique-J', type=int, metavar='N', default=0, help='Sequences must have at least N unique Js within exact sequence matches. ' 'Default: %(default)s') arg('--unique-D-ratio', type=float, metavar='RATIO', default=None, help='Discard a sequence if another allele of this gene exists ' 'such that the ratio between their Ds_exact is less than RATIO') arg('--unique-D-threshold', type=int, metavar='THRESHOLD', default=10, help='Apply the --unique-D-ratio filter only if the Ds_exact of the other ' 'allele is at least THRESHOLD') arg('--allow-stop', action='store_true', default=False, help='Allow stop codons in sequences (uses the has_stop column).' 'Default: Do not allow stop codons.') # arg('--allow-chimeras', action='store_true', default=False, # help='Do not filter out chimeric sequences. Default: Discard chimeras') arg('--whitelist', metavar='FASTA', default=[], action='append', help='Sequences that are never discarded or merged with others, ' 'even if criteria for discarding them would apply (except cross-mapping artifact ' 'removal, which is always performed).') arg('--fasta', metavar='FILE', help='Write new database in FASTA format to FILE') arg('--annotate', metavar='FILE', help='Write candidates.tab with filter annotations to FILE') arg('tables', metavar='CANDIDATES.TAB', help='Tables (one or more) created by the "discover" command', nargs='+') Candidate = namedtuple('Candidate', ['sequence', 'name', 'clonotypes', 'exact', 'Ds_exact', 'cluster_size', 'whitelisted', 'is_database', 'cluster_size_is_accurate', 'CDR3_start', 'row']) class CrossMappingFilter: """ Filters a sequence if it is a cross-mapping artifact """ def __init__(self, cross_mapping_ratio: float): """Check for cross-mapping""" self._ratio = cross_mapping_ratio def should_discard(self, a: Candidate, b: Candidate, dist: int, *args): """ Compare two candidates and decide which one should be filtered, if any. :param a: First candidate :param b: Second candidate :param dist: Edit distance between candidates :return: None if both candidates should be kept, a pair (x, reason) otherwise, where x is the candidate to be discarded (x is either a or b) and reason is a string describing why. """ if dist > 1: # Cross-mapping is unlikely if the edit distance is larger than 1 return None if not a.is_database or not b.is_database: # Cross-mapping can only occur if both sequences are in the database return None total_count = (a.cluster_size + b.cluster_size) for u, v in [(a, b), (b, a)]: ratio = u.cluster_size / total_count if u.cluster_size_is_accurate and ratio < self._ratio: # u is probably a cross-mapping artifact of the higher-expressed v return u, f'xmap_of={v.name},xmap_ratio={ratio:.4f}' return None class TooSimilarSequenceFilter: """ Filter out sequences that are too similar to another one """ @staticmethod def should_discard(a: Candidate, b: Candidate, dist: int, *args): if dist > 0: # The sequences are not similar, so keep both return None if a.whitelisted and b.whitelisted: # Both are whitelisted, so keep them return None if a.whitelisted: return b, f'identical_to={a.name},other_whitelisted' if b.whitelisted: return a, f'identical_to={b.name},other_whitelisted' # No sequence is whitelisted if b.clonotypes < a.clonotypes: return b, f'identical_to={a.name},fewer_clonotypes' # FIXME this is missing: # if a.clonotypes < b.clonotypes: ... if len(a.sequence) < len(b.sequence): return a, f'identical_to={b.name},shorter' return b, f'identical_to={a.name}' class SameGeneFilter: def should_discard(self, a: Candidate, b: Candidate, dist: int, same_gene: bool): """ Compare two candidates and decide which one should be filtered, if any. :param a: First candidate :param b: Second candidate :param dist: Edit distance between candidates :return: None if both candidates should be kept, a pair (x, reason) otherwise, where x is the candidate to be discarded (x is either a or b) and reason is a string describing why. """ if not same_gene: # This filter applies only when comparing alleles of the same gene return None for u, v in [(a, b), (b, a)]: result = self.decide(a, b) if result is not None: return result return None def decide(self, a: Candidate, b: Candidate): raise NotImplementedError class ClonotypeAlleleRatioFilter(SameGeneFilter): """ Clonotype allele ratio filter. Somewhat similar to cross-mapping, but this uses sequence names to decide whether two genes can be alleles of each other and the ratio is between the CDR3_clusters values """ def __init__(self, clonotype_ratio: float): self._ratio = clonotype_ratio def decide(self, u: Candidate, v: Candidate): if v.clonotypes == 0: return None ratio = u.clonotypes / v.clonotypes if ratio < self._ratio: # Clonotype allele ratio too low return u, f'clonotype_ratio={ratio:.4f},other={v.name}' return None class ExactRatioFilter(SameGeneFilter): """Exact V sequence occurrence allele ratio""" def __init__(self, exact_ratio: float): self._ratio = exact_ratio def decide(self, u: Candidate, v: Candidate): if v.exact == 0: return None ratio = u.exact / v.exact if ratio < self._ratio: # Allele ratio of exact occurrences too low return u, f'ex_occ_ratio={ratio:.1f},other={v.name}' return None class UniqueDRatioFilter(SameGeneFilter): def __init__(self, unique_d_ratio: float, unique_d_threshold: int): self._unique_d_ratio = unique_d_ratio self._unique_d_threshold = unique_d_threshold def decide(self, u: Candidate, v: Candidate): if v.cluster_size < u.cluster_size or v.Ds_exact < self._unique_d_threshold: # TODO comment return None ratio = u.Ds_exact / v.Ds_exact if ratio < self._unique_d_ratio: # Ds_exact ratio too low return u, f'Ds_exact_ratio={ratio:.1f},other={v.name}' return None class CandidateFilterer: """ Merge sequences that are sufficiently similar into single entries. """ def __init__(self, filters): self._items = [] self._filters = filters def add(self, item): # This method could possibly be made simpler if the graph structure # was made explicit. items = [] for existing_item in self._items: m = self.merged(existing_item, item) if m is None: items.append(existing_item) else: item = m items.append(item) self._items = items def extend(self, iterable): for i in iterable: self.add(i) def __iter__(self): if self._items and hasattr(self._items, 'name'): yield from sorted(self._items, key=lambda x: x.name) else: yield from self._items def __len__(self): return len(self._items) def merged(self, s: Candidate, t: Candidate): """ Given two candidates, decide whether to discard one of them and which one. Return None if both candidates should be kept. Return the candidate to keep otherwise. """ if len(s.sequence) > len(t.sequence): s, t = t, s # make s always the shorter sequence # When computing edit distance between the two sequences, ignore the # bases in the 3' end that correspond to the CDR3 s_no_cdr3 = s.sequence[:s.CDR3_start] t_no_cdr3 = t.sequence[:t.CDR3_start] if len(s_no_cdr3) != len(t_no_cdr3): t_prefix = t_no_cdr3[:len(s_no_cdr3)] t_suffix = t_no_cdr3[-len(s_no_cdr3):] dist_prefix = edit_distance(s_no_cdr3, t_prefix, 1) dist_suffix = edit_distance(s_no_cdr3, t_suffix, 1) dist_no_cdr3 = min(dist_prefix, dist_suffix) else: dist_no_cdr3 = edit_distance(s_no_cdr3, t_no_cdr3, 1) same_gene = is_same_gene(s.name, t.name) for filter_ in self._filters: result = filter_.should_discard(s, t, dist_no_cdr3, same_gene) if result is None: # not filtered continue if result[0] is s: return t else: return s # None of the filters decided to discard one of the sequences, # so keep both return None class Whitelist: def __init__(self): self._sequences = dict() def add_fasta(self, path): with FastaReader(path) as fr: for record in fr: self._sequences[record.sequence.upper()] = record.name def closest(self, sequence): """ Search for the whitelist sequence that is closest to the given sequence. Return tuple (distance, name). """ if sequence in self._sequences: return 0, self._sequences[sequence] mindist = len(sequence) distances = [] for seq, name in self._sequences.items(): ed = edit_distance(seq, sequence, maxdiff=mindist) distances.append((ed, name)) if ed == 1: # We know ed does not get smaller because the # 'sequence in whitelist' check # above covers that return ed, name mindist = min(mindist, ed) distance, name = min(distances) return distance, name def __len__(self): return len(self._sequences) def __contains__(self, other): return other in self._sequences def is_chimera(table, whitelist): result = pd.Series('', index=table.index, dtype=object) if whitelist: whitelisted = table[table['whitelist_diff'] == 0] else: whitelisted = table[table['database_diff'] == 0] chimera_finder = ChimeraFinder(list(whitelisted['consensus'])) for row in table[(table['whitelist_diff'] != 0) & (table['database_diff'] != 0)].itertuples(): query = row.consensus chimera_result = chimera_finder.find_exact(query) if chimera_result: (prefix_length, prefix_indices, suffix_indices) = chimera_result suffix_length = len(query) - prefix_length prefix_name = whitelisted.iloc[prefix_indices[0]]['name'] suffix_row = whitelisted.iloc[suffix_indices[0]] suffix_name = suffix_row['name'] suffix_sequence = suffix_row['consensus'] # logger.info('Candidate %s (diffs.: %d) appears to be a chimera of ' # '%s:1..%d and %s:%d..%d', row.name, row.whitelist_diff, prefix_name, prefix_length, # suffix_name, len(suffix_sequence) - suffix_length + 1, len(suffix_sequence)) result.loc[row.Index] = '{}:1..{}+{}:{}..{}'.format(prefix_name, prefix_length, suffix_name, len(suffix_sequence) - suffix_length + 1, len(suffix_sequence)) return result def mark_rows(table, condition, reason): table.loc[condition, 'why_filtered'] += reason + ';' table.loc[condition, 'is_filtered'] += 1 logger.info('Marked %s candidates as %r', sum(condition), reason) def main(args): if args.unique_D_threshold <= 1: sys.exit('--unique-D-threshold must be at least 1') filters = [] if args.cross_mapping_ratio: filters.append(CrossMappingFilter(args.cross_mapping_ratio)) if args.clonotype_ratio: filters.append(ClonotypeAlleleRatioFilter(args.clonotype_ratio)) if args.exact_ratio: filters.append(ExactRatioFilter(args.exact_ratio)) if args.unique_D_ratio or args.unique_D_threshold: filters.append(UniqueDRatioFilter(args.unique_D_ratio, args.unique_D_threshold)) filters.append(TooSimilarSequenceFilter()) merger = CandidateFilterer(filters) whitelist = Whitelist() for path in args.whitelist: whitelist.add_fasta(path) logger.info('%d unique sequences in whitelist', len(whitelist)) # Read in tables total = 0 overall_table = None for path in args.tables: table = pd.read_csv(path, sep='\t') i = list(table.columns).index('consensus') # whitelist_diff distinguishes between 0 and !=0 only # at this point. Accurate edit distances are computed later. whitelist_diff = [(0 if s in whitelist else -1) for s in table['consensus']] # TODO rename to is_whitelisted table.insert(i, 'whitelist_diff', pd.Series(whitelist_diff, index=table.index, dtype=int)) table.insert(i+1, 'closest_whitelist', pd.Series('', index=table.index)) table.insert(3, 'why_filtered', pd.Series('', index=table.index)) table.insert(3, 'is_filtered', pd.Series(0, index=table.index)) mark_rows(table, table.database_diff < args.minimum_db_diff, 'too_low_dbdiff') if 'N_bases' in table.columns: mark_rows(table, table.N_bases > args.maximum_N, 'too_many_N_bases') mark_rows(table, table.CDR3s_exact < args.unique_CDR3, 'too_low_CDR3s_exact') mark_rows(table, table.CDR3_shared_ratio > args.cdr3_shared_ratio, 'too_high_CDR3_shared_ratio') mark_rows(table, table.Js_exact < args.unique_J, 'too_low_Js_exact') if not args.allow_stop: mark_rows(table, (table.has_stop != 0) & (table.whitelist_diff != 0), 'has_stop') mark_rows(table, (table.cluster_size < args.cluster_size) & (table.whitelist_diff != 0), 'too_low_cluster_size') table['database_changes'].fillna('', inplace=True) logger.info('Table read from %r contains %s candidate V gene sequences. ' '%s remain after per-entry filtering', path, len(table), sum(table.is_filtered == 0)) if args.whitelist: logger.info('Of those, %d are protected by the whitelist', sum(table.whitelist_diff == 0)) total += len(table) if overall_table is None: overall_table = table else: overall_table.append(table) del table if len(args.tables) > 1: logger.info('Read %s tables with %s entries total. ' 'After per-entry filtering, %s entries remain.', len(args.tables), len(overall_table), sum(overall_table.is_filtered == 0)) def cluster_size_is_accurate(row): return bool(set(row.cluster.split(';')) & {'all', 'db'}) for _, row in overall_table.iterrows(): if row['is_filtered'] > 0: continue merger.add(Candidate( sequence=row['consensus'], name=row['name'], clonotypes=row['clonotypes'], exact=row['exact'], Ds_exact=row['Ds_exact'], cluster_size=row['cluster_size'], whitelisted=row['whitelist_diff'] == 0, is_database=row['database_diff'] == 0, cluster_size_is_accurate=cluster_size_is_accurate(row), CDR3_start=row.get('CDR3_start', 10000), # TODO backwards compatibility row=row.name, # row.name is the index of the row. It is not row['name']. )) # Discard near-duplicates overall_table.loc[overall_table.is_filtered > 0, 'is_duplicate'] = False overall_table.loc[overall_table.is_filtered == 0, 'is_duplicate'] = True for info in merger: overall_table.loc[info.row, 'is_duplicate'] = False mark_rows(overall_table, overall_table.is_duplicate, 'is_duplicate') del overall_table['is_duplicate'] # Name sequences overall_table['name'] = overall_table['name'].apply(UniqueNamer()) overall_table.sort_values(['name'], inplace=True) # Because whitelist_dist() is expensive, this is run when # all of the filtering has already been done if whitelist: for row in overall_table.itertuples(): # TODO skipping this is just a performance optimization if row.is_filtered > 0: continue distance, name = whitelist.closest(overall_table.loc[row[0], 'consensus']) overall_table.loc[row[0], 'closest_whitelist'] = name overall_table.loc[row[0], 'whitelist_diff'] = distance else: overall_table.whitelist_diff.replace(-1, '', inplace=True) i = list(overall_table.columns).index('database_diff') # TODO # chimeras are computed for the full table, not the filtered one. what should be done? overall_table.insert(i, 'chimera', is_chimera(overall_table, whitelist)) # Discard chimeric sequences # if not args.allow_chimeras: # overall_table = overall_table[~is_chimera(overall_table, whitelist)].copy() filtered_table = overall_table[overall_table.is_filtered == 0] del filtered_table['is_filtered'] del filtered_table['why_filtered'] print(filtered_table.to_csv(sep='\t', index=False, float_format='%.2f'), end='') if args.annotate: with open(args.annotate, 'w') as f: print(overall_table.to_csv(sep='\t', index=False, float_format='%.2f'), end='', file=f) if args.fasta: with open(args.fasta, 'w') as f: for _, row in filtered_table.iterrows(): print('>{}\n{}'.format(row['name'], row['consensus']), file=f) logger.info('%d sequences in new database', len(filtered_table)) IgDiscover-0.11/igdiscover/group.py000066400000000000000000000273711337725263500173640ustar00rootroot00000000000000""" Group sequences that share a barcode (molecular identifier, MID) Since the same barcode can sometimes be used by different sequences, the CDR3 sequence can further be used to distinguish sequences. You can choose between using either a 'pseudo CDR3' sequence, which encompasses by default bases 80 to 61 counted from the 3' end. Or you can use the real CDR3 detected with a regular expression. If grouping by CDR3s is enabled, sequences with identical barcode and CDR3 must additionally have a similar length. If the length differs by more than 2 bp, they are put into different groups. The barcode can be in the 5' end or the 3' end of the sequence. Use --trim-g to remove initial runs of G at the 5' end (artifact from RACE protocol). These are removed after the barcode is removed. For all the found groups, one sequence is output to standard output (in FASTA format). Which sequence that is depends on the group size: - If the group consists of a single sequence, that sequence is output - If the group consists of two sequences, one of them is picked randomly - If the group has at least three sequences, a consensus is computed. The consensus is output if it contains no ambiguous bases. Otherwise, also here a random sequence is chosen. """ # NOTES # # - Different lengths of the initial G run cannot be used to distinguish sequences # since they can come from polymerase problems in homopolymers. # - There are some indels in homopolymers (probably PCR problem) # - There are also regular sequencing errors in the initial run of G nucleotides. # - Some paired reads aren’t correctly merged into single reads. They end up being # too long. # - When grouping by barcode and pseudo CDR3, sequence lengths vary within groups. # However, this affects only ~1% of sequences, so it is not necessary to compute # a multiple alignment. Just taking the consensus will drown the incorrect # sequences, at least if the group size is large. # - It does not hurt to reduce the minimimum number of sequences per group for # taking a consensus to 2, but it also does not help much (results in 0.5% more # sequences). (The consensus is only successful if both sequences are identical.) # However, since this is also a simple way to deal with exact duplicates, we do # it anyway and can then skip the separate duplicate removal step (VSEARCH). # TODO # - Use pandas.DataFrame import csv import sys import logging from collections import Counter, defaultdict from contextlib import ExitStack from itertools import islice import json from sqt.align import consensus from sqt import SequenceReader from xopen import xopen from .species import find_cdr3 from .cluster import Graph from .utils import slice_arg # minimum number of sequences needed for attempting to compute a consensus MIN_CONSENSUS_SEQUENCES = 3 logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument group = parser.add_mutually_exclusive_group() group.add_argument('--real-cdr3', action='store_true', default=False, help='In addition to barcode, group sequences by real CDR3 (detected with regex).') group.add_argument('--pseudo-cdr3', nargs='?', default=None, type=slice_arg, const=slice(-80, -60), metavar='START:END', help='In addition to barcode, group sequences by pseudo CDR3. ' 'If START:END is omitted, use -80:-60.') arg('--groups-output', metavar='FILE', default=None, help='Write tab-separated table with groups to FILE') arg('--plot-sizes', metavar='FILE', default=None, help='Plot group sizes to FILE (.png or .pdf)') arg('--limit', default=None, type=int, metavar='N', help='Limit processing to the first N reads') arg('--trim-g', action='store_true', default=False, help="Trim 'G' nucleotides at 5' end") arg('--minimum-length', '-l', type=int, default=0, help='Minimum sequence length') arg('--barcode-length', '-b', type=int, default=12, help="Length of barcode. Positive for 5' barcode, negative for 3' barcode. Default: %(default)s") arg('--json', metavar="FILE", help="Write statistics to FILE") arg('fastx', metavar='FASTA/FASTQ', help='FASTA or FASTQ file (can be gzip-compressed) with sequences') def hamming_neighbors(s): """Return sequences that are at hamming distance 1 and return also s itself""" for i in range(len(s)): for c in 'ACGT': if s[i] != c: yield s[:i] + c + s[i+1:] yield s def cluster_sequences(records): """ Single-linkage clustering. Two sequences are linked if - their (pseudo-) CDR3 sequences have a hamming distance of at most 1 - and their lengths differs by at most 2. """ if len(records) == 1: # TODO check if this helps return [records] # Cluster unique CDR3s first cdr3s = set(r.cdr3 for r in records) sorted_cdr3s = sorted(cdr3s) # For reproducibility graph = Graph(sorted_cdr3s) for cdr3 in sorted_cdr3s: for neighbor in hamming_neighbors(cdr3): if neighbor in cdr3s: graph.add_edge(cdr3, neighbor) cdr3_components = graph.connected_components() # Maps CDR3 sequence to list of records of sequence that have that CDR3 cdr3_records = defaultdict(list) for r in records: cdr3_records[r.cdr3].append(r) components = [] for cdr3_component in cdr3_components: component_records = [] for cdr3 in cdr3_component: component_records.extend(cdr3_records[cdr3]) component_records.sort(key=lambda r: len(r.sequence)) component = [] prev_length = None for r in component_records: l = len(r.sequence) if prev_length is not None and l > prev_length + 2: # Start a new component components.append(component) component = [] component.append(r) prev_length = l if component: components.append(component) assert sum(len(component) for component in components) == len(records) assert all(components) # Components must be non-empty return components GROUPS_HEADER = ['barcode', 'cdr3', 'name', 'sequence'] def write_group(csvfile, barcode, sequences, with_cdr3): for sequence in sequences: row = [barcode, sequence.name.split(maxsplit=1)[0], sequence.sequence] if with_cdr3: row[1:1] = [sequence.cdr3] csvfile.writerow(row) csvfile.writerow([]) def collect_barcode_groups( fastx, barcode_length, trim_g, limit, minimum_length, pseudo_cdr3, real_cdr3): """ fastx -- path to FASTA or FASTQ input """ group_by_cdr3 = pseudo_cdr3 or real_cdr3 if group_by_cdr3: cdr3s = set() # Map barcodes to lists of sequences barcodes = defaultdict(list) n = 0 too_short = 0 regex_fail = 0 with SequenceReader(fastx) as f: for record in islice(f, 0, limit): if len(record) < minimum_length: too_short += 1 continue if barcode_length > 0: barcode = record.sequence[:barcode_length] unbarcoded = record[barcode_length:] else: barcode = record.sequence[barcode_length:] unbarcoded = record[:barcode_length] if trim_g: # The RACE protocol leads to a run of non-template Gs in the beginning # of the sequence, after the barcode. unbarcoded.sequence = unbarcoded.sequence.lstrip('G') if unbarcoded.qualities: unbarcoded.qualities = unbarcoded.qualities[-len(unbarcoded.sequence):] if real_cdr3: match = find_cdr3(unbarcoded.sequence, chain='VH') if match: cdr3 = unbarcoded.sequence[match[0]:match[1]] else: regex_fail += 1 continue elif pseudo_cdr3: cdr3 = unbarcoded.sequence[pseudo_cdr3] if group_by_cdr3: unbarcoded.cdr3 = cdr3 # TODO slight abuse of Sequence objects cdr3s.add(cdr3) barcodes[barcode].append(unbarcoded) n += 1 logger.info('%s sequences in input', n + too_short + regex_fail) logger.info('%s sequences long enough', n + regex_fail) if real_cdr3: logger.info('Using the real CDR3') logger.info('%s times (%.2f%%), the CDR3 regex matched', n, n / (n + regex_fail) * 100) elif pseudo_cdr3: logger.info('Using the pseudo CDR3') if group_by_cdr3: logger.info('%s unique CDR3s', len(cdr3s)) return barcodes def plot_sizes(sizes, path): from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas from matplotlib.figure import Figure import matplotlib import seaborn as sns sns.set() fig = Figure() matplotlib.rcParams.update({'font.size': 14}) FigureCanvas(fig) ax = fig.add_subplot(111) v, _, _ = ax.hist(sizes, bins=100) ax.set_ylim(0, v[1:].max() * 1.1) ax.set_xlabel('Group size') ax.set_ylabel('Read frequency') ax.set_title('Histogram of group sizes (>1)') ax.grid(axis='x') ax.tick_params(direction='out', top=False, right=False) fig.set_tight_layout(True) fig.savefig(path) logger.info('Plotted group sizes to %r', path) def main(args): if args.barcode_length == 0: sys.exit("Barcode length must be non-zero") group_by_cdr3 = args.pseudo_cdr3 or args.real_cdr3 barcodes = collect_barcode_groups(args.fastx, args.barcode_length, args.trim_g, args.limit, args.minimum_length, args.pseudo_cdr3, args.real_cdr3) logger.info('%s unique barcodes', len(barcodes)) barcode_singletons = sum(1 for seqs in barcodes.values() if len(seqs) == 1) logger.info('%s barcodes used by only a single sequence (singletons)', barcode_singletons) with ExitStack() as stack: if args.groups_output: group_out = csv.writer(stack.enter_context( xopen(args.groups_output, 'w')), delimiter='\t', lineterminator='\n') group_out.writerow(GROUPS_HEADER) else: group_out = None too_few = 0 n_clusters = 0 n_singletons = 0 n_consensus = 0 n_ambiguous = 0 sizes = [] for barcode in sorted(barcodes): sequences = barcodes[barcode] if len(sequences) != len(set(s.name for s in sequences)): logger.error('Duplicate sequence records detected') sys.exit(1) if group_by_cdr3: clusters = cluster_sequences(sequences) # it’s a list of lists else: # TODO it would be useful to do the clustering by length that cluster_sequences() does clusters = [sequences] n_clusters += len(clusters) for cluster in clusters: sizes.append(len(cluster)) if group_out: write_group(group_out, barcode, cluster, with_cdr3=group_by_cdr3) if len(cluster) == 1: n_singletons += 1 if len(cluster) < MIN_CONSENSUS_SEQUENCES: too_few += 1 sequence = cluster[0].sequence name = cluster[0].name if group_by_cdr3: cdr3 = cluster[0].cdr3 else: cons = consensus({s.name: s.sequence for s in cluster}, threshold=0.501) if 'N' in cons: # Pick the first sequence as the output sequence sequence = cluster[0].sequence name = cluster[0].name if group_by_cdr3: cdr3 = cluster[0].cdr3 n_ambiguous += 1 else: sequence = cons n_consensus += 1 if group_by_cdr3: cdr3 = Counter(cl.cdr3 for cl in cluster).most_common(1)[0][0] name = 'consensus{}'.format(n_consensus) name = name.split(maxsplit=1)[0] if name.endswith(';'): name = name[:-1] if group_by_cdr3: print('>{};barcode={};cdr3={};size={};\n{}'.format(name, barcode, cdr3, len(cluster), sequence)) else: print('>{};barcode={};size={};\n{}'.format(name, barcode, len(cluster), sequence)) logger.info('%d clusters (%d singletons)', n_clusters, n_singletons) logger.info('%d consensus sequences computed (from groups that had at least %d sequences)', n_consensus + n_ambiguous, MIN_CONSENSUS_SEQUENCES) logger.info('%d of those had no ambiguous bases', n_consensus) if args.groups_output: logger.info('Groups written to %r', args.groups_output) assert sum(sizes) == sum(len(v) for v in barcodes.values()) if args.json: sizes_counter = Counter(sizes) stats = { 'unique_barcodes': len(barcodes), 'barcode_singletons': barcode_singletons, 'groups_written': n_clusters, 'group_size_1': sizes_counter[1], 'group_size_2': sizes_counter[2], 'group_size_3plus': sum(v for k, v in sizes_counter.items() if k >= 3), } with open(args.json, 'w') as f: json.dump(stats, f, indent=2) print(file=f) if args.plot_sizes: plot_sizes(sizes, args.plot_sizes) IgDiscover-0.11/igdiscover/haplotype.py000066400000000000000000000357671337725263500202450ustar00rootroot00000000000000""" Determine haplotypes based on co-occurrences of alleles """ import sys import logging from typing import List, Tuple, Iterator from itertools import product import pandas as pd from argparse import ArgumentParser from sqt import SequenceReader from .table import read_table logger = logging.getLogger(__name__) # The second-most expressed allele of a gene must be expressed at at least this # fraction of the highest-expressed allele in order for the gene to be considered # heterozygous. HETEROZYGOUS_THRESHOLD = 0.1 # EXPRESSED_RATIO = 0.1 def add_arguments(parser: ArgumentParser): arg = parser.add_argument arg('--v-gene', help='V gene to use for haplotyping J. Default: Auto-detected') arg('--d-evalue', type=float, default=1E-4, help='Maximal allowed E-value for D gene match. Default: %(default)s') arg('--d-coverage', '--D-coverage', type=float, default=65, help='Minimum D coverage (in percent). Default: %(default)s%%)') arg('--restrict', metavar='FASTA', help='Restrict analysis to the genes named in the FASTA file. ' 'Only the sequence names are used!') arg('--order', metavar='FASTA', default=None, help='Sort the output according to the order of the records in ' 'the given FASTA file.') arg('--plot', metavar='FILE', default=None, help='Write a haplotype plot to FILE') arg('--structure-plot', metavar='FILE', default=None, help='Write a haplotype structure plot (counts binarized 0 and 1) to FILE') arg('table', help='Table with parsed and filtered IgBLAST results') def expression_counts(table: pd.DataFrame, gene_type: str) -> Iterator[pd.DataFrame]: """ Yield DataFrames for each gene with gene and allele as the row index and columns 'name' and 'count'. For example, when 'name' is VH1-1*01, gene would be 'VH1-1' and allele would be '01'. """ counts = table.groupby(gene_type + '_gene').size() names, _, alleles = zip(*[s.partition('*') for s in counts.index]) expressions = pd.DataFrame( {'gene': names, 'allele': alleles, 'count': counts, 'name': counts.index}, columns=['gene', 'allele', 'name', 'count']).set_index(['gene', 'allele']) del alleles # Example expressions at this point: # # name count # gene allele # IGHV1-18 01 IGHV1-18*01 166 # 03 IGHV1-18*03 1 # IGHV1-2 02 IGHV1-2*02 85 # 04 IGHV1-2*04 16 # IGHV1-24 01 IGHV1-24*01 5 logger.info('Heterozygous %s genes:', gene_type) for _, alleles in expressions.groupby(level='gene'): # Remove alleles that have too low expression relative to the highest-expressed allele max_expression = alleles['count'].max() alleles = alleles[alleles['count'] >= HETEROZYGOUS_THRESHOLD * max_expression] if len(alleles) >= 2: logger.info('%s with alleles %s -- Counts: %s', alleles.index[0][0], ', '.join(alleles['name']), ', '.join(str(x) for x in alleles['count']) ) yield alleles class HeterozygousGene: def __init__(self, name: str, alleles: List[str]): """ name -- name of this gene, such as 'VH4-4' alleles -- list of its alleles, such as ['VH4-4*01', 'VH4-4*02'] """ self.name = name self.alleles = alleles def compute_coexpressions(table: pd.DataFrame, gene_type1: str, gene_type2: str): assert gene_type1 != gene_type2 coexpressions = table.groupby( (gene_type1 + '_gene', gene_type2 + '_gene')).size().to_frame() coexpressions.columns = ['count'] return coexpressions def cooccurrences(coexpressions, het_alleles: Tuple[str, str], target_groups): """ het_alleles -- a pair of alleles of a heterozygous gene, such as ('IGHJ6*02', 'IGHJ6*03'). """ assert len(het_alleles) == 2 haplotype = [] for target_alleles in target_groups: is_expressed_list = [] names = [] counts = [] for target_allele, _ in target_alleles.itertuples(index=False): ex = [] for het_allele in het_alleles: try: e = coexpressions.loc[(het_allele, target_allele), 'count'] except KeyError: e = 0 ex.append(e) ex_total = sum(ex) + 1 # +1 avoids division by zero ratios = [x / ex_total for x in ex] is_expressed = [ratio >= EXPRESSED_RATIO for ratio in ratios] if is_expressed != [False, False]: is_expressed_list.append(is_expressed) names.append(target_allele) counts.append(ex) if len(is_expressed_list) == 1: is_expressed = is_expressed_list[0] if is_expressed == [True, False]: haplotype.append((names[0], '', 'deletion', counts[0])) elif is_expressed == [False, True]: haplotype.append(('', names[0], 'deletion', counts[0])) elif is_expressed == [True, True]: haplotype.append((names[0], names[0], 'homozygous', counts[0])) else: assert False elif is_expressed_list == [[True, False], [False, True]]: haplotype.append((names[0], names[1], 'heterozygous', (counts[0][0], counts[1][1]))) elif is_expressed_list == [[False, True], [True, False]]: haplotype.append((names[1], names[0], 'heterozygous', (counts[0][1], counts[1][0]))) else: type_ = 'unknown' # Somewhat arbitrary criteria for a "duplication": # 1) one heterozygous allele, 2) at least three alleles in total n_true = sum(x.count(True) for x in is_expressed_list) if ([True, False] in is_expressed_list or [False, True] in is_expressed_list) and n_true > 2: type_ = 'duplication' for is_expressed, name, count in zip(is_expressed_list, names, counts): haplotype.append(( name if is_expressed[0] else '', name if is_expressed[1] else '', type_, count, )) return haplotype class HaplotypePair: """Haplotype pair for a single gene type (V/D/J)""" def __init__(self, haplotype, gene_type, het1, het2): self.haplotype = haplotype self.gene_type = gene_type self.het1 = het1 self.het2 = het2 def sort(self, order: List[str]) -> None: """ Sort the haplotype order -- list a names of genes in the desired order """ gene_order = {name: i for i, name in enumerate(order)} def keyfunc(hap): name = hap[0] if hap[0] else hap[1] gene, _, allele = name.partition('*') try: allele = int(allele) except ValueError: allele = 999 try: index = gene_order[gene] except KeyError: logger.warning('Gene %s not found in gene order file, placing it at the end', gene) index = 1000000 return index * 1000 + allele self.haplotype = sorted(self.haplotype, key=keyfunc) def switch(self): """Swap the two haplotypes""" self.het2, self.het1 = self.het1, self.het2 haplotype = [] for name1, name2, type_, counts in self.haplotype: assert len(counts) == 2 counts = counts[1], counts[0] haplotype.append((name2, name1, type_, counts)) self.haplotype = haplotype def to_tsv(self, header: bool=True) -> str: lines = [] if header: lines.append('\t'.join(['haplotype1', 'haplotype2', 'type', 'count1', 'count2'])) lines.append( '# {} haplotype from {} and {}'.format(self.gene_type, self.het1, self.het2)) for h1, h2, typ, count in self.haplotype: lines.append('\t'.join([h1, h2, typ, str(count[0]), str(count[1])])) return '\n'.join(lines) + '\n' def plot_haplotypes(blocks: List[HaplotypePair], show_unknown: bool=False, binarize: bool=False): from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas from matplotlib.figure import Figure from matplotlib.patches import Patch colormap = dict(homozygous='cornflowerblue', heterozygous='lightgreen', deletion='gold', duplication='crimson', unknown='gray') if not show_unknown: del colormap['unknown'] labels = [[], []] heights = [[], []] names = [] colors = [] positions = [] pos = 0 for block in blocks: for hap1, hap2, type_, (count1, count2) in block.haplotype: if not show_unknown and type_ == 'unknown': continue if binarize: count1 = 1 if hap1 else 0 count2 = 1 if hap2 else 0 label = hap1 if hap1 else hap2 hap1 = ('*' + hap1.partition('*')[2]) if hap1 else '' hap2 = ('*' + hap2.partition('*')[2]) if hap2 else '' heights[0].append(count1) heights[1].append(count2) labels[0].append(hap1) labels[1].append(hap2) names.append(label.partition('*')[0]) colors.append(colormap[type_]) positions.append(pos) pos += 1 pos += 2 n = len(names) assert len(labels[0]) == len(labels[1]) == len(colors) == len(heights[0]) == len(heights[1]) == n fig = Figure(figsize=(8, 24)) FigureCanvas(fig) axes = fig.subplots(ncols=2) for i in 0, 1: axes[i].barh(y=positions, width=heights[i], color=colors) axes[i].set_yticklabels(labels[i]) axes[i].set_ylim((-0.5, max(positions) + 0.5)) # Add center axis ax_center = axes[0].twinx() ax_center.set_yticks(axes[0].get_yticks()) ax_center.set_ylim(axes[0].get_ylim()) ax_center.set_yticklabels(names) # Synchronize x limits on both axes (has no effect if binarize is True) max_x = max(axes[i].get_xlim()[1] for i in range(2)) for ax in axes: ax.set_xlim(right=max_x) axes[0].invert_xaxis() for ax in axes[0], axes[1], ax_center: ax.invert_yaxis() for spine in ax.spines.values(): spine.set_visible(False) if binarize: ax.set_xticks([]) else: ax.set_axisbelow(True) ax.grid(True, axis='x') ax.set_yticks(positions) ax.tick_params(left=False, right=False, labelsize=12) axes[1].tick_params(labelleft=False, labelright=True) if not binarize: axes[0].spines['right'].set_visible(True) axes[1].spines['left'].set_visible(True) # Legend legend_patches = [Patch(color=col, label=label) for label, col in colormap.items()] fig.legend(handles=legend_patches, loc='upper center', ncol=len(legend_patches)) fig.tight_layout() fig.subplots_adjust(top=1 - 2 / len(names)) return fig def read_and_filter(path: str, d_evalue: float, d_coverage: float): usecols = ['V_gene', 'D_gene', 'J_gene', 'V_errors', 'D_errors', 'J_errors', 'D_covered', 'D_evalue'] # Support reading a table without D_errors try: table = read_table(path, usecols=usecols) except ValueError: usecols.remove('D_errors') table = read_table(path, usecols=usecols) logger.info('Table with %s rows read', len(table)) table = table[table.V_errors == 0] logger.info('%s rows remain after requiring V errors = 0', len(table)) table = table[table.J_errors == 0] logger.info('%s rows remain after requiring J errors = 0', len(table)) table = table[table.D_evalue <= d_evalue] logger.info('%s rows remain after requiring D E-value <= %s', len(table), d_evalue) table = table[table.D_covered >= d_coverage] logger.info('%s rows remain after requiring D coverage >= %s', len(table), d_coverage) if 'D_errors' in table.columns: table = table[table.D_errors == 0] logger.info('%s rows remain after requiring D errors = 0', len(table)) return table def main(args): if args.order is not None: with SequenceReader(args.order) as sr: gene_order = [r.name for r in sr] else: gene_order = None table = read_and_filter(args.table, args.d_evalue, args.d_coverage) if args.restrict is not None: with SequenceReader(args.restrict) as sr: restrict_names = set(r.name for r in sr) table = table[table['V_gene'].map(lambda name: name in restrict_names)] logger.info('After restricting to V genes named in %r, %d rows remain', args.restrict, len(table)) if len(table) == 0: logger.error('No rows remain, cannot continue') sys.exit(1) expressions = dict() het_expressions = dict() # these are also sorted, most highly expressed first for gene_type in 'VDJ': ex = list(expression_counts(table, gene_type)) expressions[gene_type] = ex het_ex = [e for e in ex if len(e) == 2] if het_ex: # Pick most highly expressed het_expressions[gene_type] = sorted(het_ex, key=lambda e: e['count'].sum(), reverse=True)[:5] else: # Force at least something to be plotted het_expressions[gene_type] = [None] if args.v_gene: het_ex = [e for e in expressions['V'] if len(e) == 2] for ex in het_ex: if (args.v_gene in ex.index and not ex.loc[args.v_gene].empty) or (args.v_gene in ex['name'].values): het_expressions['V'] = [ex] break else: logger.error('The gene or allele %s was not found in the list of heterozygous V genes. ' 'It cannot be used with the --v-gene option.', args.v_gene) sys.exit(1) block_lists = [] # We want to avoid using a gene classifed as 'duplicate' for haplotyping, but the # classification is only known after we have done the haplotyping, so we try it # until we found a combination that works products = list(product(het_expressions['J'], het_expressions['V'])) for attempt, (het_j, het_v) in enumerate(products): best_het_genes = { 'V': het_v, 'D': het_expressions['D'][0] if het_expressions['D'] else None, 'J': het_j, } for gene_type in 'VDJ': bhg = best_het_genes[gene_type] text = bhg.index[0][0] if bhg is not None else 'none found' logger.info('Heterozygous %s gene to use for haplotyping: %s', gene_type, text) # Create HaplotypePair objects ('blocks') for each gene type blocks = [] for target_gene_type, het_gene in ( ('J', 'V'), ('D', 'J'), ('V', 'J'), ): het_alleles = best_het_genes[het_gene] if het_alleles is None: continue coexpressions = compute_coexpressions(table, het_gene, target_gene_type) target_groups = expressions[target_gene_type] het1, het2 = het_alleles['name'] haplotype = cooccurrences(coexpressions, (het1, het2), target_groups) block = HaplotypePair(haplotype, target_gene_type, het1, het2) if gene_order: block.sort(gene_order) blocks.append(block) if het_j is None or het_v is None: break het_used = set(sum([list(h['name']) for h in best_het_genes.values() if h is not None], [])) het_is_duplicate = False for block in blocks: if block.gene_type == 'D': continue # This nested loop is put in a separate generator so we can easily 'break' # out of both loops at the same time. def nameiter(): for name1, name2, type_, _ in block.haplotype: for name in name1, name2: yield name, type_ for name, type_ in nameiter(): if name in het_used and type_ != 'heterozygous': het_is_duplicate = True if not args.v_gene: logger.warning('%s not classified as "heterozygous" during haplotyping, ' 'attempting to use different alleles', name) break if het_is_duplicate: break block_lists.append(blocks) if not het_is_duplicate: break else: if not args.v_gene: logger.warning('No other alleles remain, using first found solution') blocks = block_lists[0] # Get the phasing right across blocks (i.e., swap J haplotypes if necessary) assert len(blocks) in (0, 1, 3) if len(blocks) == 3: j_hap, d_hap, v_hap = blocks assert j_hap.gene_type == 'J' assert d_hap.gene_type == 'D' assert v_hap.gene_type == 'V' assert d_hap.het1 == v_hap.het1 assert d_hap.het2 == v_hap.het2 for name1, name2, _, _ in j_hap.haplotype: if (name1, name2) == (v_hap.het2, v_hap.het1): j_hap.switch() break # Print the table header = True for block in blocks: print(block.to_tsv(header=header)) header = False # Create plots if requested if args.plot: fig = plot_haplotypes(blocks, show_unknown=True) fig.savefig(args.plot) if args.structure_plot: fig = plot_haplotypes(blocks, binarize=True) fig.savefig(args.structure_plot) IgDiscover-0.11/igdiscover/igblast.py000066400000000000000000000332701337725263500176500ustar00rootroot00000000000000""" Run IgBLAST and output a result table This is a wrapper for the "igblastn" tool which has a simpler command-line syntax and can also run IgBLAST in parallel. The results are parsed, postprocessed and printed as a tab-separated table to standard output. Postprocessing includes: - The CDR3 is detected by using a regular expression - The leader is detected within the sequence before the found V gene (by searching for the start codon). - If the V sequence hit starts not at base 1 in the reference, it is extended to the left. """ import sys import os import shutil import time import multiprocessing import subprocess from contextlib import ExitStack from io import StringIO from itertools import islice import hashlib import errno import pkg_resources import logging import tempfile import json import gzip from xopen import xopen from sqt import SequenceReader from sqt.dna import nt_to_aa from .utils import SerialPool, available_cpu_count from .parse import TableWriter, IgBlastParser from .species import cdr3_start, cdr3_end from .config import GlobalConfig logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--threads', '-t', '-j', type=int, default=available_cpu_count(), help='Number of threads. Default: no. of available CPUs (%(default)s)') arg('--cache', action='store_true', default=None, help='Use the cache') arg('--no-cache', action='store_false', dest='cache', default=None, help='Do not use the cache') arg('--penalty', type=int, choices=(-1, -2, -3, -4), default=None, help='BLAST mismatch penalty (default: -1)') arg('--species', default=None, help='Tell IgBLAST which species to use. Note that this setting does ' 'not seem to have any effect since we provide our own database to ' 'IgBLAST. Default: Use IgBLAST’s default') arg('--sequence-type', default='Ig', choices=('Ig', 'TCR'), help='Sequence type. Default: %(default)s') arg('--raw', metavar='FILE', help='Write raw IgBLAST output to FILE ' '(add .gz to compress)') arg('--limit', type=int, metavar='N', help='Limit processing to first N records') arg('--rename', default=None, metavar='PREFIX', help='Rename reads to PREFIXseqN (where N is a number starting at 1)') arg('--stats', metavar='FILE', help='Write statistics in JSON format to FILE') arg('database', help='Database directory with V.fasta, D.fasta, J.fasta.') arg('fasta', help='File with original reads') class IgBlastCache: """Cache IgBLAST results in ~/.cache""" binary = 'igblastn' def __init__(self): version_string = subprocess.check_output([IgBlastCache.binary, '-version']) self._hasher = hashlib.md5(version_string) cache_home = os.environ.get('XDG_CACHE_HOME', os.path.expanduser('~/.cache')) self._cachedir = os.path.join(cache_home, 'igdiscover') logger.info('Caching IgBLAST results in %r', self._cachedir) self._lock = multiprocessing.Lock() def _path(self, digest): """Return path to cache file given a digest""" return os.path.join(self._cachedir, digest[:2], digest) + '.txt.gz' def _load(self, digest): try: with gzip.open(self._path(digest), 'rt') as f: return f.read() except FileNotFoundError: return None def _store(self, digest, data): path = self._path(digest) dir = os.path.dirname(path) os.makedirs(dir, exist_ok=True) with gzip.open(path, 'wt') as f: f.write(data) def retrieve(self, variable_arguments, fixed_arguments, blastdb_dir, fasta_str) -> str: hasher = self._hasher.copy() hasher.update(' '.join(fixed_arguments).encode()) for gene in 'V', 'D', 'J': with open(os.path.join(blastdb_dir, gene + '.fasta'), 'rb') as f: hasher.update(f.read()) hasher.update(fasta_str.encode()) digest = hasher.hexdigest() data = self._load(digest) if data is None: with tempfile.TemporaryDirectory() as tmpdir: path = os.path.join(tmpdir, 'igblast.txt') full_arguments = [IgBlastCache.binary] + variable_arguments + fixed_arguments\ + ['-out', path] output = subprocess.check_output(full_arguments, input=fasta_str, universal_newlines=True) assert output == '' with open(path) as f: data = f.read() with self._lock: # TODO does this help? self._store(digest, data) return data def run_igblast(sequences, blastdb_dir, species, sequence_type, penalty=None, use_cache=True) -> str: """ Run the igblastn command-line program. sequences -- list of Sequence objects blastdb_dir -- directory that contains BLAST databases. Files in that directory must be databases created by the makeblastdb program and have names V, D, and J. Return IgBLAST’s raw output as a string. """ if sequence_type not in ('Ig', 'TCR'): raise ValueError('sequence_type must be "Ig" or "TCR"') variable_arguments = [] for gene in 'V', 'D', 'J': variable_arguments += ['-germline_db_{gene}'.format(gene=gene), os.path.join(blastdb_dir, '{gene}'.format(gene=gene))] # An empty .aux suppresses a warning from IgBLAST. /dev/null does not work. empty_aux_path = pkg_resources.resource_filename('igdiscover', 'empty.aux') variable_arguments += ['-auxiliary_data', empty_aux_path] arguments = [] if penalty is not None: arguments += ['-penalty', str(penalty)] if species is not None: arguments += ['-organism', species] arguments += [ '-ig_seqtype', sequence_type, '-num_threads', '1', '-domain_system', 'imgt', '-num_alignments_V', '1', '-num_alignments_D', '1', '-num_alignments_J', '1', '-outfmt', '7 sseqid qstart qseq sstart sseq pident slen evalue', '-query', '-', ] fasta_str = ''.join(">{}\n{}\n".format(r.name, r.sequence) for r in sequences) if use_cache: global _igblastcache return _igblastcache.retrieve(variable_arguments, arguments, blastdb_dir, fasta_str) else: # For some reason, it has become unreliable to let IgBLAST 1.10 write its result # to standard output using "-out -". The data becomes corrupt. This does not occur # when calling igblastn in the shell and using the same syntax, only with # subprocess.check_output. As a workaround, we write the output to a temporary file. with tempfile.TemporaryDirectory() as tmpdir: path = os.path.join(tmpdir, 'igblast.txt') output = subprocess.check_output(['igblastn'] + variable_arguments + arguments + ['-out', path], input=fasta_str, universal_newlines=True) assert output == '' with open(path) as f: return f.read() def chunked(iterable, chunksize: int): """ Group the iterable into lists of length chunksize >>> list(chunked('ABCDEFG', 3)) [['A', 'B', 'C'], ['D', 'E', 'F'], ['G']] """ chunk = [] for it in iterable: if len(chunk) == chunksize: yield chunk chunk = [] chunk.append(it) if chunk: yield chunk class Runner: """ This is the target of a multiprocessing pool. The target needs to be pickleable, and because nested functions cannot be pickled, we need this separate class. It runs IgBLAST and parses the output for a list of sequences. """ def __init__(self, blastdb_dir, species, sequence_type, penalty, database, use_cache): self.blastdb_dir = blastdb_dir self.species = species self.sequence_type = sequence_type self.penalty = penalty self.database = database self.use_cache = use_cache def __call__(self, sequences): """ Return tuples (igblast_result, records) where igblast_result is the raw IgBLAST output and records is a list of (Extended-)IgBlastRecord objects (the parsed output). """ igblast_result = run_igblast(sequences, self.blastdb_dir, self.species, self.sequence_type, self.penalty, self.use_cache) sio = StringIO(igblast_result) parser = IgBlastParser(sequences, sio, self.database) records = list(parser) assert len(records) == len(sequences) return igblast_result, records def makeblastdb(fasta, database_name): with SequenceReader(fasta) as fr: sequences = list(fr) if not sequences: raise ValueError("FASTA file {} is empty".format(fasta)) process_output = subprocess.check_output( ['makeblastdb', '-parse_seqids', '-dbtype', 'nucl', '-in', fasta, '-out', database_name], stderr=subprocess.STDOUT ) shutil.copy(fasta, database_name + '.fasta') if b'Error: ' in process_output: raise subprocess.SubprocessError() class Database: def __init__(self, path, sequence_type): """path -- path to database directory with V.fasta, D.fasta, J.fasta""" self.path = path self.sequence_type = sequence_type with SequenceReader(os.path.join(path, 'V.fasta')) as sr: self._v_records = list(sr) self.v = self._records_to_dict(self._v_records) with SequenceReader(os.path.join(path, 'J.fasta')) as sr: self._j_records = list(sr) self.j = self._records_to_dict(self._j_records) self._cdr3_starts = dict() self._cdr3_ends = dict() for chain in ('heavy', 'kappa', 'lambda', 'alpha', 'beta', 'gamma', 'delta'): self._cdr3_starts[chain] = {name: cdr3_start(s, chain) for name, s in self.v.items()} self._cdr3_ends[chain] = {name: cdr3_end(s, chain) for name, s in self.j.items()} self.v_regions_nt, self.v_regions_aa = self._find_v_regions() @staticmethod def _records_to_dict(records): return {record.name: record.sequence.upper() for record in records} def v_cdr3_start(self, gene, chain): return self._cdr3_starts[chain][gene] def j_cdr3_end(self, gene, chain): return self._cdr3_ends[chain][gene] def _find_v_regions(self): """ Run IgBLAST on the V sequences to determine the nucleotide and amino-acid sequences of the FR1, CDR1, FR2, CDR2 and FR3 regions """ v_regions_nt = dict() v_regions_aa = dict() for record in igblast(self.path, self._v_records, self.sequence_type, threads=1): nt_regions = dict() aa_regions = dict() for region in ('FR1', 'CDR1', 'FR2', 'CDR2', 'FR3'): nt_seq = record.region_sequence(region) if nt_seq is None: break if len(nt_seq) % 3 != 0: logger.warning('Length %s of %s region in %r is not divisible by three; region ' 'info for this gene will not be available', len(nt_seq), region, record.query_name) # not codon-aligned, skip entire record break nt_regions[region] = nt_seq try: aa_seq = nt_to_aa(nt_seq) except ValueError as e: logger.warning('The %s region could not be converted to amino acids: %s', region, str(e)) break if '*' in aa_seq: logger.warning('The %s region in %r contains a stop codon (%r); region info ' 'for this gene will not be available', region, record.query_name, aa_seq) break aa_regions[region] = aa_seq else: v_regions_nt[record.query_name] = nt_regions v_regions_aa[record.query_name] = aa_regions return v_regions_nt, v_regions_aa def igblast(database, sequences, sequence_type, species=None, threads=None, penalty=None, raw_output=None, use_cache=False): """ Run IgBLAST, parse results and yield (Extended-)IgBlastRecord objects. database -- Path to database directory with V./D./J.fasta files *or* a Database object. If it is a path, then only IgBlastRecord objects are returned. sequences -- an iterable of Sequence objects sequence_type -- 'Ig' or 'TCR' threads -- number of threads. If None, then available_cpu_count() is used. raw_output -- If not None, raw IgBLAST output is written to this file """ if threads is None: threads = available_cpu_count() if isinstance(database, str): database_dir = database database = None else: database_dir = database.path with ExitStack() as stack: # Create the three BLAST databases in a temporary directory blastdb_dir = stack.enter_context(tempfile.TemporaryDirectory()) for gene in ['V', 'D', 'J']: makeblastdb(os.path.join(database_dir, gene + '.fasta'), os.path.join(blastdb_dir, gene)) chunks = chunked(sequences, chunksize=1000) runner = Runner(blastdb_dir, species=species, sequence_type=sequence_type, penalty=penalty, database=database, use_cache=use_cache) pool = stack.enter_context(multiprocessing.Pool(threads) if threads > 1 else SerialPool()) for igblast_output, igblast_records in pool.imap(runner, chunks, chunksize=1): if raw_output: raw_output.write(igblast_output) yield from igblast_records def main(args): config = GlobalConfig() use_cache = config.use_cache if args.cache is not None: use_cache = args.cache if use_cache: global _igblastcache _igblastcache = IgBlastCache() logger.info('IgBLAST cache enabled') database = Database(args.database, args.sequence_type) detected_cdr3s = 0 writer = TableWriter(sys.stdout) start_time = time.time() last_status_update = 0 with ExitStack() as stack: if args.raw: raw_output = stack.enter_context(xopen(args.raw, 'w')) else: raw_output = None sequences = stack.enter_context(SequenceReader(args.fasta)) sequences = islice(sequences, 0, args.limit) n = 0 # number of records processed so far for record in igblast(database, sequences, sequence_type=args.sequence_type, species=args.species, threads=args.threads, penalty=args.penalty, raw_output=raw_output, use_cache=use_cache): n += 1 if args.rename is not None: record.query_name = "{}seq{}".format(args.rename, n) d = record.asdict() if d['CDR3_aa']: detected_cdr3s += 1 try: writer.write(d) except IOError as e: if e.errno == errno.EPIPE: sys.exit(1) raise if n % 1000 == 0: elapsed = time.time() - start_time if elapsed >= last_status_update + 60: logger.info( 'Processed {:10,d} sequences at {:.3f} ms/sequence'.format(n, elapsed / n * 1E3)) last_status_update = elapsed elapsed = time.time() - start_time logger.info('Processed {:10,d} sequences at {:.1f} ms/sequence'.format(n, elapsed / n * 1E3)) logger.info('%d IgBLAST assignments parsed and written', n) logger.info('CDR3s detected in %.1f%% of all sequences', detected_cdr3s / n * 100) if args.stats: stats = {'total': n, 'detected_cdr3s': detected_cdr3s} with open(args.stats, 'w') as f: json.dump(stats, f) print(file=f) IgDiscover-0.11/igdiscover/igdiscover.yaml000066400000000000000000000165511337725263500206760ustar00rootroot00000000000000## IgDiscover configuration # How many discovery iterations to run. If 0, no updated database is created, # but expression profiles are still computed. Unless working with a highly # incomplete starting database, a single iteration is usually sufficient. # iterations: 1 # Type of sequences: Choose 'Ig' or 'TCR'. # sequence_type: Ig ## Barcoding settings # If you have a random barcode sequence (unique molecular identifier) at the 5' end, # set this to its length. Leave at 0 when you have no 5' barcode. # barcode_length_5prime: 0 # Same as above, but for the 3' end of the sequence. Leave at 0 when you have no 3' barcode. # Currently, you cannot have a barcode in both ends, so at least one of the two settings # must be zero. # barcode_length_3prime: 0 # When barcoding is enabled, sequences that have identical barcode and CDR3 are # collapsed into a single consensus sequence. # If you set this to false, no collapsing and consensus taking is done and # only the barcode is removed from each sequence. # barcode_consensus: true # When grouping by barcode and CDR3, the CDR3 location is either detected with a # regular expressions or a 'pseudo' CDR3 sequence is used, which is at a # pre-defined position within the sequence. # # Set this configuration option to a region like [-80, -60] to use a pseudo # CDR3 located at bases 80 to 60 counted from the 3' end. (Use negative numbers to # count from the 3' end, positive ones to count from the 5' end. The most 5' # base has index 0.) # # Set this to 'detect' (with quotation marks) in order to use CDR3s # detected by regular expression. This assumes that the input contains # VH sequences! # # Set this to false (no quotation marks) in order to *only* group by barcode, not by CDR3. # cdr3_location: 'detect' # Works only with VH sequences! # When you use a RACE protocol, then the sequences have a run of G nucleotides in the beginning # which need to be removed when barcodes are used. If you use RACE, set this to true. # The G nucleotides are assumed to be in the 5' end (but after the barcode if it exists). # race_g: false ## Primer-related settings # If set to true, it is assumed that the forward primer is always at the 5' end # of the first read and that the reverse primer is always at the 5' end of the # second read. If it can also be the other way, set this to false. # This setting has no effect if no primer sequences are defined below. # stranded: false # List of 5' primers # forward_primers: # - AGCTACAGAGTCCCAGGTCCA # - ACAGGYGCCCACTCYSAG # - TTGCTMTTTTAARAGGTGTCCAGTGTG # - CTCCCAGATGGGTCCTGTC # - ACCGTCCYGGGTCTTGTC # - CTGTTCTCCAAGGGWGTCTSTG # - CATGGGGTGTCCTGTCACA # List of 3' primers # reverse_primers: # - GCAGGCCTTTTTGGCCNNNNNGCCGATGGGCCCTTGGTGGAGGCTGA # IgG # - GCAGGCCTTTTTGGCCNNNNNGGGGCATTCTCACAGGAGACGAGGGGGAAAAG # IgM # Work only on this number of reads (for quick test runs). Set to false to # process all reads. # #limit: false # Filter out merged reads that are shorter than this length. # minimum_merged_read_length: 300 # Read merging program. Choose either 'pear' or 'flash'. # pear merges more reads, but is slower. # #merge_program: pear # Maximum overlap (-M) for the flash read merger. # If you use pear, this is ignored. # flash_maximum_overlap: 300 # Do not mention the original FASTA or FASTQ sequence names in the # assigned.tab files, but instead use names _seq, # where is a running number starting at 1. # true: yes, rename # false: no, do not rename # rename: false # Whether debugging is enabled or not. Currently, if this is set to true, # some large intermediate files that would otherwise be deleted will be # kept. # debug: false # The "seed value" is in arbitrary number used to get reproducible # runs. Two runs that use the same software version, the same seed # and otherwise the same configuration will give identical results. # # Set this to false in order to use a different seed each run. # The results will then be not exactly reproducible. # seed: 1 # The preprocessing filter is always applied directly after running IgBLAST, # even if no gene discovery is requested. # preprocessing_filter: v_coverage: 90 # Match must cover V gene by at least this percentage j_coverage: 60 # Match must cover J gene by at least this percentage v_evalue: 0.001 # Highest allowed V gene match E-value ## Candidate discovery settings # When discovering new V genes, ignore whether a J gene has been assigned # and also ignore its %SHM. # true: yes, ignore the J # false: do not ignore J assignment, do not ignore its %SHM # ignore_j: false # When clustering sequences to discover new genes, subsample to this number of # sequences. Higher is slower. # subsample: 1000 # When computing the Ds_exact column, consider only D hits that # cover the reference D gene sequence by at least this percentage. # d_coverage: 70 ## V candidate filtering (germline filtering) settings # Filtering criteria applied to candidate sequences in all iterations except the last. # pre_germline_filter: unique_cdr3s: 2 # Minimum number of unique CDR3s (within exact matches) unique_js: 2 # Minimum number of unique J genes (within exact matches) whitelist: true # Add database sequences to the whitelist cluster_size: 0 # Minimum number of sequences assigned to cluster allow_stop: true # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) clonotype_ratio: 0.12 # Required minimum ratio of clonotype counts between alleles of the same gene exact_ratio: 0.12 # Required minimum ratio of "exact" counts between alleles of the same gene cdr3_shared_ratio: 0.8 # Maximum allowed CDR3_shared_ratio unique_d_ratio: 0.3 # Minimum Ds_exact ratio between alleles unique_d_threshold: 10 # Check Ds_exact ratio only if highest-expressed allele has at least this Ds_exact count # Filtering criteria applied to candidate sequences in the last iteration. # These should be more strict than the pre_germline_filter criteria. # germline_filter: unique_cdr3s: 5 # Minimum number of unique CDR3s (within exact matches) unique_js: 3 # Minimum number of unique J genes (within exact matches) whitelist: true # Add database sequences to the whitelist cluster_size: 100 # Minimum number of sequences assigned to cluster allow_stop: false # Whether to allow non-productive sequences containing stop codons cross_mapping_ratio: 0.02 # Threshold for removal of cross-mapping artifacts (set to 0 to disable) clonotype_ratio: 0.12 # Required minimum ratio of clonotype counts between alleles of the same gene exact_ratio: 0.12 # Required minimum ratio of "exact" counts between alleles of the same gene cdr3_shared_ratio: 0.8 # Maximum allowed CDR3_shared_ratio unique_d_ratio: 0.3 # Minimum Ds_exact ratio between alleles unique_d_threshold: 10 # Check Ds_exact ratio only if highest-expressed allele has at least this Ds_exact count ## J discovery settings j_discovery: allele_ratio: 0.2 # Required minimum ratio between alleles of a single gene cross_mapping_ratio: 0.1 # Threshold for removal of cross-mapping artifacts. propagate: true # Use J genes discovered in iteration 1 in subsequent ones IgDiscover-0.11/igdiscover/init.py000066400000000000000000000231571337725263500171710ustar00rootroot00000000000000""" Create and initialize a new analysis directory. """ import glob import logging import re import os import os.path import sys import subprocess import pkg_resources from sqt import SequenceReader from .config import Config try: import tkinter as tk from tkinter import messagebox from tkinter import filedialog except ImportError: tk = None from xopen import xopen logger = logging.getLogger(__name__) do_not_show_cpustats = 1 def add_arguments(parser): parser.add_argument('--database', '--db', metavar='PATH', default=None, help='Directory with V.fasta, D.fasta and J.fasta files. If not given, a dialog is shown.') group = parser.add_mutually_exclusive_group() group.add_argument('--single-reads', default=None, metavar='READS', help='File with single-end reads (.fasta.gz or .fastq.gz)') group.add_argument('--reads1', default=None, help='First paired-end read file. The second is found automatically. ' 'Must be a .fastq.gz file. If not given, a dialog is shown.') parser.add_argument('directory', help='New analysis directory to create') def launch(path): if hasattr(os, 'startfile'): os.startfile(path) elif sys.platform == 'linux': subprocess.call(['xdg-open', path]) elif sys.platform == 'darwin': subprocess.call(['open', path]) class TkinterGui: """Show a GUI for selecting reads and the database directory""" def __init__(self): if not tk: # import failed raise ImportError() root = tk.Tk() root.withdraw() def yesno(self, title, question): return tk.messagebox.askyesno(title, question) def database_path(self, initialdir): path = tk.filedialog.askdirectory( title="Choose V/D/J database directory", mustexist=True, initialdir=initialdir) return path def reads1_path(self): path = tk.filedialog.askopenfilename( title="Choose first reads file", filetypes=[ ("Reads", "*.fastq.gz"), ("Any file", "*")]) return path def single_reads_path(self): path = tk.filedialog.askopenfilename( title="Choose single-end reads file", filetypes=[ ("Reads", "*.fasta *.fastq *.fastq.gz *.fasta.gz"), ("Any file", "*")]) return path def error(self, title, message): tk.messagebox.showerror(title, message) """ # Works, but let’s not introduce the PySide dependency for now. def qt_path(): import PySide from PySide.QtGui import QApplication from PySide.QtGui import QMessageBox, QFileDialog # Create the application object app = QApplication([]) path = QFileDialog.getOpenFileName(None, "Open first reads file", '.', "FASTA/FASTQ reads (*.fastq *.fasta *.fastq.gz *.fasta.gz);; Any file (*)") # QMessageBox.information(None, 'Chosen file', path[0]) return path[0] """ def is_1_2(s, t): """ Determine whether s and t are identical except for a single character of which one of them is '1' and the other is '2'. """ differences = 0 one_two = {'1', '2'} for c1, c2 in zip(s, t): if c1 != c2: differences += 1 if differences == 2: return False if {c1, c2} != one_two: return False return differences == 1 def guess_paired_path(path): """ Given the path to a file that contains the sequences for the first read in a pair, return the file that contains the sequences for the second read in a pair. Both files must have identical names, except that the first must have a '1' in its name, and the second must have a '2' at the same position. Return None if no second file was found or if there are too many candidates. >>> guess_paired_path('file.1.fastq.gz') # doctest: +SKIP 'file.2.fastq.gz' # if that file exists """ base, name = os.path.split(path) # All lone 1 digits replaced with '?' name_with_globs = re.sub(r'(?': return 'fasta' else: raise UnknownFileFormatError('Cannot recognize format. File starts with neither ">" nor "@"') def try_open(path): try: with open(path) as f: pass except OSError as e: logger.error('Could not open %r: %s', path, e) sys.exit(1) def read_and_repair_fasta(path): """ Read a FASTA file and make sure it is suitable for use with makeblastdb. It repairs the following issues: - If a record is empty, it is skipped - If a record name occurs more than once, the second record name gets a suffix - If a sequence occurs more than once, occurrences after the first are skipped """ with SequenceReader(path) as sr: records = list(sr) names = set() sequences = dict() for r in records: r.sequence = r.sequence.upper() if len(r.sequence) == 0: logger.info("Record %r is empty, skipping it.", r.name) continue name = r.name i = 0 while name in names: i += 1 name = r.name + '_{}'.format(i) if name != r.name: logger.info('Record name %r replaced with %r because it occurs more than once', r.name, name) if r.sequence in sequences: logger.info('Skipping %r because it contains the same sequence as %r', r.name, sequences[r.sequence]) continue sequences[r.sequence] = name names.add(name) r.name = name yield r def main(args): if ' ' in args.directory: logger.error('The name of the analysis directory must not contain spaces') sys.exit(1) if os.path.exists(args.directory): logger.error('The target directory {!r} already exists.'.format(args.directory)) sys.exit(1) # If reads files or database were not given, initialize the GUI if (args.reads1 is None and args.single_reads is None) or args.database is None: try: gui = TkinterGui() except ImportError: # TODO tk.TclError cannot be caught when import of tk fails logger.error('GUI cannot be started. Please provide reads1 file ' 'and database directory on command line.') sys.exit(1) else: gui = None # Find out whether data is paired or single assert not (args.reads1 and args.single_reads) if args.reads1 is None and args.single_reads is None: paired = gui.yesno('Paired end or single-end reads', 'Are your reads paired and need to be merged?\n\n' 'If you answer "Yes", next select the FASTQ files ' 'with the first of your paired-end reads.\n' 'If you answer "No", next select the FASTA or FASTQ ' 'file with single-end reads.') if paired is None: logger.error('Cancelled') sys.exit(2) else: paired = bool(args.reads1) # Assign reads1 and (if paired) also reads2 if paired: if args.reads1 is not None: reads1 = args.reads1 try_open(reads1) else: reads1 = gui.reads1_path() if not reads1: logger.error('Cancelled') sys.exit(2) reads2 = guess_paired_path(reads1) if reads2 is None: logger.error('Could not determine second file of paired-end reads') sys.exit(1) else: if args.single_reads is not None: reads1 = args.single_reads try_open(reads1) else: reads1 = gui.single_reads_path() if not reads1: logger.error('Cancelled') sys.exit(2) if args.database is not None: dbpath = args.database else: # TODO as soon as we distribute our own database files, we can use this: # database_path = pkg_resources.resource_filename('igdiscover', 'databases') databases_path = None dbpath = gui.database_path(databases_path) if not dbpath: logger.error('Cancelled') sys.exit(2) database = dict() for g in ['V', 'D', 'J']: path = os.path.join(dbpath, g + '.fasta') if not os.path.exists(path): logger.error( 'The database directory %r must contain the three files ' 'V.fasta, D.fasta and J.fasta', dbpath) logger.error( 'A dummy D.fasta is necessary even if analyzing light chains (see manual)') sys.exit(2) database[g] = list(read_and_repair_fasta(path)) # Create the directory try: os.mkdir(args.directory) except OSError as e: logger.error(e) sys.exit(1) def create_symlink(readspath, dirname, target): gz = '.gz' if readspath.endswith('.gz') else '' if not os.path.isabs(readspath): src = os.path.relpath(readspath, dirname) else: src = readspath os.symlink(src, os.path.join(dirname, target + gz)) if paired: create_symlink(reads1, args.directory, 'reads.1.fastq') create_symlink(reads2, args.directory, 'reads.2.fastq') else: try: target = 'reads.' + file_type(reads1) except UnknownFileFormatError: logger.error('Cannot determine whether reads file is FASTA or FASTQ') sys.exit(1) create_symlink(reads1, args.directory, target) # Write the configuration file configuration = pkg_resources.resource_string('igdiscover', Config.DEFAULT_PATH).decode() with open(os.path.join(args.directory, Config.DEFAULT_PATH), 'w') as f: f.write(configuration) # Create database files database_dir = os.path.join(args.directory, 'database') os.mkdir(database_dir) for gene in ['V', 'D', 'J']: with open(os.path.join(database_dir, gene + '.fasta'), 'w') as db_file: for record in database[gene]: print('>{}\n{}'.format(record.name, record.sequence), file=db_file) if gui is not None: # Only suggest to edit the config file if at least one GUI dialog has been shown if gui.yesno('Directory initialized', 'Do you want to edit the configuration file now?'): launch(os.path.join(args.directory, Config.DEFAULT_PATH)) logger.info('Directory %s initialized.', args.directory) logger.info('Edit %s/%s, then run "cd %s && igdiscover run" to start the analysis', args.directory, Config.DEFAULT_PATH, args.directory) IgDiscover-0.11/igdiscover/merge.py000066400000000000000000000061651337725263500173250ustar00rootroot00000000000000""" Merge paired-end reads (a wrapper around PEAR) This script can also manage a cache of already-merged files. """ import os import sys from pathlib import Path import logging import tempfile import subprocess import hashlib import shutil from igdiscover.config import GlobalConfig logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--threads', '-j', default=None, type=int, help='Number of threads') arg('--no-cache', default=None, action='store_true', help='Disable cache. Default: Determined by configuration') arg('reads1', help='Forward reads FASTQ file') arg('reads2', help='Reverse reads FASTQ file') arg('output', help='Output file (compressed FASTQ)') def compute_hash(path1: str, path2: str): result = subprocess.run(['pear', '-h'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) pear_version = None for line in result.stdout.split(b'\n'): if line.startswith(b'PEAR'): pear_version = line break assert pear_version is not None hasher = hashlib.md5(pear_version) for path in path1, path2: with open(path, 'rb') as f: while True: chunk = f.read(1048576) if not chunk: break hasher.update(chunk) return hasher.hexdigest() def run_pear(path1: str, path2: str, output: str, log_output=None, threads: int=None): gzip = 'pigz' if shutil.which('pigz') else 'gzip' with tempfile.TemporaryDirectory() as tmpdir: tmpdir = Path(tmpdir) args = ['pear'] if threads is not None: args += ['-j', str(threads)] args += [ '-f', path1, '-r', path2, '-o', tmpdir / 'merged' ] if log_output is None: subprocess.run(args, check=True) else: result = subprocess.run(args, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) with open(log_output, 'wb') as f: f.write(result.stdout) subprocess.run([gzip, tmpdir / 'merged.assembled.fastq'], check=True) shutil.move(tmpdir / 'merged.assembled.fastq.gz', output) def run_pear_cached(path1: str, path2: str, output: str, threads: int = None): cache_home = Path(os.environ.get('XDG_CACHE_HOME', os.path.expanduser('~/.cache'))) cachedir = cache_home / 'igdiscover' / 'mergedreads' cachedir.mkdir(parents=True, exist_ok=True) md5sum = compute_hash(path1, path2) cached_merged = cachedir / (md5sum + '.fastq.gz') cached_log = cachedir / (md5sum + '.log') if not cached_merged.exists() or not cached_log.exists(): run_pear(path1, path2, output=cached_merged, log_output=cached_log, threads=threads) logger.info('PEAR result copied to cache') else: logger.info('PEAR result found in cache:') logger.info(cached_merged) logger.info(cached_log) shutil.copy(cached_merged, output) with open(cached_log) as f: sys.stdout.write(f.read()) # Update mtime of the cache file. Let’s us find unused cache entries. for path in cached_merged, cached_log: try: path.touch() except PermissionError: pass def main(args): use_cache = GlobalConfig().use_cache if args.no_cache: use_cache = False if use_cache: logger.info('Cache enabled') func = run_pear_cached else: func = run_pear func(path1=args.reads1, path2=args.reads2, output=args.output, threads=args.threads) IgDiscover-0.11/igdiscover/multidiscover.py000066400000000000000000000037321337725263500211140ustar00rootroot00000000000000""" Find V gene sister sequences shared by multiple libraries. """ import logging from collections import Counter import pandas as pd logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--minimum-frequency', '-n', type=int, metavar='N', default=None, help='Minimum number of datasets in which sequence must occur (default is no. of files divided by two)') arg('--minimum-db-diff', '-b', type=int, metavar='DIST', default=1, help='Use only sequences that have at least DIST differences to the database sequence. Default: %(default)s') arg('tables', metavar='DISCOVER.TAB', help='Table created by the "discover" command (give at least two)', nargs='+') def main(args): if args.minimum_frequency is None: minimum_frequency = max((len(args.tables) + 1) // 2, 2) else: minimum_frequency = args.minimum_frequency logger.info('Minimum frequency set to %s', minimum_frequency) # Read in tables tables = [] for path in args.tables: table = pd.read_csv(path, sep='\t') table = table[table.database_diff >= args.minimum_db_diff] table = table.dropna() tables.append(table) if len(table) == 0: logger.warn('Table read from %r is empty after filtering out sequences with database diff >= %s.', path, args.minimum_db_diff) # Count V sequence occurrences counter = Counter() for table in tables: counter.update(set(table.consensus)) # Find most frequent occurrences and print result print('count', 'gene', 'database_diff', 'sequence', 'names', sep='\t') for sequence, frequency in counter.most_common(): if frequency < minimum_frequency: break names = [] gene = None for table in tables: matching_rows = table[table.consensus == sequence] if matching_rows.empty: continue names.extend(matching_rows.name) if gene is None: row = matching_rows.iloc[0] gene = row.gene database_diff = row.database_diff #shm = row['V_SHM'] print(frequency, gene, database_diff, sequence, *names, sep='\t') IgDiscover-0.11/igdiscover/parse.py000066400000000000000000000561301337725263500173350ustar00rootroot00000000000000""" Parse IgBLAST output and write out a tab-separated table. IgBLAST must have been run with -outfmt "7 sseqid qstart qseq sstart sseq pident slen" A few extra things are done in addition to parsing: - The CDR3 is detected by using a regular expression - The leader is detected within the sequence before the found V gene (by searching for the start codon). - If the V sequence hit starts at base 2 in the reference, it is extended one to the left. """ import csv import logging from collections import namedtuple import functools from sqt.dna import reverse_complement from sqt.align import edit_distance, hamming_distance from .utils import nt_to_aa from .species import find_cdr3, CDR3_SEARCH_START logger = logging.getLogger(__name__) def none_if_na(s): """Return None if s == 'N/A'. Return s otherwise.""" return None if s == 'N/A' else s def split_by_section(iterable, section_starts): """ Parse a stream of lines into chunks of sections. When one of the lines starts with a string given in section_starts, a new section is started, and a tuple (head, lines) is returned where head is the matching line and lines contains a list of the lines following the section header, up to (but excluding) the next section header. Works a bit like str.split(), but on lines. """ lines = None header = None for line in iterable: line = line.strip() for start in section_starts: if line.startswith(start): if header is not None: yield (header, lines) header = line lines = [] break else: if header is None: raise ParseError("Expected a line starting with one of {}".format(', '.join(section_starts))) lines.append(line) if header is not None: yield (header, lines) # Each alignment summary describes a region in the V region (FR1, CDR1, etc. up to CDR3) AlignmentSummary = namedtuple('AlignmentSummary', 'start stop length matches mismatches gaps percent_identity') JunctionVDJ = namedtuple('JunctionVDJ', 'v_end vd_junction d_region dj_junction j_start') JunctionVJ = namedtuple('JunctionVJ', 'v_end vj_junction j_start') _Hit = namedtuple('_Hit', [ 'subject_id', # name of database record, such as "VH4.11" 'query_start', 'query_alignment', # aligned part of the query, with '-' for deletions 'subject_start', 'subject_alignment', # aligned reference, with '-' for insertions 'subject_length', # total length of reference, depends only on subject_id 'percent_identity', 'evalue', ]) class Hit(_Hit): # This avoids having a __dict__ attribute, which is necessary for namedtuple # subclasses that need _asdict() to work (http://bugs.python.org/issue24931) __slots__ = () def covered(self): """ Return fraction of bases in the original subject sequence that are covered by this hit. """ return len(self.subject_sequence) / self.subject_length @property def query_end(self): return self.query_start + len(self.query_sequence) @property def subject_end(self): return self.subject_start + len(self.subject_sequence) @property def query_sequence(self): return self.query_alignment.replace('-', '') @property def subject_sequence(self): return self.subject_alignment.replace('-', '') @property def errors(self): return sum(a != b for a, b in zip(self.subject_alignment, self.query_alignment)) def query_position(self, reference_position): """ Given a position on the reference, return the same position but relative to the full query sequence. """ # Iterate over alignment columns ref_pos = self.subject_start query_pos = self.query_start if ref_pos == reference_position: return query_pos for ref_c, query_c in zip(self.subject_alignment, self.query_alignment): if ref_c != '-': ref_pos += 1 if query_c != '-': query_pos += 1 if ref_pos == reference_position: return query_pos return None def parse_header(header): """ Extract size= and barcode= fields from the FASTA/FASTQ header line >>> parse_header("name;size=12;barcode=ACG;") ('name', 12, 'ACG') >>> parse_header("another name;size=200;foo=bar;") ('another name', 200, None) """ fields = header.split(';') query_name = fields[0] size = barcode = None for field in fields[1:]: if field == '': continue if '=' in field: key, value = field.split('=', maxsplit=1) if key == 'size': size = int(value) elif key == 'barcode': barcode = value return query_name, size, barcode class IgBlastRecord: def __init__( self, full_sequence, query_name, alignments, hits, v_gene, d_gene, j_gene, chain, has_stop, in_frame, is_productive, strand, junction ): self.full_sequence = full_sequence self.query_name = query_name self.alignments = alignments self.hits = hits self.v_gene = v_gene self.d_gene = d_gene self.j_gene = j_gene self.chain = chain self.has_stop = has_stop self.in_frame = in_frame self.is_productive = is_productive self.strand = strand self.junction = junction def region_sequence(self, region): """ Return the nucleotide sequence of a named region. Allowed names are: CDR1, CDR2, CDR3, FR1, FR2, FR3. For all regions except CDR3, sequences are extracted from the full read using begin and end coordinates from IgBLAST’s "alignment summary" table. """ alignment = self.alignments.get(region, None) if alignment is None: return None if alignment.start is None or alignment.stop is None: return None return self.full_sequence[alignment.start:alignment.stop] def __repr__(self): return 'IgBlastRecord(query_name={query_name!r}, ' \ 'v_gene={v_gene!r}, d_gene={d_gene!r}, j_gene={j_gene!r}, chain={chain!r}, ...)'.format( **vars(self)) class ExtendedIgBlastRecord(IgBlastRecord): """ This extended record does a few extra things: - The CDR3 is detected by using a regular expression - The leader is detected within the sequence before the found V gene (by searching for the start codon). - If the V sequence hit starts at base 2 in the reference, it is extended one to the left. """ # TODO move computation of cdr3_sequence, vdj_sequence into constructor # TODO maybe make all coordinates relative to full sequence # Order of columns (use with asdict()) columns = [ 'count', 'V_gene', 'D_gene', 'J_gene', 'chain', 'stop', 'V_covered', 'D_covered', 'J_covered', 'V_evalue', 'D_evalue', 'J_evalue', 'FR1_SHM', 'CDR1_SHM', 'FR2_SHM', 'CDR2_SHM', 'FR3_SHM', 'V_SHM', 'J_SHM', 'V_aa_mut', 'J_aa_mut', 'FR1_aa_mut', 'CDR1_aa_mut', 'FR2_aa_mut', 'CDR2_aa_mut', 'FR3_aa_mut', 'V_errors', 'D_errors', 'J_errors', 'UTR', 'leader', 'CDR1_nt', 'CDR1_aa', 'CDR2_nt', 'CDR2_aa', 'CDR3_nt', 'CDR3_aa', 'V_nt', 'V_aa', 'V_end', 'V_CDR3_start', 'VD_junction', 'D_region', 'DJ_junction', 'J_nt', 'VDJ_nt', 'VDJ_aa', 'name', 'barcode', 'genomic_sequence', ] CHAINS = { 'VH': 'heavy', 'VK': 'kappa', 'VL': 'lambda', 'VA': 'alpha', 'VB': 'beta', 'VG': 'gamma', 'VD': 'delta' } def __init__(self, database, **kwargs): super().__init__(**kwargs) self.query_name, self.size, self.barcode = parse_header(self.query_name) self.genomic_sequence = self.full_sequence self._database = database if 'V' in self.hits: self.hits['V'] = self._fixed_v_hit() self.utr, self.leader = self._utr_leader() self.alignments['CDR3'] = self._find_cdr3() @property def vdj_sequence(self): if 'V' not in self.hits or 'J' not in self.hits: return None hit_v = self.hits['V'] hit_j = self.hits['J'] vdj_start = hit_v.query_start vdj_stop = hit_j.query_start + len(hit_j.query_sequence) return self.full_sequence[vdj_start:vdj_stop] @property def v_cdr3_start(self): """Start of CDR3 within V""" if 'V' not in self.hits or self.alignments['CDR3'] is None: return 0 v_start = self.hits['V'].query_start cdr3_start = self.alignments['CDR3'].start return cdr3_start - v_start def _utr_leader(self): """ Split the sequence before the V gene match into UTR and leader by searching for the start codon. """ if 'V' not in self.hits: return None, None before_v = self.full_sequence[:self.hits['V'].query_start] # Search for the start codon for offset in (0, 1, 2): for i in range(66, 42, -3): if before_v[-i + offset : -i + 3 + offset] == 'ATG': return before_v[:-i + offset], before_v[-i + offset:] return None, None # TODO this is unused def _fixed_cdr3_alignment_by_regex(self): """ Return a repaired AlignmentSummary object for the CDR3 region which does not use IgBLAST’s coordinates. IgBLAST does not determine the end of the CDR3 correctly, at least when a custom database is used, Return (start, end) of CDR3 relative to query. The CDR3 is detected using a regular expression. Return None if no CDR3 detected. """ if 'V' not in self.hits or 'J' not in self.hits: return None # Search in a window around the V(D)J junction for the CDR3 if 'CDR3' in self.alignments: window_start = self.alignments['CDR3'].start - CDR3_SEARCH_START else: window_start = max(0, self.hits['V'].query_end - CDR3_SEARCH_START) window_end = self.hits['J'].query_end window = self.full_sequence[window_start:window_end] match = find_cdr3(window, self.chain) if not match: return None start = match[0] + window_start end = match[1] + window_start assert start < end return AlignmentSummary(start=start, stop=end, length=None, matches=None, mismatches=None, gaps=None, percent_identity=None) def _find_cdr3(self): """ Return a repaired AlignmentSummary object that describes the CDR3 region. Return None if no CDR3 detected. """ if 'V' not in self.hits or 'J' not in self.hits: return None if self.chain not in self.CHAINS: return None # CDR3 start cdr3_ref_start = self._database.v_cdr3_start(self.hits['V'].subject_id, self.CHAINS[self.chain]) if cdr3_ref_start is None: return None cdr3_query_start = self.hits['V'].query_position(reference_position=cdr3_ref_start) if cdr3_query_start is None: # Alignment is not long enough to cover CDR3 start position; try to rescue it # by assuming that the alignment would continue without indels. hit = self.hits['V'] cdr3_query_start = hit.query_end + (cdr3_ref_start - hit.subject_end) # CDR3 end cdr3_ref_end = self._database.j_cdr3_end(self.hits['J'].subject_id, self.CHAINS[self.chain]) if cdr3_ref_end is None: return None cdr3_query_end = self.hits['J'].query_position(reference_position=cdr3_ref_end) if cdr3_query_end is None: return None return AlignmentSummary(start=cdr3_query_start, stop=cdr3_query_end, length=None, matches=None, mismatches=None, gaps=None, percent_identity=None) def _fixed_v_hit(self): """ Extend the V hit to the left if it does not start at the first nucleotide of the V gene. """ hit = self.hits['V'] d = hit._asdict() while d['subject_start'] > 0 and d['query_start'] > 0: d['query_start'] -= 1 d['subject_start'] -= 1 preceding_query_base = self.full_sequence[d['query_start']] d['query_alignment'] = preceding_query_base + d['query_alignment'] if self._database.v: reference = self._database.v[hit.subject_id] preceding_base = reference[d['subject_start']] else: preceding_base = 'N' d['subject_alignment'] = preceding_base + d['subject_alignment'] return Hit(**d) def asdict(self): """ Return a flattened representation of this record as a dictionary. The dictionary can then be used with e.g. a csv.DictWriter or pandas.DataFrame.from_items. """ nt_regions = dict() aa_regions = dict() for region in ('FR1', 'CDR1', 'FR2', 'CDR2', 'FR3', 'CDR3'): nt_seq = self.region_sequence(region) nt_regions[region] = nt_seq aa_regions[region] = nt_to_aa(nt_seq) if nt_seq else None vdj_nt = self.vdj_sequence vdj_aa = nt_to_aa(vdj_nt) if vdj_nt else None def nt_mutation_rate(region): """Nucleotide-level mutation rate in percent""" if region in self.alignments: rar = self.alignments[region] if rar is None or rar.percent_identity is None: return None return 100. - rar.percent_identity else: return None def j_aa_mutation_rate(): if 'J' not in self.hits: return None j_subject_id = self.hits['J'].subject_id if self.chain not in self.CHAINS: return None cdr3_ref_end = self._database.j_cdr3_end(j_subject_id, self.CHAINS[self.chain]) if cdr3_ref_end is None: return None cdr3_query_end = self.hits['J'].query_position(reference_position=cdr3_ref_end) if cdr3_query_end is None: return None query = self.full_sequence[cdr3_query_end:self.hits['J'].query_end] try: query_aa = nt_to_aa(query) except ValueError: return None ref = self._database.j[j_subject_id][cdr3_ref_end:self.hits['J'].subject_end] try: ref_aa = nt_to_aa(ref) except ValueError: return None if not ref_aa: return None return 100. * edit_distance(ref_aa, query_aa) / len(ref_aa) def aa_mutation_rates(): """Amino-acid level mutation rates for all regions and V in percent""" rates = dict() rates['J'] = j_aa_mutation_rate() v_aa_mutations = 0 v_aa_length = 0 for region in ('FR1', 'CDR1', 'FR2', 'CDR2', 'FR3'): if not aa_regions[region]: break try: reference_sequence = self._database.v_regions_aa[self.v_gene][region] except KeyError: break aa_sequence = aa_regions[region] # We previously computed edit distance, but some FR1 alignments are reported # with a frameshift by IgBLAST. By requiring that reference and query FR1 # lengths are identical, we can filter out these cases (and use Hamming distance # to get a bit of a speedup) if len(nt_regions[region]) != len(self._database.v_regions_nt[self.v_gene][region]): break mutations = hamming_distance(reference_sequence, aa_sequence) length = len(reference_sequence) # If the mutation rate is still obviously too high, assume something went # wrong and skip mutation rate assignment if region == 'FR1' and mutations / length >= 0.8: break rates[region] = 100. * mutations / length v_aa_mutations += mutations v_aa_length += length else: rates['V'] = 100. * v_aa_mutations / v_aa_length return rates return dict(FR1=None, CDR1=None, FR2=None, CDR2=None, FR3=None, V=None, J=None) aa_rates = aa_mutation_rates() if 'V' in self.hits: v_nt = self.hits['V'].query_sequence v_aa = nt_to_aa(v_nt) v_shm = 100. - self.hits['V'].percent_identity v_errors = self.hits['V'].errors v_covered = 100. * self.hits['V'].covered() v_evalue = self.hits['V'].evalue else: v_nt = None v_aa = None v_shm = None v_errors = None v_covered = None v_evalue = None if 'D' in self.hits: d_errors = self.hits['D'].errors d_covered = 100. * self.hits['D'].covered() d_evalue = self.hits['D'].evalue else: d_errors = None d_covered = None d_evalue = None if 'J' in self.hits: j_nt = self.hits['J'].query_sequence j_shm = 100. - self.hits['J'].percent_identity j_errors = self.hits['J'].errors j_covered = 100. * self.hits['J'].covered() j_evalue = self.hits['J'].evalue else: j_nt = None j_shm = None j_errors = None j_covered = None j_evalue = None v_end = getattr(self.junction, 'v_end', None) vd_junction = getattr(self.junction, 'vd_junction', None) d_region = getattr(self.junction, 'd_region', None) dj_junction = getattr(self.junction, 'dj_junction', None) return dict( count=self.size, V_gene=self.v_gene, D_gene=self.d_gene, J_gene=self.j_gene, chain=self.chain, stop=self.has_stop, V_covered=v_covered, D_covered=d_covered, J_covered=j_covered, V_evalue=v_evalue, D_evalue=d_evalue, J_evalue=j_evalue, FR1_SHM=nt_mutation_rate('FR1'), CDR1_SHM=nt_mutation_rate('CDR1'), FR2_SHM=nt_mutation_rate('FR2'), CDR2_SHM=nt_mutation_rate('CDR2'), FR3_SHM=nt_mutation_rate('FR3'), V_SHM=v_shm, J_SHM=j_shm, V_aa_mut=aa_rates['V'], J_aa_mut=aa_rates['J'], FR1_aa_mut=aa_rates['FR1'], CDR1_aa_mut=aa_rates['CDR1'], FR2_aa_mut=aa_rates['FR2'], CDR2_aa_mut=aa_rates['CDR2'], FR3_aa_mut=aa_rates['FR3'], V_errors=v_errors, D_errors=d_errors, J_errors=j_errors, UTR=self.utr, leader=self.leader, CDR1_nt=nt_regions['CDR1'], CDR1_aa=aa_regions['CDR1'], CDR2_nt=nt_regions['CDR2'], CDR2_aa=aa_regions['CDR2'], CDR3_nt=nt_regions['CDR3'], CDR3_aa=aa_regions['CDR3'], V_nt=v_nt, V_aa=v_aa, V_end=v_end, V_CDR3_start=self.v_cdr3_start, VD_junction=vd_junction, D_region=d_region, DJ_junction=dj_junction, J_nt=j_nt, VDJ_nt=vdj_nt, VDJ_aa=vdj_aa, name=self.query_name, barcode=self.barcode, genomic_sequence=self.genomic_sequence, ) class ParseError(Exception): pass class IgBlastParser: """ Parser for IgBLAST results. Works only when IgBLAST was run with the option -outfmt "7 sseqid qstart qseq sstart sseq pident slen". """ BOOL = {'Yes': True, 'No': False, 'N/A': None} FRAME = {'In-frame': True, 'Out-of-frame': False, 'N/A': None} SECTIONS = frozenset([ '# Query:', '# V-(D)-J rearrangement summary', '# V-(D)-J junction details', '# Alignment summary', '# Hit table', 'Total queries = ', ]) def __init__(self, sequences, igblast_lines, database=None): """ If a database is given, iterating over this object will yield ExtendedIgBlastRecord objects, otherwise 'normal' IgBlastRecord objects """ self._sequences = sequences self._igblast_lines = igblast_lines self._database = database if self._database is None: self._create_record = IgBlastRecord else: self._create_record = functools.partial(ExtendedIgBlastRecord, database=self._database) def __iter__(self): """ Yield (Extended-)IgBlastRecord objects """ zipped = zip(self._sequences, split_by_section(self._igblast_lines, ['# IGBLASTN'])) for fasta_record, (record_header, record_lines) in zipped: # 'IGBLASTN 2.5.1+': IgBLAST 1.6.1 assert record_header in { '# IGBLASTN 2.2.29+', # IgBLAST 1.4.0 '# IGBLASTN 2.3.1+', # IgBLAST 1.5.0 '# IGBLASTN 2.6.1+', # IgBLAST 1.7.0 '# IGBLASTN', # IgBLAST 1.10 } yield self._parse_record(record_lines, fasta_record) def _parse_record(self, record_lines, fasta_record): """ Parse a single IgBLAST record """ hits = dict() # All of the sections are optional, so we need to set default values here. query_name = None junction = None v_gene, d_gene, j_gene, chain, has_stop, in_frame, is_productive, strand = [None] * 8 alignments = dict() for section, lines in split_by_section(record_lines, self.SECTIONS): if section.startswith('# Query: '): query_name = section.split(': ')[1] elif section.startswith('# V-(D)-J rearrangement summary'): fields = lines[0].split('\t') if len(fields) == 7: # No D assignment v_gene, j_gene, chain, has_stop, in_frame, is_productive, strand = fields d_gene = None else: v_gene, d_gene, j_gene, chain, has_stop, in_frame, is_productive, strand = fields v_gene = none_if_na(v_gene) d_gene = none_if_na(d_gene) j_gene = none_if_na(j_gene) chain = none_if_na(chain) has_stop = self.BOOL[has_stop] in_frame = self.FRAME[in_frame] is_productive = self.BOOL[is_productive] strand = strand if strand in '+-' else None elif section.startswith('# V-(D)-J junction details'): fields = lines[0].split('\t') if len(fields) == 5: junction = JunctionVDJ( v_end=fields[0], vd_junction=fields[1], d_region=fields[2], dj_junction=fields[3], j_start=fields[4] ) else: junction = JunctionVJ( v_end=fields[0], vj_junction=fields[1], j_start=fields[2]) elif section.startswith('# Alignment summary'): for line in lines: fields = line.split('\t') if len(fields) == 8 and fields[0] != 'Total': summary = self._parse_alignment_summary(fields[1:]) region_name, _, imgt = fields[0].partition('-') assert imgt in ('IMGT', 'IMGT (germline)') alignments[region_name] = summary elif section.startswith('# Hit table'): for line in lines: if not line or line.startswith('#'): continue hit, gene = self._parse_hit(line) assert gene in ('V', 'D', 'J') assert gene not in hits, "Two hits for same gene found" hits[gene] = hit elif section.startswith('Total queries = '): continue assert fasta_record.name == query_name full_sequence = fasta_record.sequence.upper() if strand == '-': full_sequence = reverse_complement(full_sequence) if __debug__: for gene in ('V', 'D', 'J'): if gene not in hits: continue hit = hits[gene] qsequence = hit.query_sequence # IgBLAST removes the trailing semicolon (why, oh why??) qname = query_name[:-1] if query_name.endswith(';') else query_name assert chain in (None, 'VL', 'VH', 'VK', 'NON', 'VA', 'VB', 'VG', 'VD'), chain assert qsequence == full_sequence[hit.query_start:hit.query_start+len(qsequence)] return self._create_record( query_name=query_name, alignments=alignments, v_gene=v_gene, d_gene=d_gene, j_gene=j_gene, chain=chain, has_stop=has_stop, in_frame=in_frame, is_productive=is_productive, strand=strand, hits=hits, full_sequence=full_sequence, junction=junction) def _parse_alignment_summary(self, fields): start, stop, length, matches, mismatches, gaps = (int(v) for v in fields[:6]) percent_identity = float(fields[6]) return AlignmentSummary( start=start - 1, stop=stop, length=length, matches=matches, mismatches=mismatches, gaps=gaps, percent_identity=percent_identity ) def _parse_hit(self, line): """ Parse a line of the "Hit table" section and return a tuple (hit, gene) where hit is a Hit object. """ (gene, subject_id, query_start, query_alignment, subject_start, subject_alignment, percent_identity, subject_length, evalue) = line.split('\t') query_start = int(query_start) - 1 subject_start = int(subject_start) - 1 subject_length = int(subject_length) # Length of original subject sequence # Percent identity is calculated by IgBLAST as # 100 - errors / alignment_length and then rounded to two decimal digits percent_identity = float(percent_identity) evalue = float(evalue) hit = Hit(subject_id, query_start, query_alignment, subject_start, subject_alignment, subject_length, percent_identity, evalue) return hit, gene class TableWriter: def __init__(self, file): self._file = file self._writer = csv.DictWriter(file, fieldnames=ExtendedIgBlastRecord.columns, delimiter='\t') self._writer.writeheader() @staticmethod def yesno(v): """ Return "yes", "no" or None for boolean value v, which may also be None. """ if v is None: return None return ["no", "yes"][v] def write(self, d): """ Write the IgBLAST record (must be given as dictionary) to the output file. """ d = d.copy() d['stop'] = self.yesno(d['stop']) for name in ('V_covered', 'D_covered', 'J_covered', 'FR1_SHM', 'CDR1_SHM', 'FR2_SHM', 'CDR2_SHM', 'FR3_SHM', 'V_SHM', 'J_SHM', 'V_aa_mut', 'J_aa_mut', 'FR1_aa_mut', 'CDR1_aa_mut', 'FR2_aa_mut', 'CDR2_aa_mut', 'FR3_aa_mut'): if d[name] is not None: d[name] = '{:.1f}'.format(d[name]) for name in ('V_evalue', 'D_evalue', 'J_evalue'): if d[name] is not None: d[name] = '{:G}'.format(d[name]) self._writer.writerow(d) IgDiscover-0.11/igdiscover/plotalleles.py000066400000000000000000000107331337725263500205420ustar00rootroot00000000000000""" Plot allele usage """ import sys import logging import pandas as pd from sqt import SequenceReader from .table import read_table logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--d-evalue', type=float, default=1E-4, help='Maximal allowed E-value for D gene match. Default: %(default)s') arg('--d-coverage', '--D-coverage', type=float, default=65, help='Minimum D coverage (in percent). Default: %(default)s%%)') arg('--database', metavar='FASTA', help='Restrict plotting to the sequences named in the FASTA file. ' 'Only the sequence names are used!') arg('--order', metavar='FASTA', help='Sort genes according to the order of the records in the FASTA file.') arg('--x', choices=('V', 'D', 'J'), default='V', help='Type of gene on x axis. Default: %(default)s') arg('--gene', choices=('V', 'D', 'J'), default='J', help='Type of gene on y axis. Default: %(default)s') arg('alleles', help='List of alleles to plot on y axis, separated by comma') arg('table', help='Table with parsed and filtered IgBLAST results') arg('plot', help='Path to output PDF or PNG') def main(args): usecols = ['V_gene', 'D_gene', 'J_gene', 'V_errors', 'D_errors', 'J_errors', 'D_covered', 'D_evalue'] # Support reading a table without D_errors try: table = read_table(args.table, usecols=usecols) except ValueError: usecols.remove('D_errors') table = read_table(args.table, usecols=usecols) logger.info('Table with %s rows read', len(table)) if args.x == 'V' or args.gene == 'V': table = table[table.V_errors == 0] logger.info('%s rows remain after requiring V errors = 0', len(table)) if args.gene == 'J' or args.x == 'J': table = table[table.J_errors == 0] logger.info('%s rows remain after requiring J errors = 0', len(table)) if args.gene == 'D' or args.x == 'D': table = table[table.D_evalue <= args.d_evalue] logger.info('%s rows remain after requiring D E-value <= %s', len(table), args.d_evalue) table = table[table.D_covered >= args.d_coverage] logger.info('%s rows remain after requiring D coverage >= %s', len(table), args.d_coverage) if 'D_errors' in table.columns: table = table[table.D_errors == 0] logger.info('%s rows remain after requiring D errors = 0', len(table)) gene1 = args.x + '_gene' gene2 = args.gene + '_gene' expression_counts = table.groupby((gene1, gene2)).size().to_frame().reset_index() matrix = pd.DataFrame( expression_counts.pivot(index=gene1, columns=gene2, values=0).fillna(0), dtype=int) # matrix[v_gene,d_gene] gives co-occurrences of v_gene and d_gene print('#\n# Expressed genes with counts\n#') # The .sum() is along axis=0, that is, the V gene counts are summed up, # resulting in counts for each D/J gene for g, count in matrix.sum().iteritems(): print(g, '{:8}'.format(count)) alleles = args.alleles.split(',') for allele in alleles: if allele not in matrix.columns: logger.error('Allele %s not expressed in this dataset', allele) sys.exit(1) matrix = matrix.loc[:, alleles] if args.database: with SequenceReader(args.database) as f: x_names = [record.name for record in f if record.name in matrix.index] if not x_names: logger.error('None of the sequence names in %r were found in the input table', args.database) sys.exit(1) matrix = matrix.loc[x_names, :] if args.order: with SequenceReader(args.order) as f: ordered_names = [r.name.partition('*')[0] for r in f] gene_order = {name: index for index, name in enumerate(ordered_names)} def orderfunc(full_name): name, _, allele = full_name.partition('*') allele = int(allele) try: index = gene_order[name] except KeyError: logger.warning('Gene name %s not found in %r, placing it at the end', name, args.order) index = 1000000 return index * 1000 + allele matrix['V_gene_tmp'] = pd.Series(matrix.index, index=matrix.index).apply(orderfunc) matrix.sort_values('V_gene_tmp', inplace=True) del matrix['V_gene_tmp'] print('#\n# Allele-specific expression\n#') print(matrix) if len(alleles) == 2: matrix.loc[:, alleles[1]] *= -1 # remove all-zero rows # allele_expressions = allele_expressions[(allele_expressions > 0.001).any(axis=1)] ax = matrix.plot(kind='bar', stacked=True, figsize=(12, 6)) ax.legend(title=None) ax.set_title('Allele-specific expression counts') ax.set_xlabel(args.x + ' gene') ax.set_ylabel('Count') ax.figure.set_tight_layout(True) # ax.legend(bbox_to_anchor=(1.15, 0.5)) ax.figure.savefig(args.plot) logger.info('Plotted %r', args.plot) IgDiscover-0.11/igdiscover/rename.py000066400000000000000000000104741337725263500174730ustar00rootroot00000000000000""" Rename and reorder records in a FASTA file Sequences can be renamed according to the sequences in a template file. The sequences in the target file will get the name that they have in the template file. Sequences are considered to be equivalent if one is a prefix of the other. Sequences can also be ordered by name, either alphabetically or by the order given in a template file. For comparison, only the 'gene part' of the name is used (before the '*'). For example, for 'VH4-4*01', the name 'VH4-4' is looked up in the template. Alphabetic order is the default. Use --no-sort to disable sorting entirely. """ import sys import logging from sqt import FastaReader from .utils import natural_sort_key logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--no-sort', dest='sort', action='store_false', default=True, help='Do not sort sequences by name') arg('--not-found', metavar='TEXT', default=' (not found)', help='Append this text to the record name when the sequence was not found ' 'in the template. Default: %(default)r') arg('--rename-from', metavar='TEMPLATE', help='FASTA template file with correctly named sequences. If a sequence in ' 'the target file is identical to one in the template, it is assigned the ' 'name of the sequence in the template.') arg('--order-by', metavar='TEMPLATE', help='FASTA template that contains genes in the desired order. ' 'If a name contains a "*" (asterisk), only the preceding part ' 'is used. Thus, VH4-4*01 and VH4-4*02 are equivalent.') arg('target', help='FASTA file with to-be renamed sequences') class PrefixDict: """ A dict that maps strings to values, but where a prefix of a key is enough to retrieve the value. """ def __init__(self, items): self._items = [] for k, v in items: self.add(k, v) def add(self, k, v): try: self[k] raise ValueError('Key {!r} already exists'.format(k)) except KeyError: self._items.append((k, v)) def __getitem__(self, key): found = None for seq, value in self._items: if seq.startswith(key) or key.startswith(seq): if found is not None: # TODO don't use keyerror here raise KeyError('Key {!r} is ambiguous'.format(key)) found = value if found is None: raise KeyError(key) return found def get(self, key, default=None): try: v = self[key] except KeyError: return default else: return v def __len__(self): return len(self._items) class GeneMissing(Exception): pass def gene_name(record): record_name, _, record_comment = record.name.partition(' ') return record_name.partition('*')[0] def sorted_by_gene(records, gene_order): d = {name: i for i, name in enumerate(gene_order)} def keyfunc(record): gene = gene_name(record) try: return d[gene] except KeyError: raise GeneMissing(gene) return sorted(records, key=keyfunc) def main(args): if args.rename_from: with FastaReader(args.rename_from) as fr: template = PrefixDict([]) for record in fr: try: template.add(record.sequence.upper(), record.name) except ValueError: logger.error('Sequences in entry %r and %r are duplicate', record.name, template[record.sequence.upper()]) logger.info('Read %d entries from template', len(template)) else: template = None if args.order_by: with FastaReader(args.order_by) as fr: gene_order = [gene_name(r) for r in fr] else: gene_order = None with FastaReader(args.target) as fr: sequences = list(fr) # Rename renamed = 0 if template is not None: for record in sequences: name = template.get(record.sequence.upper()) if name is None: name = record.name + args.not_found else: renamed += 1 # Replace record’s name, leaving comment intact record_name, _, record_comment = record.name.partition(' ') if record_comment: record.name = name + ' ' + record_comment else: record.name = name # Reorder if gene_order: try: sequences = sorted_by_gene(sequences, gene_order) except GeneMissing as e: logger.error('Gene "%s" not found in the --order-by template file', e) sys.exit(1) elif args.sort: sequences = sorted(sequences, key=lambda s: natural_sort_key(s.name)) for record in sequences: print('>{}\n{}'.format(record.name, record.sequence)) logger.info('Wrote %s FASTA records (%d sequences found in template)', len(sequences), renamed) IgDiscover-0.11/igdiscover/run.py000066400000000000000000000042161337725263500170250ustar00rootroot00000000000000""" Run IgDiscover Calls Snakemake to produce all the output files. """ import sys import logging import resource import platform import pkg_resources from snakemake import snakemake from .utils import available_cpu_count from . import __version__ from .config import Config logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--dryrun', '-n', default=False, action='store_true', help='Do not execute anything') arg('--cores', '--jobs', '-j', metavar='N', type=int, default=available_cpu_count(), help='Run on at most N CPU cores in parallel. ' 'Default: Use as many cores as available (%(default)s)') arg('--keepgoing', '-k', default=False, action='store_true', help='If one job fails, finish the others.') arg('targets', nargs='*', default=[], help='File(s) to create. If omitted, the full pipeline is run.') def main(args): try: config = Config.from_default_path() except FileNotFoundError as e: sys.exit("Pipeline configuration file {!r} not found. Please create it!".format(e.filename)) print('IgDiscover version {} with Python {}. Configuration:'.format(__version__, platform.python_version())) for k, v in sorted(vars(config).items()): # TODO the following line is only necessary for non-YAML configurations if k.startswith('_'): continue print(' ', k, ': ', repr(v), sep='') # snakemake sets up its own logging and this cannot be easily changed # (setting keep_logger=True crashes), so remove our own log handler # for now logger.root.handlers = [] snakefile_path = pkg_resources.resource_filename('igdiscover', 'Snakefile') success = snakemake(snakefile_path, snakemakepath='snakemake', # Needed in snakemake 3.9.0 dryrun=args.dryrun, cores=args.cores, keepgoing=args.keepgoing, printshellcmds=True, targets=args.targets if args.targets else None, ) if sys.platform == 'linux' and not args.dryrun: cputime = resource.getrusage(resource.RUSAGE_SELF).ru_utime cputime += resource.getrusage(resource.RUSAGE_CHILDREN).ru_utime h = int(cputime // 3600) m = (cputime - h * 3600) / 60 print('Total CPU time: {}h {:.2f}m'.format(h, m)) sys.exit(0 if success else 1) IgDiscover-0.11/igdiscover/species.py000066400000000000000000000105321337725263500176520ustar00rootroot00000000000000""" Species-specific code, such as lists of motifs and regular expressions. Some refactoring is needed to make this module actually usable for many species. Right now, it works for - at least - human, rhesus monkey and mouse. """ import re from .utils import nt_to_aa # Regular expressions for CDR3 detection # # The idea comes from D’Angelo et al.: The antibody mining toolbox. # http://dx.doi.org/10.4161/mabs.27105 # The heavy-chain regex was taken directly from there, but the difference # is that we express everything in terms of amino acids, not nucleotides. # This simplifies the expressions and makes them more readable. # _CDR3_REGEX = { # Heavy chain 'VH': re.compile(""" [FY] [FHVWY] C (?P [ADEGIKMNRSTV] .{3,31} ) W[GAV] """, re.VERBOSE), # Light chain, kappa 'VK': re.compile(""" [FSVY] [CFHNVY] [CDFGLSW] (?P .{4,15} ) [FLV][GRV] """, re.VERBOSE), # Light chain, lambda 'VL': re.compile(""" # the negative lookahead assertion ensures that the rightmost start is found [CDY](?![CDY][CFHSY][CFGW])[CFHSY][CFGW] (?P .{4,15} ) [FS]G """, re.VERBOSE) } _CDR3_VH_ALTERNATIVE_REGEX = re.compile(""" C (?P . [RK] .{3,30}) [WF]G.G """, re.VERBOSE) def find_cdr3(sequence, chain): """ Find the CDR3 in the given sequence, assuming it comes from the given chain ('VH', 'VK', 'VL'). If the chain is not one of 'VH', 'VK', 'VL', return None. Return a tuple (start, stop) if found, None otherwise. """ try: regex = _CDR3_REGEX[chain] except KeyError: return None matches = [] for offset in 0, 1, 2: aa = nt_to_aa(sequence[offset:]) match = regex.search(aa) if not match and chain == 'VH': match = _CDR3_VH_ALTERNATIVE_REGEX.search(aa) if match: start, stop = match.span('cdr3') matches.append((start * 3 + offset, stop * 3 + offset)) return min(matches, default=None) # The following code is used for detecting CDR3 start sites within V # reference sequences and CDR3 end sites within J reference sequences. # Matches the start of the CDR3 within the end of a VH sequence _CDR3_START_VH_REGEX = re.compile(""" [FY] [FHVWY] C (?P [ADEGIKMNRSTV*] | $ ) """, re.VERBOSE) _CDR3_START_VH_ALTERNATIVE_REGEX = re.compile(""" C (?P . [RK]) """, re.VERBOSE) _CDR3_START_REGEXES = { 'kappa': re.compile('[FSVY][CFHNVY][CDFGLSW]'), 'lambda': re.compile('[CDY](?![CDY][CFHSY][CFGW])[CFHSY][CFGW]'), 'gamma': re.compile('[YFH]C'), # TODO test whether this also works for alpha and beta 'delta': re.compile('[YFH]C'), } def _cdr3_start_heavy(aa): head, tail = aa[:-15], aa[-15:] match = _CDR3_START_VH_REGEX.search(tail) if not match: match = _CDR3_START_VH_ALTERNATIVE_REGEX.search(tail) if not match: return None return len(head) + match.start('cdr3_start') def cdr3_start(nt, chain): """ Find CDR3 start location within a V gene (Ig or TCR) nt -- nucleotide sequence of the gene chain -- one of the following strings: - 'heavy', 'lambda', 'kappa' for Ig genes - 'alpha', 'beta', 'gamma', 'delta' for TCR genes """ aa = nt_to_aa(nt) if chain == 'heavy': start = _cdr3_start_heavy(aa) if start is None: return None return 3 * start if chain in ('kappa', 'lambda', 'gamma', 'delta'): head, tail = aa[:-15], aa[-15:] match = _CDR3_START_REGEXES[chain].search(tail) if match: return 3 * (len(head) + match.end()) else: return None elif chain in ('alpha', 'beta'): head, tail = aa[:-8], aa[-8:] pos = tail.find('C') if pos == -1: return None else: return 3 * (len(head) + pos + 1) # Matches after the end of the CDR3 within a J sequence _CDR3_END_REGEXES = { 'heavy': re.compile('W[GAV]'), 'kappa': re.compile('FG'), 'lambda': re.compile('FG'), 'alpha': re.compile('FG'), 'beta': re.compile('FG'), 'gamma': re.compile('FG'), 'delta': re.compile('FG'), } def cdr3_end(nt, chain): """ Find the position of the CDR3 end within a J sequence nt -- nucleotide sequence of the J gene chain -- one of the following strings: - 'heavy', 'lambda', 'kappa' for Ig genes - 'alpha', 'beta', 'gamma', 'delta' for TCR genes """ regex = _CDR3_END_REGEXES[chain] for frame in 0, 1, 2: aa = nt_to_aa(nt[frame:]) match = regex.search(aa) if match: return match.start() * 3 + frame return None # When searching for the CDR3, start this many bases to the left of the end of # the V match. CDR3_SEARCH_START = 30 IgDiscover-0.11/igdiscover/table.py000066400000000000000000000034301337725263500173050ustar00rootroot00000000000000""" Function for reading the table created by the 'parse' subcommand. """ import logging import pandas as pd logger = logging.getLogger(__name__) # These columns contain string data # convert them to str to avoid a PerformanceWarning # TODO some of these are actually categorical or bool _STRING_COLUMNS = [ 'V_gene', # categorical 'D_gene', # categorical 'J_gene', # categorical 'chain', # categorical 'stop', # bool 'productive', # bool 'UTR', 'leader', 'CDR1_nt', 'CDR1_aa', 'CDR2_nt', 'CDR2_aa', 'CDR3_nt', 'CDR3_aa', 'V_nt', 'V_aa', 'V_end', 'VD_junction', 'D_region', 'DJ_junction', 'J_nt', 'VDJ_nt', 'VDJ_aa', 'name', 'barcode', 'race_G', 'genomic_sequence', ] _INTEGER_COLUMNS = ('V_errors', 'D_errors', 'J_errors', 'V_CDR3_start') def fix_columns(df): """ Changes DataFrame in-place """ # Convert all string columns to str to avoid a PerformanceWarning for col in _STRING_COLUMNS: if col not in df: continue df[col].fillna('', inplace=True) df[col] = df[col].astype('str') # Empty strings have been set to NaN by read_csv. Replacing # by the empty string avoids problems with groupby, which # ignores NaN values. # Columns that have any NaN values in them cannot be converted to # int due to a numpy limitation. for col in _INTEGER_COLUMNS: if col not in df.columns: continue if all(df[col].notnull()): df[col] = df[col].astype(int) # TODO backwards compatibility if 'CDR3_clusters' in df.columns: df.rename(columns={'CDR3_clusters': 'clonotypes'}, inplace=True) def read_table(path, usecols=None, log=False): """ Read in the table created by the parse subcommand (typically named *.tab) """ d = pd.read_csv(path, usecols=usecols, sep='\t') fix_columns(d) if log: logger.info('%s rows in input table', len(d)) return d IgDiscover-0.11/igdiscover/trie.py000066400000000000000000000114631337725263500171660ustar00rootroot00000000000000""" This trie can be used to store a set of strings and to retrieve sequences similar to a query sequence. """ class Trie: """ A tree-like datastructure for storing strings. It supports queries similar to a set() of strings. This particular implementation also supports searching for strings by Hamming distance. """ # Children A = None C = None G = None T = None # The name attribute is set to the string that is spelled from the root of the tree to this # node - but only for leaf nodes. name = None def __init__(self, iterable=None): if iterable is not None: for it in iterable: self.add(it) def add(self, s: str): """Add a string to this trie. If it already exists, the trie remains unchanged.""" self._insert(s, leaf_name=s) def _insert(self, s: str, leaf_name: str): """Recursive insert, called by add()""" if len(s) == 0: # This needs to become a leaf node assert self.name is None or self.name == leaf_name self.name = leaf_name else: subtrie = getattr(self, s[0]) if subtrie is None: subtrie = Trie() setattr(self, s[0], subtrie) subtrie._insert(s[1:], leaf_name) def __repr__(self): parts = [] for c in list('ACGT'): if getattr(self, c) is not None: if getattr(self, c) is None: parts.append(c) else: parts.append(c + ':' + repr(getattr(self, c))) if self.name is not None: parts.append('Leaf({})'.format(self.name)) if parts: return '{' + ', '.join(parts) + '}' else: return '*' def find_node(self, s: str): """ Search for s. If s is a prefix of any string in the trie, then return the node corresponding to that prefix. The node can be an internal or a leaf node. A leaf node is a node in which name attribute is not None. If the returned node is a leaf node, then s is in the trie. If it is an internal node (that is, the name attribute is None), then there exists a string in the trie that has s as a proper prefix. If no such string (with s as prefix) exists, then None is returned. """ if len(s) == 0: return self subtrie = getattr(self, s[0]) if subtrie is None: return None else: return subtrie.find_node(s[1:]) def __contains__(self, s: str): """Return whether s is in the trie""" node = self.find_node(s) return node is not None and node.name == s def count_nodes(self, internal=True): """Return number of nodes in this trie. internal nodes are counted if "internal" is True""" n = 1 if internal or self.name is not None else 0 for c in list('ACGT'): subtrie = getattr(self, c) if subtrie is not None: n += len(subtrie) return n def __len__(self): """Return the number of unique strings in this trie""" return self.count_nodes(internal=False) def has_similar(self, s, mismatches): """ Return whether a string exists in this trie that has a Hamming distance of at most 'mismatches' to s. """ if len(s) == 0: return self.name is not None # As a runtime heuristic, descend into the the mismatches==0 subtrie first # if possible. subtrie = getattr(self, s[0]) if subtrie is not None and subtrie.has_similar(s[1:], mismatches): return True if mismatches == 0: return False # The above did not work - try all three possible substitutions # and recursively check subtries for c in list('ACGT'): if c == s[0]: continue subtrie = getattr(self, c) if subtrie is None: continue if subtrie.has_similar(s[1:], mismatches - 1): return True return False def find_all_similar(self, s, mismatches): """ Yield all strings in this trie that have a Hamming distance of at most 'mismatches' to s. """ # This routine is similar to has_similar, but since we are interested # in all similar strings, the optimization of going into the matching # subtrie first does not help. if len(s) == 0: if self.name is not None: yield self.name return # The code below is an optimized version of the following: # # for c in list('ACGT'): # if c != s[0] and mismatches == 0: # continue # subtrie = getattr(self, c) # if subtrie is not None: # yield from subtrie.find_all_similar(s[1:], mismatches - int(c != s[0])) if mismatches > 0: s0 = s[0] s1 = s[1:] subtrie = self.A if subtrie is not None: yield from subtrie.find_all_similar(s1, mismatches - (int(s0 != 'A'))) subtrie = self.C if subtrie is not None: yield from subtrie.find_all_similar(s1, mismatches - (int(s0 != 'C'))) subtrie = self.G if subtrie is not None: yield from subtrie.find_all_similar(s1, mismatches - (int(s0 != 'G'))) subtrie = self.T if subtrie is not None: yield from subtrie.find_all_similar(s1, mismatches - (int(s0 != 'T'))) else: subtrie = self for c in s: subtrie = getattr(subtrie, c) if subtrie is None: break else: if subtrie.name is not None: yield subtrie.name IgDiscover-0.11/igdiscover/union.py000066400000000000000000000027711337725263500173550ustar00rootroot00000000000000""" Compute union of sequences in multiple FASTA files """ import logging from collections import namedtuple from sqt import FastaReader from .utils import Merger logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument # arg('--max-differences', type=int, metavar='MAXDIFF', default=0, # help='Merge sequences if they have at most MAXDIFF differences. ' # ' Default: %(default)s') arg('fasta', help='FASTA file', nargs='+') SequenceInfo = namedtuple('SequenceInfo', 'sequence name') class SequenceMerger(Merger): """ Merge sequences where one is a prefix of the other into single entries. """ def __init__(self): super().__init__() def merged(self, s, t): """ Merge two sequences if one is the prefix of the other. If they should not be merged, None is returned. s and t must have attributes sequence and name. """ s_seq = s.sequence t_seq = t.sequence # Make both sequences the same length - cheap trick to not penalize # end gaps s_seq += t_seq[len(s_seq):] t_seq += s_seq[len(t_seq):] if s_seq == t_seq: return s return None def main(args): merger = SequenceMerger() n_read = 0 for path in args.fasta: n = 0 for record in FastaReader(path): merger.add(SequenceInfo(record.sequence.upper(), record.name)) n += 1 n_read += n logger.info('Read %s sequences from %s', n, path) logger.info('Read %s sequences from %s files', n_read, len(args.fasta)) for info in merger: print('>{}\n{}'.format(info.name, info.sequence)) IgDiscover-0.11/igdiscover/upstream.py000066400000000000000000000106541337725263500200640ustar00rootroot00000000000000""" Cluster upstream sequences (UTR and leader) for each gene For each gene, look at the sequences assigned to it. Take the upstream sequences and compute a consensus for them. Only those assigned sequences are taken that have a very low error rate for the V gene match. Output a FASTA file that contains one consensus sequence for each gene. """ import logging from collections import Counter from .table import read_table from .utils import iterative_consensus logger = logging.getLogger(__name__) # When computing a UTR consensus, ignore sequences that deviate by more than # this factor from the median length. UTR_MEDIAN_DEVIATION = 0.1 def add_arguments(parser): arg = parser.add_argument arg('--max-V-errors', '--max-error-percentage', '-e', dest='max_v_errors', metavar='PERCENT', type=float, default=1, help='Allow PERCENT errors in V gene match. Default: %(default)s') arg('--max-FR1-errors', dest='max_fr1_errors', metavar='PERCENT', type=float, default=None, help='Allow PERCENT errors in FR1 region.') arg('--max-CDR1-errors', dest='max_cdr1_errors', metavar='PERCENT', type=float, default=None, help='Allow PERCENT errors in CDR1 region.') arg('--min-consensus-size', type=int, default=1, metavar='N', help='Require at least N sequences for consensus. Default: %(default)s') arg('--consensus-threshold', '-t', metavar='PERCENT', type=float, default=75, help='Threshold for consensus computation. Default: %(default)s%%') arg('--no-ambiguous', '--no-N', default=False, action='store_true', help='Discard consensus sequences with ambiguous bases') arg('--part', choices=['UTR', 'leader', 'UTR+leader'], default='UTR+leader', help='Which part of the sequence before the V ' 'gene match to analyze. Default: %(default)s') arg('--debug', default=False, action='store_true', help='Enable debugging output') arg('table', help='Table with parsed IgBLAST results (assigned.tab.gz or ' 'filtered.tab.gz)') def main(args): if args.debug: logging.getLogger().setLevel(logging.DEBUG) table = read_table(args.table) n_genes = len(set(table['V_gene'])) logger.info('%s rows read with %d unique gene names', len(table), n_genes) for name, column_name, arg in ( ('V%SHM', 'V_SHM', args.max_v_errors), ('FR1%SHM', 'FR1_SHM', args.max_fr1_errors), ('CDR1%SHM', 'CDR1_SHM', args.max_cdr1_errors)): if arg is not None: table = table[getattr(table, column_name) <= arg] logger.info('%s rows remain after discarding %s > %s%%', len(table), name, arg) table['UTR+leader'] = table['UTR'] + table['leader'] table = table[table[args.part] != ''] table['UTR_length'] = [ len(s) for s in table['UTR'] ] n_written = 0 n_consensus_with_n = 0 for name, group in table.groupby('V_gene'): assert len(group) != 0 if len(group) < args.min_consensus_size: logger.info('Gene %s has too few assignments (%s), skipping.', name, len(group)) continue counter = Counter(group['UTR_length']) logger.debug('Sequence length/count table: %s', ', '.join('{}: {}'.format(l, c) for l, c in counter.most_common())) if args.part == 'leader': sequences = list(group['leader']) else: # Since UTR lengths vary a lot, take all sequences that are at # least as long as the tenth longest sequence, and compute # consensus only from those. length_threshold = sorted(group['UTR_length'], reverse=True)[:10][-1] sequences = list(group[group['UTR_length'] >= length_threshold][args.part]) if len(sequences) == 0: logger.info('Gene %s has %s assignments, but lengths are too different, skipping.', name, len(group)) continue assert len(sequences) > 0 if len(sequences) == 1: cons = sequences[0] else: # Keep only those sequences whose length is at least 90% of the longest one cons = iterative_consensus(sequences, program='muscle-medium', threshold=args.consensus_threshold/100) # If the sequence lengths are different, the beginning can be unclear cons = cons.lstrip('N') logger.info('Gene %s has %s assignments, %s usable (%s unique sequences). ' 'Consensus has %s N bases.', name, len(group), len(sequences), len(set(sequences)), cons.count('N')) if cons.count('N') > 0: n_consensus_with_n += 1 if args.no_ambiguous: continue n_written += 1 print('>{} {}_consensus\n{}'.format(name, args.part, cons)) in_or_ex = 'excluding' if args.no_ambiguous else 'including' logger.info('Wrote a consensus for %s of %s genes (%s %s with ambiguous bases)', n_written, n_genes, in_or_ex, n_consensus_with_n) IgDiscover-0.11/igdiscover/utils.py000066400000000000000000000322641337725263500173650ustar00rootroot00000000000000""" Some utility functions that work on sequences and lists of sequences. """ import os import sys import re import random import hashlib import resource from collections import OrderedDict, Counter, defaultdict from itertools import groupby from typing import List, Tuple import numpy as np from sqt.align import edit_distance, multialign, consensus, globalalign from sqt.dna import GENETIC_CODE, nt_to_aa as _nt_to_aa from sqt import SequenceReader from cutadapt.align import Aligner def downsampled(population, size): """ Return a random subsample of the population. Uses reservoir sampling. See https://en.wikipedia.org/wiki/Reservoir_sampling """ sample = population[:size] for index in range(size, len(population)): r = random.randint(0, index) if r < size: sample[r] = population[index] return sample def distances(sequences, band=0.2): """ Compute all pairwise edit distances and return a square matrix. Entry [i,j] in the matrix is the edit distance between sequences[i] and sequences[j]. """ # Pre-compute distances between unique sequences unique_sequences = list(set(sequences)) unique_distances = dict() # maps (seq1, seq2) tuples to edit distance maxdiff = max((int(len(s) * band) for s in sequences), default=0) # TODO double-check this for i, s in enumerate(unique_sequences): for j, t in enumerate(unique_sequences): if i < j: dist = min(maxdiff+1, edit_distance(s, t, maxdiff=maxdiff)) unique_distances[(t, s)] = dist unique_distances[(s, t)] = dist # Fill the result matrix m = np.zeros((len(sequences), len(sequences)), dtype=float) for i, s in enumerate(sequences): for j, t in enumerate(sequences): if i < j: d = 0 if s == t else unique_distances[(s, t)] m[j, i] = m[i, j] = d return m # copied from sqt and modified def consensus(aligned, threshold=0.7, ambiguous='N', keep_gaps=False): """ Compute a consensus from multialign() output, allowing degraded sequences in the 3' end. aligned -- a dict mapping names to sequences or a list of sequences keep_gaps -- whether the returned sequence contains gaps (-) """ n = len(aligned) result = [] if hasattr(aligned, 'values'): sequences = aligned.values() else: sequences = aligned active = int(len(aligned) * 0.05) for i, chars in enumerate(reversed(list(zip(*sequences)))): counter = Counter(chars) active = max(n - counter['-'], active) assert counter['-'] >= n - active counter['-'] -= n - active char, freq = counter.most_common(1)[0] if i >= 10: # TODO hard-coded active = n if freq / active >= threshold: if keep_gaps or char != '-': result.append(char) else: result.append(ambiguous) return ''.join(result[::-1]) def iterative_consensus(sequences, program='muscle-medium', threshold=0.6, subsample_size=200, maximum_subsample_size=1600): """ Compute a consensus sequence of the given sequences, but do not use all sequences if there are many: First, try to compute the consensus from a small subsample. If there are 'N' bases, increase the subsample size and repeat until either there are no more 'N' bases, all available sequences have been used or maximum_subsample_size is reached. """ while True: sample = downsampled(sequences, subsample_size) aligned = multialign(OrderedDict(enumerate(sample)), program=program) cons = consensus(aligned, threshold=threshold).strip('N') if 'N' not in cons: # This consensus is good enough break if len(sequences) <= subsample_size: # We have already used all the sequences that are available break subsample_size *= 2 if subsample_size > maximum_subsample_size: break return cons def sequence_hash(s, digits=4): """ For a string, return a 'fingerprint' that looks like 'S1234' (the character 'S' is fixed). The idea is that this allows one to quickly see whether two sequences are not identical. """ h = int(hashlib.md5(s.encode()).hexdigest()[-4:], base=16) return 'S' + str(h % 10**digits).rjust(digits, '0') def unique_name(name, sequence): """Create a unique name based on the current name and the sequence The returned name looks like name_S1234. If the current name contains already a S_.... suffix, it is removed before the new suffix is appended. name -- current name """ return '{}_{}'.format(name.rsplit('_S', 1)[0], sequence_hash(sequence)) class SerialPool: """ An alternative to multiprocessing.Pool that runs things in serial for easier debugging """ def __init__(self, *args, **kwargs): pass def __enter__(self): return self def __exit__(self, *args): pass def imap(self, func, iterable, chunksize): for i in iterable: yield func(i) def natural_sort_key(s, _nsre=re.compile('([0-9]+)')): """ Use this function as sorting key to sort in 'natural' order. >>> names = ['file10', 'file1.5', 'file1'] >>> sorted(names, key=natural_sort_key) ['file1', 'file1.5', 'file10'] Source: http://stackoverflow.com/a/16090640/715090 """ return [int(text) if text.isdigit() else text.lower() for text in re.split(_nsre, s)] class UniqueNamer: """ Assign unique names by appending letters to already seen names. """ def __init__(self): self._names = set() def __call__(self, name): ext = 'A' new_name = name while new_name in self._names: if ext == '[': raise ValueError('Too many duplicate names') new_name = name + ext ext = chr(ord(ext) + 1) self._names.add(new_name) return new_name class Merger: """ Merge similar items into one. To specify what "similar" means, implement the merged() method in a subclass. """ def __init__(self): self._items = [] def add(self, item): # This method could possibly be made simpler if the graph structure # was made explicit. items = [] for existing_item in self._items: m = self.merged(existing_item, item) if m is None: items.append(existing_item) else: item = m items.append(item) self._items = items def extend(self, iterable): for i in iterable: self.add(i) def __iter__(self): if self._items and hasattr(self._items, 'name'): yield from sorted(self._items, key=lambda x: x.name) else: yield from self._items def __len__(self): return len(self._items) def merged(self, existing_item, item): """ If existing_item and item can be returned, this method must return a new item that represents a merged version of both. If they cannot be merged, it must return None. """ raise NotImplementedError("not implemented") def relative_symlink(src, dst, force=False): """ Create a symbolic link in any directory. force -- if True, then overwrite an existing file/symlink """ if force: try: os.remove(dst) except FileNotFoundError: pass target = os.path.relpath(os.path.abspath(src), start=os.path.dirname(dst)) os.symlink(target, dst) def nt_to_aa(s): """Translate a nucleotide sequence to an amino acid sequence""" try: # try fast version first return _nt_to_aa(s) except ValueError: # failure because there was an unknown nucleotide return ''.join(GENETIC_CODE.get(s[i:i+3], '*') for i in range(0, len(s), 3)) def has_stop(sequence): """ Return a boolean indicating whether the sequence has an internal stop codon. An incomplete codon at the end is allowed. >>> has_stop('GGG') False >>> has_stop('TAA') True >>> has_stop('GGGAC') False """ s = sequence[:len(sequence) // 3 * 3] return '*' in nt_to_aa(s) def plural_s(n): return 's' if n != 1 else '' class FastaValidationError(Exception): pass def validate_fasta(path): """ Ensure that the FASTA file is suitable for use with makeblastdb. Raise a FastaValidationError if any of the following are true: - a record is empty - a record name occurs more than once - a sequence occurs more than once """ with SequenceReader(path) as sr: records = list(sr) names = set() sequences = dict() for r in records: if len(r.sequence) == 0: raise FastaValidationError("Record {!r} is empty".format(r.name)) if r.name in names: raise FastaValidationError("Record name {!r} occurs more than once".format(r.name)) s = r.sequence.upper() if s in sequences: raise FastaValidationError("Records {!r} and {!r} contain the same sequence".format( r.name, sequences[s])) sequences[s] = r.name names.add(r.name) def find_overlap(s, t, min_overlap=1): """ Detect if s and t overlap. Returns: None if no overlap was detected. 0 if s is a prefix of t or t is a prefix of s. Positive int gives index where t starts within s. Negative int gives -index where s starts within t. >>> find_overlap('ABCDE', 'CDE') 2 >>> find_overlap('CDE', 'ABCDEFG') -2 >>> find_overlap('ABC', 'X') is None True """ aligner = Aligner(s, max_error_rate=0) aligner.min_overlap = min_overlap result = aligner.locate(t) if result is None: return None s_start, _, t_start, _, _, _ = result return s_start - t_start def merge_overlapping(s, t): """ Return merged sequences or None if they do not overlap The minimum overlap is 50% of the length of the shorter sequence. """ i = find_overlap(s, t, min_overlap=max(1, min(len(s), len(t)) // 2)) if i is None: return None if i >= 0: # positive: index of t in s if i + len(t) < len(s): # t is in s return s return s[:i] + t if -i + len(s) < len(t): # s is in t return t return t[:-i] + s def get_cpu_time(): """Return CPU time used by process and children""" if sys.platform != 'linux': return None rs = resource.getrusage(resource.RUSAGE_SELF) rc = resource.getrusage(resource.RUSAGE_CHILDREN) return rs.ru_utime + rs.ru_stime + rc.ru_utime + rc.ru_stime def slice_arg(s): """ Parse a string that describes a slice with start and end. >>> slice_arg('2:-3') slice(2, -3, None) >> slice_arg(':-3') slice(None, -3, None) >> slice_arg('2:') slice(2, None, None) """ start, end = s.split(':') start = None if start == '' else int(start) end = None if end == '' else int(end) return slice(start, end) def is_same_gene(name1: str, name2: str): """ Compare gene names to find out whether they are alleles of each other. Both names must have a '*' in them """ return '*' in name1 and '*' in name2 and name1.split('*')[0] == name2.split('*')[0] def describe_nt_change(s: str, t: str): """ Describe changes between two nucleotide sequences >>> describe_nt_change('AAA', 'AGA') '2A>G' >>> describe_nt_change('AAGG', 'AATTGG') '2_3insTT' >>> describe_nt_change('AATTGG', 'AAGG') '3_4delTT' >>> describe_nt_change('AATTGGCG', 'AAGGTG') '3_4delTT; 7C>T' """ row1, row2, start1, stop1, start2, stop2, errors = \ globalalign(s.encode('ascii'), t.encode('ascii'), flags=0, match=0) row1 = row1.decode('ascii').replace('\0', '-') row2 = row2.decode('ascii').replace('\0', '-') changes = [] def grouper(c): c1, c2 = c if c1 == c2: return 'MATCH' elif c1 == '-': return 'INS' elif c2 == '-': return 'DEL' else: return 'SUBST' index = 1 for event, group in groupby(zip(row1, row2), grouper): if event == 'MATCH': index += len(list(group)) elif event == 'SUBST': # ungroup for c1, c2 in group: change = '{}{}>{}'.format(index, c1, c2) changes.append(change) index += 1 elif event == 'INS': inserted = ''.join(c[1] for c in group) change = '{}_{}ins{}'.format(index-1, index, inserted) changes.append(change) elif event == 'DEL': deleted = ''.join(c[0] for c in group) change = '{}_{}del{}'.format(index, index + len(deleted) - 1, deleted) changes.append(change) index += len(deleted) return '; '.join(changes) class ChimeraFinder: def __init__(self, sequences: List[str], min_length: int=10): self._sequences = sequences self._min_length = min_length self._build_index() def _build_index(self): min_length = self._min_length # Create two dictionaries that map all prefixes and suffixes to indexes of all # sequences they occur in prefixes = defaultdict(list) suffixes = defaultdict(list) for i, sequence in enumerate(self._sequences): if len(sequence) < min_length: continue for stop in range(min_length, len(sequence) + 1): prefix = sequence[:stop] prefixes[prefix].append(i) for start in range(0, len(sequence) + 1 - min_length): suffix = sequence[start:] suffixes[suffix].append(i) self._prefixes = prefixes self._suffixes = suffixes def find_exact(self, query: str): """ Find out whether the query string can be explained as a concatenation of a prefix of one of the strings plus a suffix of one of the strings in the given list of sequences. Both the prefix and the suffix must have a length of at least min_length. If the answer is yes, a tuple (prefix_length, prefix_indices, suffix_indices) is returned where prefix_length is the length of the query prefix, prefix_indices is a list of int indices into the sequences of all possible prefix sequences that match, and suffix_indices is the same for the suffix. The prefix_length returned is the first that yields a result. More are possible. If the answer is no, None is returned. """ min_length = self._min_length for split_index in range(min_length, len(query) + 1 - min_length): prefix = query[:split_index] suffix = query[split_index:] if prefix in self._prefixes and suffix in self._suffixes: return (split_index, self._prefixes[prefix], self._suffixes[suffix]) return None def available_cpu_count(): return len(os.sched_getaffinity(0)) IgDiscover-0.11/misc/000077500000000000000000000000001337725263500144415ustar00rootroot00000000000000IgDiscover-0.11/misc/Singularity000066400000000000000000000004121337725263500166730ustar00rootroot00000000000000Bootstrap: docker From: conda/miniconda3 %post conda info conda config --add channels conda-forge --add channels bioconda conda install -y igdiscover conda clean --all %environment export PYTHONNOUSERSITE=x %runscript exec igdiscover "$@" IgDiscover-0.11/misc/bubbles.py000066400000000000000000000042141337725263500164320ustar00rootroot00000000000000#!/usr/bin/env python3 """ Bubble plot """ import logging from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas from matplotlib.figure import Figure import matplotlib import seaborn as sns import pandas as pd import numpy as np #sns.set(style='white', font_scale=1.5, rc={"lines.linewidth": 1}) logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument arg('--scale', default=1.0, type=float, help='scaling factor for bubble size (default %(default)s)') arg('pdf', help='PDF output') arg('table', help='Input table') def main(args): with open(args.table) as f: # Try to auto-detect the separator if ';' in f.read(): kwargs = {'sep': ';'} else: kwargs = {} df = pd.read_table(args.table, index_col=0, **kwargs) m = len(df.index) n_compartments = len(df.columns) logger.info('Table with %s rows read', len(df)) logger.info('%s compartments', n_compartments) df = df.unstack().reset_index() df.columns = ['compartment', 'clone', 'size'] colors = { "BLOOD": "#990000", "BM": "#0000CC", "SPLEEN": "#336600", "LN": "#999999", "GUT": "#660099" } fig = Figure(figsize=((n_compartments+2)*.6, (m+1)*.4))#, sharex=True, sharey=True) FigureCanvas(fig) ax = fig.add_subplot(111) sns.scatterplot( data=df, y='clone', x='compartment', hue='compartment', size='size', sizes=(0, args.scale * 1000), palette=colors.values(), ax=ax) ax.set_xlabel('Traced lineages') ax.set_ylabel('') ax.set_ylim(-1, m) #ax.xaxis.set_tick_params(rotation=90) ax.grid(axis='y', linestyle=':') ax.set_axisbelow(True) handles, labels = ax.get_legend_handles_labels() handles = handles[2+n_compartments:] labels = labels[2+n_compartments:] ax.legend( handles, labels, bbox_to_anchor=(1.1, 0.55), loc=6, labelspacing=3, borderaxespad=0., handletextpad=1, frameon=False)#, title='') fig.savefig(args.pdf, bbox_inches='tight') logger.info('File %r written', args.pdf) if __name__ == '__main__': logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s') from argparse import ArgumentParser parser = ArgumentParser() add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/clonointersect.py000066400000000000000000000100521337725263500200440ustar00rootroot00000000000000""" Query two ... """ import logging from collections import defaultdict from contextlib import ExitStack import pandas as pd from xopen import xopen from igdiscover.table import read_table from igdiscover.utils import slice_arg from igdiscover.clonotypes import is_similar_with_junction, CLONOTYPE_COLUMNS logger = logging.getLogger(__name__) def add_arguments(parser): arg = parser.add_argument # arg('--minimum-count', '-c', metavar='N', default=1, type=int, # help='Discard all rows with count less than N. Default: %(default)s') # arg('--cdr3-core', default=None, # type=slice_arg, metavar='START:END', # help='START:END defines the non-junction region of CDR3 ' # 'sequences. Use negative numbers for END to count ' # 'from the end. Regions before and after are considered to ' # 'be junction sequence, and for two CDR3s to be considered ' # 'similar, at least one of the junctions must be identical. ' # 'Default: no junction region.') # arg('--mismatches', default=1, type=float, # help='No. of allowed mismatches between CDR3 sequences. ' # 'Can also be a fraction between 0 and 1 (such as 0.15), ' # 'interpreted relative to the length of the CDR3 (minus the front non-core). ' # 'Default: %(default)s') arg('--aa', default=False, action='store_true', help='Count CDR3 mismatches on amino-acid level. Default: Compare nucleotides.') arg('tables', nargs='+', help='clonotype tables') def collect(querytable, reftable, mismatches, cdr3_core_slice, cdr3_column): """ Find all queries from the querytable in the reftable. Yield tuples (query_rows, similar_rows) where the query_rows is a list with all the rows that have the same result. similar_rows is a DataFrame whose rows are the ones matching the query. """ # The vjlentype is a "clonotype without CDR3 sequence" (only V, J, CDR3 length) # Determine set of vjlentypes to query query_vjlentypes = defaultdict(list) for row in querytable.itertuples(): vjlentype = (row.V_gene, row.J_gene, len(row.CDR3_nt)) query_vjlentypes[vjlentype].append(row) groupby = ('V_gene', 'J_gene', 'CDR3_length') for vjlentype, vjlen_group in reftable.groupby(groupby): # (v_gene, j_gene, cdr3_length) = vjlentype if vjlentype not in query_vjlentypes: continue # Collect results for this vjlentype. The results dict # maps row indices (into the vjlen_group) to each query_row, # allowing us to group identical results together. results = defaultdict(list) for query_row in query_vjlentypes.pop(vjlentype): cdr3 = getattr(query_row, cdr3_column) # Save indices of the rows that are similar to this query indices = tuple(index for index, r in enumerate(vjlen_group.itertuples()) if is_similar_with_junction(cdr3, getattr(r, cdr3_column), mismatches, cdr3_core_slice)) results[indices].append(query_row) # Yield results, grouping queries that lead to the same result for indices, query_rows in results.items(): if not indices: for query_row in query_rows: yield ([query_row], []) continue similar_group = vjlen_group.iloc[list(indices), :] yield (query_rows, similar_group) # Yield result tuples for all the queries that have not been found for queries in query_vjlentypes.values(): for query_row in queries: yield ([query_row], []) def main(args): cdr3_column = 'CDR3_aa' if args.aa else 'CDR3_nt' # CDR3_length is unused, but this ensures that we are reading the right type of file usecols = ['count', 'V_gene', 'J_gene', 'CDR3_nt', 'CDR3_length', cdr3_column] tables = [] for path in args.tables: table = read_table(path, usecols=usecols).set_index( ['V_gene', 'J_gene', cdr3_column]) tables.append(table) logger.info('Read table from %r with %s rows', path, len(table)) common = tables[0].join(tables[1], lsuffix='_left', rsuffix='_right', how='inner') print('Common clonotypes:', len(common)) n = common[['count_left', 'count_right']].min(axis=1).sum() print('Common sequences:', n) if __name__ == '__main__': from argparse import ArgumentParser parser = ArgumentParser() add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/clonoplot.py000066400000000000000000000075641337725263500170400ustar00rootroot00000000000000""" Query two ... """ import sys import logging from itertools import islice from collections import Counter from io import StringIO from matplotlib.backends.backend_pdf import FigureCanvasPdf, PdfPages import pandas as pd import seaborn as sns sns.set(style='white', font_scale=1.5, rc={"lines.linewidth": 1}) logger = logging.getLogger(__name__) CM = 1 / 2.54 def add_arguments(parser): arg = parser.add_argument arg('--nt', default=False, action='store_true', help='Count CDR3 mismatches on nucleotide level. Default: Compare amino-acids.') arg('--limit', default=None, type=int, metavar='N', help='Read only the first N clonotypes of each table (for fast testing)') arg('--minsize', default=5, type=int, metavar='N', help='Require at least N members in each dataset to plot a clonotype') arg('--names', metavar='NAME[,NAME]...', help='Comma-separated list of dataset names to be used in the legend') arg('pdf', help='PDF output') arg('tables', nargs='+', help='clonotype member tables') def read_dataset(path, cdr3_column='CDR3_aa', limit=None, minsize=1): usecols = [ 'V_gene', 'J_gene', cdr3_column, 'FR1_SHM', 'CDR1_SHM', 'FR2_SHM', 'CDR2_SHM', 'FR3_SHM', 'V_SHM', 'J_SHM', ] column_names = list(pd.read_csv(path, sep='\t', nrows=0).columns) assert 'CDR3_length' in column_names logger.info('Reading %s', path) with open(path) as f: data = f.read() first = True tables = [] for i, chunk in enumerate(islice(data.split('\n\n'), limit), 1): if not chunk: # The file ends with an empty line that needs to be skipped continue table = pd.read_csv(StringIO(chunk), sep='\t', usecols=usecols, header=0 if first else None, names=column_names) first = False assert len(set(table[cdr3_column])) == 1 if len(table) >= minsize: tables.append(table.set_index(['V_gene', 'J_gene', cdr3_column]).sort_index()) if i % 10000 == 0: logger.info('Read %s clones', i) logger.info('Read %s clones', i) return pd.concat(tables) def main(args): logger.info('Will plot results to %s', args.pdf) cdr3_column = 'CDR3_nt' if args.nt else 'CDR3_aa' n_datasets = len(args.tables) if args.names: names = args.names.split(',') if len(names) != n_datasets: logger.error('%s dataset names given, but %s datasets provided', len(names), n_datasets) sys.exit(1) else: names = list(range(n_datasets)) datasets = (read_dataset(path, limit=args.limit, minsize=args.minsize) for path in args.tables) df = pd.concat(datasets, keys=range(n_datasets), names=['dataset_id']) logger.info('Read %s tables', n_datasets) df.rename(columns=lambda x: x[:-4], inplace=True) # Removes _SHM suffix cols = ['V_gene', 'J_gene', cdr3_column] n = 0 with PdfPages(args.pdf) as pages: for (v_gene, j_gene, cdr3), group in df.groupby(level=cols): group = group.reset_index(level=cols, drop=True) skip = False counter = Counter(group.index) for dataset_id in range(n_datasets): if counter[dataset_id] < args.minsize: skip = True break if skip: continue table = group.stack() table.index.set_names('region', level=1, inplace=True) table.name = 'SHM' table = table.reset_index() table = table.assign(Dataset=table['dataset_id'].map(lambda i: names[i])) g = sns.factorplot(data=table, x='region', y='SHM', hue='Dataset', kind='violin', size=16*CM, aspect=2) dscounts = ' vs '.join(str(counter[i]) for i in range(n_datasets)) g.fig.suptitle('V: {} – J: {} – CDR3: {} ({})'.format(v_gene, j_gene, cdr3, dscounts)) g.set_axis_labels('Region') g.set_ylabels('%SHM') # g.despine() FigureCanvasPdf(g.fig).print_figure(pages, bbox_inches='tight') n += 1 logger.info('Plotted %s clonotypes', n) if __name__ == '__main__': logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s') from argparse import ArgumentParser parser = ArgumentParser() add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/geneusage.py000077500000000000000000000032051337725263500167610ustar00rootroot00000000000000#!/usr/bin/env python3 """ Compute and plot V/J gene usage from a clonotypes.tab file """ import sys from argparse import ArgumentParser import seaborn import pandas as pd def add_arguments(parser): arg = parser.add_argument arg('--mincount', metavar='N', default=1, type=int, help='Filter out clonotypes with less than N members. Default: %(default)s') arg('--gene', default='V', choices=('V', 'J'), help='Which gene to plot. Choose V or J. Default: %(default)s') arg('--plot', metavar='FILE', help='Plot results into FILE (PNG, PDF)') arg('table', help='clonotypes.tab table created with the "clonotypes" subcommand') def main(args): # Read in input table table = pd.read_csv(args.table, sep='\t') # Extract the 'V_gene' or 'J_gene' column counts_column = table[args.gene + '_gene'] # Count how often each value (gene) appears gene_frequencies = counts_column.value_counts() # Keep only those with minimum count gene_frequencies = gene_frequencies[gene_frequencies >= args.mincount] # Sort by gene name gene_frequencies = gene_frequencies.sort_index() # Print out the frequencies print('gene', 'frequency', sep='\t') for gene_name, frequency in gene_frequencies.items(): print(gene_name, frequency, sep='\t') # If requested, create a plot if args.plot: ax = gene_frequencies.plot(kind='bar', figsize=(20, 5)) ax.set_title('Gene expression counts') ax.set_xlabel('{} gene'.format(args.gene)) ax.set_ylabel('Clonotype count') ax.figure.tight_layout() ax.figure.savefig(args.plot) if __name__ == '__main__': parser = ArgumentParser(description=__doc__) add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/igblast-example-record.txt000066400000000000000000000043751337725263500215450ustar00rootroot00000000000000# IGBLASTN 2.2.29+ # Query: M00559:137:000000000-ARGK3:1:1102:21678:7874;size=83; # Database: iteration-01/database/human_V iteration-01/database/human_D iteration-01/database/human_J # Domain classification requested: imgt # V-(D)-J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, stop codon, V-J frame, Productive, Strand). Multiple equivalent top matches having the same score and percent identity, if present, are separated by a comma. IGHV6-1*01 IGHD4-23*01 IGHJ6*02 VH No N/A N/A + # V-(D)-J junction details based on top germline gene matches (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either rearranging gene) are indicated in parentheses (i.e., (TACT)) but are not included under the V, D, or J gene itself AAGAG G CGGTGGTAAC A ACTAC # Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity) FR1-IMGT 25 99 75 75 0 0 100 CDR1-IMGT 100 129 30 30 0 0 100 FR2-IMGT 130 180 51 51 0 0 100 CDR2-IMGT 181 207 27 27 0 0 100 FR3-IMGT 208 321 114 114 0 0 100 CDR3-IMGT (germline) 322 328 7 7 0 0 100 Total N/A N/A 304 304 0 0 100 # Hit table (the first field indicates the chain type of the hit) # Fields: subject id, q. start, query seq, s. start, subject seq, % identity, subject length, evalue # 3 hits found V IGHV6-1*01 25 CAGGTACAGCTGCAGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGAG 1 CAGGTACAGCTGCAGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGAG 100.00 305 3e-136 D IGHD4-23*01 330 CGGTGGTAAC 7 CGGTGGTAAC 100.00 19 0.22 J IGHJ6*02 341 ACTACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCA 13 ACTACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCA 100.00 62 3e-25 IgDiscover-0.11/misc/mergeoverlapping.py000077500000000000000000000063371337725263500203750ustar00rootroot00000000000000#!/usr/bin/env python3 """ Merge overlapping sequences """ import sys from argparse import ArgumentParser from sqt.io.fasta import FastaReader from cutadapt.align import Aligner class Merger: """ Merge similar items into one. To specify what "similar" means, implement the merged() method in a subclass. """ def __init__(self): self._items = [] def add(self, item): # This method could possibly be made simpler if the graph structure # was made explicit. items = [] for existing_item in self._items: m = self.merged(existing_item, item) if m is None: items.append(existing_item) else: item = m items.append(item) self._items = items def extend(self, iterable): for i in iterable: self.add(i) def __iter__(self): if self._items and hasattr(self._items, 'name'): yield from sorted(self._items, key=lambda x: x.name) else: yield from self._items def __len__(self): return len(self._items) def merged(self, existing_item, item): """ If existing_item and item can be returned, this method must return a new item that represents a merged version of both. If they cannot be merged, it must return None. """ raise NotImplementedError("not implemented") def find_overlap(s, t, min_overlap=1): """ Detect if s and t overlap. Returns: None if no overlap was detected. 0 if s is a prefix of t or t is a prefix of s. Positive int gives index where t starts within s. Negative int gives -index where s starts within t. >>> find_overlap('ABCDE', 'CDE') 2 >>> find_overlap('CDE', 'ABCDEFG') -2 >>> find_overlap('ABC', 'X') is None True """ aligner = Aligner(s, max_error_rate=0) aligner.min_overlap = min_overlap result = aligner.locate(t) if result is None: return None s_start, _, t_start, _, _, _ = result return s_start - t_start def merge_overlapping(s, t): """ Return merged sequences or None if they do not overlap The minimum overlap is 50% of the length of the shorter sequence. """ i = find_overlap(s, t, min_overlap=max(1, min(len(s), len(t)) // 2)) if i is None: return None if i >= 0: # positive: index of t in s if i + len(t) < len(s): # t is in s return s return s[:i] + t if -i + len(s) < len(t): # s is in t return t return t[:-i] + s class SequenceInfo: __slots__ = ('name', 'sequence', 'count') def __init__(self, name, sequence): self.name = name self.sequence = sequence # self.count = count class OverlappingSequenceMerger(Merger): """ Merge sequences that overlap """ def merged(self, s, t): """ Merge two sequences if they overlap. Return None if they should not be merged. """ m = merge_overlapping(s.sequence, t.sequence) if m is not None: # overlap found return SequenceInfo(s.name + ';' + t.name, m) # no overlap found return None def add_arguments(parser): arg = parser.add_argument arg('fasta', help='File with sequences') def main(args): merger = OverlappingSequenceMerger() with FastaReader(args.fasta) as f: for record in f: merger.add(SequenceInfo(record.name, record.sequence)) for info in merger: print('>{}\n{}'.format(info.name, info.sequence)) if __name__ == '__main__': parser = ArgumentParser(description=__doc__) add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/shmmultitab.py000077500000000000000000000046111337725263500173510ustar00rootroot00000000000000#!/usr/bin/env python3 """ Extract V_SHM values from multiple clonoquery tables """ from collections import OrderedDict import logging import pandas as pd import io import os import sys import re def add_arguments(parser): arg = parser.add_argument arg('--field', default='V_SHM', help='Which column to extract. Default: %(default)s') arg('-n', default=5, type=int, help='Require at least N values in each file for a query to be listed') arg('paths', nargs='+') def read_one_file(path, field_name): with open(path) as f: header = f.readline() sections = re.split(r'\n\t*\n', f.read().strip()) column_names = header.strip().split('\t') queries = dict() for section in sections: sio = io.StringIO(section) n = section.count('# Query: ') if n > 100: print(repr(section[:10000])) query_names = [] for i in range(n): query = sio.readline() assert query.startswith('# Query: '), query query_name = query.split('\t')[0].partition('# Query: ')[2] query_names.append(query_name) table = pd.read_table(sio, header=None, names=column_names, usecols=(field_name,)) # print('Read a table with {} entries'.format(len(table)), file=sys.stderr) for query_name in query_names: queries[query_name] = table[field_name] return queries def main(args): files = OrderedDict((os.path.basename(path), read_one_file(path, args.field)) for path in args.paths) print('read {} files'.format(len(files)), file=sys.stderr) query_names = dict() for queries in files.values(): for query_name in queries.keys(): query_names[query_name] = None query_names = list(query_names) maxlen = 0 for queries in files.values(): maxlen = max(maxlen, max(len(q) for q in queries.values())) assert len(set(query_names)) == len(query_names) df = pd.DataFrame(index=range(maxlen)) for query_name in query_names: long_enough_in_all_files = all(len(files[file].get(query_name, [])) >= args.n for file in files) if not long_enough_in_all_files: print('Query {} has too few values in at least one file'.format(query_name), file=sys.stderr) continue for file in files: column_name = '{file}-{query_name}'.format(file=file, query_name=query_name) df[column_name] = files[file].get(query_name, pd.Series([])) print(df.to_csv(sep='\t', index=False)) if __name__ == '__main__': from argparse import ArgumentParser parser = ArgumentParser() add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/shmtab.py000077500000000000000000000015011337725263500162710ustar00rootroot00000000000000#!/usr/bin/env python3 """ Extract V_SHM values from a filtered.tab.gz and group them by V gene Create a table that has a column for each V gene. """ import logging import pandas as pd def add_arguments(parser): arg = parser.add_argument arg('--field', default='V_SHM', help='Which column to extract. Default: %(default)s') arg('table', help='filtered.tab.gz file') def main(args): table = pd.read_table(args.table, sep='\t', usecols=(args.field, 'V_gene')) df = pd.DataFrame(index=range(table.groupby('V_gene').size().max())) for gene, group in table.groupby('V_gene'): df[gene] = group[args.field].reset_index(drop=True) print(df.to_csv(sep='\t', index=False)) if __name__ == '__main__': from argparse import ArgumentParser parser = ArgumentParser() add_arguments(parser) args = parser.parse_args() main(args) IgDiscover-0.11/misc/splitbybarcode.py000077500000000000000000000027361337725263500200340ustar00rootroot00000000000000#!/usr/bin/env python3 """ Split sequences in a FASTA file by barcode. """ __author__ = "Marcel Martin" import sys import random from sqt import HelpfulArgumentParser, FastaReader def main(): parser = HelpfulArgumentParser(description=__doc__) parser.add_argument('--barcode-length', '-l', metavar='N', type=int, default=12, help="Barcode length (default: %(default)s)") parser.add_argument('--sequential', '-s', action='store_true', default=False, help="Number output files sequentially, do not use BARCODE in the file name.") parser.add_argument("fasta", metavar='FASTA', help="Input FASTA file") parser.add_argument("prefix", metavar='PREFIX', nargs='?', default='split', help='Prefix for output file names. The output files will be named ' '.fasta or .fasta.') args = parser.parse_args() n = 0 barcodes = dict() n_barcodes = 1 for record in FastaReader(args.fasta): barcode = record.sequence[:args.barcode_length] if barcode in barcodes: index = barcodes[barcode] else: index = n_barcodes barcodes[barcode] = index n_barcodes += 1 if args.sequential: path = args.prefix + str(index) + '.fasta' else: path = args.prefix + barcode + '.fasta' # Open/close files every time to avoid too many open files with open(path, mode='a') as f: f.write('>{}\n{}\n'.format(record.name, record.sequence)) n += 1 print('Wrote {} sequences to {} files.'.format(n, len(barcodes))) if __name__ == '__main__': main() IgDiscover-0.11/misc/sra_download.py000077500000000000000000000014461337725263500174770ustar00rootroot00000000000000#!/usr/bin/env python3 import sys import subprocess def download_sra(sra_id): if not sra_id.startswith('SRR') and not sra_id.startswith('ERR'): raise ValueError('only ids starting with SRR or ERR supported') number = sra_id[3:] if len(number) == 6: url = f'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{sra_id[:6]}/{sra_id}/' elif len(number) == 7: url = f'ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{sra_id[:6]}/00{number[6]}/{sra_id}/' else: raise ValueError('number not supported') single_end_url = f'{url}{sra_id}.fastq.gz' paired_end_urls = [f'{url}{sra_id}_{i}.fastq.gz' for i in (1, 2)] result = subprocess.run(['wget', single_end_url]) if result.returncode != 0: for url in paired_end_urls: subprocess.run(['wget', url], check=True) if __name__ == '__main__': download_sra(sys.argv[1]) IgDiscover-0.11/misc/uniq.py000077500000000000000000000004641337725263500157760ustar00rootroot00000000000000#!/usr/bin/env python3 from cutadapt import seqio import sys from collections import defaultdict with seqio.open(sys.argv[1]) as sr: reads = list(sr) d = defaultdict(list) for read in reads: d[read.sequence].append(read.name) for seq, names in d.items(): print('>{}\n{}'.format(','.join(names), seq)) IgDiscover-0.11/setup.cfg000066400000000000000000000003651337725263500153330ustar00rootroot00000000000000[versioneer] VCS = git style = pep440 versionfile_source = igdiscover/_version.py versionfile_build = igdiscover/_version.py tag_prefix = v parentdir_prefix = igdiscover- [tool:pytest] addopts = --doctest-modules testpaths = igdiscover/ tests/ IgDiscover-0.11/setup.py000066400000000000000000000023671337725263500152300ustar00rootroot00000000000000import sys from setuptools import setup import versioneer if sys.version_info < (3, 5): sys.stdout.write("At least Python 3.5 is required.\n") sys.exit(1) with open('README.rst', encoding='utf-8') as f: long_description = f.read() setup( name = 'igdiscover', version = versioneer.get_version(), cmdclass = versioneer.get_cmdclass(), author = 'Marcel Martin', author_email = 'marcel.martin@scilifelab.se', url = 'https://igdiscover.se/', description = 'Analyze antibody repertoires and discover new V genes', long_description = long_description, license = 'MIT', entry_points = {'console_scripts': ['igdiscover = igdiscover.__main__:main']}, packages = ['igdiscover'], package_data = {'igdiscover': ['igdiscover.yaml', 'Snakefile', 'empty.aux']}, install_requires = [ 'sqt>=0.8.0', 'pandas>=0.16.2', 'numpy', 'matplotlib>=1.5.0', 'seaborn>=0.8.1', 'snakemake>=4.5', 'cutadapt', 'scipy>=0.16.1', 'xopen>=0.3.2', 'ruamel.yaml', ], classifiers = [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Bio-Informatics" ] ) IgDiscover-0.11/tests/000077500000000000000000000000001337725263500146505ustar00rootroot00000000000000IgDiscover-0.11/tests/__init__.py000066400000000000000000000000001337725263500167470ustar00rootroot00000000000000IgDiscover-0.11/tests/data/000077500000000000000000000000001337725263500155615ustar00rootroot00000000000000IgDiscover-0.11/tests/data/H1/000077500000000000000000000000001337725263500160315ustar00rootroot00000000000000IgDiscover-0.11/tests/data/H1/README.md000066400000000000000000000002311337725263500173040ustar00rootroot00000000000000These are partial results from ERR1760498. V.fasta -- V gene sequences that actually appear in the `candidates.tab` (not the full V starting database). IgDiscover-0.11/tests/data/H1/V.fasta000066400000000000000000001124751337725263500172700ustar00rootroot00000000000000>IGHV1-18*01 CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-18*03 CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACATGGCCGTGTATTACTGTGCGAGAGA >IGHV1-18*04 CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-2*02 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-2*04 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-24*01 CAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGTTTCCGGATACACCCTCACTGAATTATCCATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATCCTGAAGATGGTGAAACAATCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA >IGHV1-3*01 CAGGTCCAGCTTGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATGCTATGCATTGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGATCAACGCTGGCAATGGTAACACAAAATATTCACAGAAGTTCCAGGGCAGAGTCACCATTACCAGGGACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV1-3*02 CAGGTTCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATGCTATGCATTGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGAGCAACGCTGGCAATGGTAACACAAAATATTCACAGGAGTTCCAGGGCAGAGTCACCATTACCAGGGACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACATGGCTGTGTATTACTGTGCGAGAGA >IGHV1-45*02 CAGATGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGACTGGGTCCTCAGTGAAGGTTTCCTGCAAGGCTTCCGGATACACCTTCACCTACCGCTACCTGCACTGGGTGCGACAGGCCCCCGGACAAGCGCTTGAGTGGATGGGATGGATCACACCTTTCAATGGTAACACCAACTACGCACAGAAATTCCAGGACAGAGTCACCATTACCAGGGACAGGTCTATGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACAGCCATGTATTACTGTGCAAGATA >IGHV1-46*01 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCATCTGGATACACCTTCACCAGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAATAATCAACCCTAGTGGTGGTAGCACAAGCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGGACACGTCCACGAGCACAGTCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-46*02 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCATCTGGATACACCTTCAACAGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAATAATCAACCCTAGTGGTGGTAGCACAAGCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGGACACGTCCACGAGCACAGTCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-46*03 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCATCTGGATACACCTTCACCAGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAATAATCAACCCTAGTGGTGGTAGCACAAGCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGGACACGTCCACGAGCACAGTCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCTAGAGA >IGHV1-58*01 CAAATGCAGCTGGTGCAGTCTGGGCCTGAGGTGAAGAAGCCTGGGACCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATTCACCTTTACTAGCTCTGCTGTGCAGTGGGTGCGACAGGCTCGTGGACAACGCCTTGAGTGGATAGGATGGATCGTCGTTGGCAGTGGTAACACAAACTACGCACAGAAGTTCCAGGAAAGAGTCACCATTACCAGGGACATGTCCACAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCCGAGGACACGGCCGTGTATTACTGTGCGGCAGA >IGHV1-69*01 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACGAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-69*06 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACAAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-69*12 CAGGTCCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACGAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-69*13 CAGGTCCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACGAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-69*14 CAGGTCCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACAAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV1-69-2*01 GAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCTACAGTGAAAATCTCCTGCAAGGTTTCTGGATACACCTTCACCGACTACTACATGCACTGGGTGCAACAGGCCCCTGGAAAAGGGCTTGAGTGGATGGGACTTGTTGATCCTGAAGATGGTGAAACAATATACGCAGAGAAGTTCCAGGGCAGAGTCACCATAACCGCGGACACGTCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA >IGHV1-8*01 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCAGTTATGATATCAACTGGGTGCGACAGGCCACTGGACAAGGGCTTGAGTGGATGGGATGGATGAACCCTAACAGTGGTAACACAGGCTATGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGAACACCTCCATAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGG >IGHV1-8*02 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCAGCTATGATATCAACTGGGTGCGACAGGCCACTGGACAAGGGCTTGAGTGGATGGGATGGATGAACCCTAACAGTGGTAACACAGGCTATGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGAACACCTCCATAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGG >IGHV2-26*01 CAGGTCACCTTGAAGGAGTCTGGTCCTGTGCTGGTGAAACCCACAGAGACCCTCACGCTGACCTGCACCGTCTCTGGGTTCTCACTCAGCAATGCTAGAATGGGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACACATTTTTTCGAATGACGAAAAATCCTACAGCACATCTCTGAAGAGCAGGCTCACCATCTCCAAGGACACCTCCAAAAGCCAGGTGGTCCTTACCATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACGGATAC >IGHV2-5*01 CAGATCACCTTGAAGGAGTCTGGTCCTACGCTGGTGAAACCCACACAGACCCTCACGCTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAGTGGGTGTGGGCTGGATCCGTCAGCCCCCAGGAAAGGCCCTGGAGTGGCTTGCACTCATTTATTGGAATGATGATAAGCGCTACAGCCCATCTCTGAAGAGCAGGCTCACCATCACCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACACAGAC >IGHV2-5*02 CAGATCACCTTGAAGGAGTCTGGTCCTACGCTGGTGAAACCCACACAGACCCTCACGCTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAGTGGGTGTGGGCTGGATCCGTCAGCCCCCAGGAAAGGCCCTGGAGTGGCTTGCACTCATTTATTGGGATGATGATAAGCGCTACAGCCCATCTCTGAAGAGCAGGCTCACCATCACCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACACAGAC >IGHV2-5*04 CAGATCACCTTGAAGGAGTCTGGTCCTACGCTGGTGAAACCCACACAGACCCTCACGCTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAGTGGGTGTGGGCTGGATCCGTCAGCCCCCAGGAAAGGCCCTGGAGTGGCTTGCACTCATTTATTGGAATGATGATAAGCGCTACAGCCCATCTCTGAAGAGCAGGCTCACCATCACCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGGCACATATTACTGTGTAC >IGHV2-70*01 CAGGTCACCTTGAGGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGTGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACTCATTGATTGGGATGATGATAAATACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC >IGHV2-70*10 CAGGTCACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGCGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGATTGCACGCATTGATTGGGATGATGATAAATACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC >IGHV2-70*11 CGGGTCACCTTGAGGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGTGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACGCATTGATTGGGATGATGATAAATACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC >IGHV2-70*13 CAGGTCACCTTGAGGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGTGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACTCATTGATTGGGATGATGATAAATACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTATTGTGCACGGATAC >IGHV2-70D*04 CAGGTCACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGCGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACGCATTGATTGGGATGATGATAAATTCTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC >IGHV2-70D*14 CAGGTCACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGCGTGTGAGCTGGATCCGTCAGCCCCCAGGTAAGGCCCTGGAGTGGCTTGCACGCATTGATTGGGATGATGATAAATTCTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC >IGHV3-11*01 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCAAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACTACTACATGAGCTGGATCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTGGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGGGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV3-11*04 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCAAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACTACTACATGAGCTGGATCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTGGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGGGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-11*06 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCAAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACTACTACATGAGCTGGATCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTTACACAAACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-13*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACGACATGCACTGGGTCCGCCAAGCTACAGGAAAAGGTCTGGAGTGGGTCTCAGCTATTGGTACTGCTGGTGACACATACTATCCAGGCTCCGTGAAGGGCCGATTCACCATCTCCAGAGAAAATGCCAAGAACTCCTTGTATCTTCAAATGAACAGCCTGAGAGCCGGGGACACGGCTGTGTATTACTGTGCAAGAGA >IGHV3-13*05 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACGACATGCACTGGGTCCGCCAAGCTACAGGAAAAGGTCTGGAGTGGGTCTCAGCTATTGGTACTGCTGGTGACCCATACTATCCAGGCTCCGTGAAGGGCCGATTCACCATCTCCAGAGAAAATGCCAAGAACTCCTTGTATCTTCAAATGAACAGCCTGAGAGCCGGGGACACGGCTGTGTATTACTGTGCAAGAGA >IGHV3-15*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAACGCCTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA >IGHV3-15*02 GAGGTGCAGCTGGTGGAGTCTGGGGGAGCCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAACGCCTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA >IGHV3-15*04 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAACGCCTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTGGCCGTATTGAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA >IGHV3-15*05 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAACGCCTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGTCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA >IGHV3-15*06 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAACGCCTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAAACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA >IGHV3-15*07 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGTTTCACTTTCAGTAACGCCTGGATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA >IGHV3-20*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGGCATGAGCTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGAGTGGGTCTCTGGTATTAATTGGAATGGTGGTAGCACAGGTTATGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCCTTGTATCACTGTGCGAGAGA >IGHV3-21*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-21*02 GAGGTGCAACTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-21*03 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACAGCTGTGTATTACTGTGCGAGAGA >IGHV3-21*04 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV3-23*01 GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA >IGHV3-23*04 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA >IGHV3-23*05 GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTTATAGCAGTGGTAGTAGCACATACTATGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAA >IGHV3-30*02 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCATTTATACGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAAAGA >IGHV3-30*03 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATCATATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-30*04 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGCTATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATCATATGATGGAAGTAATAAATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-30*18 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATCATATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAAAGA >IGHV3-30-3*01 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGCTATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-33*01 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-33*03 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAACTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAAAGA >IGHV3-33*06 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAAAGA >IGHV3-43*01 GAAGTGCAGCTGGTGGAGTCTGGGGGAGTCGTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATACCATGCACTGGGTCCGTCAAGCTCCGGGGAAGGGTCTGGAGTGGGTCTCTCTTATTAGTTGGGATGGTGGTAGCACATACTATGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACAGCAAAAACTCCCTGTATCTGCAAATGAACAGTCTGAGAACTGAGGACACCGCCTTGTATTACTGTGCAAAAGATA >IGHV3-43D*01 GAAGTGCAGCTGGTGGAGTCTGGGGGAGTCGTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGCCATGCACTGGGTCCGTCAAGCTCCGGGGAAGGGTCTGGAGTGGGTCTCTCTTATTAGTTGGGATGGTGGTAGCACCTACTATGCAGACTCTGTGAAGGGTCGATTCACCATCTCCAGAGACAACAGCAAAAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCTGAGGACACCGCCTTGTATTACTGTGCAAAAGATA >IGHV3-48*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-48*02 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGACGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-48*03 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGTTATGAAATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTGGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTTTATTACTGTGCGAGAGA >IGHV3-48*04 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-49*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCAGGGCGGTCCCTGAGACTCTCCTGTACAGCTTCTGGATTCACCTTTGGTGATTATGCTATGAGCTGGTTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTAGGTTTCATTAGAAGCAAAGCTTATGGTGGGACAACAGAATACACCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGGTTCCAAAAGCATCGCCTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACTAGAGA >IGHV3-49*03 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCAGGGCGGTCCCTGAGACTCTCCTGTACAGCTTCTGGATTCACCTTTGGTGATTATGCTATGAGCTGGTTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTAGGTTTCATTAGAAGCAAAGCTTATGGTGGGACAACAGAATACGCCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCCAAAAGCATCGCCTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACTAGAGA >IGHV3-49*04 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCAGGGCGGTCCCTGAGACTCTCCTGTACAGCTTCTGGATTCACCTTTGGTGATTATGCTATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTAGGTTTCATTAGAAGCAAAGCTTATGGTGGGACAACAGAATACGCCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCCAAAAGCATCGCCTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACTAGAGA >IGHV3-49*05 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCAGGGCGGTCCCTGAGACTCTCCTGTACAGCTTCTGGATTCACCTTTGGTGATTATGCTATGAGCTGGTTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTAGGTTTCATTAGAAGCAAAGCTTATGGTGGGACAACAGAATACGCCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCCAAAAGCATCGCCTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACTAGAGA >IGHV3-53*01 GAGGTGCAGCTGGTGGAGTCTGGAGGAGGCTTGATCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGGTTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV3-53*02 GAGGTGCAGCTGGTGGAGACTGGAGGAGGCTTGATCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGGTTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV3-53*04 GAGGTGCAGCTGGTGGAGTCTGGAGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGGTTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGACACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV3-64*02 GAGGTGCAGCTGGTGGAGTCTGGGGAAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGCTATGCACTGGGTCCGCCAGGCTCCAGGGAAGGGACTGGAATATGTTTCAGCTATTAGTAGTAATGGGGGTAGCACATATTATGCAGACTCTGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGGGCAGCCTGAGAGCTGAGGACATGGCTGTGTATTACTGTGCGAGAGA >IGHV3-66*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-66*03 GAGGTGCAGCTGGTGGAGTCTGGAGGAGGCTTGATCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGGTTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCTGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-7*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGTAGCTATTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTGGCCAACATAAAGCAAGATGGAAGTGAGAAATACTATGTGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV3-7*02 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGTAGCTATTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAAGGGCTGGAGTGGGTGGCCAACATAAAGCAAGATGGAAGTGAGAAATACTATGTGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGA >IGHV3-7*03 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGTAGCTATTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTGGCCAACATAAAGCAAGATGGAAGTGAGAAATACTATGTGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV3-72*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACCACTACATGGACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTGGCCGTACTAGAAACAAAGCTAACAGTTACACCACAGAATACGCCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAGAACTCACTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACGGCCGTGTATTACTGTGCTAGAGA >IGHV3-73*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGGTTCACCTTCAGTGGCTCTGCTATGCACTGGGTCCGCCAGGCTTCCGGGAAAGGGCTGGAGTGGGTTGGCCGTATTAGAAGCAAAGCTAACAGTTACGCGACAGCATATGCTGCGTCGGTGAAAGGCAGGTTCACCATCTCCAGAGATGATTCAAAGAACACGGCGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACGGCCGTGTATTACTGTACTAGACA >IGHV3-73*02 GAGGTGCAGCTGGTGGAGTCCGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGGTTCACCTTCAGTGGCTCTGCTATGCACTGGGTCCGCCAGGCTTCCGGGAAAGGGCTGGAGTGGGTTGGCCGTATTAGAAGCAAAGCTAACAGTTACGCGACAGCATATGCTGCGTCGGTGAAAGGCAGGTTCACCATCTCCAGAGATGATTCAAAGAACACGGCGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACGGCCGTGTATTACTGTACTAGACA >IGHV3-74*01 GAGGTGCAGCTGGTGGAGTCCGGGGGAGGCTTAGTTCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACTGGATGCACTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGTGTGGGTCTCACGTATTAATAGTGATGGGAGTAGCACAAGCTACGCGGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACACGCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCAAGAGA >IGHV3-74*02 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTAGTTCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACTGGATGCACTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGTGTGGGTCTCACGTATTAATAGTGATGGGAGTAGCACAAGCTACGCGGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACACGCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCAAGA >IGHV3-74*03 GAGGTGCAGCTGGTGGAGTCCGGGGGAGGCTTAGTTCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACTGGATGCACTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGTGTGGGTCTCACGTATTAATAGTGATGGGAGTAGCACAACGTACGCGGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACACGCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCAAGAGA >IGHV3-9*01 GAAGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGCAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGCCATGCACTGGGTCCGGCAAGCTCCAGGGAAGGGCCTGGAGTGGGTCTCAGGTATTAGTTGGAATAGTGGTAGCATAGGCTATGCGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCTGAGGACACGGCCTTGTATTACTGTGCAAAAGATA >IGHV3-9*03 GAAGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGCAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGCCATGCACTGGGTCCGGCAAGCTCCAGGGAAGGGCCTGGAGTGGGTCTCAGGTATTAGTTGGAATAGTGGTAGCATAGGCTATGCGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCTGAGGACATGGCCTTGTATTACTGTGCAAAAGATA >IGHV4-28*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTACATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAAA >IGHV4-28*03 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTACATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-31*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCTAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-31*03 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-34*01 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGG >IGHV4-34*02 CAGGTGCAGCTACAACAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGG >IGHV4-34*04 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACAACAACCCGTCCCTCAAGAGTCGAGCCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGG >IGHV4-34*08 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGACCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCG >IGHV4-34*10 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAATCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTACCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGATA >IGHV4-34*11 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCGTCAGTGGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGTATATCTATTATAGTGGGAGCACCAACAACAACCCCTCCCTCAAGAGTCGAGCCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAACCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAGA >IGHV4-34*12 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCATTCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGA >IGHV4-38-2*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTGGTTACTACTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATCATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGA >IGHV4-38-2*02 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTTACTCCATCAGCAGTGGTTACTACTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATCATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-39*01 CAGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTACTGTGCGAGACA >IGHV4-39*02 CAGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGACACGTCCAAGAACCACTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTACTGTGCGAGAGA >IGHV4-39*06 CGGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCCCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-39*07 CAGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-4*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTCCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTGCTGTGCGAGAGA >IGHV4-4*02 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-4*07 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-59*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-59*02 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCGTCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-59*03 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAATTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCG >IGHV4-59*05 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCGCCGGGGAAGGGACTGGAGTGGATTGGGCGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTACTGTGCG >IGHV4-59*07 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGA >IGHV4-59*08 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGACA >IGHV4-59*10 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGGCTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGATA >IGHV4-61*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCGTCAGCAGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-61*02 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-61*03 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCGTCAGCAGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCACTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV4-61*05 CAGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGA >IGHV4-61*08 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCGTCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA >IGHV5-10-1*03 GAAGTGCAGCTGGTGCAGTCCGGAGCAGAGGTGAAAAAGCCCGGGGAGTCTCTGAGGATCTCCTGTAAGGGTTCTGGATACAGCTTTACCAGCTACTGGATCAGCTGGGTGCGCCAGATGCCCGGGAAAGGCCTGGAGTGGATGGGGAGGATTGATCCTAGTGACTCTTATACCAACTACAGCCCGTCCTTCCAAGGCCACGTCACCATCTCAGCTGACAAGTCCATCAGCACTGCCTACCTGCAGTGGAGCAGCCTGAAGGCCTCGGACACCGCCATGTATTACTGTGCGAGA >IGHV5-51*01 GAGGTGCAGCTGGTGCAGTCTGGAGCAGAGGTGAAAAAGCCCGGGGAGTCTCTGAAGATCTCCTGTAAGGGTTCTGGATACAGCTTTACCAGCTACTGGATCGGCTGGGTGCGCCAGATGCCCGGGAAAGGCCTGGAGTGGATGGGGATCATCTATCCTGGTGACTCTGATACCAGATACAGCCCGTCCTTCCAAGGCCAGGTCACCATCTCAGCCGACAAGTCCATCAGCACCGCCTACCTGCAGTGGAGCAGCCTGAAGGCCTCGGACACCGCCATGTATTACTGTGCGAGACA >IGHV5-51*03 GAGGTGCAGCTGGTGCAGTCTGGAGCAGAGGTGAAAAAGCCGGGGGAGTCTCTGAAGATCTCCTGTAAGGGTTCTGGATACAGCTTTACCAGCTACTGGATCGGCTGGGTGCGCCAGATGCCCGGGAAAGGCCTGGAGTGGATGGGGATCATCTATCCTGGTGACTCTGATACCAGATACAGCCCGTCCTTCCAAGGCCAGGTCACCATCTCAGCCGACAAGTCCATCAGCACCGCCTACCTGCAGTGGAGCAGCCTGAAGGCCTCGGACACCGCCATGTATTACTGTGCGAGA >IGHV6-1*01 CAGGTACAGCTGCAGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGAGA >IGHV6-1*02 CAGGTACAGCTGCAGCAGTCAGGTCCGGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGAGA >IGHV7-4-1*01 CAGGTGCAGCTGGTGCAATCTGGGTCTGAGTTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATGCTATGAATTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACACCAACACTGGGAACCCAACGTATGCCCAGGGCTTCACAGGACGGTTTGTCTTCTCCTTGGACACCTCTGTCAGCACGGCATATCTGCAGATCTGCAGCCTAAAGGCTGAGGACACTGCCGTGTATTACTGTGCGAGA IgDiscover-0.11/tests/data/H1/candidates.tab.gz000066400000000000000000001231641337725263500212460ustar00rootroot00000000000000Ycandidates.tab[,;>}+`&G51RO jT ӟ03ҝNүyڻʕ;Wd/,z/zn=?nwۿ?o}2???__/?/_/?wޒzrfniMD3]?),_ X'v:ÏuI'=}_2mGrⷹ?>= ~#ܿgE`w?*~~/8zzqeY_v*g3n%lEhM6%lvKa;o"lO=d#ð?q1I|llϰMh7MoJnlFwWh oʛ^O{q'>yCMO0n#ݽ]#sTŰͭhí۶0r^R H7>aACf @o3Yۈ9HPaR]ɺ qC}|4 =D?s *h{4EeW=/ }6>:m$DLH[!"b/ L sv" ݴ+^wyd{G?/ٿѶavߩ ms7m:Tf=g]u'(QAY/Z. yvNȈB(iwQF 2j`o ܸV{_ĭqO=#e f8iD!yk[v7HԌ{`H}WinRp; +>z$*Bw$Ѱ#p$UD*Z ˴}HHw-H$wBlZ=x]6MLw䔡xL}D\le(y{t'*&jwY%]/B]w>6D4TQ wO&#fƼ>^kQBJ9e;0l"Kb]^@@a t-f @ץ@.R\%݉t86tjma!fn^ߙOS0.(aSkn]w˹Y.AVs|ʴݔ;ʘus4&=!ƌH#nzJat>ո'4 I"暎H!L%r;Qk97đC#j?GSnupdW=QbtkjE.{U5w^ 1脄nͪ;d6I6R)Ch5g#"\<)Obo6EbkۛH3嚚Ѯca-xȹ^>kuLBj] p%U?4U0ZSs{Eˮg> :ptn8,)ŃRxxeC6!̥ٞҍ sM@t=wm_N壉YnGW{ZިϦNљnUaV/\@K~N7ꋓzkǎ5n9VRCq¿@s,nQх[aG#vQEhS;en_0gxRviNY6,)Zѯ; v êOmy8EE\`vO4=߻i Jkp |1ò1k܊J"E/aDί*{5W2JTt_N8) r <|APX֔Smt 5?5ByP5eY4ᠱL ^ *j֞}h/-C N7B_ pQӥt=;~4ZmE;zR0~ &]4ӣu|o[89eRҵSML#I2lstsB^n(Alhf>o'P~=j~}m tE˜>n3X~成&n)8h{X62eu4f '9?Z13ݳɼ䎓ϜI>S`\gҶFF\+AE iʶ9g_nifg[-j u-g*>EM(P5<uzSٽubS'7l0uEdQ#Cto [Mw%h!r؛sE7`sopn*oiúU+9] hSߝt$B=rU b߹2(QQZZJܩ~rF؊=3ִTR0Ԋb`o]e!j , g8`ċB lG?lK,i`ۅZ#ljmMv3kՏgZj3 MYb tnvknȪEXd?2kk3ÎabMW۩[ a4)YC"kVal_ڵ]a鹘rK8&6 v?g~oߗi𳶀 #d;Bb [rxh1 üšPL/R>̐o7٥]P,rڗۋdͰF. z{4nM[JGS!0w^ڏg&cRo5yQڌ'QolqQ YGP 5veN/XD[9t? HEBU~ҎSApVWf vg"{xD(1 26|#!TFT0jnԖ_GOdMWh7 E=dt֤O9k7gqEw{圴duBZ‹l)b4&ԫݣHuH"rvETCqg۴z>lNZNHQv|sC˫SX0m?iϬzVJ[vnt3z[ ;}#B}X59oxa5nDK/LUfuie nHK`aض*y֢+aw737mcԒQ))q}p+h,+WX|&ע lhӖ5mTE%I5"[-.. \I֝NJSDU-i\DN6$~ߔGuo޻y=Ԥۄ렂hvl9sjkDtGWU!{ݐ_{(٪T,{~^iStqf.ȬҶmNAa΅%...Cf􊴥%ms3ՄS&6d4|an67mEx=eAR.v3v\F؞ig[UX ]U3F[Zj ݩb2j{Dm Ե=&Qv ijBeY,YO>ǝ~gVdV} Ju_*?̓sT_bogtل6׎+t@/ z9K uU=CqKۚwCk8>#_-dҮUfm ĩ=3ӭE;):Bb"З:!\&{Å!mXu8<3t1K@sq h&&82OO 7ˉqD#_Tߥ Ha*!iG;ž-i$]N@:Rf׉[bEOM 4%wb -ynzYd(wsĉo ^w|x;/ȃXմk$۞ɒm̶7v CԶc_^tk!vF*tͶ;1F izS늝E"rl{جc'N 9lܵ2v%l؂b3MvѲf.lv˰gk T-{9&/)Ym/NyһĨOs~ܨ+H5FqLS_q>=~#+WdddBQ.D!)xJ\?eQG`ܽ*H³F^Joni*M} |: MKm1THLL{VڇhBTk@ NFߤspخsƔ ok2H_79.}v`o$aR_ - j 8q\j$~!.[&3QjR%Hƴ_*_e1]R֡\.[LM˼*,ˈyEӦ2(Y*S!ͷ pɮ߃znn`v}_˹Wm9P NEFkrj>8o)mt$t(m[t(ҋ kA k`w}sv+M2ؾ /0% 15ͭÝR'ۭ9փ5F0Edmw,sv-WZv^(.@^ զc.D g+ρ]wis)^knN3loA8{Ko}3Xr;ne(v2AwtJs3`8">1tݎЧ}nWcvC΢YSp\}3^~*n%h-=cjE%JYnqo[J&li%M? 8%*-mڌs&mZpOo?./0ꌴ(6G.saNbAKO^Ψ*i_Ix4]WC?_ےz Z}og]}{:N %!~]~oGí|PaM(4cma1dZu]_A-r@nذi[u>obR&"*ۦ317vb<1H;338}Y7]1]ۄ8oГ 4%kbOdm_j?fS~A=qʰdѽiX_PPœX鮿iUG[=,Vu"N}kH BN%< T+IY5JxwPw=F26VaS `O`r;P2 a8ǢDJUICy&E)1+ =P߂b{19K9ތdW:irK;7k[U5Up5mUK|3~ hT_nc/t%ik1Y ]M3У_ܧpLoOѝ}uU2E.0ml/mJ6B7^۰3ܶ~'E .E?d@?v6p^ԛ"rGWVN`j:,2юW|@dns箦@X&Vz4CtgniB~kϤ]sh3͟@wMܫ|.}X{n*6-CV/\rm@nJ*-䆺5{ ~  a+&7Nloa Ky~]"LczqG̪Mk*D>])ߨɦ1ԽP™(1u-;_zUÚU7c>$AEEkIs/V5I&SYҡ,BL][IC,LlD.~7>'KǬ ,A)z~`ߒ2DŽʲn__,ǭ:NAFKվmm =&#pΑY0_Xo>]Y qA j$qHEW1sx8։&m$n#q8*q7k9[mmH͈krWa/¦J֡zۋigt7Մ^cro5ܓm=C1эUescIV2zCgYϣœB}y. tgNڱ`c@n%VƎԠ{|G7^6>MҺ{JbmSsnE$1բ Svnnõ`3e Խ˩@&=G!6M;z|'RćZǀ1Č-!4F.R:9uw(M9:nP+9o_N?frs+Hq~m'(xNH3ZxVM)hB30=mGU܉y{nYt/J/$vIJN}'$ߴStZ]8F=[FOvuPej:hu[b&tpf@v,>fmC)( igv yy-Jn>*R6\"}UEbҹK7Kd^InjvLkv ]ED ֑b2Hл3(hOÖV}U~/+:T4][Zl>v[Uנv/CQ2-^yK mİ#;յ%k CIH<}p&k7*i!pi2"q:$߽ǻ=r̼~"KSkkCs]dwӚy\rn'{ p,JdOJO[ ! <"B+O^0t[0t^q}?(5Uj/>PlnbذQa\Uwn[UXUx*>maXS35'yf;rjM:!ۮqW"Z;C %9a&}~ȥòj֙`C)q2hjeX&^KQ#]%H(/v=%`uZe0kdlpܸD`hȪ]V@ۭ{"Xmh7@M[xܰ,"J Y ݗA6G҃*Icٗmx&Eݽ^mI>P>ӥ97x{,Tq9eό層.+_pVUU܋MuorqU !PeW- \rML^\^NA|P43Dhug$$-dlu=6 HpvIH2PiP,)i8ƔTmD/P3}pUy")OGt9xA1ϰlxlPMV} ȅ[ePcQqFڰW/jA'] yR4`F1d}צD% cyc) n؅ ϰ ?}wGb>nT&Wg)*^,gZIK)4 9neeA6)کu ܶzV26 >9+!6m}Fܻ˜w+Qgӡp|iJܯ* a7t["B7 lߐ8xWbL1\B5,I)aԕa/5ѐǿ3"ظcmF&^&s;(~ɚ4{Fy99\*\ !A~GHKBZJ6ឥGW?ꪸs)N(Y9^$NjNVӒۃ6n; gj ,ևj+;΀E9_YQ. >[aӨZx/uއ MWOg8#ީjxO72W7ozjfo&㞂vVm 3 7|ޅuTaS8b{&w3 Vx+Wh\>ay}m5`ડwkNrR6[ 6Kz1BHO/3tesJg6I"EOUojWhKQo$㧃QmJqOKj>$Xk{exO*/."g黱9{o2c6etׇ.huá fh *Q-3;A 9j5'I~M6j:zQjYkU1jZc~nFѨin_ _ {'5wOHC%/N%%g9 g:!,8aJY6H;pDsv ]Em٪69x![QtM']O+ jMC:t|~GAZ4a 錴h yOh{̠a YаZT@=PcNDqkXݶ]EBMavۗa7:m-|1FtGDE)_J~вj8Ѧlp\5Y6aO]k+ed+VZN6x|m(M6ÜrW?_ݬ6 ٝ/$CWNj.#dӂW ci+޵='V H^!V U%un8HuΪ1Y.ߒ=>]V7"$Zۯ}{]-yjf员ضJ]/lz޴h"m }3K]ߟNkj:m~*Ơm[aͰs<`Q')-BĪ)Ӧ$l؟:W<>ڗSy2=8Z gbݸ peJ (35))͸90r7ƚ}rƍ%1"G>Qh2.0nrW9!4Ұ4kܐn%j]bߦJtQ%a!6g}:?sy.*r,~$W wچ>MؒX-07X!) n9e $Zy|P;s+έbAz,J8OϬVBa۶Bc*D 8iT |)td݂NЩD)]B:aˡc&= :mB'dp/54%sgrm:ꪲsfXZB`cDd˘{m]0$JƮ.OIQ`њm`]fH$_'R̞H֬0{6h_?gNcMWhrg E{!.mz[Wh-dMۛ:\xSVHq֤#i;? i%[qJVhчE?( zoOS=G!)=ؑ bGKmcv_1mlF4|538g*#ܶZ?;j┮*VյF"x%C_[+)܆:L7U'&"iΌv7"a]~*!*Xls>\VWhN:ƺV[k=Fm8.GOZ@ez. ns'>]Y"O(.=91mg<<܄2K+7*Z;s gN3RU Csd01!env9sdbUC+*uFގ}䥙Geo>́a _NOݹ=oәA'3ؓA Sq9Rz:_.'j$:0t|tXX(=pǬlխ;pJ6YX`#Fߌ ğMOSGns>''C6ttiz$u9#tŜv t@o-s#K" n͊3Q:z苰m[K܆-?ep`Efe&6+XHeV'+an-@M@&mA%Wike+˕UjvSYJwr40[Slx>!ݜIc\}$nk9̈QV2S' PʀK! #ד| ?wǍIeXUxznRRlЈIR_ӕ b+lW^K|Zg;Sw: 2Wahxoid쌼vFcUznǞcjw:-X{rvUӉy\D(GRbnμW75nuNGvAF '{-4exڏ4d@m/5ػ+̉}vU>}Cվ'P%RؾjߝlJm(`&qW"@T[uvI9E31,Jd< ɐ+yze4'#OB9r9ts04;=I"wˊP7gIU ̩i]+THgEV$em )]+I{4]S]vQUCHΈ/==LNXh$nM=+*qx n}8~K2FiH$ND4x5k]GsqWPWv5Ɔ0Qq 5)3$P 3 jziU;èa_ՑxV=}OJPT;bհϪ+VM+ՉqL0H8TG8;>]IY ƔAoېaɘE;^M0\cỷ7OOC7Eቨ݆D[a5V8܅֔;c3פ.=잙!_ȇ/AnХO}Ud|gPnf}F#dW;yWՒr-ǒgt$ 1y?٦/hQC pNnӰIV0^=Mf~9%i8~qPa]+1ir{狅~ixƉyJSF"@f ÖMv}jشkӪо=M+ި5y_)LA~~(MnC+txQ7ꃧ#upB#L-Y}'3#uӗiYl1=D}KSM@N0hj֝@*h u. jТaԔ*?l^,0k䃲 I2EإUafJXj(Se/`?fPWNud=[F]wkuIn}.< *"Sv%E/QF3|ˍPgv%<؞E㓤cDE =JFAIWJ_4W5jJ ETWuv$d}TH바ܽ~wvWp.m?S@gb7NJa0RELK ү"կSӫY6 O}7ckXۃ'dpȳ,%:"r85}h%25dKaCCjj9sDvvvI";l4,aaI뾒LZ:-Oo;>Xc-L%"ՍS ؊?:P/琩th {r"z- ~$/;m9]M( fpguB(n$n*J VenwJa崪we L~aSXp4l,}#k[n@d.I *}!,Pڱip*i?7| MM]Fff$I@*u3TRv#jT$u!t@펨lQ?h4/8wQ$+ŸK\]ҸMH%W2D?޷\^c`#o[>"6<"&!{HyD8# s}jV)ݐMo7$5g56nFH(m'`wP:Uw Q`ǎ\rzZZVž1gdr/?!MyX ;ØCH86}sħ|ȴ2.ݡթ7I14a`xSޔڭ2 7d+{hVJiMWvm;P2d) 7>%1JětzR<¦7b!;q@M<ZS'cbΪki6UDV dndŰ[A!زY{ L9DÆa; ayXNa} uT4A !{BlC2 q%hč1!6POfMun|t,Tp B$F5 ڕƄ:;P:=lծ\%>3dY7kfg`۬ǃ  kN7Lfw{m3M~N]hVRBؤ%d"3U‰tI4=:}$Xh6Uv 'bӪmBWZퟵ{3Ic4zS;.KL_@1eGvvŤc)Cf[i!;|sqjg&O=t"\!V]9Z°ܢ@<{?BydCOyU]-%m GwOwWz޲)Υod/?Q:,$߶UXە+4|QkJYCzcguš;.tXAqzɾ9Y٘Àmp/Νa/ }RSmq7uIOr9ƩY > oksB3>{]M0! !^^_~#&¶l|_Yf<5? kݪ"S­jdYZn;#cNY\>|}UD u0Cּ>{o·9CT<T8cHBq5wn?zy_5:F5'D[eJOHFf'"}yYk=sSغR[N!ڝ$"iC&bnnԉo}cX]Mi"B@zQHCoHv*SE~4 3c%ZPyqTbbXO_.=qHVIC8'ހQSmWx'b>߫C~6 PA73h~uYK,~Ljrfm77|.]MWq]1cQ'|d{Q וz\"]MCV4z]EZ]Π-1<9fj"{4 Pl:OH}߰[v{ K(%5&S%GUimWҖmʹh4tMrMh'GE{ð+t ^4Ty*6rkf "*]Qkyc^3ZhcFߣ8ǖ2j|yS֍xs -}|M/=q ^jlK kjc81 M .Ǽhk8]]c#`<KCިs'TLn*[s4! .4i ~6~4,j̏0.:G,QwNtgNV#'_D?6 ;t]5ݗy>d~vNE!$U}a;,oj [nCKqO]]6t!Py=pֺaI*^ s.ZtnMwc&]AƜr? GM9me6)ki4gmMX'&v1 0<Fcbӎmxt~.J4t7̚F>EIv뜡j |{W9{9p˜ۮ*MU+< q ؁+tF9aT /GWYqh~K0%]wSm ]ל&O^ǵNY'yZxmdBW~G]<!a%JIp]T?n`Ghk#rC= Y}qܶs޲צ{lhb)a? L { e/Ӓgzדla=U[C}Ǽkp/Us֍6e:&ik>wdMo# Uo<3|"k{$7fx_&LFLa&q8CNT^ KC5ț3xa\EȰs*3 }%/RhFk%m+8t5կMY'Y˰|:fRipĬg:{`F}uhn섳3C+wp9{:fsĤ {hJx@43eg&-qo%e.J>6Xp<uWc-fYɝ %kڇ/_`3ִHbЎC40m %yډe!H!6X6іeS-ql*ߣg Xg`Wh-𲄶%sbvTe۪*ٲ1m Hr}؊lH+5NHs8e1QE˴-NZQ3q Q~Zt۾܆K*Ċ]>xp^> F75%ְ{U>cZNǧ wһΤ D>=,^5U2.;hU-_Ϟe t܃;L̨o/ڵVgЁ"jWAVP/G? <M:j>_%9j[A}ǔi}*2WD8Emuņ [ke/lL6rݼa-[`)G`ް-vUS8.DpK~K.p|)cBtKzK3׮N'nVxA]f#(B_4:CZ>fa5 I' B,Pth*<=e'Ԙ pQzƹk.LM8F}EG)%lE`7hf\R@7!p :Wl]C57SVPcGP50u6Ӱ8XB++/.}g*gn$ZC.ltCW5ے:p}3+j.ACNbuՎYl0jCQ{jOF8É|*JZb~bZ̸SFOS%,W'P#z7mZO Fǫzp֞h^)v kF4HvP(^XEAsWFM;anVU()I9B]i{"H4d|@RY>=OUjE|)g>壙v&uhOU.\6} HY^cbpdavK]6kK&fICk= u򥹾zLţ^WHwqz32YQ:kڥ- ];4$K#K}b!M}|.ե9Gg== AW{ ~J{j.P1aK-96=6W?@= M:jڴ@AOOdMɥތ^hm #]k sZu/a̜ m8.Z銷'LlXhb*dQoRv&#A`C2HU cDw w_4poDȴ0)a`/kZxhw5W, #g%{ɦ/=þ|C5Vqlٺ߬uv8bdpDl-'߁S] jjN4q4 .mص^_v2cZAT}fp2HD_I6d`gz ]甩ߩ雎vRN ܿ@}tq!;2ÞPJ~j?9kzAN?6ky{KX4{H^OXTk$KOӣV:q2垊86PӚa5!Y"[d/ǃ2 m8Mިwgt6=sDf;=wLg'Cn?Z+ojQg]i9tĻ #u7X_C/ 0 w@{;v/oIgvr"tcztm% 9^HUfCy yNW _n#KEi.ݠKނ&xt'w& m ~dNMy2]~l[2@yX!<ݾa+ ooU+4ER|; t%ļa]=śyOa Wzi՘"% /ȣDx46މ(P+<^i㪭z*}1_K|n9kLf)սꃊՇ74؜>=] IzG6R̈s IfLڂ9̫kBv5戰M̏SYiRS DU3~KrVS=2q*q+~cy58͜S42Ln_\E.;K.t0p(Iy{Hȷy/Dw ^|/T:uٹZm>ܣ㮎ruvJcjBލsө%YY[hs`>5s$/0_E\*JF4mMd.pCQPS6D„v\-? bì͌ȮߍQ{ y>@vkP;4 }6u;5i$ƕT\/G^MT}7^L_LMcgs2%:.*x!MB1Ӈn_a;q| x7べv4A;kh~"m Pᬇqny@ϋ=9f`vrG̔J ;jTHs/2-@A,~̳YW]1$S6<`;MC.ҮFڝ#^#+I뿰55#KZ׍^6vv] h7x5݇;?p)8`,Q>ŒC> +:R=HmwSv#eONǡ88_iwz-\&]@ KOB=3Q{)Q/b*"(n-/ Q(Dt9jQ̯>%އ<{~ŪkG:6kۃ'opVqVC_9JC^g;M޴PkM@M 4܃uA35^>+u.&*"Z3$'4:zaFٮ4 aܠEWCoTնl<@-Q/-v7僊uMo*q nd?hlx;ΑKX]kԹkE͞Vŋ9+m1^w .;M!9e#*lHۛ6q/]mz15ӮF~+$̺y@*""T&ɹ4O (jQ# h89`O>{ΦoiNꩃOߞ~{c.i08.Ue|;it75ĸ"&Um: E,W=*3n83ęhGƮմ]NO3YiⅳD~WadΔڱۆXGֻXgX**UD&?Phow8șӳ#' Q~AH vպ9{=/{8CfeΓ "gzE!ihʆ7n[ZOP~V@ۈ1C4]@p F8:I1cM笱|q7)f{ʦaC Y2#p(ɪddw+qWFx8⹖Hi&'qt9(PxvPǶoWC k}j|_yDS=Qۃ)EwvT 0V٤h!@gz JR7E̍W /z'T ~ݾZtQ/yЖAåu3wL@ۙ=l'V% ?r[ڴ;:  >[n j˦mԏ8ve3=;J>Qش=ڞZJ)iJBYRýt)-ey G5۞#[)ےsL&ڌix8qHsD'Rf~vrPwGU99^{ݰpa{CwpiG) Rly@F=*'a;:ɨ?!㶇tY|#+?' eF;nVD(c K~8A&!ǒOi#-҉5p?Ϛh=ϠU9X?ك">-B\y/mZr&>ҭbGѯGYC]c3ְ_,,ȸN+ݭL 4<恸2QWۗg"=> (Dx(!!DgR] iQCXfF:-ֳbs,{˷.*eFyN;a=5`!PJC6m\j$A0&fb.6֦ 0|tE<8%!H%aJֲ :(Lf?9go,Y444& hl[H8e@!KK|y45eK#{SrӢ}wwAIni:MMS AGNѴ56,pm6p:-ti`SZ,:nt#3,^m:>5j/޳ n5ѯ=O2jKv6lnj5sVRk*æI!\j<[/}iaUv MMdJVwҾ&vG󦌵<[a[ E~xԺ a*B0iD1Fڗg#n:y" ^ېXgr1#C;Ϥ:!g1R1G_eꑣS+ t3b$n3.y {yL e ij}i?Ǽ+A aJ0T٤IMX;R.7iK1ѕ{B E 1=;V ƶ&Igb7, =jgyklǵfNV5d}{"z.6' .`" Pn^õܱ[ 1>ؕl0w881u35-f9NU R ԥ[^ fs?Qc#$O.pc7X:P:K{qp/6DRK~zMIDṁ{ze}],b)^|B_ߞa~ƥvjS/2?61|iּ܆WӡŦGv< )r%pNWL_-ޭZN3L4dmMȎ,t洁Hmzi42ګtfٲ%6kdW׹|BN]pv#<ԡ`r1Ok{&KSˬӔ8{g`e{#bhHT$U @- ? GCiªZeƊ4mЕEpR@\tխIH>(vtfR=eT.(|Z~GF_UrRaOfUڞW}@?^p>ZYԻNk?ʶu&W-z:7)DPdLO~"j;CQe3XΞf5F<%=NApjAyrpؠ}AZW8c6*lu. \*]cJ1)WK{%J( q8԰{4M_VHkCCkRLߎgdYIL]l˭ijx_: 2ٲ  S}duzψjԺn ~%Bh0]},H|j&B# R3ꡜȟۦΝMLLdX:ZԼ$!ɈK-f^_p+1:'ίІ X%)qL`]VrՎɧteO_[9pGO$8k&2 Bbe4qb▀ER#揌nt=[(_~wWrGM+qi T!)v΅%mٹx缡o7̋^fv]V pKq* *SRϾΕ2 _Å8L>Y8AC:>{6{x V<ʍCHnW6uC/NPNpGn{:O{UO6p5}Wr);ne!N^Byoe۞"bh +RDp fwĘa svJ^kwu3a̒6Ĵȵ_#o] UҤ6X"Y?wP]v꾤/Նnl5S ړnz#և:ossڎ_IanϭzP O-S`v >?*$l}C U?ۆHu]gKz*W}Հnqw$WFd%4TbprCq8Kq5*_l0'nSB~,/iP!.qjpc&ħIL&:GP|u3vwUgE.'}rw fK Nrr.zv8N:+@_i!nRwjAc>\ldh n?WG\gs=PũQ4q;ҎUGvX;U2hx%8qU9 Jy=-|f̸ܑ]*ȜhwnJe>AN){); (:;qo QW qe٭>b n<.[jUCB+IYos`rN:>PeO:vY|8Bv)j`{ucVe]:@髨[9N 5 5(yQ!k~-p,vD  @Wle:oGըQ/q)|SnaS9nԕZ} 2Ƚ*ūׂgeZ9rcR ~'V]=wvc&"C U^G)m2C2@EpQ= X橱+_[5Y6aQaɉnha܆Ģ>ʓOf]2.=dPîQ oi0\ڑ6OqG\nQ@|/84Ni;ZUj!tvຂ蛭%O>/&//ϛOQ!ԣej芊%.Rj ceZMgOEU8ysd!bŌ2RTHO^&-MԬE}a˓  jQ@7evp4P~oRnY̬&!Fи?t%"䆯 /Z-LS>0_z%!ֶڮ05E}޼ju̹67H;9fOE8Ԭ܈#wUo Ji~-*^vc\EQ8bO㽶"m35$~}L; ydSj) kk<$a>_+ N/ґmZVҨw>,~tU^ x9SUoOW.ȶ#:祾Kbhv}1:9:>x[q%DB6d/I&/"E3_KG_ VmEp߄ S4v*v Ua?~a6kuE*yzu~ KdEsӧ555}s0_AU%[. hܼ17|Bڏ3̙S4\ `^cq5kHS"uQވ>#%6T%2,b]_ۍFap:F 0&VUYۜuC45o)Ϧ=WYJq:T9:Kjֶ4k7 eBFΈY H;Ykq'$9:o>挠jAȻ8S]ۢM_DleT}LhvNmh7(""x ?UT4{5xv$dSk~N>gx} 66f4%m{65vis>pN^j^p˴2mXN]U.$?F;UpG=#,{6-P%)hCE̩Cu_3m_? #-zH!L Mm'buVI- Fg9\fPw~OH;0ӌ s)3k\&a"f8f+I~UdHo4P.a߅A7rO[] >a6ffM6Ԣ4JSlZ b!; WI4>AЉH6AϭZ9Չt9imqt~5ɦsnqC ]71i(H1*u+?ishvD-3w0iy&]8wbsYMݷyґ-0mAS/?} A i C'`RY|3^`?3BKjL4}~%m@n-~S?~Bs9BIBNiAێC3z=&'Ø!4}$Mi }R6Ks9`2fLbS;.u"jzc^*s~.1Y8zC7僔a_=y̘3{i{'FM+Q5 5}5vtvƣ֔JhoƓ!ڦ6ZvKc>Ѽjp0øhaNq)>#چ2oHOnЖnJQ$[ЗmnvpðաJf5`Tԓ9a/aM( 盩=W Ag1T80Lh%)O"o6 'Qk jyC{-NfzݪaY>rh%:IS! A;=3bly?,WԺe dn0ujL\U^8])>RHg37OF'l탠ݺlJc4v ӆehm}Jm>È&o~0QusL{Ҝ-3 +3 5ccvjKl0Ӌ:_C5B9=h:s}9㢱0At͡o?8D\+)ǛG~GN90x%L-F_WTY*Ԕv,o ! S :MbDr%x/,eyf-8ܨ$Oe؟e ]ѐ'zq78A< ҹ+ ߿T+|J,$HѦŗJ 4U+ hw)hIA2 Dr)WG+(j{qR /C0IUp1N!{j Q+_h+çΊU9ll|߷-S}ƴSVmn9& , _nX1z['&| Y twpԧŸ?/{ƕXVq|{Oݴn k @gqȬJ|=!ͩh(Só@6}_"Ϛa@S.*r0۲G wj{HO¦F6Je@*]QS{f>5᤮Dd԰Ǩ5ej_G ɨR ~nMJ4BN%wGGO[/+2&eU==CǣBlNeurMROVʰre؂ IX.`I]r>|>(G@a[ {j+W$Uy%v(`Ӱ'vOpF<"IugnU7Udl kiWwbrkqQA_T2^,phS< aQA`.RO*V-hR!p]^m?>h¦@ ~tg]]WlFԑm-^mC@n>A/.Nذo^ώ:PH'aOcqKoi>D;=*۸K#+=nܮB)!ۖOȒ[Y$m-V6]J"۸)7Kr>?rδ v$h~d%1TUwfKwGW?+IHX|Tiރ6u ҼBܼg$n,j [Z⟍[vCݾuPL=pܐZ}Qխ۝˓kuvXx75\yw/T o?BNuʻPR"*sO/7|ݥz.](ie"m o͂cϯ7<½ yæ4h Jf6;ӆGĤm@<M_x{CEލnI*\|3zG\REwEa%l:n(r]ðN\A^#|#-n3 wzL?}}s5݉ !P+d e>< x& "AoG<;95g]}Ξ27ܵ"Ns9oxy??8x8h[$,tF;]1sV܊{Aۑh]nC#M \ǝ] aPv9p18-*âbYTOtUWu4)㩓"aEWoOyC+y;n[2ob]%.)_^\~2$(oTC{L}`Z ] &Om[CEM釚yBp?w)4]wv~!&Kأv/ u%vϲnﶙ&i*v;N~6OMKM4B={_'eS7ģqڒ+?~D\롈5)[ȓ |}^39hU}C{ >5qyK?NzӇv/>w9zjd8(bgvgp;znϞp%yFՙ.e MqCJ=6]`qЦWIqOa/} qw1sk%fn%(lނ.ʰ˨LGjwH╟1'?q7BED[5립cD!ރ6m-MU$4DC|F2ҴMEFL"xy #`F7bUG5H[QiFV|]&n1]E a#Ȝ6N g/$ ,g>v7[@c;5eKx]n8~x6֐Wܥ%4N;m? k.,LG"p\I:k9wp^,Qhtѻ#pھ8b;qp{PKܴ/6 $Fd ]˂7$z'O=a_/m7y=Q6#Y Sҏ.I0:kֈ*kǵ=]>pmSCDW w-SO;&ljE=zP!xgNsۮj}EK0 MAJ3oXwMm&ޔVY&(ߍRL*x$xR~'XnZe޲8%ܙ:b>xXsprw6Tp qZ_R~x~&j+djNmWpib(ة_^X{UqMDKfkߨWPuuMPU3 Z+Ҫ.ރ'&#|+PIx`:q|+f{} 3T$;l"a% 9oUA=Lנ3dXM$~ʔyCCN*#}x#~)F-q=%ǁ Cʅ';J-W-`D9[ j"X:dNo'XGؿ@dRDf܉TBF򀮮)Iґ4bN5ۮk &tH G#eubZ MQxS\]4]AFv|z^$|'2lq0yc3<^9inxhk}60 wV^Σ>Eވ˽nD\ped# ÍK좽,sYXl7ۙ۱PȞ޳0ܷ7mD<]58Zqϸ5ѐj(řrޚ}vWNULЛʵsj0NUB0x}2g8 :=UY9榫8" M{}@*lUEWcz%M¨كYgxoV8;`ENbFZw#۰Qi3?eY vJ,tVMUff){vt@}azͤ4i˫Jrd2I.)3y6zJ?du4U_J7"ZTHڐuIgcmgQd6rT0gvE M 5Jiy9.pv506q5"u c8VAmwe[^ M:Gv1|6IL:w/wS0 n,)ÞDgs,`I/Z:!+GrL<{?<&E:aEtxS \Pn@]O3s)-Lc61fUc^vo.v,seA9nvM]w{1sG>s>#i;'w5`W$֎I~aεs5D ΍u9T1 _s~ Bوg3;B"s']ӖhcEd~ cvsm[9J>TJ`+r#+qM\>4#1Ɣ \!%\~tG] Vmʍ'y0TR'^$-<,"Cj8 &2!E|~$NHS;c~徸+x+m7G{rq2yxm٭a݆2 !?ba{O!m86S̱_ö3&y U1ϟP֜~68}X p#rV9(A(;_LI)A_fî ep8W=o~&L{lF-5XB *] k5 5;չVe^,zG”0 sзWpLy~&9g!*M֊Wlr.<%vў;jrxvԥE½m< I:Q=֥cɞ6qҽ4DqnռM~2q٫=TZ!i~|r~aa4ؒ[$m7 1"t14)&.;MڄRPP Q煊~d^(:S_HL=s{aU p 5}#S/J LF$MЩvh_kpؿtV5tw&9[H-+Ƒ?]_ɮh욚ԛ E1-]agݜsP+a{KLxegxs9H+EιCδITH.iSqKr@N^Lg"spڅgB^{WҦ3vV4]iԴuhơ$+_ s^s[U(9STY(9V3+6Ӽ#܇Q'^2jұ/yRKcW.{j=*;plm=yFo+YpZJIfoൊ]gH9#A!qG7Wswu;;Դb Kq Sv}qo2eX@ QMٺ:fsS6m#ƕTm,z gsN.b|=H*02d-<@bXVĩNmJ\;<r-fqDL+3a*E2wdAz ?uq9igSnf7@ڭ%]Jҳ~WVQ9iVe$d0+N۴_"شHs6۴;nө5hEj1+凮m9f, 2MǟԡQr|ݼ?_v؃fQ ۯ wqx9WԓDΛ9Q4L }Zu#VPE$z]4nUJ;HE0vsĚYEB=GZf0RP_C">6Lu{sc?XW BB958e %/2Ѥ]@!?>wvL9-0č,qy!qw8".&dܫax~J;Q7|o3&`'Bki]@5Bԩ"h1GmVuT aP |ܶ&GmծfŐ˕;mqNJ[1E؞_8'JU; Hikv{A?ݼFE9kwqd[ #i:l4mC"Ɯ+Mo$|1npo+I#ZSm;%ȼ;[TܺUP!Pd(E[7݊Pq6W?[-{J{37o<wWo֫BU%e,Mj}MsWNbZ[ww9h8'*3n42:&6!M!-Pi6p* Oved+@]>+4 &m :g?·*)C "t:5cC.A t;}jو|Tn #D:!''fn͜+d3ψÚGK~- 3h48NМ¾KggcbFCWRnͣxhSq7vWq 7Bn7x*6\ #n_U oMgJ 4`2s' briQL`ɴ7: 0mO[UQL|ݴne l`Ub!zWpn6_^8l5%9p`´u'ےvlcNLϠv.UB*4鬊iju5&9jѵᕃ0!m|@s%!Y:kV{̍:UIڥ hh?wP5~2#~TƮ%jZ/4Y@y<9hʱ8.wl7B^ nV$ 5씏2i g%o_fw(~mVw4 +:hO[X#t.6CSa݆R{INNMkGcGhbhѼ/9aR]%}(Ejڰ"ky{~ YGڍn ! I۰aÖHWFI%4Iq!ijYljԡm^-$#k8-$9dZuC{ \[1ܔiƙ;1d<ՑÖm8e͔q'LI=?DeފwڻG)H8|KE.21'F"w|1`oBI8rEjLku&@>#oj|@ zK\ؤ† 泥ֱiPKMdPrه )rk?&jd.z|ܝeNrtS*<`އc@2cs~q3V: gnPsk4=|uf6HvgH\0x}s|l윿y+v^en'nܮ1\ p,FnN<>CBO[u9q6Jjj◔eO6MS o=ɗ:oWV rHːf_El]̷ |;ω5%dҴ&-CA^A܇!P B~a;|ω.fIAVs3l|Q!6Ti,v:<]qXuXu<(+sZ)q}Rí@}; EOdŭ8}Ddi!(y y,_GnYA5.-1s{_yzý`e&ہ憨&L!Q-,5{鋳 `&O/0wGEEq.qs5n:,]:U܇" [!~a<3pP" ƗO S]?4w)͐d8uhnx芝 3]lTy*҇cS Lq"g@GuYE6|в XsRKGGs{9q?q>Wa(\6|Ù4@_gm=?= ۾R q4J60txr$ ꄌ㻍 CNz"q?/xSErL)2uIquyZA;/fuhV-o16}am*ayg1np6R4K5]qd֔I3o ? *2o+yhHSY'W:.]'d[jo/[yF^AOe\;C^E(7}ZM؍k ?DB_9lB7(P]}N#@Q4tapezƾfFꦿ{8}m'%=/'b3ܕoQrCɒ\!02cJY:nc_)ܘc6NuoP`6|tlj؞-_oR7|6ΣA~ߦAtcKg~巁 }]-=|҄c;ID-·"~~6>v0au E^AںJ`[[GcԇO'OKIgDiscover-0.11/tests/data/H1/expected.tab000066400000000000000000000520021337725263500203210ustar00rootroot00000000000000name source chain cluster cluster_size Js CDR3s exact Js_exact CDR3s_exact CDR3_exact_ratio database_diff has_stop looks_like_V CDR3_start whitelist_diff closest_whitelist consensus IGHV1-18*01 IGHV1-18*01 VH all 38082 12 9605 18063 10 4649 3.9 0 0 1 288 0 IGHV1-18*01 CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA IGHV1-2*02 IGHV1-2*02 VH all 18707 13 4522 8398 10 2133 3.9 0 0 1 288 0 IGHV1-2*02 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA IGHV1-2*04 IGHV1-2*04 VH all 3081 9 731 1015 5 277 3.7 0 0 1 288 0 IGHV1-2*04 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA IGHV1-24*01 IGHV1-24*01 VH all 1303 7 341 735 7 222 3.3 0 0 1 288 0 IGHV1-24*01 CAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGTTTCCGGATACACCCTCACTGAATTATCCATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATCCTGAAGATGGTGAAACAATCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA IGHV1-3*01 IGHV1-3*01 VH all 14794 9 3281 6096 7 1431 4.3 0 0 1 288 0 IGHV1-3*01 CAGGTCCAGCTTGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATGCTATGCATTGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGATCAACGCTGGCAATGGTAACACAAAATATTCACAGAAGTTCCAGGGCAGAGTCACCATTACCAGGGACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCTGTGTATTACTGTGCGAGAGA IGHV1-46*01 IGHV1-46*01 VH all 30999 11 7474 14728 9 3665 4.0 0 0 1 288 0 IGHV1-46*01 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCATCTGGATACACCTTCACCAGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAATAATCAACCCTAGTGGTGGTAGCACAAGCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGGACACGTCCACGAGCACAGTCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV1-58*01 IGHV1-58*01 VH all 2038 6 512 1023 6 281 3.6 0 0 1 288 0 IGHV1-58*01 CAAATGCAGCTGGTGCAGTCTGGGCCTGAGGTGAAGAAGCCTGGGACCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATTCACCTTTACTAGCTCTGCTGTGCAGTGGGTGCGACAGGCTCGTGGACAACGCCTTGAGTGGATAGGATGGATCGTCGTTGGCAGTGGTAACACAAACTACGCACAGAAGTTCCAGGAAAGAGTCACCATTACCAGGGACATGTCCACAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCCGAGGACACGGCCGTGTATTACTGTGCGGCAGA IGHV1-69*01 IGHV1-69*01 VH all 41033 12 10647 21253 10 6013 3.5 0 0 1 288 0 IGHV1-69*01 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACGAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV1-69*06 IGHV1-69*06 VH all 13719 11 3718 7062 9 2100 3.4 0 0 1 288 0 IGHV1-69*06 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGTCCTCGGTGAAGGTCTCCTGCAAGGCTTCTGGAGGCACCTTCAGCAGCTATGCTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGAGGGATCATCCCTATCTTTGGTACAGCAAACTACGCACAGAAGTTCCAGGGCAGAGTCACGATTACCGCGGACAAATCCACGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV1-69-2*01 IGHV1-69-2*01 VH all 416 5 115 235 5 70 3.4 0 0 1 288 0 IGHV1-69-2*01 GAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCTACAGTGAAAATCTCCTGCAAGGTTTCTGGATACACCTTCACCGACTACTACATGCACTGGGTGCAACAGGCCCCTGGAAAAGGGCTTGAGTGGATGGGACTTGTTGATCCTGAAGATGGTGAAACAATATACGCAGAGAAGTTCCAGGGCAGAGTCACCATAACCGCGGACACGTCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA IGHV1-8*01 IGHV1-8*01 VH all 12011 11 3186 6200 9 1760 3.5 0 0 1 288 0 IGHV1-8*01 CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCAGTTATGATATCAACTGGGTGCGACAGGCCACTGGACAAGGGCTTGAGTGGATGGGATGGATGAACCCTAACAGTGGTAACACAGGCTATGCACAGAAGTTCCAGGGCAGAGTCACCATGACCAGGAACACCTCCATAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGG IGHV2-26*01 IGHV2-26*01 VH all 4191 10 968 2112 6 511 4.1 0 0 1 291 0 IGHV2-26*01 CAGGTCACCTTGAAGGAGTCTGGTCCTGTGCTGGTGAAACCCACAGAGACCCTCACGCTGACCTGCACCGTCTCTGGGTTCTCACTCAGCAATGCTAGAATGGGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACACATTTTTTCGAATGACGAAAAATCCTACAGCACATCTCTGAAGAGCAGGCTCACCATCTCCAAGGACACCTCCAAAAGCCAGGTGGTCCTTACCATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACGGATAC IGHV2-5*01 IGHV2-5*01 VH all 4418 9 1039 1888 6 478 4.0 0 0 1 291 0 IGHV2-5*01 CAGATCACCTTGAAGGAGTCTGGTCCTACGCTGGTGAAACCCACACAGACCCTCACGCTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAGTGGGTGTGGGCTGGATCCGTCAGCCCCCAGGAAAGGCCCTGGAGTGGCTTGCACTCATTTATTGGAATGATGATAAGCGCTACAGCCCATCTCTGAAGAGCAGGCTCACCATCACCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACACAGAC IGHV2-5*02 IGHV2-5*02 VH all 2028 10 493 744 6 205 3.6 0 0 1 291 0 IGHV2-5*02 CAGATCACCTTGAAGGAGTCTGGTCCTACGCTGGTGAAACCCACACAGACCCTCACGCTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAGTGGGTGTGGGCTGGATCCGTCAGCCCCCAGGAAAGGCCCTGGAGTGGCTTGCACTCATTTATTGGGATGATGATAAGCGCTACAGCCCATCTCTGAAGAGCAGGCTCACCATCACCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACACAGAC IGHV2-70*01 IGHV2-70*01 VH all 4746 9 1152 2400 7 566 4.2 0 0 1 291 0 IGHV2-70*01 CAGGTCACCTTGAGGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGTGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACTCATTGATTGGGATGATGATAAATACTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC IGHV2-70D*04 IGHV2-70D*04 VH all 2957 8 652 1144 7 246 4.7 0 0 1 291 0 IGHV2-70D*04 CAGGTCACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACACTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGAATGCGTGTGAGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAGTGGCTTGCACGCATTGATTGGGATGATGATAAATTCTACAGCACATCTCTGAAGACCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTCCTTACAATGACCAACATGGACCCTGTGGACACAGCCACGTATTACTGTGCACGGATAC IGHV3-11*01 IGHV3-11*01 VH all 1787 8 506 822 7 264 3.1 0 0 1 288 0 IGHV3-11*01 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCAAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACTACTACATGAGCTGGATCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTGGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGGGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV3-11*06 IGHV3-11*06 VH all 221 7 80 92 5 38 2.4 0 0 1 288 0 IGHV3-11*06 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCAAGCCTGGAGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTGACTACTACATGAGCTGGATCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTTACACAAACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA IGHV3-13*01 IGHV3-13*01 VH all 821 6 199 380 5 97 3.9 0 0 1 285 0 IGHV3-13*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACGACATGCACTGGGTCCGCCAAGCTACAGGAAAAGGTCTGGAGTGGGTCTCAGCTATTGGTACTGCTGGTGACACATACTATCCAGGCTCCGTGAAGGGCCGATTCACCATCTCCAGAGAAAATGCCAAGAACTCCTTGTATCTTCAAATGAACAGCCTGAGAGCCGGGGACACGGCTGTGTATTACTGTGCAAGAGA IGHV3-13*01_S2321 IGHV3-13*01_S2321 VH all 805 6 192 324 5 86 3.8 0 0 1 285 3 IGHV3-13*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACGACATGCACTGGGTCCGCCAAGCTACAGGAAAAGGTCTGGAGTGGGTCTCAGCTATTGGTACTGCTGGTGACACATACTATCCAGGCTCCGTGAAGGGCCGATTCACCATCTCCAGAGAAAATGCCAAGAACTCCTTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCAAGAG IGHV3-15*01 IGHV3-15*01 VH all 13813 11 2773 5650 9 1217 4.6 0 0 1 294 0 IGHV3-15*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTAACGCCTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA IGHV3-15*07 IGHV3-15*07 VH all 16932 11 3269 6926 8 1394 5.0 0 0 1 294 0 IGHV3-15*07 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCTGGGGGGTCCCTTAGACTCTCCTGTGCAGCCTCTGGTTTCACTTTCAGTAACGCCTGGATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCGGCCGTATTAAAAGCAAAACTGATGGTGGGACAACAGACTACGCTGCACCCGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCAAAAAACACGCTGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACCACAGA IGHV3-20*01 IGHV3-20*01 VH all 529 6 170 200 6 61 3.3 0 0 1 288 0 IGHV3-20*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGGCATGAGCTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGAGTGGGTCTCTGGTATTAATTGGAATGGTGGTAGCACAGGTTATGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCCTTGTATCACTGTGCGAGAGA IGHV3-20*01_S7413 IGHV3-20*01_S7413 VH all 369 6 114 152 5 50 3.0 0 0 1 288 2 IGHV3-20*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGTGTGGTACGGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGGCATGAGCTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGAGTGGGTCTCTGGTATTAATTGGAATGGTGGTAGCACAGGTTATGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCCTTGTATTACTGTGCGAGAG IGHV3-21*01 IGHV3-21*01 VH all 14645 9 3602 6690 8 1831 3.6 0 0 1 288 0 IGHV3-21*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA IGHV3-23*01 IGHV3-23*01 VH all 4960 10 1433 1736 6 612 2.8 0 0 1 288 0 IGHV3-23*01 GAGGTGCAGCTGTTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGCAGCTATGCCATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGCTATTAGTGGTAGTGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGGTTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTATATTACTGTGCGAAAGA IGHV3-30*03_S9223 IGHV3-30*03_S9223 VH all 379 3 53 175 3 16 10.9 0 0 1 288 16 IGHV3-30*03 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGTAGTCTCTGGATTCACCTTCAGTAGTTATGGCATACACTGGGTCCGTCAGGCTCCAGTCAAGGGGCTGGAGTGGGTGGCAGTTATATCACATGATGGAAGTACTAAGTACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCCGAGACAATTCCAAGAACACATTGTATCTGCAAATGAACAGCCTGACATTTGAGGACACGGCTGTGTATTACTGTGCGAGGGA IGHV3-30-5*01 IGHV3-30-5*01 VH all 25649 11 5549 12485 9 2841 4.4 0 0 1 288 0 IGHV3-30-5*01 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATCATATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCTGAGGACACGGCTGTGTATTACTGTGCGAAAGA IGHV3-33*01 IGHV3-33*01 VH all 19630 9 4104 8200 9 1900 4.3 0 0 1 288 0 IGHV3-33*01 CAGGTGCAGCTGGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGAGGTCCCTGAGACTCTCCTGTGCAGCGTCTGGATTCACCTTCAGTAGCTATGGCATGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGCTGGAGTGGGTGGCAGTTATATGGTATGATGGAAGTAATAAATACTATGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA IGHV3-43*01 IGHV3-43*01 VH all 1738 8 467 862 7 249 3.5 0 0 1 288 0 IGHV3-43*01 GAAGTGCAGCTGGTGGAGTCTGGGGGAGTCGTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATACCATGCACTGGGTCCGTCAAGCTCCGGGGAAGGGTCTGGAGTGGGTCTCTCTTATTAGTTGGGATGGTGGTAGCACATACTATGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACAGCAAAAACTCCCTGTATCTGCAAATGAACAGTCTGAGAACTGAGGACACCGCCTTGTATTACTGTGCAAAAGATA IGHV3-43D*01 IGHV3-43D*01 VH all 1228 6 318 553 6 143 3.9 0 0 1 288 0 IGHV3-43D*01 GAAGTGCAGCTGGTGGAGTCTGGGGGAGTCGTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGCCATGCACTGGGTCCGTCAAGCTCCGGGGAAGGGTCTGGAGTGGGTCTCTCTTATTAGTTGGGATGGTGGTAGCACCTACTATGCAGACTCTGTGAAGGGTCGATTCACCATCTCCAGAGACAACAGCAAAAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCTGAGGACACCGCCTTGTATTACTGTGCAAAAGATA IGHV3-48*02 IGHV3-48*02 VH all 6851 11 1741 2591 8 763 3.4 0 0 1 288 0 IGHV3-48*02 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGACGAGGACACGGCTGTGTATTACTGTGCGAGAGA IGHV3-48*04 IGHV3-48*04 VH all 8560 11 2086 3710 8 1004 3.7 0 0 1 288 0 IGHV3-48*04 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGTAGTAGTACCATATACTACGCAGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA IGHV3-49*03 IGHV3-49*03 VH all 3686 8 983 1798 7 482 3.7 0 0 1 294 0 IGHV3-49*03 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCAGGGCGGTCCCTGAGACTCTCCTGTACAGCTTCTGGATTCACCTTTGGTGATTATGCTATGAGCTGGTTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTAGGTTTCATTAGAAGCAAAGCTTATGGTGGGACAACAGAATACGCCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCCAAAAGCATCGCCTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACTAGAGA IGHV3-49*05 IGHV3-49*05 VH all 2237 9 584 1149 7 327 3.5 0 0 1 294 0 IGHV3-49*05 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTAAAGCCAGGGCGGTCCCTGAGACTCTCCTGTACAGCTTCTGGATTCACCTTTGGTGATTATGCTATGAGCTGGTTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTAGGTTTCATTAGAAGCAAAGCTTATGGTGGGACAACAGAATACGCCGCGTCTGTGAAAGGCAGATTCACCATCTCAAGAGATGATTCCAAAAGCATCGCCTATCTGCAAATGAACAGCCTGAAAACCGAGGACACAGCCGTGTATTACTGTACTAGAGA IGHV3-53*01 IGHV3-53*01 VH all 16832 10 3372 5862 9 1384 4.2 0 0 1 285 0 IGHV3-53*01 GAGGTGCAGCTGGTGGAGTCTGGAGGAGGCTTGATCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGGTTCACCGTCAGTAGCAACTACATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCAGTTATTTATAGCGGTGGTAGCACATACTACGCAGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV3-64*02 IGHV3-64*02 VH all 133 5 33 41 4 12 3.4 0 0 1 288 0 IGHV3-64*02 GAGGTGCAGCTGGTGGAGTCTGGGGAAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATGCTATGCACTGGGTCCGCCAGGCTCCAGGGAAGGGACTGGAATATGTTTCAGCTATTAGTAGTAATGGGGGTAGCACATATTATGCAGACTCTGTGAAGGGCAGATTCACCATCTCCAGAGACAATTCCAAGAACACGCTGTATCTTCAAATGGGCAGCCTGAGAGCTGAGGACATGGCTGTGTATTACTGTGCGAGAGA IGHV3-7*01 IGHV3-7*01 VH all 9716 11 2189 2560 7 729 3.5 0 0 1 288 0 IGHV3-7*01 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGTAGCTATTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTGGCCAACATAAAGCAAGATGGAAGTGAGAAATACTATGTGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGAGA IGHV3-7*03 IGHV3-7*03 VH all 6298 9 1498 2114 6 623 3.4 0 0 1 288 0 IGHV3-7*03 GAGGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTAGTAGCTATTGGATGAGCTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTGGCCAACATAAAGCAAGATGGAAGTGAGAAATACTATGTGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV3-73*02 IGHV3-73*02 VH all 6251 10 1185 1736 6 402 4.3 0 0 1 294 0 IGHV3-73*02 GAGGTGCAGCTGGTGGAGTCCGGGGGAGGCTTGGTCCAGCCTGGGGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGGTTCACCTTCAGTGGCTCTGCTATGCACTGGGTCCGCCAGGCTTCCGGGAAAGGGCTGGAGTGGGTTGGCCGTATTAGAAGCAAAGCTAACAGTTACGCGACAGCATATGCTGCGTCGGTGAAAGGCAGGTTCACCATCTCCAGAGATGATTCAAAGAACACGGCGTATCTGCAAATGAACAGCCTGAAAACCGAGGACACGGCCGTGTATTACTGTACTAGACA IGHV3-74*01 IGHV3-74*01 VH all 7208 9 1584 2390 8 586 4.1 0 0 1 288 0 IGHV3-74*01 GAGGTGCAGCTGGTGGAGTCCGGGGGAGGCTTAGTTCAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTACTGGATGCACTGGGTCCGCCAAGCTCCAGGGAAGGGGCTGGTGTGGGTCTCACGTATTAATAGTGATGGGAGTAGCACAAGCTACGCGGACTCCGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACACGCTGTATCTGCAAATGAACAGTCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCAAGAGA IGHV3-9*01 IGHV3-9*01 VH all 805 6 261 373 6 126 3.0 0 0 1 288 0 IGHV3-9*01 GAAGTGCAGCTGGTGGAGTCTGGGGGAGGCTTGGTACAGCCTGGCAGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTTGATGATTATGCCATGCACTGGGTCCGGCAAGCTCCAGGGAAGGGCCTGGAGTGGGTCTCAGGTATTAGTTGGAATAGTGGTAGCATAGGCTATGCGGACTCTGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCCCTGTATCTGCAAATGAACAGTCTGAGAGCTGAGGACACGGCCTTGTATTACTGTGCAAAAGATA IGHV4-28*01 IGHV4-28*01 VH all 55 5 23 24 4 11 2.2 0 0 1 288 0 IGHV4-28*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGACACCCTGTCCCTCACCTGCGCTGTCTCTGGTTACTCCATCAGCAGTAGTAACTGGTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTACATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGTGGACACGGCCGTGTATTACTGTGCGAGAAA IGHV4-31*03 IGHV4-31*03 VH all 5511 11 1727 2519 8 905 2.8 0 0 1 291 0 IGHV4-31*03 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTGGTTACTACTGGAGCTGGATCCGCCAGCACCCAGGGAAGGGCCTGGAGTGGATTGGGTACATCTATTACAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTTACCATATCAGTAGACACGTCTAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACTGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV4-34*01 IGHV4-34*01 VH all 38338 12 11990 18928 10 6988 2.7 0 0 1 285 0 IGHV4-34*01 CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGG IGHV4-38-2*02 IGHV4-38-2*02 VH all 5678 9 1579 2099 7 720 2.9 0 0 1 288 0 IGHV4-38-2*02 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTTACTCCATCAGCAGTGGTTACTACTGGGGCTGGATCCGGCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATCATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGAGA IGHV4-39*01 IGHV4-39*01 VH all 12250 10 3346 5180 7 1708 3.0 0 0 1 291 0 IGHV4-39*01 CAGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCCGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCTGTGTATTACTGTGCGAGACA IGHV4-39*07 IGHV4-39*07 VH all 19396 11 5780 7910 9 2829 2.8 0 0 1 291 0 IGHV4-39*07 CAGCTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTAGTAGTTACTACTGGGGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGAGTATCTATTATAGTGGGAGCACCTACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV4-4*02 IGHV4-4*02 VH all 4465 8 1382 2030 7 734 2.8 0 0 1 288 0 IGHV4-4*02 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGGGACCCTGTCCCTCACCTGCGCTGTCTCTGGTGGCTCCATCAGCAGTAGTAACTGGTGGAGTTGGGTCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCTATCATAGTGGGAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACAAGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV4-4*07 IGHV4-4*07 VH all 5043 10 1391 1886 7 678 2.8 0 0 1 285 0 IGHV4-4*07 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATGTCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV4-59*01 IGHV4-59*01 VH all 16798 10 5022 7023 9 2549 2.8 0 0 1 285 0 IGHV4-59*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV4-61*01 IGHV4-61*01 VH all 4946 8 1515 1862 8 709 2.6 0 0 1 291 0 IGHV4-61*01 CAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCGTCAGCAGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCCCAGGGAAGGGACTGGAGTGGATTGGGTATATCTATTACAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCTGCGGACACGGCCGTGTATTACTGTGCGAGAGA IGHV5-10-1*03 IGHV5-10-1*03 VH all 4317 7 1046 1720 7 476 3.6 0 0 1 288 0 IGHV5-10-1*03 GAAGTGCAGCTGGTGCAGTCCGGAGCAGAGGTGAAAAAGCCCGGGGAGTCTCTGAGGATCTCCTGTAAGGGTTCTGGATACAGCTTTACCAGCTACTGGATCAGCTGGGTGCGCCAGATGCCCGGGAAAGGCCTGGAGTGGATGGGGAGGATTGATCCTAGTGACTCTTATACCAACTACAGCCCGTCCTTCCAAGGCCACGTCACCATCTCAGCTGACAAGTCCATCAGCACTGCCTACCTGCAGTGGAGCAGCCTGAAGGCCTCGGACACCGCCATGTATTACTGTGCGAGA IGHV5-51*01 IGHV5-51*01 VH all 30679 10 7649 11953 9 3386 3.5 0 0 1 288 0 IGHV5-51*01 GAGGTGCAGCTGGTGCAGTCTGGAGCAGAGGTGAAAAAGCCCGGGGAGTCTCTGAAGATCTCCTGTAAGGGTTCTGGATACAGCTTTACCAGCTACTGGATCGGCTGGGTGCGCCAGATGCCCGGGAAAGGCCTGGAGTGGATGGGGATCATCTATCCTGGTGACTCTGATACCAGATACAGCCCGTCCTTCCAAGGCCAGGTCACCATCTCAGCCGACAAGTCCATCAGCACCGCCTACCTGCAGTGGAGCAGCCTGAAGGCCTCGGACACCGCCATGTATTACTGTGCGAGACA IGHV6-1*01 IGHV6-1*01 VH all 18773 11 3603 6639 8 1316 5.0 0 0 1 297 0 IGHV6-1*01 CAGGTACAGCTGCAGCAGTCAGGTCCAGGACTGGTGAAGCCCTCGCAGACCCTCTCACTCACCTGTGCCATCTCCGGGGACAGTGTCTCTAGCAACAGTGCTGCTTGGAACTGGATCAGGCAGTCCCCATCGAGAGGCCTTGAGTGGCTGGGAAGGACATACTACAGGTCCAAGTGGTATAATGATTATGCAGTATCTGTGAAAAGTCGAATAACCATCAACCCAGACACATCCAAGAACCAGTTCTCCCTGCAGCTGAACTCTGTGACTCCCGAGGACACGGCTGTGTATTACTGTGCAAGAGA IGHV7-4-1*01 IGHV7-4-1*01 VH all 140 5 42 44 3 14 3.1 0 0 1 288 0 IGHV7-4-1*01 CAGGTGCAGCTGGTGCAATCTGGGTCTGAGTTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTCACTAGCTATGCTATGAATTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACACCAACACTGGGAACCCAACGTATGCCCAGGGCTTCACAGGACGGTTTGTCTTCTCCTTGGACACCTCTGTCAGCACGGCATATCTGCAGATCTGCAGCCTAAAGGCTGAGGACACTGCCGTGTATTACTGTGCGAGA IgDiscover-0.11/tests/data/H1/test.sh000077500000000000000000000005001337725263500173420ustar00rootroot00000000000000#!/bin/bash set -euo pipefail igdiscover germlinefilter --whitelist=V.fasta --max-differences=0 --unique-CDR3=5 --cluster-size=100 --unique-J=3 --cross-mapping-ratio=0.02 --allele-ratio=0.1 candidates.tab.gz > new_V_germline.tab diff -U 0 <(cut -f1 expected.tab) <(cut -f1 new_V_germline.tab) | grep '^[+-]' | sed 1,2d IgDiscover-0.11/tests/data/README.md000066400000000000000000000005061337725263500170410ustar00rootroot00000000000000# clusterplot test data The `clusterplot.tab.gz` file was extracted from the SRR5408020 dataset with this command: ( zcat iteration-01/filtered.tab.gz | head -n 1 zcat iteration-01/filtered.tab.gz | awk -vFS='\t' '$19==0' | grep '\bIGHV1-18\*01\b' | head -n 10 ) | cut -f2,19,38 | gzip > clusterplot.tab IgDiscover-0.11/tests/data/clusterplot.tab.gz000066400000000000000000000004601337725263500212500ustar00rootroot00000000000000record1 ACGT >record2 AACCGGTT >record2 TTGGAACC IgDiscover-0.11/tests/data/duplicate-sequence.fasta000066400000000000000000000000561337725263500223620ustar00rootroot00000000000000>record1 ACGT >record2 AACCGGTT >record3 acgt IgDiscover-0.11/tests/data/empty-record.fasta000066400000000000000000000000521337725263500212100ustar00rootroot00000000000000>record1 ACGTACGT >record2 >record3 ACGT IgDiscover-0.11/tests/data/ungrouped.fasta000066400000000000000000000010011337725263500206010ustar00rootroot00000000000000>group1 AAAAGGAGGCTCCTCTTTGTGGTGGCAGCAGCTACAGGTGC >group2 CCCCCCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGTCCGT >group3 CCCCCCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGACAGT >group4a AACCAACTTCTCACCGAGCAGTTGTCGAA >group4b AACCAACTTCTGACCGAGCAGTTGTCGAA >group5a TTAACCCGGCTAAGGCAGACGTAGTACGAGCCTACGTGGTT >group5b TTAACCCGGCTAAGGCAGACGTAGTACGAGCCTACGTGGTT >group5c TTAACCCGGCTCCCCCCCCCCCCCCCCCAGCCTACGTGGTT >group6a TTTTGGGCAAGAACATGAAGCACCTGTGGT >group6b TTTTGGGGGCAAGAACATGAAGCACCTGTGGT >group6c TTTTGGGGGGGCAATTTTTTTTTTTTTTTGTGGT IgDiscover-0.11/tests/results/000077500000000000000000000000001337725263500163515ustar00rootroot00000000000000IgDiscover-0.11/tests/results/grouped-by-barcode-only.fasta000066400000000000000000000005011337725263500240160ustar00rootroot00000000000000>group1;barcode=AAAA;size=1; GGAGGCTCCTCTTTGTGGTGGCAGCAGCTACAGGTGC >group4a;barcode=AACC;size=2; AACTTCTCACCGAGCAGTTGTCGAA >group2;barcode=CCCC;size=2; CCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGTCCGT >consensus1;barcode=TTAA;size=3; CCCGGCTAAGGCAGACGTAGTACGAGCCTACGTGGTT >group6a;barcode=TTTT;size=3; GGGCAAGAACATGAAGCACCTGTGGT IgDiscover-0.11/tests/results/grouped.fasta000066400000000000000000000006731337725263500210440ustar00rootroot00000000000000>group1;barcode=AAAA;cdr3=GGT;size=1; AGGCTCCTCTTTGTGGTGGCAGCAGCTACAGGTGC >group4a;barcode=AACC;cdr3=TCG;size=2; AACTTCTCACCGAGCAGTTGTCGAA >group3;barcode=CCCC;cdr3=ACA;size=1; CCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGACAGT >group2;barcode=CCCC;cdr3=TCC;size=1; CCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGTCCGT >consensus1;barcode=TTAA;cdr3=TGG;size=3; CCCGGCTAAGGCAGACGTAGTACGAGCCTACGTGGTT >consensus2;barcode=TTTT;cdr3=GTG;size=3; CAAGAACATGAAGCACCTGTGGT IgDiscover-0.11/tests/results/grouped2.fasta000066400000000000000000000006671337725263500211310ustar00rootroot00000000000000>group3;barcode=CAGT;cdr3=CC;size=1; CCCCCCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGA >group2;barcode=CCGT;cdr3=CC;size=1; CCCCCCATGGACACGCTTTGCTCCACGCTCCTGCTGCTGT >group4a;barcode=CGAA;cdr3=AC;size=2; AACCAACTTCTCACCGAGCAGTTGT >consensus1;barcode=GGTT;cdr3=TA;size=3; TTAACCCGGCTAAGGCAGACGTAGTACGAGCCTACGT >group1;barcode=GTGC;cdr3=AA;size=1; AAAAGGAGGCTCCTCTTTGTGGTGGCAGCAGCTACAG >group6a;barcode=TGGT;cdr3=TT;size=3; TTTTGGGCAAGAACATGAAGCACCTG IgDiscover-0.11/tests/run.sh000077500000000000000000000026451337725263500160220ustar00rootroot00000000000000#!/bin/bash # Run this within an activated igdiscover environment set -euo pipefail set -x unset DISPLAY pytest rm -rf testrun mkdir testrun [[ -L testdata ]] || ln -s igdiscover-testdata testdata # Test whether specifying primer sequences leads to a SyntaxError igdiscover init --db=testdata/database --reads=testdata/reads.1.fastq.gz testrun/primers pushd testrun/primers igdiscover config \ --set forward_primers "['CGTGA']" \ --set reverse_primers "['TTCAC']" igdiscover run -n stats/reads.json popd igdiscover init --db=testdata/database --reads=testdata/reads.1.fastq.gz testrun/paired pushd testrun/paired igdiscover config --set barcode_length_3prime 21 igdiscover run nofinal if [[ -d final/ ]]; then echo "ERROR: nofinal failed" exit 1 fi # run final iteration igdiscover run igdiscover run iteration-01/exact.tab popd # Use the merged file from above as input again igdiscover init --db=testdata/database --single-reads=testrun/paired/reads/2-merged.fastq.gz testrun/singlefastq cp -p testrun/paired/igdiscover.yaml testrun/singlefastq/ ( cd testrun/singlefastq && igdiscover run stats/reads.json ) # Test FASTA input sqt fastxmod -w 0 --fasta testrun/paired/reads/2-merged.fastq.gz > testrun/reads.fasta igdiscover init --db=testdata/database --single-reads=testrun/reads.fasta testrun/singlefasta cp -p testrun/paired/igdiscover.yaml testrun/singlefastq/ ( cd testrun/singlefasta && igdiscover run stats/reads.json ) IgDiscover-0.11/tests/test_cluster.py000066400000000000000000000041671337725263500177520ustar00rootroot00000000000000from igdiscover.cluster import inner_nodes, hamming_single_linkage def inner_nodes_recursive(root): """ Return a list of all inner nodes of the tree, from left to right. """ if root.is_leaf(): return [] return inner_nodes_recursive(root.left) + [root] + inner_nodes_recursive(root.right) def collect_ids_recursive(root): """ Return a list of ids of all leaves of the given tree """ if root.is_leaf(): return [root.id] return collect_ids_recursive(root.left) + collect_ids_recursive(root.right) def test_inner_nodes(): class Node: def __init__(self, value, left, right): self.value = value self.left = left self.right = right def is_leaf(self): return self.left is None and self.right is None def __repr__(self): # return 'Node({!r}, {!r}, {!r})'.format(self.value, self.left, self.right) return 'Node({!r}, ...)'.format(self.value) empty_tree = None assert inner_nodes(empty_tree) == [] leaf = Node(0, None, None) assert inner_nodes(leaf) == [] tree = Node(1, leaf, leaf) values = [ n.value for n in inner_nodes(tree) ] assert values == [1] tree = Node(1, Node(2, Node(3, leaf, leaf), leaf), leaf) values = [ n.value for n in inner_nodes(tree) ] assert values == [3, 2, 1] tree = Node(1, Node(2, Node(3, leaf, Node(4, leaf, leaf)), leaf), leaf) values = [ n.value for n in inner_nodes(tree) ] assert values == [3, 4, 2, 1] tree = Node(1, leaf, Node(2, leaf, Node(3, Node(4, leaf, Node(5, leaf, leaf) ), Node(6, leaf, leaf) ) ) ) values = [ n.value for n in inner_nodes(tree) ] assert values == [1, 2, 4, 5, 3, 6] def test_hamming_single_linkage(): strings = ['', 'AC', 'AG', 'GG', 'CT', 'GGG', 'GGA'] components = hamming_single_linkage(strings, 0) assert all(len(component) == 1 for component in components) assert set(strings) == set(component[0] for component in components) components = hamming_single_linkage(strings, 1) components = set(frozenset(component) for component in components) expected = [[''], ['AC', 'AG', 'GG'], ['CT'], ['GGG', 'GGA']] expected = set(frozenset(component) for component in expected) assert components == expected IgDiscover-0.11/tests/test_commands.py000066400000000000000000000025051337725263500200640ustar00rootroot00000000000000import os import sys from tempfile import TemporaryDirectory import pytest from igdiscover.__main__ import main from .utils import datapath, resultpath, files_equal def run(args, expected): """ Run IgDiscover, redirecting stdout to a temporary file. Then compare the output with the contents of an expected file. """ with TemporaryDirectory() as td: outpath = os.path.join(td, 'output') print('Running:', ' '.join(args)) with open(outpath, 'w') as f: old_stdout = sys.stdout sys.stdout = f main(args) sys.stdout = old_stdout assert files_equal(expected, outpath) def test_main(): with pytest.raises(SystemExit) as exc: main(['--version']) assert exc.value.code == 0 def test_group_by_barcode_only(): args = ['group', '-b', '4', datapath('ungrouped.fasta')] run(args, resultpath('grouped-by-barcode-only.fasta')) def test_group_by_pseudo_cdr3(): args = ['group', '-b', '4', '--pseudo-cdr3=-5:-2', '--trim-g', datapath('ungrouped.fasta')] run(args, resultpath('grouped.fasta')) def test_group_by_pseudo_cdr3_barcode_at_end(): args = ['group', '-b', '-4', '--pseudo-cdr3=1:3', datapath('ungrouped.fasta')] run(args, resultpath('grouped2.fasta')) def test_clusterplot(tmpdir): main(['clusterplot', '-m', '10', datapath('clusterplot.tab.gz'), str(tmpdir)]) assert tmpdir.join('IGHV1-1801.png').check() IgDiscover-0.11/tests/test_merger.py000066400000000000000000000103201337725263500175360ustar00rootroot00000000000000""" Test the SiblingMerger class """ import pandas as pd import pytest from igdiscover.discover import SiblingMerger, SiblingInfo from igdiscover.germlinefilter import CandidateFilterer, Candidate, TooSimilarSequenceFilter from igdiscover.utils import UniqueNamer from igdiscover.rename import PrefixDict def test_0(): merger = SiblingMerger() assert list(merger) == [] def test_1(): merger = SiblingMerger() group = pd.DataFrame([1, 2, 3]) info = SiblingInfo('ACCGGT', False, 'name1', group) merger.add(info) assert list(merger) == [info] def test_2(): merger = SiblingMerger() group = pd.DataFrame([1, 2, 3]) merger.add(SiblingInfo('ACCGGT', False, 'name1', group)) merger.add(SiblingInfo('ACCGGT', False, 'name2', group)) sisters = list(merger) assert len(sisters) == 1 assert sisters[0].sequence == 'ACCGGT' assert not sisters[0].requested assert sisters[0].name == 'name1;name2' def test_requested(): merger = SiblingMerger() group = pd.DataFrame([1, 2, 3]) merger.add(SiblingInfo('ACCGGT', True, 'name1', group)) merger.add(SiblingInfo('ACCGGT', False, 'name2', group)) sisters = list(merger) assert sisters[0].requested def test_prefix(): merger = SiblingMerger() group = pd.DataFrame([1, 2, 3]) merger.add(SiblingInfo('ACCGGTAACGT', True, 'name1', group)) merger.add(SiblingInfo('ACCGGT', False, 'name2', group)) info2 = SiblingInfo('TGATACC', False, 'name3', group) merger.add(info2) sisters = list(merger) assert len(sisters) == 2 assert sisters[0].sequence == 'ACCGGTAACGT' assert sisters[0].name == 'name1;name2' assert sisters[1] == info2 def test_with_N(): merger = SiblingMerger() group = pd.DataFrame([1, 2, 3]) merger.add(SiblingInfo('ACCNGTAANGT', True, 'name1', group)) merger.add(SiblingInfo('ANCGGT', False, 'name2', group)) info2 = SiblingInfo('TGATACC', False, 'name3', group) merger.add(info2) sisters = list(merger) assert len(sisters) == 2 assert sisters[0].sequence == 'ACCGGTAANGT' assert sisters[0].name == 'name1;name2' assert sisters[1] == info2 def test_unique_namer(): namer = UniqueNamer() assert namer('Name') == 'Name' assert namer('AnotherName') == 'AnotherName' assert namer('Name') == 'NameA' assert namer('Name') == 'NameB' assert namer('YetAnotherName') == 'YetAnotherName' assert namer('Name') == 'NameC' assert namer('NameC') == 'NameCA' def SI(sequence, name, clonotypes, whitelisted=False): return Candidate( sequence=sequence, name=name, clonotypes=clonotypes, exact=100, Ds_exact=10, cluster_size=100, whitelisted=whitelisted, is_database=False, cluster_size_is_accurate=True, CDR3_start=len(sequence), row=None ) def test_candidate_filter_with_cdr3(): filters = [ TooSimilarSequenceFilter(), ] merger = CandidateFilterer(filters) infos = [ SI('ACGTTA', 'Name1', 15), SI('ACGTTAT', 'Name2', 100), # kept because it is longer SI('ACGCCAT', 'Name3', 15), # kept because edit distance > 1 SI('ACGGTAT', 'Name5', 120), # kept because it has more clonotypes ] merger.add(infos[0]); merged = list(merger) assert len(merged) == 1 and merged[0] == infos[0] merger.add(infos[0]); merged = list(merger) assert len(merged) == 1 and merged[0] == infos[0] merger.add(infos[1]); merged = list(merger) assert len(merged) == 1 and merged[0] == infos[1] merger.add(infos[2]); merged = list(merger) assert len(merged) == 2 and merged[0] == infos[1] and merged[1] == infos[2] merger.add(infos[3]); merged = list(merger) assert len(merged) == 3 and merged[0:3] == infos[1:4] def test_candidate_filter_prefix(): merger = CandidateFilterer([TooSimilarSequenceFilter()]) infos = [ SI('AAATAA', 'Name1', 117), SI('AAATAAG', 'Name2', 10), ] merger.add(infos[0]) merger.add(infos[1]) merged = list(merger) assert len(merged) == 1 assert merged[0] == infos[0], (merged, infos) class TestPrefixDict: def setup(self): self.pd = PrefixDict([ ('AAACCT', 7), ('AGAAA', 11), ('AGAAC', 13) ]) def test_ambiguous(self): with pytest.raises(KeyError): self.pd['AGAA'] def test_missing(self): with pytest.raises(KeyError): self.pd['TTAAG'] def test_existing(self): assert self.pd['AAACCT'] == 7 assert self.pd['AAACCT'] == 7 assert self.pd['AAA'] == 7 assert self.pd['AGAAAT'] == 11 assert self.pd['AGAACT'] == 13 IgDiscover-0.11/tests/test_parse.py000066400000000000000000000006731337725263500174010ustar00rootroot00000000000000""" """ from igdiscover.parse import parse_header def test(): assert parse_header('') == ('', None, None) assert parse_header('abc') == ('abc', None, None) assert parse_header('abc;size=17;') == ('abc', 17, None) assert parse_header('abc;barcode=ACG') == ('abc', None, 'ACG') assert parse_header('abc;size=17;barcode=ACG') == ('abc', 17, 'ACG') assert parse_header('abc;size=17;barcode=ACG;soze=19;baarcode=GGG') == ('abc', 17, 'ACG') IgDiscover-0.11/tests/test_species.py000066400000000000000000000061771337725263500177270ustar00rootroot00000000000000# Tests for CDR3 detection from igdiscover.species import find_cdr3 from igdiscover.utils import nt_to_aa def split(s): for line in s.split('\n'): line = line.strip() if line: yield line.split() def assert_cdr3_detection(chain, s): for amino_acids, sequence in split(s): for offset in range(3): target = sequence[offset:] match = find_cdr3(target, chain) assert match is not None assert nt_to_aa(target[match[0]:match[1]]) == amino_acids, (chain, amino_acids, offset) def test_cdr3_detection_heavy(): heavy = """ ARRLHSGSYILFDY CAGGTGACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAAACCCACACAGACCCTCACGCTGACCTGCACCTTCTCTGGGTTCTCACTCAGCACTAGTGGTATGGGTGTGGGCTGGATCCGTCAGCCCTCACGGAAGACCCTGGAGTGGCTTGCACACATTTATTGGAATGATGATAAATACTACAGCACATCGCTGAAGAGCAGGCTCACCATCTCCAAGGACACCTCCAAAAACCAGGTGGTTCTAACAATGACCAACATGGACCCTGTGGACACAGCCACATATTACTGTGCACGGAGACTTCATAGTGGGAGCTACATTCTCTTTGACTACTGGGGCCAGGGAGTCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGA ARIKWLRSPGYGYFDF CAGGTGACCTTGAAGGAGTCTGGTCCTGCGCTGGTGAGACCCACACAGACCCTCACTCTGACCTGCACCTTCTCTGGGTTCTCAATCAGCACCTCTGGAACAGGTGTGGGCTGGATCCGTCAGCCCCCAGGGAAGGCCCTGGAATGGCTTGCAAGCATTTATTGGACTGATGCTAAATACTATAGCACATCGCAGAAGAGCAGGCTCACCATCTCCAAGGACACCTCCAGAAACCAGGTGATTCTAACAATGACCAACATGGAGCCTGTGGACATAGCCACATATTTCTGTGCACGGATAAAGTGGCTGCGGTCCCCAGGCTATGGATACTTCGATTTCTGGGGCCCTGGCACCCCAATCACCATCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGA ARHGIAAAGTHNWFDP TCAGCCGACAAGTCCATCAGCACCGCCTACCTGCAGTGGAGCAGCCTGAAGGCCTCGGACACCGCCATGGATTACTGTGCGAGACATGGGATAGCAGCAGCTGGTACCCACAACTGGTTCGACCCCTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGCAGGTTGTT """ assert_cdr3_detection('VH', heavy) def test_cdr3_detection_kappa(): kappa = """ QQYDSSPRT TTCAGTGGCAGTGGAGCAGGGACAGATTTCACTCTCACCATCAGCAGTCTGGAACCTGAGGATGTCGCAACTTACTACTGTCAGCAGTATGATAGCAGCCCCCGGACGTTCGGCGCTGGGACCAAGCTGGAAATCAAACGGAGTGTGCAGAAGCCAACTATCTCCCTCTTCCCTCCATCATCTGAGGAGG QQYSSYPYT GAGCTGGCCTCGGGAGTCCCAGCTCGCTTCAGTGGGAGTGGGTCAGGGACTTCTTTCTCTCTCACAATCAGCAACGTGGAGGCTGAAGATGTTGCAACCTATTACTGTCAGCAGTATAGCAGTTATCCGTACACGTTCGGCGCAGGGACCAAGCTGGAAATCAAACGGAGTGTGCAGAAGCCAACTATCTCCCTCTTCCCTCCATCATCTGAGGAGG LQYDSSPYT ATTCTCACCATCAGCAGCCTGCAGCCTGAAGACTTTGCAACTTACTACTGTCTACAGTATGATAGTTCCCCGTACACGTTCGGCGCAGGGACCAAGCTGGAAATCAAACGGAGTGTGCAGAAGCCAACTATCTCCCTCTTCCCTCCATCATCTGAGGAGG FQYYSGRLT ACAGACTTCACTCTCACCATCAGCAGCCTGCAGCCTGAGGACATTGCAGTTTATTACTGTTTCCAGTATTACAGCGGGAGACTCACGTTCGGAGGAGGGACCCGCTTGGAAATCAAACGGAGTGTGCAGAAGCCAACTATCTCCCTCTTCCCTCCATCATCTGAGGAGG """ assert_cdr3_detection('VK', kappa) def test_cdr3_detection_lambda(): lambda_ = """ LTYHGNSGTFV GGATCCAAAAACCCCTCAGCCAATGCAGGAATTTTGCTCATCTCTGAACTCCAGAATGAGGATGAGGCTGACTATTACTGTCTGACATATCATGGTAATAGTGGTACTTTTGTATTCGGTGGAGGAACCAAGCTGACCGTCCTAGGTCAGCCCAAGTCTGCCCCCACAGTCAGCCTGTTCC QLWDANSTV ATGGCCACACTGACCATCACTGGCGCCCAGGGTGAGGACGAGGCCGACTATTGCTGTCAGTTGTGGGATGCTAACAGTACTGTGTTCGGTGGAGGAACCACGCTGACCGTCCTAGGTCAGCCCAAGTCTGCCCCCACAGTCAGCCTGTTCCCGCCCTCCTC GVGYSGGYV GATCGCTACTTAACCATCTCCAACATCCAGCCTGAAGACGAGGCTGACTATTTCTGTGGTGTGGGTTATAGCGGTGGTTATGTATTCGGTGGAGGAACCAAGTTGACCGTCCTAGGTCAGCCCAAGTCTGCTCCCACAGTCAGCCTGTTCCCGCCCTCCTC """ assert_cdr3_detection('VL', lambda_) IgDiscover-0.11/tests/test_trie.py000066400000000000000000000061711337725263500172310ustar00rootroot00000000000000import random from igdiscover.trie import Trie from igdiscover.group import hamming_neighbors from sqt.align import hamming_distance def random_nt(length): return ''.join(random.choice('ACGT') for _ in range(length)) class TestTrie: LENGTHS = (0, 1, 2, 3, 4, 5, 6, 10, 12, 15, 20) def setup(self): self.strings = set() self.trie = Trie() # Create a set and a Trie both containing the same strings for _ in range(80): for length in self.LENGTHS: s = random_nt(length) self.strings.add(s) self.trie.add(s) def test_empty_string(self): trie = Trie() trie.add('') assert '' in trie assert 'A' not in trie def test_contains(self): for s in self.strings: assert s in self.trie for length in self.LENGTHS: for _ in range(min(100, 4**length)): s = random_nt(length) assert (s in self.strings) == (s in self.trie) def test_len(self): assert len(self.strings) == len(self.trie) def naive_has_similar(self, s, distance): for t in self.strings: if len(t) != len(s): continue if hamming_distance(t, s) <= distance: return True return False def naive_find_all_similar(self, s, distance): for t in self.strings: if len(t) != len(s): continue if hamming_distance(t, s) <= distance: yield t def test_has_similar(self): for s in self.strings: assert self.trie.has_similar(s, 0) assert self.trie.has_similar(s, 1) assert self.trie.has_similar(s, 2) for base in self.strings: for modified in hamming_neighbors(base): assert self.trie.has_similar(modified, 1) for errors in range(4): for length in self.LENGTHS: for _ in range(min(100, 4**length)): s = random_nt(length) assert self.naive_has_similar(s, errors) == self.trie.has_similar(s, errors) def test_find_all_similar(self): t = Trie() t.add('ACGT') result = list(t.find_all_similar('ACGG', 1)) assert set(result) == frozenset(('ACGT',)) for s in self.strings: result = list(self.trie.find_all_similar(s, 0)) assert result == [s] for errors in range(1, 4): result = list(self.trie.find_all_similar(s, errors)) assert s in result for base in self.strings: for modified in hamming_neighbors(base): assert base in self.trie.find_all_similar(modified, 1) for errors in range(1, 3): for length in self.LENGTHS: for _ in range(min(100, 4**length)): s = random_nt(length) expected = set(self.naive_find_all_similar(s, errors)) assert expected == set(self.trie.find_all_similar(s, errors)) def main(): import sys n = int(sys.argv[1]) dist = int(sys.argv[2]) strings = [random_nt(13) for _ in range(n)] print('created random sequences') if sys.argv[3] == 'x': def naive_has_similar(t): for s in strings: if hamming_distance(s, t) <= dist: return True return False hs = 0 for s in strings: hs += int(naive_has_similar(s[1:] + s[0])) print('hs:', hs, 'out of', len(strings)) else: trie = Trie() for s in strings: trie.add(s) print('created trie') hs = 0 for s in strings: hs += int(trie.has_similar(s[1:] + s[0], dist)) print('hs:', hs, 'out of', len(strings)) if __name__ == '__main__': main() IgDiscover-0.11/tests/test_utils.py000066400000000000000000000053161337725263500174260ustar00rootroot00000000000000from io import StringIO import pkg_resources import pytest from igdiscover.utils import (has_stop, validate_fasta, FastaValidationError, find_overlap, merge_overlapping, consensus) from igdiscover.config import Config def test_has_stop(): assert has_stop('TAA') assert has_stop('TAG') assert has_stop('TGA') assert has_stop('GGGTGA') assert not has_stop('GGG') assert not has_stop('TAT') assert not has_stop('GGGT') assert not has_stop('GGGTA') assert not has_stop('TATTG') def assert_dicts_equal(expected, actual): assert expected.keys() == actual.keys() for k in expected: if hasattr(expected[k], 'keys'): assert hasattr(actual[k], 'keys') assert_dicts_equal(expected[k], actual[k]) else: assert expected[k] == actual[k], '{}: {} vs {}'.format(k, expected[k], actual[k]) def test_config(): empty_config = Config(file=StringIO('{}')) packaged_config = Config(file=pkg_resources.resource_stream('igdiscover', Config.DEFAULT_PATH)) # force library name to be equal since it is dynamically determined empty_config.library_name = packaged_config.library_name = 'nolib' e = empty_config.__dict__ p = packaged_config.__dict__ assert_dicts_equal(e, p) # assert empty_config == packaged_config def test_validate_empty_record(): with pytest.raises(FastaValidationError): validate_fasta('tests/data/empty-record.fasta') def test_validate_duplicate_name(): with pytest.raises(FastaValidationError): validate_fasta('tests/data/duplicate-name.fasta') def test_validate_duplicate_sequence(): with pytest.raises(FastaValidationError): validate_fasta('tests/data/duplicate-sequence.fasta') def test_find_overlap(): assert find_overlap('', '') is None assert find_overlap('A', '') is None assert find_overlap('ABC', 'X') is None assert find_overlap('X', 'ABC') is None assert find_overlap('A', 'A') == 0 assert find_overlap('ABCD', 'A') == 0 assert find_overlap('A', 'ABC') == 0 assert find_overlap('AB', 'BD') == 1 assert find_overlap('ABCDE', 'CDE') == 2 assert find_overlap('ABCDEFGH', 'CDE') == 2 assert find_overlap('CDE', 'XABCDEFG') == -3 assert find_overlap('EFGHI', 'ABCDEFG') == -4 def test_merge_overlapping(): assert merge_overlapping('', '') is None # TODO assert merge_overlapping('ABC', 'DEF') is None assert merge_overlapping('HELLOW', 'LOWORLD') == 'HELLOWORLD' assert merge_overlapping('LOWORLD', 'HELLOW') == 'HELLOWORLD' assert merge_overlapping('HELLOWORLD', 'LOWO') == 'HELLOWORLD' assert merge_overlapping('LOWO', 'HELLOWORLD') == 'HELLOWORLD' def test_consensus(): assert consensus(( 'TATTACTGTGCGAG---', 'TATTACTGTGCGAGAGA', 'TATTACTGTGCGAGAGA', 'TATTACTGTGCGAGAG-', 'TATTACTGTGCGAGAG-', 'TATTACTGTGCGAG---', 'TATTACTGTGCGAGA--', )) == \ 'TATTACTGTGCGAGAGA' IgDiscover-0.11/tests/utils.py000066400000000000000000000007701337725263500163660ustar00rootroot00000000000000import sys import os import subprocess from contextlib import contextmanager from io import StringIO @contextmanager def capture_stdout(): sio = StringIO() old_stdout = sys.stdout sys.stdout = sio yield sio sys.stdout = old_stdout def datapath(path): return os.path.join(os.path.dirname(__file__), 'data', path) def resultpath(path): return os.path.join(os.path.dirname(__file__), 'results', path) def files_equal(path1, path2): return subprocess.call(['diff', '-u', path1, path2]) == 0 IgDiscover-0.11/tox.ini000066400000000000000000000001421337725263500150160ustar00rootroot00000000000000[tox] envlist = py35,py36 [testenv] deps = pytest pip>=7.0.0 wheel commands = pytest IgDiscover-0.11/versioneer.py000066400000000000000000002003231337725263500162410ustar00rootroot00000000000000 # Version: 0.16 """The Versioneer - like a rocketeer, but for versions. The Versioneer ============== * like a rocketeer, but for versions! * https://github.com/warner/python-versioneer * Brian Warner * License: Public Domain * Compatible With: python2.6, 2.7, 3.3, 3.4, 3.5, and pypy * [![Latest Version] (https://pypip.in/version/versioneer/badge.svg?style=flat) ](https://pypi.python.org/pypi/versioneer/) * [![Build Status] (https://travis-ci.org/warner/python-versioneer.png?branch=master) ](https://travis-ci.org/warner/python-versioneer) This is a tool for managing a recorded version number in distutils-based python projects. The goal is to remove the tedious and error-prone "update the embedded version string" step from your release process. Making a new release should be as easy as recording a new tag in your version-control system, and maybe making new tarballs. ## Quick Install * `pip install versioneer` to somewhere to your $PATH * add a `[versioneer]` section to your setup.cfg (see below) * run `versioneer install` in your source tree, commit the results ## Version Identifiers Source trees come from a variety of places: * a version-control system checkout (mostly used by developers) * a nightly tarball, produced by build automation * a snapshot tarball, produced by a web-based VCS browser, like github's "tarball from tag" feature * a release tarball, produced by "setup.py sdist", distributed through PyPI Within each source tree, the version identifier (either a string or a number, this tool is format-agnostic) can come from a variety of places: * ask the VCS tool itself, e.g. "git describe" (for checkouts), which knows about recent "tags" and an absolute revision-id * the name of the directory into which the tarball was unpacked * an expanded VCS keyword ($Id$, etc) * a `_version.py` created by some earlier build step For released software, the version identifier is closely related to a VCS tag. Some projects use tag names that include more than just the version string (e.g. "myproject-1.2" instead of just "1.2"), in which case the tool needs to strip the tag prefix to extract the version identifier. For unreleased software (between tags), the version identifier should provide enough information to help developers recreate the same tree, while also giving them an idea of roughly how old the tree is (after version 1.2, before version 1.3). Many VCS systems can report a description that captures this, for example `git describe --tags --dirty --always` reports things like "0.7-1-g574ab98-dirty" to indicate that the checkout is one revision past the 0.7 tag, has a unique revision id of "574ab98", and is "dirty" (it has uncommitted changes. The version identifier is used for multiple purposes: * to allow the module to self-identify its version: `myproject.__version__` * to choose a name and prefix for a 'setup.py sdist' tarball ## Theory of Operation Versioneer works by adding a special `_version.py` file into your source tree, where your `__init__.py` can import it. This `_version.py` knows how to dynamically ask the VCS tool for version information at import time. `_version.py` also contains `$Revision$` markers, and the installation process marks `_version.py` to have this marker rewritten with a tag name during the `git archive` command. As a result, generated tarballs will contain enough information to get the proper version. To allow `setup.py` to compute a version too, a `versioneer.py` is added to the top level of your source tree, next to `setup.py` and the `setup.cfg` that configures it. This overrides several distutils/setuptools commands to compute the version when invoked, and changes `setup.py build` and `setup.py sdist` to replace `_version.py` with a small static file that contains just the generated version data. ## Installation First, decide on values for the following configuration variables: * `VCS`: the version control system you use. Currently accepts "git". * `style`: the style of version string to be produced. See "Styles" below for details. Defaults to "pep440", which looks like `TAG[+DISTANCE.gSHORTHASH[.dirty]]`. * `versionfile_source`: A project-relative pathname into which the generated version strings should be written. This is usually a `_version.py` next to your project's main `__init__.py` file, so it can be imported at runtime. If your project uses `src/myproject/__init__.py`, this should be `src/myproject/_version.py`. This file should be checked in to your VCS as usual: the copy created below by `setup.py setup_versioneer` will include code that parses expanded VCS keywords in generated tarballs. The 'build' and 'sdist' commands will replace it with a copy that has just the calculated version string. This must be set even if your project does not have any modules (and will therefore never import `_version.py`), since "setup.py sdist" -based trees still need somewhere to record the pre-calculated version strings. Anywhere in the source tree should do. If there is a `__init__.py` next to your `_version.py`, the `setup.py setup_versioneer` command (described below) will append some `__version__`-setting assignments, if they aren't already present. * `versionfile_build`: Like `versionfile_source`, but relative to the build directory instead of the source directory. These will differ when your setup.py uses 'package_dir='. If you have `package_dir={'myproject': 'src/myproject'}`, then you will probably have `versionfile_build='myproject/_version.py'` and `versionfile_source='src/myproject/_version.py'`. If this is set to None, then `setup.py build` will not attempt to rewrite any `_version.py` in the built tree. If your project does not have any libraries (e.g. if it only builds a script), then you should use `versionfile_build = None`. To actually use the computed version string, your `setup.py` will need to override `distutils.command.build_scripts` with a subclass that explicitly inserts a copy of `versioneer.get_version()` into your script file. See `test/demoapp-script-only/setup.py` for an example. * `tag_prefix`: a string, like 'PROJECTNAME-', which appears at the start of all VCS tags. If your tags look like 'myproject-1.2.0', then you should use tag_prefix='myproject-'. If you use unprefixed tags like '1.2.0', this should be an empty string, using either `tag_prefix=` or `tag_prefix=''`. * `parentdir_prefix`: a optional string, frequently the same as tag_prefix, which appears at the start of all unpacked tarball filenames. If your tarball unpacks into 'myproject-1.2.0', this should be 'myproject-'. To disable this feature, just omit the field from your `setup.cfg`. This tool provides one script, named `versioneer`. That script has one mode, "install", which writes a copy of `versioneer.py` into the current directory and runs `versioneer.py setup` to finish the installation. To versioneer-enable your project: * 1: Modify your `setup.cfg`, adding a section named `[versioneer]` and populating it with the configuration values you decided earlier (note that the option names are not case-sensitive): ```` [versioneer] VCS = git style = pep440 versionfile_source = src/myproject/_version.py versionfile_build = myproject/_version.py tag_prefix = parentdir_prefix = myproject- ```` * 2: Run `versioneer install`. This will do the following: * copy `versioneer.py` into the top of your source tree * create `_version.py` in the right place (`versionfile_source`) * modify your `__init__.py` (if one exists next to `_version.py`) to define `__version__` (by calling a function from `_version.py`) * modify your `MANIFEST.in` to include both `versioneer.py` and the generated `_version.py` in sdist tarballs `versioneer install` will complain about any problems it finds with your `setup.py` or `setup.cfg`. Run it multiple times until you have fixed all the problems. * 3: add a `import versioneer` to your setup.py, and add the following arguments to the setup() call: version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), * 4: commit these changes to your VCS. To make sure you won't forget, `versioneer install` will mark everything it touched for addition using `git add`. Don't forget to add `setup.py` and `setup.cfg` too. ## Post-Installation Usage Once established, all uses of your tree from a VCS checkout should get the current version string. All generated tarballs should include an embedded version string (so users who unpack them will not need a VCS tool installed). If you distribute your project through PyPI, then the release process should boil down to two steps: * 1: git tag 1.0 * 2: python setup.py register sdist upload If you distribute it through github (i.e. users use github to generate tarballs with `git archive`), the process is: * 1: git tag 1.0 * 2: git push; git push --tags Versioneer will report "0+untagged.NUMCOMMITS.gHASH" until your tree has at least one tag in its history. ## Version-String Flavors Code which uses Versioneer can learn about its version string at runtime by importing `_version` from your main `__init__.py` file and running the `get_versions()` function. From the "outside" (e.g. in `setup.py`), you can import the top-level `versioneer.py` and run `get_versions()`. Both functions return a dictionary with different flavors of version information: * `['version']`: A condensed version string, rendered using the selected style. This is the most commonly used value for the project's version string. The default "pep440" style yields strings like `0.11`, `0.11+2.g1076c97`, or `0.11+2.g1076c97.dirty`. See the "Styles" section below for alternative styles. * `['full-revisionid']`: detailed revision identifier. For Git, this is the full SHA1 commit id, e.g. "1076c978a8d3cfc70f408fe5974aa6c092c949ac". * `['dirty']`: a boolean, True if the tree has uncommitted changes. Note that this is only accurate if run in a VCS checkout, otherwise it is likely to be False or None * `['error']`: if the version string could not be computed, this will be set to a string describing the problem, otherwise it will be None. It may be useful to throw an exception in setup.py if this is set, to avoid e.g. creating tarballs with a version string of "unknown". Some variants are more useful than others. Including `full-revisionid` in a bug report should allow developers to reconstruct the exact code being tested (or indicate the presence of local changes that should be shared with the developers). `version` is suitable for display in an "about" box or a CLI `--version` output: it can be easily compared against release notes and lists of bugs fixed in various releases. The installer adds the following text to your `__init__.py` to place a basic version in `YOURPROJECT.__version__`: from ._version import get_versions __version__ = get_versions()['version'] del get_versions ## Styles The setup.cfg `style=` configuration controls how the VCS information is rendered into a version string. The default style, "pep440", produces a PEP440-compliant string, equal to the un-prefixed tag name for actual releases, and containing an additional "local version" section with more detail for in-between builds. For Git, this is TAG[+DISTANCE.gHEX[.dirty]] , using information from `git describe --tags --dirty --always`. For example "0.11+2.g1076c97.dirty" indicates that the tree is like the "1076c97" commit but has uncommitted changes (".dirty"), and that this commit is two revisions ("+2") beyond the "0.11" tag. For released software (exactly equal to a known tag), the identifier will only contain the stripped tag, e.g. "0.11". Other styles are available. See details.md in the Versioneer source tree for descriptions. ## Debugging Versioneer tries to avoid fatal errors: if something goes wrong, it will tend to return a version of "0+unknown". To investigate the problem, run `setup.py version`, which will run the version-lookup code in a verbose mode, and will display the full contents of `get_versions()` (including the `error` string, which may help identify what went wrong). ## Updating Versioneer To upgrade your project to a new release of Versioneer, do the following: * install the new Versioneer (`pip install -U versioneer` or equivalent) * edit `setup.cfg`, if necessary, to include any new configuration settings indicated by the release notes * re-run `versioneer install` in your source tree, to replace `SRC/_version.py` * commit any changed files ### Upgrading to 0.16 Nothing special. ### Upgrading to 0.15 Starting with this version, Versioneer is configured with a `[versioneer]` section in your `setup.cfg` file. Earlier versions required the `setup.py` to set attributes on the `versioneer` module immediately after import. The new version will refuse to run (raising an exception during import) until you have provided the necessary `setup.cfg` section. In addition, the Versioneer package provides an executable named `versioneer`, and the installation process is driven by running `versioneer install`. In 0.14 and earlier, the executable was named `versioneer-installer` and was run without an argument. ### Upgrading to 0.14 0.14 changes the format of the version string. 0.13 and earlier used hyphen-separated strings like "0.11-2-g1076c97-dirty". 0.14 and beyond use a plus-separated "local version" section strings, with dot-separated components, like "0.11+2.g1076c97". PEP440-strict tools did not like the old format, but should be ok with the new one. ### Upgrading from 0.11 to 0.12 Nothing special. ### Upgrading from 0.10 to 0.11 You must add a `versioneer.VCS = "git"` to your `setup.py` before re-running `setup.py setup_versioneer`. This will enable the use of additional version-control systems (SVN, etc) in the future. ## Future Directions This tool is designed to make it easily extended to other version-control systems: all VCS-specific components are in separate directories like src/git/ . The top-level `versioneer.py` script is assembled from these components by running make-versioneer.py . In the future, make-versioneer.py will take a VCS name as an argument, and will construct a version of `versioneer.py` that is specific to the given VCS. It might also take the configuration arguments that are currently provided manually during installation by editing setup.py . Alternatively, it might go the other direction and include code from all supported VCS systems, reducing the number of intermediate scripts. ## License To make Versioneer easier to embed, all its code is dedicated to the public domain. The `_version.py` that it creates is also in the public domain. Specifically, both are released under the Creative Commons "Public Domain Dedication" license (CC0-1.0), as described in https://creativecommons.org/publicdomain/zero/1.0/ . """ from __future__ import print_function try: import configparser except ImportError: import ConfigParser as configparser import errno import json import os import re import subprocess import sys class VersioneerConfig: """Container for Versioneer configuration parameters.""" def get_root(): """Get the project root directory. We require that all commands are run from the project root, i.e. the directory that contains setup.py, setup.cfg, and versioneer.py . """ root = os.path.realpath(os.path.abspath(os.getcwd())) setup_py = os.path.join(root, "setup.py") versioneer_py = os.path.join(root, "versioneer.py") if not (os.path.exists(setup_py) or os.path.exists(versioneer_py)): # allow 'python path/to/setup.py COMMAND' root = os.path.dirname(os.path.realpath(os.path.abspath(sys.argv[0]))) setup_py = os.path.join(root, "setup.py") versioneer_py = os.path.join(root, "versioneer.py") if not (os.path.exists(setup_py) or os.path.exists(versioneer_py)): err = ("Versioneer was unable to run the project root directory. " "Versioneer requires setup.py to be executed from " "its immediate directory (like 'python setup.py COMMAND'), " "or in a way that lets it use sys.argv[0] to find the root " "(like 'python path/to/setup.py COMMAND').") raise VersioneerBadRootError(err) try: # Certain runtime workflows (setup.py install/develop in a setuptools # tree) execute all dependencies in a single python process, so # "versioneer" may be imported multiple times, and python's shared # module-import table will cache the first one. So we can't use # os.path.dirname(__file__), as that will find whichever # versioneer.py was first imported, even in later projects. me = os.path.realpath(os.path.abspath(__file__)) if os.path.splitext(me)[0] != os.path.splitext(versioneer_py)[0]: print("Warning: build in %s is using versioneer.py from %s" % (os.path.dirname(me), versioneer_py)) except NameError: pass return root def get_config_from_root(root): """Read the project setup.cfg file to determine Versioneer config.""" # This might raise EnvironmentError (if setup.cfg is missing), or # configparser.NoSectionError (if it lacks a [versioneer] section), or # configparser.NoOptionError (if it lacks "VCS="). See the docstring at # the top of versioneer.py for instructions on writing your setup.cfg . setup_cfg = os.path.join(root, "setup.cfg") parser = configparser.SafeConfigParser() with open(setup_cfg, "r") as f: parser.readfp(f) VCS = parser.get("versioneer", "VCS") # mandatory def get(parser, name): if parser.has_option("versioneer", name): return parser.get("versioneer", name) return None cfg = VersioneerConfig() cfg.VCS = VCS cfg.style = get(parser, "style") or "" cfg.versionfile_source = get(parser, "versionfile_source") cfg.versionfile_build = get(parser, "versionfile_build") cfg.tag_prefix = get(parser, "tag_prefix") if cfg.tag_prefix in ("''", '""'): cfg.tag_prefix = "" cfg.parentdir_prefix = get(parser, "parentdir_prefix") cfg.verbose = get(parser, "verbose") return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" # these dictionaries contain VCS-specific tools LONG_VERSION_PY = {} HANDLERS = {} def register_vcs_handler(vcs, method): # decorator """Decorator to mark a method as the handler for a particular VCS.""" def decorate(f): """Store f in HANDLERS[vcs][method].""" if vcs not in HANDLERS: HANDLERS[vcs] = {} HANDLERS[vcs][method] = f return f return decorate def run_command(commands, args, cwd=None, verbose=False, hide_stderr=False): """Call the given command(s).""" assert isinstance(commands, list) p = None for c in commands: try: dispcmd = str([c] + args) # remember shell=False, so use git.cmd on windows, not just git p = subprocess.Popen([c] + args, cwd=cwd, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None)) break except EnvironmentError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue if verbose: print("unable to run %s" % dispcmd) print(e) return None else: if verbose: print("unable to find command, tried %s" % (commands,)) return None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: stdout = stdout.decode() if p.returncode != 0: if verbose: print("unable to run %s (error)" % dispcmd) return None return stdout LONG_VERSION_PY['git'] = ''' # This file helps to compute a version number in source trees obtained from # git-archive tarball (such as those provided by githubs download-from-tag # feature). Distribution tarballs (built by setup.py sdist) and build # directories (produced by setup.py build) will contain a much shorter file # that just contains the computed version number. # This file is released into the public domain. Generated by # versioneer-0.16 (https://github.com/warner/python-versioneer) """Git implementation of _version.py.""" import errno import os import re import subprocess import sys def get_keywords(): """Get the keywords needed to look up the version information.""" # these strings will be replaced by git during git-archive. # setup.py/versioneer.py will grep for the variable names, so they must # each be defined on a line of their own. _version.py will just call # get_keywords(). git_refnames = "%(DOLLAR)sFormat:%%d%(DOLLAR)s" git_full = "%(DOLLAR)sFormat:%%H%(DOLLAR)s" keywords = {"refnames": git_refnames, "full": git_full} return keywords class VersioneerConfig: """Container for Versioneer configuration parameters.""" def get_config(): """Create, populate and return the VersioneerConfig() object.""" # these strings are filled in when 'setup.py versioneer' creates # _version.py cfg = VersioneerConfig() cfg.VCS = "git" cfg.style = "%(STYLE)s" cfg.tag_prefix = "%(TAG_PREFIX)s" cfg.parentdir_prefix = "%(PARENTDIR_PREFIX)s" cfg.versionfile_source = "%(VERSIONFILE_SOURCE)s" cfg.verbose = False return cfg class NotThisMethod(Exception): """Exception raised if a method is not valid for the current scenario.""" LONG_VERSION_PY = {} HANDLERS = {} def register_vcs_handler(vcs, method): # decorator """Decorator to mark a method as the handler for a particular VCS.""" def decorate(f): """Store f in HANDLERS[vcs][method].""" if vcs not in HANDLERS: HANDLERS[vcs] = {} HANDLERS[vcs][method] = f return f return decorate def run_command(commands, args, cwd=None, verbose=False, hide_stderr=False): """Call the given command(s).""" assert isinstance(commands, list) p = None for c in commands: try: dispcmd = str([c] + args) # remember shell=False, so use git.cmd on windows, not just git p = subprocess.Popen([c] + args, cwd=cwd, stdout=subprocess.PIPE, stderr=(subprocess.PIPE if hide_stderr else None)) break except EnvironmentError: e = sys.exc_info()[1] if e.errno == errno.ENOENT: continue if verbose: print("unable to run %%s" %% dispcmd) print(e) return None else: if verbose: print("unable to find command, tried %%s" %% (commands,)) return None stdout = p.communicate()[0].strip() if sys.version_info[0] >= 3: stdout = stdout.decode() if p.returncode != 0: if verbose: print("unable to run %%s (error)" %% dispcmd) return None return stdout def versions_from_parentdir(parentdir_prefix, root, verbose): """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. """ dirname = os.path.basename(root) if not dirname.startswith(parentdir_prefix): if verbose: print("guessing rootdir is '%%s', but '%%s' doesn't start with " "prefix '%%s'" %% (root, dirname, parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None} @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs): """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords = {} try: f = open(versionfile_abs, "r") for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) f.close() except EnvironmentError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords(keywords, tag_prefix, verbose): """Get version information from git keywords.""" if not keywords: raise NotThisMethod("no keywords at all, weird") refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = set([r.strip() for r in refnames.strip("()").split(",")]) # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = set([r[len(TAG):] for r in refs if r.startswith(TAG)]) if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %%d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = set([r for r in refs if re.search(r'\d', r)]) if verbose: print("discarding '%%s', no digits" %% ",".join(refs-tags)) if verbose: print("likely tags: %%s" %% ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] if verbose: print("picking %%s" %% r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None } # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags"} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ if not os.path.exists(os.path.join(root, ".git")): if verbose: print("no .git in %%s" %% root) raise NotThisMethod("no .git directory") GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out = run_command(GITS, ["describe", "--tags", "--dirty", "--always", "--long", "--match", "%%s*" %% tag_prefix], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out = run_command(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparseable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%%s'" %% describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%%s' doesn't start with prefix '%%s'" print(fmt %% (full_tag, tag_prefix)) pieces["error"] = ("tag '%%s' doesn't start with prefix '%%s'" %% (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None count_out = run_command(GITS, ["rev-list", "HEAD", "--count"], cwd=root) pieces["distance"] = int(count_out) # total number of commits return pieces def plus_or_dot(pieces): """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces): """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%%d.g%%s" %% (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_pre(pieces): """TAG[.post.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post.devDISTANCE """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += ".post.dev%%d" %% pieces["distance"] else: # exception #1 rendered = "0.post.dev%%d" %% pieces["distance"] return rendered def render_pep440_post(pieces): """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%%s" %% pieces["short"] else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%%s" %% pieces["short"] return rendered def render_pep440_old(pieces): """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Eexceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%%d" %% pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces): """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%%d-g%%s" %% (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces): """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%%d-g%%s" %% (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces, style): """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"]} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%%s'" %% style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None} def get_versions(): """Get version information or return default if unable to do so.""" # I am in _version.py, which lives at ROOT/VERSIONFILE_SOURCE. If we have # __file__, we can work backwards from there to the root. Some # py2exe/bbfreeze/non-CPython implementations don't do __file__, in which # case we can only use expanded keywords. cfg = get_config() verbose = cfg.verbose try: return git_versions_from_keywords(get_keywords(), cfg.tag_prefix, verbose) except NotThisMethod: pass try: root = os.path.realpath(__file__) # versionfile_source is the relative path from the top of the source # tree (where the .git directory might live) to this file. Invert # this to find the root from __file__. for i in cfg.versionfile_source.split('/'): root = os.path.dirname(root) except NameError: return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to find root of source tree"} try: pieces = git_pieces_from_vcs(cfg.tag_prefix, root, verbose) return render(pieces, cfg.style) except NotThisMethod: pass try: if cfg.parentdir_prefix: return versions_from_parentdir(cfg.parentdir_prefix, root, verbose) except NotThisMethod: pass return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version"} ''' @register_vcs_handler("git", "get_keywords") def git_get_keywords(versionfile_abs): """Extract version information from the given file.""" # the code embedded in _version.py can just fetch the value of these # keywords. When used from setup.py, we don't want to import _version.py, # so we do it with a regexp instead. This function is not used from # _version.py. keywords = {} try: f = open(versionfile_abs, "r") for line in f.readlines(): if line.strip().startswith("git_refnames ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["refnames"] = mo.group(1) if line.strip().startswith("git_full ="): mo = re.search(r'=\s*"(.*)"', line) if mo: keywords["full"] = mo.group(1) f.close() except EnvironmentError: pass return keywords @register_vcs_handler("git", "keywords") def git_versions_from_keywords(keywords, tag_prefix, verbose): """Get version information from git keywords.""" if not keywords: raise NotThisMethod("no keywords at all, weird") refnames = keywords["refnames"].strip() if refnames.startswith("$Format"): if verbose: print("keywords are unexpanded, not using") raise NotThisMethod("unexpanded keywords, not a git-archive tarball") refs = set([r.strip() for r in refnames.strip("()").split(",")]) # starting in git-1.8.3, tags are listed as "tag: foo-1.0" instead of # just "foo-1.0". If we see a "tag: " prefix, prefer those. TAG = "tag: " tags = set([r[len(TAG):] for r in refs if r.startswith(TAG)]) if not tags: # Either we're using git < 1.8.3, or there really are no tags. We use # a heuristic: assume all version tags have a digit. The old git %d # expansion behaves like git log --decorate=short and strips out the # refs/heads/ and refs/tags/ prefixes that would let us distinguish # between branches and tags. By ignoring refnames without digits, we # filter out many common branch names like "release" and # "stabilization", as well as "HEAD" and "master". tags = set([r for r in refs if re.search(r'\d', r)]) if verbose: print("discarding '%s', no digits" % ",".join(refs-tags)) if verbose: print("likely tags: %s" % ",".join(sorted(tags))) for ref in sorted(tags): # sorting will prefer e.g. "2.0" over "2.0rc1" if ref.startswith(tag_prefix): r = ref[len(tag_prefix):] if verbose: print("picking %s" % r) return {"version": r, "full-revisionid": keywords["full"].strip(), "dirty": False, "error": None } # no suitable tags, so version is "0+unknown", but full hex is still there if verbose: print("no suitable tags, using unknown + full revision id") return {"version": "0+unknown", "full-revisionid": keywords["full"].strip(), "dirty": False, "error": "no suitable tags"} @register_vcs_handler("git", "pieces_from_vcs") def git_pieces_from_vcs(tag_prefix, root, verbose, run_command=run_command): """Get version from 'git describe' in the root of the source tree. This only gets called if the git-archive 'subst' keywords were *not* expanded, and _version.py hasn't already been rewritten with a short version string, meaning we're inside a checked out source tree. """ if not os.path.exists(os.path.join(root, ".git")): if verbose: print("no .git in %s" % root) raise NotThisMethod("no .git directory") GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] # if there is a tag matching tag_prefix, this yields TAG-NUM-gHEX[-dirty] # if there isn't one, this yields HEX[-dirty] (no NUM) describe_out = run_command(GITS, ["describe", "--tags", "--dirty", "--always", "--long", "--match", "%s*" % tag_prefix], cwd=root) # --long was added in git-1.5.5 if describe_out is None: raise NotThisMethod("'git describe' failed") describe_out = describe_out.strip() full_out = run_command(GITS, ["rev-parse", "HEAD"], cwd=root) if full_out is None: raise NotThisMethod("'git rev-parse' failed") full_out = full_out.strip() pieces = {} pieces["long"] = full_out pieces["short"] = full_out[:7] # maybe improved later pieces["error"] = None # parse describe_out. It will be like TAG-NUM-gHEX[-dirty] or HEX[-dirty] # TAG might have hyphens. git_describe = describe_out # look for -dirty suffix dirty = git_describe.endswith("-dirty") pieces["dirty"] = dirty if dirty: git_describe = git_describe[:git_describe.rindex("-dirty")] # now we have TAG-NUM-gHEX or HEX if "-" in git_describe: # TAG-NUM-gHEX mo = re.search(r'^(.+)-(\d+)-g([0-9a-f]+)$', git_describe) if not mo: # unparseable. Maybe git-describe is misbehaving? pieces["error"] = ("unable to parse git-describe output: '%s'" % describe_out) return pieces # tag full_tag = mo.group(1) if not full_tag.startswith(tag_prefix): if verbose: fmt = "tag '%s' doesn't start with prefix '%s'" print(fmt % (full_tag, tag_prefix)) pieces["error"] = ("tag '%s' doesn't start with prefix '%s'" % (full_tag, tag_prefix)) return pieces pieces["closest-tag"] = full_tag[len(tag_prefix):] # distance: number of commits since tag pieces["distance"] = int(mo.group(2)) # commit: short hex revision ID pieces["short"] = mo.group(3) else: # HEX: no tags pieces["closest-tag"] = None count_out = run_command(GITS, ["rev-list", "HEAD", "--count"], cwd=root) pieces["distance"] = int(count_out) # total number of commits return pieces def do_vcs_install(manifest_in, versionfile_source, ipy): """Git-specific installation logic for Versioneer. For Git, this means creating/changing .gitattributes to mark _version.py for export-time keyword substitution. """ GITS = ["git"] if sys.platform == "win32": GITS = ["git.cmd", "git.exe"] files = [manifest_in, versionfile_source] if ipy: files.append(ipy) try: me = __file__ if me.endswith(".pyc") or me.endswith(".pyo"): me = os.path.splitext(me)[0] + ".py" versioneer_file = os.path.relpath(me) except NameError: versioneer_file = "versioneer.py" files.append(versioneer_file) present = False try: f = open(".gitattributes", "r") for line in f.readlines(): if line.strip().startswith(versionfile_source): if "export-subst" in line.strip().split()[1:]: present = True f.close() except EnvironmentError: pass if not present: f = open(".gitattributes", "a+") f.write("%s export-subst\n" % versionfile_source) f.close() files.append(".gitattributes") run_command(GITS, ["add", "--"] + files) def versions_from_parentdir(parentdir_prefix, root, verbose): """Try to determine the version from the parent directory name. Source tarballs conventionally unpack into a directory that includes both the project name and a version string. """ dirname = os.path.basename(root) if not dirname.startswith(parentdir_prefix): if verbose: print("guessing rootdir is '%s', but '%s' doesn't start with " "prefix '%s'" % (root, dirname, parentdir_prefix)) raise NotThisMethod("rootdir doesn't start with parentdir_prefix") return {"version": dirname[len(parentdir_prefix):], "full-revisionid": None, "dirty": False, "error": None} SHORT_VERSION_PY = """ # This file was generated by 'versioneer.py' (0.16) from # revision-control system data, or from the parent directory name of an # unpacked source archive. Distribution tarballs contain a pre-generated copy # of this file. import json import sys version_json = ''' %s ''' # END VERSION_JSON def get_versions(): return json.loads(version_json) """ def versions_from_file(filename): """Try to determine the version from _version.py if present.""" try: with open(filename) as f: contents = f.read() except EnvironmentError: raise NotThisMethod("unable to read _version.py") mo = re.search(r"version_json = '''\n(.*)''' # END VERSION_JSON", contents, re.M | re.S) if not mo: raise NotThisMethod("no version_json in _version.py") return json.loads(mo.group(1)) def write_to_version_file(filename, versions): """Write the given version number to the given _version.py file.""" os.unlink(filename) contents = json.dumps(versions, sort_keys=True, indent=1, separators=(",", ": ")) with open(filename, "w") as f: f.write(SHORT_VERSION_PY % contents) print("set %s to '%s'" % (filename, versions["version"])) def plus_or_dot(pieces): """Return a + if we don't already have one, else return a .""" if "+" in pieces.get("closest-tag", ""): return "." return "+" def render_pep440(pieces): """Build up version string, with post-release "local version identifier". Our goal: TAG[+DISTANCE.gHEX[.dirty]] . Note that if you get a tagged build and then dirty it, you'll get TAG+0.gHEX.dirty Exceptions: 1: no tags. git_describe was just HEX. 0+untagged.DISTANCE.gHEX[.dirty] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += plus_or_dot(pieces) rendered += "%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" else: # exception #1 rendered = "0+untagged.%d.g%s" % (pieces["distance"], pieces["short"]) if pieces["dirty"]: rendered += ".dirty" return rendered def render_pep440_pre(pieces): """TAG[.post.devDISTANCE] -- No -dirty. Exceptions: 1: no tags. 0.post.devDISTANCE """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += ".post.dev%d" % pieces["distance"] else: # exception #1 rendered = "0.post.dev%d" % pieces["distance"] return rendered def render_pep440_post(pieces): """TAG[.postDISTANCE[.dev0]+gHEX] . The ".dev0" means dirty. Note that .dev0 sorts backwards (a dirty tree will appear "older" than the corresponding clean one), but you shouldn't be releasing software with -dirty anyways. Exceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += plus_or_dot(pieces) rendered += "g%s" % pieces["short"] else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" rendered += "+g%s" % pieces["short"] return rendered def render_pep440_old(pieces): """TAG[.postDISTANCE[.dev0]] . The ".dev0" means dirty. Eexceptions: 1: no tags. 0.postDISTANCE[.dev0] """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"] or pieces["dirty"]: rendered += ".post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" else: # exception #1 rendered = "0.post%d" % pieces["distance"] if pieces["dirty"]: rendered += ".dev0" return rendered def render_git_describe(pieces): """TAG[-DISTANCE-gHEX][-dirty]. Like 'git describe --tags --dirty --always'. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] if pieces["distance"]: rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render_git_describe_long(pieces): """TAG-DISTANCE-gHEX[-dirty]. Like 'git describe --tags --dirty --always -long'. The distance/hash is unconditional. Exceptions: 1: no tags. HEX[-dirty] (note: no 'g' prefix) """ if pieces["closest-tag"]: rendered = pieces["closest-tag"] rendered += "-%d-g%s" % (pieces["distance"], pieces["short"]) else: # exception #1 rendered = pieces["short"] if pieces["dirty"]: rendered += "-dirty" return rendered def render(pieces, style): """Render the given version pieces into the requested style.""" if pieces["error"]: return {"version": "unknown", "full-revisionid": pieces.get("long"), "dirty": None, "error": pieces["error"]} if not style or style == "default": style = "pep440" # the default if style == "pep440": rendered = render_pep440(pieces) elif style == "pep440-pre": rendered = render_pep440_pre(pieces) elif style == "pep440-post": rendered = render_pep440_post(pieces) elif style == "pep440-old": rendered = render_pep440_old(pieces) elif style == "git-describe": rendered = render_git_describe(pieces) elif style == "git-describe-long": rendered = render_git_describe_long(pieces) else: raise ValueError("unknown style '%s'" % style) return {"version": rendered, "full-revisionid": pieces["long"], "dirty": pieces["dirty"], "error": None} class VersioneerBadRootError(Exception): """The project root directory is unknown or missing key files.""" def get_versions(verbose=False): """Get the project version from whatever source is available. Returns dict with two keys: 'version' and 'full'. """ if "versioneer" in sys.modules: # see the discussion in cmdclass.py:get_cmdclass() del sys.modules["versioneer"] root = get_root() cfg = get_config_from_root(root) assert cfg.VCS is not None, "please set [versioneer]VCS= in setup.cfg" handlers = HANDLERS.get(cfg.VCS) assert handlers, "unrecognized VCS '%s'" % cfg.VCS verbose = verbose or cfg.verbose assert cfg.versionfile_source is not None, \ "please set versioneer.versionfile_source" assert cfg.tag_prefix is not None, "please set versioneer.tag_prefix" versionfile_abs = os.path.join(root, cfg.versionfile_source) # extract version from first of: _version.py, VCS command (e.g. 'git # describe'), parentdir. This is meant to work for developers using a # source checkout, for users of a tarball created by 'setup.py sdist', # and for users of a tarball/zipball created by 'git archive' or github's # download-from-tag feature or the equivalent in other VCSes. get_keywords_f = handlers.get("get_keywords") from_keywords_f = handlers.get("keywords") if get_keywords_f and from_keywords_f: try: keywords = get_keywords_f(versionfile_abs) ver = from_keywords_f(keywords, cfg.tag_prefix, verbose) if verbose: print("got version from expanded keyword %s" % ver) return ver except NotThisMethod: pass try: ver = versions_from_file(versionfile_abs) if verbose: print("got version from file %s %s" % (versionfile_abs, ver)) return ver except NotThisMethod: pass from_vcs_f = handlers.get("pieces_from_vcs") if from_vcs_f: try: pieces = from_vcs_f(cfg.tag_prefix, root, verbose) ver = render(pieces, cfg.style) if verbose: print("got version from VCS %s" % ver) return ver except NotThisMethod: pass try: if cfg.parentdir_prefix: ver = versions_from_parentdir(cfg.parentdir_prefix, root, verbose) if verbose: print("got version from parentdir %s" % ver) return ver except NotThisMethod: pass if verbose: print("unable to compute version") return {"version": "0+unknown", "full-revisionid": None, "dirty": None, "error": "unable to compute version"} def get_version(): """Get the short version string for this project.""" return get_versions()["version"] def get_cmdclass(): """Get the custom setuptools/distutils subclasses used by Versioneer.""" if "versioneer" in sys.modules: del sys.modules["versioneer"] # this fixes the "python setup.py develop" case (also 'install' and # 'easy_install .'), in which subdependencies of the main project are # built (using setup.py bdist_egg) in the same python process. Assume # a main project A and a dependency B, which use different versions # of Versioneer. A's setup.py imports A's Versioneer, leaving it in # sys.modules by the time B's setup.py is executed, causing B to run # with the wrong versioneer. Setuptools wraps the sub-dep builds in a # sandbox that restores sys.modules to it's pre-build state, so the # parent is protected against the child's "import versioneer". By # removing ourselves from sys.modules here, before the child build # happens, we protect the child from the parent's versioneer too. # Also see https://github.com/warner/python-versioneer/issues/52 cmds = {} # we add "version" to both distutils and setuptools from distutils.core import Command class cmd_version(Command): description = "report generated version string" user_options = [] boolean_options = [] def initialize_options(self): pass def finalize_options(self): pass def run(self): vers = get_versions(verbose=True) print("Version: %s" % vers["version"]) print(" full-revisionid: %s" % vers.get("full-revisionid")) print(" dirty: %s" % vers.get("dirty")) if vers["error"]: print(" error: %s" % vers["error"]) cmds["version"] = cmd_version # we override "build_py" in both distutils and setuptools # # most invocation pathways end up running build_py: # distutils/build -> build_py # distutils/install -> distutils/build ->.. # setuptools/bdist_wheel -> distutils/install ->.. # setuptools/bdist_egg -> distutils/install_lib -> build_py # setuptools/install -> bdist_egg ->.. # setuptools/develop -> ? # we override different "build_py" commands for both environments if "setuptools" in sys.modules: from setuptools.command.build_py import build_py as _build_py else: from distutils.command.build_py import build_py as _build_py class cmd_build_py(_build_py): def run(self): root = get_root() cfg = get_config_from_root(root) versions = get_versions() _build_py.run(self) # now locate _version.py in the new build/ directory and replace # it with an updated value if cfg.versionfile_build: target_versionfile = os.path.join(self.build_lib, cfg.versionfile_build) print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) cmds["build_py"] = cmd_build_py if "cx_Freeze" in sys.modules: # cx_freeze enabled? from cx_Freeze.dist import build_exe as _build_exe class cmd_build_exe(_build_exe): def run(self): root = get_root() cfg = get_config_from_root(root) versions = get_versions() target_versionfile = cfg.versionfile_source print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, versions) _build_exe.run(self) os.unlink(target_versionfile) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) cmds["build_exe"] = cmd_build_exe del cmds["build_py"] # we override different "sdist" commands for both environments if "setuptools" in sys.modules: from setuptools.command.sdist import sdist as _sdist else: from distutils.command.sdist import sdist as _sdist class cmd_sdist(_sdist): def run(self): versions = get_versions() self._versioneer_generated_versions = versions # unless we update this, the command will keep using the old # version self.distribution.metadata.version = versions["version"] return _sdist.run(self) def make_release_tree(self, base_dir, files): root = get_root() cfg = get_config_from_root(root) _sdist.make_release_tree(self, base_dir, files) # now locate _version.py in the new base_dir directory # (remembering that it may be a hardlink) and replace it with an # updated value target_versionfile = os.path.join(base_dir, cfg.versionfile_source) print("UPDATING %s" % target_versionfile) write_to_version_file(target_versionfile, self._versioneer_generated_versions) cmds["sdist"] = cmd_sdist return cmds CONFIG_ERROR = """ setup.cfg is missing the necessary Versioneer configuration. You need a section like: [versioneer] VCS = git style = pep440 versionfile_source = src/myproject/_version.py versionfile_build = myproject/_version.py tag_prefix = parentdir_prefix = myproject- You will also need to edit your setup.py to use the results: import versioneer setup(version=versioneer.get_version(), cmdclass=versioneer.get_cmdclass(), ...) Please read the docstring in ./versioneer.py for configuration instructions, edit setup.cfg, and re-run the installer or 'python versioneer.py setup'. """ SAMPLE_CONFIG = """ # See the docstring in versioneer.py for instructions. Note that you must # re-run 'versioneer.py setup' after changing this section, and commit the # resulting files. [versioneer] #VCS = git #style = pep440 #versionfile_source = #versionfile_build = #tag_prefix = #parentdir_prefix = """ INIT_PY_SNIPPET = """ from ._version import get_versions __version__ = get_versions()['version'] del get_versions """ def do_setup(): """Main VCS-independent setup function for installing Versioneer.""" root = get_root() try: cfg = get_config_from_root(root) except (EnvironmentError, configparser.NoSectionError, configparser.NoOptionError) as e: if isinstance(e, (EnvironmentError, configparser.NoSectionError)): print("Adding sample versioneer config to setup.cfg", file=sys.stderr) with open(os.path.join(root, "setup.cfg"), "a") as f: f.write(SAMPLE_CONFIG) print(CONFIG_ERROR, file=sys.stderr) return 1 print(" creating %s" % cfg.versionfile_source) with open(cfg.versionfile_source, "w") as f: LONG = LONG_VERSION_PY[cfg.VCS] f.write(LONG % {"DOLLAR": "$", "STYLE": cfg.style, "TAG_PREFIX": cfg.tag_prefix, "PARENTDIR_PREFIX": cfg.parentdir_prefix, "VERSIONFILE_SOURCE": cfg.versionfile_source, }) ipy = os.path.join(os.path.dirname(cfg.versionfile_source), "__init__.py") if os.path.exists(ipy): try: with open(ipy, "r") as f: old = f.read() except EnvironmentError: old = "" if INIT_PY_SNIPPET not in old: print(" appending to %s" % ipy) with open(ipy, "a") as f: f.write(INIT_PY_SNIPPET) else: print(" %s unmodified" % ipy) else: print(" %s doesn't exist, ok" % ipy) ipy = None # Make sure both the top-level "versioneer.py" and versionfile_source # (PKG/_version.py, used by runtime code) are in MANIFEST.in, so # they'll be copied into source distributions. Pip won't be able to # install the package without this. manifest_in = os.path.join(root, "MANIFEST.in") simple_includes = set() try: with open(manifest_in, "r") as f: for line in f: if line.startswith("include "): for include in line.split()[1:]: simple_includes.add(include) except EnvironmentError: pass # That doesn't cover everything MANIFEST.in can do # (http://docs.python.org/2/distutils/sourcedist.html#commands), so # it might give some false negatives. Appending redundant 'include' # lines is safe, though. if "versioneer.py" not in simple_includes: print(" appending 'versioneer.py' to MANIFEST.in") with open(manifest_in, "a") as f: f.write("include versioneer.py\n") else: print(" 'versioneer.py' already in MANIFEST.in") if cfg.versionfile_source not in simple_includes: print(" appending versionfile_source ('%s') to MANIFEST.in" % cfg.versionfile_source) with open(manifest_in, "a") as f: f.write("include %s\n" % cfg.versionfile_source) else: print(" versionfile_source already in MANIFEST.in") # Make VCS-specific changes. For git, this means creating/changing # .gitattributes to mark _version.py for export-time keyword # substitution. do_vcs_install(manifest_in, cfg.versionfile_source, ipy) return 0 def scan_setup_py(): """Validate the contents of setup.py against Versioneer's expectations.""" found = set() setters = False errors = 0 with open("setup.py", "r") as f: for line in f.readlines(): if "import versioneer" in line: found.add("import") if "versioneer.get_cmdclass()" in line: found.add("cmdclass") if "versioneer.get_version()" in line: found.add("get_version") if "versioneer.VCS" in line: setters = True if "versioneer.versionfile_source" in line: setters = True if len(found) != 3: print("") print("Your setup.py appears to be missing some important items") print("(but I might be wrong). Please make sure it has something") print("roughly like the following:") print("") print(" import versioneer") print(" setup( version=versioneer.get_version(),") print(" cmdclass=versioneer.get_cmdclass(), ...)") print("") errors += 1 if setters: print("You should remove lines like 'versioneer.VCS = ' and") print("'versioneer.versionfile_source = ' . This configuration") print("now lives in setup.cfg, and should be removed from setup.py") print("") errors += 1 return errors if __name__ == "__main__": cmd = sys.argv[1] if cmd == "setup": errors = do_setup() errors += scan_setup_py() if errors: sys.exit(1)