vsearch-2.21.1/0000755000175000017500000000000014171574117012566 5ustar nileshnileshvsearch-2.21.1/autogen.sh0000755000175000017500000000004714171574117014570 0ustar nileshnilesh#!/bin/sh autoreconf --force --install vsearch-2.21.1/README.md0000644000175000017500000005305514171574117014055 0ustar nileshnilesh[![Build Status](https://travis-ci.com/torognes/vsearch.svg?branch=master)](https://travis-ci.com/torognes/vsearch) # VSEARCH ## Introduction The aim of this project is to create an alternative to the [USEARCH](https://www.drive5.com/usearch/) tool developed by Robert C. Edgar (2010). The new tool should: * have open source code with an appropriate open source license * be free of charge, gratis * have a 64-bit design that handles very large databases and much more than 4GB of memory * be as accurate or more accurate than usearch * be as fast or faster than usearch We have implemented a tool called VSEARCH which supports *de novo* and reference based chimera detection, clustering, full-length and prefix dereplication, rereplication, reverse complementation, masking, all-vs-all pairwise global alignment, exact and global alignment searching, shuffling, subsampling and sorting. It also supports FASTQ file analysis, filtering, conversion and merging of paired-end reads. VSEARCH stands for vectorized search, as the tool takes advantage of parallelism in the form of SIMD vectorization as well as multiple threads to perform accurate alignments at high speed. VSEARCH uses an optimal global aligner (full dynamic programming Needleman-Wunsch), in contrast to USEARCH which by default uses a heuristic seed and extend aligner. This usually results in more accurate alignments and overall improved sensitivity (recall) with VSEARCH, especially for alignments with gaps. [VSEARCH binaries](https://github.com/torognes/vsearch/releases/latest) are provided for GNU/Linux on three 64-bit processor architectures: x86-64, POWER8 (ppc64le) and ARMv8 (aarch64). Binaries are also provided for MacOS (version 10.9 Mavericks or later) on Intel (x86-64) and Apple Silicon (ARMv8), as well as Windows (64-bit, version 7 or higher, on x86_64). VSEARCH contains dedicated SIMD code for the three processor architectures (SSE2/SSSE3, AltiVec/VMX/VSX, Neon). | CPU \ OS | GNU/Linux | MacOS | Windows | | ------------- | :-----------: | :----: | :-------: | | x86_64 | ✔ | ✔ | ✔ | | ARMv8 | ✔ | ✔ | | | POWER8 | ✔ | | | Various packages, plugins and wrappers are also available from other sources - see [below](https://github.com/torognes/vsearch#packages-plugins-and-wrappers). The source code compiles correctly with `gcc` (versions 4.8.5 to 10.2) and `llvm-clang` (3.8 to 13.0). The source code should also compile on [FreeBSD](https://www.freebsd.org/) and [NetBSD](https://www.netbsd.org/) systems. VSEARCH can directly read input query and database files that are compressed using gzip and bzip2 (.gz and .bz2) if the zlib and bzip2 libraries are available. Most of the nucleotide based commands and options in USEARCH version 7 are supported, as well as some in version 8. The same option names as in USEARCH version 7 has been used in order to make VSEARCH an almost drop-in replacement. VSEARCH does not support amino acid sequences or local alignments. These features may be added in the future. ## Getting Help If you can't find an answer in the [VSEARCH documentation](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf), please visit the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) to post a question or start a discussion. ## Example In the example below, VSEARCH will identify sequences in the file database.fsa that are at least 90% identical on the plus strand to the query sequences in the file queries.fsa and write the results to the file alnout.txt. `./vsearch --usearch_global queries.fsa --db database.fsa --id 0.9 --alnout alnout.txt` ## Download and install **Source distribution** To download the source distribution from a [release](https://github.com/torognes/vsearch/releases) and build the executable and the documentation, use the following commands: ``` wget https://github.com/torognes/vsearch/archive/v2.21.1.tar.gz tar xzf v2.21.1.tar.gz cd vsearch-2.21.1 ./autogen.sh ./configure CFLAGS="-O3" CXXFLAGS="-O3" make make install # as root or sudo make install ``` You may customize the installation directory using the `--prefix=DIR` option to `configure`. If the compression libraries [zlib](https://www.zlib.net) and/or [bzip2](https://www.sourceware.org/bzip2/) are installed on the system, they will be detected automatically and support for compressed files will be included in vsearch. Support for compressed files may be disabled using the `--disable-zlib` and `--disable-bzip2` options to `configure`. A PDF version of the manual will be created from the `vsearch.1` manual file if `ps2pdf` is available, unless disabled using the `--disable-pdfman` option to `configure`. It is recommended to run configure with the options `CFLAGS="-O3"` and `CXXFLAGS="-O3"`. Other options may also be applied to `configure`, please run `configure -h` to see them all. GNU autoconf (version 2.63 or later), automake and the GCC C++ compiler is required to build vsearch. Version 3.82 or later of Make may be required on Linux, while version 3.81 is sufficient on macOS. The distributed Linux ppc64le and aarch64 binaries and the Windows binary were compiled using the [Mingw-w64](http://mingw-w64.org/) C++ cross-compiler. **Cloning the repo** Instead of downloading the source distribution as a compressed archive, you could clone the repo and build it as shown below. The options to `configure` as described above are still valid. ``` git clone https://github.com/torognes/vsearch.git cd vsearch ./autogen.sh ./configure make make install # as root or sudo make install ``` **Binary distribution** Starting with version 1.4.0, binary distribution files containing pre-compiled binaries as well as the documentation will be made available as part of each [release](https://github.com/torognes/vsearch/releases). The included executables include support for input files compressed by zlib and bzip2 (with files usually ending in `.gz` or `.bz2`). Binary distributions are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) or Windows (64-bit, version 7 or higher), 64-bit AMDv8 (aarch64) systems running GNU/Linux or macOS, as well as POWER8 (ppc64le) systems running GNU/Linux. Download the appropriate executable for your system using the following commands if you are using a Linux x86_64 system: ```sh wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-x86_64.tar.gz tar xzf vsearch-2.21.1-linux-x86_64.tar.gz ``` Or these commands if you are using a Linux ppc64le system: ```sh wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-ppc64le.tar.gz tar xzf vsearch-2.21.1-linux-ppc64le.tar.gz ``` Or these commands if you are using a Linux aarch64 (arm64) system: ```sh wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-linux-aarch64.tar.gz tar xzf vsearch-2.21.1-linux-aarch64.tar.gz ``` Or these commands if you are using a Mac: ```sh wget https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-macos-x86_64.tar.gz tar xzf vsearch-2.21.1-macos-x86_64.tar.gz ``` Or if you are using Windows, download and extract (unzip) the contents of this file: ``` https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch-2.21.1-win-x86_64.zip ``` Linux and Mac: You will now have the binary distribution in a folder called `vsearch-2.21.1-linux-x86_64` or `vsearch-2.21.1-macos-x86_64` in which you will find three subfolders `bin`, `man` and `doc`. We recommend making a copy or a symbolic link to the vsearch binary `bin/vsearch` in a folder included in your `$PATH`, and a copy or a symbolic link to the vsearch man page `man/vsearch.1` in a folder included in your `$MANPATH`. The PDF version of the manual is available in `doc/vsearch_manual.pdf`. Versions with statically compiled libraries are available for Linux systems. These have "-static" in their name, and could be used on systems that do not have all the necessary libraries installed. Windows: You will now have the binary distribution in a folder called `vsearch-2.21.1-win-x86_64`. The vsearch executable is called `vsearch.exe`. The manual in PDF format is called `vsearch_manual.pdf`. **Documentation** The VSEARCH user's manual is available in the `man` folder in the form of a [man page](https://github.com/torognes/vsearch/blob/master/man/vsearch.1). A pdf version ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf)) will be generated by `make`. To install the manpage manually, copy the `vsearch.1` file or a create a symbolic link to `vsearch.1` in a folder included in your `$MANPATH`. The manual in both formats is also available with the binary distribution. The manual in PDF form ([vsearch_manual.pdf](https://github.com/torognes/vsearch/releases/download/v2.21.1/vsearch_manual.pdf)) is also attached to the latest [release](https://github.com/torognes/vsearch/releases). ## Packages, plugins, and wrappers **Conda package** Thanks to the [BioConda](https://bioconda.github.io/) team, there is now a [vsearch package](https://anaconda.org/bioconda/vsearch) in [Conda](https://conda.io/). **Debian package** Thanks to the [Debian Med](https://www.debian.org/devel/debian-med/) team, there is now a [vsearch](https://packages.debian.org/sid/vsearch) package in [Debian](https://www.debian.org/). **FreeBSD ports package** Thanks to [Jason Bacon](https://github.com/outpaddling), a [vsearch](https://www.freebsd.org/cgi/ports.cgi?query=vsearch&stype=all) [FreeBSD ports](https://www.freebsd.org/ports/) package is available. Install the binary package with `pkg install vsearch`, or build from source with additional optimizations. **Galaxy wrapper** Thanks to the work of the [Intergalactic Utilities Commission](https://wiki.galaxyproject.org/IUC) members, vsearch is now part of the [Galaxy ToolShed](https://toolshed.g2.bx.psu.edu/view/iuc/vsearch/). **Homebrew package** Thanks to [Torsten Seeman](https://github.com/tseemann), a [vsearch package](https://formulae.brew.sh/formula/vsearch) for [Homebrew](http://brew.sh/) has been made. **Pkgsrc package** Thanks to [Jason Bacon](https://github.com/outpaddling), a vsearch [pkgsrc](https://www.pkgsrc.org) package is available for NetBSD and other UNIX-like systems. Install the binary package with `pkgin install vsearch`, or build from source with additional optimizations. **QIIME 2 plugin** Thanks to the [QIIME 2](https://github.com/qiime2) team, there is now a plugin called [q2-vsearch](https://github.com/qiime2/q2-vsearch) for [QIIME 2](https://qiime2.org). ## Converting output to a biom file for use in QIIME and other software With the `from-uc`command in [biom](http://biom-format.org/) 2.1.5 or later, it is possible to convert data in a `.uc` file produced by vsearch into a biom file that can be read by QIIME and other software. It is described [here](https://gist.github.com/gregcaporaso/f3c042e5eb806349fa18). Please note that vsearch version 2.2.0 and later are able to directly output OTU tables in biom 1.0 format as well as the classic and mothur formats. ## Implementation details and initial assessment Please see the paper for details: Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584 doi: [10.7717/peerj.2584](https://doi.org/10.7717/peerj.2584) ## Dependencies When compiling VSEARCH the header files for the following two optional libraries are required if support for gzip and bzip2 compressed FASTA and FASTQ input files is needed: * libz (zlib library) (zlib.h header file) (optional) * libbz2 (bzip2lib library) (bzlib.h header file) (optional) VSEARCH will automatically check whether these libraries are available and load them dynamically. On Windows these libraries are called zlib1.dll and bz2.dll. Unfortunately, VSEARCH will not work properly with all the different variants of the `zlib1.dll` file on Windows. One that works well is provided by the MinGW-w64 project and is found in the `bin` folder within the [zlib-1.2.5-bin-x64.zip](https://sourceforge.net/projects/mingw-w64/files/External%20binary%20packages%20%28Win64%20hosted%29/Binaries%20%2864-bit%29/zlib-1.2.5-bin-x64.zip) archive available on SourceForge. The MD5 of the `zlib1.dll` file should be `0f67ee0b965d3d29388c238aebcf60bc`. To create the PDF file with the manual the ps2pdf tool is required. It is part of the ghostscript package. ## VSEARCH license and third party licenses The VSEARCH code is dual-licensed either under the GNU General Public License version 3 or under the BSD 2-clause license. Please see LICENSE.txt for details. VSEARCH includes code from several other projects. We thank the authors for making their source code available. VSEARCH includes code from Google's [CityHash project](https://github.com/google/cityhash) by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license. VSEARCH includes code derived from Tatusov and Lipman's DUST program that is in the public domain. VSEARCH includes public domain code written by Alexander Peslyak for the MD5 message digest algorithm. VSEARCH includes public domain code written by Steve Reid and others for the SHA1 message digest algorithm. The VSEARCH distribution includes code from GNU Autoconf which normally is available under the GNU General Public License, but may be distributed with the special autoconf configure script exception. VSEARCH may include code from the [zlib](https://www.zlib.net) library copyright Jean-loup Gailly and Mark Adler, distributed under the [zlib license](https://www.zlib.net/zlib_license.html). VSEARCH may include code from the [bzip2](https://www.sourceware.org/bzip2/) library copyright Julian R. Seward, distributed under a BSD-style license. ## Code The code is written mostly in C++. File | Description ---|--- **align.cc** | New Needleman-Wunsch global alignment, serial. Only for testing. **align_simd.cc** | SIMD parallel global alignment of 1 query with 8 database sequences **allpairs.cc** | All-vs-all optimal global pairwise alignment (no heuristics) **arch.cc** | Architecture specific code (Mac/Linux) **attributes.cc** | Extraction and printing of attributes in FASTA headers **bitmap.cc** | Implementation of bitmaps **chimera.cc** | Chimera detection **city.cc** | CityHash code **cluster.cc** | Clustering (cluster\_fast and cluster\_smallmem) **cpu.cc** | Code dependent on specific cpu features (e.g. ssse3) **cut.cc** | Restriction site cutting **db.cc** | Handles the database file read, access etc **dbhash.cc** | Database hashing for exact searches **dbindex.cc** | Indexes the database by identifying unique kmers in the sequences **derep.cc** | Dereplication **dynlibs.cc** | Dynamic loading of compression libraries **eestats.cc** | Produce statistics for fastq_eestats command **fa2fq.cc** | FASTA to FASTQ conversion **fasta.cc** | FASTA file parser **fastq.cc** | FASTQ file parser **fastqjoin.cc** | FASTQ paired-end reads joining **fastqops.cc** | FASTQ file statistics etc **fastx.cc** | Detection of FASTA and FASTQ files, wrapper for FASTA and FASTQ parsers **filter.cc** | Trimming and filtering of sequences in FASTA and FASTQ files **getseq.cc** | Extraction of sequences based on header labels **kmerhash.cc** | Hash for kmers used by paired-end read merger **linmemalign.cc** | Linear memory global sequence aligner **maps.cc** | Various character mapping arrays **mask.cc** | Masking (DUST) **md5.c** | MD5 message digest **mergepairs.cc** | Paired-end read merging **minheap.cc** | A minheap implementation for the list of top kmer matches **msa.cc** | Simple multiple sequence alignment and consensus sequence computation for clusters **orient.cc** | Orient direction of sequences based on reference database **otutable.cc** | Generate OTU tables in various formats **rerep.cc** | Rereplication **results.cc** | Output results in various formats (alnout, userout, blast6, uc) **search.cc** | Implements search using global alignment **searchcore.cc** | Core search functions for searching, clustering and chimera detection **searchexact.cc** | Exact search functions **sffconvert.cc** | SFF to FASTQ file conversion **sha1.c** | SHA1 message digest **showalign.cc** | Output an alignment in a human-readable way given a CIGAR-string and the sequences **shuffle.cc** | Shuffle sequences **sintax.cc** | Taxonomic classification using Sintax method **sortbylength.cc** | Code for sorting by length **sortbysize.cc** | Code for sorting by size (abundance) **subsample.cc** | Subsampling reads from a FASTA file **tax.cc** | Taxonomy information parsing **udb.cc** | UDB database file handling **unique.cc** | Find unique kmers in a sequence **userfields.cc** | Code for parsing the userfields option argument **util.cc** | Various common utility functions **vsearch.cc** | Main program file, general initialization, reads arguments and parses options, writes info. **xstring.h** | Code for a simple string class VSEARCH may be compiled with zlib or bzip2 integration that allows it to read compressed FASTA files. The [zlib](http://www.zlib.net/) and the [bzip2](https://www.sourceware.org/bzip2/) libraries are needed for this. ## Bugs All bug reports are highly appreciated. You may submit a bug report here on GitHub as an [issue](https://github.com/torognes/vsearch/issues), you could post a message on the [VSEARCH Web Forum](https://groups.google.com/forum/#!forum/vsearch-forum) or you could send an email to [torognes@ifi.uio.no](mailto:torognes@ifi.uio.no?subject=bug_in_vsearch). ## Limitations VSEARCH is designed for rather short sequences, and will be slow when sequences are longer than about 5,000 bp. This is because it always performs optimal global alignment on selected sequences. ## The VSEARCH team The main contributors to VSEARCH: * Torbjørn Rognes (Coding, testing, documentation, evaluation) * Frédéric Mahé (Documentation, testing, feature suggestions) * Tomáš Flouri (Coding, testing) * Christopher Quince (Initiator, feature suggestions, evaluation) * Ben Nichols (Evaluation) ## Acknowledgements Special thanks to the following people for patches, suggestions, computer access etc: * Davide Albanese * Colin Brislawn * Jeff Epler * Christopher M. Sullivan * Andreas Tille * Sarah Westcott ## Citing VSEARCH Please cite the following publication if you use VSEARCH: Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: [10.7717/peerj.2584](https://doi.org/10.7717/peerj.2584) Please note that citing any of the underlying algorithms, e.g. UCHIME, may also be appropriate. ## Test datasets Test datasets (found in the separate vsearch-data repository) were obtained from the BioMarks project (Logares et al. 2014), the [TARA OCEANS project](https://oceans.taraexpeditions.org/en/) (Karsenti et al. 2011) and the [Protist Ribosomal Reference Database (PR2)](https://github.com/pr2database/pr2database) (Guillou et al. 2013). ## References * Edgar RC (2010) **Search and clustering orders of magnitude faster than BLAST.** *Bioinformatics*, 26 (19): 2460-2461. doi:[10.1093/bioinformatics/btq461](https://doi.org/10.1093/bioinformatics/btq461) * Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R (2011) **UCHIME improves sensitivity and speed of chimera detection.** *Bioinformatics*, 27 (16): 2194-2200. doi:[10.1093/bioinformatics/btr381](https://doi.org/10.1093/bioinformatics/btr381) * Edgar RC, Flyvbjerg H (2015) **Error filtering, pair assembly and error correction for next-generation sequencing reads.** *Bioinformatics*, 31 (21): 3476-3482. doi:[10.1093/bioinformatics/btv401](https://doi.org/10.1093/bioinformatics/btv401) * Guillou L, Bachar D, Audic S, Bass D, Berney C, Bittner L, Boutte C, Burgaud G, de Vargas C, Decelle J, del Campo J, Dolan J, Dunthorn M, Edvardsen B, Holzmann M, Kooistra W, Lara E, Lebescot N, Logares R, Mahé F, Massana R, Montresor M, Morard R, Not F, Pawlowski J, Probert I, Sauvadet A-L, Siano R, Stoeck T, Vaulot D, Zimmermann P & Christen R (2013) **The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy.** *Nucleic Acids Research*, 41 (D1), D597-D604. doi:[10.1093/nar/gks1160](https://doi.org/10.1093/nar/gks1160) * Karsenti E, González Acinas S, Bork P, Bowler C, de Vargas C, Raes J, Sullivan M B, Arendt D, Benzoni F, Claverie J-M, Follows M, Jaillon O, Gorsky G, Hingamp P, Iudicone D, Kandels-Lewis S, Krzic U, Not F, Ogata H, Pesant S, Reynaud E G, Sardet C, Sieracki M E, Speich S, Velayoudon D, Weissenbach J, Wincker P & the Tara Oceans Consortium (2011) **A holistic approach to marine eco-systems biology.** *PLoS Biology*, 9(10), e1001177. doi:[10.1371/journal.pbio.1001177](https://doi.org/10.1371/journal.pbio.1001177) * Logares R, Audic S, Bass D, Bittner L, Boutte C, Christen R, Claverie J-M, Decelle J, Dolan J R, Dunthorn M, Edvardsen B, Gobet A, Kooistra W H C F, Mahé F, Not F, Ogata H, Pawlowski J, Pernice M C, Romac S, Shalchian-Tabrizi K, Simon N, Stoeck T, Santini S, Siano R, Wincker P, Zingone A, Richards T, de Vargas C & Massana R (2014) **The patterning of rare and abundant community assemblages in coastal marine-planktonic microbial eukaryotes.** *Current Biology*, 24(8), 813-821. doi:[10.1016/j.cub.2014.02.050](https://doi.org/10.1016/j.cub.2014.02.050) * Rognes T (2011) **Faster Smith-Waterman database searches by inter-sequence SIMD parallelisation.** *BMC Bioinformatics*, 12: 221. doi:[10.1186/1471-2105-12-221](https://doi.org/10.1186/1471-2105-12-221) vsearch-2.21.1/man/0000755000175000017500000000000014171574117013341 5ustar nileshnileshvsearch-2.21.1/man/Makefile.am0000755000175000017500000000106714171574117015404 0ustar nileshnilesh# Makefile for creating PDF manual from man file dist_man_MANS = vsearch.1 doc_DATA = CLEANFILES = if HAVE_MAN_HTML doc_DATA += vsearch_manual.html vsearch_manual.html : vsearch.1 sed -e 's/\\-/-/g' $< | \ iconv -f UTF-8 -t ISO-8859-1 | \ groff -t -m mandoc -m www -Thtml > $@ CLEANFILES += vsearch_manual.html endif if HAVE_PS2PDF doc_DATA += vsearch_manual.pdf vsearch_manual.pdf : vsearch.1 sed -e 's/\\-/-/g' $< | \ iconv -f UTF-8 -t ISO-8859-1 | \ groff -W space -t -m mandoc -T ps -P -pa4 | ps2pdf - $@ CLEANFILES += vsearch_manual.pdf endif vsearch-2.21.1/man/vsearch.10000644000175000017500000055451014171574117015070 0ustar nileshnilesh.\" ============================================================================ .TH vsearch 1 "January 18, 2022" "version 2.21.1" "USER COMMANDS" .\" ============================================================================ .SH NAME vsearch \(em a versatile open-source tool for microbiome analysis, including chimera detection, clustering, dereplication and rereplication, extraction, FASTA/FASTQ/SFF file processing, masking, orienting, pairwise alignment, restriction site cutting, searching, shuffling, sorting, subsampling, and taxonomic classification of amplicon sequences for metagenomics, genomics, and population genetics. .\" ============================================================================ .SH SYNOPSIS .\" left justified, ragged right .ad l Chimera detection: .RS \fBvsearch\fR (\-\-uchime_denovo | \-\-uchime2_denovo | \-\-uchime3_denovo) \fIfastafile\fR (\-\-chimeras | \-\-nonchimeras | \-\-uchimealns | \-\-uchimeout) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-uchime_ref \fIfastafile\fR (\-\-chimeras | \-\-nonchimeras | \-\-uchimealns | \-\-uchimeout) \fIoutputfile\fR \-\-db \fIfastafile\fR [\fIoptions\fR] .PP .RE Clustering: .RS \fBvsearch\fR (\-\-cluster_fast | \-\-cluster_size | \-\-cluster_smallmem | \-\-cluster_unoise) \fIfastafile\fR (\-\-alnout | \-\-biomout | \-\-blast6out | \-\-centroids | \-\-clusters | \-\-mothur_shared_out | \-\-msaout | \-\-otutabout | \-\-profile | \-\-samout | \-\-uc | \-\-userout) \fIoutputfile\fR \-\-id \fIreal\fR [\fIoptions\fR] .PP .RE Dereplication and rereplication: .RS \fBvsearch\fR \-\-fastx_uniques (\fIfastafile\fR | \fIfastqfile\fR) (\-\-fastaout | \-\-fastqout | \-\-tabbedout | \-\-uc) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR (\-\-derep_fulllength | \-\-derep_id | \-\-derep_prefix) \fIfastafile\fR (\-\-output | \-\-uc) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-rereplicate \fIfastafile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR] .PP .RE Extraction of sequences: .RS \fBvsearch\fR \-\-fastx_getseq \fIfastafile\fR (\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR \-\-label \fIlabel\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastx_getseqs \fIfastafile\fR (\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR (\-\-label \fIlabel\fR \ \-\-labels \fIlabelfile\fR | \-\-label_word \fIlabel\fR | \-\-label_words \fIlabelfile\fR) [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastx_getsubseq \fIfastafile\fR (\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-notmatchedfq) \fIoutputfile\fR \-\-label \fIlabel\fR [\-\-subseq_start \fIposition\fR] [\-\-subseq_end \fIposition\fR] [\fIoptions\fR] .PP .RE FASTA/FASTQ/SFF file processing: .RS \fBvsearch\fR \-\-fasta2fastq \fIfastqfile\fR \-\-fastqout \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastq_chars \fIfastqfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastq_convert \fIfastqfile\fR \-\-fastqout \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR (\-\-fastq_eestats | \-\-fastq_eestats2) \fIfastqfile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastq_filter \fIfastqfile\fR [\-\-reverse \fIfastqfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout | \-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev | \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastq_join \fIfastqfile\fR \-\-reverse \fIfastqfile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastq_mergepairs \fIfastqfile\fR \-\-reverse \fIfastqfile\fR (\-\-fastaout | \-\-fastqout | \-\-fastaout_notmerged_fwd | \-\-fastaout_notmerged_rev | \-\-fastqout_notmerged_fwd | \-\-fastqout_notmerged_rev | \-\-eetabbedout) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastq_stats \fIfastqfile\fR [\-\-log \fIlogfile\fR] [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastx_filter \fIinputfile\fR [\-\-reverse \fIinputfile\fR] (\-\-fastaout | \-\-fastaout_discarded | \-\-fastqout | \-\-fastqout_discarded \-\-fastaout_rev | \-\-fastaout_discarded_rev | \-\-fastqout_rev | \-\-fastqout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-fastx_revcomp \fIinputfile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-sff_convert \fIsff-file\fR \-\-fastqout \fIoutputfile\fR [\fIoptions\fR] .PP .RE Masking: .RS \fBvsearch\fR \-\-fastx_mask \fIfastxfile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-maskfasta \fIfastafile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR] .PP .RE Orienting: .RS \fBvsearch\fR \-\-orient \fIfastxfile\fR \-\-db \fIfastafile\fR (\-\-fastaout | \-\-fastqout | \-\-notmatched | \-\-tabbedout) \fIoutputfile\fR [\fIoptions\fR] .PP .RE Pairwise alignment: .RS \fBvsearch\fR \-\-allpairs_global \fIfastafile\fR (\-\-alnout | \-\-blast6out | \-\-matched | \-\-notmatched | \-\-samout | \-\-uc | \-\-userout) \fIoutputfile\fR (\-\-acceptall | \-\-id \fIreal\fR) [\fIoptions\fR] .PP .RE Restriction site cutting: .RS \fBvsearch\fR \-\-cut \fIfastafile\fR \-\-cut_pattern \fIpattern\fR (\-\-fastaout | \-\-fastaout_rev | \-\-fastaout_discarded | \-\-fastaout_discarded_rev) \fIoutputfile\fR [\fIoptions\fR] .PP .RE Searching: .RS \fBvsearch\fR \-\-search_exact \fIfastafile\fR \-\-db \fIfastafile\fR (\-\-alnout | \-\-biomout | \-\-blast6out | \-\-mothur_shared_out | \-\-otutabout | \-\-samout | \-\-uc | \-\-userout | \-\-lcaout) \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-usearch_global \fIfastafile\fR \-\-db \fIfastafile\fR (\-\-alnout | \-\-biomout | \-\-blast6out | \-\-mothur_shared_out | \-\-otutabout | \-\-samout | \-\-uc | \-\-userout | \-\-lcaout) \fIoutputfile\fR \-\-id \fIreal\fR [\fIoptions\fR] .PP .RE Shuffling and sorting: .RS \fBvsearch\fR (\-\-shuffle | \-\-sortbylength | \-\-sortbysize) \fIfastafile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR] .PP .RE Subsampling: .RS \fBvsearch\fR \-\-fastx_subsample \fIfastafile\fR (\-\-fastaout | \-\-fastqout) \fIoutputfile\fR (\-\-sample_pct \fIreal\fR | \-\-sample_size \fIpositive integer\fR) [\fIoptions\fR] .PP .RE Taxonomic classification: .RS \fBvsearch\fR \-\-sintax \fIfastafile\fR \-\-db \fIfastafile\fR \-\-tabbedout \fIoutputfile\fR [\-\-sintax_cutoff \fIreal\fR] [\fIoptions\fR] .PP .RE UDB database handling: .RS \fBvsearch\fR \-\-makeudb_usearch \fIfastafile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR \-\-udb2fasta \fIudbfile\fR \-\-output \fIoutputfile\fR [\fIoptions\fR] .PP \fBvsearch\fR (\-\-udbinfo | \-\-udbstats) \fIudbfile\fR [\fIoptions\fR] .PP .RE .\" left and right justified (default) .ad b .\" ============================================================================ .SH DESCRIPTION Environmental or clinical molecular diversity studies generate large volumes of amplicons (e.g.; SSU-rRNA sequences) that need to be checked for chimeras, dereplicated, masked, sorted, searched, clustered or compared to reference sequences. The aim of \fBvsearch\fR is to offer a all-in-one open source tool to perform these tasks, using optimized algorithm implementations and harvesting the full potential of modern computers, thus providing fast and accurate data processing. .PP Comparing nucleotide sequences is at the core of \fBvsearch\fR. To speed up comparisons, \fBvsearch\fR implements an extremely fast Needleman-Wunsch algorithm, making use of the Streaming SIMD Extensions (SSE2) of post-2003 x86-64 CPUs. If SSE2 instructions are not available, \fBvsearch\fR exits with an error message. On Power8 CPUs it will use AltiVec/VSX/VMX instructions, and on ARMv8 CPUs it will use Neon instructions. Memory usage increases rapidly with sequence length: for example comparing two sequences of length 1 kb requires 8 MB of memory per thread, and comparing two 10 kb sequences requires 800 MB of memory per thread. For comparisons involving sequences with a length product greater than 25 million (for example two sequences of length 5 kb), \fBvsearch\fR uses a slower alignment method described by Hirschberg (1975) and Myers and Miller (1988), with much smaller memory requirements. .\" ---------------------------------------------------------------------------- .SS Input \fBvsearch\fR accept as input fasta or fastq files containing one or several nucleotidic entries. In fasta files, each entry is made of a header and a sequence. The header is defined as the string comprised between the initial '>' symbol and the first space, tab or the end of the line, unless the \-\-notrunclabels option is in effect, in which case the entire line is included. The header should contain printable ascii characters (33-126). The program will terminate with a fatal error if there are unprintable ascii characters. A warning will be issued if non-ascii characters (128-255) are encountered. .PP If the header matches '>[;]size=\fIinteger\fR;label', '>label;size=\fIinteger\fR;label' or '>label;size=\fIinteger\fR[;]', \fBvsearch\fR interpret \fIinteger\fR as the number of occurrences (or abundance) of the sequence in the study. That abundance information is used or created during chimera detection, clustering, dereplication, sorting and searching. .PP The sequence is defined as a string of IUPAC symbols (ACGTURYSWKMDBHVN), starting after the end of the identifier line and ending before the next identifier line, or the file end. \fBvsearch\fR silently ignores ascii characters 9 to 13, and exits with an error message if ascii characters 0 to 8, 14 to 31, '.' or '-' are present. All other ascii or non-ascii characters are stripped and complained about in a warning message. .PP In fastq files, each entry is made of sequence header starting with a symbol '@', a nucleotidic sequence (same rules as for fasta sequences), a quality header starting with a symbol '+' and a string of ASCII characters (offset 33 or 64), each one encoding the quality value of the corresponding position in the nucleotidic sequence. .PP \fBvsearch\fR operations are case insensitive, except when soft masking is activated. Masking is automatically applied during chimera detection, clustering, masking, pairwise alignment and searching. Soft masking is specified with the options '\-\-dbmask soft' (for searching and chimera detection with a reference) or '\-\-qmask soft' (for searching, \fIde novo\fR chimera detection, clustering and masking). When using soft masking, lower case letters indicate masked symbols, while upper case letters indicate regular symbols. Masked symbols are never included in the unique index words used for sequence comparisons, otherwise they are treated as normal symbols. .PP When comparing sequences during chimera detection, dereplication, searching and clustering, T and U are considered identical, regardless of their case. When aligning sequences, identical symbols will receive a positive match score (default +2). If two symbols are not identical, their alignment result in a negative mismatch score (default -4). Aligning a pair of symbols where at least one of them is an ambiguous symbol (BDHKMNRSVWY) will always result in a score of zero. Alignment of two identical ambiguous symbols (for example, R vs R) also receives a score of zero. When computing the amount of similarity by counting matches and mismatches after alignment, ambiguous nucleotide symbols will count as matching to other symbols if they have at least one of the nucleotides (ACGTU) they may represent in common. For example: W will match A and T, but also any of MRVHDN. When showing alignments (for example with the \-\-alnout option) matches involving ambiguous symbols will be shown with a plus character (+) between them while exact matches between non-ambiguous symbols will be shown with a vertical bar character (|). .PP \fBvsearch\fR can read data from standard files and write to standard files, but it can also read from pipes and write to pipes! For example, multiple fasta files can be piped into \fBvsearch\fR for dereplication. To do so, file names can be replaced with: .RS .IP - 2 the symbol '-', representing '/dev/stdin' for input files or '/dev/stdout' for output files, .IP - a named pipe created with the command mkfifo, .IP - a process substitution '<(command)' as input or '>(command)' as output. .RE .PP \fBvsearch\fR can automatically read compressed gzip or bzip2 files if the appropriate libraries are present during the compilation. \fBvsearch\fR can also read pipes streaming compressed gzip or bzip2 data if the options \-\-gzip_decompress or \-\-bzip2_decompress are selected. When reading from a pipe, the progress indicator is not updated. .\" ---------------------------------------------------------------------------- .SS Options \fBvsearch\fR recognizes a large number of command-line commands and options. For easier navigation, options are grouped below by theme (chimera detection, clustering, dereplication and rereplication, FASTA/FASTQ file processing, masking, pairwise alignment, searching, shuffling, sorting, and subsampling). We start with the general options that apply to all themes. Options start with a double dash (\-\-). A single dash (\-) may also be used, except on NetBSD systems. Option names may be shortened as long as they are not ambiguous (e.g. \-\-derep_f). .RE .PP .\" ---------------------------------------------------------------------------- .TAG help-and-version-commands Help and version commands: .PP .RS .TAG help .TAG h .TP 9 .B \-\-help \-\-h Display help text with brief information about all commands and options. .TAG version .TAG v .TP .B \-\-version \-\-v Output version information and a citation for the VSEARCH publication. Show the status of the support for gzip- and bzip2-compressed input files. .RE .PP .\" ---------------------------------------------------------------------------- .TAG general-options General options: .RS .TAG bzip2_decompress .TP 9 .B \-\-bzip2_decompress When reading from a pipe streaming bzip2-compressed data, decompress the data. This option is not needed when reading from a standard bzip2-compressed file. .TAG fasta_width .TP .BI \-\-fasta_width\~ "positive integer" Fasta files produced by \fBvsearch\fR are wrapped (sequences are written on lines of \fIinteger\fR nucleotides, 80 by default). Set the value to zero to eliminate the wrapping. .TAG gzip_decompress .TP .B \-\-gzip_decompress When reading from a pipe streaming gzip-compressed data, decompress the data. This option is not needed when reading from a standard gzip-compressed file. .TAG label_suffix .TP .BI \-\-label_suffix\~ string When writing FASTA or FASTQ files, add the suffix \fIstring\fR to sequence headers. .TAG log .TP .BI \-\-log \0filename Write messages to the specified log file. Information written includes program version, amount of memory available, number of cores and command line options, and if need be, informational messages, warnings and fatal errors. The start and finish times are also recorded as well as the elapsed time and the maximum amount of memory consumed. The different \fBvsearch\fR commands can also write additional information to the log file. .TAG maxseqlength .TP .BI \-\-maxseqlength\~ "positive integer" All \fBvsearch\fR operations discard sequences longer than \fIinteger\fR (50,000 nucleotides by default). .TAG minseqlength .TP .BI \-\-minseqlength\~ "positive integer" All \fBvsearch\fR operations discard sequences shorter than \fIinteger\fR: 1 nucleotide by default for sorting or shuffling, 32 nucleotides for clustering and dereplication as well as the commands \-\-makeudb_usearch, \-\-sintax, and \-\-usearch_global. .TAG no_progress .TP .B \-\-no_progress Do not show the gradually increasing progress indicator. .TAG notrunclabels .TP .B \-\-notrunclabels Do not truncate sequence labels at first space or tab, but use the full header in output files. Turned off by default for all commands except the sintax command. .TAG quiet .TP .B \-\-quiet Suppress all messages to stdout and stderr except for warnings and fatal error messages. .TAG sample .TP .BI \-\-sample\~ string When writing FASTA or FASTQ files, add the the given sample identifier \fIstring\fR to sequence headers. For instance, if the given string is ABC, the text ";sample=ABC" will be added to the header. .TAG threads .TP .BI \-\-threads\~ "positive integer" Number of computation threads to use (1 to 1024). The number of threads should be less than or equal to the number of available CPU cores. The default is to use all available resources and to launch one thread per core. The following commands are multi-threaded: allpairs_global, cluster_fast, cluster_size, cluster_smallmem, cluster_unoise, fastq_mergepairs, fastx_mask, maskfasta, search_exact, sintax, uchime_ref, and usearch_global. Only one thread is used for the other commands. .RE .PP .\" ---------------------------------------------------------------------------- .TAG chimera-detection-options Chimera detection options: .PP .RS Chimera detection is based on a scoring function controlled by five options (\-\-dn, \-\-mindiffs, \-\-mindiv, \-\-minh, \-\-xn). Sequences are first sorted by decreasing abundance, if available, and compared on their \fIplus\fR strand only (case insensitive). .PP Input sequences are masked as specified with the \-\-qmask and \-\-hardmask options. Masking of the database for reference based chimera detection is specified with the \-\-dbmask option. .PP In \fIde novo\fR mode, input fasta file must present abundance annotations (i.e. a pattern [;]size=\fIinteger\fR[;] in the fasta header). Input order matters for chimera detection, so we recommend to sort sequences by decreasing abundance (default of \-\-derep_fulllength command). If your sequence set needs to be sorted, please see the \-\-sortbysize command in the sorting section. .PP .TAG abskew .TP 9 .BI \-\-abskew \0real When using \-\-uchime_denovo, the abundance skew is used to distinguish in a three-way alignment which sequence is the chimera and which are the parents. The assumption is that chimeras appear later in the PCR amplification process and are therefore less abundant than their parents. For \-\-uchime3_denovo the default value is 16.0. For the other commands, the default value is 2.0, which means that the parents should be at least 2 times more abundant than their chimera. Any positive value equal or greater than 1.0 can be used. .TAG alignwidth .TP .BI \-\-alignwidth\~ "positive integer" When using \-\-uchimealns, set the width of the three-way alignments (80 nucleotides by default). Set to zero to eliminate wrapping. .TAG borderline .TP .BI \-\-borderline \0filename Output borderline chimeric sequences to \fIfilename\fR, in fasta format. Borderline chimeric sequences are sequences that have a high enough score but which are not sufficiently different from their closest parent. .TAG chimeras .TP .BI \-\-chimeras \0filename Output chimeric sequences to \fIfilename\fR, in fasta format. Output order may vary when using multiple threads. .TAG db .TP .BI \-\-db \0filename When using \-\-uchime_ref, detect chimeras using the reference sequences contained in \fIfilename\fR. Reference sequences are assumed to be chimera-free. Chimeras cannot be detected if their parents, or sufficiently close relatives, are not present in the database. The file name must refer to a FASTA file or to a UDB file. If a UDB file is used, it should be created using the \-\-makeudb_usearch command with the \-\-dbmask dust option. .TAG dn .TP .BI \-\-dn\~ "strictly positive real number" pseudo-count prior on the number of no votes, corresponding to the parameter \fIn\fR in the chimera scoring function (default value is 1.4). Increasing \-\-dn reduces the likelihood of tagging a sequence as a chimera (less false positives, but also more false negatives). .TAG fasta_score .TP .B \-\-fasta_score Add the chimera score to the headers in the fasta output files for chimeras, non-chimeras and borderline sequences, using the format ';uchime_denovo=\fIfloat\fR;'. .TAG mindiffs .TP .BI \-\-mindiffs\~ "positive integer" Minimum number of differences per segment (default value is 3). The parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo. .TAG mindiv .TP .BI \-\-mindiv \0real Minimum divergence from closest parent (default value is 0.8). The parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo. .TAG minh .TP .BI \-\-minh \0real Minimum score (\fIh\fR). Increasing this value tends to reduce the number of false positives and to decrease sensitivity. Default value is 0.28, and values ranging from 0.0 to 1.0 included are accepted. The parameter is ignored with \-\-uchime2_denovo and \-\-uchime3_denovo. .TAG nonchimeras .TP .BI \-\-nonchimeras \0filename Output non-chimeric sequences to \fIfilename\fR, in fasta format. Output order may vary when using multiple threads. .TAG relabel .TP .BI \-\-relabel \0string Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3, etc.) to construct the new headers. Use \-\-sizeout to conserve the abundance annotations. .TAG relabel_keep .TP .B \-\-relabel_keep When relabelling, keep the old identifier in the header after a space. .TAG relabel_md5 .TP .B \-\-relabel_md5 Relabel sequences using the MD5 message digest algorithm applied to each sequence. Former sequence headers are discarded. The sequence is converted to upper case and each 'U' is replaced by a 'T' before computation of the digest. The MD5 digest is a cryptographic hash function designed to minimize the probability that two different inputs give the same output, even for very similar, but non-identical inputs. Still, there is a very small, but non-zero, probability that two different inputs give the same digest (i.e. a collision). MD5 generates a 128-bit (16-byte) digest that is represented by 16 hexadecimal numbers (using 32 symbols among 0123456789abcdef). Use \-\-sizeout to conserve the abundance annotations. .\" The probablity of collision for two sequences is 1/2^128 .TAG relabel_self .TP .B \-\-relabel_self Relabel sequences using each sequence itself as a label. .TAG relabel_sha1 .TP .B \-\-relabel_sha1 Relabel sequences using the SHA1 message digest algorithm applied to each sequence. It is similar to the \-\-relabel_md5 option but uses the SHA1 algorithm instead of the MD5 algorithm. SHA1 generates a 160-bit (20-byte) digest that is represented by 20 hexadecimal numbers (40 symbols). The probability of a collision (two non-identical sequences resulting in the same digest) is smaller for the SHA1 algorithm than it is for the MD5 algorithm. .\" The probablity of collision for two sequences is 1/2^160 .TAG self .TP .B \-\-self When using \-\-uchime_ref, ignore a reference sequence when its label matches the label of the query sequence (useful to estimate false-positive rate in reference sequences). .\" I am not sure the statement above is true. .TAG selfid .TP .B \-\-selfid When using \-\-uchime_ref, ignore a reference sequence when its nucleotide sequence is strictly identical to the nucleotidic sequence of the query. .TP .B \-\-sizein In \fIde novo\fR mode, abundance annotations (pattern '[>;]size=\fIinteger\fR[;]') present in sequence headers are taken into account by default (\-\-sizein is always implied). This option is ignored by \-\-uchime_ref. .TP .TAG sizeout .B \-\-sizeout When relabelling, add abundance annotations to fasta headers (using the format ';size=\fIinteger\fR;'). .TAG uchime_denovo .TP .BI \-\-uchime_denovo \0filename Detect chimeras present in the fasta-formatted \fIfilename\fR, without external references (i.e. \fIde novo\fR). Automatically sort the sequences in \fIfilename\fR by decreasing abundance beforehand (see the sorting section for details). Multithreading is not supported. .TAG uchime2_denovo .TP .BI \-\-uchime2_denovo \0filename Detect chimeras present in the fasta-formatted \fIfilename\fR, using the UCHIME2 algorithm. This algorithm is designed for denoised amplicons (see \-\-cluster_unoise). Automatically sort the sequences in \fIfilename\fR by decreasing abundance beforehand (see the sorting section for details). Multithreading is not supported. .TAG uchime3_denovo .TP .BI \-\-uchime3_denovo \0filename Detect chimeras present in the fasta-formatted \fIfilename\fR, using the UCHIME2 algorithm. The only difference from \-\-uchime2_denovo is that the default minimum abundance skew (\-\-abskew) is set to 16.0 rather than 2.0. .TAG uchime_ref .TP .BI \-\-uchime_ref \0filename Detect chimeras present in the fasta-formatted \fIfilename\fR by comparing them with reference sequences (option \-\-db). Multithreading is supported. .TAG uchimealns .TP .BI \-\-uchimealns \0filename Write the three-way global alignments (parentA, parentB, chimera) to \fIfilename\fR using a human-readable format. Use \-\-alignwidth to modify alignment length. Output order may vary when using multiple threads. All sequences are converted to upper case before alignment. Lower case letters indicate disagreement in the alignment. .TAG uchimeout .TP .BI \-\-uchimeout \0filename Write chimera detection results to \fIfilename\fR using a 18-field, tab\-separated uchime\-like format. Use \-\-uchimeout5 to use a format compatible with usearch v5 and earlier versions. Rows output order may vary when using multiple threads. .RS .RS .nr step 1 1 .IP \n[step]. 4 score: higher score means a more likely chimeric alignment. .IP \n+[step]. Q: query sequence label. .IP \n+[step]. A: parent A sequence label. .IP \n+[step]. B: parent B sequence label. .IP \n+[step]. T: top parent sequence label (i.e. parent most similar to the query). That field is removed when using \-\-uchimeout5. .IP \n+[step]. idQM: percentage of similarity of query (Q) and model (M) constructed as a part of parent A and a part of parent B. .IP \n+[step]. idQA: percentage of similarity of query (Q) and parent A. .IP \n+[step]. idQB: percentage of similarity of query (Q) and parent B. .IP \n+[step]. idAB: percentage of similarity of parent A and parent B. .IP \n+[step]. idQT: percentage of similarity of query (Q) and top parent (T). .IP \n+[step]. LY: yes votes in the left part of the model. .IP \n+[step]. LN: no votes in the left part of the model. .IP \n+[step]. LA: abstain votes in the left part of the model. .IP \n+[step]. RY: yes votes in the right part of the model. .IP \n+[step]. RN: no votes in the right part of the model. .IP \n+[step]. RA: abstain votes in the right part of the model. .IP \n+[step]. div: divergence, defined as (idQM - idQT). .IP \n+[step]. YN: query is chimeric (Y), or not (N), or is a borderline case (?). .RE .RE .TAG uchimeout5 .TP .B \-\-uchimeout5 When using \-\-uchimeout, write chimera detection results using a 17\-field, tab\-separated uchime\-like format (drop the 5th field of \-\-uchimeout), compatible with usearch version 5 and earlier versions. .TP .TAG xn .BI \-\-xn\~ "strictly positive real number" weight of no votes, corresponding to the parameter \fIbeta\fR in the scoring function (default value is 8.0). Increasing \-\-xn reduces the likelihood of tagging a sequence as a chimera (less false positives, but also more false negatives). .TP .TAG xsize .B \-\-xsize Strip abundance information from the headers when writing the output file. .RE .PP .\" ---------------------------------------------------------------------------- .TAG clustering-options Clustering options: .RS .PP \fBvsearch\fR implements a single-pass, greedy centroid-based clustering algorithm, similar to the algorithms implemented in usearch, DNAclust and sumaclust for example. Important parameters are the global clustering threshold (\-\-id) and the pairwise identity definition (\-\-iddef). .PP Input sequences are masked as specified with the \-\-qmask and \-\-hardmask options. .TAG biomout .TP 9 .BI \-\-biomout \0filename Generate an OTU table in the biom version 1.0 JSON file format as specified at .URL http://biom-format.org/documentation/format_versions/biom-1.0.html "(link)" . The format describes how to store a sparse matrix containing the abundances of the OTUs in the different samples. This format is much more efficient than the classic and mothur OTU table formats available with the \-\-otutabout and \-\-mothur_shared_out options, respectively, and is recommended at least for large tables. The OTUs are represented by the cluster centroids. Taxonomy information will be included for the OTUs if available. Sample identifiers will be extracted from the headers of all sequences in the input file. If the header contains ';sample=abc123;' or ';barcodelabel=abc123;' or a similar string somewhere, then the given sample identifier (here 'abc123') will be used. The semicolon is not mandatory at the beginning or end of the header. The sample identifier may contain any printable character except semicolons. If no such sample label is found, the identifier in the initial part of the header will be used, but only letters, digits and underscores are allowed. OTU identifiers will be extracted from the headers of the cluster centroid sequences. If the header contains ';otu=def789;' or a similar string somewhere, then the given OTU identifier (here 'def789') will be used. The semicolon is not mandatory at the beginning or end of the header. The OTU identifier may contain any printable character except semicolons. If no such OTU label is found, the identifier in the initial part of the header will be used, and all characters except semicolons are allowed. Alternatively, OTU identifiers can be generated using the relabelling options (\-\-relabel, \-\-relabel_self, \-\-relabel_sha1, or \-\-relabel_md5). Taxonomy information, if present, will also be extracted from the headers of the centroid sequences. If the header contains ';tax=Homo_sapiens;' or a similar string somewhere, then the given taxonomy information (here 'Homo_sapiens') will be used. The semicolon is not mandatory at the beginning or end of the header. The taxonomy information may contain any printable character except semicolons. If an OTU table in the biom version 2.1 HDF5 file format is required, the biom utility may be used as described at .URL http://biom-format.org/documentation/biom_conversion.html "(link)" . .TAG centroids .TP .BI \-\-centroids \0filename Output cluster centroid sequences to \fIfilename\fR, in fasta format. The centroid is the sequence that seeded the cluster (i.e. the first sequence of the cluster). .TAG clusterout_id .TP .BI \-\-clusterout_id Add cluster identifier information to the output files when using the \-\-centroids, \-\-consout and \-\-profile options. .TAG clusterout_sort .TP .BI \-\-clusterout_sort Sort some output files by decreasing abundance instead of input order. It applies to the \-\-consout, \-\-msaout, \-\-profile, \-\-centroids, and \-\-uc options. For \-\-uc, the sorting applies only to the centroid information part (the C lines). .TAG cluster_fast .TP .BI \-\-cluster_fast \0filename Clusterize the fasta sequences in \fIfilename\fR, automatically sort by decreasing sequence length beforehand. .TAG cluster_size .TP .BI \-\-cluster_size \0filename Clusterize the fasta sequences in \fIfilename\fR, automatically sort by decreasing sequence abundance beforehand. .TAG cluster_smallmem .TP .BI \-\-cluster_smallmem \0filename Clusterize the fasta sequences in \fIfilename\fR without automatically modifying their order beforehand. Sequence are expected to be sorted by decreasing sequence length, unless \-\-usersort is used. .TAG cluster_unoise .TP .BI \-\-cluster_unoise \0filename Perform denoising of the fasta sequences in \fIfilename\fR according to the UNOISE version 3 algorithm by Robert Edgar, but without the chimera removal step. The options \-\-minsize (default 8) and \-\-unoise_alpha (default 2.0) may be specified. Chimera removal (\fIde novo\fR) should be performed afterwards with \-\-uchime3_denovo. .TAG clusters .TP .BI \-\-clusters \0string Output each cluster to a separate fasta file using the prefix \fIstring\fR and a ticker (0, 1, 2, etc.) to construct the path and filenames. .TAG consout .TP .BI \-\-consout \0filename Output cluster consensus sequences to \fIfilename\fR. For each cluster, a multiple alignment is computed, and a consensus sequence is constructed by taking the majority symbol (nucleotide or gap) from each column of the alignment. Columns containing a majority of gaps are skipped, except for terminal gaps. If the \-\-sizein option is specified, sequence abundances will be taken into account. .TAG cons_truncate .TP .B \-\-cons_truncate This command is ignored. A warning is issued. .\" .TP .\" .B \-\-cons_truncate .\" when using the \-\-consout option to build consensus sequences, .\" do not ignore terminal gaps. That option skips terminal columns .\" if they contain a majority of gaps, yielding shorter consensus .\" sequences than when using \-\-consout alone. .TAG id .TP .BI \-\-id \0real Do not add the target to the cluster if the pairwise identity with the centroid is lower than \fIreal\fR (value ranging from 0.0 to 1.0 included). The pairwise identity is defined as the number of (matching columns) / (alignment length - terminal gaps). That definition can be modified by \-\-iddef. .TAG iddef .TP .BI \-\-iddef\~ "0|1|2|3|4" Change the pairwise identity definition used in \-\-id. Values accepted are: .RS .RS .nr step 0 1 .IP \n[step]. 4 CD-HIT definition: (matching columns) / (shortest sequence length). .IP \n+[step]. edit distance: (matching columns) / (alignment length). .IP \n+[step]. edit distance excluding terminal gaps (same as \-\-id). .IP \n+[step]. Marine Biological Lab definition counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended: 1.0 - [(mismatches + gap openings)/(longest sequence length)] .IP \n+[step]. BLAST definition, equivalent to \-\-iddef 1 in a context of global pairwise alignment. .RE .RE .TAG minsize .TP .BI \-\-minsize\~ "positive integer" Specify the minimum abundance of sequences for denoising using \-\-cluster_unoise. The default is 8. .TAG msaout .TP .BI \-\-msaout \0filename Output a multiple sequence alignment and a consensus sequence for each cluster to \fIfilename\fR, in fasta format. Be warned that vsearch computes center star multiple sequence alignments using a fast method whose accuracy can decrease significantly when using low pairwise identity thresholds. The consensus sequence is constructed by taking the majority symbol (nucleotide or gap) from each column of the alignment. Columns containing a majority of gaps are skipped, except for terminal gaps. If the \-\-sizein option is specified, sequence abundances will be taken into account when computing the consensus. .TAG mothur_shared_out .TP .BI \-\-mothur_shared_out \0filename Output an OTU table in the mothur 'shared' tab-separated plain text format as described at .URL https://www.mothur.org/wiki/Shared_file (link) . The format describes how a matrix containing the abundances of the OTUs in the different samples is stored. The first line will start with the strings 'label', 'group' and 'numOtus' and is followed by a list of all OTU identifiers. The following lines, one for each sample, starts with the string 'vsearch' followed by the sample identifier, the total number of OTUs, and a list of abundances for each OTU in that sample, in the order given on the first line. The OTU and sample identifiers are extracted from the FASTA headers of the sequences. The OTUs are represented by the cluster centroids. See the \-\-biomout option for further details. .TAG otutabout .TP .BI \-\-otutabout \0filename Output an OTU table in the classic tab-separated plain text format as a matrix containing the abundances of the OTUs in the different samples. The first line will start with the string '#OTU ID' and is followed by a tab-separated list of all sample identifiers. The following lines, one for each OTU, starts with the OTU identifier and is followed by a tab-separated list of abundances for that OTU in each sample, in the order given on the first line. The OTU and sample identifiers are extracted from the FASTA headers of the sequences. The OTUs are represented by the cluster centroids. An extra column is added to the right of the table if taxonomy information is available for at least one of the OTUs. This column will be labelled 'taxonomy' and each row will then contain the taxonomy information extracted for that OTU. See the \-\-biomout option for further details. .TAG profile .TP .BI \-\-profile \0filename Output a sequence profile to a text file with the frequency of each nucleotide in each position in the multiple alignment for each cluster. There is a FASTA-like header line for each cluster, followed by the profile information in a tab-separated format. The eight columns are: position (0-based), consensus nucleotide, number of As, number of Cs, number of Gs, number of Ts or Us, number of gap symbols, and finally the total number of ambiguous nucleotide symbols (B, D, H, K, M, N, R, S, Y, V or W). All numbers are integers. If the \-\-sizein option is specified, sequence abundances will be taken into account. .TAG qmask .TP .BI \-\-qmask\~ "none|dust|soft" Mask regions in sequences using the \fIdust\fR or the \fIsoft\fR methods, or do not mask (\fInone\fR). Warning, when using \fIsoft\fR masking, clustering becomes case sensitive. The default is to mask using \fIdust\fR. .TAG qsegout .TP .BI \-\-qsegout \0filename Write the aligned part of each query sequence to \fIfilename\fR in FASTA format. .TAG relabel .TP .BI \-\-relabel \0string Relabel sequence identifiers in the output files produced by \-\-consout, \-\-profile and \-\-centroids options. Please see the description of the same option under Chimera detection for details. .TAG relabel_keep .TP .B \-\-relabel_keep When relabelling, keep the old identifier in the header after a space. .TAG relabel_md5 .TP .B \-\-relabel_md5 Relabel sequence identifiers in the output files produced by \-\-consout, \-\-profile and \-\-centroids options. Please see the description of the same option under Chimera detection for details. .TAG relabel_self .TP .B \-\-relabel_self Relabel sequence identifiers in the output files produced by \-\-consout, \-\-profile and \-\-centroids options. Please see the description of the same option under Chimera detection for details. .TAG relabel_sha1 .TP .B \-\-relabel_sha1 Relabel sequence identifiers in the output files produced by \-\-consout, \-\-profile and \-\-centroids options. Please see the description of the same option under Chimera detection for details. .TAG sizein .TP .B \-\-sizein Take into account the abundance annotations present in the input fasta file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence headers). .TAG sizeorder .TP .B \-\-sizeorder When an amplicon is close to 2 or more centroids, both within the distance specified with the \-\-id option, resolve the ambiguity by clustering it with the centroid having the highest abundance, not necessarily the closest one. The option only has effect when the value specified with \-\-maxaccepts is higher than one. The \-\-sizeorder option turns on what is sometimes referred to as abundance-based greedy clustering (AGC), in contrast to the default distance-based greedy clustering (DGC). .TAG sizeout .TP .B \-\-sizeout Add abundance annotations to the output fasta files (add the pattern ';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is specified, abundance annotations are reported to output files, and each cluster centroid receives a new abundance value corresponding to the total abundance of the amplicons included in the cluster (\-\-centroids option). If \-\-sizein is not specified, input abundances are set to 1 for amplicons, and to the number of amplicons per cluster for centroids. .TAG strand .TP .BI \-\-strand\~ "plus|both" When comparing sequences with the cluster seed, check the \fIplus\fR strand only (default) or check \fIboth\fR strands. .TAG tsegout .TP .BI \-\-tsegout \0filename Write the aligned part of each target sequence to \fIfilename\fR in FASTA format. .TAG uc .TP .BI \-\-uc \0filename Output clustering results in \fIfilename\fR using a tab-separated uclust-like format with 10 columns and 3 different type of entries (S, H or C). Each fasta sequence in the input file can be either a cluster centroid (S) or a hit (H) assigned to a cluster. Cluster records (C) summarize information (size, centroid label) for each cluster. In the context of clustering, the option \-\-uc_allhits has no effect on the \-\-uc output. Column content varies with the type of entry (S, H or C): .RS .RS .nr step 1 1 .IP \n[step]. 4 Record type: S, H, or C. .IP \n+[step]. Cluster number (zero-based). .IP \n+[step]. Centroid length (S), query length (H), or cluster size (C). .IP \n+[step]. Percentage of similarity with the centroid sequence (H), or set to '*' (S, C). .IP \n+[step]. Match orientation + or - (H), or set to '*' (S, C). .IP \n+[step]. Not used, always set to '*' (S, C) or to zero (H). .IP \n+[step]. Not used, always set to '*' (S, C) or to zero (H). .IP \n+[step]. set to '*' (S, C) or, for H, compact representation of the pairwise alignment using the CIGAR format (Compact Idiosyncratic Gapped Alignment Report): M (match/mismatch), D (deletion) and I (insertion). The equal sign '=' indicates that the query is identical to the centroid sequence. .IP \n+[step]. Label of the query sequence (H), or of the centroid sequence (S, C). .IP \n+[step]. Label of the centroid sequence (H), or set to '*' (S, C). .RE .RE .TAG unoise_alpha .TP .BI \-\-unoise_alpha\~ real Specify the alpha parameter to the \-\-cluster_unoise command. The default is 2.0. .TAG usersort .TP .B \-\-usersort When using \-\-cluster_smallmem, allow any sequence input order, not just a decreasing length ordering. .TAG xsize .TP .B \-\-xsize Strip abundance information from the headers when writing the output file. .TP .B ... Most searching options as well as score filtering, gap penalties and masking also apply to clustering (see the Searching section for definitions): \-\-alnout, \-\-blast6out, \-\-fastapairs, \-\-matched, \-\-notmatched, \-\-maxaccept, \-\-maxreject, \-\-samout, \-\-userout, \-\-userfields .RE .PP .\" ---------------------------------------------------------------------------- .TAG dereplication-and-rereplication-options Dereplication and rereplication options: .PP .RS VSEARCH can dereplicate sequences with the commands \-\-derep_fulllength, \-\-derep_id, \-\-derep_prefix and \-\-fastx_uniques. The \-\-derep_fulllength command is depreciated and is replaced by the new \-\-fastx_uniques command that can also handle FASTQ files in addition to FASTA files. The \-\-derep_fulllength and \-\-fastx_uniques commands requires strictly identical sequences of the same length, but ignores upper/lower case and treats T and U as identical symbols. The \-\-derep_id command requires both identical sequences and identical headers/labels. The \-\-derep_prefix command will group sequences with a common prefix and does not require them to be equally long. The \-\-fastx_uniques command can write FASTQ output (specified with \-\-fastqout) or FASTA output (specified with \-\-fastaout) as well as a special tab-separated column text format (with \-\-tabbedout). The other commands can write FASTA output to the file specified with the \-\-output option. All dereplication commands can write output to a special UCLUST-like file specified with the \-\-uc option. The \-\-rereplicate command can duplicate sequences in the input file according to the abundance of each input sequence. Other valid options are \-\-fastq_ascii, \-\-fastq_asciiout, \-\-fastq_qmax, \-\-fastq_qmaxout, \-\-fastq_qmin, \-\-fastq_qminout, \-\-fastq_qout_max, \-\-maxuniquesize, \-\-minuniquesize, \-\-relabel, \-\-relabel_keep, \-\-relabel_md5, \-\-relabel_self, \-\-relabel_sha1, \-\-sizein, \-\-sizeout, \-\-strand, \-\-topn, and \-\-xsize. .PP .TAG derep_fulllength .TP 9 .BI \-\-derep_fulllength \0filename Merge strictly identical sequences contained in \fIfilename\fR. Identical sequences are defined as having the same length and the same string of nucleotides (case insensitive, T and U are considered the same). See the options \-\-sizein and \-\-sizeout to take into account and compute abundance values. This command does not support multithreading. .TAG derep_id .TP .BI \-\-derep_id \0filename Merge strictly identical sequences contained in \fIfilename\fR, as with the \-\-derep_fulllength command, but the sequence labels (identifiers) on the header line need to be identical too. .TAG derep_prefix .TP .BI \-\-derep_prefix \0filename Merge sequences with identical prefixes contained in \fIfilename\fR. A short sequence identical to an initial segment (prefix) of another sequence is considered a replicate of the longer sequence. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant. Remaining ties are solved using sequence headers and sequence input order. Sequence comparisons are case insensitive, and T and U are considered identical. This command does not support multithreading. .TAG fastaout .TP .BI \-\-fastaout \0filename Write the dereplicated sequences to \fIfilename\fR, in fasta format and sorted by decreasing abundance. Identical sequences receive the header of the first sequence of their group. If \-\-sizeout is used, the number of occurrences (i.e. abundance) of each sequence is indicated at the end of their fasta header using the pattern ';size=\fIinteger\fR;'. This option is only valid for \-\-fastx_uniques. .TAG fastqout .TP .BI \-\-fastqout \0filename Write the dereplicated sequences to \fIfilename\fR, in fastq format and sorted by decreasing abundance. Identical sequences receive the header of the first sequence of their group. If \-\-sizeout is used, the number of occurrences (i.e. abundance) of each sequence is indicated at the end of their fastq header using the pattern ';size=\fIinteger\fR;'. This option is only valid for \-\-fastx_uniques. .TAG fastq_ascii .TP .BI \-\-fastq_ascii\~ "positive integer" Define the ASCII character number used as the basis for the FASTQ quality score. The default is 33, which is used by the Sanger / Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64). Only 33 and 64 are valid arguments. .TAG fastq_asciiout .TP .BI \-\-fastq_asciiout\~ "positive integer" When using \-\-fastq_convert, \-\-sff_convert or \-\-fasta2fastq, define the ASCII character number used as the basis for the FASTQ quality score when writing FASTQ output files. The default is 33. Only 33 and 64 are valid arguments. .TAG fastq_qmax .TP .BI \-\-fastq_qmax\~ "positive integer" Specify the maximum quality score accepted when reading FASTQ files. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files. .TAG fastq_qmaxout .TP .BI \-\-fastq_qmaxout\~ "positive integer" Specify the maximum quality score used when writing FASTQ files. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files. Older formats may use a maximum quality score of 40. .TAG fastq_qmin .TP .BI \-\-fastq_qmin\~ "positive integer" Specify the minimum quality score accepted for FASTQ files. The default is 0, which is usual for recent Sanger/Illumina 1.8+ files. Older formats may use scores between -5 and 2. .TAG fastq_qminout .TP .BI \-\-fastq_qminout\~ "positive integer" Specify the minimum quality score used when writing FASTQ files. The default is 0, which is usual for Sanger/Illumina 1.8+ files. Older versions of the format may use scores between -5 and 2. .TAG fastq_qout_max .TP .BI \-\-fastq_qout_max For \-\-fastx_uniques, indicate that the new quality scores computed when dereplicating FASTQ files should be equal to the maximum (best) of the input quality scores for each position (corresponding to the lowest error probability). The default is to output a quality score corresponding to the average of the error probabilities for each position. .TAG fastx_uniques .TP .BI \-\-fastx_uniques \0filename Merge strictly identical sequences contained in FASTA or FASTQ file \fIfilename\fR. Identical sequences are defined as having the same length and the same string of nucleotides (case insensitive, T and U are considered the same). See the options \-\-sizein and \-\-sizeout to take into account and compute abundance values. This command does not support multithreading. By default, the quality scores in FASTQ output files will correspond to the average error probability of the nucleotides in the each position. If the \-\-fastq_qout_max option is given, the quality score will be the highest (best) quality score observed in each position. .TAG maxuniquesize .TP .BI \-\-maxuniquesize\~ "positive integer" Discard sequences with a post-dereplication abundance value greater than \fIinteger\fR. .TAG minuniquesize .TP .BI \-\-minuniquesize\~ "positive integer" Discard sequences with a post-dereplication abundance value smaller than \fIinteger\fR. .TAG output .TP .BI \-\-output \0filename Write the dereplicated sequences to \fIfilename\fR, in fasta format and sorted by decreasing abundance. Identical sequences receive the header of the first sequence of their group. If \-\-sizeout is used, the number of occurrences (i.e. abundance) of each sequence is indicated at the end of their fasta header using the pattern ';size=\fIinteger\fR;'. This option is not allowed for fastx_uniques. .TP .TAG relabel .BI \-\-relabel \0string Please see the description of the same option under Chimera detection for details. .TP .TAG relabel_keep .B \-\-relabel_keep When relabelling, keep the old identifier in the header after a space. .TP .TAG relabel_md5 .B \-\-relabel_md5 Please see the description of the same option under Chimera detection for details. .TP .TAG relabel_self .B \-\-relabel_self Please see the description of the same option under Chimera detection for details. .TP .TAG relabel_sha1 .B \-\-relabel_sha1 Please see the description of the same option under Chimera detection for details. .TP .TAG rereplicate .BI \-\-rereplicate \0filename Duplicate each sequence the number of times indicated by the abundance of each sequence in the specified file (option \-\-sizein is always implied). The sequence labels are identical for the same sequence, unless \-\-relabel, \-\-relabel_self, \-\-relabel_sha1 or \-\-relabel_md5 is used to create unique labels. Output is written to the file specified with the \-\-output option, in FASTA format. The output file does not contain abundance information unless \-\-sizeout is specified, in which case an abundance of 1 is used. .TAG sizein .TP .B \-\-sizein Take into account the abundance annotations present in the input fasta file (search for the pattern '[>;]size=\fIinteger\fR[;]' in sequence headers). That option is active by default when rereplicating. .TAG sizeout .TP .B \-\-sizeout Add abundance annotations to the output fasta file (add the pattern ';size=\fIinteger\fR;' to sequence headers). If \-\-sizein is specified, each unique sequence receives a new abundance value corresponding to its total abundance (sum of the abundances of its occurrences). If \-\-sizein is not specified, input abundances are set to 1, and each unique sequence receives a new abundance value corresponding to its number of occurrences in the input file. .TAG strand .TP .BI \-\-strand\~ "plus|both" When searching for strictly identical sequences, check the \fIplus\fR strand only (default) or check \fIboth\fR strands. .TAG tabbedout .TP .BI \-\-tabbedout \0filename Output clustering info to the specified tab-separated text file with 6 columns and a row for each input sequence. Column 1 contains the original label/header of the sequence. Column 2 contains the label of the output sequence which is equal to the label/header of the first sequence in each cluster, but potentially relabelled. Column 3 contains the cluster number, starting from 0. Column 4 contains the sequence number within each cluster, starting at 0. Column 5 contains the number of sequences in the cluster. Column 6 contains the original label/header of the first sequence in the cluster before any potential relabelling. This option is only valid for the \-\-fastx_uniques command. .TAG topn .TP .BI \-\-topn\~ "positive integer" Output only the top \fIinteger\fR sequences (i.e. the most abundant). .TAG uc .TP .BI \-\-uc \0filename Output full-length or prefix-dereplication results in \fIfilename\fR using a tab-separated uclust-like format with 10 columns and 3 different type of entries (S, H or C). Each fasta sequence in the input file can be either a cluster centroid (S) or a hit (H) assigned to a cluster. Cluster records (C) summarize information (size, centroid label) for each cluster. In the context of dereplication, the option \-\-uc_allhits has no effect on the \-\-uc output. Column content varies with the type of entry (S, H or C): .RS .RS .nr step 1 1 .IP \n[step]. 4 Record type: S, H, or C. .IP \n+[step]. Cluster number (zero-based). .IP \n+[step]. Sequence length (S, H), or cluster size (C). .IP \n+[step]. Percentage of similarity with the centroid sequence (H), or set to '*' (S, C). .IP \n+[step]. Match orientation + or - (H), or set to '*' (S, C). .IP \n+[step]. Not used, always set to '*' (S, C) or 0 (H). .IP \n+[step]. Not used, always set to '*' (S, C) or 0 (H). .IP \n+[step]. Not used, always set to '*'. .IP \n+[step]. Label of the query sequence (H), or of the centroid sequence (S, C). .IP \n+[step]. Label of the centroid sequence (H), or set to '*' (S, C). .RE .RE .RE .PP .RS .TAG xsize .TP .B \-\-xsize Strip abundance information from the headers when writing the output file. .RE .PP .\" ---------------------------------------------------------------------------- .TAG extraction-options Extraction options: .RS .PP Sequences with headers matching certain criteria can be extracted from FASTA and FASTQ files using the \-\-fastx_getseq, \-\-fastx_getseqs and \-\-fastx_getsubseq commands. .PP The \-\-fastx_getseq command requires the header to match a label specified with the \-\-label option. If the \-\-label_substr_match option is given, the label may be a substring located anywhere in the header, otherwise the entire header must match the label. These matches are not case-sensitive. The headers in the input file are truncated at the first space or tab character unless the \-\-notrunclabels option is given. The matching sequences will be written to the files specified with the \-\-fastaout and \-\-fastqout options, in FASTA and FASTQ format, respectively. Sequences that do not match are written to the files specified with the \-\-notmatched and \-\-notmatchedfq options, respectively. .PP The \-\-fastx_getsubseq command is similar to the \-\-fastx_getseq command, but will extract a subsequence of the matching sequences. The start position is specified with the \-\-subseq_start option and the end position is specified with the \-\-subseq_end option. The positions are 1-based, meaning that the first symbol of the sequence is at position 1. If the start or end position option is not specified, the default is to start at the first position and end at the last position in the sequence. .PP The \-\-fastx_getseqs command is similar to the \-\-fastx_getseq command but allows more flexibility in specifying the label(s) to be matched. A single label may be specified using the \-\-label option as described above. Alternatively, a file containing a list of labels to be matched may be specified with the \-\-labels option. The file must be a plain text file with one label on each line. The \-\-label_word and \-\-label_words options may be used to specify either a single word or a file containing a list of words, respectively, to be matched. Words are defined as character sequences delimited either by a character that is not alpha-numeric (A-Z, a-z, or 0-9) or by the beginning or end of the header. Word matching is case-sensitive. The \-\-label_field option will limit the matching of words to a certain field in the header. .PP .TAG fastaout .TP 9 .BI \-\-fastaout \0filename Write the extracted sequences in FASTA format to the file with the given name. .TAG fastqout .TP .BI \-\-fastqout \0filename Write the extracted sequences in FASTQ format to the file with the given name. This option is illegal if the input is in FASTA format. .TAG fastx_getseq .TP .BI \-\-fastx_getseq \0filename Extract sequences from the given FASTA or FASTQ file. Specify a label to match using the \-\-label option. Output files are specified with the \-\-fastaout, \-\-fastqout, \-\-notmatched and \-\-notmatchedfq options. .TAG fastx_getseqs .TP .BI \-\-fastx_getseqs \0filename Extract sequences from the given FASTA or FASTQ file. Specify the label or labels to match using one of the following options: \-\-label, \-\-labels, \-\-label_word, or \-\-label_words. Output files are specified with the \-\-fastaout, \-\-fastqout, \-\-notmatched and \-\-notmatchedfq options. .TAG fastx_getsubseq .TP .BI \-\-fastx_getsubseq \0filename Extract a certain part of some of the sequences in the given FASTA or FASTQ file. Specify labels to match using the \-\-label option. Specify the subsequence range to be extracted with the \-\-subseq_start and \-\-subseq_end options. Output files are specified with the \-\-fastaout, \-\-fastqout, \-\-notmatched and \-\-notmatchedfq options. .TAG label .TP .BI \-\-label \0string Specify the label to match in the sequence header. Unless the \-\-label_substr_match option is given, the label must match the entire header. The comparison is not case-sensitive. .TAG label_field .TP .BI \-\-label_field \0string Specify a field name to be used when matching using the \-\-label_word or \-\-label_words option. The field name is a string like "abc" that must precede the word to be matched with an equals sign (=) in between. The field must be delimited by semicolons or the beginning or end of the header. The following header will match the label 123 in the field abc: "seq1;abc=123". .TAG label_substr_match .TP .BI \-\-label_substr_match The labels specified with the \-\-label or the \-\-labels option may match anywhere in the header if this option is given. Otherwise a label needs to match the entire header. .TAG label_word .TP .BI \-\-label_word \0string Specify a word to match in the sequence header. Words are defined as strings delimited by either the start or end of the header or by any symbol that is not a letter (A-Z, a-z) or digit (0-9). The comparison is case-sensitive. .TAG label_words .TP .BI \-\-label_words \0filename Specify a file containing words to be matched against the sequence headers. The plain text file must contain one word on each line. Words are defined as strings delimited by either the start or end of the header or by any symbol that is not a letter (A-Z, a-z) or digit (0-9). The comparison is case-sensitive. .TAG labels .TP .BI \-\-labels \0filename Specify a file containing labels to be matched against the sequence headers. The plain text file must contain one label on each line. Unless the \-\-label_substr_match option is given, a label must match the entire header. The comparison is not case-sensitive. .TAG notmatched .TP .BI \-\-notmatched \0filename Write the sequences that were not extracted to the file with the given name, in FASTA format. .TAG notmatchedfq .TP .BI \-\-notmatchedfq \0filename Write the sequences that were not extracted to the file with the given name, in FASTQ format. This option is illegal if the input is in FASTA format. .TAG subseq_end .TP .BI \-\-subseq_end\~ "positive integer" Specify the end position in the sequences when extracting subsequences using the \-\-fastx_getsubseq command. Positions are 1-based, so the sequences start at position 1. The default is to end at the end of the sequence if this option is not specified. .TAG subseq_start .TP .BI \-\-subseq_start\~ "positive integer" Specifiy the starting position in the sequences when extracting subsequences using the \-\-fastx_getsubseq command. Positions are 1-based, so the sequences start at position 1. The default is to start at the beginning of the sequence (position 1), if this option is not specified. .RE .PP .\" ---------------------------------------------------------------------------- .TAG fasta-fastq-file-processing-options FASTA/FASTQ/SFF file processing options: .RS .PP Analyse, trim, filter, convert, merge, join or reverse complement sequences in FASTA, FASTQ or SFF files. The \-\-fastq_chars command can be used to analyse FASTQ files to identify the quality encoding and the range of quality score values used. To convert between different FASTQ file variants, use the \-\-fastq_convert command. Statistical analysis of the quality and length of the sequences in a FASTQ file may be performed with the \-\-fastq_stats, \-\-fastq_eestats, and \-\-fastq_eestats2 commands. Sequences may be trimmed, filtered and converted by the \-\-fastq_filter or \-\-fastx_filter commands. The \-\-sff_convert command can be used to convert SFF files to FASTQ, while the \-\-fasta2fastq command will convert a FASTA file to a FASTQ file with fake quality scores. Paired-end reads can be merged using the \-\-fastq_mergepairs command or joined with the \-\-fastq_join command. The \-\-fastx_revcomp command will reverse-complements sequences. .PP .TAG eeout .TP 9 .B \-\-eeout When using \-\-fastq_filter, \-\-fastx_filter or \-\-fastq_mergepairs, include the number of expected errors (ee) in the sequence header of FASTQ and FASTA output files. This option is a synonym of the \-\-fastq_eeout option. Use the \-\-xee option to remove this information from headers. .TAG eetabbedout .TP .BI \-\-eetabbedout \0filename When specified with the \-\-fastq_mergepairs command, write statistics with expected errors of each merged read to the given file. The file is a tab separated file with four columns: The number of errors expected in the forward read, the number of expected errors in the reverse read, the number of observed errors in the forward read, and the number of observed errors in the reverse read. The observed number of errors are the number of differences in the overlap region of the merged sequence relative to each of the reads in the pair. .TAG fasta2fastq .TP .BI \-\-fasta2fastq \0filename Add a fake nucleotide quality score to the sequences in the given FASTA file and write them to the FASTQ file specified with the \-\-fastqout option. The quality score may be adjusted using the \-\-fastq_qmaxout option (default 41). The \-\-fastq_asciiout option may be used to adjust the FASTQ output quality ASCII base character (default 33). .TAG fastaout .TP .BI \-\-fastaout \0filename When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter, write to the given FASTA-formatted file the sequences passing the filter, or the merged sequences. .TAG fastaout_rev .TP .BI \-\-fastaout_rev \0filename When using \-\-fastq_filter, or \-\-fastx_filter, write to the given FASTA-formatted file the reverse reads passing the filter. .TAG fastaout_notmerged_fwd .TP .BI \-\-fastaout_notmerged_fwd \0filename When using \-\-fastq_mergepairs, write forward reads not merged to the specified FASTA file. .TAG fastaout_notmerged_rev .TP .BI \-\-fastaout_notmerged_rev \0filename When using \-\-fastq_mergepairs, write reverse reads not merged to the specified FASTA file. .TAG fastaout_discarded .TP .BI \-\-fastaout_discarded \0filename Write sequences that do not pass the filter of the \-\-fastq_filter or \-\-fastx_filter command to the given FASTA-formatted file. .TAG fastaout_discarded_rev .TP .BI \-\-fastaout_discarded_rev \0filename Write reverse reads that do not pass the filter of the \-\-fastq_filter or \-\-fastx_filter command to the given FASTA-formatted file. .TAG fastq_allowmergestagger .TP .B \-\-fastq_allowmergestagger When using \-\-fastq_mergepairs, allow merging of staggered read pairs. Staggered pairs are pairs where the 3' end of the reverse read has an overhang to the left of the 5' end of the forward read. This situation can occur when a very short fragment is sequenced. The 3' overhang of the reverse read is not included in the merged sequence. The opposite option is the \-\-fastq_nostagger option. The default is to discard staggered pairs. .TAG fastq_ascii .TP .BI \-\-fastq_ascii\~ "positive integer" Define the ASCII character number used as the basis for the FASTQ quality score. The default is 33, which is used by the Sanger / Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64). Only 33 and 64 are valid arguments. .TAG fastq_asciiout .TP .BI \-\-fastq_asciiout\~ "positive integer" When using \-\-fastq_convert, \-\-sff_convert or \-\-fasta2fastq, define the ASCII character number used as the basis for the FASTQ quality score when writing FASTQ output files. The default is 33. Only 33 and 64 are valid arguments. .TAG fastq_chars .TP .BI \-\-fastq_chars \0filename Summarize the composition of sequence and quality strings contained in the input FASTQ file. For each of the four DNA letters, \-\-fastq_chars gives the number of occurrences of the letter, its relative frequency and the length of the longest run of that letter. For each character present in the quality strings, \-\-fastq_chars gives the ASCII value of the character, its relative frequency, and the number of times a \fIk\fR-mer of that character appears at the end of quality strings. The length of the \fIk\fR-mer can be set using \-\-fastq_tail (4 by default). The command \-\-fastq_chars tries to automatically detect the quality encoding (Solexa, Illumina 1.3+, Illumina 1.5+ or Illumina 1.8+/Sanger) by analyzing the range of observed quality score values. In case of success, \-\-fastq_chars suggests values for the \-\-fastq_ascii (33 or 64), \-\-fastq_qmin and \-\-fastq_qmax options to be used with the other commands that require a FASTQ input file. .TAG fastq_convert .TP .BI \-\-fastq_convert \0filename Convert between the different variants of the FASTQ file format. The quality encoding of the input file must be specified with the \-\-fastq_ascii option (either 33 or 64, the default is 33), and the output quality encoding must be specified with the \-\-fastq_asciiout option (default 33). The minimum and maximum output quality scores may be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout options. The output file is specified with the \-\-fastqout option. .TAG fastq_eeout .TP .B \-\-fastq_eeout When using \-\-fastq_filter, \-\-fastx_filter or \-\-fastq_mergepairs, include the number of expected errors (ee) in the sequence header of FASTQ and FASTA files. This option is a synonym of the \-\-eeout option. Use the \-\-xee option to remove this information from headers. .TAG fastq_eestats .TP .BI \-\-fastq_eestats \0filename Analyze a FASTQ file and report statistics on the distributions of quality scores, error probabilities and expected accumulated errors. The report, a table of 21 tab-separated columns, is written to the file specified with the \-\-output option. The first column corresponds to the position in the reads (Pos). The second and third columns correspond to the number of reads (Reads) and percentage of reads (PctRecs) that include this position. The remaining columns include information about the distribution of quality scores in this position (Q), error probabilities in this position (Pe), and finally the expected number of accumulated errors from the beginning of the reads and until the current position (EE). For each of the Q, Pe and EE distributions, the following statistics are included: minimum value (Min), lower quartile (Low), median (Med), mean (Mean), upper quartile (Hi), and maximum value (Max). The quality encoding and the range of quality values may be specified with \-\-fastq_ascii \-\-fastq_qmin and \-\-fastq_qmax. .TAG fastq_eestats2 .TP .BI \-\-fastq_eestats2 \0filename Analyze the specified FASTQ file and report statistics on the number of sequences that would be retained at a combination of selected cutoffs for length truncation and maximum expected errors, that could potentially be used as arguments to the \-\-fastq_trunclen and \-\-fastq_maxee options to the \-\-fastq_filter command. The result, a table of two or more columns, is written to the file specified with the \-\-output option. There is a line for each length truncation cutoff. The first column on each line contains the selected truncation length, while the following columns contain the number of sequences and, in parenthesis, the percentage of sequences that would be retained at the selected EE levels. The truncation length cutoffs may be specified with the \-\-length_cutoffs option and requires a list of three comma-separated integers indicating the shortest cutoff, the longest cutoff, and the increment between cutoffs. The longest cutoff may be specified with a star (*) which indicates that the limit is equal to the longest sequence in the input file. The default setting is "50,*,50" meaning that truncation lengths of 50, 100, 150 and so on up to the longest sequence length should be used. The maximum expected error (EE) cutoffs may be specified with the \-\-ee_cutoffs option which requires a comma-separated list of floating point numbers as its argument. The default setting is "0.5,1.0,2.0" that indicates that expected error levels of 0.5, 1.0 and 2.0 should be used. .TAG fastq_filter .TP .BI \-\-fastq_filter \0filename Trim and/or filter sequences in the given FASTQ file. Similar to the \-\-fastx_filter command, but works only on FASTQ files. See \-\-fastx_filter for details. .TAG fastq_join .TP .BI \-\-fastq_join\0 filename Join paired-end sequence reads into one sequence and add a gap between them using a padding sequence. The sequences are not merged as with the fastq_mergepairs command, but simply joined with a gap. The forward reads are specified as the argument to this option and the reverse reads are specified with the \-\-reverse option. The resulting sequences consist of the forward read, the padding sequence and the reverse complement of the reverse read. The padding sequence is specified with the \-\-join_padgap option and the padding quality is specified with the \-\-join_padgapq option. The default padding sequence string is NNNNNNNN and the default padding quality string is IIIIIIII, corresponding to a base quality score of 40 (a very high quality score with error probability 0.0001). The joined sequences are output to the file(s) specified with the \-\-fastaout or \-\-fastqout options. .TAG fastq_maxdiffs .TP .BI \-\-fastq_maxdiffs\~ "positive integer" When using \-\-fastq_mergepairs, specify the maximum number of non-matching nucleotides allowed in the overlap region. That option has a strong influence on the merging success rate. The default value is 10. .TAG fastq_maxdiffpct .TP .BI \-\-fastq_maxdiffpct\~ real When using \-\-fastq_mergepairs, specify the maximum percentage of non-matching nucleotides allowed in the overlap region. The default value is 100.0%. There are other more sophisticated rules in the merging algorithm that will discard read pairs with a high fraction of mismatches. .TAG fastq_maxee .TP .BI \-\-fastq_maxee\~ real When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter, discard sequences with more than the specified number of expected errors. .TAG fastq_maxee_rate .TP .BI \-\-fastq_maxee_rate\~ real When using \-\-fastq_filter or \-\-fastx_filter, discard sequences with more than the specified number of expected errors per base. .TAG fastq_maxlen .TP .BI \-\-fastq_maxlen\~ "positive integer" When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter, discard sequences with more than the specified number of bases. .TAG fastq_maxmergelen .TP .BI \-\-fastq_maxmergelen\~ "positive integer" When using \-\-fastq_mergepairs, specify the maximum length of the merged sequence. By default there is no limit. .TAG fastq_maxns .TP .BI \-\-fastq_maxns\~ "positive integer" When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter, discard sequences with more than the specified number of N's. .TAG fastq_mergepairs .TP .BI \-\-fastq_mergepairs\0 filename Merge paired-end sequence reads into one sequence. The forward reads are specified as the argument to this option and the reverse reads are specified with the \-\-reverse option. The merged sequences are output to the file(s) specified with the \-\-fastaout or \-\-fastqout options. The non-merged reads can be output to the files specified with the \-\-fastaout_notmerged_fwd, \-\-fastaout_notmerged_rev, \-\-fastqout_notmerged_fwd and \-\-fastqout_notmerged_rev options. Statistics may be output to the file specified with the \-\-eetabbedout option. Sequences are truncated as specified with the \-\-fastq_truncqual option to remove low-quality bases in the 3' end. Sequences shorter than specified with \-\-fastq_minlen (after truncation) are discarded (1 by default). Sequences with too many ambiguous bases (N's), as specified with the \-\-fastq_maxns are also discarded (no limit by default). Staggered reads are not merged unless the \-\-fastq_allowmergestagger option is specified. The minimum length of the overlap region between the reads may be specified with the \-\-fastq_minovlen option (at least 5, default 10). The overlap region may not include more mismatches than specified with the \-\-fastq_maxdiffs option (10 by default) or a higher percentage of mismatches than specified with the \-\-fastq_maxdiffpct option (100.0% by default), otherwise the read pair is discarded. Additional rules will avoid merging of reads that cannot be aligned reliably and unambiguously. The minimum and maximum length of the merged sequence may be specified with the \-\-fastq_minmergelen and \-\-fastq_maxmergelen options, respectively. The quality value limits for output files may be specified with the \-\-fastq_qminout and \-\-fastq_qmaxout options, but they apply only to the merged region. Other relevant options are: \-\-fastq_ascii, \-\-fastq_maxee, \-\-fastq_nostagger, \-\-fastq_qmax, \-\-fastq_qmin, and \-\-label_suffix. .TAG fastq_minlen .TP .BI \-\-fastq_minlen\~ "positive integer" When using \-\-fastq_filter, \-\-fastq_mergepairs or \-\-fastx_filter, discard sequences with less than the specified number of bases (default 1). .TAG fastq_minmergelen .TP .BI \-\-fastq_minmergelen\~ "positive integer" When using \-\-fastq_mergepairs, specify the minimum length of the merged sequence. The default is 1. .TAG fastq_minovlen .TP .BI \-\-fastq_minovlen\~ "positive integer" When using \-\-fastq_mergepairs, specify the minimum overlap between the merged reads. The default is 10. Must be at least 5. .TAG fastq_nostagger .TP .B \-\-fastq_nostagger When using \-\-fastq_mergepairs, forbid the merging of staggered read pairs. This is the default behaviour of \-\-fastq_mergepairs. To change that behaviour, see the \-\-fastq_allowmergestagger option. .TAG fastq_qmax .TP .BI \-\-fastq_qmax\~ "positive integer" Specify the maximum quality score accepted when reading FASTQ files. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files. .TAG fastq_qmaxout .TP .BI \-\-fastq_qmaxout\~ "positive integer" When using \-\-fastq_mergepairs, \-\-fastq_convert, \-\-sff_convert or \-\-fasta2fastq, specify the maximum quality score used when writing FASTQ files. For the \-\-fasta2fastq command, the value specified here is the fake quality score used for the FASTQ output file. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files. Older formats may use a maximum quality score of 40. The limit only applies to the merged region when using \-\-fastq_mergepairs. .TAG fastq_qmin .TP .BI \-\-fastq_qmin\~ "positive integer" Specify the minimum quality score accepted for FASTQ files. The default is 0, which is usual for recent Sanger/Illumina 1.8+ files. Older formats may use scores between -5 and 2. .TAG fastq_qminout .TP .BI \-\-fastq_qminout\~ "positive integer" When using \-\-fastq_mergepairs, \-\-fastq_convert or \-\-sff_convert, specify the minimum quality score used when writing FASTQ files. The default is 0, which is usual for Sanger/Illumina 1.8+ files. Older versions of the format may use scores between -5 and 2. The limit applies only to the merged region when using \-\-fastq_mergepairs. .TAG fastq_stats .TP .BI \-\-fastq_stats \0filename Analyze a FASTQ file and report the number of reads it contains. The quality encoding and the range of quality values may be specified with \-\-fastq_ascii \-\-fastq_qmin and \-\-fastq_qmax. That command requires the \-\-log option and outputs the following detailed statistics on read length, quality score, length vs. quality distributions, and length / quality filtering: .RS .TP Read length distribution: .RS .nr step 1 1 .IP \n[step]. 4 L: read length. .IP \n+[step]. N: number of reads. .IP \n+[step]. Pct: fraction of reads with this length. .IP \n+[step]: AccPct: fraction of reads with this length or longer. .RE .TP Quality score distribution: .RS .nr step 1 1 .IP \n[step]. 4 ASCII: character encoding the quality score. .IP \n+[step]. Q: Phred quality score. .IP \n+[step]. Pe: probability of error associated with the quality score. .IP \n+[step]. N: number of bases with this quality score. .IP \n+[step]. Pct: fraction of bases with this quality score. .IP \n+[step]: AccPct: fraction of bases with this quality score or higher. .RE .TP Length vs. quality distribution: .RS .nr step 1 1 .IP \n[step]. 4 L: position in reads (starting from position 2). .IP \n+[step]. PctRecs: fraction of reads with at least this length. .IP \n+[step]. AvgQ: average quality score over all reads up to this position. .IP \n+[step]. P(AvgQ): error probability corresponding to AvgQ. .IP \n+[step]. AvgP: average error probability. .IP \n+[step]: AvgEE: average expected error over all reads up to this position. .IP \n+[step]: Rate: growth rate of AvgEE between this position and position - 1. .IP \n+[step]: RatePct: Rate (as explained above) expressed as a percentage. .RE .TP Effect of expected error and length filtering: .RS The first column indicates read lengths (\fIL\fR). The next four columns indicate the number of reads that would be retained by the \-\-fastq_filter command if the reads were truncated at length \fIL\fR (option \-\-fastq_trunclen \fIL\fR) and filtered to have a maximum expected error of 1.0, 0.5, 0.25 or 0.1 (with the option \-\-fastq_maxee \fIfloat\fR). The last four columns indicate the fraction of reads that would be retained by the \-\-fastq_filter command using the same length and maximum expected error parameters. .RE .TP Effect of minimum quality and length filtering: .RS The first column indicates read lengths (\fILen\fR). The next four columns indicate the fraction of reads that would be retained by the \-\-fastq_filter command if the reads were truncated at length \fILen\fR (option \-\-fastq_trunclen \fILen\fR) or at the first position with a quality \fIQ\fR below 5, 10, 15 or 20 (option \-\-fastq_truncqual \fIQ\fR). .RE .RE .TAG fastq_stripleft .TP .BI \-\-fastq_stripleft\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, strip the specified number of bases from the left end of the reads. .TAG fastq_stripright .TP .BI \-\-fastq_stripright\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, strip the specified number of bases from the right end of the reads. .TAG fastq_tail .TP .BI \-\-fastq_tail\~ "positive integer" When using \-\-fastq_chars, count the number of times a series of characters of length \fIk\fR appears at the end of quality strings. By default, \fIk\fR = 4. .TAG fastq_truncee .TP .BI \-\-fastq_truncee\~ real When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences so that their total expected error is not higher than the specified value. .TAG fastq_trunclen .TP .BI \-\-fastq_trunclen\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences to the specified length. Shorter sequences are discarded. .TAG fastq_trunclen_keep .TP .BI \-\-fastq_trunclen_keep\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences to the specified length. Shorter sequences are not discarded. .TAG fastq_truncqual .TP .BI \-\-fastq_truncqual\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, truncate sequences starting from the first base with the specified base quality score value or lower. .TAG fastqout .TP .BI \-\-fastqout \0filename When using \-\-fastq_filter, \-\-fastq_mergepairs, \-\-fastx_filter or \-\-fasta2fastq, write to the given FASTQ-formatted file the sequences passing the filter, or the merged or converted sequences. .TAG fastqout_rev .TP .BI \-\-fastqout_rev \0filename When using \-\-fastq_filter or \-\-fastx_filter, write to the given FASTQ-formatted file the reverse reads passing the filter. .TAG fastqout_discarded .TP .BI \-\-fastqout_discarded \0filename When using \-\-fastq_filter or \-\-fastx_filter, write sequences that do not pass the filter to the given FASTQ-formatted file. .TAG fastqout_discarded_rev .TP .BI \-\-fastqout_discarded_rev \0filename When using \-\-fastq_filter or \-\-fastx_filter, write reverse reads that do not pass the filter to the given FASTQ-formatted file. .TAG fastqout_notmerged_fwd .TP .BI \-\-fastqout_notmerged_fwd \0filename When using \-\-fastq_mergepairs, write forward reads not merged to the specified FASTQ file. .TAG fastqout_notmerged_rev .TP .BI \-\-fastqout_notmerged_rev \0filename When using \-\-fastq_mergepairs, write reverse reads not merged to the specified FASTQ file. .TAG fastx_filter .TP .BI \-\-fastx_filter \0filename Trim and/or filter the sequences in the given FASTA or FASTQ file and output the remaining sequences to the FASTQ file specified with the \-\-fastqout option and/or to the FASTA file specified with the \-\-fastaout option. Discarded sequences are written to the files specified with the \-\-fastaout_discarded and \-\-fastqout_discarded options. The input format (FASTA or FASTQ) is automatically detected. If the input consists of paired sequences, an input file with reverse reads may be specified with the \-\-reverse option, and corresponding output will be written to the files specified with the \-\-fastqout_rev, \-\-fastaout_rev, \-\-fastqout_discarded_rev, and \-\-fastaout_discarded_rev options. Output can not be written to FASTQ files if the input is in FASTA format. The sequences are first trimmed and then filtered based on the remaining bases. Sequences may be trimmed using the options \-\-fastq_stripleft, \-\-fastq_stripright, \-\-fastq_truncee, \-\-fastq_trunclen, \-\-fastq_trunclen_keep and \-\-fastq_truncqual. The sequences may be filtered using the options \-\-fastq_maxee, \-\-fastq_maxee_rate, \-\-fastq_maxlen, \-\-fastq_maxns, \-\-fastq_minlen (default 1), \-\-fastq_trunclen, \-\-maxsize, and \-\-minsize. Sequences not satisfying the requirements are discarded. For pairs of sequences, both sequences in a pair must satisfy the requirements, otherwise both are discarded. If no shortening or filtering options are given, all sequences are written to the output files, possibly after conversion from FASTQ to FASTA format. The \-\-relabel option may be used to relabel the output sequences. The \-\-eeout option may be used to output the expected number of errors in each sequence. After all sequences have been processed, the number of kept and discarded sequences will be shown, as well as how many of the kept sequences were trimmed. When the input is in FASTA format, the following options are not accepted because quality scores are not available: \-\-eeout, \-\-fastq_ascii, \-\-fastq_eeout, \-\-fastq_maxee, \-\-fastq_maxee_rate, \-\-fastq_out, \-\-fastq_qmax, \-\-fastq_qmin, \-\-fastq_truncee, \-\-fastq_truncqual, \-\-fastqout_discarded, \-\-fastqout_discarded_rev, \-\-fastqout_rev. .TAG fastx_revcomp .TP .BI \-\-fastx_revcomp \0filename Reverse-complement the sequences in the given FASTA or FASTQ file to a file specified with the \-\-fastaout and/or \-\-fastqout options. If the input file is in FASTA format, the output can not be written back to a FASTQ file due to missing base quality scores. .TAG join_padgap .TP .BI \-\-join_padgap\~ string When running \-\-fastq_join, use the \fIstring\fR as a sequence padding string. The default is NNNNNNNN (8 N's). .TAG join_padgapq .TP .BI \-\-join_padgapq\~ string When running \-\-fastq_join, use the \fIstring\fR as a quality padding string. The default is a string of I's equal in length to the sequence padding string. The letter I corresponds to a base quality score of 40 indicating a very high quality base with error probability of 0.0001. .TAG maxsize .TP .BI \-\-maxsize\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, discard sequences with an abundance higher than the specified value. .TAG minsize .TP .BI \-\-minsize\~ "positive integer" When using \-\-fastq_filter or \-\-fastx_filter, discard sequences with an abundance lower than the specified value. .TAG output .TP .BI \-\-output \0filename When using \-\-fastq_eestats or \-\-fastq_eestats2, write tabulated results to \fIfilename\fR. See \-\-fastq_eestats's and \-\-fastq_eestats2's documentation for a complete description of the table. .TAG relabel_keep .TP .B \-\-relabel_keep When using \-\-relabel, keep the old identifier in the header after a space. .TAG relabel .TP .BI \-\-relabel \0string Please see the description of the same option under Chimera detection for details. .TAG relabel_md5 .TP .BI \-\-relabel_md5 Please see the description of the same option under Chimera detection for details. .TAG relabel_self .TP .BI \-\-relabel_self Please see the description of the same option under Chimera detection for details. .TAG relabel_sha1 .TP .BI \-\-relabel_sha1 Please see the description of the same option under Chimera detection for details. .TAG reverse .TP .BI \-\-reverse \0filename When using \-\-fastq_filter, \-\-fastx_filter, \-\-fastq_mergepairs or \-\-fastq_join, specify the FASTQ file containing containing the reverse reads. .TAG sff_convert .TP .BI \-\-sff_convert \0filename Convert the given SFF file to FASTQ. The FASTQ output file is specified with the \-\-fastqout option. The sequence may be clipped as specified in the SFF file if the option \-\-sff_clip is specified, otherwise no clipping occurs. Bases that would have been clipped are converted to lower case, while the rest is in upper case. The output quality encoding may be specified with the \-\-fastq_asciiout option (default 33). The minimum and maximum output quality scores may be limited using the \-\-fastq_qminout and \-\-fastq_qmaxout options. .TAG sff_clip .TP .BI \-\-sff_clip Specifies that the sequences converted by the \-\-sff_convert command should be clipped in both ends as indicated in the SFF file. By default no clipping is performed. .TAG xsize .TP .B \-\-xsize Strip abundance information from the headers when writing the output file. .TAG xee .TP .B \-\-xee Strip information about expected errors (ee) from the output file headers. This information is added by the \-\-fastq_eeout and \-\-eeout options. .RE .PP .\" ---------------------------------------------------------------------------- .TAG masking-options Masking options: .RS .PP An input sequence can be composed of lower- or uppercase letters. When soft masking is specified, lower case letters are treated as symbols that should be masked. Otherwise the case of the input sequences is ignored. .PP Masking is performed by the commands for chimera detection (uchime_denovo, uchime_ref), clustering (cluster_fast, cluster_smallmem, cluster_size), masking (maskfasta, fastx_mask), pairwise alignment (allpairs_global) and searching (search_exact, usearch_global). .PP Masking is usually specified with the \-\-qmask option, while the \-\-dbmask option is used for the database sequences specified with the \-\-db option with the \-\-usearch_global, \-\-search_exact and \-\-uchime_ref commands. .PP The argument to the \-\-qmask and \-\-dbmask option may be none, soft or dust. If the argument is none, the no masking is performed. If the argument is soft the lower case symbols are masked. Finally, if the argument is dust, the sequence is masked using the DUST algorithm by Tatusov and Lipman to mask low-complexity regions. .PP If the \-\-hardmask option is specified, all masked regions are converted to N's, otherwise masked regions are indicated by lower case letters. .PP If any sequence is masked, the masked version of the sequence (with lower case letters or N's) is used in all output files. Otherwise the sequence is unmodified. The exception is the sequences in the output file specified with the \-\-uchimealns option, where the input sequences are converted to upper case first and lower case letters indicate disagreement between the aligned sequences. .PP The \-\-qmask option (or \-\-dbmask for database sequences) may be combined with the \-\-hardmask option. The results of using the none, dust or soft argument to \-\-qmask or \-\-dbmask are presented below, assuming each input sequence contains both lower and uppercase symbols. .PP Results if the \-\-hardmask option is off (default): .RS .TP 9 .B none: no masking, all symbols used, no change .TP .B dust: masked symbols lowercased, rest uppercased .TP .B soft: lowercase symbols masked, no case changes .RE .PP Results if the \-\-hardmask option is on: .RS .TP 9 .B none: no masking, all symbols used, no change .TP .B dust: masked symbols changed to Ns, rest unchanged .TP .B soft: lowercase symbols masked and changed to Ns .RE .PP When a sequence region is masked, words in the region are not included in the indices used in the heuristic search algorithm. In all other aspects, the region is treated as other regions. .PP Regions in sequences that are hardmasked (with N's) have a zero alignment score and do not contribute to an alignment. .RE .PP .RS .TAG fastaout .TP 9 .BI \-\-fastaout \0filename Write the masked sequences to \fIfilename\fR, in fasta format. Applies only to the \-\-fastx_mask command. .TAG fastqout .TP .BI \-\-fastqout \0filename Write the masked sequences to \fIfilename\fR, in fastq format. Applies only to the \-\-fastx_mask command. .TAG fastx_mask .TP .BI \-\-fastx_mask \0filename Mask regions in sequences contained in the specified fasta or fastq file. The default is to mask using DUST (use \-\-qmask to modify that behavior). The output files are specified with the \-\-fastaout and \-\-fastqout options. The minimum and maximum percentage of unmasked residues may be specified with the \-\-min_unmasked_pct and \-\-max_unmasked_pct options, respectively. .TAG hardmask .TP .B \-\-hardmask Symbols in masked regions are replaced by N's. The default is to replace the masked regions by lower case letters. .TAG maskfasta .TP .BI \-\-maskfasta \0filename Mask regions in sequences contained in the fasta file \fIfilename\fR. The default is to mask using \fIdust\fR (use \-\-qmask to modify that behavior). The output file is specified with the \-\-output option. This command is depreciated, please use \-\-fastx_mask instead. .TAG max_unmasked_pct .TP .BI \-\-max_unmasked_pct \0real Discard sequences with more than the specified maximum percentage of unmasked residues. Works only with \-\-fastx_mask. .TAG min_unmasked_pct .TP .BI \-\-min_unmasked_pct \0real Discard sequences with less than the specified minimum percentage of unmasked residues. Works only with \-\-fastx_mask. .TAG output .TP .BI \-\-output \0filename Write the masked sequences to \fIfilename\fR, in fasta format. Applies only to the \-\-mask_fasta command. .TAG qmask .TP .BI \-\-qmask\~ "none|dust|soft" If the argument is dust, mask regions in sequences using the \fIDUST\fR algorithm that detects simple repeats and low-complexity regions. This is the default. If the argument is soft, mask the lower case letters in the input sequence. If the argument is none, do not mask. .RE .PP .\" ---------------------------------------------------------------------------- .TAG orienting-options Orienting options: .RS .PP The \-\-orient command can be used to orient the sequences in a given file in either the forward or the reverse complementary direction based on a reference database specified with the \-\-db option. The two strands of each input sequence are compared to the reference database using nucleotide words. If one of the strands share many more words with at least one sequence in the database than the other, that strand is chosen. The correctly oriented sequences may be written to a FASTA file specified with the \-\-fastaout, and to a FASTQ file specified with the \-\-fastqout option (as long as the input was also in FASTA format). If the result is uncertain, because the number of matching words is too similar, the original sequence is written to the file specified with the \-\-notmatched option. The results may also be written to a tab-delimited text file specified with the \-\-tabbedout option. This file will contain the query label, the direction (+, - or ?), the number of matching words on the forward strand, and the number of matching words on the reverse complementary strand. By default, a word length of 12 is used for this command. The word length may be adjusted using the \-\-wordlength option. There has to be at least 4 times as many matches on one strand than the other for a strand to be selected. In addition to the common options, the following options may also be specified for this command: \-\-dbmask, \-\-qmask, \-\-relabel, \-\-relabel_keep, \-\-relabel_md5, \-\-relabel_self, \-\-relabel_sha1, \-\-sizein, and \-\-sizeout. .PP .TAG db .TP 9 .BI \-\-db \0filename Read the reference database from the given file. It may be in FASTA, FASTQ or UDB format. If an UDB file is used it should have been created with a wordlength of 12. .TAG fastaout .TP .BI \-\-fastaout \0filename Write the correctly oriented sequences to \fIfilename\fR, in fasta format. .TAG fastqout .TP .BI \-\-fastqout \0filename Write the correctly oriented sequences to \fIfilename\fR, in fastq format. .TAG notmatched .TP .BI \-\-notmatched \0filename Write the sequences with undetermined direction to \fIfilename\fR, in the orginal format. .TAG orient .TP .BI \-\-orient \0filename Orient the sequences in the given file. .TAG tabbedout .TP .BI \-\-tabbedout \0 filename Write the resuls to a tab-delimited text file with the specified filename. This file will contain the query label, the direction (+, - or ?), the number of matching words on the forward strand, and the number of matching words on the reverse complementary strand. .RE .PP .\" ---------------------------------------------------------------------------- .TAG pairwise-alignment-options Pairwise alignment options: .RS .PP The results of the n * (n-1) / 2 pairwise alignments are written to the result files specified with \-\-alnout, \-\-blast6out, \-\-fastapairs \-\-matched, \-\-notmatched, \-\-qsegout, \-\-samout, \-\-tsegout, \-\-uc or \-\-userout (see Searching section below). Specify either the \-\-acceptall option to output all pairwise alignments, or specify an identity level with \-\-id to discard weak alignments. Most other accept/reject options (see Searching options below) may also be used. Sequences are aligned on their \fIplus\fR strand only. Masking is performed as usual and specified with \-\-qmask and \-\-hardmask. .TAG acceptall .TP 9 .B \-\-acceptall Write the results of all alignments to output files. This option overrides all other accept/reject options (including \-\-id). .TAG allpairs_global .TP .BI \-\-allpairs_global \0filename Perform optimal global pairwise alignments of the fasta sequences contained in \fIfilename\fR. Each sequence is compared to all sequencs that come after it in the file, resulting in a total of n * (n-1) / 2 pairwise alignments, where n is the total number of sequences. This command is multi-threaded. .TAG id .TP .BI \-\-id \0real Reject the sequence match if the pairwise identity is lower than \fIreal\fR (value ranging from 0.0 to 1.0 included). .TAG threads .TP .BI \-\-threads\~ "positive integer" Number of computation threads to use (1 to 1024). The number of threads should be lesser or equal to the number of available CPU cores. The default is to use all available resources and to launch one thread per logical core. .TAG uc .TP .BI \-\-uc \0filename Output pairwise alignment results in \fIfilename\fR using a tab-separated uclust-like format with 10 columns. Each sequence is compared to all other sequences, and all hits (\-\-acceptall) or only some hits (\-\-id \fIfloat\fR) are reported, with one pairwise comparison per line: .RS .RS .nr step 1 1 .IP \n[step]. 4 Record type, always set to 'H'. .IP \n+[step]. Ordinal number of the target sequence (based on input order, starting from zero). .IP \n+[step]. Sequence length. .IP \n+[step]. Percentage of similarity with the target sequence. .IP \n+[step]. Match orientation, always set to '+'. .IP \n+[step]. Not used, always set to zero. .IP \n+[step]. Not used, always set to zero. .IP \n+[step]. Compact representation of the pairwise alignment using the CIGAR format (Compact Idiosyncratic Gapped Alignment Report): M (match/mismatch), D (deletion) and I (insertion). The equal sign '=' indicates that the query is identical to the centroid sequence. .IP \n+[step]. Label of the query sequence. .IP \n+[step]. Label of the target sequence. .RE .RE .RE .PP .\" ---------------------------------------------------------------------------- .TAG restriction-site-cutting-options Restriction site cutting options: .RS .PP The input sequences in the file specified with the \-\-cut command are cut into fragments at all restriction sites matching the pattern given with the \-\-cut_pattern option. The fragments on the forward strand are written to the file specified with the \-\-fastaout file and the fragments on the reverse strand are written to the file specified with the \-\-fastaout_rev option. Input sequences that do not match are written to the file specified with the option \-\-fastaout_discarded, and their reverse complement are also written to the file specfied with the \-\-fastaout_discarded_rev option. The relabel options (\-\-relabel, \-\-relabel_self, \-\-relabel_keep, \-\-relabel_md5, and \-\-relabel_sha1) may be used to relabel the output sequences). .TAG cut .TP 9 .BI \-\-cut \0filename Specify the input file with sequences in FASTA format. .TAG cut_pattern .TP .BI \-\-cut_pattern \0string Specify the restriction site cutting pattern and positions. The pattern is a string of lower- or uppercase letters specifying the nucleotides that must match, and may include ambiguous nucleotide symbols. The special characters "^" (circumflex) and "_" (underscore) are used to indicate the cutting position on the forward and reverse strand, respectively. For example, the pattern "G^AATT_C" is the pattern for the EcoRI restriction site. For such palindromic patterns (identical to its reverse complement) the command will output all possible fragments on both strands. For non-palindromic sites, it may be necessary to run the command also on the reverse complemented input sequences. Exactly one cutting site on each strand must be indicated. .TAG fastaout .TP .BI \-\-fastaout \0filename Specify the output file for the resulting fragments on the forward strand. .TAG fastaout_rev .TP .BI \-\-fastaout_rev \0filename Specify the output file for the resulting fragments on the reverse strand. .TAG fastaout_discarded .TP .BI \-\-fastaout_discarded \0filename Specify the output file for the non-matching sequences. .TAG fastaout_discarded_rev .TP .BI \-\-fastaout_discarded_rev \0filename Specify the output file for the non-matching seqeunces, reverse complemented. .RE .PP .\" ---------------------------------------------------------------------------- .TAG searching-options Searching options: .RS .TAG alnout .TP 9 .BI \-\-alnout \0filename Write pairwise global alignments to \fIfilename\fR using a human-readable format. Use \-\-rowlen to modify alignment length. Output order may vary when using multiple threads. .TAG biomout .TP .BI \-\-biomout \0filename Write search results to an OTU table in the biom version 1.0 file format. The query file contains the samples, while the database file contains the OTUs. Sample and OTU identifiers are extracted from the header of these sequences. See the \-\-biomout option in the Clustering section for further details. .TAG blast6out .TP .BI \-\-blast6out \0filename Write search results to \fIfilename\fR using a blast-like tab-separated format of twelve fields (listed below), with one line per query-target matching (or lack of matching if \-\-output_no_hits is used). Warning, vsearch uses global pairwise alignments, not blast's seed-and-extend algorithm. Therefore, some common blast output values (alignment start and end, evalue, bit score) are reported differently. Output order may vary when using multiple threads. A similar output can be obtain with \-\-userout \fIfilename\fR and \-\-userfields query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits. A complete list and description is available in the section 'Userfields' of this manual. .RS .RS .nr step 1 1 .IP \n[step]. 4 \fIquery\fR: query label. .IP \n+[step]. \fItarget\fR: target (database sequence) label. The field is set to '*' if there is no alignment. .IP \n+[step]. \fIid\fR: percentage of identity (real value ranging from 0.0 to 100.0). The percentage identity is defined as 100 * (matching columns) / (alignment length - terminal gaps). See fields id0 to id4 for other definitions. .IP \n+[step]. \fIalnlen\fR: length of the query-target alignment (number of columns). The field is set to 0 if there is no alignment. .IP \n+[step]. \fImism\fR: number of mismatches in the alignment (zero or positive integer value). .IP \n+[step]. \fIopens\fR: number of columns containing a gap opening (zero or positive integer value). .IP \n+[step]. \fIqlo\fR: first nucleotide of the query aligned with the target. Always equal to 1 if there is an alignment, 0 otherwise (see \fIqilo\fR to ignore initial gaps). .IP \n+[step]. \fIqhi\fR: last nucleotide of the query aligned with the target. Always equal to the length of the pairwise alignment, 0 otherwise (see \fIqihi\fR to ignore terminal gaps). .IP \n+[step]. \fItlo\fR: first nucleotide of the target aligned with the query. Always equal to 1 if there is an alignment, 0 otherwise (see \fItilo\fR to ignore initial gaps). .IP \n+[step]. \fIthi\fR: last nucleotide of the target aligned with the query. Always equal to the length of the pairwise alignment, 0 otherwise (see \fItihi\fR to ignore terminal gaps). .IP \n+[step]. \fIevalue\fR: expectancy-value (not computed for nucleotide alignments). Always set to -1. .IP \n+[step]. \fIbits\fR: bit score (not computed for nucleotide alignments). Always set to 0. .RE .RE .TAG db .TP .BI \-\-db \0filename Compare query sequences (specified with \-\-usearch_global) to the fasta-formatted target sequences contained in \fIfilename\fR, using global pairwise alignment. Alternatively, the name of a preformatted UDB database created using the makeudb_usearch command (see below) may be specified. .TAG dbmask .TP .BI \-\-dbmask\~ "none|dust|soft" Mask regions in the target database sequences using the dust method or the soft method, or do not mask (none). Warning, when using soft masking search commands become case sensitive. The default is to mask using dust. .TAG dbmatched .TP .BI \-\-dbmatched \0filename Write database target sequences matching at least one query sequence to \fIfilename\fR, in fasta format. If the option \-\-sizeout is used, the number of queries that matched each target sequence is indicated using the pattern ";size=\fIinteger\fR;". .TAG dbnotmatched .TP .BI \-\-dbnotmatched \0filename Write database target sequences not matching query sequences to \fIfilename\fR, in fasta format. .TAG fastapairs .TP .BI \-\-fastapairs \0filename Write pairwise alignments of query and target sequences to \fIfilename\fR, in fasta format. .TAG fulldp .TP .B \-\-fulldp Dummy option for compatibility with usearch. To maximize search sensitivity, \fBvsearch\fR uses a 8-way 16-bit SIMD vectorized full dynamic programming algorithm (Needleman-Wunsch), whether or not \-\-fulldp is specified. .TAG gapext .TP .BI \-\-gapext \0string Set penalties for a gap extension. See \-\-gapopen for a complete description of the penalty declaration system. The default is to initialize the six gap extending penalties using a penalty of 2 for extending internal gaps and a penalty of 1 for extending terminal gaps, in both query and target sequences (i.e. 2I/1E). .TAG gapopen .TP .BI \-\-gapopen \0string Set penalties for a gap opening. A gap opening can occur in six different contexts: in the query (Q) or in the target (T) sequence, at the left (L) or right (R) extremity of the sequence, or inside the sequence (I). Sequence symbols (Q and T) can be combined with location symbols (L, I, and R), and numerical values to declare penalties for all possible contexts: aQL/bQI/cQR/dTL/eTI/fTR, where abcdef are zero or positive integers, and '/' is used as a separator. .br To simplify declarations, the location symbols (L, I, and R) can be combined, the symbol (E) can be used to treat both extremities (L and R) equally, and the symbols Q and T can be omitted to treat query and target sequences equally. For instance, the default is to declare a penalty of 20 for opening internal gaps and a penalty of 2 for opening terminal gaps (left or right), in both query and target sequences (i.e. 20I/2E). If only a numerical value is given, without any sequence or location symbol, then the penalty applies to all gap openings. To forbid gap-opening, an infinite penalty value can be declared with the symbol '*'. To use \fBvsearch\fR as a semi-global aligner, a null-penalty can be applied to the left (L) or right (R) gaps. .br \fBvsearch\fR always initializes the six gap opening penalties using the default parameters (20I/2E). The user is then free to declare only the values he/she wants to modify. The \fIstring\fR is scanned from left to right, accepted symbols are (0123456789/LIREQT*), and later values override previous values. .br Please note that \fBvsearch\fR, in contrast to usearch, only allows integer gap penalties. Because the lowest gap penalties are 0.5 by default in usearch, all default scores and gap penalties in \fBvsearch\fR have been doubled to maintain equivalent penalties and to produce identical alignments. .TAG hardmask .TP .B \-\-hardmask Mask sequence regions by replacing them with Ns instead of setting them to lower case as is the default. For more information, please see the Masking section. .TAG id .TP .BI \-\-id \0real Reject the sequence match if the pairwise identity is lower than \fIreal\fR (value ranging from 0.0 to 1.0 included). The search process sorts target sequences by decreasing number of \fIk\fR-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. That efficient pre-filtering also prevents pairwise alignments with weakly matching targets, as there needs to be at least 6 shared \fIk\fR-mers to start the pairwise alignment, and at least one out of every 16 \fIk\fR-mers from the query needs to match the target. Consequently, using values lower than \-\-id 0.5 is not likely to capture more weakly matching targets. The pairwise identity is by default defined as the number of (matching columns) / (alignment length - terminal gaps). That definition can be modified by \-\-iddef. .TAG iddef .TP .BI \-\-iddef\~ "0|1|2|3|4" Change the pairwise identity definition used in \-\-id. Values accepted are: .RS .RS .nr step 0 1 .IP \n[step]. 4 CD-HIT definition: (matching columns) / (shortest sequence length). .IP \n+[step]. edit distance: (matching columns) / (alignment length). .IP \n+[step]. edit distance excluding terminal gaps (default definition for \-\-id). .IP \n+[step]. Marine Biological Lab definition counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended: 1.0 - [(mismatches + gap openings)/(longest sequence length)] .IP \n+[step]. BLAST definition, equivalent to \-\-iddef 1 for global pairwise alignments. .RE .PP The option \-\-userfields accepts the fields id0 to id4, in addition to the field id, to report the pairwise identity values corresponding to the different definitions. .RE .TAG idprefix .TP .BI \-\-idprefix\~ "positive integer" Reject the sequence match if the first \fIinteger\fR nucleotides of the target do not match the query. .TAG idsuffix .TP .BI \-\-idsuffix\~ "positive integer" Reject the sequence match if the last \fIinteger\fR nucleotides of the target do not match the query. .TAG lca_cutoff .TP .BI \-\-lca_cutoff \0real Adjust the fraction of matching hits required for the last common ancestor (LCA) output with the \-\-lcaout option during searches. The default value is 1.0 which requires all hits to match at each taxonomic rank for that rank to be included. If a lower cutoff value is used, e.g. 0.95, a small fraction of non-matching hits are allowed while that rank will still be reported. The argument to this option must be larger than 0.5, but not larger than 1.0. .TAG lcaout .TP .BI \-\-lcaout \0filename Output last common ancestor (LCA) information about the hits of each query to a text file in a tab-separated format. The first column contains the query id, while the second column contains the taxonomic information. The headers of the sequences in the database must contain taxonomic information in the same format as used with the \-\-sintax command, e.g. "tax=k:Archaea,p:Euryarchaeota,c:Halobacteria". Only the initial parts of the taxonomy that are common to a large fraction of the hits of each query will be output. It is necessary to set the \-\-maxaccepts option to a value differrent from 1 for this information to be useful. The \-\-top_hits_only option may also be useful. The fraction of matching hits required may be adjusted by the \-\-lca_cutoff option (default 1.0). .TAG leftjust .TP .B \-\-leftjust Reject the sequence match if the pairwise alignment begins with gaps. .TAG match .TP .BI \-\-match\~ "integer" Score assigned to a match (i.e. identical nucleotides) in the pairwise alignment. The default value is 2. .TAG matched .TP .BI \-\-matched \0filename Write query sequences matching database target sequences to \fIfilename\fR, in fasta format. .TAG maxaccepts .TP .BI \-\-maxaccepts\~ "positive integer" Maximum number of hits to accept before stopping the search. The default value is 1. This option works in pair with \-\-maxrejects. The search process sorts target sequences by decreasing number of \fIk\fR-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If \-\-maxaccepts is set to a higher value, more hits are accepted. If \-\-maxaccepts and \-\-maxrejects are both set to 0, the complete database is searched. .TAG maxdiffs .TP .BI \-\-maxdiffs\~ "positive integer" Reject the sequence match if the alignment contains at least \fIinteger\fR substitutions, insertions or deletions. .TAG maxgaps .TP .BI \-\-maxgaps\~ "positive integer" Reject the sequence match if the alignment contains at least \fIinteger\fR insertions or deletions. .TAG maxhits .TP .BI \-\-maxhits\~ "non-negative integer" Maximum number of hits to show once the search is terminated (hits are sorted by decreasing identity). Unlimited by default or if the argument it zero. This option applies to \-\-alnout, \-\-blast6out, \-\-fastapairs, \-\-samout, \-\-uc, or \-\-userout output files. .TAG maxid .TP .BI \-\-maxid \0real Reject the sequence match if the percentage of identity between the two sequences is greater than \fIreal\fR. .TAG maxqsize .TP .BI \-\-maxqsize\~ "positive integer" Reject query sequences with an abundance greater than \fIinteger\fR. .TAG maxqt .TP .BI \-\-maxqt \0real Reject if the query/target sequence length ratio is greater than \fIreal\fR. .TAG maxrejects .TP .BI \-\-maxrejects\~ "positive integer" Maximum number of non-matching target sequences to consider before stopping the search. The default value is 32. This option works in pair with \-\-maxaccepts. The search process sorts target sequences by decreasing number of \fIk\fR-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if none of the first 32 examined target sequences pass the acceptation criteria, the search process stops for that query (no hit). If \-\-maxrejects is set to a higher value, more target sequences are considered. If \-\-maxaccepts and \-\-maxrejects are both set to 0, the complete database is searched. .TAG maxsizeratio .TP .BI \-\-maxsizeratio \0real Reject if the query/target abundance ratio is greater than \fIreal\fR. .TAG maxsl .TP .BI \-\-maxsl \0real Reject if the shorter/longer sequence length ratio is greater than \fIreal\fR. .TAG maxsubs .TP .BI \-\-maxsubs\~ "positive integer" Reject the sequence match if the pairwise alignment contains more than \fIinteger\fR substitutions. .TAG mid .TP .BI \-\-mid \0real Reject the sequence match if the percentage of identity is lower than \fIreal\fR (ignoring all gaps, internal and terminal). .TAG mincols .TP .BI \-\-mincols\~ "positive integer" Reject the sequence match if the alignment length is shorter than \fIinteger\fR. .TAG minqt .TP .BI \-\-minqt \0real Reject if the query/target sequence length ratio is lower than \fIreal\fR. .TAG minsizeratio .TP .BI \-\-minsizeratio \0real Reject if the query/target abundance ratio is lower than \fIreal\fR. .TAG minsl .TP .BI \-\-minsl \0real Reject if the shorter/longer sequence length ratio is lower than \fIreal\fR. .TAG mintsize .TP .BI \-\-mintsize\~ "positive integer" Reject target sequences with an abundance lower than \fIinteger\fR. .TAG minwordmatches .TP .BI \-\-minwordmatches\~ "non-negative integer" Minimum number of word matches required for a sequence to be considered further. Default value is 12 for the default word length 8. For word lengths 3-15, the default minimum word matches are 18, 17, 16, 15, 14, 12, 11, 10, 9, 8, 7, 5 and 3, respectively. If the query sequence has fewer unique words than the number specified, all words in the query must match. If the argument is 0, no word matches are required. .TAG mismatch .TP .BI \-\-mismatch\~ "integer" Score assigned to a mismatch (i.e. different nucleotides) in the pairwise alignment. The default value is -4. .TAG mothur_shared_out .TP .BI \-\-mothur_shared_out \0filename Write search results to an OTU table in the mothur 'shared' tab-separated plain text file format. The query file contains the samples, while the database file contains the OTUs. Sample and OTU identifiers are extracted from the header of these sequences. See the \-\-otutabout option in the Clustering section for further details. .TAG notmatched .TP .BI \-\-notmatched \0filename Write query sequences not matching database target sequences to \fIfilename\fR, in fasta format. .TAG otutabout .TP .BI \-\-otutabout \0filename Write search results to an OTU table in the classic tab-separated plain text format. The query file contains the samples, while the database file contains the OTUs. Sample and OTU identifiers are extracted from the header of these sequences. See the \-\-mothur_shared_out option in the Clustering section for further details. .TAG output_no_hits .TP .B \-\-output_no_hits Write both matching and non-matching queries to \-\-alnout, \-\-blast6out, \-\-samout or \-\-userout output files. Non-matching queries are labelled 'No hits' in \-\-alnout files. .TAG pattern .TP .B \-\-pattern \fIstring\fR This option is ignored. It is provided for compatibility with usearch. .TAG qmask .TP .BI \-\-qmask\~ "none|dust|soft" Mask regions in the query sequences using the dust or the soft algorithms, or do not mask (none). Warning, when using soft masking search commands become case sensitive. The default is to mask using \fIdust\fR. .TAG qsegout .TP .BI \-\-qsegout \0filename Write the aligned part of each query sequence to \fIfilename\fR in FASTA format. .TAG query_cov .TP .BI \-\-query_cov \0real Reject if the fraction of the query aligned to the target sequence is lower than \fIreal\fR. The query coverage is computed as (matches + mismatches) / query sequence length. Internal or terminal gaps are not taken into account. .TAG rightjust .TP .B \-\-rightjust Reject the sequence match if the pairwise alignment ends with gaps. .TAG rowlen .TP .BI \-\-rowlen\~ "positive integer" Width of alignment lines in \-\-alnout output. The default value is 64. Set to 0 to eliminate wrapping. .TAG samheader .TP .B \-\-samheader Include header lines to the SAM file when \-\-samout is specified. The header includes lines starting with @HD, @SQ and @PG, but no @RG lines (see .URL https://github.com/samtools/hts-specs (link) ). By default no header line is written. .TAG samout .TP .BI \-\-samout \0filename Write alignment results to \fIfilename\fR using the SAM format (a tab-separated text file). When using the \-\-samheader option, the SAM file starts with header lines. Each non-header line is a SAM record, which represents either a query-target alignment or the absence of match for a query (output order may vary when using multiple threads). Each record contains 11 mandatory fields and optional fields (see .URL https://github.com/samtools/hts-specs (link) for a complete description of the format): .RS .RS .nr step 1 1 .IP \n[step]. 4 query sequence label. .IP \n+[step]. combination of bitwise flags. Possible values are: 0 (top hit), 4 (no hit), 16 (reverse-complemented hit), 256 (secondary hit, i.e. all hits except the top hit). .IP \n+[step]. target sequence label. .IP \n+[step]. first position of a target aligned with the query (always 1 for global pairwise alignments, 0 if there is no match). .IP \n+[step]. mapping quality (ignored, always set to '*'). .IP \n+[step]. CIGAR string (set to '*' if there is no match). .IP \n+[step]. name of the target sequence matching with the next read of the query (for mate reads only, ignored and always set to '*'). .IP \n+[step]. position of the primary alignment of the next read of the query (for mate reads only, ignored and always set to 0). .IP \n+[step]. target sequence length (for multi-segment targets, ignored and always set to 0). .IP \n+[step]. query sequence (complete, not only the segment aligned to the target as usearch does). .IP \n+[step]. quality string (ignored, always set to '*'). .RE Optional fields for query-target matches (number and order of fields may vary): .RS .nr step 12 1 .IP \n[step]. 4 AS:i:? alignment score (i.e. percentage of identity). .IP \n+[step]. XN:i:? next best alignment score (always set to 0). .IP \n+[step]. XM:i:? number of mismatches. .IP \n+[step]. XO:i:? number of gap openings (excluding terminal gaps). .IP \n+[step]. XG:i:? number of gap extensions (excluding terminal gaps). .IP \n+[step]. NM:i:? edit distance to the target (sum of XM and XG). .IP \n+[step]. MD:Z:? string for mismatching positions. .IP \n+[step]. YT:Z:UU string representing the alignment type. .RE .RE .TAG search_exact .TP .BI \-\-search_exact \0filename Search for exact full-length matches to the query sequences contained in \fIfilename\fR in the database of target sequences (\-\-db). Only 100% exact matches are reported and this command is much faster than \-\-usearch_global. The \-\-id, \-\-maxaccepts and \-\-maxrejects options are ignored, but the rest of the searching options may be specified. .TAG self .TP .B \-\-self Reject the sequence match if the query and target labels are identical. .TAG selfid .TP .B \-\-selfid Reject the sequence match if the query and target sequences are strictly identical. .TAG sizeout .TP .B \-\-sizeout Add abundance annotations to the output of the option \-\-dbmatched (using the pattern ';size=\fIinteger\fR;'), to report the number of queries that matched each target. .TAG strand .TP .BI \-\-strand\~ "plus|both" When searching for similar sequences, check the \fIplus\fR strand only (default) or check \fIboth\fR strands. .TAG target_cov .TP .BI \-\-target_cov \0real Reject the sequence match if the fraction of the target sequence aligned to the query sequence is lower than \fIreal\fR. The target coverage is computed as (matches + mismatches) / target sequence length. Internal or terminal gaps are not taken into account. .TAG top_hits_only .TP .B \-\-top_hits_only Only the top hits with an equally high percentage of identity between the query and database sequence sets are written to the output specified with the options \-\-lcaout, \-\-alnout, \-\-samout, \-\-userout, \-\-blast6out, \-\-uc, \-\-fastapairs, \-\-matched or \-\-notmatched (but not \-\-dbmatched and \-\-dbnotmatched). For each query, the top hit is the one presenting the highest percentage of identity (see the \-\-iddef option to change the way identity is measured). For a given query, if several top hits present exactly the same percentage of identity, the number of hits reported is controlled by the \-\-maxaccepts value (1 by default). .TAG tsegout .TP .BI \-\-tsegout \0filename Write the aligned part of each target sequence to \fIfilename\fR in FASTA format. .TAG uc .TP .BI \-\-uc \0filename Output searching results in \fIfilename\fR using a tab-separated uclust-like format with 10 columns. When using the \-\-search_exact command, the table layout is the same than with the \-\-allpairs_global. When using the \-\-usearch_global command, the table present two different type of entries: hit (H) or no hit (N). Each query sequence is compared to all other sequences, and the best hit (\-\-maxaccept 1) or several hits (\-\-maxaccept > 1) are reported (H). Output order may vary when using multiple threads. Column content varies with the type of entry (H or N): .RS .RS .nr step 1 1 .IP \n[step]. 4 Record type: H, or N ('hit' or 'no hit'). .IP \n+[step]. Ordinal number of the target sequence (based on input order, starting from zero). Set to '*' for N. .IP \n+[step]. Sequence length. Set to '*' for N. .IP \n+[step]. Percentage of similarity with the target sequence. Set to '*' for N. .IP \n+[step]. Match orientation + or -. . Set to '.' for N. .IP \n+[step]. Not used, always set to zero for H, or '*' for N. .IP \n+[step]. Not used, always set to zero for H, or '*' for N. .IP \n+[step]. Compact representation of the pairwise alignment using the CIGAR format (Compact Idiosyncratic Gapped Alignment Report): M (match/mismatch), D (deletion) and I (insertion). The equal sign '=' indicates that the query is identical to the centroid sequence. Set to '*' for N. .IP \n+[step]. Label of the query sequence. .IP \n+[step]. Label of the target centroid sequence. Set to '*' for N. .RE .RE .TAG uc_allhits .TP .B \-\-uc_allhits When using the \-\-uc option, show all hits, not just the top hit for each query. .TAG usearch_global .TP .BI \-\-usearch_global \0filename Compare target sequences (\-\-db) to the fasta-formatted query sequences contained in \fIfilename\fR, using global pairwise alignment. .TAG userfields .TP .BI \-\-userfields \0string When using \-\-userout, select and order the fields written to the output file. Fields are separated by '+' (e.g. query+target+id). See the 'Userfields' section for a complete list of fields. .TAG userout .TP .BI \-\-userout \0filename Write user-defined tab-separated output to \fIfilename\fR. Select the fields with the option \-\-userfields. Output order may vary when using multiple threads. If \-\-userfields is empty or not present, \fIfilename\fR is empty. .TAG weak_id .TP .BI \-\-weak_id \0real Show hits with percentage of identity of at least \fIreal\fR, without terminating the search. A normal search stops as soon as enough hits are found (as defined by \-\-maxaccepts, \-\-maxrejects, and \-\-id). As \-\-weak_id reports weak hits that are not deduced from \-\-maxaccepts, high \-\-id values can be used, hence preserving both speed and sensitivity. Logically, \fIreal\fR must be smaller than the value indicated by \-\-id. .TAG wordlength .TP .BI \-\-wordlength\~ "positive integer" Length of words (i.e. \fIk\fR-mers) for database indexing. The range of possible values goes from 3 to 15, but values near 8 or 9 are generally recommended. Longer words may reduce the sensitivity/recall for weak similarities, but can increase precision. On the other hand, shorter words may increase sensitivity or recall, but may reduce precision. Computation time generally increases with shorter words and decreases with longer words, but it increases again for very long words. Memory requirements for a part of the index increase with a factor of 4 each time word length increases by one nucleotide, and this generally becomes significant for long words (12 or more). The default value is 8. .RE .PP .\" ---------------------------------------------------------------------------- .TAG shuffling-options Shuffling options: .RS Fasta entries in the input file are outputted in a pseudo-random order. .TAG output .TP 9 .BI \-\-output \0filename Write the shuffled sequences to \fIfilename\fR, in fasta format. .TAG randseed .TP .BI \-\-randseed\~ "positive integer" When shuffling sequence order, use \fIinteger\fR as seed. A given seed always produces the same output order (useful for replicability). Set to 0 to use a pseudo-random seed (default behavior). .TAG relabel .TP .BI \-\-relabel \0string Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3, etc.) to construct the new headers. Use \-\-sizeout to conserve the abundance annotations. .TAG relabel_keep .TP .B \-\-relabel_keep When relabelling, keep the old identifier in the header after a space. .TAG relabel_md5 .TP .B \-\-relabel_md5 Relabel sequences using the MD5 message digest algorithm applied to each sequence. Former sequence headers are discarded. The sequence is converted to upper case and U is replaced by T before the digest is computed. The MD5 digest is a cryptographic hash function designed to minimize the probability that two different inputs gives the same output, even for very similar, but non-identical inputs. Still, there is always a very small, but non-zero probability that two different inputs give the same result. The MD5 digest generates a 128-bit (16-byte) digest that is represented by 16 hexadecimal numbers (using 32 symbols among 0123456789abcdef). Use \-\-sizeout to conserve the abundance annotations. .TAG relabel_self .TP .B \-\-relabel_self Relabel sequences using the sequence itself as the label. .TAG relabel_sha1 .TP .B \-\-relabel_sha1 Relabel sequences using the SHA1 message digest algorithm applied to each sequence. It is similar to the \-\-relabel_md5 option but uses the SHA1 algorithm instead of the MD5 algorithm. The SHA1 digest generates a 160-bit (20-byte) result that is represented by 20 hexadecimal numbers (40 symbols). The probability of a collision (two non-identical sequences having the same digest) is smaller for the SHA1 algorithm than it is for the MD5 algorithm. Use \-\-sizeout to conserve the abundance annotations. .TAG sizeout .TP .B \-\-sizeout When using \-\-relabel, \-\-relabel_self, \-\-relabel_md5 or \-\-relabel_sha1, preserve and report abundance annotations to the output fasta file (using the pattern ';size=\fIinteger\fR;'). .TAG shuffle .TP .BI \-\-shuffle \0filename Pseudo-randomly shuffle the order of sequences contained in \fIfilename\fR. .TAG topn .TP .BI \-\-topn\~ "positive integer" Output only the first \fIinteger\fR sequences after pseudo-random reordering. .TAG xsize .TP .B \-\-xsize Strip abundance information from the headers when writing the output file. .RE .PP .\" ---------------------------------------------------------------------------- .TAG sorting-options Sorting options: .RS Fasta entries are sorted by decreasing abundance (\-\-sortbysize) or sequence length (\-\-sortbylength). To obtain a stable sorting order, ties are sorted by decreasing abundance and label increasing alpha-numerical order (\-\-sortbylength), or just by label increasing alpha-numerical order (\-\-sortbysize). Label sorting assumes that all sequences have unique labels. The same applies to the automatic sorting performed during chimera checking (\-\-uchime_denovo), dereplication (\-\-derep_fulllength), and clustering (\-\-cluster_fast and \-\-cluster_size). .PP .TAG maxsize .TP 9 .BI \-\-maxsize\~ "positive integer" When using \-\-sortbysize, discard sequences with an abundance value greater than \fIinteger\fR. .TAG minsize .TP .BI \-\-minsize\~ "positive integer" When using \-\-sortbysize, discard sequences with an abundance value smaller than \fIinteger\fR. .TAG output .TP .BI \-\-output \0filename Write the sorted sequences to \fIfilename\fR, in fasta format. .TAG relabel .TP .BI \-\-relabel \0string Please see the description of the same option under Chimera detection for details. .TAG relabel_keep .TP .B \-\-relabel_keep When relabelling, keep the old identifier in the header after a space. .TAG relabel_md5 .TP .BI \-\-relabel_md5 Please see the description of the same option under Chimera detection for details. .TAG relabel_self .TP .BI \-\-relabel_self Please see the description of the same option under Chimera detection for details. .TAG relabel_sha1 .TP .BI \-\-relabel_sha1 Please see the description of the same option under Chimera detection for details. .TAG sizeout .TP .B \-\-sizeout When using \-\-relabel, report abundance annotations to the output fasta file (using the pattern ';size=\fIinteger\fR;'). .TAG sortbylength .TP .BI \-\-sortbylength \0filename Sort by decreasing length the sequences contained in \fIfilename\fR. See the general options \-\-minseqlength and \-\-maxseqlength to eliminate short and long sequences. .TAG sortbysize .TP .BI \-\-sortbysize \0filename Sort by decreasing abundance the sequences contained in \fIfilename\fR (missing abundance values are assumed to be ';size=1'). See the options \-\-minsize and \-\-maxsize to eliminate rare and dominant sequences. .TAG topn .TP .BI \-\-topn\~ "positive integer" Output only the top \fIinteger\fR sequences (i.e. the longest or the most abundant). .TAG xsize .TP .B \-\-xsize Strip abundance information from the headers when writing the output file. .RE .PP .\" ---------------------------------------------------------------------------- .TAG subsampling-options Subsampling options: .RS Subsampling randomly extracts a certain number or a certain percentage of the sequences in the input file. If the \-\-sizein option is in effect, the abundances of the input sequences is taken into account and the sampling is performed as if the input sequences were rereplicated, subsampled and dereplicated before being written to the output file. The extraction is performed as a random sampling with a uniform distribution among the input sequences and is performed without replacement. The input file is specified with the \-\-fastx_subsample option, the output files are specified with the \-\-fastaout and \-\-fastqout options and the amount of sequences to be sampled is specified with the \-\-sample_pct or \-\-sample_size options. The sequences not sampled may be written to files specified with the options \-\-fasta_discarded and \-\-fastq_discarded. The \-\-fastq_ascii, \-\-fastq_qmin and \-\-fastq_qmax options are also available. .PP .TAG fastaout .TP 9 .BI \-\-fastaout \0filename Write the sampled sequences to \fIfilename\fR, in fasta format. .TAG fastaout_discarded .TP .BI \-\-fastaout_discarded \0filename Write the sequences not sampled to \fIfilename\fR, in fasta format. .TAG fastq_ascii .TP .BI \-\-fastq_ascii\~ "positive integer" Define the ASCII character number used as the basis for the FASTQ quality score. The default is 33, which is used by the Sanger / Illumina 1.8+ FASTQ format (phred+33). The value 64 is used by the Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64). Only 33 and 64 are valid arguments. .TAG fastq_qmax .TP .BI \-\-fastq_qmax\~ "positive integer" Specify the maximum quality score accepted when reading FASTQ files. The default is 41, which is usual for recent Sanger/Illumina 1.8+ files. .TAG fastq_qmin .TP .BI \-\-fastq_qmin\~ "positive integer" Specify the minimum quality score accepted for FASTQ files. The default is 0, which is usual for recent Sanger/Illumina 1.8+ files. Older formats may use scores between -5 and 2. .TAG fastqout .TP .BI \-\-fastqout \0filename Write the sampled sequences to \fIfilename\fR, in fastq format. Requires input in fastq format. .TAG fastqout_discarded .TP .BI \-\-fastqout_discarded \0filename Write the sequences not sampled to \fIfilename\fR, in fastq format. Requires input in fastq format. .TAG fastx_subsample .TP .BI \-\-fastx_subsample \0filename Perform subsampling from the sequences in the specified input file that is in FASTA or FASTQ format. .TAG randseed .TP .BI \-\-randseed\~ "positive integer" Use \fIinteger\fR as a seed for the pseudo-random generator. A given seed always produces the same output, which is useful for replicability. Set to 0 to use a pseudo-random seed (default behavior). .TAG relabel .TP .BI \-\-relabel \0string Relabel sequences using the prefix \fIstring\fR and a ticker (1, 2, 3, etc.) to construct the new headers. Use \-\-sizeout to conserve the abundance annotations. .TAG relabel_keep .TP .B \-\-relabel_keep When relabelling, keep the old identifier in the header after a space. .TAG relabel_md5 .TP .B \-\-relabel_md5 Relabel sequences using the MD5 message digest algorithm applied to each sequence. Former sequence headers are discarded. The sequence is converted to upper case and U is replaced by T before the digest is computed. The MD5 digest is a cryptographic hash function designed to minimize the probability that two different inputs give the same output, even for very similar, but non-identical inputs. Still, there is always a very small, but non-zero probability that two different inputs give the same result. The MD5 digest generates a 128-bit (16-byte) digest that is represented by 16 hexadecimal numbers (using 32 symbols among 0123456789abcdef). Use \-\-sizeout to conserve the abundance annotations. .TAG relabel_self .TP .B \-\-relabel_self Relabel sequences using the sequence itself as the label. .TAG relabel_sha1 .TP .B \-\-relabel_sha1 Relabel sequences using the SHA1 message digest algorithm applied to each sequence. It is similar to the \-\-relabel_md5 option but uses the SHA1 algorithm instead of the MD5 algorithm. The SHA1 digest generates a 160-bit (20-byte) result that is represented by 20 hexadecimal numbers (40 symbols). The probability of a collision (two non-identical sequences having the same digest) is smaller for the SHA1 algorithm than it is for the MD5 algorithm. Use \-\-sizeout to conserve the abundance annotations. .TAG sample_pct .TP .BI \-\-sample_pct\~ "real" Subsample the given percentage of the input sequences. Accepted values range from 0.0 to 100.0. .TAG sample_size .TP .BI \-\-sample_size\~ "positive integer" Extract the given number of sequences. .TAG sizein .TP .B \-\-sizein Take the abundance information of the input file into account, otherwise the abundance of each sequence is considered to be 1. .TAG sizeout .TP .B \-\-sizeout Write abundance information to the output file. .TAG xsize .TP .B \-\-xsize Strip abundance information from the headers when writing the output file. .RE .PP .\" ---------------------------------------------------------------------------- .TAG taxonomic-classification-options Taxonomic classification options: .RS The vsearch command \-\-sintax will classify the input sequences according to the Sintax algorithm as described by Robert Edgar (2016) in SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences, BioRxiv, 074161. Preprint. doi: 10.1101/074161 .URL https://doi.org/10.1101/074161 (link) .PP The name of the fasta file containing the input sequences to be classified is given as an argument to the \-\-sintax command. The reference sequence database is specified with the \-\-db option. The results are written in a tab delimited text file whose name is specified with the \-\-tabbedout option. The \-\-sintax_cutoff option may be used to set a minimum level of bootstrap support for the taxonomic ranks to be reported. The `--randseed` option may be included to specify a seed for initialisation of the random number generator used by the algorithm. .PP Multithreading is supported. Databases in UDB files are supported. The strand option may be specified. .PP The reference database must contain taxonomic information in the header of each sequence in the form of a string starting with ";tax=" and followed by a comma-separated list of up to eight taxonomic identifiers. Each taxonomic identifier must start with an indication of the rank by one of the letters d (for domain) k (kingdom), p (phylum), c (class), o (order), f (family), g (genus), or s (species). The letter is followed by a colon (:) and the name of that rank. Commas and semicolons are not allowed in the name of the rank. .PP Example: ">X80725_S000004313;\:tax=d:Bacteria,\:p:Proteobacteria,\:c:Gammaproteobacteria,\:o:Enterobacteriales,\:f:Enterobacteriaceae,\:g:Escherichia/Shigella,\:s:Escherichia_coli". .PP The option \-\-notrunclabels is turned on by default for this command, allowing spaces in the taxonomic identifiers. .PP .TAG db .TP 9 .BI \-\-db \0filename Read the reference sequences from \fIfilename\fR, in FASTA, FASTQ or UDB format. These sequences needs to be annotated with taxonomy. .TAG randseed .TP .BI \-\-randseed\~ "positive integer" Use \fIinteger\fR as seed for the random number generator used in the Sintax algorithm. A given seed always produces the same output order (useful for replicability). Set to 0 to use a pseudo-random seed (default behavior). .TAG sintax_cutoff .TP .BI \-\-sintax_cutoff\~ "real" Specify a minimum level of bootstrap support for the taxonomic ranks that will be included in column 4 of the output file. For instance 0.9, corresponding to 90%. .TAG sintax .TP .BI \-\-sintax \0filename Read the input sequences from \fIfilename\fR, in FASTA or FASTQ format. .TAG tabbedout .TP .BI \-\-tabbedout \0filename Write the results to \fIfilename\fR, in a tab-separated text format. Column 1 contains the query label. Column 2 contains the predicted taxonomy in the same format as for the reference data, with bootstrap support indicated in parentheses after each rank. Column 3 contains the strand. If the \-\-sintax_cutoff option is used, the predicted taxonomy will be repeated in column 4 while omitting the bootstrap values and including only the ranks with support at or above the threshold. .RE .PP .\" ---------------------------------------------------------------------------- .TAG udb-options UDB options: .RS Databases to be used with the \-\-usearch_global command may be prepared from FASTA files and stored to a binary UDB formatted file in order to speed up searching. This may be worthwhile when searching a large database repeatedly. The sequences are indexed and stored in a way that can be quickly loaded into memory. The commands and options below can be used to create and inspect UDB files. An UDB file may be specified with the \-\-db option instead of a FASTA formatted file with the \-\-usearch_global command. .PP .TAG dbmask .TP 9 .BI \-\-dbmask\~ "none|dust|soft" Specify the sequence masking method used with the \-\-makeudb_usearch command, either none, dust or soft. No masking is performed when none is specified. When dust is specified, the DUST algorithm will be used for masking low complexity regions (short repeats and skewed composition). Lower case letters in the input file will be masked when soft is specified (soft masking). .TAG hardmask .TP .B \-\-hardmask Mask sequences by replacing letters with N for the \-\-makeudb_usearch command. The default is to use lower case letters (soft masking). .TAG makeudb_usearch .TP .BI \-\-makeudb_usearch \0filename Create an UDB database file from the FASTA-formatted sequences in the file with the given \fIfilename\fR. The UDB database is written to the file specified with the \-\-output option. .TAG output .TP .BI \-\-output \0filename Specify the \fIfilename\fR of a FASTA or UDB output file for the \-\-makeudb_usearch or the \-\-udb2fasta command, respectively. .TAG udb2fasta .TP .BI \-\-udb2fasta \0filename Read the UDB database in the file with the given \fIfilename\fR and output the sequences in FASTA format in the file specified by the \-\-output option. .TAG udbinfo .TP .BI \-\-udbinfo \0filename Show information about the UDB database in the file with the given \fIfilename\fR. .TAG udbstats .TP .BI \-\-udbstats \0filename Report statistics about the indexed words in the UDB database in the file with the given \fIfilename\fR. .TAG wordlength .TP .BI \-\-wordlength\~ "positive integer" Specify the length of the words to be used when creating the UDB database index using the \-\-makeudb_usearch command. Valid numbers range from 3 to 15. The default is 8. .RE .PP .\" ---------------------------------------------------------------------------- .TAG userfields Userfields (fields accepted by the \-\-userfields option): .RS .TP 9 .B aln Print a string of M (match/mismatch, i.e. not a gap), D (delete, i.e. a gap in the query) and I (insert, i.e. a gap in the target) representing the pairwise alignment. Empty field if there is no alignment. .TP .B alnlen Print the length of the query-target alignment (number of columns). The field is set to 0 if there is no alignment. .TP .B bits Bit score (not computed for nucleotide alignments). Always set to 0. .TP .B caln Compact representation of the pairwise alignment using the CIGAR format (Compact Idiosyncratic Gapped Alignment Report): M (match/mismatch), D (deletion) and I (insertion). Empty field if there is no alignment. .TP .B evalue E-value (not computed for nucleotide alignments). Always set to -1. .TP .B exts Number of columns containing a gap extension (zero or positive integer value). .TP .B gaps Number of columns containing a gap (zero or positive integer value). .TP .B id The percentage of identity, according to the identity definition specified by the \-\-iddef option. Equal to id0, id1, id2, id3 or id4 below. By default the same as id2. .TP .B id0 CD-HIT definition of the percentage of identity (real value ranging from 0.0 to 100.0) using the length of the shortest sequence in the pairwise alignment as denominator: 100 * (matching columns) / (shortest sequence length). .TP .B id1 The percentage of identity (real value ranging from 0.0 to 100.0) is defined as the edit distance: 100 * (matching columns) / (alignment length). .TP .B id2 The percentage of identity (real value ranging from 0.0 to 100.0) is defined as the edit distance, excluding terminal gaps. .TP .B id3 Marine Biological Lab definition of the percentage of identity (real value ranging from 0.0 to 100.0), counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended, and using the length of the longest sequence in the pairwise alignment as denominator: 100 * (1.0 - [(mismatches + gaps) / (longest sequence length)]). .TP .B id4 BLAST definition of the percentage of identity (real value ranging from 0.0 to 100.0), equivalent to \-\-iddef 1 in a context of global pairwise alignment. The field id4 is always equal to the field id1. .TP .B ids Number of matches in the alignment (zero or positive integer value). .TP .B mism Number of mismatches in the alignment (zero or positive integer value). .TP .B opens Number of columns containing a gap opening (zero or positive integer value). .TP .B pairs Number of columns containing only nucleotides. That value corresponds to the length of the alignment minus the gap-containing columns (zero or positive integer value). .TP .B pctgaps Number of columns containing gaps expressed as a percentage of the alignment length (real value ranging from 0.0 to 100.0). .TP .B pctpv Percentage of positive columns. When working with nucleotide sequences, this is equivalent to the percentage of matches (real value ranging from 0.0 to 100.0). .TP .B pv Number of positive columns. When working with nucleotide sequences, this is equivalent to the number of matches (zero or positive integer value). .TP .B qcov Fraction of the query sequence that is aligned with the target sequence (real value ranging from 0.0 to 100.0). The query coverage is computed as 100.0 * (matches + mismatches) / query sequence length. Internal or terminal gaps are not taken into account. The field is set to 0.0 if there is no alignment. .TP .B qframe Query frame (-3 to +3). That field only concerns coding sequences and is not computed by \fBvsearch\fR. Always set to +0. .TP .B qhi Last nucleotide of the query aligned with the target. Always equal to the length of the pairwise alignment, 0 otherwise (see \fIqihi\fR to ignore terminal gaps). .TP .B qihi Last nucleotide of the query aligned with the target (ignoring terminal gaps). Nucleotide numbering starts from 1. The field is set to 0 if there is no alignment. .TP .B qilo First nucleotide of the query aligned with the target (ignoring initial gaps). Nucleotide numbering starts from 1. The field is set to 0 if there is no alignment. .TP .B ql Query sequence length (positive integer value). The field is set to 0 if there is no alignment. .TP .B qlo First nucleotide of the query aligned with the target. Always equal to 1 if there is an alignment, 0 otherwise (see \fIqilo\fR to ignore initial gaps). .TP .B qrow Print the sequence of the query segment as seen in the pairwise alignment (i.e. with gap insertions if need be). Empty field if there is no alignment. .TP .B qs Query segment length. Always equal to query sequence length. .\" The meaning of that field is not clear to us. .TP .B qstrand Query strand orientation (+ or - for nucleotide sequences). Empty field if there is no alignment. .TP .B query Query label. .TP .B raw Raw alignment score (negative, null or positive integer value). The score is the sum of match rewards minus mismatch penalties, gap openings and gap extensions. The field is set to 0 if there is no alignment. .TP .B target Target label. The field is set to '*' if there is no alignment. .TP .B tcov Fraction of the target sequence that is aligned with the query sequence (real value ranging from 0.0 to 100.0). The target coverage is computed as 100.0 * (matches + mismatches) / target sequence length. Internal or terminal gaps are not taken into account. The field is set to 0.0 if there is no alignment. .TP .B tframe Target frame (-3 to +3). That field only concerns coding sequences and is not computed by \fBvsearch\fR. Always set to +0. .TP .B thi Last nucleotide of the target aligned with the query. Always equal to the length of the pairwise alignment, 0 otherwise (see \fItihi\fR to ignore terminal gaps). .TP .B tihi Last nucleotide of the target aligned with the query (ignoring terminal gaps). Nucleotide numbering starts from 1. The field is set to 0 if there is no alignment. .TP .B tilo First nucleotide of the target aligned with the query (ignoring initial gaps). Nucleotide numbering starts from 1. The field is set to 0 if there is no alignment. .TP .B tl Target sequence length (positive integer value). The field is set to 0 if there is no alignment. .TP .B tlo First nucleotide of the target aligned with the query. Always equal to 1 if there is an alignment, 0 otherwise (see \fItilo\fR to ignore initial gaps). .TP .B trow Print the sequence of the target segment as seen in the pairwise alignment (i.e. with gap insertions if need be). Empty field if there is no alignment. .TP .B ts Target segment length. Always equal to target sequence length. The field is set to 0 if there is no alignment. .TP .B tstrand Target strand orientation (+ or - for nucleotide sequences). Always set to '+', so reverse strand matches have tstrand '+' and qstrand '\-'. Empty field if there is no alignment. .RE .PP .\" ============================================================================ .SH DELIBERATE CHANGES If you are a usearch user, our objective is to make you feel at home. That's why \fBvsearch\fR was designed to behave like usearch, to some extent. Like any complex software, usearch is not free from quirks and inconsistencies. We decided not to reproduce some of them, and for complete transparency, to document here the deliberate changes we made. .PP During a search with usearch, when using the options \-\-blast6out and \-\-output_no_hits, for queries with no match the number of fields reported is 13, where it should be 12. This is corrected in \fBvsearch\fR. .PP The field raw of the \-\-userfields option is not informative in usearch. This is corrected in \fBvsearch\fR. .PP The fields qlo, qhi, tlo, thi now have counterparts (qilo, qihi, tilo, tihi) reporting alignment coordinates ignoring terminal gaps. .PP In usearch, when using the option \-\-output_no_hits, queries that receive no match are reported in \-\-blast6out file, but not in the alignment output file. This is corrected in \fBvsearch\fR. .PP \fBvsearch\fR introduces a new \-\-cluster_size command that sorts sequences by decreasing abundance before clustering. .PP \fBvsearch\fR reintroduces \-\-iddef alternative pairwise identity definitions that were removed from usearch. .PP \fBvsearch\fR extends the \-\-topn option to sorting commands. .PP \fBvsearch\fR extends the \-\-sizein option to dereplication (\-\-derep_fulllength) and clustering (\-\-cluster_fast). .PP \fBvsearch\fR treats T and U as identical nucleotides during dereplication. .PP \fBvsearch\fR sorting is stabilized by using sequence abundances or sequences labels as secondary or tertiary keys. .PP \fBvsearch\fR by default uses the DUST algorithm for masking low-complexity regions. Masking behavior is also slightly changed to be more consistent. .PP .\" ============================================================================ .SH NOVELTIES \fBvsearch\fR introduces new commands and new options not present in usearch 7. They are described in the 'Options' section of this manual. Here is a short list: .RS .IP - 2 uchime2_denovo, uchime3_denovo, alignwidth, borderline, fasta_score (chimera checking) .IP - cluster_size, cluster_unoise, clusterout_id, clusterout_sort, profile (clustering) .IP - fasta_width, gzip_decompress, bzip2_decompress (general option) .IP - iddef (clustering, pairwise alignment, searching) .IP - maxuniquesize (dereplication) .IP - relabel_md5, relabel_self and relabel_sha1 (chimera detection, dereplication, FASTQ processing, shuffling, sorting) .IP - shuffle (shuffling) .IP - fastq_eestats, fastq_eestats2, fastq_maxlen, fastq_truncee (FASTQ processing) .IP - fastaout_discarded, fastqout_discarded (subsampling) .IP - rereplicate (dereplication/rereplication) .RE .PP .\" ============================================================================ .SH EXAMPLES .PP Align all sequences in a database with each other and output all pairwise alignments: .PP .RS \fBvsearch\fR \-\-allpairs_global \fIdatabase.fas\fR \-\-alnout \fIresults.aln\fR \-\-acceptall .RE .PP Check for the presence of chimeras (\fIde novo\fR); parents should be at least 1.5 times more abundant than chimeras. Output non-chimeric sequences in fasta format (no wrapping): .PP .RS \fBvsearch\fR \-\-uchime_denovo \fIqueries.fas\fR \-\-abskew 1.5 \-\-nonchimeras \fIresults.fas\fR \-\-fasta_width 0 .RE .PP Cluster with a 97% similarity threshold, collect cluster centroids, and write cluster descriptions using a uclust-like format: .PP .RS \fBvsearch\fR \-\-cluster_fast \fIqueries.fas\fR \-\-id 0.97 \-\-centroids \fIcentroids.fas\fR \-\-uc \fIclusters.uc\fR .RE .PP Dereplicate the sequences contained in \fIqueries.fas\fR, take into account the abundance information already present, write unwrapped fasta sequences to \fIqueries_unique.fas\fR with the new abundance information, discard all sequences with an abundance of 1: .PP .RS \fBvsearch\fR \-\-derep_fulllength \fIqueries.fas\fR \-\-sizein \-\-fasta_width 0 \-\-sizeout \-\-output \fIqueries_unique.fas\fR \-\-minuniquesize 2 .RE .PP Mask simple repeats and low complexity regions in the input fasta file with the DUST algorithm (masked regions are lowercased), and write the results to the output file: .PP .RS \fBvsearch\fR \-\-maskfasta \fIqueries.fas\fR \-\-qmask dust \-\-output \fIqueries_masked.fas\fR .RE .PP Search queries in a reference database, with a 80%-similarity threshold, take terminal gaps into account when calculating pairwise similarities, output pairwise alignments: .PP .RS \fBvsearch\fR \-\-usearch_global \fIqueries.fas\fR \-\-db \fIreferences.fas\fR \-\-id 0.8 \-\-iddef 1 \-\-alnout \fIresults.aln\fR .RE .PP Search a sequence dataset against itself (ignore self hits), get all matches with at least 60% similarity, and collect results in a blast-like tab-separated format. Accept an unlimited number of hits (\-\-maxaccepts 0), and compare each query to all other sequences, including unlikely candidates (\-\-maxrejects 0): .PP .RS \fBvsearch\fR \-\-usearch_global \fIqueries.fas\fR \-\-db \fIqueries.fas\fR \-\-self \-\-id 0.6 \-\-blast6out \fIresults.blast6\fR \-\-maxaccepts 0 \-\-maxrejects 0 .RE .PP Shuffle the input fasta file (change the order of sequences) in a repeatable fashion (fixed seed), and write unwrapped fasta sequences to the output file: .PP .RS \fBvsearch\fR \-\-shuffle \fIqueries.fas\fR \-\-output \fIqueries_shuffled.fas\fR \-\-randseed 13 \-\-fasta_width 0 .RE .PP Sort by decreasing abundance the sequences contained in \fIqueries.fas\fR (using the 'size=\fIinteger\fR' information), relabel the sequences while preserving the abundance information (with \-\-sizeout), keep only sequences with an abundance equal to or greater than 2: .PP .RS \fBvsearch\fR \-\-sortbysize \fIqueries.fas\fR \-\-output \fIqueries_sorted.fas\fR \-\-relabel sampleA_ \-\-sizeout \-\-minsize 2 .RE .PP .\" .\" ============================================================================ .SH AUTHORS Implementation by Torbjørn Rognes and Tomás Flouri, documentation by Frédéric Mahé. .PP .\" ============================================================================ .SH CITATION .PP Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. \fIPeerJ\fR 4:e2584 doi: 10.7717/peerj.2584 .URL https://doi.org/10.7717/peerj.2584 (link) .PP .\" ============================================================================ .SH REPORTING BUGS Submit suggestions and bug-reports at .URL https://github.com/torognes/vsearch/issues (link) , send a pull request on .URL https://github.com/torognes/vsearch (link) , or compose a friendly or curmudgeont e-mail to Torbjørn Rognes .MTO torognes@ifi.uio.no (link) . .PP .\" ============================================================================ .SH AVAILABILITY Source code and binaries are available at . .PP .\" ============================================================================ .SH COPYRIGHT Copyright (C) 2014-2021, Torbjørn Rognes, Frédéric Mahé and Tomás Flouri .PP All rights reserved. .PP Contact: Torbjørn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway .PP This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. .PP \fBGNU General Public License version 3\fR .PP This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. .PP This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. .PP You should have received a copy of the GNU General Public License along with this program. If not, see .URL http://www.gnu.org/licenses/ (link) . .PP .PP \fBThe BSD 2-Clause License\fR .PP Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: .PP 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. .PP 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. .PP THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. .PP We would like to thank the authors of the following projects for making their source code available: .RS .IP - 2 \fBvsearch\fR includes code from Google's CityHash project by Geoff Pike and Jyrki Alakuijala, providing some excellent hash functions available under a MIT license. .IP - \fBvsearch\fR includes code derived from Tatusov and Lipman's DUST program that is in the public domain. .IP - \fBvsearch\fR includes public domain code written by Alexander Peslyak for the MD5 message digest algorithm. .IP - \fBvsearch\fR includes public domain code written by Steve Reid and others for the SHA1 message digest algorithm. .IP - \fBvsearch\fR binaries may include code from the zlib library, copyright Jean-Loup Gailly and Mark Adler. .IP - \fBvsearch\fR binaries may include code from the bzip2 library, copyright Julian R. Seward. .RE .PP .\" ============================================================================ .SH SEE ALSO \fBswipe\fR, an extremely fast pairwise local (Smith-Waterman) database search tool by Torbjørn Rognes, available at .URL https://github.com/torognes/swipe "(link)" . .PP \fBswarm\fR, a fast and accurate amplicon clustering method by Frédéric Mahé and Torbjørn Rognes, available at .URL https://github.com/torognes/swarm "(link)" . .PP .\" ============================================================================ .SH VERSION HISTORY New features and important modifications of \fBvsearch\fR (short lived or minor bug releases may not be mentioned): .TP .BR v1.0.0\~ "released November 28th, 2014" First public release. .TP .BR v1.0.1\~ "released December 1st, 2014" Bug fixes (sortbysize, semicolon after size annotation in headers) and minor changes (labels as secondary sort key for most sorts, treat T and U as identical for dereplication, only output size in \-\-dbmatched file if \-\-sizeout specified). .TP .BR v1.0.2\~ "released December 6th, 2014" Bug fixes (ssse3/sse4.1 requirement, memory leak). .TP .BR v1.0.3\~ "released December 6th, 2014" Bug fix (now writes help to stdout instead of stderr). .TP .BR v1.0.4\~ "released December 8th, 2014" Added \-\-allpairs_global option. Reduce memory requirements slightly and eliminate memory leaks. .TP .BR v1.0.5\~ "released December 9th, 2014" Fixes a minor bug with \-\-allpairs_global and \-\-acceptall options. .TP .BR v1.0.6\~ "released December 14th, 2014" Fixes a memory allocation bug in chimera detection (\-\-uchime_ref option). .TP .BR v1.0.7\~ "released December 19th, 2014" Fixes a bug in the output from chimera detection with the \-\-uchimeout option. .TP .BR v1.0.8\~ "released January 22nd, 2015" Introduces several changes and bug fixes: .RS .IP - 2 a new linear memory aligner for alignment of sequences longer than 5,000 nucleotides, .IP - a new \-\-cluster_size command that sorts sequences by decreasing abundance before clustering, .IP - meaning of userfields qlo, qhi, tlo, thi changed for compatibility with usearch, .IP - new userfields qilo, qihi, tilo, tihi give alignment coordinates ignoring terminal gaps, .IP - in \-\-uc output files, a perfect alignment is indicated with a '=' sign, .IP - the option \-\-cluster_fast now sorts sequences by decreasing length, then by decreasing abundance and finally by sequence identifier, .IP - default \-\-maxseqlength value set to 50,000 nucleotides, .IP - fix for bug in alignment in rare cases, .IP - fix for lack of detection of under- or overflow in SIMD aligner. .RE .TP .BR v1.0.9\~ "released January 22nd, 2015" Fixes a bug in the function sorting sequences by decreasing abundance (\-\-sortbysize). .TP .BR v1.0.10\~ "released January 23rd, 2015" Fixes a bug where the \-\-sizein option was ignored and always treated as on, affecting clustering and dereplication commands. .TP .BR v1.0.11\~ "released February 5th, 2015" Introduces the possibility to output results in SAM format (for clustering, pairwise alignment and searching). .TP .BR v1.0.12\~ "released February 6th, 2015" Temporarily fixes a problem with long headers in FASTA files. .TP .BR v1.0.13\~ "released February 17th, 2015" Fix a memory allocation problem when computing multiple sequence alignments with the \-\-msaout and \-\-consout options, as well as a memory leak. Also increased line buffer for reading FASTA files to 4MB. .TP .BR v1.0.14\~ "released February 17th, 2015" Fix a bug where the multiple alignment and consensus sequence computed after clustering ignored the strand of the sequences. Also decreased size of line buffer for reading FASTA files to 1MB again due to excessive stack memory usage. .TP .BR v1.0.15\~ "released February 18th, 2015" Fix bug in calculation of identity metric between sequences when using the MBL definition (\-\-iddef 3). .TP .BR v1.0.16\~ "released February 19th, 2015" Integrated patches from Debian for increased compatibility with various architectures. .TP .BR v1.1.0\~ "released February 20th, 2015" Added the \-\-quiet option to suppress all output to stdout and stderr except for warnings and fatal errors. Added the \-\-log option to write messages to a log file. .TP .BR v1.1.1\~ "released February 20th, 2015" Added info about \-\-log and \-\-quiet options to help text. .TP .BR v1.1.2\~ "released March 18th, 2015" Fix bug with large datasets. Fix format of help info. .TP .BR v1.1.3\~ "released March 18th, 2015" Fix more bugs with large datasets. .TP .BR v1.2.0-1.2.19\~ "released July 6th to September 8th, 2015" Several new commands and options added. Bugs fixed. Documentation updated. .TP .BR v1.3.0\~ "released September 9th, 2015" Changed to autotools build system. .TP .BR v1.3.1\~ "released September 14th, 2015" Several new commands and options. Bug fixes. .TP .BR v1.3.2\~ "released September 15th, 2015" Fixed memory leaks. Added '-h' shortcut for help. Removed extra 'v' in version number. .TP .BR v1.3.3\~ "released September 15th, 2015" Fixed bug in hexadecimal digits of MD5 and SHA1 digests. Added \-\-samheader option. .TP .BR v1.3.4\~ "released September 16th, 2015" Fixed compilation problems with zlib and bzip2lib. .TP .BR v1.3.5\~ "released September 17th, 2015" Minor configuration/makefile changes to compile to native CPU and simplify makefile. .TP .BR v1.4.0\~ "released September 25th, 2015" Added \-\-sizeorder option. .TP .BR v1.4.1\~ "released September 29th, 2015" Inserted public domain MD5 and SHA1 code to eliminate dependency on crypto and openssl libraries and their licensing issues. .TP .BR v1.4.2\~ "released October 2nd, 2015" Dynamic loading of libraries for reading gzip and bzip2 compressed files if available. Circumvention of missing gzoffset function in zlib 1.2.3 and earlier. .TP .BR v1.4.3\~ "released October 3rd, 2015" Fix a bug with determining amount of memory on some versions of Apple OS X. .TP .BR v1.4.4\~ "released October 3rd, 2015" Remove debug message. .TP .BR v1.4.5\~ "released October 6th, 2015" Fix memory allocation bug when reading long FASTA sequences. .TP .BR v1.4.6\~ "released October 6th, 2015" Fix subtle bug in SIMD alignment code that reduced accuracy. .TP .BR v1.4.7\~ "released October 7th, 2015" Fixes a problem with searching for or clustering sequences with repeats. In this new version, vsearch looks at all words occurring at least once in the sequences in the initial step. Previously only words occurring exactly once were considered. In addition, vsearch now requires at least 10 words to be shared by the sequences, previously only 6 were required. If the query contains less than 10 words, all words must be present for a match. This change seems to lead to slightly reduced recall, but somewhat increased precision, ending up with slightly improved overall accuracy. .TP .BR v1.5.0\~ "released October 7th, 2015" This version introduces the new option \-\-minwordmatches that allows the user to specify the minimum number of matching unique words before a sequence is considered further. New default values for different word lengths are also set. The minimum word length is increased to 7. .TP .BR v1.6.0\~ "released October 9th, 2015" This version adds the relabeling options (\-\-relabel, \-\-relabel_md5 and \-\-relabel_sha1) to the shuffle command. It also adds the \-\-xsize option to the clustering, dereplication, shuffling and sorting commands. .TP .BR v1.6.1\~ "released October 14th, 2015" Fix bugs and update manual and help text regarding relabelling. Add all relabelling options to the subsampling command. Add the \-\-xsize option to chimera detection, dereplication and fastq filtering commands. Refactoring of code. .TP .BR v1.7.0\~ "released October 14th, 2015" Add \-\-relabel_keep option. .TP .BR v1.8.0\~ "released October 19th, 2015" Added \-\-search_exact, \-\-fastx_mask and \-\-fastq_convert commands. Changed most commands to read FASTQ input files as well as FASTA files. Modified \-\-fastx_revcomp and \-\-fastx_subsample to write FASTQ files. .TP .BR v1.8.1\~ "released November 2nd, 2015" Fixes for compatibility with QIIME and older OS X versions. .TP .BR v1.9.0\~ "released November 12th, 2015" Added the \-\-fastq_mergepairs command and associated options. This command has not been tested well yet. Included additional files to avoid dependency of autoconf for compilation. Fixed an error where identifiers in fasta headers where not truncated at tabs, just spaces. Fixed a bug in detection of the file format (FASTA/FASTQ) of a gzip compressed input file. .TP .BR v1.9.1\~ "released November 13th, 2015" Fixed memory leak and a bug in score computation in \-\-fastq_mergepairs, and improved speed. .TP .BR v1.9.2\~ "released November 17th, 2015" Fixed a bug in the computation of some values with \-\-fastq_stats. .TP .BR v1.9.3\~ "released November 19th, 2015" Workaround for missing x86intrin.h with old compilers. .TP .BR v1.9.4\~ "released December 3rd, 2015" Fixed incrementation of counter when relabeling dereplicated sequences. .TP .BR v1.9.5\~ "released December 3rd, 2015" Fixed bug resulting in inferior chimera detection performance. .TP .BR v1.9.6\~ "released January 8th, 2016" Fixed bug in aligned sequences produced with \-\-fastapairs and \-\-userout (qrow, trow) options. .TP .BR v1.9.7\~ "released January 12th, 2016" Masking behavior is changed somewhat to keep the letter case of the input sequences unchanged when no masking is performed. Masking is now performed also during chimera detection. Documentation updated. .TP .BR v1.9.8\~ "released January 22nd, 2016" Fixed bug causing segfault when chimera detection is performed on extremely short sequences. .TP .BR v1.9.9\~ "released January 22nd, 2016" Adjusted default minimum number of word matches during searches for improved performance. .TP .BR v1.9.10\~ "released January 25th, 2016" Fixed bug related to masking and lower case database sequences. .TP .BR v1.10.0\~ "released February 11th, 2016" Parallelized and improved merging of paired-end reads and adjusted some defaults. Removed progress indicator when stderr is not a terminal. Added \-\-fasta_score option to report chimera scores in FASTA files. Added \-\-rereplicate and \-\-fastq_eestats commands. Fixed typos. Added relabelling to files produced with \-\-consout and \-\-profile options. .TP .BR v1.10.1\~ "released February 23rd, 2016" Fixed a bug affecting the \-\-fastq_mergepairs command causing FASTQ headers to be truncated at first space (despite the bug fix release 1.9.0 of November 12th, 2015). Full headers are now included in the output (no matter if \-\-notrunclabels is in effect or not). .TP .BR v1.10.2\~ "released March 18th, 2016" Fixed a bug causing a segmentation fault when running \-\-usearch_global with an empty query sequence. Also fixed a bug causing imperfect alignments to be reported with an alignment string of '=' in uc output files. Fixed typos in man file. Fixed fasta/fastq processing code regarding presence or absence of compression library header files. .TP .BR v1.11.1\~ "released April 13th, 2016" Added strand information in UC file for \-\-derep_fulllength and \-\-derep_prefix. Added expected errors (ee) to header of FASTA files specified with \-\-fastaout and \-\-fastaout_discarded when \-\-eeout or \-\-fastq_eeout option is in effect for fastq_filter and fastq_mergepairs. The options \-\-eeout and \-\-fastq_eeout are now equivalent. .TP .BR v1.11.2\~ "released June 21st, 2016" Two bugs were fixed. The first issue was related to the \-\-query_cov option that used a different coverage definition than the qcov userfield. The coverage is now defined as the fraction of the whole query sequence length that is aligned with matching or mismatching residues in the target. All gaps are ignored. The other issue was related to the consensus sequences produced during clustering when only N's were present in some positions. Previously these would be converted to A's in the consensus. The behaviour is changed so that N's are produced in the consensus, and it should now be more compatible with usearch. .TP .BR v2.0.0\~ "released June 24th, 2016" This major new version supports reading from pipes. Two new options are added: \-\-gzip_decompress and \-\-bzip2_decompress. One of these options must be specified if reading compressed input from a pipe, but are not required when reading from ordinary files. The vsearch header that was previously written to stdout is now written to stderr. This enables piping of results for further processing. The file name '\-' now represent standard input (/dev/stdin) or standard output (/dev/stdout) when reading or writing files, respectively. Code for reading FASTA and FASTQ files has been refactored. .TP .BR v2.0.1\~ "released June 30th, 2016" Avoid segmentation fault when masking very long sequences. .TP .BR v2.0.2\~ "released July 5th, 2016" Avoid warnings when compiling with GCC 6. .TP .BR v2.0.3\~ "released August 2nd, 2016" Fixed bad compiler options resulting in Illegal instruction errors when running precompiled binaries. .TP .BR v2.0.4\~ "released September 1st, 2016" Improved error message for bad FASTQ quality values. Improved manual. .TP .BR v2.0.5\~ "released September 9th, 2016" Add options \-\-fastaout_discarded and \-\-fastqout_discarded to output discarded sequences from subsampling to separate files. Updated manual. .TP .BR v2.1.0\~ "released September 16th, 2016" New command: \-\-fastx_filter. New options: \-\-fastq_maxlen, \-\-fastq_truncee. Allow \-\-minwordmatches down to 3. .TP .BR v2.1.1\~ "released September 23rd, 2016" Fixed bugs in output to UC-files. Improved help text and manual. .TP .BR v2.1.2\~ "released September 28th, 2016" Fixed incorrect abundance output from fastx_filter and fastq_filter when relabelling. .TP .BR v2.2.0\~ "released October 7th, 2016" Added OTU table generation options \-\-biomout, \-\-mothur_shared_out and \-\-otutabout to the clustering and searching commands. .TP .BR v2.3.0\~ "released October 10th, 2016" Allowed zero-length sequences in FASTA and FASTQ files. Added \-\-fastq_trunclen_keep option. Fixed bug with output of OTU tables to pipes. .TP .BR v2.3.1\~ "released November 16th, 2016" Fixed bug where \-\-minwordmatches 0 was interpreted as the default minimum word matches for the given word length instead of zero. When used in combination with \-\-maxaccepts 0 and \-\-maxrejects 0 it will allow complete bypass of kmer-based heuristics. .TP .BR v2.3.2\~ "released November 18th, 2016" Fixed bug where vsearch reported the ordinal number of the target sequence instead of the cluster number in column 2 on H-lines in the uc output file after clustering. For search and alignment commands both usearch and vsearch reports the target sequence number here. .TP .BR v2.3.3\~ "released December 5th, 2016" A minor speed improvement. .TP .BR v2.3.4\~ "released December 9th, 2016" Fixed bug in output of sequence profiles and updated documentation. .TP .BR v2.4.0\~ "released February 8th, 2017" Added support for Linux on Power8 systems (ppc64le) and Windows on x86_64. Improved detection of pipes when reading FASTA and FASTQ files. Corrected option for specifying output from fastq_eestats command in help text. .TP .BR v2.4.1\~ "released March 1st, 2017" Fixed an overflow bug in fastq_stats and fastq_eestats affecting analysis of very large FASTQ files. Fixed maximum memory usage reporting on Windows. .TP .BR v2.4.2\~ "released March 10th, 2017" Default value for fastq_minovlen increased to 16 in accordance with help text and for compatibility with usearch. Minor changes for improved accuracy of paired-end read merging. .TP .BR v2.4.3\~ "released April 6th, 2017" Fixed bug with progress bar for shuffling. Fixed missing N-lines in UC files with usearch_global, search_exact and allpairs_global when the output_no_hits option was not specified. .TP .BR v2.4.4\~ "released August 28th, 2017" Fixed a few minor bugs, improved error messages and updated documentation. .TP .BR v2.5.0\~ "released October 5th, 2017" Support for UDB database files. New commands: fastq_stripright, fastq_eestats2, makeudb_usearch, udb2fasta, udbinfo, and udbstats. New general option: no_progress. New options minsize and maxsize to fastx_filter. Minor bug fixes, error message improvements and documentation updates. .TP .BR v2.5.1\~ "released October 25th, 2017" Fixed bug with bad default value of 1 instead of 32 for minseqlength when using the makeudb_usearch command. .TP .BR v2.5.2\~ "released October 30th, 2017" Fixed bug with where '-' as an argument to the fastq_eestats2 option was treated literally instead of equivalent to stdin. .TP .BR v2.6.0\~ "released November 10th, 2017" Rewritten paired-end reads merger with improved accuracy. Decreased default value for fastq_minovlen option from 16 to 10. The default value for the fastq_maxdiffs option is increased from 5 to 10. There are now other more important restrictions that will avoid merging reads that cannot be reliably aligned. .TP .BR v2.6.1\~ "released December 8th, 2017" Improved parallelisation of paired end reads merging. .TP .BR v2.6.2\~ "released December 18th, 2017" Fixed option xsize that was partially inactive for commands uchime_denovo, uchime_ref, and fastx_filter. .TP .BR v2.7.0\~ "released February 13th, 2018" Added commands cluster_unoise, uchime2_denovo and uchime3_denovo contributed by Davide Albanese based on Robert Edgar's papers. Refactored fasta and fastq print functions as well as code for extraction of abundance and other attributes from the headers. .TP .BR v2.7.1\~ "released February 16th, 2018" Fix several bugs on Windows related to large files, use of "-" as a file name to mean stdin or stdout, alignment errors, missed kmers and corrupted UDB files. Added documentation of UDB-related commands. .TP .BR v2.7.2\~ "released April 20th, 2018" Added the sintax command for taxonomic classification. Fixed a bug with incorrect FASTA headers of consensus sequences after clustering. .TP .BR v2.8.0\~ "released April 24th, 2018" Added the fastq_maxdiffpct option to the fastq_mergepairs command. .TP .BR v2.8.1\~ "released June 22nd, 2018" Fixes for compilation warnings with GCC 8. .TP .BR v2.8.2\~ "released August 21st, 2018" Fix for wrong placement of semicolons in header lines in some cases when using the sizeout or xsize options. Reduced memory requirements for full-length dereplication in cases with many duplicate sequences. Improved wording of fastq_mergepairs report. Updated manual regarding use of sizein and sizeout with dereplication. Changed a compiler option. .TP .BR v2.8.3\~ "released August 31st, 2018" Fix for segmentation fault for \-\-derep_fulllength with \-\-uc. .TP .BR v2.8.4\~ "released September 3rd, 2018" Further reduce memory requirements for dereplication when not using the uc option. Fix output during subsampling when quiet or log options are in effect. .TP .BR v2.8.5\~ "released September 26th, 2018" Fixed a bug in fastq_eestats2 that caused the values for large lengths to be much too high when the input sequences had varying lengths. .TP .BR v2.8.6\~ "released October 9th, 2018" Fixed a bug introduced in version 2.8.2 that caused derep_fulllength to include the full FASTA header in its output instead of stopping at the first space (unless the notrunclabels option is in effect). .TP .BR v2.9.0\~ "released October 10th, 2018" Added the fastq_join command. .TP .BR v2.9.1\~ "released October 29th, 2018" Changed compiler options that select the target cpu and tuning to allow the software to run on any 64-bit x86 system, while tuning for more modern variants. Avoid illegal instruction error on some architectures. Update documentation of rereplicate command. .TP .BR v2.10.0\~ "released December 6th, 2018" Added the sff_convert command to convert SFF files to FASTQ. Added some additional option argument checks. Fixed segmentation fault bug after some fatal errors when a log file was specified. .TP .BR v2.10.1\~ "released December 7th, 2018" Improved sff_convert command. It will now read several variants of the SFF format. It is also able to read from a pipe. Warnings are given if there are minor problems. Errors messages have been improved. Minor speed and memory usage improvements. .TP .BR v2.10.2\~ "released December 10th, 2018" Fixed bug in sintax with reversed order of domain and kingdom. .TP .BR v2.10.3\~ "released December 19th, 2018" Ported to Linux on ARMv8 (aarch64). Fixed compilation warning with gcc version 8.1.0 and 8.2.0. .TP .BR v2.10.4\~ "released January 4th, 2019" Fixed serious bug in x86_64 SIMD alignment code introduced in version 2.10.3. Added link to BioConda in README. Fixed bug in fastq_stats with sequence length 1. Fixed use of equals symbol in UC files for identical sequences with cluster_fast. .TP .BR v2.11.0\~ "released February 13th, 2019" Added ability to trim and filter paired-end reads using the reverse option with the fastx_filter and fastq_filter commands. Added \-\-xee option to remove ee attributes from FASTA headers. Minor invisible improvement to the progress indicator. .TP .BR v2.11.1\~ "released February 28th, 2019" Minor change to the handling of the weak_id and id options when using cluster_unoise. .TP .BR v2.12.0\~ "released March 19th, 2019" Take sequence abundance into account when computing consensus sequences or profiles after clustering. Warn when rereplicating sequences without abundance info. Guess offset 33 in more cases with fastq_chars. Stricter checking of option arguments and option combinations. .TP .BR v2.13.0\~ "released April 11th, 2019" Added the \-\-fastx_getseq, \-\-fastx_getseqs and \-\-fastx_getsubseq commands to extract sequences from a FASTA or FASTQ file based on their labels. Improved handling of ambiguous nucleotide symbols. Corrected behaviour of \-\-uchime_ref command with and options \-\-self and \-\-selfid. Strict detection of illegal options for each command. .TP .BR v2.13.1\~ "released April 26th, 2019" Minor changes to the allowed options for each command. All commands now allow the log, quiet and threads options. If more than 1 thread is specified for commands that are not multi-threaded, a warning will be issued. Minor changes to the manual. .TP .BR v2.13.2\~ "released April 30th, 2019" Fixed bug related to improper handling of newlines on Windows. Allowed option strand plus to uchime_ref for compatibility. .TP .BR v2.13.3\~ "released April 30th, 2019" Fixed bug in FASTQ parsing introduced in version 2.13.2. .TP .BR v2.13.4\~ "released May 10th, 2019" Added information about support for gzip- and bzip2-compressed input files to the output of the version command. Adapted source code for compilation on FreeBSD and NetBSD systems. .TP .BR v2.13.5\~ "released July 2nd, 2019" Added cut command to fragment sequences at restriction sites. Silenced output from the fastq_stats command if quiet option was given. Updated manual. .TP .BR v2.13.6\~ "released July 2nd, 2019" Added info about cut command to output of help command. .TP .BR v2.13.7\~ "released September 2nd, 2019" Fixed bug in consensus sequence introduced in version 2.13.0. .TP .BR v2.14.0\~ "released September 11th, 2019" Added relabel_self option. Made fasta_width, sizein, sizeout and relabelling options valid for certain commands. .TP .BR v2.14.1\~ "released September 18th, 2019" Fixed bug with sequences written to file specified with fastaout_rev for commands fastx_filter and fastq_filter. .TP .BR v2.14.2\~ "released January 28th, 2020" Fixed some issues with the cut, fastx_revcomp, fastq_convert, fastq_mergepairs, and makeudb_usearch commands. Updated manual. .TP .BR v2.15.0\~ "released June 19th, 2020" Update manual and documentation. Turn on notrunclabels option for sintax command by default. Change maxhits 0 to mean unlimited hits, like the default. Allow non-ascii characters in headers, with a warning. Sort centroids and uc too when clusterout_sort specified. Add cluster id to centroids output when clusterout_id specified. Improve error messages when parsing FASTQ files. Add missing fastq_qminout option and fix label_suffix option for fastq_mergepairs. Add derep_id command that dereplicates based on both label and sequence. Remove compilation warnings. .TP .BR v2.15.1\~ "released October 28th, 2020" Fix for dereplication when including reverse complement sequences and headers. Make some extra checks when loading compression libraries and add more diagnostic output about them to the output of the version command. Report an error when fastx_filter is used with FASTA input and options that require FASTQ input. Update manual. .TP .BR v2.15.2\~ "released January 26th, 2021" No real functional changes, but some code and compilation changes. Compiles successfully on macOS running on Apple Silicon (ARMv8). Binaries available. Code updated for C++11. Minor adaptations for Windows compatibility, including the use of the C++ standard library for regular expressions. Minor changes for compatibility with Power8. Switch to C++ header files. .TP .BR v2.16.0\~ "released March 22nd, 2021" This version adds the orient command. It also handles empty input files properly. Documentation has been updated. .TP .BR v2.17.0\~ "released March 29nd, 2021" The fastq_mergepairs command has been changed. It now allows merging of sequences with overlaps as short as 5 bp if the \-\-fastq_minovlen option has been adjusted down from the default 10. In addition, much fewer pairs of reads should now be rejected with the reason 'multiple potential alignments' as the algorithm for detecting those have been changed. .TP .BR v2.17.1\~ "released June 14th, 2021" Modernized code. Minor changes to help info. .TP .BR v2.18.0\~ "released August 27th, 2021" Added the fasta2fastq command. Fixed search bug on ppc64le. Fixed bug with removal of size and ee info in uc files. Fixed compilation errors in some cases. Made some general code improvements. Updated manual. .TP .BR v2.19.0\~ "released December 21st, 2021" Added the lcaout and lca_cutoff options to enable the output of last common ancestor (LCA) information about hits when searching. The randseed option was added as a valid option to the sintax command. Code improvements. .TP .BR v2.20.0\~ "released January 10th, 2022" Added the fastx_uniques command and the fastq_qout_max option for dereplication of FASTQ files. Some code cleaning. .TP .BR v2.20.1\~ "released January 11th, 2022" Fixes a bug in fastq_mergepair that caused an occational hang at the end when using multiple threads. .TP .BR v2.21.0\~ "released January 12th, 2022" This version adds the sample, qsegout and tsegout options. It enables the use of UDB databases with uchime_ref. .TP .BR v2.21.1\~ "released January 18th, 2022" Fix a problem with dereplication of empty input files. Update Altivec code on ppc64le for improved compiler compatibility (vector->__vector). .LP .\" ============================================================================ .\" TODO: .\" .\" NOTES .\" visualize and output to pdf .\" man -l vsearch.1 .\" man -t ./doc/vsearch.1 | ps2pdf - > ./doc/vsearch_manual.pdf vsearch-2.21.1/Makefile.am0000644000175000017500000000010514171574117014616 0ustar nileshnileshAUTOMAKE_OPTIONS = foreign SUBDIRS = src man EXTRA_DIST = autogen.sh vsearch-2.21.1/src/0000755000175000017500000000000014171574117013355 5ustar nileshnileshvsearch-2.21.1/src/mergepairs.h0000644000175000017500000000470614171574117015673 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastq_mergepairs(); vsearch-2.21.1/src/align_simd.cc0000644000175000017500000015606714171574117016011 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* Using 16-bit signed values, from -32768 to +32767. match: positive mismatch: negative gap penalties: positive (open, extend, query/target, left/interior/right) optimal global alignment (NW) maximize score */ #define CHANNELS 8 #define CDEPTH 4 /* Due to memory usage, limit the product of the length of the sequences. If the product of the query length and any target sequence length is above the limit, the alignment will not be computed and a score of SHRT_MAX will be returned as the score. If an overflow occurs during alignment computation, a score of SHRT_MAX will also be returned. The limit is set to 5 000 * 5 000 = 25 000 000. This will allocate up to 200 MB per thread. It will align pairs of sequences less than 5000 nt long using the SIMD implementation, larger alignments will be performed with the linear memory aligner. */ #define MAXSEQLENPRODUCT 25000000 static int64_t scorematrix[16][16]; /* The macros below usually operate on 128-bit vectors of 8 signed short 16-bit integers. Additions and subtractions should be saturated. The shift operation should shift left by 2 bytes (one short int) and shift in zeros. The v_mask_gt operation should compare two vectors of signed shorts and return a 16-bit bitmask with pairs of 2 bits set for each element greater in the first than in the second argument. */ #ifdef __PPC__ typedef __vector signed short VECTOR_SHORT; const __vector unsigned char perm_merge_long_low = {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17}; const __vector unsigned char perm_merge_long_high = {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f}; #define v_init(a,b,c,d,e,f,g,h) (const VECTOR_SHORT){a,b,c,d,e,f,g,h} #define v_load(a) vec_ld(0, (VECTOR_SHORT *)(a)) #define v_store(a, b) vec_st((__vector unsigned char)(b), 0, \ (__vector unsigned char *)(a)) #define v_add(a, b) vec_adds((a), (b)) #define v_sub(a, b) vec_subs((a), (b)) #define v_sub_unsigned(a, b) ((VECTOR_SHORT) \ vec_subs((__vector unsigned short) (a), \ (__vector unsigned short) (b))) #define v_max(a, b) vec_max((a), (b)) #define v_min(a, b) vec_min((a), (b)) #define v_dup(a) vec_splat((VECTOR_SHORT){(short)(a), 0, 0, 0, 0, 0, 0, 0}, 0); #define v_zero vec_splat_s16(0) #define v_and(a, b) vec_and((a), (b)) #define v_xor(a, b) vec_xor((a), (b)) #define v_shift_left(a) vec_sld((a), v_zero, 2) #elif defined __aarch64__ typedef int16x8_t VECTOR_SHORT; const uint16x8_t neon_mask = {0x0003, 0x000c, 0x0030, 0x00c0, 0x0300, 0x0c00, 0x3000, 0xc000}; #define v_init(a,b,c,d,e,f,g,h) (const VECTOR_SHORT){a,b,c,d,e,f,g,h} #define v_load(a) vld1q_s16((const int16_t *)(a)) #define v_store(a, b) vst1q_s16((int16_t *)(a), (b)) #define v_merge_lo_16(a, b) vzip1q_s16((a),(b)) #define v_merge_hi_16(a, b) vzip2q_s16((a),(b)) #define v_merge_lo_32(a, b) vreinterpretq_s16_s32(vzip1q_s32(vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b))) #define v_merge_hi_32(a, b) vreinterpretq_s16_s32(vzip2q_s32(vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b))) #define v_merge_lo_64(a, b) vreinterpretq_s16_s64(vcombine_s64(vget_low_s64(vreinterpretq_s64_s16(a)), vget_low_s64(vreinterpretq_s64_s16(b)))) #define v_merge_hi_64(a, b) vreinterpretq_s16_s64(vcombine_s64(vget_high_s64(vreinterpretq_s64_s16(a)), vget_high_s64(vreinterpretq_s64_s16(b)))) #define v_add(a, b) vqaddq_s16((a), (b)) #define v_sub(a, b) vqsubq_s16((a), (b)) #define v_sub_unsigned(a, b) vreinterpretq_s16_u16(vqsubq_u16(vreinterpretq_u16_s16(a), vreinterpretq_u16_s16(b))) #define v_max(a, b) vmaxq_s16((a), (b)) #define v_min(a, b) vminq_s16((a), (b)) #define v_dup(a) vdupq_n_s16(a) #define v_zero v_dup(0) #define v_and(a, b) vandq_s16((a), (b)) #define v_xor(a, b) veorq_s16((a), (b)) #define v_shift_left(a) vextq_s16((v_zero), (a), 7) #define v_mask_gt(a, b) vaddvq_u16(vandq_u16((vcgtq_s16((a), (b))), neon_mask)) #elif __x86_64__ typedef __m128i VECTOR_SHORT; #define v_init(a,b,c,d,e,f,g,h) _mm_set_epi16(h,g,f,e,d,c,b,a) #define v_load(a) _mm_load_si128((VECTOR_SHORT *)(a)) #define v_store(a, b) _mm_store_si128((VECTOR_SHORT *)(a), (b)) #define v_merge_lo_16(a, b) _mm_unpacklo_epi16((a),(b)) #define v_merge_hi_16(a, b) _mm_unpackhi_epi16((a),(b)) #define v_merge_lo_32(a, b) _mm_unpacklo_epi32((a),(b)) #define v_merge_hi_32(a, b) _mm_unpackhi_epi32((a),(b)) #define v_merge_lo_64(a, b) _mm_unpacklo_epi64((a),(b)) #define v_merge_hi_64(a, b) _mm_unpackhi_epi64((a),(b)) #define v_add(a, b) _mm_adds_epi16((a), (b)) #define v_sub(a, b) _mm_subs_epi16((a), (b)) #define v_sub_unsigned(a, b) _mm_subs_epu16((a), (b)) #define v_max(a, b) _mm_max_epi16((a), (b)) #define v_min(a, b) _mm_min_epi16((a), (b)) #define v_dup(a) _mm_set1_epi16(a) #define v_zero v_dup(0) #define v_and(a, b) _mm_and_si128((a), (b)) #define v_xor(a, b) _mm_xor_si128((a), (b)) #define v_shift_left(a) _mm_slli_si128((a), 2) #define v_mask_gt(a, b) _mm_movemask_epi8(_mm_cmpgt_epi16((a), (b))) #else #error Unknown Architecture #endif struct s16info_s { VECTOR_SHORT matrix[32]; VECTOR_SHORT * hearray; VECTOR_SHORT * dprofile; VECTOR_SHORT ** qtable; unsigned short * dir; char * qseq; uint64_t diralloc; char * cigar; char * cigarend; int64_t cigaralloc; int opcount; char op; int qlen; int maxdlen; CELL penalty_gap_open_query_left; CELL penalty_gap_open_target_left; CELL penalty_gap_open_query_interior; CELL penalty_gap_open_target_interior; CELL penalty_gap_open_query_right; CELL penalty_gap_open_target_right; CELL penalty_gap_extension_query_left; CELL penalty_gap_extension_target_left; CELL penalty_gap_extension_query_interior; CELL penalty_gap_extension_target_interior; CELL penalty_gap_extension_query_right; CELL penalty_gap_extension_target_right; }; void _mm_print(VECTOR_SHORT x) { auto * y = (unsigned short*)&x; for (int i=0; i<8; i++) { printf("%s%6d", (i>0?" ":""), y[7-i]); } } void _mm_print2(VECTOR_SHORT x) { auto * y = (signed short*)&x; for (int i=0; i<8; i++) { printf("%s%2d", (i>0?" ":""), y[7-i]); } } void dprofile_dump16(CELL * dprofile) { char * s = sym_nt_4bit; printf("\ndprofile:\n"); for(int i=0; i<16; i++) { printf("%c: ",s[i]); for(int k=0; kH initially (must go up) (4th pri) in DIR[2..3] if E>max(H,F) (must go left) (3rd pri) in DIR[4..5] if new F>H (must extend up) (2nd pri) in DIR[6..7] if new E>H (must extend left) (1st pri) no bits set: go diagonally */ /* On PPC the fifth parameter is a vector for the result in the lower 64 bits. On x86_64 the fifth parameter is the address to write the result to. */ #ifdef __PPC__ /* Handle differences between GNU and IBM compilers */ #ifdef __IBMCPP__ #define VECTORBYTEPERMUTE vec_bperm #else #define VECTORBYTEPERMUTE vec_vbpermq #endif /* The VSX vec_bperm instruction puts the 16 selected bits of the first source into bits 48-63 of the destination. */ const __vector unsigned char perm = { 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 0 }; #define ALIGNCORE(H, N, F, V, RES, QR_q, R_q, QR_t, R_t, H_MIN, H_MAX) \ { \ __vector unsigned short W, X, Y, Z; \ __vector unsigned int WX, YZ; \ __vector short VV; \ VV = v_load(&V); \ H = v_add(H, VV); \ W = (__vector unsigned short) VECTORBYTEPERMUTE \ ((__vector unsigned char) vec_cmpgt(F, H), perm); \ H = v_max(H, F); \ X = (__vector unsigned short) VECTORBYTEPERMUTE \ ((__vector unsigned char) vec_cmpgt(E, H), perm); \ H = v_max(H, E); \ H_MIN = v_min(H_MIN, H); \ H_MAX = v_max(H_MAX, H); \ N = H; \ HF = v_sub(H, QR_t); \ F = v_sub(F, R_t); \ Y = (__vector unsigned short) VECTORBYTEPERMUTE \ ((__vector unsigned char) vec_cmpgt(F, HF), perm); \ F = v_max(F, HF); \ HE = v_sub(H, QR_q); \ E = v_sub(E, R_q); \ Z = (__vector unsigned short) VECTORBYTEPERMUTE \ ((__vector unsigned char) vec_cmpgt(E, HE), perm); \ E = v_max(E, HE); \ WX = (__vector unsigned int) vec_mergel(W, X); \ YZ = (__vector unsigned int) vec_mergel(Y, Z); \ RES = (__vector unsigned long long) vec_mergeh(WX, YZ); \ } #else /* x86_64 & aarch64 */ #define ALIGNCORE(H, N, F, V, PATH, QR_q, R_q, QR_t, R_t, H_MIN, H_MAX) \ H = v_add(H, V); \ *(PATH+0) = v_mask_gt(F, H); \ H = v_max(H, F); \ *(PATH+1) = v_mask_gt(E, H); \ H = v_max(H, E); \ H_MIN = v_min(H_MIN, H); \ H_MAX = v_max(H_MAX, H); \ N = H; \ HF = v_sub(H, QR_t); \ F = v_sub(F, R_t); \ *(PATH+2) = v_mask_gt(F, HF); \ F = v_max(F, HF); \ HE = v_sub(H, QR_q); \ E = v_sub(E, R_q); \ *(PATH+3) = v_mask_gt(E, HE); \ E = v_max(E, HE); #endif void aligncolumns_first(VECTOR_SHORT * Sm, VECTOR_SHORT * hep, VECTOR_SHORT ** qp, VECTOR_SHORT QR_q_i, VECTOR_SHORT R_q_i, VECTOR_SHORT QR_q_r, VECTOR_SHORT R_q_r, VECTOR_SHORT QR_t_0, VECTOR_SHORT R_t_0, VECTOR_SHORT QR_t_1, VECTOR_SHORT R_t_1, VECTOR_SHORT QR_t_2, VECTOR_SHORT R_t_2, VECTOR_SHORT QR_t_3, VECTOR_SHORT R_t_3, VECTOR_SHORT h0, VECTOR_SHORT h1, VECTOR_SHORT h2, VECTOR_SHORT h3, VECTOR_SHORT f0, VECTOR_SHORT f1, VECTOR_SHORT f2, VECTOR_SHORT f3, VECTOR_SHORT * _h_min, VECTOR_SHORT * _h_max, VECTOR_SHORT Mm, VECTOR_SHORT M_QR_t_left, VECTOR_SHORT M_R_t_left, VECTOR_SHORT M_QR_q_interior, VECTOR_SHORT M_QR_q_right, int64_t ql, unsigned short * dir) { VECTOR_SHORT h4, h5, h6, h7, h8, E, HE, HF; VECTOR_SHORT * vp; VECTOR_SHORT h_min = v_zero; VECTOR_SHORT h_max = v_zero; #ifdef __PPC__ __vector unsigned long long RES1, RES2, RES; #endif int64_t i; f0 = v_sub(f0, QR_t_0); f1 = v_sub(f1, QR_t_1); f2 = v_sub(f2, QR_t_2); f3 = v_sub(f3, QR_t_3); for(i=0; i < ql - 1; i++) { vp = qp[i+0]; h4 = hep[2*i+0]; E = hep[2*i+1]; /* Initialize selected h and e values for next/this round. First zero those cells where a new sequence starts by using an unsigned saturated subtraction of a huge value to set it to zero. Then use signed subtraction to obtain the correct value. */ h4 = v_sub_unsigned(h4, Mm); h4 = v_sub(h4, M_QR_t_left); E = v_sub_unsigned(E, Mm); E = v_sub(E, M_QR_t_left); E = v_sub(E, M_QR_q_interior); M_QR_t_left = v_add(M_QR_t_left, M_R_t_left); #ifdef __PPC__ ALIGNCORE(h0, h5, f0, vp[0], RES1, QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], RES2, QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 0), RES); ALIGNCORE(h2, h7, f2, vp[2], RES1, QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], RES2, QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 8), RES); #else ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+0, QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+4, QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max); ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+8, QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12, QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max); #endif hep[2*i+0] = h8; hep[2*i+1] = E; h0 = h4; h1 = h5; h2 = h6; h3 = h7; } /* the final round - using query gap penalties for right end */ vp = qp[i+0]; E = hep[2*i+1]; E = v_sub_unsigned(E, Mm); E = v_sub(E, M_QR_t_left); E = v_sub(E, M_QR_q_right); #ifdef __PPC__ ALIGNCORE(h0, h5, f0, vp[0], RES1, QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], RES2, QR_q_r, R_q_r, QR_t_1, R_t_1, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 0), RES); ALIGNCORE(h2, h7, f2, vp[2], RES1, QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], RES2, QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 8), RES); #else ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0, QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4, QR_q_r, R_q_r, QR_t_1, R_t_1, h_min, h_max); ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+ 8, QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12, QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max); #endif hep[2*i+0] = h8; hep[2*i+1] = E; Sm[0] = h5; Sm[1] = h6; Sm[2] = h7; Sm[3] = h8; *_h_min = h_min; *_h_max = h_max; } void aligncolumns_rest(VECTOR_SHORT * Sm, VECTOR_SHORT * hep, VECTOR_SHORT ** qp, VECTOR_SHORT QR_q_i, VECTOR_SHORT R_q_i, VECTOR_SHORT QR_q_r, VECTOR_SHORT R_q_r, VECTOR_SHORT QR_t_0, VECTOR_SHORT R_t_0, VECTOR_SHORT QR_t_1, VECTOR_SHORT R_t_1, VECTOR_SHORT QR_t_2, VECTOR_SHORT R_t_2, VECTOR_SHORT QR_t_3, VECTOR_SHORT R_t_3, VECTOR_SHORT h0, VECTOR_SHORT h1, VECTOR_SHORT h2, VECTOR_SHORT h3, VECTOR_SHORT f0, VECTOR_SHORT f1, VECTOR_SHORT f2, VECTOR_SHORT f3, VECTOR_SHORT * _h_min, VECTOR_SHORT * _h_max, int64_t ql, unsigned short * dir) { VECTOR_SHORT h4, h5, h6, h7, h8, E, HE, HF; VECTOR_SHORT * vp; VECTOR_SHORT h_min = v_zero; VECTOR_SHORT h_max = v_zero; #ifdef __PPC__ __vector unsigned long long RES1, RES2, RES; #endif int64_t i; f0 = v_sub(f0, QR_t_0); f1 = v_sub(f1, QR_t_1); f2 = v_sub(f2, QR_t_2); f3 = v_sub(f3, QR_t_3); for(i=0; i < ql - 1; i++) { vp = qp[i+0]; h4 = hep[2*i+0]; E = hep[2*i+1]; #ifdef __PPC__ ALIGNCORE(h0, h5, f0, vp[0], RES1, QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], RES2, QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 0), RES); ALIGNCORE(h2, h7, f2, vp[2], RES1, QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], RES2, QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 8), RES); #else ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0, QR_q_i, R_q_i, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4, QR_q_i, R_q_i, QR_t_1, R_t_1, h_min, h_max); ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+ 8, QR_q_i, R_q_i, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12, QR_q_i, R_q_i, QR_t_3, R_t_3, h_min, h_max); #endif hep[2*i+0] = h8; hep[2*i+1] = E; h0 = h4; h1 = h5; h2 = h6; h3 = h7; } /* the final round - using query gap penalties for right end */ vp = qp[i+0]; E = hep[2*i+1]; #ifdef __PPC__ ALIGNCORE(h0, h5, f0, vp[0], RES1, QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], RES2, QR_q_r, R_q_r, QR_t_1, R_t_1, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 0), RES); ALIGNCORE(h2, h7, f2, vp[2], RES1, QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], RES2, QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max); RES = vec_perm(RES1, RES2, perm_merge_long_low); v_store((dir + 16*i + 8), RES); #else ALIGNCORE(h0, h5, f0, vp[0], dir+16*i+ 0, QR_q_r, R_q_r, QR_t_0, R_t_0, h_min, h_max); ALIGNCORE(h1, h6, f1, vp[1], dir+16*i+ 4, QR_q_r, R_q_r, QR_t_1, R_t_1, h_min, h_max); ALIGNCORE(h2, h7, f2, vp[2], dir+16*i+ 8, QR_q_r, R_q_r, QR_t_2, R_t_2, h_min, h_max); ALIGNCORE(h3, h8, f3, vp[3], dir+16*i+12, QR_q_r, R_q_r, QR_t_3, R_t_3, h_min, h_max); #endif hep[2*i+0] = h8; hep[2*i+1] = E; Sm[0] = h5; Sm[1] = h6; Sm[2] = h7; Sm[3] = h8; *_h_min = h_min; *_h_max = h_max; } inline void pushop(s16info_s * s, char newop) { if (newop == s->op) { s->opcount++; } else { *--s->cigarend = s->op; if (s->opcount > 1) { char buf[11]; int len = sprintf(buf, "%d", s->opcount); s->cigarend -= len; memcpy(s->cigarend, buf, len); } s->op = newop; s->opcount = 1; } } inline void finishop(s16info_s * s) { if (s->op && s->opcount) { *--s->cigarend = s->op; if (s->opcount > 1) { char buf[11]; int len = sprintf(buf, "%d", s->opcount); s->cigarend -= len; memcpy(s->cigarend, buf, len); } s->op = 0; s->opcount = 0; } } void backtrack16(s16info_s * s, char * dseq, uint64_t dlen, uint64_t offset, uint64_t channel, unsigned short * paligned, unsigned short * pmatches, unsigned short * pmismatches, unsigned short * pgaps) { unsigned short * dirbuffer = s->dir; uint64_t dirbuffersize = s->qlen * s->maxdlen * 4; uint64_t qlen = s->qlen; char * qseq = s->qseq; uint64_t maskup = 3ULL << (2*channel+ 0); uint64_t maskleft = 3ULL << (2*channel+16); uint64_t maskextup = 3ULL << (2*channel+32); uint64_t maskextleft = 3ULL << (2*channel+48); #if 0 printf("Dumping backtracking array\n"); for(uint64_t i=0; iqlen*(j/4) + 16*i + 4*(j&3)) % dirbuffersize)); if (d & maskup) { if (d & maskleft) printf("+"); else printf("^"); } else if (d & maskleft) { printf("<"); } else { printf("\\"); } } printf("\n"); } printf("Dumping gap extension array\n"); for(uint64_t i=0; iqlen*(j/4) + 16*i + 4*(j&3)) % dirbuffersize)); if (d & maskextup) { if (d & maskextleft) printf("+"); else printf("^"); } else if (d & maskextleft) { printf("<"); } else { printf("\\"); } } printf("\n"); } #endif unsigned short aligned = 0; unsigned short matches = 0; unsigned short mismatches = 0; unsigned short gaps = 0; int64_t i = qlen - 1; int64_t j = dlen - 1; s->cigarend = s->cigar + s->qlen + s->maxdlen + 1; s->op = 0; s->opcount = 1; while ((i>=0) && (j>=0)) { aligned++; uint64_t d = *((uint64_t *) (dirbuffer + (offset + 16*s->qlen*(j/4) + 16*i + 4*(j&3)) % dirbuffersize)); if ((s->op == 'I') && (d & maskextleft)) { j--; pushop(s, 'I'); } else if ((s->op == 'D') && (d & maskextup)) { i--; pushop(s, 'D'); } else if (d & maskleft) { if (s->op != 'I') { gaps++; } j--; pushop(s, 'I'); } else if (d & maskup) { if (s->op != 'D') { gaps++; } i--; pushop(s, 'D'); } else { if (chrmap_4bit[(int)(qseq[i])] & chrmap_4bit[(int)(dseq[j])]) { matches++; } else { mismatches++; } i--; j--; pushop(s, 'M'); } } while(i>=0) { aligned++; if (s->op != 'D') { gaps++; } i--; pushop(s, 'D'); } while(j>=0) { aligned++; if (s->op != 'I') { gaps++; } j--; pushop(s, 'I'); } finishop(s); /* move cigar to beginning of allocated memory area */ int cigarlen = s->cigar + s->qlen + s->maxdlen - s->cigarend; memmove(s->cigar, s->cigarend, cigarlen + 1); * paligned = aligned; * pmatches = matches; * pmismatches = mismatches; * pgaps = gaps; } struct s16info_s * search16_init(CELL score_match, CELL score_mismatch, CELL penalty_gap_open_query_left, CELL penalty_gap_open_target_left, CELL penalty_gap_open_query_interior, CELL penalty_gap_open_target_interior, CELL penalty_gap_open_query_right, CELL penalty_gap_open_target_right, CELL penalty_gap_extension_query_left, CELL penalty_gap_extension_target_left, CELL penalty_gap_extension_query_interior, CELL penalty_gap_extension_target_interior, CELL penalty_gap_extension_query_right, CELL penalty_gap_extension_target_right) { (void) score_match; (void) score_mismatch; /* prepare alloc of qtable, dprofile, hearray, dir */ auto * s = (struct s16info_s *) xmalloc(sizeof(struct s16info_s)); s->dprofile = (VECTOR_SHORT *) xmalloc(2*4*8*16); s->qlen = 0; s->qseq = nullptr; s->maxdlen = 0; s->dir = nullptr; s->diralloc = 0; s->hearray = nullptr; s->qtable = nullptr; s->cigar = nullptr; s->cigarend = nullptr; s->cigaralloc = 0; for(int i=0; i<16; i++) { for(int j=0; j<16; j++) { CELL value; if (ambiguous_4bit[i] || ambiguous_4bit[j]) { value = 0; } else if (i == j) { value = opt_match; } else { value = opt_mismatch; } ((CELL*)(&s->matrix))[16*i+j] = value; scorematrix[i][j] = value; } } s->penalty_gap_open_query_left = penalty_gap_open_query_left; s->penalty_gap_open_query_interior = penalty_gap_open_query_interior; s->penalty_gap_open_query_right = penalty_gap_open_query_right; s->penalty_gap_open_target_left = penalty_gap_open_target_left; s->penalty_gap_open_target_interior = penalty_gap_open_target_interior; s->penalty_gap_open_target_right = penalty_gap_open_target_right; s->penalty_gap_extension_query_left = penalty_gap_extension_query_left; s->penalty_gap_extension_query_interior = penalty_gap_extension_query_interior; s->penalty_gap_extension_query_right = penalty_gap_extension_query_right; s->penalty_gap_extension_target_left = penalty_gap_extension_target_left; s->penalty_gap_extension_target_interior = penalty_gap_extension_target_interior; s->penalty_gap_extension_target_right = penalty_gap_extension_target_right; return s; } void search16_exit(s16info_s * s) { /* free mem for dprofile, hearray, dir, qtable */ if (s->dir) { xfree(s->dir); } if (s->hearray) { xfree(s->hearray); } if (s->dprofile) { xfree(s->dprofile); } if (s->qtable) { xfree(s->qtable); } if (s->cigar) { xfree(s->cigar); } xfree(s); } void search16_qprep(s16info_s * s, char * qseq, int qlen) { s->qlen = qlen; s->qseq = qseq; if (s->hearray) { xfree(s->hearray); } s->hearray = (VECTOR_SHORT *) xmalloc(2 * s->qlen * sizeof(VECTOR_SHORT)); memset(s->hearray, 0, 2 * s->qlen * sizeof(VECTOR_SHORT)); if (s->qtable) { xfree(s->qtable); } s->qtable = (VECTOR_SHORT **) xmalloc(s->qlen * sizeof(VECTOR_SHORT*)); for(int i = 0; i < qlen; i++) { s->qtable[i] = s->dprofile + 4 * chrmap_4bit[(int)(qseq[i])]; } } void search16(s16info_s * s, unsigned int sequences, unsigned int * seqnos, CELL * pscores, unsigned short * paligned, unsigned short * pmatches, unsigned short * pmismatches, unsigned short * pgaps, char ** pcigar) { CELL ** q_start = (CELL**) s->qtable; CELL * dprofile = (CELL*) s->dprofile; CELL * hearray = (CELL*) s->hearray; uint64_t qlen = s->qlen; if (qlen == 0) { for (unsigned int cand_id = 0; cand_id < sequences; cand_id++) { unsigned int seqno = seqnos[cand_id]; int64_t length = db_getsequencelen(seqno); paligned[cand_id] = length; pmatches[cand_id] = 0; pmismatches[cand_id] = 0; pgaps[cand_id] = length; if (length == 0) { pscores[cand_id] = 0; } else { pscores[cand_id] = MAX(- s->penalty_gap_open_target_left - length * s->penalty_gap_extension_target_left, - s->penalty_gap_open_target_right - length * s->penalty_gap_extension_target_right); } char * cigar = nullptr; if (length > 0) { int ret = xsprintf(&cigar, "%ldI", length); if ((ret < 2) || !cigar) { fatal("Unable to allocate enough memory."); } } else { cigar = (char *) xmalloc(1); cigar[0] = 0; } pcigar[cand_id] = cigar; } return; } /* find longest target sequence and reallocate direction buffer */ uint64_t maxdlen = 0; for(int64_t i = 0; i < sequences; i++) { uint64_t dlen = db_getsequencelen(seqnos[i]); /* skip the very long sequences */ if ((int64_t)(s->qlen) * dlen <= MAXSEQLENPRODUCT) { if (dlen > maxdlen) { maxdlen = dlen; } } } maxdlen = 4 * ((maxdlen + 3) / 4); s->maxdlen = maxdlen; uint64_t dirbuffersize = s->qlen * s->maxdlen * 4; if (dirbuffersize > s->diralloc) { s->diralloc = dirbuffersize; if (s->dir) { xfree(s->dir); } s->dir = (unsigned short*) xmalloc(dirbuffersize * sizeof(unsigned short)); } unsigned short * dirbuffer = s->dir; if (s->qlen + s->maxdlen + 1 > s->cigaralloc) { s->cigaralloc = s->qlen + s->maxdlen + 1; if (s->cigar) { xfree(s->cigar); } s->cigar = (char *) xmalloc(s->cigaralloc); } VECTOR_SHORT M, T0; VECTOR_SHORT M_QR_target_left, M_R_target_left; VECTOR_SHORT M_QR_query_interior; VECTOR_SHORT M_QR_query_right; VECTOR_SHORT R_query_left; VECTOR_SHORT QR_query_interior, R_query_interior; VECTOR_SHORT QR_query_right, R_query_right; VECTOR_SHORT QR_target_left, R_target_left; VECTOR_SHORT QR_target_interior, R_target_interior; VECTOR_SHORT QR_target_right, R_target_right; VECTOR_SHORT QR_target[4], R_target[4]; VECTOR_SHORT *hep, **qp; BYTE * d_begin[CHANNELS]; BYTE * d_end[CHANNELS]; uint64_t d_offset[CHANNELS]; BYTE * d_address[CHANNELS]; uint64_t d_length[CHANNELS]; int64_t seq_id[CHANNELS]; bool overflow[CHANNELS]; VECTOR_SHORT dseqalloc[CDEPTH]; VECTOR_SHORT S[4]; BYTE * dseq = (BYTE*) & dseqalloc; BYTE zero = 0; uint64_t next_id = 0; uint64_t done = 0; T0 = v_init(-1, 0, 0, 0, 0, 0, 0, 0); R_query_left = v_dup(s->penalty_gap_extension_query_left); QR_query_interior = v_dup((s->penalty_gap_open_query_interior + s->penalty_gap_extension_query_interior)); R_query_interior = v_dup(s->penalty_gap_extension_query_interior); QR_query_right = v_dup((s->penalty_gap_open_query_right + s->penalty_gap_extension_query_right)); R_query_right = v_dup(s->penalty_gap_extension_query_right); QR_target_left = v_dup((s->penalty_gap_open_target_left + s->penalty_gap_extension_target_left)); R_target_left = v_dup(s->penalty_gap_extension_target_left); QR_target_interior = v_dup((s->penalty_gap_open_target_interior + s->penalty_gap_extension_target_interior)); R_target_interior = v_dup(s->penalty_gap_extension_target_interior); QR_target_right = v_dup((s->penalty_gap_open_target_right + s->penalty_gap_extension_target_right)); R_target_right = v_dup(s->penalty_gap_extension_target_right); hep = (VECTOR_SHORT*) hearray; qp = (VECTOR_SHORT**) q_start; for (int c=0; cpenalty_gap_open_query_left + s->penalty_gap_extension_query_left); gap_penalty_max = MAX(gap_penalty_max, s->penalty_gap_open_query_interior + s->penalty_gap_extension_query_interior); gap_penalty_max = MAX(gap_penalty_max, s->penalty_gap_open_query_right + s->penalty_gap_extension_query_right); gap_penalty_max = MAX(gap_penalty_max, s->penalty_gap_open_target_left + s->penalty_gap_extension_target_left); gap_penalty_max = MAX(gap_penalty_max, s->penalty_gap_open_target_interior + s->penalty_gap_extension_target_interior); gap_penalty_max = MAX(gap_penalty_max, s->penalty_gap_open_target_right + s->penalty_gap_extension_target_right); short score_min = SHRT_MIN + gap_penalty_max; short score_max = SHRT_MAX; for(int i=0; i<4; i++) { S[i] = v_zero; dseqalloc[i] = v_zero; } VECTOR_SHORT H0 = v_zero; VECTOR_SHORT H1 = v_zero; VECTOR_SHORT H2 = v_zero; VECTOR_SHORT H3 = v_zero; VECTOR_SHORT F0 = v_zero; VECTOR_SHORT F1 = v_zero; VECTOR_SHORT F2 = v_zero; VECTOR_SHORT F3 = v_zero; int easy = 0; unsigned short * dir = dirbuffer; while(true) { if (easy) { /* fill all channels with symbols from the database sequences */ for(int c=0; cmatrix, dseq); /* create vectors of gap penalties for target depending on whether any of the database sequences ended in these four columns */ if (easy) { for(unsigned int j=0; j= ((d_length[c]+3) % 4))) { MM = v_xor(MM, TT); } TT = v_shift_left(TT); } QR_target[j] = v_add(QR_target_interior, v_and(QR_diff, MM)); R_target[j] = v_add(R_target_interior, v_and(R_diff, MM)); } } VECTOR_SHORT h_min, h_max; aligncolumns_rest(S, hep, qp, QR_query_interior, R_query_interior, QR_query_right, R_query_right, QR_target[0], R_target[0], QR_target[1], R_target[1], QR_target[2], R_target[2], QR_target[3], R_target[3], H0, H1, H2, H3, F0, F1, F2, F3, & h_min, & h_max, qlen, dir); VECTOR_SHORT h_min_vector; VECTOR_SHORT h_max_vector; v_store(& h_min_vector, h_min); v_store(& h_max_vector, h_max); for(int c=0; c= score_max)) { overflow[c] = true; } } } } else { /* One or more sequences ended in the previous block. We have to switch over to a new sequence */ easy = 1; M = v_zero; VECTOR_SHORT T = T0; for (int c=0; c= 0) { /* save score */ char * dbseq = (char*) d_address[c]; int64_t dbseqlen = d_length[c]; int64_t z = (dbseqlen+3) % 4; int64_t score = ((CELL*)S)[z*CHANNELS+c]; if (overflow[c]) { pscores[cand_id] = SHRT_MAX; paligned[cand_id] = 0; pmatches[cand_id] = 0; pmismatches[cand_id] = 0; pgaps[cand_id] = 0; pcigar[cand_id] = xstrdup(""); } else { pscores[cand_id] = score; backtrack16(s, dbseq, dbseqlen, d_offset[c], c, paligned + cand_id, pmatches + cand_id, pmismatches + cand_id, pgaps + cand_id); pcigar[cand_id] = (char *) xmalloc(strlen(s->cigar)+1); strcpy(pcigar[cand_id], s->cigar); } done++; } /* get next sequence of reasonable length */ int64_t length = 0; while ((length == 0) && (next_id < sequences)) { cand_id = next_id++; length = db_getsequencelen(seqnos[cand_id]); if ((length==0) || (s->qlen * length > MAXSEQLENPRODUCT)) { pscores[cand_id] = SHRT_MAX; paligned[cand_id] = 0; pmatches[cand_id] = 0; pmismatches[cand_id] = 0; pgaps[cand_id] = 0; pcigar[cand_id] = xstrdup(""); length = 0; done++; } } if (length > 0) { seq_id[c] = cand_id; char * address = db_getsequence(seqnos[cand_id]); d_address[c] = (BYTE*) address; d_length[c] = length; d_begin[c] = (unsigned char*) address; d_end[c] = (unsigned char*) address + length; d_offset[c] = dir - dirbuffer; overflow[c] = false; ((CELL*)&H0)[c] = 0; ((CELL*)&H1)[c] = - s->penalty_gap_open_query_left - 1*s->penalty_gap_extension_query_left; ((CELL*)&H2)[c] = - s->penalty_gap_open_query_left - 2*s->penalty_gap_extension_query_left; ((CELL*)&H3)[c] = - s->penalty_gap_open_query_left - 3*s->penalty_gap_extension_query_left; ((CELL*)&F0)[c] = - s->penalty_gap_open_query_left - 1*s->penalty_gap_extension_query_left; ((CELL*)&F1)[c] = - s->penalty_gap_open_query_left - 2*s->penalty_gap_extension_query_left; ((CELL*)&F2)[c] = - s->penalty_gap_open_query_left - 3*s->penalty_gap_extension_query_left; ((CELL*)&F3)[c] = - s->penalty_gap_open_query_left - 4*s->penalty_gap_extension_query_left; /* fill channel */ for(int j=0; jmatrix, dseq); /* create vectors of gap penalties for target depending on whether any of the database sequences ended in these four columns */ if (easy) { for(unsigned int j=0; j= ((d_length[c]+3) % 4))) { MM = v_xor(MM, TT); } TT = v_shift_left(TT); } QR_target[j] = v_add(QR_target_interior, v_and(QR_diff, MM)); R_target[j] = v_add(R_target_interior, v_and(R_diff, MM)); } } VECTOR_SHORT h_min, h_max; aligncolumns_first(S, hep, qp, QR_query_interior, R_query_interior, QR_query_right, R_query_right, QR_target[0], R_target[0], QR_target[1], R_target[1], QR_target[2], R_target[2], QR_target[3], R_target[3], H0, H1, H2, H3, F0, F1, F2, F3, & h_min, & h_max, M, M_QR_target_left, M_R_target_left, M_QR_query_interior, M_QR_query_right, qlen, dir); VECTOR_SHORT h_min_vector; VECTOR_SHORT h_max_vector; v_store(& h_min_vector, h_min); v_store(& h_max_vector, h_max); for(int c=0; c= score_max)) { overflow[c] = true; } } } } H0 = v_sub(H3, R_query_left); H1 = v_sub(H0, R_query_left); H2 = v_sub(H1, R_query_left); H3 = v_sub(H2, R_query_left); F0 = v_sub(F3, R_query_left); F1 = v_sub(F0, R_query_left); F2 = v_sub(F1, R_query_left); F3 = v_sub(F2, R_query_left); dir += 4 * 4 * s->qlen; if (dir >= dirbuffer + dirbuffersize) { dir -= dirbuffersize; } } } vsearch-2.21.1/src/searchcore.cc0000644000175000017500000005735114171574117016015 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* per thread data */ inline int hit_compare_byid_typed(struct hit * x, struct hit * y) { // high id, then low id // early target, then late target if (x->rejected < y->rejected) { return -1; } else if (x->rejected > y->rejected) { return +1; } else if (x->rejected == 1) { return 0; } else if (x->aligned > y->aligned) { return -1; } else if (x->aligned < y->aligned) { return +1; } else if (x->aligned == 0) { return 0; } else if (x->id > y->id) { return -1; } else if (x->id < y->id) { return +1; } else if (x->target < y->target) { return -1; } else if (x->target > y->target) { return +1; } else { return 0; } } inline int hit_compare_bysize_typed(struct hit * x, struct hit * y) { // high abundance, then low abundance // high id, then low id // early target, then late target if (x->rejected < y->rejected) { return -1; } else if (x->rejected > y->rejected) { return +1; } else if (x->rejected == 1) { return 0; } else if (x->aligned > y->aligned) { return -1; } else if (x->aligned < y->aligned) { return +1; } else if (x->aligned == 0) { return 0; } else if (db_getabundance(x->target) > db_getabundance(y->target)) { return -1; } else if (db_getabundance(x->target) < db_getabundance(y->target)) { return +1; } else if (x->id > y->id) { return -1; } else if (x->id < y->id) { return +1; } else if (x->target < y->target) { return -1; } else if (x->target > y->target) { return +1; } else { return 0; } } int hit_compare_byid(const void * a, const void * b) { return hit_compare_byid_typed((struct hit *) a, (struct hit *) b); } int hit_compare_bysize(const void * a, const void * b) { return hit_compare_bysize_typed((struct hit *) a, (struct hit *) b); } bool search_enough_kmers(struct searchinfo_s * si, unsigned int count) { return (count >= opt_minwordmatches) || (count >= si->kmersamplecount); } void search_topscores(struct searchinfo_s * si) { /* Count the kmer hits in each database sequence and make a sorted list of a given number (th) of the database sequences with the highest number of matching kmers. These are stored in the min heap array. */ /* count kmer hits in the database sequences */ int indexed_count = dbindex_getcount(); /* zero counts */ memset(si->kmers, 0, indexed_count * sizeof(count_t)); minheap_empty(si->m); for(unsigned int i=0; ikmersamplecount; i++) { unsigned int kmer = si->kmersample[i]; unsigned char * bitmap = dbindex_getbitmap(kmer); if (bitmap) { #ifdef __x86_64__ if (ssse3_present) { increment_counters_from_bitmap_ssse3(si->kmers, bitmap, indexed_count); } else { increment_counters_from_bitmap_sse2(si->kmers, bitmap, indexed_count); } #else increment_counters_from_bitmap(si->kmers, bitmap, indexed_count); #endif } else { unsigned int * list = dbindex_getmatchlist(kmer); unsigned int count = dbindex_getmatchcount(kmer); for(unsigned int j=0; j < count; j++) { si->kmers[list[j]]++; } } } int minmatches = MIN(opt_minwordmatches, si->kmersamplecount); for(int i=0; i < indexed_count; i++) { count_t count = si->kmers[i]; if (count >= minmatches) { unsigned int seqno = dbindex_getmapping(i); unsigned int length = db_getsequencelen(seqno); elem_t novel; novel.count = count; novel.seqno = seqno; novel.length = length; minheap_add(si->m, & novel); } } minheap_sort(si->m); } int seqncmp(char * a, char * b, uint64_t n) { for(unsigned int i = 0; i y) { return +1; } } return 0; } void align_trim(struct hit * hit) { /* trim alignment and fill in info */ /* assumes that the hit has been aligned */ /* info for semi-global alignment (without gaps at ends) */ hit->trim_aln_left = 0; hit->trim_q_left = 0; hit->trim_t_left = 0; hit->trim_aln_right = 0; hit->trim_q_right = 0; hit->trim_t_right = 0; /* left trim alignment */ char * p = hit->nwalignment; char op; int64_t run; if (*p) { run = 1; int scanlength = 0; sscanf(p, "%" PRId64 "%n", &run, &scanlength); op = *(p+scanlength); if (op != 'M') { hit->trim_aln_left = 1 + scanlength; if (op == 'D') { hit->trim_q_left = run; } else { hit->trim_t_left = run; } } } /* right trim alignment */ char * e = hit->nwalignment + strlen(hit->nwalignment); if (e > hit->nwalignment) { p = e - 1; op = *p; if (op != 'M') { while ((p > hit->nwalignment) && (*(p-1) <= '9')) { p--; } run = 1; sscanf(p, "%" PRId64, &run); hit->trim_aln_right = e - p; if (op == 'D') { hit->trim_q_right = run; } else { hit->trim_t_right = run; } } } if (hit->trim_q_left >= hit->nwalignmentlength) { hit->trim_q_right = 0; } if (hit->trim_t_left >= hit->nwalignmentlength) { hit->trim_t_right = 0; } hit->internal_alignmentlength = hit->nwalignmentlength - hit->trim_q_left - hit->trim_t_left - hit->trim_q_right - hit->trim_t_right; hit->internal_indels = hit->nwindels - hit->trim_q_left - hit->trim_t_left - hit->trim_q_right - hit->trim_t_right; hit->internal_gaps = hit->nwgaps - ((hit->trim_q_left + hit->trim_t_left) > 0 ? 1 : 0) - ((hit->trim_q_right + hit->trim_t_right) > 0 ? 1 : 0); /* CD-HIT */ hit->id0 = hit->shortest > 0 ? 100.0 * hit->matches / hit->shortest : 0.0; /* all diffs */ hit->id1 = hit->nwalignmentlength > 0 ? 100.0 * hit->matches / hit->nwalignmentlength : 0.0; /* internal diffs */ hit->id2 = hit->internal_alignmentlength > 0 ? 100.0 * hit->matches / hit->internal_alignmentlength : 0.0; /* Marine Biology Lab */ hit->id3 = MAX(0.0, 100.0 * (1.0 - (1.0 * (hit->mismatches + hit->nwgaps) / hit->longest))); /* BLAST */ hit->id4 = hit->nwalignmentlength > 0 ? 100.0 * hit->matches / hit->nwalignmentlength : 0.0; switch (opt_iddef) { case 0: hit->id = hit->id0; break; case 1: hit->id = hit->id1; break; case 2: hit->id = hit->id2; break; case 3: hit->id = hit->id3; break; case 4: hit->id = hit->id4; break; } } int search_acceptable_unaligned(struct searchinfo_s * si, int target) { /* consider whether a hit satisfy accept criteria before alignment */ char * qseq = si->qsequence; char * dlabel = db_getheader(target); char * dseq = db_getsequence(target); int64_t dseqlen = db_getsequencelen(target); int64_t tsize = db_getabundance(target); if ( /* maxqsize */ (si->qsize <= opt_maxqsize) && /* mintsize */ (tsize >= opt_mintsize) && /* minsizeratio */ (si->qsize >= opt_minsizeratio * tsize) && /* maxsizeratio */ (si->qsize <= opt_maxsizeratio * tsize) && /* minqt */ (si->qseqlen >= opt_minqt * dseqlen) && /* maxqt */ (si->qseqlen <= opt_maxqt * dseqlen) && /* minsl */ (si->qseqlen < dseqlen ? si->qseqlen >= opt_minsl * dseqlen : dseqlen >= opt_minsl * si->qseqlen) && /* maxsl */ (si->qseqlen < dseqlen ? si->qseqlen <= opt_maxsl * dseqlen : dseqlen <= opt_maxsl * si->qseqlen) && /* idprefix */ ((si->qseqlen >= opt_idprefix) && (dseqlen >= opt_idprefix) && (!seqncmp(qseq, dseq, opt_idprefix))) && /* idsuffix */ ((si->qseqlen >= opt_idsuffix) && (dseqlen >= opt_idsuffix) && (!seqncmp(qseq+si->qseqlen-opt_idsuffix, dseq+dseqlen-opt_idsuffix, opt_idsuffix))) && /* self */ ((!opt_self) || (strcmp(si->query_head, dlabel))) && /* selfid */ ((!opt_selfid) || (si->qseqlen != dseqlen) || (seqncmp(qseq, dseq, si->qseqlen))) ) { /* needs further consideration */ return 1; } else { /* reject */ return 0; } } int search_acceptable_aligned(struct searchinfo_s * si, struct hit * hit) { if (/* weak_id */ (hit->id >= 100.0 * opt_weak_id) && /* maxsubs */ (hit->mismatches <= opt_maxsubs) && /* maxgaps */ (hit->internal_gaps <= opt_maxgaps) && /* mincols */ (hit->internal_alignmentlength >= opt_mincols) && /* leftjust */ ((!opt_leftjust) || (hit->trim_q_left + hit->trim_t_left == 0)) && /* rightjust */ ((!opt_rightjust) || (hit->trim_q_right + hit->trim_t_right == 0)) && /* query_cov */ (hit->matches + hit->mismatches >= opt_query_cov * si->qseqlen) && /* target_cov */ (hit->matches + hit->mismatches >= opt_target_cov * db_getsequencelen(hit->target)) && /* maxid */ (hit->id <= 100.0 * opt_maxid) && /* mid */ (100.0 * hit->matches / (hit->matches + hit->mismatches) >= opt_mid) && /* maxdiffs */ (hit->mismatches + hit->internal_indels <= opt_maxdiffs)) { if (opt_cluster_unoise) { int d = hit->mismatches; double skew = 1.0 * si->qsize / db_getabundance(hit->target); double beta = 1.0 / pow(2, 1.0 * opt_unoise_alpha * d + 1); if (skew <= beta || d == 0) { /* accepted */ hit->accepted = true; hit->weak = false; return 1; } else { /* rejected, but weak hit */ hit->rejected = true; hit->weak = true; return 0; } } else { if (hit->id >= 100.0 * opt_id) { /* accepted */ hit->accepted = true; hit->weak = false; return 1; } else { /* rejected, but weak hit */ hit->rejected = true; hit->weak = true; return 0; } } } else { /* rejected */ hit->rejected = true; hit->weak = false; return 0; } } void align_delayed(struct searchinfo_s * si) { /* compute global alignment */ unsigned int target_list[MAXDELAYED]; CELL nwscore_list[MAXDELAYED]; unsigned short nwalignmentlength_list[MAXDELAYED]; unsigned short nwmatches_list[MAXDELAYED]; unsigned short nwmismatches_list[MAXDELAYED]; unsigned short nwgaps_list[MAXDELAYED]; char * nwcigar_list[MAXDELAYED]; int target_count = 0; for(int x = si->finalized; x < si->hit_count; x++) { struct hit * hit = si->hits + x; if (! hit->rejected) { target_list[target_count++] = hit->target; } } if (target_count) { search16(si->s, target_count, target_list, nwscore_list, nwalignmentlength_list, nwmatches_list, nwmismatches_list, nwgaps_list, nwcigar_list); } int i = 0; for(int x = si->finalized; x < si->hit_count; x++) { /* maxrejects or maxaccepts reached - ignore remaining hits */ if ((si->rejects < opt_maxrejects) && (si->accepts < opt_maxaccepts)) { struct hit * hit = si->hits + x; if (hit->rejected) { si->rejects++; } else { int64_t target = hit->target; int64_t nwscore = nwscore_list[i]; char * nwcigar; int64_t nwalignmentlength; int64_t nwmatches; int64_t nwmismatches; int64_t nwgaps; int64_t dseqlen = db_getsequencelen(target); if (nwscore == SHRT_MAX) { /* In case the SIMD aligner cannot align, perform a new alignment with the linear memory aligner */ char * dseq = db_getsequence(target); if (nwcigar_list[i]) { xfree(nwcigar_list[i]); } nwcigar = xstrdup(si->lma->align(si->qsequence, dseq, si->qseqlen, dseqlen)); si->lma->alignstats(nwcigar, si->qsequence, dseq, & nwscore, & nwalignmentlength, & nwmatches, & nwmismatches, & nwgaps); } else { nwalignmentlength = nwalignmentlength_list[i]; nwmatches = nwmatches_list[i]; nwmismatches = nwmismatches_list[i]; nwgaps = nwgaps_list[i]; nwcigar = nwcigar_list[i]; } hit->aligned = true; hit->shortest = MIN(si->qseqlen, dseqlen); hit->longest = MAX(si->qseqlen, dseqlen); hit->nwalignment = nwcigar; hit->nwscore = nwscore; hit->nwdiff = nwalignmentlength - nwmatches; hit->nwgaps = nwgaps; hit->nwindels = nwalignmentlength - nwmatches - nwmismatches; hit->nwalignmentlength = nwalignmentlength; hit->nwid = 100.0 * (nwalignmentlength - hit->nwdiff) / nwalignmentlength; hit->matches = nwalignmentlength - hit->nwdiff; hit->mismatches = hit->nwdiff - hit->nwindels; /* trim alignment and compute numbers excluding terminal gaps */ align_trim(hit); /* test accept/reject criteria after alignment */ if (search_acceptable_aligned(si, hit)) { si->accepts++; } else { si->rejects++; } i++; } } } /* free ignored alignments */ while (i < target_count) { xfree(nwcigar_list[i++]); } si->finalized = si->hit_count; } void search_onequery(struct searchinfo_s * si, int seqmask) { si->hit_count = 0; search16_qprep(si->s, si->qsequence, si->qseqlen); si->lma = new LinearMemoryAligner; int64_t * scorematrix = si->lma->scorematrix_create(opt_match, opt_mismatch); si->lma->set_parameters(scorematrix, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); /* extract unique kmer samples from query*/ unique_count(si->uh, opt_wordlength, si->qseqlen, si->qsequence, & si->kmersamplecount, & si->kmersample, seqmask); /* find database sequences with the most kmer hits */ search_topscores(si); /* analyse targets with the highest number of kmer hits */ si->accepts = 0; si->rejects = 0; si->finalized = 0; int delayed = 0; int t = 0; while ((si->finalized + delayed < opt_maxaccepts + opt_maxrejects - 1) && (si->rejects < opt_maxrejects) && (si->accepts < opt_maxaccepts) && (!minheap_isempty(si->m))) { elem_t e = minheap_poplast(si->m); struct hit * hit = si->hits + si->hit_count; hit->target = e.seqno; hit->count = e.count; hit->strand = si->strand; hit->rejected = false; hit->accepted = false; hit->aligned = false; hit->weak = false; hit->nwalignment = nullptr; /* Test some accept/reject criteria before alignment */ if (search_acceptable_unaligned(si, e.seqno)) { delayed++; } else { hit->rejected = true; } si->hit_count++; if (delayed == MAXDELAYED) { align_delayed(si); delayed = 0; } t++; } if (delayed > 0) { align_delayed(si); } delete si->lma; xfree(scorematrix); } struct hit * search_findbest2_byid(struct searchinfo_s * si_p, struct searchinfo_s * si_m) { struct hit * best = nullptr; for(int i=0; i < si_p->hit_count; i++) { if ((!best) || (hit_compare_byid_typed(si_p->hits + i, best) < 0)) { best = si_p->hits + i; } } if (opt_strand>1) { for(int i=0; i < si_m->hit_count; i++) { if ((!best) || (hit_compare_byid_typed(si_m->hits + i, best) < 0)) { best = si_m->hits + i; } } } if (best && ! best->accepted) { best = nullptr; } return best; } struct hit * search_findbest2_bysize(struct searchinfo_s * si_p, struct searchinfo_s * si_m) { struct hit * best = nullptr; for(int i=0; i < si_p->hit_count; i++) { if ((!best) || (hit_compare_bysize_typed(si_p->hits + i, best) < 0)) { best = si_p->hits + i; } } if (opt_strand>1) { for(int i=0; i < si_m->hit_count; i++) { if ((!best) || (hit_compare_bysize_typed(si_m->hits + i, best) < 0)) { best = si_m->hits + i; } } } if (best && ! best->accepted) { best = nullptr; } return best; } void search_joinhits(struct searchinfo_s * si_p, struct searchinfo_s * si_m, struct hit * * hitsp, int * hit_count) { /* join and sort accepted hits from both strands */ /* remove and unallocate unaccepted hits */ int a = 0; for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_m : si_p; for(int i=0; ihit_count; i++) { if (si->hits[i].accepted) { a++; } } } auto * hits = (struct hit *) xmalloc(a * sizeof(struct hit)); a = 0; for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_m : si_p; for(int i=0; ihit_count; i++) { struct hit * h = si->hits + i; if (h->accepted) { hits[a++] = *h; } else if (h->aligned) { xfree(h->nwalignment); } } } qsort(hits, a, sizeof(struct hit), hit_compare_byid); *hitsp = hits; *hit_count = a; } vsearch-2.21.1/src/cut.cc0000644000175000017500000003004214171574117014456 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static uint64_t fragment_no = 0; static uint64_t fragment_rev_no = 0; static uint64_t fragment_discarded_no = 0; static uint64_t fragment_discarded_rev_no = 0; int cut_one(fastx_handle h, FILE * fp_fastaout, FILE * fp_fastaout_discarded, FILE * fp_fastaout_rev, FILE * fp_fastaout_discarded_rev, char * pattern, int pattern_length, int cut_fwd, int cut_rev) { char * seq = fasta_get_sequence(h); int seq_length = fasta_get_sequence_length(h); /* get reverse complement */ char * rc = (char *) xmalloc(seq_length + 1); reverse_complement(rc, seq, seq_length); int frag_start = 0; int frag_length = seq_length; int matches = 0; int rc_start = seq_length; int rc_length = 0; for(int i = 0; i < seq_length - pattern_length + 1; i++) { bool match = true; for(int j = 0; j < pattern_length; j++) { if ((chrmap_4bit[(unsigned char)(pattern[j])] & chrmap_4bit[(unsigned char)(seq[i+j])]) == 0) { match = false; break; } } if (match) { matches++; frag_length = i + cut_fwd - frag_start; rc_length = rc_start - (seq_length - (i + cut_rev)); rc_start -= rc_length; if (frag_length > 0) { if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, fasta_get_sequence(h) + frag_start, frag_length, fasta_get_header(h), fasta_get_header_length(h), fasta_get_abundance(h), ++fragment_no, -1.0, -1, -1, nullptr, 0.0); } } if (rc_length > 0) { if (opt_fastaout_rev) { fasta_print_general(fp_fastaout_rev, nullptr, rc + rc_start, rc_length, fasta_get_header(h), fasta_get_header_length(h), fasta_get_abundance(h), ++fragment_rev_no, -1.0, -1, -1, nullptr, 0.0); } } frag_start += frag_length; } } if (matches > 0) { frag_length = seq_length - frag_start; if (frag_length > 0) { if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, fasta_get_sequence(h) + frag_start, frag_length, fasta_get_header(h), fasta_get_header_length(h), fasta_get_abundance(h), ++fragment_no, -1.0, -1, -1, nullptr, 0.0); } } rc_length = rc_start; rc_start = 0; if (rc_length > 0) { if (opt_fastaout_rev) { fasta_print_general(fp_fastaout_rev, nullptr, rc + rc_start, rc_length, fasta_get_header(h), fasta_get_header_length(h), fasta_get_abundance(h), ++fragment_rev_no, -1.0, -1, -1, nullptr, 0.0); } } } else { if (opt_fastaout_discarded) { fasta_print_general(fp_fastaout_discarded, nullptr, fasta_get_sequence(h), seq_length, fasta_get_header(h), fasta_get_header_length(h), fasta_get_abundance(h), ++fragment_discarded_no, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastaout_discarded_rev) { fasta_print_general(fp_fastaout_discarded_rev, nullptr, rc, seq_length, fasta_get_header(h), fasta_get_header_length(h), fasta_get_abundance(h), ++fragment_discarded_rev_no, -1.0, -1, -1, nullptr, 0.0); } } xfree(rc); return matches; } void cut() { if ((!opt_fastaout) && (!opt_fastaout_discarded) && (!opt_fastaout_rev) && (!opt_fastaout_discarded_rev)) { fatal("No output files specified"); } fastx_handle h = nullptr; h = fasta_open(opt_cut); if (!h) { fatal("Unrecognized file type (not proper FASTA format)"); } uint64_t filesize = fasta_get_size(h); FILE * fp_fastaout = nullptr; FILE * fp_fastaout_discarded = nullptr; FILE * fp_fastaout_rev = nullptr; FILE * fp_fastaout_discarded_rev = nullptr; if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastaout_rev) { fp_fastaout_rev = fopen_output(opt_fastaout_rev); if (!fp_fastaout_rev) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastaout_discarded) { fp_fastaout_discarded = fopen_output(opt_fastaout_discarded); if (!fp_fastaout_discarded) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastaout_discarded_rev) { fp_fastaout_discarded_rev = fopen_output(opt_fastaout_discarded_rev); if (!fp_fastaout_discarded_rev) { fatal("Unable to open FASTA output file for writing"); } } char * pattern = opt_cut_pattern; if (pattern == nullptr) { fatal("No cut pattern string specified with --cut_pattern"); } int n = strlen(pattern); if (n == 0) { fatal("Empty cut pattern string"); } int cut_fwd = -1; int cut_rev = -1; int j = 0; for (int i = 0; i < n ; i++) { unsigned char x = pattern[i]; if (x == '^') { if (j < 0) { fatal("Multiple cut sites not supported"); } cut_fwd = j; } else if (x == '_') { if (j < 0) { fatal("Multiple cut sites not supported"); } cut_rev = j; } else if (chrmap_4bit[(unsigned int)x]) { pattern[j++] = x; } else { fatal("Illegal character in cut pattern"); } } if (cut_fwd < 0) { fatal("No forward sequence cut site (^) found in pattern"); } if (cut_rev < 0) { fatal("No reverse sequence cut site (_) found in pattern"); } progress_init("Cutting sequences", filesize); int64_t cut = 0; int64_t uncut = 0; int64_t matches = 0; while(fasta_next(h, false, chrmap_no_change)) { int64_t m = cut_one(h, fp_fastaout, fp_fastaout_discarded, fp_fastaout_rev, fp_fastaout_discarded_rev, pattern, n - 2, cut_fwd, cut_rev); matches += m; if (m > 0) { cut++; } else { uncut++; } progress_update(fasta_get_position(h)); } progress_done(); if (! opt_quiet) { fprintf(stderr, "%" PRId64 " sequence(s) cut %" PRId64 " times, %" PRId64 " sequence(s) never cut.\n", cut, matches, uncut); } if (opt_log) { fprintf(fp_log, "%" PRId64 " sequence(s) cut %" PRId64 " times, %" PRId64 " sequence(s) never cut.\n", cut, matches, uncut); } if (opt_fastaout) { fclose(fp_fastaout); } if (opt_fastaout_rev) { fclose(fp_fastaout_rev); } if (opt_fastaout_discarded) { fclose(fp_fastaout_discarded); } if (opt_fastaout_discarded_rev) { fclose(fp_fastaout_discarded_rev); } fasta_close(h); } vsearch-2.21.1/src/msa.cc0000644000175000017500000002717714171574117014462 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* Compute consensus sequence and msa of clustered sequences */ typedef uint64_t prof_type; #define PROFSIZE 6 static char * aln; static int alnpos; static prof_type * profile; void msa_add(char c, prof_type abundance) { prof_type * p = profile + PROFSIZE * alnpos; switch(toupper(c)) { case 'A': p[0] += abundance; break; case 'C': p[1] += abundance; break; case 'G': p[2] += abundance; break; case 'T': case 'U': p[3] += abundance; break; case 'R': case 'Y': case 'S': case 'W': case 'K': case 'M': case 'B': case 'D': case 'H': case 'V': case 'N': p[4] += abundance; break; case '-': p[5] += abundance; break; } aln[alnpos++] = c; } void msa(FILE * fp_msaout, FILE * fp_consout, FILE * fp_profile, int cluster, int target_count, struct msa_target_s * target_list, int64_t totalabundance) { int centroid_seqno = target_list[0].seqno; int centroid_len = db_getsequencelen(centroid_seqno); /* find max insertions in front of each position in the centroid sequence */ int * maxi = (int *) xmalloc((centroid_len + 1) * sizeof(int)); memset(maxi, 0, (centroid_len + 1) * sizeof(int)); for(int j=1; j maxi[pos]) { maxi[pos] = run; } break; } } } /* find total alignment length */ int alnlen = 0; for(int i=0; i < centroid_len+1; i++) { alnlen += maxi[i]; } alnlen += centroid_len; /* allocate memory for profile (for consensus) and aligned seq */ profile = (prof_type *) xmalloc(PROFSIZE * sizeof(prof_type) * alnlen); memset(profile, 0, PROFSIZE * sizeof(prof_type) * alnlen); aln = (char *) xmalloc(alnlen+1); char * cons = (char *) xmalloc(alnlen+1); /* Find longest target sequence on reverse strand and allocate buffer */ int64_t longest_reversed = 0; for(int i=0; i < target_count; i++) { if (target_list[i].strand) { int64_t len = db_getsequencelen(target_list[i].seqno); if (len > longest_reversed) { longest_reversed = len; } } } char * rc_buffer = nullptr; if (longest_reversed > 0) { rc_buffer = (char*) xmalloc(longest_reversed + 1); } /* blank line before each msa */ if (fp_msaout) { fprintf(fp_msaout, "\n"); } for(int j=0; j= alnlen - right_censored)) { aln[i] = '+'; } else { /* find most common symbol of A, C, G and T */ char best_sym = 0; prof_type best_count = 0; for(int c=0; c<4; c++) { prof_type count = profile[PROFSIZE*i+c]; if (count > best_count) { best_count = count; best_sym = 1 << c; } } /* if no A, C, G, or T, check if there are any N's */ prof_type n_count = profile[PROFSIZE*i+4]; if ((best_count == 0) && (n_count > 0)) { best_count = n_count; best_sym = 15; // N } /* compare to the number of gap symbols */ prof_type gap_count = profile[PROFSIZE*i+5]; if (best_count >= gap_count) { char sym = sym_nt_4bit[(int)best_sym]; aln[i] = sym; cons[conslen++] = sym; } else { aln[i] = '-'; } } } aln[alnlen] = 0; cons[conslen] = 0; if (fp_msaout) { fasta_print(fp_msaout, "consensus", aln, alnlen); } if (fp_consout) { fasta_print_general(fp_consout, "centroid=", cons, conslen, db_getheader(centroid_seqno), db_getheaderlen(centroid_seqno), totalabundance, cluster+1, -1.0, target_count, opt_clusterout_id ? cluster : -1, nullptr, 0.0); } if (fp_profile) { fasta_print_general(fp_profile, "centroid=", nullptr, 0, db_getheader(centroid_seqno), db_getheaderlen(centroid_seqno), totalabundance, cluster+1, -1.0, target_count, opt_clusterout_id ? cluster : -1, nullptr, 0.0); for (int i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" #define HASH CityHash64 struct kh_bucket_s { unsigned int kmer; unsigned int pos; /* 1-based position, 0 = empty */ }; struct kh_handle_s { struct kh_bucket_s * hash; unsigned int hash_mask; int size; int alloc; int maxpos; }; struct kh_handle_s * kh_init() { auto * kh = (struct kh_handle_s *) xmalloc(sizeof(struct kh_handle_s)); kh->maxpos = 0; kh->alloc = 256; kh->size = 0; kh->hash_mask = kh->alloc - 1; kh->hash = (struct kh_bucket_s *) xmalloc(kh->alloc * sizeof(struct kh_bucket_s)); return kh; } void kh_exit(struct kh_handle_s * kh) { if (kh->hash) { xfree(kh->hash); } xfree(kh); } inline void kh_insert_kmer(struct kh_handle_s * kh, int k, unsigned int kmer, unsigned int pos) { /* find free bucket in hash */ unsigned int j = HASH((char*)&kmer, (k+3)/4) & kh->hash_mask; while(kh->hash[j].pos) { j = (j + 1) & kh->hash_mask; } kh->hash[j].kmer = kmer; kh->hash[j].pos = pos; } void kh_insert_kmers(struct kh_handle_s * kh, int k, char * seq, int len) { int kmers = 1 << (2 * k); unsigned int kmer_mask = kmers - 1; /* reallocate hash table if necessary */ if (kh->alloc < 2 * len) { while (kh->alloc < 2 * len) { kh->alloc *= 2; } kh->hash = (struct kh_bucket_s *) xrealloc(kh->hash, kh->alloc * sizeof(struct kh_bucket_s)); } kh->size = 1; while(kh->size < 2 * len) { kh->size *= 2; } kh->hash_mask = kh->size - 1; kh->maxpos = len; memset(kh->hash, 0, kh->size * sizeof(struct kh_bucket_s)); unsigned int bad = kmer_mask; unsigned int kmer = 0; char * s = seq; unsigned int * maskmap = chrmap_mask_ambig; for (int pos = 0; pos < len; pos++) { int c = *s++; bad <<= 2ULL; bad |= maskmap[c]; bad &= kmer_mask; kmer <<= 2ULL; kmer |= chrmap_2bit[c]; kmer &= kmer_mask; if (!bad) { /* 1-based pos of start of kmer */ kh_insert_kmer(kh, k, kmer, pos - k + 1 + 1); } } } int kh_find_best_diagonal(struct kh_handle_s * kh, int k, char * seq, int len) { int diag_counts[kh->maxpos]; memset(diag_counts, 0, kh->maxpos * sizeof(int)); int kmers = 1 << (2 * k); unsigned int kmer_mask = kmers - 1; unsigned int bad = kmer_mask; unsigned int kmer = 0; char * s = seq + len - 1; unsigned int * maskmap = chrmap_mask_ambig; for (int pos = 0; pos < len; pos++) { int c = *s--; bad <<= 2ULL; bad |= maskmap[c]; bad &= kmer_mask; kmer <<= 2ULL; kmer |= chrmap_2bit[chrmap_complement[c]]; kmer &= kmer_mask; if (!bad) { /* find matching buckets in hash */ unsigned int j = HASH((char*)&kmer, (k+3)/4) & kh->hash_mask; while(kh->hash[j].pos) { if (kh->hash[j].kmer == kmer) { int fpos = kh->hash[j].pos - 1; int diag = fpos - (pos - k + 1); if (diag >= 0) { diag_counts[diag]++; } } j = (j + 1) & kh->hash_mask; } } } int best_diag_count = -1; int best_diag = -1; int good_diags = 0; for(int d = 0; d < kh->maxpos - k + 1; d++) { int diag_len = kh->maxpos - d; int minmatch = MAX(1, diag_len - k + 1 - k * MAX(diag_len / 20, 0)); int c = diag_counts[d]; if (c >= minmatch) { good_diags++; } if (c > best_diag_count) { best_diag_count = c; best_diag = d; } } if (good_diags == 1) { return best_diag; } else { return -1; } } void kh_find_diagonals(struct kh_handle_s * kh, int k, char * seq, int len, int * diags) { memset(diags, 0, (kh->maxpos+len) * sizeof(int)); int kmers = 1 << (2 * k); unsigned int kmer_mask = kmers - 1; unsigned int bad = kmer_mask; unsigned int kmer = 0; char * s = seq + len - 1; for (int pos = 0; pos < len; pos++) { int c = *s--; bad <<= 2ULL; bad |= chrmap_mask_ambig[c]; bad &= kmer_mask; kmer <<= 2ULL; kmer |= chrmap_2bit[chrmap_complement[c]]; kmer &= kmer_mask; if (!bad) { /* find matching buckets in hash */ unsigned int j = HASH((char*)&kmer, (k+3)/4) & kh->hash_mask; while(kh->hash[j].pos) { if (kh->hash[j].kmer == kmer) { int fpos = kh->hash[j].pos - 1; int diag = len + fpos - (pos - k + 1); if (diag >= 0) { diags[diag]++; } } j = (j + 1) & kh->hash_mask; } } } } vsearch-2.21.1/src/fastqops.h0000644000175000017500000000500114171574117015362 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastq_chars(); void fastq_convert(); void fastq_stats(); void fastx_revcomp(); vsearch-2.21.1/src/align.h0000644000175000017500000000666014171574117014630 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct nwinfo_s; struct nwinfo_s * nw_init(); void nw_exit(struct nwinfo_s * nw); void nw_align(char * dseq, char * dend, char * qseq, char * qend, int64_t * score_matrix, int64_t gapopen_q_left, int64_t gapopen_q_interior, int64_t gapopen_q_right, int64_t gapopen_t_left, int64_t gapopen_t_interior, int64_t gapopen_t_right, int64_t gapextend_q_left, int64_t gapextend_q_interior, int64_t gapextend_q_right, int64_t gapextend_t_left, int64_t gapextend_t_interior, int64_t gapextend_t_right, int64_t * nwscore, int64_t * nwdiff, int64_t * nwgaps, int64_t * nwindels, int64_t * nwalignmentlength, char ** nwalignment, int64_t queryno, int64_t dbseqno, struct nwinfo_s * nw); vsearch-2.21.1/src/dbhash.h0000644000175000017500000000603314171574117014761 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct dbhash_bucket_s { uint64_t hash; uint64_t seqno; }; struct dbhash_search_info_s { char * seq; uint64_t seqlen; uint64_t hash; uint64_t index; }; void dbhash_open(uint64_t maxelements); void dbhash_close(); void dbhash_add(char * seq, uint64_t seqlen, uint64_t seqno); void dbhash_add_one(uint64_t seqno); void dbhash_add_all(); int64_t dbhash_search_first(char * seq, uint64_t seqlen, struct dbhash_search_info_s * info); int64_t dbhash_search_next(struct dbhash_search_info_s * info); void dbhash_search_finish(struct dbhash_search_info_s * info); vsearch-2.21.1/src/allpairs.h0000644000175000017500000000474614171574117015350 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void allpairs_global(char * cmdline, char * progheader); vsearch-2.21.1/src/vsearch.cc0000644000175000017500000051764514171574117015341 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* options */ bool opt_bzip2_decompress; bool opt_clusterout_id; bool opt_clusterout_sort; bool opt_eeout; bool opt_fasta_score; bool opt_fastq_allowmergestagger; bool opt_fastq_eeout; bool opt_fastq_nostagger; bool opt_gzip_decompress; bool opt_label_substr_match; bool opt_no_progress; bool opt_fastq_qout_max; bool opt_quiet; bool opt_relabel_keep; bool opt_relabel_md5; bool opt_relabel_self; bool opt_relabel_sha1; bool opt_samheader; bool opt_sff_clip; bool opt_sizeorder; bool opt_xee; bool opt_xsize; char * opt_allpairs_global; char * opt_alnout; char * opt_biomout; char * opt_blast6out; char * opt_borderline; char * opt_centroids; char * opt_chimeras; char * opt_cluster_fast; char * opt_cluster_size; char * opt_cluster_smallmem; char * opt_cluster_unoise; char * opt_clusters; char * opt_consout; char * opt_cut; char * opt_cut_pattern; char * opt_db; char * opt_dbmatched; char * opt_dbnotmatched; char * opt_derep_fulllength; char * opt_derep_id; char * opt_derep_prefix; char * opt_eetabbedout; char * opt_fasta2fastq; char * opt_fastaout; char * opt_fastaout_discarded; char * opt_fastaout_discarded_rev; char * opt_fastaout_notmerged_fwd; char * opt_fastaout_notmerged_rev; char * opt_fastaout_rev; char * opt_fastapairs; char * opt_fastq_chars; char * opt_fastq_convert; char * opt_fastq_eestats; char * opt_fastq_eestats2; char * opt_fastq_filter; char * opt_fastq_join; char * opt_fastq_mergepairs; char * opt_fastq_stats; char * opt_fastqout; char * opt_fastqout_discarded; char * opt_fastqout_discarded_rev; char * opt_fastqout_notmerged_fwd; char * opt_fastqout_notmerged_rev; char * opt_fastqout_rev; char * opt_fastx_filter; char * opt_fastx_getseq; char * opt_fastx_getseqs; char * opt_fastx_getsubseq; char * opt_fastx_mask; char * opt_fastx_revcomp; char * opt_fastx_subsample; char * opt_fastx_uniques; char * opt_join_padgap; char * opt_join_padgapq; char * opt_label; char * opt_labels; char * opt_label_suffix; char * opt_label_word; char * opt_label_words; char * opt_label_field; char * opt_lcaout; char * opt_log; char * opt_makeudb_usearch; char * opt_maskfasta; char * opt_matched; char * opt_mothur_shared_out; char * opt_msaout; char * opt_nonchimeras; char * opt_notmatched; char * opt_notmatchedfq; char * opt_orient; char * opt_otutabout; char * opt_output; char * opt_pattern; char * opt_profile; char * opt_qsegout; char * opt_relabel; char * opt_rereplicate; char * opt_reverse; char * opt_samout; char * opt_sample; char * opt_search_exact; char * opt_sff_convert; char * opt_shuffle; char * opt_sintax; char * opt_sortbylength; char * opt_sortbysize; char * opt_tabbedout; char * opt_tsegout; char * opt_udb2fasta; char * opt_udbinfo; char * opt_udbstats; char * opt_uc; char * opt_uchime_denovo; char * opt_uchime2_denovo; char * opt_uchime3_denovo; char * opt_uchime_ref; char * opt_uchimealns; char * opt_uchimeout; char * opt_usearch_global; char * opt_userout; double * opt_ee_cutoffs_values; double opt_abskew; double opt_dn; double opt_fastq_maxdiffpct; double opt_fastq_maxee; double opt_fastq_maxee_rate; double opt_fastq_truncee; double opt_id; double opt_lca_cutoff; double opt_max_unmasked_pct; double opt_maxid; double opt_maxqt; double opt_maxsizeratio; double opt_maxsl; double opt_mid; double opt_min_unmasked_pct; double opt_mindiv; double opt_minh; double opt_minqt; double opt_minsizeratio; double opt_minsl; double opt_query_cov; double opt_sample_pct; double opt_sintax_cutoff; double opt_target_cov; double opt_unoise_alpha; double opt_weak_id; double opt_xn; int opt_acceptall; int opt_alignwidth; int opt_cons_truncate; int opt_ee_cutoffs_count; int opt_gap_extension_query_interior; int opt_gap_extension_query_left; int opt_gap_extension_query_right; int opt_gap_extension_target_interior; int opt_gap_extension_target_left; int opt_gap_extension_target_right; int opt_gap_open_query_interior; int opt_gap_open_query_left; int opt_gap_open_query_right; int opt_gap_open_target_interior; int opt_gap_open_target_left; int opt_gap_open_target_right; int opt_help; int opt_length_cutoffs_shortest; int opt_length_cutoffs_longest; int opt_length_cutoffs_increment; int opt_mindiffs; int opt_slots; int opt_uchimeout5; int opt_usersort; int opt_version; int64_t opt_dbmask; int64_t opt_fasta_width; int64_t opt_fastq_ascii; int64_t opt_fastq_asciiout; int64_t opt_fastq_maxdiffs; int64_t opt_fastq_maxlen; int64_t opt_fastq_maxmergelen; int64_t opt_fastq_maxns; int64_t opt_fastq_minlen; int64_t opt_fastq_minmergelen; int64_t opt_fastq_minovlen; int64_t opt_fastq_qmax; int64_t opt_fastq_qmaxout; int64_t opt_fastq_qmin; int64_t opt_fastq_qminout; int64_t opt_fastq_stripleft; int64_t opt_fastq_stripright; int64_t opt_fastq_tail; int64_t opt_fastq_trunclen; int64_t opt_fastq_trunclen_keep; int64_t opt_fastq_truncqual; int64_t opt_fulldp; int64_t opt_hardmask; int64_t opt_iddef; int64_t opt_idprefix; int64_t opt_idsuffix; int64_t opt_leftjust; int64_t opt_match; int64_t opt_maxaccepts; int64_t opt_maxdiffs; int64_t opt_maxgaps; int64_t opt_maxhits; int64_t opt_maxqsize; int64_t opt_maxrejects; int64_t opt_maxseqlength; int64_t opt_maxsize; int64_t opt_maxsubs; int64_t opt_maxuniquesize; int64_t opt_mincols; int64_t opt_minseqlength; int64_t opt_minsize; int64_t opt_mintsize; int64_t opt_minuniquesize; int64_t opt_minwordmatches; int64_t opt_mismatch; int64_t opt_notrunclabels; int64_t opt_output_no_hits; int64_t opt_qmask; int64_t opt_randseed; int64_t opt_rightjust; int64_t opt_rowlen; int64_t opt_sample_size; int64_t opt_self; int64_t opt_selfid; int64_t opt_sizein; int64_t opt_sizeout; int64_t opt_strand; int64_t opt_subseq_start; int64_t opt_subseq_end; int64_t opt_threads; int64_t opt_top_hits_only; int64_t opt_topn; int64_t opt_uc_allhits; int64_t opt_wordlength; /* Other variables */ /* cpu features available */ int64_t altivec_present = 0; int64_t neon_present = 0; int64_t mmx_present = 0; int64_t sse_present = 0; int64_t sse2_present = 0; int64_t sse3_present = 0; int64_t ssse3_present = 0; int64_t sse41_present = 0; int64_t sse42_present = 0; int64_t popcnt_present = 0; int64_t avx_present = 0; int64_t avx2_present = 0; static char * progname; static char progheader[80]; static char * cmdline; static time_t time_start; static time_t time_finish; FILE * fp_log = nullptr; char * STDIN_NAME = (char*) "/dev/stdin"; char * STDOUT_NAME = (char*) "/dev/stdout"; #ifdef __x86_64__ #define cpuid(f1, f2, a, b, c, d) \ __asm__ __volatile__ ("cpuid" \ : "=a" (a), "=b" (b), "=c" (c), "=d" (d) \ : "a" (f1), "c" (f2)); #endif void cpu_features_detect() { #ifdef __aarch64__ #ifdef __ARM_NEON /* may check /proc/cpuinfo for asimd or neon */ neon_present = 1; #else #error ARM Neon not present #endif #elif __PPC__ altivec_present = 1; #elif __x86_64__ unsigned int a, b, c, d; cpuid(0, 0, a, b, c, d); unsigned int maxlevel = a & 0xff; if (maxlevel >= 1) { cpuid(1, 0, a, b, c, d); mmx_present = (d >> 23) & 1; sse_present = (d >> 25) & 1; sse2_present = (d >> 26) & 1; sse3_present = (c >> 0) & 1; ssse3_present = (c >> 9) & 1; sse41_present = (c >> 19) & 1; sse42_present = (c >> 20) & 1; popcnt_present = (c >> 23) & 1; avx_present = (c >> 28) & 1; if (maxlevel >= 7) { cpuid(7, 0, a, b, c, d); avx2_present = (b >> 5) & 1; } } #else #error Unknown architecture #endif } void cpu_features_show() { fprintf(stderr, "CPU features:"); if (neon_present) { fprintf(stderr, " neon"); } if (altivec_present) { fprintf(stderr, " altivec"); } if (mmx_present) { fprintf(stderr, " mmx"); } if (sse_present) { fprintf(stderr, " sse"); } if (sse2_present) { fprintf(stderr, " sse2"); } if (sse3_present) { fprintf(stderr, " sse3"); } if (ssse3_present) { fprintf(stderr, " ssse3"); } if (sse41_present) { fprintf(stderr, " sse4.1"); } if (sse42_present) { fprintf(stderr, " sse4.2"); } if (popcnt_present) { fprintf(stderr, " popcnt"); } if (avx_present) { fprintf(stderr, " avx"); } if (avx2_present) { fprintf(stderr, " avx2"); } fprintf(stderr, "\n"); } void args_get_ee_cutoffs(char * arg) { /* get comma-separated list of floating point numbers */ /* save in ee_cutoffs_count and ee_cutoffs_values */ int commas = 0; for (size_t i=0; i opt_length_cutoffs_longest) || (opt_length_cutoffs_increment < 1)) { fatal("Invalid arguments to length_cutoffs"); } } void args_get_gap_penalty_string(char * arg, int is_open) { /* See http://www.drive5.com/usearch/manual/aln_params.html --gapopen *E/10I/1E/2L/3RQ/4RT/1IQ --gapext *E/10I/1E/2L/3RQ/4RT/1IQ integer or * followed by I, E, L, R, Q or T characters separated by / * means infinitely high (disallow) E=end I=interior L=left R=right Q=query T=target E cannot be combined with L or R We do not support floating point values. Therefore, all default score and penalties are multiplied by 2. */ char *p = arg; while (*p) { int skip = 0; int pen = 0; if (sscanf(p, "%d%n", &pen, &skip) == 1) { p += skip; } else if (*p == '*') { pen = 1000; p++; } else { fatal("Invalid gap penalty argument (%s)", p); } char * q = p; int set_E = 0; int set_I = 0; int set_L = 0; int set_R = 0; int set_Q = 0; int set_T = 0; while((*p) && (*p != '/')) { switch(*p) { case 'E': set_E = 1; break; case 'I': set_I = 1; break; case 'L': set_L = 1; break; case 'R': set_R = 1; break; case 'Q': set_Q = 1; break; case 'T': set_T = 1; break; default: fatal("Invalid char '%.1s' in gap penalty string", p); break; } p++; } if (*p == '/') { p++; } if (set_E && (set_L || set_R)) { fatal("Invalid gap penalty string (E and L or R) '%s'", q); } if (set_E) { set_L = 1; set_R = 1; } /* if neither L, I, R nor E is specified, it applies to all */ if ((!set_L) && (!set_I) && (!set_R)) { set_L = 1; set_I = 1; set_R = 1; } /* if neither Q nor T is specified, it applies to both */ if ((!set_Q) && (!set_T)) { set_Q = 1; set_T = 1; } if (is_open) { if (set_Q) { if (set_L) { opt_gap_open_query_left = pen; } if (set_I) { opt_gap_open_query_interior = pen; } if (set_R) { opt_gap_open_query_right = pen; } } if (set_T) { if (set_L) { opt_gap_open_target_left = pen; } if (set_I) { opt_gap_open_target_interior = pen; } if (set_R) { opt_gap_open_target_right = pen; } } } else { if (set_Q) { if (set_L) { opt_gap_extension_query_left = pen; } if (set_I) { opt_gap_extension_query_interior = pen; } if (set_R) { opt_gap_extension_query_right = pen; } } if (set_T) { if (set_L) { opt_gap_extension_target_left = pen; } if (set_I) { opt_gap_extension_target_interior = pen; } if (set_R) { opt_gap_extension_target_right = pen; } } } } } int64_t args_getlong(char * arg) { int len = 0; int64_t temp = 0; int ret = sscanf(arg, "%" PRId64 "%n", &temp, &len); if ((ret == 0) || (((unsigned int)(len)) < strlen(arg))) { fatal("Illegal option argument"); } return temp; } double args_getdouble(char * arg) { int len = 0; double temp = 0; int ret = sscanf(arg, "%lf%n", &temp, &len); if ((ret == 0) || (((unsigned int)(len)) < strlen(arg))) { fatal("Illegal option argument"); } return temp; } void args_init(int argc, char **argv) { /* Set defaults */ progname = argv[0]; opt_abskew = -1.0; opt_acceptall = 0; opt_alignwidth = 80; opt_allpairs_global = nullptr; opt_alnout = nullptr; opt_blast6out = nullptr; opt_biomout = nullptr; opt_borderline = nullptr; opt_bzip2_decompress = false; opt_centroids = nullptr; opt_chimeras = nullptr; opt_cluster_fast = nullptr; opt_cluster_size = nullptr; opt_cluster_smallmem = nullptr; opt_cluster_unoise = nullptr; opt_clusterout_id = false; opt_clusterout_sort = false; opt_clusters = nullptr; opt_cons_truncate = 0; opt_consout = nullptr; opt_cut = nullptr; opt_cut_pattern = nullptr; opt_db = nullptr; opt_dbmask = MASK_DUST; opt_dbmatched = nullptr; opt_dbnotmatched = nullptr; opt_derep_fulllength = nullptr; opt_derep_id = nullptr; opt_derep_prefix = nullptr; opt_dn = 1.4; opt_ee_cutoffs_count = 3; opt_ee_cutoffs_values = (double*) xmalloc(opt_ee_cutoffs_count * sizeof(double)); opt_ee_cutoffs_values[0] = 0.5; opt_ee_cutoffs_values[1] = 1.0; opt_ee_cutoffs_values[2] = 2.0; opt_eeout = false; opt_eetabbedout = nullptr; opt_fasta2fastq = nullptr; opt_fastaout_notmerged_fwd = nullptr; opt_fastaout_notmerged_rev = nullptr; opt_fasta_score = false; opt_fasta_width = 80; opt_fastaout = nullptr; opt_fastaout_discarded = nullptr; opt_fastaout_discarded_rev = nullptr; opt_fastaout_rev = nullptr; opt_fastapairs = nullptr; opt_fastq_allowmergestagger = false; opt_fastq_ascii = 33; opt_fastq_asciiout = 33; opt_fastq_chars = nullptr; opt_fastq_convert = nullptr; opt_fastq_eeout = false; opt_fastq_eestats = nullptr; opt_fastq_eestats2 = nullptr; opt_fastq_filter = nullptr; opt_fastq_join = nullptr; opt_fastq_maxdiffpct = 100.0; opt_fastq_maxdiffs = 10; opt_fastq_maxee = DBL_MAX; opt_fastq_maxee_rate = DBL_MAX; opt_fastq_maxlen = LONG_MAX; opt_fastq_maxmergelen = 1000000; opt_fastq_maxns = LONG_MAX; opt_fastq_mergepairs = nullptr; opt_fastq_minlen = 1; opt_fastq_minmergelen = 0; opt_fastq_minovlen = 10; opt_fastq_nostagger = true; opt_fastqout_notmerged_fwd = nullptr; opt_fastqout_notmerged_rev = nullptr; opt_fastq_qmax = 41; opt_fastq_qmaxout = 41; opt_fastq_qmin = 0; opt_fastq_qminout = 0; opt_fastq_qout_max = false; opt_fastq_stats = nullptr; opt_fastq_stripleft = 0; opt_fastq_stripright = 0; opt_fastq_tail = 4; opt_fastq_truncee = DBL_MAX; opt_fastq_trunclen = -1; opt_fastq_trunclen_keep = -1; opt_fastq_truncqual = LONG_MIN; opt_fastqout = nullptr; opt_fastqout_discarded = nullptr; opt_fastqout_discarded_rev = nullptr; opt_fastqout_rev = nullptr; opt_fastx_filter = nullptr; opt_fastx_mask = nullptr; opt_fastx_revcomp = nullptr; opt_fastx_subsample = nullptr; opt_fulldp = 0; opt_gap_extension_query_interior=2; opt_gap_extension_query_left=1; opt_gap_extension_query_right=1; opt_gap_extension_target_interior=2; opt_gap_extension_target_left=1; opt_gap_extension_target_right=1; opt_gap_open_query_interior=20; opt_gap_open_query_left=2; opt_gap_open_query_right=2; opt_gap_open_target_interior=20; opt_gap_open_target_left=2; opt_gap_open_target_right=2; opt_fastx_getseq = nullptr; opt_fastx_getseqs = nullptr; opt_fastx_getsubseq = nullptr; opt_gzip_decompress = false; opt_hardmask = 0; opt_help = 0; opt_id = -1.0; opt_iddef = 2; opt_idprefix = 0; opt_idsuffix = 0; opt_join_padgap = nullptr; opt_join_padgapq = nullptr; opt_label = nullptr; opt_label_substr_match = false; opt_label_suffix = nullptr; opt_labels = nullptr; opt_label_field = nullptr; opt_label_word = nullptr; opt_label_words = nullptr; opt_leftjust = 0; opt_length_cutoffs_increment = 50; opt_length_cutoffs_longest = INT_MAX; opt_length_cutoffs_shortest = 50; opt_lca_cutoff = 1.0; opt_lcaout = nullptr; opt_log = nullptr; opt_makeudb_usearch = nullptr; opt_maskfasta = nullptr; opt_match = 2; opt_matched = nullptr; opt_max_unmasked_pct = 100.0; opt_maxaccepts = 1; opt_maxdiffs = INT_MAX; opt_maxgaps = INT_MAX; opt_maxhits = 0; opt_maxid = 1.0; opt_maxqsize = INT_MAX; opt_maxqt = DBL_MAX; opt_maxrejects = -1; opt_maxseqlength = 50000; opt_maxsize = LONG_MAX; opt_maxsizeratio = DBL_MAX; opt_maxsl = DBL_MAX; opt_maxsubs = INT_MAX; opt_maxuniquesize = LONG_MAX; opt_mid = 0.0; opt_min_unmasked_pct = 0.0; opt_mincols = 0; opt_mindiffs = 3; opt_mindiv = 0.8; opt_minh = 0.28; opt_minqt = 0.0; opt_minseqlength = -1; opt_minsize = 0; opt_minsizeratio = 0.0; opt_minsl = 0.0; opt_mintsize = 0; opt_minuniquesize = 1; opt_minwordmatches = -1; opt_mismatch = -4; opt_mothur_shared_out = nullptr; opt_msaout = nullptr; opt_no_progress = false; opt_nonchimeras = nullptr; opt_notmatched = nullptr; opt_notmatched = nullptr; opt_notrunclabels = 0; opt_orient = nullptr; opt_otutabout = nullptr; opt_output = nullptr; opt_output_no_hits = 0; opt_pattern = nullptr; opt_profile = nullptr; opt_qmask = MASK_DUST; opt_qsegout = nullptr; opt_query_cov = 0.0; opt_quiet = false; opt_randseed = 0; opt_relabel = nullptr; opt_relabel_keep = false; opt_relabel_md5 = false; opt_relabel_self = false; opt_relabel_sha1 = false; opt_rereplicate = nullptr; opt_reverse = nullptr; opt_rightjust = 0; opt_rowlen = 64; opt_samheader = false; opt_samout = nullptr; opt_sample = nullptr; opt_sample_pct = 0; opt_sample_size = 0; opt_search_exact = nullptr; opt_self = 0; opt_selfid = 0; opt_sff_convert = nullptr; opt_sff_clip = false; opt_shuffle = nullptr; opt_sintax = nullptr; opt_sintax_cutoff = 0.0; opt_sizein = 0; opt_sizeorder = false; opt_sizeout = 0; opt_slots = 0; opt_sortbylength = nullptr; opt_sortbysize = nullptr; opt_strand = 1; opt_subseq_start = 1; opt_subseq_end = LONG_MAX; opt_tabbedout = nullptr; opt_target_cov = 0.0; opt_threads = 0; opt_top_hits_only = 0; opt_topn = LONG_MAX; opt_tsegout = nullptr; opt_udb2fasta = nullptr; opt_udbinfo = nullptr; opt_udbstats = nullptr; opt_uc = nullptr; opt_uc_allhits = 0; opt_uchime_denovo = nullptr; opt_uchime2_denovo = nullptr; opt_uchime3_denovo = nullptr; opt_uchime_ref = nullptr; opt_uchimealns = nullptr; opt_uchimeout = nullptr; opt_uchimeout5 = 0; opt_unoise_alpha = 2.0; opt_usearch_global = nullptr; opt_userout = nullptr; opt_usersort = 0; opt_version = 0; opt_weak_id = 10.0; opt_wordlength = 0; opt_xn = 8.0; opt_xsize = false; opt_xee = false; opterr = 1; enum { option_abskew, option_acceptall, option_alignwidth, option_allpairs_global, option_alnout, option_band, option_biomout, option_blast6out, option_borderline, option_bzip2_decompress, option_centroids, option_chimeras, option_cluster_fast, option_cluster_size, option_cluster_smallmem, option_cluster_unoise, option_clusterout_id, option_clusterout_sort, option_clusters, option_cons_truncate, option_consout, option_cut, option_cut_pattern, option_db, option_dbmask, option_dbmatched, option_dbnotmatched, option_derep_fulllength, option_derep_id, option_derep_prefix, option_dn, option_ee_cutoffs, option_eeout, option_eetabbedout, option_fasta2fastq, option_fasta_score, option_fasta_width, option_fastaout, option_fastaout_discarded, option_fastaout_discarded_rev, option_fastaout_notmerged_fwd, option_fastaout_notmerged_rev, option_fastaout_rev, option_fastapairs, option_fastq_allowmergestagger, option_fastq_ascii, option_fastq_asciiout, option_fastq_chars, option_fastq_convert, option_fastq_eeout, option_fastq_eestats, option_fastq_eestats2, option_fastq_filter, option_fastq_join, option_fastq_maxdiffpct, option_fastq_maxdiffs, option_fastq_maxee, option_fastq_maxee_rate, option_fastq_maxlen, option_fastq_maxmergelen, option_fastq_maxns, option_fastq_mergepairs, option_fastq_minlen, option_fastq_minmergelen, option_fastq_minovlen, option_fastq_nostagger, option_fastq_qmax, option_fastq_qmaxout, option_fastq_qmin, option_fastq_qminout, option_fastq_qout_max, option_fastq_stats, option_fastq_stripleft, option_fastq_stripright, option_fastq_tail, option_fastq_truncee, option_fastq_trunclen, option_fastq_trunclen_keep, option_fastq_truncqual, option_fastqout, option_fastqout_discarded, option_fastqout_discarded_rev, option_fastqout_notmerged_fwd, option_fastqout_notmerged_rev, option_fastqout_rev, option_fastx_filter, option_fastx_getseq, option_fastx_getseqs, option_fastx_getsubseq, option_fastx_mask, option_fastx_revcomp, option_fastx_subsample, option_fastx_uniques, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_h, option_hardmask, option_help, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_join_padgap, option_join_padgapq, option_label, option_label_field, option_label_substr_match, option_label_suffix, option_label_word, option_label_words, option_labels, option_lca_cutoff, option_lcaout, option_leftjust, option_length_cutoffs, option_log, option_makeudb_usearch, option_maskfasta, option_match, option_matched, option_max_unmasked_pct, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsize, option_maxsizeratio, option_maxsl, option_maxsubs, option_maxuniquesize, option_mid, option_min_unmasked_pct, option_mincols, option_mindiffs, option_mindiv, option_minh, option_minhsp, option_minqt, option_minseqlength, option_minsize, option_minsizeratio, option_minsl, option_mintsize, option_minuniquesize, option_minwordmatches, option_mismatch, option_mothur_shared_out, option_msaout, option_no_progress, option_nonchimeras, option_notmatched, option_notmatchedfq, option_notrunclabels, option_orient, option_otutabout, option_output, option_output_no_hits, option_pattern, option_profile, option_qmask, option_qsegout, option_query_cov, option_quiet, option_randseed, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rereplicate, option_reverse, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_sample_pct, option_sample_size, option_search_exact, option_self, option_selfid, option_sff_clip, option_sff_convert, option_shuffle, option_sintax, option_sintax_cutoff, option_sizein, option_sizeorder, option_sizeout, option_slots, option_sortbylength, option_sortbysize, option_strand, option_subseq_end, option_subseq_start, option_tabbedout, option_target_cov, option_threads, option_top_hits_only, option_topn, option_tsegout, option_uc, option_uc_allhits, option_uchime2_denovo, option_uchime3_denovo, option_uchime_denovo, option_uchime_ref, option_uchimealns, option_uchimeout, option_uchimeout5, option_udb2fasta, option_udbinfo, option_udbstats, option_unoise_alpha, option_usearch_global, option_userfields, option_userout, option_usersort, option_v, option_version, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xn, option_xsize }; static struct option long_options[] = { {"abskew", required_argument, nullptr, 0 }, {"acceptall", no_argument, nullptr, 0 }, {"alignwidth", required_argument, nullptr, 0 }, {"allpairs_global", required_argument, nullptr, 0 }, {"alnout", required_argument, nullptr, 0 }, {"band", required_argument, nullptr, 0 }, {"biomout", required_argument, nullptr, 0 }, {"blast6out", required_argument, nullptr, 0 }, {"borderline", required_argument, nullptr, 0 }, {"bzip2_decompress", no_argument, nullptr, 0 }, {"centroids", required_argument, nullptr, 0 }, {"chimeras", required_argument, nullptr, 0 }, {"cluster_fast", required_argument, nullptr, 0 }, {"cluster_size", required_argument, nullptr, 0 }, {"cluster_smallmem", required_argument, nullptr, 0 }, {"cluster_unoise", required_argument, nullptr, 0 }, {"clusterout_id", no_argument, nullptr, 0 }, {"clusterout_sort", no_argument, nullptr, 0 }, {"clusters", required_argument, nullptr, 0 }, {"cons_truncate", no_argument, nullptr, 0 }, {"consout", required_argument, nullptr, 0 }, {"cut", required_argument, nullptr, 0 }, {"cut_pattern", required_argument, nullptr, 0 }, {"db", required_argument, nullptr, 0 }, {"dbmask", required_argument, nullptr, 0 }, {"dbmatched", required_argument, nullptr, 0 }, {"dbnotmatched", required_argument, nullptr, 0 }, {"derep_fulllength", required_argument, nullptr, 0 }, {"derep_id", required_argument, nullptr, 0 }, {"derep_prefix", required_argument, nullptr, 0 }, {"dn", required_argument, nullptr, 0 }, {"ee_cutoffs", required_argument, nullptr, 0 }, {"eeout", no_argument, nullptr, 0 }, {"eetabbedout", required_argument, nullptr, 0 }, {"fasta2fastq", required_argument, nullptr, 0 }, {"fasta_score", no_argument, nullptr, 0 }, {"fasta_width", required_argument, nullptr, 0 }, {"fastaout", required_argument, nullptr, 0 }, {"fastaout_discarded", required_argument, nullptr, 0 }, {"fastaout_discarded_rev",required_argument, nullptr, 0 }, {"fastaout_notmerged_fwd",required_argument, nullptr, 0 }, {"fastaout_notmerged_rev",required_argument, nullptr, 0 }, {"fastaout_rev", required_argument, nullptr, 0 }, {"fastapairs", required_argument, nullptr, 0 }, {"fastq_allowmergestagger", no_argument, nullptr, 0 }, {"fastq_ascii", required_argument, nullptr, 0 }, {"fastq_asciiout", required_argument, nullptr, 0 }, {"fastq_chars", required_argument, nullptr, 0 }, {"fastq_convert", required_argument, nullptr, 0 }, {"fastq_eeout", no_argument, nullptr, 0 }, {"fastq_eestats", required_argument, nullptr, 0 }, {"fastq_eestats2", required_argument, nullptr, 0 }, {"fastq_filter", required_argument, nullptr, 0 }, {"fastq_join", required_argument, nullptr, 0 }, {"fastq_maxdiffpct", required_argument, nullptr, 0 }, {"fastq_maxdiffs", required_argument, nullptr, 0 }, {"fastq_maxee", required_argument, nullptr, 0 }, {"fastq_maxee_rate", required_argument, nullptr, 0 }, {"fastq_maxlen", required_argument, nullptr, 0 }, {"fastq_maxmergelen", required_argument, nullptr, 0 }, {"fastq_maxns", required_argument, nullptr, 0 }, {"fastq_mergepairs", required_argument, nullptr, 0 }, {"fastq_minlen", required_argument, nullptr, 0 }, {"fastq_minmergelen", required_argument, nullptr, 0 }, {"fastq_minovlen", required_argument, nullptr, 0 }, {"fastq_nostagger", no_argument, nullptr, 0 }, {"fastq_qmax", required_argument, nullptr, 0 }, {"fastq_qmaxout", required_argument, nullptr, 0 }, {"fastq_qmin", required_argument, nullptr, 0 }, {"fastq_qminout", required_argument, nullptr, 0 }, {"fastq_qout_max", no_argument, nullptr, 0 }, {"fastq_stats", required_argument, nullptr, 0 }, {"fastq_stripleft", required_argument, nullptr, 0 }, {"fastq_stripright", required_argument, nullptr, 0 }, {"fastq_tail", required_argument, nullptr, 0 }, {"fastq_truncee", required_argument, nullptr, 0 }, {"fastq_trunclen", required_argument, nullptr, 0 }, {"fastq_trunclen_keep", required_argument, nullptr, 0 }, {"fastq_truncqual", required_argument, nullptr, 0 }, {"fastqout", required_argument, nullptr, 0 }, {"fastqout_discarded", required_argument, nullptr, 0 }, {"fastqout_discarded_rev",required_argument, nullptr, 0 }, {"fastqout_notmerged_fwd",required_argument, nullptr, 0 }, {"fastqout_notmerged_rev",required_argument, nullptr, 0 }, {"fastqout_rev", required_argument, nullptr, 0 }, {"fastx_filter", required_argument, nullptr, 0 }, {"fastx_getseq", required_argument, nullptr, 0 }, {"fastx_getseqs", required_argument, nullptr, 0 }, {"fastx_getsubseq", required_argument, nullptr, 0 }, {"fastx_mask", required_argument, nullptr, 0 }, {"fastx_revcomp", required_argument, nullptr, 0 }, {"fastx_subsample", required_argument, nullptr, 0 }, {"fastx_uniques", required_argument, nullptr, 0 }, {"fulldp", no_argument, nullptr, 0 }, {"gapext", required_argument, nullptr, 0 }, {"gapopen", required_argument, nullptr, 0 }, {"gzip_decompress", no_argument, nullptr, 0 }, {"h", no_argument, nullptr, 0 }, {"hardmask", no_argument, nullptr, 0 }, {"help", no_argument, nullptr, 0 }, {"hspw", required_argument, nullptr, 0 }, {"id", required_argument, nullptr, 0 }, {"iddef", required_argument, nullptr, 0 }, {"idprefix", required_argument, nullptr, 0 }, {"idsuffix", required_argument, nullptr, 0 }, {"join_padgap", required_argument, nullptr, 0 }, {"join_padgapq", required_argument, nullptr, 0 }, {"label", required_argument, nullptr, 0 }, {"label_field", required_argument, nullptr, 0 }, {"label_substr_match", no_argument, nullptr, 0 }, {"label_suffix", required_argument, nullptr, 0 }, {"label_word", required_argument, nullptr, 0 }, {"label_words", required_argument, nullptr, 0 }, {"labels", required_argument, nullptr, 0 }, {"lca_cutoff", required_argument, nullptr, 0 }, {"lcaout", required_argument, nullptr, 0 }, {"leftjust", no_argument, nullptr, 0 }, {"length_cutoffs", required_argument, nullptr, 0 }, {"log", required_argument, nullptr, 0 }, {"makeudb_usearch", required_argument, nullptr, 0 }, {"maskfasta", required_argument, nullptr, 0 }, {"match", required_argument, nullptr, 0 }, {"matched", required_argument, nullptr, 0 }, {"max_unmasked_pct", required_argument, nullptr, 0 }, {"maxaccepts", required_argument, nullptr, 0 }, {"maxdiffs", required_argument, nullptr, 0 }, {"maxgaps", required_argument, nullptr, 0 }, {"maxhits", required_argument, nullptr, 0 }, {"maxid", required_argument, nullptr, 0 }, {"maxqsize", required_argument, nullptr, 0 }, {"maxqt", required_argument, nullptr, 0 }, {"maxrejects", required_argument, nullptr, 0 }, {"maxseqlength", required_argument, nullptr, 0 }, {"maxsize", required_argument, nullptr, 0 }, {"maxsizeratio", required_argument, nullptr, 0 }, {"maxsl", required_argument, nullptr, 0 }, {"maxsubs", required_argument, nullptr, 0 }, {"maxuniquesize", required_argument, nullptr, 0 }, {"mid", required_argument, nullptr, 0 }, {"min_unmasked_pct", required_argument, nullptr, 0 }, {"mincols", required_argument, nullptr, 0 }, {"mindiffs", required_argument, nullptr, 0 }, {"mindiv", required_argument, nullptr, 0 }, {"minh", required_argument, nullptr, 0 }, {"minhsp", required_argument, nullptr, 0 }, {"minqt", required_argument, nullptr, 0 }, {"minseqlength", required_argument, nullptr, 0 }, {"minsize", required_argument, nullptr, 0 }, {"minsizeratio", required_argument, nullptr, 0 }, {"minsl", required_argument, nullptr, 0 }, {"mintsize", required_argument, nullptr, 0 }, {"minuniquesize", required_argument, nullptr, 0 }, {"minwordmatches", required_argument, nullptr, 0 }, {"mismatch", required_argument, nullptr, 0 }, {"mothur_shared_out", required_argument, nullptr, 0 }, {"msaout", required_argument, nullptr, 0 }, {"no_progress", no_argument, nullptr, 0 }, {"nonchimeras", required_argument, nullptr, 0 }, {"notmatched", required_argument, nullptr, 0 }, {"notmatchedfq", required_argument, nullptr, 0 }, {"notrunclabels", no_argument, nullptr, 0 }, {"orient", required_argument, nullptr, 0 }, {"otutabout", required_argument, nullptr, 0 }, {"output", required_argument, nullptr, 0 }, {"output_no_hits", no_argument, nullptr, 0 }, {"pattern", required_argument, nullptr, 0 }, {"profile", required_argument, nullptr, 0 }, {"qmask", required_argument, nullptr, 0 }, {"qsegout", required_argument, nullptr, 0 }, {"query_cov", required_argument, nullptr, 0 }, {"quiet", no_argument, nullptr, 0 }, {"randseed", required_argument, nullptr, 0 }, {"relabel", required_argument, nullptr, 0 }, {"relabel_keep", no_argument, nullptr, 0 }, {"relabel_md5", no_argument, nullptr, 0 }, {"relabel_self", no_argument, nullptr, 0 }, {"relabel_sha1", no_argument, nullptr, 0 }, {"rereplicate", required_argument, nullptr, 0 }, {"reverse", required_argument, nullptr, 0 }, {"rightjust", no_argument, nullptr, 0 }, {"rowlen", required_argument, nullptr, 0 }, {"samheader", no_argument, nullptr, 0 }, {"samout", required_argument, nullptr, 0 }, {"sample", required_argument, nullptr, 0 }, {"sample_pct", required_argument, nullptr, 0 }, {"sample_size", required_argument, nullptr, 0 }, {"search_exact", required_argument, nullptr, 0 }, {"self", no_argument, nullptr, 0 }, {"selfid", no_argument, nullptr, 0 }, {"sff_clip", no_argument, nullptr, 0 }, {"sff_convert", required_argument, nullptr, 0 }, {"shuffle", required_argument, nullptr, 0 }, {"sintax", required_argument, nullptr, 0 }, {"sintax_cutoff", required_argument, nullptr, 0 }, {"sizein", no_argument, nullptr, 0 }, {"sizeorder", no_argument, nullptr, 0 }, {"sizeout", no_argument, nullptr, 0 }, {"slots", required_argument, nullptr, 0 }, {"sortbylength", required_argument, nullptr, 0 }, {"sortbysize", required_argument, nullptr, 0 }, {"strand", required_argument, nullptr, 0 }, {"subseq_end", required_argument, nullptr, 0 }, {"subseq_start", required_argument, nullptr, 0 }, {"tabbedout", required_argument, nullptr, 0 }, {"target_cov", required_argument, nullptr, 0 }, {"threads", required_argument, nullptr, 0 }, {"top_hits_only", no_argument, nullptr, 0 }, {"topn", required_argument, nullptr, 0 }, {"tsegout", required_argument, nullptr, 0 }, {"uc", required_argument, nullptr, 0 }, {"uc_allhits", no_argument, nullptr, 0 }, {"uchime2_denovo", required_argument, nullptr, 0 }, {"uchime3_denovo", required_argument, nullptr, 0 }, {"uchime_denovo", required_argument, nullptr, 0 }, {"uchime_ref", required_argument, nullptr, 0 }, {"uchimealns", required_argument, nullptr, 0 }, {"uchimeout", required_argument, nullptr, 0 }, {"uchimeout5", no_argument, nullptr, 0 }, {"udb2fasta", required_argument, nullptr, 0 }, {"udbinfo", required_argument, nullptr, 0 }, {"udbstats", required_argument, nullptr, 0 }, {"unoise_alpha", required_argument, nullptr, 0 }, {"usearch_global", required_argument, nullptr, 0 }, {"userfields", required_argument, nullptr, 0 }, {"userout", required_argument, nullptr, 0 }, {"usersort", no_argument, nullptr, 0 }, {"v", no_argument, nullptr, 0 }, {"version", no_argument, nullptr, 0 }, {"weak_id", required_argument, nullptr, 0 }, {"wordlength", required_argument, nullptr, 0 }, {"xdrop_nw", required_argument, nullptr, 0 }, {"xee", no_argument, nullptr, 0 }, {"xn", required_argument, nullptr, 0 }, {"xsize", no_argument, nullptr, 0 }, { nullptr, 0, nullptr, 0 } }; const int options_count = (sizeof(long_options) / sizeof(struct option)) - 1; bool options_selected[options_count]; memset(options_selected, 0, sizeof(options_selected)); int options_index = 0; int c; while ((c = getopt_long_only(argc, argv, "", long_options, &options_index)) == 0) { if (options_index < options_count) { options_selected[options_index] = true; } switch(options_index) { case option_help: opt_help = 1; break; case option_version: opt_version = 1; break; case option_alnout: opt_alnout = optarg; break; case option_usearch_global: opt_usearch_global = optarg; break; case option_db: opt_db = optarg; break; case option_id: opt_id = args_getdouble(optarg); break; case option_maxaccepts: opt_maxaccepts = args_getlong(optarg); break; case option_maxrejects: opt_maxrejects = args_getlong(optarg); break; case option_wordlength: opt_wordlength = args_getlong(optarg); break; case option_match: opt_match = args_getlong(optarg); break; case option_mismatch: opt_mismatch = args_getlong(optarg); break; case option_fulldp: opt_fulldp = 1; fprintf(stderr, "WARNING: Option --fulldp is ignored\n"); break; case option_strand: if (strcasecmp(optarg, "plus") == 0) { opt_strand = 1; } else if (strcasecmp(optarg, "both") == 0) { opt_strand = 2; } else { fatal("The argument to --strand must be plus or both"); } break; case option_threads: opt_threads = (int64_t) args_getdouble(optarg); break; case option_gapopen: args_get_gap_penalty_string(optarg, 1); break; case option_gapext: args_get_gap_penalty_string(optarg, 0); break; case option_rowlen: opt_rowlen = args_getlong(optarg); break; case option_userfields: if (!parse_userfields_arg(optarg)) { fatal("Unrecognized userfield argument"); } break; case option_userout: opt_userout = optarg; break; case option_self: opt_self = 1; break; case option_blast6out: opt_blast6out = optarg; break; case option_uc: opt_uc = optarg; break; case option_weak_id: opt_weak_id = args_getdouble(optarg); break; case option_uc_allhits: opt_uc_allhits = 1; break; case option_notrunclabels: opt_notrunclabels = 1; break; case option_sortbysize: opt_sortbysize = optarg; break; case option_output: opt_output = optarg; break; case option_minsize: opt_minsize = args_getlong(optarg); if (opt_minsize <= 0) { fatal("The argument to --minsize must be at least 1"); } break; case option_maxsize: opt_maxsize = args_getlong(optarg); break; case option_relabel: opt_relabel = optarg; break; case option_sizeout: opt_sizeout = 1; break; case option_derep_fulllength: opt_derep_fulllength = optarg; break; case option_minseqlength: opt_minseqlength = args_getlong(optarg); if (opt_minseqlength < 0) { fatal("The argument to --minseqlength must not be negative"); } break; case option_minuniquesize: opt_minuniquesize = args_getlong(optarg); break; case option_topn: opt_topn = args_getlong(optarg); break; case option_maxseqlength: opt_maxseqlength = args_getlong(optarg); break; case option_sizein: opt_sizein = 1; break; case option_sortbylength: opt_sortbylength = optarg; break; case option_matched: opt_matched = optarg; break; case option_notmatched: opt_notmatched = optarg; break; case option_dbmatched: opt_dbmatched = optarg; break; case option_dbnotmatched: opt_dbnotmatched = optarg; break; case option_fastapairs: opt_fastapairs = optarg; break; case option_output_no_hits: opt_output_no_hits = 1; break; case option_maxhits: opt_maxhits = args_getlong(optarg); break; case option_top_hits_only: opt_top_hits_only = 1; break; case option_fasta_width: opt_fasta_width = args_getlong(optarg); break; case option_query_cov: opt_query_cov = args_getdouble(optarg); break; case option_target_cov: opt_target_cov = args_getdouble(optarg); break; case option_idprefix: opt_idprefix = args_getlong(optarg); break; case option_idsuffix: opt_idsuffix = args_getlong(optarg); break; case option_minqt: opt_minqt = args_getdouble(optarg); break; case option_maxqt: opt_maxqt = args_getdouble(optarg); break; case option_minsl: opt_minsl = args_getdouble(optarg); break; case option_maxsl: opt_maxsl = args_getdouble(optarg); break; case option_leftjust: opt_leftjust = 1; break; case option_rightjust: opt_rightjust = 1; break; case option_selfid: opt_selfid = 1; break; case option_maxid: opt_maxid = args_getdouble(optarg); break; case option_minsizeratio: opt_minsizeratio = args_getdouble(optarg); break; case option_maxsizeratio: opt_maxsizeratio = args_getdouble(optarg); break; case option_maxdiffs: opt_maxdiffs = args_getlong(optarg); break; case option_maxsubs: opt_maxsubs = args_getlong(optarg); break; case option_maxgaps: opt_maxgaps = args_getlong(optarg); break; case option_mincols: opt_mincols = args_getlong(optarg); break; case option_maxqsize: opt_maxqsize = args_getlong(optarg); break; case option_mintsize: opt_mintsize = args_getlong(optarg); break; case option_mid: opt_mid = args_getdouble(optarg); break; case option_shuffle: opt_shuffle = optarg; break; case option_randseed: opt_randseed = args_getlong(optarg); break; case option_maskfasta: opt_maskfasta = optarg; break; case option_hardmask: opt_hardmask = 1; break; case option_qmask: if (strcasecmp(optarg, "none") == 0) { opt_qmask = MASK_NONE; } else if (strcasecmp(optarg, "dust") == 0) { opt_qmask = MASK_DUST; } else if (strcasecmp(optarg, "soft") == 0) { opt_qmask = MASK_SOFT; } else { opt_qmask = MASK_ERROR; } break; case option_dbmask: if (strcasecmp(optarg, "none") == 0) { opt_dbmask = MASK_NONE; } else if (strcasecmp(optarg, "dust") == 0) { opt_dbmask = MASK_DUST; } else if (strcasecmp(optarg, "soft") == 0) { opt_dbmask = MASK_SOFT; } else { opt_dbmask = MASK_ERROR; } break; case option_cluster_smallmem: opt_cluster_smallmem = optarg; break; case option_cluster_fast: opt_cluster_fast = optarg; break; case option_centroids: opt_centroids = optarg; break; case option_clusters: opt_clusters = optarg; break; case option_consout: opt_consout = optarg; break; case option_cons_truncate: fprintf(stderr, "WARNING: Option --cons_truncate is ignored\n"); opt_cons_truncate = 1; break; case option_msaout: opt_msaout = optarg; break; case option_usersort: opt_usersort = 1; break; case option_xn: opt_xn = args_getdouble(optarg); break; case option_iddef: opt_iddef = args_getlong(optarg); break; case option_slots: fprintf(stderr, "WARNING: Option --slots is ignored\n"); opt_slots = args_getlong(optarg); break; case option_pattern: fprintf(stderr, "WARNING: Option --pattern is ignored\n"); opt_pattern = optarg; break; case option_maxuniquesize: opt_maxuniquesize = args_getlong(optarg); break; case option_abskew: opt_abskew = args_getdouble(optarg); break; case option_chimeras: opt_chimeras = optarg; break; case option_dn: opt_dn = args_getdouble(optarg); break; case option_mindiffs: opt_mindiffs = args_getlong(optarg); break; case option_mindiv: opt_mindiv = args_getdouble(optarg); break; case option_minh: opt_minh = args_getdouble(optarg); break; case option_nonchimeras: opt_nonchimeras = optarg; break; case option_uchime_denovo: opt_uchime_denovo = optarg; break; case option_uchime_ref: opt_uchime_ref = optarg; break; case option_uchimealns: opt_uchimealns = optarg; break; case option_uchimeout: opt_uchimeout = optarg; break; case option_uchimeout5: opt_uchimeout5 = 1; break; case option_alignwidth: opt_alignwidth = args_getlong(optarg); break; case option_allpairs_global: opt_allpairs_global = optarg; break; case option_acceptall: opt_acceptall = 1; break; case option_cluster_size: opt_cluster_size = optarg; break; case option_samout: opt_samout = optarg; break; case option_log: opt_log = optarg; break; case option_quiet: opt_quiet = true; break; case option_fastx_subsample: opt_fastx_subsample = optarg; break; case option_sample_pct: opt_sample_pct = args_getdouble(optarg); break; case option_fastq_chars: opt_fastq_chars = optarg; break; case option_profile: opt_profile = optarg; break; case option_sample_size: opt_sample_size = args_getlong(optarg); break; case option_fastaout: opt_fastaout = optarg; break; case option_xsize: opt_xsize = true; break; case option_clusterout_id: opt_clusterout_id = true; break; case option_clusterout_sort: opt_clusterout_sort = true; break; case option_borderline: opt_borderline = optarg; break; case option_relabel_sha1: opt_relabel_sha1 = true; break; case option_relabel_md5: opt_relabel_md5 = true; break; case option_derep_prefix: opt_derep_prefix = optarg; break; case option_fastq_filter: opt_fastq_filter = optarg; break; case option_fastqout: opt_fastqout = optarg; break; case option_fastaout_discarded: opt_fastaout_discarded = optarg; break; case option_fastqout_discarded: opt_fastqout_discarded = optarg; break; case option_fastq_truncqual: opt_fastq_truncqual = args_getlong(optarg); break; case option_fastq_maxee: opt_fastq_maxee = args_getdouble(optarg); break; case option_fastq_trunclen: opt_fastq_trunclen = args_getlong(optarg); break; case option_fastq_minlen: opt_fastq_minlen = args_getlong(optarg); break; case option_fastq_stripleft: opt_fastq_stripleft = args_getlong(optarg); break; case option_fastq_maxee_rate: opt_fastq_maxee_rate = args_getdouble(optarg); break; case option_fastq_maxns: opt_fastq_maxns = args_getlong(optarg); break; case option_eeout: opt_eeout = true; break; case option_fastq_ascii: opt_fastq_ascii = args_getlong(optarg); break; case option_fastq_qmin: opt_fastq_qmin = args_getlong(optarg); break; case option_fastq_qmax: opt_fastq_qmax = args_getlong(optarg); break; case option_fastq_qmaxout: opt_fastq_qmaxout = args_getlong(optarg); break; case option_fastq_stats: opt_fastq_stats = optarg; break; case option_fastq_tail: opt_fastq_tail = args_getlong(optarg); break; case option_fastx_revcomp: opt_fastx_revcomp = optarg; break; case option_label_suffix: opt_label_suffix = optarg; break; case option_h: opt_help = 1; break; case option_samheader: opt_samheader = true; break; case option_sizeorder: opt_sizeorder = true; break; case option_minwordmatches: opt_minwordmatches = args_getlong(optarg); if (opt_minwordmatches < 0) { fatal("The argument to --minwordmatches must not be negative"); } break; case option_v: opt_version = 1; break; case option_relabel_keep: opt_relabel_keep = true; break; case option_search_exact: opt_search_exact = optarg; break; case option_fastx_mask: opt_fastx_mask = optarg; break; case option_min_unmasked_pct: opt_min_unmasked_pct = args_getdouble(optarg); break; case option_max_unmasked_pct: opt_max_unmasked_pct = args_getdouble(optarg); break; case option_fastq_convert: opt_fastq_convert = optarg; break; case option_fastq_asciiout: opt_fastq_asciiout = args_getlong(optarg); break; case option_fastq_qminout: opt_fastq_qminout = args_getlong(optarg); break; case option_fastq_mergepairs: opt_fastq_mergepairs = optarg; break; case option_fastq_eeout: opt_fastq_eeout = true; break; case option_fastqout_notmerged_fwd: opt_fastqout_notmerged_fwd = optarg; break; case option_fastqout_notmerged_rev: opt_fastqout_notmerged_rev = optarg; break; case option_fastq_minovlen: opt_fastq_minovlen = args_getlong(optarg); break; case option_fastq_minmergelen: opt_fastq_minmergelen = args_getlong(optarg); break; case option_fastq_maxmergelen: opt_fastq_maxmergelen = args_getlong(optarg); break; case option_fastq_nostagger: opt_fastq_nostagger = optarg; break; case option_fastq_allowmergestagger: opt_fastq_allowmergestagger = true; break; case option_fastq_maxdiffs: opt_fastq_maxdiffs = args_getlong(optarg); break; case option_fastaout_notmerged_fwd: opt_fastaout_notmerged_fwd = optarg; break; case option_fastaout_notmerged_rev: opt_fastaout_notmerged_rev = optarg; break; case option_reverse: opt_reverse = optarg; break; case option_eetabbedout: opt_eetabbedout = optarg; break; case option_fasta_score: opt_fasta_score = true; break; case option_fastq_eestats: opt_fastq_eestats = optarg; break; case option_rereplicate: opt_rereplicate = optarg; break; case option_xdrop_nw: /* xdrop_nw ignored */ fprintf(stderr, "WARNING: Option --xdrop_nw is ignored\n"); break; case option_minhsp: /* minhsp ignored */ fprintf(stderr, "WARNING: Option --minhsp is ignored\n"); break; case option_band: /* band ignored */ fprintf(stderr, "WARNING: Option --band is ignored\n"); break; case option_hspw: /* hspw ignored */ fprintf(stderr, "WARNING: Option --hspw is ignored\n"); break; case option_gzip_decompress: opt_gzip_decompress = true; break; case option_bzip2_decompress: opt_bzip2_decompress = true; break; case option_fastq_maxlen: opt_fastq_maxlen = args_getlong(optarg); break; case option_fastq_truncee: opt_fastq_truncee = args_getdouble(optarg); break; case option_fastx_filter: opt_fastx_filter = optarg; break; case option_otutabout: opt_otutabout = optarg; break; case option_mothur_shared_out: opt_mothur_shared_out = optarg; break; case option_biomout: opt_biomout = optarg; break; case option_fastq_trunclen_keep: opt_fastq_trunclen_keep = args_getlong(optarg); break; case option_fastq_stripright: opt_fastq_stripright = args_getlong(optarg); break; case option_no_progress: opt_no_progress = true; break; case option_fastq_eestats2: opt_fastq_eestats2 = optarg; break; case option_ee_cutoffs: args_get_ee_cutoffs(optarg); break; case option_length_cutoffs: args_get_length_cutoffs(optarg); break; case option_makeudb_usearch: opt_makeudb_usearch = optarg; break; case option_udb2fasta: opt_udb2fasta = optarg; break; case option_udbinfo: opt_udbinfo = optarg; break; case option_udbstats: opt_udbstats = optarg; break; case option_cluster_unoise: opt_cluster_unoise = optarg; break; case option_unoise_alpha: opt_unoise_alpha = args_getdouble(optarg); break; case option_uchime2_denovo: opt_uchime2_denovo = optarg; break; case option_uchime3_denovo: opt_uchime3_denovo = optarg; break; case option_sintax: opt_sintax = optarg; break; case option_sintax_cutoff: opt_sintax_cutoff = args_getdouble(optarg); break; case option_tabbedout: opt_tabbedout = optarg; break; case option_fastq_maxdiffpct: opt_fastq_maxdiffpct = args_getdouble(optarg); break; case option_fastq_join: opt_fastq_join = optarg; break; case option_join_padgap: opt_join_padgap = optarg; break; case option_join_padgapq: opt_join_padgapq = optarg; break; case option_sff_convert: opt_sff_convert = optarg; break; case option_sff_clip: opt_sff_clip = true; break; case option_fastaout_rev: opt_fastaout_rev = optarg; break; case option_fastaout_discarded_rev: opt_fastaout_discarded_rev = optarg; break; case option_fastqout_rev: opt_fastqout_rev = optarg; break; case option_fastqout_discarded_rev: opt_fastqout_discarded_rev = optarg; break; case option_xee: opt_xee = true; break; case option_fastx_getseq: opt_fastx_getseq = optarg; break; case option_fastx_getseqs: opt_fastx_getseqs = optarg; break; case option_fastx_getsubseq: opt_fastx_getsubseq = optarg; break; case option_label_substr_match: opt_label_substr_match = true; break; case option_label: opt_label = optarg; break; case option_subseq_start: opt_subseq_start = args_getlong(optarg); break; case option_subseq_end: opt_subseq_end = args_getlong(optarg); break; case option_notmatchedfq: opt_notmatchedfq = optarg; break; case option_label_field: opt_label_field = optarg; break; case option_label_word: opt_label_word = optarg; break; case option_label_words: opt_label_words = optarg; break; case option_labels: opt_labels = optarg; break; case option_cut: opt_cut = optarg; break; case option_cut_pattern: opt_cut_pattern = optarg; break; case option_relabel_self: opt_relabel_self = true; break; case option_derep_id: opt_derep_id = optarg; break; case option_orient: opt_orient = optarg; break; case option_fasta2fastq: opt_fasta2fastq = optarg; break; case option_lcaout: opt_lcaout = optarg; break; case option_lca_cutoff: opt_lca_cutoff = args_getdouble(optarg); break; case option_fastx_uniques: opt_fastx_uniques = optarg; break; case option_fastq_qout_max: opt_fastq_qout_max = true; break; case option_sample: opt_sample = optarg; break; case option_qsegout: opt_qsegout = optarg; break; case option_tsegout: opt_tsegout = optarg; break; default: fatal("Internal error in option parsing"); } } /* Terminate if ambiguous or illegal options have been detected */ if (c != -1) { exit(EXIT_FAILURE); } /* Terminate after reporting any extra non-option arguments */ if (optind < argc) { fatal("Unrecognized string on command line (%s)", argv[optind]); } /* Below is a list of all command names, in alphabetical order. */ int command_options[] = { option_allpairs_global, option_cluster_fast, option_cluster_size, option_cluster_smallmem, option_cluster_unoise, option_cut, option_derep_fulllength, option_derep_id, option_derep_prefix, option_fasta2fastq, option_fastq_chars, option_fastq_convert, option_fastq_eestats, option_fastq_eestats2, option_fastq_filter, option_fastq_join, option_fastq_mergepairs, option_fastq_stats, option_fastx_filter, option_fastx_getseq, option_fastx_getseqs, option_fastx_getsubseq, option_fastx_mask, option_fastx_revcomp, option_fastx_subsample, option_fastx_uniques, option_h, option_help, option_makeudb_usearch, option_maskfasta, option_orient, option_rereplicate, option_search_exact, option_sff_convert, option_shuffle, option_sintax, option_sortbylength, option_sortbysize, option_uchime2_denovo, option_uchime3_denovo, option_uchime_denovo, option_uchime_ref, option_udb2fasta, option_udbinfo, option_udbstats, option_usearch_global, option_v, option_version }; const int commands_count = sizeof(command_options) / sizeof(int); /* Below is a list of all the options that are valid for each command. The first line is the command and the lines below are the valid options. */ const int valid_options[][96] = { { option_allpairs_global, option_acceptall, option_alnout, option_band, option_blast6out, option_bzip2_decompress, option_fasta_width, option_fastapairs, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_hardmask, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_label_suffix, option_leftjust, option_log, option_match, option_matched, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsizeratio, option_maxsl, option_maxsubs, option_mid, option_mincols, option_minhsp, option_minqt, option_minseqlength, option_minsizeratio, option_minsl, option_mintsize, option_minwordmatches, option_mismatch, option_no_progress, option_notmatched, option_notrunclabels, option_output_no_hits, option_pattern, option_qmask, option_qsegout, option_query_cov, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_selfid, option_sizein, option_sizeout, option_slots, option_target_cov, option_threads, option_top_hits_only, option_tsegout, option_uc, option_userfields, option_userout, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xsize, -1 }, { option_cluster_fast, option_alnout, option_band, option_biomout, option_blast6out, option_bzip2_decompress, option_centroids, option_clusterout_id, option_clusterout_sort, option_clusters, option_cons_truncate, option_consout, option_fasta_width, option_fastapairs, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_hardmask, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_label_suffix, option_leftjust, option_log, option_match, option_matched, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsizeratio, option_maxsl, option_maxsubs, option_mid, option_mincols, option_minhsp, option_minqt, option_minseqlength, option_minsizeratio, option_minsl, option_mintsize, option_minwordmatches, option_mismatch, option_mothur_shared_out, option_msaout, option_no_progress, option_notmatched, option_notrunclabels, option_otutabout, option_output_no_hits, option_pattern, option_profile, option_qmask, option_qsegout, option_query_cov, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_selfid, option_sizein, option_sizeorder, option_sizeout, option_slots, option_strand, option_target_cov, option_threads, option_top_hits_only, option_tsegout, option_uc, option_userfields, option_userout, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xsize, -1 }, { option_cluster_size, option_alnout, option_band, option_biomout, option_blast6out, option_bzip2_decompress, option_centroids, option_clusterout_id, option_clusterout_sort, option_clusters, option_cons_truncate, option_consout, option_fasta_width, option_fastapairs, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_hardmask, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_label_suffix, option_leftjust, option_log, option_match, option_matched, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsizeratio, option_maxsl, option_maxsubs, option_mid, option_mincols, option_minhsp, option_minqt, option_minseqlength, option_minsizeratio, option_minsl, option_mintsize, option_minwordmatches, option_mismatch, option_mothur_shared_out, option_msaout, option_no_progress, option_notmatched, option_notrunclabels, option_otutabout, option_output_no_hits, option_pattern, option_profile, option_qmask, option_qsegout, option_query_cov, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_selfid, option_sizein, option_sizeorder, option_sizeout, option_slots, option_strand, option_target_cov, option_threads, option_top_hits_only, option_tsegout, option_uc, option_userfields, option_userout, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xsize, -1 }, { option_cluster_smallmem, option_alnout, option_band, option_biomout, option_blast6out, option_bzip2_decompress, option_centroids, option_clusterout_id, option_clusterout_sort, option_clusters, option_cons_truncate, option_consout, option_fasta_width, option_fastapairs, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_hardmask, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_label_suffix, option_leftjust, option_log, option_match, option_matched, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsizeratio, option_maxsl, option_maxsubs, option_mid, option_mincols, option_minhsp, option_minqt, option_minseqlength, option_minsizeratio, option_minsl, option_mintsize, option_minwordmatches, option_mismatch, option_mothur_shared_out, option_msaout, option_no_progress, option_notmatched, option_notrunclabels, option_otutabout, option_output_no_hits, option_pattern, option_profile, option_qmask, option_qsegout, option_query_cov, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_selfid, option_sizein, option_sizeorder, option_sizeout, option_slots, option_strand, option_target_cov, option_threads, option_top_hits_only, option_tsegout, option_uc, option_userfields, option_userout, option_usersort, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xsize, -1 }, { option_cluster_unoise, option_alnout, option_band, option_biomout, option_blast6out, option_bzip2_decompress, option_centroids, option_clusterout_id, option_clusterout_sort, option_clusters, option_cons_truncate, option_consout, option_fasta_width, option_fastapairs, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_hardmask, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_label_suffix, option_leftjust, option_log, option_match, option_matched, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsizeratio, option_maxsl, option_maxsubs, option_mid, option_mincols, option_minhsp, option_minqt, option_minseqlength, option_minsizeratio, option_minsize, option_minsl, option_mintsize, option_minwordmatches, option_mismatch, option_mothur_shared_out, option_msaout, option_no_progress, option_notmatched, option_notrunclabels, option_otutabout, option_output_no_hits, option_qsegout, option_pattern, option_profile, option_qmask, option_query_cov, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_selfid, option_sizein, option_sizeorder, option_sizeout, option_slots, option_strand, option_target_cov, option_threads, option_top_hits_only, option_tsegout, option_uc, option_unoise_alpha, option_userfields, option_userout, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xsize, -1 }, { option_cut, option_bzip2_decompress, option_cut_pattern, option_fasta_width, option_fastaout, option_fastaout_discarded, option_fastaout_discarded_rev, option_fastaout_rev, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_xee, option_xsize, -1 }, { option_derep_fulllength, option_bzip2_decompress, option_fasta_width, option_gzip_decompress, option_log, option_maxseqlength, option_maxuniquesize, option_minseqlength, option_minuniquesize, option_no_progress, option_notrunclabels, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_strand, option_threads, option_topn, option_uc, option_xee, option_xsize, -1 }, { option_derep_id, option_bzip2_decompress, option_fasta_width, option_gzip_decompress, option_label_suffix, option_log, option_maxseqlength, option_maxuniquesize, option_minseqlength, option_minuniquesize, option_no_progress, option_notrunclabels, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_strand, option_threads, option_topn, option_uc, option_xee, option_xsize, -1 }, { option_derep_prefix, option_bzip2_decompress, option_fasta_width, option_gzip_decompress, option_label_suffix, option_log, option_maxseqlength, option_maxuniquesize, option_minseqlength, option_minuniquesize, option_no_progress, option_notrunclabels, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_strand, option_threads, option_topn, option_uc, option_xee, option_xsize, -1 }, { option_fasta2fastq, option_bzip2_decompress, option_fastq_asciiout, option_fastq_qmaxout, option_fastqout, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastq_chars, option_bzip2_decompress, option_fastq_tail, option_gzip_decompress, option_log, option_no_progress, option_quiet, option_threads, -1 }, { option_fastq_convert, option_bzip2_decompress, option_fastq_ascii, option_fastq_asciiout, option_fastq_qmax, option_fastq_qmaxout, option_fastq_qmin, option_fastq_qminout, option_fastqout, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastq_eestats, option_bzip2_decompress, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_log, option_no_progress, option_output, option_quiet, option_threads, -1 }, { option_fastq_eestats2, option_bzip2_decompress, option_ee_cutoffs, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_length_cutoffs, option_log, option_no_progress, option_output, option_quiet, option_threads, -1 }, { option_fastq_filter, option_bzip2_decompress, option_eeout, option_fasta_width, option_fastaout, option_fastaout_discarded, option_fastaout_discarded_rev, option_fastaout_rev, option_fastq_ascii, option_fastq_eeout, option_fastq_maxee, option_fastq_maxee_rate, option_fastq_maxlen, option_fastq_maxns, option_fastq_minlen, option_fastq_qmax, option_fastq_qmin, option_fastq_stripleft, option_fastq_stripright, option_fastq_truncee, option_fastq_trunclen, option_fastq_trunclen_keep, option_fastq_truncqual, option_fastqout, option_fastqout_discarded, option_fastqout_discarded_rev, option_fastqout_rev, option_gzip_decompress, option_label_suffix, option_log, option_maxsize, option_minsize, option_no_progress, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_reverse, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastq_join, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_gzip_decompress, option_join_padgap, option_join_padgapq, option_label_suffix, option_log, option_no_progress, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_reverse, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastq_mergepairs, option_bzip2_decompress, option_eeout, option_eetabbedout, option_fasta_width, option_fastaout, option_fastaout_notmerged_fwd, option_fastaout_notmerged_rev, option_fastq_allowmergestagger, option_fastq_ascii, option_fastq_eeout, option_fastq_maxdiffpct, option_fastq_maxdiffs, option_fastq_maxee, option_fastq_maxlen, option_fastq_maxmergelen, option_fastq_maxns, option_fastq_minlen, option_fastq_minmergelen, option_fastq_minovlen, option_fastq_nostagger, option_fastq_qmax, option_fastq_qmaxout, option_fastq_qmin, option_fastq_qminout, option_fastq_truncqual, option_fastqout, option_fastqout_notmerged_fwd, option_fastqout_notmerged_rev, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_reverse, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastq_stats, option_bzip2_decompress, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_log, option_no_progress, option_output, option_quiet, option_threads, -1 }, { option_fastx_filter, option_bzip2_decompress, option_eeout, option_fasta_width, option_fastaout, option_fastaout_discarded, option_fastaout_discarded_rev, option_fastaout_rev, option_fastq_ascii, option_fastq_eeout, option_fastq_maxee, option_fastq_maxee_rate, option_fastq_maxlen, option_fastq_maxns, option_fastq_minlen, option_fastq_qmax, option_fastq_qmin, option_fastq_stripleft, option_fastq_stripright, option_fastq_truncee, option_fastq_trunclen, option_fastq_trunclen_keep, option_fastq_truncqual, option_fastqout, option_fastqout_discarded, option_fastqout_discarded_rev, option_fastqout_rev, option_gzip_decompress, option_label_suffix, option_log, option_maxsize, option_minsize, option_no_progress, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_reverse, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastx_getseq, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_gzip_decompress, option_label, option_label_substr_match, option_label_suffix, option_log, option_no_progress, option_notmatched, option_notmatchedfq, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastx_getseqs, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_gzip_decompress, option_label, option_label_field, option_label_substr_match, option_label_suffix, option_label_word, option_label_words, option_labels, option_log, option_no_progress, option_notmatched, option_notmatchedfq, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastx_getsubseq, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_gzip_decompress, option_label, option_label_substr_match, option_label_suffix, option_log, option_no_progress, option_notmatched, option_notmatchedfq, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_subseq_end, option_subseq_start, option_threads, option_xee, option_xsize, -1 }, { option_fastx_mask, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_gzip_decompress, option_hardmask, option_label_suffix, option_log, option_max_unmasked_pct, option_min_unmasked_pct, option_no_progress, option_notrunclabels, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastx_revcomp, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastx_subsample, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastaout_discarded, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_fastqout, option_fastqout_discarded, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_notrunclabels, option_quiet, option_randseed, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sample_pct, option_sample_size, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_fastx_uniques, option_bzip2_decompress, option_fasta_width, option_fastaout, option_fastq_ascii, option_fastq_asciiout, option_fastq_qmax, option_fastq_qmaxout, option_fastq_qmin, option_fastq_qminout, option_fastq_qout_max, option_fastqout, option_gzip_decompress, option_label_suffix, option_log, option_maxseqlength, option_maxuniquesize, option_minseqlength, option_minuniquesize, option_no_progress, option_notrunclabels, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_strand, option_tabbedout, option_threads, option_topn, option_uc, option_xee, option_xsize, -1 }, { option_h, option_log, option_quiet, option_threads, -1 }, { option_help, option_log, option_quiet, option_threads, -1 }, { option_makeudb_usearch, option_bzip2_decompress, option_dbmask, option_gzip_decompress, option_hardmask, option_log, option_minseqlength, option_no_progress, option_notrunclabels, option_output, option_quiet, option_threads, option_wordlength, -1 }, { option_maskfasta, option_bzip2_decompress, option_fasta_width, option_gzip_decompress, option_hardmask, option_label_suffix, option_log, option_max_unmasked_pct, option_maxseqlength, option_min_unmasked_pct, option_minseqlength, option_no_progress, option_notrunclabels, option_output, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_orient, option_bzip2_decompress, option_db, option_dbmask, option_fasta_width, option_fastaout, option_fastqout, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_notmatched, option_notrunclabels, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_tabbedout, option_threads, option_wordlength, option_xee, option_xsize, -1 }, { option_rereplicate, option_bzip2_decompress, option_fasta_width, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_notrunclabels, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_search_exact, option_alnout, option_biomout, option_blast6out, option_bzip2_decompress, option_db, option_dbmask, option_dbmatched, option_dbnotmatched, option_fasta_width, option_fastapairs, option_gzip_decompress, option_hardmask, option_label_suffix, option_lca_cutoff, option_lcaout, option_log, option_match, option_matched, option_maxhits, option_maxqsize, option_maxqt, option_maxseqlength, option_maxsizeratio, option_maxsl, option_mincols, option_minqt, option_minseqlength, option_minsizeratio, option_minsl, option_mintsize, option_mismatch, option_mothur_shared_out, option_no_progress, option_notmatched, option_notrunclabels, option_otutabout, option_output_no_hits, option_qmask, option_qsegout, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_sizein, option_sizeout, option_strand, option_threads, option_top_hits_only, option_tsegout, option_uc, option_uc_allhits, option_userfields, option_userout, option_xee, option_xsize, -1 }, { option_sff_convert, option_fastq_asciiout, option_fastq_qmaxout, option_fastq_qminout, option_fastqout, option_label_suffix, option_log, option_no_progress, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sff_clip, option_sizeout, option_threads, -1 }, { option_shuffle, option_bzip2_decompress, option_fasta_width, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_label_suffix, option_log, option_maxseqlength, option_minseqlength, option_no_progress, option_notrunclabels, option_output, option_quiet, option_randseed, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_topn, option_xee, option_xsize, -1 }, { option_sintax, option_bzip2_decompress, option_db, option_dbmask, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_label_suffix, option_log, option_no_progress, option_notrunclabels, option_quiet, option_randseed, option_sintax_cutoff, option_strand, option_tabbedout, option_threads, option_wordlength, -1 }, { option_sortbylength, option_bzip2_decompress, option_fasta_width, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_label_suffix, option_log, option_maxseqlength, option_minseqlength, option_no_progress, option_notrunclabels, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_topn, option_xee, option_xsize, -1 }, { option_sortbysize, option_bzip2_decompress, option_fasta_width, option_fastq_ascii, option_fastq_qmax, option_fastq_qmin, option_gzip_decompress, option_label_suffix, option_log, option_maxseqlength, option_maxsize, option_minseqlength, option_minsize, option_no_progress, option_notrunclabels, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_topn, option_xee, option_xsize, -1 }, { option_uchime2_denovo, option_abskew, option_alignwidth, option_borderline, option_chimeras, option_dn, option_fasta_score, option_fasta_width, option_gapext, option_gapopen, option_hardmask, option_label_suffix, option_log, option_match, option_mindiffs, option_mindiv, option_minh, option_mismatch, option_no_progress, option_nonchimeras, option_notrunclabels, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_uchimealns, option_uchimeout, option_uchimeout5, option_xee, option_xn, option_xsize, -1 }, { option_uchime3_denovo, option_abskew, option_alignwidth, option_borderline, option_chimeras, option_dn, option_fasta_score, option_fasta_width, option_gapext, option_gapopen, option_hardmask, option_label_suffix, option_log, option_match, option_mindiffs, option_mindiv, option_minh, option_mismatch, option_no_progress, option_nonchimeras, option_notrunclabels, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_uchimealns, option_uchimeout, option_uchimeout5, option_xee, option_xn, option_xsize, -1 }, { option_uchime_denovo, option_abskew, option_alignwidth, option_borderline, option_chimeras, option_dn, option_fasta_score, option_fasta_width, option_gapext, option_gapopen, option_hardmask, option_label_suffix, option_log, option_match, option_mindiffs, option_mindiv, option_minh, option_mismatch, option_no_progress, option_nonchimeras, option_notrunclabels, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_uchimealns, option_uchimeout, option_uchimeout5, option_xee, option_xn, option_xsize, -1 }, { option_uchime_ref, option_abskew, option_alignwidth, option_borderline, option_chimeras, option_db, option_dbmask, option_dn, option_fasta_score, option_fasta_width, option_gapext, option_gapopen, option_hardmask, option_label_suffix, option_log, option_match, option_mindiffs, option_mindiv, option_minh, option_mismatch, option_no_progress, option_nonchimeras, option_notrunclabels, option_qmask, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_self, option_selfid, option_sizein, option_sizeout, option_strand, option_threads, option_uchimealns, option_uchimeout, option_uchimeout5, option_xee, option_xn, option_xsize, -1 }, { option_udb2fasta, option_fasta_width, option_label_suffix, option_log, option_no_progress, option_output, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_sample, option_sizein, option_sizeout, option_threads, option_xee, option_xsize, -1 }, { option_udbinfo, option_log, option_quiet, option_threads, -1 }, { option_udbstats, option_log, option_no_progress, option_quiet, option_threads, -1 }, { option_usearch_global, option_alnout, option_band, option_biomout, option_blast6out, option_bzip2_decompress, option_db, option_dbmask, option_dbmatched, option_dbnotmatched, option_fasta_width, option_fastapairs, option_fulldp, option_gapext, option_gapopen, option_gzip_decompress, option_hardmask, option_hspw, option_id, option_iddef, option_idprefix, option_idsuffix, option_label_suffix, option_lca_cutoff, option_lcaout, option_leftjust, option_log, option_match, option_matched, option_maxaccepts, option_maxdiffs, option_maxgaps, option_maxhits, option_maxid, option_maxqsize, option_maxqt, option_maxrejects, option_maxseqlength, option_maxsizeratio, option_maxsl, option_maxsubs, option_mid, option_mincols, option_minhsp, option_minqt, option_minseqlength, option_minsizeratio, option_minsl, option_mintsize, option_minwordmatches, option_mismatch, option_mothur_shared_out, option_no_progress, option_notmatched, option_notrunclabels, option_otutabout, option_output_no_hits, option_pattern, option_qmask, option_qsegout, option_query_cov, option_quiet, option_relabel, option_relabel_keep, option_relabel_md5, option_relabel_self, option_relabel_sha1, option_rightjust, option_rowlen, option_samheader, option_samout, option_sample, option_self, option_selfid, option_sizein, option_sizeout, option_slots, option_strand, option_target_cov, option_threads, option_top_hits_only, option_tsegout, option_uc, option_uc_allhits, option_userfields, option_userout, option_weak_id, option_wordlength, option_xdrop_nw, option_xee, option_xsize, -1 }, { option_v, option_log, option_quiet, option_threads, -1 }, { option_version, option_log, option_quiet, option_threads, -1 } }; /* check that only one commmand is specified */ int commands = 0; int k = -1; for (int i = 0; i < commands_count; i++) { if (options_selected[command_options[i]]) { commands++; k = i; } } if (commands > 1) { fatal("More than one command specified"); } /* check that only valid options are specified */ int invalid_options = 0; if (commands == 0) { /* check if any options are specified */ bool any_options = false; for (bool i : options_selected) { if (i) { any_options = true; } } if (any_options) { fprintf(stderr, "WARNING: Options given, but no valid command specified.\n"); } } else { for (int i = 0; i < options_count; i++) { if (options_selected[i]) { int j = 0; bool ok = false; while (valid_options[k][j] >= 0) { if (valid_options[k][j] == i) { ok = true; break; } j++; } if (! ok) { invalid_options++; if (invalid_options == 1) { fprintf(stderr, "Fatal error: Invalid options to command %s\n", long_options[command_options[k]].name); fprintf(stderr, "Invalid option(s):"); } fprintf(stderr, " --%s", long_options[i].name); } } } if (invalid_options > 0) { fprintf(stderr, "\nThe valid options for the %s command are:", long_options[command_options[k]].name); int count = 0; for(int j = 1; valid_options[k][j] >= 0; j++) { fprintf(stderr, " --%s", long_options[valid_options[k][j]].name); count++; } if (! count) { fprintf(stderr, " (none)"); } fprintf(stderr, "\n"); exit(EXIT_FAILURE); } } /* multi-threaded commands */ if ((opt_threads < 0) || (opt_threads > 1024)) { fatal("The argument to --threads must be in the range 0 (default) to 1024"); } if (opt_allpairs_global || opt_cluster_fast || opt_cluster_size || opt_cluster_smallmem || opt_cluster_unoise || opt_fastq_mergepairs || opt_fastx_mask || opt_maskfasta || opt_search_exact || opt_sintax || opt_uchime_ref || opt_usearch_global) { if (opt_threads == 0) { opt_threads = arch_get_cores(); } } else { if (opt_threads > 1) { fprintf(stderr, "WARNING: The %s command does not support multithreading.\nOnly 1 thread used.\n", long_options[command_options[k]].name); } opt_threads = 1; } if (opt_cluster_unoise) { opt_weak_id = 0.90; } else if (opt_weak_id > opt_id) { opt_weak_id = opt_id; } if (opt_maxrejects == -1) { if (opt_cluster_fast) { opt_maxrejects = 8; } else { opt_maxrejects = 32; } } if (opt_maxaccepts < 0) { fatal("The argument to --maxaccepts must not be negative"); } if (opt_maxrejects < 0) { fatal("The argument to --maxrejects must not be negative"); } if (opt_wordlength == 0) { /* set default word length */ if (opt_orient) { opt_wordlength = 12; } else { opt_wordlength = 8; } } if ((opt_wordlength < 3) || (opt_wordlength > 15)) { fatal("The argument to --wordlength must be in the range 3 to 15"); } if ((opt_iddef < 0) || (opt_iddef > 4)) { fatal("The argument to --iddef must in the range 0 to 4"); } #if 0 if (opt_match <= 0) fatal("The argument to --match must be positive"); if (opt_mismatch >= 0) fatal("The argument to --mismatch must be negative"); #endif if (opt_alignwidth < 0) { fatal("The argument to --alignwidth must not be negative"); } if (opt_rowlen < 0) { fatal("The argument to --rowlen must not be negative"); } if (opt_qmask == MASK_ERROR) { fatal("The argument to --qmask must be none, dust or soft"); } if (opt_dbmask == MASK_ERROR) { fatal("The argument to --dbmask must be none, dust or soft"); } if ((opt_sample_pct < 0.0) || (opt_sample_pct > 100.0)) { fatal("The argument to --sample_pct must be in the range 0.0 to 100.0"); } if (opt_sample_size < 0) { fatal("The argument to --sample_size must not be negative"); } if (((opt_relabel ? 1 : 0) + opt_relabel_md5 + opt_relabel_self + opt_relabel_sha1) > 1) { fatal("Specify only one of --relabel, --relabel_self, --relabel_sha1, or --relabel_md5"); } if (opt_fastq_tail < 1) { fatal("The argument to --fastq_tail must be positive"); } if ((opt_min_unmasked_pct < 0.0) && (opt_min_unmasked_pct > 100.0)) { fatal("The argument to --min_unmasked_pct must be between 0.0 and 100.0"); } if ((opt_max_unmasked_pct < 0.0) && (opt_max_unmasked_pct > 100.0)) { fatal("The argument to --max_unmasked_pct must be between 0.0 and 100.0"); } if (opt_min_unmasked_pct > opt_max_unmasked_pct) { fatal("The argument to --min_unmasked_pct cannot be larger than to --max_unmasked_pct"); } if ((opt_fastq_ascii != 33) && (opt_fastq_ascii != 64)) { fatal("The argument to --fastq_ascii must be 33 or 64"); } if (opt_fastq_qmin > opt_fastq_qmax) { fatal("The argument to --fastq_qmin cannot be larger than to --fastq_qmax"); } if (opt_fastq_ascii + opt_fastq_qmin < 33) { fatal("Sum of arguments to --fastq_ascii and --fastq_qmin must be no less than 33"); } if (opt_fastq_ascii + opt_fastq_qmax > 126) { fatal("Sum of arguments to --fastq_ascii and --fastq_qmax must be no more than 126"); } if (opt_fastq_qminout > opt_fastq_qmaxout) { fatal("The argument to --fastq_qminout cannot be larger than to --fastq_qmaxout"); } if ((opt_fastq_asciiout != 33) && (opt_fastq_asciiout != 64)) { fatal("The argument to --fastq_asciiout must be 33 or 64"); } if (opt_fastq_asciiout + opt_fastq_qminout < 33) { fatal("Sum of arguments to --fastq_asciiout and --fastq_qminout must be no less than 33"); } if (opt_fastq_asciiout + opt_fastq_qmaxout > 126) { fatal("Sum of arguments to --fastq_asciiout and --fastq_qmaxout must be no more than 126"); } if (opt_gzip_decompress && opt_bzip2_decompress) { fatal("Specify either --gzip_decompress or --bzip2_decompress, not both"); } if ((opt_sintax_cutoff < 0.0) || (opt_sintax_cutoff > 1.0)) { fatal("The argument to sintax_cutoff must be in the range 0.0 to 1.0"); } if ((opt_lca_cutoff <= 0.5) || (opt_lca_cutoff > 1.0)) { fatal("The argument to lca_cutoff must be larger than 0.5, but not larger than 1.0"); } if (opt_minuniquesize < 1) { fatal("The argument to minuniquesize must be at least 1"); } if (opt_maxuniquesize < 1) { fatal("The argument to maxuniquesize must be at least 1"); } if (opt_maxsize < 1) { fatal("The argument to maxsize must be at least 1"); } if (opt_maxhits < 0) { fatal("The argument to maxhits cannot be negative"); } /* TODO: check valid range of gap penalties */ /* adapt/adjust parameters */ #if 1 /* Adjust gap open penalty according to convention. The specified gap open penalties include the penalty for a single nucleotide gap: gap penalty = gap open penalty + (gap length - 1) * gap extension penalty The rest of the code assumes the first nucleotide gap penalty is not included in the gap opening penalty. */ opt_gap_open_query_left -= opt_gap_extension_query_left; opt_gap_open_target_left -= opt_gap_extension_target_left; opt_gap_open_query_interior -= opt_gap_extension_query_interior; opt_gap_open_target_interior -= opt_gap_extension_target_interior; opt_gap_open_query_right -= opt_gap_extension_query_right; opt_gap_open_target_right -= opt_gap_extension_target_right; #endif /* set defaults parameters, if not specified */ if (opt_maxhits == 0) { opt_maxhits = LONG_MAX; } if (opt_minwordmatches < 0) { opt_minwordmatches = minwordmatches_defaults[opt_wordlength]; } /* set default opt_minsize depending on command */ if (opt_minsize == 0) { if (opt_cluster_unoise) { opt_minsize = 8; } else { opt_minsize = 1; } } /* set default opt_abskew depending on command */ if (opt_abskew < 0.0) { if (opt_uchime3_denovo) { opt_abskew = 16.0; } else { opt_abskew = 2.0; } } /* set default opt_minseqlength depending on command */ if (opt_minseqlength < 0) { if (opt_cluster_fast || opt_cluster_size || opt_cluster_smallmem || opt_cluster_unoise || opt_derep_fulllength || opt_derep_id || opt_derep_prefix || opt_makeudb_usearch || opt_sintax || opt_usearch_global) { opt_minseqlength = 32; } else { opt_minseqlength = 1; } } if (opt_sintax) { opt_notrunclabels = 1; } } void show_publication() { fprintf(stdout, "Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)\n" "VSEARCH: a versatile open source tool for metagenomics\n" "PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584\n" "\n"); } void cmd_version() { if (! opt_quiet) { show_publication(); #ifdef HAVE_ZLIB_H printf("Compiled with support for gzip-compressed files,"); if (gz_lib) { printf(" and the library is loaded.\n"); char * (*zlibVersion_p)(); zlibVersion_p = (char * (*)()) arch_dlsym(gz_lib, "zlibVersion"); char * gz_version = (*zlibVersion_p)(); uLong (*zlibCompileFlags_p)(); zlibCompileFlags_p = (uLong (*)()) arch_dlsym(gz_lib, "zlibCompileFlags"); uLong flags = (*zlibCompileFlags_p)(); printf("zlib version %s, compile flags %lx", gz_version, flags); if (flags & 0x0400) { printf(" (ZLIB_WINAPI)"); } printf("\n"); } else { printf(" but the library was not found.\n"); } #else printf("Compiled without support for gzip-compressed files.\n"); #endif #ifdef HAVE_BZLIB_H printf("Compiled with support for bzip2-compressed files,"); if (bz2_lib) { printf(" and the library is loaded.\n"); } else { printf(" but the library was not found.\n"); } #else printf("Compiled without support for bzip2-compressed files.\n"); #endif } } void cmd_help() { /* 0 1 2 3 4 5 6 7 */ /* 01234567890123456789012345678901234567890123456789012345678901234567890123456789 */ if (! opt_quiet) { show_publication(); fprintf(stdout, "Usage: %s [OPTIONS]\n", progname); fprintf(stdout, "\n" "General options\n" " --bzip2_decompress decompress input with bzip2 (required if pipe)\n" " --fasta_width INT width of FASTA seq lines, 0 for no wrap (80)\n" " --gzip_decompress decompress input with gzip (required if pipe)\n" " --help | -h display help information\n" " --log FILENAME write messages, timing and memory info to file\n" " --maxseqlength INT maximum sequence length (50000)\n" " --minseqlength INT min seq length (clust/derep/search: 32, other:1)\n" " --no_progress do not show progress indicator\n" " --notrunclabels do not truncate labels at first space\n" " --quiet output just warnings and fatal errors to stderr\n" " --threads INT number of threads to use, zero for all cores (0)\n" " --version | -v display version information\n" "\n" "Chimera detection\n" " --uchime_denovo FILENAME detect chimeras de novo\n" " --uchime2_denovo FILENAME detect chimeras de novo in denoised amplicons\n" " --uchime3_denovo FILENAME detect chimeras de novo in denoised amplicons\n" " --uchime_ref FILENAME detect chimeras using a reference database\n" " Data\n" " --db FILENAME reference database for --uchime_ref\n" " Parameters\n" " --abskew REAL minimum abundance ratio (2.0, 16.0 for uchime3)\n" " --dn REAL 'no' vote pseudo-count (1.4)\n" " --mindiffs INT minimum number of differences in segment (3) *\n" " --mindiv REAL minimum divergence from closest parent (0.8) *\n" " --minh REAL minimum score (0.28) * ignored in uchime2/3\n" " --sizein propagate abundance annotation from input\n" " --self exclude identical labels for --uchime_ref\n" " --selfid exclude identical sequences for --uchime_ref\n" " --xn REAL 'no' vote weight (8.0)\n" " Output\n" " --alignwidth INT width of alignment in uchimealn output (80)\n" " --borderline FILENAME output borderline chimeric sequences to file\n" " --chimeras FILENAME output chimeric sequences to file\n" " --fasta_score include chimera score in fasta output\n" " --nonchimeras FILENAME output non-chimeric sequences to file\n" " --relabel STRING relabel nonchimeras with this prefix string\n" " --relabel_keep keep the old label after the new when relabelling\n" " --relabel_md5 relabel with md5 digest of normalized sequence\n" " --relabel_self relabel with the sequence itself as label\n" " --relabel_sha1 relabel with sha1 digest of normalized sequence\n" " --sizeout include abundance information when relabelling\n" " --uchimealns FILENAME output chimera alignments to file\n" " --uchimeout FILENAME output to chimera info to tab-separated file\n" " --uchimeout5 make output compatible with uchime version 5\n" " --xsize strip abundance information in output\n" "\n" "Clustering\n" " --cluster_fast FILENAME cluster sequences after sorting by length\n" " --cluster_size FILENAME cluster sequences after sorting by abundance\n" " --cluster_smallmem FILENAME cluster already sorted sequences (see -usersort)\n" " --cluster_unoise FILENAME denoise Illumina amplicon reads\n" " Parameters (most searching options also apply)\n" " --cons_truncate do not ignore terminal gaps in MSA for consensus\n" " --id REAL reject if identity lower, accepted values: 0-1.0\n" " --iddef INT id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)\n" " --qmask none|dust|soft mask seqs with dust, soft or no method (dust)\n" " --sizein propagate abundance annotation from input\n" " --strand plus|both cluster using plus or both strands (plus)\n" " --usersort indicate sequences not pre-sorted by length\n" " --minsize INT minimum abundance (unoise only) (8)\n" " --unoise_alpha REAL alpha parameter (unoise only) (2.0)\n" " Output\n" " --biomout FILENAME filename for OTU table output in biom 1.0 format\n" " --centroids FILENAME output centroid sequences to FASTA file\n" " --clusterout_id add cluster id info to consout and profile files\n" " --clusterout_sort order msaout, consout, profile by decr abundance\n" " --clusters STRING output each cluster to a separate FASTA file\n" " --consout FILENAME output cluster consensus sequences to FASTA file\n" " --mothur_shared_out FN filename for OTU table output in mothur format\n" " --msaout FILENAME output multiple seq. alignments to FASTA file\n" " --otutabout FILENAME filename for OTU table output in classic format\n" " --profile FILENAME output sequence profile of each cluster to file\n" " --relabel STRING relabel centroids with this prefix string\n" " --relabel_keep keep the old label after the new when relabelling\n" " --relabel_md5 relabel with md5 digest of normalized sequence\n" " --relabel_self relabel with the sequence itself as label\n" " --relabel_sha1 relabel with sha1 digest of normalized sequence\n" " --sizeorder sort accepted centroids by abundance, AGC\n" " --sizeout write cluster abundances to centroid file\n" " --uc FILENAME specify filename for UCLUST-like output\n" " --xsize strip abundance information in output\n" "\n" "Convert SFF to FASTQ\n" " --sff_convert FILENAME convert given SFF file to FASTQ format\n" " Parameters\n" " --sff_clip clip ends of sequences as indicated in file (no)\n" " --fastq_asciiout INT FASTQ output quality score ASCII base char (33)\n" " --fastq_qmaxout INT maximum base quality value for FASTQ output (41)\n" " --fastq_qminout INT minimum base quality value for FASTQ output (0)\n" " Output\n" " --fastqout FILENAME output converted sequences to given FASTQ file\n" "\n" "Dereplication and rereplication\n" " --derep_fulllength FILENAME dereplicate sequences in the given FASTA file\n" " --derep_id FILENAME dereplicate using both identifiers and sequences\n" " --derep_prefix FILENAME dereplicate sequences in file based on prefixes\n" " --fastx_uniques FILENAME dereplicate sequences in the FASTA/FASTQ file\n" " --rereplicate FILENAME rereplicate sequences in the given FASTA file\n" " Parameters\n" " --maxuniquesize INT maximum abundance for output from dereplication\n" " --minuniquesize INT minimum abundance for output from dereplication\n" " --sizein propagate abundance annotation from input\n" " --strand plus|both dereplicate plus or both strands (plus)\n" " Output\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmaxout INT maximum base quality value for FASTQ output (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --fastq_qminout INT minimum base quality value for FASTQ output (0)\n" " --fastaout FILENAME output FASTA file (for fastx_uniques)\n" " --fastqout FILENAME output FASTQ file (for fastx_uniques)\n" " --output FILENAME output FASTA file (not for fastx_uniques)\n" " --relabel STRING relabel with this prefix string\n" " --relabel_keep keep the old label after the new when relabelling\n" " --relabel_md5 relabel with md5 digest of normalized sequence\n" " --relabel_self relabel with the sequence itself as label\n" " --relabel_sha1 relabel with sha1 digest of normalized sequence\n" " --sizeout write abundance annotation to output\n" " --tabbedout FILENAME write cluster info to tsv file for fastx_uniques\n" " --topn INT output only n most abundant sequences after derep\n" " --uc FILENAME filename for UCLUST-like dereplication output\n" " --xsize strip abundance information in derep output\n" "\n" "FASTA to FASTQ conversion\n" " --fasta2fastq FILENAME convert from FASTA to FASTQ, fake quality scores\n" " Parameters\n" " --fastq_asciiout INT FASTQ output quality score ASCII base char (33)\n" " --fastq_qmaxout INT fake quality score for FASTQ output (41)\n" " Output\n" " --fastqout FILENAME FASTQ output filename for converted sequences\n" "\n" "FASTQ format conversion\n" " --fastq_convert FILENAME convert between FASTQ file formats\n" " Parameters\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_asciiout INT FASTQ output quality score ASCII base char (33)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmaxout INT maximum base quality value for FASTQ output (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --fastq_qminout INT minimum base quality value for FASTQ output (0)\n" " Output\n" " --fastqout FILENAME FASTQ output filename for converted sequences\n" "\n" "FASTQ format detection and quality analysis\n" " --fastq_chars FILENAME analyse FASTQ file for version and quality range\n" " Parameters\n" " --fastq_tail INT min length of tails to count for fastq_chars (4)\n" "\n" "FASTQ quality statistics\n" " --fastq_stats FILENAME report statistics on FASTQ file\n" " --fastq_eestats FILENAME quality score and expected error statistics\n" " --fastq_eestats2 FILENAME expected error and length cutoff statistics\n" " Parameters\n" " --ee_cutoffs REAL,... fastq_eestats2 expected error cutoffs (0.5,1,2)\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --length_cutoffs INT,INT,INT fastq_eestats2 length (min,max,incr) (50,*,50)\n" " Output\n" " --log FILENAME output file for fastq_stats statistics\n" " --output FILENAME output file for fastq_eestats(2) statistics\n" "\n" "Masking (new)\n" " --fastx_mask FILENAME mask sequences in the given FASTA or FASTQ file\n" " Parameters\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --hardmask mask by replacing with N instead of lower case\n" " --max_unmasked_pct max unmasked %% of sequences to keep (100.0)\n" " --min_unmasked_pct min unmasked %% of sequences to keep (0.0)\n" " --qmask none|dust|soft mask seqs with dust, soft or no method (dust)\n" " Output\n" " --fastaout FILENAME output to specified FASTA file\n" " --fastqout FILENAME output to specified FASTQ file\n" "\n" "Masking (old)\n" " --maskfasta FILENAME mask sequences in the given FASTA file\n" " Parameters\n" " --hardmask mask by replacing with N instead of lower case\n" " --qmask none|dust|soft mask seqs with dust, soft or no method (dust)\n" " Output\n" " --output FILENAME output to specified FASTA file\n" "\n" "Orient sequences in forward or reverse direction\n" " --orient FILENAME orient sequences in given FASTA/FASTQ file\n" " Data\n" " --db FILENAME database of sequences in correct orientation\n" " --dbmask none|dust|soft mask db seqs with dust, soft or no method (dust)\n" " --qmask none|dust|soft mask query with dust, soft or no method (dust)\n" " --wordlength INT length of words used for matching 3-15 (12)\n" " Output\n" " --fastaout FILENAME FASTA output filename for oriented sequences\n" " --fastqout FILENAME FASTQ output filenamr for oriented sequences\n" " --notmatched FILENAME output filename for undetermined sequences\n" " --tabbedout FILENAME output filename for result information\n" "\n" "Paired-end reads joining\n" " --fastq_join FILENAME join paired-end reads into one sequence with gap\n" " Data\n" " --reverse FILENAME specify FASTQ file with reverse reads\n" " --join_padgap STRING sequence string used for padding (NNNNNNNN)\n" " --join_padgapq STRING quality string used for padding (IIIIIIII)\n" " Output\n" " --fastaout FILENAME FASTA output filename for joined sequences\n" " --fastqout FILENAME FASTQ output filename for joined sequences\n" "\n" "Paired-end reads merging\n" " --fastq_mergepairs FILENAME merge paired-end reads into one sequence\n" " Data\n" " --reverse FILENAME specify FASTQ file with reverse reads\n" " Parameters\n" " --fastq_allowmergestagger allow merging of staggered reads\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_maxdiffpct REAL maximum percentage diff. bases in overlap (100.0)\n" " --fastq_maxdiffs INT maximum number of different bases in overlap (10)\n" " --fastq_maxee REAL maximum expected error value for merged sequence\n" " --fastq_maxmergelen maximum length of entire merged sequence\n" " --fastq_maxns INT maximum number of N's\n" " --fastq_minlen INT minimum input read length after truncation (1)\n" " --fastq_minmergelen minimum length of entire merged sequence\n" " --fastq_minovlen minimum length of overlap between reads (10)\n" " --fastq_nostagger disallow merging of staggered reads (default)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmaxout INT maximum base quality value for FASTQ output (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --fastq_qminout INT minimum base quality value for FASTQ output (0)\n" " --fastq_truncqual INT base quality value for truncation\n" " Output\n" " --eetabbedout FILENAME output error statistics to specified file\n" " --fastaout FILENAME FASTA output filename for merged sequences\n" " --fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences\n" " --fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences\n" " --fastq_eeout include expected errors (ee) in FASTQ output\n" " --fastqout FILENAME FASTQ output filename for merged sequences\n" " --fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences\n" " --fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences\n" " --label_suffix STRING suffix to append to label of merged sequences\n" " --xee remove expected errors (ee) info from output\n" "\n" "Pairwise alignment\n" " --allpairs_global FILENAME perform global alignment of all sequence pairs\n" " Output (most searching options also apply)\n" " --alnout FILENAME filename for human-readable alignment output\n" " --acceptall output all pairwise alignments\n" "\n" "Restriction site cutting\n" " --cut FILENAME filename of FASTA formatted input sequences\n" " Parameters\n" " --cut_pattern STRING pattern to match with ^ and _ at cut sites\n" " Output\n" " --fastaout FILENAME FASTA filename for fragments on forward strand\n" " --fastaout_rev FILENAME FASTA filename for fragments on reverse strand\n" " --fastaout_discarded FN FASTA filename for non-matching sequences\n" " --fastaout_discarded_rev FN FASTA filename for non-matching, reverse compl.\n" "\n" "Reverse complementation\n" " --fastx_revcomp FILENAME reverse-complement seqs in FASTA or FASTQ file\n" " Parameters\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " Output\n" " --fastaout FILENAME FASTA output filename\n" " --fastqout FILENAME FASTQ output filename\n" " --label_suffix STRING label to append to identifier in the output\n" "\n" "Searching\n" " --search_exact FILENAME filename of queries for exact match search\n" " --usearch_global FILENAME filename of queries for global alignment search\n" " Data\n" " --db FILENAME name of UDB or FASTA database for search\n" " Parameters\n" " --dbmask none|dust|soft mask db with dust, soft or no method (dust)\n" " --fulldp full dynamic programming alignment (always on)\n" " --gapext STRING penalties for gap extension (2I/1E)\n" " --gapopen STRING penalties for gap opening (20I/2E)\n" " --hardmask mask by replacing with N instead of lower case\n" " --id REAL reject if identity lower\n" " --iddef INT id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)\n" " --idprefix INT reject if first n nucleotides do not match\n" " --idsuffix INT reject if last n nucleotides do not match\n" " --lca_cutoff REAL fraction of matching hits required for LCA (1.0)\n" " --leftjust reject if terminal gaps at alignment left end\n" " --match INT score for match (2)\n" " --maxaccepts INT number of hits to accept and show per strand (1)\n" " --maxdiffs INT reject if more substitutions or indels\n" " --maxgaps INT reject if more indels\n" " --maxhits INT maximum number of hits to show (unlimited)\n" " --maxid REAL reject if identity higher\n" " --maxqsize INT reject if query abundance larger\n" " --maxqt REAL reject if query/target length ratio higher\n" " --maxrejects INT number of non-matching hits to consider (32)\n" " --maxsizeratio REAL reject if query/target abundance ratio higher\n" " --maxsl REAL reject if shorter/longer length ratio higher\n" " --maxsubs INT reject if more substitutions\n" " --mid REAL reject if percent identity lower, ignoring gaps\n" " --mincols INT reject if alignment length shorter\n" " --minqt REAL reject if query/target length ratio lower\n" " --minsizeratio REAL reject if query/target abundance ratio lower\n" " --minsl REAL reject if shorter/longer length ratio lower\n" " --mintsize INT reject if target abundance lower\n" " --minwordmatches INT minimum number of word matches required (12)\n" " --mismatch INT score for mismatch (-4)\n" " --pattern STRING option is ignored\n" " --qmask none|dust|soft mask query with dust, soft or no method (dust)\n" " --query_cov REAL reject if fraction of query seq. aligned lower\n" " --rightjust reject if terminal gaps at alignment right end\n" " --sizein propagate abundance annotation from input\n" " --self reject if labels identical\n" " --selfid reject if sequences identical\n" " --slots INT option is ignored\n" " --strand plus|both search plus or both strands (plus)\n" " --target_cov REAL reject if fraction of target seq. aligned lower\n" " --weak_id REAL include aligned hits with >= id; continue search\n" " --wordlength INT length of words for database index 3-15 (8)\n" " Output\n" " --alnout FILENAME filename for human-readable alignment output\n" " --biomout FILENAME filename for OTU table output in biom 1.0 format\n" " --blast6out FILENAME filename for blast-like tab-separated output\n" " --dbmatched FILENAME FASTA file for matching database sequences\n" " --dbnotmatched FILENAME FASTA file for non-matching database sequences\n" " --fastapairs FILENAME FASTA file with pairs of query and target\n" " --lcaout FILENAME output LCA of matching sequences to file\n" " --matched FILENAME FASTA file for matching query sequences\n" " --mothur_shared_out FN filename for OTU table output in mothur format\n" " --notmatched FILENAME FASTA file for non-matching query sequences\n" " --otutabout FILENAME filename for OTU table output in classic format\n" " --output_no_hits output non-matching queries to output files\n" " --rowlen INT width of alignment lines in alnout output (64)\n" " --samheader include a header in the SAM output file\n" " --samout FILENAME filename for SAM format output\n" " --sizeout write abundance annotation to dbmatched file\n" " --top_hits_only output only hits with identity equal to the best\n" " --uc FILENAME filename for UCLUST-like output\n" " --uc_allhits show all, not just top hit with uc output\n" " --userfields STRING fields to output in userout file\n" " --userout FILENAME filename for user-defined tab-separated output\n" "\n" "Shuffling and sorting\n" " --shuffle FILENAME shuffle order of sequences in FASTA file randomly\n" " --sortbylength FILENAME sort sequences by length in given FASTA file\n" " --sortbysize FILENAME abundance sort sequences in given FASTA file\n" " Parameters\n" " --maxsize INT maximum abundance for sortbysize\n" " --minsize INT minimum abundance for sortbysize\n" " --randseed INT seed for PRNG, zero to use random data source (0)\n" " --sizein propagate abundance annotation from input\n" " Output\n" " --output FILENAME output to specified FASTA file\n" " --relabel STRING relabel sequences with this prefix string\n" " --relabel_keep keep the old label after the new when relabelling\n" " --relabel_md5 relabel with md5 digest of normalized sequence\n" " --relabel_self relabel with the sequence itself as label\n" " --relabel_sha1 relabel with sha1 digest of normalized sequence\n" " --sizeout include abundance information when relabelling\n" " --topn INT output just first n sequences\n" " --xsize strip abundance information in output\n" "\n" "Subsampling\n" " --fastx_subsample FILENAME subsample sequences from given FASTA/FASTQ file\n" " Parameters\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --randseed INT seed for PRNG, zero to use random data source (0)\n" " --sample_pct REAL sampling percentage between 0.0 and 100.0\n" " --sample_size INT sampling size\n" " --sizein consider abundance info from input, do not ignore\n" " Output\n" " --fastaout FILENAME output subsampled sequences to FASTA file\n" " --fastaout_discarded FILE output non-subsampled sequences to FASTA file\n" " --fastqout FILENAME output subsampled sequences to FASTQ file\n" " --fastqout_discarded output non-subsampled sequences to FASTQ file\n" " --relabel STRING relabel sequences with this prefix string\n" " --relabel_keep keep the old label after the new when relabelling\n" " --relabel_md5 relabel with md5 digest of normalized sequence\n" " --relabel_self relabel with the sequence itself as label\n" " --relabel_sha1 relabel with sha1 digest of normalized sequence\n" " --sizeout update abundance information in output\n" " --xsize strip abundance information in output\n" "\n" "Taxonomic classification\n" " --sintax FILENAME classify sequences in given FASTA/FASTQ file\n" " Parameters\n" " --db FILENAME taxonomic reference db in given FASTA or UDB file\n" " --sintax_cutoff REAL confidence value cutoff level (0.0)\n" " Output\n" " --tabbedout FILENAME write results to given tab-delimited file\n" "\n" "Trimming and filtering\n" " --fastx_filter FILENAME trim and filter sequences in FASTA/FASTQ file\n" " --fastq_filter FILENAME trim and filter sequences in FASTQ file\n" " --reverse FILENAME FASTQ file with other end of paired-end reads\n" " Parameters\n" " --fastq_ascii INT FASTQ input quality score ASCII base char (33)\n" " --fastq_maxee REAL discard if expected error value is higher\n" " --fastq_maxee_rate REAL discard if expected error rate is higher\n" " --fastq_maxlen INT discard if length of sequence is longer\n" " --fastq_maxns INT discard if number of N's is higher\n" " --fastq_minlen INT discard if length of sequence is shorter\n" " --fastq_qmax INT maximum base quality value for FASTQ input (41)\n" " --fastq_qmin INT minimum base quality value for FASTQ input (0)\n" " --fastq_stripleft INT delete given number of bases from the 5' end\n" " --fastq_stripright INT delete given number of bases from the 3' end\n" " --fastq_truncee REAL truncate to given maximum expected error\n" " --fastq_trunclen INT truncate to given length (discard if shorter)\n" " --fastq_trunclen_keep INT truncate to given length (keep if shorter)\n" " --fastq_truncqual INT truncate to given minimum base quality\n" " --maxsize INT discard if abundance of sequence is above\n" " --minsize INT discard if abundance of sequence is below\n" " Output\n" " --eeout include expected errors in output\n" " --fastaout FN FASTA filename for passed sequences\n" " --fastaout_discarded FN FASTA filename for discarded sequences\n" " --fastaout_discarded_rev FN FASTA filename for discarded reverse sequences\n" " --fastaout_rev FN FASTA filename for passed reverse sequences\n" " --fastqout FN FASTQ filename for passed sequences\n" " --fastqout_discarded FN FASTQ filename for discarded sequences\n" " --fastqout_discarded_rev FN FASTQ filename for discarded reverse sequences\n" " --fastqout_rev FN FASTQ filename for passed reverse sequences\n" " --relabel STRING relabel filtered sequences with given prefix\n" " --relabel_keep keep the old label after the new when relabelling\n" " --relabel_md5 relabel filtered sequences with md5 digest\n" " --relabel_self relabel with the sequence itself as label\n" " --relabel_sha1 relabel filtered sequences with sha1 digest\n" " --sizeout include abundance information when relabelling\n" " --xee remove expected errors (ee) info from output\n" " --xsize strip abundance information in output\n" "\n" "UDB files\n" " --makeudb_usearch FILENAME make UDB file from given FASTA file\n" " --udb2fasta FILENAME output FASTA file from given UDB file\n" " --udbinfo FILENAME show information about UDB file\n" " --udbstats FILENAME report statistics about indexed words in UDB file\n" " Parameters\n" " --dbmask none|dust|soft mask db with dust, soft or no method (dust)\n" " --hardmask mask by replacing with N instead of lower case\n" " --wordlength INT length of words for database index 3-15 (8)\n" " Output\n" " --output FILENAME UDB or FASTA output file\n" ); } } void cmd_allpairs_global() { /* check options */ if ((!opt_alnout) && (!opt_userout) && (!opt_uc) && (!opt_blast6out) && (!opt_matched) && (!opt_notmatched) && (!opt_samout) && (!opt_fastapairs)) { fatal("No output files specified"); } if (! (opt_acceptall || ((opt_id >= 0.0) && (opt_id <= 1.0)))) { fatal("Specify either --acceptall or --id with an identity from 0.0 to 1.0"); } allpairs_global(cmdline, progheader); } void cmd_usearch_global() { /* check options */ if ((!opt_alnout) && (!opt_userout) && (!opt_uc) && (!opt_blast6out) && (!opt_matched) && (!opt_notmatched) && (!opt_dbmatched) && (!opt_dbnotmatched) && (!opt_samout) && (!opt_otutabout) && (!opt_biomout) && (!opt_mothur_shared_out) && (!opt_fastapairs) && (!opt_lcaout)) { fatal("No output files specified"); } if (!opt_db) { fatal("Database filename not specified with --db"); } if ((opt_id < 0.0) || (opt_id > 1.0)) { fatal("Identity between 0.0 and 1.0 must be specified with --id"); } usearch_global(cmdline, progheader); } void cmd_search_exact() { /* check options */ if ((!opt_alnout) && (!opt_userout) && (!opt_uc) && (!opt_blast6out) && (!opt_matched) && (!opt_notmatched) && (!opt_dbmatched) && (!opt_dbnotmatched) && (!opt_samout) && (!opt_otutabout) && (!opt_biomout) && (!opt_mothur_shared_out) && (!opt_fastapairs) && (!opt_lcaout)) { fatal("No output files specified"); } if (!opt_db) { fatal("Database filename not specified with --db"); } search_exact(cmdline, progheader); } void cmd_subsample() { if ((!opt_fastaout) && (!opt_fastqout)) { fatal("Specify output files for subsampling with --fastaout and/or --fastqout"); } if ((opt_sample_pct > 0) == (opt_sample_size > 0)) { fatal("Specify either --sample_pct or --sample_size"); } subsample(); } void cmd_none() { if (! opt_quiet) { fprintf(stderr, "For help, please enter: %s --help | more\n" "For further details, please consult the manual by entering: man vsearch\n" "\n" "Selected command examples:\n" "\n" "vsearch --allpairs_global FILENAME --id 0.5 --alnout FILENAME\n" "vsearch --cluster_size FILENAME --id 0.97 --centroids FILENAME\n" "vsearch --cut FILENAME --cut_pattern G^AATT_C --fastaout FILENAME\n" "vsearch --fastq_chars FILENAME\n" "vsearch --fastq_convert FILENAME --fastqout FILENAME --fastq_ascii 64\n" "vsearch --fastq_eestats FILENAME --output FILENAME\n" "vsearch --fastq_eestats2 FILENAME --output FILENAME\n" "vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME\n" "vsearch --fastq_stats FILENAME --log FILENAME\n" "vsearch --fastx_filter FILENAME --fastaout FILENAME --fastq_trunclen 100\n" "vsearch --fastx_getseq FILENAME --label LABEL --fastaout FILENAME\n" "vsearch --fastx_mask FILENAME --fastaout FILENAME\n" "vsearch --fastx_revcomp FILENAME --fastqout FILENAME\n" "vsearch --fastx_subsample FILENAME --fastaout FILENAME --sample_pct 1\n" "vsearch --fastx_uniques FILENAME --output FILENAME\n" "vsearch --makeudb_usearch FILENAME --output FILENAME\n" "vsearch --search_exact FILENAME --db FILENAME --alnout FILENAME\n" "vsearch --sff_convert FILENAME --output FILENAME --sff_clip\n" "vsearch --shuffle FILENAME --output FILENAME\n" "vsearch --sintax FILENAME --db FILENAME --tabbedout FILENAME\n" "vsearch --sortbylength FILENAME --output FILENAME\n" "vsearch --sortbysize FILENAME --output FILENAME\n" "vsearch --uchime_denovo FILENAME --nonchimeras FILENAME\n" "vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME\n" "vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME\n" "\n" "Other commands: cluster_fast, cluster_smallmem, cluster_unoise, cut,\n" " derep_id, derep_fulllength, derep_prefix, fasta2fastq,\n" " fastq_filter, fastq_join, fastx_getseqs, fastx_getsubseqs,\n" " maskfasta, orient, rereplicate, uchime2_denovo,\n" " uchime3_denovo, udb2fasta, udbinfo, udbstats, version\n" "\n", progname); } } void cmd_cluster() { if ((!opt_alnout) && (!opt_userout) && (!opt_uc) && (!opt_blast6out) && (!opt_matched) && (!opt_notmatched) && (!opt_centroids) && (!opt_clusters) && (!opt_consout) && (!opt_msaout) && (!opt_samout) && (!opt_profile) && (!opt_otutabout) && (!opt_biomout) && (!opt_mothur_shared_out)) { fatal("No output files specified"); } if (!opt_cluster_unoise) { if ((opt_id < 0.0) || (opt_id > 1.0)) { fatal("Identity between 0.0 and 1.0 must be specified with --id"); } } if (opt_cluster_fast) { cluster_fast(cmdline, progheader); } else if (opt_cluster_smallmem) { cluster_smallmem(cmdline, progheader); } else if (opt_cluster_size) { cluster_size(cmdline, progheader); } else if (opt_cluster_unoise) { cluster_unoise(cmdline, progheader); } } void cmd_uchime() { if ((!opt_chimeras) && (!opt_nonchimeras) && (!opt_uchimeout) && (!opt_uchimealns)) { fatal("No output files specified"); } if (opt_uchime_ref && ! opt_db) { fatal("Database filename not specified with --db"); } if (opt_xn <= 1.0) { fatal("Argument to --xn must be > 1"); } if (opt_dn <= 0.0) { fatal("Argument to --dn must be > 0"); } if ((!opt_uchime2_denovo) && (!opt_uchime3_denovo)) { if (opt_mindiffs <= 0) { fatal("Argument to --mindiffs must be > 0"); } if (opt_mindiv <= 0.0) { fatal("Argument to --mindiv must be > 0"); } if (opt_minh <= 0.0) { fatal("Argument to --minh must be > 0"); } } #if 0 if (opt_abskew <= 1.0) fatal("Argument to --abskew must be > 1"); #endif chimera(); } void cmd_fastq_mergepairs() { if (!opt_reverse) { fatal("No reverse reads file specified with --reverse"); } if ((!opt_fastqout) && (!opt_fastaout) && (!opt_fastqout_notmerged_fwd) && (!opt_fastqout_notmerged_rev) && (!opt_fastaout_notmerged_fwd) && (!opt_fastaout_notmerged_rev) && (!opt_eetabbedout)) { fatal("No output files specified"); } fastq_mergepairs(); } void fillheader() { constexpr double one_gigabyte {1024.0 * 1024.0 * 1024.0}; snprintf(progheader, 80, "%s v%s_%s, %.1fGB RAM, %ld cores", PROG_NAME, PROG_VERSION, PROG_ARCH, arch_get_memtotal() / one_gigabyte, arch_get_cores()); } void getentirecommandline(int argc, char** argv) { int len = 0; for (int i=0; i0) { strcat(cmdline, " "); } strcat(cmdline, argv[i]); } } void show_header() { if (! opt_quiet) { fprintf(stderr, "%s\n", progheader); fprintf(stderr, "https://github.com/torognes/vsearch\n"); fprintf(stderr, "\n"); } } int main(int argc, char** argv) { fillheader(); getentirecommandline(argc, argv); cpu_features_detect(); args_init(argc, argv); if (opt_log) { fp_log = fopen_output(opt_log); if (!fp_log) { fatal("Unable to open log file for writing"); } fprintf(fp_log, "%s\n", progheader); fprintf(fp_log, "%s\n", cmdline); char time_string[26]; time_start = time(nullptr); struct tm * tm_start = localtime(& time_start); strftime(time_string, 26, "%c", tm_start); fprintf(fp_log, "Started %s\n", time_string); } random_init(); show_header(); dynlibs_open(); #ifdef __x86_64__ if (!sse2_present) { fatal("Sorry, this program requires a cpu with SSE2."); } #endif if (opt_help) { cmd_help(); } else if (opt_allpairs_global) { cmd_allpairs_global(); } else if (opt_usearch_global) { cmd_usearch_global(); } else if (opt_sortbysize) { sortbysize(); } else if (opt_sortbylength) { sortbylength(); } else if (opt_derep_fulllength) { derep(opt_derep_fulllength, false); } else if (opt_derep_prefix) { derep_prefix(); } else if (opt_derep_id) { derep(opt_derep_id, true); } else if (opt_shuffle) { shuffle(); } else if (opt_fastx_subsample) { cmd_subsample(); } else if (opt_maskfasta) { maskfasta(); } else if (opt_cluster_smallmem || opt_cluster_fast || opt_cluster_size || opt_cluster_unoise) { cmd_cluster(); } else if (opt_uchime_denovo || opt_uchime_ref || opt_uchime2_denovo || opt_uchime3_denovo) { cmd_uchime(); } else if (opt_fastq_chars) { fastq_chars(); } else if (opt_fastq_stats) { fastq_stats(); } else if (opt_fastq_filter) { fastq_filter(); } else if (opt_fastx_filter) { fastx_filter(); } else if (opt_fastx_revcomp) { fastx_revcomp(); } else if (opt_search_exact) { cmd_search_exact(); } else if (opt_fastx_mask) { fastx_mask(); } else if (opt_fastq_convert) { fastq_convert(); } else if (opt_fastq_mergepairs) { cmd_fastq_mergepairs(); } else if (opt_fastq_eestats) { fastq_eestats(); } else if (opt_fastq_eestats2) { fastq_eestats2(); } else if (opt_fastq_join) { fastq_join(); } else if (opt_rereplicate) { rereplicate(); } else if (opt_version) { cmd_version(); } else if (opt_makeudb_usearch) { udb_make(); } else if (opt_udb2fasta) { udb_fasta(); } else if (opt_udbinfo) { udb_info(); } else if (opt_udbstats) { udb_stats(); } else if (opt_sintax) { sintax(); } else if (opt_sff_convert) { sff_convert(); } else if (opt_fastx_getseq) { fastx_getseq(); } else if (opt_fastx_getseqs) { fastx_getseqs(); } else if (opt_fastx_getsubseq) { fastx_getsubseq(); } else if (opt_cut) { cut(); } else if (opt_orient) { orient(); } else if (opt_fasta2fastq) { fasta2fastq(); } else if (opt_fastx_uniques) { derep(opt_fastx_uniques, false); } else { cmd_none(); } if (opt_log) { time_finish = time(nullptr); struct tm * tm_finish = localtime(& time_finish); char time_string[26]; strftime(time_string, 26, "%c", tm_finish); fprintf(fp_log, "\n"); fprintf(fp_log, "Finished %s", time_string); double time_diff = difftime(time_finish, time_start); fprintf(fp_log, "\n"); fprintf(fp_log, "Elapsed time %02.0lf:%02.0lf\n", floor(time_diff / 60.0), floor(time_diff - 60.0 * floor(time_diff / 60.0))); double maxmem = arch_get_memused() / 1048576.0; if (maxmem < 1024.0) { fprintf(fp_log, "Max memory %.1lfMB\n", maxmem); } else { fprintf(fp_log, "Max memory %.1lfGB\n", maxmem/1024.0); } fclose(fp_log); } if (opt_ee_cutoffs_values) { xfree(opt_ee_cutoffs_values); } opt_ee_cutoffs_values = nullptr; xfree(cmdline); dynlibs_close(); } vsearch-2.21.1/src/bitmap.cc0000644000175000017500000000533414171574117015145 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" bitmap_t * bitmap_init(unsigned int size) { auto * b = (bitmap_t*) xmalloc(sizeof(bitmap_t)); b->size = size; b->bitmap = (unsigned char *) xmalloc((size+7)/8); return b; } void bitmap_free(bitmap_t* b) { if (b->bitmap) { xfree(b->bitmap); } xfree(b); } vsearch-2.21.1/src/rerep.cc0000644000175000017500000001073214171574117015004 0ustar nileshnilesh /* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" void rereplicate() { if (!opt_output) fatal("FASTA output file for rereplicate must be specified with --output"); opt_xsize = true; FILE * fp_output = nullptr; if (opt_output) { fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open FASTA output file for writing"); } } fastx_handle fh = fasta_open(opt_rereplicate); int64_t filesize = fasta_get_size(fh); progress_init("Rereplicating", filesize); int64_t missing = 0; int64_t i = 0; int64_t n = 0; while (fasta_next(fh, ! opt_notrunclabels, chrmap_no_change)) { n++; int64_t abundance = fasta_get_abundance_and_presence(fh); if (abundance == 0) { missing++; abundance = 1; } for(int64_t j=0; j, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* Find the unique kmers or words in a given sequence. Unique is now defined as all different words occuring at least once. Earlier it was defined as those words occuring exactly once, but that caused a problem when searching for sequences with many repeats. */ #define HASH CityHash64 struct bucket_s { unsigned int kmer; unsigned int count; }; struct uhandle_s { struct bucket_s * hash; unsigned int * list; unsigned int hash_mask; int size; int alloc; uint64_t bitmap_size; uint64_t * bitmap; }; struct uhandle_s * unique_init() { auto * uh = (struct uhandle_s *) xmalloc(sizeof(struct uhandle_s)); uh->alloc = 2048; uh->size = 0; uh->hash_mask = uh->alloc - 1; uh->hash = (struct bucket_s *) xmalloc(sizeof(struct bucket_s) * uh->alloc); uh->list = (unsigned int *) xmalloc(sizeof(unsigned int) * uh->alloc); uh->bitmap_size = 0; uh->bitmap = nullptr; return uh; } void unique_exit(struct uhandle_s * uh) { if (uh->bitmap) { xfree(uh->bitmap); } if (uh->hash) { xfree(uh->hash); } if (uh->list) { xfree(uh->list); } xfree(uh); } int unique_compare(const void * a, const void * b) { auto * x = (unsigned int*) a; auto * y = (unsigned int*) b; if (xy) { return +1; } else { return 0; } } void unique_count_bitmap(struct uhandle_s * uh, int k, int seqlen, char * seq, unsigned int * listlen, unsigned int * * list, int seqmask) { /* if necessary, reallocate list of unique kmers */ if (uh->alloc < seqlen) { while (uh->alloc < seqlen) { uh->alloc *= 2; } uh->list = (unsigned int *) xrealloc(uh->list, sizeof(unsigned int) * uh->alloc); } uint64_t size = 1ULL << (k << 1ULL); /* reallocate bitmap arrays if necessary */ if (uh->bitmap_size < size) { uh->bitmap = (uint64_t *) xrealloc(uh->bitmap, size >> 3ULL); uh->bitmap_size = size; } memset(uh->bitmap, 0, size >> 3ULL); uint64_t bad = 0; uint64_t kmer = 0; uint64_t mask = size - 1ULL; char * s = seq; char * e1 = s + k-1; char * e2 = s + seqlen; if (e2 < e1) { e1 = e2; } unsigned int * maskmap = (seqmask != MASK_NONE) ? chrmap_mask_lower : chrmap_mask_ambig; while (s < e1) { bad <<= 2ULL; bad |= maskmap[(int)(*s)]; kmer <<= 2ULL; kmer |= chrmap_2bit[(int)(*s++)]; } int unique = 0; while (s < e2) { bad <<= 2ULL; bad |= maskmap[(int)(*s)]; bad &= mask; kmer <<= 2ULL; kmer |= chrmap_2bit[(int)(*s++)]; kmer &= mask; if (!bad) { uint64_t x = kmer >> 6ULL; uint64_t y = 1ULL << (kmer & 63ULL); if (!(uh->bitmap[x] & y)) { /* not seen before */ uh->list[unique++] = kmer; uh->bitmap[x] |= y; } } } *listlen = unique; *list = uh->list; } void unique_count_hash(struct uhandle_s * uh, int k, int seqlen, char * seq, unsigned int * listlen, unsigned int * * list, int seqmask) { /* if necessary, reallocate hash table and list of unique kmers */ if (uh->alloc < 2*seqlen) { while (uh->alloc < 2*seqlen) { uh->alloc *= 2; } uh->hash = (struct bucket_s *) xrealloc(uh->hash, sizeof(struct bucket_s) * uh->alloc); uh->list = (unsigned int *) xrealloc(uh->list, sizeof(unsigned int) * uh->alloc); } /* hashtable variant */ uh->size = 1; while (uh->size < 2*seqlen) { uh->size *= 2; } uh->hash_mask = uh->size - 1; memset(uh->hash, 0, sizeof(struct bucket_s) * uh->size); uint64_t bad = 0; uint64_t j; unsigned int kmer = 0; unsigned int mask = (1ULL<<(2ULL*k)) - 1ULL; char * s = seq; char * e1 = s + k-1; char * e2 = s + seqlen; if (e2 < e1) { e1 = e2; } unsigned int * maskmap = (seqmask != MASK_NONE) ? chrmap_mask_lower : chrmap_mask_ambig; while (s < e1) { bad <<= 2ULL; bad |= maskmap[(int)(*s)]; kmer <<= 2ULL; kmer |= chrmap_2bit[(int)(*s++)]; } uint64_t unique = 0; while (s < e2) { bad <<= 2ULL; bad |= maskmap[(int)(*s)]; bad &= mask; kmer <<= 2ULL; kmer |= chrmap_2bit[(int)(*s++)]; kmer &= mask; if (!bad) { /* find free appropriate bucket in hash */ j = HASH((char*)&kmer, (k+3)/4) & uh->hash_mask; while((uh->hash[j].count) && (uh->hash[j].kmer != kmer)) { j = (j + 1) & uh->hash_mask; } if (!(uh->hash[j].count)) { /* not seen before */ uh->list[unique++] = kmer; uh->hash[j].kmer = kmer; uh->hash[j].count = 1; } } } *listlen = unique; *list = uh->list; } void unique_count(struct uhandle_s * uh, int k, int seqlen, char * seq, unsigned int * listlen, unsigned int * * list, int seqmask) { if (k<10) { unique_count_bitmap(uh, k, seqlen, seq, listlen, list, seqmask); } else { unique_count_hash(uh, k, seqlen, seq, listlen, list, seqmask); } } int unique_count_shared(struct uhandle_s * uh, int k, int listlen, unsigned int * list) { /* counts how many of the kmers in list are present in the (already computed) hash or bitmap */ int count = 0; if (k<10) { for(int i = 0; i> 6ULL; uint64_t y = 1ULL << (kmer & 63ULL); if (uh->bitmap[x] & y) { count++; } } } else { for(int i = 0; ihash_mask; while((uh->hash[j].count) && (uh->hash[j].kmer != kmer)) { j = (j + 1) & uh->hash_mask; } if (uh->hash[j].count) { count++; } } } return count; } vsearch-2.21.1/src/arch.cc0000644000175000017500000001700314171574117014602 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" const int memalignment = 16; uint64_t arch_get_memused() { #ifdef _WIN32 PROCESS_MEMORY_COUNTERS pmc; GetProcessMemoryInfo(GetCurrentProcess(), &pmc, sizeof(PROCESS_MEMORY_COUNTERS)); return pmc.PeakWorkingSetSize; #else struct rusage r_usage; getrusage(RUSAGE_SELF, & r_usage); # ifdef __APPLE__ /* Mac: ru_maxrss gives the size in bytes */ return r_usage.ru_maxrss; # else /* Linux: ru_maxrss gives the size in kilobytes */ return r_usage.ru_maxrss * 1024; # endif #endif } uint64_t arch_get_memtotal() { #ifdef _WIN32 MEMORYSTATUSEX ms; ms.dwLength = sizeof(MEMORYSTATUSEX); GlobalMemoryStatusEx(&ms); return ms.ullTotalPhys; #elif defined(__APPLE__) int mib [] = { CTL_HW, HW_MEMSIZE }; int64_t ram = 0; size_t length = sizeof(ram); if(sysctl(mib, 2, &ram, &length, NULL, 0) == -1) fatal("Cannot determine amount of RAM"); return ram; #elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE) int64_t phys_pages = sysconf(_SC_PHYS_PAGES); int64_t pagesize = sysconf(_SC_PAGESIZE); if ((phys_pages == -1) || (pagesize == -1)) { fatal("Cannot determine amount of RAM"); } return pagesize * phys_pages; #else struct sysinfo si; if (sysinfo(&si)) fatal("Cannot determine amount of RAM"); return si.totalram * si.mem_unit; #endif } long arch_get_cores() { #ifdef _WIN32 SYSTEM_INFO si; GetSystemInfo(&si); return si.dwNumberOfProcessors; #else return sysconf(_SC_NPROCESSORS_ONLN); #endif } void arch_get_user_system_time(double * user_time, double * system_time) { *user_time = 0; *system_time = 0; #ifdef _WIN32 HANDLE hProcess = GetCurrentProcess(); FILETIME ftCreation, ftExit, ftKernel, ftUser; ULARGE_INTEGER ul; GetProcessTimes(hProcess, &ftCreation, &ftExit, &ftKernel, &ftUser); ul.u.HighPart = ftUser.dwHighDateTime; ul.u.LowPart = ftUser.dwLowDateTime; *user_time = ul.QuadPart * 100.0e-9; ul.u.HighPart = ftKernel.dwHighDateTime; ul.u.LowPart = ftKernel.dwLowDateTime; *system_time = ul.QuadPart * 100.0e-9; #else struct rusage r_usage; getrusage(RUSAGE_SELF, & r_usage); * user_time = r_usage.ru_utime.tv_sec * 1.0 + r_usage.ru_utime.tv_usec * 1.0e-6; * system_time = r_usage.ru_stime.tv_sec * 1.0 + r_usage.ru_stime.tv_usec * 1.0e-6; #endif } void arch_srandom() { /* initialize pseudo-random number generator */ unsigned int seed = opt_randseed; if (seed == 0) { #ifdef _WIN32 srand(GetTickCount()); #else int fd = open("/dev/urandom", O_RDONLY); if (fd < 0) { fatal("Unable to open /dev/urandom"); } if (read(fd, & seed, sizeof(seed)) < 0) { fatal("Unable to read from /dev/urandom"); } close(fd); srandom(seed); #endif } else { #ifdef _WIN32 srand(seed); #else srandom(seed); #endif } } uint64_t arch_random() { #ifdef _WIN32 return rand(); #else return random(); #endif } void * xmalloc(size_t size) { if (size == 0) { size = 1; } void * t = nullptr; #ifdef _WIN32 t = _aligned_malloc(size, memalignment); #else if (posix_memalign(& t, memalignment, size)) { t = nullptr; } #endif if (!t) { fatal("Unable to allocate enough memory."); } return t; } void * xrealloc(void *ptr, size_t size) { if (size == 0) { size = 1; } #ifdef _WIN32 void * t = _aligned_realloc(ptr, size, memalignment); #else void * t = realloc(ptr, size); #endif if (!t) { fatal("Unable to reallocate enough memory."); } return t; } void xfree(void * ptr) { if (ptr) { #ifdef _WIN32 _aligned_free(ptr); #else free(ptr); #endif } else { fatal("Trying to free a null pointer"); } } int xfstat(int fd, xstat_t * buf) { #ifdef _WIN32 return _fstat64(fd, buf); #else return fstat(fd, buf); #endif } int xstat(const char * path, xstat_t * buf) { #ifdef _WIN32 return _stat64(path, buf); #else return stat(path, buf); #endif } uint64_t xlseek(int fd, uint64_t offset, int whence) { #ifdef _WIN32 return _lseeki64(fd, offset, whence); #else return lseek(fd, offset, whence); #endif } uint64_t xftello(FILE * stream) { #ifdef _WIN32 return _ftelli64(stream); #else return ftello(stream); #endif } int xopen_read(const char * path) { #ifdef _WIN32 return _open(path, _O_RDONLY | _O_BINARY); #else return open(path, O_RDONLY); #endif } int xopen_write(const char * path) { #ifdef _WIN32 return _open(path, _O_WRONLY | _O_CREAT | _O_TRUNC | _O_BINARY, _S_IREAD | _S_IWRITE); #else return open(path, O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR); #endif } const char * xstrcasestr(const char * haystack, const char * needle) { #ifdef _WIN32 return StrStrIA(haystack, needle); #else return strcasestr(haystack, needle); #endif } #ifdef _WIN32 FARPROC arch_dlsym(HMODULE handle, const char * symbol) #else void * arch_dlsym(void * handle, const char * symbol) #endif { #ifdef _WIN32 return GetProcAddress(handle, symbol); #else return dlsym(handle, symbol); #endif } vsearch-2.21.1/src/derep.h0000644000175000017500000000476614171574117014642 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void derep(char * input_filename, bool use_header); void derep_prefix(); vsearch-2.21.1/src/cpu.h0000644000175000017500000000600014171574117014311 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifdef __x86_64__ void increment_counters_from_bitmap_sse2(count_t * counters, unsigned char * bitmap, unsigned int totalbits); void increment_counters_from_bitmap_ssse3(count_t * counters, unsigned char * bitmap, unsigned int totalbits); #else void increment_counters_from_bitmap(count_t * counters, unsigned char * bitmap, unsigned int totalbits); #endif vsearch-2.21.1/src/fastx.cc0000644000175000017500000004330714171574117015020 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* file compression and format detector */ /* basic file buffering function for fastq and fastx parsers */ #define FASTX_BUFFER_ALLOC 8192 #ifdef HAVE_BZLIB_H #define BZ_VERBOSE_0 0 #define BZ_VERBOSE_1 1 #define BZ_VERBOSE_2 2 #define BZ_VERBOSE_3 3 #define BZ_VERBOSE_4 4 #define BZ_MORE_MEM 0 /* faster decompression using more memory */ #define BZ_LESS_MEM 1 /* slower decompression but requires less memory */ #endif #define FORMAT_PLAIN 1 #define FORMAT_BZIP 2 #define FORMAT_GZIP 3 static unsigned char MAGIC_GZIP[] = "\x1f\x8b"; static unsigned char MAGIC_BZIP[] = "BZ"; void buffer_init(struct fastx_buffer_s * buffer) { buffer->alloc = FASTX_BUFFER_ALLOC; buffer->data = (char*) xmalloc(buffer->alloc); buffer->data[0] = 0; buffer->length = 0; buffer->position = 0; } void buffer_free(struct fastx_buffer_s * buffer) { if (buffer->data) { xfree(buffer->data); } buffer->data = nullptr; buffer->alloc = 0; buffer->length = 0; buffer->position = 0; } void buffer_makespace(struct fastx_buffer_s * buffer, uint64_t x) { /* make sure there is space for x more chars in buffer */ if (buffer->length + x > buffer->alloc) { /* alloc space for x more characters, but round up to nearest block size */ buffer->alloc = ((buffer->length + x + FASTX_BUFFER_ALLOC - 1) / FASTX_BUFFER_ALLOC) * FASTX_BUFFER_ALLOC; buffer->data = (char*) xrealloc(buffer->data, buffer->alloc); } } void buffer_extend(struct fastx_buffer_s * dest_buffer, char * source_buf, uint64_t len) { buffer_makespace(dest_buffer, len+1); memcpy(dest_buffer->data + dest_buffer->length, source_buf, len); dest_buffer->length += len; dest_buffer->data[dest_buffer->length] = 0; } void fastx_filter_header(fastx_handle h, bool truncateatspace) { /* filter and truncate header */ char * p = h->header_buffer.data; char * q = p; while (true) { unsigned char c = *p++; unsigned int m = char_header_action[c]; switch(m) { case 1: /* legal, printable character */ *q++ = c; break; case 2: /* illegal, fatal */ fprintf(stderr, "\n\n" "Fatal error: Illegal character encountered in FASTA/FASTQ header.\n" "Unprintable ASCII character no %d on or right before line %" PRIu64 ".\n", c, h->lineno); if (fp_log) { fprintf(fp_log, "\n\n" "Fatal error: Illegal character encountered in FASTA/FASTQ header.\n" "Unprintable ASCII character no %d on or right before line %" PRIu64 ".\n", c, h->lineno); } exit(EXIT_FAILURE); case 7: /* Non-ASCII but acceptable */ fprintf(stderr, "\n" "WARNING: Non-ASCII character encountered in FASTA/FASTQ header.\n" "Character no %d (0x%2x) on or right before line %" PRIu64 ".\n", c, c, h->lineno); if (fp_log) { fprintf(fp_log, "\n" "WARNING: Non-ASCII character encountered in FASTA/FASTQ header.\n" "Character no %d (0x%2x) on or right before line %" PRIu64 ".\n", c, c, h->lineno); } *q++ = c; break; case 5: case 6: /* tab or space */ /* conditional end of line */ if (truncateatspace) { goto end_of_line; } *q++ = c; break; case 0: /* null */ case 3: /* cr */ case 4: /* lf */ /* end of line */ goto end_of_line; default: fatal("Internal error"); break; } } end_of_line: /* add a null character at the end */ *q = 0; h->header_buffer.length = q - h->header_buffer.data; } fastx_handle fastx_open(const char * filename) { auto * h = (fastx_handle) xmalloc(sizeof(struct fastx_s)); h->fp = nullptr; #ifdef HAVE_ZLIB_H h->fp_gz = nullptr; #endif #ifdef HAVE_BZLIB_H h->fp_bz = nullptr; int bzError = 0; #endif h->fp = fopen_input(filename); if (!h->fp) { fatal("Unable to open file for reading (%s)", filename); } /* Get mode and size of original (uncompressed) file */ xstat_t fs; if (xfstat(fileno(h->fp), & fs)) { fatal("Unable to get status for input file (%s)", filename); } h->is_pipe = S_ISFIFO(fs.st_mode); if (h->is_pipe) { h->file_size = 0; } else { h->file_size = fs.st_size; } if (opt_gzip_decompress) { h->format = FORMAT_GZIP; } else if (opt_bzip2_decompress) { h->format = FORMAT_BZIP; } else if (h->is_pipe) { h->format = FORMAT_PLAIN; } else { /* autodetect compression (plain, gzipped or bzipped) */ /* read two characters and compare with magic */ unsigned char magic[2]; h->format = FORMAT_PLAIN; size_t bytes_read = fread(&magic, 1, 2, h->fp); if (bytes_read >= 2) { if (memcmp(magic, MAGIC_GZIP, 2) == 0) { h->format = FORMAT_GZIP; } else if (memcmp(magic, MAGIC_BZIP, 2) == 0) { h->format = FORMAT_BZIP; } } else { /* consider it an empty file or a tiny fasta file, uncompressed */ } /* close and reopen to avoid problems with gzip library */ /* rewind was not enough */ fclose(h->fp); h->fp = fopen_input(filename); if (!h->fp) { fatal("Unable to open file for reading (%s)", filename); } } if (h->format == FORMAT_GZIP) { /* GZIP: Keep original file open, then open as gzipped file as well */ #ifdef HAVE_ZLIB_H if (!gz_lib) { fatal("Files compressed with gzip are not supported"); } if (! (h->fp_gz = (*gzdopen_p)(fileno(h->fp), "rb"))) { // dup? fatal("Unable to open gzip compressed file (%s)", filename); } #else fatal("Files compressed with gzip are not supported"); #endif } if (h->format == FORMAT_BZIP) { /* BZIP2: Keep original file open, then open as bzipped file as well */ #ifdef HAVE_BZLIB_H if (!bz2_lib) { fatal("Files compressed with bzip2 are not supported"); } if (! (h->fp_bz = (*BZ2_bzReadOpen_p)(& bzError, h->fp, BZ_VERBOSE_0, BZ_MORE_MEM, nullptr, 0))) { fatal("Unable to open bzip2 compressed file (%s)", filename); } #else fatal("Files compressed with bzip2 are not supported"); #endif } /* init buffers */ h->file_position = 0; buffer_init(& h->file_buffer); /* start filling up file buffer */ uint64_t rest = fastx_file_fill_buffer(h); /* examine first char and see if it starts with > or @ */ int filetype = 0; h->is_empty = true; h->is_fastq = false; if (rest > 0) { h->is_empty = false; char * first = h->file_buffer.data; if (*first == '>') { filetype = 1; } else if (*first == '@') { filetype = 2; h->is_fastq = true; } if (filetype == 0) { /* close files if unrecognized file type */ switch(h->format) { case FORMAT_PLAIN: break; case FORMAT_GZIP: #ifdef HAVE_ZLIB_H (*gzclose_p)(h->fp_gz); h->fp_gz = nullptr; break; #endif case FORMAT_BZIP: #ifdef HAVE_BZLIB_H (*BZ2_bzReadClose_p)(&bzError, h->fp_bz); h->fp_bz = nullptr; break; #endif default: fatal("Internal error"); } fclose(h->fp); h->fp = nullptr; if (rest >= 2) { if (memcmp(first, MAGIC_GZIP, 2) == 0) { fatal("File appears to be gzip compressed. Please use --gzip_decompress"); } if (memcmp(first, MAGIC_BZIP, 2) == 0) { fatal("File appears to be bzip2 compressed. Please use --bzip2_decompress"); } } fatal("File type not recognized."); return nullptr; } } /* more initialization */ buffer_init(& h->header_buffer); buffer_init(& h->sequence_buffer); buffer_init(& h->plusline_buffer); buffer_init(& h->quality_buffer); h->stripped_all = 0; for(uint64_t & i : h->stripped) { i = 0; } h->lineno = 1; h->lineno_start = 1; h->seqno = -1; return h; } bool fastx_is_fastq(fastx_handle h) { return h->is_fastq || h->is_empty; } bool fastx_is_empty(fastx_handle h) { return h->is_empty; } void fastx_close(fastx_handle h) { /* Warn about stripped chars */ if (h->stripped_all) { fprintf(stderr, "WARNING: %" PRIu64 " invalid characters stripped from %s file:", h->stripped_all, (h->is_fastq ? "FASTQ" : "FASTA")); for (int i=0; i<256;i++) { if (h->stripped[i]) { fprintf(stderr, " %c(%" PRIu64 ")", i, h->stripped[i]); } } fprintf(stderr, "\n"); fprintf(stderr, "REMINDER: vsearch does not support amino acid sequences\n"); if (opt_log) { fprintf(fp_log, "WARNING: %" PRIu64 " invalid characters stripped from %s file:", h->stripped_all, (h->is_fastq ? "FASTQ" : "FASTA")); for (int i=0; i<256;i++) { if (h->stripped[i]) { fprintf(fp_log, " %c(%" PRIu64 ")", i, h->stripped[i]); } } fprintf(fp_log, "\n"); fprintf(fp_log, "REMINDER: vsearch does not support amino acid sequences\n"); } } #ifdef HAVE_BZLIB_H int bz_error; #endif switch(h->format) { case FORMAT_PLAIN: break; case FORMAT_GZIP: #ifdef HAVE_ZLIB_H (*gzclose_p)(h->fp_gz); h->fp_gz = nullptr; break; #endif case FORMAT_BZIP: #ifdef HAVE_BZLIB_H (*BZ2_bzReadClose_p)(&bz_error, h->fp_bz); h->fp_bz = nullptr; break; #endif default: fatal("Internal error"); } fclose(h->fp); h->fp = nullptr; buffer_free(& h->file_buffer); buffer_free(& h->header_buffer); buffer_free(& h->sequence_buffer); buffer_free(& h->plusline_buffer); buffer_free(& h->quality_buffer); h->file_size = 0; h->file_position = 0; h->lineno = 0; h->seqno = -1; xfree(h); h=nullptr; } uint64_t fastx_file_fill_buffer(fastx_handle h) { /* read more data if necessary */ uint64_t rest = h->file_buffer.length - h->file_buffer.position; if (rest > 0) { return rest; } else { uint64_t space = h->file_buffer.alloc - h->file_buffer.length; if (space == 0) { /* back to beginning of buffer */ h->file_buffer.position = 0; h->file_buffer.length = 0; space = h->file_buffer.alloc; } int bytes_read = 0; #ifdef HAVE_BZLIB_H int bzError = 0; #endif switch(h->format) { case FORMAT_PLAIN: bytes_read = fread(h->file_buffer.data + h->file_buffer.position, 1, space, h->fp); break; case FORMAT_GZIP: #ifdef HAVE_ZLIB_H bytes_read = (*gzread_p)(h->fp_gz, h->file_buffer.data + h->file_buffer.position, space); if (bytes_read < 0) { fatal("Unable to read gzip compressed file"); } break; #endif case FORMAT_BZIP: #ifdef HAVE_BZLIB_H bytes_read = (*BZ2_bzRead_p)(& bzError, h->fp_bz, h->file_buffer.data + h->file_buffer.position, space); if ((bytes_read < 0) || ! ((bzError == BZ_OK) || (bzError == BZ_STREAM_END) || (bzError == BZ_SEQUENCE_ERROR))) { fatal("Unable to read from bzip2 compressed file"); } break; #endif default: fatal("Internal error"); } if (!h->is_pipe) { #ifdef HAVE_ZLIB_H if (h->format == FORMAT_GZIP) { /* Circumvent the missing gzoffset function in zlib 1.2.3 and earlier */ int fd = dup(fileno(h->fp)); h->file_position = xlseek(fd, 0, SEEK_CUR); close(fd); } else #endif { h->file_position = xftello(h->fp); } } h->file_buffer.length += bytes_read; return bytes_read; } } bool fastx_next(fastx_handle h, bool truncateatspace, const unsigned char * char_mapping) { if (h->is_fastq) { return fastq_next(h, truncateatspace, char_mapping); } else { return fasta_next(h, truncateatspace, char_mapping); } } uint64_t fastx_get_position(fastx_handle h) { if (h->is_fastq) { return fastq_get_position(h); } else { return fasta_get_position(h); } } uint64_t fastx_get_size(fastx_handle h) { if (h->is_fastq) { return fastq_get_size(h); } else { return fasta_get_size(h); } } uint64_t fastx_get_lineno(fastx_handle h) { if (h->is_fastq) { return fastq_get_lineno(h); } else { return fasta_get_lineno(h); } } uint64_t fastx_get_seqno(fastx_handle h) { if (h->is_fastq) { return fastq_get_seqno(h); } else { return fasta_get_seqno(h); } } char * fastx_get_header(fastx_handle h) { if (h->is_fastq) { return fastq_get_header(h); } else { return fasta_get_header(h); } } char * fastx_get_sequence(fastx_handle h) { if (h->is_fastq) { return fastq_get_sequence(h); } else { return fasta_get_sequence(h); } } uint64_t fastx_get_header_length(fastx_handle h) { if (h->is_fastq) { return fastq_get_header_length(h); } else { return fasta_get_header_length(h); } } uint64_t fastx_get_sequence_length(fastx_handle h) { if (h->is_fastq) { return fastq_get_sequence_length(h); } else { return fasta_get_sequence_length(h); } } char * fastx_get_quality(fastx_handle h) { if (h->is_fastq) { return fastq_get_quality(h); } else { return nullptr; } } int64_t fastx_get_abundance(fastx_handle h) { if (h->is_fastq) { return fastq_get_abundance(h); } else { return fasta_get_abundance(h); } } vsearch-2.21.1/src/align_simd.h0000644000175000017500000000714614171574117015644 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ typedef signed short CELL; typedef unsigned short WORD; typedef unsigned char BYTE; struct s16info_s; struct s16info_s * search16_init(CELL score_match, CELL score_mismatch, CELL penalty_gap_open_query_left, CELL penalty_gap_open_target_left, CELL penalty_gap_open_query_interior, CELL penalty_gap_open_target_interior, CELL penalty_gap_open_query_right, CELL penalty_gap_open_target_right, CELL penalty_gap_extension_query_left, CELL penalty_gap_extension_target_left, CELL penalty_gap_extension_query_interior, CELL penalty_gap_extension_target_interior, CELL penalty_gap_extension_query_right, CELL penalty_gap_extension_target_right); void search16_exit(s16info_s * s); void search16_qprep(s16info_s * s, char * qseq, int qlen); void search16(s16info_s * s, unsigned int sequences, unsigned int * seqnos, CELL * pscores, unsigned short * paligned, unsigned short * pmatches, unsigned short * pmismatches, unsigned short * pgaps, char * * pcigar); vsearch-2.21.1/src/udb.h0000644000175000017500000000521714171574117014305 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ bool udb_detect_isudb(const char * filename); void udb_read(const char * filename, bool create_bitmaps, bool parse_abundances); void udb_fasta(); void udb_info(); void udb_make(); void udb_stats(); vsearch-2.21.1/src/fastqjoin.cc0000644000175000017500000001612714171574117015671 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* static variables */ FILE * join_fileopenw(char * filename) { FILE * fp = nullptr; fp = fopen_output(filename); if (!fp) { fatal("Unable to open file for writing (%s)", filename); } return fp; } void fastq_join() { FILE * fp_fastqout = nullptr; FILE * fp_fastaout = nullptr; fastx_handle fastq_fwd = nullptr; fastx_handle fastq_rev = nullptr; uint64_t total = 0; /* check input and options */ if (!opt_reverse) { fatal("No reverse reads file specified with --reverse"); } if ((!opt_fastqout) && (!opt_fastaout)) { fatal("No output files specified"); } char * padgap = nullptr; char * padgapq = nullptr; if (opt_join_padgap) { padgap = xstrdup(opt_join_padgap); } else { padgap = xstrdup("NNNNNNNN"); } uint64_t padlen = strlen(padgap); if (opt_join_padgapq) { padgapq = xstrdup(opt_join_padgapq); } else { padgapq = (char *) xmalloc(padlen + 1); for(uint64_t i = 0; i < padlen; i++) { padgapq[i] = 'I'; } padgapq[padlen] = 0; } if (padlen != strlen(padgapq)) { fatal("Strings given by --join_padgap and --join_padgapq differ in length"); } /* open input files */ fastq_fwd = fastq_open(opt_fastq_join); fastq_rev = fastq_open(opt_reverse); /* open output files */ if (opt_fastqout) { fp_fastqout = join_fileopenw(opt_fastqout); } if (opt_fastaout) { fp_fastaout = join_fileopenw(opt_fastaout); } /* main */ uint64_t filesize = fastq_get_size(fastq_fwd); progress_init("Joining reads", filesize); /* do it */ total = 0; uint64_t alloc = 0; uint64_t len = 0; char * seq = nullptr; char * qual = nullptr; while(fastq_next(fastq_fwd, false, chrmap_no_change)) { if (! fastq_next(fastq_rev, false, chrmap_no_change)) { fatal("More forward reads than reverse reads"); } uint64_t fwd_seq_length = fastq_get_sequence_length(fastq_fwd); uint64_t rev_seq_length = fastq_get_sequence_length(fastq_rev); /* allocate enough mem */ uint64_t needed = fwd_seq_length + rev_seq_length + padlen + 1; if (alloc < needed) { seq = (char *) xrealloc(seq, needed); qual = (char *) xrealloc(qual, needed); alloc = needed; } /* join them */ strcpy(seq, fastq_get_sequence(fastq_fwd)); strcpy(qual, fastq_get_quality(fastq_fwd)); len = fwd_seq_length; strcpy(seq + len, padgap); strcpy(qual + len, padgapq); len += padlen; /* reverse complement reverse read */ char * rev_seq = fastq_get_sequence(fastq_rev); char * rev_qual = fastq_get_quality(fastq_rev); for(uint64_t i = 0; i < rev_seq_length; i++) { uint64_t rev_pos = rev_seq_length - 1 - i; seq[len] = chrmap_complement[(int)(rev_seq[rev_pos])]; qual[len] = rev_qual[rev_pos]; len++; } seq[len] = 0; qual[len] = 0; /* write output */ if (opt_fastqout) { fastq_print_general(fp_fastqout, seq, len, fastq_get_header(fastq_fwd), fastq_get_header_length(fastq_fwd), qual, 0, total + 1, -1.0); } if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, seq, len, fastq_get_header(fastq_fwd), fastq_get_header_length(fastq_fwd), 0, total + 1, -1.0, -1, -1, nullptr, 0); } total++; progress_update(fastq_get_position(fastq_fwd)); } progress_done(); if (fastq_next(fastq_rev, false, chrmap_no_change)) { fatal("More reverse reads than forward reads"); } fprintf(stderr, "%" PRIu64 " pairs joined\n", total); /* clean up */ if (opt_fastaout) { fclose(fp_fastaout); } if (opt_fastqout) { fclose(fp_fastqout); } fastq_close(fastq_rev); fastq_rev = nullptr; fastq_close(fastq_fwd); fastq_fwd = nullptr; if (seq) { xfree(seq); } if (qual) { xfree(qual); } xfree(padgap); xfree(padgapq); } vsearch-2.21.1/src/msa.h0000644000175000017500000000525114171574117014311 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct msa_target_s { int seqno; char * cigar; int strand; }; void msa(FILE * fp_msaout, FILE * fp_consout, FILE * fp_profile, int cluster, int target_count, struct msa_target_s * target_list, int64_t totalabundance); vsearch-2.21.1/src/searchexact.cc0000644000175000017500000006130614171574117016164 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static struct searchinfo_s * si_plus; static struct searchinfo_s * si_minus; static pthread_t * pthread; /* global constants/data, no need for synchronization */ static int tophits; /* the maximum number of hits to keep */ static int seqcount; /* number of database sequences */ static pthread_attr_t attr; static fastx_handle query_fasta_h; /* global data protected by mutex */ static pthread_mutex_t mutex_input; static pthread_mutex_t mutex_output; static int qmatches; static int queries; static int * dbmatched; static FILE * fp_samout = nullptr; static FILE * fp_alnout = nullptr; static FILE * fp_userout = nullptr; static FILE * fp_blast6out = nullptr; static FILE * fp_uc = nullptr; static FILE * fp_fastapairs = nullptr; static FILE * fp_matched = nullptr; static FILE * fp_notmatched = nullptr; static FILE * fp_dbmatched = nullptr; static FILE * fp_dbnotmatched = nullptr; static FILE * fp_otutabout = nullptr; static FILE * fp_mothur_shared_out = nullptr; static FILE * fp_biomout = nullptr; static FILE * fp_qsegout = nullptr; static FILE * fp_tsegout = nullptr; static int count_matched = 0; static int count_notmatched = 0; void add_hit(struct searchinfo_s * si, uint64_t seqno) { if (search_acceptable_unaligned(si, seqno)) { struct hit * hp = si->hits + si->hit_count; si->hit_count++; hp->target = seqno; hp->strand = si->strand; hp->count = 0; hp->nwscore = si->qseqlen * opt_match; hp->nwdiff = 0; hp->nwgaps = 0; hp->nwindels = 0; hp->nwalignmentlength = si->qseqlen; hp->nwid = 100.0; hp->matches = si->qseqlen; hp->mismatches = 0; int ret = xsprintf(&hp->nwalignment, "%dM", si->qseqlen); if ((ret == -1) || (!hp->nwalignment)) { fatal("Out of memory"); } hp->internal_alignmentlength = si->qseqlen; hp->internal_gaps = 0; hp->internal_indels = 0; hp->trim_q_left = 0; hp->trim_q_right = 0; hp->trim_t_left = 0; hp->trim_t_right = 0; hp->trim_aln_left = 0; hp->trim_aln_right = 0; hp->id = 100.0; hp->id0 = 100.0; hp->id1 = 100.0; hp->id2 = 100.0; hp->id3 = 100.0; hp->id4 = 100.0; hp->shortest = si->qseqlen; hp->longest = si->qseqlen; hp->aligned = true; hp->accepted = false; hp->rejected = false; hp->weak = false; (void) search_acceptable_aligned(si, hp); } } void search_exact_onequery(struct searchinfo_s * si) { dbhash_search_info_s info; char * seq = si->qsequence; uint64_t seqlen = si->qseqlen; char * normalized = (char*) xmalloc(seqlen+1); string_normalize(normalized, seq, seqlen); si->hit_count = 0; int64_t ret = dbhash_search_first(normalized, seqlen, & info); while (ret >= 0) { add_hit(si, ret); ret = dbhash_search_next(&info); } xfree(normalized); } void search_exact_output_results(int hit_count, struct hit * hits, char * query_head, int qseqlen, char * qsequence, char * qsequence_rc, int qsize) { xpthread_mutex_lock(&mutex_output); /* show results */ int64_t toreport = MIN(opt_maxhits, hit_count); if (fp_alnout) { results_show_alnout(fp_alnout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_samout) { results_show_samout(fp_samout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (toreport) { double top_hit_id = hits[0].id; if (opt_otutabout || opt_mothur_shared_out || opt_biomout) { otutable_add(query_head, db_getheader(hits[0].target), qsize); } for(int t = 0; t < toreport; t++) { struct hit * hp = hits + t; if (opt_top_hits_only && (hp->id < top_hit_id)) { break; } if (fp_fastapairs) { results_show_fastapairs_one(fp_fastapairs, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_qsegout) { results_show_qsegout_one(fp_qsegout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_tsegout) { results_show_tsegout_one(fp_tsegout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_uc) { if ((t==0) || opt_uc_allhits) { results_show_uc_one(fp_uc, hp, query_head, qsequence, qseqlen, qsequence_rc, hp->target); } } if (fp_userout) { results_show_userout_one(fp_userout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, hp, query_head, qsequence, qseqlen, qsequence_rc); } } } else { if (fp_uc) { results_show_uc_one(fp_uc, nullptr, query_head, qsequence, qseqlen, qsequence_rc, 0); } if (opt_output_no_hits) { if (fp_userout) { results_show_userout_one(fp_userout, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } } } if (hit_count) { count_matched++; if (opt_matched) { fasta_print_general(fp_matched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), qsize, count_matched, -1.0, -1, -1, nullptr, 0.0); } } else { count_notmatched++; if (opt_notmatched) { fasta_print_general(fp_notmatched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), qsize, count_notmatched, -1.0, -1, -1, nullptr, 0.0); } } /* update matching db sequences */ for (int i=0; i < hit_count; i++) { if (hits[i].accepted) { dbmatched[hits[i].target]++; } } xpthread_mutex_unlock(&mutex_output); } int search_exact_query(int64_t t) { for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_minus+t : si_plus+t; /* mask query */ if (opt_qmask == MASK_DUST) { dust(si->qsequence, si->qseqlen); } else if ((opt_qmask == MASK_SOFT) && (opt_hardmask)) { hardmask(si->qsequence, si->qseqlen); } /* perform search */ search_exact_onequery(si); } struct hit * hits; int hit_count; search_joinhits(si_plus + t, opt_strand > 1 ? si_minus + t : nullptr, & hits, & hit_count); search_exact_output_results(hit_count, hits, si_plus[t].query_head, si_plus[t].qseqlen, si_plus[t].qsequence, opt_strand > 1 ? si_minus[t].qsequence : nullptr, si_plus[t].qsize); /* free memory for alignment strings */ for(int i=0; iquery_head_len = query_head_len; si->qseqlen = qseqlen; si->query_no = query_no; si->qsize = qsize; si->strand = s; /* allocate more memory for header and sequence, if necessary */ if (si->query_head_len + 1 > si->query_head_alloc) { si->query_head_alloc = si->query_head_len + 2001; si->query_head = (char*) xrealloc(si->query_head, (size_t)(si->query_head_alloc)); } if (si->qseqlen + 1 > si->seq_alloc) { si->seq_alloc = si->qseqlen + 2001; si->qsequence = (char*) xrealloc(si->qsequence, (size_t)(si->seq_alloc)); } } /* plus strand: copy header and sequence */ strcpy(si_plus[t].query_head, qhead); strcpy(si_plus[t].qsequence, qseq); /* get progress as amount of input file read */ uint64_t progress = fasta_get_position(query_fasta_h); /* let other threads read input */ xpthread_mutex_unlock(&mutex_input); /* minus strand: copy header and reverse complementary sequence */ if (opt_strand > 1) { strcpy(si_minus[t].query_head, si_plus[t].query_head); reverse_complement(si_minus[t].qsequence, si_plus[t].qsequence, si_plus[t].qseqlen); } int match = search_exact_query(t); /* lock mutex for update of global data and output */ xpthread_mutex_lock(&mutex_output); /* update stats */ queries++; if (match) { qmatches++; } /* show progress */ progress_update(progress); xpthread_mutex_unlock(&mutex_output); } else { xpthread_mutex_unlock(&mutex_input); break; } } } void search_exact_thread_init(struct searchinfo_s * si) { /* thread specific initialiation */ si->uh = nullptr; si->kmers = nullptr; si->m = nullptr; si->hits = (struct hit *) xmalloc (sizeof(struct hit) * (tophits) * opt_strand); si->qsize = 1; si->query_head_alloc = 0; si->query_head = nullptr; si->seq_alloc = 0; si->qsequence = nullptr; si->nw = nullptr; si->s = nullptr; } void search_exact_thread_exit(struct searchinfo_s * si) { /* thread specific clean up */ xfree(si->hits); if (si->query_head) { xfree(si->query_head); } if (si->qsequence) { xfree(si->qsequence); } } void * search_exact_thread_worker(void * vp) { auto t = (int64_t) vp; search_exact_thread_run(t); return nullptr; } void search_exact_thread_worker_run() { /* initialize threads, start them, join them and return */ xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); /* init and create worker threads, put them into stand-by mode */ for(int t=0; t 1) { si_minus = (struct searchinfo_s *) xmalloc(opt_threads * sizeof(struct searchinfo_s)); } else { si_minus = nullptr; } pthread = (pthread_t *) xmalloc(opt_threads * sizeof(pthread_t)); /* init mutexes for input and output */ xpthread_mutex_init(&mutex_input, nullptr); xpthread_mutex_init(&mutex_output, nullptr); progress_init("Searching", fasta_get_size(query_fasta_h)); search_exact_thread_worker_run(); progress_done(); xpthread_mutex_destroy(&mutex_output); xpthread_mutex_destroy(&mutex_input); xfree(pthread); xfree(si_plus); if (si_minus) { xfree(si_minus); } fasta_close(query_fasta_h); if (!opt_quiet) { fprintf(stderr, "Matching query sequences: %d of %d", qmatches, queries); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(stderr, "\n"); } if (opt_log) { fprintf(fp_log, "Matching query sequences: %d of %d", qmatches, queries); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(fp_log, "\n"); } if (fp_biomout) { otutable_print_biomout(fp_biomout); fclose(fp_biomout); } if (fp_otutabout) { otutable_print_otutabout(fp_otutabout); fclose(fp_otutabout); } if (fp_mothur_shared_out) { otutable_print_mothur_shared_out(fp_mothur_shared_out); fclose(fp_mothur_shared_out); } otutable_done(); int count_dbmatched = 0; int count_dbnotmatched = 0; if (opt_dbmatched || opt_dbnotmatched) { for(int64_t i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void usearch_global(char * cmdline, char * progheader); vsearch-2.21.1/src/fastq.cc0000644000175000017500000003545114171574117015012 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" void fastq_fatal(uint64_t lineno, const char * msg) { char * string; if (xsprintf(& string, "Invalid line %lu in FASTQ file: %s", lineno, msg) == -1) { fatal("Out of memory"); } if (string) { fatal(string); xfree(string); } else { fatal("Out of memory"); } } void buffer_filter_extend(fastx_handle h, struct fastx_buffer_s * dest_buffer, char * source_buf, uint64_t len, unsigned int * char_action, const unsigned char * char_mapping, bool * ok, char * illegal_char) { buffer_makespace(dest_buffer, len+1); /* Strip unwanted characters from the string and raise warnings or errors on certain characters. */ char * p = source_buf; char * d = dest_buffer->data + dest_buffer->length; char * q = d; * ok = true; for(uint64_t i = 0; i < len; i++) { char c = *p++; char m = char_action[(unsigned char)c]; switch(m) { case 0: /* stripped */ h->stripped_all++; h->stripped[(unsigned char)c]++; break; case 1: /* legal character */ *q++ = char_mapping[(unsigned char)(c)]; break; case 2: /* fatal character */ if (*ok) { * illegal_char = c; } * ok = false; break; case 3: /* silently stripped chars (whitespace) */ break; case 4: /* newline (silently stripped) */ break; } } /* add zero after sequence */ *q = 0; dest_buffer->length += q - d; } fastx_handle fastq_open(const char * filename) { fastx_handle h = fastx_open(filename); if (!fastx_is_fastq(h)) { fatal("FASTQ file expected, FASTA file found (%s)", filename); } return h; } void fastq_close(fastx_handle h) { fastx_close(h); } bool fastq_next(fastx_handle h, bool truncateatspace, const unsigned char * char_mapping) { h->header_buffer.length = 0; h->header_buffer.data[0] = 0; h->sequence_buffer.length = 0; h->sequence_buffer.data[0] = 0; h->plusline_buffer.length = 0; h->plusline_buffer.data[0] = 0; h->quality_buffer.length = 0; h->quality_buffer.data[0] = 0; h->lineno_start = h->lineno; char msg[200]; bool ok = true; char illegal_char = 0; uint64_t rest = fastx_file_fill_buffer(h); /* check end of file */ if (rest == 0) { return false; } /* read header */ /* check initial @ character */ if (h->file_buffer.data[h->file_buffer.position] != '@') { fastq_fatal(h->lineno, "Header line must start with '@' character"); } h->file_buffer.position++; rest--; char * lf = nullptr; while (lf == nullptr) { /* get more data if buffer empty */ rest = fastx_file_fill_buffer(h); if (rest == 0) { fastq_fatal(h->lineno, "Unexpected end of file"); } /* find LF */ lf = (char *) memchr(h->file_buffer.data + h->file_buffer.position, '\n', rest); /* copy to header buffer */ uint64_t len = rest; if (lf) { /* LF found, copy up to and including LF */ len = lf - (h->file_buffer.data + h->file_buffer.position) + 1; h->lineno++; } buffer_extend(& h->header_buffer, h->file_buffer.data + h->file_buffer.position, len); h->file_buffer.position += len; rest -= len; } /* read sequence line(s) */ lf = nullptr; while (true) { /* get more data, if necessary */ rest = fastx_file_fill_buffer(h); /* cannot end here */ if (rest == 0) { fastq_fatal(h->lineno, "Unexpected end of file"); } /* end when new line starting with + is seen */ if (lf && (h->file_buffer.data[h->file_buffer.position] == '+')) { break; } /* find LF */ lf = (char *) memchr(h->file_buffer.data + h->file_buffer.position, '\n', rest); /* copy to sequence buffer */ uint64_t len = rest; if (lf) { /* LF found, copy up to and including LF */ len = lf - (h->file_buffer.data + h->file_buffer.position) + 1; h->lineno++; } buffer_filter_extend(h, & h->sequence_buffer, h->file_buffer.data + h->file_buffer.position, len, char_fq_action_seq, char_mapping, & ok, & illegal_char); h->file_buffer.position += len; rest -= len; if (!ok) { if ((illegal_char >= 32) && (illegal_char < 127)) { snprintf(msg, 200, "Illegal sequence character '%c'", illegal_char); } else { snprintf(msg, 200, "Illegal sequence character (unprintable, no %d)", (unsigned char) illegal_char); } fastq_fatal(h->lineno - (lf ? 1 : 0), msg); } } /* read + line */ /* skip + character */ h->file_buffer.position++; rest--; lf = nullptr; while (lf == nullptr) { /* get more data if buffer empty */ rest = fastx_file_fill_buffer(h); /* cannot end here */ if (rest == 0) { fastq_fatal(h->lineno, "Unexpected end of file"); } /* find LF */ lf = (char *) memchr(h->file_buffer.data + h->file_buffer.position, '\n', rest); /* copy to plusline buffer */ uint64_t len = rest; if (lf) { /* LF found, copy up to and including LF */ len = lf - (h->file_buffer.data + h->file_buffer.position) + 1; h->lineno++; } buffer_extend(& h->plusline_buffer, h->file_buffer.data + h->file_buffer.position, len); h->file_buffer.position += len; rest -= len; } /* check that the plus line is empty or identical to @ line */ bool plusline_invalid = false; if (h->header_buffer.length == h->plusline_buffer.length) { if (memcmp(h->header_buffer.data, h->plusline_buffer.data, h->header_buffer.length)) { plusline_invalid = true; } } else { if ((h->plusline_buffer.length > 2) || ((h->plusline_buffer.length == 2) && (h->plusline_buffer.data[0] != '\r'))) { plusline_invalid = true; } } if (plusline_invalid) { fastq_fatal(h->lineno - (lf ? 1 : 0), "'+' line must be empty or identical to header"); } /* read quality line(s) */ lf = nullptr; while (true) { /* get more data, if necessary */ rest = fastx_file_fill_buffer(h); /* end if no more data */ if (rest == 0) { break; } /* end if next entry starts : LF + '@' + correct length */ if (lf && (h->file_buffer.data[h->file_buffer.position] == '@') && (h->quality_buffer.length == h->sequence_buffer.length)) { break; } /* find LF */ lf = (char *) memchr(h->file_buffer.data + h->file_buffer.position, '\n', rest); /* copy to quality buffer */ uint64_t len = rest; if (lf) { /* LF found, copy up to and including LF */ len = lf - (h->file_buffer.data + h->file_buffer.position) + 1; h->lineno++; } buffer_filter_extend(h, & h->quality_buffer, h->file_buffer.data + h->file_buffer.position, len, char_fq_action_qual, chrmap_identity, & ok, & illegal_char); h->file_buffer.position += len; rest -= len; /* break if quality line already too long */ if (h->quality_buffer.length > h->sequence_buffer.length) { break; } if (!ok) { if ((illegal_char >= 32) && (illegal_char < 127)) { snprintf(msg, 200, "Illegal quality character '%c'", illegal_char); } else { snprintf(msg, 200, "Illegal quality character (unprintable, no %d)", (unsigned char) illegal_char); } fastq_fatal(h->lineno - (lf ? 1 : 0), msg); } } if (h->sequence_buffer.length != h->quality_buffer.length) { fastq_fatal(h->lineno - (lf ? 1 : 0), "Sequence and quality lines must be equally long"); } fastx_filter_header(h, truncateatspace); h->seqno++; return true; } char * fastq_get_quality(fastx_handle h) { return h->quality_buffer.data; } uint64_t fastq_get_quality_length(fastx_handle h) { return h->quality_buffer.length; } uint64_t fastq_get_position(fastx_handle h) { return h->file_position; } uint64_t fastq_get_size(fastx_handle h) { return h->file_size; } uint64_t fastq_get_lineno(fastx_handle h) { return h->lineno_start; } uint64_t fastq_get_seqno(fastx_handle h) { return h->seqno; } uint64_t fastq_get_header_length(fastx_handle h) { return h->header_buffer.length; } uint64_t fastq_get_sequence_length(fastx_handle h) { return h->sequence_buffer.length; } char * fastq_get_header(fastx_handle h) { return h->header_buffer.data; } char * fastq_get_sequence(fastx_handle h) { return h->sequence_buffer.data; } int64_t fastq_get_abundance(fastx_handle h) { // return 1 if not present int64_t size = header_get_size(h->header_buffer.data, h->header_buffer.length); if (size > 0) { return size; } else { return 1; } } int64_t fastq_get_abundance_and_presence(fastx_handle h) { // return 0 if not present return header_get_size(h->header_buffer.data, h->header_buffer.length); } inline void fprint_seq_label(FILE * fp, char * seq, int len) { /* normalize first? */ fprintf(fp, "%.*s", len, seq); } void fastq_print_general(FILE * fp, char * seq, int len, char * header, int header_len, char * quality, int abundance, int ordinal, double ee) { fprintf(fp, "@"); if (opt_relabel_self) { fprint_seq_label(fp, seq, len); } else if (opt_relabel_sha1) { fprint_seq_digest_sha1(fp, seq, len); } else if (opt_relabel_md5) { fprint_seq_digest_md5(fp, seq, len); } else if (opt_relabel && (ordinal > 0)) { fprintf(fp, "%s%d", opt_relabel, ordinal); } else { bool xsize = opt_xsize || (opt_sizeout && (abundance > 0)); bool xee = opt_xee || ((opt_eeout || opt_fastq_eeout) && (ee >= 0.0)); header_fprint_strip_size_ee(fp, header, header_len, xsize, xee); } if (opt_label_suffix) { fprintf(fp, "%s", opt_label_suffix); } if (opt_sample) { fprintf(fp, ";sample=%s", opt_sample); } if (opt_sizeout && (abundance > 0)) { fprintf(fp, ";size=%u", abundance); } if ((opt_eeout || opt_fastq_eeout) && (ee >= 0.0)) { fprintf(fp, ";ee=%.4lf", ee); } if (opt_relabel_keep && ((opt_relabel && (ordinal > 0)) || opt_relabel_sha1 || opt_relabel_md5 || opt_relabel_self)) { fprintf(fp, " %.*s", header_len, header); } fprintf(fp, "\n%.*s\n+\n%.*s\n", len, seq, len, quality); } void fastq_print(FILE * fp, char * header, char * sequence, char * quality) { int slen = strlen(sequence); int hlen = strlen(header); fastq_print_general(fp, sequence, slen, header, hlen, quality, 0, 0, -1.0); } vsearch-2.21.1/src/unique.h0000644000175000017500000000566014171574117015043 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct bucket_s; struct uhandle_s; struct uhandle_s * unique_init(); void unique_exit(struct uhandle_s * u); void unique_count(struct uhandle_s * uh, int k, int seqlen, char * seq, unsigned int * listlen, unsigned int * * list, int seqmask); int unique_count_shared(struct uhandle_s * uh, int k, int listlen, unsigned int * list); vsearch-2.21.1/src/linmemalign.cc0000644000175000017500000004704614171574117016173 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* Compute the optimal global alignment of two sequences in linear space using the divide and conquer method. These functions are based on the following articles: - Hirschberg (1975) Comm ACM 18:341-343 - Myers & Miller (1988) CABIOS 4:11-17 The method has been adapted for the use of different gap penalties for query/target/left/interior/right gaps. scorematrix consists of 16x16 int64_t integers Sequences and alignment matrix: A/a/i/query/q/downwards/vertical/top/bottom B/b/j/target/t/rightwards/horizontal/left/right f corresponds to score ending with gap in A/query EE corresponds to score ending with gap in B/target */ LinearMemoryAligner::LinearMemoryAligner() { scorematrix = nullptr; cigar_alloc = 0; cigar_string = nullptr; vector_alloc = 0; HH = nullptr; EE = nullptr; XX = nullptr; YY = nullptr; } LinearMemoryAligner::~LinearMemoryAligner() { if (cigar_string) { xfree(cigar_string); } if (HH) { xfree(HH); } if (EE) { xfree(EE); } if (XX) { xfree(XX); } if (YY) { xfree(YY); } } int64_t * LinearMemoryAligner::scorematrix_create(int64_t match, int64_t mismatch) { auto * newscorematrix = (int64_t*) xmalloc(16*16*sizeof(int64_t)); for(int i=0; i<16; i++) { for(int j=0; j<16; j++) { int64_t value; if (ambiguous_4bit[i] || ambiguous_4bit[j]) { value = 0; } else if (i == j) { value = match; } else { value = mismatch; } newscorematrix[16*i+j] = value; } } return newscorematrix; } void LinearMemoryAligner::alloc_vectors(size_t x) { if (vector_alloc < x) { vector_alloc = x; if (HH) { xfree(HH); } if (EE) { xfree(EE); } if (XX) { xfree(XX); } if (YY) { xfree(YY); } HH = (int64_t*) xmalloc(vector_alloc * (sizeof(int64_t))); EE = (int64_t*) xmalloc(vector_alloc * (sizeof(int64_t))); XX = (int64_t*) xmalloc(vector_alloc * (sizeof(int64_t))); YY = (int64_t*) xmalloc(vector_alloc * (sizeof(int64_t))); } } void LinearMemoryAligner::cigar_reset() { if (cigar_alloc < 1) { cigar_alloc = 64; cigar_string = (char*) xrealloc(cigar_string, cigar_alloc); } cigar_string[0] = 0; cigar_length = 0; op = 0; op_run = 0; } void LinearMemoryAligner::cigar_flush() { if (op_run > 0) { while (true) { /* try writing string until enough memory has been allocated */ int64_t rest = cigar_alloc - cigar_length; int n; if (op_run > 1) { n = snprintf(cigar_string + cigar_length, rest, "%" PRId64 "%c", op_run, op); } else { n = snprintf(cigar_string + cigar_length, rest, "%c", op); } if (n < 0) { fatal("snprintf returned a negative number.\n"); } else if (n >= rest) { cigar_alloc += MAX(n - rest + 1, 64); cigar_string = (char*) xrealloc(cigar_string, cigar_alloc); } else { cigar_length += n; break; } } } } void LinearMemoryAligner::cigar_add(char _op, int64_t run) { if (op == _op) { op_run += run; } else { cigar_flush(); op = _op; op_run = run; } } void LinearMemoryAligner::show_matrix() { for(int i=0; i<16; i++) { printf("%2d:", i); for(int j=0; j<16; j++) { printf(" %2" PRId64, scorematrix[16*i+j]); } printf("\n"); } } void LinearMemoryAligner::diff(int64_t a_start, int64_t b_start, int64_t a_len, int64_t b_len, bool gap_b_left, /* gap open left of b */ bool gap_b_right, /* gap open right of b */ bool a_left, /* includes left end of a */ bool a_right, /* includes right end of a */ bool b_left, /* includes left end of b */ bool b_right) /* includes right end of b */ { if (b_len == 0) { /* B and possibly A is empty */ if (a_len > 0) { // Delete a_len from A // AAA // --- cigar_add('D', a_len); } } else if (a_len == 0) { /* A is empty, B is not */ // Delete b_len from B // --- // BBB cigar_add('I', b_len); } else if (a_len == 1) { /* Convert 1 symbol from A to b_len symbols from B b_len >= 1 */ int64_t MaxScore; int64_t best; int64_t Score = 0; /* First possibility */ // Delete 1 from A, Insert b_len from B // A---- // -BBBB /* gap penalty for gap in B of length 1 */ if (! gap_b_left) { Score -= b_left ? go_t_l : go_t_i; } Score -= b_left ? ge_t_l : ge_t_i; /* gap penalty for gap in A of length b_len */ Score -= a_right ? go_q_r + b_len * ge_q_r : go_q_i + b_len * ge_q_i; MaxScore = Score; best = -1; /* Second possibility */ // Insert b_len from B, Delete 1 from A // ----A // BBBB- /* gap penalty for gap in A of length b_len */ Score -= a_left ? go_q_l + b_len * ge_q_l : go_q_i + b_len * ge_q_i; /* gap penalty for gap in B of length 1 */ if (! gap_b_right) { Score -= b_right ? go_t_r : go_t_i; } Score -= b_right ? ge_t_r : ge_t_i; if (Score > MaxScore) { MaxScore = Score; best = b_len; } /* Third possibility */ for (int64_t j = 0; j < b_len; j++) { // Insert zero or more from B, replace 1, insert rest of B // -A-- // BBBB Score = 0; if (j > 0) { Score -= a_left ? go_q_l + j * ge_q_l : go_q_i + j * ge_q_i; } Score += subst_score(a_start, b_start + j); if (j < b_len - 1) { Score -= a_right ? go_q_r + (b_len-1-j) * ge_q_r : go_q_i + (b_len-1-j) * ge_q_i; } if (Score > MaxScore) { MaxScore = Score; best = j; } } if (best == -1) { cigar_add('D', 1); cigar_add('I', b_len); } else if (best == b_len) { cigar_add('I', b_len); cigar_add('D', 1); } else { if (best > 0) { cigar_add('I', best); } cigar_add('M', 1); if (best < b_len - 1) { cigar_add('I', b_len - 1 - best); } } } else { /* a_len >= 2, b_len >= 1 */ int64_t I = a_len / 2; int64_t i, j; // Compute HH & EE in forward phase // Upper part /* initialize HH and EE for values corresponding to empty seq A vs B of j symbols, i.e. a gap of length j in A */ HH[0] = 0; EE[0] = 0; for (j = 1; j <= b_len; j++) { HH[j] = - (a_left ? go_q_l + j * ge_q_l : go_q_i + j * ge_q_i); EE[j] = LONG_MIN; } /* compute matrix */ for (i = 1; i <= I; i++) { int64_t p = HH[0]; int64_t h = - (b_left ? (gap_b_left ? 0 : go_t_l) + i * ge_t_l : (gap_b_left ? 0 : go_t_i) + i * ge_t_i); HH[0] = h; int64_t f = LONG_MIN; for (j = 1; j <= b_len; j++) { f = MAX(f, h - go_q_i) - ge_q_i; if (b_right && (j==b_len)) { EE[j] = MAX(EE[j], HH[j] - go_t_r) - ge_t_r; } else { EE[j] = MAX(EE[j], HH[j] - go_t_i) - ge_t_i; } h = p + subst_score(a_start + i - 1, b_start + j - 1); if (f > h) { h = f; } if (EE[j] > h) { h = EE[j]; } p = HH[j]; HH[j] = h; } } EE[0] = HH[0]; // Compute XX & YY in reverse phase // Lower part /* initialize XX and YY */ XX[0] = 0; YY[0] = 0; for (j = 1; j <= b_len; j++) { XX[j] = - (a_right ? go_q_r + j * ge_q_r : go_q_i + j * ge_q_i); YY[j] = LONG_MIN; } /* compute matrix */ for (i = 1; i <= a_len - I; i++) { int64_t p = XX[0]; int64_t h = - (b_right ? (gap_b_right ? 0 : go_t_r) + i * ge_t_r : (gap_b_right ? 0 : go_t_i) + i * ge_t_i); XX[0] = h; int64_t f = LONG_MIN; for (j = 1; j <= b_len; j++) { f = MAX(f, h - go_q_i) - ge_q_i; if (b_left && (j==b_len)) { YY[j] = MAX(YY[j], XX[j] - go_t_l) - ge_t_l; } else { YY[j] = MAX(YY[j], XX[j] - go_t_i) - ge_t_i; } h = p + subst_score(a_start + a_len - i, b_start + b_len - j); if (f > h) { h = f; } if (YY[j] > h) { h = YY[j]; } p = XX[j]; XX[j] = h; } } YY[0] = XX[0]; /* find maximum score along division line */ int64_t MaxScore0 = LONG_MIN; int64_t best0 = -1; /* solutions with diagonal at break */ for (j=0; j <= b_len; j++) { int64_t Score = HH[j] + XX[b_len - j]; if (Score > MaxScore0) { MaxScore0 = Score; best0 = j; } } int64_t MaxScore1 = LONG_MIN; int64_t best1 = -1; /* solutions that end with a gap in b from both ends at break */ for (j=0; j <= b_len; j++) { int64_t g; if (b_left && (j==0)) { g = go_t_l; } else if (b_right && (j==b_len)) { g = go_t_r; } else { g = go_t_i; } int64_t Score = EE[j] + YY[b_len - j] + g; if (Score > MaxScore1) { MaxScore1 = Score; best1 = j; } } int64_t P; int64_t best; if (MaxScore0 > MaxScore1) { P = 0; best = best0; } else if (MaxScore1 > MaxScore0) { P = 1; best = best1; } else { if (best0 <= best1) { P = 0; best = best0; } else { P = 1; best = best1; } } /* recursively compute upper left and lower right parts */ if (P == 0) { diff(a_start, b_start, I, best, gap_b_left, false, a_left, false, b_left, b_right && (best == b_len)); diff(a_start + I, b_start + best, a_len - I, b_len - best, false, gap_b_right, false, a_right, b_left && (best == 0), b_right); } else if (P == 1) { diff(a_start, b_start, I - 1, best, gap_b_left, true, a_left, false, b_left, b_right && (best == b_len)); cigar_add('D', 2); diff(a_start + I + 1, b_start + best, a_len - I - 1, b_len - best, true, gap_b_right, false, a_right, b_left && (best == 0), b_right); } } } void LinearMemoryAligner::set_parameters(int64_t * _scorematrix, int64_t _gap_open_query_left, int64_t _gap_open_target_left, int64_t _gap_open_query_interior, int64_t _gap_open_target_interior, int64_t _gap_open_query_right, int64_t _gap_open_target_right, int64_t _gap_extension_query_left, int64_t _gap_extension_target_left, int64_t _gap_extension_query_interior, int64_t _gap_extension_target_interior, int64_t _gap_extension_query_right, int64_t _gap_extension_target_right) { scorematrix = _scorematrix; /* a = query/q b = t/target */ go_q_l = _gap_open_query_left; go_t_l = _gap_open_target_left; go_q_i = _gap_open_query_interior; go_t_i = _gap_open_target_interior; go_q_r = _gap_open_query_right; go_t_r = _gap_open_target_right; ge_q_l = _gap_extension_query_left; ge_t_l = _gap_extension_target_left; ge_q_i = _gap_extension_query_interior; ge_t_i = _gap_extension_target_interior; ge_q_r = _gap_extension_query_right; ge_t_r = _gap_extension_target_right; q = _gap_open_query_interior; r = _gap_extension_query_interior; } char * LinearMemoryAligner::align(char * _a_seq, char * _b_seq, int64_t a_len, int64_t b_len) { /* copy parameters */ a_seq = _a_seq; b_seq = _b_seq; /* init cigar operations */ cigar_reset(); /* allocate enough memory for vectors */ alloc_vectors(b_len+1); /* perform alignment */ diff(0, 0, a_len, b_len, false, false, true, true, true, true); /* ensure entire cigar has been written */ cigar_flush(); /* return cigar */ return cigar_string; } void LinearMemoryAligner::alignstats(char * cigar, char * _a_seq, char * _b_seq, int64_t * _nwscore, int64_t * _nwalignmentlength, int64_t * _nwmatches, int64_t * _nwmismatches, int64_t * _nwgaps) { a_seq = _a_seq; b_seq = _b_seq; int64_t nwscore = 0; int64_t nwalignmentlength = 0; int64_t nwmatches = 0; int64_t nwmismatches = 0; int64_t nwgaps = 0; int64_t a_pos = 0; int64_t b_pos = 0; char * p = cigar; int64_t g; while (*p) { int64_t run = 1; int scanlength = 0; sscanf(p, "%" PRId64 "%n", &run, &scanlength); p += scanlength; switch (*p++) { case 'M': nwalignmentlength += run; for(int64_t k=0; k, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* legal symbols: *abcdefghiklmnpqrstuvxyz (all except j and o), also upper case fatal symbols: .- fatal: ascii 0-26 except tab (9), newline (10 and 13), vt (11), formfeed (12) stripped: !"#$&'()+,/0123456789:;<=>?@JO[\]^_`jo{|}~ and chrs 9-13, 127 includes both amino acid and nucleotide sequences, adapt to nt only */ char sym_nt_2bit[] = "ACGT"; char sym_nt_4bit[] = "-ACMGRSVTWYHKDBN"; unsigned int char_header_action[256] = { /* FASTA/FASTQ header characters 0 = null 1 = legal, printable ascii 2 = illegal, fatal 3 = cr 4 = lf 5 = tab 6 = space 7 = non-ascii, legal, but warn @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 0, 2, 2, 2, 2, 2, 2, 2, 2, 5, 4, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7 }; unsigned int char_fasta_action[256] = { /* How to handle input characters for FASTA 0=stripped, 1=legal, 2=fatal, 3=silently stripped, 4=newline @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned int char_fq_action_seq[256] = { /* How to handle input characters for FASTQ: All IUPAC characters are valid. CR (^M) silently stripped. LF is newline. Rest is fatal 0=stripped, 1=legal, 2=fatal, 3=silently stripped, 4=newline @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, }; unsigned int char_fq_action_qual[256] = { /* Quality characters, any from 33 to 126 is valid. CR (^M) silently stripped. LF is newline. Rest is fatal @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 }; unsigned int chrmap_2bit[256] = { /* Map from ascii to 2-bit nucleotide code Aa: 0 Cc: 1 Gg: 2 TtUu: 3 All others: 0 @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; /* New 4 bit ambiguous nucleic acid symbol encoding bit 0 = A bit 1 = C bit 2 = G bit 3 = T - = = 0000 = 0 A = A = 0001 = 1 C = C = 0010 = 2 M = AC = 0011 = 3 G = G = 0100 = 4 R = A G = 0101 = 5 S = CG = 0110 = 6 V = ACG = 0111 = 7 T = T = 1000 = 8 W = A T = 1001 = 9 Y = C T = 1010 = 10 H = AC T = 1011 = 11 K = GT = 1100 = 12 D = A GT = 1101 = 13 B = CGT = 1110 = 14 N = ACGT = 1111 = 15 */ unsigned int ambiguous_4bit[16] = { /* - A C M G R S V T W Y H K D B N */ 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1 }; unsigned int chrmap_4bit[256] = { /* Map from ascii to 4-bit nucleotide code Aa: 1 Bb: 14 Cc: 2 Dd: 13 Gg: 4 Hh: 11 Kk: 12 Mm: 3 Nn: 15 Rr: 5 Ss: 6 Tt: 8 Uu: 8 Vv: 7 Ww: 9 Yy: 10 Others: 0 @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 14, 2, 13, 0, 0, 4, 11, 0, 0, 12, 0, 3, 15, 0, 0, 0, 5, 6, 8, 8, 7, 9, 0, 10, 0, 0, 0, 0, 0, 0, 0, 1, 14, 2, 13, 0, 0, 4, 11, 0, 0, 12, 0, 3, 15, 0, 0, 0, 5, 6, 8, 8, 7, 9, 0, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned int chrmap_mask_lower[256] = { /* Should character be masked and not used for search ? Mask everything but A, C, G, T and U. All lower case letters are masked (soft masking). @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }; unsigned int chrmap_mask_ambig[256] = { /* Should character be masked and not used for search ? Mask everything but A, C, G, T and U. Lower case letters are NOT masked. @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }; const unsigned char chrmap_complement[256] = { /* Map from ascii to ascii, complementary nucleotide @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','T','V','G','H','N','N','C','D','N','N','M','N','K','N','N', 'N','N','Y','S','A','A','B','W','N','R','N','N','N','N','N','N', 'N','t','v','g','h','N','N','c','d','N','N','m','N','k','n','N', 'N','N','y','s','a','a','b','w','N','r','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N' }; const unsigned char chrmap_normalize[256] = { /* Map from ascii to ascii Convert to upper case nucleotide, and replace U by T @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','A','B','C','D','N','N','G','H','N','N','K','N','M','N','N', 'N','N','R','S','T','T','V','W','N','Y','N','N','N','N','N','N', 'N','A','B','C','D','N','N','G','H','N','N','K','N','M','N','N', 'N','N','R','S','T','T','V','W','N','Y','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N' }; const unsigned char chrmap_upcase[256] = { /* Map from ascii to ascii Convert to upper case nucleotide @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O', 'P','Q','R','S','T','U','V','W','X','Y','Z','N','N','N','N','N', 'N','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O', 'P','Q','R','S','T','U','V','W','X','Y','Z','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N' }; const unsigned char chrmap_no_change[256] = { /* Map from ascii to ascii - no change @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ */ 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O', 'P','Q','R','S','T','U','V','W','X','Y','Z','N','N','N','N','N', 'N','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o', 'p','q','r','s','t','u','v','w','x','y','z','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N', 'N','N','N','N','N','N','N','N','N','N','N','N','N','N','N','N' }; const unsigned char chrmap_identity[256] = { /* identity map */ 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9a, 0x9b, 0x9c, 0x9d, 0x9e, 0x9f, 0xa0, 0xa1, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, 0xa7, 0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, 0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xd7, 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf, 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7, 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, 0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff }; vsearch-2.21.1/src/fastqjoin.h0000644000175000017500000000470014171574117015525 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastq_join(); vsearch-2.21.1/src/dbhash.cc0000644000175000017500000001432014171574117015115 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static bitmap_t * dbhash_bitmap; static uint64_t dbhash_size; static unsigned int dbhash_shift; static uint64_t dbhash_mask; static struct dbhash_bucket_s * dbhash_table; int dbhash_seqcmp(char * a, char * b, uint64_t n) { char * p = a; char * q = b; if (n <= 0) { return 0; } while ((n-- > 0) && (chrmap_4bit[(int)(*p)] == chrmap_4bit[(int)(*q)])) { if ((n == 0) || (*p == 0) || (*q == 0)) { break; } p++; q++; } return chrmap_4bit[(int)(*p)] - chrmap_4bit[(int)(*q)]; } void dbhash_open(uint64_t maxelements) { /* adjust size of hash table for 2/3 fill rate */ /* and use a multiple of 2 */ dbhash_size = 1; dbhash_shift = 0; while (3 * maxelements > 2 * dbhash_size) { dbhash_size <<= 1; dbhash_shift++; } dbhash_mask = dbhash_size - 1; dbhash_table = (struct dbhash_bucket_s *) xmalloc(sizeof(dbhash_bucket_s) * dbhash_size); memset(dbhash_table, 0, sizeof(dbhash_bucket_s) * dbhash_size); dbhash_bitmap = bitmap_init(dbhash_size); bitmap_reset_all(dbhash_bitmap); } void dbhash_close() { bitmap_free(dbhash_bitmap); dbhash_bitmap = nullptr; xfree(dbhash_table); dbhash_table = nullptr; } int64_t dbhash_search_first(char * seq, uint64_t seqlen, struct dbhash_search_info_s * info) { uint64_t hash = hash_cityhash64(seq, seqlen); info->hash = hash; info->seq = seq; info->seqlen = seqlen; uint64_t index = hash & dbhash_mask; struct dbhash_bucket_s * bp = dbhash_table + index; while (bitmap_get(dbhash_bitmap, index) && ((bp->hash != hash) || (seqlen != db_getsequencelen(bp->seqno)) || (dbhash_seqcmp(seq, db_getsequence(bp->seqno), seqlen)))) { index = (index + 1) & dbhash_mask; bp = dbhash_table + index; } info->index = index; if (bitmap_get(dbhash_bitmap, index)) { return bp->seqno; } else { return -1; } } int64_t dbhash_search_next(struct dbhash_search_info_s * info) { uint64_t hash = info->hash; char * seq = info->seq; uint64_t seqlen = info->seqlen; uint64_t index = (info->index + 1) & dbhash_mask; struct dbhash_bucket_s * bp = dbhash_table + index; while (bitmap_get(dbhash_bitmap, index) && ((bp->hash != hash) || (seqlen != db_getsequencelen(bp->seqno)) || (dbhash_seqcmp(seq, db_getsequence(bp->seqno), seqlen)))) { index = (index + 1) & dbhash_mask; bp = dbhash_table + index; } info->index = index; if (bitmap_get(dbhash_bitmap, index)) { return bp->seqno; } else { return -1; } } void dbhash_add(char * seq, uint64_t seqlen, uint64_t seqno) { struct dbhash_search_info_s info; int64_t ret = dbhash_search_first(seq, seqlen, & info); while (ret >= 0) { ret = dbhash_search_next(&info); } bitmap_set(dbhash_bitmap, info.index); struct dbhash_bucket_s * bp = dbhash_table + info.index; bp->hash = info.hash; bp->seqno = seqno; } void dbhash_add_one(uint64_t seqno) { char * seq = db_getsequence(seqno); uint64_t seqlen = db_getsequencelen(seqno); char * normalized = (char*) xmalloc(seqlen+1); string_normalize(normalized, seq, seqlen); dbhash_add(normalized, seqlen, seqno); } void dbhash_add_all() { progress_init("Hashing database sequences", db_getsequencecount()); char * normalized = (char*) xmalloc(db_getlongestsequence()+1); for(uint64_t seqno=0; seqno < db_getsequencecount(); seqno++) { char * seq = db_getsequence(seqno); uint64_t seqlen = db_getsequencelen(seqno); string_normalize(normalized, seq, seqlen); dbhash_add(normalized, seqlen, seqno); progress_update(seqno+1); } xfree(normalized); progress_done(); } vsearch-2.21.1/src/util.cc0000644000175000017500000002461114171574117014645 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" //#define SHOW_RUSAGE static const char * progress_prompt; static uint64_t progress_next; static uint64_t progress_size; static uint64_t progress_pct; static bool progress_show; void progress_init(const char * prompt, uint64_t size) { progress_show = isatty(fileno(stderr)) && (!opt_quiet) && (!opt_no_progress); progress_prompt = prompt; progress_size = size; progress_pct = 0; progress_next = ((progress_pct + 1) * progress_size + 99) / 100; if (! opt_quiet) { fprintf(stderr, "%s", prompt); if (progress_show) { fprintf(stderr, " %d%%", 0); } } } void progress_update(uint64_t progress) { if ((progress >= progress_next) && progress_show) { if (progress_size > 0) { progress_pct = 100 * progress / progress_size; fprintf(stderr, " \r%s %" PRIu64 "%%", progress_prompt, progress_pct); progress_next = ((progress_pct + 1) * progress_size + 99) / 100; } else { fprintf(stderr, " \r%s 0%%", progress_prompt); } } } void progress_done() { if (! opt_quiet) { if (progress_show) { fprintf(stderr, " \r%s", progress_prompt); } fprintf(stderr, " %d%%\n", 100); } } void __attribute__((noreturn)) fatal(const char * msg) { fprintf(stderr, "\n\n"); fprintf(stderr, "Fatal error: %s\n", msg); if (fp_log) { fprintf(fp_log, "\n\n"); fprintf(fp_log, "Fatal error: %s\n", msg); } exit(EXIT_FAILURE); } void __attribute__((noreturn)) fatal(const char * format, const char * message) { fprintf(stderr, "\n\nFatal error: "); fprintf(stderr, format, message); fprintf(stderr, "\n"); if (opt_log) { fprintf(fp_log, "\n\nFatal error: "); fprintf(fp_log, format, message); fprintf(fp_log, "\n"); } exit(EXIT_FAILURE); } char * xstrdup(const char *s) { size_t len = strlen(s); char * p = (char*) xmalloc(len+1); return strcpy(p, s); } char * xstrchrnul(char *s, int c) { char * r = strchr(s, c); if (r) { return r; } else { return (char *)s + strlen(s); } } int xsprintf(char * * ret, const char * format, ...) { va_list ap; va_start(ap, format); int len = vsnprintf(nullptr, 0, format, ap); va_end(ap); if (len < 0) { fatal("Error with vsnprintf in xsprintf"); } char * p = (char *) xmalloc(len + 1); va_start(ap, format); len = vsnprintf(p, len + 1, format, ap); va_end(ap); *ret = p; return len; } uint64_t hash_cityhash64(char * s, uint64_t n) { return CityHash64((const char*)s, n); } int64_t getusec() { struct timeval tv; if (gettimeofday(&tv,nullptr) != 0) { return 0; } return tv.tv_sec * 1000000 + tv.tv_usec; } void show_rusage() { #ifdef SHOW_RUSAGE double user_time = 0.0; double system_time = 0.0; arch_get_user_system_time(&user_time, &system_time); double megabytes = arch_get_memused() / 1024.0 / 1024.0; fprintf(stderr, "Time: %.3fs (user) %.3fs (sys) Memory: %.0lfMB\n", user_time, system_time, megabytes); if (opt_log) fprintf(fp_log, "Time: %.3fs (user) %.3fs (sys) Memory: %.0lfMB\n", user_time, system_time, megabytes); #endif } void reverse_complement(char * rc, char * seq, int64_t len) { /* Write the reverse complementary sequence to rc. The memory for rc must be long enough for the rc of the sequence (identical to the length of seq + 1. */ for(int64_t i=0; i 0 The random() function returns a random number in the range 0 to 2147483647 (=2^31-1=RAND_MAX), inclusive. We should avoid some of the upper generated numbers to avoid modulo bias. */ int64_t random_max = RAND_MAX; int64_t limit = random_max - (random_max + 1) % n; int64_t r = arch_random(); while (r > limit) { r = arch_random(); } return r % n; } uint64_t random_ulong(uint64_t n) { /* Generate a random integer in the range 0 to n-1, inclusive, n must be > 0 */ uint64_t random_max = ULONG_MAX; uint64_t limit = random_max - (random_max - n + 1) % n; uint64_t r = ((arch_random() << 48) ^ (arch_random() << 32) ^ (arch_random() << 16) ^ (arch_random())); while (r > limit) { r = ((arch_random() << 48) ^ (arch_random() << 32) ^ (arch_random() << 16) ^ (arch_random())); } return r % n; } void string_normalize(char * normalized, char * s, unsigned int len) { /* convert string to upper case and replace U by T */ char * p = s; char * q = normalized; for(unsigned int i=0; i> 4]; hex[2*i+1] = hexdigits[digest[i] & 15]; } hex[2*LEN_DIG_SHA1] = 0; } void get_hex_seq_digest_md5(char * hex, char * seq, int seqlen) { /* Save hexadecimal representation of the MD5 hash of the sequence. The string array digest must be large enough (LEN_HEX_DIG_MD5). First normalize string by uppercasing it and replacing U's with T's. */ char * normalized = (char*) xmalloc(seqlen+1); string_normalize(normalized, seq, seqlen); unsigned char digest[MD5_DIGEST_LENGTH]; MD5(normalized, (size_t) seqlen, digest); xfree(normalized); for(int i=0; i> 4]; hex[2*i+1] = hexdigits[digest[i] & 15]; } hex[2*MD5_DIGEST_LENGTH] = 0; } void fprint_seq_digest_sha1(FILE * fp, char * seq, int seqlen) { char digest[LEN_HEX_DIG_SHA1]; get_hex_seq_digest_sha1(digest, seq, seqlen); fprintf(fp, "%s", digest); } void fprint_seq_digest_md5(FILE * fp, char * seq, int seqlen) { char digest[LEN_HEX_DIG_MD5]; get_hex_seq_digest_md5(digest, seq, seqlen); fprintf(fp, "%s", digest); } FILE * fopen_input(const char * filename) { /* open the input stream given by filename, but use stdin if name is - */ if (strcmp(filename, "-") == 0) { int fd = dup(STDIN_FILENO); if (fd < 0) { return nullptr; } else { return fdopen(fd, "rb"); } } else { return fopen(filename, "rb"); } } FILE * fopen_output(const char * filename) { /* open the output stream given by filename, but use stdout if name is - */ if (strcmp(filename, "-") == 0) { int fd = dup(STDOUT_FILENO); if (fd < 0) { return nullptr; } else { return fdopen(fd, "w"); } } else { return fopen(filename, "w"); } } vsearch-2.21.1/src/rerep.h0000644000175000017500000000470114171574117014645 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void rereplicate(); vsearch-2.21.1/src/userfields.cc0000644000175000017500000001046314171574117016035 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static const char * userfields_names[] = { "query", // 0 "target", // 1 "evalue", // 2 "id", // 3 "pctpv", "pctgaps", "pairs", "gaps", "qlo", "qhi", "tlo", "thi", "pv", "ql", "tl", "qs", "ts", "alnlen", "opens", "exts", "raw", "bits", "aln", "caln", "qstrand", "tstrand", "qrow", "trow", "qframe", "tframe", "mism", "ids", "qcov", "tcov", // 33 "id0", "id1", "id2", "id3", "id4", // 38 "qilo", // 39 "qihi", "tilo", "tihi", // 42 nullptr }; int * userfields_requested = nullptr; int userfields_requested_count = 0; int parse_userfields_arg(char * arg) { // Parses the userfields option argument, e.g. query+target+id+alnlen+mism // and returns 1 if it is ok or 0 if not. char * p = arg; char * e = p + strlen(p); // pointer to end of string userfields_requested_count = 1; while(p unrecognized field return 0; // bad argument } int i = (int)(((const char**)u) - userfields_names); userfields_requested[fields++] = i; p = q; if (p == e) { // reached end of argument return 1; } p++; } } vsearch-2.21.1/src/db.cc0000644000175000017500000003335614171574117014263 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" #define MEMCHUNK 16777216 static fastx_handle h = nullptr; static bool is_fastq = false; static uint64_t sequences = 0; static uint64_t nucleotides = 0; static uint64_t longest = 0; static uint64_t shortest = 0; static uint64_t longestheader = 0; seqinfo_t * seqindex = nullptr; char * datap = nullptr; void db_setinfo(bool new_is_fastq, uint64_t new_sequences, uint64_t new_nucleotides, uint64_t new_longest, uint64_t new_shortest, uint64_t new_longestheader) { is_fastq = new_is_fastq; sequences = new_sequences; nucleotides = new_nucleotides; longest = new_longest; shortest = new_shortest; longestheader = new_longestheader; } bool db_is_fastq() { return is_fastq; } char * db_getquality(uint64_t seqno) { if (is_fastq) { return datap + seqindex[seqno].qual_p; } else { return nullptr; } } void db_read(const char * filename, int upcase) { h = fastx_open(filename); if (!h) { fatal("Unrecognized file type (not proper FASTA or FASTQ format)"); } is_fastq = fastx_is_fastq(h); int64_t filesize = fastx_get_size(h); char * prompt = nullptr; if (xsprintf(& prompt, "Reading file %s", filename) == -1) { fatal("Out of memory"); } progress_init(prompt, filesize); longest = 0; shortest = LONG_MAX; longestheader = 0; sequences = 0; nucleotides = 0; int64_t discarded_short = 0; int64_t discarded_long = 0; int64_t discarded_unoise = 0; /* allocate space for data */ uint64_t dataalloc = 0; datap = nullptr; uint64_t datalen = 0; /* allocate space for index */ size_t seqindex_alloc = 0; seqindex = nullptr; while(fastx_next(h, ! opt_notrunclabels, upcase ? chrmap_upcase : chrmap_no_change)) { size_t headerlength = fastx_get_header_length(h); size_t sequencelength = fastx_get_sequence_length(h); int64_t abundance = fastx_get_abundance(h); if (sequencelength < (size_t)opt_minseqlength) { discarded_short++; } else if (sequencelength > (size_t)opt_maxseqlength) { discarded_long++; } else if (opt_cluster_unoise && (abundance < (int64_t)opt_minsize)) { discarded_unoise++; } else { /* grow space for data, if necessary */ size_t dataalloc_old = dataalloc; size_t needed = datalen + headerlength + 1 + sequencelength + 1; if (is_fastq) { needed += sequencelength + 1; } while (dataalloc < needed) { dataalloc += MEMCHUNK; } if (dataalloc > dataalloc_old) { datap = (char *) xrealloc(datap, dataalloc); } /* store the header */ size_t header_p = datalen; memcpy(datap + header_p, fastx_get_header(h), headerlength + 1); datalen += headerlength + 1; /* store sequence */ size_t sequence_p = datalen; memcpy(datap + sequence_p, fastx_get_sequence(h), sequencelength + 1); datalen += sequencelength + 1; size_t quality_p = datalen; if (is_fastq) { /* store quality */ memcpy(datap+quality_p, fastx_get_quality(h), sequencelength + 1); datalen += sequencelength + 1; } /* grow space for index, if necessary */ size_t seqindex_alloc_old = seqindex_alloc; while ((sequences + 1) * sizeof(seqinfo_t) > seqindex_alloc) { seqindex_alloc += MEMCHUNK; } if (seqindex_alloc > seqindex_alloc_old) { seqindex = (seqinfo_t *) xrealloc(seqindex, seqindex_alloc); } /* update index */ seqinfo_t * seqindex_p = seqindex + sequences; seqindex_p->headerlen = headerlength; seqindex_p->seqlen = sequencelength; seqindex_p->header_p = header_p; seqindex_p->seq_p = sequence_p; seqindex_p->qual_p = quality_p; seqindex_p->size = abundance; /* update statistics */ sequences++; nucleotides += sequencelength; if (sequencelength > longest) { longest = sequencelength; } if (sequencelength < shortest) { shortest = sequencelength; } if (headerlength > longestheader) { longestheader = headerlength; } } progress_update(fastx_get_position(h)); } progress_done(); xfree(prompt); fastx_close(h); if (!opt_quiet) { if (sequences > 0) { fprintf(stderr, "%'" PRIu64 " nt in %'" PRIu64 " seqs, " "min %'" PRIu64 ", max %'" PRIu64 ", avg %'.0f\n", db_getnucleotidecount(), db_getsequencecount(), db_getshortestsequence(), db_getlongestsequence(), db_getnucleotidecount() * 1.0 / db_getsequencecount()); } else { fprintf(stderr, "%'" PRIu64 " nt in %'" PRIu64 " seqs\n", db_getnucleotidecount(), db_getsequencecount()); } } if (opt_log) { if (sequences > 0) { fprintf(fp_log, "%'" PRIu64 " nt in %'" PRIu64 " seqs, " "min %'" PRIu64 ", max %'" PRIu64 ", avg %'.0f\n\n", db_getnucleotidecount(), db_getsequencecount(), db_getshortestsequence(), db_getlongestsequence(), db_getnucleotidecount() * 1.0 / db_getsequencecount()); } else { fprintf(fp_log, "%'" PRIu64 " nt in %'" PRIu64 " seqs\n\n", db_getnucleotidecount(), db_getsequencecount()); } } /* Warn about discarded sequences */ if (discarded_short) { fprintf(stderr, "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n", opt_minseqlength, discarded_short, (discarded_short == 1 ? "sequence" : "sequences")); if (opt_log) { fprintf(fp_log, "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n", opt_minseqlength, discarded_short, (discarded_short == 1 ? "sequence" : "sequences")); } } if (discarded_long) { fprintf(stderr, "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n", opt_maxseqlength, discarded_long, (discarded_long == 1 ? "sequence" : "sequences")); if (opt_log) { fprintf(fp_log, "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n", opt_maxseqlength, discarded_long, (discarded_long == 1 ? "sequence" : "sequences")); } } if (discarded_unoise) { fprintf(stderr, "minsize %" PRId64 ": %" PRId64 " %s discarded.\n", opt_minsize, discarded_unoise, (discarded_unoise == 1 ? "sequence" : "sequences")); if (opt_log) { fprintf(fp_log, "minsize %" PRId64 ": %" PRId64 " %s discarded.\n", opt_minsize, discarded_unoise, (discarded_unoise == 1 ? "sequence" : "sequences")); } } show_rusage(); } uint64_t db_getsequencecount() { return sequences; } uint64_t db_getnucleotidecount() { return nucleotides; } uint64_t db_getlongestheader() { return longestheader; } uint64_t db_getlongestsequence() { return longest; } uint64_t db_getshortestsequence() { return shortest; } void db_free() { if (datap) { xfree(datap); } if (seqindex) { xfree(seqindex); } } int compare_bylength(const void * a, const void * b) { auto * x = (seqinfo_t *) a; auto * y = (seqinfo_t *) b; /* longest first, then by abundance, then by label, otherwise keep order */ if (x->seqlen < y->seqlen) { return +1; } else if (x->seqlen > y->seqlen) { return -1; } else { if (x->size < y->size) { return +1; } else if (x->size > y->size) { return -1; } else { int r = strcmp(datap + x->header_p, datap + y->header_p); if (r != 0) { return r; } else { if (x < y) { return -1; } else if (x > y) { return +1; } else { return 0; } } } } } int compare_bylength_shortest_first(const void * a, const void * b) { auto * x = (seqinfo_t *) a; auto * y = (seqinfo_t *) b; /* shortest first, then by abundance, then by label, otherwise keep order */ if (x->seqlen < y->seqlen) { return -1; } else if (x->seqlen > y->seqlen) { return +1; } else { if (x->size < y->size) { return +1; } else if (x->size > y->size) { return -1; } else { int r = strcmp(datap + x->header_p, datap + y->header_p); if (r != 0) { return r; } else { if (x < y) { return -1; } else if (x > y) { return +1; } else { return 0; } } } } } inline int compare_byabundance(const void * a, const void * b) { auto * x = (seqinfo_t *) a; auto * y = (seqinfo_t *) b; /* most abundant first, then by label, otherwise keep order */ if (x->size > y->size) { return -1; } else if (x->size < y->size) { return +1; } else { int r = strcmp(datap + x->header_p, datap + y->header_p); if (r != 0) { return r; } else { if (x < y) { return -1; } else if (x > y) { return +1; } else { return 0; } } } } void db_sortbylength() { progress_init("Sorting by length", 100); qsort(seqindex, sequences, sizeof(seqinfo_t), compare_bylength); progress_done(); } void db_sortbylength_shortest_first() { progress_init("Sorting by length", 100); qsort(seqindex, sequences, sizeof(seqinfo_t), compare_bylength_shortest_first); progress_done(); } void db_sortbyabundance() { progress_init("Sorting by abundance", 100); qsort(seqindex, sequences, sizeof(seqinfo_t), compare_byabundance); progress_done(); } vsearch-2.21.1/src/sffconvert.cc0000644000175000017500000004136014171574117016047 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" uint32_t sff_magic = 0x2e736666; struct sff_header_s { uint32_t magic_number; /* .sff */ uint32_t version; uint64_t index_offset; uint32_t index_length; uint32_t number_of_reads; uint16_t header_length; uint16_t key_length; uint16_t flows_per_read; uint8_t flowgram_format_code; } sff_header; struct sff_read_header_s { uint16_t read_header_length; uint16_t name_length; uint32_t number_of_bases; uint16_t clip_qual_left; uint16_t clip_qual_right; uint16_t clip_adapter_left; uint16_t clip_adapter_right; } read_header; uint64_t fskip(FILE * fp, uint64_t length) { /* read given amount of data from a stream and ignore it */ /* used instead of seeking in order to work with pipes */ #define BLOCKSIZE 4096 char buffer[BLOCKSIZE]; uint64_t skipped = 0; uint64_t rest = length; while (rest > 0) { uint64_t want = ((rest > BLOCKSIZE) ? BLOCKSIZE : rest); uint64_t got = fread(buffer, 1, want, fp); skipped += got; rest -= got; if (got < want) { break; } } return skipped; } void sff_convert() { if (! opt_fastqout) { fatal("No output file for sff_convert specified with --fastqout."); } FILE * fp_fastqout = fopen_output(opt_fastqout); if (!fp_fastqout) { fatal("Unable to open FASTQ output file for writing."); } FILE * fp_sff = fopen_input(opt_sff_convert); if (!fp_sff) { fatal("Unable to open SFF input file for reading."); } /* read and check header */ uint64_t filepos = 0; if (fread(&sff_header, 1, 31, fp_sff) < 31) { fatal("Unable to read from SFF file. File may be truncated."); } filepos += 31; sff_header.magic_number = bswap_32(sff_header.magic_number); sff_header.version = bswap_32(sff_header.version); sff_header.index_offset = bswap_64(sff_header.index_offset); sff_header.index_length = bswap_32(sff_header.index_length); sff_header.number_of_reads = bswap_32(sff_header.number_of_reads); sff_header.header_length = bswap_16(sff_header.header_length); sff_header.key_length = bswap_16(sff_header.key_length); sff_header.flows_per_read = bswap_16(sff_header.flows_per_read); if (sff_header.magic_number != sff_magic) { fatal("Invalid SFF file. Incorrect magic number. Must be 0x2e736666 (.sff)."); } if (sff_header.version != 1) { fatal("Invalid SFF file. Incorrect version. Must be 1."); } if (sff_header.flowgram_format_code != 1) { fatal("Invalid SFF file. Incorrect flowgram format code. Must be 1."); } if (sff_header.header_length != 8 * ((31 + sff_header.flows_per_read + sff_header.key_length + 7) / 8)) { fatal("Invalid SFF file. Incorrect header length."); } if (sff_header.key_length != 4) { fatal("Invalid SFF file. Incorrect key length. Must be 4."); } if ((sff_header.index_length > 0) && (sff_header.index_length < 8)) { fatal("Invalid SFF file. Incorrect index size. Must be at least 8."); } /* read and check flow chars, key and padding */ if (fskip(fp_sff, sff_header.flows_per_read) < sff_header.flows_per_read) { fatal("Invalid SFF file. Unable to read flow characters. File may be truncated."); } filepos += sff_header.flows_per_read; char * key_sequence = (char *) xmalloc(sff_header.key_length + 1); if (fread(key_sequence, 1, sff_header.key_length, fp_sff) < sff_header.key_length) { fatal("Invalid SFF file. Unable to read key sequence. File may be truncated."); } key_sequence[sff_header.key_length] = 0; filepos += sff_header.key_length; uint32_t padding_length = sff_header.header_length - sff_header.flows_per_read - sff_header.key_length - 31; if (fskip(fp_sff, padding_length) < padding_length) { fatal("Invalid SFF file. Unable to read padding. File may be truncated."); } filepos += padding_length; double totallength = 0.0; uint32_t minimum = UINT_MAX; uint32_t maximum = 0; bool index_done = (sff_header.index_offset == 0) || (sff_header.index_length == 0); bool index_odd = false; char index_kind[9]; uint32_t index_padding = 0; if ((sff_header.index_length & 7) > 0) { index_padding = 8 - (sff_header.index_length & 7); } if (! opt_quiet) { fprintf(stderr, "Number of reads: %d\n", sff_header.number_of_reads); fprintf(stderr, "Flows per read: %d\n", sff_header.flows_per_read); fprintf(stderr, "Key sequence: %s\n", key_sequence); } if (opt_log) { fprintf(fp_log, "Number of reads: %d\n", sff_header.number_of_reads); fprintf(fp_log, "Flows per read: %d\n", sff_header.flows_per_read); fprintf(fp_log, "Key sequence: %s\n", key_sequence); } progress_init("Converting SFF: ", sff_header.number_of_reads); for(uint32_t read_no = 0; read_no < sff_header.number_of_reads; read_no++) { /* check if the index block is here */ if (! index_done) { if (filepos == sff_header.index_offset) { if (fread(index_kind, 1, 8, fp_sff) < 8) { fatal("Invalid SFF file. Unable to read index header. File may be truncated."); } filepos += 8; index_kind[8] = 0; uint64 index_size = sff_header.index_length - 8 + index_padding; if (fskip(fp_sff, index_size) != index_size) { fatal("Invalid SFF file. Unable to read entire index. File may be truncated."); } filepos += index_size; index_done = true; index_odd = true; } } /* read and check each read header */ if (fread(&read_header, 1, 16, fp_sff) < 16) { fatal("Invalid SFF file. Unable to read read header. File may be truncated."); } filepos += 16; read_header.read_header_length = bswap_16(read_header.read_header_length); read_header.name_length = bswap_16(read_header.name_length); read_header.number_of_bases = bswap_32(read_header.number_of_bases); read_header.clip_qual_left = bswap_16(read_header.clip_qual_left); read_header.clip_qual_right = bswap_16(read_header.clip_qual_right); read_header.clip_adapter_left = bswap_16(read_header.clip_adapter_left); read_header.clip_adapter_right = bswap_16(read_header.clip_adapter_right); if (read_header.read_header_length != 8 * ((16 + read_header.name_length + 7) / 8)) { fatal("Invalid SFF file. Incorrect read header length."); } if (read_header.clip_qual_left > read_header.number_of_bases) { fatal("Invalid SFF file. Incorrect clip_qual_left value."); } if (read_header.clip_adapter_left > read_header.number_of_bases) { fatal("Invalid SFF file. Incorrect clip_adapter_left value."); } if (read_header.clip_qual_right > read_header.number_of_bases) { fatal("Invalid SFF file. Incorrect clip_qual_right value."); } if (read_header.clip_adapter_right > read_header.number_of_bases) { fatal("Invalid SFF file. Incorrect clip_adapter_right value."); } char * read_name = (char *) xmalloc(read_header.name_length + 1); if (fread(read_name, 1, read_header.name_length, fp_sff) < read_header.name_length) { fatal("Invalid SFF file. Unable to read read name. File may be truncated."); } filepos += read_header.name_length; read_name[read_header.name_length] = 0; uint32_t read_header_padding_length = read_header.read_header_length - read_header.name_length - 16; if (fskip(fp_sff, read_header_padding_length) < read_header_padding_length) { fatal("Invalid SFF file. Unable to read read header padding. File may be truncated."); } filepos += read_header_padding_length; /* read and check the flowgram and sequence */ if (fskip(fp_sff, 2 * sff_header.flows_per_read) < sff_header.flows_per_read) { fatal("Invalid SFF file. Unable to read flowgram values. File may be truncated."); } filepos += 2 * sff_header.flows_per_read; if (fskip(fp_sff, read_header.number_of_bases) < read_header.number_of_bases) { fatal("Invalid SFF file. Unable to read flow indices. File may be truncated."); } filepos += read_header.number_of_bases; char * bases = (char *) xmalloc(read_header.number_of_bases + 1); if (fread(bases, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases) { fatal("Invalid SFF file. Unable to read read length. File may be truncated."); } bases[read_header.number_of_bases] = 0; filepos += read_header.number_of_bases; char * qual = (char *) xmalloc(read_header.number_of_bases + 1); if (fread(qual, 1, read_header.number_of_bases, fp_sff) < read_header.number_of_bases) { fatal("Invalid SFF file. Unable to read quality scores. File may be truncated."); } filepos += read_header.number_of_bases; /* convert quality scores to ascii characters */ for(uint32_t base_no = 0; base_no < read_header.number_of_bases; base_no++) { int q = qual[base_no]; if (q < opt_fastq_qminout) { q = opt_fastq_qminout; } if (q > opt_fastq_qmaxout) { q = opt_fastq_qmaxout; } qual[base_no] = opt_fastq_asciiout + q; } qual[read_header.number_of_bases] = 0; uint32_t read_data_length = (2 * sff_header.flows_per_read + 3 * read_header.number_of_bases); uint32_t read_data_padded_length = 8 * ((read_data_length + 7) / 8); uint32_t read_data_padding_length = read_data_padded_length - read_data_length; if (fskip(fp_sff, read_data_padding_length) < read_data_padding_length) { fatal("Invalid SFF file. Unable to read read data padding. File may be truncated."); } filepos += read_data_padding_length; uint32_t clip_start = 0; clip_start = MAX(1, MAX(read_header.clip_qual_left, read_header.clip_adapter_left)) - 1; uint32_t clip_end = read_header.number_of_bases; clip_end = MIN((read_header.clip_qual_right == 0 ? read_header.number_of_bases : read_header.clip_qual_right), (read_header.clip_adapter_right == 0 ? read_header.number_of_bases : read_header.clip_adapter_right)); /* make the clipped bases lowercase and the rest uppercase */ for (uint32_t i = 0; i < read_header.number_of_bases; i++) { if ((i < clip_start) || (i >= clip_end)) { bases[i] = tolower(bases[i]); } else { bases[i] = toupper(bases[i]); } } if (opt_sff_clip) { bases[clip_end] = 0; qual[clip_end] = 0; } else { clip_start = 0; clip_end = read_header.number_of_bases; } uint32_t length = clip_end - clip_start; fastq_print_general(fp_fastqout, bases + clip_start, length, read_name, strlen(read_name), qual + clip_start, 1, read_no + 1, -1.0); xfree(read_name); xfree(bases); xfree(qual); totallength += length; if (length < minimum) { minimum = length; } if (length > maximum) { maximum = length; } progress_update(read_no + 1); } progress_done(); /* check if the index block is here */ if (! index_done) { if (filepos == sff_header.index_offset) { if (fread(index_kind, 1, 8, fp_sff) < 8) { fatal("Invalid SFF file. Unable to read index header. File may be truncated."); } filepos += 8; index_kind[8] = 0; uint64 index_size = sff_header.index_length - 8; if (fskip(fp_sff, index_size) != index_size) { fatal("Invalid SFF file. Unable to read entire index. File may be truncated."); } filepos += index_size; index_done = true; /* try to skip padding, if any */ if (index_padding > 0) { uint64_t got = fskip(fp_sff, index_padding); if ((got < index_padding) && (got != 0)) { fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n"); } } } } if (! index_done) { fprintf(stderr, "WARNING: SFF index missing\n"); if (opt_log) { fprintf(fp_log, "WARNING: SFF index missing\n"); } } if (index_odd) { fprintf(stderr, "WARNING: Index at unusual position in file\n"); if (opt_log) { fprintf(fp_log, "WARNING: Index at unusual position in file\n"); } } /* ignore the rest of file */ /* try reading just another byte */ if (fskip(fp_sff, 1) > 0) { fprintf(stderr, "WARNING: Additional data at end of SFF file ignored\n"); if (opt_log) { fprintf(fp_log, "WARNING: Additional data at end of SFF file ignored\n"); } } fclose(fp_sff); fclose(fp_fastqout); double average = totallength / sff_header.number_of_reads; if (! opt_quiet) { if (sff_header.index_length > 0) { fprintf(stderr, "Index type: %s\n", index_kind); } fprintf(stderr, "\nSFF file read successfully.\n"); if (sff_header.number_of_reads > 0) { fprintf(stderr, "Sequence length: minimum %d, average %.1f, maximum %d\n", minimum, average, maximum); } } if (opt_log) { if (sff_header.index_length > 0) { fprintf(fp_log, "Index type: %s\n", index_kind); } fprintf(fp_log, "\nSFF file read successfully.\n"); if (sff_header.number_of_reads > 0) { fprintf(fp_log, "Sequence length: minimum %d, average %.1f, maximum %d\n", minimum, average, maximum); } } xfree(key_sequence); } vsearch-2.21.1/src/shuffle.cc0000644000175000017500000000714614171574117015330 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" void shuffle() { if (!opt_output) fatal("Output file for shuffling must be specified with --output"); FILE * fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open shuffle output file for writing"); } db_read(opt_shuffle, 0); show_rusage(); int dbsequencecount = db_getsequencecount(); int * deck = (int*) xmalloc(dbsequencecount * sizeof(int)); for(int i=0; i0; i--) { /* generate a random number j in the range 0 to i, inclusive */ int j = random_int(i+1); /* exchange elements i and j */ int t = deck[i]; deck[i] = deck[j]; deck[j] = t; passed++; progress_update(passed); } progress_done(); show_rusage(); passed = MIN(dbsequencecount, opt_topn); progress_init("Writing output", passed); for(int i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* chunk constants */ static const int chunk_size = 500; /* read pairs per chunk */ static const int chunk_factor = 2; /* chunks per thread */ /* scores in bits */ static const int k = 5; static int merge_mindiagcount = 4; static double merge_minscore = 16.0; static const double merge_dropmax = 16.0; static const double merge_mismatchmax = -4.0; /* static variables */ static FILE * fp_fastqout = nullptr; static FILE * fp_fastaout = nullptr; static FILE * fp_fastqout_notmerged_fwd = nullptr; static FILE * fp_fastqout_notmerged_rev = nullptr; static FILE * fp_fastaout_notmerged_fwd = nullptr; static FILE * fp_fastaout_notmerged_rev = nullptr; static FILE * fp_eetabbedout = nullptr; static fastx_handle fastq_fwd; static fastx_handle fastq_rev; static int64_t merged = 0; static int64_t notmerged = 0; static int64_t total = 0; static double sum_read_length = 0.0; static double sum_squared_fragment_length = 0.0; static double sum_fragment_length = 0.0; static pthread_t * pthread; static pthread_attr_t attr; static char merge_qual_same[128][128]; static char merge_qual_diff[128][128]; static double match_score[128][128]; static double mism_score[128][128]; static double q2p[128]; static double sum_ee_fwd = 0.0; static double sum_ee_rev = 0.0; static double sum_ee_merged = 0.0; static uint64_t sum_errors_fwd = 0.0; static uint64_t sum_errors_rev = 0.0; static uint64_t failed_undefined = 0; static uint64_t failed_minlen = 0; static uint64_t failed_maxlen = 0; static uint64_t failed_maxns = 0; static uint64_t failed_minovlen = 0; static uint64_t failed_maxdiffs = 0; static uint64_t failed_maxdiffpct = 0; static uint64_t failed_staggered = 0; static uint64_t failed_indel = 0; static uint64_t failed_repeat = 0; static uint64_t failed_minmergelen = 0; static uint64_t failed_maxmergelen = 0; static uint64_t failed_maxee = 0; static uint64_t failed_minscore = 0; static uint64_t failed_nokmers = 0; /* reasons for not merging: - undefined - ok - input seq too short (after truncation) - input seq too long - too many Ns in input - overlap too short - too many differences (maxdiffs) - too high percentage of differences (maxdiffpct) - staggered - indels in overlap region - potential repeats in overlap region / multiple overlaps - merged sequence too short - merged sequence too long - expected error too high - alignment score too low, insignificant, potential indel - too few kmers on same diag found */ enum reason_enum { undefined, ok, minlen, maxlen, maxns, minovlen, maxdiffs, maxdiffpct, staggered, indel, repeat, minmergelen, maxmergelen, maxee, minscore, nokmers }; enum state_enum { empty, filled, inprogress, processed }; typedef struct merge_data_s { char * fwd_header; char * rev_header; char * fwd_sequence; char * rev_sequence; char * fwd_quality; char * rev_quality; int64_t header_alloc; int64_t seq_alloc; int64_t fwd_length; int64_t rev_length; int64_t fwd_trunc; int64_t rev_trunc; int64_t pair_no; char * merged_sequence; char * merged_quality; int64_t merged_length; int64_t merged_seq_alloc; double ee_merged; double ee_fwd; double ee_rev; int64_t fwd_errors; int64_t rev_errors; int64_t offset; bool merged; reason_enum reason; state_enum state; } merge_data_t; typedef struct chunk_s { int size; /* size of merge_data = number of pairs of reads */ state_enum state; /* state of chunk: empty, read, processed */ merge_data_t * merge_data; /* data for merging */ } chunk_t; static chunk_t * chunks; /* pointer to array of chunks */ static int chunk_count; static int chunk_read_next; static int chunk_process_next; static int chunk_write_next; static bool finished_reading = false; static bool finished_all = false; static int pairs_read = 0; static int pairs_written = 0; static pthread_mutex_t mutex_chunks; static pthread_cond_t cond_chunks; FILE * fileopenw(char * filename) { FILE * fp = nullptr; fp = fopen_output(filename); if (!fp) { fatal("Unable to open file for writing (%s)", filename); } return fp; } inline int get_qual(char q) { int qual = q - opt_fastq_ascii; if (qual < opt_fastq_qmin) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) below qmin (%" PRId64 ")\n", qual, opt_fastq_qmin); if (fp_log) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) below qmin (%" PRId64 ")\n", qual, opt_fastq_qmin); } exit(EXIT_FAILURE); } else if (qual > opt_fastq_qmax) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) above qmax (%" PRId64 ")\n", qual, opt_fastq_qmax); fprintf(stderr, "By default, quality values range from 0 to 41.\n" "To allow higher quality values, " "please use the option --fastq_qmax %d\n", qual); if (fp_log) { fprintf(fp_log, "\n\nFatal error: FASTQ quality value (%d) above qmax (%" PRId64 ")\n", qual, opt_fastq_qmax); fprintf(fp_log, "By default, quality values range from 0 to 41.\n" "To allow higher quality values, " "please use the option --fastq_qmax %d\n", qual); } exit(EXIT_FAILURE); } return qual; } inline double q_to_p(int q) { int x = q - opt_fastq_ascii; if (x < 2) { return 0.75; } else { return exp10(-x/10.0); } } void precompute_qual() { /* Precompute tables of scores etc */ for (int x = 33; x <= 126; x++) { double px = q_to_p(x); q2p[x] = px; for (int y = 33; y <= 126; y++) { double py = q_to_p(y); double p, q; /* Quality score equations from Edgar & Flyvbjerg (2015) */ /* Match */ p = px * py / 3.0 / (1.0 - px - py + 4.0 * px * py / 3.0); q = round(-10.0 * log10(p)); q = MIN(q, opt_fastq_qmaxout); q = MAX(q, opt_fastq_qminout); merge_qual_same[x][y] = opt_fastq_ascii + q; /* Mismatch, x is highest quality */ p = px * (1.0 - py / 3.0) / (px + py - 4.0 * px * py / 3.0); q = round(-10.0 * log10(p)); q = MIN(q, opt_fastq_qmaxout); q = MAX(q, opt_fastq_qminout); merge_qual_diff[x][y] = opt_fastq_ascii + q; /* observed match, p = probability that they truly are identical, given error probabilites of px and py, resp. */ // Given two initially identical aligned bases, and // the error probabilities px and py, // what is the probability of observing a match (or a mismatch)? p = 1.0 - px - py + px * py * 4.0 / 3.0; match_score[x][y] = log2(p/0.25); // Use a minimum mismatch penalty mism_score[x][y] = MIN(log2((1.0-p)/0.75), merge_mismatchmax); } } } void merge_sym(char * sym, char * qual, char fwd_sym, char rev_sym, char fwd_qual, char rev_qual) { if (rev_sym == 'N') { * sym = fwd_sym; * qual = fwd_qual; } else if (fwd_sym == 'N') { * sym = rev_sym; * qual = rev_qual; } else if (fwd_sym == rev_sym) { /* agreement */ * sym = fwd_sym; * qual = merge_qual_same[(unsigned)fwd_qual][(unsigned)rev_qual]; } else { /* disagreement */ if (fwd_qual > rev_qual) { * sym = fwd_sym; * qual = merge_qual_diff[(unsigned)fwd_qual][(unsigned)rev_qual]; } else { * sym = rev_sym; * qual = merge_qual_diff[(unsigned)rev_qual][(unsigned)fwd_qual]; } } } void keep(merge_data_t * ip) { merged++; sum_fragment_length += ip->merged_length; sum_squared_fragment_length += ip->merged_length * ip->merged_length; sum_ee_merged += ip->ee_merged; sum_ee_fwd += ip->ee_fwd; sum_ee_rev += ip->ee_rev; sum_errors_fwd += ip->fwd_errors; sum_errors_rev += ip->rev_errors; if (opt_fastqout) { fastq_print_general(fp_fastqout, ip->merged_sequence, ip->merged_length, ip->fwd_header, strlen(ip->fwd_header), ip->merged_quality, 0, merged, ip->ee_merged); } if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, ip->merged_sequence, ip->merged_length, ip->fwd_header, strlen(ip->fwd_header), 0, merged, ip->ee_merged, -1, -1, nullptr, 0.0); } if (opt_eetabbedout) { fprintf(fp_eetabbedout, "%.2lf\t%.2lf\t%" PRId64 "\t%" PRId64 "\n", ip->ee_fwd, ip->ee_rev, ip->fwd_errors, ip->rev_errors); } } void discard(merge_data_t * ip) { switch(ip->reason) { case undefined: failed_undefined++; break; case ok: break; case minlen: failed_minlen++; break; case maxlen: failed_maxlen++; break; case maxns: failed_maxns++; break; case minovlen: failed_minovlen++; break; case maxdiffs: failed_maxdiffs++; break; case maxdiffpct: failed_maxdiffpct++; break; case staggered: failed_staggered++; break; case indel: failed_indel++; break; case repeat: failed_repeat++; break; case minmergelen: failed_minmergelen++; break; case maxmergelen: failed_maxmergelen++; break; case maxee: failed_maxee++; break; case minscore: failed_minscore++; break; case nokmers: failed_nokmers++; break; } notmerged++; if (opt_fastqout_notmerged_fwd) { fastq_print_general(fp_fastqout_notmerged_fwd, ip->fwd_sequence, ip->fwd_length, ip->fwd_header, strlen(ip->fwd_header), ip->fwd_quality, 0, notmerged, -1.0); } if (opt_fastqout_notmerged_rev) { fastq_print_general(fp_fastqout_notmerged_rev, ip->rev_sequence, ip->rev_length, ip->rev_header, strlen(ip->rev_header), ip->rev_quality, 0, notmerged, -1.0); } if (opt_fastaout_notmerged_fwd) { fasta_print_general(fp_fastaout_notmerged_fwd, nullptr, ip->fwd_sequence, ip->fwd_length, ip->fwd_header, strlen(ip->fwd_header), 0, notmerged, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastaout_notmerged_rev) { fasta_print_general(fp_fastaout_notmerged_rev, nullptr, ip->rev_sequence, ip->rev_length, ip->rev_header, strlen(ip->rev_header), 0, notmerged, -1.0, -1, -1, nullptr, 0.0); } } void merge(merge_data_t * ip) { /* length of 5' overhang of the forward sequence not merged with the reverse sequence */ int64_t fwd_5prime_overhang = ip->fwd_trunc > ip->offset ? ip->fwd_trunc - ip->offset : 0; ip->ee_merged = 0.0; ip->ee_fwd = 0.0; ip->ee_rev = 0.0; ip->fwd_errors = 0; ip->rev_errors = 0; char sym, qual; char fwd_sym, fwd_qual, rev_sym, rev_qual; int64_t fwd_pos, rev_pos, merged_pos; double ee; merged_pos = 0; // 5' overhang in forward sequence fwd_pos = 0; while(fwd_pos < fwd_5prime_overhang) { sym = ip->fwd_sequence[fwd_pos]; qual = ip->fwd_quality[fwd_pos]; ip->merged_sequence[merged_pos] = sym; ip->merged_quality[merged_pos] = qual; ee = q2p[(unsigned)qual]; ip->ee_merged += ee; ip->ee_fwd += ee; fwd_pos++; merged_pos++; } // Merged region int64_t rev_3prime_overhang = ip->offset > ip->fwd_trunc ? ip->offset - ip->fwd_trunc : 0; rev_pos = ip->rev_trunc - 1 - rev_3prime_overhang; while ((fwd_pos < ip->fwd_trunc) && (rev_pos >= 0)) { fwd_sym = ip->fwd_sequence[fwd_pos]; rev_sym = chrmap_complement[(int)(ip->rev_sequence[rev_pos])]; fwd_qual = ip->fwd_quality[fwd_pos]; rev_qual = ip->rev_quality[rev_pos]; merge_sym(& sym, & qual, fwd_qual < 2 ? 'N' : fwd_sym, rev_qual < 2 ? 'N' : rev_sym, fwd_qual, rev_qual); if (sym != fwd_sym) { ip->fwd_errors++; } if (sym != rev_sym) { ip->rev_errors++; } ip->merged_sequence[merged_pos] = sym; ip->merged_quality[merged_pos] = qual; ip->ee_merged += q2p[(unsigned)qual]; ip->ee_fwd += q2p[(unsigned)fwd_qual]; ip->ee_rev += q2p[(unsigned)rev_qual]; fwd_pos++; rev_pos--; merged_pos++; } // 5' overhang in reverse sequence while (rev_pos >= 0) { sym = chrmap_complement[(int)(ip->rev_sequence[rev_pos])]; qual = ip->rev_quality[rev_pos]; ip->merged_sequence[merged_pos] = sym; ip->merged_quality[merged_pos] = qual; merged_pos++; ee = q2p[(unsigned)qual]; ip->ee_merged += ee; ip->ee_rev += ee; rev_pos--; } int64_t mergelen = merged_pos; ip->merged_length = mergelen; ip->merged_sequence[mergelen] = 0; ip->merged_quality[mergelen] = 0; if (ip->ee_merged <= opt_fastq_maxee) { ip->reason = ok; ip->merged = true; } else { ip->reason = maxee; } } int64_t optimize(merge_data_t * ip, kh_handle_s * kmerhash) { /* ungapped alignment in each diagonal */ int64_t i1 = 1; int64_t i2 = ip->fwd_trunc + ip->rev_trunc - 1; double best_score = 0.0; int64_t best_i = 0; int64_t best_diffs = 0; int hits = 0; int kmers = 0; int diags[ip->fwd_trunc + ip->rev_trunc]; kh_insert_kmers(kmerhash, k, ip->fwd_sequence, ip->fwd_trunc); kh_find_diagonals(kmerhash, k, ip->rev_sequence, ip->rev_trunc, diags); for(int64_t i = i1; i <= i2; i++) { int diag = ip->rev_trunc + ip->fwd_trunc - i; int diagcount = diags[diag]; if (diagcount >= merge_mindiagcount) { kmers = 1; /* for each interesting diagonal */ int64_t fwd_3prime_overhang = i > ip->rev_trunc ? i - ip->rev_trunc : 0; int64_t rev_3prime_overhang = i > ip->fwd_trunc ? i - ip->fwd_trunc : 0; int64_t overlap = i - fwd_3prime_overhang - rev_3prime_overhang; int64_t fwd_pos_start = ip->fwd_trunc - fwd_3prime_overhang - 1; int64_t rev_pos_start = ip->rev_trunc - rev_3prime_overhang - overlap; int64_t fwd_pos = fwd_pos_start; int64_t rev_pos = rev_pos_start; double score = 0.0; int64_t diffs = 0; double score_high = 0.0; double dropmax = 0.0; for (int64_t j=0; j < overlap; j++) { /* for each pair of bases in the overlap */ char fwd_sym = ip->fwd_sequence[fwd_pos]; char rev_sym = chrmap_complement[(int)(ip->rev_sequence[rev_pos])]; unsigned int fwd_qual = ip->fwd_quality[fwd_pos]; unsigned int rev_qual = ip->rev_quality[rev_pos]; fwd_pos--; rev_pos++; if (fwd_sym == rev_sym) { score += match_score[fwd_qual][rev_qual]; if (score > score_high) { score_high = score; } } else { score += mism_score[fwd_qual][rev_qual]; diffs++; if (score < score_high - dropmax) { dropmax = score_high - score; } } } if (dropmax >= merge_dropmax) { score = 0.0; } if (score >= merge_minscore) { hits++; } if (score > best_score) { best_score = score; best_i = i; best_diffs = diffs; } } } if (hits > 1) { ip->reason = repeat; return 0; } if ((! opt_fastq_allowmergestagger) && (best_i > ip->fwd_trunc)) { ip->reason = staggered; return 0; } if (best_diffs > opt_fastq_maxdiffs) { ip->reason = maxdiffs; return 0; } if ((100.0 * best_diffs / best_i) > opt_fastq_maxdiffpct) { ip->reason = maxdiffpct; return 0; } if (kmers == 0) { ip->reason = nokmers; return 0; } if (best_score < merge_minscore) { ip->reason = minscore; return 0; } if (best_i < opt_fastq_minovlen) { ip->reason = minovlen; return 0; } int mergelen = ip->fwd_trunc + ip->rev_trunc - best_i; if (mergelen < opt_fastq_minmergelen) { ip->reason = minmergelen; return 0; } if (mergelen > opt_fastq_maxmergelen) { ip->reason = maxmergelen; return 0; } return best_i; } void process(merge_data_t * ip, struct kh_handle_s * kmerhash) { ip->merged = false; bool skip = false; /* check length */ if ((ip->fwd_length < opt_fastq_minlen) || (ip->rev_length < opt_fastq_minlen)) { ip->reason = minlen; skip = true; } if ((ip->fwd_length > opt_fastq_maxlen) || (ip->rev_length > opt_fastq_maxlen)) { ip->reason = maxlen; skip = true; } /* truncate sequences by quality */ int64_t fwd_trunc = ip->fwd_length; if (!skip) { for (int64_t i = 0; i < ip->fwd_length; i++) { if (get_qual(ip->fwd_quality[i]) <= opt_fastq_truncqual) { fwd_trunc = i; break; } } if (fwd_trunc < opt_fastq_minlen) { ip->reason = minlen; skip = true; } } ip->fwd_trunc = fwd_trunc; int64_t rev_trunc = ip->rev_length; if (!skip) { for (int64_t i = 0; i < ip->rev_length; i++) { if (get_qual(ip->rev_quality[i]) <= opt_fastq_truncqual) { rev_trunc = i; break; } } if (rev_trunc < opt_fastq_minlen) { ip->reason = minlen; skip = true; } } ip->rev_trunc = rev_trunc; /* count n's */ /* replace quality of N's by zero */ if (!skip) { int64_t fwd_ncount = 0; for (int64_t i = 0; i < fwd_trunc; i++) { if (ip->fwd_sequence[i] == 'N') { ip->fwd_quality[i] = opt_fastq_ascii; fwd_ncount++; } } if (fwd_ncount > opt_fastq_maxns) { ip->reason = maxns; skip = true; } } if (!skip) { int64_t rev_ncount = 0; for (int64_t i = 0; i < rev_trunc; i++) { if (ip->rev_sequence[i] == 'N') { ip->rev_quality[i] = opt_fastq_ascii; rev_ncount++; } } if (rev_ncount > opt_fastq_maxns) { ip->reason = maxns; skip = true; } } ip->offset = 0; if (!skip) { ip->offset = optimize(ip, kmerhash); } if (ip->offset > 0) { merge(ip); } ip->state = processed; } bool read_pair(merge_data_t * ip) { if (fastq_next(fastq_fwd, false, chrmap_upcase)) { if (! fastq_next(fastq_rev, false, chrmap_upcase)) { fatal("More forward reads than reverse reads"); } /* allocate more memory if necessary */ int64_t fwd_header_len = fastq_get_header_length(fastq_fwd); int64_t rev_header_len = fastq_get_header_length(fastq_rev); int64_t header_needed = MAX(fwd_header_len, rev_header_len) + 1; if (header_needed > ip->header_alloc) { ip->header_alloc = header_needed; ip->fwd_header = (char*) xrealloc(ip->fwd_header, header_needed); ip->rev_header = (char*) xrealloc(ip->rev_header, header_needed); } ip->fwd_length = fastq_get_sequence_length(fastq_fwd); ip->rev_length = fastq_get_sequence_length(fastq_rev); int64_t seq_needed = MAX(ip->fwd_length, ip->rev_length) + 1; sum_read_length += ip->fwd_length + ip->rev_length; if (seq_needed > ip->seq_alloc) { ip->seq_alloc = seq_needed; ip->fwd_sequence = (char*) xrealloc(ip->fwd_sequence, seq_needed); ip->rev_sequence = (char*) xrealloc(ip->rev_sequence, seq_needed); ip->fwd_quality = (char*) xrealloc(ip->fwd_quality, seq_needed); ip->rev_quality = (char*) xrealloc(ip->rev_quality, seq_needed); } int64_t merged_seq_needed = ip->fwd_length + ip->rev_length + 1; if (merged_seq_needed > ip->merged_seq_alloc) { ip->merged_seq_alloc = merged_seq_needed; ip->merged_sequence = (char*) xrealloc(ip->merged_sequence, merged_seq_needed); ip->merged_quality = (char*) xrealloc(ip->merged_quality, merged_seq_needed); } /* make local copies of the seq, header and qual */ strcpy(ip->fwd_header, fastq_get_header(fastq_fwd)); strcpy(ip->rev_header, fastq_get_header(fastq_rev)); strcpy(ip->fwd_sequence, fastq_get_sequence(fastq_fwd)); strcpy(ip->rev_sequence, fastq_get_sequence(fastq_rev)); strcpy(ip->fwd_quality, fastq_get_quality(fastq_fwd)); strcpy(ip->rev_quality, fastq_get_quality(fastq_rev)); ip->merged_sequence[0] = 0; ip->merged_quality[0] = 0; ip->merged = false; ip->pair_no = total++; return true; } else { return false; } } void keep_or_discard(merge_data_t * ip) { if (ip->merged) { keep(ip); } else { discard(ip); } } void init_merge_data(merge_data_t * ip) { ip->fwd_header = nullptr; ip->rev_header = nullptr; ip->fwd_sequence = nullptr; ip->rev_sequence = nullptr; ip->fwd_quality = nullptr; ip->rev_quality = nullptr; ip->header_alloc = 0; ip->seq_alloc = 0; ip->fwd_length = 0; ip->rev_length = 0; ip->fwd_trunc = 0; ip->rev_trunc = 0; ip->pair_no = 0; ip->reason = undefined; ip->merged_seq_alloc = 0; ip->merged_sequence = nullptr; ip->merged_quality = nullptr; ip->merged_length = 0; } void free_merge_data(merge_data_t * ip) { if (ip->fwd_header) { xfree(ip->fwd_header); } if (ip->rev_header) { xfree(ip->rev_header); } if (ip->fwd_sequence) { xfree(ip->fwd_sequence); } if (ip->rev_sequence) { xfree(ip->rev_sequence); } if (ip->fwd_quality) { xfree(ip->fwd_quality); } if (ip->rev_quality) { xfree(ip->rev_quality); } if (ip->merged_sequence) { xfree(ip->merged_sequence); } if (ip->merged_quality) { xfree(ip->merged_quality); } } inline void chunk_perform_read() { while((!finished_reading) && (chunks[chunk_read_next].state == empty)) { xpthread_mutex_unlock(&mutex_chunks); progress_update(fastq_get_position(fastq_fwd)); int r = 0; while ((r < chunk_size) && read_pair(chunks[chunk_read_next].merge_data + r)) { r++; } chunks[chunk_read_next].size = r; xpthread_mutex_lock(&mutex_chunks); pairs_read += r; if (r > 0) { chunks[chunk_read_next].state = filled; chunk_read_next = (chunk_read_next + 1) % chunk_count; } if (r < chunk_size) { finished_reading = true; if (pairs_written >= pairs_read) { finished_all = true; } } xpthread_cond_broadcast(&cond_chunks); } } inline void chunk_perform_write() { while (chunks[chunk_write_next].state == processed) { xpthread_mutex_unlock(&mutex_chunks); for(int i = 0; i < chunks[chunk_write_next].size; i++) { keep_or_discard(chunks[chunk_write_next].merge_data + i); } xpthread_mutex_lock(&mutex_chunks); pairs_written += chunks[chunk_write_next].size; chunks[chunk_write_next].state = empty; if (finished_reading && (pairs_written >= pairs_read)) { finished_all = true; } chunk_write_next = (chunk_write_next + 1) % chunk_count; xpthread_cond_broadcast(&cond_chunks); } } inline void chunk_perform_process(struct kh_handle_s * kmerhash) { int chunk_current = chunk_process_next; if (chunks[chunk_current].state == filled) { chunks[chunk_current].state = inprogress; chunk_process_next = (chunk_current + 1) % chunk_count; xpthread_cond_broadcast(&cond_chunks); xpthread_mutex_unlock(&mutex_chunks); for(int i=0; iis_empty) { pair_all(); } progress_done(); if (fastq_next(fastq_rev, true, chrmap_upcase)) { fatal("More reverse reads than forward reads"); } fprintf(stderr, "%10" PRIu64 " Pairs\n", total); fprintf(stderr, "%10" PRIu64 " Merged", merged); if (total > 0) { fprintf(stderr, " (%.1lf%%)", 100.0 * merged / total); } fprintf(stderr, "\n"); fprintf(stderr, "%10" PRIu64 " Not merged", notmerged); if (total > 0) { fprintf(stderr, " (%.1lf%%)", 100.0 * notmerged / total); } fprintf(stderr, "\n"); if (notmerged > 0) { fprintf(stderr, "\nPairs that failed merging due to various reasons:\n"); } if (failed_undefined) { fprintf(stderr, "%10" PRIu64 " undefined reason\n", failed_undefined); } if (failed_minlen) { fprintf(stderr, "%10" PRIu64 " reads too short (after truncation)\n", failed_minlen); } if (failed_maxlen) { fprintf(stderr, "%10" PRIu64 " reads too long (after truncation)\n", failed_maxlen); } if (failed_maxns) { fprintf(stderr, "%10" PRIu64 " too many N's\n", failed_maxns); } if (failed_nokmers) { fprintf(stderr, "%10" PRIu64 " too few kmers found on same diagonal\n", failed_nokmers); } if (failed_repeat) { fprintf(stderr, "%10" PRIu64 " multiple potential alignments\n", failed_repeat); } if (failed_maxdiffs) { fprintf(stderr, "%10" PRIu64 " too many differences\n", failed_maxdiffs); } if (failed_maxdiffpct) { fprintf(stderr, "%10" PRIu64 " too high percentage of differences\n", failed_maxdiffpct); } if (failed_minscore) { fprintf(stderr, "%10" PRIu64 " alignment score too low, or score drop too high\n", failed_minscore); } if (failed_minovlen) { fprintf(stderr, "%10" PRIu64 " overlap too short\n", failed_minovlen); } if (failed_maxee) { fprintf(stderr, "%10" PRIu64 " expected error too high\n", failed_maxee); } if (failed_minmergelen) { fprintf(stderr, "%10" PRIu64 " merged fragment too short\n", failed_minmergelen); } if (failed_maxmergelen) { fprintf(stderr, "%10" PRIu64 " merged fragment too long\n", failed_maxmergelen); } if (failed_staggered) { fprintf(stderr, "%10" PRIu64 " staggered read pairs\n", failed_staggered); } if (failed_indel) { fprintf(stderr, "%10" PRIu64 " indel errors\n", failed_indel); } fprintf(stderr, "\n"); if (total > 0) { fprintf(stderr, "Statistics of all reads:\n"); double mean_read_length = sum_read_length / (2.0 * pairs_read); fprintf(stderr, "%10.2f Mean read length\n", mean_read_length); } if (merged > 0) { fprintf(stderr, "\n"); fprintf(stderr, "Statistics of merged reads:\n"); double mean = sum_fragment_length / merged; fprintf(stderr, "%10.2f Mean fragment length\n", mean); double stdev = sqrt((sum_squared_fragment_length - 2.0 * mean * sum_fragment_length + mean * mean * merged) / (merged + 0.0)); fprintf(stderr, "%10.2f Standard deviation of fragment length\n", stdev); fprintf(stderr, "%10.2f Mean expected error in forward sequences\n", sum_ee_fwd / merged); fprintf(stderr, "%10.2f Mean expected error in reverse sequences\n", sum_ee_rev / merged); fprintf(stderr, "%10.2f Mean expected error in merged sequences\n", sum_ee_merged / merged); fprintf(stderr, "%10.2f Mean observed errors in merged region of forward sequences\n", 1.0 * sum_errors_fwd / merged); fprintf(stderr, "%10.2f Mean observed errors in merged region of reverse sequences\n", 1.0 * sum_errors_rev / merged); fprintf(stderr, "%10.2f Mean observed errors in merged region\n", 1.0 * (sum_errors_fwd + sum_errors_rev) / merged); } /* clean up */ if (opt_eetabbedout) { fclose(fp_eetabbedout); } if (opt_fastaout_notmerged_rev) { fclose(fp_fastaout_notmerged_rev); } if (opt_fastaout_notmerged_fwd) { fclose(fp_fastaout_notmerged_fwd); } if (opt_fastqout_notmerged_rev) { fclose(fp_fastqout_notmerged_rev); } if (opt_fastqout_notmerged_fwd) { fclose(fp_fastqout_notmerged_fwd); } if (opt_fastaout) { fclose(fp_fastaout); } if (opt_fastqout) { fclose(fp_fastqout); } fastq_close(fastq_rev); fastq_rev = nullptr; fastq_close(fastq_fwd); fastq_fwd = nullptr; } vsearch-2.21.1/src/otutable.h0000644000175000017500000000525614171574117015355 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void otutable_init(); void otutable_done(); void otutable_add(char * query_header, char * target_header, int64_t abundance); void otutable_print_otutabout(FILE * fp); void otutable_print_mothur_shared_out(FILE * fp); void otutable_print_biomout(FILE * fp); vsearch-2.21.1/src/city.cc0000644000175000017500000004513214171574117014641 0ustar nileshnilesh// Copyright (c) 2011 Google, Inc. // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in // all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN // THE SOFTWARE. // // CityHash, by Geoff Pike and Jyrki Alakuijala // // This file provides CityHash64() and related functions. // // It's probably possible to create even faster hash functions by // writing a program that systematically explores some of the space of // possible hash functions, by using SIMD instructions, or by // compromising on hash quality. #include "config.h" #include #include #include // for memcpy and memset using namespace std; static uint64 UNALIGNED_LOAD64(const char *p) { uint64 result; memcpy(&result, p, sizeof(result)); return result; } static uint32 UNALIGNED_LOAD32(const char *p) { uint32 result; memcpy(&result, p, sizeof(result)); return result; } #ifdef _MSC_VER #include #define bswap_32(x) _byteswap_ulong(x) #define bswap_64(x) _byteswap_uint64(x) #elif defined(__APPLE__) // Mac OS X / Darwin features #include #define bswap_32(x) OSSwapInt32(x) #define bswap_64(x) OSSwapInt64(x) #elif defined(__FreeBSD__) #include #define bswap_32(x) bswap32(x) #define bswap_64(x) bswap64(x) #elif defined(__NetBSD__) #include #include #if defined(__BSWAP_RENAME) && !defined(__bswap_32) #define bswap_32(x) bswap32(x) #define bswap_64(x) bswap64(x) #endif #else #include #endif #ifdef WORDS_BIGENDIAN #define uint32_in_expected_order(x) (bswap_32(x)) #define uint64_in_expected_order(x) (bswap_64(x)) #else #define uint32_in_expected_order(x) (x) #define uint64_in_expected_order(x) (x) #endif #if !defined(LIKELY) #if HAVE_BUILTIN_EXPECT #define LIKELY(x) (__builtin_expect(!!(x), 1)) #else #define LIKELY(x) (x) #endif #endif static uint64 Fetch64(const char *p) { return uint64_in_expected_order(UNALIGNED_LOAD64(p)); } static uint32 Fetch32(const char *p) { return uint32_in_expected_order(UNALIGNED_LOAD32(p)); } // Some primes between 2^63 and 2^64 for various uses. static const uint64 k0 = 0xc3a5c85c97cb3127ULL; static const uint64 k1 = 0xb492b66fbe98f273ULL; static const uint64 k2 = 0x9ae16a3b2f90404fULL; // Magic numbers for 32-bit hashing. Copied from Murmur3. static const uint32_t c1 = 0xcc9e2d51; static const uint32_t c2 = 0x1b873593; // A 32-bit to 32-bit integer hash copied from Murmur3. static uint32 fmix(uint32 h) { h ^= h >> 16; h *= 0x85ebca6b; h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16; return h; } static uint32 Rotate32(uint32 val, int shift) { // Avoid shifting by 32: doing so yields an undefined result. return shift == 0 ? val : ((val >> shift) | (val << (32 - shift))); } #undef PERMUTE3 #define PERMUTE3(a, b, c) do { std::swap(a, b); std::swap(a, c); } while (0) static uint32 Mur(uint32 a, uint32 h) { // Helper from Murmur3 for combining two 32-bit values. a *= c1; a = Rotate32(a, 17); a *= c2; h ^= a; h = Rotate32(h, 19); return h * 5 + 0xe6546b64; } static uint32 Hash32Len13to24(const char *s, size_t len) { uint32 a = Fetch32(s - 4 + (len >> 1)); uint32 b = Fetch32(s + 4); uint32 c = Fetch32(s + len - 8); uint32 d = Fetch32(s + (len >> 1)); uint32 e = Fetch32(s); uint32 f = Fetch32(s + len - 4); uint32 h = len; return fmix(Mur(f, Mur(e, Mur(d, Mur(c, Mur(b, Mur(a, h))))))); } static uint32 Hash32Len0to4(const char *s, size_t len) { uint32 b = 0; uint32 c = 9; for (int i = 0; i < len; i++) { signed char v = s[i]; b = b * c1 + v; c ^= b; } return fmix(Mur(b, Mur(len, c))); } static uint32 Hash32Len5to12(const char *s, size_t len) { uint32 a = len, b = len * 5, c = 9, d = b; a += Fetch32(s); b += Fetch32(s + len - 4); c += Fetch32(s + ((len >> 1) & 4)); return fmix(Mur(c, Mur(b, Mur(a, d)))); } uint32 CityHash32(const char *s, size_t len) { if (len <= 24) { return len <= 12 ? (len <= 4 ? Hash32Len0to4(s, len) : Hash32Len5to12(s, len)) : Hash32Len13to24(s, len); } // len > 24 uint32 h = len, g = c1 * len, f = g; uint32 a0 = Rotate32(Fetch32(s + len - 4) * c1, 17) * c2; uint32 a1 = Rotate32(Fetch32(s + len - 8) * c1, 17) * c2; uint32 a2 = Rotate32(Fetch32(s + len - 16) * c1, 17) * c2; uint32 a3 = Rotate32(Fetch32(s + len - 12) * c1, 17) * c2; uint32 a4 = Rotate32(Fetch32(s + len - 20) * c1, 17) * c2; h ^= a0; h = Rotate32(h, 19); h = h * 5 + 0xe6546b64; h ^= a2; h = Rotate32(h, 19); h = h * 5 + 0xe6546b64; g ^= a1; g = Rotate32(g, 19); g = g * 5 + 0xe6546b64; g ^= a3; g = Rotate32(g, 19); g = g * 5 + 0xe6546b64; f += a4; f = Rotate32(f, 19); f = f * 5 + 0xe6546b64; size_t iters = (len - 1) / 20; do { uint32 a0 = Rotate32(Fetch32(s) * c1, 17) * c2; uint32 a1 = Fetch32(s + 4); uint32 a2 = Rotate32(Fetch32(s + 8) * c1, 17) * c2; uint32 a3 = Rotate32(Fetch32(s + 12) * c1, 17) * c2; uint32 a4 = Fetch32(s + 16); h ^= a0; h = Rotate32(h, 18); h = h * 5 + 0xe6546b64; f += a1; f = Rotate32(f, 19); f = f * c1; g += a2; g = Rotate32(g, 18); g = g * 5 + 0xe6546b64; h ^= a3 + a1; h = Rotate32(h, 19); h = h * 5 + 0xe6546b64; g ^= a4; g = bswap_32(g) * 5; h += a4 * 5; h = bswap_32(h); f += a0; PERMUTE3(f, h, g); s += 20; } while (--iters != 0); g = Rotate32(g, 11) * c1; g = Rotate32(g, 17) * c1; f = Rotate32(f, 11) * c1; f = Rotate32(f, 17) * c1; h = Rotate32(h + g, 19); h = h * 5 + 0xe6546b64; h = Rotate32(h, 17) * c1; h = Rotate32(h + f, 19); h = h * 5 + 0xe6546b64; h = Rotate32(h, 17) * c1; return h; } // Bitwise right rotate. Normally this will compile to a single // instruction, especially if the shift is a manifest constant. static uint64 Rotate(uint64 val, int shift) { // Avoid shifting by 64: doing so yields an undefined result. return shift == 0 ? val : ((val >> shift) | (val << (64 - shift))); } static uint64 ShiftMix(uint64 val) { return val ^ (val >> 47); } static uint64 HashLen16(uint64 u, uint64 v) { return Hash128to64(uint128(u, v)); } static uint64 HashLen16(uint64 u, uint64 v, uint64 mul) { // Murmur-inspired hashing. uint64 a = (u ^ v) * mul; a ^= (a >> 47); uint64 b = (v ^ a) * mul; b ^= (b >> 47); b *= mul; return b; } static uint64 HashLen0to16(const char *s, size_t len) { if (len >= 8) { uint64 mul = k2 + len * 2; uint64 a = Fetch64(s) + k2; uint64 b = Fetch64(s + len - 8); uint64 c = Rotate(b, 37) * mul + a; uint64 d = (Rotate(a, 25) + b) * mul; return HashLen16(c, d, mul); } if (len >= 4) { uint64 mul = k2 + len * 2; uint64 a = Fetch32(s); return HashLen16(len + (a << 3), Fetch32(s + len - 4), mul); } if (len > 0) { uint8 a = s[0]; uint8 b = s[len >> 1]; uint8 c = s[len - 1]; uint32 y = static_cast(a) + (static_cast(b) << 8); uint32 z = len + (static_cast(c) << 2); return ShiftMix(y * k2 ^ z * k0) * k2; } return k2; } // This probably works well for 16-byte strings as well, but it may be overkill // in that case. static uint64 HashLen17to32(const char *s, size_t len) { uint64 mul = k2 + len * 2; uint64 a = Fetch64(s) * k1; uint64 b = Fetch64(s + 8); uint64 c = Fetch64(s + len - 8) * mul; uint64 d = Fetch64(s + len - 16) * k2; return HashLen16(Rotate(a + b, 43) + Rotate(c, 30) + d, a + Rotate(b + k2, 18) + c, mul); } // Return a 16-byte hash for 48 bytes. Quick and dirty. // Callers do best to use "random-looking" values for a and b. static pair WeakHashLen32WithSeeds( uint64 w, uint64 x, uint64 y, uint64 z, uint64 a, uint64 b) { a += w; b = Rotate(b + a + z, 21); uint64 c = a; a += x; a += y; b += Rotate(a, 44); return make_pair(a + z, b + c); } // Return a 16-byte hash for s[0] ... s[31], a, and b. Quick and dirty. static pair WeakHashLen32WithSeeds( const char* s, uint64 a, uint64 b) { return WeakHashLen32WithSeeds(Fetch64(s), Fetch64(s + 8), Fetch64(s + 16), Fetch64(s + 24), a, b); } // Return an 8-byte hash for 33 to 64 bytes. static uint64 HashLen33to64(const char *s, size_t len) { uint64 mul = k2 + len * 2; uint64 a = Fetch64(s) * k2; uint64 b = Fetch64(s + 8); uint64 c = Fetch64(s + len - 24); uint64 d = Fetch64(s + len - 32); uint64 e = Fetch64(s + 16) * k2; uint64 f = Fetch64(s + 24) * 9; uint64 g = Fetch64(s + len - 8); uint64 h = Fetch64(s + len - 16) * mul; uint64 u = Rotate(a + g, 43) + (Rotate(b, 30) + c) * 9; uint64 v = ((a + g) ^ d) + f + 1; uint64 w = bswap_64((u + v) * mul) + h; uint64 x = Rotate(e + f, 42) + c; uint64 y = (bswap_64((v + w) * mul) + g) * mul; uint64 z = e + f + c; a = bswap_64((x + z) * mul + y) + b; b = ShiftMix((z + a) * mul + d + h) * mul; return b + x; } uint64 CityHash64(const char *s, size_t len) { if (len <= 32) { if (len <= 16) { return HashLen0to16(s, len); } else { return HashLen17to32(s, len); } } else if (len <= 64) { return HashLen33to64(s, len); } // For strings over 64 bytes we hash the end first, and then as we // loop we keep 56 bytes of state: v, w, x, y, and z. uint64 x = Fetch64(s + len - 40); uint64 y = Fetch64(s + len - 16) + Fetch64(s + len - 56); uint64 z = HashLen16(Fetch64(s + len - 48) + len, Fetch64(s + len - 24)); pair v = WeakHashLen32WithSeeds(s + len - 64, len, z); pair w = WeakHashLen32WithSeeds(s + len - 32, y + k1, x); x = x * k1 + Fetch64(s); // Decrease len to the nearest multiple of 64, and operate on 64-byte chunks. len = (len - 1) & ~static_cast(63); do { x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; x ^= w.second; y += v.first + Fetch64(s + 40); z = Rotate(z + w.first, 33) * k1; v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); std::swap(z, x); s += 64; len -= 64; } while (len != 0); return HashLen16(HashLen16(v.first, w.first) + ShiftMix(y) * k1 + z, HashLen16(v.second, w.second) + x); } uint64 CityHash64WithSeed(const char *s, size_t len, uint64 seed) { return CityHash64WithSeeds(s, len, k2, seed); } uint64 CityHash64WithSeeds(const char *s, size_t len, uint64 seed0, uint64 seed1) { return HashLen16(CityHash64(s, len) - seed0, seed1); } // A subroutine for CityHash128(). Returns a decent 128-bit hash for strings // of any length representable in signed long. Based on City and Murmur. static uint128 CityMurmur(const char *s, size_t len, uint128 seed) { uint64 a = Uint128Low64(seed); uint64 b = Uint128High64(seed); uint64 c = 0; uint64 d = 0; signed long l = len - 16; if (l <= 0) { // len <= 16 a = ShiftMix(a * k1) * k1; c = b * k1 + HashLen0to16(s, len); d = ShiftMix(a + (len >= 8 ? Fetch64(s) : c)); } else { // len > 16 c = HashLen16(Fetch64(s + len - 8) + k1, a); d = HashLen16(b + len, c + Fetch64(s + len - 16)); a += d; do { a ^= ShiftMix(Fetch64(s) * k1) * k1; a *= k1; b ^= a; c ^= ShiftMix(Fetch64(s + 8) * k1) * k1; c *= k1; d ^= c; s += 16; l -= 16; } while (l > 0); } a = HashLen16(a, c); b = HashLen16(d, b); return uint128(a ^ b, HashLen16(b, a)); } uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed) { if (len < 128) { return CityMurmur(s, len, seed); } // We expect len >= 128 to be the common case. Keep 56 bytes of state: // v, w, x, y, and z. pair v, w; uint64 x = Uint128Low64(seed); uint64 y = Uint128High64(seed); uint64 z = len * k1; v.first = Rotate(y ^ k1, 49) * k1 + Fetch64(s); v.second = Rotate(v.first, 42) * k1 + Fetch64(s + 8); w.first = Rotate(y + z, 35) * k1 + x; w.second = Rotate(x + Fetch64(s + 88), 53) * k1; // This is the same inner loop as CityHash64(), manually unrolled. do { x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; x ^= w.second; y += v.first + Fetch64(s + 40); z = Rotate(z + w.first, 33) * k1; v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); std::swap(z, x); s += 64; x = Rotate(x + y + v.first + Fetch64(s + 8), 37) * k1; y = Rotate(y + v.second + Fetch64(s + 48), 42) * k1; x ^= w.second; y += v.first + Fetch64(s + 40); z = Rotate(z + w.first, 33) * k1; v = WeakHashLen32WithSeeds(s, v.second * k1, x + w.first); w = WeakHashLen32WithSeeds(s + 32, z + w.second, y + Fetch64(s + 16)); std::swap(z, x); s += 64; len -= 128; } while (LIKELY(len >= 128)); x += Rotate(v.first + z, 49) * k0; y = y * k0 + Rotate(w.second, 37); z = z * k0 + Rotate(w.first, 27); w.first *= 9; v.first *= k0; // If 0 < len < 128, hash up to 4 chunks of 32 bytes each from the end of s. for (size_t tail_done = 0; tail_done < len; ) { tail_done += 32; y = Rotate(x + y, 42) * k0 + v.second; w.first += Fetch64(s + len - tail_done + 16); x = x * k0 + w.first; z += w.second + Fetch64(s + len - tail_done); w.second += v.first; v = WeakHashLen32WithSeeds(s + len - tail_done, v.first + z, v.second); v.first *= k0; } // At this point our 56 bytes of state should contain more than // enough information for a strong 128-bit hash. We use two // different 56-byte-to-8-byte hashes to get a 16-byte final result. x = HashLen16(x, v.first); y = HashLen16(y + z, w.first); return uint128(HashLen16(x + v.second, w.second) + y, HashLen16(x + w.second, y + v.second)); } uint128 CityHash128(const char *s, size_t len) { return len >= 16 ? CityHash128WithSeed(s + 16, len - 16, uint128(Fetch64(s), Fetch64(s + 8) + k0)) : CityHash128WithSeed(s, len, uint128(k0, k1)); } #ifdef __SSE4_2__ #include #include // Requires len >= 240. static void CityHashCrc256Long(const char *s, size_t len, uint32 seed, uint64 *result) { uint64 a = Fetch64(s + 56) + k0; uint64 b = Fetch64(s + 96) + k0; uint64 c = result[0] = HashLen16(b, len); uint64 d = result[1] = Fetch64(s + 120) * k0 + len; uint64 e = Fetch64(s + 184) + seed; uint64 f = 0; uint64 g = 0; uint64 h = c + d; uint64 x = seed; uint64 y = 0; uint64 z = 0; // 240 bytes of input per iter. size_t iters = len / 240; len -= iters * 240; do { #undef CHUNK #define CHUNK(r) \ PERMUTE3(x, z, y); \ b += Fetch64(s); \ c += Fetch64(s + 8); \ d += Fetch64(s + 16); \ e += Fetch64(s + 24); \ f += Fetch64(s + 32); \ a += b; \ h += f; \ b += c; \ f += d; \ g += e; \ e += z; \ g += x; \ z = _mm_crc32_u64(z, b + g); \ y = _mm_crc32_u64(y, e + h); \ x = _mm_crc32_u64(x, f + a); \ e = Rotate(e, r); \ c += e; \ s += 40 CHUNK(0); PERMUTE3(a, h, c); CHUNK(33); PERMUTE3(a, h, f); CHUNK(0); PERMUTE3(b, h, f); CHUNK(42); PERMUTE3(b, h, d); CHUNK(0); PERMUTE3(b, h, e); CHUNK(33); PERMUTE3(a, h, e); } while (--iters > 0); while (len >= 40) { CHUNK(29); e ^= Rotate(a, 20); h += Rotate(b, 30); g ^= Rotate(c, 40); f += Rotate(d, 34); PERMUTE3(c, h, g); len -= 40; } if (len > 0) { s = s + len - 40; CHUNK(33); e ^= Rotate(a, 43); h += Rotate(b, 42); g ^= Rotate(c, 41); f += Rotate(d, 40); } result[0] ^= h; result[1] ^= g; g += h; a = HashLen16(a, g + z); x += y << 32; b += x; c = HashLen16(c, z) + h; d = HashLen16(d, e + result[0]); g += e; h += HashLen16(x, f); e = HashLen16(a, d) + g; z = HashLen16(b, c) + a; y = HashLen16(g, h) + c; result[0] = e + z + y + x; a = ShiftMix((a + y) * k0) * k0 + b; result[1] += a + result[0]; a = ShiftMix(a * k0) * k0 + c; result[2] = a + result[1]; a = ShiftMix((a + e) * k0) * k0; result[3] = a + result[2]; } // Requires len < 240. static void CityHashCrc256Short(const char *s, size_t len, uint64 *result) { char buf[240]; memcpy(buf, s, len); memset(buf + len, 0, 240 - len); CityHashCrc256Long(buf, 240, ~static_cast(len), result); } void CityHashCrc256(const char *s, size_t len, uint64 *result) { if (LIKELY(len >= 240)) { CityHashCrc256Long(s, len, 0, result); } else { CityHashCrc256Short(s, len, result); } } uint128 CityHashCrc128WithSeed(const char *s, size_t len, uint128 seed) { if (len <= 900) { return CityHash128WithSeed(s, len, seed); } else { uint64 result[4]; CityHashCrc256(s, len, result); uint64 u = Uint128High64(seed) + result[0]; uint64 v = Uint128Low64(seed) + result[1]; return uint128(HashLen16(u, v + result[2]), HashLen16(Rotate(v, 32), u * k0 + result[3])); } } uint128 CityHashCrc128(const char *s, size_t len) { if (len <= 900) { return CityHash128(s, len); } else { uint64 result[4]; CityHashCrc256(s, len, result); return uint128(result[2], result[3]); } } #endif vsearch-2.21.1/src/filter.cc0000644000175000017500000004663714171574117015171 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" inline int fastq_get_qual(char q) { int qual = q - opt_fastq_ascii; if (qual < opt_fastq_qmin) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) below qmin (%" PRId64 ")\n", qual, opt_fastq_qmin); if (fp_log) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) below qmin (%" PRId64 ")\n", qual, opt_fastq_qmin); } exit(EXIT_FAILURE); } else if (qual > opt_fastq_qmax) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) above qmax (%" PRId64 ")\n", qual, opt_fastq_qmax); fprintf(stderr, "By default, quality values range from 0 to 41.\n" "To allow higher quality values, " "please use the option --fastq_qmax %d\n", qual); if (fp_log) { fprintf(fp_log, "\n\nFatal error: FASTQ quality value (%d) above qmax (%" PRId64 ")\n", qual, opt_fastq_qmax); fprintf(fp_log, "By default, quality values range from 0 to 41.\n" "To allow higher quality values, " "please use the option --fastq_qmax %d\n", qual); } exit(EXIT_FAILURE); } return qual; } struct analysis_res { bool discarded; bool truncated; int start; int length; double ee; }; struct analysis_res analyse(fastx_handle h) { struct analysis_res res = { false, false, 0, 0, -1.0 }; res.length = fastx_get_sequence_length(h); int64_t old_length = res.length; /* strip left (5') end */ if (opt_fastq_stripleft < res.length) { res.start += opt_fastq_stripleft; res.length -= opt_fastq_stripleft; } else { res.start = res.length; res.length = 0; } /* strip right (3') end */ if (opt_fastq_stripright < res.length) { res.length -= opt_fastq_stripright; } else { res.length = 0; } /* truncate trailing (3') part */ if (opt_fastq_trunclen >= 0) { if (res.length > opt_fastq_trunclen) { res.length = opt_fastq_trunclen; } } /* truncate trailing (3') part, but keep if short */ if (opt_fastq_trunclen_keep >= 0) { if (res.length > opt_fastq_trunclen_keep) { res.length = opt_fastq_trunclen_keep; } } if (h->is_fastq) { /* truncate by quality and expected errors (ee) */ res.ee = 0.0; char * q = fastx_get_quality(h) + res.start; for (int64_t i = 0; i < res.length; i++) { int qual = fastq_get_qual(q[i]); double e = exp10(-0.1 * qual); res.ee += e; if ((qual <= opt_fastq_truncqual) || (res.ee > opt_fastq_truncee)) { res.ee -= e; res.length = i; break; } } /* filter by expected errors (ee) */ if (res.ee > opt_fastq_maxee) { res.discarded = true; } if ((res.length > 0) && (res.ee / res.length > opt_fastq_maxee_rate)) { res.discarded = true; } } /* filter by length */ if ((opt_fastq_trunclen >= 0) && (res.length < opt_fastq_trunclen)) { res.discarded = true; } if (res.length < opt_fastq_minlen) { res.discarded = true; } if (res.length > opt_fastq_maxlen) { res.discarded = true; } /* filter by n's */ int64_t ncount = 0; char * p = fastx_get_sequence(h) + res.start; for (int64_t i = 0; i < res.length; i++) { int pc = p[i]; if ((pc == 'N') || (pc == 'n')) { ncount++; } } if (ncount > opt_fastq_maxns) { res.discarded = true; } /* filter by abundance */ int64_t abundance = fastx_get_abundance(h); if (abundance < opt_minsize) { res.discarded = true; } if (abundance > opt_maxsize) { res.discarded = true; } res.truncated = res.length < old_length; return res; } void filter(bool fastq_only, char * filename) { if ((!opt_fastqout) && (!opt_fastaout) && (!opt_fastqout_discarded) && (!opt_fastaout_discarded) && (!opt_fastqout_rev) && (!opt_fastaout_rev) && (!opt_fastqout_discarded_rev) && (!opt_fastaout_discarded_rev)) { fatal("No output files specified"); } fastx_handle h1 = nullptr; fastx_handle h2 = nullptr; h1 = fastx_open(filename); if (!h1) { fatal("Unrecognized file type (not proper FASTA or FASTQ format)"); } if (! (h1->is_fastq || h1->is_empty)) { if (fastq_only) { fatal("FASTA input files not allowed with fastq_filter, consider using fastx_filter command instead"); } else if (opt_eeout || (opt_fastq_ascii != 33) || opt_fastq_eeout || (opt_fastq_maxee < DBL_MAX) || (opt_fastq_maxee_rate < DBL_MAX) || opt_fastqout || (opt_fastq_qmax < 41) || (opt_fastq_qmin > 0) || (opt_fastq_truncee < DBL_MAX) || (opt_fastq_truncqual < LONG_MIN) || opt_fastqout_discarded || opt_fastqout_discarded_rev || opt_fastqout_rev) { fatal("The following options are not accepted with the fastx_filter command when the input is a FASTA file, because quality scores are not available: eeout, fastq_ascii, fastq_eeout, fastq_maxee, fastq_maxee_rate, fastq_out, fastq_qmax, fastq_qmin, fastq_truncee, fastq_truncqual, fastqout_discarded, fastqout_discarded_rev, fastqout_rev"); } } uint64_t filesize = fastx_get_size(h1); if (opt_reverse) { h2 = fastx_open(opt_reverse); if (!h2) { fatal("Unrecognized file type (not proper FASTA or FASTQ format) for reverse reads"); } if (h1->is_fastq != h2->is_fastq) { fatal("The forward and reverse input sequence must in the same format, either FASTA or FASTQ"); } if (! (h2->is_fastq || h2->is_empty)) { if (fastq_only) { fatal("FASTA input files not allowed with fastq_filter, consider using fastx_filter command instead"); } else if (opt_eeout || (opt_fastq_ascii != 33) || opt_fastq_eeout || (opt_fastq_maxee < DBL_MAX) || (opt_fastq_maxee_rate < DBL_MAX) || opt_fastqout || (opt_fastq_qmax < 41) || (opt_fastq_qmin > 0) || (opt_fastq_truncee < DBL_MAX) || (opt_fastq_truncqual < LONG_MIN) || opt_fastqout_discarded || opt_fastqout_discarded_rev || opt_fastqout_rev) { fatal("The following options are not accepted with the fastx_filter command when the input is a FASTA file, because quality scores are not available: eeout, fastq_ascii, fastq_eeout, fastq_maxee, fastq_maxee_rate, fastq_out, fastq_qmax, fastq_qmin, fastq_truncee, fastq_truncqual, fastqout_discarded, fastqout_discarded_rev, fastqout_rev"); } } } FILE * fp_fastaout = nullptr; FILE * fp_fastqout = nullptr; FILE * fp_fastaout_discarded = nullptr; FILE * fp_fastqout_discarded = nullptr; FILE * fp_fastaout_rev = nullptr; FILE * fp_fastqout_rev = nullptr; FILE * fp_fastaout_discarded_rev = nullptr; FILE * fp_fastqout_discarded_rev = nullptr; if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout) { fp_fastqout = fopen_output(opt_fastqout); if (!fp_fastqout) { fatal("Unable to open FASTQ output file for writing"); } } if (opt_fastaout_discarded) { fp_fastaout_discarded = fopen_output(opt_fastaout_discarded); if (!fp_fastaout_discarded) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout_discarded) { fp_fastqout_discarded = fopen_output(opt_fastqout_discarded); if (!fp_fastqout_discarded) { fatal("Unable to open FASTQ output file for writing"); } } if (h2) { if (opt_fastaout_rev) { fp_fastaout_rev = fopen_output(opt_fastaout_rev); if (!fp_fastaout_rev) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout_rev) { fp_fastqout_rev = fopen_output(opt_fastqout_rev); if (!fp_fastqout_rev) { fatal("Unable to open FASTQ output file for writing"); } } if (opt_fastaout_discarded_rev) { fp_fastaout_discarded_rev = fopen_output(opt_fastaout_discarded_rev); if (!fp_fastaout_discarded_rev) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout_discarded_rev) { fp_fastqout_discarded_rev = fopen_output(opt_fastqout_discarded_rev); if (!fp_fastqout_discarded_rev) { fatal("Unable to open FASTQ output file for writing"); } } } progress_init("Reading input file", filesize); int64_t kept = 0; int64_t discarded = 0; int64_t truncated = 0; while(fastx_next(h1, false, chrmap_no_change)) { if (h2 && ! fastx_next(h2, false, chrmap_no_change)) { fatal("More forward reads than reverse reads"); } struct analysis_res res1 = { false, false, 0, 0, 0.0 } ; struct analysis_res res2 = { false, false, 0, 0, -1.0 } ; res1 = analyse(h1); if (h2) { res2 = analyse(h2); } if (res1.discarded || res2.discarded) { /* discard the sequence(s) */ discarded++; if (opt_fastaout_discarded) { fasta_print_general(fp_fastaout_discarded, nullptr, fastx_get_sequence(h1) + res1.start, res1.length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_abundance(h1), discarded, res1.ee, -1, -1, nullptr, 0.0); } if (opt_fastqout_discarded) { fastq_print_general(fp_fastqout_discarded, fastx_get_sequence(h1) + res1.start, res1.length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_quality(h1) + res1.start, fastx_get_abundance(h1), discarded, res1.ee); } if (h2) { if (opt_fastaout_discarded_rev) { fasta_print_general(fp_fastaout_discarded_rev, nullptr, fastx_get_sequence(h2) + res2.start, res2.length, fastx_get_header(h2), fastx_get_header_length(h2), fastx_get_abundance(h2), discarded, res2.ee, -1, -1, nullptr, 0.0); } if (opt_fastqout_discarded_rev) { fastq_print_general(fp_fastqout_discarded_rev, fastx_get_sequence(h2) + res2.start, res2.length, fastx_get_header(h2), fastx_get_header_length(h2), fastx_get_quality(h2) + res2.start, fastx_get_abundance(h2), discarded, res2.ee); } } } else { /* keep the sequence(s) */ kept++; if (res1.truncated || res2.truncated) { truncated++; } if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, fastx_get_sequence(h1) + res1.start, res1.length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_abundance(h1), kept, res1.ee, -1, -1, nullptr, 0.0); } if (opt_fastqout) { fastq_print_general(fp_fastqout, fastx_get_sequence(h1) + res1.start, res1.length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_quality(h1) + res1.start, fastx_get_abundance(h1), kept, res1.ee); } if (h2) { if (opt_fastaout_rev) { fasta_print_general(fp_fastaout_rev, nullptr, fastx_get_sequence(h2) + res2.start, res2.length, fastx_get_header(h2), fastx_get_header_length(h2), fastx_get_abundance(h2), kept, res2.ee, -1, -1, nullptr, 0.0); } if (opt_fastqout_rev) { fastq_print_general(fp_fastqout_rev, fastx_get_sequence(h2) + res2.start, res2.length, fastx_get_header(h2), fastx_get_header_length(h2), fastx_get_quality(h2) + res2.start, fastx_get_abundance(h2), kept, res2.ee); } } } progress_update(fastx_get_position(h1)); } progress_done(); if (h2 && fastx_next(h2, false, chrmap_no_change)) { fatal("More reverse reads than forward reads"); } if (! opt_quiet) { fprintf(stderr, "%" PRId64 " sequences kept (of which %" PRId64 " truncated), %" PRId64 " sequences discarded.\n", kept, truncated, discarded); } if (opt_log) { fprintf(fp_log, "%" PRId64 " sequences kept (of which %" PRId64 " truncated), %" PRId64 " sequences discarded.\n", kept, truncated, discarded); } if (h2) { if (opt_fastaout_rev) { fclose(fp_fastaout_rev); } if (opt_fastqout_rev) { fclose(fp_fastqout_rev); } if (opt_fastaout_discarded_rev) { fclose(fp_fastaout_discarded_rev); } if (opt_fastqout_discarded_rev) { fclose(fp_fastqout_discarded_rev); } fastx_close(h2); } if (opt_fastaout) { fclose(fp_fastaout); } if (opt_fastqout) { fclose(fp_fastqout); } if (opt_fastaout_discarded) { fclose(fp_fastaout_discarded); } if (opt_fastqout_discarded) { fclose(fp_fastqout_discarded); } fastx_close(h1); } void fastq_filter() { filter(true, opt_fastq_filter); } void fastx_filter() { filter(false, opt_fastx_filter); } vsearch-2.21.1/src/sha1.h0000644000175000017500000000104414171574117014361 0ustar nileshnilesh/* public api for steve reid's public domain SHA-1 implementation */ /* this file is in the public domain */ #ifndef __SHA1_H #define __SHA1_H #ifdef __cplusplus extern "C" { #endif typedef struct { uint32_t state[5]; uint32_t count[2]; uint8_t buffer[64]; } SHA1_CTX; #define SHA1_DIGEST_SIZE 20 void SHA1_Init(SHA1_CTX* context); void SHA1_Update(SHA1_CTX* context, const uint8_t* data, const size_t len); void SHA1_Final(SHA1_CTX* context, uint8_t digest[SHA1_DIGEST_SIZE]); #ifdef __cplusplus } #endif #endif /* __SHA1_H */ vsearch-2.21.1/src/sortbylength.cc0000644000175000017500000001224714171574117016416 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static struct sortinfo_length_s { unsigned int length; unsigned int size; unsigned int seqno; } * sortinfo; int sortbylength_compare(const void * a, const void * b) { auto * x = (struct sortinfo_length_s *) a; auto * y = (struct sortinfo_length_s *) b; /* longest first, then most abundant, then by label, otherwise keep order */ if (x->length < y->length) { return +1; } else if (x->length > y->length) { return -1; } else if (x->size < y->size) { return +1; } else if (x->size > y->size) { return -1; } else { int r = strcmp(db_getheader(x->seqno), db_getheader(y->seqno)); if (r != 0) { return r; } else { if (x->seqno < y->seqno) { return -1; } else if (x->seqno > y->seqno) { return +1; } else { return 0; } } } } void sortbylength() { if (!opt_output) fatal("FASTA output file for sortbylength must be specified with --output"); FILE * fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open sortbylength output file for writing"); } db_read(opt_sortbylength, 0); show_rusage(); int dbsequencecount = db_getsequencecount(); sortinfo = (struct sortinfo_length_s *) xmalloc(dbsequencecount * sizeof(sortinfo_length_s)); int passed = 0; progress_init("Getting lengths", dbsequencecount); for(int i=0; i 0) { if (passed % 2) { median = sortinfo[(passed-1)/2].length; } else { median = (sortinfo[(passed/2)-1].length + sortinfo[passed/2].length) / 2.0; } } if (!opt_quiet) { fprintf(stderr, "Median length: %.0f\n", median); } if (opt_log) { fprintf(fp_log, "Median length: %.0f\n", median); } show_rusage(); passed = MIN(passed, opt_topn); progress_init("Writing output", passed); for(int i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct seqinfo_s { size_t header_p; size_t seq_p; size_t qual_p; unsigned int headerlen; unsigned int seqlen; unsigned int size; }; typedef struct seqinfo_s seqinfo_t; extern char * datap; extern seqinfo_t * seqindex; inline char * db_getheader(uint64_t seqno) { return datap + seqindex[seqno].header_p; } inline char * db_getsequence(uint64_t seqno) { return datap + seqindex[seqno].seq_p; } inline uint64_t db_getabundance(uint64_t seqno) { return seqindex[seqno].size; } inline uint64_t db_getsequencelen(uint64_t seqno) { return seqindex[seqno].seqlen; } inline uint64_t db_getheaderlen(uint64_t seqno) { return seqindex[seqno].headerlen; } void db_read(const char * filename, int upcase); void db_free(); uint64_t db_getsequencecount(); uint64_t db_getnucleotidecount(); uint64_t db_getlongestheader(); uint64_t db_getlongestsequence(); uint64_t db_getshortestsequence(); /* Note: the sorting functions below must be called after db_read, but before dbindex_prepare */ void db_sortbylength(); void db_sortbylength_shortest_first(); void db_sortbyabundance(); bool db_is_fastq(); char * db_getquality(uint64_t seqno); void db_setinfo(bool new_is_fastq, uint64_t new_sequences, uint64_t new_nucleotides, uint64_t new_longest, uint64_t new_shortest, uint64_t new_longestheader); vsearch-2.21.1/src/cluster.cc0000644000175000017500000014273514171574117015361 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static int tophits; /* the maximum number of hits to keep */ static int seqcount; /* number of database sequences */ typedef struct clusterinfo_s { int seqno; int clusterno; char * cigar; int strand; } clusterinfo_t; static clusterinfo_t * clusterinfo = nullptr; static int clusters = 0; static int count_matched = 0; static int count_notmatched = 0; static int64_t * cluster_abundance; static FILE * fp_centroids = nullptr; static FILE * fp_uc = nullptr; static FILE * fp_alnout = nullptr; static FILE * fp_samout = nullptr; static FILE * fp_userout = nullptr; static FILE * fp_blast6out = nullptr; static FILE * fp_fastapairs = nullptr; static FILE * fp_matched = nullptr; static FILE * fp_notmatched = nullptr; static FILE * fp_otutabout = nullptr; static FILE * fp_mothur_shared_out = nullptr; static FILE * fp_biomout = nullptr; static FILE * fp_qsegout = nullptr; static FILE * fp_tsegout = nullptr; static pthread_attr_t attr; static struct searchinfo_s * si_plus; static struct searchinfo_s * si_minus; typedef struct thread_info_s { pthread_t thread; pthread_mutex_t mutex; pthread_cond_t cond; int work; int query_first; int query_count; } thread_info_t; static thread_info_t * ti; inline int compare_byclusterno(const void * a, const void * b) { auto * x = (clusterinfo_t *) a; auto * y = (clusterinfo_t *) b; if (x->clusterno < y->clusterno) { return -1; } else if (x->clusterno > y->clusterno) { return +1; } else if (x->seqno < y->seqno) { return -1; } else if (x->seqno > y->seqno) { return +1; } else { return 0; } } inline int compare_byclusterabundance(const void * a, const void * b) { auto * x = (clusterinfo_t *) a; auto * y = (clusterinfo_t *) b; if (cluster_abundance[x->clusterno] > cluster_abundance[y->clusterno]) { return -1; } else if (cluster_abundance[x->clusterno] < cluster_abundance[y->clusterno]) { return +1; } else if (x->clusterno < y->clusterno) { return -1; } else if (x->clusterno > y->clusterno) { return +1; } else if (x->seqno < y->seqno) { return -1; } else if (x->seqno > y->seqno) { return +1; } else { return 0; } } inline void cluster_query_core(struct searchinfo_s * si) { /* the main core function for clustering */ /* get sequence etc */ int seqno = si->query_no; si->query_head_len = db_getheaderlen(seqno); si->query_head = db_getheader(seqno); si->qsize = db_getabundance(seqno); si->qseqlen = db_getsequencelen(seqno); if (si->strand) { reverse_complement(si->qsequence, db_getsequence(seqno), si->qseqlen); } else { strcpy(si->qsequence, db_getsequence(seqno)); } /* perform search */ search_onequery(si, opt_qmask); } inline void cluster_worker(int64_t t) { /* wrapper for the main threaded core function for clustering */ for (int q = 0; q < ti[t].query_count; q++) { cluster_query_core(si_plus + ti[t].query_first + q); if (opt_strand>1) { cluster_query_core(si_minus + ti[t].query_first + q); } } } void * threads_worker(void * vp) { auto t = (int64_t) vp; thread_info_s * tip = ti + t; xpthread_mutex_lock(&tip->mutex); /* loop until signalled to quit */ while (tip->work >= 0) { /* wait for work available */ if (tip->work == 0) { xpthread_cond_wait(&tip->cond, &tip->mutex); } if (tip->work > 0) { cluster_worker(t); tip->work = 0; xpthread_cond_signal(&tip->cond); } } xpthread_mutex_unlock(&tip->mutex); return nullptr; } void threads_wakeup(int queries) { int threads = queries > opt_threads ? opt_threads : queries; int queries_rest = queries; int threads_rest = threads; int query_next = 0; /* tell the threads that there is work to do */ for(int t=0; t < threads; t++) { thread_info_t * tip = ti + t; tip->query_first = query_next; tip->query_count = (queries_rest + threads_rest - 1) / threads_rest; queries_rest -= tip->query_count; query_next += tip->query_count; threads_rest--; xpthread_mutex_lock(&tip->mutex); tip->work = 1; xpthread_cond_signal(&tip->cond); xpthread_mutex_unlock(&tip->mutex); } /* wait for theads to finish their work */ for(int t=0; t < threads; t++) { thread_info_t * tip = ti + t; xpthread_mutex_lock(&tip->mutex); while (tip->work > 0) { xpthread_cond_wait(&tip->cond, &tip->mutex); } xpthread_mutex_unlock(&tip->mutex); } } void threads_init() { xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); /* allocate memory for thread info */ ti = (thread_info_t *) xmalloc(opt_threads * sizeof(thread_info_t)); /* init and create worker threads */ for(int t=0; t < opt_threads; t++) { thread_info_t * tip = ti + t; tip->work = 0; xpthread_mutex_init(&tip->mutex, nullptr); xpthread_cond_init(&tip->cond, nullptr); xpthread_create(&tip->thread, &attr, threads_worker, (void*)(int64_t)t); } } void threads_exit() { /* finish and clean up worker threads */ for(int t=0; tmutex); tip->work = -1; xpthread_cond_signal(&tip->cond); xpthread_mutex_unlock(&tip->mutex); /* wait for worker to quit */ xpthread_join(tip->thread, nullptr); xpthread_cond_destroy(&tip->cond); xpthread_mutex_destroy(&tip->mutex); } xfree(ti); xpthread_attr_destroy(&attr); } void cluster_query_init(struct searchinfo_s * si) { /* initialisation of data for one thread; run once for each thread */ /* thread specific initialiation */ si->qsize = 1; si->nw = nullptr; si->hit_count = 0; /* allocate memory for sequence */ si->seq_alloc = db_getlongestsequence() + 1; si->qsequence = (char *) xmalloc(si->seq_alloc); si->kmers = (count_t *) xmalloc(seqcount * sizeof(count_t) + 32); si->hits = (struct hit *) xmalloc(sizeof(struct hit) * tophits); si->uh = unique_init(); si->m = minheap_init(tophits); si->s = search16_init(opt_match, opt_mismatch, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); si->nw = nw_init(); } void cluster_query_exit(struct searchinfo_s * si) { /* clean up after thread execution; called once per thread */ search16_exit(si->s); unique_exit(si->uh); minheap_exit(si->m); nw_exit(si->nw); if (si->qsequence) { xfree(si->qsequence); } if (si->hits) { xfree(si->hits); } if (si->kmers) { xfree(si->kmers); } } char * relabel_otu(int clusterno, char * sequence, int seqlen) { char * label = nullptr; if (opt_relabel) { label = (char*) xmalloc(strlen(opt_relabel) + 21); sprintf(label, "%s%d", opt_relabel, clusterno+1); } else if (opt_relabel_self) { label = (char*) xmalloc(seqlen + 1); sprintf(label, "%.*s", seqlen, sequence); } else if (opt_relabel_sha1) { label = (char*) xmalloc(LEN_HEX_DIG_SHA1); get_hex_seq_digest_sha1(label, sequence, seqlen); } else if (opt_relabel_md5) { label = (char*) xmalloc(LEN_HEX_DIG_MD5); get_hex_seq_digest_md5(label, sequence, seqlen); } return label; } void cluster_core_results_hit(struct hit * best, int clusterno, char * query_head, int qseqlen, char * qsequence, char * qsequence_rc, int qsize) { count_matched++; if (opt_otutabout || opt_mothur_shared_out || opt_biomout) { if (opt_relabel || opt_relabel_self || opt_relabel_sha1 || opt_relabel_md5) { char * label = relabel_otu(clusterno, db_getsequence(best->target), db_getsequencelen(best->target)); otutable_add(query_head, label, qsize); xfree(label); } else { otutable_add(query_head, db_getheader(best->target), qsize); } } if (fp_uc) { results_show_uc_one(fp_uc, best, query_head, qsequence, qseqlen, qsequence_rc, clusterno); } if (fp_alnout) { results_show_alnout(fp_alnout, best, 1, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_samout) { results_show_samout(fp_samout, best, 1, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_fastapairs) { results_show_fastapairs_one(fp_fastapairs, best, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_qsegout) { results_show_qsegout_one(fp_qsegout, best, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_tsegout) { results_show_tsegout_one(fp_tsegout, best, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_userout) { results_show_userout_one(fp_userout, best, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, best, query_head, qsequence, qseqlen, qsequence_rc); } if (opt_matched) { fasta_print_general(fp_matched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), qsize, count_matched, -1.0, -1, -1, nullptr, 0.0); } } void cluster_core_results_nohit(int clusterno, char * query_head, int qseqlen, char * qsequence, char * qsequence_rc, int qsize) { count_notmatched++; if (opt_otutabout || opt_mothur_shared_out || opt_biomout) { if (opt_relabel || opt_relabel_self || opt_relabel_sha1 || opt_relabel_md5) { char * label = relabel_otu(clusterno, qsequence, qseqlen); otutable_add(query_head, label, qsize); xfree(label); } else { otutable_add(query_head, query_head, qsize); } } if (opt_uc) { fprintf(fp_uc, "S\t%d\t%d\t*\t*\t*\t*\t*\t", clusters, qseqlen); header_fprint_strip_size_ee(fp_uc, query_head, strlen(query_head), opt_xsize, opt_xee); fprintf(fp_uc, "\t*\n"); } if (opt_output_no_hits) { if (fp_userout) { results_show_userout_one(fp_userout, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } } if (opt_notmatched) { fasta_print_general(fp_notmatched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), qsize, count_notmatched, -1.0, -1, -1, nullptr, 0.0); } } int compare_kmersample(const void * a, const void * b) { unsigned int x = * (unsigned int *) a; unsigned int y = * (unsigned int *) b; if (x < y) { return -1; } else if (x > y) { return +1; } else { return 0; } } void cluster_core_parallel() { /* create threads and set them in stand-by mode */ threads_init(); const int queries_per_thread = 1; int max_queries = queries_per_thread * opt_threads; /* allocate memory for the search information for each query; and initialize it */ si_plus = (struct searchinfo_s *) xmalloc(max_queries * sizeof(struct searchinfo_s)); if (opt_strand>1) { si_minus = (struct searchinfo_s *) xmalloc(max_queries * sizeof(struct searchinfo_s)); } for(int i = 0; i < max_queries; i++) { cluster_query_init(si_plus+i); si_plus[i].strand = 0; if (opt_strand > 1) { cluster_query_init(si_minus+i); si_minus[i].strand = 1; } } int * extra_list = (int*) xmalloc(max_queries*sizeof(int)); LinearMemoryAligner lma; int64_t * scorematrix = lma.scorematrix_create(opt_match, opt_mismatch); lma.set_parameters(scorematrix, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); int aligncount = 0; int lastlength = INT_MAX; int seqno = 0; int64_t sum_nucleotides = 0; progress_init("Clustering", db_getnucleotidecount()); while(seqno < seqcount) { /* prepare work for the threads in sia[i] */ /* read query sequences into the search info (si) for each thread */ int queries = 0; for(int i = 0; i < max_queries; i++) { if (seqno < seqcount) { int length = db_getsequencelen(seqno); #if 1 if (opt_cluster_smallmem && (!opt_usersort) && (length > lastlength)) { fatal("Sequences not sorted by length and --usersort not specified."); } #endif lastlength = length; si_plus[i].query_no = seqno; si_plus[i].strand = 0; if (opt_strand > 1) { si_minus[i].query_no = seqno; si_minus[i].strand = 1; } queries++; seqno++; } } /* perform work in threads */ threads_wakeup(queries); /* analyse results */ int extra_count = 0; for(int i=0; i < queries; i++) { struct searchinfo_s * si_p = si_plus + i; struct searchinfo_s * si_m = opt_strand > 1 ? si_minus + i : nullptr; for(int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_m : si_p; int added = 0; if (extra_count) { /* Check if there is a hit with one of the non-matching extra sequences just analysed in this round */ for (int j=0; juh, opt_wordlength, sic->kmersamplecount, sic->kmersample); /* check if min number of shared kmers is satisfied */ if (search_enough_kmers(si, shared)) { unsigned int length = sic->qseqlen; /* Go through the list of hits and see if the current match is better than any on the list in terms of more shared kmers (or shorter length if equal no of kmers). Determine insertion point (x). */ int x = si->hit_count; while ((x > 0) && ((si->hits[x-1].count < shared) || ((si->hits[x-1].count == shared) && (db_getsequencelen(si->hits[x-1].target) > length)))) { x--; } if (x < opt_maxaccepts + opt_maxrejects - 1) { /* insert into list at position x */ /* trash bottom element if no more space */ if (si->hit_count >= opt_maxaccepts + opt_maxrejects - 1) { if (si->hits[si->hit_count-1].aligned) { xfree(si->hits[si->hit_count-1].nwalignment); } si->hit_count--; } /* move the rest down */ for(int z = si->hit_count; z > x; z--) { si->hits[z] = si->hits[z-1]; } /* init new hit */ struct hit * hit = si->hits + x; si->hit_count++; hit->target = sic->query_no; hit->strand = si->strand; hit->count = shared; hit->accepted = false; hit->rejected = false; hit->aligned = false; hit->weak = false; hit->nwalignment = nullptr; added++; } } } } /* now go through the hits and determine final status of each */ if (added) { si->rejects = 0; si->accepts = 0; /* set all statuses to undetermined */ for(int t=0; t< si->hit_count; t++) { si->hits[t].accepted = false; si->hits[t].rejected = false; } for(int t = 0; (si->accepts < opt_maxaccepts) && (si->rejects < opt_maxrejects) && (t < si->hit_count); t++) { struct hit * hit = si->hits + t; if (! hit->aligned) { /* Test accept/reject criteria before alignment */ unsigned int target = hit->target; if (search_acceptable_unaligned(si, target)) { aligncount++; /* perform vectorized alignment */ /* but only using 1 sequence ! */ unsigned int nwtarget = target; int64_t nwscore; int64_t nwalignmentlength; int64_t nwmatches; int64_t nwmismatches; int64_t nwgaps; char * nwcigar = nullptr; /* short variants for simd aligner */ CELL snwscore; unsigned short snwalignmentlength; unsigned short snwmatches; unsigned short snwmismatches; unsigned short snwgaps; search16(si->s, 1, & nwtarget, & snwscore, & snwalignmentlength, & snwmatches, & snwmismatches, & snwgaps, & nwcigar); int64_t tseqlen = db_getsequencelen(target); if (snwscore == SHRT_MAX) { /* In case the SIMD aligner cannot align, perform a new alignment with the linear memory aligner */ char * tseq = db_getsequence(target); if (nwcigar) { xfree(nwcigar); } nwcigar = xstrdup(lma.align(si->qsequence, tseq, si->qseqlen, tseqlen)); lma.alignstats(nwcigar, si->qsequence, tseq, & nwscore, & nwalignmentlength, & nwmatches, & nwmismatches, & nwgaps); } else { nwscore = snwscore; nwalignmentlength = snwalignmentlength; nwmatches = snwmatches; nwmismatches = snwmismatches; nwgaps = snwgaps; } int64_t nwdiff = nwalignmentlength - nwmatches; int64_t nwindels = nwdiff - nwmismatches; hit->aligned = true; hit->nwalignment = nwcigar; hit->nwscore = nwscore; hit->nwdiff = nwdiff; hit->nwgaps = nwgaps; hit->nwindels = nwindels; hit->nwalignmentlength = nwalignmentlength; hit->matches = nwmatches; hit->mismatches = nwmismatches; hit->nwid = 100.0 * (nwalignmentlength - hit->nwdiff) / nwalignmentlength; hit->shortest = MIN(si->qseqlen, tseqlen); hit->longest = MAX(si->qseqlen, tseqlen); /* trim alignment and compute numbers excluding terminal gaps */ align_trim(hit); } else { /* rejection without alignment */ hit->rejected = true; si->rejects++; } } if (! hit->rejected) { /* test accept/reject criteria after alignment */ if (search_acceptable_aligned(si, hit)) { si->accepts++; } else { si->rejects++; } } } /* delete all undetermined hits */ int new_hit_count = si->hit_count; for(int t=si->hit_count-1; t>=0; t--) { struct hit * hit = si->hits + t; if (!hit->accepted && !hit->rejected) { new_hit_count = t; if (hit->aligned) { xfree(hit->nwalignment); } } } si->hit_count = new_hit_count; } } /* find best hit */ struct hit * best = nullptr; if (opt_sizeorder) { best = search_findbest2_bysize(si_p, si_m); } else { best = search_findbest2_byid(si_p, si_m); } int myseqno = si_p->query_no; if (best) { /* a hit was found, cluster current sequence with hit */ int target = best->target; /* output intermediate results to uc etc */ cluster_core_results_hit(best, clusterinfo[target].clusterno, si_p->query_head, si_p->qseqlen, si_p->qsequence, best->strand ? si_m->qsequence : nullptr, si_p->qsize); /* update cluster info about this sequence */ clusterinfo[myseqno].seqno = myseqno; clusterinfo[myseqno].clusterno = clusterinfo[target].clusterno; clusterinfo[myseqno].cigar = best->nwalignment; clusterinfo[myseqno].strand = best->strand; best->nwalignment = nullptr; } else { /* no hit found; add it to the list of extra sequences that must be considered by the coming queries in this round */ extra_list[extra_count++] = i; /* update cluster info about this sequence */ clusterinfo[myseqno].seqno = myseqno; clusterinfo[myseqno].clusterno = clusters; clusterinfo[myseqno].cigar = nullptr; clusterinfo[myseqno].strand = 0; /* add current sequence to database */ dbindex_addsequence(myseqno, opt_qmask); /* output intermediate results to uc etc */ cluster_core_results_nohit(clusters, si_p->query_head, si_p->qseqlen, si_p->qsequence, nullptr, si_p->qsize); clusters++; } /* free alignments */ for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_m : si_p; for(int j=0; jhit_count; j++) { if (si->hits[j].aligned) { if (si->hits[j].nwalignment) { xfree(si->hits[j].nwalignment); } } } } sum_nucleotides += si_p->qseqlen; } progress_update(sum_nucleotides); } progress_done(); #if 0 if (!opt_quiet) fprintf(stderr, "Extra alignments computed: %d\n", aligncount); #endif /* clean up search info */ for(int i = 0; i < max_queries; i++) { cluster_query_exit(si_plus+i); if (opt_strand > 1) { cluster_query_exit(si_minus+i); } } xfree(extra_list); xfree(si_plus); if (opt_strand>1) { xfree(si_minus); } /* terminate threads and clean up */ threads_exit(); xfree(scorematrix); } void cluster_core_serial() { struct searchinfo_s si_p[1]; struct searchinfo_s si_m[1]; cluster_query_init(si_p); if (opt_strand > 1) { cluster_query_init(si_m); } int lastlength = INT_MAX; progress_init("Clustering", seqcount); for (int seqno=0; seqno lastlength)) { fatal("Sequences not sorted by length and --usersort not specified."); } #endif lastlength = length; si_p->query_no = seqno; si_p->strand = 0; cluster_query_core(si_p); if (opt_strand > 1) { si_m->query_no = seqno; si_m->strand = 1; cluster_query_core(si_m); } struct hit * best = nullptr; if (opt_sizeorder) { best = search_findbest2_bysize(si_p, si_m); } else { best = search_findbest2_byid(si_p, si_m); } if (best) { int target = best->target; cluster_core_results_hit(best, clusterinfo[target].clusterno, si_p->query_head, si_p->qseqlen, si_p->qsequence, best->strand ? si_m->qsequence : nullptr, si_p->qsize); clusterinfo[seqno].seqno = seqno; clusterinfo[seqno].clusterno = clusterinfo[target].clusterno; clusterinfo[seqno].cigar = best->nwalignment; clusterinfo[seqno].strand = best->strand; best->nwalignment = nullptr; } else { clusterinfo[seqno].seqno = seqno; clusterinfo[seqno].clusterno = clusters; clusterinfo[seqno].cigar = nullptr; clusterinfo[seqno].strand = 0; dbindex_addsequence(seqno, opt_qmask); cluster_core_results_nohit(clusters, si_p->query_head, si_p->qseqlen, si_p->qsequence, nullptr, si_p->qsize); clusters++; } /* free alignments */ for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_m : si_p; for(int i=0; ihit_count; i++) { if (si->hits[i].aligned) { if (si->hits[i].nwalignment) { xfree(si->hits[i].nwalignment); } } } } progress_update(seqno); } progress_done(); cluster_query_exit(si_p); if (opt_strand>1) { cluster_query_exit(si_m); } } void cluster(char * dbname, char * cmdline, char * progheader) { if (opt_centroids) { fp_centroids = fopen_output(opt_centroids); if (!fp_centroids) { fatal("Unable to open centroids file for writing"); } } if (opt_uc) { fp_uc = fopen_output(opt_uc); if (!fp_uc) { fatal("Unable to open uc file for writing"); } } if (opt_alnout) { fp_alnout = fopen_output(opt_alnout); if (! fp_alnout) { fatal("Unable to open alignment output file for writing"); } fprintf(fp_alnout, "%s\n", cmdline); fprintf(fp_alnout, "%s\n", progheader); } if (opt_samout) { fp_samout = fopen_output(opt_samout); if (! fp_samout) { fatal("Unable to open SAM output file for writing"); } } if (opt_userout) { fp_userout = fopen_output(opt_userout); if (! fp_userout) { fatal("Unable to open user-defined output file for writing"); } } if (opt_blast6out) { fp_blast6out = fopen_output(opt_blast6out); if (! fp_blast6out) { fatal("Unable to open blast6-like output file for writing"); } } if (opt_fastapairs) { fp_fastapairs = fopen_output(opt_fastapairs); if (! fp_fastapairs) { fatal("Unable to open fastapairs output file for writing"); } } if (opt_qsegout) { fp_qsegout = fopen_output(opt_qsegout); if (! fp_qsegout) { fatal("Unable to open qsegout output file for writing"); } } if (opt_tsegout) { fp_tsegout = fopen_output(opt_tsegout); if (! fp_tsegout) { fatal("Unable to open tsegout output file for writing"); } } if (opt_matched) { fp_matched = fopen_output(opt_matched); if (! fp_matched) { fatal("Unable to open matched output file for writing"); } } if (opt_notmatched) { fp_notmatched = fopen_output(opt_notmatched); if (! fp_notmatched) { fatal("Unable to open notmatched output file for writing"); } } if (opt_otutabout) { fp_otutabout = fopen_output(opt_otutabout); if (! fp_otutabout) { fatal("Unable to open OTU table (text format) output file for writing"); } } if (opt_mothur_shared_out) { fp_mothur_shared_out = fopen_output(opt_mothur_shared_out); if (! fp_mothur_shared_out) { fatal("Unable to open OTU table (mothur format) output file for writing"); } } if (opt_biomout) { fp_biomout = fopen_output(opt_biomout); if (! fp_biomout) { fatal("Unable to open OTU table (biom 1.0 format) output file for writing"); } } db_read(dbname, 0); otutable_init(); results_show_samheader(fp_samout, cmdline, dbname); if (opt_qmask == MASK_DUST) { dust_all(); } else if ((opt_qmask == MASK_SOFT) && (opt_hardmask)) { hardmask_all(); } show_rusage(); seqcount = db_getsequencecount(); if (opt_cluster_fast) { db_sortbylength(); } else if (opt_cluster_size || opt_cluster_unoise) { db_sortbyabundance(); } dbindex_prepare(1, opt_qmask); /* tophits = the maximum number of hits we need to store */ if ((opt_maxrejects == 0) || (opt_maxrejects > seqcount)) { opt_maxrejects = seqcount; } if ((opt_maxaccepts == 0) || (opt_maxaccepts > seqcount)) { opt_maxaccepts = seqcount; } tophits = opt_maxrejects + opt_maxaccepts + MAXDELAYED; if (tophits > seqcount) { tophits = seqcount; } clusterinfo = (clusterinfo_t *) xmalloc(seqcount * sizeof(clusterinfo_t)); if (opt_log) { uint64_t slots = 1ULL << (opt_wordlength << 1ULL); fprintf(fp_log, "\n"); fprintf(fp_log, " Alphabet nt\n"); fprintf(fp_log, " Word width %" PRId64 "\n", opt_wordlength); fprintf(fp_log, " Word ones %" PRId64 "\n", opt_wordlength); fprintf(fp_log, " Spaced No\n"); fprintf(fp_log, " Hashed No\n"); fprintf(fp_log, " Coded No\n"); fprintf(fp_log, " Stepped No\n"); fprintf(fp_log, " Slots %" PRIu64 " (%.1fk)\n", slots, slots/1000.0); fprintf(fp_log, " DBAccel 100%%\n"); fprintf(fp_log, "\n"); } if (opt_threads == 1) { cluster_core_serial(); } else { cluster_core_parallel(); } /* find size and abundance of each cluster and save stats */ cluster_abundance = (int64_t *) xmalloc(clusters * sizeof(int64_t)); int * cluster_size = (int *) xmalloc(clusters * sizeof(int)); memset(cluster_abundance, 0, clusters * sizeof(int64_t)); memset(cluster_size, 0, clusters * sizeof(int)); for(int i=0; i abundance_max) { abundance_max = abundance; } if (abundance == 1) { singletons++; } int size = cluster_size[z]; if (size > size_max) { size_max = size; } } /* Sort sequences in clusters by their abundance or ordinal number */ /* Sequences in same cluster must always come right after each other. */ /* The centroid sequence must be the first in each cluster. */ progress_init("Sorting clusters", clusters); if (opt_clusterout_sort) { qsort(clusterinfo, seqcount, sizeof(clusterinfo_t), compare_byclusterabundance); } else { qsort(clusterinfo, seqcount, sizeof(clusterinfo_t), compare_byclusterno); } progress_done(); progress_init("Writing clusters", seqcount); /* allocate memory for full file name of the clusters files */ FILE * fp_clusters = nullptr; char * fn_clusters = nullptr; if (opt_clusters) { fn_clusters = (char *) xmalloc(strlen(opt_clusters) + 25); } int lastcluster = -1; int ordinal = 0; for(int i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fasta2fastq(); vsearch-2.21.1/src/sortbysize.cc0000644000175000017500000001200114171574117016073 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static struct sortinfo_size_s { unsigned int size; unsigned int seqno; } * sortinfo; int sortbysize_compare(const void * a, const void * b) { auto * x = (struct sortinfo_size_s *) a; auto * y = (struct sortinfo_size_s *) b; /* highest abundance first, then by label, otherwise keep order */ if (x->size < y->size) { return +1; } else if (x->size > y->size) { return -1; } else { int r = strcmp(db_getheader(x->seqno), db_getheader(y->seqno)); if (r != 0) { return r; } else { if (x->seqno < y->seqno) { return -1; } else if (x->seqno > y->seqno) { return +1; } else { return 0; } } } } void sortbysize() { if (!opt_output) fatal("FASTA output file for sortbysize must be specified with --output"); FILE * fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open sortbysize output file for writing"); } db_read(opt_sortbysize, 0); show_rusage(); int dbsequencecount = db_getsequencecount(); progress_init("Getting sizes", dbsequencecount); sortinfo = (struct sortinfo_size_s*) xmalloc(dbsequencecount * sizeof(sortinfo_size_s)); int passed = 0; for(int i=0; i= opt_minsize) && (size <= opt_maxsize)) { sortinfo[passed].seqno = i; sortinfo[passed].size = (unsigned int) size; passed++; } progress_update(i); } progress_done(); show_rusage(); progress_init("Sorting", 100); qsort(sortinfo, passed, sizeof(sortinfo_size_s), sortbysize_compare); progress_done(); double median = 0.0; if (passed > 0) { if (passed % 2) { median = sortinfo[(passed-1)/2].size; } else { median = (sortinfo[(passed/2)-1].size + sortinfo[passed/2].size) / 2.0; } } if (! opt_quiet) { fprintf(stderr, "Median abundance: %.0f\n", median); } if (opt_log) { fprintf(fp_log, "Median abundance: %.0f\n", median); } show_rusage(); passed = MIN(passed, opt_topn); progress_init("Writing output", passed); for(int i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* This file contains code dependent on special cpu features. */ /* The file may be compiled several times with different cpu options. */ #ifdef __aarch64__ void increment_counters_from_bitmap(count_t * counters, unsigned char * bitmap, unsigned int totalbits) { const uint8x16_t c1 = { 0x01, 0x01, 0x02, 0x02, 0x04, 0x04, 0x08, 0x08, 0x10, 0x10, 0x20, 0x20, 0x40, 0x40, 0x80, 0x80 }; unsigned short * p = (unsigned short *)(bitmap); int16x8_t * q = (int16x8_t *)(counters); int r = (totalbits + 15) / 16; for(int j=0; j, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" bool header_find_attribute(const char * header, int header_length, const char * attribute, int * start, int * end, bool allow_decimal) { /* Identify the first occurence of the pattern (^|;)size=([0-9]+)(;|$) in the header string, where "size=" is the specified attribute. If allow_decimal is true, a dot (.) is allowed within the digits. */ const char * digit_chars = "0123456789"; const char * digit_chars_decimal = "0123456789."; if ((! header) || (! attribute)) { return false; } int hlen = header_length; int alen = strlen(attribute); int i = 0; while (i < hlen - alen) { char * r = (char *) strstr(header + i, attribute); /* no match */ if (r == nullptr) { break; } i = r - header; /* check for ';' in front */ if ((i > 0) && (header[i-1] != ';')) { i += alen + 1; continue; } int digits = (int) strspn(header + i + alen, (allow_decimal ? digit_chars_decimal : digit_chars)); /* check for at least one digit */ if (digits == 0) { i += alen + 1; continue; } /* check for ';' after */ if ((i + alen + digits < hlen) && (header[i + alen + digits] != ';')) { i += alen + digits + 2; continue; } /* ok */ * start = i; * end = i + alen + digits; return true; } return false; } int64_t header_get_size(char * header, int header_length) { /* read size/abundance annotation */ int64_t abundance = 0; int start = 0; int end = 0; if (header_find_attribute(header, header_length, "size=", & start, & end, false)) { int64_t number = atol(header + start + 5); if (number > 0) { abundance = number; } else { fatal("Invalid (zero) abundance annotation in FASTA file header"); } } return abundance; } void header_fprint_strip_size_ee(FILE * fp, char * header, int header_length, bool strip_size, bool strip_ee) { int attributes = 0; int attribute_start[2]; int attribute_end[2]; /* look for size attribute */ int size_start = 0; int size_end = 0; bool size_found = false; if (strip_size) { size_found = header_find_attribute(header, header_length, "size=", & size_start, & size_end, false); } if (size_found) { attribute_start[attributes] = size_start; attribute_end[attributes] = size_end; attributes++; } /* look for ee attribute */ int ee_start = 0; int ee_end = 0; bool ee_found = false; if (strip_ee) { ee_found = header_find_attribute(header, header_length, "ee=", & ee_start, & ee_end, true); } if (ee_found) { attribute_start[attributes] = ee_start; attribute_end[attributes] = ee_end; attributes++; } /* sort */ if (attributes > 1) { if (attribute_start[0] > attribute_start[1]) { /* swap */ int s = attribute_start[0]; int e = attribute_end[0]; attribute_start[0] = attribute_start[1]; attribute_end[0] = attribute_end[1]; attribute_start[1] = s; attribute_end[1] = e; } } /* print */ if (attributes == 0) { fprintf(fp, "%.*s", header_length, header); } else { int prev_end = 0; for (int i = 0; i < attributes; i++) { /* print part of header in front of this attribute */ if (attribute_start[i] > prev_end + 1) { fprintf(fp, "%.*s", attribute_start[i] - prev_end - 1, header + prev_end); } prev_end = attribute_end[i]; } /* print the rest, if any */ if (header_length > prev_end + 1) { fprintf(fp, "%.*s", header_length - prev_end, header + prev_end); } } } void header_fprint_strip_size(FILE * fp, char * header, int header_length) { header_fprint_strip_size_ee(fp, header, header_length, true, false); } vsearch-2.21.1/src/tax.h0000644000175000017500000000525614171574117014332 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ const int tax_levels = 8; extern const char * tax_letters; bool tax_parse(const char * header, int header_length, int * tax_start, int * tax_end); void tax_split(int seqno, int * level_start, int * level_len); vsearch-2.21.1/src/city.h0000644000175000017500000001144014171574117014476 0ustar nileshnilesh// Copyright (c) 2011 Google, Inc. // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in // all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN // THE SOFTWARE. // // CityHash, by Geoff Pike and Jyrki Alakuijala // // http://code.google.com/p/cityhash/ // // This file provides a few functions for hashing strings. All of them are // high-quality functions in the sense that they pass standard tests such // as Austin Appleby's SMHasher. They are also fast. // // For 64-bit x86 code, on short strings, we don't know of anything faster than // CityHash64 that is of comparable quality. We believe our nearest competitor // is Murmur3. For 64-bit x86 code, CityHash64 is an excellent choice for hash // tables and most other hashing (excluding cryptography). // // For 64-bit x86 code, on long strings, the picture is more complicated. // On many recent Intel CPUs, such as Nehalem, Westmere, Sandy Bridge, etc., // CityHashCrc128 appears to be faster than all competitors of comparable // quality. CityHash128 is also good but not quite as fast. We believe our // nearest competitor is Bob Jenkins' Spooky. We don't have great data for // other 64-bit CPUs, but for long strings we know that Spooky is slightly // faster than CityHash on some relatively recent AMD x86-64 CPUs, for example. // Note that CityHashCrc128 is declared in citycrc.h. // // For 32-bit x86 code, we don't know of anything faster than CityHash32 that // is of comparable quality. We believe our nearest competitor is Murmur3A. // (On 64-bit CPUs, it is typically faster to use the other CityHash variants.) // // Functions in the CityHash family are not suitable for cryptography. // // Please see CityHash's README file for more details on our performance // measurements and so on. // // WARNING: This code has been only lightly tested on big-endian platforms! // It is known to work well on little-endian platforms that have a small penalty // for unaligned reads, such as current Intel and AMD moderate-to-high-end CPUs. // It should work on all 32-bit and 64-bit platforms that allow unaligned reads; // bug reports are welcome. // // By the way, for some hash functions, given strings a and b, the hash // of a+b is easily derived from the hashes of a and b. This property // doesn't hold for any hash functions in this file. #ifndef CITY_HASH_H_ #define CITY_HASH_H_ #include #include // for size_t. #include typedef uint8_t uint8; typedef uint32_t uint32; typedef uint64_t uint64; typedef std::pair uint128; inline uint64 Uint128Low64(const uint128& x) { return x.first; } inline uint64 Uint128High64(const uint128& x) { return x.second; } // Hash function for a byte array. uint64 CityHash64(const char *buf, size_t len); // Hash function for a byte array. For convenience, a 64-bit seed is also // hashed into the result. uint64 CityHash64WithSeed(const char *buf, size_t len, uint64 seed); // Hash function for a byte array. For convenience, two seeds are also // hashed into the result. uint64 CityHash64WithSeeds(const char *buf, size_t len, uint64 seed0, uint64 seed1); // Hash function for a byte array. uint128 CityHash128(const char *s, size_t len); // Hash function for a byte array. For convenience, a 128-bit seed is also // hashed into the result. uint128 CityHash128WithSeed(const char *s, size_t len, uint128 seed); // Hash function for a byte array. Most useful in 32-bit binaries. uint32 CityHash32(const char *buf, size_t len); // Hash 128 input bits down to 64 bits of output. // This is intended to be a reasonably good hash function. inline uint64 Hash128to64(const uint128& x) { // Murmur-inspired hashing. const uint64 kMul = 0x9ddfea08eb382d69ULL; uint64 a = (Uint128Low64(x) ^ Uint128High64(x)) * kMul; a ^= (a >> 47); uint64 b = (Uint128High64(x) ^ a) * kMul; b ^= (b >> 47); b *= kMul; return b; } #endif // CITY_HASH_H_ vsearch-2.21.1/src/linmemalign.h0000644000175000017500000001217714171574117016032 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ class LinearMemoryAligner { char op; int64_t op_run; int64_t cigar_alloc; int64_t cigar_length; char * cigar_string; char * a_seq; char * b_seq; int64_t * scorematrix; int64_t q; int64_t r; /* gap penalties for open/extension query/target left/interior/right */ int64_t go_q_l; int64_t go_t_l; int64_t go_q_i; int64_t go_t_i; int64_t go_q_r; int64_t go_t_r; int64_t ge_q_l; int64_t ge_t_l; int64_t ge_q_i; int64_t ge_t_i; int64_t ge_q_r; int64_t ge_t_r; size_t vector_alloc; int64_t * HH; int64_t * EE; int64_t * XX; int64_t * YY; void cigar_reset(); void cigar_flush(); void cigar_add(char _op, int64_t run); inline int64_t subst_score(int64_t x, int64_t y) { /* return substitution score for replacing symbol at position x in a with symbol at position y in b */ return scorematrix[chrmap_4bit[(int)(b_seq[y])] * 16 + chrmap_4bit[(int)(a_seq[x])]]; } void diff(int64_t a_start, int64_t b_start, int64_t a_len, int64_t b_len, bool gap_b_left, /* gap open left of b */ bool gap_b_right, /* gap open right of b */ bool a_left, /* includes left end of a */ bool a_right, /* includes right end of a */ bool b_left, /* includes left end of b */ bool b_right); /* includes right end of b */ void alloc_vectors(size_t N); void show_matrix(); public: LinearMemoryAligner(); ~LinearMemoryAligner(); int64_t * scorematrix_create(int64_t match, int64_t mismatch); void set_parameters(int64_t * _scorematrix, int64_t _gap_open_query_left, int64_t _gap_open_target_left, int64_t _gap_open_query_interior, int64_t _gap_open_target_interior, int64_t _gap_open_query_right, int64_t _gap_open_target_right, int64_t _gap_extension_query_left, int64_t _gap_extension_target_left, int64_t _gap_extension_query_interior, int64_t _gap_extension_target_interior, int64_t _gap_extension_query_right, int64_t _gap_extension_target_right); char * align(char * _a_seq, char * _b_seq, int64_t M, int64_t N); void alignstats(char * cigar, char * a_seq, char * b_seq, int64_t * nwscore, int64_t * nwalignmentlength, int64_t * nwmatches, int64_t * nwmismatches, int64_t * nwgaps); }; vsearch-2.21.1/src/results.cc0000644000175000017500000006646714171574117015410 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" void results_show_fastapairs_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { /* http://www.drive5.com/usearch/manual/fastapairs.html */ if (hp) { char * qrow = align_getrow(hp->strand ? rc : qsequence, hp->nwalignment, hp->nwalignmentlength, 0); fasta_print_general(fp, nullptr, qrow + hp->trim_q_left + hp->trim_t_left, hp->internal_alignmentlength, query_head, strlen(query_head), 0, 0, -1.0, -1, -1, nullptr, 0.0); xfree(qrow); char * trow = align_getrow(db_getsequence(hp->target), hp->nwalignment, hp->nwalignmentlength, 1); fasta_print_general(fp, nullptr, trow + hp->trim_q_left + hp->trim_t_left, hp->internal_alignmentlength, db_getheader(hp->target), db_getheaderlen(hp->target), 0, 0, -1.0, -1, -1, nullptr, 0.0); xfree(trow); fprintf(fp, "\n"); } } void results_show_qsegout_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { if (hp) { char * qseg = (hp->strand ? rc : qsequence) + hp->trim_q_left; int qseglen = qseqlen - hp->trim_q_left - hp->trim_q_right; fasta_print_general(fp, nullptr, qseg, qseglen, query_head, strlen(query_head), 0, 0, -1.0, -1, -1, nullptr, 0.0); } } void results_show_tsegout_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { if (hp) { char * tseg = db_getsequence(hp->target) + hp->trim_t_left; int tseglen = db_getsequencelen(hp->target) - hp->trim_t_left - hp->trim_t_right; fasta_print_general(fp, nullptr, tseg, tseglen, db_getheader(hp->target), db_getheaderlen(hp->target), 0, 0, -1.0, -1, -1, nullptr, 0.0); } } void results_show_blast6out_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { /* http://www.drive5.com/usearch/manual/blast6out.html query label target label percent identity alignment length number of mismatches number of gap opens 1-based position of start in query 1-based position of end in query 1-based position of start in target 1-based position of end in target E-value bit score Note that USEARCH shows 13 fields when there is no hit, but only 12 when there is a hit. Fixed in VSEARCH. */ if (hp) { int qstart, qend; if (hp->strand) { /* minus strand */ qstart = qseqlen; qend = 1; } else { /* plus strand */ qstart = 1; qend = qseqlen; } fprintf(fp, "%s\t%s\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%" PRIu64 "\t%d\t%d\n", query_head, db_getheader(hp->target), hp->id, hp->internal_alignmentlength, hp->mismatches, hp->internal_gaps, qstart, qend, 1, db_getsequencelen(hp->target), -1, 0); } else { fprintf(fp, "%s\t*\t0.0\t0\t0\t0\t0\t0\t0\t0\t-1\t0\n", query_head); } } void results_show_uc_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc, int clusterno) { /* http://www.drive5.com/usearch/manual/ucout.html Columns: H/N cluster no (0-based) (target sequence no) sequence length (query) percent identity strand: + or - 0 0 compressed alignment, e.g. 9I92M14D, or "=" if perfect alignment query label target label */ if (hp) { bool perfect; if (opt_cluster_fast) { /* cluster_fast */ /* use = for identical sequences ignoring terminal gaps */ perfect = (hp->matches == hp->internal_alignmentlength); } else { /* cluster_size, cluster_smallmem, cluster_unoise */ /* usearch_global, search_exact, allpairs_global */ /* use = for strictly identical sequences */ perfect = (hp->matches == hp->nwalignmentlength); } fprintf(fp, "H\t%d\t%" PRId64 "\t%.1f\t%c\t0\t0\t%s\t", clusterno, qseqlen, hp->id, hp->strand ? '-' : '+', perfect ? "=" : hp->nwalignment); header_fprint_strip_size_ee(fp, query_head, strlen(query_head), opt_xsize, opt_xee); fprintf(fp, "\t"); header_fprint_strip_size_ee(fp, db_getheader(hp->target), db_getheaderlen(hp->target), opt_xsize, opt_xee); fprintf(fp, "\n"); } else { fprintf(fp, "N\t*\t*\t*\t.\t*\t*\t*\t%s\t*\n", query_head); } } void results_show_userout_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { /* http://drive5.com/usearch/manual/userout.html qlo, qhi, tlo, thi and raw are given more meaningful values here */ for (int c = 0; c < userfields_requested_count; c++) { if (c) { fprintf(fp, "\t"); } int field = userfields_requested[c]; char * tsequence = nullptr; int64_t tseqlen = 0; char * t_head = nullptr; if (hp) { tsequence = db_getsequence(hp->target); tseqlen = db_getsequencelen(hp->target); t_head = db_getheader(hp->target); } char * qrow; char * trow; switch (field) { case 0: /* query */ fprintf(fp, "%s", query_head); break; case 1: /* target */ fprintf(fp, "%s", hp ? t_head : "*"); break; case 2: /* evalue */ fprintf(fp, "-1"); break; case 3: /* id */ fprintf(fp, "%.1f", hp ? hp->id : 0.0); break; case 4: /* pctpv */ fprintf(fp, "%.1f", (hp && (hp->internal_alignmentlength > 0)) ? 100.0 * hp->matches / hp->internal_alignmentlength : 0.0); break; case 5: /* pctgaps */ fprintf(fp, "%.1f", (hp && (hp->internal_alignmentlength > 0)) ? 100.0 * hp->internal_indels / hp->internal_alignmentlength : 0.0); break; case 6: /* pairs */ fprintf(fp, "%d", hp ? hp->matches + hp->mismatches : 0); break; case 7: /* gaps */ fprintf(fp, "%d", hp ? hp->internal_indels : 0); break; case 8: /* qlo */ fprintf(fp, "%" PRId64, hp ? (hp->strand ? qseqlen : 1) : 0); break; case 9: /* qhi */ fprintf(fp, "%" PRId64, hp ? (hp->strand ? 1 : qseqlen) : 0); break; case 10: /* tlo */ fprintf(fp, "%d", hp ? 1 : 0); break; case 11: /* thi */ fprintf(fp, "%" PRId64, tseqlen); break; case 12: /* pv */ fprintf(fp, "%d", hp ? hp->matches : 0); break; case 13: /* ql */ fprintf(fp, "%" PRId64, qseqlen); break; case 14: /* tl */ fprintf(fp, "%" PRId64, hp ? tseqlen : 0); break; case 15: /* qs */ fprintf(fp, "%" PRId64, qseqlen); break; case 16: /* ts */ fprintf(fp, "%" PRId64, hp ? tseqlen : 0); break; case 17: /* alnlen */ fprintf(fp, "%d", hp ? hp->internal_alignmentlength : 0); break; case 18: /* opens */ fprintf(fp, "%d", hp ? hp->internal_gaps : 0); break; case 19: /* exts */ fprintf(fp, "%d", hp ? hp->internal_indels - hp->internal_gaps : 0); break; case 20: /* raw */ fprintf(fp, "%d", hp ? hp->nwscore : 0); break; case 21: /* bits */ fprintf(fp, "%d", 0); break; case 22: /* aln */ if (hp) { align_fprint_uncompressed_alignment(fp, hp->nwalignment); } break; case 23: /* caln */ if (hp) { fprintf(fp, "%s", hp->nwalignment); } break; case 24: /* qstrand */ if (hp) { fprintf(fp, "%c", hp->strand ? '-' : '+'); } break; case 25: /* tstrand */ if (hp) { fprintf(fp, "%c", '+'); } break; case 26: /* qrow */ if (hp) { qrow = align_getrow(hp->strand ? rc : qsequence, hp->nwalignment, hp->nwalignmentlength, 0); fprintf(fp, "%.*s", (int)(hp->internal_alignmentlength), qrow + hp->trim_q_left + hp->trim_t_left); xfree(qrow); } break; case 27: /* trow */ if (hp) { trow = align_getrow(tsequence, hp->nwalignment, hp->nwalignmentlength, 1); fprintf(fp, "%.*s", (int)(hp->internal_alignmentlength), trow + hp->trim_q_left + hp->trim_t_left); xfree(trow); } break; case 28: /* qframe */ fprintf(fp, "+0"); break; case 29: /* tframe */ fprintf(fp, "+0"); break; case 30: /* mism */ fprintf(fp, "%d", hp ? hp->mismatches : 0); break; case 31: /* ids */ fprintf(fp, "%d", hp ? hp->matches : 0); break; case 32: /* qcov */ fprintf(fp, "%.1f", hp ? 100.0 * (hp->matches + hp->mismatches) / qseqlen : 0.0); break; case 33: /* tcov */ fprintf(fp, "%.1f", hp ? 100.0 * (hp->matches + hp->mismatches) / tseqlen : 0.0); break; case 34: /* id0 */ fprintf(fp, "%.1f", hp ? hp->id0 : 0.0); break; case 35: /* id1 */ fprintf(fp, "%.1f", hp ? hp->id1 : 0.0); break; case 36: /* id2 */ fprintf(fp, "%.1f", hp ? hp->id2 : 0.0); break; case 37: /* id3 */ fprintf(fp, "%.1f", hp ? hp->id3 : 0.0); break; case 38: /* id4 */ fprintf(fp, "%.1f", hp ? hp->id4 : 0.0); break; /* new internal alignment coordinates */ case 39: /* qilo */ fprintf(fp, "%d", hp ? hp->trim_q_left + 1 : 0); break; case 40: /* qihi */ fprintf(fp, "%" PRId64, hp ? qseqlen - hp->trim_q_right : 0); break; case 41: /* tilo */ fprintf(fp, "%d", hp ? hp->trim_t_left + 1 : 0); break; case 42: /* tihi */ fprintf(fp, "%" PRId64, hp ? tseqlen - hp->trim_t_right : 0); break; } } fprintf(fp, "\n"); } void results_show_lcaout(FILE * fp, struct hit * hits, int hitcount, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { /* Output last common ancestor (LCA) of the hits, in a similar way to the Sintax command */ int first_level_start[tax_levels]; int first_level_len[tax_levels]; int level_match[tax_levels]; char * first_h = nullptr; fprintf(fp, "%s\t", query_head); if (hitcount > 0) { for (int t = 0; t < hitcount; t++) { int seqno = hits[t].target; if (t == 0) { tax_split(seqno, first_level_start, first_level_len); first_h = db_getheader(seqno); for (int j = 0; j < tax_levels; j++) { level_match[j] = 1; } } else { int level_start[tax_levels]; int level_len[tax_levels]; tax_split(seqno, level_start, level_len); char * h = db_getheader(seqno); for (int j = 0; j < tax_levels; j++) { /* For each taxonomic level */ if ((level_len[j] == first_level_len[j]) && (strncmp(first_h + first_level_start[j], h + level_start[j], level_len[j]) == 0)) { level_match[j]++; } } } } bool comma = false; for (int j = 0; j < tax_levels; j++) { if (1.0 * level_match[j] / hitcount < opt_lca_cutoff) { break; } if (first_level_len[j] > 0) { fprintf(fp, "%s%c:%.*s", (comma ? "," : ""), tax_letters[j], first_level_len[j], first_h + first_level_start[j]); comma = true; } } } fprintf(fp, "\n"); } void results_show_alnout(FILE * fp, struct hit * hits, int hitcount, char * query_head, char * qsequence, int64_t qseqlen, char * rc) { /* http://drive5.com/usearch/manual/alnout.html */ if (hitcount) { fprintf(fp, "\n"); fprintf(fp,"Query >%s\n", query_head); fprintf(fp," %%Id TLen Target\n"); double top_hit_id = hits[0].id; for(int t = 0; t < hitcount; t++) { struct hit * hp = hits + t; if (opt_top_hits_only && (hp->id < top_hit_id)) { break; } fprintf(fp,"%3.0f%% %6" PRIu64 " %s\n", hp->id, db_getsequencelen(hp->target), db_getheader(hp->target)); } for(int t = 0; t < hitcount; t++) { struct hit * hp = hits + t; if (opt_top_hits_only && (hp->id < top_hit_id)) { break; } fprintf(fp,"\n"); char * dseq = db_getsequence(hp->target); int64_t dseqlen = db_getsequencelen(hp->target); int qlenlen = snprintf(nullptr, 0, "%" PRId64, qseqlen); int tlenlen = snprintf(nullptr, 0, "%" PRId64, dseqlen); int numwidth = MAX(qlenlen, tlenlen); fprintf(fp," Query %*" PRId64 "nt >%s\n", numwidth, qseqlen, query_head); fprintf(fp,"Target %*" PRId64 "nt >%s\n", numwidth, dseqlen, db_getheader(hp->target)); int rowlen = opt_rowlen == 0 ? qseqlen+dseqlen : opt_rowlen; align_show(fp, qsequence, qseqlen, hp->trim_q_left, "Qry", dseq, dseqlen, hp->trim_t_left, "Tgt", hp->nwalignment + hp->trim_aln_left, strlen(hp->nwalignment) - hp->trim_aln_left - hp->trim_aln_right, numwidth, 3, rowlen, hp->strand); fprintf(fp, "\n%d cols, %d ids (%3.1f%%), %d gaps (%3.1f%%)\n", hp->internal_alignmentlength, hp->matches, hp->id, hp->internal_indels, hp->internal_alignmentlength > 0 ? 100.0 * hp->internal_indels / hp->internal_alignmentlength : 0.0); #if 0 fprintf(fp, "%d kmers, %d score, %d gap opens. %s %s %d %d %d %d %d\n", hp->count, hp->nwscore, hp->nwgaps, hp->accepted ? "accepted" : "not accepted", hp->nwalignment, hp->nwalignmentlength, hp->trim_q_left, hp->trim_q_right, hp->trim_t_left, hp->trim_t_right ); #endif } } else if (opt_output_no_hits) { fprintf(fp, "\n"); fprintf(fp,"Query >%s\n", query_head); fprintf(fp,"No hits\n"); } } bool inline nucleotide_equal(char a, char b) { return chrmap_4bit[(int)a] == chrmap_4bit[(int)b]; } void build_sam_strings(char * alignment, char * queryseq, char * targetseq, xstring * cigar, xstring * md) { /* convert cigar to sam format: add "1" to operations without run length flip direction of indels in cigar string build MD-string with substitutions */ cigar->empty(); md->empty(); char * p = alignment; char * e = p + strlen(p); int qpos = 0; int tpos = 0; int matched = 0; bool flag = false; /* 1: MD string ends with a number */ while(p < e) { int run = 1; int scanned = 0; sscanf(p, "%d%n", & run, & scanned); p += scanned; char op = *p++; switch (op) { case 'M': cigar->add_d(run); cigar->add_c('M'); for(int i=0; iadd_d(matched); matched = 0; flag = true; } md->add_c(targetseq[tpos]); flag = false; } qpos++; tpos++; } break; case 'D': cigar->add_d(run); cigar->add_c('I'); qpos += run; break; case 'I': cigar->add_d(run); cigar->add_c('D'); if (!flag) { md->add_d(matched); matched = 0; flag = true; } md->add_c('^'); for(int i=0; iadd_c(targetseq[tpos++]); } flag = false; break; } } if (!flag) { md->add_d(matched); matched = 0; flag = true; } } void results_show_samheader(FILE * fp, char * cmdline, char * dbname) { if (opt_samout && opt_samheader) { fprintf(fp, "@HD\tVN:1.0\tSO:unsorted\tGO:query\n"); for(uint64_t i=0; i 0) { double top_hit_id = hits[0].id; for(int t = 0; t < hitcount; t++) { struct hit * hp = hits + t; if (opt_top_hits_only && (hp->id < top_hit_id)) { break; } /* */ xstring cigar; xstring md; build_sam_strings(hp->nwalignment, hp->strand ? rc : qsequence, db_getsequence(hp->target), & cigar, & md); fprintf(fp, "%s\t%u\t%s\t%" PRIu64 "\t%u\t%s\t%s\t%" PRIu64 "\t%" PRIu64 "\t%s\t%s\t" "AS:i:%.0f\tXN:i:%d\tXM:i:%d\tXO:i:%d\t" "XG:i:%d\tNM:i:%d\tMD:Z:%s\tYT:Z:%s\n", query_head, 0x10 * hp->strand | (t>0 ? 0x100 : 0), db_getheader(hp->target), (uint64_t) 1, 255, cigar.get_string(), "*", (uint64_t) 0, (uint64_t) 0, hp->strand ? rc : qsequence, "*", hp->id, 0, hp->mismatches, hp->internal_gaps, hp->internal_indels, hp->mismatches + hp->internal_indels, md.get_string(), "UU"); } } else if (opt_output_no_hits) { fprintf(fp, "%s\t%u\t%s\t%" PRIu64 "\t%u\t%s\t%s\t%" PRIu64 "\t%" PRIu64 "\t%s\t%s\n", query_head, 0x04, "*", (uint64_t) 0, 255, "*", "*", (uint64_t) 0, (uint64_t) 0, qsequence, "*"); } } vsearch-2.21.1/src/tax.cc0000644000175000017500000001165714171574117014472 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" const char * tax_letters = "dkpcofgs"; bool tax_parse(const char * header, int header_length, int * tax_start, int * tax_end) { /* Identify the first occurence of the pattern (^|;)tax=([^;]*)(;|$) */ if (! header) { return false; } const char * attribute = "tax="; int hlen = header_length; int alen = strlen(attribute); int i = 0; while (i < hlen - alen) { char * r = (char *) strstr(header + i, attribute); /* no match */ if (r == nullptr) { break; } i = r - header; /* check for ';' in front */ if ((i > 0) && (header[i-1] != ';')) { i += alen + 1; continue; } * tax_start = i; /* find end (semicolon or end of header) */ const char * s = strchr(header+i+alen, ';'); if (s == nullptr) { * tax_end = hlen; } else { * tax_end = s - header; } return true; } return false; } void tax_split(int seqno, int * level_start, int * level_len) { /* Parse taxonomy string into the following parts d domain k kingdom p phylum c class o order f family g genus s species */ for (int i = 0; i < tax_levels; i++) { level_start[i] = 0; level_len[i] = 0; } int tax_start, tax_end; char * h = db_getheader(seqno); int hlen = db_getheaderlen(seqno); if (tax_parse(h, hlen, & tax_start, & tax_end)) { int t = tax_start + 4; while (t < tax_end) { /* Is the next char a recogized tax level letter? */ const char * r = strchr(tax_letters, tolower(h[t])); if (r) { int level = r - tax_letters; /* Is there a colon after it? */ if (h[t + 1] == ':') { level_start[level] = t + 2; char * z = strchr(h + t + 2, ','); if (z) { level_len[level] = z - h - t - 2; } else { level_len[level] = tax_end - t - 2; } } } /* skip past next comma */ char * x = strchr(h + t, ','); if (x) { t = x - h + 1; } else { t = tax_end; } } } } vsearch-2.21.1/src/sha1.c0000644000175000017500000002706714171574117014371 0ustar nileshnilesh/* Slightly modified for vsearch by Torbjorn Rognes */ /* SHA-1 in C By Steve Reid 100% Public Domain ----------------- Modified 7/98 By James H. Brown Still 100% Public Domain Corrected a problem which generated improper hash values on 16 bit machines Routine SHA1Update changed from void SHA1Update(SHA1_CTX* context, unsigned char* data, unsigned int len) to void SHA1Update(SHA1_CTX* context, unsigned char* data, unsigned long len) The 'len' parameter was declared an int which works fine on 32 bit machines. However, on 16 bit machines an int is too small for the shifts being done against it. This caused the hash function to generate incorrect values if len was greater than 8191 (8K - 1) due to the 'len << 3' on line 3 of SHA1Update(). Since the file IO in main() reads 16K at a time, any file 8K or larger would be guaranteed to generate the wrong hash (e.g. Test Vector #3, a million "a"s). I also changed the declaration of variables i & j in SHA1Update to unsigned long from unsigned int for the same reason. These changes should make no difference to any 32 bit implementations since an int and a long are the same size in those environments. -- I also corrected a few compiler warnings generated by Borland C. 1. Added #include for exit() prototype 2. Removed unused variable 'j' in SHA1Final 3. Changed exit(0) to return(0) at end of main. ALL changes I made can be located by searching for comments containing 'JHB' ----------------- Modified 8/98 By Steve Reid Still 100% public domain 1- Removed #include and used return() instead of exit() 2- Fixed overwriting of finalcount in SHA1Final() (discovered by Chris Hall) 3- Changed email address from steve@edmweb.com to sreid@sea-to-sky.net ----------------- Modified 4/01 By Saul Kravitz Still 100% PD Modified to run on Compaq Alpha hardware. ----------------- Modified 07/2002 By Ralph Giles Still 100% public domain modified for use with stdint types, autoconf code cleanup, removed attribution comments switched SHA1Final() argument order for consistency use SHA1_ prefix for public api move public api to sha1.h */ /* Test Vectors (from FIPS PUB 180-1) "abc" A9993E36 4706816A BA3E2571 7850C26C 9CD0D89D "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq" 84983E44 1C3BD26E BAAE4AA1 F95129E5 E54670F1 A million repetitions of "a" 34AA973C D4C4DAA4 F61EEB2B DBAD2731 6534016F */ /* #define SHA1HANDSOFF */ #ifdef HAVE_CONFIG_H #include "config.h" #endif #include #include #include #include "sha1.h" void SHA1_Transform(uint32_t state[5], const uint8_t buffer[64]); #define rol(value, bits) (((value) << (bits)) | ((value) >> (32 - (bits)))) /* blk0() and blk() perform the initial expand. */ /* I got the idea of expanding during the round function from SSLeay */ /* FIXME: can we do this in an endian-proof way? */ #ifdef WORDS_BIGENDIAN #define blk0(i) block->l[i] #else #define blk0(i) (block->l[i] = (rol(block->l[i],24)&0xFF00FF00) \ |(rol(block->l[i],8)&0x00FF00FF)) #endif #define blk(i) (block->l[i&15] = rol(block->l[(i+13)&15]^block->l[(i+8)&15] \ ^block->l[(i+2)&15]^block->l[i&15],1)) /* (R0+R1), R2, R3, R4 are the different operations used in SHA1 */ #define R0(v,w,x,y,z,i) z+=((w&(x^y))^y)+blk0(i)+0x5A827999+rol(v,5);w=rol(w,30); #define R1(v,w,x,y,z,i) z+=((w&(x^y))^y)+blk(i)+0x5A827999+rol(v,5);w=rol(w,30); #define R2(v,w,x,y,z,i) z+=(w^x^y)+blk(i)+0x6ED9EBA1+rol(v,5);w=rol(w,30); #define R3(v,w,x,y,z,i) z+=(((w|x)&y)|(w&x))+blk(i)+0x8F1BBCDC+rol(v,5);w=rol(w,30); #define R4(v,w,x,y,z,i) z+=(w^x^y)+blk(i)+0xCA62C1D6+rol(v,5);w=rol(w,30); #ifdef VERBOSE /* SAK */ void SHAPrintContext(SHA1_CTX *context, char *msg){ printf("%s (%d,%d) %x %x %x %x %x\n", msg, context->count[0], context->count[1], context->state[0], context->state[1], context->state[2], context->state[3], context->state[4]); } #endif /* VERBOSE */ /* Hash a single 512-bit block. This is the core of the algorithm. */ void SHA1_Transform(uint32_t state[5], const uint8_t buffer[64]) { uint32_t a, b, c, d, e; typedef union { uint8_t c[64]; uint32_t l[16]; } CHAR64LONG16; CHAR64LONG16* block; #ifdef SHA1HANDSOFF static uint8_t workspace[64]; block = (CHAR64LONG16*)workspace; memcpy(block, buffer, 64); #else block = (CHAR64LONG16*)buffer; #endif /* Copy context->state[] to working vars */ a = state[0]; b = state[1]; c = state[2]; d = state[3]; e = state[4]; /* 4 rounds of 20 operations each. Loop unrolled. */ R0(a,b,c,d,e, 0); R0(e,a,b,c,d, 1); R0(d,e,a,b,c, 2); R0(c,d,e,a,b, 3); R0(b,c,d,e,a, 4); R0(a,b,c,d,e, 5); R0(e,a,b,c,d, 6); R0(d,e,a,b,c, 7); R0(c,d,e,a,b, 8); R0(b,c,d,e,a, 9); R0(a,b,c,d,e,10); R0(e,a,b,c,d,11); R0(d,e,a,b,c,12); R0(c,d,e,a,b,13); R0(b,c,d,e,a,14); R0(a,b,c,d,e,15); R1(e,a,b,c,d,16); R1(d,e,a,b,c,17); R1(c,d,e,a,b,18); R1(b,c,d,e,a,19); R2(a,b,c,d,e,20); R2(e,a,b,c,d,21); R2(d,e,a,b,c,22); R2(c,d,e,a,b,23); R2(b,c,d,e,a,24); R2(a,b,c,d,e,25); R2(e,a,b,c,d,26); R2(d,e,a,b,c,27); R2(c,d,e,a,b,28); R2(b,c,d,e,a,29); R2(a,b,c,d,e,30); R2(e,a,b,c,d,31); R2(d,e,a,b,c,32); R2(c,d,e,a,b,33); R2(b,c,d,e,a,34); R2(a,b,c,d,e,35); R2(e,a,b,c,d,36); R2(d,e,a,b,c,37); R2(c,d,e,a,b,38); R2(b,c,d,e,a,39); R3(a,b,c,d,e,40); R3(e,a,b,c,d,41); R3(d,e,a,b,c,42); R3(c,d,e,a,b,43); R3(b,c,d,e,a,44); R3(a,b,c,d,e,45); R3(e,a,b,c,d,46); R3(d,e,a,b,c,47); R3(c,d,e,a,b,48); R3(b,c,d,e,a,49); R3(a,b,c,d,e,50); R3(e,a,b,c,d,51); R3(d,e,a,b,c,52); R3(c,d,e,a,b,53); R3(b,c,d,e,a,54); R3(a,b,c,d,e,55); R3(e,a,b,c,d,56); R3(d,e,a,b,c,57); R3(c,d,e,a,b,58); R3(b,c,d,e,a,59); R4(a,b,c,d,e,60); R4(e,a,b,c,d,61); R4(d,e,a,b,c,62); R4(c,d,e,a,b,63); R4(b,c,d,e,a,64); R4(a,b,c,d,e,65); R4(e,a,b,c,d,66); R4(d,e,a,b,c,67); R4(c,d,e,a,b,68); R4(b,c,d,e,a,69); R4(a,b,c,d,e,70); R4(e,a,b,c,d,71); R4(d,e,a,b,c,72); R4(c,d,e,a,b,73); R4(b,c,d,e,a,74); R4(a,b,c,d,e,75); R4(e,a,b,c,d,76); R4(d,e,a,b,c,77); R4(c,d,e,a,b,78); R4(b,c,d,e,a,79); /* Add the working vars back into context.state[] */ state[0] += a; state[1] += b; state[2] += c; state[3] += d; state[4] += e; /* Wipe variables */ a = b = c = d = e = 0; } /* SHA1Init - Initialize new context */ void SHA1_Init(SHA1_CTX* context) { /* SHA1 initialization constants */ context->state[0] = 0x67452301; context->state[1] = 0xEFCDAB89; context->state[2] = 0x98BADCFE; context->state[3] = 0x10325476; context->state[4] = 0xC3D2E1F0; context->count[0] = context->count[1] = 0; } /* Run your data through this. */ void SHA1_Update(SHA1_CTX* context, const uint8_t* data, const size_t len) { size_t i, j; #ifdef VERBOSE SHAPrintContext(context, "before"); #endif j = (context->count[0] >> 3) & 63; if ((context->count[0] += len << 3) < (len << 3)) { context->count[1]++; } context->count[1] += (len >> 29); if ((j + len) > 63) { memcpy(&context->buffer[j], data, (i = 64-j)); SHA1_Transform(context->state, context->buffer); for ( ; i + 63 < len; i += 64) { SHA1_Transform(context->state, data + i); } j = 0; } else { i = 0; } memcpy(&context->buffer[j], &data[i], len - i); #ifdef VERBOSE SHAPrintContext(context, "after "); #endif } /* Add padding and return the message digest. */ void SHA1_Final(SHA1_CTX* context, uint8_t digest[SHA1_DIGEST_SIZE]) { uint32_t i; uint8_t finalcount[8]; for (i = 0; i < 8; i++) { finalcount[i] = (unsigned char)((context->count[(i >= 4 ? 0 : 1)] >> ((3-(i & 3)) * 8) ) & 255); /* Endian independent */ } SHA1_Update(context, (uint8_t *)"\200", 1); while ((context->count[0] & 504) != 448) { SHA1_Update(context, (uint8_t *)"\0", 1); } SHA1_Update(context, finalcount, 8); /* Should cause a SHA1_Transform() */ for (i = 0; i < SHA1_DIGEST_SIZE; i++) { digest[i] = (uint8_t) ((context->state[i>>2] >> ((3-(i & 3)) * 8) ) & 255); } /* Wipe variables */ i = 0; memset(context->buffer, 0, 64); memset(context->state, 0, 20); memset(context->count, 0, 8); memset(finalcount, 0, 8); /* SWR */ #ifdef SHA1HANDSOFF /* make SHA1Transform overwrite its own static vars */ SHA1_Transform(context->state, context->buffer); #endif } /*************************************************************/ #if 0 int main(int argc, char** argv) { int i, j; SHA1_CTX context; unsigned char digest[SHA1_DIGEST_SIZE], buffer[16384]; FILE* file; if (argc > 2) { puts("Public domain SHA-1 implementation - by Steve Reid "); puts("Modified for 16 bit environments 7/98 - by James H. Brown "); /* JHB */ puts("Produces the SHA-1 hash of a file, or stdin if no file is specified."); return(0); } if (argc < 2) { file = stdin; } else { if (!(file = fopen(argv[1], "rb"))) { fputs("Unable to open file.", stderr); return(-1); } } SHA1_Init(&context); while (!feof(file)) { /* note: what if ferror(file) */ i = fread(buffer, 1, 16384, file); SHA1_Update(&context, buffer, i); } SHA1_Final(&context, digest); fclose(file); for (i = 0; i < SHA1_DIGEST_SIZE/4; i++) { for (j = 0; j < 4; j++) { printf("%02X", digest[i*4+j]); } putchar(' '); } putchar('\n'); return(0); /* JHB */ } #endif /* self test */ #ifdef TEST static char *test_data[] = { "abc", "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq", "A million repetitions of 'a'"}; static char *test_results[] = { "A9993E36 4706816A BA3E2571 7850C26C 9CD0D89D", "84983E44 1C3BD26E BAAE4AA1 F95129E5 E54670F1", "34AA973C D4C4DAA4 F61EEB2B DBAD2731 6534016F"}; void digest_to_hex(const uint8_t digest[SHA1_DIGEST_SIZE], char *output) { int i,j; char *c = output; for (i = 0; i < SHA1_DIGEST_SIZE/4; i++) { for (j = 0; j < 4; j++) { sprintf(c,"%02X", digest[i*4+j]); c += 2; } sprintf(c, " "); c += 1; } *(c - 1) = '\0'; } int main(int argc, char** argv) { int k; SHA1_CTX context; uint8_t digest[20]; char output[80]; fprintf(stdout, "verifying SHA-1 implementation... "); for (k = 0; k < 2; k++){ SHA1_Init(&context); SHA1_Update(&context, (uint8_t*)test_data[k], strlen(test_data[k])); SHA1_Final(&context, digest); digest_to_hex(digest, output); if (strcmp(output, test_results[k])) { fprintf(stdout, "FAIL\n"); fprintf(stderr,"* hash of \"%s\" incorrect:\n", test_data[k]); fprintf(stderr,"\t%s returned\n", output); fprintf(stderr,"\t%s is correct\n", test_results[k]); return (1); } } /* million 'a' vector we feed separately */ SHA1_Init(&context); for (k = 0; k < 1000000; k++) SHA1_Update(&context, (uint8_t*)"a", 1); SHA1_Final(&context, digest); digest_to_hex(digest, output); if (strcmp(output, test_results[2])) { fprintf(stdout, "FAIL\n"); fprintf(stderr,"* hash of \"%s\" incorrect:\n", test_data[2]); fprintf(stderr,"\t%s returned\n", output); fprintf(stderr,"\t%s is correct\n", test_results[2]); return (1); } /* success */ fprintf(stdout, "ok\n"); return(0); } #endif /* TEST */ vsearch-2.21.1/src/align.cc0000644000175000017500000002745414171574117014772 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" struct nwinfo_s { int64_t dir_alloc; int64_t hearray_alloc; char * dir; int64_t * hearray; }; static const char maskup = 1; static const char maskleft = 2; static const char maskextup = 4; static const char maskextleft = 8; inline void pushop(char newop, char ** cigarendp, char * op, int * count) { if (newop == *op) { (*count)++; } else { *--*cigarendp = *op; if (*count > 1) { char buf[25]; int len = sprintf(buf, "%d", *count); *cigarendp -= len; memcpy(*cigarendp, buf, (size_t)len); } *op = newop; *count = 1; } } inline void finishop(char ** cigarendp, char * op, int * count) { if ((op) && (count)) { *--*cigarendp = *op; if (*count > 1) { char buf[25]; int len = sprintf(buf, "%d", *count); *cigarendp -= len; memcpy(*cigarendp, buf, (size_t)len); } *op = 0; *count = 0; } } /* Needleman-Wunsch aligner finds a global alignment with maximum score positive score for matches; negative score mismatches gap penalties are positive, but counts negatively alignment priority when backtracking (from lower right corner): 1. left/insert/e (gap in query sequence (qseq)) 2. up/delete/f (gap in database sequence (dseq)) 3. align/diag/h (match/mismatch) qseq: the reference/query/upper/vertical/from sequence dseq: the sample/database/lower/horisontal/to sequence default (interior) scores: match: +2 mismatch: -4 gap open: 20 gap extend: 2 input dseq: pointer to start of database sequence dend: pointer after database sequence qseq: pointer to start of query sequence qend: pointer after database sequence score_matrix: 16x16 matrix of longs with scores for aligning two symbols gapopen: positive number indicating penalty for opening a gap of length zero gapextend: positive number indicating penalty for extending a gap output nwscore: the global alignment score nwdiff: number of non-identical nucleotides in one optimal global alignment nwalignmentlength: the length of one optimal alignment nwalignment: cigar string with one optimal alignment */ struct nwinfo_s * nw_init() { auto * nw = (struct nwinfo_s *) xmalloc(sizeof(struct nwinfo_s)); nw->dir = nullptr; nw->dir_alloc = 0; nw->hearray = nullptr; nw->hearray_alloc = 0; return nw; } void nw_exit(struct nwinfo_s * nw) { if (nw->dir) { xfree(nw->dir); } if (nw->hearray) { xfree(nw->hearray); } xfree(nw); } inline int64_t getscore(int64_t * score_matrix, char a, char b) { return score_matrix[(chrmap_4bit[(int)a]<<4) + chrmap_4bit[(int)b]]; } void nw_align(char * dseq, char * dend, char * qseq, char * qend, int64_t * score_matrix, int64_t gapopen_q_left, int64_t gapopen_q_interior, int64_t gapopen_q_right, int64_t gapopen_t_left, int64_t gapopen_t_interior, int64_t gapopen_t_right, int64_t gapextend_q_left, int64_t gapextend_q_interior, int64_t gapextend_q_right, int64_t gapextend_t_left, int64_t gapextend_t_interior, int64_t gapextend_t_right, int64_t * nwscore, int64_t * nwdiff, int64_t * nwgaps, int64_t * nwindels, int64_t * nwalignmentlength, char ** nwalignment, int64_t queryno, int64_t dbseqno, struct nwinfo_s * nw) { int64_t h, n, e, f, h_e, h_f; int64_t *hep; int64_t qlen = qend - qseq; int64_t dlen = dend - dseq; if (qlen * dlen > nw->dir_alloc) { nw->dir_alloc = qlen * dlen; nw->dir = (char *) xrealloc(nw->dir, (size_t)nw->dir_alloc); } int64_t need = 2 * qlen * (int64_t) sizeof(int64_t); if (need > nw->hearray_alloc) { nw->hearray_alloc = need; nw->hearray = (int64_t *) xrealloc(nw->hearray, (size_t)nw->hearray_alloc); } memset(nw->dir, 0, (size_t)(qlen*dlen)); int64_t i, j; for(i=0; ihearray[2*i] = -gapopen_t_left - (i+1) * gapextend_t_left; if (i < qlen-1) { nw->hearray[2*i+1] = - gapopen_t_left - (i+1) * gapextend_t_left - gapopen_q_interior - gapextend_q_interior; } else { nw->hearray[2*i+1] = - gapopen_t_left - (i+1) * gapextend_t_left - gapopen_q_right - gapextend_q_right; } } for(j=0; jhearray; if (j == 0) { h = 0; } else { h = - gapopen_q_left - j * gapextend_q_left; } if (j < dlen-1) { f = - gapopen_q_left - (j+1) * gapextend_q_left - gapopen_t_interior - gapextend_t_interior; } else { f = - gapopen_q_left - (j+1) * gapextend_q_left - gapopen_t_right - gapextend_t_right; } for(i=0; idir + qlen*j+i; n = *hep; e = *(hep+1); h += getscore(score_matrix, dseq[j], qseq[i]); if (f > h) { h = f; *d |= maskup; } if (e > h) { h = e; *d |= maskleft; } *hep = h; if (i < qlen-1) { h_e = h - gapopen_q_interior - gapextend_q_interior; e -= gapextend_q_interior; } else { h_e = h - gapopen_q_right - gapextend_q_right; e -= gapextend_q_right; } if (j < dlen-1) { h_f = h - gapopen_t_interior - gapextend_t_interior; f -= gapextend_t_interior; } else { h_f = h - gapopen_t_right - gapextend_t_right; f -= gapextend_t_right; } if (f > h_f) { *d |= maskextup; } else { f = h_f; } if (e > h_e) { *d |= maskextleft; } else { e = h_e; } *(hep+1) = e; h = n; hep += 2; } } int64_t dist = nw->hearray[2*qlen-2]; /* backtrack: count differences and save alignment in cigar string */ int64_t score = 0; int64_t alength = 0; int64_t matches = 0; int64_t gaps = 0; int64_t indels = 0; char * cigar = (char *) xmalloc((size_t)(qlen + dlen + 1)); char * cigarend = cigar+qlen+dlen+1; char op = 0; int count = 0; *(--cigarend) = 0; i = qlen; j = dlen; while ((i>0) && (j>0)) { int64_t gapopen_q = (i < qlen) ? gapopen_q_interior : gapopen_q_right; int64_t gapextend_q = (i < qlen) ? gapextend_q_interior : gapextend_q_right; int64_t gapopen_t = (j < dlen) ? gapopen_t_interior : gapopen_t_right; int64_t gapextend_t = (j < dlen) ? gapextend_t_interior : gapextend_t_right; int d = nw->dir[qlen*(j-1)+(i-1)]; alength++; if ((op == 'I') && (d & maskextleft)) { score -= gapextend_q; indels++; j--; pushop('I', &cigarend, &op, &count); } else if ((op == 'D') && (d & maskextup)) { score -= gapextend_t; indels++; i--; pushop('D', &cigarend, &op, &count); } else if (d & maskleft) { score -= gapextend_q; indels++; if (op != 'I') { score -= gapopen_q; gaps++; } j--; pushop('I', &cigarend, &op, &count); } else if (d & maskup) { score -= gapextend_t; indels++; if (op != 'D') { score -= gapopen_t; gaps++; } i--; pushop('D', &cigarend, &op, &count); } else { score += getscore(score_matrix, dseq[j-1], qseq[i-1]); if (chrmap_4bit[(int)(dseq[j-1])] & chrmap_4bit[(int)(qseq[i-1])]) { matches++; } i--; j--; pushop('M', &cigarend, &op, &count); } } while(i>0) { alength++; score -= gapextend_t_left; indels++; if (op != 'D') { score -= gapopen_t_left; gaps++; } i--; pushop('D', &cigarend, &op, &count); } while(j>0) { alength++; score -= gapextend_q_left; indels++; if (op != 'I') { score -= gapopen_q_left; gaps++; } j--; pushop('I', &cigarend, &op, &count); } finishop(&cigarend, &op, &count); /* move and reallocate cigar */ int64_t cigarlength = cigar+qlen+dlen-cigarend; memmove(cigar, cigarend, (size_t)(cigarlength+1)); cigar = (char*) xrealloc(cigar, (size_t)(cigarlength+1)); * nwscore = dist; * nwdiff = alength - matches; * nwalignmentlength = alength; * nwalignment = cigar; * nwgaps = gaps; * nwindels = indels; #if 1 if (score != dist) { fprintf(stderr, "WARNING: Error with query no %" PRId64 " and db sequence no %" PRId64 ":\n", queryno, dbseqno); fprintf(stderr, "Initial and recomputed alignment score disagreement: %" PRId64 " %" PRId64 "\n", dist, score); fprintf(stderr, "Alignment: %s\n", cigar); if (opt_log) { fprintf(fp_log, "WARNING: Error with query no %" PRId64 " and db sequence no %" PRId64 ":\n", queryno, dbseqno); fprintf(fp_log, "Initial and recomputed alignment score disagreement: %" PRId64 " %" PRId64 "\n", dist, score); fprintf(fp_log, "Alignment: %s\n", cigar); fprintf(fp_log, "\n"); } } #endif } vsearch-2.21.1/src/results.h0000644000175000017500000001204014171574117015224 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void results_show_alnout(FILE * fp, struct hit * hits, int hitcount, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_lcaout(FILE * fp, struct hit * hits, int hitcount, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_blast6out_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_uc_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc, int clusterno); void results_show_userout_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_fastapairs_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_qsegout_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_tsegout_one(FILE * fp, struct hit * hp, char * query_head, char * qsequence, int64_t qseqlen, char * rc); void results_show_samheader(FILE * fp, char * cmdline, char * dbname); void results_show_samout(FILE * fp, struct hit * hits, int hitcount, char * query_head, char * qsequence, int64_t qseqlen, char * rc); vsearch-2.21.1/src/udb.cc0000644000175000017500000006445614171574117014455 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" #define BLOCKSIZE (4096 * 4096) static unsigned int udb_dbaccel = 0; typedef struct wordfreq { unsigned int kmer; unsigned int count; } wordfreq_t; int wc_compare(const void * a, const void * b) { auto * x = (wordfreq_t *) a; auto * y = (wordfreq_t *) b; if (x->count < y->count) { return -1; } else if (x->count > y->count) { return +1; } else { if (x->kmer < y->kmer) { return +1; } else if (x->kmer > y->kmer) { return -1; } else { return 0; } } } uint64_t largeread(int fd, void * buf, uint64_t nbyte, uint64_t offset) { /* call pread multiple times and update progress */ uint64_t progress = offset; for(uint64_t i = 0; i < nbyte; i += BLOCKSIZE) { uint64_t res = xlseek(fd, offset + i, SEEK_SET); if (res != offset + i) { fatal("Unable to seek in UDB file or invalid UDB file"); } uint64 rem = MIN(BLOCKSIZE, nbyte - i); uint64_t bytesread = read(fd, ((char*)buf) + i, rem); if (bytesread != rem) { fatal("Unable to read from UDB file or invalid UDB file"); } progress += rem; progress_update(progress); } return nbyte; } uint64_t largewrite(int fd, void * buf, uint64_t nbyte, uint64_t offset) { /* call write multiple times and update progress */ uint64_t progress = offset; for(uint64_t i = 0; i < nbyte; i += BLOCKSIZE) { uint64_t res = xlseek(fd, offset + i, SEEK_SET); if (res != offset + i) { fatal("Unable to seek in UDB file or invalid UDB file"); } uint64 rem = MIN(BLOCKSIZE, nbyte - i); uint64_t byteswritten = write(fd, ((char*)buf) + i, rem); if (byteswritten != rem) { fatal("Unable to write to UDB file"); } progress += rem; progress_update(progress); } return nbyte; } auto udb_detect_isudb(const char * filename) -> bool { /* Detect whether the given filename seems to refer to an UDB file. It must be an uncompressed regular file, not a pipe. */ constexpr uint32_t udb_file_signature {0x55444246}; constexpr uint64_t expected_n_bytes {sizeof(uint32_t)}; xstat_t fs; if (xstat(filename, & fs)) { fatal("Unable to get status for input file (%s)", filename); } bool is_pipe = S_ISFIFO(fs.st_mode); if (is_pipe) { return false; } int fd = 0; fd = xopen_read(filename); if (!fd) { fatal("Unable to open input file for reading (%s)", filename); } unsigned int magic = 0; uint64_t bytesread = read(fd, & magic, expected_n_bytes); close(fd); if ((bytesread == expected_n_bytes) && (magic == udb_file_signature)) { return true; } return false; } void udb_info() { /* Read UDB header and show basic info */ unsigned int buffer[50]; int fd_udbinfo = 0; fd_udbinfo = xopen_read(opt_udbinfo); if (! fd_udbinfo) { fatal("Unable to open UDB file for reading"); } uint64_t bytesread = read(fd_udbinfo, buffer, 4 * 50); if (bytesread != 4 * 50) { fatal("Unable to read from UDB file or invalid UDB file"); } if ((buffer[0] != 0x55444246) || (buffer[2] != 32) || (buffer[4] < 3) || (buffer[4] > 15) || (buffer[13] == 0) || (buffer[17] != 0x0000746e) || (buffer[49] != 0x55444266)) { fatal("Invalid UDB file"); } if (!opt_quiet) { fprintf(stderr, " Seqs %u\n", buffer[13]); fprintf(stderr, " SeqIx bits %u\n", buffer[2]); fprintf(stderr, " Alpha nt (4)\n"); fprintf(stderr, " Word width %u\n", buffer[4]); fprintf(stderr, " Slots %u\n", buffer[11]); fprintf(stderr, " Dict size %u (%.1fk)\n", (1 << (2 * buffer[4])), (1 << (2 * buffer[4])) * 1.0 / 1000.0); fprintf(stderr, " DBstep %u\n", buffer[5]); fprintf(stderr, " DBAccel %u%%\n", buffer[6]); } if (opt_log) { fprintf(fp_log, " Seqs %u\n", buffer[13]); fprintf(fp_log, " SeqIx bits %u\n", buffer[2]); fprintf(fp_log, " Alpha nt (4)\n"); fprintf(fp_log, " Word width %u\n", buffer[4]); fprintf(fp_log, " Slots %u\n", buffer[11]); fprintf(fp_log, " Dict size %u (%.1fk)\n", (1 << (2 * buffer[4])), (1 << (2 * buffer[4])) * 1.0 / 1000.0); fprintf(fp_log, " DBstep %u\n", buffer[5]); fprintf(fp_log, " DBAccel %u%%\n", buffer[6]); } close(fd_udbinfo); } void udb_read(const char * filename, bool create_bitmaps, bool parse_abundances) { /* read UDB as indexed database */ unsigned int seqcount = 0; unsigned int udb_wordlength = 0; uint64 nucleotides = 0; xstat_t fs; if (xstat(filename, & fs)) { fatal("Unable to get status for input file (%s)", filename); } bool is_pipe = S_ISFIFO(fs.st_mode); if (is_pipe) { fatal("Cannot read UDB file from a pipe"); } /* get file size */ uint64_t filesize = fs.st_size; /* open UDB file */ int fd_udb = 0; fd_udb = xopen_read(filename); if (! fd_udb) { fatal("Unable to open UDB file for reading"); } char * prompt = nullptr; if (xsprintf(& prompt, "Reading UDB file %s", filename) == -1) { fatal("Out of memory"); } progress_init(prompt, filesize); /* header */ unsigned int buffer[50]; uint64_t pos = 0; pos += largeread(fd_udb, buffer, 4 * 50, pos); if ((buffer[0] != 0x55444246) || (buffer[2] != 32) || (buffer[4] < 3) || (buffer[4] > 15) || (buffer[13] == 0) || (buffer[17] != 0x0000746e) || (buffer[49] != 0x55444266)) { fatal("Invalid UDB file"); } udb_wordlength = buffer[4]; seqcount = buffer[13]; udb_dbaccel = buffer[6]; if (udb_wordlength != opt_wordlength) { fprintf(stderr, "\nWARNING: Wordlength adjusted to %u as indicated in UDB file\n", udb_wordlength); opt_wordlength = udb_wordlength; } /* word match counts */ kmerhashsize = 1 << (2 * udb_wordlength); kmercount = (unsigned int*) xmalloc(kmerhashsize * sizeof(unsigned int)); kmerhash = (uint64_t *) xmalloc(kmerhashsize * sizeof(uint64_t)); kmerbitmap = (bitmap_t * *) xmalloc(kmerhashsize * sizeof(bitmap_t**)); memset(kmerbitmap, 0, kmerhashsize * sizeof(bitmap_t**)); pos += largeread(fd_udb, kmercount, 4 * kmerhashsize, pos); kmerindexsize = 0; for(uint64_t i = 0; i < kmerhashsize; i++) { kmerhash[i] = kmerindexsize; kmerindexsize += kmercount[i]; } /* signature */ pos += largeread(fd_udb, buffer, 4, pos); if (buffer[0] != 0x55444233) { fatal("Invalid UDB file"); } /* sequence numbers for word matches */ kmerindex = (unsigned int *) xmalloc(kmerindexsize * 4); pos += largeread(fd_udb, kmerindex, 4 * kmerindexsize, pos); /* new header */ pos += largeread(fd_udb, buffer, 4 * 8, pos); if ((buffer[0] != 0x55444234) || (buffer[1] != 0x005e0db3) || (buffer[2] != seqcount) || (buffer[7] != 0x005e0db4)) { fatal("Invalid UDB file"); } nucleotides = (((uint64_t) buffer[4]) << 32) | buffer[3]; uint64_t udb_headerchars = (((uint64_t) buffer[6]) << 32) | buffer[5]; /* header index */ seqindex = (seqinfo_t *) xmalloc(seqcount * sizeof(seqinfo_t)); int * header_index = (int *) xmalloc(4 * (seqcount+1)); pos += largeread(fd_udb, header_index, 4 * seqcount, pos); header_index[seqcount] = udb_headerchars; unsigned last = 0; for(unsigned int i = 0; i < seqcount; i++) { unsigned int x = header_index[i]; if ((x < last) || (x >= udb_headerchars)) { fatal("Invalid UDB file"); } seqindex[i].header_p = x; seqindex[i].headerlen = header_index[i+1] - x - 1; seqindex[i].size = 1; last = x; } xfree(header_index); /* headers */ datap = (char *) xmalloc(udb_headerchars + nucleotides + seqcount); pos += largeread(fd_udb, datap, udb_headerchars, pos); uint64_t longestheader = 0; for(unsigned int i = 0; i < seqcount; i++) { if (seqindex[i].headerlen > longestheader) { longestheader = seqindex[i].headerlen; } } /* sequence lengths */ int * sequence_lengths = (int *) xmalloc(4 * seqcount); pos += largeread(fd_udb, sequence_lengths, 4 * seqcount, pos); uint64_t sum = 0; unsigned int shortest = UINT_MAX; unsigned int longest = 0; for(unsigned int i = 0; i < seqcount; i++) { unsigned int x = sequence_lengths[i]; seqindex[i].seq_p = udb_headerchars + sum; seqindex[i].seqlen = x; seqindex[i].qual_p = 0; if (x < shortest) { shortest = x; } if (x > longest) { longest = x; } sum += x; if (sum > nucleotides) { fatal("Invalid UDB file"); } } xfree(sequence_lengths); if (sum != nucleotides) { fatal("Invalid UDB file"); } /* sequences */ pos += largeread(fd_udb, datap + udb_headerchars, nucleotides, pos); if (pos != filesize) { fatal("Incorrect UDB file size"); } /* close UDB file */ close(fd_udb); progress_done(); xfree(prompt); /* move sequences and insert zero at end of each sequence */ progress_init("Reorganizing data in memory", seqcount); for(unsigned int i = seqcount-1; i > 0; i--) { size_t old_p = seqindex[i].seq_p; size_t new_p = seqindex[i].seq_p + i; size_t len = seqindex[i].seqlen; memmove(datap + new_p, datap + old_p, len); *(datap + new_p + len) = 0; seqindex[i].seq_p = new_p; progress_update(seqcount - i); } *(datap + seqindex[0].seq_p + seqindex[0].seqlen) = 0; progress_done(); /* Create bitmaps for the most frequent words */ if (create_bitmaps) { progress_init("Creating bitmaps", kmerhashsize); unsigned int bitmap_mincount = seqcount / 8; for(unsigned int i = 0; i < kmerhashsize; i++) { if (kmercount[i] >= bitmap_mincount) { kmerbitmap[i] = bitmap_init(seqcount+127); // pad for xmm bitmap_reset_all(kmerbitmap[i]); for(unsigned j = 0; j < kmercount[i]; j++) { bitmap_set(kmerbitmap[i], kmerindex[kmerhash[i]+j]); } } progress_update(i+1); } progress_done(); } /* get abundances and longest header */ if (parse_abundances) { progress_init("Parsing abundances", seqcount); for(unsigned int i = 0; i < seqcount; i++) { int64_t size = header_get_size(datap + seqindex[i].header_p, seqindex[i].headerlen); if (size > 0) { seqindex[i].size = size; } else { seqindex[i].size = 1; } progress_update(i+1); } progress_done(); } /* set database info */ dbindex_uh = unique_init(); db_setinfo(false, seqcount, nucleotides, longest, shortest, longestheader); /* make mapping from indexno to seqno */ dbindex_map = (unsigned int *) xmalloc(seqcount * sizeof(unsigned int)); dbindex_count = seqcount; for (unsigned int i = 0; i < seqcount; i++) { dbindex_map[i] = i; } /* done */ /* some stats */ if (!opt_quiet) { if (seqcount > 0) { fprintf(stderr, "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64 ", max %'" PRIu64 ", avg %'.0f\n", db_getnucleotidecount(), db_getsequencecount(), db_getshortestsequence(), db_getlongestsequence(), db_getnucleotidecount() * 1.0 / db_getsequencecount()); } else { fprintf(stderr, "%'" PRIu64 " nt in %'" PRIu64 " seqs\n", db_getnucleotidecount(), db_getsequencecount()); } } if (opt_log) { if (seqcount > 0) { fprintf(fp_log, "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64 ", max %'" PRIu64 ", avg %'.0f\n\n", db_getnucleotidecount(), db_getsequencecount(), db_getshortestsequence(), db_getlongestsequence(), db_getnucleotidecount() * 1.0 / db_getsequencecount()); } else { fprintf(fp_log, "%'" PRIu64 " nt in %'" PRIu64 " seqs\n\n", db_getnucleotidecount(), db_getsequencecount()); } } } void udb_fasta() { if (!opt_output) fatal("FASTA output file must be specified with --output"); /* open FASTA file for writing */ FILE * fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open FASTA output file for writing"); } /* read UDB file */ udb_read(opt_udb2fasta, false, false); /* dump fasta */ unsigned int seqcount = db_getsequencecount(); progress_init("Writing FASTA file", seqcount); for(unsigned int i = 0; i < seqcount; i++) { fasta_print_db_relabel(fp_output, i, i+1); progress_update(i+1); } progress_done(); fclose(fp_output); dbindex_free(); db_free(); } void udb_stats() { /* show word statistics for an UDB file */ /* read UDB file */ udb_read(opt_udbstats, false, false); /* analyze word counts */ auto * freqtable = (wordfreq_t *) xmalloc (sizeof(wordfreq_t) * kmerhashsize); for(unsigned int i = 0; i < kmerhashsize; i++) { freqtable[i].kmer = i; freqtable[i].count = kmercount[i]; } qsort(freqtable, kmerhashsize, sizeof(wordfreq_t), wc_compare); unsigned int wcmax = freqtable[kmerhashsize-1].count; unsigned int wcmedian = ( freqtable[(kmerhashsize / 2) - 1].count + freqtable[kmerhashsize / 2].count ) / 2; unsigned int seqcount = db_getsequencecount(); uint64_t nt = db_getnucleotidecount(); /* show stats */ if (opt_log) { fprintf(fp_log, " Alphabet nt\n"); fprintf(fp_log, " Word width %" PRIu64 "\n", opt_wordlength); fprintf(fp_log, " Word ones %" PRIu64 "\n", opt_wordlength); fprintf(fp_log, " Spaced No\n"); fprintf(fp_log, " Hashed No\n"); fprintf(fp_log, " Coded No\n"); fprintf(fp_log, " Stepped No\n"); fprintf(fp_log, " Slots %u (%.1fk)\n", kmerhashsize, 1.0 * kmerhashsize / 1000.0); fprintf(fp_log, " DBAccel %u%%\n", udb_dbaccel); fprintf(fp_log, "\n"); fprintf(fp_log, "%10" PRIu64 " DB size (%.1fk)\n", nt, 1.0 * nt / 1000.0); fprintf(fp_log, "%10" PRIu64 " Words\n", kmerindexsize); fprintf(fp_log, "%10u Median size\n", wcmedian); fprintf(fp_log, "%10.1f Mean size\n", 1.0 * kmerindexsize / kmerhashsize); fprintf(fp_log, "\n"); fprintf(fp_log, " iWord sWord Cap Size Row\n"); fprintf(fp_log, "---------- ------------ ---------- ---------- ---\n"); for(unsigned int i = 0; i < kmerhashsize; i++) { fprintf(fp_log, "%10u ", freqtable[kmerhashsize-1-i].kmer); fprintf(fp_log, "%.*s", MAX(12 - (int)(opt_wordlength), 0), " "); fprint_kmer(fp_log, opt_wordlength, freqtable[kmerhashsize-1-i].kmer); fprintf(fp_log, " %10u %10u", 0, freqtable[kmerhashsize-1-i].count); fprintf(fp_log, " "); for(unsigned j = 0; j < freqtable[kmerhashsize-1-i].count; j++) { fprintf(fp_log, " %u", kmerindex[kmerhash[freqtable[kmerhashsize-1-i].kmer]+j]); if (j == 7) { break; } } if (freqtable[kmerhashsize-1-i].count > 8) { fprintf(fp_log, "..."); } fprintf(fp_log, "\n"); if (i == 10) { break; } } fprintf(fp_log, "\n\n"); fprintf(fp_log, "Word width %" PRIu64 "\n", opt_wordlength); fprintf(fp_log, "Slots %u\n", kmerhashsize); fprintf(fp_log, "Words %" PRIu64 "\n", kmerindexsize); fprintf(fp_log, "Max size %u (", wcmax); fprint_kmer(fp_log, opt_wordlength, freqtable[kmerhashsize-1].kmer); fprintf(fp_log, ")\n\n"); fprintf(fp_log, " Size lo Size hi Total size Nr. Words Pct TotPct\n"); fprintf(fp_log, "---------- ---------- ---------- ---------- ------ ------\n"); unsigned int size_lo = 0; unsigned int size_hi = 0; unsigned int x = 0; double totpct = 0.0; while (size_lo < seqcount) { int count = 0; int size = 0; while((x < kmerhashsize) && (freqtable[x].count <= size_hi)) { count++; size += freqtable[x].count; x++; } double pct = 100.0 * count / kmerhashsize; totpct += pct; if (size_lo < size_hi) { fprintf(fp_log, "%10u", size_lo); } else { fprintf(fp_log, " "); } fprintf(fp_log, " %10u", size_hi); if (size >= 10000) { fprintf(fp_log, " %9.1fk", size * 0.001); } else { fprintf(fp_log, " %10.1f", size * 1.0); } if (count >= 10000) { fprintf(fp_log, " %9.1fk", count * 0.001); } else { fprintf(fp_log, " %10.1f", count * 1.0); } fprintf(fp_log, " %5.1f%% %5.1f%%", pct, totpct); int dots = int (pct / 3.0 + 0.5); if (dots > 0) { fprintf(fp_log, " "); } for (int i = 0; i < dots ; i++) { fprintf(fp_log, "*"); } fprintf(fp_log, "\n"); size_lo = size_hi + 1; if (size_hi > 0) { size_hi *= 2; } else { size_hi = 1; } if (size_hi > seqcount) { size_hi = seqcount; } } fprintf(fp_log, "---------- ---------- ---------- ----------\n"); fprintf(fp_log, " "); if (kmerindexsize >= 10000) { fprintf(fp_log, " %9.1fk", kmerindexsize * 0.001); } else { fprintf(fp_log, " %10.1f", kmerindexsize * 1.0); } if (kmerhashsize >= 10000) { fprintf(fp_log, " %9.1fk", kmerhashsize * 0.001); } else { fprintf(fp_log, " %10.1f", kmerhashsize * 1.0); } fprintf(fp_log, "\n\n"); fprintf(fp_log, "%10" PRIu64 " Upper\n", nt); fprintf(fp_log, "%10u Lower (%.1f%%)\n", 0, 0.0); fprintf(fp_log, "%10" PRIu64 " Total\n", nt); fprintf(fp_log, "%10" PRIu64 " Indexed words\n", kmerindexsize); } xfree(freqtable); dbindex_free(); db_free(); } void udb_make() { if (!opt_output) fatal("UDB output file must be specified with --output"); int fd_output = 0; fd_output = xopen_write(opt_output); if (!fd_output) { fatal("Unable to open output file for writing"); } db_read(opt_makeudb_usearch, 1); if (opt_dbmask == MASK_DUST) { dust_all(); } else if ((opt_dbmask == MASK_SOFT) && (opt_hardmask)) { hardmask_all(); } dbindex_prepare(1, opt_dbmask); dbindex_addallsequences(opt_dbmask); unsigned int seqcount = db_getsequencecount(); uint64_t ntcount = db_getnucleotidecount(); uint64_t header_characters = 0; for (unsigned int i=0; i 0) { pos += largewrite(fd_output, kmerindex + kmerhash[i], 4 * kmercount[i], pos); } } } /* New header */ buffer[0] = 0x55444234; /* 4BDU UDB4 */ /* 0x005e0db3 */ buffer[1] = 0x005e0db3; /* number of sequences, uint32 */ buffer[2] = (unsigned int) seqcount; /* total number of nucleotides, uint64 */ buffer[3] = (unsigned int)(ntcount & 0xffffffff); buffer[4] = (unsigned int)(ntcount >> 32); /* total number of header characters, incl zero-terminator, uint64 */ buffer[5] = (unsigned int)(header_characters & 0xffffffff); buffer[6] = (unsigned int)(header_characters >> 32); /* 0x005e0db4 */ buffer[7] = 0x005e0db4; pos += largewrite(fd_output, buffer, 4 * 8, pos); /* indices to headers (uint32) */ unsigned int sum = 0; for (unsigned int i = 0; i < seqcount; i++) { buffer[i] = sum; sum += db_getheaderlen(i) + 1; } pos += largewrite(fd_output, buffer, 4 * seqcount, pos); /* headers (ascii, zero terminated, not padded) */ for (unsigned int i = 0; i < seqcount; i++) { unsigned int len = db_getheaderlen(i); pos += largewrite(fd_output, db_getheader(i), len + 1, pos); } /* sequence lengths (uint32) */ for (unsigned int i = 0; i < seqcount; i++) { buffer[i] = db_getsequencelen(i); } pos += largewrite(fd_output, buffer, 4 * seqcount, pos); /* sequences (ascii, no term, no pad) */ for (unsigned int i = 0; i < seqcount; i++) { unsigned int len = db_getsequencelen(i); pos += largewrite(fd_output, db_getsequence(i), len, pos); } if (close(fd_output) != 0) { fatal("Unable to close UDB file"); } progress_done(); dbindex_free(); db_free(); xfree(buffer); } vsearch-2.21.1/src/orient.cc0000644000175000017500000003365114171574117015174 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" unsigned int rc_kmer(unsigned int kmer) { /* reverse complement a kmer where k = opt_wordlength */ unsigned int fwd = kmer; unsigned int rev = 0; for (int i = 0; i < opt_wordlength; i++) { unsigned int x = (fwd & 3U) ^ 3U; fwd = fwd >> 2U; rev = rev << 2U; rev |= x; } return rev; } void orient() { fastx_handle query_h; FILE * fp_fastaout = nullptr; FILE * fp_fastqout = nullptr; FILE * fp_tabbedout = nullptr; FILE * fp_notmatched = nullptr; int queries = 0; int qmatches = 0; int matches_fwd = 0; int matches_rev = 0; int notmatched = 0; /* check arguments */ if (! opt_db) { fatal("Database not specified with --db"); } if (! (opt_fastaout || opt_fastqout || opt_notmatched || opt_tabbedout)) { fatal("Output file not specified with --fastaout, --fastqout, --notmatched or --tabbedout"); } /* prepare reading of queries */ query_h = fastx_open(opt_orient); /* open output files */ if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (! fp_fastaout) { fatal("Unable to open fasta output file for writing"); } } if (opt_fastqout) { if (! fastx_is_fastq(query_h)) { fatal("Cannot write FASTQ output with FASTA input"); } fp_fastqout = fopen_output(opt_fastqout); if (! fp_fastqout) { fatal("Unable to open fastq output file for writing"); } } if (opt_notmatched) { fp_notmatched = fopen_output(opt_notmatched); if (! fp_notmatched) { fatal("Unable to open notmatched output file for writing"); } } if (opt_tabbedout) { fp_tabbedout = fopen_output(opt_tabbedout); if (! fp_tabbedout) { fatal("Unable to open tabbedout output file for writing"); } } /* check if it may be an UDB file */ bool is_udb = udb_detect_isudb(opt_db); if (is_udb) { udb_read(opt_db, true, true); } else { db_read(opt_db, 0); } if (!is_udb) { if (opt_dbmask == MASK_DUST) { dust_all(); } else if ((opt_dbmask == MASK_SOFT) && (opt_hardmask)) { hardmask_all(); } } if (!is_udb) { dbindex_prepare(1, opt_dbmask); dbindex_addallsequences(opt_dbmask); } uhandle_s * uh_fwd = unique_init(); size_t alloc = 0; char * qseq_rev = nullptr; char * query_qual_rev = nullptr; progress_init("Orienting sequences", fasta_get_size(query_h)); while (fastx_next(query_h, ! opt_notrunclabels, chrmap_no_change)) { char * query_head = fastx_get_header(query_h); int query_head_len = fastx_get_header_length(query_h); char * qseq_fwd = fastx_get_sequence(query_h); int qseqlen = fastx_get_sequence_length(query_h); int qsize = fastx_get_abundance(query_h); char * query_qual_fwd = fastx_get_quality(query_h); /* find kmers in query sequence */ unsigned int kmer_count_fwd; unsigned int * kmer_list_fwd; unique_count(uh_fwd, opt_wordlength, qseqlen, qseq_fwd, & kmer_count_fwd, & kmer_list_fwd, opt_qmask); /* count kmers matching on each strand */ unsigned int count_fwd = 0; unsigned int count_rev = 0; const unsigned int hits_factor = 8; for(unsigned int i = 0; i < kmer_count_fwd; i++) { unsigned int kmer_fwd = kmer_list_fwd[i]; unsigned int kmer_rev = rc_kmer(kmer_fwd); unsigned int hits_fwd = dbindex_getmatchcount(kmer_fwd); unsigned int hits_rev = dbindex_getmatchcount(kmer_rev); /* require 8 times as many matches on one stand than the other */ if (hits_fwd > hits_factor * hits_rev) { count_fwd++; } else if (hits_rev > hits_factor * hits_fwd) { count_rev++; } } /* get progress as amount of input file read */ uint64_t progress = fasta_get_position(query_h); /* update stats */ queries++; int strand = 2; unsigned int min_count = 1; unsigned int min_factor = 4; if ((count_fwd >= min_count) && (count_fwd >= min_factor * count_rev)) { /* fwd */ strand = 0; matches_fwd++; qmatches++; if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, qseq_fwd, qseqlen, query_head, query_head_len, qsize, qmatches, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastqout) { fastq_print_general(fp_fastqout, qseq_fwd, qseqlen, query_head, query_head_len, query_qual_fwd, qsize, qmatches, -1.0); } } else if ((count_rev >= min_count) && (count_rev >= min_factor * count_fwd)) { /* rev */ strand = 1; matches_rev++; qmatches++; /* alloc more mem if necessary to keep reverse sequence and qual */ if ((size_t)(qseqlen + 1) > alloc) { alloc = qseqlen + 1; qseq_rev = (char*) xrealloc(qseq_rev, alloc); if (fastx_is_fastq(query_h)) { query_qual_rev = (char*) xrealloc(query_qual_rev, alloc); } } /* get reverse complementary sequence */ reverse_complement(qseq_rev, qseq_fwd, qseqlen); if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, qseq_rev, qseqlen, query_head, query_head_len, qsize, qmatches, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastqout) { /* reverse quality scores */ if (fastx_is_fastq(query_h)) { for(int i = 0; i < qseqlen; i++) { query_qual_rev[i] = query_qual_fwd[qseqlen-1-i]; } query_qual_rev[qseqlen] = 0; } fastq_print_general(fp_fastqout, qseq_rev, qseqlen, query_head, query_head_len, query_qual_rev, qsize, qmatches, -1.0); } } else { /* undecided */ strand = 2; notmatched++; if (opt_notmatched) { if (fastx_is_fastq(query_h)) { fastq_print_general(fp_notmatched, qseq_fwd, qseqlen, query_head, query_head_len, query_qual_fwd, qsize, notmatched, -1.0); } else { fasta_print_general(fp_notmatched, nullptr, qseq_fwd, qseqlen, query_head, query_head_len, qsize, notmatched, -1.0, -1, -1, nullptr, 0.0); } } } if (opt_tabbedout) { fprintf(fp_tabbedout, "%s\t%c\t%d\t%d\n", query_head, strand == 0 ? '+' : (strand == 1 ? '-' : '?'), count_fwd, count_rev); } /* show progress */ progress_update(progress); } progress_done(); /* clean up */ if (qseq_rev) { xfree(qseq_rev); } if (query_qual_rev) { xfree(query_qual_rev); } unique_exit(uh_fwd); dbindex_free(); db_free(); if (opt_tabbedout) { fclose(fp_tabbedout); } if (opt_notmatched) { fclose(fp_notmatched); } if (opt_fastqout) { fclose(fp_fastqout); } if (opt_fastaout) { fclose(fp_fastaout); } fasta_close(query_h); if (!opt_quiet) { fprintf(stderr, "Forward oriented sequences: %d", matches_fwd); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * matches_fwd / queries); } fprintf(stderr, "\n"); fprintf(stderr, "Reverse oriented sequences: %d", matches_rev); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * matches_rev / queries); } fprintf(stderr, "\n"); fprintf(stderr, "All oriented sequences: %d", qmatches); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(stderr, "\n"); fprintf(stderr, "Not oriented sequences: %d", notmatched); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * notmatched / queries); } fprintf(stderr, "\n"); fprintf(stderr, "Total number of sequences: %d\n", queries); } if (opt_log) { fprintf(fp_log, "Forward oriented sequences: %d", matches_fwd); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * matches_fwd / queries); } fprintf(fp_log, "\n"); fprintf(fp_log, "Reverse oriented sequences: %d", matches_rev); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * matches_rev / queries); } fprintf(fp_log, "\n"); fprintf(fp_log, "All oriented sequences: %d", qmatches); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(fp_log, "\n"); fprintf(fp_log, "Not oriented sequences: %d", notmatched); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * notmatched / queries); } fprintf(fp_log, "\n"); fprintf(fp_log, "Total number of sequences: %d\n", queries); } } vsearch-2.21.1/src/chimera.h0000644000175000017500000000467514171574117015152 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void chimera(); vsearch-2.21.1/src/sintax.h0000644000175000017500000000467514171574117015050 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void sintax(); vsearch-2.21.1/src/getseq.cc0000644000175000017500000003753614171574117015172 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* Implement fastx_getseq, fastx_getseqs and fastx_getsubseq as described here: https://drive5.com/usearch/manual/cmd_fastx_getseqs.html */ #include "vsearch.h" static int labels_alloc = 0; static int labels_count = 0; static int labels_longest = 0; static char * * labels_data = nullptr; void read_labels_file(char * filename) { FILE * fp_labels = fopen_input(filename); if (! fp_labels) { fatal("Unable to open labels file (%s)", filename); } xstat_t fs; if (xfstat(fileno(fp_labels), & fs)) { fatal("Unable to get status for labels file (%s)", filename); } bool is_pipe = S_ISFIFO(fs.st_mode); uint64_t file_size = 0; if (! is_pipe) { file_size = fs.st_size; } progress_init("Reading labels", file_size); while(true) { const int buffer_size = 1024; char buffer[buffer_size]; char * ret = fgets(buffer, buffer_size, fp_labels); if (ret) { int len = strlen(buffer); if ((len > 0) && (buffer[len - 1] == '\n')) { buffer[len - 1] = 0; len--; } if (len > labels_longest) { labels_longest = len; } if (labels_count + 1 > labels_alloc) { labels_alloc += 1024; labels_data = (char * *) realloc(labels_data, labels_alloc * sizeof (char*)); if (! labels_data) { fatal("Unable to allocate memory for labels"); } } labels_data[labels_count++] = strdup(buffer); } else { break; } } fclose(fp_labels); progress_done(); if (labels_longest >= 1023) { if (!opt_quiet) { fprintf(stderr, "WARNING: Labels longer than 1023 characters are not supported\n"); } if (opt_log) { fprintf(fp_log, "WARNING: Labels longer than 1023 characters are not supported\n"); } } } void free_labels() { for(int i=0; i < labels_count; i++) { free(labels_data[i]); } free(labels_data); labels_data = nullptr; } bool test_label_match(fastx_handle h) { char * header = fastx_get_header(h); int hlen = fastx_get_header_length(h); char * field_buffer = nullptr; int field_len = 0; if (opt_label_field) { field_len = strlen(opt_label_field); int field_buffer_size = field_len + 2; if (opt_label_word) { field_buffer_size += strlen(opt_label_word); } else { field_buffer_size += labels_longest; } field_buffer = (char *) xmalloc(field_buffer_size); sprintf(field_buffer, "%s=", opt_label_field); } if (opt_label) { char * needle = opt_label; int wlen = strlen(needle); if (opt_label_substr_match) { return xstrcasestr(header, needle); } else { return (hlen == wlen) && ! strcasecmp(header, needle); } } else if (opt_labels) { if (opt_label_substr_match) { for (int i = 0; i < labels_count; i++) { if (xstrcasestr(header, labels_data[i])) { return true; } } } else { for (int i = 0; i < labels_count; i++) { char * needle = labels_data[i]; int wlen = strlen(needle); if ((hlen == wlen) && ! strcasecmp(header, needle)) { return true; } } } } else if (opt_label_word) { char * needle = opt_label_word; if (opt_label_field) { strcpy(field_buffer + field_len + 1, needle); needle = field_buffer; } int wlen = strlen(needle); char * hit = header; while (true) { hit = strstr(hit, needle); if (hit) { if (opt_label_field) { /* check of field */ if (((hit == header) || (*(hit - 1) == ';')) && ((hit + wlen == header + hlen) || (*(hit + wlen) == ';'))) { return true; } } else { /* check of full word */ if (((hit == header) || (!isalnum(*(hit - 1)))) && ((hit + wlen == header + hlen) || (!isalnum(*(hit + wlen))))) { return true; } } hit++; } else { break; } } } else if (opt_label_words) { for (int i = 0; i < labels_count; i++) { char * needle = labels_data[i]; if (opt_label_field) { strcpy(field_buffer + field_len + 1, needle); needle = field_buffer; } int wlen = strlen(needle); char * hit = header; while (true) { hit = strstr(hit, needle); if (hit) { if (opt_label_field) { /* check of field */ if (((hit == header) || (*(hit - 1) == ';')) && ((hit + wlen == header + hlen) || (*(hit + wlen) == ';'))) { return true; } } else { /* check of full word */ if (((hit == header) || (!isalnum(*(hit - 1)))) && ((hit + wlen == header + hlen) || (!isalnum(*(hit + wlen))))) { return true; } } hit++; } else { break; } } } } return false; } void getseq(char * filename) { if ((!opt_fastqout) && (!opt_fastaout) && (!opt_notmatched) && (!opt_notmatchedfq)) { fatal("No output files specified"); } if (opt_fastx_getseq) { if (! opt_label) { fatal("Missing label option"); } } else if (opt_fastx_getsubseq) { if (! opt_label) { fatal("Missing label option"); } if ((opt_subseq_start < 1) || (opt_subseq_end < 1)) { fatal("The argument to options subseq_start and subseq_end must be at least 1"); } if (opt_subseq_start > opt_subseq_end) { fatal("The argument to option subseq_start must be equal or less than to subseq_end"); } } else if (opt_fastx_getseqs) { int label_options = 0; if (opt_label) { label_options++; } if (opt_labels) { label_options++; } if (opt_label_word) { label_options++; } if (opt_label_words) { label_options++; } if (label_options != 1) { fatal("Specify one label option (label, labels, label_word or label_words)"); } if (opt_labels) { read_labels_file(opt_labels); } if (opt_label_words) { read_labels_file(opt_label_words); } } fastx_handle h1 = nullptr; h1 = fastx_open(filename); if (!h1) { fatal("Unrecognized file type (not proper FASTA or FASTQ format)"); } if ((opt_fastqout || opt_notmatchedfq) && ! (h1->is_fastq || h1->is_empty)) { fatal("Cannot write FASTQ output from FASTA input"); } uint64_t filesize = fastx_get_size(h1); FILE * fp_fastaout = nullptr; FILE * fp_fastqout = nullptr; FILE * fp_notmatched = nullptr; FILE * fp_notmatchedfq = nullptr; if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout) { fp_fastqout = fopen_output(opt_fastqout); if (!fp_fastqout) { fatal("Unable to open FASTQ output file for writing"); } } if (opt_notmatched) { fp_notmatched = fopen_output(opt_notmatched); if (!fp_notmatched) { fatal("Unable to open FASTA output file (notmatched) for writing"); } } if (opt_notmatchedfq) { fp_notmatchedfq = fopen_output(opt_notmatchedfq); if (!fp_notmatchedfq) { fatal("Unable to open FASTQ output file (notmatchedfq) for writing"); } } progress_init("Extracting sequences", filesize); int64_t kept = 0; int64_t discarded = 0; while(fastx_next(h1, ! opt_notrunclabels, chrmap_no_change)) { bool match = test_label_match(h1); int64_t start = 1; int64_t end = fastx_get_sequence_length(h1); if (opt_fastx_getsubseq) { if (opt_subseq_start > start) { start = opt_subseq_start; } if (opt_subseq_end < end) { end = opt_subseq_end; } } int64_t length = end - start + 1; if (match) { /* keep the sequence(s) */ kept++; if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, fastx_get_sequence(h1) + start - 1, length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_abundance(h1), kept, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastqout) { fastq_print_general(fp_fastqout, fastx_get_sequence(h1) + start - 1, length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_quality(h1) + start - 1, fastx_get_abundance(h1), kept, -1.0); } } else { /* discard the sequence */ discarded++; if (opt_notmatched) { fasta_print_general(fp_notmatched, nullptr, fastx_get_sequence(h1) + start - 1, length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_abundance(h1), discarded, -1.0, -1, -1, nullptr, 0.0); } if (opt_notmatchedfq) { fastq_print_general(fp_notmatchedfq, fastx_get_sequence(h1) + start - 1, length, fastx_get_header(h1), fastx_get_header_length(h1), fastx_get_quality(h1) + start - 1, fastx_get_abundance(h1), discarded, -1.0); } } progress_update(fastx_get_position(h1)); } progress_done(); if (! opt_quiet) { fprintf(stderr, "%" PRId64 " of %" PRId64 " sequences extracted", kept, kept + discarded); if (kept + discarded > 0) { fprintf(stderr, " (%.1lf%%)", 100.0 * kept / (kept + discarded)); } fprintf(stderr, "\n"); } if (opt_log) { fprintf(fp_log, "%" PRId64 " of %" PRId64 " sequences extracted", kept, kept + discarded); if (kept + discarded > 0) { fprintf(fp_log, " (%.1lf%%)", 100.0 * kept / (kept + discarded)); } fprintf(fp_log, "\n"); } if (opt_fastaout) { fclose(fp_fastaout); } if (opt_fastqout) { fclose(fp_fastqout); } if (opt_notmatched) { fclose(fp_notmatched); } if (opt_notmatchedfq) { fclose(fp_notmatchedfq); } fastx_close(h1); if (opt_labels || opt_label_words) { free_labels(); } } void fastx_getseq() { getseq(opt_fastx_getseq); } void fastx_getseqs() { getseq(opt_fastx_getseqs); } void fastx_getsubseq() { getseq(opt_fastx_getsubseq); } vsearch-2.21.1/src/sintax.cc0000644000175000017500000003667314171574117015211 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* Implements the Sintax algorithm as desribed in Robert Edgar's preprint: Robert Edgar (2016) SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences BioRxiv, 074161 doi: https://doi.org/10.1101/074161 Further details: https://www.drive5.com/usearch/manual/cmd_sintax.html */ #include "vsearch.h" static struct searchinfo_s * si_plus; static struct searchinfo_s * si_minus; static pthread_t * pthread; /* global constants/data, no need for synchronization */ static int tophits; /* the maximum number of hits to keep */ static int seqcount; /* number of database sequences */ static pthread_attr_t attr; static fastx_handle query_fastx_h; const int subset_size = 32; const int bootstrap_count = 100; /* global data protected by mutex */ static pthread_mutex_t mutex_input; static pthread_mutex_t mutex_output; static FILE * fp_tabbedout; static int queries = 0; static int classified = 0; void sintax_analyse(char * query_head, int strand, int best_seqno, int best_count, int * all_seqno, int count) { int best_level_start[tax_levels]; int best_level_len[tax_levels]; int level_match[tax_levels]; /* check number of successful bootstraps */ if (count >= (bootstrap_count+1) / 2) { char * best_h = db_getheader(best_seqno); tax_split(best_seqno, best_level_start, best_level_len); for (int & j : level_match) { j = 0; } for (int i = 0; i < count; i++) { /* For each bootstrap experiment */ int level_start[tax_levels]; int level_len[tax_levels]; tax_split(all_seqno[i], level_start, level_len); char * h = db_getheader(all_seqno[i]); for (int j = 0; j < tax_levels; j++) { /* For each taxonomic level */ if ((level_len[j] == best_level_len[j]) && (strncmp(best_h + best_level_start[j], h + level_start[j], level_len[j]) == 0)) { level_match[j]++; } } } } /* write to tabbedout file */ xpthread_mutex_lock(&mutex_output); fprintf(fp_tabbedout, "%s\t", query_head); queries++; if (count >= bootstrap_count / 2) { char * best_h = db_getheader(best_seqno); classified++; bool comma = false; for (int j = 0; j < tax_levels; j++) { if (best_level_len[j] > 0) { fprintf(fp_tabbedout, "%s%c:%.*s(%.2f)", (comma ? "," : ""), tax_letters[j], best_level_len[j], best_h + best_level_start[j], 1.0 * level_match[j] / count); comma = true; } } fprintf(fp_tabbedout, "\t%c", strand ? '-' : '+'); if (opt_sintax_cutoff > 0.0) { fprintf(fp_tabbedout, "\t"); bool comma = false; for (int j = 0; j < tax_levels; j++) { if ((best_level_len[j] > 0) && (1.0 * level_match[j] / count >= opt_sintax_cutoff)) { fprintf(fp_tabbedout, "%s%c:%.*s", (comma ? "," : ""), tax_letters[j], best_level_len[j], best_h + best_level_start[j]); comma = true; } } } } else { if (opt_sintax_cutoff > 0.0) { fprintf(fp_tabbedout, "\t\t\t"); } else { fprintf(fp_tabbedout, "\t\t"); } } #if 0 fprintf(fp_tabbedout, "\t%d\t%d", best_count, count); #endif fprintf(fp_tabbedout, "\n"); xpthread_mutex_unlock(&mutex_output); } void sintax_query(int64_t t) { int all_seqno[2][bootstrap_count]; int best_seqno[2] = {0, 0}; int boot_count[2] = {0, 0}; unsigned int best_count[2] = {0, 0}; int qseqlen = si_plus[t].qseqlen; char * query_head = si_plus[t].query_head; bitmap_t * b = bitmap_init(qseqlen); for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_minus+t : si_plus+t; /* perform search */ unsigned int kmersamplecount; unsigned int * kmersample; /* find unique kmers */ unique_count(si->uh, opt_wordlength, si->qseqlen, si->qsequence, & kmersamplecount, & kmersample, MASK_NONE); /* perform 100 bootstraps */ if (kmersamplecount >= subset_size) { for (int i = 0; i < bootstrap_count ; i++) { /* subsample 32 kmers */ unsigned int kmersample_subset[subset_size]; int subsamples = 0; bitmap_reset_all(b); for(int j = 0; j < subset_size ; j++) { int64_t x = random_int(kmersamplecount); if (! bitmap_get(b, x)) { kmersample_subset[subsamples++] = kmersample[x]; bitmap_set(b, x); } } si->kmersamplecount = subsamples; si->kmersample = kmersample_subset; search_topscores(si); while(!minheap_isempty(si->m)) { elem_t e = minheap_poplast(si->m); all_seqno[s][boot_count[s]++] = e.seqno; if (e.count > best_count[s]) { best_count[s] = e.count; best_seqno[s] = e.seqno; } } } } } int best_strand; if (opt_strand == 1) { best_strand = 0; } else { if (best_count[0] > best_count[1]) { best_strand = 0; } else if (best_count[1] > best_count[0]) { best_strand = 1; } else { if (boot_count[0] >= boot_count[1]) { best_strand = 0; } else { best_strand = 1; } } } sintax_analyse(query_head, best_strand, best_seqno[best_strand], best_count[best_strand], all_seqno[best_strand], boot_count[best_strand]); bitmap_free(b); } void sintax_thread_run(int64_t t) { while (true) { xpthread_mutex_lock(&mutex_input); if (fastx_next(query_fastx_h, ! opt_notrunclabels, chrmap_no_change)) { char * qhead = fastx_get_header(query_fastx_h); int query_head_len = fastx_get_header_length(query_fastx_h); char * qseq = fastx_get_sequence(query_fastx_h); int qseqlen = fastx_get_sequence_length(query_fastx_h); int query_no = fastx_get_seqno(query_fastx_h); int qsize = fastx_get_abundance(query_fastx_h); for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_minus+t : si_plus+t; si->query_head_len = query_head_len; si->qseqlen = qseqlen; si->query_no = query_no; si->qsize = qsize; si->strand = s; /* allocate more memory for header and sequence, if necessary */ if (si->query_head_len + 1 > si->query_head_alloc) { si->query_head_alloc = si->query_head_len + 2001; si->query_head = (char*) xrealloc(si->query_head, (size_t)(si->query_head_alloc)); } if (si->qseqlen + 1 > si->seq_alloc) { si->seq_alloc = si->qseqlen + 2001; si->qsequence = (char*) xrealloc(si->qsequence, (size_t)(si->seq_alloc)); } } /* plus strand: copy header and sequence */ strcpy(si_plus[t].query_head, qhead); strcpy(si_plus[t].qsequence, qseq); /* get progress as amount of input file read */ uint64_t progress = fastx_get_position(query_fastx_h); /* let other threads read input */ xpthread_mutex_unlock(&mutex_input); /* minus strand: copy header and reverse complementary sequence */ if (opt_strand > 1) { strcpy(si_minus[t].query_head, si_plus[t].query_head); reverse_complement(si_minus[t].qsequence, si_plus[t].qsequence, si_plus[t].qseqlen); } sintax_query(t); /* lock mutex for update of global data and output */ xpthread_mutex_lock(&mutex_output); /* show progress */ progress_update(progress); xpthread_mutex_unlock(&mutex_output); } else { xpthread_mutex_unlock(&mutex_input); break; } } } void sintax_thread_init(struct searchinfo_s * si) { /* thread specific initialiation */ si->uh = unique_init(); si->kmers = (count_t *) xmalloc(seqcount * sizeof(count_t) + 32); si->m = minheap_init(tophits); si->hits = nullptr; si->qsize = 1; si->query_head_alloc = 0; si->query_head = nullptr; si->seq_alloc = 0; si->qsequence = nullptr; si->nw = nullptr; si->s = nullptr; } void sintax_thread_exit(struct searchinfo_s * si) { /* thread specific clean up */ unique_exit(si->uh); minheap_exit(si->m); xfree(si->kmers); if (si->query_head) { xfree(si->query_head); } if (si->qsequence) { xfree(si->qsequence); } } void * sintax_thread_worker(void * vp) { auto t = (int64_t) vp; sintax_thread_run(t); return nullptr; } void sintax_thread_worker_run() { /* initialize threads, start them, join them and return */ xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); /* init and create worker threads, put them into stand-by mode */ for(int t=0; t 1) { si_minus = (struct searchinfo_s *) xmalloc(opt_threads * sizeof(struct searchinfo_s)); } else { si_minus = nullptr; } pthread = (pthread_t *) xmalloc(opt_threads * sizeof(pthread_t)); /* init mutexes for input and output */ xpthread_mutex_init(&mutex_input, nullptr); xpthread_mutex_init(&mutex_output, nullptr); /* run */ progress_init("Classifying sequences", fastx_get_size(query_fastx_h)); sintax_thread_worker_run(); progress_done(); if (! opt_quiet) { fprintf(stderr, "Classified %d of %d sequences", classified, queries); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * classified / queries); } fprintf(stderr, "\n"); } if (opt_log) { fprintf(fp_log, "Classified %d of %d sequences", classified, queries); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * classified / queries); } fprintf(fp_log, "\n"); } /* clean up */ xpthread_mutex_destroy(&mutex_output); xpthread_mutex_destroy(&mutex_input); xfree(pthread); xfree(si_plus); if (si_minus) { xfree(si_minus); } fastx_close(query_fastx_h); fclose(fp_tabbedout); dbindex_free(); db_free(); } vsearch-2.21.1/src/chimera.cc0000644000175000017500000015036614171574117015307 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* This code implements the method described in this paper: Robert C. Edgar, Brian J. Haas, Jose C. Clemente, Christopher Quince and Rob Knight (2011) UCHIME improves sensitivity and speed of chimera detection Bioinformatics, 27, 16, 2194-2200 http://dx.doi.org/10.1093/bioinformatics/btr381 */ /* global constants/data, no need for synchronization */ const int parts = 4; const int few = 4; const int maxcandidates = few * parts; const int rejects = 16; const double chimera_id = 0.55; static int tophits; static pthread_attr_t attr; static pthread_t * pthread; static fastx_handle query_fasta_h; /* mutexes and global data protected by mutex */ static pthread_mutex_t mutex_input; static pthread_mutex_t mutex_output; static unsigned int seqno = 0; static uint64_t progress = 0; static int chimera_count = 0; static int nonchimera_count = 0; static int borderline_count = 0; static int total_count = 0; static int64_t chimera_abundance = 0; static int64_t nonchimera_abundance = 0; static int64_t borderline_abundance = 0; static int64_t total_abundance = 0; static FILE * fp_chimeras = nullptr; static FILE * fp_nonchimeras = nullptr; static FILE * fp_uchimealns = nullptr; static FILE * fp_uchimeout = nullptr; static FILE * fp_borderline = nullptr; /* information for each query sequence to be checked */ struct chimera_info_s { int query_alloc; /* the longest query sequence allocated memory for */ int head_alloc; /* the longest header allocated memory for */ int query_no; char * query_head; int query_head_len; int query_size; char * query_seq; int query_len; struct searchinfo_s si[parts]; unsigned int cand_list[maxcandidates]; int cand_count; struct s16info_s * s; CELL snwscore[maxcandidates]; unsigned short snwalignmentlength[maxcandidates]; unsigned short snwmatches[maxcandidates]; unsigned short snwmismatches[maxcandidates]; unsigned short snwgaps[maxcandidates]; int64_t nwscore[maxcandidates]; int64_t nwalignmentlength[maxcandidates]; int64_t nwmatches[maxcandidates]; int64_t nwmismatches[maxcandidates]; int64_t nwgaps[maxcandidates]; char * nwcigar[maxcandidates]; int match_size; int * match; int * smooth; int * maxsmooth; int best_parents[2]; int best_target; char * best_cigar; int * maxi; char * paln[2]; char * qaln; char * diffs; char * votes; char * model; char * ignore; struct hit * all_hits; double best_h; }; static struct chimera_info_s * cia; void realloc_arrays(struct chimera_info_s * ci) { int maxhlen = MAX(ci->query_head_len,1); if (maxhlen > ci->head_alloc) { ci->head_alloc = maxhlen; ci->query_head = (char*) xrealloc(ci->query_head, maxhlen + 1); } /* realloc arrays based on query length */ int maxqlen = MAX(ci->query_len,1); if (maxqlen > ci->query_alloc) { ci->query_alloc = maxqlen; ci->query_seq = (char*) xrealloc(ci->query_seq, maxqlen + 1); for(auto & i : ci->si) { int maxpartlen = (maxqlen + parts - 1) / parts; i.qsequence = (char*) xrealloc(i.qsequence, maxpartlen + 1); } ci->maxi = (int *) xrealloc(ci->maxi, (maxqlen + 1) * sizeof(int)); ci->maxsmooth = (int*) xrealloc(ci->maxsmooth, maxqlen * sizeof(int)); ci->match = (int*) xrealloc(ci->match, maxcandidates * maxqlen * sizeof(int)); ci->smooth = (int*) xrealloc(ci->smooth, maxcandidates * maxqlen * sizeof(int)); int maxalnlen = maxqlen + 2 * db_getlongestsequence(); ci->paln[0] = (char*) xrealloc(ci->paln[0], maxalnlen+1); ci->paln[1] = (char*) xrealloc(ci->paln[1], maxalnlen+1); ci->qaln = (char*) xrealloc(ci->qaln, maxalnlen+1); ci->diffs = (char*) xrealloc(ci->diffs, maxalnlen+1); ci->votes = (char*) xrealloc(ci->votes, maxalnlen+1); ci->model = (char*) xrealloc(ci->model, maxalnlen+1); ci->ignore = (char*) xrealloc(ci->ignore, maxalnlen+1); } } int find_best_parents(struct chimera_info_s * ci) { ci->best_parents[0] = -1; ci->best_parents[1] = -1; /* find the positions with matches for each potential parent */ char * qseq = ci->query_seq; memset(ci->match, 0, ci->cand_count * ci->query_len * sizeof(int)); for(int i=0; i < ci->cand_count; i++) { char * tseq = db_getsequence(ci->cand_list[i]); int qpos = 0; int tpos = 0; char * p = ci->nwcigar[i]; char * e = p + strlen(p); while (p < e) { int run = 1; int scanlength = 0; sscanf(p, "%d%n", &run, &scanlength); p += scanlength; char op = *p++; switch (op) { case 'M': for(int k=0; kmatch[i * ci->query_len + qpos] = 1; } qpos++; tpos++; } break; case 'I': tpos += run; break; case 'D': qpos += run; break; } } } /* Compute smoothed identity score in a window for each candidate, */ /* and record max smoothed score for each position among candidates. */ memset(ci->maxsmooth, 0, ci->query_len * sizeof(int)); const int window = 32; for(int i = 0; i < ci->cand_count; i++) { int sum = 0; for(int qpos = 0; qpos < ci->query_len; qpos++) { int z = i * ci->query_len + qpos; sum += ci->match[z]; if (qpos >= window) { sum -= ci->match[z-window]; } if (qpos >= window-1) { ci->smooth[z] = sum; if (ci->smooth[z] > ci->maxsmooth[qpos]) { ci->maxsmooth[qpos] = ci->smooth[z]; } } } } /* find first parent */ int wins[ci->cand_count]; memset(wins, 0, ci->cand_count * sizeof(int)); for(int qpos = window-1; qpos < ci->query_len; qpos++) { if (ci->maxsmooth[qpos] != 0) { for(int i=0; i < ci->cand_count; i++) { int z = i * ci->query_len + qpos; if (ci->smooth[z] == ci->maxsmooth[qpos]) { wins[i]++; } } } } int best1_w = -1; int best1_i = -1; int best2_w = -1; int best2_i = -1; for(int i=0; i < ci->cand_count; i++) { int w = wins[i]; if (w > best1_w) { best1_w = w; best1_i = i; } } if (best1_w >= 0) { /* find second parent */ /* wipe out matches in positions covered by first parent */ for(int qpos = window - 1; qpos < ci->query_len; qpos++) { int z = best1_i * ci->query_len + qpos; if (ci->smooth[z] == ci->maxsmooth[qpos]) { for(int i = qpos + 1 - window; i <= qpos; i++) { for(int j = 0; j < ci->cand_count; j++) { ci->match[j * ci->query_len + i] = 0; } } } } /* recompute smoothed identity over window, and record max smoothed score for each position among remaining candidates */ memset(ci->maxsmooth, 0, ci->query_len * sizeof(int)); for(int i = 0; i < ci->cand_count; i++) { if (i != best1_i) { int sum = 0; for(int qpos = 0; qpos < ci->query_len; qpos++) { int z = i * ci->query_len + qpos; sum += ci->match[z]; if (qpos >= window) { sum -= ci->match[z-window]; } if (qpos >= window-1) { ci->smooth[z] = sum; if (ci->smooth[z] > ci->maxsmooth[qpos]) { ci->maxsmooth[qpos] = ci->smooth[z]; } } } } } /* find second parent */ memset(wins, 0, ci->cand_count * sizeof(int)); for(int qpos = window-1; qpos < ci->query_len; qpos++) { if (ci->maxsmooth[qpos] != 0) { for(int i=0; i < ci->cand_count; i++) { if (i != best1_i) { int z = i * ci->query_len + qpos; if (ci->smooth[z] == ci->maxsmooth[qpos]) { wins[i]++; } } } } } for(int i=0; i < ci->cand_count; i++) { int w = wins[i]; if (w > best2_w) { best2_w = w; best2_i = i; } } } ci->best_parents[0] = best1_i; ci->best_parents[1] = best2_i; return (best1_w >= 0) && (best2_w >= 0); } int eval_parents(struct chimera_info_s * ci) { int status = 1; /* create msa */ /* find max insertions in front of each position in the query sequence */ memset(ci->maxi, 0, (ci->query_len + 1) * sizeof(int)); for(int best_parent : ci->best_parents) { char * p = ci->nwcigar[best_parent]; char * e = p + strlen(p); int pos = 0; while (p < e) { int run = 1; int scanlength = 0; sscanf(p, "%d%n", &run, &scanlength); p += scanlength; char op = *p++; switch (op) { case 'M': case 'D': pos += run; break; case 'I': if (run > ci->maxi[pos]) { ci->maxi[pos] = run; } break; } } } /* find total alignment length */ int alnlen = 0; for(int i=0; i < ci->query_len+1; i++) { alnlen += ci->maxi[i]; } alnlen += ci->query_len; /* fill in alignment string for query */ char * q = ci->qaln; int qpos = 0; for (int i=0; i < ci->query_len; i++) { for (int j=0; j < ci->maxi[i]; j++) { *q++ = '-'; } *q++ = chrmap_upcase[(int)(ci->query_seq[qpos++])]; } for (int j=0; j < ci->maxi[ci->query_len]; j++) { *q++ = '-'; } *q = 0; /* fill in alignment strings for the 2 parents */ for(int j=0; j<2; j++) { int cand = ci->best_parents[j]; int target_seqno = ci->cand_list[cand]; char * target_seq = db_getsequence(target_seqno); int inserted = 0; qpos = 0; int tpos = 0; char * t = ci->paln[j]; char * p = ci->nwcigar[cand]; char * e = p + strlen(p); while (p < e) { int run = 1; int scanlength = 0; sscanf(p, "%d%n", &run, &scanlength); p += scanlength; char op = *p++; if (op == 'I') { for(int x=0; x < ci->maxi[qpos]; x++) { if (x < run) { *t++ = chrmap_upcase[(int)(target_seq[tpos++])]; } else { *t++ = '-'; } } inserted = 1; } else { for(int x=0; x < run; x++) { if (!inserted) { for(int y=0; y < ci->maxi[qpos]; y++) { *t++ = '-'; } } if (op == 'M') { *t++ = chrmap_upcase[(int)(target_seq[tpos++])]; } else { *t++ = '-'; } qpos++; inserted = 0; } } } /* add any gaps at the end */ if (!inserted) { for(int x=0; x < ci->maxi[qpos]; x++) { *t++ = '-'; } } /* end of sequence string */ *t = 0; } memset(ci->ignore, 0, alnlen); for(int i = 0; i < alnlen; i++) { unsigned int qsym = chrmap_4bit[(int)(ci->qaln [i])]; unsigned int p1sym = chrmap_4bit[(int)(ci->paln[0][i])]; unsigned int p2sym = chrmap_4bit[(int)(ci->paln[1][i])]; /* mark positions to ignore in voting */ /* ignore gap positions and those next to the gap */ if ((!qsym) || (!p1sym) || (!p2sym)) { ci->ignore[i] = 1; if (i>0) { ci->ignore[i-1] = 1; } if (iignore[i+1] = 1; } } /* ignore ambiguous symbols */ if ((ambiguous_4bit[qsym]) || (ambiguous_4bit[p1sym]) || (ambiguous_4bit[p2sym])) { ci->ignore[i] = 1; } /* lower case parent symbols that differ from query */ if (p1sym && (p1sym != qsym)) { ci->paln[0][i] = tolower(ci->paln[0][i]); } if (p2sym && (p2sym != qsym)) { ci->paln[1][i] = tolower(ci->paln[1][i]); } /* compute diffs */ char diff; if (qsym && p1sym && p2sym) { if (p1sym == p2sym) { if (qsym == p1sym) { diff = ' '; } else { diff = 'N'; } } else { if (qsym == p1sym) { diff = 'A'; } else if (qsym == p2sym) { diff = 'B'; } else { diff = '?'; } } } else { diff = ' '; } ci->diffs[i] = diff; } ci->diffs[alnlen] = 0; /* compute score */ int sumA = 0; int sumB = 0; int sumN = 0; for (int i = 0; i < alnlen; i++) { if (!ci->ignore[i]) { char diff = ci->diffs[i]; if (diff == 'A') { sumA++; } else if (diff == 'B') { sumB++; } else if (diff != ' ') { sumN++; } } } int left_n = 0; int left_a = 0; int left_y = 0; int right_n = sumA; int right_a = sumN; int right_y = sumB; double best_h = -1; int best_i = -1; int best_reverse = 0; int best_left_y = 0; int best_right_y = 0; int best_left_n = 0; int best_right_n = 0; int best_left_a = 0; int best_right_a = 0; for (int i=0; iignore[i]) { char diff = ci->diffs[i]; if (diff != ' ') { if (diff == 'A') { left_y++; right_n--; } else if (diff == 'B') { left_n++; right_y--; } else { left_a++; right_a--; } double left_h, right_h, h; if ((left_y > left_n) && (right_y > right_n)) { left_h = left_y / (opt_xn * (left_n + opt_dn) + left_a); right_h = right_y / (opt_xn * (right_n + opt_dn) + right_a); h = left_h * right_h; if (h > best_h) { best_reverse = 0; best_h = h; best_i = i; best_left_n = left_n; best_left_y = left_y; best_left_a = left_a; best_right_n = right_n; best_right_y = right_y; best_right_a = right_a; } } else if ((left_n > left_y) && (right_n > right_y)) { /* swap left/right and yes/no */ left_h = left_n / (opt_xn * (left_y + opt_dn) + left_a); right_h = right_n / (opt_xn * (right_y + opt_dn) + right_a); h = left_h * right_h; if (h > best_h) { best_reverse = 1; best_h = h; best_i = i; best_left_n = left_y; best_left_y = left_n; best_left_a = left_a; best_right_n = right_y; best_right_y = right_n; best_right_a = right_a; } } } } } ci->best_h = best_h > 0 ? best_h : 0.0; if (best_h >= 0.0) { status = 2; /* flip A and B if necessary */ if (best_reverse) { for(int i = 0; i < alnlen; i++) { char diff = ci->diffs[i]; if (diff == 'A') { ci->diffs[i] = 'B'; } else if (diff == 'B') { ci->diffs[i] = 'A'; } } } /* fill in votes and model */ for(int i = 0; i < alnlen; i++) { char m = i <= best_i ? 'A' : 'B'; ci->model[i] = m; char v = ' '; if (!ci->ignore[i]) { char d = ci->diffs[i]; if ((d == 'A') || (d == 'B')) { if (d == m) { v = '+'; } else { v = '!'; } } else if ((d == 'N') || (d == '?')) { v = '0'; } } ci->votes[i] = v; /* lower case diffs for no votes */ if (v == '!') { ci->diffs[i] = tolower(ci->diffs[i]); } } /* fill in crossover region */ for(int i = best_i + 1; i < alnlen; i++) { if ((ci->diffs[i] == ' ') || (ci->diffs[i] == 'A')) { ci->model[i] = 'x'; } else { break; } } ci->votes[alnlen] = 0; ci->model[alnlen] = 0; /* count matches */ int index_a = best_reverse ? 1 : 0; int index_b = best_reverse ? 0 : 1; int match_QA = 0; int match_QB = 0; int match_AB = 0; int match_QM = 0; int cols = 0; for(int i = 0; i < alnlen; i++) { if (! ci->ignore[i]) { cols++; char qsym = chrmap_4bit[(int)(ci->qaln[i])]; char asym = chrmap_4bit[(int)(ci->paln[index_a][i])]; char bsym = chrmap_4bit[(int)(ci->paln[index_b][i])]; char msym = (i <= best_i) ? asym : bsym; if (qsym == asym) { match_QA++; } if (qsym == bsym) { match_QB++; } if (asym == bsym) { match_AB++; } if (qsym == msym) { match_QM++; } } } int seqno_a = ci->cand_list[ci->best_parents[index_a]]; int seqno_b = ci->cand_list[ci->best_parents[index_b]]; double QA = 100.0 * match_QA / cols; double QB = 100.0 * match_QB / cols; double AB = 100.0 * match_AB / cols; double QT = MAX(QA, QB); double QM = 100.0 * match_QM / cols; double divdiff = QM - QT; double divfrac = 100.0 * divdiff / QT; int sumL = best_left_n + best_left_a + best_left_y; int sumR = best_right_n + best_right_a + best_right_y; if (opt_uchime2_denovo || opt_uchime3_denovo) { if ((QM == 100.0) && (QT < 100.0)) { status = 4; } } else if (best_h >= opt_minh) { status = 3; if ((divdiff >= opt_mindiv) && (sumL >= opt_mindiffs) && (sumR >= opt_mindiffs)) { status = 4; } } /* print alignment */ xpthread_mutex_lock(&mutex_output); if (opt_uchimealns && (status == 4)) { fprintf(fp_uchimealns, "\n"); fprintf(fp_uchimealns, "----------------------------------------" "--------------------------------\n"); fprintf(fp_uchimealns, "Query (%5d nt) ", ci->query_len); if (opt_xsize) { header_fprint_strip_size(fp_uchimealns, ci->query_head, ci->query_head_len); } else { fprintf(fp_uchimealns, "%s", ci->query_head); } fprintf(fp_uchimealns, "\nParentA (%5" PRIu64 " nt) ", db_getsequencelen(seqno_a)); if (opt_xsize) { header_fprint_strip_size(fp_uchimealns, db_getheader(seqno_a), db_getheaderlen(seqno_a)); } else { fprintf(fp_uchimealns, "%s", db_getheader(seqno_a)); } fprintf(fp_uchimealns, "\nParentB (%5" PRIu64 " nt) ", db_getsequencelen(seqno_b)); if (opt_xsize) { header_fprint_strip_size(fp_uchimealns, db_getheader(seqno_b), db_getheaderlen(seqno_b)); } else { fprintf(fp_uchimealns, "%s", db_getheader(seqno_b)); } fprintf(fp_uchimealns, "\n\n"); int width = opt_alignwidth > 0 ? opt_alignwidth : alnlen; qpos = 0; int p1pos = 0; int p2pos = 0; int rest = alnlen; for(int i = 0; i < alnlen; i += width) { /* count non-gap symbols on current line */ int qnt, p1nt, p2nt; qnt = p1nt = p2nt = 0; int w = MIN(rest,width); for(int j=0; jqaln[i+j] != '-') { qnt++; } if (ci->paln[0][i+j] != '-') { p1nt++; } if (ci->paln[1][i+j] != '-') { p2nt++; } } if (! best_reverse) { fprintf(fp_uchimealns, "A %5d %.*s %d\n", p1pos+1, w, ci->paln[0]+i, p1pos+p1nt); fprintf(fp_uchimealns, "Q %5d %.*s %d\n", qpos+1, w, ci->qaln+i, qpos+qnt); fprintf(fp_uchimealns, "B %5d %.*s %d\n", p2pos+1, w, ci->paln[1]+i, p2pos+p2nt); } else { fprintf(fp_uchimealns, "A %5d %.*s %d\n", p2pos+1, w, ci->paln[1]+i, p2pos+p2nt); fprintf(fp_uchimealns, "Q %5d %.*s %d\n", qpos+1, w, ci->qaln+i, qpos+qnt); fprintf(fp_uchimealns, "B %5d %.*s %d\n", p1pos+1, w, ci->paln[0]+i, p1pos+p1nt); } fprintf(fp_uchimealns, "Diffs %.*s\n", w, ci->diffs+i); fprintf(fp_uchimealns, "Votes %.*s\n", w, ci->votes+i); fprintf(fp_uchimealns, "Model %.*s\n", w, ci->model+i); fprintf(fp_uchimealns, "\n"); qpos += qnt; p1pos += p1nt; p2pos += p2nt; rest -= width; } fprintf(fp_uchimealns, "Ids. QA %.1f%%, QB %.1f%%, AB %.1f%%, " "QModel %.1f%%, Div. %+.1f%%\n", QA, QB, AB, QM, divfrac); fprintf(fp_uchimealns, "Diffs Left %d: N %d, A %d, Y %d (%.1f%%); " "Right %d: N %d, A %d, Y %d (%.1f%%), Score %.4f\n", sumL, best_left_n, best_left_a, best_left_y, 100.0 * best_left_y / sumL, sumR, best_right_n, best_right_a, best_right_y, 100.0 * best_right_y / sumR, best_h); } if (opt_uchimeout) { fprintf(fp_uchimeout, "%.4f\t", best_h); if (opt_xsize) { header_fprint_strip_size(fp_uchimeout, ci->query_head, ci->query_head_len); fprintf(fp_uchimeout, "\t"); header_fprint_strip_size(fp_uchimeout, db_getheader(seqno_a), db_getheaderlen(seqno_a)); fprintf(fp_uchimeout, "\t"); header_fprint_strip_size(fp_uchimeout, db_getheader(seqno_b), db_getheaderlen(seqno_b)); fprintf(fp_uchimeout, "\t"); } else { fprintf(fp_uchimeout, "%s\t%s\t%s\t", ci->query_head, db_getheader(seqno_a), db_getheader(seqno_b)); } if(! opt_uchimeout5) { if (opt_xsize) { if (QA >= QB) { header_fprint_strip_size(fp_uchimeout, db_getheader(seqno_a), db_getheaderlen(seqno_a)); } else { header_fprint_strip_size(fp_uchimeout, db_getheader(seqno_b), db_getheaderlen(seqno_b)); } fprintf(fp_uchimeout, "\t"); } else { if (QA >= QB) { fprintf(fp_uchimeout, "%s\t", db_getheader(seqno_a)); } else { fprintf(fp_uchimeout, "%s\t", db_getheader(seqno_b)); } } } fprintf(fp_uchimeout, "%.1f\t%.1f\t%.1f\t%.1f\t%.1f\t" "%d\t%d\t%d\t%d\t%d\t%d\t%.1f\t%c\n", QM, QA, QB, AB, QT, best_left_y, best_left_n, best_left_a, best_right_y, best_right_n, best_right_a, divdiff, status == 4 ? 'Y' : (status == 2 ? 'N' : '?')); } xpthread_mutex_unlock(&mutex_output); } return status; } /* new chimeric status: 0: no parents, non-chimeric 1: score < 0 (no alignment), non-chimeric 2: score < minh, non-chimeric 3: score >= minh, suspicious -> not available with uchime2_denovo and uchime3_denovo 4: score >= minh && (divdiff >= opt_mindiv) && ..., chimeric */ void query_init(struct searchinfo_s * si) { si->qsequence = nullptr; si->kmers = nullptr; si->hits = (struct hit *) xmalloc(sizeof(struct hit) * tophits); si->kmers = (count_t *) xmalloc(db_getsequencecount() * sizeof(count_t) + 32); si->hit_count = 0; si->uh = unique_init(); si->s = search16_init(opt_match, opt_mismatch, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); si->nw = nw_init(); si->m = minheap_init(tophits); } void query_exit(struct searchinfo_s * si) { search16_exit(si->s); unique_exit(si->uh); minheap_exit(si->m); nw_exit(si->nw); if (si->qsequence) { xfree(si->qsequence); } if (si->hits) { xfree(si->hits); } if (si->kmers) { xfree(si->kmers); } } void partition_query(struct chimera_info_s * ci) { int rest = ci->query_len; char * p = ci->query_seq; for (int i=0; isi + i; si->query_no = ci->query_no; si->strand = 0; si->qsize = ci->query_size; si->query_head_len = ci->query_head_len; si->query_head = ci->query_head; si->qseqlen = len; strncpy(si->qsequence, p, len); si->qsequence[len] = 0; rest -= len; p += len; } } void chimera_thread_init(struct chimera_info_s * ci) { ci->query_alloc = 0; ci->head_alloc = 0; ci->query_head = nullptr; ci->query_seq = nullptr; ci->maxi = nullptr; ci->maxsmooth = nullptr; ci->match = nullptr; ci->smooth = nullptr; ci->paln[0] = nullptr; ci->paln[1] = nullptr; ci->qaln = nullptr; ci->diffs = nullptr; ci->votes = nullptr; ci->model = nullptr; ci->ignore = nullptr; for(int i = 0; i < parts; i++) { query_init(ci->si + i); } ci->s = search16_init(opt_match, opt_mismatch, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); } void chimera_thread_exit(struct chimera_info_s * ci) { search16_exit(ci->s); for(int i = 0; i < parts; i++) { query_exit(ci->si + i); } if (ci->maxsmooth) { xfree(ci->maxsmooth); } if (ci->match) { xfree(ci->match); } if (ci->smooth) { xfree(ci->smooth); } if (ci->diffs) { xfree(ci->diffs); } if (ci->votes) { xfree(ci->votes); } if (ci->model) { xfree(ci->model); } if (ci->ignore) { xfree(ci->ignore); } if (ci->maxi) { xfree(ci->maxi); } if (ci->qaln) { xfree(ci->qaln); } if (ci->paln[0]) { xfree(ci->paln[0]); } if (ci->paln[1]) { xfree(ci->paln[1]); } if (ci->query_seq) { xfree(ci->query_seq); } if (ci->query_head) { xfree(ci->query_head); } } uint64_t chimera_thread_core(struct chimera_info_s * ci) { chimera_thread_init(ci); auto * allhits_list = (struct hit *) xmalloc(maxcandidates * sizeof(struct hit)); LinearMemoryAligner lma; int64_t * scorematrix = lma.scorematrix_create(opt_match, opt_mismatch); lma.set_parameters(scorematrix, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); while(true) { /* get next sequence */ xpthread_mutex_lock(&mutex_input); if (opt_uchime_ref) { if (fasta_next(query_fasta_h, ! opt_notrunclabels, chrmap_no_change)) { ci->query_head_len = fasta_get_header_length(query_fasta_h); ci->query_len = fasta_get_sequence_length(query_fasta_h); ci->query_no = fasta_get_seqno(query_fasta_h); ci->query_size = fasta_get_abundance(query_fasta_h); /* if necessary expand memory for arrays based on query length */ realloc_arrays(ci); /* copy the data locally (query seq, head) */ strcpy(ci->query_head, fasta_get_header(query_fasta_h)); strcpy(ci->query_seq, fasta_get_sequence(query_fasta_h)); } else { xpthread_mutex_unlock(&mutex_input); break; /* end while loop */ } } else { if (seqno < db_getsequencecount()) { ci->query_no = seqno; ci->query_head_len = db_getheaderlen(seqno); ci->query_len = db_getsequencelen(seqno); ci->query_size = db_getabundance(seqno); /* if necessary expand memory for arrays based on query length */ realloc_arrays(ci); strcpy(ci->query_head, db_getheader(seqno)); strcpy(ci->query_seq, db_getsequence(seqno)); } else { xpthread_mutex_unlock(&mutex_input); break; /* end while loop */ } } xpthread_mutex_unlock(&mutex_input); int status = 0; /* partition query */ partition_query(ci); /* perform searches and collect candidate parents */ ci->cand_count = 0; int allhits_count = 0; if (ci->query_len >= parts) { for (int i=0; isi+i, opt_qmask); search_joinhits(ci->si+i, nullptr, & hits, & hit_count); for(int j=0; jcand_count; k++) { if (ci->cand_list[k] == target) { break; } } if (k == ci->cand_count) { ci->cand_list[ci->cand_count++] = target; } /* deallocate cigar */ if (allhits_list[i].nwalignment) { xfree(allhits_list[i].nwalignment); } } /* align full query to each candidate */ search16_qprep(ci->s, ci->query_seq, ci->query_len); search16(ci->s, ci->cand_count, ci->cand_list, ci->snwscore, ci->snwalignmentlength, ci->snwmatches, ci->snwmismatches, ci->snwgaps, ci->nwcigar); for(int i=0; i < ci->cand_count; i++) { int64_t target = ci->cand_list[i]; int64_t nwscore = ci->snwscore[i]; char * nwcigar; int64_t nwalignmentlength; int64_t nwmatches; int64_t nwmismatches; int64_t nwgaps; if (nwscore == SHRT_MAX) { /* In case the SIMD aligner cannot align, perform a new alignment with the linear memory aligner */ char * tseq = db_getsequence(target); int64_t tseqlen = db_getsequencelen(target); if (ci->nwcigar[i]) { xfree(ci->nwcigar[i]); } nwcigar = xstrdup(lma.align(ci->query_seq, tseq, ci->query_len, tseqlen)); lma.alignstats(nwcigar, ci->query_seq, tseq, & nwscore, & nwalignmentlength, & nwmatches, & nwmismatches, & nwgaps); ci->nwcigar[i] = nwcigar; ci->nwscore[i] = nwscore; ci->nwalignmentlength[i] = nwalignmentlength; ci->nwmatches[i] = nwmatches; ci->nwmismatches[i] = nwmismatches; ci->nwgaps[i] = nwgaps; } else { ci->nwscore[i] = ci->snwscore[i]; ci->nwalignmentlength[i] = ci->snwalignmentlength[i]; ci->nwmatches[i] = ci->snwmatches[i]; ci->nwmismatches[i] = ci->snwmismatches[i]; ci->nwgaps[i] = ci->snwgaps[i]; } } /* find the best pair of parents, then compute score for them */ if (find_best_parents(ci)) { status = eval_parents(ci); } else { status = 0; } /* output results */ xpthread_mutex_lock(&mutex_output); total_count++; total_abundance += ci->query_size; if (status == 4) { chimera_count++; chimera_abundance += ci->query_size; if (opt_chimeras) { fasta_print_general(fp_chimeras, nullptr, ci->query_seq, ci->query_len, ci->query_head, ci->query_head_len, ci->query_size, chimera_count, -1.0, -1, -1, opt_fasta_score ? ( opt_uchime_ref ? "uchime_ref" : "uchime_denovo" ) : nullptr, ci->best_h); } } if (status == 3) { borderline_count++; borderline_abundance += ci->query_size; if (opt_borderline) { fasta_print_general(fp_borderline, nullptr, ci->query_seq, ci->query_len, ci->query_head, ci->query_head_len, ci->query_size, borderline_count, -1.0, -1, -1, opt_fasta_score ? ( opt_uchime_ref ? "uchime_ref" : "uchime_denovo" ) : nullptr, ci->best_h); } } if (status < 3) { nonchimera_count++; nonchimera_abundance += ci->query_size; /* output no parents, no chimeras */ if ((status < 2) && opt_uchimeout) { fprintf(fp_uchimeout, "0.0000\t"); if (opt_xsize) { header_fprint_strip_size(fp_uchimeout, ci->query_head, ci->query_head_len); } else { fprintf(fp_uchimeout, "%s", ci->query_head); } if (opt_uchimeout5) { fprintf(fp_uchimeout, "\t*\t*\t*\t*\t*\t*\t*\t0\t0\t0\t0\t0\t0\t*\tN\n"); } else { fprintf(fp_uchimeout, "\t*\t*\t*\t*\t*\t*\t*\t*\t0\t0\t0\t0\t0\t0\t*\tN\n"); } } /* uchime_denovo: add non-chimeras to db */ if (opt_uchime_denovo || opt_uchime2_denovo || opt_uchime3_denovo) { dbindex_addsequence(seqno, opt_qmask); } if (opt_nonchimeras) { fasta_print_general(fp_nonchimeras, nullptr, ci->query_seq, ci->query_len, ci->query_head, ci->query_head_len, ci->query_size, nonchimera_count, -1.0, -1, -1, opt_fasta_score ? ( opt_uchime_ref ? "uchime_ref" : "uchime_denovo" ) : nullptr, ci->best_h); } } for (int i=0; i < ci->cand_count; i++) { if (ci->nwcigar[i]) { xfree(ci->nwcigar[i]); } } if (opt_uchime_ref) { progress = fasta_get_position(query_fasta_h); } else { progress += db_getsequencelen(seqno); } progress_update(progress); seqno++; xpthread_mutex_unlock(&mutex_output); } if (allhits_list) { xfree(allhits_list); } chimera_thread_exit(ci); xfree(scorematrix); return 0; } void * chimera_thread_worker(void * vp) { return (void *) chimera_thread_core(cia + (int64_t) vp); } void chimera_threads_run() { xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); /* create worker threads */ for(int64_t t=0; t 0) { fprintf(stderr, "Found %d (%.1f%%) chimeras, " "%d (%.1f%%) non-chimeras,\n" "and %d (%.1f%%) borderline sequences " "in %u unique sequences.\n", chimera_count, 100.0 * chimera_count / total_count, nonchimera_count, 100.0 * nonchimera_count / total_count, borderline_count, 100.0 * borderline_count / total_count, total_count); } else { fprintf(stderr, "Found %d chimeras, " "%d non-chimeras,\n" "and %d borderline sequences " "in %u unique sequences.\n", chimera_count, nonchimera_count, borderline_count, total_count); } if (total_abundance > 0) { fprintf(stderr, "Taking abundance information into account, " "this corresponds to\n" "%" PRId64 " (%.1f%%) chimeras, " "%" PRId64 " (%.1f%%) non-chimeras,\n" "and %" PRId64 " (%.1f%%) borderline sequences " "in %" PRId64 " total sequences.\n", chimera_abundance, 100.0 * chimera_abundance / total_abundance, nonchimera_abundance, 100.0 * nonchimera_abundance / total_abundance, borderline_abundance, 100.0 * borderline_abundance / total_abundance, total_abundance); } else { fprintf(stderr, "Taking abundance information into account, " "this corresponds to\n" "%" PRId64 " chimeras, " "%" PRId64 " non-chimeras,\n" "and %" PRId64 " borderline sequences " "in %" PRId64 " total sequences.\n", chimera_abundance, nonchimera_abundance, borderline_abundance, total_abundance); } } if (opt_log) { if (opt_uchime_ref) { fprintf(fp_log, "%s", opt_uchime_ref); } else { fprintf(fp_log, "%s", denovo_dbname); } if (seqno > 0) { fprintf(fp_log, ": %d/%u chimeras (%.1f%%)\n", chimera_count, seqno, 100.0 * chimera_count / seqno); } else { fprintf(fp_log, ": %d/%u chimeras\n", chimera_count, seqno); } } if (opt_uchime_ref) { fasta_close(query_fasta_h); } dbindex_free(); db_free(); xpthread_mutex_destroy(&mutex_output); xpthread_mutex_destroy(&mutex_input); xfree(cia); xfree(pthread); close_chimera_file(fp_borderline); close_chimera_file(fp_uchimeout); close_chimera_file(fp_uchimealns); close_chimera_file(fp_nonchimeras); close_chimera_file(fp_chimeras); show_rusage(); } vsearch-2.21.1/src/getseq.h0000644000175000017500000000476014171574117015025 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastx_getseq(); void fastx_getseqs(); void fastx_getsubseq(); vsearch-2.21.1/src/showalign.h0000644000175000017500000000603214171574117015522 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ char * align_getrow(char * seq, char * cigar, int alignlen, int origin); void align_fprint_uncompressed_alignment(FILE * f, char * cigar); void align_show(FILE * f, char * seq1, int64_t seq1len, int64_t seq1off, const char * seq1name, char * seq2, int64_t seq2len, int64_t seq2off, const char * seq2name, char * cigar, int64_t cigarlen, int numwidth, int namewidth, int alignwidth, int strand); vsearch-2.21.1/src/cluster.h0000644000175000017500000000521314171574117015210 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void cluster_smallmem(char * cmdline, char * progheader); void cluster_fast(char * cmdline, char * progheader); void cluster_size(char * cmdline, char * progheader); void cluster_unoise(char * cmdline, char * progheader); vsearch-2.21.1/src/attributes.h0000644000175000017500000000554514171574117015725 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ int64_t header_get_size(char * header, int header_length); void header_fprint_strip_size(FILE * fp, char * header, int header_length); void header_fprint_strip_size_ee(FILE * fp, char * header, int header_length, bool strip_size, bool strip_ee); vsearch-2.21.1/src/sortbysize.h0000644000175000017500000000470014171574117015744 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void sortbysize(); vsearch-2.21.1/src/dynlibs.h0000644000175000017500000000615214171574117015176 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifdef HAVE_ZLIB_H #ifdef _WIN32 extern HMODULE gz_lib; #else extern void * gz_lib; #endif extern gzFile (*gzdopen_p)(int, const char *); extern int (*gzclose_p)(gzFile); extern int (*gzread_p)(gzFile, void*, unsigned); extern int (*gzgetc_p)(gzFile); extern int (*gzrewind_p)(gzFile); extern int (*gzungetc_p)(int, gzFile); extern const char * (*gzerror_p)(gzFile, int*); #endif #ifdef HAVE_BZLIB_H #ifdef _WIN32 extern HMODULE bz2_lib; #else extern void * bz2_lib; #endif extern BZFILE* (*BZ2_bzReadOpen_p)(int*, FILE*, int, int, void*, int); extern void (*BZ2_bzReadClose_p)(int*, BZFILE*); extern int (*BZ2_bzRead_p)(int*, BZFILE*, void*, int); #endif void dynlibs_open(); void dynlibs_close(); vsearch-2.21.1/src/sortbylength.h0000644000175000017500000000470214171574117016255 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void sortbylength(); vsearch-2.21.1/src/md5.c0000644000175000017500000002410614171574117014211 0ustar nileshnilesh/* * This is an OpenSSL-compatible implementation of the RSA Data Security, Inc. * MD5 Message-Digest Algorithm (RFC 1321). * * Homepage: * http://openwall.info/wiki/people/solar/software/public-domain-source-code/md5 * * Author: * Alexander Peslyak, better known as Solar Designer * * This software was written by Alexander Peslyak in 2001. No copyright is * claimed, and the software is hereby placed in the public domain. * In case this attempt to disclaim copyright and place the software in the * public domain is deemed null and void, then the software is * Copyright (c) 2001 Alexander Peslyak and it is hereby released to the * general public under the following terms: * * Redistribution and use in source and binary forms, with or without * modification, are permitted. * * There's ABSOLUTELY NO WARRANTY, express or implied. * * (This is a heavily cut-down "BSD license".) * * This differs from Colin Plumb's older public domain implementation in that * no exactly 32-bit integer data type is required (any 32-bit or wider * unsigned integer data type will do), there's no compile-time endianness * configuration, and the function prototypes match OpenSSL's. No code from * Colin Plumb's implementation has been reused; this comment merely compares * the properties of the two independent implementations. * * The primary goals of this implementation are portability and ease of use. * It is meant to be fast, but not as fast as possible. Some known * optimizations are not included to reduce source code size and avoid * compile-time configuration. */ #ifndef HAVE_OPENSSL #include #include "md5.h" /* * The basic MD5 functions. * * F and G are optimized compared to their RFC 1321 definitions for * architectures that lack an AND-NOT instruction, just like in Colin Plumb's * implementation. */ #define F(x, y, z) ((z) ^ ((x) & ((y) ^ (z)))) #define G(x, y, z) ((y) ^ ((z) & ((x) ^ (y)))) #define H(x, y, z) ((x) ^ (y) ^ (z)) #define I(x, y, z) ((y) ^ ((x) | ~(z))) /* * The MD5 transformation for all four rounds. */ #define STEP(f, a, b, c, d, x, t, s) \ (a) += f((b), (c), (d)) + (x) + (t); \ (a) = (((a) << (s)) | (((a) & 0xffffffff) >> (32 - (s)))); \ (a) += (b); /* * SET reads 4 input bytes in little-endian byte order and stores them * in a properly aligned word in host byte order. * * The check for little-endian architectures that tolerate unaligned * memory accesses is just an optimization. Nothing will break if it * doesn't work. */ #if defined(__i386__) || defined(__x86_64__) || defined(__vax__) #define SET(n) \ (*(MD5_u32plus *)&ptr[(n) * 4]) #define GET(n) \ SET(n) #else #define SET(n) \ (ctx->block[(n)] = \ (MD5_u32plus)ptr[(n) * 4] | \ ((MD5_u32plus)ptr[(n) * 4 + 1] << 8) | \ ((MD5_u32plus)ptr[(n) * 4 + 2] << 16) | \ ((MD5_u32plus)ptr[(n) * 4 + 3] << 24)) #define GET(n) \ (ctx->block[(n)]) #endif /* * This processes one or more 64-byte data blocks, but does NOT update * the bit counters. There are no alignment requirements. */ static void *body(MD5_CTX *ctx, void *data, unsigned long size) { unsigned char *ptr; MD5_u32plus a, b, c, d; MD5_u32plus saved_a, saved_b, saved_c, saved_d; ptr = data; a = ctx->a; b = ctx->b; c = ctx->c; d = ctx->d; do { saved_a = a; saved_b = b; saved_c = c; saved_d = d; /* Round 1 */ STEP(F, a, b, c, d, SET(0), 0xd76aa478, 7) STEP(F, d, a, b, c, SET(1), 0xe8c7b756, 12) STEP(F, c, d, a, b, SET(2), 0x242070db, 17) STEP(F, b, c, d, a, SET(3), 0xc1bdceee, 22) STEP(F, a, b, c, d, SET(4), 0xf57c0faf, 7) STEP(F, d, a, b, c, SET(5), 0x4787c62a, 12) STEP(F, c, d, a, b, SET(6), 0xa8304613, 17) STEP(F, b, c, d, a, SET(7), 0xfd469501, 22) STEP(F, a, b, c, d, SET(8), 0x698098d8, 7) STEP(F, d, a, b, c, SET(9), 0x8b44f7af, 12) STEP(F, c, d, a, b, SET(10), 0xffff5bb1, 17) STEP(F, b, c, d, a, SET(11), 0x895cd7be, 22) STEP(F, a, b, c, d, SET(12), 0x6b901122, 7) STEP(F, d, a, b, c, SET(13), 0xfd987193, 12) STEP(F, c, d, a, b, SET(14), 0xa679438e, 17) STEP(F, b, c, d, a, SET(15), 0x49b40821, 22) /* Round 2 */ STEP(G, a, b, c, d, GET(1), 0xf61e2562, 5) STEP(G, d, a, b, c, GET(6), 0xc040b340, 9) STEP(G, c, d, a, b, GET(11), 0x265e5a51, 14) STEP(G, b, c, d, a, GET(0), 0xe9b6c7aa, 20) STEP(G, a, b, c, d, GET(5), 0xd62f105d, 5) STEP(G, d, a, b, c, GET(10), 0x02441453, 9) STEP(G, c, d, a, b, GET(15), 0xd8a1e681, 14) STEP(G, b, c, d, a, GET(4), 0xe7d3fbc8, 20) STEP(G, a, b, c, d, GET(9), 0x21e1cde6, 5) STEP(G, d, a, b, c, GET(14), 0xc33707d6, 9) STEP(G, c, d, a, b, GET(3), 0xf4d50d87, 14) STEP(G, b, c, d, a, GET(8), 0x455a14ed, 20) STEP(G, a, b, c, d, GET(13), 0xa9e3e905, 5) STEP(G, d, a, b, c, GET(2), 0xfcefa3f8, 9) STEP(G, c, d, a, b, GET(7), 0x676f02d9, 14) STEP(G, b, c, d, a, GET(12), 0x8d2a4c8a, 20) /* Round 3 */ STEP(H, a, b, c, d, GET(5), 0xfffa3942, 4) STEP(H, d, a, b, c, GET(8), 0x8771f681, 11) STEP(H, c, d, a, b, GET(11), 0x6d9d6122, 16) STEP(H, b, c, d, a, GET(14), 0xfde5380c, 23) STEP(H, a, b, c, d, GET(1), 0xa4beea44, 4) STEP(H, d, a, b, c, GET(4), 0x4bdecfa9, 11) STEP(H, c, d, a, b, GET(7), 0xf6bb4b60, 16) STEP(H, b, c, d, a, GET(10), 0xbebfbc70, 23) STEP(H, a, b, c, d, GET(13), 0x289b7ec6, 4) STEP(H, d, a, b, c, GET(0), 0xeaa127fa, 11) STEP(H, c, d, a, b, GET(3), 0xd4ef3085, 16) STEP(H, b, c, d, a, GET(6), 0x04881d05, 23) STEP(H, a, b, c, d, GET(9), 0xd9d4d039, 4) STEP(H, d, a, b, c, GET(12), 0xe6db99e5, 11) STEP(H, c, d, a, b, GET(15), 0x1fa27cf8, 16) STEP(H, b, c, d, a, GET(2), 0xc4ac5665, 23) /* Round 4 */ STEP(I, a, b, c, d, GET(0), 0xf4292244, 6) STEP(I, d, a, b, c, GET(7), 0x432aff97, 10) STEP(I, c, d, a, b, GET(14), 0xab9423a7, 15) STEP(I, b, c, d, a, GET(5), 0xfc93a039, 21) STEP(I, a, b, c, d, GET(12), 0x655b59c3, 6) STEP(I, d, a, b, c, GET(3), 0x8f0ccc92, 10) STEP(I, c, d, a, b, GET(10), 0xffeff47d, 15) STEP(I, b, c, d, a, GET(1), 0x85845dd1, 21) STEP(I, a, b, c, d, GET(8), 0x6fa87e4f, 6) STEP(I, d, a, b, c, GET(15), 0xfe2ce6e0, 10) STEP(I, c, d, a, b, GET(6), 0xa3014314, 15) STEP(I, b, c, d, a, GET(13), 0x4e0811a1, 21) STEP(I, a, b, c, d, GET(4), 0xf7537e82, 6) STEP(I, d, a, b, c, GET(11), 0xbd3af235, 10) STEP(I, c, d, a, b, GET(2), 0x2ad7d2bb, 15) STEP(I, b, c, d, a, GET(9), 0xeb86d391, 21) a += saved_a; b += saved_b; c += saved_c; d += saved_d; ptr += 64; } while (size -= 64); ctx->a = a; ctx->b = b; ctx->c = c; ctx->d = d; return ptr; } void MD5_Init(MD5_CTX *ctx) { ctx->a = 0x67452301; ctx->b = 0xefcdab89; ctx->c = 0x98badcfe; ctx->d = 0x10325476; ctx->lo = 0; ctx->hi = 0; } void MD5_Update(MD5_CTX *ctx, void *data, unsigned long size) { MD5_u32plus saved_lo; unsigned long used, free; saved_lo = ctx->lo; if ((ctx->lo = (saved_lo + size) & 0x1fffffff) < saved_lo) { ctx->hi++; } ctx->hi += size >> 29; used = saved_lo & 0x3f; if (used) { free = 64 - used; if (size < free) { memcpy(&ctx->buffer[used], data, size); return; } memcpy(&ctx->buffer[used], data, free); data = (unsigned char *)data + free; size -= free; body(ctx, ctx->buffer, 64); } if (size >= 64) { data = body(ctx, data, size & ~(unsigned long)0x3f); size &= 0x3f; } memcpy(ctx->buffer, data, size); } void MD5_Final(unsigned char *result, MD5_CTX *ctx) { unsigned long used, free; used = ctx->lo & 0x3f; ctx->buffer[used++] = 0x80; free = 64 - used; if (free < 8) { memset(&ctx->buffer[used], 0, free); body(ctx, ctx->buffer, 64); used = 0; free = 64; } memset(&ctx->buffer[used], 0, free - 8); ctx->lo <<= 3; ctx->buffer[56] = ctx->lo; ctx->buffer[57] = ctx->lo >> 8; ctx->buffer[58] = ctx->lo >> 16; ctx->buffer[59] = ctx->lo >> 24; ctx->buffer[60] = ctx->hi; ctx->buffer[61] = ctx->hi >> 8; ctx->buffer[62] = ctx->hi >> 16; ctx->buffer[63] = ctx->hi >> 24; body(ctx, ctx->buffer, 64); result[0] = ctx->a; result[1] = ctx->a >> 8; result[2] = ctx->a >> 16; result[3] = ctx->a >> 24; result[4] = ctx->b; result[5] = ctx->b >> 8; result[6] = ctx->b >> 16; result[7] = ctx->b >> 24; result[8] = ctx->c; result[9] = ctx->c >> 8; result[10] = ctx->c >> 16; result[11] = ctx->c >> 24; result[12] = ctx->d; result[13] = ctx->d >> 8; result[14] = ctx->d >> 16; result[15] = ctx->d >> 24; memset(ctx, 0, sizeof(*ctx)); } #endif vsearch-2.21.1/src/showalign.cc0000644000175000017500000001746214171574117015671 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static int64_t line_pos; static char * q_seq; static char * d_seq; static int64_t q_start; static int64_t d_start; static int64_t q_pos; static int64_t d_pos; static int64_t q_strand; static int64_t alignlen; static char * q_line; static char * a_line; static char * d_line; static FILE * out; static int poswidth = 3; static int headwidth = 5; static const char * q_name; static const char * d_name; static int64_t q_len; static int64_t d_len; inline void putop(char c, int64_t len) { int64_t delta = q_strand ? -1 : +1; int64_t count = len; while(count) { if (line_pos == 0) { q_start = q_pos; d_start = d_pos; } char qs; char ds; unsigned int qs4, ds4; switch(c) { case 'M': qs = q_strand ? chrmap_complement[(int)(q_seq[q_pos])] : q_seq[q_pos]; ds = d_seq[d_pos]; q_pos += delta; d_pos += 1; q_line[line_pos] = qs; qs4 = chrmap_4bit[(int)qs]; ds4 = chrmap_4bit[(int)ds]; if ((qs4 == ds4) && (! ambiguous_4bit[qs4])) { a_line[line_pos] = '|'; } else if (qs4 & ds4) { a_line[line_pos] = '+'; } else { a_line[line_pos] = ' '; } d_line[line_pos] = ds; line_pos++; break; case 'D': qs = q_strand ? chrmap_complement[(int)(q_seq[q_pos])] : q_seq[q_pos]; q_pos += delta; q_line[line_pos] = qs; a_line[line_pos] = ' '; d_line[line_pos] = '-'; line_pos++; break; case 'I': ds = d_seq[d_pos]; d_pos += 1; q_line[line_pos] = '-'; a_line[line_pos] = ' '; d_line[line_pos] = ds; line_pos++; break; } if ((line_pos == alignlen) || ((c == 0) && (line_pos > 0))) { q_line[line_pos] = 0; a_line[line_pos] = 0; d_line[line_pos] = 0; int64_t q1 = q_start + 1; if (q1 > q_len) { q1 = q_len; } int64_t q2 = q_strand ? q_pos +2 : q_pos; int64_t d1 = d_start + 1; if (d1 > d_len) { d1 = d_len; } int64_t d2 = d_pos; fprintf(out, "\n"); fprintf(out, "%*s %*" PRId64 " %c %s %" PRId64 "\n", headwidth, q_name, poswidth, q1, q_strand ? '-' : '+', q_line, q2); fprintf(out, "%*s %*s %s\n", headwidth, "", poswidth, "", a_line); fprintf(out, "%*s %*" PRId64 " %c %s %" PRId64 "\n", headwidth, d_name, poswidth, d1, '+', d_line, d2); line_pos = 0; } count--; } } void align_show(FILE * f, char * seq1, int64_t seq1len, int64_t seq1off, const char * seq1name, char * seq2, int64_t seq2len, int64_t seq2off, const char * seq2name, char * cigar, int64_t cigarlen, int numwidth, int namewidth, int alignwidth, int strand) { out = f; q_seq = seq1; q_len = seq1len; q_name = seq1name; q_strand = strand; d_seq = seq2; d_len = seq2len; d_name = seq2name; char * p = cigar; char * e = p + cigarlen; poswidth = numwidth; headwidth = namewidth; alignlen = alignwidth; q_line = (char*) xmalloc(alignwidth+1); a_line = (char*) xmalloc(alignwidth+1); d_line = (char*) xmalloc(alignwidth+1); q_pos = strand ? seq1len - 1 - seq1off : seq1off; d_pos = seq2off; line_pos = 0; while(p < e) { int64_t len; int n; if (!sscanf(p, "%" PRId64 "%n", & len, & n)) { n = 0; len = 1; } p += n; char op = *p++; putop(op, len); } putop(0, 1); xfree(q_line); xfree(a_line); xfree(d_line); } char * align_getrow(char * seq, char * cigar, int alen, int origin) { char * row = (char*) xmalloc(alen+1); char * r = row; char * p = cigar; char * s = seq; while(*p) { int64_t len; int n; if (!sscanf(p, "%" PRId64 "%n", & len, & n)) { n = 0; len = 1; } p += n; char op = *p++; if ((op == 'M') || ((op == 'D') && (origin == 0)) || ((op == 'I') && (origin == 1))) { strncpy(r, s, len); r += len; s += len; } else { /* insert len gap symbols */ for(int64_t i = 0; i < len; i++) { *r++ = '-'; } } } *r = 0; return row; } void align_fprint_uncompressed_alignment(FILE * f, char * cigar) { char * p = cigar; while(*p) { if (*p > '9') { fprintf(f, "%c", *p++); } else { int n = 0; char c = 0; int x = 0; if (sscanf(p, "%d%c%n", &n, &c, &x) == 2) { for(int i = 0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifndef MIN #define MIN(a,b) ((a) < (b) ? (a) : (b)) #endif #ifndef MAX #define MAX(a,b) ((a) > (b) ? (a) : (b)) #endif #ifndef exp10 #define exp10(x) (pow(10.0,(x))) #endif #define SHA_DIGEST_LENGTH SHA1_DIGEST_SIZE constexpr int MD5_DIGEST_LENGTH {16}; #define LEN_DIG_SHA1 SHA_DIGEST_LENGTH constexpr int LEN_HEX_DIG_MD5 {2 * MD5_DIGEST_LENGTH + 1}; #define LEN_HEX_DIG_SHA1 (2*LEN_DIG_SHA1+1) void fatal(const char * msg); void fatal(const char * format, const char * message); char * xstrdup(const char *s); char * xstrchrnul(char *s, int c); int xsprintf(char * * ret, const char * format, ...); uint64_t hash_cityhash64(char * s, uint64_t n); int64_t getusec(); void show_rusage(); void progress_init(const char * prompt, uint64_t size); void progress_update(uint64_t progress); void progress_done(); void random_init(); int64_t random_int(int64_t n); uint64_t random_ulong(uint64_t n); void string_normalize(char * normalized, char * s, unsigned int len); void reverse_complement(char * rc, char * seq, int64_t len); void fprint_hex(FILE * fp, unsigned char * data, int len); void get_hex_seq_digest_sha1(char * hex, char * seq, int seqlen); void get_hex_seq_digest_md5(char * hex, char * seq, int seqlen); void fprint_seq_digest_sha1(FILE * fp, char * seq, int seqlen); void fprint_seq_digest_md5(FILE * fp, char * seq, int seqlen); FILE * fopen_input(const char * filename); FILE * fopen_output(const char * filename); void inline xpthread_attr_init(pthread_attr_t *attr) { if (pthread_attr_init(attr)) { fatal("Unable to init thread attributes"); } } void inline xpthread_attr_destroy(pthread_attr_t *attr) { if (pthread_attr_destroy(attr)) { fatal("Unable to destroy thread attributes"); } } void inline xpthread_attr_setdetachstate(pthread_attr_t *attr, int detachstate) { if (pthread_attr_setdetachstate(attr, detachstate)) { fatal("Unable to set thread attributes detach state"); } } void inline xpthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void *), void *arg) { if (pthread_create(thread, attr, start_routine, arg)) { fatal("Unable to create thread"); } } void inline xpthread_join(pthread_t thread, void **value_ptr) { if (pthread_join(thread, value_ptr)) { fatal("Unable to join thread"); } } void inline xpthread_mutex_init(pthread_mutex_t *mutex, const pthread_mutexattr_t *attr) { if (pthread_mutex_init(mutex, attr)) { fatal("Unable to init mutex"); } } void inline xpthread_mutex_destroy(pthread_mutex_t *mutex) { if (pthread_mutex_destroy(mutex)) { fatal("Unable to destroy mutex"); } } void inline xpthread_mutex_lock(pthread_mutex_t *mutex) { if (pthread_mutex_lock(mutex)) { fatal("Unable to lock mutex"); } } void inline xpthread_mutex_unlock(pthread_mutex_t *mutex) { if (pthread_mutex_unlock(mutex)) { fatal("Unable to unlock mutex"); } } void inline xpthread_cond_init(pthread_cond_t *cond, const pthread_condattr_t *attr) { if (pthread_cond_init(cond, attr)) { fatal("Unable to init condition variable"); } } void inline xpthread_cond_destroy(pthread_cond_t *cond) { if (pthread_cond_destroy(cond)) { fatal("Unable to destroy condition variable"); } } void inline xpthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex) { if (pthread_cond_wait(cond, mutex)) { fatal("Unable to wait on condition variable"); } } void inline xpthread_cond_signal(pthread_cond_t *cond) { if (pthread_cond_signal(cond)) { fatal("Unable to signal condition variable"); } } void inline xpthread_cond_broadcast(pthread_cond_t *cond) { if (pthread_cond_broadcast(cond)) { fatal("Unable to broadcast condition variable"); } } vsearch-2.21.1/src/fasta.h0000644000175000017500000000774614171574117014642 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /* fasta input */ void fasta_open_rest(fastx_handle h); fastx_handle fasta_open(const char * filename); void fasta_close(fastx_handle h); bool fasta_next(fastx_handle h, bool truncateatspace, const unsigned char * char_mapping); uint64_t fasta_get_position(fastx_handle h); uint64_t fasta_get_size(fastx_handle h); uint64_t fasta_get_lineno(fastx_handle h); uint64_t fasta_get_seqno(fastx_handle h); char * fasta_get_header(fastx_handle h); char * fasta_get_sequence(fastx_handle h); uint64_t fasta_get_header_length(fastx_handle h); uint64_t fasta_get_sequence_length(fastx_handle h); int64_t fasta_get_abundance(fastx_handle h); int64_t fasta_get_abundance_and_presence(fastx_handle h); /* fasta output */ void fasta_print(FILE * fp, const char * hdr, char * seq, uint64_t len); void fasta_print_general(FILE * fp, const char * prefix, char * seq, int len, char * header, int header_len, unsigned int abundance, int ordinal, double ee, int clustersize, int clusterid, const char * score_name, double score); void fasta_print_db(FILE * fp, uint64_t seqno); void fasta_print_db_relabel(FILE * fp, uint64_t seqno, int ordinal); vsearch-2.21.1/src/dbindex.h0000644000175000017500000000717114171574117015151 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ extern unsigned int * kmercount; /* number of matching seqnos for each kmer */ extern uint64_t * kmerhash; /* index into the list below for each kmer */ extern unsigned int * kmerindex; /* the list of matching seqnos for kmers */ extern bitmap_t * * kmerbitmap; extern unsigned int * dbindex_map; extern unsigned int dbindex_count; extern unsigned int kmerhashsize; extern uint64_t kmerindexsize; extern uhandle_s * dbindex_uh; void fprint_kmer(FILE * f, unsigned int k, uint64_t kmer); void dbindex_prepare(int use_bitmap, int seqmask); void dbindex_addallsequences(int seqmask); void dbindex_addsequence(unsigned int seqno, int seqmask); void dbindex_free(); void dbindex_udb_write(); inline unsigned char * dbindex_getbitmap(unsigned int kmer) { if (kmerbitmap[kmer]) { return kmerbitmap[kmer]->bitmap; } else { return nullptr; } } inline unsigned int dbindex_getmatchcount(unsigned int kmer) { return kmercount[kmer]; } inline unsigned int * dbindex_getmatchlist(unsigned int kmer) { return kmerindex + kmerhash[kmer]; } inline unsigned int dbindex_getmapping(unsigned int index) { return dbindex_map[index]; } inline unsigned int dbindex_getcount() { return dbindex_count; } vsearch-2.21.1/src/vsearch.h0000644000175000017500000003313614171574117015167 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #define _GNU_SOURCE 1 #define __STDC_CONSTANT_MACROS 1 #define __STDC_FORMAT_MACROS 1 #define __STDC_LIMIT_MACROS 1 #define __restrict #ifdef HAVE_CONFIG_H #include "config.h" #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* include appropriate regex library */ #ifdef HAVE_REGEX_H #include #else #include #endif #include #include #include #include #include #include #include #define PROG_NAME PACKAGE #define PROG_VERSION PACKAGE_VERSION #ifdef __x86_64__ #define PROG_CPU "x86_64" #include #elif __PPC__ #ifdef __LITTLE_ENDIAN__ #define PROG_CPU "ppc64le" #include #undef bool #else #error Big endian ppc64 CPUs not supported #endif #elif __aarch64__ #define PROG_CPU "aarch64" #include #else #error Unknown architecture (not ppc64le, aarch64 or x86_64) #endif #ifdef _WIN32 #define PROG_OS "win" #include #include #include #define bswap_16(x) _byteswap_ushort(x) #define bswap_32(x) _byteswap_ulong(x) #define bswap_64(x) _byteswap_uint64(x) #elif __APPLE__ #define PROG_OS "macos" #include #include #include #define bswap_16(x) OSSwapInt16(x) #define bswap_32(x) OSSwapInt32(x) #define bswap_64(x) OSSwapInt64(x) #elif __linux__ #define PROG_OS "linux" #include #include #include #elif __FreeBSD__ #define PROG_OS "freebsd" #include #include #include #define bswap_16(x) bswap16(x) #define bswap_32(x) bswap32(x) #define bswap_64(x) bswap64(x) #elif __NetBSD__ #define PROG_OS "netbsd" #include #include #include #define bswap_16(x) bswap16(x) #define bswap_32(x) bswap32(x) #define bswap_64(x) bswap64(x) /* Alters behavior, but NetBSD 7 does not have getopt_long_only() */ #define getopt_long_only getopt_long #else #define PROG_OS "unknown" #include #include #include #endif #define PROG_ARCH PROG_OS "_" PROG_CPU #ifdef HAVE_DLFCN_H #include #endif #ifdef HAVE_ZLIB_H #include #endif #ifdef HAVE_BZLIB_H #include #endif #include "city.h" #include "md5.h" #include "sha1.h" #include "arch.h" #include "dynlibs.h" #include "util.h" #include "xstring.h" #include "align_simd.h" #include "maps.h" #include "attributes.h" #include "db.h" #include "align.h" #include "unique.h" #include "bitmap.h" #include "dbindex.h" #include "minheap.h" #include "search.h" #include "linmemalign.h" #include "searchcore.h" #include "showalign.h" #include "userfields.h" #include "results.h" #include "sortbysize.h" #include "sortbylength.h" #include "derep.h" #include "shuffle.h" #include "mask.h" #include "cluster.h" #include "msa.h" #include "chimera.h" #include "cpu.h" #include "allpairs.h" #include "subsample.h" #include "fastx.h" #include "fasta.h" #include "fastq.h" #include "fastqops.h" #include "filter.h" #include "dbhash.h" #include "searchexact.h" #include "mergepairs.h" #include "eestats.h" #include "rerep.h" #include "otutable.h" #include "udb.h" #include "kmerhash.h" #include "tax.h" #include "sintax.h" #include "fastqjoin.h" #include "sffconvert.h" #include "getseq.h" #include "cut.h" #include "orient.h" #include "fa2fq.h" /* options */ extern bool opt_bzip2_decompress; extern bool opt_clusterout_id; extern bool opt_clusterout_sort; extern bool opt_eeout; extern bool opt_fasta_score; extern bool opt_fastq_allowmergestagger; extern bool opt_fastq_eeout; extern bool opt_fastq_nostagger; extern bool opt_gzip_decompress; extern bool opt_label_substr_match; extern bool opt_no_progress; extern bool opt_fastq_qout_max; extern bool opt_quiet; extern bool opt_relabel_keep; extern bool opt_relabel_md5; extern bool opt_relabel_self; extern bool opt_relabel_sha1; extern bool opt_samheader; extern bool opt_sff_clip; extern bool opt_sizeorder; extern bool opt_xee; extern bool opt_xsize; extern char * opt_allpairs_global; extern char * opt_alnout; extern char * opt_biomout; extern char * opt_blast6out; extern char * opt_borderline; extern char * opt_centroids; extern char * opt_chimeras; extern char * opt_cluster_fast; extern char * opt_cluster_size; extern char * opt_cluster_smallmem; extern char * opt_cluster_unoise; extern char * opt_clusters; extern char * opt_consout; extern char * opt_cut; extern char * opt_cut_pattern; extern char * opt_db; extern char * opt_dbmatched; extern char * opt_dbnotmatched; extern char * opt_derep_fulllength; extern char * opt_derep_id; extern char * opt_derep_prefix; extern char * opt_eetabbedout; extern char * opt_fasta2fastq; extern char * opt_fastaout; extern char * opt_fastaout_discarded; extern char * opt_fastaout_discarded_rev; extern char * opt_fastaout_notmerged_fwd; extern char * opt_fastaout_notmerged_rev; extern char * opt_fastaout_rev; extern char * opt_fastapairs; extern char * opt_fastq_chars; extern char * opt_fastq_convert; extern char * opt_fastq_eestats2; extern char * opt_fastq_eestats; extern char * opt_fastq_filter; extern char * opt_fastq_join; extern char * opt_fastq_mergepairs; extern char * opt_fastq_stats; extern char * opt_fastqout; extern char * opt_fastqout_discarded; extern char * opt_fastqout_discarded_rev; extern char * opt_fastqout_rev; extern char * opt_fastqout_notmerged_fwd; extern char * opt_fastqout_notmerged_rev; extern char * opt_fastx_filter; extern char * opt_fastx_getseq; extern char * opt_fastx_getseqs; extern char * opt_fastx_getsubseq; extern char * opt_fastx_mask; extern char * opt_fastx_revcomp; extern char * opt_fastx_subsample; extern char * opt_fastx_uniques; extern char * opt_join_padgap; extern char * opt_join_padgapq; extern char * opt_label; extern char * opt_label_suffix; extern char * opt_labels; extern char * opt_label_word; extern char * opt_label_words; extern char * opt_label_field; extern char * opt_lcaout; extern char * opt_log; extern char * opt_makeudb_usearch; extern char * opt_maskfasta; extern char * opt_matched; extern char * opt_mothur_shared_out; extern char * opt_msaout; extern char * opt_nonchimeras; extern char * opt_notmatched; extern char * opt_notmatchedfq; extern char * opt_orient; extern char * opt_otutabout; extern char * opt_output; extern char * opt_pattern; extern char * opt_profile; extern char * opt_qsegout; extern char * opt_relabel; extern char * opt_rereplicate; extern char * opt_reverse; extern char * opt_samout; extern char * opt_sample; extern char * opt_search_exact; extern char * opt_sff_convert; extern char * opt_shuffle; extern char * opt_sintax; extern char * opt_sortbylength; extern char * opt_sortbysize; extern char * opt_tabbedout; extern char * opt_tsegout; extern char * opt_uc; extern char * opt_uchime2_denovo; extern char * opt_uchime3_denovo; extern char * opt_uchime_denovo; extern char * opt_uchime_ref; extern char * opt_uchimealns; extern char * opt_uchimeout; extern char * opt_udb2fasta; extern char * opt_udbinfo; extern char * opt_udbstats; extern char * opt_usearch_global; extern char * opt_userout; extern double * opt_ee_cutoffs_values; extern double opt_abskew; extern double opt_dn; extern double opt_fastq_maxdiffpct; extern double opt_fastq_maxee; extern double opt_fastq_maxee_rate; extern double opt_fastq_truncee; extern double opt_id; extern double opt_lca_cutoff; extern double opt_max_unmasked_pct; extern double opt_maxid; extern double opt_maxqt; extern double opt_maxsizeratio; extern double opt_maxsl; extern double opt_mid; extern double opt_min_unmasked_pct; extern double opt_mindiv; extern double opt_minh; extern double opt_minqt; extern double opt_minsizeratio; extern double opt_minsl; extern double opt_query_cov; extern double opt_sample_pct; extern double opt_sintax_cutoff; extern double opt_target_cov; extern double opt_unoise_alpha; extern double opt_weak_id; extern double opt_xn; extern int opt_acceptall; extern int opt_alignwidth; extern int opt_cons_truncate; extern int opt_ee_cutoffs_count; extern int opt_gap_extension_query_interior; extern int opt_gap_extension_query_left; extern int opt_gap_extension_query_right; extern int opt_gap_extension_target_interior; extern int opt_gap_extension_target_left; extern int opt_gap_extension_target_right; extern int opt_gap_open_query_interior; extern int opt_gap_open_query_left; extern int opt_gap_open_query_right; extern int opt_gap_open_target_interior; extern int opt_gap_open_target_left; extern int opt_gap_open_target_right; extern int opt_help; extern int opt_length_cutoffs_increment; extern int opt_length_cutoffs_longest; extern int opt_length_cutoffs_shortest; extern int opt_mindiffs; extern int opt_slots; extern int opt_uchimeout5; extern int opt_usersort; extern int opt_version; extern int64_t opt_dbmask; extern int64_t opt_fasta_width; extern int64_t opt_fastq_ascii; extern int64_t opt_fastq_asciiout; extern int64_t opt_fastq_maxdiffs; extern int64_t opt_fastq_maxlen; extern int64_t opt_fastq_maxmergelen; extern int64_t opt_fastq_maxns; extern int64_t opt_fastq_minlen; extern int64_t opt_fastq_minmergelen; extern int64_t opt_fastq_minovlen; extern int64_t opt_fastq_qmax; extern int64_t opt_fastq_qmaxout; extern int64_t opt_fastq_qmin; extern int64_t opt_fastq_qminout; extern int64_t opt_fastq_stripleft; extern int64_t opt_fastq_stripright; extern int64_t opt_fastq_tail; extern int64_t opt_fastq_trunclen; extern int64_t opt_fastq_trunclen_keep; extern int64_t opt_fastq_truncqual; extern int64_t opt_fulldp; extern int64_t opt_hardmask; extern int64_t opt_iddef; extern int64_t opt_idprefix; extern int64_t opt_idsuffix; extern int64_t opt_leftjust; extern int64_t opt_match; extern int64_t opt_maxaccepts; extern int64_t opt_maxdiffs; extern int64_t opt_maxgaps; extern int64_t opt_maxhits; extern int64_t opt_maxqsize; extern int64_t opt_maxrejects; extern int64_t opt_maxseqlength; extern int64_t opt_maxsize; extern int64_t opt_maxsubs; extern int64_t opt_maxuniquesize; extern int64_t opt_mincols; extern int64_t opt_minseqlength; extern int64_t opt_minsize; extern int64_t opt_mintsize; extern int64_t opt_minuniquesize; extern int64_t opt_minwordmatches; extern int64_t opt_mismatch; extern int64_t opt_notrunclabels; extern int64_t opt_output_no_hits; extern int64_t opt_qmask; extern int64_t opt_randseed; extern int64_t opt_rightjust; extern int64_t opt_rowlen; extern int64_t opt_sample_size; extern int64_t opt_self; extern int64_t opt_selfid; extern int64_t opt_sizein; extern int64_t opt_sizeout; extern int64_t opt_strand; extern int64_t opt_subseq_start; extern int64_t opt_subseq_end; extern int64_t opt_threads; extern int64_t opt_top_hits_only; extern int64_t opt_topn; extern int64_t opt_uc_allhits; extern int64_t opt_wordlength; extern int64_t altivec_present; extern int64_t mmx_present; extern int64_t sse_present; extern int64_t sse2_present; extern int64_t sse3_present; extern int64_t ssse3_present; extern int64_t sse41_present; extern int64_t sse42_present; extern int64_t popcnt_present; extern int64_t avx_present; extern int64_t avx2_present; extern FILE * fp_log; vsearch-2.21.1/src/derep.cc0000644000175000017500000011440414171574117014767 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" #define HASH hash_cityhash64 struct bucket { uint64_t hash; unsigned int seqno_first; unsigned int seqno_last; unsigned int size; unsigned int count; bool deleted; char * header; char * seq; char * qual; }; int derep_compare_prefix(const void * a, const void * b) { auto * x = (struct bucket *) a; auto * y = (struct bucket *) b; /* highest abundance first, then by label, otherwise keep order */ if (x->deleted > y->deleted) { return +1; } else if (x->deleted < y->deleted) { return -1; } else { if (x->size < y->size) { return +1; } else if (x->size > y->size) { return -1; } else { int r = strcmp(db_getheader(x->seqno_first), db_getheader(y->seqno_first)); if (r != 0) { return r; } else { if (x->seqno_first < y->seqno_first) { return -1; } else if (x->seqno_first > y->seqno_first) { return +1; } else { return 0; } } } } } int derep_compare_full(const void * a, const void * b) { auto * x = (struct bucket *) a; auto * y = (struct bucket *) b; /* highest abundance first, then by label, otherwise keep order */ if (x->deleted > y->deleted) { return +1; } else if (x->deleted < y->deleted) { return -1; } else { if (x->size < y->size) { return +1; } else if (x->size > y->size) { return -1; } else { if (x->size == 0) { return 0; } int r = strcmp(x->header, y->header); if (r != 0) { return r; } else { if (x->seqno_first < y->seqno_first) { return -1; } else if (x->seqno_first > y->seqno_first) { return +1; } else { return 0; } } } } } int seqcmp(char * a, char * b, int n) { char * p = a; char * q = b; if (n <= 0) { return 0; } while ((n-- > 0) && (chrmap_4bit[(int)(*p)] == chrmap_4bit[(int)(*q)])) { if ((n == 0) || (*p == 0) || (*q == 0)) { break; } p++; q++; } return chrmap_4bit[(int)(*p)] - chrmap_4bit[(int)(*q)]; } void rehash(struct bucket * * hashtableref, int64_t alloc_clusters) { /* double the size of the hash table: - allocate the new hash table - rehash all entries from the old to the new table - free the old table - update variables */ struct bucket * old_hashtable = * hashtableref; uint64_t old_hashtablesize = 2 * alloc_clusters; uint64_t new_hashtablesize = 2 * old_hashtablesize; uint64_t new_hash_mask = new_hashtablesize - 1; auto * new_hashtable = (struct bucket *) xmalloc(sizeof(bucket) * new_hashtablesize); memset(new_hashtable, 0, sizeof(bucket) * new_hashtablesize); /* rehash all */ for(uint64_t i = 0; i < old_hashtablesize; i++) { struct bucket * old_bp = old_hashtable + i; if (old_bp->size) { uint64_t k = old_bp->hash & new_hash_mask; while (new_hashtable[k].size) { k = (k + 1) & new_hash_mask; } struct bucket * new_bp = new_hashtable + k; * new_bp = * old_bp; } } xfree(old_hashtable); * hashtableref = new_hashtable; } inline double convert_q_to_p(int q) { int x = q - opt_fastq_ascii; if (x < 2) { return 0.75; } else { return exp10(-x/10.0); } } inline int convert_p_to_q(double p) { // int q = round(-10.0 * log10(p)); int q = int(-10.0 * log10(p)); q = MIN(q, opt_fastq_qmaxout); q = MAX(q, opt_fastq_qminout); return opt_fastq_asciiout + q; } void derep(char * input_filename, bool use_header) { /* dereplicate full length sequences, optionally require identical headers */ /* derep_fulllength output options: --output, --uc (only FASTA, depreciated) fastx_uniques output options: --fastaout, --fastqout, --uc, --tabbedout */ show_rusage(); fastx_handle h = fastx_open(input_filename); if (!h) { fatal("Unrecognized input file type (not proper FASTA or FASTQ format)"); } if (! fastx_is_empty(h)) { if (fastx_is_fastq(h)) { if (!opt_fastx_uniques) fatal("FASTQ input is only allowed with the fastx_uniques command"); } else { if (opt_fastqout) fatal("Cannot write FASTQ output when input file is not in FASTQ format"); if (opt_tabbedout) fatal("Cannot write tab separated output file when input file is not in FASTQ format"); } } FILE * fp_fastaout = nullptr; FILE * fp_fastqout = nullptr; FILE * fp_uc = nullptr; FILE * fp_tabbedout = nullptr; if (opt_fastx_uniques) { if ((!opt_uc) && (!opt_fastaout) && (!opt_fastqout) && (!opt_tabbedout)) fatal("Output file for dereplication with fastx_uniques must be specified with --fastaout, --fastqout, --tabbedout, or --uc"); } else { if ((!opt_output) && (!opt_uc)) fatal("Output file for dereplication must be specified with --output or --uc"); } if (opt_fastx_uniques) { if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout) { fp_fastqout = fopen_output(opt_fastqout); if (!fp_fastqout) { fatal("Unable to open FASTQ output file for writing"); } } if (opt_tabbedout) { fp_tabbedout = fopen_output(opt_tabbedout); if (!fp_tabbedout) { fatal("Unable to open tab delimited output file for writing"); } } } else { if (opt_output) { fp_fastaout = fopen_output(opt_output); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } } if (opt_uc) { fp_uc = fopen_output(opt_uc); if (!fp_uc) { fatal("Unable to open output (uc) file for writing"); } } uint64_t filesize = fastx_get_size(h); /* allocate initial memory for 1024 clusters with sequences of length 1023 */ uint64_t alloc_clusters = 1024; uint64_t alloc_seqs = 1024; int64_t alloc_seqlen = 1023; uint64_t hashtablesize = 2 * alloc_clusters; uint64_t hash_mask = hashtablesize - 1; auto * hashtable = (struct bucket *) xmalloc(sizeof(bucket) * hashtablesize); memset(hashtable, 0, sizeof(bucket) * hashtablesize); show_rusage(); unsigned int * nextseqtab = nullptr; const auto terminal = (unsigned int)(-1); char ** headertab = nullptr; char * match_strand = nullptr; bool extra_info = opt_uc || opt_tabbedout; if (extra_info) { /* If the uc or tabbedout option is in effect, we need to keep some extra info. Allocate and init memory for this. */ /* Links to other sequences in cluster */ nextseqtab = (unsigned int*) xmalloc(sizeof(unsigned int) * alloc_seqs); memset(nextseqtab, terminal, sizeof(unsigned int) * alloc_seqs); /* Pointers to the header strings */ headertab = (char **) xmalloc(sizeof(char*) * alloc_seqs); memset(headertab, 0, sizeof(char*) * alloc_seqs); /* Matching strand */ match_strand = (char *) xmalloc(alloc_seqs); memset(match_strand, 0, alloc_seqs); } show_rusage(); char * seq_up = (char*) xmalloc(alloc_seqlen + 1); char * rc_seq_up = (char*) xmalloc(alloc_seqlen + 1); char * prompt = nullptr; if (xsprintf(& prompt, "Dereplicating file %s", input_filename) == -1) { fatal("Out of memory"); } progress_init(prompt, filesize); uint64_t sequencecount = 0; uint64_t nucleotidecount = 0; int64_t shortest = INT64_MAX; int64_t longest = 0; uint64_t discarded_short = 0; uint64_t discarded_long = 0; uint64_t clusters = 0; int64_t sumsize = 0; uint64_t maxsize = 0; double median = 0.0; double average = 0.0; while(fastx_next(h, ! opt_notrunclabels, chrmap_no_change)) { int64_t seqlen = fastx_get_sequence_length(h); if (seqlen < opt_minseqlength) { discarded_short++; continue; } if (seqlen > opt_maxseqlength) { discarded_long++; continue; } nucleotidecount += seqlen; if (seqlen > longest) { longest = seqlen; } if (seqlen < shortest) { shortest = seqlen; } /* check allocations */ if (seqlen > alloc_seqlen) { alloc_seqlen = seqlen; seq_up = (char*) xrealloc(seq_up, alloc_seqlen + 1); rc_seq_up = (char*) xrealloc(rc_seq_up, alloc_seqlen + 1); show_rusage(); } if (extra_info && (sequencecount + 1 > alloc_seqs)) { uint64_t new_alloc_seqs = 2 * alloc_seqs; nextseqtab = (unsigned int*) xrealloc(nextseqtab, sizeof(unsigned int) * new_alloc_seqs); memset(nextseqtab + alloc_seqs, terminal, sizeof(unsigned int) * alloc_seqs); headertab = (char**) xrealloc(headertab, sizeof(char*) * new_alloc_seqs); memset(headertab + alloc_seqs, 0, sizeof(char*) * alloc_seqs); match_strand = (char *) xrealloc(match_strand, new_alloc_seqs); memset(match_strand + alloc_seqs, 0, alloc_seqs); alloc_seqs = new_alloc_seqs; show_rusage(); } if (clusters + 1 > alloc_clusters) { uint64_t new_alloc_clusters = 2 * alloc_clusters; rehash(& hashtable, alloc_clusters); alloc_clusters = new_alloc_clusters; hashtablesize = 2 * alloc_clusters; hash_mask = hashtablesize - 1; show_rusage(); } char * seq = fastx_get_sequence(h); char * header = fastx_get_header(h); int64_t headerlen = fastx_get_header_length(h); char * qual = fastx_get_quality(h); // nullptr if FASTA /* normalize sequence: uppercase and replace U by T */ string_normalize(seq_up, seq, seqlen); /* reverse complement if necessary */ if (opt_strand > 1) { reverse_complement(rc_seq_up, seq_up, seqlen); } /* Find free bucket or bucket for identical sequence. Make sure sequences are exactly identical in case of any hash collision. With 64-bit hashes, there is about 50% chance of a collision when the number of sequences is about 5e9. */ uint64_t hash_header; if (use_header) { hash_header = HASH(header, headerlen); } else { hash_header = 0; } uint64_t hash = HASH(seq_up, seqlen) ^ hash_header; uint64_t j = hash & hash_mask; struct bucket * bp = hashtable + j; while ((bp->size) && ((hash != bp->hash) || (seqcmp(seq_up, bp->seq, seqlen)) || (use_header && strcmp(header, bp->header)))) { j = (j+1) & hash_mask; bp = hashtable + j; } if ((opt_strand > 1) && !bp->size) { /* no match on plus strand */ /* check minus strand as well */ uint64_t rc_hash = HASH(rc_seq_up, seqlen) ^ hash_header; uint64_t k = rc_hash & hash_mask; struct bucket * rc_bp = hashtable + k; while ((rc_bp->size) && ((rc_hash != rc_bp->hash) || (seqcmp(rc_seq_up, rc_bp->seq, seqlen)) || (use_header && strcmp(header, bp->header)))) { k = (k+1) & hash_mask; rc_bp = hashtable + k; } if (rc_bp->size) { bp = rc_bp; j = k; if (extra_info) { match_strand[sequencecount] = 1; } } } int abundance = fastx_get_abundance(h); int64_t ab = opt_sizein ? abundance : 1; sumsize += ab; if (bp->size) { /* at least one identical sequence already */ if (extra_info) { unsigned int last = bp->seqno_last; nextseqtab[last] = sequencecount; bp->seqno_last = sequencecount; headertab[sequencecount] = xstrdup(header); } int64_t s1 = bp->size; int64_t s2 = ab; int64_t s3 = s1 + s2; if (opt_fastqout) { /* update quality scores */ for (int i = 0; i < seqlen; i++) { int q1 = bp->qual[i]; int q2 = qual[i]; double p1 = convert_q_to_p(q1); double p2 = convert_q_to_p(q2); double p3; /* how to compute the new quality score? */ if (opt_fastq_qout_max) { // fastq_qout_max /* min error prob, highest quality */ p3 = MIN(p1, p2); } else { // fastq_qout_avg /* average, as in USEARCH */ p3 = (p1 * s1 + p2 * s2) / s3; } // fastq_qout_min /* max error prob, lowest quality */ // p3 = MAX(p1, p2); // fastq_qout_first /* keep first */ // p3 = p1; // fastq_qout_last /* keep last */ // p3 = p2; // fastq_qout_ef /* Compute as multiple independent observations Edgar & Flyvbjerg (2015) But what about s1 and s2? */ // p3 = p1 * p2 / 3.0 / (1.0 - p1 - p2 + (4.0 * p1 * p2 / 3.0)); /* always worst quality possible, certain error */ // p3 = 1.0; // always best quality possible, perfect, no errors */ // p3 = 0.0; int q3 = convert_p_to_q(p3); bp->qual[i] = q3; } } bp->size = s3; bp->count++; } else { /* no identical sequences yet */ bp->size = ab; bp->hash = hash; bp->seqno_first = sequencecount; bp->seqno_last = sequencecount; bp->seq = xstrdup(seq); bp->header = xstrdup(header); bp->count = 1; if (qual) bp->qual = xstrdup(qual); else bp->qual = nullptr; clusters++; } if (bp->size > maxsize) { maxsize = bp->size; } sequencecount++; progress_update(fastx_get_position(h)); } progress_done(); xfree(prompt); fastx_close(h); show_rusage(); if (!opt_quiet) { if (sequencecount > 0) { fprintf(stderr, "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64 ", max %'" PRIu64 ", avg %'.0f\n", nucleotidecount, sequencecount, shortest, longest, nucleotidecount * 1.0 / sequencecount); } else { fprintf(stderr, "%'" PRIu64 " nt in %'" PRIu64 " seqs\n", nucleotidecount, sequencecount); } } if (opt_log) { if (sequencecount > 0) { fprintf(fp_log, "%'" PRIu64 " nt in %'" PRIu64 " seqs, min %'" PRIu64 ", max %'" PRIu64 ", avg %'.0f\n", nucleotidecount, sequencecount, shortest, longest, nucleotidecount * 1.0 / sequencecount); } else { fprintf(fp_log, "%'" PRIu64 " nt in %'" PRIu64 " seqs\n", nucleotidecount, sequencecount); } } if (discarded_short) { fprintf(stderr, "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n", opt_minseqlength, discarded_short, (discarded_short == 1 ? "sequence" : "sequences")); if (opt_log) { fprintf(fp_log, "minseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n", opt_minseqlength, discarded_short, (discarded_short == 1 ? "sequence" : "sequences")); } } if (discarded_long) { fprintf(stderr, "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n", opt_maxseqlength, discarded_long, (discarded_long == 1 ? "sequence" : "sequences")); if (opt_log) { fprintf(fp_log, "maxseqlength %" PRId64 ": %" PRId64 " %s discarded.\n\n", opt_maxseqlength, discarded_long, (discarded_long == 1 ? "sequence" : "sequences")); } } xfree(seq_up); xfree(rc_seq_up); show_rusage(); progress_init("Sorting", 1); qsort(hashtable, hashtablesize, sizeof(struct bucket), derep_compare_full); progress_done(); show_rusage(); if (clusters > 0) { if (clusters % 2) { median = hashtable[(clusters-1)/2].size; } else { median = (hashtable[(clusters/2)-1].size + hashtable[clusters/2].size) / 2.0; } } average = 1.0 * sumsize / clusters; if (clusters < 1) { if (!opt_quiet) { fprintf(stderr, "0 unique sequences\n"); } if (opt_log) { fprintf(fp_log, "0 unique sequences\n\n"); } } else { if (!opt_quiet) { fprintf(stderr, "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n", clusters, average, median, maxsize); } if (opt_log) { fprintf(fp_log, "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n\n", clusters, average, median, maxsize); } } /* count selected */ uint64_t selected = 0; for (uint64_t i=0; isize; if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize)) { selected++; if (selected == (uint64_t) opt_topn) { break; } } } show_rusage(); /* write output */ if (opt_output || opt_fastaout) { progress_init("Writing FASTA output file", clusters); int64_t relabel_count = 0; for (uint64_t i=0; isize; if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize)) { relabel_count++; fasta_print_general(fp_fastaout, nullptr, bp->seq, strlen(bp->seq), bp->header, strlen(bp->header), size, relabel_count, -1.0, -1, -1, nullptr, 0.0); if (relabel_count == opt_topn) { break; } } progress_update(i); } progress_done(); fclose(fp_fastaout); } if (opt_fastqout) { progress_init("Writing FASTQ output file", clusters); int64_t relabel_count = 0; for (uint64_t i=0; isize; if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize)) { relabel_count++; fastq_print_general(fp_fastqout, bp->seq, strlen(bp->seq), bp->header, strlen(bp->header), bp->qual, size, relabel_count, -1.0); if (relabel_count == opt_topn) { break; } } progress_update(i); } progress_done(); fclose(fp_fastqout); } show_rusage(); if (opt_uc) { progress_init("Writing uc file, first part", clusters); for (uint64_t i=0; iheader; int64_t len = strlen(bp->seq); fprintf(fp_uc, "S\t%" PRId64 "\t%" PRId64 "\t*\t*\t*\t*\t*\t%s\t*\n", i, len, hh); for (unsigned int next = nextseqtab[bp->seqno_first]; next != terminal; next = nextseqtab[next]) { fprintf(fp_uc, "H\t%" PRId64 "\t%" PRId64 "\t%.1f\t%s\t0\t0\t*\t%s\t%s\n", i, len, 100.0, (match_strand[next] ? "-" : "+"), headertab[next], hh); } progress_update(i); } progress_done(); progress_init("Writing uc file, second part", clusters); for (uint64_t i=0; isize, bp->header); progress_update(i); } fclose(fp_uc); progress_done(); } if (opt_tabbedout) { progress_init("Writing tab separated file", clusters); for (uint64_t i=0; iheader; if (opt_relabel) fprintf(fp_tabbedout, "%s\t%s%" PRIu64 "\t%" PRIu64 "\t%" PRIu64 "\t%u\t%s\n", hh, opt_relabel, i + 1, i, (uint64_t) 0, bp->count, hh); else fprintf(fp_tabbedout, "%s\t%s\t%" PRIu64 "\t%" PRIu64 "\t%u\t%s\n", hh, hh, i, (uint64_t) 0, bp->count, hh); uint64_t j = 1; for (unsigned int next = nextseqtab[bp->seqno_first]; next != terminal; next = nextseqtab[next]) { if (opt_relabel) fprintf(fp_tabbedout, "%s\t%s%" PRIu64 "\t%" PRIu64 "\t%" PRIu64 "\t%u\t%s\n", headertab[next], opt_relabel, i + 1, i, j, bp->count, hh); else fprintf(fp_tabbedout, "%s\t%s\t%" PRIu64 "\t%" PRIu64 "\t%u\t%s\n", headertab[next], hh, i, j, bp->count, hh); j++; } progress_update(i); } fclose(fp_tabbedout); progress_done(); } show_rusage(); if (selected < clusters) { if (!opt_quiet) { fprintf(stderr, "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n", selected, clusters - selected, 100.0 * (clusters - selected) / clusters); } if (opt_log) { fprintf(fp_log, "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n\n", selected, clusters - selected, 100.0 * (clusters - selected) / clusters); } } show_rusage(); /* Free all seqs and headers */ for (uint64_t i=0; isize) { xfree(bp->seq); xfree(bp->header); if (bp->qual) xfree(bp->qual); } } if (opt_uc) { for (uint64_t i=0; i 1) { fatal("Option '--strand both' not supported with --derep_prefix"); } if (opt_output) { fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open output file for writing"); } } if (opt_uc) { fp_uc = fopen_output(opt_uc); if (!fp_uc) { fatal("Unable to open output (uc) file for writing"); } } db_read(opt_derep_prefix, 0); db_sortbylength_shortest_first(); show_rusage(); int64_t dbsequencecount = db_getsequencecount(); /* adjust size of hash table for 2/3 fill rate */ int64_t hashtablesize = 1; int hash_shift = 0; while (3 * dbsequencecount > 2 * hashtablesize) { hashtablesize <<= 1; hash_shift++; } int hash_mask = hashtablesize - 1; auto * hashtable = (struct bucket *) xmalloc(sizeof(bucket) * hashtablesize); memset(hashtable, 0, sizeof(bucket) * hashtablesize); int64_t clusters = 0; int64_t sumsize = 0; uint64_t maxsize = 0; double median = 0.0; double average = 0.0; /* alloc and init table of links to other sequences in cluster */ auto * nextseqtab = (unsigned int*) xmalloc(sizeof(unsigned int) * dbsequencecount); const auto terminal = (unsigned int)(-1); memset(nextseqtab, -1, sizeof(unsigned int) * dbsequencecount); char * seq_up = (char*) xmalloc(db_getlongestsequence() + 1); /* make table of hash values of prefixes */ unsigned int len_longest = db_getlongestsequence(); unsigned int len_shortest = db_getshortestsequence(); auto * prefix_hashes = (uint64_t *) xmalloc(sizeof(uint64_t) * (len_longest+1)); progress_init("Dereplicating", dbsequencecount); for(int64_t i=0; isize) && ((bp->deleted) || (bp->hash != hash) || (prefix_len != db_getsequencelen(bp->seqno_first)) || (seqcmp(seq_up, db_getsequence(bp->seqno_first), prefix_len)))) { bp++; if (bp >= hashtable + hashtablesize) { bp = hashtable; } } /* at this point, bp points either to (1) a free empty hash bucket, or (2) a bucket with an exact match. */ uint64_t orig_hash = hash; struct bucket * orig_bp = bp; if (bp->size) { /* exact match */ bp->size += ab; unsigned int last = bp->seqno_last; nextseqtab[last] = i; bp->seqno_last = i; if (bp->size > maxsize) { maxsize = bp->size; } } else { /* look for prefix match */ while((! bp->size) && (prefix_len > len_shortest)) { prefix_len--; hash = prefix_hashes[prefix_len]; bp = hashtable + (hash & hash_mask); while ((bp->size) && ((bp->deleted) || (bp->hash != hash) || (prefix_len != db_getsequencelen(bp->seqno_first)) || (seqcmp(seq_up, db_getsequence(bp->seqno_first), prefix_len)))) { bp++; if (bp >= hashtable + hashtablesize) { bp = hashtable; } } } if (bp->size) { /* prefix match */ /* get necessary info, then delete prefix from hash */ unsigned int first = bp->seqno_first; unsigned int last = bp->seqno_last; unsigned int size = bp->size; bp->deleted = true; /* create new hash entry */ bp = orig_bp; bp->size = size + ab; bp->hash = orig_hash; bp->seqno_first = i; nextseqtab[i] = first; bp->seqno_last = last; if (bp->size > maxsize) { maxsize = bp->size; } } else { /* no match */ orig_bp->size = ab; orig_bp->hash = orig_hash; orig_bp->seqno_first = i; orig_bp->seqno_last = i; if (ab > maxsize) { maxsize = ab; } clusters++; } } progress_update(i); } progress_done(); xfree(prefix_hashes); xfree(seq_up); show_rusage(); progress_init("Sorting", 1); qsort(hashtable, hashtablesize, sizeof(bucket), derep_compare_prefix); progress_done(); if (clusters > 0) { if (clusters % 2) { median = hashtable[(clusters-1)/2].size; } else { median = (hashtable[(clusters/2)-1].size + hashtable[clusters/2].size) / 2.0; } } average = 1.0 * sumsize / clusters; if (clusters < 1) { if (!opt_quiet) { fprintf(stderr, "0 unique sequences\n"); } if (opt_log) { fprintf(fp_log, "0 unique sequences\n\n"); } } else { if (!opt_quiet) { fprintf(stderr, "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n", clusters, average, median, maxsize); } if (opt_log) { fprintf(fp_log, "%" PRId64 " unique sequences, avg cluster %.1lf, median %.0f, max %" PRIu64 "\n\n", clusters, average, median, maxsize); } } show_rusage(); /* count selected */ int64_t selected = 0; for (int64_t i=0; isize; if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize)) { selected++; if (selected == opt_topn) { break; } } } /* write output */ if (opt_output) { progress_init("Writing output file", clusters); int64_t relabel_count = 0; for (int64_t i=0; isize; if ((size >= opt_minuniquesize) && (size <= opt_maxuniquesize)) { relabel_count++; fasta_print_general(fp_output, nullptr, db_getsequence(bp->seqno_first), db_getsequencelen(bp->seqno_first), db_getheader(bp->seqno_first), db_getheaderlen(bp->seqno_first), size, relabel_count, -1.0, -1, -1, nullptr, 0.0); if (relabel_count == opt_topn) { break; } } progress_update(i); } progress_done(); fclose(fp_output); } show_rusage(); if (opt_uc) { progress_init("Writing uc file, first part", clusters); for (int64_t i=0; iseqno_first); int64_t len = db_getsequencelen(bp->seqno_first); fprintf(fp_uc, "S\t%" PRId64 "\t%" PRId64 "\t*\t*\t*\t*\t*\t%s\t*\n", i, len, h); for (unsigned int next = nextseqtab[bp->seqno_first]; next != terminal; next = nextseqtab[next]) { fprintf(fp_uc, "H\t%" PRId64 "\t%" PRIu64 "\t%.1f\t+\t0\t0\t*\t%s\t%s\n", i, db_getsequencelen(next), 100.0, db_getheader(next), h); } progress_update(i); } progress_done(); show_rusage(); progress_init("Writing uc file, second part", clusters); for (int64_t i=0; isize, db_getheader(bp->seqno_first)); progress_update(i); } fclose(fp_uc); progress_done(); show_rusage(); } if (selected < clusters) { if (!opt_quiet) { fprintf(stderr, "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n", selected, clusters - selected, 100.0 * (clusters - selected) / clusters); } if (opt_log) { fprintf(fp_log, "%" PRId64 " uniques written, %" PRId64 " clusters discarded (%.1f%%)\n\n", selected, clusters - selected, 100.0 * (clusters - selected) / clusters); } } xfree(nextseqtab); xfree(hashtable); db_free(); } vsearch-2.21.1/src/orient.h0000644000175000017500000000467414171574117015041 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void orient(); vsearch-2.21.1/src/arch.h0000644000175000017500000000636314171574117014453 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #ifdef _WIN32 typedef struct __stat64 xstat_t; #else typedef struct stat xstat_t; #endif uint64_t arch_get_memused(); uint64_t arch_get_memtotal(); long arch_get_cores(); void arch_get_user_system_time(double * user_time, double * system_time); void arch_srandom(); uint64_t arch_random(); void * xmalloc(size_t size); void * xrealloc(void * ptr, size_t size); void xfree(void * ptr); int xfstat(int fd, xstat_t * buf); int xstat(const char * path, xstat_t * buf); uint64_t xlseek(int fd, uint64_t offset, int whence); uint64_t xftello(FILE * stream); int xopen_read(const char * path); int xopen_write(const char * path); const char * xstrcasestr(const char * haystack, const char * needle); #ifdef _WIN32 FARPROC arch_dlsym(HMODULE handle, const char * symbol); #else void * arch_dlsym(void * handle, const char * symbol); #endif vsearch-2.21.1/src/fastx.h0000644000175000017500000001034414171574117014655 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct fastx_buffer_s { char * data; uint64_t length; uint64_t alloc; uint64_t position; }; void buffer_init(struct fastx_buffer_s * buffer); void buffer_free(struct fastx_buffer_s * buffer); void buffer_extend(struct fastx_buffer_s * dest_buffer, char * source_buf, uint64_t len); void buffer_makespace(struct fastx_buffer_s * buffer, uint64_t x); struct fastx_s { bool is_pipe; bool is_fastq; bool is_empty; FILE * fp; #ifdef HAVE_ZLIB_H gzFile fp_gz; #endif #ifdef HAVE_BZLIB_H BZFILE * fp_bz; #endif struct fastx_buffer_s file_buffer; struct fastx_buffer_s header_buffer; struct fastx_buffer_s sequence_buffer; struct fastx_buffer_s plusline_buffer; struct fastx_buffer_s quality_buffer; uint64_t file_size; uint64_t file_position; uint64_t lineno; uint64_t lineno_start; int64_t seqno; uint64_t stripped_all; uint64_t stripped[256]; int format; }; typedef struct fastx_s * fastx_handle; /* fastx input */ bool fastx_is_fastq(fastx_handle h); bool fastx_is_empty(fastx_handle h); void fastx_filter_header(fastx_handle h, bool truncateatspace); fastx_handle fastx_open(const char * filename); void fastx_close(fastx_handle h); bool fastx_next(fastx_handle h, bool truncateatspace, const unsigned char * char_mapping); uint64_t fastx_get_position(fastx_handle h); uint64_t fastx_get_size(fastx_handle h); uint64_t fastx_get_lineno(fastx_handle h); uint64_t fastx_get_seqno(fastx_handle h); char * fastx_get_header(fastx_handle h); char * fastx_get_sequence(fastx_handle h); uint64_t fastx_get_header_length(fastx_handle h); uint64_t fastx_get_sequence_length(fastx_handle h); char * fastx_get_quality(fastx_handle h); int64_t fastx_get_abundance(fastx_handle h); uint64_t fastx_file_fill_buffer(fastx_handle h); vsearch-2.21.1/src/cut.h0000644000175000017500000000467114171574117014331 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void cut(); vsearch-2.21.1/src/search.cc0000644000175000017500000006373714171574117015151 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static struct searchinfo_s * si_plus; static struct searchinfo_s * si_minus; static pthread_t * pthread; /* global constants/data, no need for synchronization */ static int tophits; /* the maximum number of hits to keep */ static int seqcount; /* number of database sequences */ static pthread_attr_t attr; static fastx_handle query_fasta_h; /* global data protected by mutex */ static pthread_mutex_t mutex_input; static pthread_mutex_t mutex_output; static int qmatches; static uint64 qmatches_abundance; static int queries; static uint64 queries_abundance; static int * dbmatched; static FILE * fp_samout = nullptr; static FILE * fp_alnout = nullptr; static FILE * fp_userout = nullptr; static FILE * fp_blast6out = nullptr; static FILE * fp_uc = nullptr; static FILE * fp_fastapairs = nullptr; static FILE * fp_matched = nullptr; static FILE * fp_notmatched = nullptr; static FILE * fp_dbmatched = nullptr; static FILE * fp_dbnotmatched = nullptr; static FILE * fp_otutabout = nullptr; static FILE * fp_mothur_shared_out = nullptr; static FILE * fp_biomout = nullptr; static FILE * fp_lcaout = nullptr; static FILE * fp_qsegout = nullptr; static FILE * fp_tsegout = nullptr; static int count_matched = 0; static int count_notmatched = 0; void search_output_results(int hit_count, struct hit * hits, char * query_head, int qseqlen, char * qsequence, char * qsequence_rc, int qsize) { xpthread_mutex_lock(&mutex_output); /* show results */ int64_t toreport = MIN(opt_maxhits, hit_count); if (fp_alnout) { results_show_alnout(fp_alnout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_lcaout) { results_show_lcaout(fp_lcaout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_samout) { results_show_samout(fp_samout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (toreport) { double top_hit_id = hits[0].id; if (opt_otutabout || opt_mothur_shared_out || opt_biomout) { otutable_add(query_head, db_getheader(hits[0].target), qsize); } for(int t = 0; t < toreport; t++) { struct hit * hp = hits + t; if (opt_top_hits_only && (hp->id < top_hit_id)) { break; } if (fp_fastapairs) { results_show_fastapairs_one(fp_fastapairs, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_qsegout) { results_show_qsegout_one(fp_qsegout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_tsegout) { results_show_tsegout_one(fp_tsegout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_uc) { if ((t==0) || opt_uc_allhits) { results_show_uc_one(fp_uc, hp, query_head, qsequence, qseqlen, qsequence_rc, hp->target); } } if (fp_userout) { results_show_userout_one(fp_userout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, hp, query_head, qsequence, qseqlen, qsequence_rc); } } } else { if (fp_uc) { results_show_uc_one(fp_uc, nullptr, query_head, qsequence, qseqlen, qsequence_rc, 0); } if (opt_output_no_hits) { if (fp_userout) { results_show_userout_one(fp_userout, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } } } if (hit_count) { count_matched++; if (opt_matched) { fasta_print_general(fp_matched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), qsize, count_matched, -1.0, -1, -1, nullptr, 0.0); } } else { count_notmatched++; if (opt_notmatched) { fasta_print_general(fp_notmatched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), qsize, count_notmatched, -1.0, -1, -1, nullptr, 0.0); } } /* update matching db sequences */ for (int i=0; i < hit_count; i++) { if (hits[i].accepted) { dbmatched[hits[i].target]++; } } xpthread_mutex_unlock(&mutex_output); } int search_query(int64_t t) { for (int s = 0; s < opt_strand; s++) { struct searchinfo_s * si = s ? si_minus+t : si_plus+t; /* mask query */ if (opt_qmask == MASK_DUST) { dust(si->qsequence, si->qseqlen); } else if ((opt_qmask == MASK_SOFT) && (opt_hardmask)) { hardmask(si->qsequence, si->qseqlen); } /* perform search */ search_onequery(si, opt_qmask); } struct hit * hits; int hit_count; search_joinhits(si_plus + t, opt_strand > 1 ? si_minus + t : nullptr, & hits, & hit_count); search_output_results(hit_count, hits, si_plus[t].query_head, si_plus[t].qseqlen, si_plus[t].qsequence, opt_strand > 1 ? si_minus[t].qsequence : nullptr, si_plus[t].qsize); /* free memory for alignment strings */ for(int i=0; iquery_head_len = query_head_len; si->qseqlen = qseqlen; si->query_no = query_no; si->qsize = qsize; si->strand = s; /* allocate more memory for header and sequence, if necessary */ if (si->query_head_len + 1 > si->query_head_alloc) { si->query_head_alloc = si->query_head_len + 2001; si->query_head = (char*) xrealloc(si->query_head, (size_t)(si->query_head_alloc)); } if (si->qseqlen + 1 > si->seq_alloc) { si->seq_alloc = si->qseqlen + 2001; si->qsequence = (char*) xrealloc(si->qsequence, (size_t)(si->seq_alloc)); } } /* plus strand: copy header and sequence */ strcpy(si_plus[t].query_head, qhead); strcpy(si_plus[t].qsequence, qseq); /* get progress as amount of input file read */ uint64_t progress = fasta_get_position(query_fasta_h); /* let other threads read input */ xpthread_mutex_unlock(&mutex_input); /* minus strand: copy header and reverse complementary sequence */ if (opt_strand > 1) { strcpy(si_minus[t].query_head, si_plus[t].query_head); reverse_complement(si_minus[t].qsequence, si_plus[t].qsequence, si_plus[t].qseqlen); } int match = search_query(t); /* lock mutex for update of global data and output */ xpthread_mutex_lock(&mutex_output); /* update stats */ queries++; queries_abundance += qsize; if (match) { qmatches++; qmatches_abundance += qsize; } /* show progress */ progress_update(progress); xpthread_mutex_unlock(&mutex_output); } else { xpthread_mutex_unlock(&mutex_input); break; } } } void search_thread_init(struct searchinfo_s * si) { /* thread specific initialiation */ si->uh = unique_init(); si->kmers = (count_t *) xmalloc(seqcount * sizeof(count_t) + 32); si->m = minheap_init(tophits); si->hits = (struct hit *) xmalloc (sizeof(struct hit) * (tophits) * opt_strand); si->qsize = 1; si->query_head_alloc = 0; si->query_head = nullptr; si->seq_alloc = 0; si->qsequence = nullptr; #ifdef COMPARENONVECTORIZED si->nw = nw_init(); #else si->nw = nullptr; #endif si->s = search16_init(opt_match, opt_mismatch, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); } void search_thread_exit(struct searchinfo_s * si) { /* thread specific clean up */ search16_exit(si->s); #ifdef COMPARENONVECTORIZED nw_exit(si->nw); #endif unique_exit(si->uh); xfree(si->hits); minheap_exit(si->m); xfree(si->kmers); if (si->query_head) { xfree(si->query_head); } if (si->qsequence) { xfree(si->qsequence); } } void * search_thread_worker(void * vp) { auto t = (int64_t) vp; search_thread_run(t); return nullptr; } void search_thread_worker_run() { /* initialize threads, start them, join them and return */ xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); /* init and create worker threads, put them into stand-by mode */ for(int t=0; t seqcount)) { opt_maxrejects = seqcount; } if ((opt_maxaccepts == 0) || (opt_maxaccepts > seqcount)) { opt_maxaccepts = seqcount; } tophits = opt_maxrejects + opt_maxaccepts + MAXDELAYED; if (tophits > seqcount) { tophits = seqcount; } } void search_done() { /* clean up, global */ dbindex_free(); db_free(); if (opt_lcaout) { fclose(fp_lcaout); } if (opt_matched) { fclose(fp_matched); } if (opt_notmatched) { fclose(fp_notmatched); } if (opt_fastapairs) { fclose(fp_fastapairs); } if (opt_qsegout) { fclose(fp_qsegout); } if (opt_tsegout) { fclose(fp_tsegout); } if (fp_uc) { fclose(fp_uc); } if (fp_blast6out) { fclose(fp_blast6out); } if (fp_userout) { fclose(fp_userout); } if (fp_alnout) { fclose(fp_alnout); } if (fp_samout) { fclose(fp_samout); } show_rusage(); } void usearch_global(char * cmdline, char * progheader) { search_prep(cmdline, progheader); if (opt_dbmatched) { fp_dbmatched = fopen_output(opt_dbmatched); if (! fp_dbmatched) { fatal("Unable to open dbmatched output file for writing"); } } if (opt_dbnotmatched) { fp_dbnotmatched = fopen_output(opt_dbnotmatched); if (! fp_dbnotmatched) { fatal("Unable to open dbnotmatched output file for writing"); } } dbmatched = (int*) xmalloc(seqcount * sizeof(int*)); memset(dbmatched, 0, seqcount * sizeof(int*)); otutable_init(); /* prepare reading of queries */ qmatches = 0; qmatches_abundance = 0; queries = 0; queries_abundance = 0; query_fasta_h = fasta_open(opt_usearch_global); /* allocate memory for thread info */ si_plus = (struct searchinfo_s *) xmalloc(opt_threads * sizeof(struct searchinfo_s)); if (opt_strand > 1) { si_minus = (struct searchinfo_s *) xmalloc(opt_threads * sizeof(struct searchinfo_s)); } else { si_minus = nullptr; } pthread = (pthread_t *) xmalloc(opt_threads * sizeof(pthread_t)); /* init mutexes for input and output */ xpthread_mutex_init(&mutex_input, nullptr); xpthread_mutex_init(&mutex_output, nullptr); progress_init("Searching", fasta_get_size(query_fasta_h)); search_thread_worker_run(); progress_done(); xpthread_mutex_destroy(&mutex_output); xpthread_mutex_destroy(&mutex_input); xfree(pthread); xfree(si_plus); if (si_minus) { xfree(si_minus); } fasta_close(query_fasta_h); if (!opt_quiet) { fprintf(stderr, "Matching unique query sequences: %d of %d", qmatches, queries); if (queries > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(stderr, "\n"); if (opt_sizein) { fprintf(stderr, "Matching total query sequences: %" PRIu64 " of %" PRIu64, qmatches_abundance, queries_abundance); if (queries_abundance > 0) { fprintf(stderr, " (%.2f%%)", 100.0 * qmatches_abundance / queries_abundance); } fprintf(stderr, "\n"); } } if (opt_log) { fprintf(fp_log, "Matching unique query sequences: %d of %d", qmatches, queries); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(fp_log, "\n"); if (opt_sizein) { fprintf(fp_log, "Matching total query sequences: %" PRIu64 " of %" PRIu64, qmatches_abundance, queries_abundance); if (queries_abundance > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches_abundance / queries_abundance); } fprintf(fp_log, "\n"); } } if (opt_biomout) { otutable_print_biomout(fp_biomout); fclose(fp_biomout); } if (opt_otutabout) { otutable_print_otutabout(fp_otutabout); fclose(fp_otutabout); } if (opt_mothur_shared_out) { otutable_print_mothur_shared_out(fp_mothur_shared_out); fclose(fp_mothur_shared_out); } otutable_done(); int count_dbmatched = 0; int count_dbnotmatched = 0; if (opt_dbmatched || opt_dbnotmatched) { for(int64_t i=0; i, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" fastx_handle fasta_open(const char * filename) { fastx_handle h = fastx_open(filename); if (fastx_is_fastq(h) && ! h->is_empty) { fatal("FASTA file expected, FASTQ file found (%s)", filename); } return h; } void fasta_close(fastx_handle h) { fastx_close(h); } void fasta_filter_sequence(fastx_handle h, unsigned int * char_action, const unsigned char * char_mapping) { /* Strip unwanted characters from the sequence and raise warnings or errors on certain characters. */ char * p = h->sequence_buffer.data; char * q = p; char c; char msg[200]; while ((c = *p++)) { char m = char_action[(unsigned char)c]; switch(m) { case 0: /* stripped */ h->stripped_all++; h->stripped[(unsigned char)c]++; break; case 1: /* legal character */ *q++ = char_mapping[(unsigned char)(c)]; break; case 2: /* fatal character */ if ((c>=32) && (c<127)) { snprintf(msg, 200, "Illegal character '%c' in sequence on line %" PRIu64 " of FASTA file", (unsigned char)c, h->lineno); } else { snprintf(msg, 200, "Illegal unprintable ASCII character no %d in sequence on line %" PRIu64 " of FASTA file", (unsigned char) c, h->lineno); } fatal(msg); break; case 3: /* silently stripped chars (whitespace) */ break; case 4: /* newline (silently stripped) */ h->lineno++; break; } } /* add zero after sequence */ *q = 0; h->sequence_buffer.length = q - h->sequence_buffer.data; } bool fasta_next(fastx_handle h, bool truncateatspace, const unsigned char * char_mapping) { h->lineno_start = h->lineno; h->header_buffer.length = 0; h->header_buffer.data[0] = 0; h->sequence_buffer.length = 0; h->sequence_buffer.data[0] = 0; uint64_t rest = fastx_file_fill_buffer(h); if (rest == 0) { return false; } /* read header */ /* check initial > character */ if (h->file_buffer.data[h->file_buffer.position] != '>') { fprintf(stderr, "Found character %02x\n", (unsigned char)(h->file_buffer.data[h->file_buffer.position])); fatal("Invalid FASTA - header must start with > character"); } h->file_buffer.position++; rest--; char * lf = nullptr; while (lf == nullptr) { /* get more data if buffer empty*/ rest = fastx_file_fill_buffer(h); if (rest == 0) { fatal("Invalid FASTA - header must be terminated with newline"); } /* find LF */ lf = (char *) memchr(h->file_buffer.data + h->file_buffer.position, '\n', rest); /* copy to header buffer */ uint64_t len = rest; if (lf) { /* LF found, copy up to and including LF */ len = lf - (h->file_buffer.data + h->file_buffer.position) + 1; h->lineno++; } buffer_extend(& h->header_buffer, h->file_buffer.data + h->file_buffer.position, len); h->file_buffer.position += len; rest -= len; } /* read one or more sequence lines */ while (true) { /* get more data, if necessary */ rest = fastx_file_fill_buffer(h); /* end if no more data */ if (rest == 0) { break; } /* end if new sequence starts */ if (lf && (h->file_buffer.data[h->file_buffer.position] == '>')) { break; } /* find LF */ lf = (char *) memchr(h->file_buffer.data + h->file_buffer.position, '\n', rest); uint64_t len = rest; if (lf) { /* LF found, copy up to and including LF */ len = lf - (h->file_buffer.data + h->file_buffer.position) + 1; } buffer_extend(& h->sequence_buffer, h->file_buffer.data + h->file_buffer.position, len); h->file_buffer.position += len; rest -= len; } h->seqno++; fastx_filter_header(h, truncateatspace); fasta_filter_sequence(h, char_fasta_action, char_mapping); return true; } int64_t fasta_get_abundance(fastx_handle h) { // return 1 if not present int64_t size = header_get_size(h->header_buffer.data, h->header_buffer.length); if (size > 0) { return size; } else { return 1; } } int64_t fasta_get_abundance_and_presence(fastx_handle h) { // return 0 if not present return header_get_size(h->header_buffer.data, h->header_buffer.length); } uint64_t fasta_get_position(fastx_handle h) { return h->file_position; } uint64_t fasta_get_size(fastx_handle h) { return h->file_size; } uint64_t fasta_get_lineno(fastx_handle h) { return h->lineno_start; } uint64_t fasta_get_seqno(fastx_handle h) { return h->seqno; } uint64_t fasta_get_header_length(fastx_handle h) { return h->header_buffer.length; } uint64_t fasta_get_sequence_length(fastx_handle h) { return h->sequence_buffer.length; } char * fasta_get_header(fastx_handle h) { return h->header_buffer.data; } char * fasta_get_sequence(fastx_handle h) { return h->sequence_buffer.data; } /* fasta output */ void fasta_print_sequence(FILE * fp, char * seq, uint64_t len, int width) { /* The actual length of the sequence may be longer than "len", but only "len" characters are printed. Specify width of lines - zero (or <1) means linearize (all on one line). */ if (width < 1) { fprintf(fp, "%.*s\n", (int)(len), seq); } else { int64_t rest = len; for(uint64_t i=0; i%s\n", hdr); fasta_print_sequence(fp, seq, len, opt_fasta_width); } inline void fprint_seq_label(FILE * fp, char * seq, int len) { /* normalize first? */ fprintf(fp, "%.*s", len, seq); } void fasta_print_general(FILE * fp, const char * prefix, char * seq, int len, char * header, int header_len, unsigned int abundance, int ordinal, double ee, int clustersize, int clusterid, const char * score_name, double score) { fprintf(fp, ">"); if (prefix) { fprintf(fp, "%s", prefix); } if (opt_relabel_self) { fprint_seq_label(fp, seq, len); } else if (opt_relabel_sha1) { fprint_seq_digest_sha1(fp, seq, len); } else if (opt_relabel_md5) { fprint_seq_digest_md5(fp, seq, len); } else if (opt_relabel && (ordinal > 0)) { fprintf(fp, "%s%d", opt_relabel, ordinal); } else { bool xsize = opt_xsize || (opt_sizeout && (abundance > 0)); bool xee = opt_xee || ((opt_eeout || opt_fastq_eeout) && (ee >= 0.0)); header_fprint_strip_size_ee(fp, header, header_len, xsize, xee); } if (opt_label_suffix) { fprintf(fp, "%s", opt_label_suffix); } if (opt_sample) { fprintf(fp, ";sample=%s", opt_sample); } if (clustersize > 0) { fprintf(fp, ";seqs=%d", clustersize); } if (clusterid >= 0) { fprintf(fp, ";clusterid=%d", clusterid); } if (opt_sizeout && (abundance > 0)) { fprintf(fp, ";size=%u", abundance); } if ((opt_eeout || opt_fastq_eeout) && (ee >= 0.0)) { fprintf(fp, ";ee=%.4lf", ee); } if (score_name) { fprintf(fp, ";%s=%.4lf", score_name, score); } if (opt_relabel_keep && ((opt_relabel && (ordinal > 0)) || opt_relabel_sha1 || opt_relabel_md5 || opt_relabel_self)) { fprintf(fp, " %s", header); } fprintf(fp, "\n"); if (seq) { fasta_print_sequence(fp, seq, len, opt_fasta_width); } } void fasta_print_db_relabel(FILE * fp, uint64_t seqno, int ordinal) { fasta_print_general(fp, nullptr, db_getsequence(seqno), db_getsequencelen(seqno), db_getheader(seqno), db_getheaderlen(seqno), db_getabundance(seqno), ordinal, -1.0, -1, -1, nullptr, 0.0); } void fasta_print_db(FILE * fp, uint64_t seqno) { fasta_print_general(fp, nullptr, db_getsequence(seqno), db_getsequencelen(seqno), db_getheader(seqno), db_getheaderlen(seqno), db_getabundance(seqno), 0, -1.0, -1, -1, nullptr, 0.0); } vsearch-2.21.1/src/otutable.cc0000644000175000017500000003200414171574117015502 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* Identify sample and otu identifiers in headers, and count abundance of the samples in different OTUs. http://www.drive5.com/usearch/manual/upp_labels_sample.html http://www.drive5.com/usearch/manual/upp_labels_otus.html TODO: - add relabel @ */ #ifndef HAVE_REGEX_H const std::regex regex_sample("(^|;)(sample|barcodelabel)=([^;]*)($|;)", std::regex::extended); const std::regex regex_otu("(^|;)otu=([^;]*)($|;)", std::regex::extended); const std::regex regex_tax("(^|;)tax=([^;]*)($|;)", std::regex::extended); #endif typedef std::set string_set_t; typedef std::pair string_pair_t; typedef std::map string_pair_map_t; typedef std::map otu_tax_map_t; typedef std::map string_no_map_t; struct otutable_s { #ifdef HAVE_REGEX_H regex_t regex_sample; regex_t regex_otu; regex_t regex_tax; #endif string_set_t otu_set; string_set_t sample_set; string_pair_map_t sample_otu_count; string_pair_map_t otu_sample_count; otu_tax_map_t otu_tax_map; }; static otutable_s * otutable; void otutable_init() { otutable = new otutable_s; #ifdef HAVE_REGEX_H /* compile regular expression matchers */ if (regcomp(&otutable->regex_sample, "(^|;)(sample|barcodelabel)=([^;]*)($|;)", REG_EXTENDED)) { fatal("Compilation of regular expression for sample annotation failed"); } if (regcomp(&otutable->regex_otu, "(^|;)otu=([^;]*)($|;)", REG_EXTENDED)) { fatal("Compilation of regular expression for otu annotation failed"); } if (regcomp(&otutable->regex_tax, "(^|;)tax=([^;]*)($|;)", REG_EXTENDED)) { fatal("Compilation of regular expression for taxonomy annotation failed"); } #endif } void otutable_done() { #ifdef HAVE_REGEX_H regfree(&otutable->regex_sample); regfree(&otutable->regex_otu); regfree(&otutable->regex_tax); #endif otutable->otu_set.clear(); otutable->sample_set.clear(); otutable->sample_otu_count.clear(); otutable->otu_sample_count.clear(); delete otutable; } void otutable_add(char * query_header, char * target_header, int64_t abundance) { /* read sample annotation in query */ int len_sample; char * start_sample = query_header; #ifdef HAVE_REGEX_H regmatch_t pmatch_sample[5]; if (!regexec(&otutable->regex_sample, query_header, 5, pmatch_sample, 0)) { /* match: use the matching sample name */ len_sample = pmatch_sample[3].rm_eo - pmatch_sample[3].rm_so; start_sample += pmatch_sample[3].rm_so; } #else std::cmatch cmatch_sample; if (regex_search(query_header, cmatch_sample, regex_sample)) { len_sample = cmatch_sample.length(3); start_sample += cmatch_sample.position(3); } #endif else { /* no match: use first name in header with A-Za-z0-9_ */ len_sample = strspn(query_header, "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "abcdefghijklmnopqrstuvwxyz" "_" "0123456789"); } char * sample_name = (char *) xmalloc(len_sample+1); strncpy(sample_name, start_sample, len_sample); sample_name[len_sample] = 0; /* read OTU annotation in target */ int len_otu; char * start_otu = target_header; #ifdef HAVE_REGEX_H regmatch_t pmatch_otu[4]; if (!regexec(&otutable->regex_otu, target_header, 4, pmatch_otu, 0)) { /* match: use the matching otu name */ len_otu = pmatch_otu[2].rm_eo - pmatch_otu[2].rm_so; start_otu += pmatch_otu[2].rm_so; } #else std::cmatch cmatch_otu; if (regex_search(target_header, cmatch_otu, regex_otu)) { len_otu = cmatch_otu.length(2); start_otu += cmatch_otu.position(2); } #endif else { /* no match: use first name in header up to ; */ len_otu = strcspn(target_header, ";"); } char * otu_name = (char *) xmalloc(len_otu+1); strncpy(otu_name, start_otu, len_otu); otu_name[len_otu] = 0; /* read tax annotation in target */ #ifdef HAVE_REGEX_H char * start_tax = target_header; regmatch_t pmatch_tax[4]; if (!regexec(&otutable->regex_tax, target_header, 4, pmatch_tax, 0)) { /* match: use the matching tax name */ int len_tax = pmatch_tax[2].rm_eo - pmatch_tax[2].rm_so; start_tax += pmatch_tax[2].rm_so; char * tax_name = (char *) xmalloc(len_tax+1); strncpy(tax_name, start_tax, len_tax); tax_name[len_tax] = 0; otutable->otu_tax_map[otu_name] = tax_name; xfree(tax_name); } #else std::cmatch cmatch_tax; if (regex_search(target_header, cmatch_tax, regex_tax)) { otutable->otu_tax_map[otu_name] = cmatch_tax.str(2); } #endif /* store data */ otutable->sample_set.insert(sample_name); otutable->otu_set.insert(otu_name); otutable->sample_otu_count[string_pair_t(sample_name,otu_name)] += abundance; otutable->otu_sample_count[string_pair_t(otu_name,sample_name)] += abundance; xfree(otu_name); xfree(sample_name); } void otutable_print_otutabout(FILE * fp) { int64_t progress = 0; progress_init("Writing OTU table (classic)", otutable->otu_set.size()); fprintf(fp, "#OTU ID"); for (const auto & it_sample : otutable->sample_set) { fprintf(fp, "\t%s", it_sample.c_str()); } if (! otutable->otu_tax_map.empty()) { fprintf(fp, "\ttaxonomy"); } fprintf(fp, "\n"); auto it_map = otutable->otu_sample_count.begin(); for (auto it_otu = otutable->otu_set.begin(); it_otu != otutable->otu_set.end(); ++it_otu) { fprintf(fp, "%s", it_otu->c_str()); for (auto it_sample = otutable->sample_set.begin(); it_sample != otutable->sample_set.end(); ++it_sample) { uint64_t a = 0; if ((it_map != otutable->otu_sample_count.end()) && (it_map->first.first == *it_otu) && (it_map->first.second == *it_sample)) { a = it_map->second; ++it_map; } fprintf(fp, "\t%" PRIu64, a); } if (! otutable->otu_tax_map.empty()) { fprintf(fp, "\t"); auto it = otutable->otu_tax_map.find(*it_otu); if (it != otutable->otu_tax_map.end()) { fprintf(fp, "%s", it->second.c_str()); } } fprintf(fp, "\n"); progress_update(++progress); } progress_done(); } void otutable_print_mothur_shared_out(FILE * fp) { int64_t progress = 0; progress_init("Writing OTU table (mothur)", otutable->sample_set.size()); fprintf(fp, "label\tGroup\tnumOtus"); int64_t numotus = 0; for (const auto & it_otu : otutable->otu_set) { const char * otu_name = it_otu.c_str(); fprintf(fp, "\t%s", otu_name); ++numotus; } fprintf(fp, "\n"); auto it_map = otutable->sample_otu_count.begin(); for (auto it_sample = otutable->sample_set.begin(); it_sample != otutable->sample_set.end(); ++it_sample) { fprintf(fp, "vsearch\t%s\t%" PRId64, it_sample->c_str(), numotus); for (auto it_otu = otutable->otu_set.begin(); it_otu != otutable->otu_set.end(); ++it_otu) { uint64_t a = 0; if ((it_map != otutable->sample_otu_count.end()) && (it_map->first.first == *it_sample) && (it_map->first.second == *it_otu)) { a = it_map->second; ++it_map; } fprintf(fp, "\t%" PRIu64, a); } fprintf(fp, "\n"); progress_update(++progress); } progress_done(); } void otutable_print_biomout(FILE * fp) { int64_t progress = 0; progress_init("Writing OTU table (biom 1.0)", otutable->otu_sample_count.size()); int64_t rows = otutable->otu_set.size(); int64_t columns = otutable->sample_set.size(); static time_t time_now = time(nullptr); struct tm * tm_now = localtime(& time_now); char date[50]; strftime(date, 50, "%Y-%m-%dT%H:%M:%S", tm_now); fprintf(fp, "{\n" "\t\"id\":\"%s\",\n" "\t\"format\": \"Biological Observation Matrix 1.0\",\n" "\t\"format_url\": \"http://biom-format.org/documentation/format_versions/biom-1.0.html\",\n" "\t\"type\": \"OTU table\",\n" "\t\"generated_by\": \"%s %s\",\n" "\t\"date\": \"%s\",\n" "\t\"matrix_type\": \"sparse\",\n" "\t\"matrix_element_type\": \"int\",\n" "\t\"shape\": [%" PRId64 ",%" PRId64 "],\n", opt_biomout, PROG_NAME, PROG_VERSION, date, rows, columns); string_no_map_t otu_no_map; uint64_t otu_no = 0; fprintf(fp, "\t\"rows\":["); for (auto it_otu = otutable->otu_set.begin(); it_otu != otutable->otu_set.end(); ++it_otu) { if (it_otu != otutable->otu_set.begin()) { fprintf(fp, ","); } const char * otu_name = it_otu->c_str(); fprintf(fp, "\n\t\t{\"id\":\"%s\", \"metadata\":", otu_name); if (otutable->otu_tax_map.empty()) { fprintf(fp, "null"); } else { fprintf(fp, R"({"taxonomy":")"); auto it = otutable->otu_tax_map.find(otu_name); if (it != otutable->otu_tax_map.end()) { fprintf(fp, "%s", it->second.c_str()); } fprintf(fp, "\"}"); } fprintf(fp, "}"); otu_no_map[*it_otu] = otu_no++; } fprintf(fp, "\n"); fprintf(fp, "\t],\n"); string_no_map_t sample_no_map; uint64_t sample_no = 0; fprintf(fp, "\t\"columns\":["); for (auto it_sample = otutable->sample_set.begin(); it_sample != otutable->sample_set.end(); ++it_sample) { if (it_sample != otutable->sample_set.begin()) { fprintf(fp, ","); } fprintf(fp, "\n\t\t{\"id\":\"%s\", \"metadata\":null}", it_sample->c_str()); sample_no_map[*it_sample] = sample_no++; } fprintf(fp, "\n\t],\n"); bool first = true; fprintf(fp, "\t\"data\": ["); for (auto & it_map : otutable->otu_sample_count) { if (!first) { fprintf(fp, ","); } otu_no = otu_no_map[it_map.first.first]; sample_no = sample_no_map[it_map.first.second]; fprintf(fp, "\n\t\t[%" PRIu64 ",%" PRIu64 ",%" PRIu64 "]", otu_no, sample_no, it_map.second); first = false; progress_update(++progress); } fprintf(fp, "\n\t]\n"); fprintf(fp, "}\n"); progress_done(); } vsearch-2.21.1/src/searchexact.h0000644000175000017500000000474314171574117016030 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void search_exact(char * cmdline, char * progheader); vsearch-2.21.1/src/shuffle.h0000644000175000017500000000467514171574117015176 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void shuffle(); vsearch-2.21.1/src/minheap.h0000644000175000017500000000576414171574117015163 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ typedef struct topscore { unsigned int count; unsigned int seqno; unsigned int length; } elem_t; typedef struct minheap_s { int alloc; int count; elem_t * array; } minheap_t; inline int minheap_isempty(minheap_t * m) { return (m->count == 0); } inline void minheap_empty(minheap_t * m) { m->count = 0; } elem_t minheap_poplast(minheap_t * m); void minheap_sort(minheap_t * m); minheap_t * minheap_init(int size); void minheap_exit(minheap_t * m); void minheap_add(minheap_t * m, elem_t * n); elem_t minheap_pop(minheap_t * m); void minheap_dump(minheap_t * m); vsearch-2.21.1/src/searchcore.h0000644000175000017500000001454714171574117015657 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ //#define COMPARENONVECTORIZED /* the number of alignments that can be delayed */ #define MAXDELAYED 8 /* Default minimum number of word matches for word lengths 3-15 */ const int minwordmatches_defaults[] = { -1, -1, -1, 18, 17, 16, 15, 14, 12, 11, 10, 9, 8, 7, 5, 3 }; struct hit { int target; int strand; /* candidate info */ unsigned int count; /* number of unique kmers shared with query */ bool accepted; /* is it accepted? */ bool rejected; /* is it rejected? */ bool aligned; /* has this hit been aligned */ bool weak; /* weak hits are aligned with id > weak_id */ /* info about global alignment, including terminal gaps */ int nwscore; /* alignment score */ int nwdiff; /* indels and mismatches in global alignment */ int nwgaps; /* gaps in global alignment */ int nwindels; /* indels in global alignment */ int nwalignmentlength; /* length of global alignment */ double nwid; /* percent identity of global alignment */ char * nwalignment; /* alignment string (cigar) of global alignment */ int matches; int mismatches; /* info about alignment excluding terminal gaps */ int internal_alignmentlength; int internal_gaps; int internal_indels; int trim_q_left; int trim_q_right; int trim_t_left; int trim_t_right; int trim_aln_left; int trim_aln_right; /* more info */ double id; /* identity used for ranking */ double id0, id1, id2, id3, id4; int shortest; /* length of shortest of query and target */ int longest; /* length of longest of query and target */ }; /* type of kmer hit counter element remember possibility of overflow */ typedef unsigned short count_t; struct searchinfo_s { int query_no; /* query number, zero-based */ int strand; /* strand of query being analysed */ int qsize; /* query abundance */ int query_head_len; /* query header length */ int query_head_alloc; /* bytes allocated for the header */ char * query_head; /* query header */ int qseqlen; /* query length */ int seq_alloc; /* bytes allocated for the query sequence */ char * qsequence; /* query sequence */ unsigned int kmersamplecount; /* number of kmer samples from query */ unsigned int * kmersample; /* list of kmers sampled from query */ count_t * kmers; /* list of kmer counts for each db seq */ struct hit * hits; /* list of hits */ int hit_count; /* number of hits in the above list */ struct uhandle_s * uh; /* unique kmer finder instance */ struct s16info_s * s; /* SIMD aligner instance */ struct nwinfo_s * nw; /* NW aligner instance */ LinearMemoryAligner * lma; /* Linear memory aligner instance pointer */ int accepts; /* number of accepts */ int rejects; /* number of rejects */ minheap_t * m; /* min heap with the top kmer db seqs */ int finalized; }; void search_topscores(struct searchinfo_s * si); void search_onequery(struct searchinfo_s * si, int seqmask); struct hit * search_findbest2_byid(struct searchinfo_s * si_p, struct searchinfo_s * si_m); struct hit * search_findbest2_bysize(struct searchinfo_s * si_p, struct searchinfo_s * si_m); int search_acceptable_unaligned(struct searchinfo_s * si, int target); int search_acceptable_aligned(struct searchinfo_s * si, struct hit * hit); void align_trim(struct hit * hit); void search_joinhits(struct searchinfo_s * si_p, struct searchinfo_s * si_m, struct hit * * hits, int * hit_count); bool search_enough_kmers(struct searchinfo_s * si, unsigned int count); vsearch-2.21.1/src/mask.cc0000644000175000017500000002566514171574117014635 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static const int dust_word = 3; static const int dust_level = 20; static const int dust_window = 64; static const int dust_window2 = dust_window / 2; static const int word_count = 1 << (2 * dust_word); static const int bitmask = word_count - 1; int wo(int len, const char *s, int *beg, int *end) { int l1 = len - dust_word + 1 - 5; /* smallest possible region is 8 */ if (l1 < 0) { return 0; } int bestv = 0; int besti = 0; int bestj = 0; int counts[word_count]; int words[dust_window]; int word = 0; for (int j = 0; j < len; j++) { word <<= 2; word |= chrmap_2bit[(int)(s[j])]; words[j] = word & bitmask; } for (int i=0; i < l1; i++) { memset(counts, 0, sizeof(counts)); int sum = 0; for (int j = dust_word-1; j bestv) { bestv = v; besti = i; bestj = j; } } counts[word]++; } } *beg = besti; *end = besti + bestj; return bestv; } void dust(char * m, int len) { int a, b; /* make a local copy of the original sequence */ char * s = (char*) xmalloc(len+1); strcpy(s, m); if (! opt_hardmask) { /* convert sequence to upper case unless hardmask in effect */ for(int i=0; i < len; i++) { m[i] = toupper(m[i]); } m[len] = 0; } for (int i=0; i < len; i += dust_window2) { int l = (len > i + dust_window) ? dust_window : len-i; int v = wo(l, s+i, &a, &b); if (v > dust_level) { if (opt_hardmask) { for(int j=a+i; j<=b+i; j++) { m[j] = 'N'; } } else { for(int j=a+i; j<=b+i; j++) { m[j] = s[j] | 0x20; } } if (b < dust_window2) { i += dust_window2 - b; } } } xfree(s); } static pthread_t * pthread; static pthread_attr_t attr; static pthread_mutex_t mutex; static int nextseq = 0; static int seqcount = 0; void * dust_all_worker(void * vp) { while(true) { xpthread_mutex_lock(&mutex); int seqno = nextseq; if (seqno < seqcount) { nextseq++; progress_update(seqno); xpthread_mutex_unlock(&mutex); dust(db_getsequence(seqno), db_getsequencelen(seqno)); } else { xpthread_mutex_unlock(&mutex); break; } } return nullptr; } void dust_all() { nextseq = 0; seqcount = db_getsequencecount(); progress_init("Masking", seqcount); xpthread_mutex_init(&mutex, nullptr); xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread = (pthread_t *) xmalloc(opt_threads * sizeof(pthread_t)); for(int t=0; t opt_max_unmasked_pct) { discarded_more++; } else { kept++; if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, seq, len, db_getheader(i), db_getheaderlen(i), db_getabundance(i), kept, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastqout) { fastq_print_general(fp_fastqout, seq, len, db_getheader(i), db_getheaderlen(i), db_getquality(i), db_getabundance(i), kept, -1.0); } } progress_update(i); } progress_done(); if (!opt_quiet) { if (opt_min_unmasked_pct > 0.0) { fprintf(stderr, "%d sequences with less than %.1lf%% unmasked residues discarded\n", discarded_less, opt_min_unmasked_pct); } if (opt_max_unmasked_pct < 100.0) { fprintf(stderr, "%d sequences with more than %.1lf%% unmasked residues discarded\n", discarded_more, opt_max_unmasked_pct); } fprintf(stderr, "%d sequences kept\n", kept); } if (opt_log) { if (opt_min_unmasked_pct > 0.0) { fprintf(fp_log, "%d sequences with less than %.1lf%% unmasked residues discarded\n", discarded_less, opt_min_unmasked_pct); } if (opt_max_unmasked_pct < 100.0) { fprintf(fp_log, "%d sequences with more than %.1lf%% unmasked residues discarded\n", discarded_more, opt_max_unmasked_pct); } fprintf(fp_log, "%d sequences kept\n", kept); } show_rusage(); db_free(); if (fp_fastaout) { fclose(fp_fastaout); } if (fp_fastqout) { fclose(fp_fastqout); } } vsearch-2.21.1/src/dynlibs.cc0000644000175000017500000001131614171574117015332 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" #ifdef HAVE_ZLIB_H # ifdef _WIN32 const char gz_libname[] = "zlib1.dll"; HMODULE gz_lib; # else # ifdef __APPLE__ const char gz_libname[] = "libz.dylib"; # else const char gz_libname[] = "libz.so"; # endif void * gz_lib; # endif gzFile ZEXPORT (*gzdopen_p) OF((int, const char *)); int ZEXPORT (*gzclose_p) OF((gzFile)); int ZEXPORT (*gzread_p) OF((gzFile, void *, unsigned)); #endif #ifdef HAVE_BZLIB_H # ifdef _WIN32 const char bz2_libname[] = "libbz2.dll"; HMODULE bz2_lib; # else # ifdef __APPLE__ const char bz2_libname[] = "libbz2.dylib"; # else const char bz2_libname[] = "libbz2.so"; # endif void * bz2_lib; # endif BZFILE* (*BZ2_bzReadOpen_p)(int*, FILE*, int, int, void*, int); void (*BZ2_bzReadClose_p)(int*, BZFILE*); int (*BZ2_bzRead_p)(int*, BZFILE*, void*, int); #endif void dynlibs_open() { #ifdef HAVE_ZLIB_H #ifdef _WIN32 gz_lib = LoadLibraryA(gz_libname); #else gz_lib = dlopen(gz_libname, RTLD_LAZY); #endif if (gz_lib) { gzdopen_p = (gzFile (*)(int, const char*)) arch_dlsym(gz_lib, "gzdopen"); gzclose_p = (int (*)(gzFile)) arch_dlsym(gz_lib, "gzclose"); gzread_p = (int (*)(gzFile, void*, unsigned)) arch_dlsym(gz_lib, "gzread"); if (!(gzdopen_p && gzclose_p && gzread_p)) { fatal("Invalid compression library (zlib)"); } } #endif #ifdef HAVE_BZLIB_H #ifdef _WIN32 bz2_lib = LoadLibraryA(bz2_libname); #else bz2_lib = dlopen(bz2_libname, RTLD_LAZY); #endif if (bz2_lib) { BZ2_bzReadOpen_p = (BZFILE* (*)(int*, FILE*, int, int, void*, int)) arch_dlsym(bz2_lib, "BZ2_bzReadOpen"); BZ2_bzReadClose_p = (void (*)(int*, BZFILE*)) arch_dlsym(bz2_lib, "BZ2_bzReadClose"); BZ2_bzRead_p = (int (*)(int*, BZFILE*, void*, int)) arch_dlsym(bz2_lib, "BZ2_bzRead"); if (!(BZ2_bzReadOpen_p && BZ2_bzReadClose_p && BZ2_bzRead_p)) { fatal("Invalid compression library (bz2)"); } } #endif } void dynlibs_close() { #ifdef HAVE_ZLIB_H if (gz_lib) { #ifdef _WIN32 FreeLibrary(gz_lib); #else dlclose(gz_lib); #endif } gz_lib = nullptr; #endif #ifdef HAVE_BZLIB_H if (bz2_lib) { #ifdef _WIN32 FreeLibrary(bz2_lib); #else dlclose(bz2_lib); #endif } bz2_lib = nullptr; #endif } vsearch-2.21.1/src/userfields.h0000644000175000017500000000503614171574117015677 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ extern int * userfields_requested; extern int userfields_requested_count; int parse_userfields_arg(char * arg); vsearch-2.21.1/src/fastqops.cc0000644000175000017500000006665214171574117015543 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" void fastq_chars() { uint64_t sequence_chars[256]; uint64_t quality_chars[256]; uint64_t tail_chars[256]; uint64_t total_chars = 0; int maxrun[256]; for(int c=0; c<256; c++) { sequence_chars[c] = 0; quality_chars[c] = 0; tail_chars[c] = 0; maxrun[c] = 0; } fastx_handle h = fastq_open(opt_fastq_chars); uint64_t filesize = fastq_get_size(h); progress_init("Reading FASTQ file", filesize); uint64_t seq_count = 0; int qmin_n = 255, qmax_n = 0; while(fastq_next(h, false, chrmap_upcase)) { int64_t len = fastq_get_sequence_length(h); char * p = fastq_get_sequence(h); char * q = fastq_get_quality(h); seq_count++; total_chars += len; int run_char = -1; int run = 0; int64_t i = 0; while(i qmax_n) { qmax_n = qc; } } if (pc == run_char) { run++; if (run > maxrun[run_char]) { maxrun[run_char] = run; } } else { run_char = pc; run = 0; } i++; } if (len >= opt_fastq_tail) { q = fastq_get_quality(h) + len - 1; int tail_char = *q--; int tail_len = 1; while(*q-- == tail_char) { tail_len++; if (tail_len >= opt_fastq_tail) { break; } } if (tail_len >= opt_fastq_tail) { tail_chars[tail_char]++; } } progress_update(fastq_get_position(h)); } progress_done(); fastq_close(h); char qmin = 0; char qmax = 0; for(int c=0; c<=255; c++) { if (quality_chars[c]) { qmin = c; break; } } for(int c=255; c>=0; c--) { if (quality_chars[c]) { qmax = c; break; } } char fastq_ascii, fastq_qmin, fastq_qmax; if ((qmin < 59) || (qmax < 75)) { fastq_ascii = 33; } else { fastq_ascii = 64; } fastq_qmax = qmax - fastq_ascii; fastq_qmin = qmin - fastq_ascii; if (!opt_quiet) { fprintf(stderr, "Read %" PRIu64 " sequences.\n", seq_count); if (seq_count > 0) { fprintf(stderr, "Qmin %d, QMax %d, Range %d\n", qmin, qmax, qmax-qmin+1); fprintf(stderr, "Guess: -fastq_qmin %d -fastq_qmax %d -fastq_ascii %d\n", fastq_qmin, fastq_qmax, fastq_ascii); if (fastq_ascii == 64) { if (qmin < 64) { fprintf(stderr, "Guess: Solexa format (phred+64)\n"); } else if (qmin < 66) { fprintf(stderr, "Guess: Illumina 1.3+ format (phred+64)\n"); } else { fprintf(stderr, "Guess: Illumina 1.5+ format (phred+64)\n"); } } else { if (qmax > 73) { fprintf(stderr, "Guess: Illumina 1.8+ format (phred+33)\n"); } else { fprintf(stderr, "Guess: Original Sanger format (phred+33)\n"); } } fprintf(stderr, "\n"); fprintf(stderr, "Letter N Freq MaxRun\n"); fprintf(stderr, "------ ---------- ------ ------\n"); for(int c=0; c<256; c++) { if (sequence_chars[c] > 0) { fprintf(stderr, " %c %10" PRIu64 " %5.1f%% %6d", c, sequence_chars[c], 100.0 * sequence_chars[c] / total_chars, maxrun[c]); if ((c == 'N') || (c == 'n')) { if (qmin_n < qmax_n) { fprintf(stderr, " Q=%c..%c", qmin_n, qmax_n); } else { fprintf(stderr, " Q=%c", qmin_n); } } fprintf(stderr, "\n"); } } fprintf(stderr, "\n"); fprintf(stderr, "Char ASCII Freq Tails\n"); fprintf(stderr, "---- ----- ------ ----------\n"); for(int c=qmin; c<=qmax; c++) { if (quality_chars[c] > 0) { fprintf(stderr, " '%c' %5d %5.1f%% %10" PRIu64 "\n", c, c, 100.0 * quality_chars[c] / total_chars, tail_chars[c]); } } } } if (opt_log) { fprintf(fp_log, "Read %" PRIu64 " sequences.\n", seq_count); if (seq_count > 0) { fprintf(fp_log, "Qmin %d, QMax %d, Range %d\n", qmin, qmax, qmax-qmin+1); fprintf(fp_log, "Guess: -fastq_qmin %d -fastq_qmax %d -fastq_ascii %d\n", fastq_qmin, fastq_qmax, fastq_ascii); if (fastq_ascii == 64) { if (qmin < 64) { fprintf(fp_log, "Guess: Solexa format (phred+64)\n"); } else if (qmin < 66) { fprintf(fp_log, "Guess: Illumina 1.3+ format (phred+64)\n"); } else { fprintf(fp_log, "Guess: Illumina 1.5+ format (phred+64)\n"); } } else { if (qmax > 73) { fprintf(fp_log, "Guess: Illumina 1.8+ format (phred+33)\n"); } else { fprintf(fp_log, "Guess: Original Sanger format (phred+33)\n"); } } fprintf(fp_log, "\n"); fprintf(fp_log, "Letter N Freq MaxRun\n"); fprintf(fp_log, "------ ---------- ------ ------\n"); for(int c=0; c<256; c++) { if (sequence_chars[c] > 0) { fprintf(fp_log, " %c %10" PRIu64 " %5.1f%% %6d", c, sequence_chars[c], 100.0 * sequence_chars[c] / total_chars, maxrun[c]); if ((c == 'N') || (c == 'n')) { if (qmin_n < qmax_n) { fprintf(fp_log, " Q=%c..%c", qmin_n, qmax_n); } else { fprintf(fp_log, " Q=%c", qmin_n); } } fprintf(fp_log, "\n"); } } fprintf(fp_log, "\n"); fprintf(fp_log, "Char ASCII Freq Tails\n"); fprintf(fp_log, "---- ----- ------ ----------\n"); for(int c=qmin; c<=qmax; c++) { if (quality_chars[c] > 0) { fprintf(fp_log, " '%c' %5d %5.1f%% %10" PRIu64 "\n", c, c, 100.0 * quality_chars[c] / total_chars, tail_chars[c]); } } } } } double q2p(double q) { return exp10(- q / 10.0); } void fastq_stats() { fastx_handle h = fastq_open(opt_fastq_stats); uint64_t filesize = fastq_get_size(h); progress_init("Reading FASTQ file", filesize); uint64_t seq_count = 0; uint64_t symbols = 0; int64_t read_length_alloc = 512; auto * read_length_table = (uint64_t*) xmalloc(sizeof(uint64_t) * read_length_alloc); memset(read_length_table, 0, sizeof(uint64_t) * read_length_alloc); auto * qual_length_table = (uint64_t*) xmalloc(sizeof(uint64_t) * read_length_alloc * 256); memset(qual_length_table, 0, sizeof(uint64_t) * read_length_alloc * 256); auto * ee_length_table = (uint64_t *) xmalloc(sizeof(uint64_t) * read_length_alloc * 4); memset(ee_length_table, 0, sizeof(uint64_t) * read_length_alloc * 4); auto * q_length_table = (uint64_t *) xmalloc(sizeof(uint64_t) * read_length_alloc * 4); memset(q_length_table, 0, sizeof(uint64_t) * read_length_alloc * 4); auto * sumee_length_table = (double *) xmalloc(sizeof(double) * read_length_alloc); memset(sumee_length_table, 0, sizeof(double) * read_length_alloc); int64_t len_min = LONG_MAX; int64_t len_max = 0; int qmin = +1000; int qmax = -1000; uint64_t quality_chars[256]; for(uint64_t & quality_char : quality_chars) { quality_char = 0; } while(fastq_next(h, false, chrmap_upcase)) { seq_count++; int64_t len = fastq_get_sequence_length(h); char * q = fastq_get_quality(h); /* update length statistics */ if (len+1 > read_length_alloc) { read_length_table = (uint64_t*) xrealloc(read_length_table, sizeof(uint64_t) * (len+1)); memset(read_length_table + read_length_alloc, 0, sizeof(uint64_t) * (len + 1 - read_length_alloc)); qual_length_table = (uint64_t*) xrealloc(qual_length_table, sizeof(uint64_t) * (len+1) * 256); memset(qual_length_table + 256 * read_length_alloc, 0, sizeof(uint64_t) * (len + 1 - read_length_alloc) * 256); ee_length_table = (uint64_t*) xrealloc(ee_length_table, sizeof(uint64_t) * (len+1) * 4); memset(ee_length_table + 4 * read_length_alloc, 0, sizeof(uint64_t) * (len + 1 - read_length_alloc) * 4); q_length_table = (uint64_t*) xrealloc(q_length_table, sizeof(uint64_t) * (len+1) * 4); memset(q_length_table + 4 * read_length_alloc, 0, sizeof(uint64_t) * (len + 1 - read_length_alloc) * 4); sumee_length_table = (double *) xrealloc(sumee_length_table, sizeof(double) * (len+1)); memset(sumee_length_table + read_length_alloc, 0, sizeof(double) * (len + 1 - read_length_alloc)); read_length_alloc = len + 1; } read_length_table[len]++; if (len < len_min) { len_min = len; } if (len > len_max) { len_max = len; } /* update quality statistics */ symbols += len; double ee_limit[4] = { 1.0, 0.5, 0.25, 0.1 }; double ee = 0.0; int qmin_this = 1000; for(int64_t i=0; i < len; i++) { int qc = q[i]; int qual = qc - opt_fastq_ascii; if ((qual < opt_fastq_qmin) || (qual > opt_fastq_qmax)) { char * msg; if (xsprintf(& msg, "FASTQ quality value (%d) out of range (%" PRId64 "-%" PRId64 ").\n" "Please adjust the FASTQ quality base character or range with the\n" "--fastq_ascii, --fastq_qmin or --fastq_qmax options. For a complete\n" "diagnosis with suggested values, please run vsearch --fastq_chars file.", qual, opt_fastq_qmin, opt_fastq_qmax) > 0) { fatal(msg); } else { fatal("Out of memory"); } xfree(msg); } quality_chars[qc]++; if (qc < qmin) { qmin = qc; } if (qc > qmax) { qmax = qc; } qual_length_table[256*i + qc]++; ee += q2p(qual); sumee_length_table[i] += ee; for(int z=0; z<4; z++) { if (ee <= ee_limit[z]) { ee_length_table[4*i+z]++; } else { break; } } if (qual < qmin_this) { qmin_this = qual; } for(int z=0; z<4; z++) { if (qmin_this > 5*(z+1)) { q_length_table[4*i+z]++; } else { break; } } } progress_update(fastq_get_position(h)); } progress_done(); /* compute various distributions */ auto * length_dist = (uint64_t*) xmalloc(sizeof(uint64_t) * (len_max+1)); auto * symb_dist = (int64_t*) xmalloc(sizeof(int64_t) * (len_max+1)); auto * rate_dist = (double*) xmalloc(sizeof(double) * (len_max+1)); auto * avgq_dist = (double*) xmalloc(sizeof(double) * (len_max+1)); auto * avgee_dist = (double*) xmalloc(sizeof(double) * (len_max+1)); auto * avgp_dist = (double*) xmalloc(sizeof(double) * (len_max+1)); int64_t length_accum = 0; int64_t symb_accum = 0; for(int64_t i = 0; i <= len_max; i++) { length_accum += read_length_table[i]; length_dist[i] = length_accum; symb_accum += seq_count - length_accum; symb_dist[i] = symb_accum; int64_t q = 0; int64_t x = 0; double e_sum = 0.0; for(int c=qmin; c<=qmax; c++) { int qual = c - opt_fastq_ascii; x += qual_length_table[256*i + c]; q += qual_length_table[256*i + c] * qual; e_sum += qual_length_table[256*i + c] * q2p(qual); } avgq_dist[i] = 1.0 * q / x; avgp_dist[i] = e_sum / x; avgee_dist[i] = sumee_length_table[i] / x; rate_dist[i] = avgee_dist[i] / (i+1); } if (fp_log) { fprintf(fp_log, "\n"); fprintf(fp_log, "Read length distribution\n"); fprintf(fp_log, " L N Pct AccPct\n"); fprintf(fp_log, "------- ---------- ------- -------\n"); for(int64_t i = len_max; i >= len_min; i--) { if (read_length_table[i] > 0) { fprintf(fp_log, "%2s%5" PRId64 " %10" PRIu64 " %5.1lf%% %5.1lf%%\n", (i == len_max ? ">=" : " "), i, read_length_table[i], read_length_table[i] * 100.0 / seq_count, 100.0 * (seq_count - length_dist[i-1]) / seq_count); } } fprintf(fp_log, "\n"); fprintf(fp_log, "Q score distribution\n"); fprintf(fp_log, "ASCII Q Pe N Pct AccPct\n"); fprintf(fp_log, "----- --- ------- ---------- ------- -------\n"); int64_t qual_accum = 0; for(int c = qmax ; c >= qmin ; c--) { if (quality_chars[c] > 0) { qual_accum += quality_chars[c]; fprintf(fp_log, " %c %3" PRId64 " %7.5lf %10" PRIu64 " %6.1lf%% %6.1lf%%\n", c, c - opt_fastq_ascii, q2p(c - opt_fastq_ascii), quality_chars[c], 100.0 * quality_chars[c] / symbols, 100.0 * qual_accum / symbols); } } fprintf(fp_log, "\n"); fprintf(fp_log, " L PctRecs AvgQ P(AvgQ) AvgP AvgEE Rate RatePct\n"); fprintf(fp_log, "----- ------- ---- ------- -------- ----- --------- --------\n"); for(int64_t i = 2; i <= len_max; i++) { double PctRecs = 100.0 * (seq_count - length_dist[i-1]) / seq_count; double AvgQ = avgq_dist[i-1]; double AvgP = avgp_dist[i-1]; double AvgEE = avgee_dist[i-1]; double Rate = rate_dist[i-1]; fprintf(fp_log, "%5" PRId64 " %6.1lf%% %4.1lf %7.5lf %8.6lf %5.2lf %9.6lf %7.3lf%%\n", i, PctRecs, AvgQ, q2p(AvgQ), AvgP, AvgEE, Rate, 100.0 * Rate); } fprintf(fp_log, "\n"); fprintf(fp_log, " L 1.0000 0.5000 0.2500 0.1000 1.0000 0.5000 0.2500 0.1000\n"); fprintf(fp_log, "----- ------- ------- ------- ------- ------- ------- ------- -------\n"); for(int64_t i = len_max; i >= 1; i--) { int64_t read_count[4]; double read_percentage[4]; for(int z=0; z<4; z++) { read_count[z] = ee_length_table[4*(i-1)+z]; read_percentage[z] = 100.0 * read_count[z] / seq_count; } if (read_count[0] > 0) { fprintf(fp_log, "%5" PRId64 " %7" PRId64 " %7" PRId64 " %7" PRId64 " %7" PRId64 " " "%6.2lf%% %6.2lf%% %6.2lf%% %6.2lf%%\n", i, read_count[0], read_count[1], read_count[2], read_count[3], read_percentage[0], read_percentage[1], read_percentage[2], read_percentage[3]); } } fprintf(fp_log, "\n"); fprintf(fp_log, "Truncate at first Q\n"); fprintf(fp_log, " Len Q=5 Q=10 Q=15 Q=20\n"); fprintf(fp_log, "----- ------ ------ ------ ------\n"); for(int64_t i = len_max; i >= MAX(1, len_max/2); i--) { double read_percentage[4]; for(int z=0; z<4; z++) { read_percentage[z] = 100.0 * q_length_table[4*(i-1)+z] / seq_count; } fprintf(fp_log, "%5" PRId64 " %5.1lf%% %5.1lf%% %5.1lf%% %5.1lf%%\n", i, read_percentage[0], read_percentage[1], read_percentage[2], read_percentage[3]); } fprintf(fp_log, "\n"); fprintf(fp_log, "%10" PRIu64 " Recs (%.1lfM), 0 too long\n", seq_count, seq_count / 1.0e6); if (seq_count > 0) { fprintf(fp_log, "%10.1lf Avg length\n", 1.0 * symbols / seq_count); } fprintf(fp_log, "%9.1lfM Bases\n", symbols / 1.0e6); } xfree(read_length_table); xfree(qual_length_table); xfree(ee_length_table); xfree(q_length_table); xfree(sumee_length_table); xfree(length_dist); xfree(symb_dist); xfree(rate_dist); xfree(avgq_dist); xfree(avgee_dist); xfree(avgp_dist); fastq_close(h); if (!opt_quiet) { fprintf(stderr, "Read %" PRIu64 " sequences.\n", seq_count); } } void fastx_revcomp() { uint64_t buffer_alloc = 512; char * seq_buffer = (char*) xmalloc(buffer_alloc); char * qual_buffer = (char*) xmalloc(buffer_alloc); if ((!opt_fastaout) && (!opt_fastqout)) fatal("No output files specified"); fastx_handle h = fastx_open(opt_fastx_revcomp); if (!h) { fatal("Unrecognized file type (not proper FASTA or FASTQ format)"); } if (opt_fastqout && ! (h->is_fastq || h->is_empty)) { fatal("Cannot write FASTQ output with a FASTA input file, lacking quality scores"); } uint64_t filesize = fastx_get_size(h); FILE * fp_fastaout = nullptr; FILE * fp_fastqout = nullptr; if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout) { fp_fastqout = fopen_output(opt_fastqout); if (!fp_fastqout) { fatal("Unable to open FASTQ output file for writing"); } } if (h->is_fastq) { progress_init("Reading FASTQ file", filesize); } else { progress_init("Reading FASTA file", filesize); } int count = 0; while(fastx_next(h, false, chrmap_no_change)) { count++; /* header */ uint64_t hlen = fastx_get_header_length(h); char * header = fastx_get_header(h); int64_t abundance = fastx_get_abundance(h); /* sequence */ uint64_t length = fastx_get_sequence_length(h); if (length + 1 > buffer_alloc) { buffer_alloc = length + 1; seq_buffer = (char *) xrealloc(seq_buffer, buffer_alloc); qual_buffer = (char *) xrealloc(qual_buffer, buffer_alloc); } char * p = fastx_get_sequence(h); reverse_complement(seq_buffer, p, length); /* quality values */ char * q = fastx_get_quality(h); if (fastx_is_fastq(h)) { /* reverse quality values */ for(uint64_t i=0; i opt_fastq_qmax) { fprintf(stderr, "\nFASTQ quality score (%d) above maximum (%" PRId64 ") in entry no %" PRIu64 " starting on line %" PRIu64 "\n", q, opt_fastq_qmax, fastq_get_seqno(h) + 1, fastq_get_lineno(h)); fatal("FASTQ quality score too high"); } if (q < opt_fastq_qminout) { q = opt_fastq_qminout; } if (q > opt_fastq_qmaxout) { q = opt_fastq_qmaxout; } q += opt_fastq_asciiout; if (q < 33) { q = 33; } if (q > 126) { q = 126; } quality[i] = q; } quality[length] = 0; int hlen = fastq_get_header_length(h); fastq_print_general(fp_fastqout, sequence, length, header, hlen, quality, abundance, j, -1.0); j++; progress_update(fastq_get_position(h)); } progress_done(); fclose(fp_fastqout); fastq_close(h); } vsearch-2.21.1/src/subsample.h0000644000175000017500000000467714171574117015537 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void subsample(); vsearch-2.21.1/src/mask.h0000644000175000017500000000521414171574117014463 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #define MASK_ERROR -1 #define MASK_NONE 0 #define MASK_DUST 1 #define MASK_SOFT 2 void maskfasta(); void fastx_mask(); void dust(char * m, int len); void hardmask(char * m, int len); void dust_all(); void hardmask_all(); vsearch-2.21.1/src/subsample.cc0000644000175000017500000002103514171574117015660 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" void subsample() { FILE * fp_fastaout = nullptr; FILE * fp_fastaout_discarded = nullptr; FILE * fp_fastqout = nullptr; FILE * fp_fastqout_discarded = nullptr; if (opt_fastaout) { fp_fastaout = fopen_output(opt_fastaout); if (!fp_fastaout) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastaout_discarded) { fp_fastaout_discarded = fopen_output(opt_fastaout_discarded); if (!fp_fastaout_discarded) { fatal("Unable to open FASTA output file for writing"); } } if (opt_fastqout) { fp_fastqout = fopen_output(opt_fastqout); if (!fp_fastqout) { fatal("Unable to open FASTQ output file for writing"); } } if (opt_fastqout_discarded) { fp_fastqout_discarded = fopen_output(opt_fastqout_discarded); if (!fp_fastqout_discarded) { fatal("Unable to open FASTQ output file for writing"); } } db_read(opt_fastx_subsample, 0); show_rusage(); if ((fp_fastqout || fp_fastqout_discarded) && ! db_is_fastq()) { fatal("Cannot write FASTQ output with a FASTA input file, lacking quality scores"); } int dbsequencecount = db_getsequencecount(); uint64_t mass_total = 0; if (!opt_sizein) { mass_total = dbsequencecount; } else { for(int i=0; i mass_total) { fatal("Cannot subsample more reads than in the original sample"); } uint64_t x = n; /* number of reads left */ int a = 0; /* amplicon number */ uint64_t r = 0; /* read being checked */ uint64_t m = 0; /* accumulated mass */ uint64_t mass = /* mass of current amplicon */ opt_sizein ? db_getabundance(0) : 1; progress_init("Subsampling", mass_total); while (x > 0) { uint64_t random = random_ulong(mass_total - r); if (random < x) { /* selected read r from amplicon a */ abundance[a]++; x--; } r++; m++; if (m >= mass) { /* next amplicon */ a++; mass = opt_sizein ? db_getabundance(a) : 1; m = 0; } progress_update(r); } progress_done(); int samples = 0; int discarded = 0; progress_init("Writing output", dbsequencecount); for(int i=0; i 0) { samples++; if (opt_fastaout) { fasta_print_general(fp_fastaout, nullptr, db_getsequence(i), db_getsequencelen(i), db_getheader(i), db_getheaderlen(i), ab_sub, samples, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastqout) { fastq_print_general(fp_fastqout, db_getsequence(i), db_getsequencelen(i), db_getheader(i), db_getheaderlen(i), db_getquality(i), ab_sub, samples, -1.0); } } if (ab_discarded > 0) { discarded++; if (opt_fastaout_discarded) { fasta_print_general(fp_fastaout_discarded, nullptr, db_getsequence(i), db_getsequencelen(i), db_getheader(i), db_getheaderlen(i), ab_discarded, discarded, -1.0, -1, -1, nullptr, 0.0); } if (opt_fastqout_discarded) { fastq_print_general(fp_fastqout_discarded, db_getsequence(i), db_getsequencelen(i), db_getheader(i), db_getheaderlen(i), db_getquality(i), ab_discarded, discarded, -1.0); } } progress_update(i); } progress_done(); xfree(abundance); if (! opt_quiet) { fprintf(stderr, "Subsampled %" PRIu64 " reads from %d amplicons\n", n, samples); } if (opt_log) { fprintf(fp_log, "Subsampled %" PRIu64 " reads from %d amplicons\n", n, samples); } db_free(); if (opt_fastaout) { fclose(fp_fastaout); } if (opt_fastqout) { fclose(fp_fastqout); } if (opt_fastaout_discarded) { fclose(fp_fastaout_discarded); } if (opt_fastqout_discarded) { fclose(fp_fastqout_discarded); } } vsearch-2.21.1/src/md5.h0000644000175000017500000000277514171574117014226 0ustar nileshnilesh/* Slightly modified for vsearch by Torbjorn Rognes, 29 Sep 2015 */ /* * This is an OpenSSL-compatible implementation of the RSA Data Security, Inc. * MD5 Message-Digest Algorithm (RFC 1321). * * Homepage: * http://openwall.info/wiki/people/solar/software/public-domain-source-code/md5 * * Author: * Alexander Peslyak, better known as Solar Designer * * This software was written by Alexander Peslyak in 2001. No copyright is * claimed, and the software is hereby placed in the public domain. * In case this attempt to disclaim copyright and place the software in the * public domain is deemed null and void, then the software is * Copyright (c) 2001 Alexander Peslyak and it is hereby released to the * general public under the following terms: * * Redistribution and use in source and binary forms, with or without * modification, are permitted. * * There's ABSOLUTELY NO WARRANTY, express or implied. * * See md5.c for more information. */ #ifndef __MD5_H #define __MD5_H #ifdef __cplusplus extern "C" { #endif /* Any 32-bit or wider unsigned integer data type will do */ typedef unsigned int MD5_u32plus; typedef struct { MD5_u32plus lo, hi; MD5_u32plus a, b, c, d; unsigned char buffer[64]; MD5_u32plus block[16]; } MD5_CTX; extern void MD5_Init(MD5_CTX *ctx); extern void MD5_Update(MD5_CTX *ctx, void *data, unsigned long size); extern void MD5_Final(unsigned char *result, MD5_CTX *ctx); #ifdef __cplusplus } #endif #endif /* __MD5_H */ vsearch-2.21.1/src/xstring.h0000644000175000017500000000750214171574117015230 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ static char empty_string[1] = ""; class xstring { char * string; size_t length; size_t alloc; public: xstring() { length = 0; alloc = 0; string = nullptr; } ~xstring() { if (alloc > 0) { xfree(string); } alloc = 0; string = nullptr; length = 0; } void empty() { length = 0; } char * get_string() { if (length > 0) { return string; } else { return empty_string; } } size_t get_length() { return length; } void add_c(char c) { size_t needed = 1; if (length + needed + 1 > alloc) { alloc = length + needed + 1; string = (char*) xrealloc(string, alloc); } string[length] = c; length += 1; string[length] = 0; } void add_d(int d) { int needed = snprintf(nullptr, 0, "%d", d); if (needed < 0) { fatal("snprintf failed"); } if (length + needed + 1 > alloc) { alloc = length + needed + 1; string = (char*) xrealloc(string, alloc); } sprintf(string + length, "%d", d); length += needed; } void add_s(char * s) { size_t needed = strlen(s); if (length + needed + 1 > alloc) { alloc = length + needed + 1; string = (char*) xrealloc(string, alloc); } strcpy(string + length, s); length += needed; } }; vsearch-2.21.1/src/filter.h0000644000175000017500000000472714171574117015025 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastq_filter(); void fastx_filter(); vsearch-2.21.1/src/sffconvert.h0000644000175000017500000000470114171574117015707 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void sff_convert(); vsearch-2.21.1/src/dbindex.cc0000644000175000017500000001542714171574117015312 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" unsigned int * kmercount; uint64_t * kmerhash; unsigned int * kmerindex; bitmap_t * * kmerbitmap; unsigned int * dbindex_map; unsigned int kmerhashsize; uint64_t kmerindexsize; unsigned int dbindex_count; uhandle_s * dbindex_uh; #define BITMAP_THRESHOLD 8 static unsigned int bitmap_mincount; void fprint_kmer(FILE * f, unsigned int kk, uint64_t kmer) { uint64_t x = kmer; for(unsigned int i=0; i> (2*(kk-i-1))) & 3]); } } void dbindex_addsequence(unsigned int seqno, int seqmask) { #if 0 printf("Adding seqno %d as index element no %d\n", seqno, dbindex_count); #endif unsigned int uniquecount; unsigned int * uniquelist; unique_count(dbindex_uh, opt_wordlength, db_getsequencelen(seqno), db_getsequence(seqno), & uniquecount, & uniquelist, seqmask); dbindex_map[dbindex_count] = seqno; for(unsigned int i=0; i= bitmap_mincount) { kmerbitmap[i] = bitmap_init(seqcount+127); // pad for xmm bitmap_reset_all(kmerbitmap[i]); } else { sum += kmercount[i]; } } kmerindexsize = sum; kmerhash[kmerhashsize] = sum; #if 0 if (!opt_quiet) fprintf(stderr, "Unique %ld-mers: %u\n", opt_wordlength, kmerindexsize); #endif /* reset counts */ memset(kmercount, 0, kmerhashsize * sizeof(unsigned int)); /* allocate space for actual data */ kmerindex = (unsigned int *) xmalloc(kmerindexsize * sizeof(unsigned int)); /* allocate space for mapping from indexno to seqno */ dbindex_map = (unsigned int *) xmalloc(seqcount * sizeof(unsigned int)); dbindex_count = 0; show_rusage(); } void dbindex_free() { xfree(kmerhash); xfree(kmerindex); xfree(kmercount); xfree(dbindex_map); for(unsigned int kmer=0; kmer, Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" static pthread_t * pthread; /* global constants/data, no need for synchronization */ static int seqcount; /* number of database sequences */ static pthread_attr_t attr; /* global data protected by mutex */ static pthread_mutex_t mutex_input; static pthread_mutex_t mutex_output; static int qmatches; static int queries; static int64_t progress = 0; static FILE * fp_alnout = nullptr; static FILE * fp_samout = nullptr; static FILE * fp_userout = nullptr; static FILE * fp_blast6out = nullptr; static FILE * fp_uc = nullptr; static FILE * fp_fastapairs = nullptr; static FILE * fp_matched = nullptr; static FILE * fp_notmatched = nullptr; static FILE * fp_qsegout = nullptr; static FILE * fp_tsegout = nullptr; static int count_matched = 0; static int count_notmatched = 0; inline int allpairs_hit_compare_typed(struct hit * x, struct hit * y) { // high id, then low id // early target, then late target if (x->id > y->id) { return -1; } else if (x->id < y->id) { return +1; } else if (x->target < y->target) { return -1; } else if (x->target > y->target) { return +1; } else { return 0; } } int allpairs_hit_compare(const void * a, const void * b) { return allpairs_hit_compare_typed((struct hit *) a, (struct hit *) b); } void allpairs_output_results(int hit_count, struct hit * hits, char * query_head, int qseqlen, char * qsequence, char * qsequence_rc) { /* show results */ int64_t toreport = MIN(opt_maxhits, hit_count); if (fp_alnout) { results_show_alnout(fp_alnout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_samout) { results_show_samout(fp_samout, hits, toreport, query_head, qsequence, qseqlen, qsequence_rc); } if (toreport) { double top_hit_id = hits[0].id; for(int t = 0; t < toreport; t++) { struct hit * hp = hits + t; if (opt_top_hits_only && (hp->id < top_hit_id)) { break; } if (fp_fastapairs) { results_show_fastapairs_one(fp_fastapairs, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_qsegout) { results_show_qsegout_one(fp_qsegout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_tsegout) { results_show_tsegout_one(fp_tsegout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_uc) { if ((t==0) || opt_uc_allhits) { results_show_uc_one(fp_uc, hp, query_head, qsequence, qseqlen, qsequence_rc, hp->target); } } if (fp_userout) { results_show_userout_one(fp_userout, hp, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, hp, query_head, qsequence, qseqlen, qsequence_rc); } } } else { if (fp_uc) { results_show_uc_one(fp_uc, nullptr, query_head, qsequence, qseqlen, qsequence_rc, 0); } if (opt_output_no_hits) { if (fp_userout) { results_show_userout_one(fp_userout, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } if (fp_blast6out) { results_show_blast6out_one(fp_blast6out, nullptr, query_head, qsequence, qseqlen, qsequence_rc); } } } if (hit_count) { count_matched++; if (opt_matched) { fasta_print_general(fp_matched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), 0, count_matched, -1.0, -1, -1, nullptr, 0.0); } } else { count_notmatched++; if (opt_notmatched) { fasta_print_general(fp_notmatched, nullptr, qsequence, qseqlen, query_head, strlen(query_head), 0, count_notmatched, -1.0, -1, -1, nullptr, 0.0); } } } void allpairs_thread_run(int64_t t) { (void) t; struct searchinfo_s sia; struct searchinfo_s * si = & sia; si->strand = 0; si->query_head_alloc = 0; si->seq_alloc = 0; si->kmersamplecount = 0; si->kmers = nullptr; si->m = nullptr; si->finalized = 0; si->hits = (struct hit *) xmalloc(sizeof(struct hit) * seqcount); struct nwinfo_s * nw = nw_init(); si->s = search16_init(opt_match, opt_mismatch, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); LinearMemoryAligner lma; int64_t * scorematrix = lma.scorematrix_create(opt_match, opt_mismatch); lma.set_parameters(scorematrix, opt_gap_open_query_left, opt_gap_open_target_left, opt_gap_open_query_interior, opt_gap_open_target_interior, opt_gap_open_query_right, opt_gap_open_target_right, opt_gap_extension_query_left, opt_gap_extension_target_left, opt_gap_extension_query_interior, opt_gap_extension_target_interior, opt_gap_extension_query_right, opt_gap_extension_target_right); /* allocate memory for alignment results */ unsigned int maxhits = seqcount; auto * pseqnos = (unsigned int *) xmalloc(sizeof(unsigned int) * maxhits); CELL * pscores = (CELL*) xmalloc(sizeof(CELL) * maxhits); auto * paligned = (unsigned short*) xmalloc(sizeof(unsigned short) * maxhits); auto * pmatches = (unsigned short*) xmalloc(sizeof(unsigned short) * maxhits); auto * pmismatches = (unsigned short*) xmalloc(sizeof(unsigned short) * maxhits); auto * pgaps = (unsigned short*) xmalloc(sizeof(unsigned short) * maxhits); char** pcigar = (char**) xmalloc(sizeof(char*) * maxhits); auto * finalhits = (struct hit *) xmalloc(sizeof(struct hit) * seqcount); bool cont = true; while (cont) { xpthread_mutex_lock(&mutex_input); int query_no = queries; if (query_no < seqcount) { queries++; /* let other threads read input */ xpthread_mutex_unlock(&mutex_input); /* init search info */ si->query_no = query_no; si->qsize = db_getabundance(query_no); si->query_head_len = db_getheaderlen(query_no); si->query_head = db_getheader(query_no); si->qseqlen = db_getsequencelen(query_no); si->qsequence = db_getsequence(query_no); si->rejects = 0; si->accepts = 0; si->hit_count = 0; for(int target = si->query_no + 1; target < seqcount; target++) { if (opt_acceptall || search_acceptable_unaligned(si, target)) { pseqnos[si->hit_count++] = target; } } if (si->hit_count) { /* perform alignments */ search16_qprep(si->s, si->qsequence, si->qseqlen); search16(si->s, si->hit_count, pseqnos, pscores, paligned, pmatches, pmismatches, pgaps, pcigar); /* convert to hit structure */ for (int h = 0; h < si->hit_count; h++) { struct hit * hit = si->hits + h; unsigned int target = pseqnos[h]; int64_t nwscore = pscores[h]; char * nwcigar {nullptr}; int64_t nwalignmentlength {0}; int64_t nwmatches {0}; int64_t nwmismatches {0}; int64_t nwgaps {0}; if (nwscore == SHRT_MAX) { /* In case the SIMD aligner cannot align, perform a new alignment with the linear memory aligner */ char * tseq = db_getsequence(target); int64_t tseqlen = db_getsequencelen(target); if (pcigar[h]) { xfree(pcigar[h]); } nwcigar = xstrdup(lma.align(si->qsequence, tseq, si->qseqlen, tseqlen)); lma.alignstats(nwcigar, si->qsequence, tseq, & nwscore, & nwalignmentlength, & nwmatches, & nwmismatches, & nwgaps); } else { nwcigar = pcigar[h]; nwalignmentlength = paligned[h]; nwmatches = pmatches[h]; nwmismatches = pmismatches[h]; nwgaps = pgaps[h]; } hit->target = target; hit->strand = 0; hit->count = 0; hit->accepted = false; hit->rejected = false; hit->aligned = true; hit->weak = false; hit->nwscore = nwscore; hit->nwdiff = nwalignmentlength - nwmatches; hit->nwgaps = nwgaps; hit->nwindels = nwalignmentlength - nwmatches - nwmismatches; hit->nwalignmentlength = nwalignmentlength; hit->nwid = 100.0 * (nwalignmentlength - hit->nwdiff) / nwalignmentlength; hit->nwalignment = nwcigar; hit->matches = nwalignmentlength - hit->nwdiff; hit->mismatches = hit->nwdiff - hit->nwindels; int64_t dseqlen = db_getsequencelen(target); hit->shortest = MIN(si->qseqlen, dseqlen); hit->longest = MAX(si->qseqlen, dseqlen); /* trim alignment, compute numbers excluding terminal gaps */ align_trim(hit); /* test accept/reject criteria after alignment */ if (opt_acceptall || search_acceptable_aligned(si, hit)) { finalhits[si->accepts++] = *hit; } } /* sort hits */ qsort(finalhits, si->accepts, sizeof(struct hit), allpairs_hit_compare); } /* lock mutex for update of global data and output */ xpthread_mutex_lock(&mutex_output); /* output results */ allpairs_output_results(si->accepts, finalhits, si->query_head, si->qseqlen, si->qsequence, nullptr); /* update stats */ if (si->accepts) { qmatches++; } /* show progress */ progress += seqcount - query_no - 1; progress_update(progress); xpthread_mutex_unlock(&mutex_output); /* free memory for alignment strings */ for(int i=0; i < si->hit_count; i++) { if (si->hits[i].aligned) { xfree(si->hits[i].nwalignment); } } } else { /* let other threads read input */ xpthread_mutex_unlock(&mutex_input); cont = false; } } xfree(finalhits); xfree(pcigar); xfree(pgaps); xfree(pmismatches); xfree(pmatches); xfree(paligned); xfree(pscores); xfree(pseqnos); search16_exit(si->s); nw_exit(nw); xfree(scorematrix); xfree(si->hits); } void * allpairs_thread_worker(void * vp) { auto t = (int64_t) vp; allpairs_thread_run(t); return nullptr; } void allpairs_thread_worker_run() { /* initialize threads, start them, join them and return */ xpthread_attr_init(&attr); xpthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); /* init and create worker threads, put them into stand-by mode */ for(int t=0; t 0) { fprintf(stderr, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(stderr, "\n"); } if (opt_log) { fprintf(fp_log, "Matching query sequences: %d of %d", qmatches, queries); if (queries > 0) { fprintf(fp_log, " (%.2f%%)", 100.0 * qmatches / queries); } fprintf(fp_log, "\n\n"); } xpthread_mutex_destroy(&mutex_output); xpthread_mutex_destroy(&mutex_input); xfree(pthread); /* clean up, global */ db_free(); if (opt_matched) { fclose(fp_matched); } if (opt_notmatched) { fclose(fp_notmatched); } if (opt_fastapairs) { fclose(fp_fastapairs); } if (opt_qsegout) { fclose(fp_qsegout); } if (opt_tsegout) { fclose(fp_tsegout); } if (fp_uc) { fclose(fp_uc); } if (fp_blast6out) { fclose(fp_blast6out); } if (fp_userout) { fclose(fp_userout); } if (fp_alnout) { fclose(fp_alnout); } if (fp_samout) { fclose(fp_samout); } show_rusage(); } vsearch-2.21.1/src/citycrc.h0000644000175000017500000000350614171574117015172 0ustar nileshnilesh// Copyright (c) 2011 Google, Inc. // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in // all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN // THE SOFTWARE. // // CityHash, by Geoff Pike and Jyrki Alakuijala // // This file declares the subset of the CityHash functions that require // _mm_crc32_u64(). See the CityHash README for details. // // Functions in the CityHash family are not suitable for cryptography. #ifndef CITY_HASH_CRC_H_ #define CITY_HASH_CRC_H_ #include // Hash function for a byte array. uint128 CityHashCrc128(const char *s, size_t len); // Hash function for a byte array. For convenience, a 128-bit seed is also // hashed into the result. uint128 CityHashCrc128WithSeed(const char *s, size_t len, uint128 seed); // Hash function for a byte array. Sets result[0] ... result[3]. void CityHashCrc256(const char *s, size_t len, uint64 *result); #endif // CITY_HASH_CRC_H_ vsearch-2.21.1/src/fa2fq.cc0000644000175000017500000001026214171574117014664 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" auto fasta2fastq() -> void { const char max_ascii_value { static_cast(opt_fastq_asciiout + opt_fastq_qmaxout) }; if (opt_fastqout == nullptr) { fatal("Output FASTQ file not specified with the --fastqout option"); } fastx_handle h { fasta_open(opt_fasta2fastq) }; if (h == nullptr) { fatal("Unable to open FASTA file for reading"); } std::FILE * fp_fastqout { fopen_output(opt_fastqout) }; if (fp_fastqout == nullptr) { fatal("Unable to open FASTQ output file for writing"); } int count {0}; size_t alloc {0}; char * quality {nullptr}; progress_init("Converting FASTA file to FASTQ", fasta_get_size(h)); while(fasta_next(h, false, chrmap_no_change)) { /* get sequence length and allocate more mem if necessary */ const uint64_t length { fastq_get_sequence_length(h) }; if (alloc < length + 1) { alloc = length + 1; quality = (char*) xrealloc(quality, alloc); } /* set quality values */ for(uint64_t i = 0; i < length; i++) { quality[i] = max_ascii_value; } quality[length] = 0; ++count; /* write to fasta file */ fastq_print_general(fp_fastqout, fastq_get_sequence(h), length, fasta_get_header(h), fasta_get_header_length(h), quality, fastq_get_abundance(h), count, -1.0); progress_update(fasta_get_position(h)); } progress_done(); /* clean up */ if (quality != nullptr) { xfree(quality); } fclose(fp_fastqout); fasta_close(h); } vsearch-2.21.1/src/kmerhash.h0000644000175000017500000000553514171574117015340 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ struct kh_handle_s; struct kh_handle_s * kh_init(); void kh_exit(struct kh_handle_s * kh); void kh_insert_kmers(struct kh_handle_s * kh, int k, char * seq, int len); int kh_find_best_diagonal(struct kh_handle_s * kh, int k, char * seq, int len); void kh_find_diagonals(struct kh_handle_s * kh, int k, char * seq, int len, int * diags); vsearch-2.21.1/src/maps.h0000644000175000017500000000613714171574117014475 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ extern char sym_nt_2bit[5]; extern char sym_nt_4bit[17]; extern unsigned int ambiguous_4bit[16]; extern unsigned int char_header_action[256]; extern unsigned int char_fasta_action[256]; extern unsigned int char_fq_action_seq[256]; extern unsigned int char_fq_action_qual[256]; extern unsigned int chrmap_2bit[256]; extern unsigned int chrmap_4bit[256]; extern unsigned int chrmap_mask_ambig[256]; extern unsigned int chrmap_mask_lower[256]; extern const unsigned char chrmap_complement[256]; extern const unsigned char chrmap_normalize[256]; extern const unsigned char chrmap_upcase[256]; extern const unsigned char chrmap_no_change[256]; extern const unsigned char chrmap_identity[256]; vsearch-2.21.1/src/fastq.h0000644000175000017500000000716014171574117014650 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastq_open_rest(fastx_handle h); fastx_handle fastq_open(const char * filename); void fastq_close(fastx_handle h); bool fastq_next(fastx_handle h, bool truncateatspace, const unsigned char * char_mapping); uint64_t fastq_get_position(fastx_handle h); uint64_t fastq_get_size(fastx_handle h); uint64_t fastq_get_lineno(fastx_handle h); uint64_t fastq_get_seqno(fastx_handle h); char * fastq_get_header(fastx_handle h); char * fastq_get_sequence(fastx_handle h); char * fastq_get_quality(fastx_handle h); int64_t fastq_get_abundance(fastx_handle h); int64_t fastq_get_abundance_and_presence(fastx_handle h); uint64_t fastq_get_header_length(fastx_handle h); uint64_t fastq_get_sequence_length(fastx_handle h); uint64_t fastq_get_quality_length(fastx_handle h); void fastq_print(FILE * fp, char * header, char * sequence, char * quality); void fastq_print_general(FILE * fp, char * seq, int len, char * header, int header_len, char * quality, int abundance, int ordinal, double ee); vsearch-2.21.1/src/minheap.cc0000644000175000017500000001652514171574117015316 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" /* implement a priority queue with a min heap binary array structure */ /* elements with the lowest count should be at the top (root) */ /* To keep track of the n best potential target sequences, we store them in a min heap. The root element corresponds to the least good target, while the best elements are found at the leaf nodes. This makes it simple to decide whether a new target should be included or not, because it just needs to be compared to the root note. The list will be fully sorted before use when we want to find the best element and then the second best and so on. */ int elem_smaller(elem_t * a, elem_t * b) { /* return 1 if a is smaller than b, 0 if equal or greater */ if (a->count < b->count) { return 1; } else if (a->count > b->count) { return 0; } else if (a->length > b->length) { return 1; } else if (a->length < b->length) { return 0; } else if (a->seqno > b->seqno) { return 1; } else { return 0; } } int minheap_compare(const void * a, const void * b) { auto * x = (elem_t*) a; auto * y = (elem_t*) b; /* return -1 if a is smaller than b, +1 if greater, otherwize 0 */ /* first: lower count, larger length, lower seqno */ if (x->count < y->count) { return -1; } else if (x->count > y->count) { return +1; } else if (x->length > y->length) { return -1; } else if (x->length < y->length) { return +1; } else if (x->seqno > y->seqno) { return -1; } else if (x->seqno < y->seqno) { return +1; } else { return 0; } } minheap_t * minheap_init(int size) { auto * m = (minheap_t *) xmalloc(sizeof(minheap_t)); m->alloc = size; m->array = (elem_t *) xmalloc(size * sizeof(elem_t)); m->count = 0; return m; } void minheap_exit(minheap_t * m) { xfree(m->array); xfree(m); } static int swaps = 0; void minheap_replaceroot(minheap_t * m, elem_t tmp) { /* remove the element at the root, then swap children up to the root and insert tmp at suitable place */ /* start with root */ int p = 0; int c = 2*p+1; /* while at least one child */ while (c < m->count) { /* if two children: swap with the one with smallest value */ if ((c + 1 < m->count) && (elem_smaller(m->array + c + 1, m->array + c))) { c++; } /* swap parent and child if child has lower value */ if (elem_smaller(m->array + c, &tmp)) { m->array[p] = m->array[c]; swaps++; } else { break; } /* step down */ p = c; c = 2*p+1; } m->array[p] = tmp; } void minheap_add(minheap_t * m, elem_t * n) { if (m->count < m->alloc) { /* space for another item at end; swap upwards */ int i = m->count++; int p = (i-1)/2; while ((i>0) && elem_smaller(n, m->array+p)) { m->array[i] = m->array[p]; i = p; p = (i-1)/2; swaps++; } m->array[i] = *n; } else if (elem_smaller(m->array, n)) { /* replace the root if new element is larger than root */ minheap_replaceroot(m, *n); } } #if 0 inline int minheap_isempty(minheap_t * m) { return !m->count; } inline void minheap_empty(minheap_t * m) { m->count = 0; } #endif elem_t minheap_pop(minheap_t * m) { /* return top element and restore order */ static elem_t zero = {0,0,0}; if (m->count) { elem_t top = m->array[0]; m->count--; if (m->count) { elem_t tmp = m->array[m->count]; minheap_replaceroot(m, tmp); } return top; } else { return zero; } } void minheap_sort(minheap_t * m) { qsort(m->array, m->count, sizeof(elem_t), minheap_compare); } void minheap_dump(minheap_t * m) { for(int i=0; i < m->count; i++) { printf("%s%u", i>0 ? " " : "", m->array[i].count); } printf("\n"); } elem_t minheap_poplast(minheap_t * m) { /* return top element and restore order */ static elem_t zero = {0,0,0}; if (m->count) { return m->array[--m->count]; } else { return zero; } } void minheap_test() { minheap_t * m = minheap_init(10000000); int samples = 10000000; swaps = 0; for(int i=samples; i>=0; i--) { elem_t x = {(unsigned int)(rand()),0,1}; minheap_add(m, & x); } minheap_sort(m); while(! minheap_isempty(m)) { elem_t x = minheap_poplast(m); printf("%u\n", x.count); } printf("Swaps: %d\n\n", swaps); minheap_exit(m); } vsearch-2.21.1/src/eestats.cc0000644000175000017500000004210714171574117015340 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "vsearch.h" inline int fastq_get_qual_eestats(char q) { int qual = q - opt_fastq_ascii; if (qual < opt_fastq_qmin) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) below qmin (%" PRId64 ")\n", qual, opt_fastq_qmin); if (fp_log) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) below qmin (%" PRId64 ")\n", qual, opt_fastq_qmin); } exit(EXIT_FAILURE); } else if (qual > opt_fastq_qmax) { fprintf(stderr, "\n\nFatal error: FASTQ quality value (%d) above qmax (%" PRId64 ")\n", qual, opt_fastq_qmax); fprintf(stderr, "By default, quality values range from 0 to 41.\n" "To allow higher quality values, " "please use the option --fastq_qmax %d\n", qual); if (fp_log) { fprintf(fp_log, "\n\nFatal error: FASTQ quality value (%d) above qmax (%" PRId64 ")\n", qual, opt_fastq_qmax); fprintf(fp_log, "By default, quality values range from 0 to 41.\n" "To allow higher quality values, " "please use the option --fastq_qmax %d\n", qual); } exit(EXIT_FAILURE); } return qual; } double q2p(int q) { return exp10(- q / 10.0); } int64_t ee_start(int pos, int resolution) { return pos * (resolution * (pos + 1) + 2) / 2; } void fastq_eestats() { if (!opt_output) fatal("Output file for fastq_eestats must be specified with --output"); fastx_handle h = fastq_open(opt_fastq_eestats); uint64_t filesize = fastq_get_size(h); FILE * fp_output = nullptr; if (opt_output) { fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open output file for writing"); } } progress_init("Reading FASTQ file", filesize); uint64_t seq_count = 0; uint64_t symbols = 0; int64_t len_alloc = 10; const int resolution = 1000; int max_quality = opt_fastq_qmax - opt_fastq_qmin + 1; int64_t ee_size = ee_start(len_alloc, resolution); auto * read_length_table = (uint64_t*) xmalloc(sizeof(uint64_t) * len_alloc); memset(read_length_table, 0, sizeof(uint64_t) * len_alloc); auto * qual_length_table = (uint64_t*) xmalloc(sizeof(uint64_t) * len_alloc * (max_quality+1)); memset(qual_length_table, 0, sizeof(uint64_t) * len_alloc * (max_quality+1)); auto * ee_length_table = (uint64_t*) xmalloc(sizeof(uint64_t) * ee_size); memset(ee_length_table, 0, sizeof(uint64_t) * ee_size); auto * sum_ee_length_table = (double*) xmalloc(sizeof(double) * len_alloc); memset(sum_ee_length_table, 0, sizeof(double) * len_alloc); auto * sum_pe_length_table = (double*) xmalloc(sizeof(double) * len_alloc); memset(sum_pe_length_table, 0, sizeof(double) * len_alloc); int64_t len_min = LONG_MAX; int64_t len_max = 0; while(fastq_next(h, false, chrmap_upcase)) { seq_count++; int64_t len = fastq_get_sequence_length(h); char * q = fastq_get_quality(h); /* update length statistics */ int64_t new_alloc = len + 1; if (new_alloc > len_alloc) { int64_t new_ee_size = ee_start(new_alloc, resolution); read_length_table = (uint64_t*) xrealloc(read_length_table, sizeof(uint64_t) * new_alloc); memset(read_length_table + len_alloc, 0, sizeof(uint64_t) * (new_alloc - len_alloc)); qual_length_table = (uint64_t*) xrealloc(qual_length_table, sizeof(uint64_t) * new_alloc * (max_quality+1)); memset(qual_length_table + (max_quality+1) * len_alloc, 0, sizeof(uint64_t) * (new_alloc - len_alloc) * (max_quality+1)); ee_length_table = (uint64_t*) xrealloc(ee_length_table, sizeof(uint64_t) * new_ee_size); memset(ee_length_table + ee_size, 0, sizeof(uint64_t) * (new_ee_size - ee_size)); sum_ee_length_table = (double*) xrealloc(sum_ee_length_table, sizeof(double) * new_alloc); memset(sum_ee_length_table + len_alloc, 0, sizeof(double) * (new_alloc - len_alloc)); sum_pe_length_table = (double*) xrealloc(sum_pe_length_table, sizeof(double) * new_alloc); memset(sum_pe_length_table + len_alloc, 0, sizeof(double) * (new_alloc - len_alloc)); len_alloc = new_alloc; ee_size = new_ee_size; } if (len < len_min) { len_min = len; } if (len > len_max) { len_max = len; } /* update quality statistics */ symbols += len; double ee = 0.0; for(int64_t i=0; i < len; i++) { read_length_table[i]++; /* quality score */ int qual = fastq_get_qual_eestats(q[i]); if (qual < 0) { qual = 0; } qual_length_table[(max_quality+1)*i + qual]++; /* Pe */ double pe = q2p(qual); sum_pe_length_table[i] += pe; /* expected number of errors */ ee += pe; int64_t e_int = MIN(resolution*(i+1), (int)(resolution * ee)); ee_length_table[ee_start(i, resolution) + e_int]++; sum_ee_length_table[i] += ee; } progress_update(fastq_get_position(h)); } progress_done(); fprintf(fp_output, "Pos\tRecs\tPctRecs\t" "Min_Q\tLow_Q\tMed_Q\tMean_Q\tHi_Q\tMax_Q\t" "Min_Pe\tLow_Pe\tMed_Pe\tMean_Pe\tHi_Pe\tMax_Pe\t" "Min_EE\tLow_EE\tMed_EE\tMean_EE\tHi_EE\tMax_EE\n"); for(int64_t i=0; i 0) { qsum += q * x; n += x; if (min_q<0) { min_q = q; } if ((low_q<0) && (n >= 0.25 * reads)) { low_q = q; } if ((med_q<0) && (n >= 0.50 * reads)) { med_q = q; } if ((hi_q<0) && (n >= 0.75 * reads)) { hi_q = q; } max_q = q; } } double mean_q = 1.0 * qsum / reads; /* pe */ double min_pe = -1.0; double low_pe = -1.0; double med_pe = -1.0; double hi_pe = -1.0; double max_pe = -1.0; double pesum = 0; n = 0; for(int q=max_quality; q>=0; q--) { double x = qual_length_table[(max_quality+1)*i+q]; if (x > 0) { double pe = q2p(q); pesum += pe * x; n += x; if (min_pe<0) { min_pe = pe; } if ((low_pe<0) && (n >= 0.25 * reads)) { low_pe = pe; } if ((med_pe<0) && (n >= 0.50 * reads)) { med_pe = pe; } if ((hi_pe<0) && (n >= 0.75 * reads)) { hi_pe = pe; } max_pe = pe; } } double mean_pe = 1.0 * pesum / reads; /* expected errors */ double min_ee = -1.0; double low_ee = -1.0; double med_ee = -1.0; double hi_ee = -1.0; double max_ee = -1.0; int64_t ee_offset = ee_start(i, resolution); int64_t max_errors = resolution * (i+1); n = 0; for(int64_t e=0; e<=max_errors; e++) { int64_t x = ee_length_table[ee_offset + e]; if (x > 0) { n += x; if (min_ee<0) { min_ee = e; } if ((low_ee<0) && (n >= 0.25 * reads)) { low_ee = e; } if ((med_ee<0) && (n >= 0.50 * reads)) { med_ee = e; } if ((hi_ee<0) && (n >= 0.75 * reads)) { hi_ee = e; } max_ee = e; } } double mean_ee = sum_ee_length_table[i] / reads; min_ee = (min_ee + 0.5) / resolution; low_ee = (low_ee + 0.5) / resolution; med_ee = (med_ee + 0.5) / resolution; hi_ee = (hi_ee + 0.5) / resolution; max_ee = (max_ee + 0.5) / resolution; fprintf(fp_output, "%" PRId64 "\t%" PRId64 "\t%.1lf" "\t%.1lf\t%.1lf\t%.1lf\t%.1lf\t%.1lf\t%.1lf" "\t%.2lg\t%.2lg\t%.2lg\t%.2lg\t%.2lg\t%.2lg" "\t%.2lf\t%.2lf\t%.2lf\t%.2lf\t%.2lf\t%.2lf\n", i+1, reads, pctrecs, min_q, low_q, med_q, mean_q, hi_q, max_q, min_pe, low_pe, med_pe, mean_pe, hi_pe, max_pe, min_ee, low_ee, med_ee, mean_ee, hi_ee, max_ee); } xfree(read_length_table); xfree(qual_length_table); xfree(ee_length_table); xfree(sum_ee_length_table); xfree(sum_pe_length_table); fclose(fp_output); fastq_close(h); } void fastq_eestats2() { if (!opt_output) fatal("Output file for fastq_eestats2 must be specified with --output"); fastx_handle h = fastq_open(opt_fastq_eestats2); uint64_t filesize = fastq_get_size(h); FILE * fp_output = nullptr; if (opt_output) { fp_output = fopen_output(opt_output); if (!fp_output) { fatal("Unable to open output file for writing"); } } progress_init("Reading FASTQ file", filesize); uint64_t seq_count = 0; uint64_t symbols = 0; uint64_t longest = 0; int len_steps = 0; uint64_t * count_table = nullptr; while(fastq_next(h, false, chrmap_upcase)) { seq_count++; uint64_t len = fastq_get_sequence_length(h); char * q = fastq_get_quality(h); /* update length statistics */ if (len > longest) { longest = len; int new_len_steps = 1 + MAX(0, (MIN(longest, (uint64_t)opt_length_cutoffs_longest) - opt_length_cutoffs_shortest) / opt_length_cutoffs_increment); if (new_len_steps > len_steps) { count_table = (uint64_t *) xrealloc(count_table, sizeof(uint64_t) * new_len_steps * opt_ee_cutoffs_count); memset(count_table + len_steps * opt_ee_cutoffs_count, 0, sizeof(uint64_t) * (new_len_steps - len_steps) * opt_ee_cutoffs_count); len_steps = new_len_steps; } } /* update quality statistics */ symbols += len; double ee = 0.0; for(uint64_t i=0; i < len; i++) { /* quality score */ int qual = fastq_get_qual_eestats(q[i]); if (qual < 0) { qual = 0; } double pe = q2p(qual); ee += pe; for (int x = 0; x < len_steps; x++) { uint64_t len_cutoff = opt_length_cutoffs_shortest + x * opt_length_cutoffs_increment; if (i+1 == len_cutoff) { for (int y = 0; y < opt_ee_cutoffs_count; y++) { if (ee <= opt_ee_cutoffs_values[y]) { count_table[x * opt_ee_cutoffs_count + y]++; } } } } } progress_update(fastq_get_position(h)); } progress_done(); fprintf(fp_output, "%" PRIu64 " reads", seq_count); if (seq_count > 0) { fprintf(fp_output, ", max len %" PRIu64 ", avg %.1f", longest, 1.0 * symbols / seq_count); } fprintf(fp_output, "\n\n"); fprintf(fp_output, "Length"); for (int y = 0; y < opt_ee_cutoffs_count; y++) { fprintf(fp_output, " MaxEE %.2f", opt_ee_cutoffs_values[y]); } fprintf(fp_output, "\n"); fprintf(fp_output, "------"); for (int y = 0; y < opt_ee_cutoffs_count; y++) { fprintf(fp_output, " ----------------"); } fprintf(fp_output, "\n"); for (int x = 0; x < len_steps; x++) { int len_cutoff = opt_length_cutoffs_shortest + x * opt_length_cutoffs_increment; if (len_cutoff > opt_length_cutoffs_longest) { break; } fprintf(fp_output, "%6d", len_cutoff); for (int y = 0; y < opt_ee_cutoffs_count; y++) { fprintf(fp_output, " %8" PRIu64 "(%5.1f%%)", count_table[x * opt_ee_cutoffs_count + y], 100.0 * count_table[x * opt_ee_cutoffs_count + y] / seq_count); } fprintf(fp_output, "\n"); } if (fp_log) { fprintf(fp_log, "%" PRIu64 " reads, max len %" PRIu64 ", avg %.1f\n\n", seq_count, longest, 1.0 * symbols / seq_count); fprintf(fp_log, "Length"); for (int y = 0; y < opt_ee_cutoffs_count; y++) { fprintf(fp_log, " MaxEE %.2f", opt_ee_cutoffs_values[y]); } fprintf(fp_log, "\n"); fprintf(fp_log, "------"); for (int y = 0; y < opt_ee_cutoffs_count; y++) { fprintf(fp_log, " ----------------"); } fprintf(fp_log, "\n"); for (int x = 0; x < len_steps; x++) { int len_cutoff = opt_length_cutoffs_shortest + x * opt_length_cutoffs_increment; if (len_cutoff > opt_length_cutoffs_longest) { break; } fprintf(fp_log, "%6d", len_cutoff); for (int y = 0; y < opt_ee_cutoffs_count; y++) { fprintf(fp_log, " %8" PRIu64 "(%5.1f%%)", count_table[x * opt_ee_cutoffs_count + y], 100.0 * count_table[x * opt_ee_cutoffs_count + y] / seq_count); } fprintf(fp_log, "\n"); } } if (count_table) { xfree(count_table); count_table = nullptr; } fclose(fp_output); fastq_close(h); } vsearch-2.21.1/src/bitmap.h0000644000175000017500000000630014171574117015001 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ typedef struct bitmap_s { unsigned char * bitmap; /* the actual bitmap */ unsigned int size; /* size in bits */ } bitmap_t; bitmap_t * bitmap_init(unsigned int size); void bitmap_free(bitmap_t* b); inline unsigned char bitmap_get(bitmap_t * b, unsigned int x) { return (b->bitmap[x >> 3] >> (x & 7)) & 1; } inline void bitmap_reset_all(bitmap_t * b) { memset(b->bitmap, 0, (b->size+7)/8); } inline void bitmap_set_all(bitmap_t * b) { memset(b->bitmap, 255, (b->size+7)/8); } inline void bitmap_reset(bitmap_t * b, unsigned int x) { b->bitmap[x >> 3] &= ~ (1 << (x & 7)); } inline void bitmap_set(bitmap_t * b, unsigned int x) { b->bitmap[x >> 3] |= 1 << (x & 7); } inline void bitmap_flip(bitmap_t * b, unsigned int x) { b->bitmap[x >> 3] ^= 1 << (x & 7); } vsearch-2.21.1/src/eestats.h0000644000175000017500000000473214171574117015204 0ustar nileshnilesh/* VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ void fastq_eestats(); void fastq_eestats2(); vsearch-2.21.1/.dockerignore0000644000175000017500000000003414171574117015237 0ustar nileshnilesh.git .gitignore .travis.yml vsearch-2.21.1/LICENSE.txt0000644000175000017500000000464414171574117014421 0ustar nileshnilesh VSEARCH: a versatile open source tool for metagenomics Copyright (C) 2014-2021, Torbjorn Rognes, Frederic Mahe and Tomas Flouri All rights reserved. Contact: Torbjorn Rognes , Department of Informatics, University of Oslo, PO Box 1080 Blindern, NO-0316 Oslo, Norway This software is dual-licensed and available under a choice of one of two licenses, either under the terms of the GNU General Public License version 3 or the BSD 2-Clause License. GNU General Public License version 3 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . The BSD 2-Clause License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. vsearch-2.21.1/.gitignore0000644000175000017500000000035714171574117014563 0ustar nileshnilesh*.a *.o *.pdf *~ .deps .dirstamp /autom4te.cache /bin /config.h /config.log /config.status /config.guess /config.sub /stamp-h1 Makefile aclocal.m4 config.h.in configure compile depcomp install-sh missing Makefile.in .vscode .DS_Store .Tpo vsearch-2.21.1/Dockerfile0000644000175000017500000000062614171574117014564 0ustar nileshnileshFROM alpine:latest WORKDIR /opt/vsearch COPY . . RUN apk add --no-cache \ libstdc++ zlib-dev bzip2-dev \ autoconf automake make g++ && \ ./autogen.sh && \ ./configure CFLAGS="-O3" CXXFLAGS="-O3" && \ make clean && \ make && \ make install && \ make clean && \ apk del autoconf automake make g++ && \ rm -rf /opt/vsearch ENTRYPOINT ["/usr/local/bin/vsearch"] vsearch-2.21.1/configure.ac0000644000175000017500000000511114171574117015052 0ustar nileshnilesh# -*- Autoconf -*- # Process this file with autoconf to produce a configure script. AC_PREREQ([2.63]) AC_INIT([vsearch], [2.21.1], [torognes@ifi.uio.no], [vsearch], [https://github.com/torognes/vsearch]) AC_CANONICAL_TARGET AM_INIT_AUTOMAKE([subdir-objects]) AC_LANG([C++]) AC_CONFIG_SRCDIR([src/vsearch.cc]) AC_CONFIG_HEADERS([config.h]) AC_SUBST(MACOSX_DEPLOYMENT_TARGET) MACOSX_DEPLOYMENT_TARGET="10.9" # Checks for programs. AC_PROG_CXX AC_PROG_RANLIB AC_PROG_INSTALL # Checks for libraries. AC_CHECK_LIB([pthread], [pthread_create]) AC_CHECK_LIB([dl], [dlopen]) AC_CHECK_LIB([psapi], [GetProcessMemoryInfo]) # Checks for header files. AC_CHECK_HEADERS([getopt.h fcntl.h float.h regex.h ctype.h locale.h limits.h string.h sys/time.h dlfcn.h pthread.h]) # Checks for typedefs, structures, and compiler characteristics. AC_C_INLINE AC_TYPE_SIZE_T AC_TYPE_UINT32_T AC_TYPE_INT64_T AC_TYPE_UINT64_T AC_TYPE_UINT8_T # Checks for library functions. AC_FUNC_MALLOC AC_FUNC_STRTOD AC_FUNC_ALLOCA AC_FUNC_REALLOC AC_CHECK_FUNCS([memmove memcpy posix_memalign gettimeofday localtime memchr memset pow regcomp strcasecmp strchr strcspn sysinfo]) have_bzip2=no AC_ARG_ENABLE(bzip2, AS_HELP_STRING([--disable-bzip2], [Disable bzip2 support])) AS_IF([test "x$enable_bzip2" != "xno"], [ have_bzip2=yes ]) if test "x${have_bzip2}" = "xyes"; then AC_CHECK_HEADERS([bzlib.h], [], [have_bzip2=no]) fi have_zlib=no AC_ARG_ENABLE(zlib, AS_HELP_STRING([--disable-zlib], [Disable zlib support])) AS_IF([test "x$enable_zlib" != "xno"], [ have_zlib=yes ]) if test "x${have_zlib}" = "xyes"; then AC_CHECK_HEADERS([zlib.h], [], [have_zlib=no]) fi have_ps2pdf=no AC_ARG_ENABLE(pdfman, AS_HELP_STRING([--disable-pdfman], [Disable PDF manual creation])) AS_IF([test "x$enable_pdfman" != "xno"], [ have_ps2pdf=yes AC_CHECK_PROG(HAVE_PS2PDF, ps2pdf, yes, no) if test "x$HAVE_PS2PDF" = "xno"; then AC_MSG_WARN([*** ps2pdf is required to build a PDF version of the manual]) have_ps2pdf=no fi ]) have_man_html=no case $target in aarch64*) target_aarch64="yes" ;; powerpc64*) target_ppc="yes" ;; esac AC_CHECK_HEADERS([windows.h], [AM_CONDITIONAL(TARGET_WIN, true)], [AM_CONDITIONAL(TARGET_WIN, false)]) AM_CONDITIONAL(HAVE_PS2PDF, test "x${have_ps2pdf}" = "xyes") AM_CONDITIONAL(HAVE_MAN_HTML, test "x${have_man_html}" = "xyes") AM_CONDITIONAL(TARGET_PPC, test "x${target_ppc}" = "xyes") AM_CONDITIONAL(TARGET_AARCH64, test "x${target_aarch64}" = "xyes") AM_PROG_CC_C_O AC_CONFIG_FILES([Makefile src/Makefile man/Makefile]) AC_OUTPUT vsearch-2.21.1/LICENSE_GNU_GPL3.txt0000644000175000017500000010451314171574117015713 0ustar nileshnilesh GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read .