MHAP-2.1.1/000077500000000000000000000000001277502137000122445ustar00rootroot00000000000000MHAP-2.1.1/.gitignore000066400000000000000000000001661277502137000142370ustar00rootroot00000000000000/Utils$Pair.class /Utils$ToProtein.class /Utils$Translate.class /Utils.class /buildMulti.class /bin /target /classes/ MHAP-2.1.1/LICENSE.txt000066400000000000000000000261261277502137000140760ustar00rootroot00000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright 2012 Konstantin Berlin Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. MHAP-2.1.1/NOTICE.txt000066400000000000000000000000141277502137000137610ustar00rootroot00000000000000Nothing hereMHAP-2.1.1/README.md000066400000000000000000000027671277502137000135370ustar00rootroot00000000000000# MHAP MinHash alignment process (MHAP pronounced MAP): locality sensitive hashing to detect overlaps and utilities. This is the development branch, please use the [latest tagged](https://github.com/marbl/MHAP/releases/tag/v2.1.1). ## Build You must have a recent [JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html "JDK") and [Apache Maven](http://maven.apache.org/ "MAVEN") available. To checkout and build run: git clone https://github.com/marbl/MHAP.git cd MHAP maven install For a quick user-quide, run: cd target java -jar mhap-2.1.1.jar ## Docs For the full documentation information please see http://mhap.readthedocs.io/en/latest/ ## Cite - Berlin K, Koren S, Chin CS, Drake PJ, Landolin JM, Phillippy AM [Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing](http://www.nature.com/nbt/journal/v33/n6/abs/nbt.3238.html "nb"). Nature Biotechnology. (2015). ## License Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. MHAP-2.1.1/docs/000077500000000000000000000000001277502137000131745ustar00rootroot00000000000000MHAP-2.1.1/docs/Makefile000066400000000000000000000153131277502137000146370ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build-2.7 PAPER = BUILDDIR = build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/MinHashAlignmentProcessMHAP.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/MinHashAlignmentProcessMHAP.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/MinHashAlignmentProcessMHAP" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/MinHashAlignmentProcessMHAP" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." MHAP-2.1.1/docs/make.bat000066400000000000000000000151401277502137000146020ustar00rootroot00000000000000@ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% source set I18NSPHINXOPTS=%SPHINXOPTS% source if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. xml to make Docutils-native XML files echo. pseudoxml to make pseudoxml-XML files for display purposes echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) %SPHINXBUILD% 2> nul if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\MinHashAlignmentProcessMHAP.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\MinHashAlignmentProcessMHAP.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdf" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf cd %BUILDDIR%/.. echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdfja" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf-ja cd %BUILDDIR%/.. echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) if "%1" == "xml" ( %SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml if errorlevel 1 exit /b 1 echo. echo.Build finished. The XML files are in %BUILDDIR%/xml. goto end ) if "%1" == "pseudoxml" ( %SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml if errorlevel 1 exit /b 1 echo. echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml. goto end ) :end MHAP-2.1.1/docs/source/000077500000000000000000000000001277502137000144745ustar00rootroot00000000000000MHAP-2.1.1/docs/source/conf.py000066400000000000000000000204561277502137000160020ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # MinHash Alignment Process (MHAP) documentation build configuration file, created by # sphinx-quickstart on Sun Jul 13 18:13:46 2014. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.insert(0, os.path.abspath('.')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.todo', 'sphinx.ext.mathjax', ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'MinHash Alignment Process (MHAP)' copyright = u'2014, Sergey Koren and Konstantin Berlin' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '2.1' # The full version, including alpha/beta/rc tags. release = '2.1' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = [] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'default' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'MinHashAlignmentProcessMHAPdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ('index', 'MinHashAlignmentProcessMHAP.tex', u'MinHash Alignment Process (MHAP) Documentation', u'Sergey Koren and Konstantin Berlin', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'minhashalignmentprocessmhap', u'MinHash Alignment Process (MHAP) Documentation', [u'Sergey Koren and Konstantin Berlin'], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'MinHashAlignmentProcessMHAP', u'MinHash Alignment Process (MHAP) Documentation', u'Sergey Koren and Konstantin Berlin', 'MinHashAlignmentProcessMHAP', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False MHAP-2.1.1/docs/source/contact.rst000066400000000000000000000012261277502137000166620ustar00rootroot00000000000000############ Contact ############ Bugs, feature requests, comments: ================================ If you encounter any problems/bugs, please check the known issues pages:: https://github.com/marbl/MHAP/issues If not, please report the issue either using the contact information below or by submitting a new issue online. Please include information on your run:: 1) any output produced by MHAP 3) sample data, if possible, to reproduce the issue Who to contact to report bugs, forward complaints, feature requests: Konstantin Berlin: kberlin@gmail.com ---------------------------- Sergey Koren: sergek@umd.edu ---------------------------- MHAP-2.1.1/docs/source/index.rst000066400000000000000000000015501277502137000163360ustar00rootroot00000000000000============================================================================= MinHash Alignment Process (MHAP): a probabilistic sequence overlap algorithm. ============================================================================= ================= Overview ================= MHAP (pronounced MAP) is a reference implementation of a probabilistic sequence overlapping algorithm. Designed to efficiently detect all overlaps between noisy long-read sequence data. It efficiently estimates Jaccard similarity by compressing sequences to their representative fingerprints composed on min-mers (minimum k-mer). MHAP is included within the `Canu `_ assembler. Canu can be downloaded `here `_. Contents: .. toctree:: :maxdepth: 2 installation quickstart utilities contact MHAP-2.1.1/docs/source/installation.rst000066400000000000000000000053031277502137000177300ustar00rootroot00000000000000############ Installation ############ Before your start ================= MHAP requires a recent version of the `JVM `_ (1.8u6+). JDK 1.7 or earlier will not work. If you would like to build the code from source, you need to have the `JDK `_ and the `Maven `_ build system available. Prerequisites ============== * java (1.8u6+) * maven (3.0+) If you have not already installed the dependencies using maven, you will need an internet connection to do so during maven installation. Here is a list of currently supported Operating Systems: 1. Mac OSX (10.7 or newer) 2. Linux 64-bit (tested on CentOS, Fedora, RedHat, OpenSUSE and Ubuntu) 3. Windows (XP or newer) Installation ====================== Pre-compiled ----------------- The pre-compiled version is recommended to users who want to run MHAP, without doing development. To download a pre-compiled tar run: .. code-block:: bash $ wget https://github.com/marbl/MHAP/releases/download/v2.1.1/mhap-2.1.1.tar.gz And if ``wget`` not available, you can use ``curl`` instead: .. code-block:: bash $ curl -L https://github.com/marbl/MHAP/releases/download/v2.1.1/mhap-2.1.1.tar.gz > mhap-2.1.1.tar.gz Then run .. code-block:: bash $ tar xvzf mhap-2.1.1.tar.gz Source ----------------- To build the code from the release: .. code-block:: bash $ wget https://github.com/marbl/MHAP/archive/v2.1.1.zip If you see a certificate not trusted error, you can add the following option to wget: .. code-block:: bash $ --no-check-certificate And if ``wget`` not available, you can use ``curl`` instead: .. code-block:: bash $ curl -L https://github.com/marbl/MHAP/archive/v2.1.1.zip > v2.1.zip You can also browse the https://github.com/marbl/MHAP/tree/v2.1.1 and click on Downloads. Once downloaded, extract to unpack: .. code-block:: bash $ unzip v2.1.1.zip Change to MASH directory: .. code-block:: bash $ cd MHAP-2.1.1 Once inside the directory, run: .. code-block:: bash $ maven install This will compile the program and create a target/mhap-2.1.1.jar file which you can use to run MHAP. The quick-start instructions assume you are in the target directory when running the program. You can also use the target/mhap-2.1.1.jar file to copy MHAP to a different system or directory. If you would like to run the `validation utilties `_ you must also download and build the `SSW Library `_. Follow the instructions on the `utilities `_ page. MHAP-2.1.1/docs/source/quickstart.rst000066400000000000000000000173401277502137000174250ustar00rootroot00000000000000############ Quick Start ############ Running MHAP ----------------- Running MHAP provides command-line documenation if you run it without parameters. Assuming you have followed the `installation instructions `_ instructions, you can run: .. code-block:: bash $ java -jar mhap-2.1.1.jar MHAP has two main usage modes, the main finds all overlaps between the input sequences. The second only constructs an index which can be subsequently reused. Finding overlaps ----------------- .. code-block:: bash $ java -Xmx32g -server -jar mhap-2.1.1.jar -s [-q] [-f] Both the -s and -q options can accept either FastA sequences or binary dat files (generated as described below). The -q option can accept either a file or a directory, in which case all FastA/dat files in the specified directory will be used. By default, only the sequences specified by -s are indexed and the sequences in -q are streamed against the constructed index. Generally, 32GB of RAM is sufficient to index 40K sequences. If you have more sequences, you can partition your data and run MHAP on the partitions. You can also increase the memory MHAP is allowed to use by changing the Xmx parameter to a larger limit. The optional -f flag provides a file of repetitive k-mers which should be biased against selected as min-mers. The file is a two-column tab-delimited input specifying the kmer and the fraction of total kmers the k-mer comprises. For example: .. code-block:: bash $ head kmers.ignore 464 GGGGGGGGGGGGG 0.0005 means the k-mer GGGGGGGGGGG represents 0.05% of the k-mers in the dataset (so if there are 100,000 total k-mers, it occurs 50 times). The first line specifies the total number of k-mer entries in the file. It is also possible to use the k-mer list as a positive selection as was used in `Carvalho et. al. `_. Specify the k-mer list as above and the flag: .. code-block:: bash --supress-noise 2 which will not allow any k-mer not in in the input file to be a minmer. The k-mers above --filter-threshold will be ignored as repeats. .. code-block:: bash --supress-noise 1 will downweight any k-mer not in the input file to bias against its selection as a minmer. The k-mers above --filter-threshold will be downeighted as repeats. Constructing binary index ----------------- .. code-block:: bash $ java -Xmx32g -server -jar mhap-2.1.jar -p -q [-f] In this use case, files in the -p directory will be converted to binary sketch files in the -q directory. Subsequent runs using these files (instead of FastA files) will be faster as the sequences no longer need to be sketched, only loaded into memory. Output ----------------- MHAP outputs overlaps in a format similar to BLASR's M4 format. Example output:: [A ID] [B ID] [% error] [# shared min-mers] [0=A fwd, 1=A rc] [A start] [A end] [A length] [0=B fwd, 1=B rc] [B start] [B end] [B length] An example of output from a small dataset is below:: 155 11 0.164156 206 0 69 1693 1704 0 1208 2831 5871 155 15 0.157788 163 0 16 1041 1704 1 67 1088 2935 155 27 0.185483 159 0 455 1678 1704 0 0 1225 1862 In this case sequence 155 overlaps 11, 15, and 27. The error percent is computed from the Jaccard estimate using `mash distance `_. Options ----------------- The full list of options is available via command-line help (--help or -h). Below is a list of commonly used options. Usage 1 (direct execution): java -server -Xmx -jar -s [-q] [-f] Usage 2 (generate precomputed binaries): java -server -Xmx -jar -p -q [-f] --filter-threshold, default = 1.0E-5 [double], the cutoff at which the k-mer in the k-mer filter file is considered repetitive. This value for a specific k-mer is specified in the second column in the filter file. If no filter file is provided, this option is ignored. --help, default = false Displays the help menu. --max-shift, default = 0.2 [double], region size to the left and right of the estimated overlap, as derived from the median shift and sequence length, where a k-mer matches are still considered valid. Second stage filter only. --min-olap-length, default = 116 [int], The minimum length of the read that used for overlapping. Used to filter out short reads from FASTA file. --min-store-length, default = 0 [int], The minimum length of the read that is stored in the box. Used to filter out short reads from FASTA file. --no-self, default = false Do not compute the overlaps between sequences inside a box. Should be used when the to and from sequences are coming from different files. --no-tf, default = false Do not perform the tf weighing, in the tf-idf weighing. --num-hashes, default = 512 [int], number of min-mers to be used in MinHashing. --num-min-matches, default = 3 [int], minimum # min-mer that must be shared before computing second stage filter. Any sequences below that value are considered non-overlapping. --num-threads, default = 8 [int], number of threads to use for computation. Typically set to #cores. --ordered-kmer-size, default = 12 [int] The size of k-mers used in the ordered second stage filter. --ordered-sketch-size, default = 1536 [int] The sketch size for second stage filter. --repeat-idf-scale, default = 3.0 [double] The upper range of the idf (from tf-idf) scale. The full scale will be [1,X], where X is the parameter. --repeat-weight, default = 0.9 [double] Repeat suppression strength for tf-idf weighing. <0.0 do unweighted MinHash (version 1.0), >=1.0 do only the tf weighing. To perform no idf weighting, do no supply -f option. --settings, default = 0 Set all unset parameters for the default settings. Same defaults are applied to Nanopore and Pacbio reads. 0) None, 1) Default, 2) Fast, 3) Sensitive. --store-full-id, default = false Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file. Some FASTA files have long IDS, slowing output of results. This options is ignored when using compressed file format. --supress-noise, default = 0 [int] 0) Does nothing, 1) completely removes any k-mers not specified in the filter file, 2) supresses k-mers not specified in the filter file, similar to repeats. --threshold, default = 0.78 [double], the threshold cutoff for the second stage sort-merge filter. This is based on the identity score computed from the Jaccard distance of k-mers (size given by ordered-kmer-size) in the overlapping regions. --version, default = false Displays the version and build time. -f, default = "" k-mer filter file used for filtering out highly repetative k-mers. Must be sorted in descending order of frequency (second column). -h, default = false Displays the help menu. -k, default = 16 [int], k-mer size used for MinHashing. The k-mer size for second stage filter is seperate, and cannot be modified. -p, default = "" Usage 2 only. The directory containing FASTA files that should be converted to binary format for storage. -q, default = "" Usage 1: The FASTA file of reads, or a directory of files, that will be compared to the set of reads in the box (see -s). Usage 2: The output directory for the binary formatted dat files. -s, default = "" Usage 1 only. The FASTA or binary dat file (see Usage 2) of reads that will be stored in a box, and that all subsequent reads will be compared to. MHAP-2.1.1/docs/source/utilities.rst000066400000000000000000000100721277502137000172410ustar00rootroot00000000000000############ Utilities ############ Using MHAP extras ----------------- In addition to the main overlapping algorithm, MHAP indcludes several utilities for validating overlaps and simulating data. Validating overlaps ----------------- Assuming you have a mapping of sequences to a truth (such as a reference genome) in BLASR's M4 format, you can validate overlaps using MHAP's EstimateROC utility which will compute PPV/Sensitivity/Specificity: .. code-block:: bash $ java -cp mhap-2.1.1.jar edu.umd.marbl.mhap.main.EstimateROC [minimum overlap length to evaluate] [number of random trials] [use dynamic programming] [verbose] [minimum identity of overlap] [maximum different between expected overlap and reported] [load all overlaps] The default minimum overlap length is 2000 and default number of trials is 10000. This will estimate sensitivity/specificity to within 1%. It can be increased at the expense of runtime. Specifying 0 will examine all possible N^2 overlap pairs. The dynamic programming flag (true/false) will check overlaps not present in the reference mapping by running a Smith-Watermann alignment to identify the overlap specified. This step requires the `SSW Library `_ to be separately compiled and installed: .. code-block:: bash $ wget https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library/archive/master.zip $ unzip master.gip && cd Complete-Striped-Smith-Waterman-Library-master/src $ make all $ cd /full/path/to/mhap/target/lib $ ln -s /full/path/to/Complete-Striped-Smith-Waterman-Library-master/src/libsswjni.so The verbose flag (true/false) enables logging to report true overlaps missing from the result and false-positives where no alignments could be found matching the required thresholds. The minimum identity of the overlap (0.7 by default) is the lower bound for the sensitivity of an overlapper to evaluate. It is used to select matches to the reference that could be found by the overlapper. It is also used to threshold the minimum identity found by the Smith-Waterman alignment above. The load all overlaps flag (true/false) will evaluate the specificity and PPV on all overlaps reported by the overlapper if enabled, not only those for good reads (where both reads were mapped to the reference in the truth set). Simulating Data ----------------- MHAP includes a tool to simulate sequencing data with random error as well as estimate Jaccard similarity for the simulated data. .. code-block:: bash $ java -cp mhap-2.1.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# sequences> [reference genome] The error rates must be between 0 and 1 and are additive. Specifying 10% insertion, 2% deletion, and 1% substitution will result in sequences with a 13% error rate. If no reference sequence is given, completely random sequences are generated and errors added. Otherwise, random sequences are drawn from the reference and errors added. Errors are added randomly with no bias. .. code-block:: bash $ java -cp mhap-2.1.1.jar edu.umd.marbl.mhap.main.KmerStatSimulator <# trials> [one-sided error] [reference genome] [kmer filter] This usage will output a distribution of Jaccard similarity between a pair of overlapping sequences with the specified error rate (when using the specified k-mer size) and two random sequences of the same length. If no reference sequence is given, completely random sequences are generated and errors added, otherwise sequences are drawn from the reference. When one-sided error is specified (by typing true for the parameter), only one of the two sequences will have error simulated, matching a mapping of a noisy sequence to a reference. If a set of k-mers for filtering is given, they are excluded when computing Jaccard similarity, both between random and overlapping sequences. MHAP-2.1.1/pom.xml000066400000000000000000000100461277502137000135620ustar00rootroot00000000000000 4.0.0 mhap mhap 2.1.1 MinHash Alignment Process src/main/java **/*.java org.apache.maven.plugins maven-compiler-plugin 3.1 1.8 1.8 org.apache.maven.plugins maven-dependency-plugin package copy-dependencies ${project.build.directory}/lib org.apache.maven.plugins maven-shade-plugin 2.4.2 install shade true edu.umd.marbl.mhap.main.MhapMain *:* META-INF/*.SF META-INF/*.DSA META-INF/*.RSA org.apache.maven.plugins maven-jar-plugin 2.4 true true lib edu.umd.marbl.mhap.main.MhapMain maven-compiler-plugin 3.3 in-project In Project Repo file://${project.basedir}/lib it.unimi.dsi fastutil 7.0.12 org.apache.commons commons-compress 1.11 com.google.guava guava 19.0 com.jaligner jaligner 1.0 com.ssw ssw 1.0 https://github.com/marbl/MHAP MinHash alignment process (MHAP pronounced MAP): locality sensitive hashing to detect overlaps and utilities. MHAP-2.1.1/src/000077500000000000000000000000001277502137000130335ustar00rootroot00000000000000MHAP-2.1.1/src/main/000077500000000000000000000000001277502137000137575ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/000077500000000000000000000000001277502137000147005ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/000077500000000000000000000000001277502137000154555ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/000077500000000000000000000000001277502137000162425ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/000077500000000000000000000000001277502137000173375ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/000077500000000000000000000000001277502137000202645ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/000077500000000000000000000000001277502137000213565ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/AlignElement.java000066400000000000000000000030261277502137000245660ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.align; public interface AlignElement> { public int length(); public double similarityScore(S e, int i, int j); @Override public String toString(); public String toString(int i); public String toString(S match, int i, int j); } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/AlignElementDoubleSketch.java000066400000000000000000000110671277502137000270670ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.align; import edu.umd.marbl.mhap.impl.OverlapInfo; import edu.umd.marbl.mhap.sketch.Sketch; public final class AlignElementDoubleSketch> implements AlignElement> { private final T[] elements; private final int seqLength; private final int stepSize; public AlignElementDoubleSketch(T[] sketchArray, int stepSize, int seqLength) { this.elements = sketchArray; this.stepSize = stepSize; this.seqLength = seqLength; } public OverlapInfo getOverlapInfo(Aligner> aligner, AlignElementDoubleSketch b) { Alignment> alignment = localAlignOneSkip(aligner, b); int a1 = alignment.getA1()*2; int a2 = alignment.getA2()*2; int b1 = alignment.getB1()*2; int b2 = alignment.getB2()*2; if (alignment.getScore()<0.0) return new OverlapInfo(0.0, 0.0, a1, a2, b1, b2); int offsetStart = similarityOffset(b, alignment.getA1(), alignment.getB1()); int offsetEnd = similarityOffset(b, alignment.getA2(), alignment.getB2()); if (offsetStart>0) a1++; else if (offsetStart<0) b1++; if (offsetEnd>0) a2++; else if (offsetEnd<0) b2++; a1 = a1*this.stepSize; a2 = Math.min(getSequenceLength()-1, (a2*this.stepSize+this.stepSize-1)); b1 = b1*b.stepSize; b2 = Math.min(b.getSequenceLength()-1, (b2*b.stepSize+b.stepSize-1)); double score = alignment.getScore(); //int overlapSize = Math.max(a2-a1, b2-b1); //if (overlapSize<2000) // return new OverlapInfo(0.0, 0.0, 0, 0, 0, 0); //double relOverlapSize = (double)overlapSize/(double)this.stepSize; //score = score/relOverlapSize; return new OverlapInfo(score/100000.0, score, a1, a2, b1, b2); } public int getSequenceLength() { return this.seqLength; } public T getSketch(int index) { return this.elements[index]; } public int getStepSize() { return this.stepSize; } @Override public int length() { int val = this.elements.length/2; if (this.elements.length%2!=0) val++; return val; } public Alignment> localAlignOneSkip(Aligner> aligner, AlignElementDoubleSketch b) { return aligner.localAlignOneSkip(this, b); } @Override public double similarityScore(AlignElementDoubleSketch e, int i, int j) { double max = this.elements[2*i].similarity(e.elements[2*j]); if ((2*i+1) e, int i, int j) { double max = this.elements[2*i].similarity(e.elements[2*j]); int diff = 0; if ((2*i+1) match, int i, int j) { return toString(); } @Override public String toString(int i) { return this.elements[i].toString(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/AlignElementSketch.java000066400000000000000000000064471277502137000257420ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.align; import edu.umd.marbl.mhap.impl.OverlapInfo; import edu.umd.marbl.mhap.sketch.Sketch; public final class AlignElementSketch> implements AlignElement> { private final T[] elements; private final int seqLength; private final int stepSize; public AlignElementSketch(T[] sketchArray, int stepSize, int seqLength) { this.elements = sketchArray; this.stepSize = stepSize; this.seqLength = seqLength; } public OverlapInfo getOverlapInfo(Aligner> aligner, AlignElementSketch b) { Alignment> alignment = localAlignOneSkip(aligner, b); int a1 = alignment.getA1(); int a2 = alignment.getA2(); int b1 = alignment.getB1(); int b2 = alignment.getB2(); a1 = alignment.getA1()*this.stepSize; a2 = Math.min(getSequenceLength()-1, alignment.getA2()*this.stepSize+this.stepSize-1); b1 = alignment.getB1()*b.stepSize; b2 = Math.min(b.getSequenceLength()-1, alignment.getB2()*b.stepSize+b.stepSize-1); double score = alignment.getScore(); //int overlapSize = Math.max(a2-a1, b2-b1); //double relOverlapSize = (double)overlapSize/(double)this.stepSize; //score = score/relOverlapSize; return new OverlapInfo(score/100000.0, score, a1, a2, b1, b2); } public int getSequenceLength() { return this.seqLength; } public T getSketch(int index) { return this.elements[index]; } public int getStepSize() { return this.stepSize; } @Override public int length() { return this.elements.length; } public Alignment> localAlignOneSkip(Aligner> aligner, AlignElementSketch b) { return aligner.localAlignOneSkip(this, b); } @Override public double similarityScore(AlignElementSketch e, int i, int j) { return this.elements[i].similarity(e.elements[j]); } @Override public String toString(AlignElementSketch match, int i, int j) { return toString(); } @Override public String toString(int i) { return this.elements[i].toString(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/AlignElementString.java000066400000000000000000000040661277502137000257620ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.align; public class AlignElementString implements AlignElement { private final double EXACT_MATCH_SCORE = 1.0; private final double MISMATCH_SCORE = -1.0; private final char[] s; public AlignElementString(String s) { this.s = s.toCharArray(); } @Override public int length() { return s.length; } @Override public double similarityScore(AlignElementString e, int i, int j) { if (this.s[i]==e.s[j]) return EXACT_MATCH_SCORE; else return MISMATCH_SCORE; } /* (non-Javadoc) * @see java.lang.Object#toString() */ @Override public String toString() { return new String(s); } @Override public String toString(AlignElementString match, int i, int j) { return ""+s[i]; } @Override public String toString(int i) { return ""+s[i]; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/Aligner.java000066400000000000000000000176631277502137000236170ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.align; import java.util.ArrayList; import java.util.Collections; import edu.umd.marbl.mhap.align.Alignment.Operation; public final class Aligner> { private final float gapOpen; private final float gapExtend; private final boolean storePath; private final float scoreOffset; public Aligner(boolean storePath, double gapOpen, double gapExtend, double scoreOffset) { this.gapOpen = (float)gapOpen; this.gapExtend = (float)gapExtend; this.storePath = storePath; this.scoreOffset = (float)scoreOffset; } /* public Alignment localAlignSmithWater(S a, S b) { if (a.length()==0 && b.length()==0) return null; else if (a.length()==0 || b.length()==0) return null; float[][] scores = new float[a.length()+1][b.length()+1]; for (int i=1; i<=a.length(); i++) for (int j=1; j<=b.length(); j++) { float hNext = scores[i-1][j-1]+Math.min(0.0f, (float)a.similarityScore(b, i-1, j-1)); float hDeletion = scores[i-1][j]+this.gapOpen; float hInsertion = scores[i][j-1]+this.gapOpen; //adjustments for end //if (i==a.length()) // hInsertion = hInsertion-this.gapOpen; //if (j==b.length()) // hDeletion = hDeletion-this.gapOpen; float value = Math.max(Math.max(Math.max(0.0f, hNext), hDeletion), hInsertion); scores[i][j] = value; } double bestValue = scores[a.length()-1][b.length()-1]; double score = bestValue/(double)Math.max(a.length(), b.length()); //if (a.length()<500) // System.err.println(edu.umd.marbl.mhap.utils.Utils.toString(scores)); if (storePath) { //figure out the path ArrayList backOperations = new ArrayList<>(a.length()+b.length()); int i = a.length(); int j = b.length(); while (i>0 || j>0) { if (i==0) { backOperations.add(Operation.DELETE); j--; } else if (j==0) { backOperations.add(Operation.INSERT); i--; } else if (scores[i-1][j-1]>=scores[i-1][j] && scores[i-1][j-1]>=scores[i][j-1]) { backOperations.add(Operation.MATCH); i--; j--; } else if (scores[i-1][j]>=scores[i-1][j-1]) { backOperations.add(Operation.INSERT); i--; } else { backOperations.add(Operation.DELETE); j--; } } return new Alignment(a, b, score, this.gapOpen, backOperations); } return new Alignment(a, b, score, this.gapOpen, null); } */ public Alignment localAlignSmithWaterGotoh(S a, S b) { float[][] D = new float[a.length()+1][b.length()+1]; float[][] P = new float[a.length()+1][b.length()+1]; float[][] Q = new float[a.length()+1][b.length()+1]; for (int i=1; i<=a.length(); i++) { D[i][0] = 0.0f; P[i][0] = Float.NEGATIVE_INFINITY; Q[i][0] = Float.NEGATIVE_INFINITY; } for (int j=1; j<=b.length(); j++) { D[0][j] = 0.0f; P[0][j] = Float.NEGATIVE_INFINITY; Q[0][j] = Float.NEGATIVE_INFINITY; } float maxValue = 0.0f; int maxI = 0; int maxJ = 0; for (int i=1; i<=a.length(); i++) { for (int j=1; j<=b.length(); j++) { P[i][j] = Math.max(D[i-1][j]+this.gapOpen, P[i-1][j]+this.gapExtend); Q[i][j] = Math.max(D[i][j-1]+this.gapOpen, Q[i][j-1]+this.gapExtend); float score = D[i-1][j-1]+(float)a.similarityScore(b, i-1, j-1)+this.scoreOffset; //compute the actual score D[i][j] = Math.max(score, Math.max(P[i][j], Q[i][j])); if (D[i][j] > maxValue) { maxValue = D[i][j]; maxI = i; maxJ = j; } } } float score = maxValue; int a1 = 0; int a2 = Math.max(0, maxI-1); int b1 = 0; int b2 = Math.max(0, maxJ-1); if (storePath) { //figure out the path ArrayList backOperations = new ArrayList<>(a.length()+b.length()); int i = maxI; int j = maxJ; while (i>0 && j>0) { if ((P[i][j]>=Q[i][j] && P[i][j]==D[i][j]) || j==0) { backOperations.add(Operation.DELETE); i--; } else if (Q[i][j]==D[i][j] || i==0) { backOperations.add(Operation.INSERT); j--; } else { backOperations.add(Operation.MATCH); i--; j--; } } a1 = i; b1 = j; while (i > 0) { backOperations.add(Operation.DELETE); i--; } //reverse the direction Collections.reverse(backOperations); return new Alignment(a, b, a1, a2, b1, b2, score, this.gapOpen, backOperations); } return new Alignment(a, b, a1, a2, b1, b2, score, this.gapOpen, null); } public Alignment localAlignOneSkip(S a, S b) { float[][] D = new float[a.length()+1][b.length()+1]; float[][] P = new float[a.length()+1][b.length()+1]; float[][] S = new float[a.length()+1][b.length()+1]; float maxValue = 0.0f; int maxI = 0; int maxJ = 0; for (int i=1; i<=a.length(); i++) { for (int j=1; j<=b.length(); j++) { float sim = (float)a.similarityScore(b, i-1, j-1)+this.scoreOffset; P[i][j] = Math.max(D[i-1][j]+this.gapOpen, D[i][j-1]+this.gapOpen); D[i][j] = S[i-1][j-1]+sim; S[i][j] = Math.max(P[i][j], D[i][j]); if (i==a.length()) S[i][j] = Math.max(S[i][j], S[i][j-1]); if (j==b.length()) S[i][j] = Math.max(S[i][j], S[i-1][j]); if (S[i][j] > maxValue && (i==a.length() || j==b.length())) { maxValue = S[i][j]; maxI = i; maxJ = j; } } } float score = maxValue; int a1 = 0; int a2 = Math.max(0, maxI-1); int b1 = 0; int b2 = Math.max(0, maxJ-1); if (this.storePath) { //figure out the path ArrayList backOperations = new ArrayList<>(a.length()+b.length()); int i = maxI; int j = maxJ; while (i>0 && j>0) { if (S[i][j]==D[i-1][j]+this.gapOpen) { backOperations.add(Operation.DELETE); i--; } else if (S[i][j]==D[i][j-1]+this.gapOpen) { backOperations.add(Operation.INSERT); j--; } else { backOperations.add(Operation.MATCH); i--; j--; } } a1 = i; b1 = j; while (i > 0) { backOperations.add(Operation.DELETE); i--; } while (j > 0) { backOperations.add(Operation.INSERT); j--; } //reverse the direction Collections.reverse(backOperations); return new Alignment(a, b, a1, a2, b1, b2, score, this.gapOpen, backOperations); } else { int i = maxI; int j = maxJ; while (i>0 && j>0) { if (S[i-1][j]>S[i][j-1] && S[i-1][j]>S[i-1][j-1]) { i--; } else if (S[i][j-1]>S[i-1][j-1]) { j--; } else { i--; j--; } } a1 = i; b1 = j; return new Alignment(a, b, a1, a2, b1, b2, score, this.gapOpen, null); } } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/align/Alignment.java000066400000000000000000000124771277502137000241520ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.align; import java.util.Iterator; import java.util.List; public final class Alignment> { public enum Operation { MATCH, INSERT, DELETE; } private final double score; //private final double gapOpen; private final List operations; private final S a; private final S b; private int a1; private int a2; private int b1; private int b2; protected Alignment(S a, S b, int a1, int a2, int b1, int b2, double score, double gapOpen, List operations) { this.score = score; this.operations = operations; this.a = a; this.b = b; this.a1 = a1; this.a2 = a2; this.b1 = b1; this.b2 = b2; //this.gapOpen = gapOpen; } public double getOverlapScore(int minMatches) { int i = 0; int j = 0; Iterator iter = this.operations.iterator(); if (!iter.hasNext()) return 0.0; //remove the start Operation o = iter.next(); while (o==Operation.DELETE) { i++; if (iter.hasNext()) o = iter.next(); else return 0.0; } if (i==0) { while (o==Operation.INSERT) { if (iter.hasNext()) o = iter.next(); else return 0.0; } } double score = 0.0; int count = 0; while (true) { if (o == Operation.DELETE) { i++; } else if (o == Operation.INSERT) { //count++; j++; } else { score = score + a.similarityScore(b, i, j); count++; i++; j++; } if (iter.hasNext()) o = iter.next(); else break; } //System.err.println(this.operations); //System.err.println("HI="+count+" "+minMatches); if (count getOperations() { return this.operations; } public double getScore() { return this.score; } public int getA1() { return this.a1; } public int getA2() { return this.a2; } public int getB1() { return this.b1; } public int getB2() { return this.b2; } public String outputAlignmentSelf() { StringBuilder str = new StringBuilder(); int i = 0; int j = 0; int count = 0; while(i=b.length()) aStr = a.toString(i); else aStr = a.toString(b, i, j); str.append(aStr); i++; } else if (o == Operation.INSERT) { String bStr; if (i>=a.length()) bStr = b.toString(j); else bStr = b.toString(a, j, i); for (int space=0; space=a.length()) bStr = b.toString(j); else bStr = b.toString(a, j, i); str.append(bStr); j++; } else if (o == Operation.DELETE) { String aStr; if (j>=b.length()) aStr = a.toString(i); else aStr = a.toString(b, i, j); for (int space=0; space findMatches() { // figure out number of cores ExecutorService execSvc = Executors.newFixedThreadPool(this.numThreads); // allocate the storage and get the list of valeus final ArrayList combinedList = new ArrayList(); final ConcurrentLinkedQueue seqList = new ConcurrentLinkedQueue( getStoredForwardSequenceIds()); // for each thread create a task for (int iter = 0; iter < this.numThreads; iter++) { Runnable task = new Runnable() { @Override public void run() { List localMatches = new ArrayList(); // get next sequence SequenceId nextSequence = seqList.poll(); while (nextSequence != null) { SequenceSketch sequenceHashes = getStoredSequenceHash(nextSequence); // only search the forward sequences localMatches.addAll(findMatches(sequenceHashes, true)); // record search AbstractMatchSearch.this.sequencesSearched.getAndIncrement(); // get next sequence nextSequence = seqList.poll(); // output stored results if (nextSequence == null || localMatches.size() >= NUM_ELEMENTS_PER_OUTPUT) { // count the number of matches AbstractMatchSearch.this.matchesProcessed.getAndAdd(localMatches.size()); if (AbstractMatchSearch.this.storeResults) { // combine the results synchronized (combinedList) { combinedList.addAll(localMatches); } } else outputResults(localMatches); localMatches.clear(); } } } }; // enqueue the task execSvc.execute(task); } // shutdown the service execSvc.shutdown(); try { execSvc.awaitTermination(365L, TimeUnit.DAYS); } catch (InterruptedException e) { execSvc.shutdownNow(); throw new MhapRuntimeException("Unable to finish all tasks."); } flushOutput(); return combinedList; } protected abstract List findMatches(SequenceSketch hashes, boolean toSelf); public ArrayList findMatches(final SequenceSketchStreamer data) throws IOException { // figure out number of cores ExecutorService execSvc = Executors.newFixedThreadPool(this.numThreads); // allocate the storage and get the list of valeus final ArrayList combinedList = new ArrayList(); // for each thread create a task for (int iter = 0; iter < this.numThreads; iter++) { Runnable task = new Runnable() { @Override public void run() { List localMatches = new ArrayList(); try { ReadBuffer buf = new ReadBuffer(); SequenceSketch sequenceHashes = data.dequeue(true, buf); while (sequenceHashes != null) { // only search the forward sequences localMatches.addAll(findMatches(sequenceHashes, false)); // record search AbstractMatchSearch.this.sequencesSearched.getAndIncrement(); // get the sequence hashes sequenceHashes = data.dequeue(true, buf); // output stored results if (sequenceHashes == null || localMatches.size() >= NUM_ELEMENTS_PER_OUTPUT) { // count the number of matches AbstractMatchSearch.this.matchesProcessed.getAndAdd(localMatches.size()); if (AbstractMatchSearch.this.storeResults) { // combine the results synchronized (combinedList) { combinedList.addAll(localMatches); } } else outputResults(localMatches); localMatches.clear(); } } } catch (IOException e) { throw new MhapRuntimeException(e); } } }; // enqueue the task execSvc.execute(task); } // shutdown the service execSvc.shutdown(); try { execSvc.awaitTermination(365L, TimeUnit.DAYS); } catch (InterruptedException e) { execSvc.shutdownNow(); throw new MhapRuntimeException("Unable to finish all tasks."); } flushOutput(); return combinedList; } protected void flushOutput() { try { STD_OUT_BUFFER.flush(); } catch (IOException e) { throw new MhapRuntimeException(e); } } public long getMatchesProcessed() { return this.matchesProcessed.get(); } /** * @return the sequencesSearched */ public long getNumberSequencesSearched() { return this.sequencesSearched.get(); } public abstract List getStoredForwardSequenceIds(); public abstract SequenceSketch getStoredSequenceHash(SequenceId id); protected void outputResults(List matches) { if (this.storeResults || matches.isEmpty()) return; try { synchronized (STD_OUT_BUFFER) { for (MatchResult currResult : matches) { STD_OUT_BUFFER.write(currResult.toString()); STD_OUT_BUFFER.newLine(); } STD_OUT_BUFFER.flush(); } } catch (IOException e) { throw new MhapRuntimeException(e); } } public abstract int size(); }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/FastaData.java000066400000000000000000000132611277502137000237230ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; import java.io.BufferedReader; import java.io.IOException; import java.util.Locale; import java.util.concurrent.ConcurrentLinkedQueue; import java.util.concurrent.atomic.AtomicLong; import edu.umd.marbl.mhap.utils.Utils; public class FastaData implements Cloneable { private final BufferedReader fileReader; private final long offset; private String lastLine; private AtomicLong numberProcessed; private boolean readFullFile; // length of sequences loaded private final ConcurrentLinkedQueue sequenceList; private static final String[] fastaSuffix = { "fna", "contigs", "contig", "final", "fasta", "fa" }; private FastaData(ConcurrentLinkedQueue seqList) { this.sequenceList = new ConcurrentLinkedQueue(seqList); this.fileReader = null; this.lastLine = null; this.readFullFile = true; this.numberProcessed = new AtomicLong(this.sequenceList.size()); this.offset = 0; } public FastaData(String file, long offset) throws IOException { try { this.fileReader = Utils.getFile(file, fastaSuffix); } catch (Exception e) { throw new MhapRuntimeException(e); } this.offset = offset; this.lastLine = null; this.readFullFile = false; this.numberProcessed = new AtomicLong(0); this.sequenceList = new ConcurrentLinkedQueue(); } /* * (non-Javadoc) * * @see java.lang.Object#clone() */ @Override public synchronized FastaData clone() { // enqueue all the data try { enqueueFullFile(); } catch (IOException e) { throw new MhapRuntimeException(e); } return new FastaData(this.sequenceList); } public Sequence dequeue() throws IOException { Sequence seq; synchronized (this.sequenceList) { if (this.sequenceList.isEmpty()) { enqueueNextSequenceInFile(); } // get the sequence seq = this.sequenceList.poll(); } return seq; } public void enqueueFullFile() throws IOException { while (enqueueNextSequenceInFile()) { } } private boolean enqueueNextSequenceInFile() throws IOException { StringBuilder fastaSeq = new StringBuilder(); String header = null; synchronized (this.fileReader) { if (this.readFullFile) return false; // try to read the next line if (this.lastLine == null) { this.lastLine = this.fileReader.readLine(); // there is no next line if (this.lastLine == null) { this.fileReader.close(); this.readFullFile = true; return false; } } // process the header if (!this.lastLine.startsWith(">")) throw new MhapRuntimeException("Next sequence does not start with >. Invalid format."); // process the current header if (SequenceId.STORE_FULL_ID) header = this.lastLine.substring(1).split("[\\s,]+", 2)[0]; //read the first line of the sequence this.lastLine = this.fileReader.readLine(); while (true) { if (this.lastLine!=null && !this.lastLine.startsWith(">")) { // append the last line fastaSeq.append(this.lastLine); this.lastLine = this.fileReader.readLine(); } else if (this.lastLine == null) { this.fileReader.close(); this.readFullFile = true; break; } else break; } } String fastaSeqSring = fastaSeq.toString(); if (!fastaSeqSring.isEmpty()) { long index = this.numberProcessed.incrementAndGet(); //generate sequence id SequenceId id; if (SequenceId.STORE_FULL_ID) id = new SequenceId(index + this.offset, true, header); else id = new SequenceId(index + this.offset); Sequence seq = new Sequence(fastaSeq.toString().toUpperCase(Locale.ENGLISH), id); // enqueue sequence this.sequenceList.add(seq); return true; } else return false; } public int getNumberProcessed() { return this.numberProcessed.intValue(); } public Sequence getSequence(SequenceId id) { if (id.isForward()) { for (Sequence seq : this.sequenceList) if (seq.getId().equals(id)) return seq; } id = id.complimentId(); for (Sequence seq : this.sequenceList) if (seq.getId().equals(id)) return seq.getReverseCompliment(); return null; } public boolean isEmpty() { synchronized (this.fileReader) { return this.sequenceList.isEmpty() && this.readFullFile; } } /* * (non-Javadoc) * * @see java.lang.Object#finalize() */ @Override protected void finalize() throws Throwable { super.finalize(); this.fileReader.close(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/MatchResult.java000066400000000000000000000057771277502137000243430ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; public final class MatchResult implements Comparable { private final SequenceId fromId; private final SequenceId toId; private final int a1; private final int a2; private final int b1; private final int b2; private final double score; private final double rawScore; private final int fromLength; private final int toLength; protected MatchResult(SequenceId fromId, SequenceId toId, OverlapInfo overlap, int fromLength, int toLength) { this.fromId = fromId; this.toId = toId; this.fromLength = fromLength; this.toLength = toLength; this.a1 = getFromId().isForward() ? overlap.a1 : fromLength-overlap.a2-1; this.a2 = getFromId().isForward() ? overlap.a2 : fromLength-overlap.a1-1; this.b1 = getToId().isForward() ? overlap.b1 : toLength-overlap.b2-1; this.b2 = getToId().isForward() ? overlap.b2 : toLength-overlap.b1-1; this.rawScore = overlap.rawScore; if (overlap.score>1.0) this.score = 1.0; else this.score = overlap.score; } /** * @return the fromId */ public SequenceId getFromId() { return this.fromId; } /** * @return the toId */ public SequenceId getToId() { return this.toId; } /** * @return the score */ public double getScore() { return this.score; } @Override public int compareTo(MatchResult o) { return -Double.compare(this.score, o.score); } @Override public String toString() { return String.format("%s %s %.6f %.6f %d %d %d %d %d %d %d %d", getFromId().getHeader(), getToId().getHeader(), 1.0-getScore(), this.rawScore, getFromId().isForward() ? 0 : 1, this.a1, this.a2, this.fromLength, getToId().isForward() ? 0 : 1, this.b1, this.b2, this.toLength); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/MhapRuntimeException.java000066400000000000000000000033371277502137000262060ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; public class MhapRuntimeException extends RuntimeException { /** * */ private static final long serialVersionUID = 56387323839744808L; public MhapRuntimeException() { super(); } public MhapRuntimeException(String message, Throwable cause) { super(message, cause); } public MhapRuntimeException(String message) { super(message); } public MhapRuntimeException(Throwable cause) { super(cause); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/MinHashBitSequenceSubSketches.java000066400000000000000000000134371277502137000277230ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; import java.io.DataInputStream; import java.io.EOFException; import java.io.IOException; import java.nio.ByteBuffer; import edu.umd.marbl.mhap.align.AlignElementDoubleSketch; import edu.umd.marbl.mhap.align.Aligner; import edu.umd.marbl.mhap.sketch.MinHashBitSketch; import edu.umd.marbl.mhap.sketch.MinHashSketch; import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException; public final class MinHashBitSequenceSubSketches { private final AlignElementDoubleSketch alignmentSketch; public final static MinHashBitSketch[] computeSequences(String seq, int nGramSize, int stepSize, int numWords) throws ZeroNGramsFoundException { int remainder = seq.length()%stepSize; //get number of sequence int numSequence = (seq.length()-remainder)/stepSize; if (remainder>0) numSequence++; //make sketches out of them int start = 0; MinHashBitSketch[] sequence = new MinHashBitSketch[numSequence]; for (int iter=0; iter=stepSize/2 && remainder>=nGramSize) numSequence++; //make sketches out of them int start = 0; MinHashBitSketch[] sketches = new MinHashBitSketch[numSequence]; for (int iter=0; iter> aligner, MinHashBitSequenceSubSketches b) { return this.alignmentSketch.getOverlapInfo(aligner, b.alignmentSketch); } public final static MinHashBitSequenceSubSketches fromByteStream(DataInputStream input) throws IOException { try { int numSketches = input.readInt(); int numWordsPerSketch = input.readInt(); int stepSize = input.readInt(); int seqLength = input.readInt(); MinHashBitSketch[] sequence = new MinHashBitSketch[numSketches]; for (int iter=0; iter(sketches, stepSize, seqLength); } public MinHashBitSequenceSubSketches(String seq, int kmerSize, int stepSize, int numWords) throws ZeroNGramsFoundException { this.alignmentSketch = new AlignElementDoubleSketch<>(computeSequencesDouble(seq, kmerSize, stepSize, numWords), stepSize, seq.length()); } /* private static int[] union(int[] minHashes1, int[] minHashes2) { int[] newHashes = new int[minHashes1.length]; for (int iter=0; iter>> hashes; private final double maxShift; private final AtomicLong minhashSearchTime; private final AtomicLong sortMergeSearchTime; private final int minStoreLength; private final AtomicLong numberElementsProcessed; private final AtomicLong numberSequencesFullyCompared; private final AtomicLong numberSequencesHit; private final AtomicLong numberSequencesMinHashed; private final int numMinMatches; private final Map sequenceVectorsHash; public MinHashSearch(SequenceSketchStreamer data, int numHashes, int numMinMatches, int numThreads, boolean storeResults, int minStoreLength, double maxShift, double acceptScore) throws IOException { super(numThreads, storeResults); this.minStoreLength = minStoreLength; this.numMinMatches = numMinMatches; this.maxShift = maxShift; this.acceptScore = acceptScore; this.numberSequencesHit = new AtomicLong(); this.numberSequencesFullyCompared = new AtomicLong(); this.numberSequencesMinHashed = new AtomicLong(); this.numberElementsProcessed = new AtomicLong(); this.minhashSearchTime = new AtomicLong(); this.sortMergeSearchTime = new AtomicLong(); // enqueue full file, since have to know full size data.enqueueFullFile(false, this.numThreads); //this.sequenceVectorsHash = new HashMap<>(data.getNumberProcessed()); this.sequenceVectorsHash = new Object2ObjectOpenHashMap<>(data.getNumberProcessed()); this.hashes = new ArrayList<>(numHashes); for (int iter = 0; iter < numHashes; iter++) { //Map> map = new HashMap>(data.getNumberProcessed()); Map> map = new Int2ObjectOpenHashMap>(data.getNumberProcessed()); this.hashes.add(map); } addData(data); System.err.println("Stored "+this.sequenceVectorsHash.size()+" sequences in the index."); } @Override public boolean addSequence(SequenceSketch currHash) { int[] currMinHashes = currHash.getMinHashes().getMinHashArray(); if (currMinHashes.length != this.hashes.size()) throw new MhapRuntimeException("Number of MinHashes of the sequence does not match current settings."); // put the result into the hashmap synchronized (this.sequenceVectorsHash) { SequenceSketch minHash = this.sequenceVectorsHash.put(currHash.getSequenceId(), currHash); if (minHash != null) { this.sequenceVectorsHash.put(currHash.getSequenceId(), minHash); throw new MhapRuntimeException("Sequence ID already exists in the hash table."); } } // add the hashes int count = 0; SequenceId id = currHash.getSequenceId(); for (Map> hash : this.hashes) { ArrayList currList; final int hashVal = currMinHashes[count]; // get the list synchronized (hash) { currList = hash.computeIfAbsent(hashVal, k-> new ArrayList(2)); } // add the element synchronized (currList) { currList.add(id); } count++; } //increment the counter this.numberSequencesMinHashed.getAndIncrement(); return true; } @Override public List findMatches(SequenceSketch seqHashes, boolean toSelf) { //for performance reasons might need to change long startTime = System.nanoTime(); MinHashSketch minHash = seqHashes.getMinHashes(); if (this.hashes.size() != minHash.numHashes()) throw new MhapRuntimeException("Number of hashes does not match. Stored size " + this.hashes.size() + ", input size " + minHash.numHashes() + "."); Map bestSequenceHit = new Object2ObjectOpenHashMap<>(256); int[] minHashes = minHash.getMinHashArray(); int hashIndex = 0; long additionalProcessed = 0L; for (Map> currHash : this.hashes) { ArrayList currentHashMatchList = currHash.get(minHashes[hashIndex]); // if some matches exist add them if (currentHashMatchList != null) { additionalProcessed += currentHashMatchList.size(); for (SequenceId sequenceId : currentHashMatchList) { bestSequenceHit.compute(sequenceId, (k,v)-> (v==null) ? new HitCounter(1) : v.addHit()); } } hashIndex++; } //record the search time long minHashEndTime = System.nanoTime(); this.minhashSearchTime.getAndAdd(minHashEndTime - startTime); //record the procssed statistic this.numberElementsProcessed.getAndAdd(additionalProcessed); this.numberSequencesHit.getAndAdd(bestSequenceHit.size()); // compute the proper counts for all sets and remove below threshold ArrayList matches = new ArrayList(32); for (Entry match : bestSequenceHit.entrySet()) { //get the match id SequenceId matchId = match.getKey(); // do not store matches with smaller ids, unless its coming from a short read if (toSelf && matchId.getHeaderId() == seqHashes.getSequenceId().getHeaderId()) continue; //see if the hit number is high enough if (match.getValue().count >= this.numMinMatches) { SequenceSketch matchedHashes = this.sequenceVectorsHash.get(match.getKey()); if (matchedHashes==null) throw new MhapRuntimeException("Hashes not found for given id."); //never process short to short if (matchedHashes.getSequenceLength() seqHashes.getSequenceId().getHeaderId() && matchedHashes.getSequenceLength()>=this.minStoreLength && seqHashes.getSequenceLength()>=this.minStoreLength) continue; //never do short to long if (toSelf && matchedHashes.getSequenceLength()=this.minStoreLength) continue; //compute the direct hash score OverlapInfo result = seqHashes.getOrderedHashes().getOverlapInfo(matchedHashes.getOrderedHashes(), this.maxShift); boolean accept = result.score >= this.acceptScore; //increment the counter this.numberSequencesFullyCompared.getAndIncrement(); //if score is good add if (accept) { MatchResult currResult = new MatchResult(seqHashes.getSequenceId(), matchId, result, seqHashes.getSequenceLength(), matchedHashes.getSequenceLength()); // add to list matches.add(currResult); } } } //record the search time //TODO not clear why not working. Perhaps everything is too fast? long endTime = System.nanoTime(); this.sortMergeSearchTime.getAndAdd(endTime-minHashEndTime); return matches; } public double getMinHashSearchTime() { return this.minhashSearchTime.longValue() * 1.0e-9; } public double getSortMergeTime() { return this.sortMergeSearchTime.longValue() * 1.0e-9; } public long getNumberElementsProcessed() { return this.numberElementsProcessed.get(); } public long getNumberSequenceHashed() { return this.numberSequencesMinHashed.get(); } public long getNumberSequencesFullyCompared() { return this.numberSequencesFullyCompared.get(); } public long getNumberSequencesHit() { return this.numberSequencesHit.get(); } @Override public List getStoredForwardSequenceIds() { ArrayList seqIds = new ArrayList(this.sequenceVectorsHash.size()); for (SequenceSketch hashes : this.sequenceVectorsHash.values()) if (hashes.getSequenceId().isForward()) seqIds.add(hashes.getSequenceId()); return seqIds; } @Override public SequenceSketch getStoredSequenceHash(SequenceId id) { return this.sequenceVectorsHash.get(id); } @Override public int size() { return this.sequenceVectorsHash.size(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/OverlapInfo.java000066400000000000000000000036611277502137000243220ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; public final class OverlapInfo { public final int a1; public final int a2; public final int b1; public final int b2; public final double rawScore; public final double score; public static OverlapInfo EMPTY = new OverlapInfo(0.0, 0.0, 0, 0, 0, 0); public OverlapInfo(double score, double rawScore, int a1, int a2, int b1, int b2) { this.score = score; this.rawScore = rawScore; this.a1 = a1; this.a2 = a2; this.b1 = b1; this.b2 = b2; } /* (non-Javadoc) * @see java.lang.Object#toString() */ @Override public String toString() { return "[score="+this.score+", a1="+this.a1+" a2="+this.a2+", b1="+this.b1+" b2="+this.b2+"]"; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/Sequence.java000066400000000000000000000053031277502137000236410ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; import edu.umd.marbl.mhap.utils.Utils; public final class Sequence { private final String sequence; private final SequenceId id; public Sequence(int[] sequence, SequenceId id) { this.id = id; StringBuilder s = new StringBuilder(); for (int iter=0; iter"+this.id+"\n"); str.append(this.sequence); return str.toString(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/SequenceId.java000066400000000000000000000060651277502137000241240ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; import java.io.Serializable; public final class SequenceId implements Serializable { /** * */ private static final long serialVersionUID = 2181572437818064822L; private final long id; private final boolean isFwd; private final String strId; public static boolean STORE_FULL_ID = false; public SequenceId(long id) { this(id, true); } public SequenceId(long id, boolean isFwd) { this.id = id; this.isFwd = isFwd; this.strId = null; } public SequenceId(long id, boolean isFwd, String strId) { this.id = id; this.isFwd = isFwd; this.strId = strId; } public SequenceId createOffset(long offset) { return new SequenceId(this.id+offset, this.isFwd, this.strId); } public SequenceId complimentId() { return new SequenceId(this.id, !this.isFwd, this.strId); } /* (non-Javadoc) * @see java.lang.Object#equals(java.lang.Object) */ @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; SequenceId other = (SequenceId) obj; return (this.id==other.id) && (this.isFwd == other.isFwd); } public boolean isForward() { return this.isFwd; } public long getHeaderId() { return this.id; } public String getHeader() { if (this.strId!=null) return this.strId; return String.valueOf(this.id); } /* (non-Javadoc) * @see java.lang.Object#hashCode() */ @Override public int hashCode() { return this.isFwd? (int)this.id : -(int)this.id; } /* (non-Javadoc) * @see java.lang.Object#toString() */ @Override public String toString() { return ""+getHeader()+(this.isFwd ? "(fwd)" : "(rev)"); } public String toStringInt() { return ""+getHeader()+(this.isFwd ? " 1" : " 0"); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketch.java000066400000000000000000000117621277502137000250110ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; import java.io.ByteArrayOutputStream; import java.io.DataInputStream; import java.io.DataOutputStream; import java.io.EOFException; import java.io.IOException; import java.io.Serializable; import edu.umd.marbl.mhap.sketch.FrequencyCounts; import edu.umd.marbl.mhap.sketch.MinHashSketch; import edu.umd.marbl.mhap.sketch.BottomOverlapSketch; import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException; public final class SequenceSketch implements Serializable { /** * */ private static final long serialVersionUID = -3155689614837922443L; private final SequenceId id; private final MinHashSketch mainHashes; private final BottomOverlapSketch orderedHashes; //private final MinHashBitSequenceSubSketches alignmentSketches; private final int sequenceLength; public final static int BIT_SKETCH_SIZE = 20; public final static int SUBSEQUENCE_SIZE = 50; public final static int BIT_KMER_SIZE = 7; public static SequenceSketch fromByteStream(DataInputStream input, int offset) throws IOException { try { // input. // dos.writeBoolean(this.id.isForward()); boolean isFwd = input.readBoolean(); // dos.writeInt(this.id.getHeaderId()); SequenceId id = new SequenceId(input.readLong() + offset, isFwd); //dos.writeInt(this.sequenceLength); int sequenceLength = input.readInt(); // dos.write(this.mainHashes.getAsByteArray()); MinHashSketch mainHashes = MinHashSketch.fromByteStream(input); if (mainHashes == null) throw new MhapRuntimeException("Unexpected data read error."); BottomOverlapSketch orderedHashes = null; orderedHashes = BottomOverlapSketch.fromByteStream(input); if (orderedHashes == null) throw new MhapRuntimeException("Unexpected data read error when reading ordered k-mers."); return new SequenceSketch(id, sequenceLength, mainHashes, orderedHashes); } catch (EOFException e) { return null; } } public SequenceSketch(SequenceId id, int sequenceLength, MinHashSketch mainHashes, BottomOverlapSketch orderedHashes) { this.sequenceLength = sequenceLength; this.id = id; this.mainHashes = mainHashes; this.orderedHashes = orderedHashes; } public SequenceSketch(Sequence seq, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize, FrequencyCounts kmerFilter, double repeatWeight) throws ZeroNGramsFoundException { this.sequenceLength = seq.length(); this.id = seq.getId(); this.mainHashes = new MinHashSketch(seq.getSquenceString(), kmerSize, numHashes, kmerFilter, repeatWeight); this.orderedHashes = new BottomOverlapSketch(seq.getSquenceString(), orderedKmerSize, orderedSketchSize); } public SequenceSketch createOffset(int offset) { return new SequenceSketch(this.id.createOffset(offset), this.sequenceLength, this.mainHashes, this.orderedHashes); } public byte[] getAsByteArray() { byte[] mainHashesBytes = this.mainHashes.getAsByteArray(); byte[] orderedHashesBytes = this.orderedHashes.getAsByteArray(); //get size ByteArrayOutputStream bos = new ByteArrayOutputStream(mainHashesBytes.length+orderedHashesBytes.length); DataOutputStream dos = new DataOutputStream(bos); try { dos.writeBoolean(this.id.isForward()); dos.writeLong(this.id.getHeaderId()); dos.writeInt(this.sequenceLength); dos.write(mainHashesBytes); dos.write(orderedHashesBytes); dos.flush(); return bos.toByteArray(); } catch (IOException e) { throw new MhapRuntimeException("Unexpected IO error."); } } public MinHashSketch getMinHashes() { return this.mainHashes; } public BottomOverlapSketch getOrderedHashes() { return this.orderedHashes; } public SequenceId getSequenceId() { return this.id; } public int getSequenceLength() { return this.sequenceLength; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/impl/SequenceSketchStreamer.java000066400000000000000000000241151277502137000265100ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.impl; import java.io.BufferedInputStream; import java.io.BufferedOutputStream; import java.io.ByteArrayInputStream; import java.io.DataInputStream; import java.io.EOFException; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.nio.ByteBuffer; import java.util.Iterator; import java.util.concurrent.ConcurrentLinkedQueue; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicLong; import edu.umd.marbl.mhap.sketch.FrequencyCounts; import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException; import edu.umd.marbl.mhap.utils.ReadBuffer; import edu.umd.marbl.mhap.utils.Utils; public class SequenceSketchStreamer { private final DataInputStream buffInput; private final FastaData fastaData; private final FrequencyCounts kmerFilter; private final int kmerSize; private final int minOlapLength; private final AtomicLong numberProcessed; private final int numHashes; private final int offset; private final int orderedKmerSize; private final int orderedSketchSize; private boolean readClosed; private final boolean readingFasta; private final double repeatWeight; private final ConcurrentLinkedQueue sequenceHashList; public SequenceSketchStreamer(String file, int minOlapLength, int offset) throws FileNotFoundException { this.fastaData = null; this.readingFasta = false; this.sequenceHashList = new ConcurrentLinkedQueue(); this.numberProcessed = new AtomicLong(); this.kmerFilter = null; this.repeatWeight = 0; this.minOlapLength = minOlapLength; this.kmerSize = 0; this.numHashes = 0; this.orderedKmerSize = 0; this.orderedSketchSize = 0; this.readClosed = false; this.offset = offset; this.buffInput = new DataInputStream(new BufferedInputStream(new FileInputStream(file), Utils.BUFFER_BYTE_SIZE)); } public SequenceSketchStreamer(String file, int minOlapLength, int kmerSize, int numHashes, int orderedKmerSize, int orderedSketchSize, FrequencyCounts kmerFilter, double repeatWeight, int offset) throws IOException { this.fastaData = new FastaData(file, offset); this.readingFasta = true; this.sequenceHashList = new ConcurrentLinkedQueue(); this.numberProcessed = new AtomicLong(); this.repeatWeight = repeatWeight; this.minOlapLength = minOlapLength; this.kmerFilter = kmerFilter; this.kmerSize = kmerSize; this.numHashes = numHashes; this.orderedKmerSize = orderedKmerSize; this.orderedSketchSize = orderedSketchSize; this.buffInput = null; this.readClosed = false; this.offset = offset; } public SequenceSketch dequeue(boolean fwdOnly, ReadBuffer buf) throws IOException { enqueueUntilFound(fwdOnly, buf); return this.sequenceHashList.poll(); } private boolean enqueue(boolean fwdOnly, ReadBuffer buf) throws IOException, ZeroNGramsFoundException { SequenceSketch seqHashes; if (this.readingFasta) { Sequence seq; do { seq = this.fastaData.dequeue(); } while (seq!=null && seq.length() getDataIterator() { return this.sequenceHashList.iterator(); } public int getFastaProcessed() { if (this.fastaData == null) return 0; return this.fastaData.getNumberProcessed(); } public int getNumberProcessed() { return this.numberProcessed.intValue(); } public SequenceSketch getSketch(Sequence seq) throws ZeroNGramsFoundException { // compute the hashes return new SequenceSketch(seq, this.kmerSize, this.numHashes, this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.repeatWeight); } protected void processAddition(SequenceSketch seqHashes) { // increment counter this.numberProcessed.getAndIncrement(); int numProcessed = getNumberProcessed(); if (numProcessed % 5000 == 0) System.err.println("Current # sequences loaded and processed from file: " + numProcessed + "..."); } protected SequenceSketch readFromBinary(ReadBuffer buf, boolean fwdOnly) throws IOException { byte[] byteArray = null; synchronized (this.buffInput) { if (this.readClosed) return null; try { boolean keepReading = true; while (keepReading) { byte isFwd = this.buffInput.readByte(); if (!fwdOnly || isFwd == 1) keepReading = false; // get the size in bytes int byteSize = this.buffInput.readInt(); // allocate the array byteArray = buf.getBuffer(byteSize); // read that many bytes this.buffInput.read(byteArray, 0, byteSize); } } catch (EOFException e) { this.buffInput.close(); this.readClosed = true; return null; } } // get as byte array stream SequenceSketch seqHashes = SequenceSketch.fromByteStream(new DataInputStream( new ByteArrayInputStream(byteArray)), this.offset); return seqHashes; } public void writeToBinary(String file, final boolean fwdOnly, int numThreads) throws IOException { OutputStream output = null; try { output = new BufferedOutputStream(new FileOutputStream(file), Utils.BUFFER_BYTE_SIZE); final OutputStream finalOutput = output; // figure out number of cores ExecutorService execSvc = Executors.newFixedThreadPool(numThreads); // for each thread create a task for (int iter = 0; iter < numThreads; iter++) { Runnable task = new Runnable() { @Override public void run() { SequenceSketch seqHashes; ReadBuffer buf = new ReadBuffer(); try { seqHashes = dequeue(fwdOnly, buf); while (seqHashes != null) { byte[] byteArray = seqHashes.getAsByteArray(); int arraySize = byteArray.length; byte isFwd = seqHashes.getSequenceId().isForward() ? (byte) 1 : (byte) 0; // store the size as byte array byte[] byteSize = ByteBuffer.allocate(5).put(isFwd).putInt(arraySize).array(); synchronized (finalOutput) { finalOutput.write(byteSize); finalOutput.write(byteArray); } seqHashes = dequeue(fwdOnly, buf); } } catch (IOException e) { throw new MhapRuntimeException(e); } } }; // enqueue the task execSvc.execute(task); } // shutdown the service execSvc.shutdown(); try { execSvc.awaitTermination(365L, TimeUnit.DAYS); } catch (InterruptedException e) { execSvc.shutdownNow(); throw new MhapRuntimeException("Unable to finish all tasks."); } finalOutput.flush(); } finally { if (output != null) output.close(); } } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/000077500000000000000000000000001277502137000212105ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/.gitignore000066400000000000000000000001371277502137000232010ustar00rootroot00000000000000/buildMulti.class /Utils$Pair.class /Utils$ToProtein.class /Utils$Translate.class /Utils.class MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/AlignmentTry.java000066400000000000000000000107121277502137000244710ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.main; import edu.umd.marbl.mhap.align.AlignElementDoubleSketch; import edu.umd.marbl.mhap.align.AlignElementString; import edu.umd.marbl.mhap.align.Aligner; import edu.umd.marbl.mhap.align.Alignment; import edu.umd.marbl.mhap.impl.MinHashBitSequenceSubSketches; import edu.umd.marbl.mhap.impl.OverlapInfo; import edu.umd.marbl.mhap.sketch.MinHashBitSketch; import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException; import edu.umd.marbl.mhap.utils.RandomSequenceGenerator; public class AlignmentTry { public static void main(String[] args) throws ZeroNGramsFoundException { String a = "bcdefghij1234567890"; String b = "abcdefghij1234567890"; RandomSequenceGenerator generator = new RandomSequenceGenerator(); a = generator.generateRandomSequence(2000); b = a.substring(800, 1800); a = generator.addPacBioError(a); b = generator.addPacBioError(b); //b = generator.generateRandomSequence(1400); //b = a; Aligner aligner = new Aligner(true, -2.0, -1*Float.MAX_VALUE, 0.0); Alignment alignment = aligner.localAlignSmithWaterGotoh(new AlignElementString(a), new AlignElementString(b)); System.err.println(alignment.getOverlapScore(5)); System.out.println(alignment.outputAlignment()); System.err.println("A1="+alignment.getA1()); System.err.println("B1="+alignment.getB1()); System.err.println("A2="+alignment.getA2()); System.err.println("B2="+alignment.getB2()); MinHashBitSequenceSubSketches m1 = new MinHashBitSequenceSubSketches(a, 7, 200, 20); MinHashBitSequenceSubSketches m2 = new MinHashBitSequenceSubSketches(b, 7, 200, 20); OverlapInfo info = m1.getOverlapInfo(new Aligner>(true, 0.00, 0.0, -0.52), m2); System.err.println("Compressed="); System.err.println(info.rawScore); System.err.println(info.a1); System.err.println(info.b1); System.err.println(info.a2); System.err.println(info.b2); OverlapInfo info2 = m2.getOverlapInfo(new Aligner>(true, 0.00, 0.0, -0.52), m1); System.err.println("Swap="); System.err.println(info2.rawScore); System.err.println(info2.a1); System.err.println(info2.b1); System.err.println(info2.a2); System.err.println(info2.b2); System.exit(1); //OrderedNGramHashes hashes1 = new OrderedNGramHashes(a, 10, 1024); //OrderedNGramHashes hashes2 = new OrderedNGramHashes(b, 10, 1024); /* System.err.println("Ordered="); System.err.println(hashes1.getOverlapInfo(hashes2, .2).a1); System.err.println(hashes1.getOverlapInfo(hashes2, .2).b1); System.err.println(hashes1.getOverlapInfo(hashes2, .2).a2); System.err.println(hashes1.getOverlapInfo(hashes2, .2).b2); */ /* SimHash s1 = new SimHash(a, kmerSize, 100); SimHash s2 = new SimHash(b, kmerSize, 100); MinHashSketch h1 = new MinHashSketch(a, kmerSize, 8000); MinHashSketch h2 = new MinHashSketch(b, kmerSize, 8000); MinHashBitSketch hb1 = new MinHashBitSketch(a, kmerSize, 100); MinHashBitSketch hb2 = new MinHashBitSketch(b, kmerSize, 100); System.err.println(s1.jaccard(s2)); System.err.println(h1.jaccard(h2)); System.err.println(hb1.jaccard(hb2)); */ } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/EstimateROC.java000077500000000000000000000754331277502137000242110ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.main; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.util.Calendar; import java.util.GregorianCalendar; import java.util.HashMap; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Random; import java.util.concurrent.ExecutionException; import java.util.concurrent.ForkJoinPool; import java.util.concurrent.atomic.AtomicInteger; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.util.stream.Stream; import edu.umd.marbl.mhap.impl.FastaData; import edu.umd.marbl.mhap.impl.Sequence; import edu.umd.marbl.mhap.utils.IntervalTree; import edu.umd.marbl.mhap.utils.Utils; import jaligner.NeedlemanWunschGotoh; import jaligner.SmithWatermanGotoh; import jaligner.matrix.MatrixLoader; import jaligner.matrix.MatrixLoaderException; import ssw.Aligner; public class EstimateROC { private static final boolean ALIGN_SW = true; private static final boolean ALIGN_JALIGN = false; private static int[][] MATCH_MATRIX = new int[128][128]; private static final double MIN_REF_OVERLAP_DIFFERENCE = 0.8; private static double MIN_IDENTITY = 0.70; private static final double REF_IDENTITY_ADJUSTMENT = 0.1; private static double MIN_REF_IDENTITY = MIN_IDENTITY + REF_IDENTITY_ADJUSTMENT; private static double MIN_ALIGNMENT_IDENTITY = MIN_IDENTITY - REF_IDENTITY_ADJUSTMENT; private static double MIN_OVERLAP_DIFFERENCE = 0.30; private static final int DEFAULT_NUM_TRIALS = 10000; private static final int DEFAULT_MIN_OVL = 2000; private static final boolean DEFAULT_DO_DP = false; private static boolean LOAD_ALL = false; private static boolean DEBUG = false; private static class Pair { public int first; public int second; public Pair(int startInRef, int endInRef) { this.first = startInRef; this.second = endInRef; } @SuppressWarnings("unused") public int size() { return (Math.max(this.first, this.second) - Math.min(this.first, this.second) + 1); } } private static class Overlap { public int afirst; public int bfirst; public int asecond; public int bsecond; public boolean isFwd; public String id1; public String id2; public Overlap() { // do nothing } public int getSize() { double first = (double)Math.max(this.asecond, this.afirst) - (double)Math.min(this.asecond, this.afirst); first += (double)Math.max(this.bsecond, this.bfirst) - (double)Math.min(this.bsecond, this.bfirst); return (int)Math.round(first/2); } @Override public String toString() { StringBuilder stringBuilder = new StringBuilder(); stringBuilder.append("Overlap Fwd=" + this.isFwd); stringBuilder.append(" Aid="); stringBuilder.append(this.id1); stringBuilder.append(" ("); stringBuilder.append(this.afirst); stringBuilder.append(", "); stringBuilder.append(this.asecond); stringBuilder.append("), Bid="); stringBuilder.append(this.id2); stringBuilder.append(" ("); stringBuilder.append(this.bfirst); stringBuilder.append(", "); stringBuilder.append(this.bsecond); stringBuilder.append(")"); return stringBuilder.toString(); } } private static Random generator = null; public static int seed = 0; private HashMap> clusters = new HashMap>(); private HashMap seqToChr = new HashMap(10000000); private HashMap seqToScore = new HashMap(10000000); private HashMap seqToPosition = new HashMap(10000000); private HashMap seqToName = new HashMap(10000000); private HashMap seqNameToIndex = new HashMap(10000000); private HashMap ovlNames = new HashMap(10000000*10); private HashMap ovlInfo = new HashMap(10000000*10); private HashMap ovlToName = new HashMap(10000000*10); private int minOvlLen = DEFAULT_MIN_OVL; private int numTrials = DEFAULT_NUM_TRIALS; private boolean doDP = false; private long tp = 0; private long fn = 0; private long tn = 0; private long fp = 0; private double ppv = 0; private Sequence[] dataSeq = null; public static void printUsage() { System.err .println("This program uses random sampling to estimate PPV/Sensitivity/Specificity"); System.err.println("The sequences in the fasta file used to generate the truth must be sequentially numbered from 1 to N!"); System.err .println("\t1. A blasr M4 file mapping sequences to a reference (or reference subset)"); System.err .println("\t2. All-vs-all mappings of same sequences in CA ovl format"); System.err .println("\t3. Fasta sequences sequentially numbered from 1 to N."); System.err.println("\t4. Minimum overlap length (default: " + DEFAULT_MIN_OVL); System.err.println("\t5. Number of random trials, 0 means full compute (default : " + DEFAULT_NUM_TRIALS); System.err.println("\t6. Compute DP during PPV true/false"); System.err.println("\t7. Debug output true/false"); } public static void main(String[] args) throws Exception { if (args.length < 3) { printUsage(); System.exit(1); } EstimateROC g = null; if (args.length > 5) { g = new EstimateROC(Integer.parseInt(args[3]), Integer.parseInt(args[4]), Boolean.parseBoolean(args[5])); } else if (args.length > 4) { g = new EstimateROC(Integer.parseInt(args[3]), Integer.parseInt(args[4])); } else if (args.length > 3) { g = new EstimateROC(Integer.parseInt(args[3])); } else { g = new EstimateROC(); } if (args.length > 6) { DEBUG = Boolean.parseBoolean(args[6]); } if (args.length > 7) { MIN_IDENTITY = Double.parseDouble(args[7]); MIN_REF_IDENTITY = MIN_IDENTITY + REF_IDENTITY_ADJUSTMENT; MIN_ALIGNMENT_IDENTITY = MIN_IDENTITY - REF_IDENTITY_ADJUSTMENT/2; } if (args.length > 8) { MIN_OVERLAP_DIFFERENCE = Double.parseDouble(args[8]); } if (args.length > 9) { LOAD_ALL = Boolean.parseBoolean(args[9]); } System.err.println("Running, reference: " + args[0] + " matches: " + args[1]); System.err.println("Number trials: " + (g.numTrials == 0 ? "all" : g.numTrials)); System.err.println("Minimum ovl: " + g.minOvlLen); System.err.println("Minimum acceptable %" + MIN_IDENTITY); System.err.println("Minimum acceptable shift " + MIN_OVERLAP_DIFFERENCE); System.err.println("Minimum overlap to ref %" + MIN_REF_IDENTITY); System.err.println("Minimum acceptable overlap for dp check %" + MIN_ALIGNMENT_IDENTITY); // load and cluster reference System.err.print("Loading reference..."); long startTime = System.nanoTime(); long totalTime = startTime; g.processReference(args[0]); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); // load fasta System.err.print("Loading fasta..."); startTime = System.nanoTime(); g.loadFasta(args[2]); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); // load matches System.err.print("Loading matches..."); startTime = System.nanoTime(); g.processOverlaps(args[1]); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); if (g.numTrials == 0) { System.err.print("Computing full statistics O(" + g.seqToName.size() + "^2) operations!..."); startTime = System.nanoTime(); g.fullEstimate(); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); } else { System.err.print("Computing sensitivity..."); startTime = System.nanoTime(); g.estimateSensitivity(); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); // now estimate FP/TN by picking random match and checking reference // mapping System.err.print("Computing specificity..."); startTime = System.nanoTime(); g.estimateSpecificity(); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); // last but not least PPV, pick random subset of our matches and see what percentage are true System.err.print("Computing PPV..."); startTime = System.nanoTime(); g.estimatePPV(); System.err.println("done " + (System.nanoTime() - startTime) * 1.0e-9 + "s."); } System.err.println("Total time: " + (System.nanoTime() - totalTime) * 1.0e-9 + "s."); System.out.println("Estimated sensitivity:\t" + Utils.DECIMAL_FORMAT.format((double) g.tp / (double)(g.tp + g.fn))); System.out.println("Estimated specificity:\t" + Utils.DECIMAL_FORMAT.format((double) g.tn / (double)(g.fp + g.tn))); System.out.println("Estimated PPV:\t " + Utils.DECIMAL_FORMAT.format(g.ppv)); } public EstimateROC() { this(DEFAULT_MIN_OVL, DEFAULT_NUM_TRIALS); } public EstimateROC(int minOvlLen) { this(minOvlLen, DEFAULT_NUM_TRIALS); } public EstimateROC(int minOvlLen, int numTrials) { this(minOvlLen, numTrials, DEFAULT_DO_DP); } @SuppressWarnings("unused") public EstimateROC(int minOvlLen, int numTrials, boolean doDP) { this.minOvlLen = minOvlLen; this.numTrials = numTrials; this.doDP = doDP; if (false) { GregorianCalendar t = new GregorianCalendar(); int t1 = t.get(Calendar.SECOND); int t2 = t.get(Calendar.MINUTE); int t3 = t.get(Calendar.HOUR_OF_DAY); int t4 = t.get(Calendar.DAY_OF_MONTH); int t5 = t.get(Calendar.MONTH); int t6 = t.get(Calendar.YEAR); seed = t6 + 65 * (t5 + 12 * (t4 + 31 * (t3 + 24 * (t2 + 60 * t1)))); } generator = new Random(seed); if (!EstimateROC.ALIGN_JALIGN) { try { File f = new File(System.getProperty("java.class.path")); File dir = f.getAbsoluteFile().getParentFile(); String path = dir.toString(); System.err.println("Loaded file from path " + path); System.load(path + java.io.File.separator + "lib" + java.io.File.separator + "libsswjni.so"); // now initialize matrix for (int i = 0; i < 128; i++) { for (int j = 0; j < 128; j++) { if (i == j) MATCH_MATRIX[i][j] = 2; else MATCH_MATRIX[i][j] = -2; } } } catch (Exception e) { System.err.println("Error: could not load DP library: " + e); System.exit(1); } } } private static int getSequenceId(String id) { return Integer.parseInt(id)-1; } private static String getOvlName(String id, String id2) { return (id.compareTo(id2) <= 0 ? id + "_" + id2 : id2 + "_" + id); } private String pickRandomSequence() { int val = generator.nextInt(this.seqToName.size()); return this.seqToName.get(val); } private String pickRandomMatch() { int val = generator.nextInt(this.ovlToName.size()); return this.ovlToName.get(val); } private int getOverlapSize(String id, String id2) { String chr = this.seqToChr.get(id); String chr2 = this.seqToChr.get(id2); Pair p1 = this.seqToPosition.get(id); Pair p2 = this.seqToPosition.get(id2); if (!chr.equalsIgnoreCase(chr2)) { System.err.println("Error: comparing wrong chromosomes betweeen sequences " + id + " and sequence " + id2); System.exit(1); } return Utils.getRangeOverlap(p1.first, p1.second, p2.first, p2.second); } private HashSet getSequenceMatches(String id, int min) { String chr = this.seqToChr.get(id); Pair p1 = this.seqToPosition.get(id); if (chr == null || p1 == null) { return null; } List intersect = this.clusters.get(chr).get(p1.first, p1.second); HashSet result = new HashSet(); Iterator it = intersect.iterator(); while (it.hasNext()) { String id2 = this.seqToName.get(it.next()); Pair p2 = this.seqToPosition.get(id2); String chr2 = this.seqToChr.get(id2); if (!chr.equalsIgnoreCase(chr2)) { System.err.println("Error: comparing wrong chromosomes betweeen sequences " + id + " and sequence in its cluster " + id2); System.exit(1); } int overlap = Utils.getRangeOverlap(p1.first, p1.second, p2.first, p2.second); if (overlap >= min && !id.equalsIgnoreCase(id2)) { result.add(id2); } } return result; } private Overlap getOverlapInfo(String line) { Overlap overlap = new Overlap(); String[] splitLine = line.trim().split("\\s+"); try { // CA format if (splitLine.length == 7 || splitLine.length == 6) { overlap.id1 = splitLine[0]; overlap.id2 = splitLine[1]; @SuppressWarnings("unused") double score = Double.parseDouble(splitLine[5]) * 5; int aoffset = Integer.parseInt(splitLine[3]); int boffset = Integer.parseInt(splitLine[4]); overlap.isFwd = "N".equalsIgnoreCase(splitLine[2]); if (this.dataSeq != null) { int alen = this.dataSeq[Integer.parseInt(overlap.id1)-1].length(); int blen = this.dataSeq[Integer.parseInt(overlap.id2)-1].length(); overlap.afirst = Math.max(0, aoffset); overlap.asecond = Math.min(alen, alen + boffset); overlap.bfirst = -1*Math.min(0, aoffset); overlap.bsecond = Math.min(blen, blen - boffset); } //mhap format } else if (splitLine.length == 12) { overlap.id1 = splitLine[0]; overlap.id2 = splitLine[1]; @SuppressWarnings("unused") double score = Double.parseDouble(splitLine[2]); overlap.isFwd = Integer.parseInt(splitLine[8]) == 0; if (this.dataSeq != null) { int alen = this.dataSeq[getSequenceId(overlap.id1)].length(); int blen = this.dataSeq[getSequenceId(overlap.id2)].length(); overlap.afirst = Integer.parseInt(splitLine[5]); overlap.asecond = Integer.parseInt(splitLine[6]); overlap.bfirst = Integer.parseInt(splitLine[9]); overlap.bsecond = Integer.parseInt(splitLine[10]); if (overlap.asecond > alen) { overlap.asecond = alen; } if (overlap.bsecond > blen) { overlap.bsecond = blen; } } // blasr format } else if (splitLine.length == 13 && !line.contains("[")) { overlap.afirst = Integer.parseInt(splitLine[5]); overlap.asecond = Integer.parseInt(splitLine[6]); overlap.bfirst = Integer.parseInt(splitLine[9]); overlap.bsecond = Integer.parseInt(splitLine[10]); overlap.isFwd = (Integer.parseInt(splitLine[8]) == 0); if (!overlap.isFwd) { overlap.bsecond = Integer.parseInt(splitLine[11]) - Integer.parseInt(splitLine[9]); overlap.bfirst = Integer.parseInt(splitLine[11]) - Integer.parseInt(splitLine[10]); } overlap.id1 = splitLine[0]; if (overlap.id1.indexOf("/") != -1) { overlap.id1 = overlap.id1.substring(0, splitLine[0].indexOf("/")); } if (overlap.id1.indexOf(",") != -1) { overlap.id1 = overlap.id1.split(",")[1]; } overlap.id2 = splitLine[1]; if (overlap.id2.indexOf(",") != -1) { overlap.id2 = overlap.id2.split(",")[1]; } if (this.dataSeq != null) { int alen = this.dataSeq[getSequenceId(overlap.id1)].length(); int blen = this.dataSeq[getSequenceId(overlap.id2)].length(); if (overlap.asecond > alen) { overlap.asecond = alen; } if (overlap.bsecond > blen) { overlap.bsecond = blen; } } // 1 1,182 n [ 4,746.. 8,108] x [ 0.. 3,896] : < 982 diffs ( 34 trace pts) } else if (splitLine.length >= 13 && splitLine.length <= 18) { overlap.id1 = splitLine[0].replaceAll(",", ""); overlap.id2 = splitLine[1].replaceAll(",", ""); overlap.isFwd = (splitLine[2].equalsIgnoreCase("n")); String[] splitTwo = line.split("\\["); String aInfo = splitTwo[1].substring(0, splitTwo[1].indexOf("]")); String bInfo = splitTwo[2].substring(0, splitTwo[2].indexOf("]")); String[] aSplit = aInfo.replaceAll(",", "").split("\\.\\."); String[] bSplit = bInfo.replaceAll(",", "").split("\\.\\."); overlap.afirst=Integer.parseInt(aSplit[0].trim()); overlap.asecond=Integer.parseInt(aSplit[1].trim()); overlap.bfirst=Integer.parseInt(bSplit[0].trim()); overlap.bsecond=Integer.parseInt(bSplit[1].trim()); if (!overlap.isFwd) { overlap.bsecond = this.dataSeq[getSequenceId(overlap.id2)].length() - Integer.parseInt(bSplit[0].trim()); overlap.bfirst = this.dataSeq[getSequenceId(overlap.id2)].length() - Integer.parseInt(bSplit[1].trim()); } } } catch (NumberFormatException e) { System.err.println("Warning: could not parse input line: " + line + " " + e.getMessage()); } return overlap; } private void loadFasta(String file) throws IOException { FastaData data = new FastaData(file, 0); data.enqueueFullFile(); this.dataSeq = new Sequence[data.getNumberProcessed()]; int i = 0; while (!data.isEmpty()) { this.dataSeq[i++] = data.dequeue(); } } private void processOverlaps(String file) throws Exception { BufferedReader bf = new BufferedReader(new InputStreamReader( new FileInputStream(file))); String line = null; int counter = 0; while ((line = bf.readLine()) != null) { Overlap ovl = getOverlapInfo(line); int ovlLen = ovl.getSize(); String id = ovl.id1; String id2 = ovl.id2; if (id == null || id2 == null) { continue; } if (id.equalsIgnoreCase(id2)) { continue; } if ((EstimateROC.LOAD_ALL != true) && (this.seqToChr.get(id) == null || this.seqToChr.get(id2) == null)) { continue; } String ovlName = getOvlName(id, id2); if (this.ovlNames.containsKey(ovlName) && ovlLen < this.ovlNames.get(ovlName)) { continue; } // if we see same overlap between a pair of sequences, dont update counter just update its length and info if (this.ovlNames.containsKey(ovlName)) { this.ovlNames.put(ovlName, ovlLen); this.ovlInfo.put(ovlName, ovl); } else { this.ovlNames.put(ovlName, ovlLen); this.ovlToName.put(counter, ovlName); this.ovlInfo.put(ovlName, ovl); counter++; } if (counter % 100000 == 0) { System.err.println("Loaded " + counter); } } System.err.print("Processed " + this.ovlNames.size() + " overlaps"); if (this.ovlNames.isEmpty()) { System.err .println("Error: No sequence matches to reference loaded!"); System.exit(1); } bf.close(); } /** * We are parsing file of the format 18903/0_100 ref000001|lambda_NEB3011 * -462 96.9697 0 0 99 100 0 2 101 48502 254 21589/0_100 * ref000001|lambda_NEB3011 -500 100 0 0 100 100 1 4 104 48502 254 * 15630/0_100 ref000001|lambda_NEB3011 -478 98 0 0 100 100 0 5 105 48502 * 254 **/ @SuppressWarnings("unused") private void processReference(String file) throws Exception { BufferedReader bf = new BufferedReader(new InputStreamReader( new FileInputStream(file))); String line = null; int counter = 0; while ((line = bf.readLine()) != null) { String[] splitLine = line.trim().split("\\s+"); String id = splitLine[0]; if (id.indexOf("/") != -1) { id = id.substring(0, splitLine[0].indexOf("/")); } if (id.indexOf(",") != -1) { id = id.split(",")[1]; } double idy = Double.parseDouble(splitLine[3]); int start = Integer.parseInt(splitLine[5]); int end = Integer.parseInt(splitLine[6]); int length = Integer.parseInt(splitLine[7]); int seqIsFwd = Integer.parseInt(splitLine[4]); if (seqIsFwd != 0) { System.err.println("Error: malformed line, first sequences should always be in fwd orientation"); System.exit(1); } int startInRef = Integer.parseInt(splitLine[9]); int endInRef = Integer.parseInt(splitLine[10]); int refLen = Integer.parseInt(splitLine[11]); int isRev = Integer.parseInt(splitLine[8]); int score = Integer.parseInt(splitLine[2]); if (isRev == 1) { int tmp = refLen - endInRef; endInRef = refLen - startInRef; startInRef = tmp; } if (idy < MIN_REF_IDENTITY*100) { continue; } double diff = ((double)(end - start) / (double)(endInRef-startInRef)); if (diff < MIN_REF_OVERLAP_DIFFERENCE) { continue; } String chr = splitLine[1]; if (!this.clusters.containsKey(chr)) { this.clusters.put(chr, new IntervalTree()); } if (this.seqToPosition.containsKey(id)) { if (score < this.seqToScore.get(id)) { // replace this.seqToPosition.put(id, new Pair(startInRef, endInRef)); this.seqToChr.put(id, chr); this.seqToScore.put(id, score); } } else { this.seqToPosition.put(id, new Pair(startInRef, endInRef)); this.seqToChr.put(id, chr); this.seqToName.put(counter, id); this.seqNameToIndex.put(id, counter); this.seqToScore.put(id, score); counter++; } } bf.close(); for (String id : this.seqToPosition.keySet()) { String chr = this.seqToChr.get(id); if (!this.clusters.containsKey(chr)) { this.clusters.put(chr, new IntervalTree()); } Pair p = this.seqToPosition.get(id); this.clusters.get(chr).addInterval(p.first, p.second, this.seqNameToIndex.get(id)); } System.err.print("Processed " + this.clusters.size() + " chromosomes, " + this.seqToPosition.size() + " sequences matching ref"); if (this.seqToPosition.isEmpty()) { System.err .println("Error: No sequence matches to reference loaded!"); System.exit(1); } } private boolean overlapExists(String id, String id2) { return this.ovlNames.containsKey(getOvlName(id, id2)); } private boolean overlapMatches(String id, String m) { int refOverlap = getOverlapSize(id, m); Overlap ovl = this.ovlInfo.get(getOvlName(id, m)); if (ovl == null) { return false; } int diff = Math.abs(ovl.getSize() - refOverlap); double diffPercent = (double)diff / (double)refOverlap; if (DEBUG) { System.err.println("Overlap " + ovl + " " + ovl.getSize() + " versus ref " + refOverlap + " " + " diff is " + diff + "(" + diffPercent + ")"); } if (diffPercent > MIN_OVERLAP_DIFFERENCE) { return false; } return true; } private void checkMatches(String id, HashSet matches) { for (String m : matches) { if (overlapMatches(id, m)) { this.tp++; } else { this.fn++; if (DEBUG) { System.err.println("Overlap between sequences: " + id + ", " + m + " is missing."); System.err.println(">" + id + " reference location " + this.seqToChr.get(id) + " " + this.seqToPosition.get(id).first + ", " + this.seqToPosition.get(id).second); System.err.println(this.dataSeq[Integer.parseInt(id)-1].getSquenceString()); System.err.println(">" + m + " reference location " + this.seqToChr.get(m) + " " + this.seqToPosition.get(m).first + ", " + this.seqToPosition.get(m).second); System.err.println(this.dataSeq[Integer.parseInt(m)-1].getSquenceString()); } } } } private static double getScore(jaligner.Alignment alignment, String qry, String ref) { char[] sequence1 = alignment.getSequence1(); char[] sequence2 = alignment.getSequence2(); int length = Math.max(sequence1.length, sequence2.length); char GAP = '-'; @SuppressWarnings("unused") int errors = 0; int matches = 0; for (int i = 0; i <= length; i++) { char c1 = GAP; char c2 = GAP; if (i < sequence1.length) { c1 = sequence1[i]; } if (i < sequence2.length) { c2 = sequence2[i]; } if (c1 != c2 || c1 == GAP || c2 == GAP) { errors++; } else { matches++; } } return (matches / (double)length); } private static double getScore(ssw.Alignment alignment, String qry, String ref) { // the result is a cigar string of the format // 3M1I9M1I9M1D10M1I13M1I9M1D1M2D45M1D6M1D50 Pattern cigarPattern = Pattern.compile("[\\d]+[a-zA-Z|=]"); Matcher matcher = cigarPattern.matcher(alignment.cigar); int errors = 0; int len = 0; int qryPos = alignment.read_begin1; int refPos = alignment.ref_begin1; while (matcher.find()) { String cVal = matcher.group(); int cLen = Integer.parseInt(cVal.substring(0, cVal.length() - 1)); char cLetter = cVal.toUpperCase().charAt(cVal.length() - 1); switch (cLetter) { case 'H': break; case 'S': case '=': // ignore, not an error len+=cLen; break; case 'M': for (int i = 0; i < cLen; i++) { if (ref.toUpperCase().charAt(refPos) != qry.toUpperCase().charAt(qryPos)) { errors++; } else { // do nothing } refPos++; qryPos++; } len+=cLen; break; case 'I': errors += cLen; qryPos += cLen; len+=cLen; break; case 'D': errors += cLen; refPos += cLen; len+=cLen; break; default: System.err.println("Error, unknown base " + cLetter); System.exit(1); break; } } return 1 - (errors / (double) len); } private boolean computeDP(String id, String id2) { if (this.doDP == false) { return false; } Overlap ovl = this.ovlInfo.get(getOvlName(id, id2)); if (DEBUG) System.err.println("Aligning sequence " + ovl.id1 + " to " + ovl.id2 + " " + ovl.bfirst + " to " + ovl.bsecond + " and " + ovl.isFwd + " and " + ovl.afirst + " " + ovl.asecond); String s1 = this.dataSeq[getSequenceId(ovl.id1)].getSquenceString().substring(ovl.afirst, ovl.asecond); String s2 = null; if (ovl.isFwd) { s2 = this.dataSeq[getSequenceId(ovl.id2)].getSquenceString().substring(ovl.bfirst, ovl.bsecond); } else { s2 = Utils.rc(this.dataSeq[getSequenceId(ovl.id2)].getSquenceString().substring(ovl.bfirst, ovl.bsecond)); } int ovlLen = Math.min(s1.length(), s2.length()); double score = 0; int length = 0; if (ALIGN_JALIGN) { jaligner.Sequence js1 = new jaligner.Sequence(s1); jaligner.Sequence js2 = new jaligner.Sequence(s2); jaligner.Alignment alignment = null; try { if (ALIGN_SW) { alignment = SmithWatermanGotoh.align(js1, js2, MatrixLoader.load("MATCH"), 2f, 1f); } else { alignment = NeedlemanWunschGotoh.align(js1, js2, MatrixLoader.load("MATCH"), 2f, 1f); } } catch (MatrixLoaderException e) { return false; } length = alignment.getLength(); score = getScore(alignment, s1, s2); if (DEBUG) { System.err.println(alignment.getSummary()); System.err.println("My score: " + score); System.err.println (new jaligner.formats.Pair().format(alignment)); } } else { ssw.Alignment alignment = Aligner.align(s1.getBytes(), s2.getBytes(), MATCH_MATRIX, 2, 1, true); length = Math.max(alignment.read_end1-alignment.read_begin1, alignment.ref_end1 - alignment.ref_begin1); score = getScore(alignment, s1, s2); if (DEBUG) { System.err.println(alignment.toString()); System.err.println(alignment.read_end1 + " " + alignment.read_begin1 + " " + alignment.ref_end1 + " " + alignment.ref_begin1 + " " + length); System.err.println("My score: " + score); } } return (score > MIN_ALIGNMENT_IDENTITY && length > this.minOvlLen && 1-((float)length/ovlLen) < MIN_OVERLAP_DIFFERENCE); } private void estimateSensitivity() { // we estimate TP/FN by randomly picking a sequence, getting its // cluster, and checking our matches for (int i = 0; i < this.numTrials; i++) { String id = null; HashSet matches = null; while (matches == null || matches.size() == 0) { // pick cluster id = pickRandomSequence(); matches = getSequenceMatches(id, this.minOvlLen); } if (DEBUG) { System.err.println("Estimated sensitivity trial #" + i + " " + id + " matches " + matches); } checkMatches(id, matches); } } private void estimateSpecificity() { // we estimate FP/TN by randomly picking two sequences for (int i = 0; i < this.numTrials; i++) { // pick cluster String id = pickRandomSequence(); String other = pickRandomSequence(); while (id.equalsIgnoreCase(other)) { other = pickRandomSequence(); } HashSet matches = getSequenceMatches(id, 0); if (overlapExists(id, other)) { if (!matches.contains(other)) { this.fp++; } } else { if (!matches.contains(other)) { this.tn++; } } } } private void estimatePPV() throws InterruptedException, ExecutionException { AtomicInteger numTP = new AtomicInteger(); ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors()); forkJoinPool.submit(() -> Stream.iterate(0, i->i+1).limit(this.numTrials).parallel().forEach(i-> { int ovlLen = 0; String[] ovl = null; String ovlName = null; while (ovlLen < this.minOvlLen) { // pick an overlap ovlName = pickRandomMatch(); Overlap o = this.ovlInfo.get(ovlName); ovlLen = Utils.getRangeOverlap(o.afirst, o.asecond, o.bfirst, o.bsecond); } if (ovlName == null) { System.err.println("Could not find any computed overlaps > " + this.minOvlLen); System.exit(1); } else { ovl = ovlName.split("_"); String id = ovl[0]; String id2 = ovl[1]; HashSet matches = getSequenceMatches(id, 0); if (matches != null && matches.contains(id2)) { numTP.getAndIncrement(); } else { if (computeDP(id, id2)) { numTP.getAndIncrement(); } else { if (DEBUG) { System.err.println("Overlap between sequences: " + id + ", " + id2 + " is not correct."); } } } } }) ).get(); // now our formula for PPV. Estimate percent of our matches which are true this.ppv = numTP.doubleValue() / (double)this.numTrials; } @SuppressWarnings("cast") private void fullEstimate() { for (int i = 0; i < this.seqToName.size(); i++) { String id = this.seqToName.get(i); for (int j = i+1; j < this.seqToName.size(); j++) { String id2 = this.seqToName.get(j); if (id == null || id2 == null) { continue; } HashSet matches = getSequenceMatches(id, 0); if (!overlapMatches(id, id2)) { if (!matches.contains(id2)) { this.tn++; } else if (getOverlapSize(id, id2) > this.minOvlLen) { this.fn++; } } else { if (matches.contains(id2)) { this.tp++; } else { if (computeDP(id, id2)) { this.tp++; } else { this.fp++; } } } } } this.ppv = (double) this.tp / ((double)this.tp+(double)this.fp); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/GetHistogramStats.java000077500000000000000000000063101277502137000254720ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.main; import java.io.BufferedReader; import java.util.TreeMap; import edu.umd.marbl.mhap.utils.Utils; public class GetHistogramStats { private static final int NUM_SD = 7; private TreeMap histogram = new TreeMap(); private double percent = 0.99; private double mean = 0; private double stdev = 0; private long cut = 0; public GetHistogramStats(String fileName, double p) { try { BufferedReader bf = Utils.getFile(fileName, null); String line = null; while ((line = bf.readLine()) != null) { String[] split = line.trim().split("\\s+"); int val = Integer.parseInt(split[0]); long count = Long.parseLong(split[1]); this.histogram.put(val, count); } bf.close(); this.percent = p; } catch (Exception e) { e.printStackTrace(); } } public void process() throws NumberFormatException { double variance = 0; double sum = 0; long total = 0; for (int val : this.histogram.keySet()) { long count = this.histogram.get(val); for (long i = 0; i < count; i++) { total++; double delta = (val - this.mean); this.mean += (delta / total); variance += delta * (val - this.mean); sum += val; } } variance /= total; this.stdev = Math.sqrt(variance); double runningSum = 0; for (int val : this.histogram.keySet()) { long count = this.histogram.get(val); runningSum += (double) val * count; if ((runningSum / sum) > this.percent) { this.cut = val; break; } } } @Override public String toString() { return Utils.DECIMAL_FORMAT.format(this.mean) + "\t" + Utils.DECIMAL_FORMAT.format(this.stdev) + "\t" + "\t" + this.cut + "\t" + Utils.DECIMAL_FORMAT.format(this.mean + NUM_SD * this.stdev); } public static void main(String[] args) throws NumberFormatException { GetHistogramStats s = new GetHistogramStats(args[0], Double.parseDouble(args[1])); s.process(); System.out.println(s.toString()); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/KmerStatSimulator.java000066400000000000000000000373421277502137000255160ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.main; import java.util.ArrayList; import java.util.HashMap; import java.util.LinkedList; import java.util.ListIterator; import java.util.Random; import java.util.GregorianCalendar; import java.util.Calendar; import java.util.HashSet; import java.io.BufferedReader; import java.io.PrintStream; import edu.umd.marbl.mhap.impl.FastaData; import edu.umd.marbl.mhap.sketch.BottomSketch; import edu.umd.marbl.mhap.sketch.MinHashSketch; import edu.umd.marbl.mhap.sketch.BottomOverlapSketch; import edu.umd.marbl.mhap.sketch.ZeroNGramsFoundException; import edu.umd.marbl.mhap.utils.Utils; public class KmerStatSimulator { private boolean verbose = false; private int kmer = -1; private int overlap = 100; private ArrayList randomJaccard = new ArrayList(); private ArrayList randomMinHash = new ArrayList(); private ArrayList randomMerCounts = new ArrayList(); private String reference = null; private double requestedLength = 5000; private double sharedCount = 0; private ArrayList sharedJaccard = new ArrayList(); private ArrayList sharedMinHash = new ArrayList(); private ArrayList sharedMerCounts = new ArrayList(); private HashMap skipMers = new HashMap(); private int totalTrials = 10000; private boolean halfError = false; private static Random generator = null; public static int seed = 0; public static void main(String[] args) throws Exception { boolean usage1 = true; if (args.length >= 5 && args.length <= 6) { usage1=false; } else if (args.length >= 7) { usage1 = true; } else { printUsage(); System.exit(1); } KmerStatSimulator f = new KmerStatSimulator(); f.totalTrials = Integer.parseInt(args[0]); if (usage1) { f.requestedLength = Double.parseDouble(args[2]); f.kmer = Integer.parseInt(args[1]); f.overlap = Integer.parseInt(args[3]); if (args.length > 7) { f.halfError = Boolean.parseBoolean(args[7]); } if (args.length > 8) { f.reference = args[8]; } if (f.overlap > f.requestedLength) { System.err.println("Cannot have overlap > sequence length"); System.exit(1); } if (args.length > 9) { f.loadSkipMers(args[9]); } f.simulate(Double.parseDouble(args[4]), Double.parseDouble(args[5]), Double.parseDouble(args[6])); } else { f.requestedLength = Double.parseDouble(args[1]); if (args.length > 5) { f.reference = args[5]; } f.simulate(Double.parseDouble(args[2]), Double.parseDouble(args[3]), Double.parseDouble(args[4])); } } public static void printUsage() { System.err .println("Example usage: simulateSharedKmers <#trials> [only one sequence error] [reference genome] [kmers to ignore]"); System.err .println("Usage 2: simulateSharedKmers <#trials> [reference genome]"); } @SuppressWarnings("unused") public KmerStatSimulator() { if (false) { GregorianCalendar t = new GregorianCalendar(); int t1 = t.get(Calendar.SECOND); int t2 = t.get(Calendar.MINUTE); int t3 = t.get(Calendar.HOUR_OF_DAY); int t4 = t.get(Calendar.DAY_OF_MONTH); int t5 = t.get(Calendar.MONTH); int t6 = t.get(Calendar.YEAR); seed = t6 + 65 * (t5 + 12 * (t4 + 31 * (t3 + 24 * (t2 + 60 * t1)))); } generator = new Random(seed); } private void loadSkipMers(String file) throws Exception { BufferedReader bf = Utils.getFile(file, null); String line = null; while ((line = bf.readLine()) != null) { String[] split = line.trim().split("\\s+"); String mer = split[0].trim(); int count = Integer.parseInt(split[1]); this.skipMers.put(mer, count); } bf.close(); } private String buildRandomSequence(int length) { StringBuilder st = new StringBuilder(); for (int i = 0; i < length; i++) { st.append(getRandomBase(null)); } return st.toString(); } public double compareKmers(String first, String second) { HashSet firstSeqs = new HashSet(first.length()); HashSet totalSeqs = new HashSet(first.length()+second.length()); HashSet shared = new HashSet(first.length()); for (int i = 0; i <= first.length() - this.kmer; i++) { String fmer = first.substring(i, i + this.kmer); if (!this.skipMers.containsKey(fmer)) { firstSeqs.add(fmer); } totalSeqs.add(fmer); } for (int i = 0; i <= second.length() - this.kmer; i++) { String smer = second.substring(i, i + this.kmer); if (firstSeqs.contains(smer)) { shared.add(smer); } else { totalSeqs.add(smer); } } this.sharedCount = shared.size(); return shared.size() / (double) totalSeqs.size(); } public double compareMinHash(String first, String second) { BottomSketch h1 = new BottomSketch(first, this.kmer, 1256); BottomSketch h2 = new BottomSketch(second, this.kmer, 1256); return h1.jaccard(h2); } public double compareMinHash2(String first, String second) throws ZeroNGramsFoundException { MinHashSketch h1 = new MinHashSketch(first, this.kmer, 1256, null, 1.0); MinHashSketch h2 = new MinHashSketch(second, this.kmer, 1256, null, 1.0); return h1.jaccard(h2); } private char getRandomBase(Character toExclude) { Character result = null; while (result == null) { double base = generator.nextDouble(); if (base < 0.25) { result = 'A'; } else if (base < 0.5) { result = 'C'; } else if (base < 0.75) { result = 'G'; } else { result = 'T'; } if (toExclude != null && toExclude.equals(result)) { result = null; } } return result; } @SuppressWarnings("unused") private String getSequence(int firstLen, int firstPos, String sequence, double errorRate, StringBuilder profile, StringBuilder realErrorStr) { return getSequence(firstLen, firstPos, sequence, errorRate, profile, realErrorStr, 0.792, 0.122, 0.086, true); } private String getSequence(int seqLength, int firstPos, String sequence, double errorRate, StringBuilder profile, StringBuilder realErrorStr, double insertionRate, double deletionRate, double substitutionRate, boolean trimRight) { StringBuilder firstSeq = new StringBuilder(); firstSeq.append(sequence.substring(firstPos, Math.min(sequence.length(), firstPos + 2 * seqLength))); if (firstSeq.length() < 2 * seqLength) { firstSeq.append(sequence.substring( 0, Math.min(sequence.length(), (2 * seqLength - firstSeq.length())))); } //use a linked list for insertions LinkedList modifiedSequence = new LinkedList<>(); for (char a : firstSeq.toString().toCharArray()) modifiedSequence.add(a); // now mutate int realError = 0; ListIterator iter = modifiedSequence.listIterator(); while (iter.hasNext()) { char i = iter.next(); if (generator.nextDouble() < errorRate) { double errorType = generator.nextDouble(); if (errorType < substitutionRate) { // mismatch // switch base iter.set(getRandomBase(i)); //firstSeq.setCharAt(i, getRandomBase(firstSeq.charAt(i))); //System.err.println("sub"); realError++; i++; } else if (errorType < insertionRate + substitutionRate) { // insert iter.previous(); iter.add(getRandomBase(null)); //firstSeq.insert(i, getRandomBase(null)); // profile.insert(i+1,"X"); realError++; //i += 2; } else { // delete iter.remove(); // firstSeq.setCharAt(i, 'D'); // profile.setCharAt(i, '-'); //System.err.println("delete"); realError++; } } else { //i++; } } firstSeq = new StringBuilder(); for (char c : modifiedSequence) firstSeq.append(c); realErrorStr.append((double) realError / seqLength); if (trimRight) { return firstSeq.substring(0, seqLength).toString(); } return firstSeq.substring(firstSeq.length()-seqLength, firstSeq.length()).toString(); } private void outputStats(ArrayList values, PrintStream out) { double mean = 0.0; double variance = 0.0; int N = 0; for (double d : values) { N++; mean += d; } mean = mean/N; N = 0; for (double d : values) { N++; variance += (d-mean)*(d-mean); } variance /= (N-1); double stdev = Math.sqrt(variance); out.print(mean + "\t" + stdev); } public void simulate(double insertionRate, double delRate, double subRate) throws Exception { double errorRate = insertionRate + delRate + subRate; double insertionPercentage = insertionRate / errorRate; double deletionPercentage = delRate / errorRate; double subPercentage = subRate / errorRate; if (errorRate < 0 || errorRate > 1) { System.err.println("Error rate must be between 0 and 1"); System.exit(1); } System.err.println("Started..."); String[] sequences = null; if (this.reference != null) { FastaData data = new FastaData(this.reference, 0); data.enqueueFullFile(); sequences = new String[data.getNumberProcessed()]; int i = 0; while (!data.isEmpty()) sequences[i++] = data.dequeue().getSquenceString().toUpperCase().replace("N", ""); } System.err.println("Loaded reference"); for (int i = 0; i < this.totalTrials; i++) { if (i % 100 == 0) { System.err.println("Done " + i + "/" + this.totalTrials); } int sequenceLength = (int) this.requestedLength; int firstPos = 0; String sequence = null; int seqID = 0; if (this.reference != null) { sequence = null; while (sequence == null || sequence.length() < 4 * sequenceLength) { // pick a sequence from our reference seqID = generator.nextInt(sequences.length); sequence = sequences[seqID]; } // now pick a position firstPos = generator.nextInt(sequence.length()); } else { sequence = buildRandomSequence(sequenceLength * 4); } // simulate sequence with error StringBuilder firstAdj = new StringBuilder(); StringBuilder errors = new StringBuilder(); String firstSeq = getSequence(sequenceLength, firstPos, sequence, errorRate, firstAdj, errors, insertionPercentage, deletionPercentage, subPercentage, false); if (this.kmer < 0) { // we were only asked to simulate sequences not compare System.out.println(">s" + i + " " + seqID + " " + (firstPos+sequenceLength)); System.out.println(Utils.convertToFasta(firstSeq)); continue; } // compare number of shared kmers out of total to another sequence // from // same position int offset = (int) ((this.requestedLength * 2) - this.overlap); int secondPos = (firstPos + offset) % sequence.length(); String secondSeq = getSequence(sequenceLength, secondPos, sequence, (this.halfError ? 0 : errorRate), firstAdj, errors, (this.halfError ? 0 :insertionPercentage), (this.halfError ? 0 : deletionPercentage), (this.halfError ? 0 : subPercentage), true); if (this.verbose) { System.err.println("Given seq " + firstPos + " of len " + sequence.length() + " and offset " + secondPos + " due to offset " + offset); System.err.println(">" + seqID + "_" + firstPos + "\n" + firstSeq); System.err.println(">" + seqID + "_" + secondPos + "\n" + secondSeq); } if (firstSeq.length() != secondSeq.length() || firstSeq.length() != this.requestedLength) { System.err.println("Error wrong length first: " + firstSeq.length() + " second: " + secondSeq.length() + " requested " + this.requestedLength); System.exit(1); } this.sharedJaccard.add(compareKmers(firstSeq, secondSeq)); this.sharedMinHash.add(compareMinHash(firstSeq, secondSeq)); this.sharedMerCounts.add(this.sharedCount); // compare number of shared kmers out of total to another sequence // from a // non-overlapping position // get a non-overlapping position if (this.reference != null) { sequence = null; int secondSeqID = 0; while (sequence == null || sequence.length() < 2 * sequenceLength) { secondSeqID = generator.nextInt(sequences.length); sequence = sequences[secondSeqID]; } secondPos = generator.nextInt(sequence.length()); while (seqID == secondSeqID && Utils .getRangeOverlap(firstPos, firstPos + sequenceLength, secondPos, secondPos + sequenceLength) > 0) { secondPos = generator.nextInt(sequence.length()); } // generate error for second sequence secondSeq = getSequence(sequenceLength, secondPos, sequence, (this.halfError ? 0 : errorRate), firstAdj, errors, (this.halfError ? 0 : insertionPercentage), (this.halfError ? 0 : deletionPercentage), (this.halfError ? 0 : subPercentage), true); } else { secondPos = 0; secondSeq = buildRandomSequence(sequenceLength); } if (firstSeq.length() != secondSeq.length() || firstSeq.length() != this.requestedLength) { System.err.println("Error wrong length " + firstSeq.length()); System.exit(1); } // System.err.println("First: "+firstSeq.length()); // System.err.println("Second: "+secondSeq.length()); this.randomJaccard.add(compareKmers(firstSeq, secondSeq)); this.randomMinHash.add(compareMinHash(firstSeq, secondSeq)); this.randomMerCounts.add(this.sharedCount); } if (this.randomJaccard.size() != this.randomMerCounts.size() || this.sharedJaccard.size() != this.sharedMerCounts.size() || this.sharedJaccard.size() != this.randomJaccard.size()) { System.err.println("Error trial number not consistent!"); } if (this.sharedMerCounts.size() == 0) { return; } for (int i = 0; i < this.totalTrials; i++) { System.out.println(this.sharedMerCounts.get(i) + "\t" + this.sharedJaccard.get(i) + "\t" + this.sharedMinHash.get(i) + "\t" + BottomOverlapSketch.jaccardToIdentity(this.sharedMinHash.get(i), this.kmer) + "\t" + this.randomMerCounts.get(i) + "\t" + this.randomJaccard.get(i) + "\t" + this.randomMinHash.get(i)); } System.out.print("Shared mer counts stats: "); outputStats(this.sharedMerCounts, System.out); System.out.println(); System.out.print("Shared jaccard stats: "); outputStats(this.sharedJaccard, System.out); System.out.println(); System.out.print("Shared MinHash jaccard stats: "); outputStats(this.sharedMinHash, System.out); System.out.println(); System.out.print("Random mer counts stats: "); outputStats(this.randomMerCounts, System.out); System.out.println(); System.out.print("Random jaccard stats: "); outputStats(this.randomJaccard, System.out); System.out.println(); System.out.print("Random MinHash jaccard stats: "); outputStats(this.randomMinHash, System.out); System.out.println(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/main/MhapMain.java000066400000000000000000000547461277502137000235650ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.main; import java.io.BufferedReader; import java.io.File; import java.io.FilenameFilter; import java.io.IOException; import java.util.ArrayList; import java.util.Collections; import java.util.Locale; import edu.umd.marbl.mhap.impl.MhapRuntimeException; import edu.umd.marbl.mhap.impl.MinHashSearch; import edu.umd.marbl.mhap.impl.SequenceId; import edu.umd.marbl.mhap.impl.SequenceSketchStreamer; import edu.umd.marbl.mhap.sketch.FrequencyCounts; import edu.umd.marbl.mhap.utils.ParseOptions; import edu.umd.marbl.mhap.utils.Utils; public final class MhapMain { private final double acceptScore; private final String inFile; private final FrequencyCounts kmerFilter; private final int kmerSize; private final double maxShift; private final int minStoreLength; private final int minOlapLength; private final boolean noSelf; private final int numHashes; private final int numMinMatches; protected final int numThreads; private final int orderedKmerSize; private final int orderedSketchSize; private final String processFile; private final String toFile; private final double repeatWeight; private static final double DEFAULT_OVERLAP_ACCEPT_SCORE = 0.78; private static final double DEFAULT_REPEAT_WEIGHT= 0.9; private static final double DEFAULT_FILTER_CUTOFF = 1.0e-5; private static final int DEFAULT_KMER_SIZE = 16; private static final double DEFAULT_MAX_SHIFT_PERCENT = 0.2; private static final int DEFAULT_MIN_STORE_LENGTH = 0; private static final int DEFAULT_MIN_OVL_LENGTH = DEFAULT_KMER_SIZE+100; private static final int DEFAULT_NUM_MIN_MATCHES = 3; private static final int DEFAULT_NUM_THREADS = Runtime.getRuntime().availableProcessors(); private static final int DEFAULT_NUM_WORDS = 512; private static final int DEFAULT_ORDERED_KMER_SIZE = 12; private static final int DEFAULT_ORDERED_SKETCH_SIZE = 1536; public static void main(String[] args) throws Exception { // set the locale Locale.setDefault(Locale.US); ParseOptions options = new ParseOptions(); options.addStartTextLine("MHAP: MinHash Alignment Protocol. A tool for finding overlaps of long-read sequences (such as PacBio or Nanopore) in bioinformatics."); options.addStartTextLine("\tVersion: "+MhapMain.class.getPackage().getImplementationVersion()); options.addStartTextLine("\tUsage 1 (direct execution): java -server -Xmx -jar -s [-q] [-f]"); options.addStartTextLine("\tUsage 2 (generate precomputed binaries): java -server -Xmx -jar -p -q [-f]"); options.addOption("-s", "Usage 1 only. The FASTA or binary dat file (see Usage 2) of reads that will be stored in a box, and that all subsequent reads will be compared to.", ""); options.addOption("-q", "Usage 1: The FASTA file of reads, or a directory of files, that will be compared to the set of reads in the box (see -s). Usage 2: The output directory for the binary formatted dat files.", ""); options.addOption("-p", "Usage 2 only. The directory containing FASTA files that should be converted to binary format for storage.", ""); options.addOption("-f", "k-mer filter file used for filtering out highly repetative k-mers. Must be sorted in descending order of frequency (second column).", ""); options.addOption("-k", "[int], k-mer size used for MinHashing. The k-mer size for second stage filter is seperate, and cannot be modified.", DEFAULT_KMER_SIZE); options.addOption("--num-hashes", "[int], number of min-mers to be used in MinHashing.", DEFAULT_NUM_WORDS); options.addOption("--threshold", "[double], the threshold cutoff for the second stage sort-merge filter. This is based on the identity score computed from the Jaccard distance of k-mers (size given by ordered-kmer-size) in the overlapping regions.", DEFAULT_OVERLAP_ACCEPT_SCORE); options.addOption("--filter-threshold", "[double], the cutoff at which the k-mer in the k-mer filter file is considered repetitive. This value for a specific k-mer is specified in the second column in the filter file. If no filter file is provided, this option is ignored.", DEFAULT_FILTER_CUTOFF); options.addOption("--max-shift", "[double], region size to the left and right of the estimated overlap, as derived from the median shift and sequence length, where a k-mer matches are still considered valid. Second stage filter only.", DEFAULT_MAX_SHIFT_PERCENT); options.addOption("--num-min-matches", "[int], minimum # min-mer that must be shared before computing second stage filter. Any sequences below that value are considered non-overlapping.", DEFAULT_NUM_MIN_MATCHES); options.addOption("--num-threads", "[int], number of threads to use for computation. Typically set to #cores.", DEFAULT_NUM_THREADS); options.addOption("--repeat-weight", "[double] Repeat suppression strength for tf-idf weighing. <0.0 do unweighted MinHash (version 1.0), >=1.0 do only the tf weighing. To perform no idf weighting, do no supply -f option. ", DEFAULT_REPEAT_WEIGHT); options.addOption("--ordered-kmer-size", "[int] The size of k-mers used in the ordered second stage filter.", DEFAULT_ORDERED_KMER_SIZE); options.addOption("--ordered-sketch-size", "[int] The sketch size for second stage filter.", DEFAULT_ORDERED_SKETCH_SIZE); options.addOption("--min-store-length", "[int], The minimum length of the read that is stored in the box. Used to filter out short reads from FASTA file.", DEFAULT_MIN_STORE_LENGTH); options.addOption("--min-olap-length", "[int], The minimum length of the read that used for overlapping. Used to filter out short reads from FASTA file.", DEFAULT_MIN_OVL_LENGTH); options.addOption("--no-self", "Do not compute the overlaps between sequences inside a box. Should be used when the to and from sequences are coming from different files.", false); options.addOption("--store-full-id", "Store full IDs as seen in FASTA file, rather than storing just the sequence position in the file. Some FASTA files have long IDS, slowing output of results. This options is ignored when using compressed file format.", false); options.addOption("--supress-noise", "[int] 0) Does nothing, 1) completely removes any k-mers not specified in the filter file, 2) supresses k-mers not specified in the filter file, similar to repeats. ", 0); options.addOption("--no-tf", "Do not perform the tf weighing, in the tf-idf weighing.", false); options.addOption("--settings", "Set all unset parameters for the default settings. Same defaults are applied to Nanopore and Pacbio reads. 0) None, 1) Default, 2) Fast, 3) Sensitive.", 0); if (!options.process(args)) System.exit(0); if (options.get("--settings").getInteger()<0 || options.get("--settings").getInteger()>3) { System.out.println("Please enter valid --settings flag. See options below:"); System.out.println(options.helpMenuString()); System.exit(1); } if (options.get("--settings").getInteger()==1) //default { if (!options.get("-k").isSet()) options.setOptions("-k", 16); if (!options.get("--num-min-matches").isSet()) options.setOptions("--num-min-matches", 3); if (!options.get("--num-hashes").isSet()) options.setOptions("--num-hashes", 512); if (!options.get("--threshold").isSet()) options.setOptions("--threshold", .78); if (!options.get("--ordered-sketch-size").isSet()) options.setOptions("--ordered-sketch-size", 1536); if (!options.get("--ordered-kmer-size").isSet()) options.setOptions("--ordered-kmer-size", 12); } else if (options.get("--settings").getInteger()==2) //fast { if (!options.get("-k").isSet()) options.setOptions("-k", 16); if (!options.get("--num-min-matches").isSet()) options.setOptions("--num-min-matches", 3); if (!options.get("--num-hashes").isSet()) options.setOptions("--num-hashes", 256); if (!options.get("--threshold").isSet()) options.setOptions("--threshold", .80); if (!options.get("--ordered-sketch-size").isSet()) options.setOptions("--ordered-sketch-size", 1000); if (!options.get("--ordered-kmer-size").isSet()) options.setOptions("--ordered-kmer-size", 14); } else if (options.get("--settings").getInteger()==3) //sensitive { if (!options.get("-k").isSet()) options.setOptions("-k", 16); if (!options.get("--num-min-matches").isSet()) options.setOptions("--num-min-matches", 2); if (!options.get("--num-hashes").isSet()) options.setOptions("--num-hashes", 768); if (!options.get("--threshold").isSet()) options.setOptions("--threshold", .73); if (!options.get("--ordered-sketch-size").isSet()) options.setOptions("--ordered-sketch-size", 1536); if (!options.get("--ordered-kmer-size").isSet()) options.setOptions("--ordered-kmer-size", 12); } if (options.get("-s").getString().isEmpty() && options.get("-p").getString().isEmpty()) { System.out.println("Please set the -s or the -p options. See options below:"); System.out.println(options.helpMenuString()); System.exit(1); } if (!options.get("-p").getString().isEmpty() && options.get("-q").getString().isEmpty() ) { System.out.println("Please set the -q option. See options below:"); System.out.println(options.helpMenuString()); System.exit(1); } //check for file existance if (!options.get("-p").getString().isEmpty() && !new File(options.get("-p").getString()).exists()) { System.out.println("Could not find requested file/folder: "+options.get("-p").getString()); System.exit(1); } //check for file existance if (!options.get("-s").getString().isEmpty() && !new File(options.get("-s").getString()).exists()) { System.out.println("Could not find requested file/folder: "+options.get("-s").getString()); System.exit(1); } //check for file existance if (!options.get("-q").getString().isEmpty() && !new File(options.get("-q").getString()).exists()) { System.out.println("Could not find requested file/folder: "+options.get("-q").getString()); System.exit(1); } //check for file existance if (!options.get("-f").getString().isEmpty() && !new File(options.get("-f").getString()).exists()) { System.out.println("Could not find requested file/folder: "+options.get("-f").getString()); System.exit(1); } //check range if (options.get("--num-threads").getInteger()<=0) { System.out.println("Number of threads must be positive."); System.exit(1); } //check range if (options.get("-k").getInteger()<=0) { System.out.println("k-mer size must be positive."); System.exit(1); } //check range if (options.get("--num-min-matches").getInteger()<=0) { System.out.println("Minimum number of matches must be positive."); System.exit(1); } //check range if (options.get("--min-store-length").getInteger()<0) { System.out.println("The minimum read length stored must be >=0."); System.exit(1); } //check range if (options.get("--max-shift").getDouble()<-1.0) { System.out.println("The minimum shift must be greater than -1."); System.exit(1); } //check range if (options.get("--threshold").getDouble()<0.0 || options.get("--threshold").getDouble()>1.0) { System.out.println("The second stage filter threshold must be 0<=threshold<=1.0."); System.exit(1); } //check range if (options.get("--supress-noise").getInteger()<0 || options.get("--supress-noise").getInteger()>2) { System.out.println("The --supress-noise parameter must be in [0,2]."); System.exit(1); } //check other options //TODO move into the class if (options.get("--store-full-id").getBoolean()) SequenceId.STORE_FULL_ID = true; else SequenceId.STORE_FULL_ID = false; //printing the options used System.err.println("Running with these settings:"); System.err.println(options); // start the main program MhapMain main = new MhapMain(options); //execute main computation code main.computeMain(); } public MhapMain(ParseOptions options) throws IOException { this.processFile = options.get("-p").getString(); this.inFile = options.get("-s").getString(); this.toFile = options.get("-q").getString(); this.noSelf = options.get("--no-self").getBoolean(); this.numThreads = options.get("--num-threads").getInteger(); this.numHashes = options.get("--num-hashes").getInteger(); this.kmerSize = options.get("-k").getInteger(); this.numMinMatches = options.get("--num-min-matches").getInteger(); this.minStoreLength = options.get("--min-store-length").getInteger(); this.minOlapLength = options.get("--min-olap-length").getInteger(); this.maxShift = options.get("--max-shift").getDouble(); this.acceptScore = options.get("--threshold").getDouble(); this.repeatWeight = options.get("--repeat-weight").getDouble(); this.orderedKmerSize = options.get("--ordered-kmer-size").getInteger(); this.orderedSketchSize = options.get("--ordered-sketch-size").getInteger(); // read in the kmer filter set String filterFile = options.get("-f").getString(); if (!filterFile.isEmpty()) { long startTime = System.nanoTime(); System.err.println("Reading in filter file " + filterFile + "."); try { double offset = 0.0; if (this.repeatWeight>=0.0 && this.repeatWeight<1.0) offset = this.repeatWeight; double maxFraction = options.get("--filter-threshold").getDouble(); int removeUnique = options.get("--supress-noise").getInteger(); boolean noTf = options.get("--no-tf").getBoolean(); try (BufferedReader bf = Utils.getFile(filterFile, null)) { this.kmerFilter = new FrequencyCounts(bf, maxFraction, offset, removeUnique, noTf, this.numThreads); } } catch (Exception e) { throw new MhapRuntimeException("Could not parse k-mer filter file.", e); } System.err.println("Time (s) to read filter file: " + (System.nanoTime() - startTime) * 1.0e-9); if (this.kmerFilter!=null) System.err.println("Read in k-mer filter for sizes: " + this.kmerFilter.getKmerSizes()); } else { this.kmerFilter = null; } } public void computeMain() throws IOException { long startTotalTime = System.nanoTime(); long startTime = System.nanoTime(); long processTime = System.nanoTime(); //if processing a directory if (this.processFile!=null && !this.processFile.isEmpty()) { System.err.println("Processing FASTA files for binary compression..."); File file = new File(this.processFile); if (!file.exists()) throw new MhapRuntimeException("Process file does not exist."); if (this.toFile==null || this.toFile.isEmpty()) throw new MhapRuntimeException("Target directory must be defined."); File toDirectory = new File(this.toFile); if (!toDirectory.exists() || !toDirectory.isDirectory()) throw new MhapRuntimeException("Target directory doesn't exit."); //allocate directory files ArrayList processFiles = new ArrayList<>(); //if not dictory just add the file if (!file.isDirectory()) { processFiles.add(file); } else { //read the directory content File[] fileList = file.listFiles((dir,name) -> { if (!name.startsWith(".")) return true; return false; }); if (fileList!=null) for (File cf : fileList) processFiles.add(cf); } for (File pf : processFiles) { startTime = System.nanoTime(); SequenceSketchStreamer seqStreamer = getSequenceHashStreamer(pf.getAbsolutePath(), 0); String outputString = pf.getName(); int i = outputString.lastIndexOf('.'); if (i>0) outputString = outputString.substring(0, i); //combine with the directory name outputString = toDirectory.getPath()+File.separator+outputString+".dat"; //store the file to disk seqStreamer.writeToBinary(outputString, false, this.numThreads); System.err.println("Processed "+seqStreamer.getNumberProcessed()+" sequences (fwd and rev)."); System.err.println("Read, hashed, and stored file "+pf.getPath()+" to "+outputString+"."); System.err.println("Time (s): " + (System.nanoTime() - startTime)*1.0e-9); } System.err.println("Total time (s): " + (System.nanoTime() - startTotalTime)*1.0e-9); return; } System.err.println("Processing files for storage in reverse index..."); // read and index the kmers int seqNumberProcessed = 0; //create search object SequenceSketchStreamer seqStreamer = getSequenceHashStreamer(this.inFile, seqNumberProcessed); MinHashSearch hashSearch = getMatchSearch(seqStreamer); seqNumberProcessed += seqStreamer.getNumberProcessed()/2; System.err.println("Processed "+seqStreamer.getNumberProcessed()+" unique sequences (fwd and rev)."); System.err.println("Time (s) to read and hash from file: " + (System.nanoTime() - processTime)*1.0e-9); long startTotalScoringTime = System.nanoTime(); //System.err.println("Press Enter..."); //System.in.read(); // now that we have the hash constructed, go through all sequences to recompute their min and score their matches if (this.toFile==null || this.toFile.isEmpty()) { startTime = System.nanoTime(); hashSearch.findMatches(); System.err.println("Time (s) to score and output to self: " + (System.nanoTime() - startTime)*1.0e-9); } else { File file = new File(this.toFile); if (!file.exists()) throw new MhapRuntimeException("To-file does not exist."); ArrayList toFiles = new ArrayList<>(); //if not dictory just add the file if (!file.isDirectory()) { toFiles.add(file); } else { //read the directory content File[] fileList = file.listFiles(new FilenameFilter() { @Override public boolean accept(File dir, String name) { if (!name.startsWith(".")) return true; return false; } }); for (File cf : fileList) toFiles.add(cf); } //sort the files in alphabetical order Collections.sort(toFiles); //first perform to self startTime = System.nanoTime(); if (!this.noSelf) { hashSearch.findMatches(); System.out.flush(); System.err.println("Time (s) to score and output to self: " + (System.nanoTime() - startTime)*1.0e-9); } //no do to all files for (File cf : toFiles) { // read and index the kmers seqStreamer = getSequenceHashStreamer(cf.getAbsolutePath(), seqNumberProcessed); System.err.println("Opened fasta file "+cf.getCanonicalPath()+"."); //match the file startTime = System.nanoTime(); hashSearch.findMatches(seqStreamer); //flush to get the output System.out.flush(); seqNumberProcessed += seqStreamer.getNumberProcessed(); System.err.println("Processed "+seqStreamer.getNumberProcessed()+" to sequences."); System.err.println("Time (s) to score, hash to-file, and output: " + (System.nanoTime() - startTime)*1.0e-9); } } //flush output System.out.flush(); //output time System.err.println("Total scoring time (s): " + (System.nanoTime() - startTotalScoringTime)*1.0e-9); System.err.println("Total time (s): " + (System.nanoTime() - startTotalTime)*1.0e-9); //output final stats outputFinalStat(hashSearch); } public MinHashSearch getMatchSearch(SequenceSketchStreamer hashStreamer) throws IOException { return new MinHashSearch(hashStreamer, this.numHashes, this.numMinMatches, this.numThreads, false, this.minStoreLength, this.maxShift, this.acceptScore); } public SequenceSketchStreamer getSequenceHashStreamer(String file, int offset) throws IOException { SequenceSketchStreamer seqStreamer; if (file.endsWith(".dat")) seqStreamer = new SequenceSketchStreamer(file, this.minOlapLength, offset); else seqStreamer = new SequenceSketchStreamer(file, this.minOlapLength, this.kmerSize, this.numHashes, this.orderedKmerSize, this.orderedSketchSize, this.kmerFilter, this.repeatWeight, offset); return seqStreamer; } protected void outputFinalStat(MinHashSearch matchSearch) { System.err.println("MinHash search time (s): " + matchSearch.getMinHashSearchTime()); //System.err.println("Sort-merge search time (s): " + matchSearch.getSortMergeTime()); System.err.println("Total matches found: " + matchSearch.getMatchesProcessed()); System.err.println("Average number of matches per lookup: " + (double) matchSearch.getMatchesProcessed() / (double) matchSearch.getNumberSequencesSearched()); System.err.println("Average number of table elements processed per lookup: " + (double) matchSearch.getNumberElementsProcessed() / (double) (matchSearch.getNumberSequencesSearched())); System.err.println("Average number of table elements processed per match: " + (double) matchSearch.getNumberElementsProcessed() / (double) (matchSearch.getMatchesProcessed())); System.err.println("Average % of hashed sequences hit per lookup: " + (double) matchSearch.getNumberSequencesHit() / (double) (matchSearch.size() * matchSearch.getNumberSequencesSearched()) * 100.0); System.err.println("Average % of hashed sequences hit that are matches: " + (double) matchSearch.getMatchesProcessed() / (double) matchSearch.getNumberSequencesHit() * 100.0); System.err.println("Average % of hashed sequences fully compared that are matches: " + (double)matchSearch.getMatchesProcessed()/(double)matchSearch.getNumberSequencesFullyCompared()*100.0); System.err.flush(); } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/math/000077500000000000000000000000001277502137000212155ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/math/BasicMath.java000066400000000000000000000434701277502137000237230ustar00rootroot00000000000000/* * ARMOR package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2012 by Konstantin Berlin * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.math; /** * The Class BasicMath. */ public final class BasicMath { /** The Constant PI. */ public static final double PI = Math.PI; /** The Constant TWOPI. */ public static final double TWOPI = 2.0 * PI; public static double abs(double a) { return Math.abs(a); } public static double[] abs(double[] a) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = BasicMath.abs(a[iter]); return val; } /** * Acos. * * @param x * the x * @return the double */ public static double acos(double x) { return Math.acos(x); } /** * Adds the. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] add(final double[] a, final double b) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] + b; return val; } /** * Adds the. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] add(final double[] a, final double[] b) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] + b[iter]; return val; } /** * Angle. * * @param a * the a * @param b * the b * @return the double */ public final static double angle(final double[] a, final double[] b) { double angle = acos(normalizedDotProduct(a, b)); return angle; } public static double angleAbsolute(double[] a, double[] b) { return Math.min(Math.abs(angle(a, b)), Math.abs(angle(a, BasicMath.mult(b, -1.0)))); } /** * Asin. * * @param x * the x * @return the double */ public final static double asin(final double x) { return Math.asin(x); } public final static double[][] catColumns(final double[][] A, final double[][] B) { if (A.length != B.length) throw new MathRuntimeException("Number of rows must be equal in A and B."); double[][] C = new double[A.length][A[0].length + B[0].length]; for (int row = 0; row < C.length; row++) { for (int column = 0; column < A[row].length; column++) C[row][column] = A[row][column]; for (int column = 0; column < B[row].length; column++) C[row][A[row].length + column] = B[row][column]; } return C; } /** * Closest power of two. * * @param a * the a * @return the int */ public final static int closestPowerOfTwo(final int a) { int power = a == 0 ? 0 : 32 - Integer.numberOfLeadingZeros(a - 1); return 1 << power; } /** * Cos. * * @param angle * the angle * @return the double */ public final static double cos(final double angle) { return Math.cos(angle); } /** * Creates the identity matrix. * * @param m * the m * @param n * the n * @return the double[][] */ public final static double[][] createIdentityMatrix(final int m, final int n) { double[][] A = new double[m][n]; for (int iterRow = 0; iterRow < A.length; iterRow++) { for (int iterColumn = 0; iterColumn < A[iterRow].length; iterColumn++) { A[iterRow][iterColumn] = 0.0; if (iterRow == iterColumn) A[iterRow][iterColumn] = 1.0; } } return A; } /** * Cube. * * @param a * the a * @return the double */ public final static double cube(final double a) { return a * a * a; } public final static double det(final double[][] A) { if (A == null || A.length != 3 || A[0].length != 3) throw new MathRuntimeException("Currently can only compute determinant of 3x3 matrix."); double det = A[0][0] * (A[1][1] * A[2][2] - A[2][1] * A[1][2]) - A[1][0] * (A[0][1] * A[2][2] - A[2][1] * A[0][2]) + A[2][0] * (A[0][1] * A[1][2] - A[1][1] * A[0][2]); return det; } /** * Divide. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] divide(final double[] a, final double b) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] / b; return val; } /** * Divide. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] divide(final double[] a, final double[] b) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] / b[iter]; return val; } /** * Dot product. * * @param a * the a * @param b * the b * @return the double */ public final static double dotProduct(final double[] a, final double[] b) { if (a.length != b.length) throw new MathRuntimeException("Vector lengths must be equal."); double val = 0.0; for (int iter = 0; iter < a.length; iter++) val += a[iter] * b[iter]; return val; } /** * Euclidean distance. * * @param x1 * the x1 * @param y1 * the y1 * @param z1 * the z1 * @param x2 * the x2 * @param y2 * the y2 * @param z2 * the z2 * @return the double */ public final static double euclideanDistance(double x1, double y1, double z1, double x2, double y2, double z2) { return sqrt(euclideanDistanceSquared(x1, y1, z1, x2, y2, z2)); } /** * Euclidean distance squared. * * @param x1 * the x1 * @param x2 * the x2 * @return the double */ public final static double euclideanDistanceSquared(final double x1, final double x2) { double xdif = x2 - x1; return xdif * xdif; } /** * Euclidean distance squared. * * @param x1 * the x1 * @param y1 * the y1 * @param x2 * the x2 * @param y2 * the y2 * @return the double */ public final static double euclideanDistanceSquared(double x1, double y1, double x2, double y2) { double xdif = x2 - x1; double ydif = y2 - y1; return (xdif * xdif + ydif * ydif); } /** * Euclidean distance squared. * * @param x1 * the x1 * @param y1 * the y1 * @param z1 * the z1 * @param x2 * the x2 * @param y2 * the y2 * @param z2 * the z2 * @return the double */ public final static double euclideanDistanceSquared(double x1, double y1, double z1, double x2, double y2, double z2) { double xdif = x2 - x1; double ydif = y2 - y1; double zdif = z2 - z1; return (xdif * xdif + ydif * ydif + zdif * zdif); } public static boolean hasNaN(double[] x) { for (double val : x) if (Double.isNaN(val)) return true; return false; } /** * Checks if is identity matrix. * * @param A * the a * @return true, if is identity matrix */ public static boolean isIdentityMatrix(double[][] A) { if (A == null) return false; if (A.length != A[0].length) return false; for (int iterRow = 0; iterRow < A.length; iterRow++) for (int iterColumn = 0; iterColumn < A[iterRow].length; iterColumn++) { if (iterRow == iterColumn) { if (A[iterRow][iterColumn] != 1.0) return false; } else if (A[iterRow][iterColumn] != 0.0) return false; } return true; } public static boolean isNonNegative(double[] x) { for (double val : x) if (val < 0) return false; return true; } /* * public static long nearestPow2(long x) { double logX = * Math.log10(x)/Math.log10(2); if (Math.round(logX)<=logX) return x; else * return (int)Math.pow(2, Math.floor(logX+1)); } */ public final static double laplanceProbabilty(double x, double b) { return 1.0/(2.0*b)*Math.exp(-Math.abs(x)/b); } /** * Matrix to array. * * @param A * the a * @return the double[] */ public static double[] matrixToArray(double A[][]) { double[] val = new double[A.length * A[0].length]; for (int iterRow = 0; iterRow < A.length; iterRow++) { for (int iterColumn = 0; iterColumn < A[iterRow].length; iterColumn++) val[iterRow * A[0].length] = A[iterRow][iterColumn]; } return val; } /** * Max. * * @param a * the a * @return the double */ public final static double max(final double[] a) { double val = a[0]; for (double elem : a) val = Math.max(val, elem); return val; } /** * Min. * * @param a * the a * @return the double */ public final static double min(final double[] a) { double val = a[0]; for (double elem : a) val = Math.min(val, elem); return val; } /** * Mult. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] mult(final double[] a, final double b) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] * b; return val; } /** * Mult. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] mult(final double[] a, final double[] b) { if (a == null || b == null) throw new MathRuntimeException("Arrays cannot be null."); if (a.length != b.length) throw new MathRuntimeException("Arrays must be of equal length."); double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] * b[iter]; return val; } /** * Mult. * * @param A * the a * @param b * the b * @return the double[][] */ public final static double[][] mult(final double[][] A, final double b) { double[][] X = new double[A.length][A[0].length]; for (int iterRow = 0; iterRow < A.length; iterRow++) { for (int iterColumn = 0; iterColumn < A[iterRow].length; iterColumn++) { X[iterRow][iterColumn] = A[iterRow][iterColumn] * b; } } return X; } public final static double[] mult(final double[][] A, final double[] b) { if (A == null || b == null) throw new java.lang.NullPointerException("Values cannot be null."); if (A[0].length != b.length) throw new MathRuntimeException("Matrix dimension [" + A.length + ", " + A[0].length + "] does not match vector length " + b.length + "."); double[] x = new double[A.length]; for (int iterRow = 0; iterRow < A.length; iterRow++) { x[iterRow] = 0.0; for (int iterColumn = 0; iterColumn < A[iterRow].length; iterColumn++) { x[iterRow] += A[iterRow][iterColumn] * b[iterColumn]; } } return x; } public final static double[][] mult(final double[][] A, final double[][] B) { if (A == null || B == null) throw new java.lang.NullPointerException("Matrices cannot be null."); if (A[0].length != B.length) throw new MathRuntimeException("Matrices' dimensions do not match."); double[][] C = new double[A.length][B[0].length]; for (int row = 0; row < A.length; row++) { for (int col = 0; col < B[0].length; col++) { C[row][col] = 0.0; for (int iter = 0; iter < B.length; iter++) C[row][col] += A[row][iter] * B[iter][col]; } } return C; } public final static double[] multTranspose(final double[][] A, final double[] x) { double[] value = new double[A[0].length]; for (int iterRow = 0; iterRow < A[0].length; iterRow++) { value[iterRow] = 0.0; for (int iterColumn = 0; iterColumn < A.length; iterColumn++) { value[iterRow] += A[iterColumn][iterRow] * x[iterColumn]; } } return value; } public final static double[][] multTranspose(double[][] A, double[][] B) { if (A == null || B == null) throw new java.lang.NullPointerException("Matrices cannot be null."); if (A.length != B.length) throw new MathRuntimeException("Matrices' dimensions do not match."); double[][] C = new double[A[0].length][B[0].length]; for (int colA = 0; colA < A[0].length; colA++) { for (int colB = 0; colB < B[0].length; colB++) { C[colA][colB] = 0.0; for (int iter = 0; iter < A.length; iter++) C[colA][colB] += A[iter][colA] * B[iter][colB]; } } return C; } /** * Nearest multiple. * * @param n * the n * @param base * the base * @return the int */ public final static int nearestMultiple(int n, int base) { int x = n / base; if (x * base == n) return n; else return x * base + base; } public final static int[] nonZeroIndicies(final double[] x, final double absTolerance) { // count the number of elements int size = 0; for (int iter = 0; iter < x.length; iter++) if (Math.abs(x[iter]) > absTolerance) size++; // record the elements int[] list = new int[size]; size = 0; for (int iter = 0; iter < x.length; iter++) if (Math.abs(x[iter]) > absTolerance) { list[size] = iter; size++; } return list; } public final static double[] nonZeroValues(final double[] x, final double absTolerance) { int[] list = nonZeroIndicies(x, absTolerance); double[] xnew = new double[list.length]; int count = 0; for (int index : list) { xnew[count] = x[index]; count++; } return xnew; } /* * public static double[] legendrePolynomial(int n, double x) throws * ArithmeticException { double P[] = new double[n + 1]; * * P[0] = 1.0; P[1] = x; * * for (int m = 1; m < n - 1; m++) { P[m + 1] = ((2.0 * m + 1.0) * x * P[m] * - m * P[m - 1]) / (m + 1.0); } * * return P; } */ /* * static public double[] normalizedlegendrePolynomial(int n, double x) * throws ArithmeticException { double[] P = legendrePolynomial(n, x); * * double norm = BasicMath.sqrt(n + .5); double sign = -1; * * for (int m = 0; m < P.length; m++) { P[m] = sign * P[m] / norm; sign = * -sign; } * * return P; } */ /** * Norm. * * @param a * the a * @return the double */ public static double norm(double[] a) { return BasicMath.sqrt(normSquared(a)); } public final static double normalizedDotProduct(final double[] a, final double[] b) { return dotProduct(a, b) / (norm(a) * norm(b)); } /** * Norm squared. * * @param a * the a * @return the double */ public final static double normSquared(final double[] a) { double r = 0.0; for (double elem : a) r += elem * elem; return r; } /** * Round to nearest. * * @param x * the x * @param n * the n * @return the double */ public final static double roundToNearest(final double x, final int n) { double shift = Math.pow(10, n); return Math.round(x * shift) / shift; } /** * Sin. * * @param angle * the angle * @return the double */ public final static double sin(final double angle) { return Math.sin(angle); } /** * Sinc. * * @param x * the x * @return the double */ public final static double sinc(final double x) { return (x == 0 || (x < 1.0e-8 && x > -1.0e-8)) ? 1.0 : sin(x) / x; } /** * Sqrt. * * @param a * the a * @return the double */ public final static double sqrt(final double a) { return Math.sqrt(a); } /** * Square. * * @param a * the a * @return the double */ public final static double square(final double a) { return a * a; } /** * Square. * * @param a * the a * @return the double[] */ public final static double[] square(final double[] a) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] * a[iter]; return val; } /** * Subtract. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] subtract(final double[] a, final double b) { double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] - b; return val; } /** * Subtract. * * @param a * the a * @param b * the b * @return the double[] */ public final static double[] subtract(final double[] a, final double[] b) { if (a.length != b.length) throw new MathRuntimeException("Vectors must be of same length."); double[] val = new double[a.length]; for (int iter = 0; iter < a.length; iter++) val[iter] = a[iter] - b[iter]; return val; } public final static double sum(final double[] a) { if (a == null) return 0.0; double sum = 0.0; for (double val : a) sum += val; return sum; } public final static double[][] transpose(double[][] A) { if (A == null) return null; double[][] At = new double[A[0].length][A.length]; for (int row = 0; row < A.length; row++) for (int col = 0; col < A[row].length; col++) At[col][row] = A[row][col]; return At; } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/math/MathRuntimeException.java000066400000000000000000000041261277502137000261770ustar00rootroot00000000000000/* * ARMOR package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2012 by Konstantin Berlin * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.math; /** * The Class MathException. */ public class MathRuntimeException extends RuntimeException { /** * */ private static final long serialVersionUID = 6939427297903213601L; /** * Instantiates a new math exception. */ public MathRuntimeException() { super(); } /** * Instantiates a new math exception. * * @param arg0 * the arg0 */ public MathRuntimeException(String arg0) { super(arg0); } /** * Instantiates a new math exception. * * @param arg0 * the arg0 * @param arg1 * the arg1 */ public MathRuntimeException(String arg0, Throwable arg1) { super(arg0, arg1); } /** * Instantiates a new math exception. * * @param arg0 * the arg0 */ public MathRuntimeException(Throwable arg0) { super(arg0); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/000077500000000000000000000000001277502137000215455ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/AbstractBitSketch.java000066400000000000000000000063231277502137000257600ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; public abstract class AbstractBitSketch> implements Sketch, Comparable { protected final long[] bits; /** * */ private static final long serialVersionUID = -3392030412388403092L; protected AbstractBitSketch(long[] bits) { this.bits = bits; } @Override public int compareTo(final T sim) { for (int bitIndex = 0; bitIndex < this.bits.length; bitIndex++) { if (this.bits[bitIndex] < sim.bits[bitIndex]) return -1; if (this.bits[bitIndex] > sim.bits[bitIndex]) return 1; } return 0; } public final boolean getBit(long index) { int arrayIndex = (int)(index/64L); int bitPos = (int)(index%64L); long mask = 0b1L<= 0; bit--) { if ((this.bits[longIndex] & mask) == 0) s.append("0"); else s.append("1"); mask = mask >>> 1; } } return s.toString(); } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/BitVectorIndex.java000066400000000000000000000127451277502137000253120ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import edu.umd.marbl.mhap.utils.MersenneTwisterFast; import edu.umd.marbl.mhap.utils.Pair; import edu.umd.marbl.mhap.utils.SortablePair; public final class BitVectorIndex> { private final long bitsUsed[][]; private final ArrayList>>> hashList; private final HashMap indexedWords; private final double minSimilarity; public BitVectorIndex(List> valuePairs, double minSimilarity, double confidence) { this.minSimilarity = minSimilarity; //should go off the valuePairs list int b = 10; //probability of a hit in numIndexes when using b: confidence = 1-(1-minSimilarity^b)^(numIndexes) //solve for b, Step 1: root_numIndexes (1-confidence) = (1-minSimilarity^b) // Step 2: b = log(1-root_numIndexes (1-confidence))/log(minSimilarity) //figure out b int numIndexes = (int)Math.ceil(Math.log(1.0-confidence)/Math.log(1.0-Math.pow(this.minSimilarity, (double)b))); //allocate the memory this.bitsUsed = new long[numIndexes][b]; //now generate random permuations MersenneTwisterFast rand = new MersenneTwisterFast(); //get number of bits long numBits = 1; if (!valuePairs.isEmpty()) numBits = valuePairs.get(0).y.numberOfBits(); //generate the bits for (int index=0; index(numIndexes); for (int iter=0; iter(valuePairs.size())); this.indexedWords = new HashMap<>(valuePairs.size()); //encode all data in parallel valuePairs.parallelStream().forEach(pair-> { //get the lookup positions int[] lookupPositions = lookupPositions(pair.y); int count = 0; for(HashMap>> map : this.hashList) { //get the array list ArrayList> list; synchronized (map) { list = map.computeIfAbsent(lookupPositions[count], key-> new ArrayList<>(1)); } //add the pair to the index synchronized(list) { list.add(pair); } count++; } //add the word synchronized (this.indexedWords) { this.indexedWords.put(pair.x, pair.y); } }); } public int getBitsPerHash() { return bitsUsed[0].length; } public Map getIndexedItems() { return Collections.unmodifiableMap(this.indexedWords); } public List> getNeighbors(B sketch, double minSimilarity) { if (minSimilarity> set = new HashSet<>(); int count = 0; for(HashMap>> map : this.hashList) { ArrayList> list = map.get(lookupPositions[count]); if (list==null) continue; //add all the elements set.addAll(list); count++; } ArrayList> returnList = new ArrayList>(); //now do direct compare for (Pair pair : set) { double score = pair.y.similarity(sketch); if (score>=minSimilarity) returnList.add(new SortablePair<>(score, pair.x)); } return returnList; } public int getNumberOfIndexes() { return this.hashList.size(); } public B getSketch(T word) { return this.indexedWords.get(word); } public boolean isEmpty() { return indexedWords.isEmpty(); } private int[] lookupPositions(B bits) { int numIndexes = hashList.size(); int[] returnValues = new int[numIndexes]; for (int index=0; index absMaxShiftInOverlap) continue; // get the edges if (pos1 < leftEdge1) leftEdge1 = pos1; if (pos2 < leftEdge2) leftEdge2 = pos2; if (pos1 > rightEdge1) rightEdge1 = pos1; if (pos2 > rightEdge2) rightEdge2 = pos2; validCount++; } if (validCount < 3) return null; // get edge info uniformly minimum variance unbiased (UMVU) estimators // a = (n*a-b)/(n-1) // b = (n*b-a)/(n-1) int a1 = Math.max(0, (int) Math.round((double)(validCount * leftEdge1 - rightEdge1) / (double) (validCount - 1))); int a2 = Math.min(this.seqLength1, (int) Math.round((double)(validCount * rightEdge1 - leftEdge1) / (double) (validCount - 1))); int b1 = Math.max(0, (int) Math.round((double)(validCount * leftEdge2 - rightEdge2) / (double) (validCount - 1))); int b2 = Math.min(this.seqLength2, (int) Math.round((double)(validCount * rightEdge2 - leftEdge2) / (double) (validCount - 1))); return new EdgeData(a1, a2, b1, b2, validCount); } public int getAbsMaxShift() { performUpdate(); return this.absMaxShiftInOverlap; } public int getMedianShift() { performUpdate(); return this.medianShift; } public boolean isEmpty() { return this.count<=0; } public void optimizeShifts() { if (isEmpty()) return; int reducedCount = -1; // copy over only the best values int medianShift = getMedianShift(); for (int iter = 0; iter < this.count; iter++) { if (reducedCount >= 0 && pos1Index[reducedCount] == pos1Index[iter]) { // if better, record it if (Math.abs(posShift[reducedCount] - medianShift) > Math.abs(posShift[iter] - medianShift)) { pos1Index[reducedCount] = pos1Index[iter]; pos2Index[reducedCount] = pos2Index[iter]; posShift[reducedCount] = posShift[iter]; } } else { // add the new data reducedCount++; pos1Index[reducedCount] = pos1Index[iter]; pos2Index[reducedCount] = pos2Index[iter]; posShift[reducedCount] = posShift[iter]; } } this.count = reducedCount + 1; this.needRecompute = true; } private void performUpdate() { if (this.needRecompute) { if (this.count>0) { this.medianShift = Utils.quickSelect(Arrays.copyOf(this.posShift, this.count), this.count / 2, this.count); // get the actual overlap size int leftPosition = Math.max(0, -this.medianShift); int rightPosition = Math.min(this.seqLength1, this.seqLength2 - this.medianShift); int overlapSize = Math.max(10, rightPosition - leftPosition); // compute the max possible allowed shift in kmers this.absMaxShiftInOverlap = Math.min(Math.max(this.seqLength1, this.seqLength2), (int) ((double) overlapSize * maxShiftPercent)); } else { this.medianShift = 0; this.absMaxShiftInOverlap = Math.max(this.seqLength1, this.seqLength2); } } this.needRecompute = false; } public void recordMatch(int pos1, int pos2, int shift) { // adjust array size if needed if (posShift.length <= this.count) { this.posShift = Arrays.copyOf(this.posShift, this.posShift.length * 2); this.pos1Index = Arrays.copyOf(this.pos1Index, this.pos1Index.length * 2); this.pos2Index = Arrays.copyOf(this.pos2Index, this.pos2Index.length * 2); } posShift[this.count] = shift; pos1Index[this.count] = pos1; pos2Index[this.count] = pos2; this.count++; this.needRecompute = true; } public void reset() { this.count = 0; this.needRecompute = true; } public int size() { return this.count; } public int valid1Lower() { performUpdate(); int valid = Math.max(0, -getMedianShift() - getAbsMaxShift()); return valid; } public int valid1Upper() { performUpdate(); int valid = Math.min(this.seqLength1, this.seqLength2 - getMedianShift() + getAbsMaxShift()); return valid; } public int valid2Lower() { performUpdate(); int valid = Math.max(0, getMedianShift() - getAbsMaxShift()); return valid; } public int valid2Upper() { performUpdate(); int valid = Math.min(this.seqLength2, this.seqLength1 + getMedianShift() + getAbsMaxShift()); return valid; } } private final int kmerSize; private final int[][] orderedHashes; private final int seqLength; private static double computeKBottomSketchJaccard(int[][] seq1Hashes, int[][] seq2Hashes, int medianShift, int absMaxShiftInOverlap, int a1, int a2, int b1, int b2) { //get k for first string int s1 = 0; int[][] array1 = new int[seq1Hashes.length][]; for (int i=0; i= a1 && pos <= a2) { array1[s1] = seq1Hashes[i]; s1++; } } //get k for second string int s2 = 0; int[][] array2 = new int[seq2Hashes.length][]; for (int j=0; j= b1 && pos <= b2) { array2[s2] = seq2Hashes[j]; s2++; } } //compute k int k = Math.min(s1,s2); //empty has jaccard of 1 if (k==0) return 0; //perform the k-bottom count int i = 0; int j = 0; int intersectCount = 0; int unionCount = 0; while (unionCountarray2[j][0]) j++; else { intersectCount++; i++; j++; } unionCount++; } double score = ((double)intersectCount)/(double)k; return score; } public final static BottomOverlapSketch fromByteStream(DataInputStream input) throws IOException { try { int seqLength = input.readInt(); int kmerSize = input.readInt(); int hashLength = input.readInt(); int[][] orderedHashes = new int[hashLength][2]; for (int iter = 0; iter < hashLength; iter++) { orderedHashes[iter][0] = input.readInt(); orderedHashes[iter][1] = input.readInt(); } return new BottomOverlapSketch(seqLength, kmerSize, orderedHashes); } catch (EOFException e) { return null; } } public static double jaccardToIdentity(double score, int kmerSize) { double d = -1.0/(double)kmerSize*Math.log(2.0*score/(1.0+score)); return Math.exp(-d); } private static void recordMatchingKmers( MatchData matchData, int[][] seq1KmerHashes, int[][] seq2KmerHashes, int repeat) { // init the loop storage int hash1; int hash2; int pos1; int pos2; // init the borders int medianShift = matchData.getMedianShift(); int absMaxShift = matchData.getAbsMaxShift(); int valid1Lower = matchData.valid1Lower(); int valid2Lower = matchData.valid2Lower(); int valid1Upper = matchData.valid1Upper(); int valid2Upper = matchData.valid2Upper(); // init counters int i1 = 0; int i2 = 0; //reset the data, redo the shifts matchData.reset(); // perform merge operation to get the shift and the kmer count while (true) { if (i1>=seq1KmerHashes.length) break; if (i2>=seq2KmerHashes.length) break; // get the values in the array hash1 = seq1KmerHashes[i1][0]; pos1 = seq1KmerHashes[i1][1]; hash2 = seq2KmerHashes[i2][0]; pos2 = seq2KmerHashes[i2][1]; if (hash1 < hash2 || pos1 < valid1Lower || pos1 >= valid1Upper) i1++; else if (hash2 < hash1 || pos2 < valid2Lower || pos2 >= valid2Upper) i2++; else { // check if current shift makes sense positionally int currShift = pos2 - pos1; int diffFromExpected = currShift - medianShift; if (diffFromExpected > absMaxShift) i1++; else if (diffFromExpected < -absMaxShift) i2++; else { //record match matchData.recordMatch(pos1, pos2, currShift); // don't rely on repeats in the first iteration if (repeat == 0) i1++; i2++; } } } } private BottomOverlapSketch(int seqLength, int kmerSize, int[][] orderedHashes) { this.seqLength = seqLength; this.orderedHashes = orderedHashes; this.kmerSize = kmerSize; } public BottomOverlapSketch(String seq, int kmerSize, int sketchSize) throws ZeroNGramsFoundException { this.kmerSize = kmerSize; this.seqLength = seq.length() - kmerSize + 1; if (this.seqLength<=0) throw new ZeroNGramsFoundException("Sequence length must be greater or equal to n-gram size "+kmerSize+".", seq); // compute just direct hash of sequence int[] hashes = HashUtils.computeSequenceHashes(seq, kmerSize); int[] perm = new int[hashes.length]; //init the array for (int iter = 0; iter < hashes.length; iter++) perm[iter] = iter; //sort the array IntArrays.radixSortIndirect(perm, hashes, true); //sketchSize = (int)Math.round(0.25*(double)this.seqLength); //find the largest storage value int k = Math.min(sketchSize, hashes.length); //allocate the memory this.orderedHashes = new int[k][2]; for (int iter = 0; iter < this.orderedHashes.length; iter++) { int index = perm[iter]; this.orderedHashes[iter][0] = hashes[index]; this.orderedHashes[iter][1] = index; } } public byte[] getAsByteArray() { ByteArrayOutputStream bos = new ByteArrayOutputStream(size() * 2); DataOutputStream dos = new DataOutputStream(bos); try { dos.writeInt(this.seqLength); dos.writeInt(this.kmerSize); dos.writeInt(size()); for (int iter = 0; iter < this.orderedHashes.length; iter++) { dos.writeInt(this.orderedHashes[iter][0]); dos.writeInt(this.orderedHashes[iter][1]); } dos.flush(); return bos.toByteArray(); } catch (IOException e) { throw new SketchRuntimeException("Unexpected IO error.", e); } } public int getHash(int index) { return this.orderedHashes[index][0]; } public OverlapInfo getOverlapInfo(BottomOverlapSketch toSequence, double maxShiftPercent) { if (this.kmerSize!=toSequence.kmerSize) throw new SketchRuntimeException("Sketch k-mer size does not match between the two sequences."); //allocate the memory for the search MatchData matchData = new MatchData(this, toSequence, maxShiftPercent); //get the initial matches recordMatchingKmers(matchData, this.orderedHashes, toSequence.orderedHashes, 0); if (matchData.isEmpty()) return OverlapInfo.EMPTY; //get matches again, but now in a better region recordMatchingKmers(matchData, this.orderedHashes, toSequence.orderedHashes, 1); if (matchData.isEmpty()) return OverlapInfo.EMPTY; matchData.optimizeShifts(); if (matchData.isEmpty()) return OverlapInfo.EMPTY; //get the edge data EdgeData edgeData = matchData.computeEdges(); if (edgeData==null) return OverlapInfo.EMPTY; //compute the jaccard score using bottom-k sketching double score = computeKBottomSketchJaccard(this.orderedHashes, toSequence.orderedHashes, matchData.getMedianShift(), matchData.getAbsMaxShift(), edgeData.a1, edgeData.a2, edgeData.b1, edgeData.b2); score = jaccardToIdentity(score, this.kmerSize); double rawScore = (double)edgeData.count; return new OverlapInfo(score, rawScore, edgeData.a1, edgeData.a2, edgeData.b1, edgeData.b2); } public int getSequenceLength() { return this.seqLength; } public int size() { return this.orderedHashes.length; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/BottomSketch.java000066400000000000000000000025231277502137000250200ustar00rootroot00000000000000package edu.umd.marbl.mhap.sketch; import it.unimi.dsi.fastutil.ints.IntArrays; public class BottomSketch implements Sketch { private final int[] hashPositions; /** * */ private static final long serialVersionUID = 9035607728472270206L; public BottomSketch(String str, int nGramSize, int k) { int[] hashes = HashUtils.computeSequenceHashes(str, nGramSize); k = Math.min(k, hashes.length); int[] perm = new int[hashes.length]; for (int iter=0; itersh.hashPositions[j]) j++; else { intersectCount++; i++; j++; } unionCount++; } return ((double)intersectCount)/(double)k; } @Override public double similarity(BottomSketch sh) { return jaccard(sh); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/ClassicCounter.java000066400000000000000000000050101277502137000253250ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import java.util.HashMap; import java.util.concurrent.atomic.AtomicLong; import java.util.concurrent.atomic.LongAdder; public final class ClassicCounter implements Counter { private final HashMap map; private final LongAdder numAdditions; private final AtomicLong maxCount; public ClassicCounter(int size) { this.map = new HashMap<>(size); this.maxCount = new AtomicLong(); this.numAdditions = new LongAdder(); } @Override public long getCount(Object obj) { LongAdder adder = map.get(obj); if (adder==null) return 0; return map.get(obj).longValue(); } @Override public void add(Object obj) { add(obj, 1); } @Override public long maxCount() { return this.maxCount.longValue(); } @Override public void add(Object obj, long count) { LongAdder adder = null; synchronized (this.map) { adder = this.map.get(obj); if (adder==null) { adder = new LongAdder(); this.map.put(obj, adder); } } adder.add(count); // assumes value always increasing if (maxCount.longValue() < count) { synchronized (maxCount) { //TODO fix if (maxCount.longValue() < count) maxCount.set(count); } } this.numAdditions.add(count); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/CosineDistanceSketch.java000066400000000000000000000042271277502137000264520ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import edu.umd.marbl.mhap.math.BasicMath; public final class CosineDistanceSketch extends AbstractBitSketch { /** * */ private static final long serialVersionUID = -6501138603779963996L; private static long[] getCuts(double[] vector, int numWords, int seed) { long[] bitVector = new long[numWords]; for (int word=0; word0.0) currBitLong = currBitLong | mask; mask = mask<<1; } bitVector[word] = currBitLong; } return bitVector; } public CosineDistanceSketch(double[] vector, int numWords, int seed) { super(getCuts(vector,numWords,seed)); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/CountMin.java000066400000000000000000000067521277502137000241560ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import java.util.concurrent.atomic.LongAdder; public final class CountMin implements Counter { private final LongAdder[][] countTable; private final int depth; private final int seed; private final LongAdder totalAdded; private final int width; public CountMin(double eps, double confidence, int seed) { // 2/w = eps ; w = 2/eps // 1/2^depth <= 1-confidence ; depth >= -log2 (1-confidence) // estimate the table size // this.width = (int) Math.ceil((double)2 / eps); // this.depth = (int) Math.ceil(-Math.log(1.0 - confidence) / // Math.log(2)); // this.seed = seed; this((int) Math.ceil(-Math.log(1.0 - confidence) / Math.log(2)), (int) Math.ceil((double) 2 / eps), seed); } public CountMin(int depth, int width, int seed) { this.depth = depth; this.width = width; this.seed = seed; this.countTable = new LongAdder[depth][width]; this.totalAdded = new LongAdder(); // zero all the elements for (int iter1 = 0; iter1 < depth; iter1++) for (int iter2 = 0; iter2 < width; iter2++) this.countTable[iter1][iter2] = new LongAdder(); } @Override public void add(T obj) { add(obj, 1); } @Override public void add(T obj, long increment) { if (increment <= 0) throw new SketchRuntimeException("Positive value expected for increment."); // compute the hash int[] hashes = HashUtils.computeHashesInt(obj, depth, seed); for (int iter = 0; iter < depth; iter++) { this.countTable[iter][((hashes[iter] << 1) >>> 1) % width].add(increment); } //store the total this.totalAdded.add(increment); } @Override public long getCount(Object obj) { // compute the hash int[] hashes = HashUtils.computeHashesInt(obj, depth, seed); long mincount = Long.MAX_VALUE; for (int iter = 0; iter < depth; iter++) { long value = this.countTable[iter][((hashes[iter] << 1) >>> 1) % width].longValue(); if (mincount > value) mincount = value; } return mincount; } public int getDepth() { return this.depth; } public int getWidth() { return this.width; } @Override public long maxCount() { throw new SketchRuntimeException("Method not implemented."); } public long totalAdded() { return this.totalAdded.longValue(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/Counter.java000066400000000000000000000027171277502137000240360ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; public interface Counter extends Filter { public long getCount(T obj); public void add(T obj); public long maxCount(); public void add(T obj, long count); }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/Filter.java000066400000000000000000000024671277502137000236460ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; public interface Filter { } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/FrequencyCounts.java000066400000000000000000000172551277502137000255570ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import java.io.BufferedReader; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Set; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicReference; import com.google.common.hash.BloomFilter; import edu.umd.marbl.mhap.impl.MhapRuntimeException; import it.unimi.dsi.fastutil.ints.IntOpenHashSet; import it.unimi.dsi.fastutil.longs.Long2DoubleOpenHashMap; public final class FrequencyCounts { private final double filterCutoff; private final Map fractionCounts; private final Set kmerSizes; private final double maxIdfValue; private final double maxValue; private final double minIdfValue; private final double minValue; private final boolean noTf; private final double offset; private final int removeUnique; private final BloomFilter validMers; public static final double REPEAT_SCALE = 3.0; public FrequencyCounts(BufferedReader bf, double filterCutoff, double offset, int removeUnique, boolean noTf, int numThreads) throws IOException { //removeUnique = 0: do nothing extra to k-mers not specified in the file //removeUnique = 1: remove k-mers not specified in the file from the sketch //removeUnique = 2: supress k-mers not specified in the file the same as max supression if (removeUnique<0 || removeUnique>2) throw new MhapRuntimeException("Unknown removeUnique option "+removeUnique+"."); if (offset<0.0 || offset>=1.0) throw new MhapRuntimeException("Offset can only be between 0 and 1.0."); this.kmerSizes = new IntOpenHashSet(); this.removeUnique = removeUnique; this.noTf = noTf; // generate hashset Long2DoubleOpenHashMap validMap = new Long2DoubleOpenHashMap(); BloomFilter validMers; //the max value observed in the list AtomicReference maxValue = new AtomicReference(Double.NEGATIVE_INFINITY); //read in the first line to generate the bloom filter String line = bf.readLine(); try { long size; if (line==null) { System.err.println("Warning, k-mer filter file is empty. Assuming zero entries."); size = 1L; } else { size = Long.parseLong(line); if (size<0L) throw new MhapRuntimeException("K-mer filter file size line must have positive long value."); else if (size==0L) { System.err.println("Warning, k-mer filter file has zero elements."); size = 1L; } } //if no nothing, no need to store the while list if (removeUnique>0) validMers = BloomFilter.create((value, sink) -> sink.putLong(value), size, 1.0e-5); else validMers = null; } catch (Exception e) { throw new MhapRuntimeException("K-mer filter file first line must contain estimated number of k-mers in the file (long)."); } final ThreadPoolExecutor executor = new ThreadPoolExecutor(numThreads, numThreads, 100L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue(10000), new ThreadPoolExecutor.CallerRunsPolicy()); line = bf.readLine(); while (line != null) { String currLine = line; executor.submit(() -> { try { String[] str = currLine.split("\\s+", 3); if (str.length < 1) throw new MhapRuntimeException("K-mer filter file must have at least one column [k-mer]. Line="+currLine); //store the kmer sizes in the list synchronized (this.kmerSizes) { this.kmerSizes.add(str[0].length()); } long[] hash = HashUtils.computeSequenceHashesLong(str[0], str[0].length(), 0); if (str.length >= 2) { double percent = Double.parseDouble(str[1]); // if greater, add to hashset if (percent >= filterCutoff) { maxValue.getAndUpdate(v -> Math.max(v, percent)); //store the max percent synchronized (validMap) { validMap.put(hash[0], percent); } } } //store in the bloom filter if (removeUnique>0) synchronized (validMers) { validMers.put(hash[0]); } } catch (Exception e) { System.err.println(e); } }); // read the next line line = bf.readLine(); } executor.shutdown(); try { executor.awaitTermination(5L, TimeUnit.DAYS); } catch (InterruptedException e) { executor.shutdownNow(); throw new RuntimeException("Unable to finish all tasks."); } //trim the hashtable to the right size validMap.trim(); this.validMers = validMers; this.fractionCounts = validMap; this.filterCutoff = filterCutoff; this.offset = offset; this.maxValue = maxValue.get(); this.minValue = this.filterCutoff; this.minIdfValue = idf(this.maxValue); this.maxIdfValue = idf(this.minValue); } public double documentFrequencyRatio(long hash) { Double val = this.fractionCounts.get(hash); if (val == null) val = this.minValue; return val; } public double getFilterCutoff() { return this.filterCutoff; } public List getKmerSizes() { return new ArrayList<>(this.kmerSizes); } public double idf(double freq) { return Math.log(this.maxValue/freq-this.offset); //return Math.log1p(this.maxValue/freq); } public double idf(long hash) { double freq = documentFrequencyRatio(hash); return idf(freq); } public double inverseDocumentFrequency(long hash) { return 1.0/documentFrequencyRatio(hash); } public boolean isPopular(long hash) { return this.fractionCounts.containsKey(hash); } public boolean keepKmer(long hash) { if (this.removeUnique==1) return this.validMers.mightContain(hash); return true; } public double maxIdf() { return this.maxIdfValue; } public double minIdf() { return this.minIdfValue; } public double scaledIdf(long hash) { return scaledIdf(hash, REPEAT_SCALE); } public double scaledIdf(long hash, double maxValue) { if (this.removeUnique==2 && this.validMers!=null && !this.validMers.mightContain(hash)) return 1.0; Double val = this.fractionCounts.get(hash); if (val == null) return maxValue; //get the true value double idf = idf(val); //scale it to match max double scale = (maxIdf()-minIdf())/(double)(maxValue-1.0); return 1.0+(idf-minIdf())/scale; } public double tfWeight(int weight) { if (this.noTf) return 1.0; return (double)weight; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/HashUtils.java000066400000000000000000000173721277502137000243260ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import java.nio.ByteBuffer; import com.google.common.hash.HashCode; import com.google.common.hash.HashFunction; import com.google.common.hash.Hasher; import com.google.common.hash.Hashing; import edu.umd.marbl.mhap.math.BasicMath; import edu.umd.marbl.mhap.utils.MersenneTwisterFast; public class HashUtils { public static long[] computeHashes(String item, int numWords, int seed) { long[] hashes = new long[numWords]; for (int word = 0; word < numWords; word += 2) { HashFunction hashFunc = Hashing.murmur3_128(seed + word); Hasher hasher = hashFunc.newHasher(); hasher.putUnencodedChars(item); // get the two longs out HashCode hc = hasher.hash(); ByteBuffer bb = ByteBuffer.wrap(hc.asBytes()); hashes[word] = bb.getLong(0); if (word + 1 < numWords) hashes[word + 1] = bb.getLong(8); } return hashes; } public final static int[] computeHashesInt(Object obj, int numWords, int seed) { if (obj instanceof Integer) return computeHashesIntInt((Integer) obj, numWords, seed); if (obj instanceof Long) return computeHashesIntLong((Long) obj, numWords, seed); if (obj instanceof Double) return computeHashesIntDouble((Double) obj, numWords, seed); if (obj instanceof Float) return computeHashesIntFloat((Float) obj, numWords, seed); if (obj instanceof String) return computeHashesIntString((String) obj, numWords, seed); throw new SketchRuntimeException("Cannot hash class type " + obj.getClass().getCanonicalName()); } public final static int[] computeHashesIntDouble(double obj, int numWords, int seed) { int[] hashes = new int[numWords]; HashFunction hf = Hashing.murmur3_32(seed); for (int iter = 0; iter < numWords; iter++) { HashCode hc = hf.newHasher().putDouble(obj).putInt(iter).hash(); hashes[iter] = hc.asInt(); } return hashes; } public final static int[] computeHashesIntFloat(float obj, int numWords, int seed) { int[] hashes = new int[numWords]; HashFunction hf = Hashing.murmur3_32(seed); for (int iter = 0; iter < numWords; iter++) { HashCode hc = hf.newHasher().putFloat(obj).putInt(iter).hash(); hashes[iter] = hc.asInt(); } return hashes; } public final static int[] computeHashesIntInt(int obj, int numWords, int seed) { int[] hashes = new int[numWords]; HashFunction hf = Hashing.murmur3_32(seed); for (int iter = 0; iter < numWords; iter++) { HashCode hc = hf.newHasher().putInt(obj).putInt(iter).hash(); hashes[iter] = hc.asInt(); } return hashes; } public final static int[] computeHashesIntLong(long obj, int numWords, int seed) { int[] hashes = new int[numWords]; HashFunction hf = Hashing.murmur3_32(seed); for (int iter = 0; iter < numWords; iter++) { HashCode hc = hf.newHasher().putLong(obj).putInt(iter).hash(); hashes[iter] = hc.asInt(); } return hashes; } public final static int[] computeHashesIntString(String obj, int numWords, int seed) { int[] hashes = new int[numWords]; HashFunction hf = Hashing.murmur3_32(seed); for (int iter = 0; iter < numWords; iter++) { HashCode hc = hf.newHasher().putUnencodedChars(obj).putInt(iter).hash(); hashes[iter] = hc.asInt(); } return hashes; } public final static long[][] computeNGramHashes(final String seq, final int nGramSize, final int numWords, final int seed) { final int numberNGrams = seq.length()-nGramSize+1; if (numberNGrams < 1) throw new SketchRuntimeException("N-gram size bigger than string length."); // get the rabin hashes final long[] rabinHashes = computeSequenceHashesLong(seq, nGramSize, seed); final long[][] hashes = new long[rabinHashes.length][numWords]; // Random rand = new Random(0); for (int iter = 0; iter < rabinHashes.length; iter++) { // rand.setSeed(rabinHashes[iter]); long x = rabinHashes[iter]; for (int word = 0; word < numWords; word++) { // hashes[iter][word] = rand.nextLong(); // XORShift Random Number Generators x ^= (x << 21); x ^= (x >>> 35); x ^= (x << 4); hashes[iter][word] = x; } } return hashes; } public final static long[][] computeNGramHashesExact(final String seq, final int nGramSize, final int numWords, final int seed) { HashFunction hf = Hashing.murmur3_128(seed); long[][] hashes = new long[seq.length() - nGramSize + 1][numWords]; for (int iter = 0; iter < hashes.length; iter++) { String subStr = seq.substring(iter, iter + nGramSize); for (int word=0; word { /** * */ private static final long serialVersionUID = -44448450811302477L; private final static long[] getAsBits(int[] minHashes) { int numWords = minHashes.length/64; //now convert them to bits long[] bits = new long[numWords]; //take only the last bit long mask = 0b1; int bitCount = 0; int wordCount = 0; for (int word = 0; word { private final int[] minHashes; /** * */ private static final long serialVersionUID = 8846482698636860862L; private final static int[] computeNgramMinHashesWeighted(String seq, final int nGramSize, final int numHashes, FrequencyCounts kmerFilter, double repeatWeight) throws ZeroNGramsFoundException { final int numberNGrams = seq.length() - nGramSize + 1; if (numberNGrams < 1) throw new ZeroNGramsFoundException("N-gram size bigger than string length.", seq); //if (repeatWeight>=1.0) // throw new SketchRuntimeException("repeatWeight cannot be >=1."); // get the kmer hashes final long[] kmerHashes = HashUtils.computeSequenceHashesLong(seq, nGramSize, 0); //now compute the counts of occurance Long2ObjectLinkedOpenHashMap hitMap = new Long2ObjectLinkedOpenHashMap(kmerHashes.length); for (long kmer : kmerHashes) { //do not add unique kmers to the sketch if (kmerFilter!=null && !kmerFilter.keepKmer(kmer)) continue; HitCounter counter = hitMap.get(kmer); if (counter==null) { counter = new HitCounter(1); hitMap.put(kmer, counter); } else counter.addHit(); } //make sure don't create a zero value if (hitMap.isEmpty()) throw new ZeroNGramsFoundException("Found zero unfiltered n-grams in the string.", seq); //allocate the space int[] hashes = new int[Math.max(1,numHashes)]; long[] best = new long[numHashes]; Arrays.fill(best, Long.MAX_VALUE); //go through all the k-mers and find the min values int numberValid = 0; for (Entry kmer : hitMap.entrySet()) { long key = kmer.getKey(); int weight = kmer.getValue().count; //original version of MHAP if (repeatWeight<0.0) { weight = 1; if (kmerFilter!=null && kmerFilter.isPopular(key)) weight = 0; } else if (kmerFilter!=null) { if (repeatWeight>=0.0 && repeatWeight<1.0) { //compute the td part double tf = (double)kmerFilter.tfWeight(weight); //compute the idf part, 1-3 double idf = kmerFilter.scaledIdf(key); //compute td-idf weight = (int)Math.round(tf*idf); if (weight<1) weight = 1; } } //keep the tf weight otherwise if (weight<=0) continue; //increment valid counter numberValid++; //set the initial shift value long x = key; for (int word = 0; word < numHashes; word++) { for (int count = 0; count>> 35); x ^= (x << 4); if (x < best[word]) { best[word] = x; if (word%2==0) hashes[word] = (int)key; else hashes[word] = (int)(key>>>32); } } } } if (numberValid<=0) throw new ZeroNGramsFoundException("Found zero unfiltered n-grams in the string.", seq); //now combine into super shingles /* HashFunction hf = Hashing.murmur3_32(0); int[] superShingles = new int[numHashes]; for (int iter=0; iter { /** * */ private static final long serialVersionUID = -2655482279264410602L; private static final long[] recordHashes(final long[][] hashes, final int numWords) { final int[] counts = new int[numWords * 64]; // perform count for each ngram for (long[] objectHashes : hashes) { for (int wordIndex = 0; wordIndex < numWords; wordIndex++) { final long val = objectHashes[wordIndex]; final int offset = wordIndex * 64; long mask = 0b1; for (int bit = 0; bit < 64; bit++) { // if not different then increase counts if ((val & mask) == 0b0) counts[offset + bit]--; else counts[offset + bit]++; mask = mask << 1; } } } long[] bits = new long[numWords]; for (int wordIndex = 0; wordIndex < numWords; wordIndex++) { final int offset = wordIndex * 64; long val = 0b0; long mask = 0b1; for (int bit = 0; bit < 64; bit++) { if (counts[offset + bit] > 0) val = val | mask; // adjust the mask mask = mask << 1; } bits[wordIndex] = val; } return bits; } public SimHash(String string, int nGramSize, int numberWords) { super(recordHashes(HashUtils.computeNGramHashesExact(string, nGramSize, numberWords, 0), numberWords)); } public final double jaccard(final SimHash sh) { int count = getIntersectionCount(sh); double sim = (double)count/(double) this.numberOfBits(); double jaccard = (sim- 0.5) * 2.0; return Math.max(0.0, jaccard); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/Sketch.java000066400000000000000000000026611277502137000236360ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; import java.io.Serializable; public interface Sketch> extends Serializable { double similarity(T sh); } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/SketchRuntimeException.java000066400000000000000000000036051277502137000270600ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.sketch; public class SketchRuntimeException extends RuntimeException { /** * */ private static final long serialVersionUID = 8422390842382501317L; public SketchRuntimeException() { } public SketchRuntimeException(String message) { super(message); } public SketchRuntimeException(Throwable cause) { super(cause); } public SketchRuntimeException(String message, Throwable cause) { super(message, cause); } public SketchRuntimeException(String message, Throwable cause, boolean enableSuppression, boolean writableStackTrace) { super(message, cause, enableSuppression, writableStackTrace); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/sketch/ZeroNGramsFoundException.java000066400000000000000000000016251277502137000273160ustar00rootroot00000000000000package edu.umd.marbl.mhap.sketch; public class ZeroNGramsFoundException extends Exception { private final String seqString; /** * */ private static final long serialVersionUID = -3655558540692106680L; public ZeroNGramsFoundException(String message, String seqString) { super(message); this.seqString = seqString; } public ZeroNGramsFoundException(String message, Throwable cause, boolean enableSuppression, boolean writableStackTrace, String seqString) { super(message, cause, enableSuppression, writableStackTrace); this.seqString = seqString; } public ZeroNGramsFoundException(String message, Throwable cause, String seqString) { super(message, cause); this.seqString = seqString; } public ZeroNGramsFoundException(Throwable cause, String seqString) { super(cause); this.seqString = seqString; } public String getSequenceString() { return this.seqString; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/000077500000000000000000000000001277502137000214245ustar00rootroot00000000000000MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/.gitignore000066400000000000000000000001151277502137000234110ustar00rootroot00000000000000/Utils$Pair.class /Utils$ToProtein.class /Utils$Translate.class /Utils.class MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/HashCodeUtil.java000066400000000000000000000106461277502137000246120ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; import java.lang.reflect.Array; /** * The Class HashCodeUtil. */ public final class HashCodeUtil { /// PRIVATE /// /** The oD d_ prim e_ number. */ private static final int fODD_PRIME_NUMBER = 37; /** The Constant SEED. */ public static final int SEED = 23; /** * First term. * * @param aSeed * the a seed * @return the int */ private static int firstTerm( int aSeed ){ return fODD_PRIME_NUMBER * aSeed; } /** * Hash. * * @param aSeed * the a seed * @param aBoolean * the a boolean * @return the int */ public static int hash( int aSeed, boolean aBoolean ) { return firstTerm( aSeed ) + ( aBoolean ? 1 : 0 ); } /** * Hash. * * @param aSeed * the a seed * @param aChar * the a char * @return the int */ public static int hash( int aSeed, char aChar ) { //System.out.println("char..."); return firstTerm( aSeed ) + aChar; } public final static int hash(int aSeed, char[] charArray, int start, int size) { int hash = 0; for (int iter=0; iter>> 32) ); } /** * Hash. * * @param aSeed * the a seed * @param aObject * the a object * @return the int */ public static int hash( int aSeed , Object aObject ) { int result = aSeed; if ( aObject == null) { result = hash(result, 0); } else if ( ! isArray(aObject) ) { result = hash(result, aObject.hashCode()); } else { int length = Array.getLength(aObject); for ( int idx = 0; idx < length; ++idx ) { Object item = Array.get(aObject, idx); //recursive call! result = hash(result, item); } } return result; } /** * Checks if is array. * * @param aObject * the a object * @return true, if is array */ private static boolean isArray(Object aObject){ return aObject.getClass().isArray(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/HitCounter.java000066400000000000000000000030671277502137000243610ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; public final class HitCounter { public int count; public HitCounter() { this.count = 0; } public HitCounter(int count) { this.count = count; } public HitCounter addHit() { this.count++; return this; } public void addHits(int counts) { this.count+=counts; } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/Interval.java000066400000000000000000000033211277502137000240520ustar00rootroot00000000000000package edu.umd.marbl.mhap.utils; /** * The Interval class maintains an interval with some associated data * @author Kevin Dolan * * @param The type of data being stored */ public class Interval implements Comparable> { private long start; private long end; private Type data; public Interval(long start, long end, Type data) { this.start = start; this.end = end; this.data = data; } public long getStart() { return this.start; } public void setStart(long start) { this.start = start; } public long getEnd() { return this.end; } public void setEnd(long end) { this.end = end; } public Type getData() { return this.data; } public void setData(Type data) { this.data = data; } /** * @param time * @return true if this interval contains time (invlusive) */ public boolean contains(long time) { return time < this.end && time > this.start; } /** * @param other * @return return true if this interval intersects other */ public boolean intersects(Interval other) { return other.getEnd() > this.start && other.getStart() < this.end; } /** * Return -1 if this interval's start time is less than the other, 1 if greater * In the event of a tie, -1 if this interval's end time is less than the other, 1 if greater, 0 if same * @param other * @return 1 or -1 */ @Override public int compareTo(Interval other) { if(this.start < other.getStart()) return -1; else if(this.start > other.getStart()) return 1; else if(this.end < other.getEnd()) return -1; else if(this.end > other.getEnd()) return 1; else return 0; } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/IntervalNode.java000066400000000000000000000110721277502137000246620ustar00rootroot00000000000000package edu.umd.marbl.mhap.utils; import java.util.ArrayList; import java.util.List; import java.util.SortedMap; import java.util.SortedSet; import java.util.TreeMap; import java.util.TreeSet; import java.util.Map.Entry; /** * The Node class contains the interval tree information for one single node * * @author Kevin Dolan */ public class IntervalNode { private SortedMap, List>> intervals; private long center; private IntervalNode leftNode; private IntervalNode rightNode; public IntervalNode() { this.intervals = new TreeMap, List>>(); this.center = 0; this.leftNode = null; this.rightNode = null; } public IntervalNode(List> intervalList) { this.intervals = new TreeMap, List>>(); SortedSet endpoints = new TreeSet(); for(Interval interval: intervalList) { endpoints.add(interval.getStart()); endpoints.add(interval.getEnd()); } long median = getMedian(endpoints); this.center = median; List> left = new ArrayList>(); List> right = new ArrayList>(); for(Interval interval : intervalList) { if(interval.getEnd() < median) left.add(interval); else if(interval.getStart() > median) right.add(interval); else { List> posting = this.intervals.get(interval); if(posting == null) { posting = new ArrayList>(); this.intervals.put(interval, posting); } posting.add(interval); } } if(left.size() > 0) this.leftNode = new IntervalNode(left); if(right.size() > 0) this.rightNode = new IntervalNode(right); } /** * Perform a stabbing query on the node * @param time the time to query at * @return all intervals containing time */ public List> stab(long time) { List> result = new ArrayList>(); for(Entry, List>> entry : this.intervals.entrySet()) { if(entry.getKey().contains(time)) for(Interval interval : entry.getValue()) result.add(interval); else if(entry.getKey().getStart() > time) break; } if(time < this.center && this.leftNode != null) result.addAll(this.leftNode.stab(time)); else if(time > this.center && this.rightNode != null) result.addAll(this.rightNode.stab(time)); return result; } /** * Perform an interval intersection query on the node * @param target the interval to intersect * @return all intervals containing time */ public List> query(Interval target) { List> result = new ArrayList>(); for(Entry, List>> entry : this.intervals.entrySet()) { if(entry.getKey().intersects(target)) for(Interval interval : entry.getValue()) result.add(interval); else if(entry.getKey().getStart() > target.getEnd()) break; } if(target.getStart() < this.center && this.leftNode != null) result.addAll(this.leftNode.query(target)); if(target.getEnd() > this.center && this.rightNode != null) result.addAll(this.rightNode.query(target)); return result; } public long getCenter() { return this.center; } public void setCenter(long center) { this.center = center; } public IntervalNode getLeft() { return this.leftNode; } public void setLeft(IntervalNode left) { this.leftNode = left; } public IntervalNode getRight() { return this.rightNode; } public void setRight(IntervalNode right) { this.rightNode = right; } /** * @param set the set to look on * @return the median of the set, not interpolated */ private Long getMedian(SortedSet set) { int i = 0; int middle = set.size() / 2; for(Long point : set) { if(i == middle) return point; i++; } return null; } @Override public String toString() { StringBuffer sb = new StringBuffer(); sb.append(this.center + ": "); for(Entry, List>> entry : this.intervals.entrySet()) { sb.append("[" + entry.getKey().getStart() + "," + entry.getKey().getEnd() + "]:{"); for(Interval interval : entry.getValue()) { sb.append("("+interval.getStart()+","+interval.getEnd()+","+interval.getData()+")"); } sb.append("} "); } return sb.toString(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/IntervalTree.java000066400000000000000000000115201277502137000246720ustar00rootroot00000000000000package edu.umd.marbl.mhap.utils; import java.util.ArrayList; import java.util.List; /** * An Interval Tree is essentially a map from intervals to objects, which * can be queried for all data associated with a particular interval of * time * @author Kevin Dolan * * @param the type of objects to associate */ public class IntervalTree { private IntervalNode head; private List> intervalList; private boolean inSync; private int size; /** * Instantiate a new interval tree with no intervals */ public IntervalTree() { this.head = new IntervalNode(); this.intervalList = new ArrayList>(); this.inSync = true; this.size = 0; } /** * Instantiate and build an interval tree with a preset list of intervals * @param intervalList the list of intervals to use */ public IntervalTree(List> intervalList) { this.head = new IntervalNode(intervalList); this.intervalList = new ArrayList>(); this.intervalList.addAll(intervalList); this.inSync = true; this.size = intervalList.size(); } /** * Perform a stabbing query, returning the associated data * Will rebuild the tree if out of sync * @param time the time to stab * @return the data associated with all intervals that contain time */ public List get(long time) { List> intervals = getIntervals(time); List result = new ArrayList(); for(Interval interval : intervals) result.add(interval.getData()); return result; } /** * Perform a stabbing query, returning the interval objects * Will rebuild the tree if out of sync * @param time the time to stab * @return all intervals that contain time */ public List> getIntervals(long time) { build(); return this.head.stab(time); } /** * Perform an interval query, returning the associated data * Will rebuild the tree if out of sync * @param start the start of the interval to check * @param end the end of the interval to check * @return the data associated with all intervals that intersect target */ public List get(long start, long end) { List> intervals = getIntervals(start, end); List result = new ArrayList(); for(Interval interval : intervals) result.add(interval.getData()); return result; } /** * Perform an interval query, returning the interval objects * Will rebuild the tree if out of sync * @param start the start of the interval to check * @param end the end of the interval to check * @return all intervals that intersect target */ public List> getIntervals(long start, long end) { build(); return this.head.query(new Interval(start, end, null)); } /** * Add an interval object to the interval tree's list * Will not rebuild the tree until the next query or call to build * @param interval the interval object to add */ public void addInterval(Interval interval) { this.intervalList.add(interval); this.inSync = false; } /** * Add an interval object to the interval tree's list * Will not rebuild the tree until the next query or call to build * @param begin the beginning of the interval * @param end the end of the interval * @param data the data to associate */ public void addInterval(long begin, long end, Type data) { this.intervalList.add(new Interval(begin, end, data)); this.inSync = false; } /** * Determine whether this interval tree is currently a reflection of all intervals in the interval list * @return true if no changes have been made since the last build */ public boolean inSync() { return this.inSync; } /** * Build the interval tree to reflect the list of intervals, * Will not run if this is currently in sync */ public void build() { if(!this.inSync) { this.head = new IntervalNode(this.intervalList); this.inSync = true; this.size = this.intervalList.size(); } } /** * @return the number of entries in the currently built interval tree */ public int currentSize() { return this.size; } /** * @return the number of entries in the interval list, equal to .size() if inSync() */ public int listSize() { return this.intervalList.size(); } @Override public String toString() { return nodeString(this.head,0); } private String nodeString(IntervalNode node, int level) { if(node == null) return ""; StringBuffer sb = new StringBuffer(); for(int i = 0; i < level; i++) sb.append("\t"); sb.append(node + "\n"); sb.append(nodeString(node.getLeft(), level + 1)); sb.append(nodeString(node.getRight(), level + 1)); return sb.toString(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/LimitedSizeCollection.java000066400000000000000000000103041277502137000265230ustar00rootroot00000000000000/* * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2013 by Konstantin Berlin * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; import java.util.Collection; import java.util.Comparator; import java.util.Iterator; import java.util.PriorityQueue; public final class LimitedSizeCollection> implements Collection { public enum Priority { MAX_VALUES, MIN_VALUES; } private T best; private int maxSize; private final PriorityQueue queue; public LimitedSizeCollection(int maxSize) { this(maxSize, Priority.MIN_VALUES); } public LimitedSizeCollection(int maxSize, Priority priority) { // initiate with reverse queue if (priority == Priority.MIN_VALUES) { this.queue = new PriorityQueue(maxSize, new Comparator() { @Override public final int compare(T s1, T s2) { return s2.compareTo(s1); } }); } else { this.queue = new PriorityQueue(maxSize, new Comparator() { @Override public final int compare(T s1, T s2) { return s1.compareTo(s2); } }); } this.maxSize = maxSize; this.best = null; } @Override public boolean add(T o) { if (o == null) return false; if (this.maxSize <= 0) return false; // if can fit just add if (this.queue.size() < this.maxSize) { this.queue.add(o); } else if (this.queue.comparator().compare(o, this.queue.peek()) > 0) { this.queue.add(o); this.queue.poll(); } else return false; if (this.best == null || this.queue.comparator().compare(o, this.best) > 0) { this.best = o; } return true; } @Override public boolean addAll(Collection c) { for (T elem : c) add(elem); return true; } @Override public void clear() { this.best = null; this.queue.clear(); } @Override public boolean contains(Object o) { return this.queue.contains(o); } @Override public boolean containsAll(Collection c) { return this.queue.containsAll(c); } public T getBest() { return this.best; } public Collection getCollection() { return this.queue; } public T getWorst() { return this.queue.peek(); } @Override public boolean isEmpty() { return this.queue.isEmpty(); } public boolean isFull() { return this.queue.size() >= this.maxSize; } @Override public Iterator iterator() { return this.queue.iterator(); } @Override public boolean remove(Object o) { throw new UnsupportedOperationException(); } @Override public boolean removeAll(Collection c) { this.best = null; return this.queue.removeAll(c); } public T removeWorst() { return this.queue.poll(); } @Override public boolean retainAll(Collection c) { throw new UnsupportedOperationException(); } public void setSize(int maxSize) { this.maxSize = maxSize; while (this.queue.size() > this.maxSize) this.queue.poll(); } @Override public int size() { return this.queue.size(); } @Override public Object[] toArray() { return this.queue.toArray(); } @Override public Y[] toArray(Y[] a) { return this.queue.toArray(a); } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/MersenneTwisterFast.java000066400000000000000000001302151277502137000262450ustar00rootroot00000000000000package edu.umd.marbl.mhap.utils; import java.io.*; import java.util.*; /** *

MersenneTwister and MersenneTwisterFast

*

* Version 20, based on version MT199937(99/10/29) of the Mersenne * Twister algorithm found at The Mersenne Twister * Home Page, with the initialization improved using the new 2002/1/26 * initialization algorithm By Sean Luke, October 2004. * *

* MersenneTwister is a drop-in subclass replacement for * java.util.Random. It is properly synchronized and can be used in a * multithreaded environment. On modern VMs such as HotSpot, it is approximately * 1/3 slower than java.util.Random. * *

* MersenneTwisterFast is not a subclass of java.util.Random. It has the * same public methods as Random does, however, and it is algorithmically * identical to MersenneTwister. MersenneTwisterFast has hard-code inlined all * of its methods directly, and made all of them final (well, the ones of * consequence anyway). Further, these methods are not synchronized, so * the same MersenneTwisterFast instance cannot be shared by multiple threads. * But all this helps MersenneTwisterFast achieve well over twice the speed of * MersenneTwister. java.util.Random is about 1/3 slower than * MersenneTwisterFast. * *

About the Mersenne Twister

*

* This is a Java version of the C-program for MT19937: Integer version. The * MT19937 algorithm was created by Makoto Matsumoto and Takuji Nishimura, who * ask: "When you use this, send an email to: matumoto@math.keio.ac.jp with an * appropriate reference to your work". Indicate that this is a translation of * their algorithm into Java. * *

* Reference. Makato Matsumoto and Takuji Nishimura, "Mersenne Twister: * A 623-Dimensionally Equidistributed Uniform Pseudo-Random Number Generator", * ACM Transactions on Modeling and. Computer Simulation, Vol. 8, No. 1, * January 1998, pp 3--30. * *

About this Version

* *

* Changes since V19: nextFloat(boolean, boolean) now returns float, not * double. * *

* Changes since V18: Removed old final declarations, which used to * potentially speed up the code, but no longer. * *

* Changes since V17: Removed vestigial references to &= 0xffffffff which * stemmed from the original C code. The C code could not guarantee that ints * were 32 bit, hence the masks. The vestigial references in the Java code were * likely optimized out anyway. * *

* Changes since V16: Added nextDouble(includeZero, includeOne) and * nextFloat(includeZero, includeOne) to allow for half-open, fully-closed, and * fully-open intervals. * *

* Changes Since V15: Added serialVersionUID to quiet compiler warnings * from Sun's overly verbose compilers as of JDK 1.5. * *

* Changes Since V14: made strictfp, with StrictMath.log and * StrictMath.sqrt in nextGaussian instead of Math.log and Math.sqrt. This is * largely just to be safe, as it presently makes no difference in the speed, * correctness, or results of the algorithm. * *

* Changes Since V13: clone() method CloneNotSupportedException removed. * *

* Changes Since V12: clone() method added. * *

* Changes Since V11: stateEquals(...) method added. MersenneTwisterFast * is equal to other MersenneTwisterFasts with identical state; likewise * MersenneTwister is equal to other MersenneTwister with identical state. This * isn't equals(...) because that requires a contract of immutability to compare * by value. * *

* Changes Since V10: A documentation error suggested that setSeed(int[]) * required an int[] array 624 long. In fact, the array can be any non-zero * length. The new version also checks for this fact. * *

* Changes Since V9: readState(stream) and writeState(stream) provided. * *

* Changes Since V8: setSeed(int) was only using the first 28 bits of the * seed; it should have been 32 bits. For small-number seeds the behavior is * identical. * *

* Changes Since V7: A documentation error in MersenneTwisterFast (but * not MersenneTwister) stated that nextDouble selects uniformly from the * full-open interval [0,1]. It does not. nextDouble's contract is identical * across MersenneTwisterFast, MersenneTwister, and java.util.Random, namely, * selection in the half-open interval [0,1). That is, 1.0 should not be * returned. A similar contract exists in nextFloat. * *

* Changes Since V6: License has changed from LGPL to BSD. New timing * information to compare against java.util.Random. Recent versions of HotSpot * have helped Random increase in speed to the point where it is faster than * MersenneTwister but slower than MersenneTwisterFast (which should be the * case, as it's a less complex algorithm but is synchronized). * *

* Changes Since V5: New empty constructor made to work the same as * java.util.Random -- namely, it seeds based on the current time in * milliseconds. * *

* Changes Since V4: New initialization algorithms. See (see * http://www.math.keio.ac.jp/matumoto/MT2002/emt19937ar.html) * *

* The MersenneTwister code is based on standard MT19937 C/C++ code by Takuji * Nishimura, with suggestions from Topher Cooper and Marc Rieffel, July 1997. * The code was originally translated into Java by Michael Lecuyer, January * 1999, and the original code is Copyright (c) 1999 by Michael Lecuyer. * *

Java notes

* *

* This implementation implements the bug fixes made in Java 1.2's version of * Random, which means it can be used with earlier versions of Java. See * the JDK 1.2 java.util.Random documentation for further documentation on * the random-number generation contracts made. Additionally, there's an * undocumented bug in the JDK java.util.Random.nextBytes() method, which this * code fixes. * *

* Just like java.util.Random, this generator accepts a long seed but doesn't * use all of it. java.util.Random uses 48 bits. The Mersenne Twister instead * uses 32 bits (int size). So it's best if your seed does not exceed the int * range. * *

* MersenneTwister can be used reliably on JDK version 1.1.5 or above. Earlier * Java versions have serious bugs in java.util.Random; only MersenneTwisterFast * (and not MersenneTwister nor java.util.Random) should be used with them. * *

License

* * Copyright (c) 2003 by Sean Luke.
* Portions copyright (c) 1993 by Michael Lecuyer.
* All rights reserved.
* *

* Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: *

    *
  • Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. *
  • Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. *
  • Neither the name of the copyright owners, their employers, nor the names * of its contributors may be used to endorse or promote products derived from * this software without specific prior written permission. *
*

* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNERS OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. * * @version 20 */ // Note: this class is hard-inlined in all of its methods. This makes some of // the methods well-nigh unreadable in their complexity. In fact, the Mersenne // Twister is fairly easy code to understand: if you're trying to get a handle // on the code, I strongly suggest looking at MersenneTwister.java first. // -- Sean public strictfp class MersenneTwisterFast implements Serializable, Cloneable { // Serialization private static final long serialVersionUID = -8219700664442619525L; // locked // as of // Version // 15 // Period parameters private static final int N = 624; private static final int M = 397; private static final int MATRIX_A = 0x9908b0df; // private static final * // constant vector a private static final int UPPER_MASK = 0x80000000; // most significant w-r // bits private static final int LOWER_MASK = 0x7fffffff; // least significant r // bits // Tempering parameters private static final int TEMPERING_MASK_B = 0x9d2c5680; private static final int TEMPERING_MASK_C = 0xefc60000; private int mt[]; // the array for the state vector private int mti; // mti==N+1 means mt[N] is not initialized private int mag01[]; // a good initial seed (of int size, though stored in a long) // private static final long GOOD_SEED = 4357; private double __nextNextGaussian; private boolean __haveNextNextGaussian; /* * We're overriding all internal data, to my knowledge, so this should be * okay */ @Override public Object clone() { try { MersenneTwisterFast f = (MersenneTwisterFast) (super.clone()); f.mt = (int[]) (mt.clone()); f.mag01 = (int[]) (mag01.clone()); return f; } catch (CloneNotSupportedException e) { throw new InternalError(); } // should never happen } public boolean stateEquals(Object o) { if (o == this) return true; if (o == null || !(o instanceof MersenneTwisterFast)) return false; MersenneTwisterFast other = (MersenneTwisterFast) o; if (mti != other.mti) return false; for (int x = 0; x < mag01.length; x++) if (mag01[x] != other.mag01[x]) return false; for (int x = 0; x < mt.length; x++) if (mt[x] != other.mt[x]) return false; return true; } /** Reads the entire state of the MersenneTwister RNG from the stream */ public void readState(DataInputStream stream) throws IOException { int len = mt.length; for (int x = 0; x < len; x++) mt[x] = stream.readInt(); len = mag01.length; for (int x = 0; x < len; x++) mag01[x] = stream.readInt(); mti = stream.readInt(); __nextNextGaussian = stream.readDouble(); __haveNextNextGaussian = stream.readBoolean(); } /** Writes the entire state of the MersenneTwister RNG to the stream */ public void writeState(DataOutputStream stream) throws IOException { int len = mt.length; for (int x = 0; x < len; x++) stream.writeInt(mt[x]); len = mag01.length; for (int x = 0; x < len; x++) stream.writeInt(mag01[x]); stream.writeInt(mti); stream.writeDouble(__nextNextGaussian); stream.writeBoolean(__haveNextNextGaussian); } /** * Constructor using the default seed. */ public MersenneTwisterFast() { this(System.currentTimeMillis()); } /** * Constructor using a given seed. Though you pass this seed in as a long, * it's best to make sure it's actually an integer. * */ public MersenneTwisterFast(long seed) { setSeed(seed); } /** * Constructor using an array of integers as seed. Your array must have a * non-zero length. Only the first 624 integers in the array are used; if * the array is shorter than this then integers are repeatedly used in a * wrap-around fashion. */ public MersenneTwisterFast(int[] array) { setSeed(array); } /** * Initalize the pseudo random number generator. Don't pass in a long that's * bigger than an int (Mersenne Twister only uses the first 32 bits for its * seed). */ synchronized public void setSeed(long seed) { // Due to a bug in java.util.Random clear up to 1.2, we're // doing our own Gaussian variable. __haveNextNextGaussian = false; mt = new int[N]; mag01 = new int[2]; mag01[0] = 0x0; mag01[1] = MATRIX_A; mt[0] = (int) (seed & 0xffffffff); for (mti = 1; mti < N; mti++) { mt[mti] = (1812433253 * (mt[mti - 1] ^ (mt[mti - 1] >>> 30)) + mti); /* See Knuth TAOCP Vol2. 3rd Ed. P.106 for multiplier. */ /* In the previous versions, MSBs of the seed affect */ /* only MSBs of the array mt[]. */ /* 2002/01/09 modified by Makoto Matsumoto */ // mt[mti] &= 0xffffffff; /* for >32 bit machines */ } } /** * Sets the seed of the MersenneTwister using an array of integers. Your * array must have a non-zero length. Only the first 624 integers in the * array are used; if the array is shorter than this then integers are * repeatedly used in a wrap-around fashion. */ synchronized public void setSeed(int[] array) { if (array.length == 0) throw new IllegalArgumentException("Array length must be greater than zero"); int i, j, k; setSeed(19650218); i = 1; j = 0; k = (N > array.length ? N : array.length); for (; k != 0; k--) { mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >>> 30)) * 1664525)) + array[j] + j; /* * non * linear */ // mt[i] &= 0xffffffff; /* for WORDSIZE > 32 machines */ i++; j++; if (i >= N) { mt[0] = mt[N - 1]; i = 1; } if (j >= array.length) j = 0; } for (k = N - 1; k != 0; k--) { mt[i] = (mt[i] ^ ((mt[i - 1] ^ (mt[i - 1] >>> 30)) * 1566083941)) - i; /* * non * linear */ // mt[i] &= 0xffffffff; /* for WORDSIZE > 32 machines */ i++; if (i >= N) { mt[0] = mt[N - 1]; i = 1; } } mt[0] = 0x80000000; /* MSB is 1; assuring non-zero initial array */ } public int nextInt() { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return y; } public short nextShort() { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (short) (y >>> 16); } public char nextChar() { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (char) (y >>> 16); } public boolean nextBoolean() { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (boolean) ((y >>> 31) != 0); } /** * This generates a coin flip with a probability probability of * returning true, else returning false. probability must be * between 0.0 and 1.0, inclusive. Not as precise a random real event as * nextBoolean(double), but twice as fast. To explicitly use this, remember * you may need to cast to float first. */ public boolean nextBoolean(float probability) { int y; if (probability < 0.0f || probability > 1.0f) throw new IllegalArgumentException("probability must be between 0.0 and 1.0 inclusive."); if (probability == 0.0f) return false; // fix half-open issues else if (probability == 1.0f) return true; // fix half-open issues if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (y >>> 8) / ((float) (1 << 24)) < probability; } /** * This generates a coin flip with a probability probability of * returning true, else returning false. probability must be * between 0.0 and 1.0, inclusive. */ public boolean nextBoolean(double probability) { int y; int z; if (probability < 0.0 || probability > 1.0) throw new IllegalArgumentException("probability must be between 0.0 and 1.0 inclusive."); if (probability == 0.0) return false; // fix half-open issues else if (probability == 1.0) return true; // fix half-open issues if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (z >>> 1) ^ mag01[z & 0x1]; } for (; kk < N - 1; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (z >>> 1) ^ mag01[z & 0x1]; } z = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (z >>> 1) ^ mag01[z & 0x1]; mti = 0; } z = mt[mti++]; z ^= z >>> 11; // TEMPERING_SHIFT_U(z) z ^= (z << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(z) z ^= (z << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(z) z ^= (z >>> 18); // TEMPERING_SHIFT_L(z) /* derived from nextDouble documentation in jdk 1.2 docs, see top */ return ((((long) (y >>> 6)) << 27) + (z >>> 5)) / (double) (1L << 53) < probability; } public byte nextByte() { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (byte) (y >>> 24); } public void nextBytes(byte[] bytes) { int y; for (int x = 0; x < bytes.length; x++) { if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) bytes[x] = (byte) (y >>> 24); } } public long nextLong() { int y; int z; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (z >>> 1) ^ mag01[z & 0x1]; } for (; kk < N - 1; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (z >>> 1) ^ mag01[z & 0x1]; } z = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (z >>> 1) ^ mag01[z & 0x1]; mti = 0; } z = mt[mti++]; z ^= z >>> 11; // TEMPERING_SHIFT_U(z) z ^= (z << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(z) z ^= (z << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(z) z ^= (z >>> 18); // TEMPERING_SHIFT_L(z) return (((long) y) << 32) + (long) z; } /** * Returns a long drawn uniformly from 0 to n-1. Suffice it to say, n must * be > 0, or an IllegalArgumentException is raised. */ public long nextLong(long n) { if (n <= 0) throw new IllegalArgumentException("n must be positive, got: " + n); long bits, val; do { int y; int z; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (z >>> 1) ^ mag01[z & 0x1]; } for (; kk < N - 1; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (z >>> 1) ^ mag01[z & 0x1]; } z = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (z >>> 1) ^ mag01[z & 0x1]; mti = 0; } z = mt[mti++]; z ^= z >>> 11; // TEMPERING_SHIFT_U(z) z ^= (z << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(z) z ^= (z << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(z) z ^= (z >>> 18); // TEMPERING_SHIFT_L(z) bits = (((((long) y) << 32) + (long) z) >>> 1); val = bits % n; } while (bits - val + (n - 1) < 0); return val; } /** * Returns a random double in the half-open range from [0.0,1.0). Thus 0.0 * is a valid result but 1.0 is not. */ public double nextDouble() { int y; int z; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (z >>> 1) ^ mag01[z & 0x1]; } for (; kk < N - 1; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (z >>> 1) ^ mag01[z & 0x1]; } z = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (z >>> 1) ^ mag01[z & 0x1]; mti = 0; } z = mt[mti++]; z ^= z >>> 11; // TEMPERING_SHIFT_U(z) z ^= (z << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(z) z ^= (z << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(z) z ^= (z >>> 18); // TEMPERING_SHIFT_L(z) /* derived from nextDouble documentation in jdk 1.2 docs, see top */ return ((((long) (y >>> 6)) << 27) + (z >>> 5)) / (double) (1L << 53); } /** * Returns a double in the range from 0.0 to 1.0, possibly inclusive of 0.0 * and 1.0 themselves. Thus: * *

* * * * * *
* Expression * Interval *
nextDouble(false, false) * (0.0, 1.0) *
nextDouble(true, false) * [0.0, 1.0) *
nextDouble(false, true) * (0.0, 1.0] *
nextDouble(true, true) * [0.0, 1.0] *
* *

* This version preserves all possible random values in the double range. */ public double nextDouble(boolean includeZero, boolean includeOne) { double d = 0.0; do { d = nextDouble(); // grab a value, initially from half-open [0.0, // 1.0) if (includeOne && nextBoolean()) d += 1.0; // if includeOne, with 1/2 probability, push to [1.0, // 2.0) } while ((d > 1.0) || // everything above 1.0 is always invalid (!includeZero && d == 0.0)); // if we're not including zero, 0.0 // is invalid return d; } public double nextGaussian() { if (__haveNextNextGaussian) { __haveNextNextGaussian = false; return __nextNextGaussian; } else { double v1, v2, s; do { int y; int z; int a; int b; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly // faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly // faster for (kk = 0; kk < N - M; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (z >>> 1) ^ mag01[z & 0x1]; } for (; kk < N - 1; kk++) { z = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (z >>> 1) ^ mag01[z & 0x1]; } z = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (z >>> 1) ^ mag01[z & 0x1]; mti = 0; } z = mt[mti++]; z ^= z >>> 11; // TEMPERING_SHIFT_U(z) z ^= (z << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(z) z ^= (z << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(z) z ^= (z >>> 18); // TEMPERING_SHIFT_L(z) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly // faster for (kk = 0; kk < N - M; kk++) { a = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (a >>> 1) ^ mag01[a & 0x1]; } for (; kk < N - 1; kk++) { a = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (a >>> 1) ^ mag01[a & 0x1]; } a = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (a >>> 1) ^ mag01[a & 0x1]; mti = 0; } a = mt[mti++]; a ^= a >>> 11; // TEMPERING_SHIFT_U(a) a ^= (a << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(a) a ^= (a << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(a) a ^= (a >>> 18); // TEMPERING_SHIFT_L(a) if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly // faster for (kk = 0; kk < N - M; kk++) { b = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (b >>> 1) ^ mag01[b & 0x1]; } for (; kk < N - 1; kk++) { b = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (b >>> 1) ^ mag01[b & 0x1]; } b = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (b >>> 1) ^ mag01[b & 0x1]; mti = 0; } b = mt[mti++]; b ^= b >>> 11; // TEMPERING_SHIFT_U(b) b ^= (b << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(b) b ^= (b << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(b) b ^= (b >>> 18); // TEMPERING_SHIFT_L(b) /* * derived from nextDouble documentation in jdk 1.2 docs, see * top */ v1 = 2 * (((((long) (y >>> 6)) << 27) + (z >>> 5)) / (double) (1L << 53)) - 1; v2 = 2 * (((((long) (a >>> 6)) << 27) + (b >>> 5)) / (double) (1L << 53)) - 1; s = v1 * v1 + v2 * v2; } while (s >= 1 || s == 0); double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s) / s); __nextNextGaussian = v2 * multiplier; __haveNextNextGaussian = true; return v1 * multiplier; } } /** * Returns a random float in the half-open range from [0.0f,1.0f). Thus 0.0f * is a valid result but 1.0f is not. */ public float nextFloat() { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (y >>> 8) / ((float) (1 << 24)); } /** * Returns a float in the range from 0.0f to 1.0f, possibly inclusive of * 0.0f and 1.0f themselves. Thus: * *

* * * * * *
* Expression * Interval *
nextFloat(false, false) * (0.0f, 1.0f) *
nextFloat(true, false) * [0.0f, 1.0f) *
nextFloat(false, true) * (0.0f, 1.0f] *
nextFloat(true, true) * [0.0f, 1.0f] *
* *

* This version preserves all possible random values in the float range. */ public float nextFloat(boolean includeZero, boolean includeOne) { float d = 0.0f; do { d = nextFloat(); // grab a value, initially from half-open [0.0f, // 1.0f) if (includeOne && nextBoolean()) d += 1.0f; // if includeOne, with 1/2 probability, push to // [1.0f, 2.0f) } while ((d > 1.0f) || // everything above 1.0f is always invalid (!includeZero && d == 0.0f)); // if we're not including zero, // 0.0f is invalid return d; } /** * Returns an integer drawn uniformly from 0 to n-1. Suffice it to say, n * must be > 0, or an IllegalArgumentException is raised. */ public int nextInt(int n) { if (n <= 0) throw new IllegalArgumentException("n must be positive, got: " + n); if ((n & -n) == n) // i.e., n is a power of 2 { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) return (int) ((n * (long) (y >>> 1)) >> 31); } int bits, val; do { int y; if (mti >= N) // generate N words at one time { int kk; final int[] mt = this.mt; // locals are slightly faster final int[] mag01 = this.mag01; // locals are slightly faster for (kk = 0; kk < N - M; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + M] ^ (y >>> 1) ^ mag01[y & 0x1]; } for (; kk < N - 1; kk++) { y = (mt[kk] & UPPER_MASK) | (mt[kk + 1] & LOWER_MASK); mt[kk] = mt[kk + (M - N)] ^ (y >>> 1) ^ mag01[y & 0x1]; } y = (mt[N - 1] & UPPER_MASK) | (mt[0] & LOWER_MASK); mt[N - 1] = mt[M - 1] ^ (y >>> 1) ^ mag01[y & 0x1]; mti = 0; } y = mt[mti++]; y ^= y >>> 11; // TEMPERING_SHIFT_U(y) y ^= (y << 7) & TEMPERING_MASK_B; // TEMPERING_SHIFT_S(y) y ^= (y << 15) & TEMPERING_MASK_C; // TEMPERING_SHIFT_T(y) y ^= (y >>> 18); // TEMPERING_SHIFT_L(y) bits = (y >>> 1); val = bits % n; } while (bits - val + (n - 1) < 0); return val; } /** * Tests the code. */ public static void main(String args[]) { int j; MersenneTwisterFast r; // CORRECTNESS TEST // COMPARE WITH // http://www.math.keio.ac.jp/matumoto/CODES/MT2002/mt19937ar.out r = new MersenneTwisterFast(new int[] { 0x123, 0x234, 0x345, 0x456 }); System.out.println("Output of MersenneTwisterFast with new (2002/1/26) seeding mechanism"); for (j = 0; j < 1000; j++) { // first, convert the int from signed to "unsigned" long l = (long) r.nextInt(); if (l < 0) l += 4294967296L; // max int value String s = String.valueOf(l); while (s.length() < 10) s = " " + s; // buffer System.out.print(s + " "); if (j % 5 == 4) System.out.println(); } // SPEED TEST final long SEED = 4357; int xx; long ms; System.out.println("\nTime to test grabbing 100000000 ints"); Random rr = new Random(SEED); xx = 0; ms = System.currentTimeMillis(); for (j = 0; j < 100000000; j++) xx += rr.nextInt(); System.out.println("java.util.Random: " + (System.currentTimeMillis() - ms) + " Ignore this: " + xx); r = new MersenneTwisterFast(SEED); ms = System.currentTimeMillis(); xx = 0; for (j = 0; j < 100000000; j++) xx += r.nextInt(); System.out.println("Mersenne Twister Fast: " + (System.currentTimeMillis() - ms) + " Ignore this: " + xx); // TEST TO COMPARE TYPE CONVERSION BETWEEN // MersenneTwisterFast.java AND MersenneTwister.java System.out.println("\nGrab the first 1000 booleans"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextBoolean() + " "); if (j % 8 == 7) System.out.println(); } if (!(j % 8 == 7)) System.out.println(); System.out.println("\nGrab 1000 booleans of increasing probability using nextBoolean(double)"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextBoolean((double) (j / 999.0)) + " "); if (j % 8 == 7) System.out.println(); } if (!(j % 8 == 7)) System.out.println(); System.out.println("\nGrab 1000 booleans of increasing probability using nextBoolean(float)"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextBoolean((float) (j / 999.0f)) + " "); if (j % 8 == 7) System.out.println(); } if (!(j % 8 == 7)) System.out.println(); byte[] bytes = new byte[1000]; System.out.println("\nGrab the first 1000 bytes using nextBytes"); r = new MersenneTwisterFast(SEED); r.nextBytes(bytes); for (j = 0; j < 1000; j++) { System.out.print(bytes[j] + " "); if (j % 16 == 15) System.out.println(); } if (!(j % 16 == 15)) System.out.println(); byte b; System.out.println("\nGrab the first 1000 bytes -- must be same as nextBytes"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print((b = r.nextByte()) + " "); if (b != bytes[j]) System.out.print("BAD "); if (j % 16 == 15) System.out.println(); } if (!(j % 16 == 15)) System.out.println(); System.out.println("\nGrab the first 1000 shorts"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextShort() + " "); if (j % 8 == 7) System.out.println(); } if (!(j % 8 == 7)) System.out.println(); System.out.println("\nGrab the first 1000 ints"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextInt() + " "); if (j % 4 == 3) System.out.println(); } if (!(j % 4 == 3)) System.out.println(); System.out.println("\nGrab the first 1000 ints of different sizes"); r = new MersenneTwisterFast(SEED); int max = 1; for (j = 0; j < 1000; j++) { System.out.print(r.nextInt(max) + " "); max *= 2; if (max <= 0) max = 1; if (j % 4 == 3) System.out.println(); } if (!(j % 4 == 3)) System.out.println(); System.out.println("\nGrab the first 1000 longs"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextLong() + " "); if (j % 3 == 2) System.out.println(); } if (!(j % 3 == 2)) System.out.println(); System.out.println("\nGrab the first 1000 longs of different sizes"); r = new MersenneTwisterFast(SEED); long max2 = 1; for (j = 0; j < 1000; j++) { System.out.print(r.nextLong(max2) + " "); max2 *= 2; if (max2 <= 0) max2 = 1; if (j % 4 == 3) System.out.println(); } if (!(j % 4 == 3)) System.out.println(); System.out.println("\nGrab the first 1000 floats"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextFloat() + " "); if (j % 4 == 3) System.out.println(); } if (!(j % 4 == 3)) System.out.println(); System.out.println("\nGrab the first 1000 doubles"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextDouble() + " "); if (j % 3 == 2) System.out.println(); } if (!(j % 3 == 2)) System.out.println(); System.out.println("\nGrab the first 1000 gaussian doubles"); r = new MersenneTwisterFast(SEED); for (j = 0; j < 1000; j++) { System.out.print(r.nextGaussian() + " "); if (j % 3 == 2) System.out.println(); } if (!(j % 3 == 2)) System.out.println(); } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/Pair.java000066400000000000000000000047471277502137000231760ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; import java.io.Serializable; public class Pair implements Serializable { public final A x; public final B y; /** * */ private static final long serialVersionUID = -5782450990742961765L; public Pair(A x, B y) { this.x = x; this.y = y; } /* (non-Javadoc) * @see java.lang.Object#equals(java.lang.Object) */ @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Pair other = (Pair) obj; if (this.x == null) { if (other.x != null) return false; } else if (!this.x.equals(other.x)) return false; if (this.y == null) { if (other.y != null) return false; } else if (!this.y.equals(other.y)) return false; return true; } @Override public int hashCode() { return 31 * hashcode(this.x) + hashcode(this.y); } // todo move this to a helper class. private static int hashcode(Object o) { return o == null ? 0 : o.hashCode(); } /* (non-Javadoc) * @see java.lang.Object#toString() */ @Override public String toString() { return "[x=" + this.x + ", y=" + this.y + "]"; } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/ParseOptions.java000066400000000000000000000213271277502137000247220ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; /** * The Class ParseOptions. */ public class ParseOptions { public class Option { protected final Class objectClass; private final String description; private final String flag; private Object value; private boolean wasSet; private boolean required; @SuppressWarnings("unchecked") private Option(String flag, String description, T defaultValue) { this.flag = flag; this.description = description; this.value = defaultValue; this.wasSet = false; this.required = false; this.objectClass = (Class) defaultValue.getClass(); } private Option(String flag, String description, boolean required, Class objectClass) { this.flag = parseFlag(flag); this.description = description; this.value = null; this.wasSet = false; this.required = required; this.objectClass = objectClass; } private String parseFlag(String flag) { flag = flag.trim(); if (flag.startsWith("-")) flag = flag.substring(1); return flag; } public boolean isBoolean() { return Boolean.class.isAssignableFrom(this.objectClass); } public boolean isDouble() { return Double.class.isAssignableFrom(this.objectClass); } public boolean isInteger() { return Integer.class.isAssignableFrom(this.objectClass); } public boolean isString() { return String.class.isAssignableFrom(this.objectClass); } public boolean getBoolean() { return (Boolean) this.value; } public String getString() { if (this.value == null) return ""; return (String) this.value; } public double getDouble() { return (Double) this.value; } public int getInteger() { return (Integer) this.value; } public void setValue(Object value) { if (!this.objectClass.isInstance(value)) throw new RuntimeException("Incompatable value type for flag \"" + this.flag + "\" of type " + this.objectClass.getName() + ": " + value.getClass().getName()); this.value = value; this.wasSet = true; } public String getFlag() { return this.flag; } public boolean isSet() { return this.wasSet; } public boolean isRequired() { return this.required; } public Class getType() { return this.objectClass; } } private final ArrayList startText; private final HashMap> optionsByFlag; private final HashMap> optionsByName; public ParseOptions() { this.startText = new ArrayList<>(); this.optionsByFlag = new HashMap>(); this.optionsByName = new HashMap>(); addOption("-h", "Displays the help menu.", false); addOption("--help", "Displays the help menu.", false); addOption("--version", "Displays the version and build time.", false); } public void addStartTextLine(String text) { this.startText.add(text); } public void addOption(String flag, String description, T defaultValue) { flag = parseFlag(flag); if (this.optionsByFlag.get(flag) != null) return; Option option = new Option(flag, description, defaultValue); this.optionsByFlag.put(flag, option); this.optionsByName.put(flag, option); } public void setOptions(String flag, T defaultValue) { Option option = new Option(flag, getFlag(flag).description, defaultValue); this.optionsByFlag.put(flag, option); this.optionsByName.put(flag, option); } public void addRequiredOption(String flag, String description, Class objectClass) { flag = parseFlag(flag); Option option = new Option(flag, description, true, objectClass); this.optionsByFlag.put(flag, option); this.optionsByName.put(flag, option); } public String parseFlag(String flag) { return flag; } public boolean process(String[] args) { try { parse(args); if (needsHelp()) { System.out.println(helpMenuString()); return false; } else if (needsVersion()) { System.out.println("Version = "+ParseOptions.class.getPackage().getImplementationVersion()); return false; } checkParameters(); } catch (Exception e) { System.out.println(e.getMessage()); System.out.println(helpMenuString()); return false; } return true; } public Option get(String name) throws RuntimeException { Option option = this.optionsByName.get(name); if (option == null) throw new RuntimeException("Invalid option name \"" + name + "\"."); return option; } public Option getFlag(String flag) throws RuntimeException { Option option = this.optionsByFlag.get(flag); if (option == null) throw new RuntimeException("Invalid flag \"" + flag + "\"."); return option; } @Override public String toString() { StringBuilder menuString = new StringBuilder(); // sort the list ArrayList list = new ArrayList(this.optionsByFlag.keySet()); Collections.sort(list); for (String key : list) { Option currOption = this.optionsByFlag.get(key); menuString.append("" + currOption.flag + " = "); menuString.append("" + currOption.value); menuString.append("\n"); } return menuString.toString(); } public String helpMenuString() { StringBuilder menuString = new StringBuilder(); for (String str : this.startText) menuString.append(str+"\n"); // sort the list ArrayList list = new ArrayList(this.optionsByFlag.keySet()); Collections.sort(list); for (String key : list) { Option currOption = this.optionsByFlag.get(key); menuString.append("\t\t" + currOption.flag + ", "); if (currOption.isRequired()) menuString.append("*required, "); else { if (currOption.isString()) menuString.append("default = \"" + currOption.value + "\""); else menuString.append("default = " + currOption.value); } menuString.append("\n"); menuString.append("\t\t\t" + currOption.description); menuString.append("\n"); } return menuString.toString(); } public void checkParameters() { for (Option option : this.optionsByFlag.values()) if (option.required && !option.wasSet) throw new RuntimeException("Required option flag \"" + option.flag + "\" was not set."); } public boolean needsHelp() { return get("--help").getBoolean() || get("-h").getBoolean(); } public boolean needsVersion() { return get("--version").getBoolean(); } public void parse(String[] args) throws RuntimeException { for (int iter = 0; iter < args.length; iter++) { String flag = args[iter].trim(); if (!flag.startsWith("-")) throw new RuntimeException("Unknown parameter in command line: " + flag); flag = parseFlag(flag); Option option = getFlag(flag); if (option == null) throw new RuntimeException("Unknown flag \"" + flag + "\"."); else if (option.isBoolean()) option.setValue(true); else if (iter + 1 < args.length) { if (option.isDouble()) { option.setValue(new Double(args[iter + 1])); iter++; } else if (option.isInteger()) { option.setValue(new Integer(args[iter + 1])); iter++; } else if (option.isString()) { option.setValue(args[iter + 1]); iter++; } else throw new RuntimeException("Cannot parse flag \"" + option.getFlag() + "\" of type " + option.getType().getName() + "."); } else throw new RuntimeException("Not value provided for flag \"" + option.getFlag() + "\" of type " + option.getType().getName() + "."); } } }MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/RandomSequenceGenerator.java000066400000000000000000000075171277502137000270610ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2015 by Konstantin Berlin and Sergey Koren * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; import java.util.LinkedList; import java.util.ListIterator; import edu.umd.marbl.mhap.impl.MhapRuntimeException; public final class RandomSequenceGenerator { private MersenneTwisterFast randGenerator; public RandomSequenceGenerator() { this.randGenerator = new MersenneTwisterFast(); } public RandomSequenceGenerator(int seed) { this.randGenerator = new MersenneTwisterFast(seed); } private final char getRandomBase(Character toExclude) { Character result = null; while (result == null) { double base = this.randGenerator.nextDouble(); if (base < 0.25) { result = 'A'; } else if (base < 0.5) { result = 'C'; } else if (base < 0.75) { result = 'G'; } else { result = 'T'; } if (toExclude != null && toExclude.equals(result)) { result = null; } } return result; } public String generateRandomSequence(int length) { StringBuilder str = new StringBuilder(length); for (int iter=0; iter1.00001) throw new MhapRuntimeException("Error rate must be less than or equal to 1.0."); double errorRate = insertionRate + deletionRate + substitutionRate; // use a linked list for insertions LinkedList modifiedSequence = new LinkedList<>(); for (char a : str.toCharArray()) modifiedSequence.add(a); // now mutate ListIterator iter = modifiedSequence.listIterator(); while (iter.hasNext()) { char i = iter.next(); if (randGenerator.nextDouble() < errorRate) { double errorType = randGenerator.nextDouble(); if (errorType < substitutionRate) { // mismatch // switch base iter.set(getRandomBase(i)); i++; } else if (errorType < insertionRate + substitutionRate) { // insert iter.previous(); iter.add(getRandomBase(null)); } else { // delete iter.remove(); } } else { // i++; } } StringBuilder returnedString = new StringBuilder(modifiedSequence.size()); for (char c : modifiedSequence) returnedString.append(c); return returnedString.toString(); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/ReadBuffer.java000066400000000000000000000027751277502137000243070ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; public final class ReadBuffer { private byte[] buff = new byte[2]; public final byte[] getBuffer(int size) { if (this.buff.length, AnyType> extends Pair implements Comparable> { /** * */ private static final long serialVersionUID = 2817516347839329908L; /** * Instantiates a new sortable pair. * * @param x * the x * @param y * the y */ public SortablePair(ComparableType x, AnyType y) { super(x, y); } /* * (non-Javadoc) * * @see java.lang.Comparable#compareTo(java.lang.Object) */ @Override public final int compareTo(SortablePair pair) { return this.x.compareTo(pair.x); } } MHAP-2.1.1/src/main/java/edu/umd/marbl/mhap/utils/Utils.java000066400000000000000000000346561277502137000234050ustar00rootroot00000000000000/* * MHAP package * * This software is distributed "as is", without any warranty, including * any implied warranty of merchantability or fitness for a particular * use. The authors assume no responsibility for, and shall not be liable * for, any special, indirect, or consequential damages, or any damages * whatsoever, arising out of or in connection with the use of this * software. * * Copyright (c) 2014 by Konstantin Berlin and Sergey Koren * University Of Maryland * * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * */ package edu.umd.marbl.mhap.utils; import java.text.DecimalFormat; import java.text.NumberFormat; import java.util.ArrayList; import java.util.HashMap; import java.util.Random; import java.io.BufferedInputStream; import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.FileReader; import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream; import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream; public final class Utils { public enum ToProtein { AAA("K"), AAC("N"), AAG("K"), AAT("N"), ACA("T"), ACC("T"), ACG("T"), ACT("T"), AGA("R"), AGC("S"), AGG("R"), AGT( "S"), ATA("I"), ATC("I"), ATG("M"), ATT("I"), CAA("Q"), CAC("H"), CAG("Q"), CAT("H"), CCA("P"), CCC("P"), CCG( "P"), CCT("P"), CGA("R"), CGC("R"), CGG("R"), CGT("R"), CTA("L"), CTC("L"), CTG("L"), CTT("L"), GAA("E"), GAC( "D"), GAG("E"), GAT("D"), GCA("A"), GCC("A"), GCG("A"), GCT("A"), GGA("G"), GGC("G"), GGG("G"), GGT("G"), GTA( "V"), GTC("V"), GTG("V"), GTT("V"), TAA("X"), TAC("Y"), TAG("X"), TAT("Y"), TCA("S"), TCC("S"), TCG("S"), TCT( "S"), TGA("X"), TGC("C"), TGG("W"), TGT("C"), TTA("L"), TTC("F"), TTG("L"), TTT("F"); /* * Ala/A GCU, GCC, GCA, GCG Leu/L UUA, UUG, CUU, CUC, CUA, CUG Arg/R * CGU, CGC, CGA, CGG, AGA, AGG Lys/K AAA, AAG Asn/N AAU, AAC Met/M AUG * Asp/D GAU, GAC Phe/F UUU, UUC Cys/C UGU, UGC Pro/P CCU, CCC, CCA, CCG * Gln/Q CAA, CAG Ser/S UCU, UCC, UCA, UCG, AGU, AGC Glu/E GAA, GAG * Thr/T ACU, ACC, ACA, ACG Gly/G GGU, GGC, GGA, GGG Trp/W UGG His/H * CAU, CAC Tyr/Y UAU, UAC Ile/I AUU, AUC, AUA Val/V GUU, GUC, GUA, GUG * START AUG STOP UAG, UGA, UAA */ private String other; ToProtein(String other) { this.other = other; } public String getProtein() { return this.other; } } public enum Translate { A("T"), B("V"), C("G"), D("H"), G("C"), H("D"), K("M"), M("K"), N("N"), R("Y"), S("S"), T("A"), V("B"), W("W"), Y( "R"); private String other; Translate(String other) { this.other = other; } public String getCompliment() { return this.other; } } public static final int BUFFER_BYTE_SIZE = 8388608; // 8MB public static final NumberFormat DECIMAL_FORMAT = new DecimalFormat("############.########"); public static final int FASTA_LINE_LENGTH = 60; public static final int MBYTES = 1048576; public static int checkForEnd(String line, int brackets) { if (line.startsWith("{")) { brackets++; } if (line.startsWith("}")) { brackets--; } if (brackets == 0) { return -1; } return brackets; } // add new line breaks every FASTA_LINE_LENGTH characters public final static String convertToFasta(String supplied) { StringBuffer converted = new StringBuffer(); int i = 0; String[] split = supplied.trim().split("\\s+"); if (split.length > 1) { // process as a qual int size = 0; for (i = 0; i < split.length; i++) { converted.append(split[i]); size += split[i].length(); if (i != (split.length - 1)) { if (size >= FASTA_LINE_LENGTH) { size = 0; converted.append("\n"); } else { converted.append(" "); } } } } else { for (i = 0; (i + FASTA_LINE_LENGTH) < supplied.length(); i += FASTA_LINE_LENGTH) { converted.append(supplied.substring(i, i + FASTA_LINE_LENGTH)); converted.append("\n"); } converted.append(supplied.substring(i, supplied.length())); } return converted.toString(); } public final static int countLetterInRead(String fasta, String letter) { return countLetterInRead(fasta, letter, false); } public final static int countLetterInRead(String fasta, String letter, Boolean caseSensitive) { String ungapped = Utils.getUngappedRead(fasta); int len = ungapped.length(); if (len == 0) { return -1; } int increment = letter.length(); int count = 0; for (int i = 0; i <= ungapped.length() - increment; i += increment) { if (letter.equals(ungapped.substring(i, i + increment)) && caseSensitive) { count++; } if (letter.equalsIgnoreCase(ungapped.substring(i, i + increment)) && !caseSensitive) { count++; } } return count; } public final static int[] errorString(int[] s, double readError) { int[] snew = s.clone(); Random generator = new Random(); for (int iter = 0; iter < s.length; iter++) { if (generator.nextDouble() < readError) while (snew[iter] == s[iter]) snew[iter] = generator.nextInt(3); } return snew; } public final static BufferedReader getFile(String fileName, String[] postfix) throws IOException { if (fileName.endsWith("bz2")) { BZip2CompressorInputStream bzIn = new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(fileName), BUFFER_BYTE_SIZE)); return new BufferedReader(new InputStreamReader(bzIn)); // open file as a pipe //System.err.println("Running command " + "bzip2 -dc " + new File(fileName).getAbsolutePath() + " |"); //Process p = Runtime.getRuntime().exec("bzip2 -dc " + new File(fileName).getAbsolutePath() + " |"); //bf = new BufferedReader(new InputStreamReader(p.getInputStream()), BUFFER_BYTE_SIZE); //System.err.println(bf.ready()); } else if (fileName.endsWith("gz")) { GzipCompressorInputStream bzIn = new GzipCompressorInputStream(new BufferedInputStream(new FileInputStream(fileName), BUFFER_BYTE_SIZE)); return new BufferedReader(new InputStreamReader(bzIn)); // open file as a pipe //System.err.println("Runnning comand " + "gzip -dc " + new File(fileName).getAbsolutePath() + " |"); //Process p = Runtime.getRuntime().exec("gzip -dc " + new File(fileName).getAbsolutePath() + " |"); //bf = new BufferedReader(new InputStreamReader(p.getInputStream()), BUFFER_BYTE_SIZE); //System.err.println(bf.ready()); } else { if (postfix==null) return new BufferedReader(new FileReader(fileName), BUFFER_BYTE_SIZE); int i = 0; for (i = 0; i < postfix.length; i++) { if (fileName.endsWith(postfix[i])) return new BufferedReader(new FileReader(fileName), BUFFER_BYTE_SIZE); } throw new IOException("Unknown file format of file " + fileName+"."); } } public final static String getID(String line) { String ids[] = line.split(":"); int commaPos = ids[1].indexOf(","); if (commaPos != -1) { return ids[1].substring(1, commaPos).trim(); } else { return ids[1]; } } public final static double getLetterPercentInRead(String fasta, String letter) { int ungappedLen = getUngappedRead(fasta).length(); int count = countLetterInRead(fasta, letter); return count / (double) ungappedLen; } public final static int getOvlSize(int readA, int readB, int ahang, int bhang) { if ((ahang <= 0 && bhang >= 0) || (ahang >= 0 && bhang <= 0)) { return -1; } if (ahang < 0) { return readA - Math.abs(bhang); } else { return readA - ahang; } } public final static int getRangeOverlap(int startA, int endA, int startB, int endB) { int minA = Math.min(startA, endA); int minB = Math.min(startB, endB); int maxA = Math.max(startA, endA); int maxB = Math.max(startB, endB); int start = Math.max(minA, minB); int end = Math.min(maxA, maxB); return (end - start + 1); } public final static String getUngappedRead(String fasta) { fasta = fasta.replaceAll("N", ""); fasta = fasta.replaceAll("-", ""); assert (fasta.length() >= 0); return fasta; } public final static String getValue(String line, String key) { if (line.startsWith(key)) { return line.split(":")[1]; } return null; } public final static double hashEfficiency(HashMap> c) { double e = hashEnthropy(c); double log2inv = 1.0 / Math.log(2); double scaling = Math.log(c.size()) * log2inv; return e / scaling; } public final static double hashEnthropy(HashMap> c) { double sum = 0.0; double log2inv = 1.0 / Math.log(2); double[] p = new double[c.size()]; int size = 0; int count = 0; for (ArrayList elem : c.values()) { size += elem.size(); p[count++] = elem.size(); } for (int iter = 0; iter < p.length; iter++) { double val = p[iter] / (double) size; sum -= val * Math.log(val) * log2inv; } return sum; } public final static boolean isAContainedInB(int startA, int endA, int startB, int endB) { int minA = Math.min(startA, endA); int minB = Math.min(startB, endB); int maxA = Math.max(startA, endA); int maxB = Math.max(startB, endB); return (minB < minA && maxB > maxA); } public final static Pair linearRegression(int[] a, int[] b, int size) { // take one pass and compute means int xy = 0; int x = 0; int y = 0; int x2 = 0; for (int iter = 0; iter < size; iter++) { xy += a[iter] * b[iter]; x += a[iter]; y += b[iter]; x2 += a[iter] * a[iter]; } double Ninv = 1.0 / (double) size; double beta = ((double) xy - Ninv * (double) (x * y)) / ((double) x2 - Ninv * (double) (x * x)); double alpha = Ninv * ((double) y - beta * (double) x); return new Pair(alpha, beta); } public final static double mean(double[] a, int size) { double x = 0.0; for (int iter = 0; iter < size; iter++) x += a[iter]; return x / (double) size; } public final static double mean(int[] a, int size) { int x = 0; for (int iter = 0; iter < size; iter++) x += a[iter]; return x / (double) size; } public final static double pearsonCorr(int[] a, int[] b, int size) { if (size < 2) return 0.0; double meana = mean(a, size); double meanb = mean(b, size); double stda = std(a, size, meana); double stdb = std(b, size, meanb); double r = 0.0; for (int iter = 0; iter < size; iter++) { r += ((double) a[iter] - meana) * ((double) b[iter] - meanb) / (stda * stdb); } return r / (double) (size - 1); } // adapted form // http://blog.teamleadnet.com/2012/07/quick-select-algorithm-find-kth-element.html public final static int quickSelect(int[] array, int k, int length) { if (array == null || length <= k) return Integer.MAX_VALUE; int from = 0; int to = length - 1; // if from == to we reached the kth element while (from < to) { int r = from; int w = to; int mid = array[(r + w) / 2]; // stop if the reader and writer meets while (r < w) { if (array[r] >= mid) { // put the large values at the end int tmp = array[w]; array[w] = array[r]; array[r] = tmp; w--; } else { // the value is smaller than the pivot, skip r++; } } // if we stepped up (r++) we need to step one down if (array[r] > mid) r--; // the r pointer is on the end of the first k elements if (k <= r) { to = r; } else { from = r + 1; } } return array[k]; } public final static String rc(String supplied) { StringBuilder st = new StringBuilder(); for (int i = supplied.length() - 1; i >= 0; i--) { char theChar = supplied.charAt(i); if (theChar != '-') { Translate t = Translate.valueOf(Character.toString(theChar).toUpperCase()); st.append(t.getCompliment()); } else { st.append("-"); } } return st.toString(); } public final static double std(double[] a, int size, double mean) { double x = 0.0; for (int iter = 0; iter < size; iter++) { double val = a[iter] - mean; x += val * val; } return Math.sqrt(x / (double) (size - 1)); } public final static double std(int[] a, int size, double mean) { double x = 0.0; for (int iter = 0; iter < size; iter++) { double val = (double) a[iter] - mean; x += val * val; } return Math.sqrt(x / (double) (size - 1)); } public final static String toProtein(String genome, boolean isReversed, int frame) { StringBuilder result = new StringBuilder(); if (isReversed) { genome = rc(genome); } genome = genome.replaceAll("-", ""); for (int i = frame; i < (genome.length() - 3); i += 3) { String codon = genome.substring(i, i + 3); String protein = ToProtein.valueOf(codon).getProtein(); result.append(protein); } return result.toString(); } public static String toString(double[][] A) { StringBuilder s = new StringBuilder(); s.append("["); for (double[] a : A) { if (a != null) { for (int iter = 0; iter < a.length - 1; iter++) s.append("" + a[iter] + ","); if (a.length > 0) s.append("" + a[a.length - 1]); } s.append("\n"); } s.append("]"); return new String(s); } public static String toString(float[][] A) { StringBuilder s = new StringBuilder(); s.append("["); for (float[] a : A) { if (a != null) { for (int iter = 0; iter < a.length - 1; iter++) s.append("" + a[iter] + ","); if (a.length > 0) s.append("" + a[a.length - 1]); } s.append("\n"); } s.append("]"); return new String(s); } public static String toString(long[][] A) { StringBuilder s = new StringBuilder(); s.append("["); for (long[] a : A) { if (a != null) { for (int iter = 0; iter < a.length - 1; iter++) s.append("" + a[iter] + ","); if (a.length > 0) s.append("" + a[a.length - 1]); } s.append("\n"); } s.append("]"); return new String(s); } } MHAP-2.1.1/src/main/resources/000077500000000000000000000000001277502137000157715ustar00rootroot00000000000000MHAP-2.1.1/src/main/resources/edu/000077500000000000000000000000001277502137000165465ustar00rootroot00000000000000MHAP-2.1.1/src/main/resources/edu/umd/000077500000000000000000000000001277502137000173335ustar00rootroot00000000000000MHAP-2.1.1/src/main/resources/edu/umd/marbl/000077500000000000000000000000001277502137000204305ustar00rootroot00000000000000MHAP-2.1.1/src/main/resources/edu/umd/marbl/mhap/000077500000000000000000000000001277502137000213555ustar00rootroot00000000000000MHAP-2.1.1/src/main/resources/edu/umd/marbl/mhap/README000066400000000000000000000000071277502137000222320ustar00rootroot00000000000000MHAP-2.1.1/src/test/000077500000000000000000000000001277502137000140125ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/000077500000000000000000000000001277502137000160245ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/edu/000077500000000000000000000000001277502137000166015ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/edu/umd/000077500000000000000000000000001277502137000173665ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/edu/umd/marbl/000077500000000000000000000000001277502137000204635ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/edu/umd/marbl/mhap/000077500000000000000000000000001277502137000214105ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/edu/umd/marbl/mhap/matrix/000077500000000000000000000000001277502137000227145ustar00rootroot00000000000000MHAP-2.1.1/src/test/resources/edu/umd/marbl/mhap/matrix/score_matrix.txt000066400000000000000000000035231277502137000261570ustar00rootroot00000000000000 A R N B D C Q Z E G H I L K M F P S T W Y V X * A 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 R -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 N -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 B -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 D -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 C -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 Q -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 Z -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 E -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 G -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 H -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 I -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 L -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 K -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 M -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 -6 F -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 -6 P -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 -6 S -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 -6 T -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 -6 W -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 -6 Y -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 -6 V -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 5 -6 -6 X -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 0 -6 * -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 -6 0