pax_global_header00006660000000000000000000000064135512320270014512gustar00rootroot0000000000000052 comment=d729cca1e201ca16749b67f750b0bc5465c9a990 megahit-1.2.9/000077500000000000000000000000001355123202700131415ustar00rootroot00000000000000megahit-1.2.9/.clang-format000066400000000000000000000004431355123202700155150ustar00rootroot00000000000000# clang-format -i --style=file # find . -iname '*.cpp' -o -iname '*.h' -o -iname '*.h.in' | xargs clang-format -i --style=file BasedOnStyle : Google Language: Cpp Cpp11BracedListStyle: true Standard: Cpp11 megahit-1.2.9/.gitignore000066400000000000000000000005101355123202700151250ustar00rootroot00000000000000# Compiled Object files *.slo *.lo *.o *.obj # Compiled Dynamic libraries *.so *.dylib *.dll # Compiled Static libraries *.lai *.la *.a *.lib # Executables *.exe *.out *.app # ctags *.tags *.tags_sorted_by_file # Other library # Sublime Text file cmake-build-debug .idea build debug release venv clang sanitizer cov tsan megahit-1.2.9/.travis.yml000066400000000000000000000005711355123202700152550ustar00rootroot00000000000000dist: xenial before_install: - sudo apt-get update - sudo apt-get install -y wget language: python python: - "2.7" - "3.4" script: - mkdir build - cd build && cmake .. -DCOVERAGE=ON - make -j2 simple_test - sudo make install - megahit --test - megahit --test --kmin-1pass - megahit --test --no-hw-accelo || echo "Codecov did not collect coverage reports" megahit-1.2.9/CHANGELOG.md000066400000000000000000000167251355123202700147650ustar00rootroot00000000000000### 1.2.9 / 2019-10-13 - Fix segfault triggered by length-zero sequences - Fix memory detection problem for some outdated MacOS versions - Fix an incorrect assertion in unitig graph refreshing - Added `--verbose` to output full log to the screen ### 1.2.8 / 2019-08-10 - Add intermediate `megahit_core_popcnt` for CPUs that have ABM but not BMI2 - Allow new assembly task with `--continue` ### 1.2.7 / 2019-07-28 - Symbol link `megahit_core_no_hw_accel` to `megahit_toolkit` for backward compatibility - Better logging and for memory adjustment during SDBG building - Attempt to continue SDBG building even user-specified memory size is not sufficient ### 1.2.6 / 2019-07-13 - Refactored and fixed a bug in local assembler - Refactored `megahit` script - Obtain total memory size from `os.sysconf` - Fixed segmentation fault in Mac OS with clang 4.0 - Added `--cleaning-rounds` and `--disconnect-ratio` options for more flexible graph cleaning control ### 1.2.5-beta / 2019-06-28 PST - Fixed a bug that causes higher memory usage in seq2sdbg - Refactor on sequence sorters, edge I/O and contig I/O ### 1.2.4-beta / 2019-05-25 PST - Fixed a few memory leaks - Use std::vector to replace malloc in SDBG builders - Try to fix potential problem caused by benign data race in unitig graph refreshing - Faster by using phmap and xxh3 ### 1.2.3-beta / 2019-05-12 PST - Refactored sequence readers - Fixed a bug in SDBG building of large k-mer sizes ### 1.2.2-beta / 2019-04-16 PST - Automatically detect POPCNT/BMI2 and select the correct megahit_core binary ### 1.2.1-beta / 2019-03-30 PST - Added `--no-hw-accel` option for users whose CPUs do not support BMI2/POPCNT - Added `--test` option for testing - Compilable with CMake 2.8 and g++4.8 ### 1.2.0-beta / 2019-03-24 PST Heavily refactored the whole project: - Remove GPU support - Use cmake - Use sparsepp to replace IDBA's hash map for better performance in both speed and memory efficiency - Use pdep instruction to speed up rand and select - Rewrite unitig graph - Rewrite the iterate-edge component - Rewrite the SDBG library, except for the builder - Fixed a bug which may involve too many reads into local assembly The changes result in a faster and more memory-efficient tool, but have little effect on assembly quality. ### 1.1.4 / 2018-11-01 PST - Fixed a bug in mercy edge stage in 1-pass mode ### 1.1.3 / 2018-03-02 PST - Fix a bug in atomic bit vector that may cause a race condition ### 1.1.2 / 2017-08-01 HKT - Hotfix of an integer overflow bug ### 1.1.1 / 2016-12-08 HKT - Added `-f` option to force overwrite output directory - Added `--bubble-level` option to control bubble merging; though level 3 (i.e. super bubble) is not mature (default level is 2) - Optimized the speed of tips removal ### 1.1-beta / 2016-11-30 HKT - Added components to better handle high depth errors - Added components to merge super bubbles - Fine tuning k-mer sizes to support longer reads (150bp) - In general, it produces longer contigs compared to previous versions ### 1.0.6 / 2016-06-06 HKT - Fixed a bug that ignores edge multiplicity at all. This bug existed since v1.0.4-beta ### 1.0.5 / 2016-05-17 HKT - Removed the requirement for CPU\_thread >= 2. - Added the `--tmp-dir` option - More user-friendly error message when seeing bad pair-end files - Fixed a bug that may stop an assembly earlier ### 1.0.4-beta / 2016-02-16 HKT - Faster index of succinct de Bruijn graph via prefix look up - Add `--prune-level` 3 and `--prune-depth` options for more aggressive pruning - Tune `bulk` parameters - Support reads with length >= 65536 bp ### 1.0.3 / 2015-10-11 HKT - Hotfix of number of tip nodes in SdBG 32 bit integer overflow ### 1.0.2 / 2015-08-15 HKT - Fixed a bug when number of large multiplicities > INT\_MAX - Fixed dead loop in local assembler when \# of reads is 0 - Use `mmap` for edge/sdbg IO - Correct the rounding of edge multiplicity from float to integer ### 1.0.1 / 2015-07-31 HKT - Fixed number of SdBG edges 32-bit integer overflow. ### 1.0.0-beta / 2015-07-23 HKT - `--presets` option: preset parameters for different types of assembly - New CPU sorting design: faster kmer counting & graph construction - New unitig graph and edge multiplicity design: more accurate assembly - Merge bubble carefully at small *k*: reduce the occurrences of bubble collapsing ### 0.3.3-a / 2015-07-04 HKT - Hotfix of incorrect max read length of multiple library - Hotfix of `--input-cmd` idling ### 0.3.3 / 2015-07-04 HKT - Fixed segmentation fault when a read is all N - Fixed continue mode: check continue mode before writing binary reads - Slightly improve SdBG traversal functions ### 0.3.2-beta / 2015-06-28 HKT - fine tune local assembly multi-thread schedules - fine tune SdBG builder memory usage ### 0.3.0-beta3 / 2015-06-22 HKT - `--verbose` option - fixed a bug when reading paired-end reads of different length - print assembly stats to the end of the screen message - print read libraries info to the screen message, and number of reads in each libraries to the log file ### 0.3.0-beta2 / 2015-06-20 HKT - added the missing file `citycrc.h` ### 0.3.0-beta / 2015-06-18 HKT New features: - `--max-read-len` parameter no longer required - `--memory` option set to 0.9 by default - make use of PE informations (with local assembly) - `--prune-level` and `--merge-level` for setting pruning and merging intensity - `--kmin-1pass` option for assembling ultra low-coverage datasets in less memory - supporting bzip2 input files - useful tools in `megahit_toolkit`, including contig2fastg for conversion of contig files into SPAdes-like fastg ### 0.2.1 / 2015-03-18 Bug Fixes: - Fixed incorrect mercy kmers searching when read length >= 255 - Minor bugs (huge log in some edge cases, etc.) New features: - Semi-auto memory setting: when 0 < "-m" < 1, use fraction of the machine's memory - `--out-prefix` option - `--cpu-only` turn on by default, and `--use-gpu` option to enable GPU - python3 compatibility ### 0.2.0 / 2015-01-30 Bug Fixes: - Fixed "option --num-cpu-threads not recognized" Enhancements: - `--mem-flag` option for memory control - `--continue` option to resume an interrupted run - support mixed fasta/fastq input via `kseq.h` ### 0.1.4 / 2015-01-20 Bug Fixes: - Fixed crashes related to OpenMP ### 0.1.3 / 2014-12-01 Enhancements: - MAC OS X support - Minor improvement to reduce memory usage of the SdBG builder Bug Fixes: - Fixed crashes in some edge cases ### 0.1.2 / 2014-10-10 Enhancements: - Update Makefile and python wrapper to improve compatibility - Output exit code when subprogram exit abnormally - Improve memory usage of the subprogram `assembler` - Fix the issue of minor stat differences caused by loops Bug Fixes: - Use `get_ref_with_lock()` to ensure `hash_map` being thread-safe when updating values - Correct the computation of edge multiplicity when iterate from small *k* to large *k* - Fix a bug that cause segfault in the subprogram `sdbg_builder` ### 0.1.1 beta / 2014-10-02 Enhancements: - Add change log - More detailed README for input format - Use `CompactSequence` in `UnitigGraph` - Remove unused parallel sorting codes Bug Fixes: - Fixed wrong computation of `word_per_read` in `cx1_functions.cpp` - Fixed crash in `FastxReader` if the file is empty - Fixed floating point error in `assembly_algorithms.cpp` megahit-1.2.9/CMakeLists.txt000066400000000000000000000121041355123202700156770ustar00rootroot00000000000000cmake_minimum_required(VERSION 2.8) project(megahit) set(CMAKE_VERBOSE_MAKEFILE ON) if (CMAKE_VERSION VERSION_LESS "3.1") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11") else () set(CMAKE_CXX_STANDARD 11) endif () option(COVERAGE "Generate coverage report" OFF) option(STATIC_BUILD "Build static executation" OFF) option(SANITIZER "Enable sanitizers" OFF) option(TSAN "Enable thread sanitizers" OFF) include_directories(src) FILE(GLOB_RECURSE ASMBL_SOURCE "src/assembly/*.cpp") FILE(GLOB_RECURSE LCASM_SOURCE "src/localasm/*.cpp") FILE(GLOB_RECURSE IDBA_SOURCE "src/idba/*.cpp") FILE(GLOB_RECURSE SDBG_SOURCE "src/sdbg/*.cpp") FILE(GLOB_RECURSE CX1_SOURCE "src/sorting/*.cpp") FILE(GLOB_RECURSE SEQ_SOURCE "src/sequence/*.cpp") FILE(GLOB_RECURSE TOOLKIT_SOURCE "src/tools/*.cpp") LIST(APPEND OTHER_SOURCE src/main.cpp src/main_assemble.cpp src/main_buildlib.cpp src/main_iterate.cpp src/main_local_assemble.cpp src/main_sdbg_build.cpp src/utils/options_description.cpp ) if (STATIC_BUILD) set(CMAKE_FIND_LIBRARY_SUFFIXES ".a") endif (STATIC_BUILD) find_package(ZLIB REQUIRED) find_package(OpenMP REQUIRED) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DXXH_INLINE_ALL -ftemplate-depth=3000") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wno-unused-function") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fprefetch-loop-arrays -funroll-loops") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__XROOT__='\"${CMAKE_SOURCE_DIR}/src\"'") #set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D__XFILE__='\"$(subst ${CMAKE_SOURCE_DIR}/,,$(abspath $<))\"'") set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${ZLIB_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}") set(CMAKE_CXX_FLAGS_RELEASE "-O3 -DNDEBUG") set(CMAKE_CXX_FLAGS_DEBUG "-g -ggdb -O1 -D_LIBCPP_DEBUG -D_GLIBCXX_DEBUG") if (COVERAGE) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -O0 --coverage") set(COV_PY "coverage run") endif (COVERAGE) if (SANITIZER) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fsanitize=leak -fsanitize=undefined") endif (SANITIZER) if (TSAN) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=thread") endif (TSAN) message(STATUS "Build type: ${CMAKE_BUILD_TYPE}: ${CMAKE_CXX_FLAGS}") add_executable(megahit_core ${OTHER_SOURCE} ${ASMBL_SOURCE} ${IDBA_SOURCE} ${SDBG_SOURCE} ${LCASM_SOURCE} ${SEQ_SOURCE} ${CX1_SOURCE} ${TOOLKIT_SOURCE}) add_executable(megahit_core_popcnt ${OTHER_SOURCE} ${ASMBL_SOURCE} ${IDBA_SOURCE} ${SDBG_SOURCE} ${LCASM_SOURCE} ${SEQ_SOURCE} ${CX1_SOURCE} ${TOOLKIT_SOURCE}) add_executable(megahit_core_no_hw_accel ${OTHER_SOURCE} ${ASMBL_SOURCE} ${IDBA_SOURCE} ${SDBG_SOURCE} ${LCASM_SOURCE} ${SEQ_SOURCE} ${CX1_SOURCE} ${TOOLKIT_SOURCE}) set_target_properties(megahit_core PROPERTIES COMPILE_FLAGS "-mbmi2 -DUSE_BMI2 -mpopcnt") set_target_properties(megahit_core_popcnt PROPERTIES COMPILE_FLAGS "-mpopcnt") if (STATIC_BUILD) # TODO dirty set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -static") set_target_properties(megahit_core megahit_core_popcnt megahit_core_no_hw_accel PROPERTIES LINK_SEARCH_START_STATIC ON) set_target_properties(megahit_core megahit_core_popcnt megahit_core_no_hw_accel PROPERTIES LINK_SEARCH_END_STATIC ON) endif (STATIC_BUILD) target_link_libraries(megahit_core ${ZLIB_LIBRARIES}) target_link_libraries(megahit_core_popcnt ${ZLIB_LIBRARIES}) target_link_libraries(megahit_core_no_hw_accel ${ZLIB_LIBRARIES}) add_custom_target( megahit ALL COMMAND cp ${CMAKE_SOURCE_DIR}/src/megahit . ) add_custom_target(megahit_toolkit ALL COMMAND ${CMAKE_COMMAND} -E create_symlink megahit_core_no_hw_accel megahit_toolkit) set(TEST_DATA ${CMAKE_SOURCE_DIR}/test_data) add_custom_target( simple_test COMMAND ./megahit --test -t 2 COMMAND MEGAHIT_NUM_MERCY_FACTOR=1.5 ./megahit --test -t 4 --mem-flag 0 --no-hw-accel COMMAND ./megahit --test -t 2 --kmin-1pass --prune-level 3 --prune-depth 0 COMMAND rm -rf test-random && python3 ${TEST_DATA}/generate_random_fasta.py > random.fa && ./megahit -r random.fa --k-list 255 --min-count 1 -o test-random COMMAND rm -rf test-fastg && ./megahit --test -t 2 --mem-flag 2 --keep-tmp-files -o test-fastg COMMAND rm -rf test-empty && ./megahit -r ${TEST_DATA}/empty.fa -o test-empty COMMAND rm -rf test-no-contig && ./megahit -r ${TEST_DATA}/r4.fa -o test-no-contig COMMAND ./megahit_toolkit contig2fastg 59 test-fastg/intermediate_contigs/k59.contigs.fa > 59.fastg COMMAND ./megahit_toolkit readstat < test-fastg/intermediate_contigs/k59.contigs.fa ) add_dependencies(megahit megahit_core megahit_core_popcnt megahit_core_no_hw_accel megahit_toolkit) add_dependencies(simple_test megahit) install(TARGETS megahit_core megahit_core_popcnt megahit_core_no_hw_accel DESTINATION bin) install(PROGRAMS ${CMAKE_CURRENT_BINARY_DIR}/megahit ${CMAKE_CURRENT_BINARY_DIR}/megahit_toolkit DESTINATION bin) install(DIRECTORY test_data DESTINATION share/${PROJECT_NAME}) megahit-1.2.9/Dockerfile000066400000000000000000000005261355123202700151360ustar00rootroot00000000000000FROM ubuntu:18.04 RUN apt-get update RUN apt-get install -y g++ make zlib1g-dev gzip bzip2 cmake python --no-install-recommends COPY . /root/megahit WORKDIR /root/megahit RUN rm -rf build RUN mkdir -p build WORKDIR build RUN cmake -DCMAKE_BUILD_TYPE=Release .. RUN make -j4 RUN make install RUN megahit --test RUN megahit --test --kmin-1pass megahit-1.2.9/LICENSE000066400000000000000000001044621355123202700141550ustar00rootroot00000000000000GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. {one line to give the program's name and a brief idea of what it does.} Copyright (C) {year} {name of author} This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: {project} Copyright (C) {year} {fullname} This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . megahit-1.2.9/README.md000066400000000000000000000070071355123202700144240ustar00rootroot00000000000000MEGAHIT ======= [![BioConda Install](https://img.shields.io/conda/dn/bioconda/megahit.svg?style=flat-square&label=BioConda%20install)](https://anaconda.org/bioconda/megahit) [![Downloads](https://img.shields.io/github/downloads/voutcn/megahit/total?style=flat-square)](https://github.com/voutcn/megahit/releases) [![Build Status](https://img.shields.io/travis/voutcn/megahit?style=flat-square)](https://travis-ci.org/voutcn/megahit) [![codecov](https://img.shields.io/codecov/c/github/voutcn/megahit?style=flat-square)](https://codecov.io/gh/voutcn/megahit) MEGAHIT is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but also works well on generic single genome assembly (small or mammalian size) and single-cell assembly. Installation --------------- ### Conda ```sh conda install -c bioconda megahit ``` ### Pre-built binaries for x86_64 Linux ```sh wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz cd MEGAHIT-1.2.9-Linux-x86_64-static/bin/ ./megahit --test # run on a toy dataset ./megahit -1 MY_PE_READ_1.fq.gz -2 MY_PE_READ_2.fq.gz -o MY_OUTPUT_DIR ``` ### Pre-built docker image ``` sh # in the directory with the input reads docker run -v $(pwd):/workspace -w /workspace --user $(id -u):$(id -g) vout/megahit \ megahit -1 MY_PE_READ_1.fq.gz -2 MY_PE_READ_2.fq.gz -o MY_OUTPUT_DIR ``` ### Building from source #### Prerequisites - For building: zlib, cmake >= 2.8, g++ >= 4.8.4 - For running: gzip and bzip2 ```sh git clone https://github.com/voutcn/megahit.git cd megahit git submodule update --init mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release # add -DCMAKE_INSTALL_PREFIX=MY_PREFIX if needed make -j4 make simple_test # will test MEGAHIT with a toy dataset # make install if needed ``` Usage ----- ### Basic usage ```sh megahit -1 pe_1.fq -2 pe_2.fq -o out # 1 paired-end library megahit --12 interleaved.fq -o out # one paired & interleaved paired-end library megahit -1 a1.fq,b1.fq,c1.fq -2 a2.fq,b2.fq,c2.fq -r se1.fq,se2.fq -o out # 3 paired-end libraries + 2 SE libraries megahit_core contig2fastg 119 out/intermediate_contigs/k119.contig.fa > k119.fastg # get FASTG from the intermediate contigs of k=119 ``` The contigs can be found `final.contigs.fa` in the output directory. ### Advanced usage - `--kmin-1pass`: if sequencing depth is low and too much memory used when build the graph of k_min - `--presets meta-large`: if the metagenome is complex (i.e., bio-diversity is high, for example soil metagenomes) - `--cleaning-rounds 1 --disconnect-ratio 0`: get less pruned assembly (usually shorter contigs) - `--continue -o out`: resume an interrupted job from `out` To see the full manual, run `megahit` without parameters or with `-h`. Also, our [wiki](https://github.com/voutcn/megahit/wiki) may be helpful. Publications ------------ - Li, D., Liu, C-M., Luo, R., Sadakane, K., and Lam, T-W., (2015) MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. *Bioinformatics*, doi: 10.1093/bioinformatics/btv033 \[PMID: [25609793](http://www.ncbi.nlm.nih.gov/pubmed/25609793)\]. - Li, D., Luo, R., Liu, C.M., Leung, C.M., Ting, H.F., Sadakane, K., Yamashita, H. and Lam, T.W., 2016. MEGAHIT v1.0: A Fast and Scalable Metagenome Assembler driven by Advanced Methodologies and Community Practices. Methods. License ------- This project is licensed under the GPLv3 License - see the [LICENSE](LICENSE) file for details megahit-1.2.9/azure-pipelines.yml000066400000000000000000000061401355123202700170010ustar00rootroot00000000000000jobs: - job: ubuntu_1604 pool: vmImage: 'Ubuntu-16.04' strategy: matrix: python36: python.version: '3.6' build.type: 'Debug' sanitizer: 'ON' static: 'OFF' Python27: python.version: '2.7' build.type: 'Release' sanitizer: 'OFF' static: 'ON' steps: - task: UsePythonVersion@0 inputs: versionSpec: '$(python.version)' addToPath: true - script: | mkdir build cd build cmake -DCMAKE_BUILD_TYPE=$(build.type) -DSANITIZER=$(sanitizer) -DSTATIC_BUILD=$(static) .. make simple_test -j `nproc` displayName: 'build and test' - job: macos strategy: matrix: 1013: image: macos-10.13 latest: image: macos-latest pool: vmImage: $(image) steps: - script: | brew install cmake gcc@9 zlib bzip2 displayName: 'install dependencies' - script: | mkdir build cd build CC=gcc-9 CXX=g++-9 cmake .. make simple_test -j `sysctl -n hw.physicalcpu` displayName: 'build and test' - job: assembly timeoutInMinutes: 0 strategy: matrix: codecov: build.type: 'Release' sanitizer: 'OFF' coverage: 'ON' sanitize: build.type: 'Debug' sanitizer: 'ON' coverage: 'OFF' pool: vmImage: 'Ubuntu-16.04' steps: - script: | mkdir build cd build cmake -DCMAKE_BUILD_TYPE=$(build.type) -DSANITIZER=$(sanitizer) -DCOVERAGE=$(coverage) .. make -j `nproc` make simple_test sudo make install displayName: 'build and test' - script: | curl -o- ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR752/007/SRR7521507/SRR7521507_1.fastq.gz | gzip -cd | head -4000000 | gzip -1 > 1.fq.gz curl -o- ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR752/007/SRR7521507/SRR7521507_2.fastq.gz | gzip -cd | head -4000000 | gzip -1 > 2.fq.gz megahit --presets meta-large -1 1.fq.gz -2 2.fq.gz -m5e9 --verbose displayName: 'assemble' - script: | if [ $(coverage) = 'ON' ]; then wget http://downloads.sourceforge.net/ltp/lcov-1.14.tar.gz tar zvxf lcov-1.14.tar.gz export PATH=lcov-1.14/bin/:${PATH} lcov --capture --directory . --output-file coverage.info lcov --remove coverage.info '/usr/*' --output-file coverage.info # filter system-files lcov --remove coverage.info '*xxhash/*' --output-file coverage.info # filter xxhash-files lcov --remove coverage.info '*parallel_hashmap/*' --output-file coverage.info # filter parallel-hashmap-files lcov --remove coverage.info '*pprintpp/*' --output-file coverage.info # filter pprintpp files lcov --list coverage.info # debug info bash <(curl -s https://codecov.io/bash) -f coverage.info -t $(CODECOV_TOKEN) || echo "Codecov did not collect coverage reports" fi displayName: 'codecov' megahit-1.2.9/codecov.yml000066400000000000000000000001571355123202700153110ustar00rootroot00000000000000coverage: status: patch: default: target: 0% project: default: target: 0%megahit-1.2.9/src/000077500000000000000000000000001355123202700137305ustar00rootroot00000000000000megahit-1.2.9/src/assembly/000077500000000000000000000000001355123202700155475ustar00rootroot00000000000000megahit-1.2.9/src/assembly/all_algo.h000066400000000000000000000004121355123202700174670ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_ALL_ALGO_H #define MEGAHIT_ALL_ALGO_H #include "bubble_remover.h" #include "low_depth_remover.h" #include "sdbg_pruning.h" #include "tip_remover.h" #include "weak_link_remover.h" #endif // MEGAHIT_ALL_ALGO_H megahit-1.2.9/src/assembly/bubble_remover.cpp000066400000000000000000000124011355123202700212430ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #include "bubble_remover.h" #include namespace { // helper function double GetSimilarity(const std::string &a, const std::string &b, double min_similarity) { int n = a.length(); int m = b.length(); int max_indel = std::max(n, m) * (1 - min_similarity); if (abs(n - m) > max_indel) { return 0; } if (max_indel < 1) { return 0; } std::vector dp[2]; for (int i = 0; i < 2; ++i) { dp[i].resize(max_indel * 2 + 1, 0); } #define IDX(j, i) ((j) - (i) + max_indel) for (int j = 0; j <= max_indel; ++j) { dp[0][IDX(j, 0)] = j; } for (int i = 1; i <= n; ++i) { std::fill(dp[i & 1].begin(), dp[i & 1].end(), 0x3f3f3f3f); if (i - max_indel <= 0) { dp[i & 1][IDX(0, i)] = i; } for (int j = std::max(i - max_indel, 1); j <= m && j <= i + max_indel; ++j) { dp[i & 1][IDX(j, i)] = std::min(dp[i & 1][IDX(j, i)], dp[(i ^ 1) & 1][IDX(j - 1, i - 1)] + (a[i - 1] != b[j - 1])); if (j > i - max_indel) { dp[i & 1][IDX(j, i)] = std::min(dp[i & 1][IDX(j, i)], dp[i & 1][IDX(j - 1, i)] + 1); } if (j < i + max_indel) { dp[i & 1][IDX(j, i)] = std::min(dp[i & 1][IDX(j, i)], dp[(i ^ 1) & 1][IDX(j, i - 1)] + 1); } } } return 1 - dp[n & 1][IDX(m, n)] * 1.0 / std::max(n, m); #undef IDX } } // namespace int BaseBubbleRemover::SearchAndPopBubble(UnitigGraph &graph, UnitigGraph::VertexAdapter &adapter, uint32_t max_len, const checker_type &checker) { UnitigGraph::VertexAdapter right; UnitigGraph::VertexAdapter middle[4]; UnitigGraph::VertexAdapter possible_right[4]; int degree = graph.GetNextAdapters(adapter, middle); if (degree <= 1) { return 0; } for (int j = 0; j < degree; ++j) { if (middle[j].GetLength() > max_len) { return 0; } } for (int j = 0; j < degree; ++j) { if (graph.InDegree(middle[j]) != 1 || graph.GetNextAdapters(middle[j], possible_right) != 1) { return 0; } if (j == 0) { right = possible_right[0]; if (right.canonical_id() < adapter.canonical_id() || graph.InDegree(right) != degree) { return 0; } } else { if (right.b() != possible_right[0].b()) { return 0; } } } std::sort(middle, middle + degree, [](const UnitigGraph::VertexAdapter &a, const UnitigGraph::VertexAdapter &b) { if (a.GetAvgDepth() != b.GetAvgDepth()) return a.GetAvgDepth() > b.GetAvgDepth(); return a.canonical_id() < b.canonical_id(); }); for (int j = 1; j < degree; ++j) { if (!checker(middle[0], middle[j])) { return 0; } } bool careful_merged = false; int num_removed = 0; for (int j = 1; j < degree; ++j) { bool success = middle[j].SetToDelete(); assert(success || adapter.canonical_id() == right.canonical_id() || adapter.IsPalindrome()); num_removed += success; if (bubble_file_ && middle[j].GetAvgDepth() >= middle[0].GetAvgDepth() * careful_threshold_) { std::string label = graph.VertexToDNAString(middle[j]); bubble_file_->WriteContig(label, graph.k(), 0, 0, middle[j].GetAvgDepth()); careful_merged = true; } } if (careful_merged) { std::string left_label = graph.VertexToDNAString(adapter); std::string right_label = graph.VertexToDNAString(right); bubble_file_->WriteContig(left_label, graph.k(), 0, 0, adapter.GetAvgDepth()); bubble_file_->WriteContig(right_label, graph.k(), 0, 0, right.GetAvgDepth()); } return num_removed; } size_t BaseBubbleRemover::PopBubbles(UnitigGraph &graph, bool permanent_rm, uint32_t max_len, const checker_type &checker) { uint32_t num_removed = 0; #pragma omp parallel for reduction(+ : num_removed) for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { UnitigGraph::VertexAdapter adapter = graph.MakeVertexAdapter(i); if (adapter.IsStandalone()) { continue; } for (int strand = 0; strand < 2; ++strand, adapter.ReverseComplement()) { num_removed += SearchAndPopBubble(graph, adapter, max_len, checker); } } graph.Refresh(!permanent_rm); return num_removed; } size_t ComplexBubbleRemover::PopBubbles(UnitigGraph &graph, bool permanent_rm) { uint32_t k = graph.k(); double sim = similarity_; uint32_t max_len = lround(merge_level_ * k / sim); if (max_len * (1 - similarity_) < 1) { return 0; } auto checker = [&graph, k, sim](const UnitigGraph::VertexAdapter &a, const UnitigGraph::VertexAdapter &b) -> bool { return (b.GetLength() + k - 1) * sim <= (a.GetLength() + k - 1) && (a.GetLength() + k - 1) * sim <= (b.GetLength() + k - 1) && GetSimilarity(graph.VertexToDNAString(a), graph.VertexToDNAString(b), sim) >= sim; }; return BaseBubbleRemover::PopBubbles(graph, permanent_rm, max_len, checker); }megahit-1.2.9/src/assembly/bubble_remover.h000066400000000000000000000036121355123202700207140ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_BUBBLE_REMOVER_H #define MEGAHIT_BUBBLE_REMOVER_H #include #include "contig_output.h" #include "unitig_graph.h" #include "utils/histgram.h" class BaseBubbleRemover { public: using checker_type = std::function; public: BaseBubbleRemover &SetWriter(ContigWriter *bubble_file) { bubble_file_ = bubble_file; return *this; } BaseBubbleRemover &SetCarefulThreshold(double threshold) { careful_threshold_ = threshold; return *this; } private: ContigWriter *bubble_file_{}; double careful_threshold_{1 + 1e-3}; protected: size_t PopBubbles(UnitigGraph &graph, bool permanent_rm, uint32_t max_len, const checker_type &checker); int SearchAndPopBubble(UnitigGraph &graph, UnitigGraph::VertexAdapter &adapter, uint32_t max_len, const checker_type &checker); }; class NaiveBubbleRemover : public BaseBubbleRemover { public: size_t PopBubbles(UnitigGraph &graph, bool permanent_rm) { return BaseBubbleRemover::PopBubbles(graph, permanent_rm, graph.k() + 2, Check); } private: static constexpr bool Check(const UnitigGraph::VertexAdapter &a, const UnitigGraph::VertexAdapter &b) { return true; } }; class ComplexBubbleRemover : public BaseBubbleRemover { private: int merge_level_{20}; double similarity_{0.95}; public: ComplexBubbleRemover &SetMergeLevel(int merge_level) { merge_level_ = merge_level; return *this; } ComplexBubbleRemover &SetMergeSimilarity(double similarity) { similarity_ = similarity; return *this; } size_t PopBubbles(UnitigGraph &graph, bool permanent_rm); }; #endif // MEGAHIT_BUBBLE_REMOVER_H megahit-1.2.9/src/assembly/contig_output.cpp000066400000000000000000000063771355123202700211730ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #include "contig_output.h" #include #include "definitions.h" #include "unitig_graph.h" namespace { inline char Complement(char c) { if (c >= 0 && c < 4) { return 3 - c; } switch (c) { case 'A': return 'T'; case 'C': return 'G'; case 'G': return 'C'; case 'T': return 'A'; default: assert(false); } return 0; } inline void ReverseComplement(std::string &s) { int i, j; for (i = 0, j = s.length() - 1; i < j; ++i, --j) { std::swap(s[i], s[j]); s[i] = Complement(s[i]); s[j] = Complement(s[j]); } if (i == j) { s[i] = Complement(s[i]); } } void FoldPalindrome(std::string &s, unsigned kmer_k, bool is_loop) { if (is_loop) { for (unsigned i = 1; i + kmer_k <= s.length(); ++i) { std::string rc = s.substr(i, kmer_k); ReverseComplement(rc); if (rc == s.substr(i - 1, kmer_k)) { assert(i <= s.length() / 2); s = s.substr(i, s.length() / 2); break; } } } else { int num_edges = s.length() - kmer_k; assert(num_edges % 2 == 1); s.resize((num_edges - 1) / 2 + kmer_k + 1); } } } // namespace void OutputContigs(UnitigGraph &graph, ContigWriter *contig_writer, ContigWriter *final_contig_writer, bool change_only, uint32_t min_standalone) { assert(!(change_only && final_contig_writer != nullptr)); // if output // changed contigs, // must not output // final contigs #pragma omp parallel for for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { auto adapter = graph.MakeVertexAdapter(i); double multi = change_only ? 1 : std::min(static_cast(kMaxMul), adapter.GetAvgDepth()); std::string ascii_contig = graph.VertexToDNAString(adapter); if (change_only && !adapter.IsChanged()) { continue; } if (adapter.IsLoop()) { int flag = contig_flag::kLoop | contig_flag::kStandalone; auto writer = contig_writer; if (adapter.IsPalindrome()) { FoldPalindrome(ascii_contig, graph.k(), adapter.IsLoop()); flag = contig_flag::kStandalone; } if (final_contig_writer != nullptr) { if (ascii_contig.length() < min_standalone) { continue; } else { writer = final_contig_writer; } } writer->WriteContig(ascii_contig, graph.k(), i, flag, multi); } else { auto out_file = contig_writer; int flag = 0; if (adapter.IsStandalone() || (graph.InDegree(adapter) == 0 && graph.OutDegree(adapter) == 0)) { if (adapter.IsPalindrome()) { FoldPalindrome(ascii_contig, graph.k(), adapter.IsLoop()); } flag = contig_flag::kStandalone; if (final_contig_writer != nullptr) { if (ascii_contig.length() < min_standalone) { continue; } else { out_file = final_contig_writer; } } } out_file->WriteContig(ascii_contig, graph.k(), i, flag, multi); } } }megahit-1.2.9/src/assembly/contig_output.h000066400000000000000000000007011355123202700206210ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_CONTIG_OUTPUT_H #define MEGAHIT_CONTIG_OUTPUT_H #include #include #include #include "sequence/io/contig/contig_writer.h" class UnitigGraph; void OutputContigs(UnitigGraph &graph, ContigWriter *contig_writer, ContigWriter *final_contig_writer, bool change_only, uint32_t min_standalone); #endif // MEGAHIT_CONTIG_OUTPUT_H megahit-1.2.9/src/assembly/contig_stat.h000066400000000000000000000027721355123202700202460ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_CONTIG_STAT_H #define MEGAHIT_CONTIG_STAT_H #include #include "unitig_graph.h" #include "utils/histgram.h" #include "utils/utils.h" using ContigStat = std::map; inline ContigStat CalcAndPrintStat(UnitigGraph &graph, bool print = true, bool changed_only = false) { uint32_t n_isolated = 0, n_looped = 0; Histgram hist; #pragma omp parallel for reduction(+ : n_looped, n_isolated) for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { auto adapter = graph.MakeVertexAdapter(i); if (changed_only && !adapter.IsChanged()) { continue; } hist.insert(adapter.GetLength() + graph.k()); n_looped += adapter.IsLoop(); n_isolated += adapter.IsStandalone() || (graph.InDegree(adapter) == 0 && graph.OutDegree(adapter) == 0); } uint64_t total_size = hist.sum(); ContigStat stat = {{"Max", hist.maximum()}, {"Min", hist.minimum()}, {"N50", hist.Nx(0.5 * total_size)}, {"total size", total_size}, {"number contigs", hist.size()}, {"number looped", n_looped}, {"number isolated", n_isolated}}; if (print) { xinfo(""); for (auto &kv : stat) { xinfoc("{s}: {}, ", kv.first.c_str(), kv.second); } xinfoc("{s}", "\n"); } return stat; } #endif // MEGAHIT_CONTIG_STAT_H megahit-1.2.9/src/assembly/low_depth_remover.cpp000066400000000000000000000066321355123202700220060ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #include "low_depth_remover.h" #include "unitig_graph.h" namespace { double LocalDepth(UnitigGraph &graph, UnitigGraph::VertexAdapter &adapter, uint32_t local_width) { double total_depth = 0; uint64_t num_added_edges = 0; for (int strand = 0; strand < 2; ++strand, adapter.ReverseComplement()) { UnitigGraph::VertexAdapter outs[4]; int degree = graph.GetNextAdapters(adapter, outs); for (int i = 0; i < degree; ++i) { if (outs[i].GetLength() <= local_width) { num_added_edges += outs[i].GetLength(); total_depth += outs[i].GetTotalDepth(); } else { num_added_edges += local_width; total_depth += outs[i].GetAvgDepth() * local_width; } } } if (num_added_edges == 0) { return 0; } else { return total_depth / num_added_edges; } } } // namespace bool RemoveLocalLowDepth(UnitigGraph &graph, double min_depth, uint32_t max_len, uint32_t local_width, double local_ratio, bool permanent_rm, uint32_t *num_removed) { bool need_refresh = false; uint32_t removed = 0; std::atomic_bool is_changed{false}; #pragma omp parallel for reduction(+ : removed) reduction(|| : need_refresh) for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { auto adapter = graph.MakeVertexAdapter(i); if (adapter.IsStandalone() || adapter.GetLength() > max_len) { continue; } int indegree = graph.InDegree(adapter); int outdegree = graph.OutDegree(adapter); if (indegree + outdegree == 0) { continue; } if ((indegree <= 1 && outdegree <= 1) || indegree == 0 || outdegree == 0) { double depth = adapter.GetAvgDepth(); if (is_changed.load(std::memory_order_relaxed) && depth > min_depth) continue; double mean = LocalDepth(graph, adapter, local_width); double threshold = min_depth; if (min_depth < mean * local_ratio) is_changed.store(true, std::memory_order_relaxed); else threshold = mean * local_ratio; if (depth < threshold) { is_changed.store(true, std::memory_order_relaxed); need_refresh = true; bool success = adapter.SetToDelete(); assert(success); removed += success; } } } if (need_refresh) { bool set_changed = !permanent_rm; graph.Refresh(set_changed); } *num_removed = removed; return is_changed; } uint32_t IterateLocalLowDepth(UnitigGraph &graph, double min_depth, uint32_t min_len, uint32_t local_width, double local_ratio, bool permanent_rm) { uint32_t total_removed = 0; while (min_depth < kMaxMul) { uint32_t num_removed = 0; if (!RemoveLocalLowDepth(graph, min_depth, min_len, local_width, local_ratio, permanent_rm, &num_removed)) { break; } total_removed += num_removed; min_depth *= 1.1; } return total_removed; } uint32_t RemoveLowDepth(UnitigGraph &graph, double min_depth) { uint32_t num_removed = 0; #pragma omp parallel for reduction(+ : num_removed) for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { auto adapter = graph.MakeVertexAdapter(i); if (adapter.GetAvgDepth() < min_depth) { bool success = adapter.SetToDelete(); assert(success); num_removed += success; } } graph.Refresh(false); return num_removed; }megahit-1.2.9/src/assembly/low_depth_remover.h000066400000000000000000000012571355123202700214510ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_LOW_DEPTH_REMOVER_H #define MEGAHIT_LOW_DEPTH_REMOVER_H #include class UnitigGraph; uint32_t RemoveLowDepth(UnitigGraph &graph, double min_depth); bool RemoveLocalLowDepth(UnitigGraph &graph, double min_depth, uint32_t max_len, uint32_t local_width, double local_ratio, bool permanent_rm, uint32_t *num_removed); uint32_t IterateLocalLowDepth(UnitigGraph &graph, double min_depth, uint32_t min_len, uint32_t local_width, double local_ratio, bool permanent_rm = false); #endif // MEGAHIT_LOW_DEPTH_REMOVER_H megahit-1.2.9/src/assembly/sdbg_pruning.cpp000066400000000000000000000113441355123202700207370ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #include "sdbg_pruning.h" #include #include #include #include #include #include #include "kmlib/kmbitvector.h" #include "utils/histgram.h" #include "utils/utils.h" namespace sdbg_pruning { double InferMinDepth(SDBG &dbg) { Histgram hist; #pragma omp parallel for for (uint64_t i = 0; i < dbg.size(); ++i) { if (dbg.IsValidEdge(i)) { hist.insert(dbg.EdgeMultiplicity(i)); } } double cov = hist.FirstLocalMinimum(); for (int repeat = 1; repeat <= 100; ++repeat) { hist.TrimLow(static_cast(roundf(cov))); unsigned median = hist.median(); double cov1 = sqrt(median); if (fabs(cov - cov1) < 1e-2) { return cov; } cov = cov1; } xwarn("Cannot detect min depth: unconverged"); return 1; } int64_t Trim(SDBG &dbg, int len, AtomicBitVector &ignored) { int64_t number_tips = 0; AtomicBitVector to_remove(dbg.size()); std::vector path; #pragma omp parallel for reduction(+ : number_tips) private(path) for (uint64_t id = 0; id < dbg.size(); ++id) { if (!ignored.at(id) && dbg.EdgeOutdegreeZero(id)) { uint64_t prev = SDBG::kNullID; uint64_t cur = id; bool is_tip = false; path.clear(); path.push_back(id); for (int i = 1; i < len; ++i) { prev = dbg.UniquePrevEdge(cur); if (prev == SDBG::kNullID) { is_tip = dbg.EdgeIndegreeZero(cur); break; } else if (dbg.UniqueNextEdge(prev) == SDBG::kNullID) { is_tip = true; break; } else { path.push_back(prev); cur = prev; } } if (is_tip) { for (unsigned long i : path) { to_remove.set(i); } ++number_tips; ignored.set(id); ignored.set(path.back()); if (prev != SDBG::kNullID) { ignored.unset(prev); } } } } #pragma omp parallel for reduction(+ : number_tips) private(path) for (uint64_t id = 0; id < dbg.size(); ++id) { if (!ignored.at(id) && dbg.EdgeIndegreeZero(id)) { uint64_t next = SDBG::kNullID; uint64_t cur = id; bool is_tip = false; path.clear(); path.push_back(id); for (int i = 1; i < len; ++i) { next = dbg.UniqueNextEdge(cur); if (next == SDBG::kNullID) { is_tip = dbg.EdgeOutdegreeZero(cur); break; } else if (dbg.UniquePrevEdge(next) == SDBG::kNullID) { is_tip = true; break; } else { path.push_back(next); cur = next; } } if (is_tip) { for (unsigned long i : path) { to_remove.set(i); } ++number_tips; ignored.set(id); ignored.set(path.back()); if (next != SDBG::kNullID) { ignored.unset(next); } } } } #pragma omp parallel for for (uint64_t id = 0; id < dbg.size(); ++id) { if (to_remove.at(id)) { dbg.SetInvalidEdge(id); } } return number_tips; } uint64_t RemoveTips(SDBG &dbg, int max_tip_len) { uint64_t number_tips = 0; SimpleTimer timer; AtomicBitVector ignored(dbg.size()); #pragma omp parallel for for (uint64_t id = 0; id < dbg.size(); ++id) { if (!dbg.EdgeIndegreeZero(id) && !dbg.EdgeOutdegreeZero(id)) { ignored.set(id); } } for (int len = 2; len < max_tip_len; len *= 2) { xinfo("Removing tips with length less than {}; ", len); timer.reset(); timer.start(); number_tips += Trim(dbg, len, ignored); timer.stop(); xinfoc("Accumulated tips removed: {}; time elapsed: {.4}\n", number_tips, timer.elapsed()); } xinfo("Removing tips with length less than {}; ", max_tip_len); timer.reset(); timer.start(); number_tips += Trim(dbg, max_tip_len, ignored); timer.stop(); xinfoc("Accumulated tips removed: {}; time elapsed: {.4}\n", number_tips, timer.elapsed()); return number_tips; } } // namespace sdbg_pruningmegahit-1.2.9/src/assembly/sdbg_pruning.h000066400000000000000000000022211355123202700203760ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #ifndef ASSEMBLY_ALGORITHMS_H_ #define ASSEMBLY_ALGORITHMS_H_ #include #include #include "sdbg/sdbg.h" using std::string; using std::vector; namespace sdbg_pruning { double InferMinDepth(SDBG &dbg); // tips removal uint64_t RemoveTips(SDBG &dbg, int max_tip_len); } // namespace sdbg_pruning #endif // define ASSEMBLY_ALGORITHMS_H_megahit-1.2.9/src/assembly/tip_remover.cpp000066400000000000000000000030561355123202700206120ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #include "tip_remover.h" #include "unitig_graph.h" uint32_t RemoveTips(UnitigGraph &graph, uint32_t max_tip_len) { uint32_t num_removed = 0; for (uint32_t thre = 2; thre < max_tip_len; thre = std::min(thre * 2, max_tip_len)) { #pragma omp parallel for reduction(+ : num_removed) for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { auto adapter = graph.MakeVertexAdapter(i); if (adapter.GetLength() >= thre) { continue; } if (adapter.IsStandalone()) { bool success = adapter.SetToDelete(); assert(success); num_removed += success; } else { UnitigGraph::VertexAdapter nexts[4], prevs[4]; int outd = graph.GetNextAdapters(adapter, nexts); int ind = graph.GetPrevAdapters(adapter, prevs); if (ind + outd == 0) { bool success = adapter.SetToDelete(); assert(success); num_removed += success; } else if (outd == 1 && ind == 0) { if (nexts[0].GetAvgDepth() > 8 * adapter.GetAvgDepth()) { bool success = adapter.SetToDelete(); assert(success); num_removed += success; } } else if (outd == 0 && ind == 1) { if (prevs[0].GetAvgDepth() > 8 * adapter.GetAvgDepth()) { bool success = adapter.SetToDelete(); assert(success); num_removed += success; } } } } graph.Refresh(false); if (thre >= max_tip_len) { break; } } return num_removed; } megahit-1.2.9/src/assembly/tip_remover.h000066400000000000000000000003551355123202700202560ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_TIP_REMOVER_H #define MEGAHIT_TIP_REMOVER_H #include class UnitigGraph; uint32_t RemoveTips(UnitigGraph &graph, uint32_t max_tip_len); #endif // MEGAHIT_TIP_REMOVER_H megahit-1.2.9/src/assembly/unitig_graph.cpp000066400000000000000000000303221355123202700207330ustar00rootroot00000000000000// // Created by vout on 11/19/18. // #include "unitig_graph.h" #include #include #include "kmlib/kmbitvector.h" #include "utils/mutex.h" #include "utils/utils.h" UnitigGraph::UnitigGraph(SDBG *sdbg) : sdbg_(sdbg), adapter_impl_(this), sudo_adapter_impl_(this) { id_map_.clear(); vertices_.clear(); SpinLock path_lock; AtomicBitVector locks(sdbg_->size()); size_t count_palindrome = 0; // assemble simple paths #pragma omp parallel for reduction(+ : count_palindrome) for (uint64_t edge_idx = 0; edge_idx < sdbg_->size(); ++edge_idx) { if (sdbg_->IsValidEdge(edge_idx) && sdbg_->NextSimplePathEdge(edge_idx) == SDBG::kNullID && locks.try_lock(edge_idx)) { bool will_be_added = true; uint64_t cur_edge = edge_idx; uint64_t prev_edge; int64_t depth = sdbg_->EdgeMultiplicity(edge_idx); uint32_t length = 1; while ((prev_edge = sdbg_->PrevSimplePathEdge(cur_edge)) != SDBG::kNullID) { cur_edge = prev_edge; if (!locks.try_lock(cur_edge)) { will_be_added = false; break; } depth += sdbg_->EdgeMultiplicity(cur_edge); ++length; } if (!will_be_added) { continue; } uint64_t rc_start = sdbg_->EdgeReverseComplement(edge_idx); uint64_t rc_end; assert(rc_start != SDBG::kNullID); if (!locks.try_lock(rc_start)) { rc_end = sdbg_->EdgeReverseComplement(cur_edge); if (std::max(edge_idx, cur_edge) < std::max(rc_start, rc_end)) { will_be_added = false; } } else { // lock through the rc path uint64_t rc_cur_edge = rc_start; rc_end = rc_cur_edge; bool extend_full = true; while ((rc_cur_edge = sdbg_->NextSimplePathEdge(rc_cur_edge)) != SDBG::kNullID) { rc_end = rc_cur_edge; if (!locks.try_lock(rc_cur_edge)) { extend_full = false; break; } } if (!extend_full) { rc_end = sdbg_->EdgeReverseComplement(cur_edge); assert(rc_end != SDBG::kNullID); } } if (will_be_added) { std::lock_guard lk(path_lock); vertices_.emplace_back(cur_edge, edge_idx, rc_start, rc_end, depth, length); count_palindrome += cur_edge == rc_start; } } } xinfo("Graph size without loops: {}, palindrome: {}\n", vertices_.size(), count_palindrome); // assemble looped paths std::mutex loop_lock; size_t count_loop = 0; #pragma omp parallel for for (size_t edge_idx = 0; edge_idx < sdbg_->size(); ++edge_idx) { if (!locks.at(edge_idx) && sdbg_->IsValidEdge(edge_idx)) { std::lock_guard lk(loop_lock); if (!locks.at(edge_idx)) { uint64_t cur_edge = edge_idx; uint64_t rc_edge = sdbg_->EdgeReverseComplement(edge_idx); uint64_t depth = sdbg_->EdgeMultiplicity(edge_idx); uint32_t length = 0; // whether it is marked before entering the loop bool rc_marked = locks.at(rc_edge); while (!locks.at(cur_edge)) { locks.set(cur_edge); depth += sdbg_->EdgeMultiplicity(cur_edge); ++length; cur_edge = sdbg_->PrevSimplePathEdge(cur_edge); assert(cur_edge != SDBG::kNullID); } assert(cur_edge == edge_idx); if (!rc_marked) { uint64_t start = sdbg_->NextSimplePathEdge(edge_idx); uint64_t end = edge_idx; vertices_.emplace_back(start, end, sdbg_->EdgeReverseComplement(end), sdbg_->EdgeReverseComplement(start), depth, length, true); count_loop += 1; } } } } if (vertices_.size() >= kMaxNumVertices) { xfatal( "Too many vertices in the unitig graph ({} >= {}), " "you may increase the kmer size to remove tons of erroneous kmers.\n", vertices_.size(), kMaxNumVertices); } sdbg_->FreeMultiplicity(); id_map_.reserve(vertices_.size() * 2 - count_palindrome); for (size_type i = 0; i < vertices_.size(); ++i) { VertexAdapter adapter(vertices_[i]); id_map_[adapter.b()] = i; id_map_[adapter.rb()] = i; } assert(vertices_.size() * 2 - count_palindrome >= id_map_.size()); } void UnitigGraph::RefreshDisconnected() { SpinLock mutex; #pragma omp parallel for for (size_type i = 0; i < vertices_.size(); ++i) { auto adapter = MakeSudoAdapter(i); if (adapter.IsToDelete() || adapter.IsPalindrome() || adapter.IsLoop()) { continue; } uint8_t to_disconnect = adapter.IsToDisconnect(); adapter.ReverseComplement(); uint8_t rc_to_disconnect = adapter.IsToDisconnect(); adapter.ReverseComplement(); if (!to_disconnect && !rc_to_disconnect) { continue; } if (adapter.GetLength() <= to_disconnect + rc_to_disconnect) { adapter.SetToDelete(); continue; } auto old_start = adapter.b(); auto old_end = adapter.e(); auto old_rc_start = adapter.rb(); auto old_rc_end = adapter.re(); uint64_t new_start, new_end, new_rc_start, new_rc_end; if (to_disconnect) { new_start = sdbg_->NextSimplePathEdge(old_start); new_rc_end = sdbg_->PrevSimplePathEdge(old_rc_end); assert(new_start != SDBG::kNullID && new_rc_end != SDBG::kNullID); sdbg_->SetInvalidEdge(old_start); sdbg_->SetInvalidEdge(old_rc_end); } else { new_start = old_start; new_rc_end = old_rc_end; } if (rc_to_disconnect) { new_rc_start = sdbg_->NextSimplePathEdge(old_rc_start); new_end = sdbg_->PrevSimplePathEdge(old_end); assert(new_rc_start != SDBG::kNullID && new_end != SDBG::kNullID); sdbg_->SetInvalidEdge(old_rc_start); sdbg_->SetInvalidEdge(old_end); } else { new_rc_start = old_rc_start; new_end = old_end; } uint32_t new_length = adapter.GetLength() - to_disconnect - rc_to_disconnect; uint64_t new_total_depth = lround(adapter.GetAvgDepth() * new_length); adapter.SetBeginEnd(new_start, new_end, new_rc_start, new_rc_end); adapter.SetLength(new_length); adapter.SetTotalDepth(new_total_depth); std::lock_guard lk(mutex); if (to_disconnect) { id_map_.erase(old_start); id_map_[new_start] = i; } if (rc_to_disconnect) { id_map_.erase(old_rc_start); id_map_[new_rc_start] = i; } } } void UnitigGraph::Refresh(bool set_changed) { static const uint8_t kDeleted = 0x1; static const uint8_t kVisited = 0x2; RefreshDisconnected(); #pragma omp parallel for for (size_type i = 0; i < vertices_.size(); ++i) { auto adapter = MakeSudoAdapter(i); if (!adapter.IsToDelete()) { continue; } adapter.SetFlag(kDeleted); if (adapter.IsStandalone()) { continue; } for (int strand = 0; strand < 2; ++strand, adapter.ReverseComplement()) { uint64_t cur_edge = adapter.e(); for (size_t j = 1; j < adapter.GetLength(); ++j) { auto prev = sdbg_->UniquePrevEdge(cur_edge); sdbg_->SetInvalidEdge(cur_edge); cur_edge = prev; assert(cur_edge != SDBG::kNullID); } assert(cur_edge == adapter.b()); sdbg_->SetInvalidEdge(cur_edge); if (adapter.IsPalindrome()) { break; } } } AtomicBitVector locks(size()); #pragma omp parallel for for (size_type i = 0; i < vertices_.size(); ++i) { auto adapter = MakeSudoAdapter(i); if (adapter.IsStandalone() || (adapter.GetFlag() & kDeleted)) { continue; } for (int strand = 0; strand < 2; ++strand, adapter.ReverseComplement()) { if (PrevSimplePathAdapter(adapter).IsValid()) { continue; } if (!locks.try_lock(i)) { break; } std::vector linear_path; for (auto cur = NextSimplePathAdapter(adapter); cur.IsValid(); cur = NextSimplePathAdapter(cur)) { linear_path.emplace_back(cur); } if (linear_path.empty()) { adapter.SetFlag(kVisited); break; } size_type back_id = linear_path.back().UnitigId(); if (back_id != i && !locks.try_lock(back_id)) { if (back_id > i) { locks.unlock(i); break; } else { locks.lock(back_id); } } auto new_length = adapter.GetLength(); auto new_total_depth = adapter.GetTotalDepth(); adapter.SetFlag(kVisited); for (auto &v : linear_path) { new_length += v.GetLength(); new_total_depth += v.GetTotalDepth(); if (v.canonical_id() != adapter.canonical_id()) v.SetFlag(kDeleted); } auto new_start = adapter.b(); auto new_rc_end = adapter.re(); auto new_rc_start = linear_path.back().rb(); auto new_end = linear_path.back().e(); adapter.SetBeginEnd(new_start, new_end, new_rc_start, new_rc_end); adapter.SetLength(new_length); adapter.SetTotalDepth(new_total_depth); if (set_changed) adapter.SetChanged(); break; } } // looped path std::mutex mutex; #pragma omp parallel for for (size_type i = 0; i < vertices_.size(); ++i) { auto adapter = MakeSudoAdapter(i); if (!adapter.IsStandalone() && !adapter.GetFlag()) { std::lock_guard lk(mutex); if (adapter.GetFlag()) { continue; } uint32_t length = adapter.GetLength(); uint64_t total_depth = adapter.GetTotalDepth(); SudoVertexAdapter next_adapter = adapter; while (true) { next_adapter = NextSimplePathAdapter(next_adapter); assert(next_adapter.IsValid()); if (next_adapter.b() == adapter.b()) { break; } next_adapter.SetFlag(kDeleted); length += next_adapter.GetLength(); total_depth += next_adapter.GetTotalDepth(); } auto new_start = adapter.b(); auto new_end = sdbg_->PrevSimplePathEdge(new_start); auto new_rc_end = adapter.re(); auto new_rc_start = sdbg_->NextSimplePathEdge(new_rc_end); assert(new_start == sdbg_->EdgeReverseComplement(new_rc_end)); assert(new_end == sdbg_->EdgeReverseComplement(new_rc_start)); adapter.SetBeginEnd(new_start, new_end, new_rc_start, new_rc_end); adapter.SetLength(length); adapter.SetTotalDepth(total_depth); adapter.SetLooped(); if (set_changed) adapter.SetChanged(); } } vertices_.resize(std::remove_if(vertices_.begin(), vertices_.end(), [](UnitigGraphVertex &a) { return SudoVertexAdapter(a).GetFlag() & kDeleted; }) - vertices_.begin()); size_type num_changed = 0; #pragma omp parallel for reduction(+ : num_changed) for (size_type i = 0; i < vertices_.size(); ++i) { auto adapter = MakeSudoAdapter(i); assert(adapter.IsStandalone() || adapter.GetFlag()); adapter.SetFlag(0); id_map_.at(adapter.b()) = i; id_map_.at(adapter.rb()) = i; num_changed += adapter.IsChanged(); } } std::string UnitigGraph::VertexToDNAString(VertexAdapter v) { v.ToUniqueFormat(); std::string label; label.reserve(k() + v.GetLength()); uint64_t cur_edge = v.e(); for (unsigned i = 1; i < v.GetLength(); ++i) { int8_t cur_char = sdbg_->GetW(cur_edge); label.push_back("ACGT"[cur_char > 4 ? (cur_char - 5) : (cur_char - 1)]); cur_edge = sdbg_->PrevSimplePathEdge(cur_edge); if (cur_edge == SDBG::kNullID) { xfatal("{}, {}, {}, {}, ({}, {}), {}, {}\n", v.b(), v.e(), v.rb(), v.re(), sdbg_->EdgeReverseComplement(v.e()), sdbg_->EdgeReverseComplement(v.b()), v.GetLength(), i); } } int8_t cur_char = sdbg_->GetW(cur_edge); label.push_back("ACGT"[cur_char > 4 ? (cur_char - 5) : (cur_char - 1)]); if (cur_edge != v.b()) { xfatal("fwd: {}, {}, rev: {}, {}, ({}, {}) length: {}\n", v.b(), v.e(), v.rb(), v.re(), sdbg_->EdgeReverseComplement(v.e()), sdbg_->EdgeReverseComplement(v.b()), v.GetLength()); } uint8_t seq[kMaxK]; sdbg_->GetLabel(v.b(), seq); for (int i = sdbg_->k() - 1; i >= 0; --i) { assert(seq[i] >= 1 && seq[i] <= 4); label.append(1, "ACGT"[seq[i] - 1]); } std::reverse(label.begin(), label.end()); return label; } megahit-1.2.9/src/assembly/unitig_graph.h000066400000000000000000000117271355123202700204100ustar00rootroot00000000000000// // Created by vout on 11/10/18. // #ifndef MEGAHIT_UNITIG_GRAPH_H #define MEGAHIT_UNITIG_GRAPH_H #include #include #include "parallel_hashmap/phmap.h" #include "sdbg/sdbg.h" #include "unitig_graph_vertex.h" class UnitigGraph { public: using Vertex = UnitigGraphVertex; using VertexAdapter = UnitigGraphVertex::Adapter; using size_type = VertexAdapter::size_type; static const size_type kMaxNumVertices = std::numeric_limits::max() - 1; static const size_type kNullVertexID = kMaxNumVertices + 1; public: explicit UnitigGraph(SDBG *sdbg); UnitigGraph(const UnitigGraph &) = delete; UnitigGraph(const UnitigGraph &&) = delete; ~UnitigGraph() = default; size_type size() const { return vertices_.size(); } size_t k() const { return sdbg_->k(); } public: void Refresh(bool mark_changed = false); std::string VertexToDNAString(VertexAdapter adapter); public: /* * Function for VertexAdapter obtaining & traversal */ VertexAdapter MakeVertexAdapter(size_type id, int strand = 0) { return adapter_impl_.MakeVertexAdapter(id, strand); } int GetNextAdapters(VertexAdapter &adapter, VertexAdapter *out) { return adapter_impl_.GetNextAdapters(adapter, out); } int GetPrevAdapters(VertexAdapter &adapter, VertexAdapter *out) { return adapter_impl_.GetPrevAdapters(adapter, out); } int OutDegree(VertexAdapter &adapter) { return adapter_impl_.OutDegree(adapter); } int InDegree(VertexAdapter &adapter) { return adapter_impl_.InDegree(adapter); } private: /* * Function for SudoVertexAdapter obtaining & traversal */ using SudoVertexAdapter = UnitigGraphVertex::SudoAdapter; SudoVertexAdapter MakeSudoAdapter(size_type id, int strand = 0) { return sudo_adapter_impl_.MakeVertexAdapter(id, strand); } int GetNextAdapters(SudoVertexAdapter &adapter, SudoVertexAdapter *out) { return sudo_adapter_impl_.GetNextAdapters(adapter, out); } int GetPrevAdapters(SudoVertexAdapter &adapter, SudoVertexAdapter *out) { return sudo_adapter_impl_.GetPrevAdapters(adapter, out); } int OutDegree(SudoVertexAdapter &adapter) { return sudo_adapter_impl_.OutDegree(adapter); } int InDegree(SudoVertexAdapter &adapter) { return sudo_adapter_impl_.InDegree(adapter); } SudoVertexAdapter NextSimplePathAdapter(SudoVertexAdapter &adapter) { return sudo_adapter_impl_.NextSimplePathAdapter(adapter); } SudoVertexAdapter PrevSimplePathAdapter(SudoVertexAdapter &adapter) { return sudo_adapter_impl_.PrevSimplePathAdapter(adapter); } private: /** * A wrapper for operating different types of adapters * @tparam AdapterType type of the vertex adapter */ template class AdapterImpl { public: AdapterImpl(UnitigGraph *graph) : graph_(graph) {} public: AdapterType MakeVertexAdapter(size_type id, int strand = 0) { return {graph_->vertices_[id], strand, id}; } int GetNextAdapters(AdapterType &adapter, AdapterType *out) { uint64_t next_starts[4]; int degree = graph_->sdbg_->OutgoingEdges(adapter.e(), next_starts); if (out) { for (int i = 0; i < degree; ++i) { out[i] = MakeVertexAdapterWithSdbgId(next_starts[i]); } } return degree; } int GetPrevAdapters(AdapterType &adapter, AdapterType *out) { adapter.ReverseComplement(); int degree = GetNextAdapters(adapter, out); if (out) { for (int i = 0; i < degree; ++i) { out[i].ReverseComplement(); } } adapter.ReverseComplement(); return degree; } int OutDegree(AdapterType &adapter) { return GetNextAdapters(adapter, nullptr); } int InDegree(AdapterType &adapter) { adapter.ReverseComplement(); int degree = OutDegree(adapter); adapter.ReverseComplement(); return degree; } AdapterType NextSimplePathAdapter(AdapterType &adapter) { uint64_t next_sdbg_id = graph_->sdbg_->NextSimplePathEdge(adapter.e()); if (next_sdbg_id != SDBG::kNullID) { return MakeVertexAdapterWithSdbgId(next_sdbg_id); } else { return AdapterType{}; } } AdapterType PrevSimplePathAdapter(AdapterType &adapter) { adapter.ReverseComplement(); AdapterType ret = NextSimplePathAdapter(adapter); ret.ReverseComplement(); adapter.ReverseComplement(); return ret; } private: AdapterType MakeVertexAdapterWithSdbgId(uint64_t sdbg_id) { uint32_t id = graph_->id_map_.at(sdbg_id); AdapterType adapter(graph_->vertices_[id], 0, id); if (adapter.b() != sdbg_id) { adapter.ReverseComplement(); } return adapter; } private: UnitigGraph *graph_; }; void RefreshDisconnected(); private: SDBG *sdbg_{}; std::deque vertices_; phmap::flat_hash_map id_map_; AdapterImpl adapter_impl_; AdapterImpl sudo_adapter_impl_; }; #endif // MEGAHIT_UNITIG_GRAPH_H megahit-1.2.9/src/assembly/unitig_graph_vertex.h000066400000000000000000000121551355123202700220010ustar00rootroot00000000000000// // Created by vout on 11/20/18. // #ifndef MEGAHIT_UNITIG_GRAPH_VERTEX_H #define MEGAHIT_UNITIG_GRAPH_VERTEX_H #include #include #include #include #include /** * store the metadata of a unitig vertex; the vertex is associate with an SDBG */ class UnitigGraphVertex { public: UnitigGraphVertex() = default; UnitigGraphVertex(uint64_t begin, uint64_t end, uint64_t rbegin, uint64_t rend, uint64_t total_depth, uint32_t length, bool is_loop = false) : strand_info{{begin, end}, {rbegin, rend}}, length(length), total_depth(total_depth), is_looped(is_loop), is_palindrome(begin == rbegin), is_changed(false), flag(0) {} private: struct StrandInfo { StrandInfo(uint64_t begin = 0, uint64_t end = 0) : begin(begin), end(end) {} uint64_t begin : 48; uint64_t end : 48; } __attribute__((packed)); StrandInfo strand_info[2]; uint32_t length; uint64_t total_depth : 52; bool is_looped : 1; bool is_palindrome : 1; bool is_changed : 1; // status that can be modified by adapter during traversal and must be atomic AtomicWrapper flag; // bit 0-4: any flag; bit 5: marked as to // delete; bit 6 & 7: marked as to disconnect static const unsigned kToDeleteBit = 5; static const unsigned kToDisconnectBit = 6; static const uint8_t kFlagMask = (1u << kToDeleteBit) - 1; public: /** * An adapter on unitig vertex to handle graph traversal, modification * and reverse complement issues */ class Adapter { public: using size_type = uint32_t; Adapter() = default; Adapter(UnitigGraphVertex &vertex, int strand = 0, size_type id = static_cast(-1)) : vertex_(&vertex), strand_(strand), id_(id) {} void ReverseComplement() { strand_ ^= 1u; } size_type UnitigId() const { return id_; } bool IsValid() const { return vertex_ != nullptr; } uint32_t GetLength() const { return vertex_->length; } uint64_t GetTotalDepth() const { return vertex_->total_depth; } double GetAvgDepth() const { return static_cast(vertex_->total_depth) / vertex_->length; } bool IsLoop() const { return vertex_->is_looped; } bool IsPalindrome() const { return vertex_->is_palindrome; } bool IsChanged() const { return vertex_->is_changed; } void ToUniqueFormat() { if (canonical_id() != b()) { ReverseComplement(); } } bool IsStandalone() const { return IsLoop(); } bool SetToDelete() { uint8_t mask = 1u << kToDeleteBit; auto old_val = vertex_->flag.v.fetch_or(mask, std::memory_order_relaxed); return !(old_val & mask); } bool SetToDisconnect() { uint8_t mask = 1u << kToDisconnectBit << strand_; auto old_val = vertex_->flag.v.fetch_or(mask, std::memory_order_relaxed); return !(old_val & mask); } uint64_t canonical_id() const { return std::min(b(), rb()); }; uint64_t b() const { return StrandInfo().begin; } uint64_t e() const { return StrandInfo().end; } uint64_t rb() const { return StrandInfo(1).begin; } uint64_t re() const { return StrandInfo(1).end; } protected: UnitigGraphVertex::StrandInfo &StrandInfo(uint8_t relative_strand = 0) { return vertex_->strand_info[strand_ ^ relative_strand]; } const UnitigGraphVertex::StrandInfo &StrandInfo( uint8_t relative_strand = 0) const { return vertex_->strand_info[strand_ ^ relative_strand]; } protected: UnitigGraphVertex *vertex_{nullptr}; uint8_t strand_; uint32_t id_; // record the id for quick access }; /** * An adapter with more permission to change the data of the vertex */ class SudoAdapter : public Adapter { public: SudoAdapter() = default; SudoAdapter(UnitigGraphVertex &vertex, int strand = 0, size_type id = static_cast(-1)) : Adapter(vertex, strand, id) {} public: bool IsToDelete() const { return vertex_->flag.v.load(std::memory_order_relaxed) & (1u << kToDeleteBit); } bool IsToDisconnect() const { return vertex_->flag.v.load(std::memory_order_relaxed) & (1u << kToDisconnectBit << strand_); } void SetBeginEnd(uint64_t start, uint64_t end, uint64_t rc_start, uint64_t rc_end) { StrandInfo(0) = {start, end}; StrandInfo(1) = {rc_start, rc_end}; vertex_->is_palindrome = start == rc_start; }; void SetLength(uint32_t length) { vertex_->length = length; } void SetTotalDepth(uint64_t total_depth) { vertex_->total_depth = total_depth; } void SetLooped() { vertex_->is_looped = true; } void SetChanged() { vertex_->is_changed = true; } uint8_t GetFlag() const { return vertex_->flag.v.load(std::memory_order_relaxed) & kFlagMask; } void SetFlag(uint8_t flag) { assert(flag <= kFlagMask); vertex_->flag.v.store(flag, std::memory_order_relaxed); } }; }; static_assert(sizeof(UnitigGraphVertex) <= 40, ""); #endif // MEGAHIT_UNITIG_GRAPH_VERTEX_H megahit-1.2.9/src/assembly/weak_link_remover.cpp000066400000000000000000000021131355123202700217530ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #include "weak_link_remover.h" #include "unitig_graph.h" uint32_t DisconnectWeakLinks(UnitigGraph &graph, double local_ratio = 0.1) { uint32_t num_disconnected = 0; #pragma omp parallel for reduction(+ : num_disconnected) for (UnitigGraph::size_type i = 0; i < graph.size(); ++i) { auto adapter = graph.MakeVertexAdapter(i); if (adapter.IsStandalone() || adapter.IsPalindrome()) { continue; } for (int strand = 0; strand < 2; ++strand, adapter.ReverseComplement()) { UnitigGraph::VertexAdapter nexts[4]; double depths[4]; double total_depth = 0; int degree = graph.GetNextAdapters(adapter, nexts); if (degree <= 1) { continue; } for (int j = 0; j < degree; ++j) { depths[j] = nexts[j].GetAvgDepth(); total_depth += depths[j]; } for (int j = 0; j < degree; ++j) { if (depths[j] <= local_ratio * total_depth) { num_disconnected += nexts[j].SetToDisconnect(); } } } } graph.Refresh(false); return num_disconnected; }megahit-1.2.9/src/assembly/weak_link_remover.h000066400000000000000000000004001355123202700214150ustar00rootroot00000000000000// // Created by vout on 11/21/18. // #ifndef MEGAHIT_WEAK_LINK_REMOVER_H #define MEGAHIT_WEAK_LINK_REMOVER_H #include class UnitigGraph; uint32_t DisconnectWeakLinks(UnitigGraph&, double local_ratio); #endif // MEGAHIT_WEAK_LINK_REMOVER_H megahit-1.2.9/src/definitions.h000066400000000000000000000030121355123202700164100ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #ifndef MEGAHIT_DEFINITIONS_H #define MEGAHIT_DEFINITIONS_H #include #ifndef PACKAGE_VERSION #define PACKAGE_VERSION "v1.2.9" #endif #include "sdbg/sdbg_def.h" static const unsigned kBitsPerEdgeWord = 32; static const unsigned kBitsPerEdgeChar = 2; static const unsigned kCharsPerEdgeWord = 16; static const unsigned kEdgeCharMask = 0x3; namespace contig_flag { static const unsigned kStandalone = 0x1; static const unsigned kLoop = 0x2; } // namespace contig_flag static const int kUint32PerKmerMaxK = (kMaxK + 1 + 15) / 16; static const int kUint64PerIdbaKmerMaxK = (kMaxK * 2 + 16 + 63) / 64; #include "sequence/kmer.h" typedef Kmer GenericKmer; #endif // MEGAHIT_DEFINITIONS_H megahit-1.2.9/src/idba/000077500000000000000000000000001355123202700146275ustar00rootroot00000000000000megahit-1.2.9/src/idba/README000066400000000000000000000016501355123202700155110ustar00rootroot00000000000000Files in this directory are integrated from IDBA package (https://github.com/loneknightpy/idba). The files are amended to be single-threaded. Lisence: Copyright (C) 2009 Yu Peng (loneknightpy@gmail.com) Copyright (C) 2015 Dinghua Li (voutcn@gmail.com) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.megahit-1.2.9/src/idba/bit_edges.h000066400000000000000000000025001355123202700167220ustar00rootroot00000000000000/** * @file bit_edges.h * @brief BitEdges Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-26 */ #ifndef __GRAPH_BIT_EDGES_H_ #define __GRAPH_BIT_EDGES_H_ #include #include "kmlib/kmbit.h" /** * @brief It is compact bit vector used to represent edges in de Bruijn graph * (HashGraph). */ class BitEdges { public: BitEdges() : edges_(0) {} BitEdges(const BitEdges &bit_edges) : edges_(bit_edges.edges_) {} explicit BitEdges(uint8_t edges) : edges_(edges) {} ~BitEdges() {} const BitEdges &operator=(const BitEdges &bit_edges) { edges_ = bit_edges.edges_; return *this; } const BitEdges &operator=(uint8_t edges) { edges_ = edges; return *this; } operator uint8_t() const { return edges_; } void Add(int x) { edges_ |= uint8_t(1 << x); } void Remove(int x) { edges_ &= ~uint8_t(1 << x); } void swap(BitEdges &bit_edges) { if (this != &bit_edges) std::swap(edges_, bit_edges.edges_); } bool operator[](int index) const { return edges_ & (1 << index); } int size() const { return kmlib::bit::Popcount(edges_); } bool empty() const { return edges_ == 0; } void clear() { edges_ = 0; } private: uint8_t edges_; }; namespace std { template <> inline void swap(BitEdges &x, BitEdges &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/bit_operation.h000066400000000000000000000011561355123202700176410ustar00rootroot00000000000000/** * @file bit_operation.h * @brief Bit Operations. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-02 * @modified by Dinghua Li * @date 2014-10-02 */ #ifndef __BASIC_BIT_OPERATION_H_ #define __BASIC_BIT_OPERATION_H_ #include #include "kmlib/kmbit.h" namespace bit_operation { template inline void ReverseComplement(T &value) { value = kmlib::bit::ReverseComplement<2, T>(value); } inline int BitToIndex(uint8_t x) { const static int bit_to_index[] = { 0, 0, 1, 0, 2, 0, 0, 0, 3, }; return bit_to_index[x]; } } // namespace bit_operation #endif megahit-1.2.9/src/idba/contig_builder.h000066400000000000000000000050641355123202700177760ustar00rootroot00000000000000/** * @file contig_builder.h * @brief Contig Build Class which builds contig and related contig info. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.9 * @date 2011-12-27 */ #ifndef __GRAPH_CONTIG_BUILDER_H_ #define __GRAPH_CONTIG_BUILDER_H_ #include "idba/contig_graph_vertex.h" #include "idba/contig_info.h" #include "idba/hash_graph_vertex.h" #include "idba/sequence.h" /** * @brief It is a builder class for building contigs. */ class ContigBuilder { public: ContigBuilder() {} explicit ContigBuilder(HashGraphVertexAdaptor x) { Append(x); } explicit ContigBuilder(ContigGraphVertexAdaptor x) { Append(x, 0); } void Append(HashGraphVertexAdaptor x) { if (contig_.size() == 0) { contig_.Assign(x.kmer()); contig_info_.in_edges_ = x.in_edges(); contig_info_.out_edges_ = x.out_edges(); contig_info_.kmer_size_ = x.kmer().size(); contig_info_.kmer_count_ = x.count(); contig_info_.counts_.resize(1); contig_info_.counts_[0] = x.count(); } else { contig_ += x.kmer()[x.kmer().size() - 1]; contig_info_.out_edges_ = x.out_edges(); contig_info_.kmer_count_ += x.count(); contig_info_.counts_ += x.count(); } } void Append(ContigGraphVertexAdaptor x, int d) { if (contig_.size() == 0) { contig_ = x.contig(); contig_info_.in_edges_ = x.in_edges(); contig_info_.out_edges_ = x.out_edges(); contig_info_.kmer_size_ = x.kmer_size(); contig_info_.kmer_count_ = x.kmer_count(); contig_info_.counts_ = x.counts(); } else { if (d <= 0) { contig_.Append(x.contig(), std::min(-d, (int)x.contig_size())); contig_info_.out_edges_ = x.out_edges(); contig_info_.kmer_count_ += x.kmer_count(); SequenceCount counts = x.counts(); contig_info_.counts_ += counts.substr( std::min(-d - contig_info_.kmer_size_ + 1, (int)counts.size())); } else { contig_.Append(d, 4); contig_.Append(x.contig()); contig_info_.out_edges_ = x.out_edges(); contig_info_.kmer_count_ += x.kmer_count(); contig_info_.counts_.append(d, 0); contig_info_.counts_ += x.counts(); } } } const ContigBuilder &ReverseComplement() { contig_.ReverseComplement(); contig_info_.ReverseComplement(); return *this; } const Sequence &contig() const { return contig_; } const ContigInfo &contig_info() const { return contig_info_; } void clear() { contig_.clear(); contig_info_.clear(); } private: Sequence contig_; ContigInfo contig_info_; }; #endif megahit-1.2.9/src/idba/contig_graph.cpp000066400000000000000000000230121355123202700177750ustar00rootroot00000000000000/** * @file contig_graph.cpp * @brief * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-26 */ #include "contig_graph.h" #include #include #include #include #include #include #include "contig_graph_branch_group.h" #include "sequence.h" using namespace std; void ContigGraph::Initialize(const deque &contigs, const deque &contig_infos) { vertices_.clear(); vertices_.resize(contigs.size()); for (int64_t i = 0; i < (int64_t)contigs.size(); ++i) { vertices_[i].clear(); vertices_[i].set_contig(contigs[i]); vertices_[i].set_contig_info(contig_infos[i]); vertices_[i].set_id(i); } RefreshEdges(); } void ContigGraph::Refresh() { RefreshVertices(); RefreshEdges(); } void ContigGraph::RefreshVertices() { uint64_t index = 0; for (unsigned i = 0; i < vertices_.size(); ++i) { if (!vertices_[i].status().IsDead()) { vertices_[index].swap(vertices_[i]); vertices_[index].set_id(index); ++index; } } vertices_.resize(index); } void ContigGraph::RefreshEdges() { BuildBeginIdbaKmerMap(); uint64_t total_degree = 0; for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { for (int strand = 0; strand < 2; ++strand) { ContigGraphVertexAdaptor current(&vertices_[i], strand); for (int x = 0; x < 4; ++x) { if (current.out_edges()[x]) { IdbaKmer kmer = current.end_kmer(kmer_size_); kmer.ShiftAppend(x); if (FindVertexAdaptorByBeginIdbaKmer(kmer).is_null()) current.out_edges().Remove(x); } } total_degree += current.out_edges().size(); } if (vertices_[i].contig().size() == kmer_size_ && vertices_[i].contig().IsPalindrome()) { vertices_[i].in_edges() = vertices_[i].out_edges() | vertices_[i].out_edges(); vertices_[i].out_edges() = vertices_[i].in_edges(); } } num_edges_ = total_degree / 2; } void ContigGraph::ClearStatus() { for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) vertices_[i].status().clear(); } int64_t ContigGraph::Trim(int min_length) { uint64_t old_num_vertices = vertices_.size(); for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { if (vertices_[i].contig().size() == kmer_size_ && vertices_[i].contig().IsPalindrome()) continue; if ((vertices_[i].in_edges().empty() || vertices_[i].out_edges().empty()) && vertices_[i].contig().size() < min_length + kmer_size_ - 1 && (vertices_[i].in_edges().size() + vertices_[i].out_edges().size() <= 1)) { vertices_[i].status().SetDeadFlag(); } } Refresh(); MergeSimplePaths(); return old_num_vertices - vertices_.size(); } int64_t ContigGraph::RemoveDeadEnd(int min_length) { uint64_t num_deadend = 0; int l = 1; while (true) { l = min(2 * l, min_length); num_deadend += Trim(l); if (l == min_length) break; } num_deadend += Trim(min_length); return num_deadend; } int64_t ContigGraph::RemoveBubble() { deque candidates; for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { for (int strand = 0; strand < 2; ++strand) { ContigGraphVertexAdaptor current(&vertices_[i], strand); if (current.out_edges().size() > 1 && current.contig_size() > kmer_size_) { ContigGraphBranchGroup branch_group(this, current, 4, kmer_size_ + 2); if (branch_group.Search()) { ContigGraphVertexAdaptor begin = branch_group.begin(); ContigGraphVertexAdaptor end = branch_group.end(); begin.ReverseComplement(); end.ReverseComplement(); std::swap(begin, end); ContigGraphBranchGroup rev_branch_group(this, begin, 4, kmer_size_ + 2); if (rev_branch_group.Search() && rev_branch_group.end() == end) { candidates.push_back(current); } } } } } int64_t bubble = 0; for (unsigned i = 0; i < candidates.size(); ++i) { ContigGraphVertexAdaptor current = candidates[i]; if (current.out_edges().size() > 1) { ContigGraphBranchGroup branch_group(this, current, 4, kmer_size_ + 2); if (branch_group.Search()) { ContigGraphVertexAdaptor begin = branch_group.begin(); ContigGraphVertexAdaptor end = branch_group.end(); begin.ReverseComplement(); end.ReverseComplement(); std::swap(begin, end); ContigGraphBranchGroup rev_branch_group(this, begin, 4, kmer_size_ + 2); if (rev_branch_group.Search() && rev_branch_group.end() == end) { branch_group.Merge(); ++bubble; } } } } Refresh(); MergeSimplePaths(); return bubble; } double ContigGraph::IterateCoverage(int min_length, double min_cover, double max_cover, double factor) { min_cover = min(min_cover, max_cover); while (true) { RemoveLowCoverage(min_cover, min_length); min_cover *= factor; if (min_cover >= max_cover) break; } return min_cover; } bool ContigGraph::RemoveLowCoverage(double min_cover, int min_length) { bool is_changed = false; for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { ContigGraphVertexAdaptor current(&vertices_[i]); if (current.contig_size() < min_length + kmer_size_ - 1 && ((current.in_edges().size() <= 1 && current.out_edges().size() <= 1) || current.in_edges().size() == 0 || current.out_edges().size() == 0)) { if (current.coverage() < min_cover) { is_changed = true; current.status().SetDeadFlag(); } } } Refresh(); // Trim(min_length); MergeSimplePaths(); return is_changed; } void ContigGraph::MergeSimplePaths() { deque contigs; deque contig_infos; Assemble(contigs, contig_infos); Initialize(contigs, contig_infos); } int64_t ContigGraph::Assemble(deque &contigs, deque &contig_infos) { contigs.clear(); contig_infos.clear(); for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { if (vertices_[i].contig().size() == kmer_size_ && vertices_[i].contig().IsPalindrome()) { vertices_[i].status().Lock(1); Sequence contig = vertices_[i].contig(); // ContigInfo contig_info(vertices_[i].kmer_count(), // vertices_[i].in_edges(), vertices_[i].out_edges()); ContigInfo contig_info; contig_info.set_kmer_count(vertices_[i].kmer_count()); contig_info.in_edges() = vertices_[i].in_edges(); contig_info.out_edges() = vertices_[i].out_edges(); contigs.push_back(contig); contig_infos.push_back(contig_info); } } for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { if (!vertices_[i].status().Lock(0)) continue; ContigGraphPath path; path.Append(ContigGraphVertexAdaptor(&vertices_[i]), 0); Sequence contig; ContigInfo contig_info; for (int strand = 0; strand < 2; ++strand) { while (true) { ContigGraphVertexAdaptor current = path.back(); ContigGraphVertexAdaptor next; if (!GetNextVertexAdaptor(current, next)) break; if (IsPalindromeLoop(path, next)) break; if (IsLoop(path, next)) goto FAIL; if (!next.status().LockPreempt(0)) goto FAIL; path.Append(next, -kmer_size_ + 1); } path.ReverseComplement(); } path.Assemble(contig, contig_info); contigs.push_back(contig); contig_infos.push_back(contig_info); FAIL:; } ClearStatus(); return contigs.size(); } struct SearchNode { ContigGraphVertexAdaptor node; int distance; int label; }; void ContigGraph::BuildBeginIdbaKmerMap() { begin_kmer_map_.clear(); begin_kmer_map_.reserve(vertices_.size() * 2); for (int64_t i = 0; i < (int64_t)vertices_.size(); ++i) { for (int strand = 0; strand < 2; ++strand) { ContigGraphVertexAdaptor current(&vertices_[i], strand); IdbaKmer kmer = current.begin_kmer(kmer_size_); IdbaKmer key = kmer.unique_format(); begin_kmer_map_[key] = i; } } } bool ContigGraph::CycleDetect(ContigGraphVertexAdaptor current, map &status) { if (status[current.id()] == 0) { bool flag = false; status[current.id()] = 1; for (int x = 0; x < 4; ++x) { if (current.out_edges()[x]) { if (CycleDetect(GetNeighbor(current, x), status)) flag = true; } } status[current.id()] = 2; return flag; } else if (status[current.id()] == 1) return true; else return false; } void ContigGraph::TopSortDFS(deque &order, ContigGraphVertexAdaptor current, map &status) { if (status[current.id()] == 0) { status[current.id()] = 1; for (int x = 0; x < 4; ++x) { if (current.out_edges()[x]) TopSortDFS(order, GetNeighbor(current, x), status); } order.push_back(current); } } int ContigGraph::GetDepth(ContigGraphVertexAdaptor current, int depth, int &maximum, int min_length) { if (depth > maximum) maximum = depth; if (maximum >= min_length) return min_length; deque neighbors; GetNeighbors(current, neighbors); for (unsigned i = 0; i < neighbors.size(); ++i) { if (neighbors[i].status().IsDead()) continue; GetDepth(neighbors[i], depth - kmer_size_ + 1 + neighbors[i].contig_size(), maximum, min_length); } return min(maximum, min_length); } megahit-1.2.9/src/idba/contig_graph.h000066400000000000000000000115751355123202700174550ustar00rootroot00000000000000/** * @file contig_graph.h * @brief ContigGraph Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-16 */ #ifndef __GRAPH_CONTIG_GRAPH_H_ #define __GRAPH_CONTIG_GRAPH_H_ #include #include #include #include "parallel_hashmap/phmap.h" #include "bit_operation.h" #include "idba/contig_graph_path.h" #include "idba/contig_graph_vertex.h" #include "idba/contig_info.h" #include "idba/hash.h" #include "idba/hash_graph.h" #include "idba/kmer.h" #include "idba/sequence.h" /** * @brief It is compact version de Bruijn graph in which each vertex is a contig * and each edge between contigs means they are connected in de Bruijn graph. */ class ContigGraph { public: using HashMap = phmap::flat_hash_map>; explicit ContigGraph(uint32_t kmer_size = 0) : num_edges_(0), kmer_size_(kmer_size) {} ~ContigGraph() { clear(); } void Initialize(const std::deque &contigs, const std::deque &contig_infos); void Refresh(); void RefreshVertices(); void RefreshEdges(); void AddEdge(ContigGraphVertexAdaptor from, ContigGraphVertexAdaptor to) { from.out_edges().Add(to.contig()[kmer_size_ - 1]); from.ReverseComplement(); to.ReverseComplement(); std::swap(from, to); from.out_edges().Add(to.contig()[kmer_size_ - 1]); } void RemoveEdge(ContigGraphVertexAdaptor current, int x) { current.out_edges().Remove(x); ContigGraphVertexAdaptor next = GetNeighbor(current, x); next.ReverseComplement(); next.out_edges().Remove(3 - current.contig()[0]); } void ClearStatus(); void MergeSimplePaths(); int64_t Trim(int min_length); int64_t RemoveDeadEnd(int min_length); int64_t RemoveBubble(); double IterateCoverage(int min_length, double min_cover, double max_cover, double factor = 1.1); bool RemoveLowCoverage(double min_cover, int min_length); int64_t Assemble(std::deque &contigs, std::deque &contig_infos); ContigGraphVertexAdaptor GetNeighbor(const ContigGraphVertexAdaptor ¤t, int x) { IdbaKmer kmer = current.end_kmer(kmer_size_); kmer.ShiftAppend(x); return FindVertexAdaptorByBeginIdbaKmer(kmer); } void GetNeighbors(const ContigGraphVertexAdaptor ¤t, std::deque &neighbors) { neighbors.clear(); for (int x = 0; x < 4; ++x) { if (current.out_edges()[x]) neighbors.push_back(GetNeighbor(current, x)); } } void swap(ContigGraph &contig_graph) { begin_kmer_map_.swap(contig_graph.begin_kmer_map_); vertices_.swap(contig_graph.vertices_); std::swap(num_edges_, contig_graph.num_edges_); std::swap(kmer_size_, contig_graph.kmer_size_); } uint32_t kmer_size() const { return kmer_size_; } void set_kmer_size(uint32_t kmer_size) { kmer_size_ = kmer_size; } void clear() { num_edges_ = 0; vertices_.clear(); begin_kmer_map_.clear(); in_kmer_count_table_.clear(); } private: ContigGraph(const ContigGraph &); const ContigGraph &operator=(const ContigGraph &); void BuildBeginIdbaKmerMap(); bool GetNextVertexAdaptor(ContigGraphVertexAdaptor ¤t, ContigGraphVertexAdaptor &next) { if (current.out_edges().size() != 1) return false; next = GetNeighbor(current, bit_operation::BitToIndex(current.out_edges())); return next.in_edges().size() == 1 && !(next.contig_size() == kmer_size_ && next.contig().IsPalindrome()); } bool IsLoop(const ContigGraphPath &path, const ContigGraphVertexAdaptor &next) { return path.front().id() == next.id(); } bool IsPalindromeLoop(const ContigGraphPath &path, const ContigGraphVertexAdaptor &next) { return path.back().id() == next.id(); } ContigGraphVertexAdaptor FindVertexAdaptorByBeginIdbaKmer( const IdbaKmer &begin_kmer) { IdbaKmer key = begin_kmer.unique_format(); auto iter = begin_kmer_map_.find(key); if (iter != begin_kmer_map_.end()) { ContigGraphVertexAdaptor current(&vertices_[iter->second]); if (current.begin_kmer(kmer_size_) == begin_kmer) return current; current.ReverseComplement(); if (current.begin_kmer(kmer_size_) == begin_kmer) return current; } return ContigGraphVertexAdaptor(); } bool CycleDetect(ContigGraphVertexAdaptor current, std::map &status); void TopSortDFS(std::deque &order, ContigGraphVertexAdaptor current, std::map &status); int GetDepth(ContigGraphVertexAdaptor current, int length, int &maximum, int min_length); HashMap begin_kmer_map_; std::deque vertices_; uint64_t num_edges_; uint32_t kmer_size_; HashMap in_kmer_count_table_; }; #endif megahit-1.2.9/src/idba/contig_graph_branch_group.cpp000066400000000000000000000055251355123202700225370ustar00rootroot00000000000000/** * @file contig_graph_branch_group.cpp * @brief * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.9 * @date 2011-12-27 */ #include "contig_graph_branch_group.h" #include #include #include using namespace std; bool ContigGraphBranchGroup::Search() { int kmer_size = contig_graph_->kmer_size(); branches_.reserve(max_branches_); ContigGraphPath path; path.Append(begin_, 0); branches_.push_back(path); if ((int)begin_.out_edges().size() <= 1 || (int)begin_.out_edges().size() > max_branches_ || (int)begin_.contig_size() == kmer_size) return false; bool is_converge = false; for (int k = 1; k < max_length_; ++k) { int num_branches = branches_.size(); bool is_extend = false; for (int i = 0; i < num_branches; ++i) { if ((int)branches_[i].internal_size(kmer_size) >= max_length_) continue; ContigGraphVertexAdaptor current = branches_[i].back(); if (current.out_edges().size() == 0) return false; bool is_first = true; ContigGraphPath path = branches_[i]; for (int x = 0; x < 4; ++x) { if (current.out_edges()[x]) { ContigGraphVertexAdaptor next = contig_graph_->GetNeighbor(current, x); if (next.status().IsDead()) return false; if (is_first) { branches_[i].Append(next, -kmer_size + 1); is_first = false; } else { if ((int)branches_.size() == max_branches_) return false; path.Append(next, -kmer_size + 1); branches_.push_back(path); path.Pop(); } is_extend = true; } } } end_ = branches_[0].back(); if ((int)end_.contig_size() > kmer_size) { is_converge = true; for (unsigned i = 0; i < branches_.size(); ++i) { if (branches_[i].back() != end_ || (int)branches_[i].internal_size(kmer_size) != max_length_) { is_converge = false; break; } } if (is_converge) break; } if (!is_extend) break; } return is_converge && begin_ != end_; } void ContigGraphBranchGroup::Merge() { unsigned best = 0; for (unsigned i = 1; i < branches_.size(); ++i) { if (branches_[i].kmer_count() > branches_[best].kmer_count()) best = i; } for (unsigned i = 0; i < branches_.size(); ++i) { ContigGraphPath &path = branches_[i]; path.front().out_edges() = 0; path.back().in_edges() = 0; for (unsigned j = 1; j + 1 < path.num_nodes(); ++j) { path[j].in_edges() = 0; path[j].out_edges() = 0; path[j].status().SetDeadFlag(); } } ContigGraphPath &path = branches_[best]; for (unsigned j = 1; j + 1 < path.num_nodes(); ++j) path[j].status().ResetDeadFlag(); for (unsigned j = 0; j + 1 < path.num_nodes(); ++j) contig_graph_->AddEdge(path[j], path[j + 1]); } megahit-1.2.9/src/idba/contig_graph_branch_group.h000066400000000000000000000021521355123202700221750ustar00rootroot00000000000000/** * @file contig_graph_branch_group.h * @brief ContigGraphBranchGroup Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.9 * @date 2011-12-27 */ #ifndef __GRAPH_CONTIG_GRAPH_BRANCH_GROUP_H_ #define __GRAPH_CONTIG_GRAPH_BRANCH_GROUP_H_ #include #include "idba/contig_graph.h" #include "idba/contig_graph_path.h" /** * @brief It is used to contain a branch group in ContigGraph. */ class ContigGraphBranchGroup { public: ContigGraphBranchGroup(ContigGraph *graph, ContigGraphVertexAdaptor begin, int max_branches = 2, int max_length = 0) { contig_graph_ = graph; begin_ = begin; max_branches_ = max_branches; max_length_ = max_length; if (max_length_ == 0) max_length_ = 2 * contig_graph_->kmer_size() + 2; } bool Search(); void Merge(); ContigGraphVertexAdaptor begin() { return begin_; } ContigGraphVertexAdaptor end() { return end_; } private: ContigGraph *contig_graph_; ContigGraphVertexAdaptor begin_; ContigGraphVertexAdaptor end_; std::vector branches_; int max_branches_; int max_length_; }; #endif megahit-1.2.9/src/idba/contig_graph_path.h000066400000000000000000000076041355123202700204670ustar00rootroot00000000000000/** * @file contig_graph_path.h * @brief ContigGraph Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-16 */ #ifndef __GRAPH_CONTIG_GRAPH_PATH_H_ #define __GRAPH_CONTIG_GRAPH_PATH_H_ #include #include #include "idba/contig_builder.h" #include "idba/contig_graph_vertex.h" #include "idba/contig_info.h" /** * @brief It is a path of contigs in ContigGraph. */ class ContigGraphPath { public: ContigGraphPath() {} ContigGraphPath(const ContigGraphPath &path) : vertices_(path.vertices_), distances_(path.distances_) {} const ContigGraphPath &operator=(const ContigGraphPath &path) { vertices_ = path.vertices_; distances_ = path.distances_; return *this; } bool operator<(const ContigGraphPath &path) const { for (unsigned i = 0; i < num_nodes() && i < path.num_nodes(); ++i) { if ((*this)[i] != path[i]) return (*this)[i] < path[i]; } return num_nodes() < path.num_nodes(); } ContigGraphVertexAdaptor &operator[](uint32_t index) { return vertices_[index]; } const ContigGraphVertexAdaptor &operator[](uint32_t index) const { return vertices_[index]; } void Append(const ContigGraphVertexAdaptor &vertex, int d) { vertices_.push_back(vertex); if (vertices_.size() > 1) distances_.push_back(d); } void Append(const ContigGraphPath &path, int d) { for (unsigned i = 0; i < path.num_nodes(); ++i) { if (i == 0) Append(path[i], d); else Append(path[i], path.distances()[i - 1]); } } void Pop() { vertices_.pop_back(); if (!distances_.empty()) distances_.pop_back(); } const ContigGraphPath &ReverseComplement() { std::reverse(vertices_.begin(), vertices_.end()); for (unsigned i = 0; i < vertices_.size(); ++i) vertices_[i].ReverseComplement(); std::reverse(distances_.begin(), distances_.end()); return *this; } void Assemble(Sequence &contig, ContigInfo &contig_info) { ContigBuilder contig_builder; if (vertices_.size() > 0) { contig_builder.Append(vertices_[0], 0); for (unsigned i = 1; i < vertices_.size(); ++i) contig_builder.Append(vertices_[i], distances_[i - 1]); } contig = contig_builder.contig(); contig_info = contig_builder.contig_info(); } void swap(ContigGraphPath &path) { if (this != &path) { vertices_.swap(path.vertices_); distances_.swap(path.distances_); } } ContigGraphVertexAdaptor &front() { return vertices_.front(); } const ContigGraphVertexAdaptor &front() const { return vertices_.front(); } ContigGraphVertexAdaptor &back() { return vertices_.back(); } const ContigGraphVertexAdaptor &back() const { return vertices_.back(); } uint64_t kmer_count() const { uint64_t sum = 0; for (unsigned i = 0; i < vertices_.size(); ++i) sum += vertices_[i].kmer_count(); return sum; } uint32_t size() const { uint32_t size = 0; for (unsigned i = 0; i < vertices_.size(); ++i) size += vertices_[i].contig_size(); for (unsigned i = 0; i < distances_.size(); ++i) size += distances_[i]; return size; } uint32_t internal_size(int kmer_size) const { if (vertices_.size() <= 1) return vertices_.size(); uint32_t size = kmer_size + 1; for (unsigned i = 1; i + 1 < vertices_.size(); ++i) size += vertices_[i].contig_size(); for (unsigned i = 0; i < distances_.size(); ++i) size += distances_[i]; return size; } uint32_t num_nodes() const { return vertices_.size(); } void clear() { vertices_.clear(); distances_.clear(); } std::deque &distances() { return distances_; } const std::deque &distances() const { return distances_; } private: std::deque vertices_; std::deque distances_; }; namespace std { template <> inline void swap(ContigGraphPath &x, ContigGraphPath &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/contig_graph_vertex.h000066400000000000000000000174701355123202700210520ustar00rootroot00000000000000/** * @file contig_graph_vertex.h * @brief ContigGraphVertex Class and ContigGraphVertexAdaptor Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-16 */ #ifndef __GRAPH_CONTIG_GRAPH_VERTEX_H_ #define __GRAPH_CONTIG_GRAPH_VERTEX_H_ #include #include #include #include "idba/bit_edges.h" #include "idba/contig_info.h" #include "idba/kmer.h" #include "idba/sequence.h" #include "idba/vertex_status.h" /** * @brief It is the vertex class used in ContigGraph class. */ class ContigGraphVertex { public: explicit ContigGraphVertex(const Sequence &contig = Sequence(), const ContigInfo &contig_info = ContigInfo()) : contig_(contig), contig_info_(contig_info) {} ContigGraphVertex(const ContigGraphVertex &x) : contig_(x.contig_), id_(x.id_), status_(x.status_), contig_info_(x.contig_info_) {} const ContigGraphVertex &operator=(const ContigGraphVertex &x) { if (this != &x) { contig_ = x.contig_; id_ = x.id_; contig_info_ = x.contig_info_; } return *this; } const Sequence &contig() const { return contig_; } void set_contig(const Sequence &contig) { contig_ = contig; } uint32_t contig_size() const { return contig_.size(); } uint32_t num_kmer() const { return contig_.size() - kmer_size() + 1; } const ContigInfo &contig_info() const { return contig_info_; } void set_contig_info(const ContigInfo &contig_info) { contig_info_ = contig_info; } uint64_t kmer_count() const { return contig_info_.kmer_count(); } void set_kmer_count(uint64_t kmer_count) { contig_info_.set_kmer_count(kmer_count); } uint32_t id() const { return id_; } void set_id(uint32_t id) { id_ = id; } uint32_t kmer_size() const { return contig_info_.kmer_size(); } void set_kmer_size(uint32_t kmer_size) { contig_info_.set_kmer_size(kmer_size); } VertexStatus &status() { return status_; } const VertexStatus &status() const { return status_; } BitEdges &in_edges() { return contig_info_.in_edges(); } const BitEdges &in_edges() const { return contig_info_.in_edges(); } BitEdges &out_edges() { return contig_info_.out_edges(); } const BitEdges &out_edges() const { return contig_info_.out_edges(); } IdbaKmer begin_kmer(int kmer_size) const { return contig_.GetIdbaKmer(0, kmer_size); } IdbaKmer end_kmer(int kmer_size) const { return contig_.GetIdbaKmer(contig_.size() - kmer_size, kmer_size); } double coverage() const { return 1.0 * contig_info_.kmer_count() / (contig_size() - kmer_size() + 1); } const SequenceCount &counts() const { return contig_info_.counts(); } void set_counts(const SequenceCount &counts) { contig_info_.set_counts(counts); } char get_base(uint32_t index) const { return contig_[index]; } SequenceCountUnitType get_count(uint32_t index) const { return contig_info_.counts()[index]; } void swap(ContigGraphVertex &x) { if (this != &x) { contig_.swap(x.contig_); std::swap(id_, x.id_); status_.swap(x.status_); contig_info_.swap(x.contig_info_); } } void clear() { contig_.clear(); id_ = 0; status_.clear(); contig_info_.clear(); } private: Sequence contig_; uint32_t id_; VertexStatus status_; ContigInfo contig_info_; }; /** * @brief It is a adaptor class used to access ContigGraphVertex. Becase a * contig and its * reverse complement share the same vertex, using adaptor makes sure that * modification to * the vertex consistant. */ class ContigGraphVertexAdaptor { public: explicit ContigGraphVertexAdaptor(ContigGraphVertex *vertex = NULL, bool is_reverse = false) { vertex_ = vertex; is_reverse_ = is_reverse; } ContigGraphVertexAdaptor(const ContigGraphVertexAdaptor &x) { vertex_ = x.vertex_, is_reverse_ = x.is_reverse_; } const ContigGraphVertexAdaptor &operator=(const ContigGraphVertexAdaptor &x) { vertex_ = x.vertex_; is_reverse_ = x.is_reverse_; return *this; } bool operator<(const ContigGraphVertexAdaptor &x) const { return (vertex_ != x.vertex_) ? (vertex_ < x.vertex_) : (is_reverse_ < x.is_reverse_); } bool operator>(const ContigGraphVertexAdaptor &x) const { return (vertex_ != x.vertex_) ? (vertex_ > x.vertex_) : (is_reverse_ > x.is_reverse_); } bool operator==(const ContigGraphVertexAdaptor &x) const { return vertex_ == x.vertex_ && is_reverse_ == x.is_reverse_; } bool operator!=(const ContigGraphVertexAdaptor &x) const { return vertex_ != x.vertex_ || is_reverse_ != x.is_reverse_; } const ContigGraphVertexAdaptor &ReverseComplement() { is_reverse_ = !is_reverse_; return *this; } Sequence contig() const { Sequence contig = vertex_->contig(); return !is_reverse_ ? contig : contig.ReverseComplement(); } uint32_t contig_size() const { return vertex_->contig().size(); } uint32_t num_kmer() const { return vertex_->num_kmer(); } void set_vertex(ContigGraphVertex *vertex, bool is_reverse) { vertex_ = vertex; is_reverse_ = is_reverse; } ContigInfo contig_info() const { ContigInfo contig_info = vertex_->contig_info(); return (!is_reverse_ ? contig_info : contig_info.ReverseComplement()); } uint64_t kmer_size() const { return vertex_->kmer_size(); } void set_kmer_size(uint64_t kmer_size) { vertex_->set_kmer_size(kmer_size); } uint64_t kmer_count() const { return vertex_->kmer_count(); } void set_kmer_count(uint64_t kmer_count) { vertex_->set_kmer_count(kmer_count); } uint32_t id() const { return vertex_->id(); } void set_id(uint32_t id) { vertex_->set_id(id); } VertexStatus &status() { return vertex_->status(); } const VertexStatus &status() const { return vertex_->status(); } BitEdges &in_edges() { return !is_reverse_ ? vertex_->in_edges() : vertex_->out_edges(); } const BitEdges &in_edges() const { return !is_reverse_ ? vertex_->in_edges() : vertex_->out_edges(); } BitEdges &out_edges() { return !is_reverse_ ? vertex_->out_edges() : vertex_->in_edges(); } const BitEdges &out_edges() const { return !is_reverse_ ? vertex_->out_edges() : vertex_->in_edges(); } SequenceCount counts() { if (!is_reverse_) return vertex_->counts(); else { SequenceCount counts = vertex_->counts(); std::reverse(counts.begin(), counts.end()); return counts; } } char get_base(uint32_t index) const { return (!is_reverse_) ? vertex_->get_base(index) : 3 - vertex_->get_base(contig_size() - 1 - index); } SequenceCountUnitType get_count(uint32_t index) const { return (!is_reverse_) ? vertex_->get_count(index) : vertex_->get_count(vertex_->counts().size() - 1 - index); } IdbaKmer begin_kmer(int kmer_size) const { return !is_reverse_ ? vertex_->begin_kmer(kmer_size) : vertex_->end_kmer(kmer_size).ReverseComplement(); } IdbaKmer end_kmer(int kmer_size) const { return !is_reverse_ ? vertex_->end_kmer(kmer_size) : vertex_->begin_kmer(kmer_size).ReverseComplement(); } double coverage() const { return vertex_->coverage(); } bool is_reverse() const { return is_reverse_; } void swap(ContigGraphVertexAdaptor &x) { if (this != &x) { std::swap(vertex_, x.vertex_); std::swap(is_reverse_, x.is_reverse_); } } bool is_null() const { return vertex_ == NULL; } void clear() { vertex_->clear(); } private: ContigGraphVertex *vertex_; bool is_reverse_; }; namespace std { template <> inline void swap(ContigGraphVertex &x, ContigGraphVertex &y) { x.swap(y); } template <> inline void swap(ContigGraphVertexAdaptor &x, ContigGraphVertexAdaptor &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/contig_info.h000066400000000000000000000051511355123202700173000ustar00rootroot00000000000000/** * @file contig_info.h * @brief ContigInfo Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-26 */ #ifndef __GRAPH_CONTIG_INFO_H_ #define __GRAPH_CONTIG_INFO_H_ #include "bit_edges.h" #include #include #include #include #include typedef uint32_t SequenceCountUnitType; typedef std::basic_string SequenceCount; class ContigBuilder; /** * @brief It is used to store information of contigs, like k-mer counts, * in-edges, * out-edges, etc. */ class ContigInfo { friend class ContigBuilder; public: ContigInfo() { kmer_count_ = 0; kmer_size_ = 0; } ContigInfo(const ContigInfo &contig_info) { in_edges_ = contig_info.in_edges_; out_edges_ = contig_info.out_edges_; kmer_count_ = contig_info.kmer_count_; kmer_size_ = contig_info.kmer_size_; counts_ = contig_info.counts_; } const ContigInfo &operator=(const ContigInfo &contig_info) { in_edges_ = contig_info.in_edges_; out_edges_ = contig_info.out_edges_; kmer_count_ = contig_info.kmer_count_; kmer_size_ = contig_info.kmer_size_; counts_ = contig_info.counts_; return *this; } const ContigInfo &ReverseComplement() { std::swap(in_edges_, out_edges_); std::reverse(counts_.begin(), counts_.end()); return *this; } BitEdges &in_edges() { return in_edges_; } const BitEdges &in_edges() const { return in_edges_; } BitEdges &out_edges() { return out_edges_; } const BitEdges &out_edges() const { return out_edges_; } uint32_t kmer_size() const { return kmer_size_; } void set_kmer_size(uint32_t kmer_size) { kmer_size_ = kmer_size; } uint32_t kmer_count() const { return kmer_count_; } void set_kmer_count(uint32_t kmer_count) { kmer_count_ = kmer_count; } const SequenceCount &counts() const { return counts_; } void set_counts(const SequenceCount &counts) { counts_ = counts; } void swap(ContigInfo &contig_info) { if (this != &contig_info) { std::swap(in_edges_, contig_info.in_edges_); std::swap(out_edges_, contig_info.out_edges_); std::swap(kmer_size_, contig_info.kmer_size_); std::swap(kmer_count_, contig_info.kmer_count_); counts_.swap(contig_info.counts_); } } void clear() { in_edges_ = 0; out_edges_ = 0; kmer_size_ = 0; kmer_count_ = 0; counts_.clear(); } private: BitEdges in_edges_; BitEdges out_edges_; uint16_t kmer_size_; uint32_t kmer_count_; SequenceCount counts_; }; namespace std { template <> inline void swap(ContigInfo &x, ContigInfo &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/functional.h000066400000000000000000000021331355123202700171410ustar00rootroot00000000000000/** * @file functional.h * @brief Useful Functionals. * @author Yu Peng * @version 1.0.0 * @date 2011-08-24 */ #ifndef __BASIC_FUNCTIONALS_H_ #define __BASIC_FUNCTIONALS_H_ #include /** * @brief It is a basic functional to return the input itself. * * @tparam T */ template struct Identity { const T &operator()(const T &value) const { return value; } }; /** * @brief It is a basic functional to select the first element of a stl pair. * * @tparam Pair */ template struct Select1st { typedef typename Pair::first_type value_type; const value_type &operator()(const Pair &pair) const { return pair.first; } }; /** * @brief It is a basic functional to get the key value from a key-value pair. * * @tparam Key * @tparam Value */ template struct GetKey { typedef const Key &result_type; const Key &operator()(const Value &value) const { return value.key(); } }; template struct SetKey { inline void operator()(Value *value, const Value &new_key) const { *value = new_key; } }; #endif megahit-1.2.9/src/idba/hash.h000066400000000000000000000035231355123202700157260ustar00rootroot00000000000000/** * @file hash.h * @brief Hash Functionals. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-24 */ #ifndef __BASIC_HASH_H_ #define __BASIC_HASH_H_ #include #include #include "xxhash/xxh3.h" template struct Hash : public std::unary_function { uint64_t operator()(const T &value) const { return value.hash(); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(int8_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(uint8_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(int16_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(uint16_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(int32_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(uint32_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(int64_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; template <> struct Hash : public std::unary_function { uint64_t operator()(uint64_t value) const { return XXH3_64bits(&value, sizeof(value)); } }; #endif megahit-1.2.9/src/idba/hash_graph.cpp000066400000000000000000000063401355123202700174420ustar00rootroot00000000000000/** * @file hash_graph.cpp * @brief * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-05 */ #include "idba/hash_graph.h" #include #include #include #include "bit_operation.h" #include "idba/contig_builder.h" #include "idba/contig_info.h" #include "idba/hash_graph_vertex.h" #include "idba/kmer.h" #include "idba/sequence.h" #include "utils/histgram.h" using namespace std; #include int64_t HashGraph::InsertKmers(const Sequence &seq) { if (seq.size() < kmer_size_) return 0; IdbaKmer kmer(kmer_size_); int length = 0; int64_t num_kmers = 0; for (uint64_t i = 0; i < seq.size(); ++i) { kmer.ShiftAppend(seq[i]); length = (seq[i] < 4) ? length + 1 : 0; if (length < (int)kmer_size_) continue; IdbaKmer key = kmer.unique_format(); HashGraphVertex &vertex = vertex_table_.find_or_insert(HashGraphVertex(key)); vertex.count() += 1; HashGraphVertexAdaptor adaptor(&vertex, kmer != key); if (length > (int)kmer_size_ && seq[i - kmer_size_] < 4) adaptor.in_edges().Add(3 - seq[i - kmer_size_]); if (i + 1 < seq.size() && seq[i + 1] < 4) adaptor.out_edges().Add(seq[i + 1]); ++num_kmers; } return num_kmers; } int64_t HashGraph::InsertUncountKmers(const Sequence &seq) { if (seq.size() < kmer_size_) return 0; IdbaKmer kmer(kmer_size_); int length = 0; int64_t num_kmers = 0; for (uint64_t i = 0; i < seq.size(); ++i) { kmer.ShiftAppend(seq[i]); length = (seq[i] < 4) ? length + 1 : 0; if (length < (int)kmer_size_) continue; IdbaKmer key = kmer.unique_format(); HashGraphVertex &vertex = vertex_table_.find_or_insert(HashGraphVertex(key)); HashGraphVertexAdaptor adaptor(&vertex, kmer != key); if (length > (int)kmer_size_ && seq[i - kmer_size_] < 4) adaptor.in_edges().Add(3 - seq[i - kmer_size_]); if (i + 1 < seq.size() && seq[i + 1] < 4) adaptor.out_edges().Add(seq[i + 1]); ++num_kmers; } return num_kmers; } int64_t HashGraph::Assemble(std::deque &contigs, std::deque &contig_infos) { contigs.clear(); contig_infos.clear(); AssembleFunc func(this); vertex_table_.for_each(func); contigs.swap(func.contigs()); contig_infos.swap(func.contig_infos()); ClearStatus(); return contigs.size(); } void HashGraph::AssembleFunc::operator()(HashGraphVertex &vertex) { if (!vertex.status().Lock(0)) return; ContigBuilder contig_builder; contig_builder.Append(HashGraphVertexAdaptor(&vertex)); if (!vertex.kmer().IsPalindrome()) { for (int strand = 0; strand < 2; ++strand) { HashGraphVertexAdaptor current(&vertex, strand); while (true) { HashGraphVertexAdaptor next; if (!hash_graph_->GetNextVertexAdaptor(current, next)) break; if (hash_graph_->IsPalindromeLoop(contig_builder.contig(), next)) break; if (hash_graph_->IsLoop(contig_builder.contig(), next)) return; if (!next.status().LockPreempt(0)) return; contig_builder.Append(next); current = next; } contig_builder.ReverseComplement(); } } contigs_.push_back(contig_builder.contig()); contig_infos_.push_back(contig_builder.contig_info()); } megahit-1.2.9/src/idba/hash_graph.h000066400000000000000000000100041355123202700170770ustar00rootroot00000000000000/** * @file hash_graph.h * @brief HashGraph Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-05 */ #ifndef __GRAPH_HASH_GRAPH_H_ #define __GRAPH_HASH_GRAPH_H_ #include #include #include #include "bit_operation.h" #include "idba/contig_info.h" #include "idba/hash_graph_vertex.h" #include "idba/hash_table.h" #include "idba/kmer.h" #include "idba/sequence.h" #include "utils/histgram.h" class IdbaKmer; class Sequence; /** * @brief It is a hash table based de Bruijn graph implementation. */ class HashGraph { public: typedef HashTableST vertex_table_type; explicit HashGraph(uint32_t kmer_size = 0) { set_kmer_size(kmer_size); } ~HashGraph() {} HashGraphVertexAdaptor FindVertexAdaptor(const IdbaKmer &kmer) { IdbaKmer key = kmer.unique_format(); auto p = vertex_table_.find(key); return ((p != vertex_table_.end()) ? HashGraphVertexAdaptor(&*p, kmer != key) : HashGraphVertexAdaptor(NULL)); } int64_t InsertKmers(const Sequence &seq); int64_t InsertUncountKmers(const Sequence &seq); void ClearStatus() { ClearStatusFunc func; vertex_table_.for_each(func); } int64_t Assemble(std::deque &contigs, std::deque &contig_infos); void reserve(uint64_t capacity) { vertex_table_.reserve(capacity); } uint32_t kmer_size() const { return kmer_size_; } void set_kmer_size(uint32_t kmer_size) { kmer_size_ = kmer_size; } Histgram coverage_histgram() { CoverageHistgramFunc func; vertex_table_.for_each(func); return func.histgram(); } void swap(HashGraph &hash_graph) { if (this != &hash_graph) { vertex_table_.swap(hash_graph.vertex_table_); std::swap(kmer_size_, hash_graph.kmer_size_); } } uint64_t num_vertices() const { return vertex_table_.size(); } void clear() { vertex_table_.clear(); } private: #if __cplusplus >= 201103L HashGraph(const HashGraph &) = delete; const HashGraph &operator=(const HashGraph &) = delete; #else HashGraph(const HashGraph &); const HashGraph &operator=(const HashGraph &); #endif bool GetNextVertexAdaptor(const HashGraphVertexAdaptor ¤t, HashGraphVertexAdaptor &next) { if (current.out_edges().size() != 1) return false; IdbaKmer kmer = current.kmer(); kmer.ShiftAppend(bit_operation::BitToIndex(current.out_edges())); next = FindVertexAdaptor(kmer); return !kmer.IsPalindrome() && next.in_edges().size() == 1; } bool IsLoop(const Sequence &contig, HashGraphVertexAdaptor &next) { IdbaKmer kmer = next.kmer(); IdbaKmer rev_comp = kmer; rev_comp.ReverseComplement(); return contig.GetIdbaKmer(0, kmer_size_) == kmer; } bool IsPalindromeLoop(const Sequence &contig, HashGraphVertexAdaptor &next) { IdbaKmer kmer = next.kmer(); IdbaKmer rev_comp = kmer; rev_comp.ReverseComplement(); return contig.GetIdbaKmer(contig.size() - kmer_size_, kmer_size_) == rev_comp; } class ClearStatusFunc { public: ClearStatusFunc() {} void operator()(HashGraphVertex &vertex) { vertex.status().clear(); } }; class AssembleFunc { public: AssembleFunc(HashGraph *hash_graph) : hash_graph_(hash_graph) {} ~AssembleFunc() {} void operator()(HashGraphVertex &vertex); std::deque &contigs() { return contigs_; } std::deque &contig_infos() { return contig_infos_; } private: HashGraph *hash_graph_; std::deque contigs_; std::deque contig_infos_; }; class CoverageHistgramFunc { public: void operator()(HashGraphVertex &vertex) { histgram_.insert(vertex.count()); } const Histgram &histgram() { return histgram_; } private: Histgram histgram_; }; vertex_table_type vertex_table_; uint32_t kmer_size_; }; namespace std { inline void swap(HashGraph &x, HashGraph &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/hash_graph_path.h000066400000000000000000000042501355123202700201210ustar00rootroot00000000000000/** * @file hash_graph_path.h * @brief HashGraphPath Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.4 * @date 2011-09-21 */ #ifndef __GRAPH_HASH_GRAPH_PATH_H_ #define __GRAPH_HASH_GRAPH_PATH_H_ #include "idba/hash_graph.h" #include "idba/kmer.h" #include /** * @brief It is a path of k-mers in de Bruijn graph (HashGraph). */ class HashGraphPath { public: HashGraphPath() {} HashGraphPath(const HashGraphPath &path) : vertices_(path.vertices_) {} const HashGraphPath &operator=(const HashGraphPath &path) { vertices_ = path.vertices_; return *this; } HashGraphVertexAdaptor &operator[](uint32_t index) { return vertices_[index]; } const HashGraphVertexAdaptor &operator[](uint32_t index) const { return vertices_[index]; } void Append(const HashGraphVertexAdaptor &vertex) { vertices_.push_back(vertex); } void Pop() { vertices_.pop_back(); } const HashGraphPath &ReverseComplement() { std::reverse(vertices_.begin(), vertices_.end()); for (unsigned i = 0; i < vertices_.size(); ++i) vertices_[i].ReverseComplement(); return *this; } bool IsSimplePath() const { for (unsigned i = 1; i + 1 < vertices_.size(); ++i) { if (vertices_[i].out_edges().size() != 1) return false; if (vertices_[i].in_edges().size() != 1) return false; } return true; } void swap(HashGraphPath &path) { if (this != &path) vertices_.swap(path.vertices_); } uint64_t kmer_count() { uint64_t sum = 0; for (unsigned i = 0; i < vertices_.size(); ++i) sum += vertices_[i].count(); return sum; } HashGraphVertexAdaptor &front() { return vertices_.front(); } const HashGraphVertexAdaptor &front() const { return vertices_.front(); } HashGraphVertexAdaptor &back() { return vertices_.back(); } const HashGraphVertexAdaptor &back() const { return vertices_.back(); } uint32_t size() const { if (vertices_.empty()) return 0; else return vertices_[0].kmer().size() + vertices_.size() - 1; } uint32_t num_nodes() const { return vertices_.size(); } void clear() { vertices_.clear(); } private: std::deque vertices_; }; #endif megahit-1.2.9/src/idba/hash_graph_vertex.h000066400000000000000000000122601355123202700205020ustar00rootroot00000000000000/** * @file hash_graph_vertex.h * @brief HashGraphVertex Class and HashGraphVertexAdaptor Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-05 */ #ifndef __GRAPH_HASH_GRAPH_VERTEX_H_ #define __GRAPH_HASH_GRAPH_VERTEX_H_ #include #include "idba/bit_edges.h" #include "idba/bit_operation.h" #include "idba/kmer.h" #include "idba/vertex_status.h" /** * @brief It is the vertex class used in HashGraph. */ class HashGraphVertex { public: explicit HashGraphVertex(const IdbaKmer &kmer = IdbaKmer()) : kmer_(kmer), count_(0) {} HashGraphVertex(const HashGraphVertex &x) : kmer_(x.kmer_), count_(x.count_), status_(x.status_), in_edges_(x.in_edges_), out_edges_(x.out_edges_) {} const HashGraphVertex &operator=(const HashGraphVertex &x) { kmer_ = x.kmer_; count_ = x.count_; status_ = x.status_; in_edges_ = x.in_edges_; out_edges_ = x.out_edges_; return *this; } void FixPalindromeEdges() { if (kmer_.IsPalindrome()) out_edges_ = in_edges_ = (in_edges_ | out_edges_); } const IdbaKmer &key() const { return kmer_; } void set_key(const IdbaKmer &key) { kmer_ = key; } const IdbaKmer &kmer() const { return kmer_; } void set_kmer(const IdbaKmer &kmer) { kmer_ = kmer; } int32_t &count() { return count_; } const int32_t &count() const { return count_; } VertexStatus &status() { return status_; } const VertexStatus &status() const { return status_; } BitEdges &in_edges() { return in_edges_; } const BitEdges &in_edges() const { return in_edges_; } BitEdges &out_edges() { return out_edges_; } const BitEdges &out_edges() const { return out_edges_; } void swap(HashGraphVertex &x) { if (this != &x) { kmer_.swap(x.kmer_); std::swap(count_, x.count_); status_.swap(x.status_); in_edges_.swap(x.in_edges_); out_edges_.swap(x.out_edges_); } } uint32_t kmer_size() const { return kmer_.size(); } void clear() { in_edges_.clear(); out_edges_.clear(); status_.clear(); count_ = 0; } private: IdbaKmer kmer_; int32_t count_; VertexStatus status_; BitEdges in_edges_; BitEdges out_edges_; }; /** * @brief It is adaptor class used for accessing HashGraphVertex. Because * a k-mer and its reverse complemtn share the same vertex, using adaptor * makes sure the access to vertex consistant. */ class HashGraphVertexAdaptor { public: explicit HashGraphVertexAdaptor(HashGraphVertex *vertex = NULL, bool is_reverse = false) { vertex_ = vertex; is_reverse_ = is_reverse; } HashGraphVertexAdaptor(const HashGraphVertexAdaptor &x) { vertex_ = x.vertex_; is_reverse_ = x.is_reverse_; } const HashGraphVertexAdaptor &operator=(const HashGraphVertexAdaptor &x) { vertex_ = x.vertex_; is_reverse_ = x.is_reverse_; return *this; } bool operator<(const HashGraphVertexAdaptor &x) const { return (vertex_ != x.vertex_) ? (vertex_ < x.vertex_) : (is_reverse_ < x.is_reverse_); } bool operator>(const HashGraphVertexAdaptor &x) const { return (vertex_ != x.vertex_) ? (vertex_ > x.vertex_) : (is_reverse_ > x.is_reverse_); } bool operator==(const HashGraphVertexAdaptor &x) const { return vertex_ == x.vertex_ && is_reverse_ == x.is_reverse_; } bool operator!=(const HashGraphVertexAdaptor &x) const { return vertex_ != x.vertex_ || is_reverse_ != x.is_reverse_; } const HashGraphVertexAdaptor &ReverseComplement() { is_reverse_ = !is_reverse_; return *this; } IdbaKmer kmer() const { IdbaKmer kmer = vertex_->kmer(); return !is_reverse_ ? kmer : kmer.ReverseComplement(); } HashGraphVertex &vertex() { return *vertex_; } const HashGraphVertex &vertex() const { return *vertex_; } void set_vertex(HashGraphVertex *vertex, bool is_reverse = false) { vertex_ = vertex; is_reverse_ = is_reverse; } int32_t &count() { return vertex_->count(); } const int32_t &count() const { return vertex_->count(); } VertexStatus &status() { return vertex_->status(); } const VertexStatus &status() const { return vertex_->status(); } BitEdges &in_edges() { return !is_reverse_ ? vertex_->in_edges() : vertex_->out_edges(); } const BitEdges &in_edges() const { return !is_reverse_ ? vertex_->in_edges() : vertex_->out_edges(); } BitEdges &out_edges() { return !is_reverse_ ? vertex_->out_edges() : vertex_->in_edges(); } const BitEdges &out_edges() const { return !is_reverse_ ? vertex_->out_edges() : vertex_->in_edges(); } void swap(HashGraphVertexAdaptor &x) { if (this != &x) { std::swap(vertex_, x.vertex_); std::swap(is_reverse_, x.is_reverse_); } } bool is_null() const { return vertex_ == NULL; } uint32_t kmer_size() const { return vertex_->kmer_size(); } void clear() { vertex_->clear(); } private: HashGraphVertex *vertex_; bool is_reverse_; }; namespace std { template <> inline void swap(HashGraphVertex &x, HashGraphVertex &y) { x.swap(y); } template <> inline void swap(HashGraphVertexAdaptor &x, HashGraphVertexAdaptor &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/hash_table.h000066400000000000000000000406651355123202700171050ustar00rootroot00000000000000/** * @file hash_table.h * @brief HashTableST Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-03 */ #ifndef __LIB_IDBA_HASH_TABLE_H_ #define __LIB_IDBA_HASH_TABLE_H_ #include #include #include #include #include #include #include "functional.h" #include "hash.h" #include "idba/pool.h" template struct HashTableSTNode { HashTableSTNode *next; T value; }; template class HashTableST; template class HashTableSTIterator; template class HashTableSTConstIterator; template class HashTableSTIterator { public: typedef Key key_type; typedef Value value_type; typedef value_type *pointer; typedef const value_type *const_pointer; typedef value_type &reference; typedef const value_type &const_reference; typedef HashTableSTNode node_type; typedef HashTableST hash_table_type; typedef std::forward_iterator_tag iterator_category; typedef HashTableSTIterator iterator; HashTableSTIterator(const hash_table_type *owner = NULL, node_type *current = NULL) : owner_(owner), current_(current) {} HashTableSTIterator(const iterator &iter) : owner_(iter.owner_), current_(iter.current_) {} const iterator &operator=(const iterator &iter) { owner_ = iter.owner_; current_ = iter.current_; return *this; } bool operator==(const iterator &iter) const { return current_ == iter.current_; } bool operator!=(const iterator &iter) const { return current_ != iter.current_; } reference operator*() const { return current_->value; } pointer operator->() const { return ¤t_->value; } const iterator &operator++() { increment(); return *this; } iterator operator++(int) { iterator tmp(*this); increment(); return tmp; } private: void increment() { if (current_ != NULL) { if (current_->next) current_ = current_->next; else { uint64_t index = owner_->bucket_index_value(current_->value); current_ = current_->next; while (current_ == NULL && ++index < owner_->bucket_count()) current_ = owner_->buckets_[index]; } } } const hash_table_type *owner_; node_type *current_; }; template class HashTableSTConstIterator { public: typedef Key key_type; typedef Value value_type; typedef value_type *pointer; typedef const value_type *const_pointer; typedef value_type &reference; typedef const value_type &const_reference; typedef HashTableSTNode node_type; typedef HashTableST hash_table_type; typedef std::forward_iterator_tag iterator_category; typedef HashTableSTConstIterator const_iterator; HashTableSTConstIterator(const hash_table_type *owner = NULL, const node_type *current = NULL) : owner_(owner), current_(current) {} HashTableSTConstIterator(const const_iterator &iter) : owner_(iter.owner_), current_(iter.current_) {} const const_iterator &operator=(const const_iterator &iter) { owner_ = iter.owner_; current_ = iter.current_; return *this; } bool operator==(const const_iterator &iter) const { return current_ == iter.current_; } bool operator!=(const const_iterator &iter) const { return current_ != iter.current_; } const_reference operator*() const { return current_->value; } const_pointer operator->() const { return ¤t_->value; } const const_iterator &operator++() { increment(); return *this; } const_iterator operator++(int) { const_iterator tmp(*this); increment(); return tmp; } private: void increment() { if (current_ != NULL) { if (current_->next) current_ = current_->next; else { uint64_t index = owner_->bucket_index_value(current_->value); current_ = current_->next; while (current_ == NULL && ++index < owner_->bucket_count()) current_ = owner_->buckets_[index]; } } } const hash_table_type *owner_; const node_type *current_; }; /** * @brief It is parallel hash table. All insertion/delection operations can be * done in parallel. The table size grows automatically, if the number elements * exceed the twice of the number of buckets. * * @tparam Value * @tparam Key * @tparam HashFunc */ template , typename ExtractKey = GetKey, typename EqualKey = std::equal_to> class HashTableST { public: typedef Key key_type; typedef Value value_type; typedef size_t size_type; typedef std::ptrdiff_t difference_type; typedef value_type &reference; typedef const value_type &const_reference; typedef value_type *pointer; typedef const value_type *const_pointer; typedef HashFunc hash_func_type; typedef ExtractKey get_key_func_type; typedef EqualKey key_equal_func_type; typedef HashTableSTNode node_type; typedef HashTableST hash_table_type; typedef HashTableSTIterator iterator; typedef HashTableSTConstIterator const_iterator; typedef PoolST pool_type; friend class HashTableSTIterator; friend class HashTableSTConstIterator; template friend std::ostream &operator<<( std::ostream &os, HashTableST &hash_table); template friend std::istream &operator>>( std::istream &os, HashTableST &hash_table); static const uint64_t kNumBucketLocks = (1 << 12); static const uint64_t kDefaultNumBuckets = (1 << 12); explicit HashTableST( const hash_func_type &hash = hash_func_type(), const get_key_func_type &get_key = get_key_func_type(), const key_equal_func_type &key_equal = key_equal_func_type()) : hash_(hash), get_key_(get_key), key_equal_(key_equal) { size_ = 0; rehash(kDefaultNumBuckets); } HashTableST(const hash_table_type &hash_table) : hash_(hash_table.hash_), get_key_(hash_table.get_key_), key_equal_(hash_table.key_equal_) { size_ = 0; assign(hash_table); } ~HashTableST() { clear(); } const hash_table_type &operator=(const hash_table_type &hash_table) { return assign(hash_table); } const hash_table_type &assign(const hash_table_type &hash_table) { if (this == &hash_table) return *this; clear(); rehash(hash_table.buckets_.size()); for (int64_t i = 0; i < (int64_t)hash_table.buckets_.size(); ++i) { node_type *prev = NULL; for (node_type *node = hash_table.buckets_[i]; node; node = node->next) { node_type *p = pool_.construct(); p->value = node->value; p->next = NULL; if (prev == NULL) buckets_[i] = p; else prev->next = p; prev = p; } } return *this; } iterator begin() { for (unsigned i = 0; i < buckets_.size(); ++i) { if (buckets_[i]) return iterator(this, buckets_[i]); } return iterator(); } const_iterator begin() const { for (unsigned i = 0; i < buckets_.size(); ++i) { if (buckets_[i]) return const_iterator(this, buckets_[i]); } return const_iterator(); } iterator end() { return iterator(); } const_iterator end() const { return const_iterator(); } std::pair insert_unique(const value_type &value) { rehash_if_needed(size_); uint64_t hash_value = hash(value); lock_bucket(hash_value); uint64_t index = bucket_index(hash_value); for (node_type *node = buckets_[index]; node; node = node->next) { if (key_equal_(get_key_(node->value), get_key_(value))) { unlock_bucket(hash_value); return std::pair(iterator(this, node), false); } } node_type *p = pool_.construct(); p->value = value; p->next = buckets_[index]; buckets_[index] = p; ++size_; unlock_bucket(hash_value); return std::pair(iterator(this, p), true); } iterator find(const key_type &key) { uint64_t index = bucket_index_key(key); for (node_type *node = buckets_[index]; node; node = node->next) { if (key_equal_(key, get_key_(node->value))) { return iterator(this, node); } } return iterator(); } const_iterator find(const key_type &key) const { uint64_t index = bucket_index_key(key); for (node_type *node = buckets_[index]; node; node = node->next) { if (key_equal_(key, get_key_(node->value))) { return const_iterator(this, node); } } return const_iterator(); } reference find_or_insert(const value_type &value) { rehash_if_needed(size_); uint64_t hash_value = hash(value); uint64_t index = bucket_index(hash_value); for (node_type *node = buckets_[index]; node; node = node->next) { if (key_equal_(get_key_(node->value), get_key_(value))) { return node->value; } } node_type *p = pool_.construct(); p->value = value; p->next = buckets_[index]; buckets_[index] = p; ++size_; return p->value; } size_type remove(const key_type &key) { uint64_t num_removed_nodes = 0; uint64_t hash_value = hash_key(key); lock_bucket(hash_value); uint64_t index = bucket_index(hash_value); node_type *prev = NULL; node_type *node = buckets_[index]; while (node) { if (key_equal_(key, get_key_(node->value))) { if (prev == NULL) buckets_[index] = node->next; else prev->next = node->next; node_type *p = node; node = node->next; pool_.destroy(p); ++num_removed_nodes; } else { prev = node; node = node->next; } } unlock_bucket(hash_value); size_ -= num_removed_nodes; return num_removed_nodes; } template size_type remove_if(const Predicator &predicator) { uint64_t num_removed_nodes = 0; for (int64_t index = 0; index < (int64_t)buckets_.size(); ++index) { lock_bucket(index); node_type *prev = NULL; node_type *node = buckets_[index]; while (node) { if (predicator(node->value)) { if (prev == NULL) buckets_[index] = node->next; else prev->next = node->next; node_type *p = node; node = node->next; pool_.destroy(p); ++num_removed_nodes; } else { prev = node; node = node->next; } } unlock_bucket(index); } size_ -= num_removed_nodes; return num_removed_nodes; } template UnaryProc &for_each(UnaryProc &op) { for (int64_t i = 0; i < (int64_t)buckets_.size(); ++i) { for (node_type *node = buckets_[i]; node; node = node->next) op(node->value); } return op; } template UnaryProc &for_each(UnaryProc &op) const { for (int64_t i = 0; i < (int64_t)buckets_.size(); ++i) { for (node_type *node = buckets_[i]; node; node = node->next) op(node->value); } return op; } uint64_t hash(const value_type &value) const { return hash_(get_key_(value)); } uint64_t bucket_index(uint64_t hash_value) const { return hash_value & (buckets_.size() - 1); } uint64_t hash_key(const key_type &key) const { return hash_(key); } uint64_t bucket_index_value(const value_type &value) const { return hash_(get_key_(value)) & (buckets_.size() - 1); } uint64_t bucket_index_key(const key_type &key) const { return hash_(key) & (buckets_.size() - 1); } const hash_func_type &hash_func() const { return hash_; } const get_key_func_type &get_key_func() const { return get_key_; } const key_equal_func_type &key_equal_func() const { return key_equal_; } size_type bucket_count() const { return buckets_.size(); } void reserve(size_type capacity) { rehash_if_needed(capacity); } void swap(hash_table_type &hash_table) { if (this != &hash_table) { std::swap(hash_, hash_table.hash_); std::swap(get_key_, hash_table.get_key_); std::swap(key_equal_, hash_table.key_equal_); pool_.swap(hash_table.pool_); buckets_.swap(hash_table.buckets_); std::swap(size_, hash_table.size_); } } size_type size() const { return size_; } bool empty() const { return size_ == 0; } void clear() { size_ = 0; for (int64_t i = 0; i < (int64_t)buckets_.size(); ++i) { node_type *node = buckets_[i]; while (node) { node_type *p = node; node = node->next; pool_.destroy(p); } buckets_[i] = NULL; } pool_.clear(); } private: void lock_bucket(uint64_t hash_value) {} void unlock_bucket(uint64_t hash_value) {} void rehash_if_needed(size_type capacity) { if (capacity > buckets_.size() * 2) { size_type new_num_buckets = buckets_.size(); while (capacity > new_num_buckets * 2) new_num_buckets *= 2; rehash(new_num_buckets); } } void rehash(uint64_t new_num_buckets) { if ((new_num_buckets & (new_num_buckets - 1)) != 0) throw std::logic_error("HashTableST::rehash() invalid number of buckets"); if (new_num_buckets == buckets_.size()) return; std::vector old_buckets(new_num_buckets, NULL); old_buckets.swap(buckets_); if (new_num_buckets > old_buckets.size()) { for (int64_t i = 0; i < (int64_t)old_buckets.size(); ++i) { node_type *node = old_buckets[i]; while (node) { node_type *next = node->next; uint64_t index = bucket_index_value(node->value); node->next = buckets_[index]; buckets_[index] = node; node = next; } } } else { for (int64_t i = 0; i < (int64_t)old_buckets.size(); ++i) { node_type *node = old_buckets[i]; while (node) { uint64_t hash_value = hash(node->value); uint64_t index = bucket_index(hash_value); node_type *next = node->next; node->next = buckets_[index]; buckets_[index] = node; node = next; } } } } hash_func_type hash_; get_key_func_type get_key_; key_equal_func_type key_equal_; PoolST pool_; std::vector buckets_; uint64_t size_; }; template std::istream &operator>>( std::istream &is, HashTableST &hash_table) { hash_table.clear(); Value value; while (is.read((char *)&value, sizeof(Value))) hash_table.insert_unique(value); return is; } template std::ostream &operator<<( std::ostream &os, HashTableST &hash_table) { typename HashTableST::iterator iter; for (iter = hash_table.begin(); iter != hash_table.end(); ++iter) { os.write((char *)&*iter, sizeof(Value)); } return os; } namespace std { template inline void swap(HashTableST &x, HashTableST &y) { x.swap(y); } } // namespace std #endifmegahit-1.2.9/src/idba/kmer.h000066400000000000000000000116301355123202700157370ustar00rootroot00000000000000/** * @file kmer.h * @brief IdbaKmer Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-02 */ #ifndef __BASIC_KMER_H_ #define __BASIC_KMER_H_ #include #include #include #include "xxhash/xxh3.h" #include "bit_operation.h" #include "definitions.h" /** * @brief It represents a k-mer. The value of k is limited by the number of * uint64 words used. The maximum value can be calculated by max_size(). */ class IdbaKmer { public: IdbaKmer() { std::memset(data_, 0, sizeof(uint64_t) * kNumUint64); } IdbaKmer(const IdbaKmer &kmer) { std::memcpy(data_, kmer.data_, sizeof(uint64_t) * kNumUint64); } explicit IdbaKmer(uint32_t size) { std::memset(data_, 0, sizeof(uint64_t) * kNumUint64); resize(size); } ~IdbaKmer() {} const IdbaKmer &operator=(const IdbaKmer &kmer) { std::memcpy(data_, kmer.data_, sizeof(uint64_t) * kNumUint64); return *this; } bool operator<(const IdbaKmer &kmer) const { for (int i = kNumUint64 - 1; i >= 0; --i) { if (data_[i] != kmer.data_[i]) return data_[i] < kmer.data_[i]; } return false; } bool operator>(const IdbaKmer &kmer) const { for (int i = kNumUint64 - 1; i >= 0; --i) { if (data_[i] != kmer.data_[i]) return data_[i] > kmer.data_[i]; } return false; } bool operator==(const IdbaKmer &kmer) const { for (unsigned i = 0; i < kNumUint64; ++i) { if (data_[i] != kmer.data_[i]) return false; } return true; } bool operator!=(const IdbaKmer &kmer) const { for (unsigned i = 0; i < kNumUint64; ++i) { if (data_[i] != kmer.data_[i]) return true; } return false; } const IdbaKmer &ReverseComplement() { uint32_t kmer_size = size(); uint32_t used_words = (kmer_size + 31) >> 5; resize(0); for (unsigned i = 0; i < used_words; ++i) bit_operation::ReverseComplement(data_[i]); for (unsigned i = 0; i < (used_words >> 1); ++i) std::swap(data_[i], data_[used_words - 1 - i]); if ((kmer_size & 31) != 0) { unsigned offset = (32 - (kmer_size & 31)) << 1; for (unsigned i = 0; i + 1 < used_words; ++i) data_[i] = (data_[i] >> offset) | data_[i + 1] << (64 - offset); data_[used_words - 1] >>= offset; } resize(kmer_size); return *this; } void ShiftAppend(uint8_t ch) { ch &= 3; uint32_t kmer_size = size(); uint32_t used_words = (kmer_size + 31) >> 5; resize(0); for (unsigned i = 0; i + 1 < used_words; ++i) data_[i] = (data_[i] >> 2) | (data_[i + 1] << 62); data_[used_words - 1] = (data_[used_words - 1] >> 2) | (uint64_t(ch) << (((kmer_size - 1) & 31) << 1)); resize(kmer_size); } void ShiftPreappend(uint8_t ch) { ch &= 3; uint32_t kmer_size = size(); uint32_t used_words = (kmer_size + 31) >> 5; resize(0); for (int i = used_words - 1; i > 0; --i) data_[i] = (data_[i] << 2) | (data_[i - 1] >> 62); data_[0] = (data_[0] << 2) | ch; if ((kmer_size & 31) != 0) data_[used_words - 1] &= (1ULL << ((kmer_size & 31) << 1)) - 1; resize(kmer_size); } bool IsPalindrome() const { IdbaKmer kmer(*this); return kmer.ReverseComplement() == *this; } uint64_t hash() const { return XXH3_64bits(data_, sizeof(data_)); } IdbaKmer unique_format() const { IdbaKmer rev_comp = *this; rev_comp.ReverseComplement(); return (*this < rev_comp ? *this : rev_comp); } uint8_t operator[](uint32_t index) const { return (data_[index >> 5] >> ((index & 31) << 1)) & 3; } uint8_t get_base(uint32_t index) const { return (data_[index >> 5] >> ((index & 31) << 1)) & 3; } void set_base(uint32_t index, uint8_t ch) { ch &= 3; unsigned offset = (index & 31) << 1; data_[index >> 5] = (data_[index >> 5] & ~(3ULL << offset)) | (uint64_t(ch) << offset); } void swap(IdbaKmer &kmer) { if (this != &kmer) { for (unsigned i = 0; i < kNumUint64; ++i) std::swap(data_[i], kmer.data_[i]); } } uint32_t size() const { return data_[kNumUint64 - 1] >> (64 - kBitsForSize); } void resize(uint32_t new_size) { data_[kNumUint64 - 1] = ((data_[kNumUint64 - 1] << kBitsForSize) >> kBitsForSize) | (uint64_t(new_size) << (64 - kBitsForSize)); } void clear() { uint32_t kmer_size = size(); memset(data_, 0, sizeof(uint64_t) * kNumUint64); resize(kmer_size); } static uint32_t max_size() { return kMaxSize; } static const uint32_t kNumUint64 = kUint64PerIdbaKmerMaxK; static const uint32_t kBitsForSize = ((kNumUint64 <= 2) ? 6 : ((kNumUint64 <= 8) ? 8 : 16)); static const uint32_t kBitsForIdbaKmer = (kNumUint64 * 64 - kBitsForSize); static const uint32_t kMaxSize = kBitsForIdbaKmer / 2; private: uint64_t data_[kNumUint64]; }; namespace std { template <> inline void swap(IdbaKmer &kmer1, IdbaKmer &kmer2) { kmer1.swap(kmer2); } } // namespace std #endif megahit-1.2.9/src/idba/pool.h000066400000000000000000000065451355123202700157630ustar00rootroot00000000000000/** * @file pool.h * @brief Pool Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-23 */ #ifndef __LIB_IDBA_POOL_H_ #define __LIB_IDBA_POOL_H_ #include #include #include template struct ChunkST { ChunkST(T *address = NULL, uint32_t size = 0) { this->address = address; this->size = size; } T *address; uint32_t size; }; template struct BufferST { BufferST(T *address = NULL, uint32_t size = 0, uint32_t index = 0) { this->address = address; this->size = size; this->index = index; } T *address; uint32_t size; uint32_t index; }; template > class PoolST { public: typedef T value_type; typedef value_type *pointer; typedef const value_type *const_pointer; typedef value_type &reference; typedef const value_type &const_reference; typedef Allocator allocator_type; typedef ChunkST chunk_type; typedef BufferST buffer_type; typedef PoolST pool_type; // static const uint32_t kMaxChunkSTSize = (1 << 12); // static const uint32_t kMinChunkSTSize = (1 << 12); static const uint32_t kMaxChunkSTSize = (1 << 20); static const uint32_t kMinChunkSTSize = (1 << 8); PoolST() { heads_.resize(1, (pointer)0); buffers_.resize(1, buffer_type()); chunk_size_ = kMinChunkSTSize; } ~PoolST() { clear(); } pointer allocate() { int thread_id = 0; if (heads_[thread_id] != NULL) { pointer p = heads_[thread_id]; heads_[thread_id] = *(pointer *)heads_[thread_id]; return p; } else { buffer_type &buffer = buffers_[thread_id]; if (buffer.index == buffer.size) { uint32_t size = chunk_size_; if (chunk_size_ < kMaxChunkSTSize) chunk_size_ <<= 1; pointer p = alloc_.allocate(size); chunks_.push_back(chunk_type(p, size)); buffer.address = p; buffer.size = size; buffer.index = 0; } return buffer.address + buffer.index++; } } void deallocate(pointer p) { int thread_id = 0; *(pointer *)p = heads_[thread_id]; heads_[thread_id] = p; } pointer construct() { pointer p = allocate(); new ((void *)p) value_type(); return p; } pointer construct(const_reference x) { pointer p = allocate(); new ((void *)p) value_type(x); return p; } void destroy(pointer p) { ((value_type *)p)->~value_type(); } void swap(PoolST &pool) { if (this != &pool) { heads_.swap(pool.heads_); buffers_.swap(pool.buffers_); chunks_.swap(pool.chunks_); std::swap(chunk_size_, pool.chunk_size_); std::swap(alloc_, pool.alloc_); } } void clear() { for (unsigned i = 0; i < chunks_.size(); ++i) alloc_.deallocate(chunks_[i].address, chunks_[i].size); chunks_.resize(0); fill(heads_.begin(), heads_.end(), (pointer)0); fill(buffers_.begin(), buffers_.end(), buffer_type()); chunk_size_ = kMinChunkSTSize; } private: PoolST(const pool_type &); const pool_type &operator=(const pool_type &); std::vector heads_; std::vector buffers_; std::deque chunks_; uint32_t chunk_size_; allocator_type alloc_; }; namespace std { template inline void swap(PoolST &x, PoolST &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/idba/sequence.cpp000066400000000000000000000063541355123202700171530ustar00rootroot00000000000000/** * @file sequence.cpp * @brief * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-02 */ #include "idba/sequence.h" #include #include #include #include #include "idba/kmer.h" using namespace std; istream &operator>>(istream &is, Sequence &seq) { getline(is, seq.bases_); if (!is) return is; string line; while (is && (isalnum(is.peek()) || is.peek() == '\n') && getline(is, line)) seq.bases_ += line; is.clear(); seq.Encode(); return is; } ostream &operator<<(ostream &os, const Sequence &seq) { Sequence tmp = seq; tmp.Decode(); return os << tmp.bases_; } const Sequence &Sequence::Assign(const IdbaKmer &kmer) { bases_.resize(kmer.size()); for (unsigned i = 0; i < bases_.size(); ++i) bases_[i] = kmer[i]; return *this; } const Sequence &Sequence::ReverseComplement() { reverse(bases_.begin(), bases_.end()); for (unsigned i = 0; i < bases_.size(); ++i) { switch (bases_[i]) { case 'a': case 'A': bases_[i] = 'T'; break; case 'c': case 'C': bases_[i] = 'G'; break; case 'g': case 'G': bases_[i] = 'C'; break; case 't': case 'T': bases_[i] = 'A'; break; case 0: case 1: case 2: case 3: bases_[i] = 3 - bases_[i]; break; case 'N': case 4: break; default: throw logic_error("reverse complement error: unkown character."); } } return *this; } bool Sequence::IsValid() const { for (unsigned i = 0; i < bases_.size(); ++i) { if (!IsValid(bases_[i])) return false; } return true; } bool Sequence::IsPalindrome() const { unsigned half = (bases_.size() + 1) / 2; for (unsigned i = 0; i < half; ++i) { if (bases_[i] + bases_[bases_.size() - 1 - i] != 3) return false; } return true; } IdbaKmer Sequence::GetIdbaKmer(uint32_t offset, uint32_t kmer_size) const { IdbaKmer kmer(kmer_size); for (unsigned i = 0; i < kmer_size; ++i) kmer.set_base(i, bases_[offset + i]); return kmer; } void Sequence::Encode() { for (unsigned i = 0; i < bases_.size(); ++i) { switch (bases_[i]) { case 'a': case 'A': bases_[i] = 0; break; case 'c': case 'C': bases_[i] = 1; break; case 'g': case 'G': bases_[i] = 2; break; case 't': case 'T': bases_[i] = 3; break; case 'n': case 'N': bases_[i] = 4; break; default: bases_[i] = 4; break; // throw logic_error("encode error: unkown character."); } } } void Sequence::Decode() { for (unsigned i = 0; i < bases_.size(); ++i) { switch (bases_[i]) { case 0: bases_[i] = 'A'; break; case 1: bases_[i] = 'C'; break; case 2: bases_[i] = 'G'; break; case 3: bases_[i] = 'T'; break; case 4: bases_[i] = 'N'; break; default: throw logic_error("decode error: unkown character."); } } } ostream &WriteFasta(ostream &os, const Sequence &seq, const string &comment) { return os << ">" << comment << "\n" << seq << "\n"; } megahit-1.2.9/src/idba/sequence.h000066400000000000000000000102201355123202700166030ustar00rootroot00000000000000/** * @file sequence.h * @brief Sequence Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-02 */ #ifndef __SEQUENCE_SEQUENCE_H_ #define __SEQUENCE_SEQUENCE_H_ #include #include #include #include #include "idba/kmer.h" /** * @brief It represents a DNA sequence ({A, C, G, T, N}) as a digit sequence * ({0, 1, 2, 3, 4}). */ class Sequence { public: friend std::istream &operator>>(std::istream &stream, Sequence &seq); friend std::ostream &operator<<(std::ostream &stream, const Sequence &seq); Sequence() {} Sequence(const Sequence &seq, int offset = 0, size_t length = std::string::npos) { Assign(seq, offset, length); } explicit Sequence(const std::string &seq, int offset = 0, size_t length = std::string::npos) { Assign(seq, offset, length); } Sequence(uint32_t num, uint8_t ch) { Assign(num, ch); } explicit Sequence(const IdbaKmer &kmer) { Assign(kmer); } ~Sequence() {} const Sequence &operator=(const Sequence &seq) { Assign(seq); return *this; } const Sequence &operator=(const std::string &seq) { Assign(seq); return *this; } const Sequence &operator=(const IdbaKmer &kmer) { Assign(kmer); return *this; } const Sequence &operator+=(const Sequence &seq) { Append(seq); return *this; } const Sequence &operator+=(uint8_t ch) { Append(ch); return *this; } bool operator==(const Sequence &seq) const { return bases_ == seq.bases_; } bool operator!=(const Sequence &seq) const { return bases_ != seq.bases_; } bool operator<(const Sequence &seq) const { return bases_ < seq.bases_; } bool operator>(const Sequence &seq) const { return bases_ > seq.bases_; } const Sequence &Assign(const Sequence &seq, int offset = 0, size_t length = std::string::npos) { if (&seq != this) bases_.assign(seq.bases_, offset, length); return *this; } const Sequence &Assign(const std::string &s, int offset = 0, size_t length = std::string::npos) { bases_.assign(s, offset, length); Encode(); return *this; } const Sequence &Assign(uint32_t num, uint8_t ch) { bases_.assign(num, ch); return *this; } const Sequence &Assign(const IdbaKmer &kmer); const Sequence &Append(const Sequence &seq, int offset = 0, size_t length = std::string::npos) { bases_.append(seq.bases_, offset, length); return *this; } const Sequence &Append(const std::string &seq, int offset = 0, size_t length = std::string::npos) { bases_.append(seq, offset, length); return *this; } const Sequence &Append(uint32_t num, uint8_t ch) { bases_.append(num, ch); return *this; } const Sequence &Append(uint8_t ch) { bases_.append(1, ch); return *this; } const Sequence &ReverseComplement(); bool IsValid() const; bool IsPalindrome() const; IdbaKmer GetIdbaKmer(uint32_t offset, uint32_t kmer_size) const; uint8_t &operator[](unsigned index) { return (uint8_t &)bases_[index]; } const uint8_t &operator[](unsigned index) const { return (uint8_t &)bases_[index]; } uint8_t get_base(uint32_t index) const { return (uint8_t)bases_[index]; } void set_base(uint32_t index, uint8_t ch) { bases_[index] = ch; } void swap(Sequence &seq) { if (this != &seq) bases_.swap(seq.bases_); } uint32_t size() const { return bases_.size(); } void resize(int new_size) { bases_.resize(new_size); } bool empty() const { return bases_.size() == 0; } void clear() { bases_.clear(); } std::string str() { Sequence tmp = *this; tmp.Decode(); return tmp.bases_; } protected: void Encode(); void Decode(); bool IsValid(char ch) const { return ch == 'A' || ch == 'C' || ch == 'G' || ch == 'T' || ch == 0 || ch == 1 || ch == 2 || ch == 3; } private: std::string bases_; }; namespace std { template <> inline void swap(Sequence &seq1, Sequence &seq2) { seq1.swap(seq2); } } // namespace std std::ostream &WriteFasta(std::ostream &os, const Sequence &seq, const std::string &comment); #endif megahit-1.2.9/src/idba/vertex_status.h000066400000000000000000000046001355123202700177200ustar00rootroot00000000000000/** * @file vertex_status.h * @brief VertexStatus Class. * @author Yu Peng (ypeng@cs.hku.hk) * @version 1.0.0 * @date 2011-08-05 */ #ifndef __GRAPH_VERTEX_STATUS_H_ #define __GRAPH_VERTEX_STATUS_H_ #include #include /** * @brief It is a class for storing the status of a vertex. It provides many * useful functions to access the status of a vertex. */ class VertexStatus { public: VertexStatus() : status_(0) {} VertexStatus(const VertexStatus &vertex_status) : status_(vertex_status.status_) {} const VertexStatus &operator=(const VertexStatus &vertex_status) { status_ = vertex_status.status_; return *this; } void SetUsedFlag() { SetFlag(kVertexStatusFlagUsed); } void ResetUsedFlag() { ResetFlag(kVertexStatusFlagUsed); } bool IsUsed() const { return GetFlag(kVertexStatusFlagUsed); } void SetDeadFlag() { SetFlag(kVertexStatusFlagDead); } void ResetDeadFlag() { ResetFlag(kVertexStatusFlagDead); } bool IsDead() const { return GetFlag(kVertexStatusFlagDead); } int GetLockID() { if (status_ & kVertexStatusFlagLock) return status_ & kVertexStatusMaskLock; return -1; } bool Lock(int id) { uint16_t old_status = status_; if (old_status & kVertexStatusFlagLock) return false; status_ = (old_status & ~kVertexStatusMaskLock) | kVertexStatusFlagLock | id; return true; } bool LockPreempt(int id) { uint16_t old_status = status_; int old_id = -1; if (old_status & kVertexStatusFlagLock) old_id = old_status & kVertexStatusMaskLock; if (old_id >= id) return false; status_ = (old_status & ~kVertexStatusMaskLock) | kVertexStatusFlagLock | id; return true; } void swap(VertexStatus &vertex_status) { if (this != &vertex_status) std::swap(status_, vertex_status.status_); } void clear() { status_ = 0; } static const uint16_t kVertexStatusFlagDead = 0x8000U; static const uint16_t kVertexStatusFlagUsed = 0x4000U; static const uint16_t kVertexStatusFlagLock = 0x2000U; static const uint16_t kVertexStatusMaskLock = 0x1FFFU; private: bool GetFlag(uint16_t flag) const { return status_ & flag; } void SetFlag(uint16_t flag) { status_ |= flag; } void ResetFlag(uint16_t flag) { status_ &= ~flag; } uint16_t status_; }; namespace std { template <> inline void swap(VertexStatus &x, VertexStatus &y) { x.swap(y); } } // namespace std #endif megahit-1.2.9/src/iterate/000077500000000000000000000000001355123202700153655ustar00rootroot00000000000000megahit-1.2.9/src/iterate/contig_flank_index.h000066400000000000000000000160261355123202700213700ustar00rootroot00000000000000// // Created by vout on 11/23/18. // #ifndef MEGAHIT_JUNCION_INDEX_H #define MEGAHIT_JUNCION_INDEX_H #include #include #include "parallel_hashmap/phmap.h" #include "sequence/kmer_plus.h" #include "sequence/sequence_package.h" #include "utils/mutex.h" template class ContigFlankIndex { public: struct FlankInfo { uint64_t ext_seq : 58; unsigned ext_len : 6; float mul; } __attribute__((packed)); using Flank = KmerPlus; using HashSet = phmap::flat_hash_set; public: ContigFlankIndex(unsigned k, unsigned step) : k_(k), step_(step) {} size_t size() const { return hash_index_.size(); } void FeedBatchContigs(SeqPackage &seq_pkg, const std::vector &mul) { SpinLock lock; #pragma omp parallel for for (size_t i = 0; i < seq_pkg.seq_count(); ++i) { auto seq_view = seq_pkg.GetSeqView(i); size_t seq_len = seq_view.length(); if (seq_len < k_ + 1) { continue; } for (int strand = 0; strand < 2; ++strand) { auto get_jth_char = [&seq_view, strand, seq_len](unsigned j) -> uint8_t { uint8_t c = seq_view.base_at(strand == 0 ? j : (seq_len - 1 - j)); return strand == 0 ? c : 3u ^ c; }; KmerType kmer; for (unsigned j = 0; j < k_ + 1; ++j) { kmer.ShiftAppend(get_jth_char(j), k_ + 1); } if (kmer.IsPalindrome(k_ + 1)) { continue; } unsigned ext_len = std::min(static_cast(step_ - 1), seq_len - (k_ + 1)); uint64_t ext_seq = 0; for (unsigned j = 0; j < ext_len; ++j) { ext_seq |= uint64_t(get_jth_char(k_ + 1 + j)) << j * 2; } { std::lock_guard lk(lock); auto res = hash_index_.emplace(kmer, FlankInfo{ext_seq, ext_len}); if (!res.second) { auto old_len = res.first->aux.ext_len; auto old_seq = res.first->aux.ext_seq; if (old_len < ext_len || (old_len == ext_len && old_seq < ext_seq)) { hash_index_.erase(res.first); res = hash_index_.emplace(kmer, FlankInfo{ext_seq, ext_len}); assert(res.second); } } } if (seq_len == k_ + 1) { break; } } } } template size_t FindNextKmersFromReads(const SeqPackage &seq_pkg, CollectorType *out) const { std::vector kmer_exist; std::vector kmer_mul; size_t num_aligned_reads = 0; #pragma omp parallel for reduction(+ : num_aligned_reads) private(kmer_exist, kmer_mul) for (unsigned seq_id = 0; seq_id < seq_pkg.seq_count(); ++seq_id) { auto seq_view = seq_pkg.GetSeqView(seq_id); size_t length = seq_view.length(); if (length < k_ + step_ + 1) { continue; } bool success = false; kmer_exist.clear(); kmer_exist.resize(length, false); kmer_mul.clear(); kmer_mul.resize(length, 0); Flank flank, rflank; auto &kmer = flank.kmer; auto &rkmer = rflank.kmer; for (unsigned j = 0; j < k_ + 1; ++j) { kmer.ShiftAppend(seq_view.base_at(j), k_ + 1); } rkmer = kmer; rkmer.ReverseComplement(k_ + 1); unsigned cur_pos = 0; while (cur_pos + k_ + 1 <= length) { unsigned next_pos = cur_pos + 1; if (!kmer_exist[cur_pos]) { auto iter = hash_index_.find(flank); if (iter != hash_index_.end()) { kmer_exist[cur_pos] = true; uint64_t ext_seq = iter->aux.ext_seq; unsigned ext_len = iter->aux.ext_len; float mul = iter->aux.mul; kmer_mul[cur_pos] = mul; for (unsigned j = 0; j < ext_len && cur_pos + k_ + 1 + j < length; ++j, ++next_pos) { if (seq_view.base_at(cur_pos + k_ + 1 + j) == ((ext_seq >> j * 2u) & 3u)) { kmer_exist[cur_pos + j + 1] = true; kmer_mul[cur_pos + j + 1] = mul; } else { break; } } } if ((iter = hash_index_.find(rflank)) != hash_index_.end()) { uint64_t ext_seq = iter->aux.ext_seq; unsigned ext_len = iter->aux.ext_len; float mul = iter->aux.mul; kmer_mul[cur_pos] = kmer_exist[cur_pos] ? (kmer_mul[cur_pos] + mul) / 2 : mul; kmer_exist[cur_pos] = true; for (unsigned j = 0; j < ext_len && cur_pos >= j + 1; ++j) { if ((3u ^ seq_view.base_at(cur_pos - 1 - j)) == ((ext_seq >> j * 2u) & 3u)) { kmer_mul[cur_pos - 1 - j] = kmer_exist[cur_pos - 1 - j] ? (kmer_mul[cur_pos - 1 - j] + mul) / 2 : mul; kmer_exist[cur_pos - 1 - j] = true; } else { break; } } } } if (next_pos + k_ + 1 <= length) { while (cur_pos < next_pos) { ++cur_pos; uint8_t c = seq_view.base_at(cur_pos + k_); kmer.ShiftAppend(c, k_ + 1); rkmer.ShiftPreappend(3u ^ c, k_ + 1); } } else { break; } } for (unsigned j = 1; j + k_ + 1 <= length; ++j) { kmer_mul[j] += kmer_mul[j - 1]; } typename CollectorType::kmer_type new_kmer, new_rkmer; for (unsigned accumulated_len = 0, j = 0, end_pos = 0; j + k_ < length; ++j) { accumulated_len = kmer_exist[j] ? accumulated_len + 1 : 0; if (accumulated_len >= step_ + 1) { unsigned target_end = j + k_ + 1; if (end_pos + 8 < target_end) { while (end_pos < target_end) { auto c = seq_view.base_at(end_pos); new_kmer.ShiftAppend(c, k_ + step_ + 1); new_rkmer.ShiftPreappend(3u ^ c, k_ + step_ + 1); end_pos++; } } else { if (end_pos + k_ + step_ + 1 < target_end) { end_pos = target_end - (k_ + step_ + 1); } while (end_pos < target_end) { auto c = seq_view.base_at(end_pos); new_kmer.ShiftAppend(c, k_ + step_ + 1); end_pos++; } new_rkmer = new_kmer; new_rkmer.ReverseComplement(k_ + step_ + 1); } float mul = (kmer_mul[j] - (j >= step_ + 1 ? kmer_mul[j - (step_ + 1)] : 0)) / (step_ + 1); assert(mul <= kMaxMul + 1); out->Insert(new_kmer < new_rkmer ? new_kmer : new_rkmer, static_cast( std::min(kMaxMul, static_cast(mul + 0.5)))); success = true; } } num_aligned_reads += success; } return num_aligned_reads; } private: HashSet hash_index_; unsigned k_{}; unsigned step_{}; }; #endif // MEGAHIT_JUNCION_INDEX_H megahit-1.2.9/src/iterate/kmer_collector.h000066400000000000000000000040721355123202700205450ustar00rootroot00000000000000// // Created by vout on 11/24/18. // #ifndef MEGAHIT_SPANNING_KMER_COLLECTOR_H #define MEGAHIT_SPANNING_KMER_COLLECTOR_H #include "parallel_hashmap/phmap.h" #include "sdbg/sdbg_def.h" #include "sequence/io/edge/edge_writer.h" #include "sequence/kmer_plus.h" #include "utils/mutex.h" template class KmerCollector { public: using kmer_type = KmerType; using kmer_plus = KmerPlus; using hash_set = phmap::parallel_flat_hash_set< kmer_plus, KmerHash, phmap::container_internal::hash_default_eq, phmap::container_internal::Allocator, 12, SpinLock>; KmerCollector(unsigned k, const std::string &out_prefix) : k_(k), output_prefix_(out_prefix) { last_shift_ = k_ % 16; last_shift_ = (last_shift_ == 0 ? 0 : 16 - last_shift_) * 2; words_per_kmer_ = DivCeiling(k_ * 2 + kBitsPerMul, 32); buffer_.resize(words_per_kmer_); writer_.SetFilePrefix(out_prefix); writer_.SetUnordered(); writer_.SetKmerSize(k_ - 1); writer_.InitFiles(); } void Insert(const KmerType &kmer, mul_t mul) { collection_.insert({kmer, mul}); } const hash_set &collection() const { return collection_; } void FlushToFile() { for (const auto &item : collection_) { WriteToFile(item.kmer, item.aux); } } private: void WriteToFile(const KmerType &kmer, mul_t mul) { std::fill(buffer_.begin(), buffer_.end(), 0); auto ptr = buffer_.begin(); uint32_t w = 0; for (unsigned j = 0; j < k_; ++j) { w = (w << 2) | kmer.GetBase(k_ - 1 - j); if (j % 16 == 15) { *ptr = w; w = 0; ++ptr; } } assert(ptr - buffer_.begin() < words_per_kmer_); *ptr = (w << last_shift_); assert((buffer_.back() & kMaxMul) == 0); buffer_.back() |= mul; writer_.WriteUnordered(buffer_.data()); } private: unsigned k_; std::string output_prefix_; hash_set collection_; EdgeWriter writer_; unsigned last_shift_; unsigned words_per_kmer_; std::vector buffer_; }; #endif // MEGAHIT_SPANNING_KMER_COLLECTOR_H megahit-1.2.9/src/kmlib/000077500000000000000000000000001355123202700150265ustar00rootroot00000000000000megahit-1.2.9/src/kmlib/kmbit.h000066400000000000000000000074441355123202700163160ustar00rootroot00000000000000// // Created by vout on 27/2/2018. // #ifndef KMLIB_BIT_MAGIC_H #define KMLIB_BIT_MAGIC_H #include namespace kmlib { namespace bit { using U = unsigned int; using UL = unsigned long int; using ULL = unsigned long long int; #define KMLIB_CHECK_TYPE(T) \ static_assert(std::is_integral::value, \ "only integral types are supported by popcount"); \ static_assert(sizeof(T) <= sizeof(ULL), \ "size bigger than unsigned long long not supported"); \ static_assert(!std::is_same::value, "bool type not supported") namespace internal { template struct SubSwapMask { private: static const T prev = SubSwapMask::value; public: static const T value = prev ? (prev << MaskSize * 2) | ((T(1) << MaskSize) - 1) : ((T(1) << MaskSize) - 1); }; template struct SubSwapMask { static const T value = T(0); }; template struct SwapMask { static const T value = SubSwapMask::value; }; template struct SwapMaskedBitsHelper { T operator()(T value) { value = ((value & SwapMask::value) << MaskSize) | ((value & ~SwapMask::value) >> MaskSize); return SwapMaskedBitsHelper()(value); } }; template struct SwapMaskedBitsHelper { T operator()(T value) { return ((value & SwapMask::value) << BaseSize) | ((value & ~SwapMask::value) >> BaseSize); } }; template inline T SwapMaskedBits(T value) { return SwapMaskedBitsHelper()(value); }; } // namespace internal /*! * @brief reverse an integer at a resolution base of BaseSize * @details e.g. for BaseSize = 2 and T = uint64_t, this function swap the 1st * base * with the 32th, the 2nd with the 31th and so on. The 1st base the first two * bits, and the last base is the last 2 bits * @tparam BaseSize the base size, i.e. the number of bits per base * @tparam T type of integer * @param value the value to reverse * @return the reversed value */ template inline T Reverse(T value) { KMLIB_CHECK_TYPE(T); static_assert(sizeof(T) * 8 % BaseSize == 0, "Reverse only support base size of power of 2"); return internal::SwapMaskedBits( value); // this will be optimized to bswap } /*! * @brief the same as ~Reverse(value) * @tparam BaseSize the base size, i.e. the number of bits per base * @tparam T the type of integer * @param value the value to reverse and then complement * @return the reverse-complemented value */ template inline T ReverseComplement(T value) { return ~Reverse(value); } /*! * the value of floor(log2(x)) for a const integer x * @tparam number the number */ template struct FloorLog2 { static const int value = 1 + FloorLog2<(number >> 1)>::value; }; template <> struct FloorLog2<1> { static const int value = 0; }; template inline unsigned Popcount(T val) { KMLIB_CHECK_TYPE(T); return sizeof(T) <= sizeof(U) ? __builtin_popcount(val) : sizeof(T) <= sizeof(UL) ? __builtin_popcountl(val) : sizeof(T) <= sizeof(ULL) ? __builtin_popcountll(val) : 0; } } // namespace bit } // namespace kmlib #endif // KMLIB_BIT_MAGIC_H megahit-1.2.9/src/kmlib/kmbitvector.h000066400000000000000000000117031355123202700175320ustar00rootroot00000000000000// // Created by vout on 28/2/2018. // #ifndef KMLIB_ATOMIC_BIT_VECTOR_H #define KMLIB_ATOMIC_BIT_VECTOR_H #include #include #include #include #include #include namespace kmlib { /*! * @brief Atomic bit vector: a class that represent a vector of "bits". * @details Update of each bit is threads safe via set and get. * It can also be used as a vector of bit locks via try_lock, lock and unlock */ template class AtomicBitVector { public: using word_type = WordType; using size_type = typename std::vector::size_type; public: /*! * @brief Constructor * @param size the size (number of bits) of the bit vector */ explicit AtomicBitVector(size_type size = 0) : size_(size), data_array_((size + kBitsPerWord - 1) / kBitsPerWord) {} /*! * @brief Construct a bit vector from iterators of words * @tparam WordIterator iterator to access words * @param first the iterator pointing to the first word * @param last the iterator pointing to the last word */ template explicit AtomicBitVector(WordIterator first, WordIterator last) : size_((last - first) * kBitsPerWord), data_array_(first, last) {} /*! * @brief the move constructor */ AtomicBitVector(AtomicBitVector &&rhs) : size_(rhs.size_), data_array_(std::move(rhs.data_array_)) {} /*! * @brief the move operator */ AtomicBitVector &operator=(AtomicBitVector &&rhs) noexcept { size_ = rhs.size_; data_array_ = std::move(rhs.data_array_); return *this; } ~AtomicBitVector() = default; /*! * @return the size of the bit vector */ size_type size() const { return size_; } /*! * @brief set the i-th bit to 1 * @param i the index of the bit to be set to 1 */ void set(size_type i) { word_type mask = word_type(1) << (i % kBitsPerWord); data_array_[i / kBitsPerWord].v.fetch_or(mask, std::memory_order_relaxed); } /*! * @brief set the i-th bit to 0 * @param i the index of the bit to be set to 0 */ void unset(size_type i) { word_type mask = ~(word_type(1) << (i % kBitsPerWord)); data_array_[i / kBitsPerWord].v.fetch_and(mask, std::memory_order_relaxed); } /*! * @param i the index of the bit * @return value of the i-th bit */ bool at(size_type i) const { return !!(data_array_[i / kBitsPerWord].v.load(std::memory_order_relaxed) & (word_type(1) << i % kBitsPerWord)); } /*! * @param i the index of the bit * @return whether the i-th bit has been locked successfully */ bool try_lock(size_type i) { word_type mask = word_type(1) << (i % kBitsPerWord); word_type old_val = data_array_[i / kBitsPerWord].v.fetch_or( mask, std::memory_order_acquire); return !(old_val & mask); } /*! * @brief lock the i-th bit * @param i the bit to lock */ void lock(size_type i) { unsigned retry = 0; while (!try_lock(i)) { if (++retry > 64) { retry = 0; std::this_thread::yield(); } } assert(at(i)); } /*! * @brief unlock the i-th bit * @param i the index of the bits */ void unlock(size_type i) { auto mask = word_type(1) << (i % kBitsPerWord); auto old_val = data_array_[i / kBitsPerWord].v.fetch_and( ~mask, std::memory_order_release); assert(old_val & mask); (void)(old_val); } /*! * @brief reset the size of the bit vector and clear all bits * @param size the new size of the bit vector */ void reset(size_type size) { if (size == size_) { reset(); } else { } size_ = size; data_array_ = std::move(array_type(0)); data_array_ = std::move(array_type((size + kBitsPerWord - 1) / kBitsPerWord, 0)); } void reset() { std::fill(data_array_.begin(), data_array_.end(), 0); } /*! * @brief swap with another bit vector * @param rhs the target to swap */ void swap(AtomicBitVector &rhs) { std::swap(size_, rhs.size_); std::swap(data_array_, rhs.data_array_); } private: /*! * @brief a wrapper for std::Atomic. std::Atomic do not support copy and move * constructor, so this wrapper is used to make suitable to std::vector * @tparam T the underlying type of the atomic struct */ template struct AtomicWrapper { std::atomic v; AtomicWrapper(T a = T()) : v(a) {} AtomicWrapper(const AtomicWrapper &rhs) : v(rhs.v.load()) {} AtomicWrapper &operator=(const AtomicWrapper &rhs) { v.store(rhs.v.load()); return *this; } }; using array_type = std::vector>; static const unsigned kBitsPerByte = 8; static const unsigned kBitsPerWord = sizeof(word_type) * kBitsPerByte; size_type size_; array_type data_array_; static_assert(sizeof(AtomicWrapper) == sizeof(word_type), ""); }; } // namespace kmlib using AtomicBitVector = kmlib::AtomicBitVector<>; #endif // KMLIB_ATOMIC_BIT_VECTOR_H megahit-1.2.9/src/kmlib/kmcompactvector.h000066400000000000000000000111401355123202700203750ustar00rootroot00000000000000// // Created by vout on 11/4/18. // #ifndef KMLIB_COMPACTVECTOR_H #define KMLIB_COMPACTVECTOR_H #include namespace kmlib { enum { kBigEndian, kLittleEndian }; template class CompactVector { static_assert(sizeof(WordType) * 8 % BaseSize == 0, "BaseSize must of power of 2 and no larger than TWord"); public: using word_type = WordType; using size_type = typename std::vector::size_type; static const unsigned kBaseSize = BaseSize; static const int kEndian = Endian; static const word_type kBaseMask = (word_type{1} << kBaseSize) - 1; static const unsigned kBasesPerWord = sizeof(word_type) * 8 / kBaseSize; public: class Adapter { private: word_type *word_; unsigned bit_offset_; public: Adapter(word_type *word, unsigned base_offset) : word_(word), bit_offset_(bit_shift(base_offset)) {} Adapter &operator=(const word_type &val) { *word_ &= ~(kBaseMask << bit_offset_); *word_ |= val << bit_offset_; return *this; } Adapter &operator=(const Adapter &rhs) { *this = word_type(rhs); return *this; } operator word_type() const { return *word_ >> bit_offset_ & kBaseMask; } }; public: static size_type size_to_word_count(size_type size) { return (size + kBasesPerWord - 1) / kBasesPerWord; } static unsigned bit_shift(unsigned pos, unsigned len = 1) { return Endian == kBigEndian ? sizeof(word_type) * 8 - (pos + len) * kBaseSize : pos * kBaseSize; } static word_type at(const word_type *v, size_type i) { return v[i / kBasesPerWord] >> bit_shift(i % kBasesPerWord) & kBaseMask; } static word_type sub_word(word_type val, unsigned pos, unsigned len = 1) { if (len == kBasesPerWord) { return val >> bit_shift(pos, len); } else { return val >> bit_shift(pos, len) & ((word_type{1} << len * kBaseSize) - 1); } } public: explicit CompactVector(size_type size = 0) : size_(size), underlying_vec_(size_to_word_count(size)), vec_ptr_(&underlying_vec_) {} explicit CompactVector(std::vector *v, size_type size = 0) : size_(size), underlying_vec_(0), vec_ptr_(v) { resize(size); } ~CompactVector() = default; size_type size() const { return size_; } size_type word_count() const { return vec_ptr_->size(); } void clear() { size_ = 0; vec_ptr_->clear(); } size_type capacity() const { return vec_ptr_->capacity() * kBasesPerWord; } size_type word_capacity() const { return vec_ptr_->capacity(); } const word_type *data() const { return vec_ptr_->data(); } word_type *data() { return vec_ptr_->data(); } word_type operator[](size_type i) const { return at(vec_ptr_->data(), i); } Adapter operator[](size_type i) { return Adapter(vec_ptr_->data() + i / kBasesPerWord, i % kBasesPerWord); } void reserve(size_type size) { vec_ptr_->reserve(size_to_word_count(size)); } void resize(size_type size) { size_type old_size = size_; size_ = size; vec_ptr_->resize(size_to_word_count(size), 0); if (size_ < old_size && size_ > 0 && size_ % kBasesPerWord != 0) { vec_ptr_->back() = sub_word(vec_ptr_->back(), 0, size_ % kBasesPerWord) << bit_shift(0, size_ % kBasesPerWord); } } void push_back(word_type val) { if (size_ % kBasesPerWord == 0) { vec_ptr_->emplace_back(val << bit_shift(0)); } else { vec_ptr_->back() |= val << bit_shift(size_ % kBasesPerWord); } ++size_; } void push_word(word_type val, unsigned pos, unsigned len) { unsigned pos_in_back = size_ % kBasesPerWord; unsigned remaining = kBasesPerWord - pos_in_back; if (pos_in_back == 0) { vec_ptr_->emplace_back(sub_word(val, pos, len) << bit_shift(0, len)); } else if (remaining < len) { vec_ptr_->back() |= sub_word(val, pos, remaining) << bit_shift(pos_in_back, remaining); vec_ptr_->emplace_back(sub_word(val, pos + remaining, len - remaining) << bit_shift(0, len - remaining)); } else { vec_ptr_->back() |= sub_word(val, pos, len) << bit_shift(pos_in_back, len); } size_ += len; } void push_word(word_type val, unsigned pos = 0) { push_word(val, pos, kBasesPerWord - pos); } void pop_back() { (*this)[size_ - 1] = 0; --size_; } private: size_type size_; std::vector underlying_vec_; std::vector *vec_ptr_; }; } // namespace kmlib #endif // KMLIB_COMPACTVECTOR_H megahit-1.2.9/src/kmlib/kmrns.h000066400000000000000000000340561355123202700163410ustar00rootroot00000000000000// // Created by vout on 3/3/2018. // #ifndef KMLIB_RNS_H #define KMLIB_RNS_H #include #include #include #include #include namespace kmlib { namespace internal { template struct SubPopcountMask { static_assert(Index % BaseSize == 0, ""); static const T value = (SubPopcountMask::value << BaseSize) | 1ULL; }; template struct SubPopcountMask { static const T value = 0; }; template struct PopcountMask { static const T value = SubPopcountMask::value; }; template inline T PackToLowestBit(T value) { if (Index == 0) { return value; } value &= value >> Index; return PackToLowestBit(value); } using U = unsigned int; using UL = unsigned long int; using ULL = unsigned long long int; #define CHECK_TYPE(T) \ static_assert(std::is_integral::value, \ "only integral types are supported by popcount"); \ static_assert(sizeof(T) <= sizeof(ULL), \ "size bigger than unsigned long long not supported"); \ static_assert(!std::is_same::value, "bool type not supported") template inline unsigned Popcount(T val) { CHECK_TYPE(T); return sizeof(T) <= sizeof(U) ? __builtin_popcount(val) : sizeof(T) <= sizeof(UL) ? __builtin_popcountl(val) : sizeof(T) <= sizeof(ULL) ? __builtin_popcountll(val) : 0; } template inline unsigned Ctz(T val) { CHECK_TYPE(T); return sizeof(T) <= sizeof(U) ? __builtin_ctz(val) : sizeof(T) <= sizeof(UL) ? __builtin_ctzl(val) : sizeof(T) <= sizeof(ULL) ? __builtin_ctzll(val) : 0; } template inline Tx Pdep(Tx x, Ty y) { CHECK_TYPE(Tx); return sizeof(Tx) <= sizeof(U) ? _pdep_u32(x, y) : sizeof(Tx) <= sizeof(ULL) ? _pdep_u64(x, y) : 0; } #undef CHECK_TYPE } // namespace internal enum rnsmode { kRankOnly, kRandAndSelect }; template class RankAndSelect { public: static const unsigned kBitsPerByte = 8; static const unsigned kBitsPerWord = sizeof(TWord) * kBitsPerByte; static const unsigned kBitsPerBase = BaseSize; static const unsigned kAlphabetSize = AlphabetSize; static const unsigned kBasesPerL1 = BasePerL1Interval; static const unsigned kBasesPerL2 = BasePerL2Interval; static const unsigned kSelectSampleSize = SelectSampleSize; static const unsigned kL1PerL2 = kBasesPerL2 / kBasesPerL1; static const unsigned kBasesPerWord = kBitsPerWord / kBitsPerBase; using size_type = int64_t; using word_type = TWord; static const size_type kNullID = static_cast(-1); RankAndSelect() { for (unsigned i = kBitsPerBase == 1 ? 1 : 0; i < kAlphabetSize; ++i) { xor_masks_[i] = 0; for (unsigned j = 0; j < kBasesPerWord; ++j) { xor_masks_[i] |= (word_type)i << (kBitsPerBase * j); } xor_masks_[i] = ~xor_masks_[i]; } } ~RankAndSelect() = default; void from_packed_array(const word_type *packed_array, size_type size) { size_type num_l1 = DivCeiling(size, kBasesPerL1) + 1; size_type num_l2 = DivCeiling(size, kBasesPerL2) + 1; for (unsigned c = BaseSize == 1 ? 1 : 0; c < kAlphabetSize; ++c) { const word_type *cur_word = packed_array; size_type count = 0; l2_occ_[c] = std::vector(num_l2); l1_occ_[c] = std::vector(num_l1); size_type size_rd = size - size % kBasesPerL1; for (size_type i = 0; i < size_rd; i += kBasesPerL1, cur_word += kBasesPerL1 / kBasesPerWord) { if (i % kBasesPerL2 == 0) { l2_occ_[c][i / kBasesPerL2] = count; } l1_occ_[c][i / kBasesPerL1] = count - l2_occ_[c][i / kBasesPerL2]; count += CountCharInWords(c, cur_word, kBasesPerL1 / kBasesPerWord); } for (size_type i = size_rd; i < size; i += kBasesPerWord, ++cur_word) { if (i % kBasesPerL1 == 0) { if (i % kBasesPerL2 == 0) { l2_occ_[c][i / kBasesPerL2] = count; } l1_occ_[c][i / kBasesPerL1] = count - l2_occ_[c][i / kBasesPerL2]; } count += CountCharInWord(c, *cur_word); } l2_occ_[c][num_l2 - 1] = count; l1_occ_[c][num_l1 - 1] = count - l2_occ_[c][(num_l1 - 1) / kL1PerL2]; char_count_[c] = count; if (Mode != rnsmode::kRankOnly) { rank2itv_[c].reserve(DivCeiling(count, kSelectSampleSize) + 1); for (size_type i = 0; i < num_l1; ++i) { while (static_cast(rank2itv_[c].size() * kSelectSampleSize) < OccValue(c, i)) { rank2itv_[c].push_back(i - 1); } } rank2itv_[c].push_back(num_l1 - 1); } } packed_array_ = packed_array; this->size_ = size; } size_type rank(size_type pos) const { static_assert(BaseSize == 1, ""); return InternalRank(1, pos); } size_type rank(uint8_t c, size_type pos) const { static_assert(BaseSize != 1, ""); return InternalRank(c, pos); } size_type select(size_type ranking) const { static_assert(BaseSize == 1, ""); return InternalSelect(1, ranking); } size_type select(uint8_t c, size_type ranking) const { static_assert(BaseSize != 1, ""); return InternalSelect(c, ranking); } size_type pred(uint8_t c, size_type pos) const { // the last c in [0...pos] if (GetBaseAt(pos) == c) { return pos; } return InternalSelect(c, InternalRank(c, pos) - 1); } size_type pred(size_type pos) const { static_assert(BaseSize == 1, ""); return pred(1, pos); } size_type pred(uint8_t c, size_type pos, int step) const { // the last c in [pos-step, pos], return pos-step-1 if not exist size_type end = pos >= step ? pos - step : 0; while (pos >= end) { if (GetBaseAt(pos) == c) { return pos; } --pos; } return pos; } size_type succ(uint8_t c, size_type pos) const { // the first c in [pos...ReadLength] if (GetBaseAt(pos) == c) { return pos; } return InternalSelect(c, InternalRank(c, pos - 1)); } size_type succ(size_type pos) const { return succ(1, pos); } size_type succ(uint8_t c, size_type pos, int step) const { // the first c in [pos, pos+step], return pos+step+1 if not exist size_type end = pos + step; if (end >= size_) end = size_; while (pos <= end) { if (GetBaseAt(pos) == c) { return pos; } ++pos; } return pos; } private: unsigned CountCharInWord(uint8_t c, word_type x, word_type mask = word_type(-1)) const { if (BaseSize != 1) { x ^= xor_masks_[c]; x = internal::PackToLowestBit(x); x &= kPopcntMask; } return internal::Popcount(x & mask); } unsigned CountCharInWords(uint8_t c, const word_type *ptr, unsigned n_words) const { unsigned count = 0; for (unsigned i = 0; i < n_words; ++i) { count += CountCharInWord(c, ptr[i]); } return count; } unsigned SelectInWord(uint8_t c, int num_c, word_type x) const { if (BaseSize != 1) { x ^= xor_masks_[c]; x = internal::PackToLowestBit(x); x &= kPopcntMask; } #if defined(__BMI2__) && defined(USE_BMI2) return internal::Ctz(internal::Pdep(word_type(1) << (num_c - 1), x)) / kBitsPerBase; #else unsigned trailing_zeros = 0; while (num_c > 0) { trailing_zeros = internal::Ctz(x); x ^= TWord{1} << trailing_zeros; --num_c; } return trailing_zeros / kBitsPerBase; #endif } void PrefetchOcc(uint8_t c, int64_t i) const { __builtin_prefetch(&l2_occ_[c][i / kL1PerL2], 0); __builtin_prefetch(&l1_occ_[c][i], 0); } size_type OccValue(uint8_t c, size_type i) const { return l2_occ_[c][i / kL1PerL2] + l1_occ_[c][i]; } private: size_type InternalRank(uint8_t c, size_type pos) const { // the number of c's in [0...pos] if (pos >= size_) return kNullID; else if (pos == size_ - 1) return char_count_[c]; ++pos; size_type itv_idx = (pos + kBasesPerL1 / 2 - 1) / kBasesPerL1; size_type sampled_pos = itv_idx * kBasesPerL1; if (sampled_pos >= size_) { sampled_pos -= kBasesPerL1; itv_idx--; } PrefetchOcc(c, itv_idx); if (sampled_pos > pos) { return RankFwd(c, itv_idx, sampled_pos, sampled_pos - pos); } else if (sampled_pos < pos) { return RankBwd(c, itv_idx, sampled_pos, pos - sampled_pos); } else { return OccValue(c, itv_idx); } } size_type InternalSelect(uint8_t c, size_type k) const { static_assert(Mode != rnsmode::kRankOnly, "cannot select on rank only struct"); // return the pos (0-based) of the kth (0-based) c if (k > char_count_[c]) return kNullID; else if (k == char_count_[c]) return size_; // first locate which interval Select(c, k) falls size_type interval_l = rank2itv_[c][k / kSelectSampleSize]; size_type interval_r = rank2itv_[c][DivCeiling(k, kSelectSampleSize)]; PrefetchOcc(c, interval_l); while (interval_r > interval_l) { size_type interval_m = (interval_r + interval_l + 1) / 2; if (OccValue(c, interval_m) > k) { interval_r = interval_m - 1; } else { interval_l = interval_m; } } // refined select __builtin_prefetch(packed_array_ + interval_l * kBasesPerL1 / kBasesPerWord); unsigned remain = k + 1 - OccValue(c, interval_l); unsigned exceed = (interval_l + 1) * kBasesPerL1 >= size_ ? kBasesPerL1 : (OccValue(c, interval_l + 1) - (k + 1)); if (remain <= exceed * 2) { return SelectFwd(c, interval_l, remain); } else { return SelectBwd(c, interval_l, exceed); } } size_type RankFwd(uint8_t c, TInterval itv, size_type sampled_pos, unsigned n_bases) const { unsigned n_words = n_bases / kBasesPerWord; const word_type *p = packed_array_ + sampled_pos / kBasesPerWord - n_words - 1; __builtin_prefetch(p); unsigned n_residual = n_bases % kBasesPerWord; unsigned count = 0; if (n_residual != 0) { word_type mask = 1 + ~(1ULL << kBitsPerBase * (kBasesPerWord - n_residual)); count += CountCharInWord(c, p[0], mask); } count += CountCharInWords(c, p + 1, n_words); return OccValue(c, itv) - count; } size_type RankBwd(uint8_t c, TInterval itv, size_type sampled_pos, unsigned n_bases) const { const word_type *p = packed_array_ + sampled_pos / kBasesPerWord; __builtin_prefetch(p); unsigned n_words = n_bases / kBasesPerWord; unsigned n_residual = n_bases % kBasesPerWord; unsigned count = 0; count += CountCharInWords(c, p, n_words); if (n_residual != 0) { word_type mask = (1ULL << kBitsPerBase * n_residual) - 1; count += CountCharInWord(c, p[n_words], mask); } return OccValue(c, itv) + count; } size_type SelectFwd(uint8_t c, TInterval itv, unsigned remain) const { size_type pos = static_cast(itv) * kBasesPerL1; const word_type *begin = packed_array_ + pos / kBasesPerWord; const word_type *p = begin; for (unsigned popcnt; (popcnt = CountCharInWord(c, *p)) < remain; remain -= popcnt, ++p) ; return pos + (p - begin) * kBasesPerWord + SelectInWord(c, remain, *p); } size_type SelectBwd(uint8_t c, TInterval itv_l, unsigned exceed) const { size_type pos = (static_cast(itv_l) + 1) * kBasesPerL1; const word_type *end = packed_array_ + pos / kBasesPerWord - 1; const word_type *p = end; unsigned popcnt; for (; (popcnt = CountCharInWord(c, *p)) <= exceed; exceed -= popcnt, --p) ; return pos - kBasesPerWord * (end - p) - (kBasesPerWord - SelectInWord(c, popcnt - exceed, *p)); } uint8_t GetBaseAt(size_type i) const { return (packed_array_[i / kBasesPerWord] >> (i % kBasesPerWord * kBitsPerBase)) & ((1 << kBitsPerBase) - 1); } template static T1 DivCeiling(T1 x, T2 y) { return (x + y - 1) / y; }; static const word_type kPopcntMask = internal::PopcountMask::value; static const uint64_t kPopcntMask64 = internal::PopcountMask::value; size_type size_{}; size_type char_count_[kAlphabetSize]{}; // main memory for the structure const word_type *packed_array_; // sampled structure for rank and select // two level sampling for rank (occ value) // call the function OccValue(c, i) to get the number of c's // in packed_array_[0...i*kBasesPerL1-1] // sampling for select // rank_interval_lookup_[c][i]=j: the jth interval (0 based) // contains the (i*kSelectSampleSize)th (0 based) c // i.e. OccValue(c, j)<=i*kSelectSampleSize and OccValue(c, // j+1)>i*kSelectSampleSize std::vector rank2itv_[kAlphabetSize]; std::vector l1_occ_[kAlphabetSize]; // level 1 OCC std::vector l2_occ_[kAlphabetSize]; // level 2 OCC word_type xor_masks_[kAlphabetSize]; // e.g. if c = 0110(2), popcount_xorers_[c] = 1001 1001 1001 1001...(2), // to make all c's in a word 1111 static_assert((1ull << kBitsPerBase) >= kAlphabetSize, ""); static_assert(kBitsPerWord % kBitsPerBase == 0, ""); static_assert(kBitsPerBase <= 8, ""); static_assert(kBasesPerL2 <= 65536, ""); }; } // namespace kmlib #endif // KMLIB_RNS_H megahit-1.2.9/src/kmlib/kmsort.h000066400000000000000000000132501355123202700165170ustar00rootroot00000000000000// // Created by vout on 7/3/2018. // #ifndef KMLIB_KMSORT_H #define KMLIB_KMSORT_H #include #include namespace kmlib { namespace kmsortconst { static const int kRadixBits = 8; static const int kRadixMask = (1 << kRadixBits) - 1; static const int kNumBins = 1 << kRadixBits; static const int kInsertSortThreshold = 64; } // namespace kmsortconst namespace internal { template inline void insert_sort_core(RandomIt s, RandomIt e, RadixTraits rt) { for (RandomIt i = s + 1; i != e; ++i) { if (rt(*i, *(i - 1))) { RandomIt j; ValueType tmp = *i; *i = *(i - 1); for (j = i - 1; j > s && rt(tmp, *(j - 1)); --j) { *j = *(j - 1); } *j = tmp; } } } template inline int kth_byte(const T &val, RadixTrait rt, int byte_index) { return ByteIndex >= ByteIndexEnd ? rt.kth_byte(val, ByteIndex) : rt.kth_byte(val, byte_index); } template inline void radix_sort_core(RandomIt s, RandomIt e, RadixTrait rt, int byte_index) { RandomIt last_[kmsortconst::kNumBins + 1]; RandomIt *last = last_ + 1; size_t count[kmsortconst::kNumBins] = {0}; for (RandomIt i = s; i < e; ++i) { ++count[kth_byte( *i, rt, byte_index)]; } last_[0] = last_[1] = s; for (int i = 1; i < kmsortconst::kNumBins; ++i) { last[i] = last[i - 1] + count[i - 1]; } for (int i = 0; i < kmsortconst::kNumBins; ++i) { RandomIt end = last[i - 1] + count[i]; if (end == e) { last[i] = e; break; } while (last[i] != end) { ValueType swapper = *last[i]; int tag = kth_byte( swapper, rt, byte_index); if (tag != i) { do { std::swap(swapper, *last[tag]++); } while ( (tag = kth_byte( swapper, rt, byte_index)) != i); *last[i] = swapper; } ++last[i]; } } if (ByteIndex > ByteIndexEnd) { const int kNextIndex = ByteIndex > 0 ? ByteIndex - 1 : 0; for (int i = 0; i < kmsortconst::kNumBins; ++i) { if (count[i] > kmsortconst::kInsertSortThreshold) { radix_sort_core(last[i - 1], last[i], rt, kNextIndex); } else if (count[i] > 1) { insert_sort_core(last[i - 1], last[i], rt); } } } else if (byte_index > 0) { for (int i = 0; i < kmsortconst::kNumBins; ++i) { if (count[i] > kmsortconst::kInsertSortThreshold) { radix_sort_core( last[i - 1], last[i], rt, byte_index - 1); } else if (count[i] > 1) { insert_sort_core(last[i - 1], last[i], rt); } } } } template inline void radix_sort_entry(RandomIt s, RandomIt e, ValueType *, RadixTraits radix_traits) { if (std::distance(s, e) <= 1) { return; } else if (std::distance(s, e) <= kmsortconst::kInsertSortThreshold) { insert_sort_core(s, e, radix_traits); } else { const int kByteIndexEnd = RadixTraits::n_bytes > 8 ? RadixTraits::n_bytes - 8 : 0; radix_sort_core(s, e, radix_traits, RadixTraits::n_bytes - 1); } } } // namespace internal /*! * @brief the default radix trait for kmsort * requires the class T implements operator<(), kth_byte() and n_bytes interface * @tparam T type of the items to be sorted */ template ::value> struct RadixTraits { static const int n_bytes = T::n_bytes; int kth_byte(const T &x, int k) { return x.kth_byte(k); } bool operator()(const T &x, const T &y) const { return x < y; } }; /*! * @brief the default radix trait for integers * @tparam T the type of integer */ template struct RadixTraits { static const int n_bytes = sizeof(T); static const T kMSBMask = T(T(-1) < 0) << (sizeof(T) * 8 - 1); int kth_byte(const T &x, int k) { return ((x ^ kMSBMask) >> (kmsortconst::kRadixBits * k)) & kmsortconst::kRadixMask; } bool operator()(const T &x, const T &y) const { return x < y; } }; /*! * @brief sort [s, e) using explicitly defined radix traits * @tparam RandomIt the type of random iterator * @tparam RadixTraits the type of radix traits * @param s the first position * @param e the last position * @param radix_traits explicitly defined radix traits */ template inline void kmsort(RandomIt s, RandomIt e, RadixTraits radix_traits) { typename std::iterator_traits::value_type *_(0); internal::radix_sort_entry(s, e, _, radix_traits); } /*! * @brief sort [s, e) using default RadixTraits * @tparam RandomIt the type of random iterator * @param s the first position * @param e the last position */ template inline void kmsort(RandomIt s, RandomIt e) { using ValueType = typename std::iterator_traits::value_type; kmsort(s, e, RadixTraits()); } } // namespace kmlib #endif // KMLIB_KMSORT_Hmegahit-1.2.9/src/kmlib/test_compactvector.cpp000066400000000000000000000014271355123202700214460ustar00rootroot00000000000000// // Created by vout on 11/4/18. // #include "kmcompactvector.h" #include template void print_v(const kmlib::CompactVector &v) { std::cout << v.size() << ", " << v.capacity() << ": "; for (size_t i = 0; i < v.size(); ++i) { std::cout << v[i] << ' '; } std::cout << std::endl; } int main() { kmlib::CompactVector<2> v; for (size_t i = 0; i < 17; ++i) { v.push_back(i % 4); print_v(v); v[i] = 3 - v[i]; print_v(v); } while (v.size() > 0) { v.pop_back(); print_v(v); } kmlib::CompactVector<4> v4; for (size_t i = 0; i < 17; ++i) { v4.push_back(i % 16); print_v(v4); v4[i] = 15 - v4[i]; print_v(v4); } while (v4.size() > 0) { v4.pop_back(); print_v(v4); } return 0; }megahit-1.2.9/src/kmlib/test_kmbit.cpp000066400000000000000000000020141355123202700176740ustar00rootroot00000000000000// // Created by vout on 11/9/18. // #include #include #include #include "kmbit.h" #include "kmcompactvector.h" int main() { uint32_t a = 24128923; uint64_t b = 131231231123123ULL; auto rev_a1 = kmlib::bit::Reverse<1>(a); auto rev_a = kmlib::bit::Reverse<2>(a); auto rev_b = kmlib::bit::Reverse<4>(b); std::cout << std::bitset<32>(a) << std::endl; std::cout << std::bitset<32>(rev_a1) << std::endl; for (size_t i = 0; i < 32; ++i) { auto b1 = kmlib::CompactVector<1, uint32_t>::at(&a, i); auto b2 = kmlib::CompactVector<1, uint32_t>::at(&rev_a1, 31 - i); assert(b1 == b2); } for (size_t i = 0; i < 16; ++i) { auto b1 = kmlib::CompactVector<2, uint32_t>::at(&a, i); auto b2 = kmlib::CompactVector<2, uint32_t>::at(&rev_a, 15 - i); assert(b1 == b2); } for (size_t i = 0; i < 16; ++i) { auto b1 = kmlib::CompactVector<4, uint64_t>::at(&b, i); auto b2 = kmlib::CompactVector<4, uint64_t>::at(&rev_b, 15 - i); assert(b1 == b2); } return 0; }megahit-1.2.9/src/localasm/000077500000000000000000000000001355123202700155235ustar00rootroot00000000000000megahit-1.2.9/src/localasm/hash_mapper.cpp000066400000000000000000000222601355123202700205200ustar00rootroot00000000000000// // Created by vout on 7/10/19. // #include "hash_mapper.h" #include "sequence/io/contig/contig_reader.h" #include "utils/utils.h" namespace { inline uint64_t EncodeContigOffset(uint32_t contig_id, uint32_t contig_offset, uint8_t strand) { return (uint64_t(contig_id) << 32) | (contig_offset << 1) | strand; } inline void DecodeContigOffset(uint64_t code, uint32_t &contig_id, uint32_t &contig_offset, uint8_t &strand) { contig_id = code >> 32; contig_offset = (code & 0xFFFFFFFFULL) >> 1; strand = code & 1ULL; } inline uint32_t GetWord(const uint32_t *first_word, uint32_t first_shift, int from, int len, bool strand) { int from_word_idx = (first_shift + from) / 16; int from_word_shift = (first_shift + from) % 16; uint32_t ret = *(first_word + from_word_idx) << from_word_shift * 2; assert(len <= 16); if (16 - from_word_shift < len) { ret |= *(first_word + from_word_idx + 1) >> (16 - from_word_shift) * 2; } if (len < 16) { ret >>= (16 - len) * 2; ret <<= (16 - len) * 2; } if (strand == 1) { ret = kmlib::bit::ReverseComplement<2>(ret); ret <<= (16 - len) * 2; } return ret; } inline int Mismatch(uint32_t x, uint32_t y) { x ^= y; x |= x >> 1; x &= 0x55555555U; return kmlib::bit::Popcount(x); } } // namespace void HashMapper::LoadAndBuild(const std::string &contig_file, int32_t min_len, int32_t seed_kmer_size, int32_t sparsity) { seed_kmer_size_ = seed_kmer_size; ContigReader reader(contig_file); reader.SetMinLen(min_len)->SetDiscardFlag(contig_flag::kLoop); auto sizes = reader.GetNumContigsAndBases(); refseq_.Clear(); refseq_.ReserveSequences(sizes.first); refseq_.ReserveBases(sizes.second); bool contig_reverse = false; reader.ReadAll(&refseq_, contig_reverse); size_t sz = refseq_.seq_count(); size_t n_kmers = 0; #pragma omp parallel for reduction(+ : n_kmers) for (size_t i = 0; i < sz; ++i) { n_kmers += (refseq_.GetSeqView(i).length() - seed_kmer_size + sparsity) / sparsity; } index_.reserve(n_kmers); SpinLock mapper_lock; #pragma omp parallel for for (size_t i = 0; i < sz; ++i) { TKmer key; auto contig_view = refseq_.GetSeqView(i); for (int j = 0, len = contig_view.length(); j + seed_kmer_size <= len; j += sparsity) { auto ptr_and_offset = contig_view.raw_address(j); key.InitFromPtr(ptr_and_offset.first, ptr_and_offset.second, seed_kmer_size); auto kmer = key.unique_format(seed_kmer_size); auto offset = EncodeContigOffset(contig_view.id(), j, key != kmer); std::lock_guard lk(mapper_lock); auto res = index_.emplace(kmer, offset); if (!res.second) { res.first->second |= 1ULL << 63; } } } xinfo("Number of contigs: {}, index size: {}\n", refseq_.seq_count(), index_.size()); } int32_t HashMapper::Match(const SeqPackage::SeqView &seq_view, int query_from, int query_to, size_t contig_id, int ref_from, int ref_to, bool strand) const { auto query_ptr_and_offset = seq_view.raw_address(); const uint32_t *query_first_word = query_ptr_and_offset.first; int query_shift = query_ptr_and_offset.second; auto contig_view = refseq_.GetSeqView(contig_id); auto ref_ptr_and_offset = contig_view.raw_address(); const uint32_t *ref_first_word = ref_ptr_and_offset.first; int ref_shift = ref_ptr_and_offset.second; int match_len = query_to - query_from + 1; int threshold = lround(similarity_ * match_len); for (int i = query_from; i <= query_to; i += 16) { int len = std::min(16, query_to - i + 1); uint32_t qw = GetWord(query_first_word, query_shift, i, len, 0); int ref_i = strand == 0 ? ref_from + i - query_from : ref_to - (i + len - 1 - query_from); uint32_t rw = GetWord(ref_first_word, ref_shift, ref_i, len, strand); match_len -= Mismatch(qw, rw); if (match_len < threshold) { return 0; } } return match_len; } MappingRecord HashMapper::TryMap(const SeqPackage::SeqView &seq_view) const { MappingRecord bad_record; bad_record.valid = false; int len = seq_view.length(); if (len < seed_kmer_size_ || len < 50) return bad_record; // too short reads not reliable // small vector optimization static const int kArraySize = 3; std::array mapping_records; int n_mapping_records = 0; std::unique_ptr> v_mapping_records; auto ptr_and_offset = seq_view.raw_address(); TKmer kmer_f(ptr_and_offset.first, ptr_and_offset.second, seed_kmer_size_); TKmer kmer_r = kmer_f; kmer_r.ReverseComplement(seed_kmer_size_); for (int i = seed_kmer_size_ - 1; i < len; ++i) { if (i >= seed_kmer_size_) { uint8_t ch = seq_view.base_at(i); kmer_f.ShiftAppend(ch, seed_kmer_size_); kmer_r.ShiftPreappend(3 - ch, seed_kmer_size_); } uint8_t query_strand = kmer_f.cmp(kmer_r, seed_kmer_size_) <= 0 ? 0 : 1; auto iter = index_.find(query_strand == 0 ? kmer_f : kmer_r); if (iter == index_.end() || (iter->second >> 63) != 0) { continue; } uint32_t contig_id, contig_offset; uint8_t contig_strand; DecodeContigOffset(iter->second, contig_id, contig_offset, contig_strand); auto contig_view = refseq_.GetSeqView(contig_id); assert(contig_offset < contig_view.length()); uint8_t mapping_strand = contig_strand ^ query_strand; int32_t contig_from = mapping_strand == 0 ? contig_offset - (i - seed_kmer_size_ + 1) : contig_offset - (len - 1 - i); int32_t contig_to = mapping_strand == 0 ? contig_offset + seed_kmer_size_ - 1 + len - 1 - i : contig_offset + i; contig_from = std::max(contig_from, 0); contig_to = std::min(static_cast(contig_view.length() - 1), contig_to); if (contig_to - contig_from + 1 < len && contig_to - contig_from + 1 < min_mapped_len_) { continue; // clipped alignment is considered iff its length >= // min_mapped_len_ } int32_t query_from = mapping_strand == 0 ? i - (seed_kmer_size_ - 1) - (contig_offset - contig_from) : i - (contig_to - contig_offset); int32_t query_to = mapping_strand == 0 ? i - (seed_kmer_size_ - 1) + (contig_to - contig_offset) : i + (contig_offset - contig_from); assert(query_from >= 0 && static_cast(query_from) < seq_view.length()); assert(query_to >= 0 && static_cast(query_to) < seq_view.length()); auto rec = MappingRecord{contig_id, contig_from, contig_to, static_cast(seq_view.id()), query_from, query_to, 0, mapping_strand, true}; auto end = mapping_records.begin() + n_mapping_records; if (std::find(mapping_records.begin(), end, rec) == end) { if (n_mapping_records < kArraySize) { mapping_records[n_mapping_records] = rec; n_mapping_records++; } else { if (v_mapping_records.get() == nullptr) { v_mapping_records.reset(new std::vector(1, rec)); } else { v_mapping_records->push_back(rec); } } } } if (n_mapping_records == 0) { return bad_record; } MappingRecord *best = &bad_record; int32_t max_match = 0; #define CHECK_BEST_UNIQ(rec) \ do { \ int32_t match_bases = \ Match(seq_view, rec.query_from, rec.query_to, rec.contig_id, \ rec.contig_from, rec.contig_to, rec.strand); \ if (match_bases == max_match) { \ best = &bad_record; \ } else if (match_bases > max_match) { \ max_match = match_bases; \ int32_t mismatch = rec.query_to - rec.query_from + 1 - match_bases; \ rec.mismatch = mismatch; \ best = &rec; \ } \ } while (0) if (v_mapping_records.get() != nullptr) { if (v_mapping_records->size() > 1) { std::sort(v_mapping_records->begin(), v_mapping_records->end()); v_mapping_records->resize( std::unique(v_mapping_records->begin(), v_mapping_records->end()) - v_mapping_records->begin()); } for (auto &rec : *v_mapping_records) { CHECK_BEST_UNIQ(rec); } } for (int i = 0; i < n_mapping_records; ++i) { auto &rec = mapping_records[i]; CHECK_BEST_UNIQ(rec); } #undef CHECK_BEST_UNIQ return *best; } megahit-1.2.9/src/localasm/hash_mapper.h000066400000000000000000000042011355123202700201600ustar00rootroot00000000000000// // Created by vout on 7/10/19. // #ifndef MEGAHIT_LOCALASM_HASH_MAPPER_H #define MEGAHIT_LOCALASM_HASH_MAPPER_H #include #include "parallel_hashmap/phmap.h" #include "sequence/kmer_plus.h" #include "sequence/sequence_package.h" struct MappingRecord { uint32_t contig_id; int32_t contig_from; int32_t contig_to; uint64_t query_id; int32_t query_from; int32_t query_to; uint32_t mismatch; uint8_t strand; bool valid; bool operator<(const MappingRecord &rhs) const { if (contig_id != rhs.contig_id) return contig_id < rhs.contig_id; if (contig_from != rhs.contig_from) return contig_from < rhs.contig_from; if (contig_to != rhs.contig_to) return contig_to < rhs.contig_to; if (query_id != rhs.query_id) return query_id < rhs.query_id; if (query_from != rhs.query_from) return query_from < rhs.query_from; if (query_to != rhs.query_to) return query_to < rhs.query_to; return strand < rhs.strand; } bool operator==(const MappingRecord &rhs) const { return contig_id == rhs.contig_id && contig_from == rhs.contig_from && contig_to == rhs.contig_to && query_id == rhs.query_id && query_from == rhs.query_from && query_to == rhs.query_to && strand == rhs.strand; }; }; class HashMapper { public: using TKmer = Kmer<2, uint32_t>; using TMapper = phmap::flat_hash_map; void LoadAndBuild(const std::string &contig_file, int32_t min_len, int32_t seed_kmer_size, int32_t sparsity); void SetMappingThreshold(int32_t mapping_len, double similarity) { min_mapped_len_ = mapping_len; similarity_ = similarity; } MappingRecord TryMap(const SeqPackage ::SeqView &seq_view) const; const SeqPackage &refseq() const { return refseq_; } private: int32_t Match(const SeqPackage::SeqView &seq_view, int query_from, int query_to, size_t contig_id, int ref_from, int ref_to, bool strand) const; private: TMapper index_; SeqPackage refseq_; int32_t seed_kmer_size_{31}; int32_t min_mapped_len_{50}; double similarity_{0.95}; }; #endif // MEGAHIT_LOCALASM_HASH_MAPPER_H megahit-1.2.9/src/localasm/local_assemble.cpp000066400000000000000000000262751355123202700212100ustar00rootroot00000000000000#include "local_assemble.h" #include #include #include #include #include #include #include "idba/contig_graph.h" #include "idba/hash_graph.h" #include "idba/sequence.h" #include "kmlib/kmbit.h" #include "hash_mapper.h" #include "mapping_result_collector.h" #include "sequence/io/contig/contig_reader.h" #include "sequence/io/contig/contig_writer.h" #include "sequence/io/sequence_lib.h" #include "utils/histgram.h" #include "utils/utils.h" namespace { static const int kMaxLocalRange = 650; using TInsertSize = std::pair; void LaunchIDBA(const std::deque &reads, const Sequence &contig_end, std::deque &out_contigs, std::deque &out_contig_infos, uint32_t mink, uint32_t maxk, uint32_t step, ContigGraph &contig_graph) { int local_range = contig_end.size(); HashGraph hash_graph; out_contigs.clear(); out_contig_infos.clear(); uint32_t max_read_len = 0; for (auto &read : reads) { max_read_len = std::max(max_read_len, read.size()); } for (uint32_t kmer_size = mink; kmer_size <= std::min(maxk, max_read_len); kmer_size += step) { hash_graph.clear(); hash_graph.set_kmer_size(kmer_size); for (auto &read : reads) { if (read.size() < kmer_size) continue; const Sequence seq(read); hash_graph.InsertKmers(seq); } auto histgram = hash_graph.coverage_histgram(); double mean = histgram.percentile(1 - 1.0 * local_range / hash_graph.num_vertices()); double threshold = mean; hash_graph.InsertKmers(contig_end); for (const auto &out_contig : out_contigs) hash_graph.InsertUncountKmers(out_contig); hash_graph.Assemble(out_contigs, out_contig_infos); contig_graph.clear(); contig_graph.set_kmer_size(kmer_size); contig_graph.Initialize(out_contigs, out_contig_infos); contig_graph.RemoveDeadEnd(kmer_size * 2); contig_graph.RemoveBubble(); contig_graph.IterateCoverage(kmer_size * 2, 1, threshold); contig_graph.Assemble(out_contigs, out_contig_infos); if (out_contigs.size() == 1) { break; } } } std::vector EstimateInsertSize( const HashMapper &mapper, const SequenceLibCollection &lib_collection) { std::vector insert_sizes(lib_collection.size()); for (unsigned lib_id = 0; lib_id < lib_collection.size(); ++lib_id) { auto lib = lib_collection.GetLib(lib_id); if (!lib.IsPaired()) { continue; } Histgram insert_hist; const size_t min_hist_size_for_estimation = 1u << 18; size_t processed_reads = 0; while (insert_hist.size() < min_hist_size_for_estimation && processed_reads < lib.seq_count()) { size_t start_read_id = processed_reads; processed_reads = std::min(min_hist_size_for_estimation + start_read_id, lib.seq_count()); #pragma omp parallel for for (size_t i = start_read_id; i < processed_reads; i += 2) { auto seq1 = lib.GetSequenceView(i); auto seq2 = lib.GetSequenceView(i + 1); auto rec1 = mapper.TryMap(seq1); auto rec2 = mapper.TryMap(seq2); if (rec1.valid && rec2.valid) { if (rec1.contig_id == rec2.contig_id && rec1.strand != rec2.strand) { int insert_size; if (rec1.strand == 0) { insert_size = rec2.contig_to + seq2.length() - rec2.query_to - (rec1.contig_from - rec1.query_from); } else { insert_size = rec1.contig_to + seq1.length() - rec1.query_to - (rec2.contig_from - rec2.query_from); } if (insert_size >= (int)seq1.length() && insert_size >= (int)seq2.length()) { insert_hist.insert(insert_size); } } } } } insert_hist.Trim(0.01); insert_sizes[lib_id] = TInsertSize(insert_hist.mean(), insert_hist.sd()); xinfo("Lib {}, insert size: {.2} sd: {.2}\n", lib_id, insert_sizes[lib_id].first, insert_sizes[lib_id].second); } return insert_sizes; } int32_t LocalRange(const SequenceLib &lib, const TInsertSize &insert_size) { int32_t local_range = lib.GetMaxLength() - 1; if (lib.IsPaired() && insert_size.first >= lib.GetMaxLength()) { local_range = std::min(2 * insert_size.first, insert_size.first + 3 * insert_size.second); } if (local_range > kMaxLocalRange) { local_range = kMaxLocalRange; } return local_range; } int32_t GetMaxLocalRange(const SequenceLibCollection &lib_collection, const std::vector &insert_sizes) { int32_t max_local_range = 0; for (unsigned lib_id = 0; lib_id < lib_collection.size(); ++lib_id) { auto &lib = lib_collection.GetLib(lib_id); max_local_range = std::max(max_local_range, LocalRange(lib, insert_sizes[lib_id])); } return max_local_range; } void MapToContigs(const HashMapper &mapper, const SequenceLibCollection &lib_collection, const std::vector &insert_sizes, MappingResultCollector *collector) { for (unsigned lib_id = 0; lib_id < lib_collection.size(); ++lib_id) { auto &lib = lib_collection.GetLib(lib_id); int32_t local_range = LocalRange(lib, insert_sizes[lib_id]); bool is_paired = lib.IsPaired(); size_t num_added = 0, num_mapped = 0; if (is_paired) { #pragma omp parallel for reduction(+ : num_added, num_mapped) for (size_t i = 0; i < lib.seq_count(); i += 2) { auto seq1 = lib.GetSequenceView(i); auto seq2 = lib.GetSequenceView(i + 1); auto rec1 = mapper.TryMap(seq1); auto rec2 = mapper.TryMap(seq2); if (rec1.valid) { ++num_mapped; auto contig_len = mapper.refseq().GetSeqView(rec1.contig_id).length(); num_added += collector->AddSingle(rec1, contig_len, seq1.length(), local_range); num_added += collector->AddMate(rec1, rec2, contig_len, seq2.id(), local_range); } if (rec2.valid) { ++num_mapped; auto contig_len = mapper.refseq().GetSeqView(rec2.contig_id).length(); num_added += collector->AddSingle(rec2, contig_len, seq2.length(), local_range); num_added += collector->AddMate(rec2, rec1, contig_len, seq1.id(), local_range); } } } else { #pragma omp parallel reduction(+ : num_added, num_mapped) for (size_t i = 0; i < lib.seq_count(); ++i) { auto seq = lib.GetSequenceView(i); auto rec = mapper.TryMap(seq); if (rec.valid) { ++num_mapped; num_added += collector->AddSingle( rec, mapper.refseq().GetSeqView(rec.contig_id).length(), seq.length(), local_range); } } } xinfo( "Lib {}: total {} reads, aligned {}, added {} reads to local " "assembly\n", lib_id, lib.seq_count(), num_mapped, num_added); } } void AssembleAndOutput(const HashMapper &mapper, const SeqPackage &read_pkg, MappingResultCollector &result_collector, const std::string &output_file, const int32_t local_range, const LocalAsmOption &opt) { const size_t min_num_reads = read_pkg.max_length() > 0 ? local_range / read_pkg.max_length(): 1; xinfo("Minimum number of reads to do local assembly: {}\n", min_num_reads); Sequence seq, contig_end; ContigGraph contig_graph; std::deque reads; std::deque out_contigs; std::deque out_contig_infos; ContigWriter local_contig_writer(output_file); #pragma omp parallel for private(contig_graph, seq, contig_end, reads, \ out_contigs, out_contig_infos) \ schedule(dynamic) for (uint64_t cid = 0; cid < mapper.refseq().seq_count(); ++cid) { auto contig_view = mapper.refseq().GetSeqView(cid); int cl = contig_view.length(); for (uint8_t strand = 0; strand < 2; ++strand) { auto &mapping_rslts = result_collector.GetMappingResults(cid, strand); if (mapping_rslts.size() <= min_num_reads) { continue; } // collect local reads, convert them into Sequence reads.clear(); uint64_t last_mapping_pos = -1; int pos_count = 0; for (const auto &encoded_rslt : mapping_rslts) { uint64_t pos = MappingResultCollector::GetContigAbsPos(encoded_rslt); pos_count = pos == last_mapping_pos ? pos_count + 1 : 1; last_mapping_pos = pos; if (pos_count <= 3) { seq.clear(); auto read_view = read_pkg.GetSeqView( MappingResultCollector::GetReadId(encoded_rslt)); for (unsigned ri = 0, rsz = read_view.length(); ri < rsz; ++ri) { seq.Append(read_view.base_at(ri)); } reads.push_back(seq); } } contig_end.clear(); if (strand == 0) { for (int j = 0, e = std::min(local_range, cl); j < e; ++j) { contig_end.Append(contig_view.base_at(j)); } } else { for (int j = std::max(0, cl - local_range); j < cl; ++j) { contig_end.Append(contig_view.base_at(j)); } } out_contigs.clear(); LaunchIDBA(reads, contig_end, out_contigs, out_contig_infos, opt.kmin, opt.kmax, opt.step, contig_graph); for (uint64_t j = 0; j < out_contigs.size(); ++j) { if (out_contigs[j].size() > opt.min_contig_len && out_contigs[j].size() > opt.kmax) { auto str = out_contigs[j].str(); local_contig_writer.WriteLocalContig(str, cid, strand, j); } } } } } } // namespace void RunLocalAssembly(const LocalAsmOption &opt) { SimpleTimer timer; timer.reset(); timer.start(); HashMapper mapper; mapper.LoadAndBuild(opt.contig_file, opt.min_contig_len, opt.seed_kmer, opt.sparsity); mapper.SetMappingThreshold(opt.min_mapping_len, opt.similarity); timer.stop(); xinfo("Hash mapper construction time elapsed: {}\n", timer.elapsed()); timer.reset(); timer.start(); SequenceLibCollection lib_collection; SeqPackage read_pkg; lib_collection.SetPath(opt.lib_file_prefix); lib_collection.Read(&read_pkg); timer.stop(); xinfo("Read lib time elapsed: {}\n", timer.elapsed()); timer.reset(); timer.start(); auto insert_sizes = EstimateInsertSize(mapper, lib_collection); timer.stop(); xinfo("Insert size estimation time elapsed: {}\n", timer.elapsed()); timer.reset(); timer.start(); MappingResultCollector collector(mapper.refseq().seq_count()); MapToContigs(mapper, lib_collection, insert_sizes, &collector); timer.stop(); xinfo("Mapping time elapsed: {}\n", timer.elapsed()); timer.reset(); timer.start(); int32_t max_local_range = GetMaxLocalRange(lib_collection, insert_sizes); AssembleAndOutput(mapper, read_pkg, collector, opt.output_file, max_local_range, opt); timer.stop(); xinfo("Local assembly time elapsed: {}\n", timer.elapsed()); } megahit-1.2.9/src/localasm/local_assemble.h000066400000000000000000000024061355123202700206430ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #ifndef LOCAL_ASSEMBLER_H #define LOCAL_ASSEMBLER_H #include struct LocalAsmOption { std::string contig_file; std::string lib_file_prefix; uint32_t kmin{11}; uint32_t kmax{41}; uint32_t step{6}; uint32_t seed_kmer{31}; uint32_t min_contig_len{200}; uint32_t sparsity{8}; double similarity{0.8}; uint32_t min_mapping_len{75}; uint32_t num_threads{0}; std::string output_file; }; void RunLocalAssembly(const LocalAsmOption &opt); #endifmegahit-1.2.9/src/localasm/mapping_result_collector.h000066400000000000000000000064621355123202700230030ustar00rootroot00000000000000// // Created by vout on 7/13/19. // #ifndef MEGAHIT_LOCALASM_MAPPING_RESULT_H #define MEGAHIT_LOCALASM_MAPPING_RESULT_H #include #include #include #include "hash_mapper.h" #include "kmlib/kmbitvector.h" class MappingResultCollector { public: explicit MappingResultCollector(size_t n_ref) : fwd_mappings_(n_ref), bwd_mappings_(n_ref), locks_(n_ref) {} unsigned AddSingle(const MappingRecord &rec, int32_t contig_len, int32_t read_len, int32_t local_range) { unsigned ret = 0; if (rec.contig_to < local_range && rec.query_from != 0 && rec.query_to == read_len - 1) { locks_.lock(rec.contig_id); fwd_mappings_[rec.contig_id].push_back(EncodeMappingRead( rec.contig_to, 0, rec.mismatch, rec.strand, rec.query_id)); locks_.unlock(rec.contig_id); ret++; } else if (rec.contig_from + local_range >= contig_len && rec.query_to < read_len - 1 && rec.query_from == 0) { locks_.lock(rec.contig_id); bwd_mappings_[rec.contig_id].push_back( EncodeMappingRead(contig_len - 1 - rec.contig_from, 0, rec.mismatch, rec.strand, rec.query_id)); locks_.unlock(rec.contig_id); ret++; } return ret; } unsigned AddMate(const MappingRecord &rec1, const MappingRecord &rec2, int32_t contig_len, uint64_t mate_id, int32_t local_range) { if (rec2.valid && rec2.contig_id == rec1.contig_id) return 0; unsigned ret = 0; if (rec1.contig_to < local_range && rec1.strand == 1) { locks_.lock(rec1.contig_id); fwd_mappings_[rec1.contig_id].push_back(EncodeMappingRead( rec1.contig_to, 1, rec1.mismatch, rec1.strand, mate_id)); locks_.unlock(rec1.contig_id); ret++; } else if (rec1.contig_from + local_range >= contig_len && rec1.strand == 0) { locks_.lock(rec1.contig_id); bwd_mappings_[rec1.contig_id].push_back( EncodeMappingRead(contig_len - 1 - rec1.contig_from, 1, rec1.mismatch, rec1.strand, mate_id)); locks_.unlock(rec1.contig_id); ret++; } return ret; } const std::deque &GetMappingResults(int64_t contig_id, uint8_t strand) { auto &results = strand == 0 ? fwd_mappings_ : bwd_mappings_; auto &ret = results[contig_id]; std::sort(ret.begin(), ret.end()); return ret; } static uint64_t GetContigAbsPos(uint64_t encoded) { return encoded >> (44u + 1u + 4u); } static uint64_t GetReadId(uint64_t encoded) { return encoded & ((1ull << 44u) - 1); } private: static uint64_t EncodeMappingRead(uint32_t contig_offset, uint8_t is_mate, uint32_t mismatch, uint8_t strand, uint64_t read_id) { assert(contig_offset <= (1u << 14u)); assert(strand <= 1); uint64_t ret = contig_offset; ret = (ret << 1u) | is_mate; ret = (ret << 4u) | (mismatch < 15 ? mismatch : 15u); ret = (ret << 1u) | strand; ret = (ret << 44u) | read_id; // 44 bits for read id return ret; } std::vector> fwd_mappings_; std::vector> bwd_mappings_; AtomicBitVector locks_; }; #endif // MEGAHIT_LOCALASM_MAPPING_RESULT_H megahit-1.2.9/src/main.cpp000066400000000000000000000077651355123202700153770ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #include #include #include #include "definitions.h" #include "utils/cpu_dispatch.h" #include "utils/utils.h" int main_assemble(int argc, char **argv); int main_local(int argc, char **argv); int main_iterate(int argc, char **argv); int main_build_lib(int argc, char **argv); int main_kmer_count(int argc, char **argv); int main_read2sdbg(int argc, char **argv); int main_seq2sdbg(int argc, char **argv); int main_contig2fastg(int argc, char **argv); int main_read_stat(int argc, char **argv); int main_filter_by_len(int argc, char **argv); void show_help(const char *program_name) { pfprintf( stderr, "Usage: {s} [sub options]\n" " sub-programs:\n" " assemble assemble from SdBG\n" " local local asssembly\n" " iterate extract iterative edges\n" " buildlib build read library\n" " count kmer counting\n" " read2sdbg build sdbg from reads\n" " seq2sdbg build sdbg from megahit contigs + edges\n" " contig2fastg convert MEGAHIT's k*.contigs.fa to fastg format\n" " readstat calculate read stats (# of reads, bases, longest, " "shortest, average)\n" " filterbylen filter contigs by length\n" " checkcpu check whether the run-time CPU supports POPCNT " "and BMI2\n" " checkpopcnt check whether the run-time CPU supports POPCNT\n" " checkbmi2 check whether the run-time CPU supports BMI2\n" " dumpversion dump version\n" " kmax the largest k value supported\n", program_name); } int main(int argc, char **argv) { if (argc < 2) { show_help(argv[0]); exit(1); } if (strcmp(argv[1], "assemble") == 0) { return main_assemble(argc - 1, argv + 1); } else if (strcmp(argv[1], "local") == 0) { return main_local(argc - 1, argv + 1); } else if (strcmp(argv[1], "iterate") == 0) { return main_iterate(argc - 1, argv + 1); } else if (strcmp(argv[1], "buildlib") == 0) { return main_build_lib(argc - 1, argv + 1); } else if (strcmp(argv[1], "count") == 0) { return main_kmer_count(argc - 1, argv + 1); } else if (strcmp(argv[1], "read2sdbg") == 0) { return main_read2sdbg(argc - 1, argv + 1); } else if (strcmp(argv[1], "seq2sdbg") == 0) { return main_seq2sdbg(argc - 1, argv + 1); } else if (strcmp(argv[1], "contig2fastg") == 0) { return main_contig2fastg(argc - 1, argv + 1); } else if (strcmp(argv[1], "readstat") == 0) { return main_read_stat(argc - 1, argv + 1); } else if (strcmp(argv[1], "filterbylen") == 0) { return main_filter_by_len(argc - 1, argv + 1); } else if (strcmp(argv[1], "checkcpu") == 0) { pprintf("{}\n", HasPopcnt() && HasBmi2()); } else if (strcmp(argv[1], "checkpopcnt") == 0) { pprintf("{}\n", HasPopcnt()); } else if (strcmp(argv[1], "checkbmi2") == 0) { pprintf("{}\n", HasBmi2()); } else if (strcmp(argv[1], "dumpversion") == 0) { pprintf("{s}\n", PACKAGE_VERSION); } else if (strcmp(argv[1], "kmax") == 0) { pprintf("{}\n", kMaxK); } else { show_help(argv[0]); return 1; } return 0; } megahit-1.2.9/src/main_assemble.cpp000066400000000000000000000241451355123202700172410ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #include #include #include #include #include #include #include "assembly/all_algo.h" #include "assembly/contig_output.h" #include "assembly/contig_stat.h" #include "utils/histgram.h" #include "utils/options_description.h" #include "utils/utils.h" using std::string; namespace { struct LocalAsmOption { string sdbg_name; string output_prefix{"out"}; int num_cpu_threads{0}; int local_width{1000}; int max_tip_len{-1}; int min_standalone{200}; double min_depth{-1}; bool is_final_round{false}; int bubble_level{2}; int merge_len{20}; double merge_similar{0.98}; int prune_level{2}; double disconnect_ratio{0.1}; double low_local_ratio{0.2}; int cleaning_rounds{5}; bool output_standalone{false}; bool careful_bubble{false}; string contig_file() { return output_prefix + ".contigs.fa"; } string standalone_file() { return output_prefix + ".final.contigs.fa"; } string addi_contig_file() { return output_prefix + ".addi.fa"; } string bubble_file() { return output_prefix + ".bubble_seq.fa"; } } opt; void ParseAsmOption(int argc, char *argv[]) { OptionsDescription desc; desc.AddOption("sdbg_name", "s", opt.sdbg_name, "succinct de Bruijn graph name"); desc.AddOption("output_prefix", "o", opt.output_prefix, "output prefix"); desc.AddOption("num_cpu_threads", "t", opt.num_cpu_threads, "number of cpu threads"); desc.AddOption("max_tip_len", "", opt.max_tip_len, "max length for tips to be removed. -1 for 2k"); desc.AddOption( "min_standalone", "", opt.min_standalone, "min length of a standalone contig to output to final.contigs.fa"); desc.AddOption("bubble_level", "", opt.bubble_level, "bubbles level 0-3"); desc.AddOption("merge_len", "", opt.merge_len, "merge complex bubbles of length <= merge_len * k"); desc.AddOption("merge_similar", "", opt.merge_similar, "min similarity of complex bubble merging"); desc.AddOption("prune_level", "", opt.prune_level, "strength of low local depth contig pruning (0-3)"); desc.AddOption("disconnect_ratio", "", opt.disconnect_ratio, "ratio threshold for disconnecting contigs"); desc.AddOption("low_local_ratio", "", opt.low_local_ratio, "ratio to define low depth contigs"); desc.AddOption("cleaning_rounds", "", opt.cleaning_rounds, "number of rounds of graphs cleaning"); desc.AddOption("min_depth", "", opt.min_depth, "if prune_level >= 2, permanently remove low local coverage " "unitigs under this threshold"); desc.AddOption("is_final_round", "", opt.is_final_round, "this is the last iteration"); desc.AddOption("output_standalone", "", opt.output_standalone, "output standalone contigs to *.final.contigs.fa"); desc.AddOption("careful_bubble", "", opt.careful_bubble, "remove bubble carefully"); try { desc.Parse(argc, argv); if (opt.sdbg_name.empty()) { throw std::logic_error("no succinct de Bruijn graph name!"); } } catch (std::exception &e) { std::cerr << e.what() << std::endl; std::cerr << "Usage: " << argv[0] << " -s sdbg_name -o output_prefix" << std::endl; std::cerr << "options:" << std::endl; std::cerr << desc << std::endl; exit(1); } } } // namespace int main_assemble(int argc, char **argv) { AutoMaxRssRecorder recorder; ParseAsmOption(argc, argv); SDBG dbg; SimpleTimer timer; // graph loading timer.reset(); timer.start(); xinfo("Loading succinct de Bruijn graph: {s}", opt.sdbg_name.c_str()); dbg.LoadFromFile(opt.sdbg_name.c_str()); timer.stop(); xinfoc("Done. Time elapsed: {}\n", timer.elapsed()); xinfo("Number of Edges: {}; K value: {}\n", dbg.size(), dbg.k()); // set cpu threads if (opt.num_cpu_threads == 0) { opt.num_cpu_threads = omp_get_max_threads(); } omp_set_num_threads(opt.num_cpu_threads); xinfo("Number of CPU threads: {}\n", opt.num_cpu_threads); // set tip len if (opt.max_tip_len == -1) { opt.max_tip_len = dbg.k() * 2; } // set min depth if (opt.min_depth <= 0) { opt.min_depth = sdbg_pruning::InferMinDepth(dbg); xinfo("min depth set to {.3}\n", opt.min_depth); } // tips removal before building unitig graph if (opt.max_tip_len > 0) { timer.reset(); timer.start(); sdbg_pruning::RemoveTips(dbg, opt.max_tip_len); timer.stop(); xinfo("Tips removal done! Time elapsed(sec): {.3}\n", timer.elapsed()); } // construct unitig graph timer.reset(); timer.start(); UnitigGraph graph(&dbg); timer.stop(); xinfo("unitig graph size: {}, time for building: {.3}\n", graph.size(), timer.elapsed()); CalcAndPrintStat(graph); // set up bubble ContigWriter bubble_writer(opt.bubble_file()); NaiveBubbleRemover naiver_bubble_remover; ComplexBubbleRemover complex_bubble_remover; complex_bubble_remover.SetMergeSimilarity(opt.merge_similar) .SetMergeLevel(opt.merge_len); Histgram bubble_hist; if (opt.careful_bubble) { naiver_bubble_remover.SetCarefulThreshold(0.2).SetWriter(&bubble_writer); complex_bubble_remover.SetCarefulThreshold(0.2).SetWriter(&bubble_writer); } // graph cleaning for (int round = 1; round <= opt.cleaning_rounds; ++round) { xinfo("Graph cleaning round {}\n", round); bool changed = false; if (round > 1) { timer.reset(); timer.start(); uint32_t num_tips = RemoveTips(graph, opt.max_tip_len); changed |= num_tips > 0; timer.stop(); xinfo("Tips removed: {}, time: {.3}\n", num_tips, timer.elapsed()); } // remove bubbles if (opt.bubble_level >= 1) { timer.reset(); timer.start(); uint32_t num_bubbles = naiver_bubble_remover.PopBubbles(graph, true); timer.stop(); xinfo("Number of bubbles removed: {}, Time elapsed(sec): {.3}\n", num_bubbles, timer.elapsed()); changed |= num_bubbles > 0; } // remove complex bubbles if (opt.bubble_level >= 2) { timer.reset(); timer.start(); uint32_t num_bubbles = complex_bubble_remover.PopBubbles(graph, true); timer.stop(); xinfo("Number of complex bubbles removed: {}, Time elapsed(sec): {}\n", num_bubbles, timer.elapsed()); changed |= num_bubbles > 0; } // disconnect timer.reset(); timer.start(); uint32_t num_disconnected = DisconnectWeakLinks(graph, opt.disconnect_ratio); timer.stop(); xinfo("Number unitigs disconnected: {}, time: {.3}\n", num_disconnected, timer.elapsed()); changed |= num_disconnected > 0; // excessive pruning uint32_t num_excessive_pruned = 0; if (opt.prune_level >= 3) { timer.reset(); timer.start(); num_excessive_pruned = RemoveLowDepth(graph, opt.min_depth); num_excessive_pruned += naiver_bubble_remover.PopBubbles(graph, true); if (opt.bubble_level >= 2 && opt.merge_len > 0) { num_excessive_pruned += complex_bubble_remover.PopBubbles(graph, true); } timer.stop(); xinfo("Unitigs removed in (more-)excessive pruning: {}, time: {.3}\n", num_excessive_pruned, timer.elapsed()); } else if (opt.prune_level >= 2) { timer.reset(); timer.start(); RemoveLocalLowDepth(graph, opt.min_depth, opt.max_tip_len, opt.local_width, std::min(opt.low_local_ratio, 0.1), true, &num_excessive_pruned); timer.stop(); xinfo("Unitigs removed in excessive pruning: {}, time: {.3}\n", num_excessive_pruned, timer.elapsed()); } if (!changed) break; } ContigStat stat = CalcAndPrintStat(graph); // output contigs ContigWriter contig_writer(opt.contig_file()); ContigWriter standalone_writer(opt.standalone_file()); if (!(opt.is_final_round && opt.prune_level >= 1)) { // otherwise output after local low depth pruning timer.reset(); timer.start(); OutputContigs(graph, &contig_writer, opt.output_standalone ? &standalone_writer : nullptr, false, opt.min_standalone); timer.stop(); xinfo("Time to output: {}\n", timer.elapsed()); } // remove local low depth & output as contigs if (opt.prune_level >= 1) { ContigWriter addi_contig_writer(opt.addi_contig_file()); timer.reset(); timer.start(); uint32_t num_removed = IterateLocalLowDepth( graph, opt.min_depth, opt.max_tip_len, opt.local_width, opt.low_local_ratio, opt.is_final_round); uint32_t n_bubbles = 0; if (opt.bubble_level >= 2 && opt.merge_len > 0) { complex_bubble_remover.SetWriter(nullptr); n_bubbles = complex_bubble_remover.PopBubbles(graph, false); timer.stop(); } xinfo( "Number of local low depth unitigs removed: {}, complex bubbles " "removed: {}, time: {}\n", num_removed, n_bubbles, timer.elapsed()); CalcAndPrintStat(graph); if (!opt.is_final_round) { OutputContigs(graph, &addi_contig_writer, nullptr, true, 0); } else { OutputContigs(graph, &contig_writer, opt.output_standalone ? &standalone_writer : nullptr, false, opt.min_standalone); } auto stat_changed = CalcAndPrintStat(graph, false, true); } return 0; }megahit-1.2.9/src/main_buildlib.cpp000066400000000000000000000005751355123202700172350ustar00rootroot00000000000000#include "sequence/io/sequence_lib.h" #include "utils/utils.h" void DisplayHelp(const char *program) { pfprintf(stderr, "Usage {s} \n", program); } int main_build_lib(int argc, char **argv) { AutoMaxRssRecorder recorder; if (argc < 3) { DisplayHelp(argv[0]); exit(1); } SequenceLibCollection::Build(argv[1], argv[2]); return 0; }megahit-1.2.9/src/main_iterate.cpp000066400000000000000000000173221355123202700171020ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #include #include #include #include #include #include #include #include #include #include #include #include "definitions.h" #include "iterate/contig_flank_index.h" #include "iterate/kmer_collector.h" #include "sequence/io/async_sequence_reader.h" #include "utils/options_description.h" using std::string; using std::vector; namespace { struct LocalAsmOption { string contig_file; string bubble_file; string read_file; int num_cpu_threads{0}; int kmer_k{0}; int step{0}; string output_prefix; } opt; static void ParseIterOptions(int argc, char *argv[]) { OptionsDescription desc; desc.AddOption("contig_file", "c", opt.contig_file, "(*) contigs file, fasta/fastq format, output by assembler"); desc.AddOption("bubble_file", "b", opt.bubble_file, "(*) bubble file, fasta/fastq format, output by assembler"); desc.AddOption("read_file", "r", opt.read_file, "(*) reads to be aligned. \"-\" for stdin. Can be gzip'ed."); desc.AddOption("num_cpu_threads", "t", opt.num_cpu_threads, "number of cpu threads, at least 2. 0 for auto detect."); desc.AddOption("kmer_k", "k", opt.kmer_k, "(*) current kmer size."); desc.AddOption("step", "s", opt.step, "(*) step for iteration (<= 28). i.e. this iteration is from " "kmer_k to (kmer_k + step)"); desc.AddOption("output_prefix", "o", opt.output_prefix, "(*) output_prefix.edges.0 will be created."); try { desc.Parse(argc, argv); if (opt.step + opt.kmer_k >= static_cast( std::max(Kmer<4>::max_size(), GenericKmer::max_size()))) { std::ostringstream os; os << "kmer_k + step must less than " << std::max(Kmer<4>::max_size(), GenericKmer::max_size()); throw std::logic_error(os.str()); } else if (opt.contig_file == "") { throw std::logic_error("No contig file!"); } else if (opt.bubble_file == "") { throw std::logic_error("No bubble file!"); } else if (opt.read_file == "") { throw std::logic_error("No reads file!"); } else if (opt.kmer_k <= 0) { throw std::logic_error("Invalid kmer size!"); } else if (opt.step <= 0 || opt.step > 28 || opt.step % 2 == 1) { throw std::logic_error("Invalid step size!"); } else if (opt.output_prefix == "") { throw std::logic_error("No output prefix!"); } if (opt.num_cpu_threads == 0) { opt.num_cpu_threads = omp_get_max_threads(); } if (opt.num_cpu_threads > 1) { omp_set_num_threads(opt.num_cpu_threads - 1); } else { omp_set_num_threads(1); } } catch (std::exception &e) { std::cerr << e.what() << std::endl; std::cerr << "Usage: " << argv[0] << " [opt]" << std::endl; std::cerr << "opt with (*) are must" << std::endl; std::cerr << "opt:" << std::endl; std::cerr << desc << std::endl; exit(1); } } } // namespace template static bool ReadReadsAndProcessKernel(const LocalAsmOption &opt, const IndexType &index) { if (KmerType::max_size() < static_cast(opt.kmer_k + opt.step + 1)) { return false; } xinfo("Selected kmer type size for next k: {}\n", sizeof(KmerType)); BinaryReader binary_reader(opt.read_file); AsyncSequenceReader reader(&binary_reader); KmerCollector collector(opt.kmer_k + opt.step + 1, opt.output_prefix); int64_t num_aligned_reads = 0; int64_t num_total_reads = 0; while (true) { const auto &read_pkg = reader.Next(); if (read_pkg.seq_count() == 0) { break; } num_aligned_reads += index.FindNextKmersFromReads(read_pkg, &collector); num_total_reads += read_pkg.seq_count(); xinfo("Processed: {}, aligned: {}. Iterative edges: {}\n", num_total_reads, num_aligned_reads, collector.collection().size()); } collector.FlushToFile(); xinfo("Total: {}, aligned: {}. Iterative edges: {}\n", num_total_reads, num_aligned_reads, collector.collection().size()); return true; } template static void ReadReadsAndProcess(const LocalAsmOption &opt, const IndexType &index) { if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; if (ReadReadsAndProcessKernel>(opt, index)) return; xfatal("k is too large!\n"); } template static void ReadContigsAndBuildIndex(const LocalAsmOption &opt, const std::string &file_name, IndexType *index) { AsyncContigReader reader(file_name); while (true) { auto &pkg = reader.Next(); auto &contig_pkg = pkg.first; auto &mul = pkg.second; if (contig_pkg.seq_count() == 0) { break; } xinfo("Read {} contigs\n", contig_pkg.seq_count()); index->FeedBatchContigs(contig_pkg, mul); xinfo("Number of flank kmers: {}\n", index->size()); } } struct BaseRunner { virtual ~BaseRunner() = default; virtual void Run(const LocalAsmOption &opt) = 0; virtual uint32_t max_k() const = 0; }; template struct Runner : public BaseRunner { ~Runner() override = default; void Run(const LocalAsmOption &opt) override { xinfo("Selected kmer type size for k: {}\n", sizeof(KmerType)); ContigFlankIndex index(opt.kmer_k, opt.step); ReadContigsAndBuildIndex(opt, opt.contig_file, &index); ReadContigsAndBuildIndex(opt, opt.bubble_file, &index); ReadReadsAndProcess(opt, index); } uint32_t max_k() const override { return KmerType::max_size(); } }; int main_iterate(int argc, char *argv[]) { AutoMaxRssRecorder recorder; ParseIterOptions(argc, argv); std::list> runners; runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); runners.emplace_back(new Runner>()); for (auto &runner : runners) { if (runner->max_k() >= static_cast(opt.kmer_k + 1)) { runner->Run(opt); return 0; } } xfatal("k is too large!\n"); return 1; }megahit-1.2.9/src/main_local_assemble.cpp000066400000000000000000000054611355123202700204130ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics * Limited * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #include #include #include #include #include "localasm/local_assemble.h" #include "utils/options_description.h" #include "utils/utils.h" namespace { LocalAsmOption ParseLocalAsmOptions(int argc, char *argv[]) { LocalAsmOption opt; OptionsDescription desc; desc.AddOption("contig_file", "c", opt.contig_file, "contig file"); desc.AddOption("lib_file_prefix", "l", opt.lib_file_prefix, "lib file prefix"); desc.AddOption("kmin", "", opt.kmin, ""); desc.AddOption("kmax", "", opt.kmax, ""); desc.AddOption("step", "", opt.step, ""); desc.AddOption("seed_kmer", "", opt.seed_kmer, "kmer size for seeding alignments"); desc.AddOption("min_contig_len", "", opt.min_contig_len, ""); desc.AddOption("min_mapping_len", "", opt.min_mapping_len, ""); desc.AddOption("sparsity", "", opt.sparsity, "sparsity of hash mapper"); desc.AddOption("similarity", "", opt.similarity, "alignment similarity threshold"); desc.AddOption("num_threads", "t", opt.num_threads, ""); desc.AddOption("output_file", "o", opt.output_file, ""); try { desc.Parse(argc, argv); if (opt.contig_file == "") { throw std::logic_error("no contig file!"); } if (opt.lib_file_prefix == "") { throw std::logic_error("no read file!"); } if (opt.output_file == "") { throw std::logic_error("no output file!"); } if (opt.num_threads == 0) { opt.num_threads = omp_get_max_threads(); } } catch (std::exception &e) { std::cerr << e.what() << std::endl; std::cerr << "Usage: " << argv[0] << " -c contigs.fa -r reads.fq -o out.local_contig.fa" << std::endl; std::cerr << "options:" << std::endl; std::cerr << desc << std::endl; exit(1); } return opt; } } // namespace int main_local(int argc, char **argv) { AutoMaxRssRecorder recorder; auto opt = ParseLocalAsmOptions(argc, argv); omp_set_num_threads(opt.num_threads); RunLocalAssembly(opt); return 0; }megahit-1.2.9/src/main_sdbg_build.cpp000066400000000000000000000160121355123202700175360ustar00rootroot00000000000000/* * MEGAHIT * Copyright (C) 2014 The University of Hong Kong * * This program is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program. If not, see . */ /* contact: Dinghua Li */ #include #include #include #include #include #include "definitions.h" #include "sorting/kmer_counter.h" #include "sorting/read_to_sdbg.h" #include "sorting/seq_to_sdbg.h" #include "utils/options_description.h" #include "utils/utils.h" int main_kmer_count(int argc, char **argv) { AutoMaxRssRecorder recorder; // parse option OptionsDescription desc; KmerCounterOption opt; desc.AddOption("kmer_k", "k", opt.k, "kmer size"); desc.AddOption("min_kmer_frequency", "m", opt.solid_threshold, "min frequency to output an edge"); desc.AddOption( "host_mem", "", opt.host_mem, "Max memory to be used. 90% of the free memory is recommended."); desc.AddOption("num_cpu_threads", "", opt.n_threads, "number of CPU threads. At least 2."); desc.AddOption("read_lib_file", "", opt.read_lib_file, "read library configuration file."); desc.AddOption("output_prefix", "", opt.output_prefix, "output prefix"); desc.AddOption("mem_flag", "", opt.mem_flag, "memory options. 0: minimize memory usage; 1: automatically " "use moderate memory; " "other: use all " "available mem specified by '--host_mem'"); try { desc.Parse(argc, argv); if (opt.read_lib_file.empty()) { throw std::logic_error("No read library configuration file!"); } if (opt.n_threads == 0) { opt.n_threads = omp_get_max_threads(); } if (opt.host_mem == 0) { throw std::logic_error("Please specify the host memory!"); } } catch (std::exception &e) { std::cerr << e.what() << std::endl; std::cerr << "Usage: sdbg_builder count --input_file fastx_file -o out" << std::endl; std::cerr << "Options:" << std::endl; std::cerr << desc << std::endl; exit(1); } KmerCounter runner(opt); runner.Run(); return 0; } int main_read2sdbg(int argc, char **argv) { AutoMaxRssRecorder recorder; // parse option the same as kmer_count OptionsDescription desc; Read2SdbgOption opt; desc.AddOption("kmer_k", "k", opt.k, "kmer size"); desc.AddOption("min_kmer_frequency", "m", opt.solid_threshold, "min frequency to output an edge"); desc.AddOption( "host_mem", "", opt.host_mem, "Max memory to be used. 90% of the free memory is recommended."); desc.AddOption("num_cpu_threads", "", opt.n_threads, "number of CPU threads. At least 2."); desc.AddOption("read_lib_file", "", opt.read_lib_file, "input fast[aq] file, can be gzip'ed. \"-\" for stdin."); desc.AddOption("output_prefix", "", opt.output_prefix, "output prefix"); desc.AddOption("mem_flag", "", opt.mem_flag, "memory options. 0: minimize memory usage; 1: automatically " "use moderate memory; " "other: use all " "available mem specified by '--host_mem'"); desc.AddOption("need_mercy", "", opt.need_mercy, "to add mercy edges."); try { desc.Parse(argc, argv); if (opt.read_lib_file.empty()) { throw std::logic_error("No input file!"); } if (opt.n_threads == 0) { opt.n_threads = omp_get_max_threads(); } if (opt.host_mem == 0) { throw std::logic_error("Please specify the host memory!"); } } catch (std::exception &e) { std::cerr << e.what() << std::endl; std::cerr << "Usage: sdbg_builder read2sdbg --read_lib_file fastx_file -o out" << std::endl; std::cerr << "Options:" << std::endl; std::cerr << desc << std::endl; exit(1); } SeqPkgWithSolidMarker pkg; { // stage 1 Read2SdbgS1 runner(opt, &pkg); if (opt.solid_threshold > 1) { runner.Run(); } else { runner.Initialize(); } } { // stage 2 Read2SdbgS2 runner(opt, &pkg); runner.Run(); } return 0; } int main_seq2sdbg(int argc, char **argv) { AutoMaxRssRecorder recorder; OptionsDescription desc; Seq2SdbgOption opt; desc.AddOption("host_mem", "", opt.host_mem, "memory to be used. No more than 95% of the free memory is " "recommended. 0 for auto detect."); desc.AddOption("kmer_size", "k", opt.k, "kmer size"); desc.AddOption("kmer_from", "", opt.k_from, "previous k"); desc.AddOption("num_cpu_threads", "t", opt.n_threads, "number of CPU threads. At least 2."); desc.AddOption("contig", "", opt.contig, "contigs from previous k"); desc.AddOption("bubble", "", opt.bubble_seq, "bubble sequence from previous k"); desc.AddOption("addi_contig", "", opt.addi_contig, "additional contigs from previous k"); desc.AddOption("local_contig", "", opt.local_contig, "local contigs from previous k"); desc.AddOption( "input_prefix", "", opt.input_prefix, "files input_prefix.edges.* output by count module, can be gzip'ed."); desc.AddOption("output_prefix", "o", opt.output_prefix, "output prefix"); desc.AddOption("need_mercy", "", opt.need_mercy, "to add mercy edges. The file input_prefix.cand output by " "count module should exist."); desc.AddOption("mem_flag", "", opt.mem_flag, "memory options. 0: minimize memory usage; 1: automatically " "use moderate memory; " "other: use all " "available mem specified by '--host_mem'"); try { desc.Parse(argc, argv); if (opt.input_prefix.empty() && opt.contig.empty() && opt.addi_contig.empty()) { throw std::logic_error("No input files!"); } if (opt.n_threads == 0) { opt.n_threads = omp_get_max_threads(); } if (opt.k < 9) { throw std::logic_error("kmer size must be >= 9!"); } if (opt.host_mem == 0) { throw std::logic_error("Please specify the host memory!"); } } catch (std::exception &e) { std::cerr << e.what() << std::endl; std::cerr << "Usage: sdbg_builder seq2sdbg -k kmer_size --contig " "contigs.fa [--addi_contig " "add.fa] [--input_prefix input] -o out" << std::endl; std::cerr << "Options:" << std::endl; std::cerr << desc << std::endl; exit(1); } SeqToSdbg runner(opt); runner.Run(); return 0; } megahit-1.2.9/src/megahit000077500000000000000000001131421355123202700152760ustar00rootroot00000000000000#!/usr/bin/env python # ------------------------------------------------------------------------- # MEGAHIT # Copyright (C) 2014 - 2015 The University of Hong Kong & L3 Bioinformatics Limited # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see . # ------------------------------------------------------------------------- from __future__ import print_function import getopt import json import logging import multiprocessing import os import shutil import signal import subprocess import sys import tempfile import time import math logger = logging.getLogger(__name__) _usage_message = ''' contact: Dinghua Li Usage: megahit [options] {{-1 -2 | --12 | -r }} [-o ] Input options that can be specified for multiple times (supporting plain text and gz/bz2 extensions) -1 comma-separated list of fasta/q paired-end #1 files, paired with files in -2 comma-separated list of fasta/q paired-end #2 files, paired with files in --12 comma-separated list of interleaved fasta/q paired-end files -r/--read comma-separated list of fasta/q single-end files Optional Arguments: Basic assembly options: --min-count minimum multiplicity for filtering (k_min+1)-mers [2] --k-list comma-separated list of kmer size all must be odd, in the range 15-{0}, increment <= 28) [21,29,39,59,79,99,119,141] Another way to set --k-list (overrides --k-list if one of them set): --k-min minimum kmer size (<= {0}), must be odd number [21] --k-max maximum kmer size (<= {0}), must be odd number [141] --k-step increment of kmer size of each iteration (<= 28), must be even number [12] Advanced assembly options: --no-mercy do not add mercy kmers --bubble-level intensity of bubble merging (0-2), 0 to disable [2] --merge-level merge complex bubbles of length <= l*kmer_size and similarity >= s [20,0.95] --prune-level strength of low depth pruning (0-3) [2] --prune-depth remove unitigs with avg kmer depth less than this value [2] --disconnect-ratio disconnect unitigs if its depth is less than this ratio times the total depth of itself and its siblings [0.1] --low-local-ratio remove unitigs if its depth is less than this ratio times the average depth of the neighborhoods [0.2] --max-tip-len remove tips less than this value [2*k] --cleaning-rounds number of rounds for graph cleanning [5] --no-local disable local assembly --kmin-1pass use 1pass mode to build SdBG of k_min Presets parameters: --presets override a group of parameters; possible values: meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141' meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil) Hardware options: -m/--memory max memory in byte to be used in SdBG construction (if set between 0-1, fraction of the machine's total memory) [0.9] --mem-flag SdBG builder memory mode. 0: minimum; 1: moderate; others: use all memory specified by '-m/--memory' [1] -t/--num-cpu-threads number of CPU threads [# of logical processors] --no-hw-accel run MEGAHIT without BMI2 and POPCNT hardware instructions Output options: -o/--out-dir output directory [./megahit_out] --out-prefix output prefix (the contig file will be OUT_DIR/OUT_PREFIX.contigs.fa) --min-contig-len minimum length of contigs to output [200] --keep-tmp-files keep all temporary files --tmp-dir set temp directory Other Arguments: --continue continue a MEGAHIT run from its last available check point. please set the output directory correctly when using this option. --test run MEGAHIT on a toy test dataset -h/--help print the usage message -v/--version print version ''' def check_output(cmd): p = subprocess.Popen(cmd, stdout=subprocess.PIPE) out, _ = p.communicate() out = out.rstrip().decode('utf-8') assert p.wait() == 0 return out def abspath(path): return os.path.abspath(os.path.expanduser(path)) def mkdir_if_not_exists(path): if not os.path.exists(path): os.mkdir(path) def remove_if_exists(file_name): if os.path.exists(file_name): os.remove(file_name) class Usage(Exception): def __init__(self, msg): self.msg = msg class EarlyTerminate(Exception): def __init__(self, kmer_size): self.kmer_size = kmer_size class SoftwareInfo(object): script_path = os.path.dirname(os.path.realpath(__file__)) megahit_core = os.path.join(script_path, 'megahit_core') megahit_core_popcnt = os.path.join(script_path, 'megahit_core_popcnt') megahit_core_no_hw_accel = os.path.join(script_path, 'megahit_core_no_hw_accel') @property def megahit_version(self): return 'MEGAHIT %s' % check_output([self.megahit_core, 'dumpversion']) @property def max_k_allowed(self): return int(check_output([self.megahit_core, 'kmax'])) @property def usage_message(self): return _usage_message.format(self.max_k_allowed) class Options: def __init__(self): self.out_dir = '' self.temp_dir = '' self.test_mode = False self.continue_mode = False self.force_overwrite = False self.memory = 0.9 self.min_contig_len = 200 self.k_min = 21 self.k_max = 141 self.k_step = 10 self.k_list = [21, 29, 39, 59, 79, 99, 119, 141] self.auto_k = True self.set_list_by_min_max_step = False self.min_count = 2 self.has_popcnt = True self.hw_accel = True self.max_tip_len = -1 self.no_mercy = False self.no_local = False self.bubble_level = 2 self.merge_len = 20 self.merge_similar = 0.95 self.prune_level = 2 self.prune_depth = 2 self.num_cpu_threads = 0 self.disconnect_ratio = 0.1 self.low_local_ratio = 0.2 self.cleaning_rounds = 5 self.keep_tmp_files = False self.mem_flag = 1 self.out_prefix = '' self.kmin_1pass = False self.pe1 = [] self.pe2 = [] self.pe12 = [] self.se = [] self.presets = '' self.verbose = False @property def log_file_name(self): if self.out_prefix == '': return os.path.join(self.out_dir, 'log') else: return os.path.join(self.out_dir, self.out_prefix + '.log') @property def option_file_name(self): return os.path.join(self.out_dir, 'options.json') @property def contig_dir(self): return os.path.join(self.out_dir, 'intermediate_contigs') @property def read_lib_path(self): return os.path.join(self.temp_dir, 'reads.lib') @property def megahit_core(self): if self.hw_accel: return software_info.megahit_core elif self.has_popcnt: return software_info.megahit_core_popcnt else: return software_info.megahit_core_no_hw_accel @property def host_mem(self): if self.memory <= 0: raise Usage('Please specify a positive number for -m flag.') elif self.memory < 1: try: total_mem = detect_available_mem() return math.floor(total_mem * self.memory) except Exception: raise Usage('Failed to detect available memory size, ' 'please specify memory size in byte via -m option') else: return math.floor(self.memory) def dump(self): with open(self.option_file_name, 'w') as f: json.dump(self.__dict__, f) def load_for_continue(self): with open(self.option_file_name, 'r') as f: self.__dict__.update(json.load(f)) class Checkpoint: def __init__(self): self._current_checkpoint = 0 self._logged_checkpoint = None self._file = None def set_file(self, file): self._file = file def __call__(self, func): def checked_or_call(*args, **kwargs): if self._logged_checkpoint is None or self._current_checkpoint > self._logged_checkpoint: func(*args, **kwargs) with open(self._file, 'a') as cpf: print(str(self._current_checkpoint) + '\t' + 'done', file=cpf) else: logger.info('passing check point {0}'.format(self._current_checkpoint)) self._current_checkpoint += 1 return checked_or_call def load_for_continue(self): self._logged_checkpoint = -1 if os.path.exists(self._file): with open(self._file, 'r') as f: for line in f: a = line.strip().split() if len(a) == 2 and a[1] == 'done': self._logged_checkpoint = int(a[0]) software_info = SoftwareInfo() opt = Options() check_point = Checkpoint() def check_bin(): if not os.path.exists(opt.megahit_core): raise Usage('Cannot find megahit_core, please recompile.') def parse_option(argv): try: opts, args = getopt.getopt(argv, 'hm:o:r:t:v1:2:l:f', ['help', 'read=', '12=', 'memory=', 'out-dir=', 'min-contig-len=', 'num-cpu-threads=', 'kmin-1pass', 'k-min=', 'k-max=', 'k-step=', 'k-list=', 'num-cpu-threads=', 'min-count=', 'no-mercy', 'no-local', 'max-tip-len=', 'bubble-level=', 'prune-level=', 'prune-depth=', 'merge-level=', 'disconnect-ratio=', 'low-local-ratio=', 'cleaning-rounds=', 'keep-tmp-files', 'tmp-dir=', 'mem-flag=', 'continue', 'version', 'verbose', 'out-prefix=', 'presets=', 'test', 'no-hw-accel', 'force', # deprecated 'max-read-len=', 'no-low-local', 'cpu-only', 'gpu-mem=', 'use-gpu']) except getopt.error as msg: raise Usage(software_info.megahit_version + '\n' + str(msg)) if len(opts) == 0: raise Usage(software_info.megahit_version + '\n' + software_info.usage_message) for option, value in opts: if option in ('-h', '--help'): print(software_info.megahit_version + '\n' + software_info.usage_message) exit(0) elif option in ('-o', '--out-dir'): opt.out_dir = abspath(value) elif option in ('-m', '--memory'): opt.memory = float(value) elif option == '--min-contig-len': opt.min_contig_len = int(value) elif option in ('-t', '--num-cpu-threads'): opt.num_cpu_threads = int(value) elif option == '--kmin-1pass': opt.kmin_1pass = True elif option == '--k-min': opt.k_min = int(value) opt.set_list_by_min_max_step = True opt.auto_k = False elif option == '--k-max': opt.k_max = int(value) opt.set_list_by_min_max_step = True opt.auto_k = False elif option == '--k-step': opt.k_step = int(value) opt.set_list_by_min_max_step = True opt.auto_k = False elif option == '--k-list': opt.k_list = list(map(int, value.split(','))) opt.k_list.sort() opt.auto_k = False opt.set_list_by_min_max_step = False elif option == '--min-count': opt.min_count = int(value) elif option == '--max-tip-len': opt.max_tip_len = int(value) elif option == '--merge-level': (opt.merge_len, opt.merge_similar) = map(float, value.split(',')) opt.merge_len = int(opt.merge_len) elif option == '--prune-level': opt.prune_level = int(value) elif option == '--prune-depth': opt.prune_depth = float(value) elif option == '--bubble-level': opt.bubble_level = int(value) elif option == '--no-mercy': opt.no_mercy = True elif option == '--no-local': opt.no_local = True elif option == '--disconnect-ratio': opt.disconnect_ratio = float(value) elif option == '--low-local-ratio': opt.low_local_ratio = float(value) elif option == '--cleaning-rounds': opt.cleaning_rounds = int(value) elif option == '--keep-tmp-files': opt.keep_tmp_files = True elif option == '--mem-flag': opt.mem_flag = int(value) elif option in ('-v', '--version'): print(software_info.megahit_version) exit(0) elif option == '--verbose': opt.verbose = True elif option == '--continue': opt.continue_mode = True elif option == '--out-prefix': opt.out_prefix = value elif option == '--tmp-dir': opt.temp_dir = abspath(value) elif option in ('--cpu-only', '-l', '--max-read-len', '--no-low-local', '--use-gpu', '--gpu-mem'): print('option {0} is deprecated!'.format(option), file=sys.stderr) continue # deprecated options, just ignore elif option in ('-r', '--read'): opt.se += [abspath(f) for f in value.split(',')] elif option == '-1': opt.pe1 += [abspath(f) for f in value.split(',')] elif option == '-2': opt.pe2 += [abspath(f) for f in value.split(',')] elif option == '--12': opt.pe12 += [abspath(f) for f in value.split(',')] elif option == '--presets': opt.presets = value elif option in ('-f', '--force'): opt.force_overwrite = True elif option == '--test': opt.test_mode = True elif option == '--no-hw-accel': opt.hw_accel = False opt.has_popcnt = False else: raise Usage('Invalid option {0}'.format(option)) def setup_output_dir(): if not opt.out_dir: if opt.test_mode: opt.out_dir = tempfile.mkdtemp(prefix='megahit_test_') else: opt.out_dir = abspath('./megahit_out') check_point.set_file(os.path.join(opt.out_dir, 'checkpoints.txt')) if opt.continue_mode and not os.path.exists(opt.option_file_name): print('Cannot find {0}, switching to normal mode'.format(opt.option_file_name), file=sys.stderr) opt.continue_mode = False if opt.continue_mode: print('Continue mode activated. Ignore all options except for -o/--out-dir.', file=sys.stderr) opt.load_for_continue() check_point.load_for_continue() else: if not opt.force_overwrite and not opt.test_mode and os.path.exists(opt.out_dir): raise Usage( 'Output directory ' + opt.out_dir + ' already exists, please change the parameter -o to another value to avoid overwriting.') if opt.temp_dir == '': opt.temp_dir = os.path.join(opt.out_dir, 'tmp') else: opt.temp_dir = tempfile.mkdtemp(dir=opt.temp_dir, prefix='megahit_tmp_') mkdir_if_not_exists(opt.out_dir) mkdir_if_not_exists(opt.temp_dir) mkdir_if_not_exists(opt.contig_dir) def setup_logger(): formatter = logging.Formatter('%(asctime)s - %(message)s', '%Y-%m-%d %H:%M:%S') file_handler = logging.FileHandler(opt.log_file_name, 'a') file_handler.setLevel(logging.DEBUG) file_handler.setFormatter(formatter) console = logging.StreamHandler() console.setLevel(logging.INFO) console.setFormatter(formatter) logger.setLevel(logging.DEBUG) logger.addHandler(file_handler) logger.addHandler(console) logger.info(software_info.megahit_version) def check_and_correct_option(): # set mode if opt.memory < 0: raise Usage('-m cannot be less than 0!') if opt.presets != '': opt.auto_k = True if opt.presets == 'meta-sensitive': opt.min_count = 1 opt.k_list = [21, 29, 39, 49, 59, 69, 79, 89, 99, 109, 119, 129, 141] opt.set_list_by_min_max_step = False elif opt.presets == 'meta-large': opt.min_count = 1 opt.k_min = 27 opt.k_max = 127 opt.k_step = 10 opt.set_list_by_min_max_step = True else: raise Usage('Invalid preset: ' + opt.presets) if opt.set_list_by_min_max_step: if opt.k_step % 2 == 1: raise Usage('k-step must be even number!') if opt.k_min > opt.k_max: raise Usage('Error: k_min > k_max!') opt.k_list = list() k = opt.k_min while k < opt.k_max: opt.k_list.append(k) k = k + opt.k_step opt.k_list.append(opt.k_max) if len(opt.k_list) == 0: raise Usage('k list should not be empty!') if opt.k_list[0] < 15 or opt.k_list[-1] > software_info.max_k_allowed: raise Usage('All k\'s should be in range [15, %d]' % software_info.max_k_allowed) for k in opt.k_list: if k % 2 == 0: raise Usage('All k must be odd number!') for i in range(1, len(opt.k_list)): if opt.k_list[i] - opt.k_list[i - 1] > 28: raise Usage('--k-step (adjacent k difference) must be <= 28') opt.k_min, opt.k_max = opt.k_list[0], opt.k_list[-1] if opt.k_max < opt.k_min: raise Usage('--k-min should not be larger than --k-max.') if opt.min_count <= 0: raise Usage('--min-count must be greater than 0.') elif opt.min_count == 1: opt.kmin_1pass = True opt.no_mercy = True if opt.prune_level < 0 or opt.prune_level > 3: raise Usage('--prune-level must be in 0-3.') if opt.merge_len < 0: raise Usage('--merge-level: length must be >= 0') if opt.merge_similar < 0 or opt.merge_similar > 1: raise Usage('--merge-level: similarity must be in [0, 1]') if opt.disconnect_ratio < 0 or opt.disconnect_ratio > 0.5: raise Usage('--disconnect-ratio should be in [0, 0.5].') if opt.low_local_ratio <= 0 or opt.low_local_ratio > 0.5: raise Usage('--low-local-ratio should be in (0, 0.5].') if opt.cleaning_rounds <= 0: raise Usage('--cleaning-rounds must be >= 1') if opt.num_cpu_threads > multiprocessing.cpu_count(): logger.warning('Maximum number of available CPU thread is %d.' % multiprocessing.cpu_count()) logger.warning('Number of thread is reset to the %d.' % multiprocessing.cpu_count()) opt.num_cpu_threads = multiprocessing.cpu_count() if opt.num_cpu_threads == 0: opt.num_cpu_threads = multiprocessing.cpu_count() if opt.prune_depth < 0 and opt.prune_level < 3: opt.prune_depth = opt.min_count if opt.bubble_level < 0: logger.warning('Reset bubble level to 0.') opt.bubble_level = 0 if opt.bubble_level > 2: logger.warning('Reset bubble level to 2.') opt.bubble_level = 2 def find_test_data_path(): script_path = software_info.script_path for path in [os.path.join(script_path, '..'), os.path.join(script_path, '../share/megahit')]: test_data_dir = abspath(os.path.join(path, 'test_data')) if os.path.isdir(test_data_dir) and all( f in os.listdir(test_data_dir) for f in ['r1.il.fa.gz', 'r2.il.fa.bz2', 'r3_1.fa', 'r3_2.fa', 'r4.fa']): return test_data_dir raise Usage('Test data not found! Script path = {0}'.format(script_path)) def check_reads(): if opt.test_mode: test_data_dir = find_test_data_path() opt.pe12 = [os.path.join(test_data_dir, 'r1.il.fa.gz'), os.path.join(test_data_dir, 'r2.il.fa.bz2')] opt.pe1 = [os.path.join(test_data_dir, 'r3_1.fa')] opt.pe2 = [os.path.join(test_data_dir, 'r3_2.fa')] opt.se = [os.path.join(test_data_dir, 'r4.fa'), os.path.join(test_data_dir, 'loop.fa')] if len(opt.pe1) != len(opt.pe2): raise Usage('Number of paired-end files not match!') for r in opt.pe1 + opt.pe2 + opt.se + opt.pe12: if not os.path.exists(r): raise Usage('Cannot find file ' + r) def detect_available_mem(): try: psize = os.sysconf('SC_PAGE_SIZE') pcount = os.sysconf('SC_PHYS_PAGES') if psize < 0 or pcount < 0: raise SystemError return psize * pcount except ValueError: if sys.platform.find("darwin") != -1: return int(float(os.popen("sysctl hw.memsize").readlines()[0].split()[1])) elif sys.platform.find("linux") != -1: return int(float(os.popen("free").readlines()[1].split()[1]) * 1024) else: raise def cpu_dispatch(): if not opt.hw_accel: logger.info('Using megahit_core without POPCNT and BMI2 support, ' 'because --no-hw-accel option manually specified') else: has_hw_accel = check_output([opt.megahit_core, 'checkcpu']) if has_hw_accel == '1': logger.info('Using megahit_core with POPCNT and BMI2 support') else: opt.hw_accel = False has_popcnt = check_output([opt.megahit_core, 'checkpopcnt']) if has_popcnt == '1': opt.has_popcnt = True logger.info('Using megahit_core with POPCNT support') else: logger.info('Using megahit_core without POPCNT and BMI2 support, ' 'because the features not detected by CPUID ') def graph_prefix(kmer_k): mkdir_if_not_exists(os.path.join(opt.temp_dir, 'k' + str(kmer_k))) return os.path.join(opt.temp_dir, 'k' + str(kmer_k), str(kmer_k)) def contig_prefix(kmer_k): return os.path.join(opt.contig_dir, 'k' + str(kmer_k)) def remove_temp_after_build(kmer_k): for i in range(opt.num_cpu_threads): remove_if_exists(graph_prefix(kmer_k) + '.edges.' + str(i)) for i in range(64): remove_if_exists(graph_prefix(kmer_k) + '.mercy_cand.' + str(i)) for i in range(opt.num_cpu_threads): remove_if_exists(graph_prefix(kmer_k) + '.mercy.' + str(i)) remove_if_exists(graph_prefix(kmer_k) + '.cand') def remove_temp_after_assemble(kmer_k): for extension in ['w', 'last', 'isd', 'dn', 'f', 'mul', 'mul2']: remove_if_exists(graph_prefix(kmer_k) + '.' + extension) for i in range(opt.num_cpu_threads): remove_if_exists(graph_prefix(kmer_k) + '.sdbg.' + str(i)) def inpipe_cmd(file_name): if file_name.endswith('.gz'): return 'gzip -cd ' + file_name elif file_name.endswith('.bz2'): return 'bzip2 -cd ' + file_name else: return '' @check_point def create_library_file(): with open(opt.read_lib_path, 'w') as lib: for i, file_name in enumerate(opt.pe12): print(file_name, file=lib) if inpipe_cmd(file_name) != '': print('interleaved ' + os.path.join(opt.temp_dir, 'inpipe.pe12.' + str(i)), file=lib) else: print('interleaved ' + file_name, file=lib) for i, (file_name1, file_name2) in enumerate(zip(opt.pe1, opt.pe2)): print(','.join([file_name1, file_name2]), file=lib) if inpipe_cmd(file_name1) != '': f1 = os.path.join(opt.temp_dir, 'inpipe.pe1.' + str(i)) else: f1 = file_name1 if inpipe_cmd(file_name2) != '': f2 = os.path.join(opt.temp_dir, 'inpipe.pe2.' + str(i)) else: f2 = file_name2 print('pe ' + f1 + ' ' + f2, file=lib) for i, file_name in enumerate(opt.se): print(file_name, file=lib) if inpipe_cmd(file_name) != '': print('se ' + os.path.join(opt.temp_dir, 'inpipe.se.' + str(i)), file=lib) else: print('se ' + file_name, file=lib) @check_point def build_library(): cmd = [opt.megahit_core, 'buildlib', opt.read_lib_path, opt.read_lib_path] fifos = list() pipes = list() def create_fifo(read_type, num, command): fifo_path = os.path.join(opt.temp_dir, 'inpipe.{0}.{1}'.format(read_type, num)) remove_if_exists(fifo_path) os.mkfifo(fifo_path) fifos.append(fifo_path) p = subprocess.Popen('{0} > {1}'.format(command, fifo_path), shell=True, preexec_fn=os.setsid) pipes.append(p) try: # create inpipe for i in range(len(opt.pe12)): if inpipe_cmd(opt.pe12[i]) != '': create_fifo('pe12', i, inpipe_cmd(opt.pe12[i])) for i in range(len(opt.pe1)): if inpipe_cmd(opt.pe1[i]) != '': create_fifo('pe1', i, inpipe_cmd(opt.pe1[i])) if inpipe_cmd(opt.pe2[i]) != '': create_fifo('pe2', i, inpipe_cmd(opt.pe2[i])) for i in range(len(opt.se)): if inpipe_cmd(opt.se[i]) != '': create_fifo('se', i, inpipe_cmd(opt.se[i])) run_sub_command(cmd, 'Convert reads to binary library', verbose=True) for p in pipes: pipe_ret = p.wait() if pipe_ret != 0: logger.error('Error occurs when reading inputs') exit(pipe_ret) finally: for p in pipes: if p.poll() is None: os.killpg(p.pid, signal.SIGTERM) for f in fifos: remove_if_exists(f) def get_max_read_len(): ret = 0 with open(opt.read_lib_path + '.lib_info') as read_info: for info in read_info.readlines()[2::2]: ret = max(ret, int(info.split()[2])) read_info.close() return ret def set_max_k_by_lib(): if not opt.auto_k or len(opt.k_list) == 1: return False max_read_len = get_max_read_len() new_k_list = [k for k in opt.k_list if k < max_read_len + 20] if not new_k_list: return False else: opt.k_list = new_k_list opt.k_min = opt.k_list[0] opt.k_max = opt.k_list[-1] return True @check_point def build_first_graph_1pass(option): cmd = [opt.megahit_core, 'read2sdbg'] + option if not opt.no_mercy: cmd.append('--need_mercy') run_sub_command(cmd, 'Extracting solid (k+1)-mers and building sdbg for k = %d' % opt.k_min) if not opt.keep_tmp_files: remove_temp_after_build(opt.k_min) @check_point def count_mink(option): cmd = [opt.megahit_core, 'count'] + option run_sub_command(cmd, 'Extract solid (k+1)-mers for k = %d ' % opt.k_min) def build_first_graph(): common_option = ['-k', str(opt.k_min), '-m', str(opt.min_count), '--host_mem', str(opt.host_mem), '--mem_flag', str(opt.mem_flag), '--output_prefix', graph_prefix(opt.k_min), '--num_cpu_threads', str(opt.num_cpu_threads), '--read_lib_file', opt.read_lib_path] if not opt.kmin_1pass: count_mink(common_option) build_graph(opt.k_min, 0) else: build_first_graph_1pass(common_option) @check_point def build_graph(kmer_k, kmer_from): build_comm_opt = ['--host_mem', str(opt.host_mem), '--mem_flag', str(opt.mem_flag), '--output_prefix', graph_prefix(kmer_k), '--num_cpu_threads', str(opt.num_cpu_threads), '-k', str(kmer_k), '--kmer_from', str(kmer_from)] build_cmd = [opt.megahit_core, 'seq2sdbg'] + build_comm_opt file_size = 0 if os.path.exists(graph_prefix(kmer_k) + '.edges.0'): build_cmd += ['--input_prefix', graph_prefix(kmer_k)] file_size += os.path.getsize(graph_prefix(kmer_k) + '.edges.0') tid = 1 while os.path.exists(graph_prefix(kmer_k) + '.edges.' + str(tid)): file_size += os.path.getsize(graph_prefix(kmer_k) + '.edges.' + str(tid)) tid += 1 if os.path.exists(contig_prefix(kmer_from) + '.addi.fa'): build_cmd += ['--addi_contig', contig_prefix(kmer_from) + '.addi.fa'] file_size += os.path.getsize(contig_prefix(kmer_from) + '.addi.fa') if os.path.exists(contig_prefix(kmer_from) + '.local.fa'): build_cmd += ['--local_contig', contig_prefix(kmer_from) + '.local.fa'] file_size += os.path.getsize(contig_prefix(kmer_from) + '.local.fa') if os.path.exists(contig_prefix(kmer_from) + '.contigs.fa'): build_cmd += ['--contig', contig_prefix(kmer_from) + '.contigs.fa'] build_cmd += ['--bubble', contig_prefix(kmer_from) + '.bubble_seq.fa'] if file_size == 0 and kmer_from != 0: raise EarlyTerminate(kmer_from) if not opt.no_mercy and kmer_k == opt.k_min: build_cmd.append('--need_mercy') run_sub_command(build_cmd, 'Build graph for k = %d ' % kmer_k) if not opt.keep_tmp_files: remove_temp_after_build(kmer_k) @check_point def iterate(cur_k, step): next_k = cur_k + step iterate_cmd = [opt.megahit_core, 'iterate', '-c', contig_prefix(cur_k) + '.contigs.fa', '-b', contig_prefix(cur_k) + '.bubble_seq.fa', '-t', str(opt.num_cpu_threads), '-k', str(cur_k), '-s', str(step), '-o', graph_prefix(next_k), '-r', opt.read_lib_path + '.bin'] run_sub_command(iterate_cmd, 'Extract iterative edges from k = %d to %d ' % (cur_k, next_k)) @check_point def assemble(cur_k): min_standalone = max(min(opt.k_max * 3 - 1, int(opt.min_contig_len * 1.5)), opt.min_contig_len) if opt.max_tip_len >= 0: min_standalone = max(opt.max_tip_len + opt.k_max - 1, opt.min_contig_len) assembly_cmd = [opt.megahit_core, 'assemble', '-s', graph_prefix(cur_k), '-o', contig_prefix(cur_k), '-t', str(opt.num_cpu_threads), '--min_standalone', str(min_standalone), '--prune_level', str(opt.prune_level), '--merge_len', str(int(opt.merge_len)), '--merge_similar', str(opt.merge_similar), '--cleaning_rounds', str(opt.cleaning_rounds), '--disconnect_ratio', str(opt.disconnect_ratio), '--low_local_ratio', str(opt.low_local_ratio), '--cleaning_rounds', str(opt.cleaning_rounds), '--min_depth', str(opt.prune_depth), '--bubble_level', str(opt.bubble_level)] if opt.max_tip_len == -1 and cur_k * 3 - 1 > opt.min_contig_len * 1.5: assembly_cmd += ['--max_tip_len', str(max(1, opt.min_contig_len * 1.5 + 1 - cur_k))] else: assembly_cmd += ['--max_tip_len', str(opt.max_tip_len)] if cur_k < opt.k_max: assembly_cmd.append('--careful_bubble') if cur_k == opt.k_max: assembly_cmd.append('--is_final_round') if opt.no_local: assembly_cmd.append('--output_standalone') run_sub_command(assembly_cmd, 'Assemble contigs from SdBG for k = %d' % cur_k) if (not opt.keep_tmp_files) and (cur_k != opt.k_max): remove_temp_after_assemble(cur_k) @check_point def local_assemble(cur_k, kmer_to): la_cmd = [opt.megahit_core, 'local', '-c', contig_prefix(cur_k) + '.contigs.fa', '-l', opt.read_lib_path, '-t', str(opt.num_cpu_threads), '-o', contig_prefix(cur_k) + '.local.fa', '--kmax', str(kmer_to)] run_sub_command(la_cmd, 'Local assembly for k = %d' % cur_k) @check_point def merge_final(final_k): logger.info('Merging to output final contigs ') final_contig_name = os.path.join(opt.out_dir, 'final.contigs.fa') if opt.out_prefix != '': final_contig_name = os.path.join(opt.out_dir, opt.out_prefix + '.contigs.fa') with open(final_contig_name, 'w') as final_contigs: merge_cmd = 'cat ' + opt.contig_dir + '/*.final.contigs.fa ' + \ contig_prefix(final_k) + '.contigs.fa | ' + \ opt.megahit_core + ' filterbylen ' + str(opt.min_contig_len) p = subprocess.Popen(merge_cmd, shell=True, stdout=final_contigs, stderr=subprocess.PIPE) _, log = p.communicate() logger.info(log.rstrip().decode('utf-8')) ret_code = p.wait() if ret_code != 0: logger.error('Error occurs when merging final contigs, please refer to %s for detail' % opt.log_file_name) logger.error('Exit code %d' % ret_code) exit(ret_code) def run_sub_command(cmd, msg, verbose=False): if opt.verbose: verbose = True logger.info(msg) logger.debug('command %s' % ' '.join(cmd)) p = subprocess.Popen(cmd, stderr=subprocess.PIPE) try: while True: line = p.stderr.readline().rstrip() if not line: break if verbose: logger.info(line) else: logger.debug(line) ret_code = p.wait() if ret_code != 0: logger.error('Error occurs, please refer to %s for detail' % opt.log_file_name) logger.error('Command: %s; Exit code %d' % (' '.join(cmd), ret_code)) exit(ret_code) except KeyboardInterrupt: p.terminate() p.wait() exit(signal.SIGINT) def main(argv=None): if argv is None: argv = sys.argv try: start_time = time.time() check_bin() parse_option(argv[1:]) setup_output_dir() setup_logger() check_and_correct_option() check_reads() cpu_dispatch() opt.dump() create_library_file() build_library() if set_max_k_by_lib(): logger.info('k-max reset to: %d ' % opt.k_max) logger.info('Start assembly. Number of CPU threads %d ' % opt.num_cpu_threads) logger.info('k list: %s ' % ','.join(map(str, opt.k_list))) logger.info('Memory used: %d' % opt.host_mem) build_first_graph() assemble(opt.k_min) cur_k = opt.k_min next_k_idx = 0 try: while cur_k < opt.k_max: next_k_idx += 1 next_k = opt.k_list[next_k_idx] k_step = next_k - cur_k if not opt.no_local: local_assemble(cur_k, next_k) iterate(cur_k, k_step) build_graph(next_k, cur_k) assemble(next_k) cur_k = next_k merge_final(opt.k_max) except EarlyTerminate as et: merge_final(et.kmer_size) if not opt.keep_tmp_files and os.path.exists(opt.temp_dir): shutil.rmtree(opt.temp_dir) open(os.path.join(opt.out_dir, 'done'), 'w').close() if not opt.keep_tmp_files and opt.test_mode: shutil.rmtree(opt.out_dir) logger.info('ALL DONE. Time elapsed: %f seconds ' % (time.time() - start_time)) except Usage as usg: print(sys.argv[0].split('/')[-1] + ': ' + str(usg.msg), file=sys.stderr) exit(1) if __name__ == '__main__': main() megahit-1.2.9/src/parallel_hashmap/000077500000000000000000000000001355123202700172255ustar00rootroot00000000000000megahit-1.2.9/src/parallel_hashmap/meminfo.h000066400000000000000000000124341355123202700210340ustar00rootroot00000000000000#if !defined(spp_memory_h_guard) #define spp_memory_h_guard #include #include #include #if defined(_WIN32) || defined( __CYGWIN__) #define SPP_WIN #endif #ifdef SPP_WIN #include #include #undef min #undef max #elif defined(__linux__) #include #include #elif defined(__FreeBSD__) #include #include #include #include #include #include #endif namespace spp { uint64_t GetSystemMemory(); uint64_t GetTotalMemoryUsed(); uint64_t GetProcessMemoryUsed(); uint64_t GetPhysicalMemory(); uint64_t GetSystemMemory() { #ifdef SPP_WIN MEMORYSTATUSEX memInfo; memInfo.dwLength = sizeof(MEMORYSTATUSEX); GlobalMemoryStatusEx(&memInfo); return static_cast(memInfo.ullTotalPageFile); #elif defined(__linux__) struct sysinfo memInfo; sysinfo (&memInfo); auto totalVirtualMem = memInfo.totalram; totalVirtualMem += memInfo.totalswap; totalVirtualMem *= memInfo.mem_unit; return static_cast(totalVirtualMem); #elif defined(__FreeBSD__) kvm_t *kd; u_int pageCnt; size_t pageCntLen = sizeof(pageCnt); u_int pageSize; struct kvm_swap kswap; uint64_t totalVirtualMem; pageSize = static_cast(getpagesize()); sysctlbyname("vm.stats.vm.v_page_count", &pageCnt, &pageCntLen, NULL, 0); totalVirtualMem = pageCnt * pageSize; kd = kvm_open(NULL, _PATH_DEVNULL, NULL, O_RDONLY, "kvm_open"); kvm_getswapinfo(kd, &kswap, 1, 0); kvm_close(kd); totalVirtualMem += kswap.ksw_total * pageSize; return totalVirtualMem; #else return 0; #endif } uint64_t GetTotalMemoryUsed() { #ifdef SPP_WIN MEMORYSTATUSEX memInfo; memInfo.dwLength = sizeof(MEMORYSTATUSEX); GlobalMemoryStatusEx(&memInfo); return static_cast(memInfo.ullTotalPageFile - memInfo.ullAvailPageFile); #elif defined(__linux__) struct sysinfo memInfo; sysinfo(&memInfo); auto virtualMemUsed = memInfo.totalram - memInfo.freeram; virtualMemUsed += memInfo.totalswap - memInfo.freeswap; virtualMemUsed *= memInfo.mem_unit; return static_cast(virtualMemUsed); #elif defined(__FreeBSD__) kvm_t *kd; u_int pageSize; u_int pageCnt, freeCnt; size_t pageCntLen = sizeof(pageCnt); size_t freeCntLen = sizeof(freeCnt); struct kvm_swap kswap; uint64_t virtualMemUsed; pageSize = static_cast(getpagesize()); sysctlbyname("vm.stats.vm.v_page_count", &pageCnt, &pageCntLen, NULL, 0); sysctlbyname("vm.stats.vm.v_free_count", &freeCnt, &freeCntLen, NULL, 0); virtualMemUsed = (pageCnt - freeCnt) * pageSize; kd = kvm_open(NULL, _PATH_DEVNULL, NULL, O_RDONLY, "kvm_open"); kvm_getswapinfo(kd, &kswap, 1, 0); kvm_close(kd); virtualMemUsed += kswap.ksw_used * pageSize; return virtualMemUsed; #else return 0; #endif } uint64_t GetProcessMemoryUsed() { #ifdef SPP_WIN PROCESS_MEMORY_COUNTERS_EX pmc; GetProcessMemoryInfo(GetCurrentProcess(), reinterpret_cast(&pmc), sizeof(pmc)); return static_cast(pmc.PrivateUsage); #elif defined(__linux__) auto parseLine = [](char* line)->int { auto i = strlen(line); while(*line < '0' || *line > '9') { line++; } line[i-3] = '\0'; i = atoi(line); return i; }; auto file = fopen("/proc/self/status", "r"); auto result = -1; char line[128]; while(fgets(line, 128, file) != nullptr) { if(strncmp(line, "VmSize:", 7) == 0) { result = parseLine(line); break; } } fclose(file); return static_cast(result) * 1024; #elif defined(__FreeBSD__) struct kinfo_proc info; size_t infoLen = sizeof(info); int mib[] = { CTL_KERN, KERN_PROC, KERN_PROC_PID, getpid() }; sysctl(mib, sizeof(mib) / sizeof(*mib), &info, &infoLen, NULL, 0); return static_cast(info.ki_rssize * getpagesize()); #else return 0; #endif } uint64_t GetPhysicalMemory() { #ifdef SPP_WIN MEMORYSTATUSEX memInfo; memInfo.dwLength = sizeof(MEMORYSTATUSEX); GlobalMemoryStatusEx(&memInfo); return static_cast(memInfo.ullTotalPhys); #elif defined(__linux__) struct sysinfo memInfo; sysinfo(&memInfo); auto totalPhysMem = memInfo.totalram; totalPhysMem *= memInfo.mem_unit; return static_cast(totalPhysMem); #elif defined(__FreeBSD__) u_long physMem; size_t physMemLen = sizeof(physMem); int mib[] = { CTL_HW, HW_PHYSMEM }; sysctl(mib, sizeof(mib) / sizeof(*mib), &physMem, &physMemLen, NULL, 0); return physMem; #else return 0; #endif } } #endif // spp_memory_h_guard megahit-1.2.9/src/parallel_hashmap/phmap.h000066400000000000000000005255351355123202700205220ustar00rootroot00000000000000#if !defined(phmap_h_guard_) #define phmap_h_guard_ // --------------------------------------------------------------------------- // Copyright (c) 2019, Gregory Popovitch - greg7mdp@gmail.com // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. // // Includes work from abseil-cpp (https://github.com/abseil/abseil-cpp) // with modifications. // // Copyright 2018 The Abseil Authors. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. // --------------------------------------------------------------------------- #include #include #include #include #include #include #include #include #include #include #include #include #include "phmap_utils.h" #include "phmap_base.h" #include "phmap_fwd_decl.h" #if PHMAP_HAVE_STD_STRING_VIEW #include #endif namespace phmap { namespace container_internal { // -------------------------------------------------------------------------- template class probe_seq { public: probe_seq(size_t hash, size_t mask) { assert(((mask + 1) & mask) == 0 && "not a mask"); mask_ = mask; offset_ = hash & mask_; } size_t offset() const { return offset_; } size_t offset(size_t i) const { return (offset_ + i) & mask_; } void next() { index_ += Width; offset_ += index_; offset_ &= mask_; } // 0-based probe index. The i-th probe in the probe sequence. size_t index() const { return index_; } private: size_t mask_; size_t offset_; size_t index_ = 0; }; // -------------------------------------------------------------------------- template struct RequireUsableKey { template std::pair< decltype(std::declval()(std::declval())), decltype(std::declval()(std::declval(), std::declval()))>* operator()(const PassedKey&, const Args&...) const; }; // -------------------------------------------------------------------------- template struct IsDecomposable : std::false_type {}; template struct IsDecomposable< phmap::void_t(), std::declval()...))>, Policy, Hash, Eq, Ts...> : std::true_type {}; // TODO(alkis): Switch to std::is_nothrow_swappable when gcc/clang supports it. // -------------------------------------------------------------------------- template constexpr bool IsNoThrowSwappable() { using std::swap; return noexcept(swap(std::declval(), std::declval())); } // -------------------------------------------------------------------------- template int TrailingZeros(T x) { return sizeof(T) == 8 ? base_internal::CountTrailingZerosNonZero64( static_cast(x)) : base_internal::CountTrailingZerosNonZero32( static_cast(x)); } // -------------------------------------------------------------------------- template int LeadingZeros(T x) { return sizeof(T) == 8 ? base_internal::CountLeadingZeros64(static_cast(x)) : base_internal::CountLeadingZeros32(static_cast(x)); } // -------------------------------------------------------------------------- // An abstraction over a bitmask. It provides an easy way to iterate through the // indexes of the set bits of a bitmask. When Shift=0 (platforms with SSE), // this is a true bitmask. On non-SSE, platforms the arithematic used to // emulate the SSE behavior works in bytes (Shift=3) and leaves each bytes as // either 0x00 or 0x80. // // For example: // for (int i : BitMask(0x5)) -> yields 0, 2 // for (int i : BitMask(0x0000000080800000)) -> yields 2, 3 // -------------------------------------------------------------------------- template class BitMask { static_assert(std::is_unsigned::value, ""); static_assert(Shift == 0 || Shift == 3, ""); public: // These are useful for unit tests (gunit). using value_type = int; using iterator = BitMask; using const_iterator = BitMask; explicit BitMask(T mask) : mask_(mask) {} BitMask& operator++() { mask_ &= (mask_ - 1); return *this; } explicit operator bool() const { return mask_ != 0; } int operator*() const { return LowestBitSet(); } int LowestBitSet() const { return container_internal::TrailingZeros(mask_) >> Shift; } int HighestBitSet() const { return (sizeof(T) * CHAR_BIT - container_internal::LeadingZeros(mask_) - 1) >> Shift; } BitMask begin() const { return *this; } BitMask end() const { return BitMask(0); } int TrailingZeros() const { return container_internal::TrailingZeros(mask_) >> Shift; } int LeadingZeros() const { constexpr int total_significant_bits = SignificantBits << Shift; constexpr int extra_bits = sizeof(T) * 8 - total_significant_bits; return container_internal::LeadingZeros(mask_ << extra_bits) >> Shift; } private: friend bool operator==(const BitMask& a, const BitMask& b) { return a.mask_ == b.mask_; } friend bool operator!=(const BitMask& a, const BitMask& b) { return a.mask_ != b.mask_; } T mask_; }; // -------------------------------------------------------------------------- using ctrl_t = signed char; using h2_t = uint8_t; // -------------------------------------------------------------------------- // The values here are selected for maximum performance. See the static asserts // below for details. // -------------------------------------------------------------------------- enum Ctrl : ctrl_t { kEmpty = -128, // 0b10000000 kDeleted = -2, // 0b11111110 kSentinel = -1, // 0b11111111 }; static_assert( kEmpty & kDeleted & kSentinel & 0x80, "Special markers need to have the MSB to make checking for them efficient"); static_assert(kEmpty < kSentinel && kDeleted < kSentinel, "kEmpty and kDeleted must be smaller than kSentinel to make the " "SIMD test of IsEmptyOrDeleted() efficient"); static_assert(kSentinel == -1, "kSentinel must be -1 to elide loading it from memory into SIMD " "registers (pcmpeqd xmm, xmm)"); static_assert(kEmpty == -128, "kEmpty must be -128 to make the SIMD check for its " "existence efficient (psignb xmm, xmm)"); static_assert(~kEmpty & ~kDeleted & kSentinel & 0x7F, "kEmpty and kDeleted must share an unset bit that is not shared " "by kSentinel to make the scalar test for MatchEmptyOrDeleted() " "efficient"); static_assert(kDeleted == -2, "kDeleted must be -2 to make the implementation of " "ConvertSpecialToEmptyAndFullToDeleted efficient"); // -------------------------------------------------------------------------- // A single block of empty control bytes for tables without any slots allocated. // This enables removing a branch in the hot path of find(). // -------------------------------------------------------------------------- inline ctrl_t* EmptyGroup() { alignas(16) static constexpr ctrl_t empty_group[] = { kSentinel, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty, kEmpty}; return const_cast(empty_group); } // -------------------------------------------------------------------------- inline size_t HashSeed(const ctrl_t* ctrl) { // The low bits of the pointer have little or no entropy because of // alignment. We shift the pointer to try to use higher entropy bits. A // good number seems to be 12 bits, because that aligns with page size. return reinterpret_cast(ctrl) >> 12; } #ifdef PHMAP_NON_DETERMINISTIC inline size_t H1(size_t hash, const ctrl_t* ctrl) { // use ctrl_ pointer to add entropy to ensure // non-deterministic iteration order. return (hash >> 7) ^ HashSeed(ctrl); } #else inline size_t H1(size_t hash, const ctrl_t* ) { return (hash >> 7); } #endif inline ctrl_t H2(size_t hash) { return hash & 0x7F; } inline bool IsEmpty(ctrl_t c) { return c == kEmpty; } inline bool IsFull(ctrl_t c) { return c >= 0; } inline bool IsDeleted(ctrl_t c) { return c == kDeleted; } inline bool IsEmptyOrDeleted(ctrl_t c) { return c < kSentinel; } #if PHMAP_HAVE_SSE2 // -------------------------------------------------------------------------- // https://github.com/abseil/abseil-cpp/issues/209 // https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87853 // _mm_cmpgt_epi8 is broken under GCC with -funsigned-char // Work around this by using the portable implementation of Group // when using -funsigned-char under GCC. // -------------------------------------------------------------------------- inline __m128i _mm_cmpgt_epi8_fixed(__m128i a, __m128i b) { #if defined(__GNUC__) && !defined(__clang__) #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Woverflow" if (std::is_unsigned::value) { const __m128i mask = _mm_set1_epi8(0x80); const __m128i diff = _mm_subs_epi8(b, a); return _mm_cmpeq_epi8(_mm_and_si128(diff, mask), mask); } #pragma GCC diagnostic pop #endif return _mm_cmpgt_epi8(a, b); } // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- struct GroupSse2Impl { enum { kWidth = 16 }; // the number of slots per group explicit GroupSse2Impl(const ctrl_t* pos) { ctrl = _mm_loadu_si128(reinterpret_cast(pos)); } // Returns a bitmask representing the positions of slots that match hash. // ---------------------------------------------------------------------- BitMask Match(h2_t hash) const { auto match = _mm_set1_epi8(hash); return BitMask( _mm_movemask_epi8(_mm_cmpeq_epi8(match, ctrl))); } // Returns a bitmask representing the positions of empty slots. // ------------------------------------------------------------ BitMask MatchEmpty() const { #if PHMAP_HAVE_SSSE3 // This only works because kEmpty is -128. return BitMask( _mm_movemask_epi8(_mm_sign_epi8(ctrl, ctrl))); #else return Match(kEmpty); #endif } // Returns a bitmask representing the positions of empty or deleted slots. // ----------------------------------------------------------------------- BitMask MatchEmptyOrDeleted() const { auto special = _mm_set1_epi8(kSentinel); return BitMask( _mm_movemask_epi8(_mm_cmpgt_epi8_fixed(special, ctrl))); } // Returns the number of trailing empty or deleted elements in the group. // ---------------------------------------------------------------------- uint32_t CountLeadingEmptyOrDeleted() const { auto special = _mm_set1_epi8(kSentinel); return TrailingZeros( _mm_movemask_epi8(_mm_cmpgt_epi8_fixed(special, ctrl)) + 1); } // ---------------------------------------------------------------------- void ConvertSpecialToEmptyAndFullToDeleted(ctrl_t* dst) const { auto msbs = _mm_set1_epi8(static_cast(-128)); auto x126 = _mm_set1_epi8(126); #if PHMAP_HAVE_SSSE3 auto res = _mm_or_si128(_mm_shuffle_epi8(x126, ctrl), msbs); #else auto zero = _mm_setzero_si128(); auto special_mask = _mm_cmpgt_epi8_fixed(zero, ctrl); auto res = _mm_or_si128(msbs, _mm_andnot_si128(special_mask, x126)); #endif _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), res); } __m128i ctrl; }; #endif // PHMAP_HAVE_SSE2 // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- struct GroupPortableImpl { enum { kWidth = 8 }; explicit GroupPortableImpl(const ctrl_t* pos) : ctrl(little_endian::Load64(pos)) {} BitMask Match(h2_t hash) const { // For the technique, see: // http://graphics.stanford.edu/~seander/bithacks.html##ValueInWord // (Determine if a word has a byte equal to n). // // Caveat: there are false positives but: // - they only occur if there is a real match // - they never occur on kEmpty, kDeleted, kSentinel // - they will be handled gracefully by subsequent checks in code // // Example: // v = 0x1716151413121110 // hash = 0x12 // retval = (v - lsbs) & ~v & msbs = 0x0000000080800000 constexpr uint64_t msbs = 0x8080808080808080ULL; constexpr uint64_t lsbs = 0x0101010101010101ULL; auto x = ctrl ^ (lsbs * hash); return BitMask((x - lsbs) & ~x & msbs); } BitMask MatchEmpty() const { constexpr uint64_t msbs = 0x8080808080808080ULL; return BitMask((ctrl & (~ctrl << 6)) & msbs); } BitMask MatchEmptyOrDeleted() const { constexpr uint64_t msbs = 0x8080808080808080ULL; return BitMask((ctrl & (~ctrl << 7)) & msbs); } uint32_t CountLeadingEmptyOrDeleted() const { constexpr uint64_t gaps = 0x00FEFEFEFEFEFEFEULL; return (TrailingZeros(((~ctrl & (ctrl >> 7)) | gaps) + 1) + 7) >> 3; } void ConvertSpecialToEmptyAndFullToDeleted(ctrl_t* dst) const { constexpr uint64_t msbs = 0x8080808080808080ULL; constexpr uint64_t lsbs = 0x0101010101010101ULL; auto x = ctrl & msbs; auto res = (~x + (x >> 7)) & ~lsbs; little_endian::Store64(dst, res); } uint64_t ctrl; }; #if PHMAP_HAVE_SSE2 using Group = GroupSse2Impl; #else using Group = GroupPortableImpl; #endif template class raw_hash_set; inline bool IsValidCapacity(size_t n) { return ((n + 1) & n) == 0 && n > 0; } // -------------------------------------------------------------------------- // PRECONDITION: // IsValidCapacity(capacity) // ctrl[capacity] == kSentinel // ctrl[i] != kSentinel for all i < capacity // Applies mapping for every byte in ctrl: // DELETED -> EMPTY // EMPTY -> EMPTY // FULL -> DELETED // -------------------------------------------------------------------------- inline void ConvertDeletedToEmptyAndFullToDeleted( ctrl_t* ctrl, size_t capacity) { assert(ctrl[capacity] == kSentinel); assert(IsValidCapacity(capacity)); for (ctrl_t* pos = ctrl; pos != ctrl + capacity + 1; pos += Group::kWidth) { Group{pos}.ConvertSpecialToEmptyAndFullToDeleted(pos); } // Copy the cloned ctrl bytes. std::memcpy(ctrl + capacity + 1, ctrl, Group::kWidth); ctrl[capacity] = kSentinel; } // -------------------------------------------------------------------------- // Rounds up the capacity to the next power of 2 minus 1, with a minimum of 1. // -------------------------------------------------------------------------- inline size_t NormalizeCapacity(size_t n) { return n ? ~size_t{} >> LeadingZeros(n) : 1; } // -------------------------------------------------------------------------- // We use 7/8th as maximum load factor. // For 16-wide groups, that gives an average of two empty slots per group. // -------------------------------------------------------------------------- inline size_t CapacityToGrowth(size_t capacity) { assert(IsValidCapacity(capacity)); // `capacity*7/8` if (Group::kWidth == 8 && capacity == 7) { // x-x/8 does not work when x==7. return 6; } return capacity - capacity / 8; } // -------------------------------------------------------------------------- // From desired "growth" to a lowerbound of the necessary capacity. // Might not be a valid one and required NormalizeCapacity(). // -------------------------------------------------------------------------- inline size_t GrowthToLowerboundCapacity(size_t growth) { // `growth*8/7` if (Group::kWidth == 8 && growth == 7) { // x+(x-1)/7 does not work when x==7. return 8; } return growth + static_cast((static_cast(growth) - 1) / 7); } namespace hashtable_debug_internal { // If it is a map, call get<0>(). using std::get; template auto GetKey(const typename T::value_type& pair, int) -> decltype(get<0>(pair)) { return get<0>(pair); } // If it is not a map, return the value directly. template const typename T::key_type& GetKey(const typename T::key_type& key, char) { return key; } // -------------------------------------------------------------------------- // Containers should specialize this to provide debug information for that // container. // -------------------------------------------------------------------------- template struct HashtableDebugAccess { // Returns the number of probes required to find `key` in `c`. The "number of // probes" is a concept that can vary by container. Implementations should // return 0 when `key` was found in the minimum number of operations and // should increment the result for each non-trivial operation required to find // `key`. // // The default implementation uses the bucket api from the standard and thus // works for `std::unordered_*` containers. // -------------------------------------------------------------------------- static size_t GetNumProbes(const Container& c, const typename Container::key_type& key) { if (!c.bucket_count()) return {}; size_t num_probes = 0; size_t bucket = c.bucket(key); for (auto it = c.begin(bucket), e = c.end(bucket);; ++it, ++num_probes) { if (it == e) return num_probes; if (c.key_eq()(key, GetKey(*it, 0))) return num_probes; } } }; } // namespace hashtable_debug_internal // ---------------------------------------------------------------------------- // I N F O Z S T U B S // ---------------------------------------------------------------------------- struct HashtablezInfo { void PrepareForSampling() {} }; inline void RecordRehashSlow(HashtablezInfo*, size_t ) {} static inline void RecordInsertSlow(HashtablezInfo* , size_t, size_t ) {} static inline void RecordEraseSlow(HashtablezInfo*) {} static inline HashtablezInfo* SampleSlow(int64_t*) { return nullptr; } static inline void UnsampleSlow(HashtablezInfo* ) {} class HashtablezInfoHandle { public: inline void RecordStorageChanged(size_t , size_t ) {} inline void RecordRehash(size_t ) {} inline void RecordInsert(size_t , size_t ) {} inline void RecordErase() {} friend inline void swap(HashtablezInfoHandle& , HashtablezInfoHandle& ) noexcept {} }; static inline HashtablezInfoHandle Sample() { return HashtablezInfoHandle(); } class HashtablezSampler { public: // Returns a global Sampler. static HashtablezSampler& Global() { static HashtablezSampler hzs; return hzs; } HashtablezInfo* Register() { static HashtablezInfo info; return &info; } void Unregister(HashtablezInfo* ) {} using DisposeCallback = void (*)(const HashtablezInfo&); DisposeCallback SetDisposeCallback(DisposeCallback ) { return nullptr; } int64_t Iterate(const std::function& ) { return 0; } }; static inline void SetHashtablezEnabled(bool ) {} static inline void SetHashtablezSampleParameter(int32_t ) {} static inline void SetHashtablezMaxSamples(int32_t ) {} namespace memory_internal { // Constructs T into uninitialized storage pointed by `ptr` using the args // specified in the tuple. // ---------------------------------------------------------------------------- template void ConstructFromTupleImpl(Alloc* alloc, T* ptr, Tuple&& t, phmap::index_sequence) { phmap::allocator_traits::construct( *alloc, ptr, std::get(std::forward(t))...); } template struct WithConstructedImplF { template decltype(std::declval()(std::declval())) operator()( Args&&... args) const { return std::forward(f)(T(std::forward(args)...)); } F&& f; }; template decltype(std::declval()(std::declval())) WithConstructedImpl( Tuple&& t, phmap::index_sequence, F&& f) { return WithConstructedImplF{std::forward(f)}( std::get(std::forward(t))...); } template auto TupleRefImpl(T&& t, phmap::index_sequence) -> decltype(std::forward_as_tuple(std::get(std::forward(t))...)) { return std::forward_as_tuple(std::get(std::forward(t))...); } // Returns a tuple of references to the elements of the input tuple. T must be a // tuple. // ---------------------------------------------------------------------------- template auto TupleRef(T&& t) -> decltype( TupleRefImpl(std::forward(t), phmap::make_index_sequence< std::tuple_size::type>::value>())) { return TupleRefImpl( std::forward(t), phmap::make_index_sequence< std::tuple_size::type>::value>()); } template decltype(std::declval()(std::declval(), std::piecewise_construct, std::declval>(), std::declval())) DecomposePairImpl(F&& f, std::pair, V> p) { const auto& key = std::get<0>(p.first); return std::forward(f)(key, std::piecewise_construct, std::move(p.first), std::move(p.second)); } } // namespace memory_internal // Helper functions for asan and msan. // ---------------------------------------------------------------------------- inline void SanitizerPoisonMemoryRegion(const void* m, size_t s) { #ifdef ADDRESS_SANITIZER ASAN_POISON_MEMORY_REGION(m, s); #endif #ifdef MEMORY_SANITIZER __msan_poison(m, s); #endif (void)m; (void)s; } inline void SanitizerUnpoisonMemoryRegion(const void* m, size_t s) { #ifdef ADDRESS_SANITIZER ASAN_UNPOISON_MEMORY_REGION(m, s); #endif #ifdef MEMORY_SANITIZER __msan_unpoison(m, s); #endif (void)m; (void)s; } template inline void SanitizerPoisonObject(const T* object) { SanitizerPoisonMemoryRegion(object, sizeof(T)); } template inline void SanitizerUnpoisonObject(const T* object) { SanitizerUnpoisonMemoryRegion(object, sizeof(T)); } // ---------------------------------------------------------------------------- // Allocates at least n bytes aligned to the specified alignment. // Alignment must be a power of 2. It must be positive. // // Note that many allocators don't honor alignment requirements above certain // threshold (usually either alignof(std::max_align_t) or alignof(void*)). // Allocate() doesn't apply alignment corrections. If the underlying allocator // returns insufficiently alignment pointer, that's what you are going to get. // ---------------------------------------------------------------------------- template void* Allocate(Alloc* alloc, size_t n) { static_assert(Alignment > 0, ""); assert(n && "n must be positive"); struct alignas(Alignment) M {}; using A = typename phmap::allocator_traits::template rebind_alloc; using AT = typename phmap::allocator_traits::template rebind_traits; A mem_alloc(*alloc); void* p = AT::allocate(mem_alloc, (n + sizeof(M) - 1) / sizeof(M)); assert(reinterpret_cast(p) % Alignment == 0 && "allocator does not respect alignment"); return p; } // ---------------------------------------------------------------------------- // The pointer must have been previously obtained by calling // Allocate(alloc, n). // ---------------------------------------------------------------------------- template void Deallocate(Alloc* alloc, void* p, size_t n) { static_assert(Alignment > 0, ""); assert(n && "n must be positive"); struct alignas(Alignment) M {}; using A = typename phmap::allocator_traits::template rebind_alloc; using AT = typename phmap::allocator_traits::template rebind_traits; A mem_alloc(*alloc); AT::deallocate(mem_alloc, static_cast(p), (n + sizeof(M) - 1) / sizeof(M)); } // ---------------------------------------------------------------------------- // R A W _ H A S H _ S E T // ---------------------------------------------------------------------------- // An open-addressing // hashtable with quadratic probing. // // This is a low level hashtable on top of which different interfaces can be // implemented, like flat_hash_set, node_hash_set, string_hash_set, etc. // // The table interface is similar to that of std::unordered_set. Notable // differences are that most member functions support heterogeneous keys when // BOTH the hash and eq functions are marked as transparent. They do so by // providing a typedef called `is_transparent`. // // When heterogeneous lookup is enabled, functions that take key_type act as if // they have an overload set like: // // iterator find(const key_type& key); // template // iterator find(const K& key); // // size_type erase(const key_type& key); // template // size_type erase(const K& key); // // std::pair equal_range(const key_type& key); // template // std::pair equal_range(const K& key); // // When heterogeneous lookup is disabled, only the explicit `key_type` overloads // exist. // // find() also supports passing the hash explicitly: // // iterator find(const key_type& key, size_t hash); // template // iterator find(const U& key, size_t hash); // // In addition the pointer to element and iterator stability guarantees are // weaker: all iterators and pointers are invalidated after a new element is // inserted. // // IMPLEMENTATION DETAILS // // The table stores elements inline in a slot array. In addition to the slot // array the table maintains some control state per slot. The extra state is one // byte per slot and stores empty or deleted marks, or alternatively 7 bits from // the hash of an occupied slot. The table is split into logical groups of // slots, like so: // // Group 1 Group 2 Group 3 // +---------------+---------------+---------------+ // | | | | | | | | | | | | | | | | | | | | | | | | | // +---------------+---------------+---------------+ // // On lookup the hash is split into two parts: // - H2: 7 bits (those stored in the control bytes) // - H1: the rest of the bits // The groups are probed using H1. For each group the slots are matched to H2 in // parallel. Because H2 is 7 bits (128 states) and the number of slots per group // is low (8 or 16) in almost all cases a match in H2 is also a lookup hit. // // On insert, once the right group is found (as in lookup), its slots are // filled in order. // // On erase a slot is cleared. In case the group did not have any empty slots // before the erase, the erased slot is marked as deleted. // // Groups without empty slots (but maybe with deleted slots) extend the probe // sequence. The probing algorithm is quadratic. Given N the number of groups, // the probing function for the i'th probe is: // // P(0) = H1 % N // // P(i) = (P(i - 1) + i) % N // // This probing function guarantees that after N probes, all the groups of the // table will be probed exactly once. // ---------------------------------------------------------------------------- template class raw_hash_set { using PolicyTraits = hash_policy_traits; using KeyArgImpl = KeyArg::value && IsTransparent::value>; public: using init_type = typename PolicyTraits::init_type; using key_type = typename PolicyTraits::key_type; // TODO(sbenza): Hide slot_type as it is an implementation detail. Needs user // code fixes! using slot_type = typename PolicyTraits::slot_type; using allocator_type = Alloc; using size_type = size_t; using difference_type = ptrdiff_t; using hasher = Hash; using key_equal = Eq; using policy_type = Policy; using value_type = typename PolicyTraits::value_type; using reference = value_type&; using const_reference = const value_type&; using pointer = typename phmap::allocator_traits< allocator_type>::template rebind_traits::pointer; using const_pointer = typename phmap::allocator_traits< allocator_type>::template rebind_traits::const_pointer; // Alias used for heterogeneous lookup functions. // `key_arg` evaluates to `K` when the functors are transparent and to // `key_type` otherwise. It permits template argument deduction on `K` for the // transparent case. template using key_arg = typename KeyArgImpl::template type; private: // Give an early error when key_type is not hashable/eq. auto KeyTypeCanBeHashed(const Hash& h, const key_type& k) -> decltype(h(k)); auto KeyTypeCanBeEq(const Eq& eq, const key_type& k) -> decltype(eq(k, k)); using Layout = phmap::container_internal::Layout; static Layout MakeLayout(size_t capacity) { assert(IsValidCapacity(capacity)); return Layout(capacity + Group::kWidth + 1, capacity); } using AllocTraits = phmap::allocator_traits; using SlotAlloc = typename phmap::allocator_traits< allocator_type>::template rebind_alloc; using SlotAllocTraits = typename phmap::allocator_traits< allocator_type>::template rebind_traits; static_assert(std::is_lvalue_reference::value, "Policy::element() must return a reference"); template struct SameAsElementReference : std::is_same::type>::type, typename std::remove_cv< typename std::remove_reference::type>::type> {}; // An enabler for insert(T&&): T must be convertible to init_type or be the // same as [cv] value_type [ref]. // Note: we separate SameAsElementReference into its own type to avoid using // reference unless we need to. MSVC doesn't seem to like it in some // cases. template using RequiresInsertable = typename std::enable_if< phmap::disjunction, SameAsElementReference>::value, int>::type; // RequiresNotInit is a workaround for gcc prior to 7.1. // See https://godbolt.org/g/Y4xsUh. template using RequiresNotInit = typename std::enable_if::value, int>::type; template using IsDecomposable = IsDecomposable; public: static_assert(std::is_same::value, "Allocators with custom pointer types are not supported"); static_assert(std::is_same::value, "Allocators with custom pointer types are not supported"); class iterator { friend class raw_hash_set; public: using iterator_category = std::forward_iterator_tag; using value_type = typename raw_hash_set::value_type; using reference = phmap::conditional_t; using pointer = phmap::remove_reference_t*; using difference_type = typename raw_hash_set::difference_type; iterator() {} // PRECONDITION: not an end() iterator. reference operator*() const { return PolicyTraits::element(slot_); } // PRECONDITION: not an end() iterator. pointer operator->() const { return &operator*(); } // PRECONDITION: not an end() iterator. iterator& operator++() { ++ctrl_; ++slot_; skip_empty_or_deleted(); return *this; } // PRECONDITION: not an end() iterator. iterator operator++(int) { auto tmp = *this; ++*this; return tmp; } friend bool operator==(const iterator& a, const iterator& b) { return a.ctrl_ == b.ctrl_; } friend bool operator!=(const iterator& a, const iterator& b) { return !(a == b); } private: iterator(ctrl_t* ctrl) : ctrl_(ctrl) {} // for end() iterator(ctrl_t* ctrl, slot_type* slot) : ctrl_(ctrl), slot_(slot) {} void skip_empty_or_deleted() { while (IsEmptyOrDeleted(*ctrl_)) { // ctrl is not necessarily aligned to Group::kWidth. It is also likely // to read past the space for ctrl bytes and into slots. This is ok // because ctrl has sizeof() == 1 and slot has sizeof() >= 1 so there // is no way to read outside the combined slot array. uint32_t shift = Group{ctrl_}.CountLeadingEmptyOrDeleted(); ctrl_ += shift; slot_ += shift; } } ctrl_t* ctrl_ = nullptr; // To avoid uninitialized member warnigs, put slot_ in an anonymous union. // The member is not initialized on singleton and end iterators. union { slot_type* slot_; }; }; class const_iterator { friend class raw_hash_set; public: using iterator_category = typename iterator::iterator_category; using value_type = typename raw_hash_set::value_type; using reference = typename raw_hash_set::const_reference; using pointer = typename raw_hash_set::const_pointer; using difference_type = typename raw_hash_set::difference_type; const_iterator() {} // Implicit construction from iterator. const_iterator(iterator i) : inner_(std::move(i)) {} reference operator*() const { return *inner_; } pointer operator->() const { return inner_.operator->(); } const_iterator& operator++() { ++inner_; return *this; } const_iterator operator++(int) { return inner_++; } friend bool operator==(const const_iterator& a, const const_iterator& b) { return a.inner_ == b.inner_; } friend bool operator!=(const const_iterator& a, const const_iterator& b) { return !(a == b); } private: const_iterator(const ctrl_t* ctrl, const slot_type* slot) : inner_(const_cast(ctrl), const_cast(slot)) {} iterator inner_; }; using node_type = node_handle, Alloc>; using insert_return_type = InsertReturnType; raw_hash_set() noexcept( std::is_nothrow_default_constructible::value&& std::is_nothrow_default_constructible::value&& std::is_nothrow_default_constructible::value) {} explicit raw_hash_set(size_t bucket_count, const hasher& hash = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : ctrl_(EmptyGroup()), settings_(0, hash, eq, alloc) { if (bucket_count) { capacity_ = NormalizeCapacity(bucket_count); reset_growth_left(); initialize_slots(); } } raw_hash_set(size_t bucket_count, const hasher& hash, const allocator_type& alloc) : raw_hash_set(bucket_count, hash, key_equal(), alloc) {} raw_hash_set(size_t bucket_count, const allocator_type& alloc) : raw_hash_set(bucket_count, hasher(), key_equal(), alloc) {} explicit raw_hash_set(const allocator_type& alloc) : raw_hash_set(0, hasher(), key_equal(), alloc) {} template raw_hash_set(InputIter first, InputIter last, size_t bucket_count = 0, const hasher& hash = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : raw_hash_set(bucket_count, hash, eq, alloc) { insert(first, last); } template raw_hash_set(InputIter first, InputIter last, size_t bucket_count, const hasher& hash, const allocator_type& alloc) : raw_hash_set(first, last, bucket_count, hash, key_equal(), alloc) {} template raw_hash_set(InputIter first, InputIter last, size_t bucket_count, const allocator_type& alloc) : raw_hash_set(first, last, bucket_count, hasher(), key_equal(), alloc) {} template raw_hash_set(InputIter first, InputIter last, const allocator_type& alloc) : raw_hash_set(first, last, 0, hasher(), key_equal(), alloc) {} // Instead of accepting std::initializer_list as the first // argument like std::unordered_set does, we have two overloads // that accept std::initializer_list and std::initializer_list. // This is advantageous for performance. // // // Turns {"abc", "def"} into std::initializer_list, then // // copies the strings into the set. // std::unordered_set s = {"abc", "def"}; // // // Turns {"abc", "def"} into std::initializer_list, then // // copies the strings into the set. // phmap::flat_hash_set s = {"abc", "def"}; // // The same trick is used in insert(). // // The enabler is necessary to prevent this constructor from triggering where // the copy constructor is meant to be called. // // phmap::flat_hash_set a, b{a}; // // RequiresNotInit is a workaround for gcc prior to 7.1. template = 0, RequiresInsertable = 0> raw_hash_set(std::initializer_list init, size_t bucket_count = 0, const hasher& hash = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : raw_hash_set(init.begin(), init.end(), bucket_count, hash, eq, alloc) {} raw_hash_set(std::initializer_list init, size_t bucket_count = 0, const hasher& hash = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : raw_hash_set(init.begin(), init.end(), bucket_count, hash, eq, alloc) {} template = 0, RequiresInsertable = 0> raw_hash_set(std::initializer_list init, size_t bucket_count, const hasher& hash, const allocator_type& alloc) : raw_hash_set(init, bucket_count, hash, key_equal(), alloc) {} raw_hash_set(std::initializer_list init, size_t bucket_count, const hasher& hash, const allocator_type& alloc) : raw_hash_set(init, bucket_count, hash, key_equal(), alloc) {} template = 0, RequiresInsertable = 0> raw_hash_set(std::initializer_list init, size_t bucket_count, const allocator_type& alloc) : raw_hash_set(init, bucket_count, hasher(), key_equal(), alloc) {} raw_hash_set(std::initializer_list init, size_t bucket_count, const allocator_type& alloc) : raw_hash_set(init, bucket_count, hasher(), key_equal(), alloc) {} template = 0, RequiresInsertable = 0> raw_hash_set(std::initializer_list init, const allocator_type& alloc) : raw_hash_set(init, 0, hasher(), key_equal(), alloc) {} raw_hash_set(std::initializer_list init, const allocator_type& alloc) : raw_hash_set(init, 0, hasher(), key_equal(), alloc) {} raw_hash_set(const raw_hash_set& that) : raw_hash_set(that, AllocTraits::select_on_container_copy_construction( that.alloc_ref())) {} raw_hash_set(const raw_hash_set& that, const allocator_type& a) : raw_hash_set(0, that.hash_ref(), that.eq_ref(), a) { reserve(that.size()); // Because the table is guaranteed to be empty, we can do something faster // than a full `insert`. for (const auto& v : that) { const size_t hash = PolicyTraits::apply(HashElement{hash_ref()}, v); auto target = find_first_non_full(hash); set_ctrl(target.offset, H2(hash)); emplace_at(target.offset, v); infoz_.RecordInsert(hash, target.probe_length); } size_ = that.size(); growth_left() -= that.size(); } raw_hash_set(raw_hash_set&& that) noexcept( std::is_nothrow_copy_constructible::value&& std::is_nothrow_copy_constructible::value&& std::is_nothrow_copy_constructible::value) : ctrl_(phmap::exchange(that.ctrl_, EmptyGroup())), slots_(phmap::exchange(that.slots_, nullptr)), size_(phmap::exchange(that.size_, 0)), capacity_(phmap::exchange(that.capacity_, 0)), infoz_(phmap::exchange(that.infoz_, HashtablezInfoHandle())), // Hash, equality and allocator are copied instead of moved because // `that` must be left valid. If Hash is std::function, moving it // would create a nullptr functor that cannot be called. settings_(that.settings_) { // growth_left was copied above, reset the one from `that`. that.growth_left() = 0; } raw_hash_set(raw_hash_set&& that, const allocator_type& a) : ctrl_(EmptyGroup()), slots_(nullptr), size_(0), capacity_(0), settings_(0, that.hash_ref(), that.eq_ref(), a) { if (a == that.alloc_ref()) { std::swap(ctrl_, that.ctrl_); std::swap(slots_, that.slots_); std::swap(size_, that.size_); std::swap(capacity_, that.capacity_); std::swap(growth_left(), that.growth_left()); std::swap(infoz_, that.infoz_); } else { reserve(that.size()); // Note: this will copy elements of dense_set and unordered_set instead of // moving them. This can be fixed if it ever becomes an issue. for (auto& elem : that) insert(std::move(elem)); } } raw_hash_set& operator=(const raw_hash_set& that) { raw_hash_set tmp(that, AllocTraits::propagate_on_container_copy_assignment::value ? that.alloc_ref() : alloc_ref()); swap(tmp); return *this; } raw_hash_set& operator=(raw_hash_set&& that) noexcept( phmap::allocator_traits::is_always_equal::value&& std::is_nothrow_move_assignable::value&& std::is_nothrow_move_assignable::value) { // TODO(sbenza): We should only use the operations from the noexcept clause // to make sure we actually adhere to that contract. return move_assign( std::move(that), typename AllocTraits::propagate_on_container_move_assignment()); } ~raw_hash_set() { destroy_slots(); } iterator begin() { auto it = iterator_at(0); it.skip_empty_or_deleted(); return it; } iterator end() { return {ctrl_ + capacity_}; } const_iterator begin() const { return const_cast(this)->begin(); } const_iterator end() const { return const_cast(this)->end(); } const_iterator cbegin() const { return begin(); } const_iterator cend() const { return end(); } bool empty() const { return !size(); } size_t size() const { return size_; } size_t capacity() const { return capacity_; } size_t max_size() const { return (std::numeric_limits::max)(); } PHMAP_ATTRIBUTE_REINITIALIZES void clear() { // Iterating over this container is O(bucket_count()). When bucket_count() // is much greater than size(), iteration becomes prohibitively expensive. // For clear() it is more important to reuse the allocated array when the // container is small because allocation takes comparatively long time // compared to destruction of the elements of the container. So we pick the // largest bucket_count() threshold for which iteration is still fast and // past that we simply deallocate the array. if (capacity_ > 127) { destroy_slots(); } else if (capacity_) { for (size_t i = 0; i != capacity_; ++i) { if (IsFull(ctrl_[i])) { PolicyTraits::destroy(&alloc_ref(), slots_ + i); } } size_ = 0; reset_ctrl(); reset_growth_left(); } assert(empty()); infoz_.RecordStorageChanged(0, capacity_); } // This overload kicks in when the argument is an rvalue of insertable and // decomposable type other than init_type. // // flat_hash_map m; // m.insert(std::make_pair("abc", 42)); template = 0, typename std::enable_if::value, int>::type = 0, T* = nullptr> std::pair insert(T&& value) { return emplace(std::forward(value)); } // This overload kicks in when the argument is a bitfield or an lvalue of // insertable and decomposable type. // // union { int n : 1; }; // flat_hash_set s; // s.insert(n); // // flat_hash_set s; // const char* p = "hello"; // s.insert(p); // // TODO(romanp): Once we stop supporting gcc 5.1 and below, replace // RequiresInsertable with RequiresInsertable. // We are hitting this bug: https://godbolt.org/g/1Vht4f. template < class T, RequiresInsertable = 0, typename std::enable_if::value, int>::type = 0> std::pair insert(const T& value) { return emplace(value); } // This overload kicks in when the argument is an rvalue of init_type. Its // purpose is to handle brace-init-list arguments. // // flat_hash_set s; // s.insert({"abc", 42}); std::pair insert(init_type&& value) { return emplace(std::move(value)); } template = 0, typename std::enable_if::value, int>::type = 0, T* = nullptr> iterator insert(const_iterator, T&& value) { return insert(std::forward(value)).first; } // TODO(romanp): Once we stop supporting gcc 5.1 and below, replace // RequiresInsertable with RequiresInsertable. // We are hitting this bug: https://godbolt.org/g/1Vht4f. template < class T, RequiresInsertable = 0, typename std::enable_if::value, int>::type = 0> iterator insert(const_iterator, const T& value) { return insert(value).first; } iterator insert(const_iterator, init_type&& value) { return insert(std::move(value)).first; } template void insert(InputIt first, InputIt last) { for (; first != last; ++first) insert(*first); } template = 0, RequiresInsertable = 0> void insert(std::initializer_list ilist) { insert(ilist.begin(), ilist.end()); } void insert(std::initializer_list ilist) { insert(ilist.begin(), ilist.end()); } insert_return_type insert(node_type&& node) { if (!node) return {end(), false, node_type()}; const auto& elem = PolicyTraits::element(CommonAccess::GetSlot(node)); auto res = PolicyTraits::apply( InsertSlot{*this, std::move(*CommonAccess::GetSlot(node))}, elem); if (res.second) { CommonAccess::Reset(&node); return {res.first, true, node_type()}; } else { return {res.first, false, std::move(node)}; } } insert_return_type insert(node_type&& node, size_t hash) { if (!node) return {end(), false, node_type()}; const auto& elem = PolicyTraits::element(CommonAccess::GetSlot(node)); auto res = PolicyTraits::apply( InsertSlotWithHash{*this, std::move(*CommonAccess::GetSlot(node)), hash}, elem); if (res.second) { CommonAccess::Reset(&node); return {res.first, true, node_type()}; } else { return {res.first, false, std::move(node)}; } } iterator insert(const_iterator, node_type&& node) { return insert(std::move(node)).first; } // This overload kicks in if we can deduce the key from args. This enables us // to avoid constructing value_type if an entry with the same key already // exists. // // For example: // // flat_hash_map m = {{"abc", "def"}}; // // Creates no std::string copies and makes no heap allocations. // m.emplace("abc", "xyz"); template ::value, int>::type = 0> std::pair emplace(Args&&... args) { return PolicyTraits::apply(EmplaceDecomposable{*this}, std::forward(args)...); } // This overload kicks in if we cannot deduce the key from args. It constructs // value_type unconditionally and then either moves it into the table or // destroys. template ::value, int>::type = 0> std::pair emplace(Args&&... args) { typename std::aligned_storage::type raw; slot_type* slot = reinterpret_cast(&raw); PolicyTraits::construct(&alloc_ref(), slot, std::forward(args)...); const auto& elem = PolicyTraits::element(slot); return PolicyTraits::apply(InsertSlot{*this, std::move(*slot)}, elem); } template iterator emplace_hint(const_iterator, Args&&... args) { return emplace(std::forward(args)...).first; } // Extension API: support for lazy emplace. // // Looks up key in the table. If found, returns the iterator to the element. // Otherwise calls f with one argument of type raw_hash_set::constructor. f // MUST call raw_hash_set::constructor with arguments as if a // raw_hash_set::value_type is constructed, otherwise the behavior is // undefined. // // For example: // // std::unordered_set s; // // Makes ArenaStr even if "abc" is in the map. // s.insert(ArenaString(&arena, "abc")); // // flat_hash_set s; // // Makes ArenaStr only if "abc" is not in the map. // s.lazy_emplace("abc", [&](const constructor& ctor) { // ctor(&arena, "abc"); // }); // // WARNING: This API is currently experimental. If there is a way to implement // the same thing with the rest of the API, prefer that. class constructor { friend class raw_hash_set; public: template void operator()(Args&&... args) const { assert(*slot_); PolicyTraits::construct(alloc_, *slot_, std::forward(args)...); *slot_ = nullptr; } private: constructor(allocator_type* a, slot_type** slot) : alloc_(a), slot_(slot) {} allocator_type* alloc_; slot_type** slot_; }; template iterator lazy_emplace(const key_arg& key, F&& f) { auto res = find_or_prepare_insert(key); if (res.second) { slot_type* slot = slots_ + res.first; std::forward(f)(constructor(&alloc_ref(), &slot)); assert(!slot); } return iterator_at(res.first); } template iterator lazy_emplace_with_hash(const key_arg& key, size_t &hash, F&& f) { auto res = find_or_prepare_insert(key, hash); if (res.second) { slot_type* slot = slots_ + res.first; std::forward(f)(constructor(&alloc_ref(), &slot)); assert(!slot); } return iterator_at(res.first); } // Extension API: support for heterogeneous keys. // // std::unordered_set s; // // Turns "abc" into std::string. // s.erase("abc"); // // flat_hash_set s; // // Uses "abc" directly without copying it into std::string. // s.erase("abc"); template size_type erase(const key_arg& key) { auto it = find(key); if (it == end()) return 0; _erase(it); return 1; } iterator erase(const_iterator cit) { return erase(cit.inner_); } // Erases the element pointed to by `it`. Unlike `std::unordered_set::erase`, // this method returns void to reduce algorithmic complexity to O(1). In // order to erase while iterating across a map, use the following idiom (which // also works for standard containers): // // for (auto it = m.begin(), end = m.end(); it != end;) { // if () { // m._erase(it++); // } else { // ++it; // } // } void _erase(iterator it) { assert(it != end()); PolicyTraits::destroy(&alloc_ref(), it.slot_); erase_meta_only(it); } void _erase(const_iterator cit) { _erase(cit.inner_); } // This overload is necessary because otherwise erase(const K&) would be // a better match if non-const iterator is passed as an argument. iterator erase(iterator it) { _erase(it++); return it; } iterator erase(const_iterator first, const_iterator last) { while (first != last) { _erase(first++); } return last.inner_; } // Moves elements from `src` into `this`. // If the element already exists in `this`, it is left unmodified in `src`. template void merge(raw_hash_set& src) { // NOLINT assert(this != &src); for (auto it = src.begin(), e = src.end(); it != e; ++it) { if (PolicyTraits::apply(InsertSlot{*this, std::move(*it.slot_)}, PolicyTraits::element(it.slot_)) .second) { src.erase_meta_only(it); } } } template void merge(raw_hash_set&& src) { merge(src); } node_type extract(const_iterator position) { auto node = CommonAccess::Make(alloc_ref(), position.inner_.slot_); erase_meta_only(position); return node; } template < class K = key_type, typename std::enable_if::value, int>::type = 0> node_type extract(const key_arg& key) { auto it = find(key); return it == end() ? node_type() : extract(const_iterator{it}); } void swap(raw_hash_set& that) noexcept( IsNoThrowSwappable() && IsNoThrowSwappable() && (!AllocTraits::propagate_on_container_swap::value || IsNoThrowSwappable())) { using std::swap; swap(ctrl_, that.ctrl_); swap(slots_, that.slots_); swap(size_, that.size_); swap(capacity_, that.capacity_); swap(growth_left(), that.growth_left()); swap(hash_ref(), that.hash_ref()); swap(eq_ref(), that.eq_ref()); swap(infoz_, that.infoz_); if (AllocTraits::propagate_on_container_swap::value) { swap(alloc_ref(), that.alloc_ref()); } else { // If the allocators do not compare equal it is officially undefined // behavior. We choose to do nothing. } } void rehash(size_t n) { if (n == 0 && capacity_ == 0) return; if (n == 0 && size_ == 0) { destroy_slots(); infoz_.RecordStorageChanged(0, 0); return; } // bitor is a faster way of doing `max` here. We will round up to the next // power-of-2-minus-1, so bitor is good enough. auto m = NormalizeCapacity(n | GrowthToLowerboundCapacity(size())); // n == 0 unconditionally rehashes as per the standard. if (n == 0 || m > capacity_) { resize(m); } } void reserve(size_t n) { rehash(GrowthToLowerboundCapacity(n)); } // Extension API: support for heterogeneous keys. // // std::unordered_set s; // // Turns "abc" into std::string. // s.count("abc"); // // ch_set s; // // Uses "abc" directly without copying it into std::string. // s.count("abc"); template size_t count(const key_arg& key) const { return find(key) == end() ? 0 : 1; } // Issues CPU prefetch instructions for the memory needed to find or insert // a key. Like all lookup functions, this support heterogeneous keys. // // NOTE: This is a very low level operation and should not be used without // specific benchmarks indicating its importance. void prefetch_hash(size_t hash) const { (void)hash; #if defined(__GNUC__) auto seq = probe(hash); __builtin_prefetch(static_cast(ctrl_ + seq.offset())); __builtin_prefetch(static_cast(slots_ + seq.offset())); #endif // __GNUC__ } template void prefetch(const key_arg& key) const { prefetch_hash(HashElement{hash_ref()}(key)); } // The API of find() has two extensions. // // 1. The hash can be passed by the user. It must be equal to the hash of the // key. // // 2. The type of the key argument doesn't have to be key_type. This is so // called heterogeneous key support. template iterator find(const key_arg& key, size_t hash) { auto seq = probe(hash); while (true) { Group g{ctrl_ + seq.offset()}; for (int i : g.Match(H2(hash))) { if (PHMAP_PREDICT_TRUE(PolicyTraits::apply( EqualElement{key, eq_ref()}, PolicyTraits::element(slots_ + seq.offset(i))))) return iterator_at(seq.offset(i)); } if (PHMAP_PREDICT_TRUE(g.MatchEmpty())) return end(); seq.next(); } } template iterator find(const key_arg& key) { return find(key, HashElement{hash_ref()}(key)); } template const_iterator find(const key_arg& key, size_t hash) const { return const_cast(this)->find(key, hash); } template const_iterator find(const key_arg& key) const { return find(key, HashElement{hash_ref()}(key)); } template bool contains(const key_arg& key) const { return find(key) != end(); } template std::pair equal_range(const key_arg& key) { auto it = find(key); if (it != end()) return {it, std::next(it)}; return {it, it}; } template std::pair equal_range( const key_arg& key) const { auto it = find(key); if (it != end()) return {it, std::next(it)}; return {it, it}; } size_t bucket_count() const { return capacity_; } float load_factor() const { return capacity_ ? static_cast(size()) / capacity_ : 0.0; } float max_load_factor() const { return 1.0f; } void max_load_factor(float) { // Does nothing. } hasher hash_function() const { return hash_ref(); } key_equal key_eq() const { return eq_ref(); } allocator_type get_allocator() const { return alloc_ref(); } friend bool operator==(const raw_hash_set& a, const raw_hash_set& b) { if (a.size() != b.size()) return false; const raw_hash_set* outer = &a; const raw_hash_set* inner = &b; if (outer->capacity() > inner->capacity()) std::swap(outer, inner); for (const value_type& elem : *outer) if (!inner->has_element(elem)) return false; return true; } friend bool operator!=(const raw_hash_set& a, const raw_hash_set& b) { return !(a == b); } friend void swap(raw_hash_set& a, raw_hash_set& b) noexcept(noexcept(a.swap(b))) { a.swap(b); } private: template friend struct phmap::container_internal::hashtable_debug_internal::HashtableDebugAccess; struct FindElement { template const_iterator operator()(const K& key, Args&&...) const { return s.find(key); } const raw_hash_set& s; }; struct HashElement { template size_t operator()(const K& key, Args&&...) const { return phmap_mix()(h(key)); } const hasher& h; }; template struct EqualElement { template bool operator()(const K2& lhs, Args&&...) const { return eq(lhs, rhs); } const K1& rhs; const key_equal& eq; }; template std::pair emplace_decomposable(const K& key, size_t hash, Args&&... args) { auto res = find_or_prepare_insert(key, hash); if (res.second) { emplace_at(res.first, std::forward(args)...); } return {iterator_at(res.first), res.second}; } struct EmplaceDecomposable { template std::pair operator()(const K& key, Args&&... args) const { return s.emplace_decomposable(key, typename raw_hash_set::HashElement{s.hash_ref()}(key), std::forward(args)...); } raw_hash_set& s; }; template struct InsertSlot { template std::pair operator()(const K& key, Args&&...) && { auto res = s.find_or_prepare_insert(key); if (res.second) { PolicyTraits::transfer(&s.alloc_ref(), s.slots_ + res.first, &slot); } else if (do_destroy) { PolicyTraits::destroy(&s.alloc_ref(), &slot); } return {s.iterator_at(res.first), res.second}; } raw_hash_set& s; // Constructed slot. Either moved into place or destroyed. slot_type&& slot; }; template struct InsertSlotWithHash { template std::pair operator()(const K& key, Args&&...) && { auto res = s.find_or_prepare_insert(key, hash); if (res.second) { PolicyTraits::transfer(&s.alloc_ref(), s.slots_ + res.first, &slot); } else if (do_destroy) { PolicyTraits::destroy(&s.alloc_ref(), &slot); } return {s.iterator_at(res.first), res.second}; } raw_hash_set& s; // Constructed slot. Either moved into place or destroyed. slot_type&& slot; size_t &hash; }; // "erases" the object from the container, except that it doesn't actually // destroy the object. It only updates all the metadata of the class. // This can be used in conjunction with Policy::transfer to move the object to // another place. void erase_meta_only(const_iterator it) { assert(IsFull(*it.inner_.ctrl_) && "erasing a dangling iterator"); --size_; const size_t index = it.inner_.ctrl_ - ctrl_; const size_t index_before = (index - Group::kWidth) & capacity_; const auto empty_after = Group(it.inner_.ctrl_).MatchEmpty(); const auto empty_before = Group(ctrl_ + index_before).MatchEmpty(); // We count how many consecutive non empties we have to the right and to the // left of `it`. If the sum is >= kWidth then there is at least one probe // window that might have seen a full group. bool was_never_full = empty_before && empty_after && static_cast(empty_after.TrailingZeros() + empty_before.LeadingZeros()) < Group::kWidth; set_ctrl(index, was_never_full ? kEmpty : kDeleted); growth_left() += was_never_full; infoz_.RecordErase(); } void initialize_slots() { assert(capacity_); if (std::is_same>::value && slots_ == nullptr) { infoz_ = Sample(); } auto layout = MakeLayout(capacity_); char* mem = static_cast( Allocate(&alloc_ref(), layout.AllocSize())); ctrl_ = reinterpret_cast(layout.template Pointer<0>(mem)); slots_ = layout.template Pointer<1>(mem); reset_ctrl(); reset_growth_left(); infoz_.RecordStorageChanged(size_, capacity_); } void destroy_slots() { if (!capacity_) return; for (size_t i = 0; i != capacity_; ++i) { if (IsFull(ctrl_[i])) { PolicyTraits::destroy(&alloc_ref(), slots_ + i); } } auto layout = MakeLayout(capacity_); // Unpoison before returning the memory to the allocator. SanitizerUnpoisonMemoryRegion(slots_, sizeof(slot_type) * capacity_); Deallocate(&alloc_ref(), ctrl_, layout.AllocSize()); ctrl_ = EmptyGroup(); slots_ = nullptr; size_ = 0; capacity_ = 0; growth_left() = 0; } void resize(size_t new_capacity) { assert(IsValidCapacity(new_capacity)); auto* old_ctrl = ctrl_; auto* old_slots = slots_; const size_t old_capacity = capacity_; capacity_ = new_capacity; initialize_slots(); for (size_t i = 0; i != old_capacity; ++i) { if (IsFull(old_ctrl[i])) { size_t hash = PolicyTraits::apply(HashElement{hash_ref()}, PolicyTraits::element(old_slots + i)); auto target = find_first_non_full(hash); size_t new_i = target.offset; set_ctrl(new_i, H2(hash)); PolicyTraits::transfer(&alloc_ref(), slots_ + new_i, old_slots + i); } } if (old_capacity) { SanitizerUnpoisonMemoryRegion(old_slots, sizeof(slot_type) * old_capacity); auto layout = MakeLayout(old_capacity); Deallocate(&alloc_ref(), old_ctrl, layout.AllocSize()); } } void drop_deletes_without_resize() PHMAP_ATTRIBUTE_NOINLINE { assert(IsValidCapacity(capacity_)); assert(!is_small()); // Algorithm: // - mark all DELETED slots as EMPTY // - mark all FULL slots as DELETED // - for each slot marked as DELETED // hash = Hash(element) // target = find_first_non_full(hash) // if target is in the same group // mark slot as FULL // else if target is EMPTY // transfer element to target // mark slot as EMPTY // mark target as FULL // else if target is DELETED // swap current element with target element // mark target as FULL // repeat procedure for current slot with moved from element (target) ConvertDeletedToEmptyAndFullToDeleted(ctrl_, capacity_); typename std::aligned_storage::type raw; slot_type* slot = reinterpret_cast(&raw); for (size_t i = 0; i != capacity_; ++i) { if (!IsDeleted(ctrl_[i])) continue; size_t hash = PolicyTraits::apply(HashElement{hash_ref()}, PolicyTraits::element(slots_ + i)); auto target = find_first_non_full(hash); size_t new_i = target.offset; // Verify if the old and new i fall within the same group wrt the hash. // If they do, we don't need to move the object as it falls already in the // best probe we can. const auto probe_index = [&](size_t pos) { return ((pos - probe(hash).offset()) & capacity_) / Group::kWidth; }; // Element doesn't move. if (PHMAP_PREDICT_TRUE(probe_index(new_i) == probe_index(i))) { set_ctrl(i, H2(hash)); continue; } if (IsEmpty(ctrl_[new_i])) { // Transfer element to the empty spot. // set_ctrl poisons/unpoisons the slots so we have to call it at the // right time. set_ctrl(new_i, H2(hash)); PolicyTraits::transfer(&alloc_ref(), slots_ + new_i, slots_ + i); set_ctrl(i, kEmpty); } else { assert(IsDeleted(ctrl_[new_i])); set_ctrl(new_i, H2(hash)); // Until we are done rehashing, DELETED marks previously FULL slots. // Swap i and new_i elements. PolicyTraits::transfer(&alloc_ref(), slot, slots_ + i); PolicyTraits::transfer(&alloc_ref(), slots_ + i, slots_ + new_i); PolicyTraits::transfer(&alloc_ref(), slots_ + new_i, slot); --i; // repeat } } reset_growth_left(); } void rehash_and_grow_if_necessary() { if (capacity_ == 0) { resize(1); } else if (size() <= CapacityToGrowth(capacity()) / 2) { // Squash DELETED without growing if there is enough capacity. drop_deletes_without_resize(); } else { // Otherwise grow the container. resize(capacity_ * 2 + 1); } } bool has_element(const value_type& elem, size_t hash) const { auto seq = probe(hash); while (true) { Group g{ctrl_ + seq.offset()}; for (int i : g.Match(H2(hash))) { if (PHMAP_PREDICT_TRUE(PolicyTraits::element(slots_ + seq.offset(i)) == elem)) return true; } if (PHMAP_PREDICT_TRUE(g.MatchEmpty())) return false; seq.next(); assert(seq.index() < capacity_ && "full table!"); } return false; } bool has_element(const value_type& elem) const { size_t hash = PolicyTraits::apply(HashElement{hash_ref()}, elem); return has_element(elem, hash); } // Probes the raw_hash_set with the probe sequence for hash and returns the // pointer to the first empty or deleted slot. // NOTE: this function must work with tables having both kEmpty and kDelete // in one group. Such tables appears during drop_deletes_without_resize. // // This function is very useful when insertions happen and: // - the input is already a set // - there are enough slots // - the element with the hash is not in the table struct FindInfo { size_t offset; size_t probe_length; }; FindInfo find_first_non_full(size_t hash) { auto seq = probe(hash); while (true) { Group g{ctrl_ + seq.offset()}; auto mask = g.MatchEmptyOrDeleted(); if (mask) { return {seq.offset(mask.LowestBitSet()), seq.index()}; } assert(seq.index() < capacity_ && "full table!"); seq.next(); } } // TODO(alkis): Optimize this assuming *this and that don't overlap. raw_hash_set& move_assign(raw_hash_set&& that, std::true_type) { raw_hash_set tmp(std::move(that)); swap(tmp); return *this; } raw_hash_set& move_assign(raw_hash_set&& that, std::false_type) { raw_hash_set tmp(std::move(that), alloc_ref()); swap(tmp); return *this; } protected: template std::pair find_or_prepare_insert(const K& key, size_t hash) { auto seq = probe(hash); while (true) { Group g{ctrl_ + seq.offset()}; for (int i : g.Match(H2(hash))) { if (PHMAP_PREDICT_TRUE(PolicyTraits::apply( EqualElement{key, eq_ref()}, PolicyTraits::element(slots_ + seq.offset(i))))) return {seq.offset(i), false}; } if (PHMAP_PREDICT_TRUE(g.MatchEmpty())) break; seq.next(); } return {prepare_insert(hash), true}; } template std::pair find_or_prepare_insert(const K& key) { return find_or_prepare_insert(key, HashElement{hash_ref()}(key)); } size_t prepare_insert(size_t hash) PHMAP_ATTRIBUTE_NOINLINE { auto target = find_first_non_full(hash); if (PHMAP_PREDICT_FALSE(growth_left() == 0 && !IsDeleted(ctrl_[target.offset]))) { rehash_and_grow_if_necessary(); target = find_first_non_full(hash); } ++size_; growth_left() -= IsEmpty(ctrl_[target.offset]); set_ctrl(target.offset, H2(hash)); infoz_.RecordInsert(hash, target.probe_length); return target.offset; } // Constructs the value in the space pointed by the iterator. This only works // after an unsuccessful find_or_prepare_insert() and before any other // modifications happen in the raw_hash_set. // // PRECONDITION: i is an index returned from find_or_prepare_insert(k), where // k is the key decomposed from `forward(args)...`, and the bool // returned by find_or_prepare_insert(k) was true. // POSTCONDITION: *m.iterator_at(i) == value_type(forward(args)...). template void emplace_at(size_t i, Args&&... args) { PolicyTraits::construct(&alloc_ref(), slots_ + i, std::forward(args)...); assert(PolicyTraits::apply(FindElement{*this}, *iterator_at(i)) == iterator_at(i) && "constructed value does not match the lookup key"); } iterator iterator_at(size_t i) { return {ctrl_ + i, slots_ + i}; } const_iterator iterator_at(size_t i) const { return {ctrl_ + i, slots_ + i}; } private: friend struct RawHashSetTestOnlyAccess; probe_seq probe(size_t hash) const { return probe_seq(H1(hash, ctrl_), capacity_); } // Reset all ctrl bytes back to kEmpty, except the sentinel. void reset_ctrl() { std::memset(ctrl_, kEmpty, capacity_ + Group::kWidth); ctrl_[capacity_] = kSentinel; SanitizerPoisonMemoryRegion(slots_, sizeof(slot_type) * capacity_); } void reset_growth_left() { growth_left() = CapacityToGrowth(capacity()) - size_; } // Sets the control byte, and if `i < Group::kWidth`, set the cloned byte at // the end too. void set_ctrl(size_t i, ctrl_t h) { assert(i < capacity_); if (IsFull(h)) { SanitizerUnpoisonObject(slots_ + i); } else { SanitizerPoisonObject(slots_ + i); } ctrl_[i] = h; ctrl_[((i - Group::kWidth) & capacity_) + 1 + ((Group::kWidth - 1) & capacity_)] = h; } size_t& growth_left() { return settings_.template get<0>(); } template class RefSet, class M, class P, class H, class E, class A> friend class parallel_hash_set; template class RefSet, class M, class P, class H, class E, class A> friend class parallel_hash_map; // The representation of the object has two modes: // - small: For capacities < kWidth-1 // - large: For the rest. // // Differences: // - In small mode we are able to use the whole capacity. The extra control // bytes give us at least one "empty" control byte to stop the iteration. // This is important to make 1 a valid capacity. // // - In small mode only the first `capacity()` control bytes after the // sentinel are valid. The rest contain dummy kEmpty values that do not // represent a real slot. This is important to take into account on // find_first_non_full(), where we never try ShouldInsertBackwards() for // small tables. bool is_small() const { return capacity_ < Group::kWidth - 1; } hasher& hash_ref() { return settings_.template get<1>(); } const hasher& hash_ref() const { return settings_.template get<1>(); } key_equal& eq_ref() { return settings_.template get<2>(); } const key_equal& eq_ref() const { return settings_.template get<2>(); } allocator_type& alloc_ref() { return settings_.template get<3>(); } const allocator_type& alloc_ref() const { return settings_.template get<3>(); } // TODO(alkis): Investigate removing some of these fields: // - ctrl/slots can be derived from each other // - size can be moved into the slot array ctrl_t* ctrl_ = EmptyGroup(); // [(capacity + 1) * ctrl_t] slot_type* slots_ = nullptr; // [capacity * slot_type] size_t size_ = 0; // number of full slots size_t capacity_ = 0; // total number of slots HashtablezInfoHandle infoz_; phmap::container_internal::CompressedTuple settings_{0, hasher{}, key_equal{}, allocator_type{}}; }; // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- template class raw_hash_map : public raw_hash_set { // P is Policy. It's passed as a template argument to support maps that have // incomplete types as values, as in unordered_map. // MappedReference<> may be a non-reference type. template using MappedReference = decltype(P::value( std::addressof(std::declval()))); // MappedConstReference<> may be a non-reference type. template using MappedConstReference = decltype(P::value( std::addressof(std::declval()))); using KeyArgImpl = KeyArg::value && IsTransparent::value>; using Base = raw_hash_set; public: using key_type = typename Policy::key_type; using mapped_type = typename Policy::mapped_type; template using key_arg = typename KeyArgImpl::template type; static_assert(!std::is_reference::value, ""); // TODO(alkis): remove this assertion and verify that reference mapped_type is // supported. static_assert(!std::is_reference::value, ""); using iterator = typename raw_hash_map::raw_hash_set::iterator; using const_iterator = typename raw_hash_map::raw_hash_set::const_iterator; raw_hash_map() {} using Base::raw_hash_set; // use raw_hash_set constructor // The last two template parameters ensure that both arguments are rvalues // (lvalue arguments are handled by the overloads below). This is necessary // for supporting bitfield arguments. // // union { int n : 1; }; // flat_hash_map m; // m.insert_or_assign(n, n); template std::pair insert_or_assign(key_arg&& k, V&& v) { return insert_or_assign_impl(std::forward(k), std::forward(v)); } template std::pair insert_or_assign(key_arg&& k, const V& v) { return insert_or_assign_impl(std::forward(k), v); } template std::pair insert_or_assign(const key_arg& k, V&& v) { return insert_or_assign_impl(k, std::forward(v)); } template std::pair insert_or_assign(const key_arg& k, const V& v) { return insert_or_assign_impl(k, v); } template iterator insert_or_assign(const_iterator, key_arg&& k, V&& v) { return insert_or_assign(std::forward(k), std::forward(v)).first; } template iterator insert_or_assign(const_iterator, key_arg&& k, const V& v) { return insert_or_assign(std::forward(k), v).first; } template iterator insert_or_assign(const_iterator, const key_arg& k, V&& v) { return insert_or_assign(k, std::forward(v)).first; } template iterator insert_or_assign(const_iterator, const key_arg& k, const V& v) { return insert_or_assign(k, v).first; } template ::value, int>::type = 0, K* = nullptr> std::pair try_emplace(key_arg&& k, Args&&... args) { return try_emplace_impl(std::forward(k), std::forward(args)...); } template ::value, int>::type = 0> std::pair try_emplace(const key_arg& k, Args&&... args) { return try_emplace_impl(k, std::forward(args)...); } template iterator try_emplace(const_iterator, key_arg&& k, Args&&... args) { return try_emplace(std::forward(k), std::forward(args)...).first; } template iterator try_emplace(const_iterator, const key_arg& k, Args&&... args) { return try_emplace(k, std::forward(args)...).first; } template MappedReference

at(const key_arg& key) { auto it = this->find(key); if (it == this->end()) std::abort(); return Policy::value(&*it); } template MappedConstReference

at(const key_arg& key) const { auto it = this->find(key); if (it == this->end()) std::abort(); return Policy::value(&*it); } template MappedReference

operator[](key_arg&& key) { return Policy::value(&*try_emplace(std::forward(key)).first); } template MappedReference

operator[](const key_arg& key) { return Policy::value(&*try_emplace(key).first); } private: template std::pair insert_or_assign_impl(K&& k, V&& v) { auto res = this->find_or_prepare_insert(k); if (res.second) this->emplace_at(res.first, std::forward(k), std::forward(v)); else Policy::value(&*this->iterator_at(res.first)) = std::forward(v); return {this->iterator_at(res.first), res.second}; } template std::pair try_emplace_impl(K&& k, Args&&... args) { auto res = this->find_or_prepare_insert(k); if (res.second) this->emplace_at(res.first, std::piecewise_construct, std::forward_as_tuple(std::forward(k)), std::forward_as_tuple(std::forward(args)...)); return {this->iterator_at(res.first), res.second}; } }; // ---------------------------------------------------------------------------- // ---------------------------------------------------------------------------- // Returns "random" seed. inline size_t RandomSeed() { #if PHMAP_HAVE_THREAD_LOCAL static thread_local size_t counter = 0; size_t value = ++counter; #else // PHMAP_HAVE_THREAD_LOCAL static std::atomic counter(0); size_t value = counter.fetch_add(1, std::memory_order_relaxed); #endif // PHMAP_HAVE_THREAD_LOCAL return value ^ static_cast(reinterpret_cast(&counter)); } // ---------------------------------------------------------------------------- // ---------------------------------------------------------------------------- template class RefSet, class Mtx_, class Policy, class Hash, class Eq, class Alloc> class parallel_hash_set { using PolicyTraits = hash_policy_traits; using KeyArgImpl = KeyArg::value && IsTransparent::value>; static_assert(N <= 12, "N = 12 means 4096 hash tables!"); constexpr static size_t num_tables = 1 << N; constexpr static size_t mask = num_tables - 1; public: using EmbeddedSet = RefSet; using EmbeddedIterator= typename EmbeddedSet::iterator; using EmbeddedConstIterator= typename EmbeddedSet::const_iterator; using init_type = typename PolicyTraits::init_type; using key_type = typename PolicyTraits::key_type; using slot_type = typename PolicyTraits::slot_type; using allocator_type = Alloc; using size_type = size_t; using difference_type = ptrdiff_t; using hasher = Hash; using key_equal = Eq; using policy_type = Policy; using value_type = typename PolicyTraits::value_type; using reference = value_type&; using const_reference = const value_type&; using pointer = typename phmap::allocator_traits< allocator_type>::template rebind_traits::pointer; using const_pointer = typename phmap::allocator_traits< allocator_type>::template rebind_traits::const_pointer; // Alias used for heterogeneous lookup functions. // `key_arg` evaluates to `K` when the functors are transparent and to // `key_type` otherwise. It permits template argument deduction on `K` for the // transparent case. // -------------------------------------------------------------------- template using key_arg = typename KeyArgImpl::template type; protected: using Lockable = phmap::LockableImpl; // -------------------------------------------------------------------- struct alignas(64) Inner : public Lockable { bool operator==(const Inner& o) const { typename Lockable::SharedLocks l(const_cast(*this), const_cast(o)); return set_ == o.set_; } EmbeddedSet set_; }; private: // Give an early error when key_type is not hashable/eq. // -------------------------------------------------------------------- auto KeyTypeCanBeHashed(const Hash& h, const key_type& k) -> decltype(h(k)); auto KeyTypeCanBeEq(const Eq& eq, const key_type& k) -> decltype(eq(k, k)); using AllocTraits = phmap::allocator_traits; static_assert(std::is_lvalue_reference::value, "Policy::element() must return a reference"); template struct SameAsElementReference : std::is_same< typename std::remove_cv::type>::type, typename std::remove_cv::type>::type> {}; // An enabler for insert(T&&): T must be convertible to init_type or be the // same as [cv] value_type [ref]. // Note: we separate SameAsElementReference into its own type to avoid using // reference unless we need to. MSVC doesn't seem to like it in some // cases. // -------------------------------------------------------------------- template using RequiresInsertable = typename std::enable_if< phmap::disjunction, SameAsElementReference>::value, int>::type; // RequiresNotInit is a workaround for gcc prior to 7.1. // See https://godbolt.org/g/Y4xsUh. template using RequiresNotInit = typename std::enable_if::value, int>::type; template using IsDecomposable = IsDecomposable; public: static_assert(std::is_same::value, "Allocators with custom pointer types are not supported"); static_assert(std::is_same::value, "Allocators with custom pointer types are not supported"); // --------------------- i t e r a t o r ------------------------------ class iterator { friend class parallel_hash_set; public: using iterator_category = std::forward_iterator_tag; using value_type = typename parallel_hash_set::value_type; using reference = phmap::conditional_t; using pointer = phmap::remove_reference_t*; using difference_type = typename parallel_hash_set::difference_type; using Inner = typename parallel_hash_set::Inner; using EmbeddedSet = typename parallel_hash_set::EmbeddedSet; using EmbeddedIterator = typename EmbeddedSet::iterator; iterator() {} reference operator*() const { return *it_; } pointer operator->() const { return &operator*(); } iterator& operator++() { assert(inner_); // null inner means we are already at the end ++it_; skip_empty(); return *this; } iterator operator++(int) { assert(inner_); // null inner means we are already at the end auto tmp = *this; ++*this; return tmp; } friend bool operator==(const iterator& a, const iterator& b) { return a.inner_ == b.inner_ && (!a.inner_ || a.it_ == b.it_); } friend bool operator!=(const iterator& a, const iterator& b) { return !(a == b); } private: iterator(Inner *inner, Inner *inner_end, const EmbeddedIterator& it) : inner_(inner), inner_end_(inner_end), it_(it) { // for begin() and end() if (inner) it_end_ = inner->set_.end(); } void skip_empty() { while (it_ == it_end_) { ++inner_; if (inner_ == inner_end_) { inner_ = nullptr; // marks end() break; } else { it_ = inner_->set_.begin(); it_end_ = inner_->set_.end(); } } } Inner *inner_ = nullptr; Inner *inner_end_ = nullptr; EmbeddedIterator it_, it_end_; }; // --------------------- c o n s t i t e r a t o r ----------------- class const_iterator { friend class parallel_hash_set; public: using iterator_category = typename iterator::iterator_category; using value_type = typename parallel_hash_set::value_type; using reference = typename parallel_hash_set::const_reference; using pointer = typename parallel_hash_set::const_pointer; using difference_type = typename parallel_hash_set::difference_type; using Inner = typename parallel_hash_set::Inner; const_iterator() {} // Implicit construction from iterator. const_iterator(iterator i) : iter_(std::move(i)) {} reference operator*() const { return *(iter_); } pointer operator->() const { return iter_.operator->(); } const_iterator& operator++() { ++iter_; return *this; } const_iterator operator++(int) { return iter_++; } friend bool operator==(const const_iterator& a, const const_iterator& b) { return a.iter_ == b.iter_; } friend bool operator!=(const const_iterator& a, const const_iterator& b) { return !(a == b); } private: const_iterator(const Inner *inner, const Inner *inner_end, const EmbeddedIterator& it) : iter_(const_cast(inner), const_cast(inner_end), const_cast(it)) {} iterator iter_; }; using node_type = node_handle, Alloc>; using insert_return_type = InsertReturnType; // ------------------------- c o n s t r u c t o r s ------------------ parallel_hash_set() noexcept( std::is_nothrow_default_constructible::value&& std::is_nothrow_default_constructible::value&& std::is_nothrow_default_constructible::value) {} explicit parallel_hash_set(size_t bucket_count, const hasher& hash_param = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) { for (auto& inner : sets_) inner.set_ = EmbeddedSet(bucket_count / N, hash_param, eq, alloc); } parallel_hash_set(size_t bucket_count, const hasher& hash_param, const allocator_type& alloc) : parallel_hash_set(bucket_count, hash_param, key_equal(), alloc) {} parallel_hash_set(size_t bucket_count, const allocator_type& alloc) : parallel_hash_set(bucket_count, hasher(), key_equal(), alloc) {} explicit parallel_hash_set(const allocator_type& alloc) : parallel_hash_set(0, hasher(), key_equal(), alloc) {} template parallel_hash_set(InputIter first, InputIter last, size_t bucket_count = 0, const hasher& hash_param = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : parallel_hash_set(bucket_count, hash_param, eq, alloc) { insert(first, last); } template parallel_hash_set(InputIter first, InputIter last, size_t bucket_count, const hasher& hash_param, const allocator_type& alloc) : parallel_hash_set(first, last, bucket_count, hash_param, key_equal(), alloc) {} template parallel_hash_set(InputIter first, InputIter last, size_t bucket_count, const allocator_type& alloc) : parallel_hash_set(first, last, bucket_count, hasher(), key_equal(), alloc) {} template parallel_hash_set(InputIter first, InputIter last, const allocator_type& alloc) : parallel_hash_set(first, last, 0, hasher(), key_equal(), alloc) {} // Instead of accepting std::initializer_list as the first // argument like std::unordered_set does, we have two overloads // that accept std::initializer_list and std::initializer_list. // This is advantageous for performance. // // // Turns {"abc", "def"} into std::initializer_list, then copies // // the strings into the set. // std::unordered_set s = {"abc", "def"}; // // // Turns {"abc", "def"} into std::initializer_list, then // // copies the strings into the set. // phmap::flat_hash_set s = {"abc", "def"}; // // The same trick is used in insert(). // // The enabler is necessary to prevent this constructor from triggering where // the copy constructor is meant to be called. // // phmap::flat_hash_set a, b{a}; // // RequiresNotInit is a workaround for gcc prior to 7.1. // -------------------------------------------------------------------- template = 0, RequiresInsertable = 0> parallel_hash_set(std::initializer_list init, size_t bucket_count = 0, const hasher& hash_param = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : parallel_hash_set(init.begin(), init.end(), bucket_count, hash_param, eq, alloc) {} parallel_hash_set(std::initializer_list init, size_t bucket_count = 0, const hasher& hash_param = hasher(), const key_equal& eq = key_equal(), const allocator_type& alloc = allocator_type()) : parallel_hash_set(init.begin(), init.end(), bucket_count, hash_param, eq, alloc) {} template = 0, RequiresInsertable = 0> parallel_hash_set(std::initializer_list init, size_t bucket_count, const hasher& hash_param, const allocator_type& alloc) : parallel_hash_set(init, bucket_count, hash_param, key_equal(), alloc) {} parallel_hash_set(std::initializer_list init, size_t bucket_count, const hasher& hash_param, const allocator_type& alloc) : parallel_hash_set(init, bucket_count, hash_param, key_equal(), alloc) {} template = 0, RequiresInsertable = 0> parallel_hash_set(std::initializer_list init, size_t bucket_count, const allocator_type& alloc) : parallel_hash_set(init, bucket_count, hasher(), key_equal(), alloc) {} parallel_hash_set(std::initializer_list init, size_t bucket_count, const allocator_type& alloc) : parallel_hash_set(init, bucket_count, hasher(), key_equal(), alloc) {} template = 0, RequiresInsertable = 0> parallel_hash_set(std::initializer_list init, const allocator_type& alloc) : parallel_hash_set(init, 0, hasher(), key_equal(), alloc) {} parallel_hash_set(std::initializer_list init, const allocator_type& alloc) : parallel_hash_set(init, 0, hasher(), key_equal(), alloc) {} parallel_hash_set(const parallel_hash_set& that) : parallel_hash_set(that, AllocTraits::select_on_container_copy_construction( that.alloc_ref())) {} parallel_hash_set(const parallel_hash_set& that, const allocator_type& a) : parallel_hash_set(0, that.hash_ref(), that.eq_ref(), a) { for (size_t i=0; i::value&& std::is_nothrow_copy_constructible::value&& std::is_nothrow_copy_constructible::value) : parallel_hash_set(std::move(that), that.alloc_ref()) { } parallel_hash_set(parallel_hash_set&& that, const allocator_type& a) { for (size_t i=0; i::is_always_equal::value && std::is_nothrow_move_assignable::value && std::is_nothrow_move_assignable::value) { for (size_t i=0; i(this)->begin(); } const_iterator end() const { return const_cast(this)->end(); } const_iterator cbegin() const { return begin(); } const_iterator cend() const { return end(); } bool empty() const { return !size(); } size_t size() const { size_t sz = 0; for (const auto& inner : sets_) sz += inner.set_.size(); return sz; } size_t capacity() const { size_t c = 0; for (const auto& inner : sets_) c += inner.set_.capacity(); return c; } size_t max_size() const { return (std::numeric_limits::max)(); } PHMAP_ATTRIBUTE_REINITIALIZES void clear() { for (auto& inner : sets_) inner.set_.clear(); } // This overload kicks in when the argument is an rvalue of insertable and // decomposable type other than init_type. // // flat_hash_map m; // m.insert(std::make_pair("abc", 42)); // -------------------------------------------------------------------- template = 0, typename std::enable_if::value, int>::type = 0, T* = nullptr> std::pair insert(T&& value) { return emplace(std::forward(value)); } // This overload kicks in when the argument is a bitfield or an lvalue of // insertable and decomposable type. // // union { int n : 1; }; // flat_hash_set s; // s.insert(n); // // flat_hash_set s; // const char* p = "hello"; // s.insert(p); // // TODO(romanp): Once we stop supporting gcc 5.1 and below, replace // RequiresInsertable with RequiresInsertable. // We are hitting this bug: https://godbolt.org/g/1Vht4f. // -------------------------------------------------------------------- template < class T, RequiresInsertable = 0, typename std::enable_if::value, int>::type = 0> std::pair insert(const T& value) { return emplace(value); } // This overload kicks in when the argument is an rvalue of init_type. Its // purpose is to handle brace-init-list arguments. // // flat_hash_set> s; // s.insert({"abc", 42}); // -------------------------------------------------------------------- std::pair insert(init_type&& value) { return emplace(std::move(value)); } template = 0, typename std::enable_if::value, int>::type = 0, T* = nullptr> iterator insert(const_iterator, T&& value) { return insert(std::forward(value)).first; } // TODO(romanp): Once we stop supporting gcc 5.1 and below, replace // RequiresInsertable with RequiresInsertable. // We are hitting this bug: https://godbolt.org/g/1Vht4f. // -------------------------------------------------------------------- template < class T, RequiresInsertable = 0, typename std::enable_if::value, int>::type = 0> iterator insert(const_iterator, const T& value) { return insert(value).first; } iterator insert(const_iterator, init_type&& value) { return insert(std::move(value)).first; } template void insert(InputIt first, InputIt last) { for (; first != last; ++first) insert(*first); } template = 0, RequiresInsertable = 0> void insert(std::initializer_list ilist) { insert(ilist.begin(), ilist.end()); } void insert(std::initializer_list ilist) { insert(ilist.begin(), ilist.end()); } insert_return_type insert(node_type&& node) { if (!node) return {end(), false, node_type()}; auto& key = node.key(); size_t hashval = HashElement{hash_ref()}(key); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::UniqueLock m(inner); auto res = set.insert(std::move(node), hashval); return { make_iterator(&inner, res.position), res.inserted, res.inserted ? node_type() : std::move(res.node) }; } iterator insert(const_iterator, node_type&& node) { return insert(std::move(node)).first; } struct ReturnKey_ { template Key operator()(Key&& k, const Args&...) const { return std::forward(k); } }; template std::pair emplace_decomposable(const K& key, Args&&... args) { size_t hashval = HashElement{hash_ref()}(key); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::UniqueLock m(inner); return make_rv(&inner, set.emplace_decomposable(key, hashval, std::forward(args)...)); } struct EmplaceDecomposable { template std::pair operator()(const K& key, Args&&... args) const { return s.emplace_decomposable(key, std::forward(args)...); } parallel_hash_set& s; }; // This overload kicks in if we can deduce the key from args. This enables us // to avoid constructing value_type if an entry with the same key already // exists. // // For example: // // flat_hash_map m = {{"abc", "def"}}; // // Creates no std::string copies and makes no heap allocations. // m.emplace("abc", "xyz"); // -------------------------------------------------------------------- template ::value, int>::type = 0> std::pair emplace(Args&&... args) { return PolicyTraits::apply(EmplaceDecomposable{*this}, std::forward(args)...); } // This overload kicks in if we cannot deduce the key from args. It constructs // value_type unconditionally and then either moves it into the table or // destroys. // -------------------------------------------------------------------- template ::value, int>::type = 0> std::pair emplace(Args&&... args) { typename std::aligned_storage::type raw; slot_type* slot = reinterpret_cast(&raw); PolicyTraits::construct(&alloc_ref(), slot, std::forward(args)...); const auto& elem = PolicyTraits::element(slot); size_t hashval = HashElement{hash_ref()}(PolicyTraits::key(slot)); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::UniqueLock m(inner); typename EmbeddedSet::template InsertSlotWithHash f { inner, std::move(*slot), hashval}; return make_rv(PolicyTraits::apply(f, elem)); } template iterator emplace_hint(const_iterator, Args&&... args) { return emplace(std::forward(args)...).first; } iterator make_iterator(Inner* inner, const EmbeddedIterator it) { if (it == inner->set_.end()) return iterator(); return iterator(inner, &sets_[0] + num_tables, it); } std::pair make_rv(Inner* inner, const std::pair& res) { return {iterator(inner, &sets_[0] + num_tables, res.first), res.second}; } template iterator lazy_emplace(const key_arg& key, F&& f) { auto hashval = HashElement{hash_ref()}(key); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::UniqueLock m(inner); return make_iterator(&inner, set.lazy_emplace(key, hashval, std::forward(f))); } // Extension API: support for heterogeneous keys. // // std::unordered_set s; // // Turns "abc" into std::string. // s.erase("abc"); // // flat_hash_set s; // // Uses "abc" directly without copying it into std::string. // s.erase("abc"); // -------------------------------------------------------------------- template size_type erase(const key_arg& key) { auto hashval = HashElement{hash_ref()}(key); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::UpgradeLock m(inner); auto it = set.find(key, hashval); if (it == set.end()) return 0; typename Lockable::UpgradeToUnique unique(m); set._erase(it); return 1; } // -------------------------------------------------------------------- iterator erase(const_iterator cit) { return erase(cit.iter_); } // Erases the element pointed to by `it`. Unlike `std::unordered_set::erase`, // this method returns void to reduce algorithmic complexity to O(1). In // order to erase while iterating across a map, use the following idiom (which // also works for standard containers): // // for (auto it = m.begin(), end = m.end(); it != end;) { // if () { // m._erase(it++); // } else { // ++it; // } // } // -------------------------------------------------------------------- void _erase(iterator it) { assert(it.inner_ != nullptr); it.inner_->set_._erase(it.it_); } void _erase(const_iterator cit) { _erase(cit.iter_); } // This overload is necessary because otherwise erase(const K&) would be // a better match if non-const iterator is passed as an argument. // -------------------------------------------------------------------- iterator erase(iterator it) { _erase(it++); return it; } iterator erase(const_iterator first, const_iterator last) { while (first != last) { _erase(first++); } return last.iter_; } // Moves elements from `src` into `this`. // If the element already exists in `this`, it is left unmodified in `src`. // -------------------------------------------------------------------- template void merge(parallel_hash_set& src) { // NOLINT assert(this != &src); if (this != &src) { for (size_t i=0; i void merge(parallel_hash_set&& src) { merge(src); } node_type extract(const_iterator position) { return position.iter_.inner_->set_.extract(EmbeddedConstIterator(position.iter_.it_)); } template < class K = key_type, typename std::enable_if::value, int>::type = 0> node_type extract(const key_arg& key) { auto it = find(key); return it == end() ? node_type() : extract(const_iterator{it}); } void swap(parallel_hash_set& that) noexcept( IsNoThrowSwappable() && (!AllocTraits::propagate_on_container_swap::value || IsNoThrowSwappable())) { using std::swap; for (size_t i=0; i s; // // Turns "abc" into std::string. // s.count("abc"); // // ch_set s; // // Uses "abc" directly without copying it into std::string. // s.count("abc"); // -------------------------------------------------------------------- template size_t count(const key_arg& key) const { return find(key) == end() ? 0 : 1; } // Issues CPU prefetch instructions for the memory needed to find or insert // a key. Like all lookup functions, this support heterogeneous keys. // // NOTE: This is a very low level operation and should not be used without // specific benchmarks indicating its importance. // -------------------------------------------------------------------- template void prefetch(const key_arg& key) const { (void)key; #if 0 && defined(__GNUC__) size_t hashval = HashElement{hash_ref()}(key); const Inner& inner = sets_[subidx(hashval)]; const auto& set = inner.set_; typename Lockable::UniqueLock m(inner); set.prefetch_hash(hashval); #endif // __GNUC__ } // The API of find() has two extensions. // // 1. The hash can be passed by the user. It must be equal to the hash of the // key. // // 2. The type of the key argument doesn't have to be key_type. This is so // called heterogeneous key support. // -------------------------------------------------------------------- template iterator find(const key_arg& key, size_t hashval) { Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::SharedLock m(inner); auto it = set.find(key, hashval); return make_iterator(&inner, it); } template iterator find(const key_arg& key) { return find(key, HashElement{hash_ref()}(key)); } template const_iterator find(const key_arg& key, size_t hashval) const { return const_cast(this)->find(key, hashval); } template const_iterator find(const key_arg& key) const { return find(key, HashElement{hash_ref()}(key)); } template bool contains(const key_arg& key) const { return find(key) != end(); } template std::pair equal_range(const key_arg& key) { auto it = find(key); if (it != end()) return {it, std::next(it)}; return {it, it}; } template std::pair equal_range( const key_arg& key) const { auto it = find(key); if (it != end()) return {it, std::next(it)}; return {it, it}; } size_t bucket_count() const { size_t sz = 0; for (const auto& inner : sets_) { typename Lockable::SharedLock m(const_cast(inner)); sz += inner.set_.bucket_count(); } return sz; } float load_factor() const { size_t capacity = bucket_count(); return capacity ? static_cast(static_cast(size()) / capacity) : 0; } float max_load_factor() const { return 1.0f; } void max_load_factor(float) { // Does nothing. } hasher hash_function() const { return hash_ref(); } key_equal key_eq() const { return eq_ref(); } allocator_type get_allocator() const { return alloc_ref(); } friend bool operator==(const parallel_hash_set& a, const parallel_hash_set& b) { return std::equal(a.sets_.begin(), a.sets_.end(), b.sets_.begin()); } friend bool operator!=(const parallel_hash_set& a, const parallel_hash_set& b) { return !(a == b); } friend void swap(parallel_hash_set& a, parallel_hash_set& b) noexcept(noexcept(a.swap(b))) { a.swap(b); } private: template friend struct phmap::container_internal::hashtable_debug_internal::HashtableDebugAccess; struct FindElement { template const_iterator operator()(const K& key, Args&&...) const { return s.find(key); } const parallel_hash_set& s; }; struct HashElement { template size_t operator()(const K& key, Args&&...) const { return phmap_mix()(h(key)); } const hasher& h; }; template struct EqualElement { template bool operator()(const K2& lhs, Args&&...) const { return eq(lhs, rhs); } const K1& rhs; const key_equal& eq; }; // "erases" the object from the container, except that it doesn't actually // destroy the object. It only updates all the metadata of the class. // This can be used in conjunction with Policy::transfer to move the object to // another place. // -------------------------------------------------------------------- void erase_meta_only(const_iterator cit) { auto &it = cit.iter_; assert(it.set_ != nullptr); it.set_.erase_meta_only(const_iterator(it.it_)); } void drop_deletes_without_resize() PHMAP_ATTRIBUTE_NOINLINE { for (auto& inner : sets_) { typename Lockable::UniqueLock m(inner); inner.set_.drop_deletes_without_resize(); } } bool has_element(const value_type& elem) const { size_t hashval = PolicyTraits::apply(HashElement{hash_ref()}, elem); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; typename Lockable::SharedLock m(const_cast(inner)); return set.has_element(elem, hashval); } // TODO(alkis): Optimize this assuming *this and that don't overlap. // -------------------------------------------------------------------- parallel_hash_set& move_assign(parallel_hash_set&& that, std::true_type) { parallel_hash_set tmp(std::move(that)); swap(tmp); return *this; } parallel_hash_set& move_assign(parallel_hash_set&& that, std::false_type) { parallel_hash_set tmp(std::move(that), alloc_ref()); swap(tmp); return *this; } protected: template std::tuple find_or_prepare_insert(const K& key, typename Lockable::UniqueLock &mutexlock) { auto hashval = HashElement{hash_ref()}(key); Inner& inner = sets_[subidx(hashval)]; auto& set = inner.set_; mutexlock = std::move(typename Lockable::UniqueLock(inner)); auto p = set.find_or_prepare_insert(key, hashval); // std::pair return std::make_tuple(&inner, p.first, p.second); } iterator iterator_at(Inner *inner, const EmbeddedIterator& it) { return {inner, &sets_[0] + num_tables, it}; } const_iterator iterator_at(Inner *inner, const EmbeddedIterator& it) const { return {inner, &sets_[0] + num_tables, it}; } static size_t subidx(size_t hashval) { return (hashval ^ (hashval >> N)) & mask; } template size_t hash(const K& key) { return HashElement{hash_ref()}(key); } static size_t subcnt() { return num_tables; } private: friend struct RawHashSetTestOnlyAccess; size_t growth_left() { size_t sz = 0; for (const auto& set : sets_) sz += set.growth_left(); return sz; } hasher& hash_ref() { return sets_[0].set_.hash_ref(); } const hasher& hash_ref() const { return sets_[0].set_.hash_ref(); } key_equal& eq_ref() { return sets_[0].set_.eq_ref(); } const key_equal& eq_ref() const { return sets_[0].set_.eq_ref(); } allocator_type& alloc_ref() { return sets_[0].set_.alloc_ref(); } const allocator_type& alloc_ref() const { return sets_[0].set_.alloc_ref(); } std::array sets_; }; // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- template class RefSet, class Mtx_, class Policy, class Hash, class Eq, class Alloc> class parallel_hash_map : public parallel_hash_set { // P is Policy. It's passed as a template argument to support maps that have // incomplete types as values, as in unordered_map. // MappedReference<> may be a non-reference type. template using MappedReference = decltype(P::value( std::addressof(std::declval()))); // MappedConstReference<> may be a non-reference type. template using MappedConstReference = decltype(P::value( std::addressof(std::declval()))); using KeyArgImpl = KeyArg::value && IsTransparent::value>; using Base = typename parallel_hash_map::parallel_hash_set; using Lockable = phmap::LockableImpl; public: using key_type = typename Policy::key_type; using mapped_type = typename Policy::mapped_type; template using key_arg = typename KeyArgImpl::template type; static_assert(!std::is_reference::value, ""); // TODO(alkis): remove this assertion and verify that reference mapped_type is // supported. static_assert(!std::is_reference::value, ""); using iterator = typename parallel_hash_map::parallel_hash_set::iterator; using const_iterator = typename parallel_hash_map::parallel_hash_set::const_iterator; parallel_hash_map() {} #ifdef __INTEL_COMPILER using Base::parallel_hash_set; #else using parallel_hash_map::parallel_hash_set::parallel_hash_set; #endif // The last two template parameters ensure that both arguments are rvalues // (lvalue arguments are handled by the overloads below). This is necessary // for supporting bitfield arguments. // // union { int n : 1; }; // flat_hash_map m; // m.insert_or_assign(n, n); template std::pair insert_or_assign(key_arg&& k, V&& v) { return insert_or_assign_impl(std::forward(k), std::forward(v)); } template std::pair insert_or_assign(key_arg&& k, const V& v) { return insert_or_assign_impl(std::forward(k), v); } template std::pair insert_or_assign(const key_arg& k, V&& v) { return insert_or_assign_impl(k, std::forward(v)); } template std::pair insert_or_assign(const key_arg& k, const V& v) { return insert_or_assign_impl(k, v); } template iterator insert_or_assign(const_iterator, key_arg&& k, V&& v) { return insert_or_assign(std::forward(k), std::forward(v)).first; } template iterator insert_or_assign(const_iterator, key_arg&& k, const V& v) { return insert_or_assign(std::forward(k), v).first; } template iterator insert_or_assign(const_iterator, const key_arg& k, V&& v) { return insert_or_assign(k, std::forward(v)).first; } template iterator insert_or_assign(const_iterator, const key_arg& k, const V& v) { return insert_or_assign(k, v).first; } template ::value, int>::type = 0, K* = nullptr> std::pair try_emplace(key_arg&& k, Args&&... args) { return try_emplace_impl(std::forward(k), std::forward(args)...); } template ::value, int>::type = 0> std::pair try_emplace(const key_arg& k, Args&&... args) { return try_emplace_impl(k, std::forward(args)...); } template iterator try_emplace(const_iterator, key_arg&& k, Args&&... args) { return try_emplace(std::forward(k), std::forward(args)...).first; } template iterator try_emplace(const_iterator, const key_arg& k, Args&&... args) { return try_emplace(k, std::forward(args)...).first; } template MappedReference

at(const key_arg& key) { auto it = this->find(key); if (it == this->end()) std::abort(); return Policy::value(&*it); } template MappedConstReference

at(const key_arg& key) const { auto it = this->find(key); if (it == this->end()) std::abort(); return Policy::value(&*it); } template MappedReference

operator[](key_arg&& key) { return Policy::value(&*try_emplace(std::forward(key)).first); } template MappedReference

operator[](const key_arg& key) { return Policy::value(&*try_emplace(key).first); } private: template std::pair insert_or_assign_impl(K&& k, V&& v) { typename Lockable::UniqueLock m; auto res = this->find_or_prepare_insert(k, m); typename Base::Inner *inner = std::get<0>(res); if (std::get<2>(res)) inner->set_.emplace_at(std::get<1>(res), std::forward(k), std::forward(v)); else Policy::value(&*inner->set_.iterator_at(std::get<1>(res))) = std::forward(v); return {this->iterator_at(inner, inner->set_.iterator_at(std::get<1>(res))), std::get<2>(res)}; } template std::pair try_emplace_impl(K&& k, Args&&... args) { typename Lockable::UniqueLock m; auto res = this->find_or_prepare_insert(k, m); typename Base::Inner *inner = std::get<0>(res); if (std::get<2>(res)) inner->set_.emplace_at(std::get<1>(res), std::piecewise_construct, std::forward_as_tuple(std::forward(k)), std::forward_as_tuple(std::forward(args)...)); return {this->iterator_at(inner, inner->set_.iterator_at(std::get<1>(res))), std::get<2>(res)}; } }; // Constructs T into uninitialized storage pointed by `ptr` using the args // specified in the tuple. // ---------------------------------------------------------------------------- template void ConstructFromTuple(Alloc* alloc, T* ptr, Tuple&& t) { memory_internal::ConstructFromTupleImpl( alloc, ptr, std::forward(t), phmap::make_index_sequence< std::tuple_size::type>::value>()); } // Constructs T using the args specified in the tuple and calls F with the // constructed value. // ---------------------------------------------------------------------------- template decltype(std::declval()(std::declval())) WithConstructed( Tuple&& t, F&& f) { return memory_internal::WithConstructedImpl( std::forward(t), phmap::make_index_sequence< std::tuple_size::type>::value>(), std::forward(f)); } // ---------------------------------------------------------------------------- // Given arguments of an std::pair's consructor, PairArgs() returns a pair of // tuples with references to the passed arguments. The tuples contain // constructor arguments for the first and the second elements of the pair. // // The following two snippets are equivalent. // // 1. std::pair p(args...); // // 2. auto a = PairArgs(args...); // std::pair p(std::piecewise_construct, // std::move(p.first), std::move(p.second)); // ---------------------------------------------------------------------------- inline std::pair, std::tuple<>> PairArgs() { return {}; } template std::pair, std::tuple> PairArgs(F&& f, S&& s) { return {std::piecewise_construct, std::forward_as_tuple(std::forward(f)), std::forward_as_tuple(std::forward(s))}; } template std::pair, std::tuple> PairArgs( const std::pair& p) { return PairArgs(p.first, p.second); } template std::pair, std::tuple> PairArgs(std::pair&& p) { return PairArgs(std::forward(p.first), std::forward(p.second)); } template auto PairArgs(std::piecewise_construct_t, F&& f, S&& s) -> decltype(std::make_pair(memory_internal::TupleRef(std::forward(f)), memory_internal::TupleRef(std::forward(s)))) { return std::make_pair(memory_internal::TupleRef(std::forward(f)), memory_internal::TupleRef(std::forward(s))); } // A helper function for implementing apply() in map policies. // ---------------------------------------------------------------------------- template auto DecomposePair(F&& f, Args&&... args) -> decltype(memory_internal::DecomposePairImpl( std::forward(f), PairArgs(std::forward(args)...))) { return memory_internal::DecomposePairImpl( std::forward(f), PairArgs(std::forward(args)...)); } // A helper function for implementing apply() in set policies. // ---------------------------------------------------------------------------- template decltype(std::declval()(std::declval(), std::declval())) DecomposeValue(F&& f, Arg&& arg) { const auto& key = arg; return std::forward(f)(key, std::forward(arg)); } namespace memory_internal { // ---------------------------------------------------------------------------- // If Pair is a standard-layout type, OffsetOf::kFirst and // OffsetOf::kSecond are equivalent to offsetof(Pair, first) and // offsetof(Pair, second) respectively. Otherwise they are -1. // // The purpose of OffsetOf is to avoid calling offsetof() on non-standard-layout // type, which is non-portable. // ---------------------------------------------------------------------------- template struct OffsetOf { static constexpr size_t kFirst = -1; static constexpr size_t kSecond = -1; }; template struct OffsetOf::type> { static constexpr size_t kFirst = offsetof(Pair, first); static constexpr size_t kSecond = offsetof(Pair, second); }; // ---------------------------------------------------------------------------- template struct IsLayoutCompatible { private: struct Pair { K first; V second; }; // Is P layout-compatible with Pair? template static constexpr bool LayoutCompatible() { return std::is_standard_layout

() && sizeof(P) == sizeof(Pair) && alignof(P) == alignof(Pair) && memory_internal::OffsetOf

::kFirst == memory_internal::OffsetOf::kFirst && memory_internal::OffsetOf

::kSecond == memory_internal::OffsetOf::kSecond; } public: // Whether pair and pair are layout-compatible. If they are, // then it is safe to store them in a union and read from either. static constexpr bool value = std::is_standard_layout() && std::is_standard_layout() && memory_internal::OffsetOf::kFirst == 0 && LayoutCompatible>() && LayoutCompatible>(); }; } // namespace memory_internal // ---------------------------------------------------------------------------- // The internal storage type for key-value containers like flat_hash_map. // // It is convenient for the value_type of a flat_hash_map to be // pair; the "const K" prevents accidental modification of the key // when dealing with the reference returned from find() and similar methods. // However, this creates other problems; we want to be able to emplace(K, V) // efficiently with move operations, and similarly be able to move a // pair in insert(). // // The solution is this union, which aliases the const and non-const versions // of the pair. This also allows flat_hash_map to work, even though // that has the same efficiency issues with move in emplace() and insert() - // but people do it anyway. // // If kMutableKeys is false, only the value member can be accessed. // // If kMutableKeys is true, key can be accessed through all slots while value // and mutable_value must be accessed only via INITIALIZED slots. Slots are // created and destroyed via mutable_value so that the key can be moved later. // // Accessing one of the union fields while the other is active is safe as // long as they are layout-compatible, which is guaranteed by the definition of // kMutableKeys. For C++11, the relevant section of the standard is // https://timsong-cpp.github.io/cppwp/n3337/class.mem#19 (9.2.19) // ---------------------------------------------------------------------------- template union map_slot_type { map_slot_type() {} ~map_slot_type() = delete; using value_type = std::pair; using mutable_value_type = std::pair; value_type value; mutable_value_type mutable_value; K key; }; // ---------------------------------------------------------------------------- // ---------------------------------------------------------------------------- template struct map_slot_policy { using slot_type = map_slot_type; using value_type = std::pair; using mutable_value_type = std::pair; private: static void emplace(slot_type* slot) { // The construction of union doesn't do anything at runtime but it allows us // to access its members without violating aliasing rules. new (slot) slot_type; } // If pair and pair are layout-compatible, we can accept one // or the other via slot_type. We are also free to access the key via // slot_type::key in this case. using kMutableKeys = memory_internal::IsLayoutCompatible; public: static value_type& element(slot_type* slot) { return slot->value; } static const value_type& element(const slot_type* slot) { return slot->value; } static const K& key(const slot_type* slot) { return kMutableKeys::value ? slot->key : slot->value.first; } template static void construct(Allocator* alloc, slot_type* slot, Args&&... args) { emplace(slot); if (kMutableKeys::value) { phmap::allocator_traits::construct(*alloc, &slot->mutable_value, std::forward(args)...); } else { phmap::allocator_traits::construct(*alloc, &slot->value, std::forward(args)...); } } // Construct this slot by moving from another slot. template static void construct(Allocator* alloc, slot_type* slot, slot_type* other) { emplace(slot); if (kMutableKeys::value) { phmap::allocator_traits::construct( *alloc, &slot->mutable_value, std::move(other->mutable_value)); } else { phmap::allocator_traits::construct(*alloc, &slot->value, std::move(other->value)); } } template static void destroy(Allocator* alloc, slot_type* slot) { if (kMutableKeys::value) { phmap::allocator_traits::destroy(*alloc, &slot->mutable_value); } else { phmap::allocator_traits::destroy(*alloc, &slot->value); } } template static void transfer(Allocator* alloc, slot_type* new_slot, slot_type* old_slot) { emplace(new_slot); if (kMutableKeys::value) { phmap::allocator_traits::construct( *alloc, &new_slot->mutable_value, std::move(old_slot->mutable_value)); } else { phmap::allocator_traits::construct(*alloc, &new_slot->value, std::move(old_slot->value)); } destroy(alloc, old_slot); } template static void swap(Allocator* alloc, slot_type* a, slot_type* b) { if (kMutableKeys::value) { using std::swap; swap(a->mutable_value, b->mutable_value); } else { value_type tmp = std::move(a->value); phmap::allocator_traits::destroy(*alloc, &a->value); phmap::allocator_traits::construct(*alloc, &a->value, std::move(b->value)); phmap::allocator_traits::destroy(*alloc, &b->value); phmap::allocator_traits::construct(*alloc, &b->value, std::move(tmp)); } } template static void move(Allocator* alloc, slot_type* src, slot_type* dest) { if (kMutableKeys::value) { dest->mutable_value = std::move(src->mutable_value); } else { phmap::allocator_traits::destroy(*alloc, &dest->value); phmap::allocator_traits::construct(*alloc, &dest->value, std::move(src->value)); } } template static void move(Allocator* alloc, slot_type* first, slot_type* last, slot_type* result) { for (slot_type *src = first, *dest = result; src != last; ++src, ++dest) move(alloc, src, dest); } }; // -------------------------------------------------------------------------- // Policy: a policy defines how to perform different operations on // the slots of the hashtable (see hash_policy_traits.h for the full interface // of policy). // // Hash: a (possibly polymorphic) functor that hashes keys of the hashtable. The // functor should accept a key and return size_t as hash. For best performance // it is important that the hash function provides high entropy across all bits // of the hash. // // Eq: a (possibly polymorphic) functor that compares two keys for equality. It // should accept two (of possibly different type) keys and return a bool: true // if they are equal, false if they are not. If two keys compare equal, then // their hash values as defined by Hash MUST be equal. // // Allocator: an Allocator [https://devdocs.io/cpp/concept/allocator] with which // the storage of the hashtable will be allocated and the elements will be // constructed and destroyed. // -------------------------------------------------------------------------- template struct FlatHashSetPolicy { using slot_type = T; using key_type = T; using init_type = T; using constant_iterators = std::true_type; template static void construct(Allocator* alloc, slot_type* slot, Args&&... args) { phmap::allocator_traits::construct(*alloc, slot, std::forward(args)...); } template static void destroy(Allocator* alloc, slot_type* slot) { phmap::allocator_traits::destroy(*alloc, slot); } template static void transfer(Allocator* alloc, slot_type* new_slot, slot_type* old_slot) { construct(alloc, new_slot, std::move(*old_slot)); destroy(alloc, old_slot); } static T& element(slot_type* slot) { return *slot; } template static decltype(phmap::container_internal::DecomposeValue( std::declval(), std::declval()...)) apply(F&& f, Args&&... args) { return phmap::container_internal::DecomposeValue( std::forward(f), std::forward(args)...); } static size_t space_used(const T*) { return 0; } }; // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- template struct FlatHashMapPolicy { using slot_policy = container_internal::map_slot_policy; using slot_type = typename slot_policy::slot_type; using key_type = K; using mapped_type = V; using init_type = std::pair; template static void construct(Allocator* alloc, slot_type* slot, Args&&... args) { slot_policy::construct(alloc, slot, std::forward(args)...); } template static void destroy(Allocator* alloc, slot_type* slot) { slot_policy::destroy(alloc, slot); } template static void transfer(Allocator* alloc, slot_type* new_slot, slot_type* old_slot) { slot_policy::transfer(alloc, new_slot, old_slot); } template static decltype(phmap::container_internal::DecomposePair( std::declval(), std::declval()...)) apply(F&& f, Args&&... args) { return phmap::container_internal::DecomposePair(std::forward(f), std::forward(args)...); } static size_t space_used(const slot_type*) { return 0; } static std::pair& element(slot_type* slot) { return slot->value; } static V& value(std::pair* kv) { return kv->second; } static const V& value(const std::pair* kv) { return kv->second; } }; template struct node_hash_policy { static_assert(std::is_lvalue_reference::value, ""); using slot_type = typename std::remove_cv< typename std::remove_reference::type>::type*; template static void construct(Alloc* alloc, slot_type* slot, Args&&... args) { *slot = Policy::new_element(alloc, std::forward(args)...); } template static void destroy(Alloc* alloc, slot_type* slot) { Policy::delete_element(alloc, *slot); } template static void transfer(Alloc*, slot_type* new_slot, slot_type* old_slot) { *new_slot = *old_slot; } static size_t space_used(const slot_type* slot) { if (slot == nullptr) return Policy::element_space_used(nullptr); return Policy::element_space_used(*slot); } static Reference element(slot_type* slot) { return **slot; } template static auto value(T* elem) -> decltype(P::value(elem)) { return P::value(elem); } template static auto apply(Ts&&... ts) -> decltype(P::apply(std::forward(ts)...)) { return P::apply(std::forward(ts)...); } }; // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- template struct NodeHashSetPolicy : phmap::container_internal::node_hash_policy> { using key_type = T; using init_type = T; using constant_iterators = std::true_type; template static T* new_element(Allocator* alloc, Args&&... args) { using ValueAlloc = typename phmap::allocator_traits::template rebind_alloc; ValueAlloc value_alloc(*alloc); T* res = phmap::allocator_traits::allocate(value_alloc, 1); phmap::allocator_traits::construct(value_alloc, res, std::forward(args)...); return res; } template static void delete_element(Allocator* alloc, T* elem) { using ValueAlloc = typename phmap::allocator_traits::template rebind_alloc; ValueAlloc value_alloc(*alloc); phmap::allocator_traits::destroy(value_alloc, elem); phmap::allocator_traits::deallocate(value_alloc, elem, 1); } template static decltype(phmap::container_internal::DecomposeValue( std::declval(), std::declval()...)) apply(F&& f, Args&&... args) { return phmap::container_internal::DecomposeValue( std::forward(f), std::forward(args)...); } static size_t element_space_used(const T*) { return sizeof(T); } }; // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- template class NodeHashMapPolicy : public phmap::container_internal::node_hash_policy< std::pair&, NodeHashMapPolicy> { using value_type = std::pair; public: using key_type = Key; using mapped_type = Value; using init_type = std::pair; template static value_type* new_element(Allocator* alloc, Args&&... args) { using PairAlloc = typename phmap::allocator_traits< Allocator>::template rebind_alloc; PairAlloc pair_alloc(*alloc); value_type* res = phmap::allocator_traits::allocate(pair_alloc, 1); phmap::allocator_traits::construct(pair_alloc, res, std::forward(args)...); return res; } template static void delete_element(Allocator* alloc, value_type* pair) { using PairAlloc = typename phmap::allocator_traits< Allocator>::template rebind_alloc; PairAlloc pair_alloc(*alloc); phmap::allocator_traits::destroy(pair_alloc, pair); phmap::allocator_traits::deallocate(pair_alloc, pair, 1); } template static decltype(phmap::container_internal::DecomposePair( std::declval(), std::declval()...)) apply(F&& f, Args&&... args) { return phmap::container_internal::DecomposePair(std::forward(f), std::forward(args)...); } static size_t element_space_used(const value_type*) { return sizeof(value_type); } static Value& value(value_type* elem) { return elem->second; } static const Value& value(const value_type* elem) { return elem->second; } }; // -------------------------------------------------------------------------- // hash_default // -------------------------------------------------------------------------- #if 0 struct int64_t_hash { using is_transparent = void; size_t operator()(int64_t v) const { return (size_t)v; } }; template <> struct HashEq { using Hash = int64_t_hash; using Eq = std::equal_to; }; #endif #if PHMAP_HAVE_STD_STRING_VIEW struct StringHash { using is_transparent = void; size_t operator()(std::string_view v) const { return phmap::Hash{}(v); } }; // Supports heterogeneous lookup for string-like elements. struct StringHashEq { using Hash = StringHash; struct Eq { using is_transparent = void; bool operator()(std::string_view lhs, std::string_view rhs) const { return lhs == rhs; } }; }; template <> struct HashEq : StringHashEq {}; template <> struct HashEq : StringHashEq {}; #endif // Supports heterogeneous lookup for pointers and smart pointers. template struct HashEq { struct Hash { using is_transparent = void; template size_t operator()(const U& ptr) const { return phmap::Hash{}(HashEq::ToPtr(ptr)); } }; struct Eq { using is_transparent = void; template bool operator()(const A& a, const B& b) const { return HashEq::ToPtr(a) == HashEq::ToPtr(b); } }; private: static const T* ToPtr(const T* ptr) { return ptr; } template static const T* ToPtr(const std::unique_ptr& ptr) { return ptr.get(); } template static const T* ToPtr(const std::shared_ptr& ptr) { return ptr.get(); } }; template struct HashEq> : HashEq {}; template struct HashEq> : HashEq {}; namespace hashtable_debug_internal { // -------------------------------------------------------------------------- // -------------------------------------------------------------------------- template struct HashtableDebugAccess> { using Traits = typename Set::PolicyTraits; using Slot = typename Traits::slot_type; static size_t GetNumProbes(const Set& set, const typename Set::key_type& key) { size_t num_probes = 0; size_t hash = typename Set::HashElement{set.hash_ref()}(key); auto seq = set.probe(hash); while (true) { container_internal::Group g{set.ctrl_ + seq.offset()}; for (int i : g.Match(container_internal::H2(hash))) { if (Traits::apply( typename Set::template EqualElement{ key, set.eq_ref()}, Traits::element(set.slots_ + seq.offset(i)))) return num_probes; ++num_probes; } if (g.MatchEmpty()) return num_probes; seq.next(); ++num_probes; } } static size_t AllocatedByteSize(const Set& c) { size_t capacity = c.capacity_; if (capacity == 0) return 0; auto layout = Set::MakeLayout(capacity); size_t m = layout.AllocSize(); size_t per_slot = Traits::space_used(static_cast(nullptr)); if (per_slot != ~size_t{}) { m += per_slot * c.size(); } else { for (size_t i = 0; i != capacity; ++i) { if (container_internal::IsFull(c.ctrl_[i])) { m += Traits::space_used(c.slots_ + i); } } } return m; } static size_t LowerBoundAllocatedByteSize(size_t size) { size_t capacity = GrowthToLowerboundCapacity(size); if (capacity == 0) return 0; auto layout = Set::MakeLayout(NormalizeCapacity(capacity)); size_t m = layout.AllocSize(); size_t per_slot = Traits::space_used(static_cast(nullptr)); if (per_slot != ~size_t{}) { m += per_slot * size; } return m; } }; } // namespace hashtable_debug_internal } // namespace container_internal // ----------------------------------------------------------------------------- // phmap::flat_hash_set // ----------------------------------------------------------------------------- // An `phmap::flat_hash_set` is an unordered associative container which has // been optimized for both speed and memory footprint in most common use cases. // Its interface is similar to that of `std::unordered_set` with the // following notable differences: // // * Requires keys that are CopyConstructible // * Supports heterogeneous lookup, through `find()`, `operator[]()` and // `insert()`, provided that the set is provided a compatible heterogeneous // hashing function and equality operator. // * Invalidates any references and pointers to elements within the table after // `rehash()`. // * Contains a `capacity()` member function indicating the number of element // slots (open, deleted, and empty) within the hash set. // * Returns `void` from the `erase(iterator)` overload. // ----------------------------------------------------------------------------- template // default values in phmap_fwd_decl.h class flat_hash_set : public phmap::container_internal::raw_hash_set< phmap::container_internal::FlatHashSetPolicy, Hash, Eq, Alloc> { using Base = typename flat_hash_set::raw_hash_set; public: flat_hash_set() {} #ifdef __INTEL_COMPILER using Base::raw_hash_set; #else using Base::Base; #endif using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; // may shrink - To avoid shrinking `erase(begin(), end())` using Base::erase; using Base::insert; using Base::emplace; using Base::emplace_hint; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; }; // ----------------------------------------------------------------------------- // phmap::flat_hash_map // ----------------------------------------------------------------------------- // // An `phmap::flat_hash_map` is an unordered associative container which // has been optimized for both speed and memory footprint in most common use // cases. Its interface is similar to that of `std::unordered_map` with // the following notable differences: // // * Requires keys that are CopyConstructible // * Requires values that are MoveConstructible // * Supports heterogeneous lookup, through `find()`, `operator[]()` and // `insert()`, provided that the map is provided a compatible heterogeneous // hashing function and equality operator. // * Invalidates any references and pointers to elements within the table after // `rehash()`. // * Contains a `capacity()` member function indicating the number of element // slots (open, deleted, and empty) within the hash map. // * Returns `void` from the `erase(iterator)` overload. // ----------------------------------------------------------------------------- template // default values in phmap_fwd_decl.h class flat_hash_map : public phmap::container_internal::raw_hash_map< phmap::container_internal::FlatHashMapPolicy, Hash, Eq, Alloc> { using Base = typename flat_hash_map::raw_hash_map; public: flat_hash_map() {} #ifdef __INTEL_COMPILER using Base::raw_hash_map; #else using Base::Base; #endif using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::insert_or_assign; using Base::emplace; using Base::emplace_hint; using Base::try_emplace; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::at; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::operator[]; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; }; // ----------------------------------------------------------------------------- // phmap::node_hash_set // ----------------------------------------------------------------------------- // An `phmap::node_hash_set` is an unordered associative container which // has been optimized for both speed and memory footprint in most common use // cases. Its interface is similar to that of `std::unordered_set` with the // following notable differences: // // * Supports heterogeneous lookup, through `find()`, `operator[]()` and // `insert()`, provided that the map is provided a compatible heterogeneous // hashing function and equality operator. // * Contains a `capacity()` member function indicating the number of element // slots (open, deleted, and empty) within the hash set. // * Returns `void` from the `erase(iterator)` overload. // ----------------------------------------------------------------------------- template // default values in phmap_fwd_decl.h class node_hash_set : public phmap::container_internal::raw_hash_set< phmap::container_internal::NodeHashSetPolicy, Hash, Eq, Alloc> { using Base = typename node_hash_set::raw_hash_set; public: node_hash_set() {} #ifdef __INTEL_COMPILER using Base::raw_hash_set; #else using Base::Base; #endif using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::emplace; using Base::emplace_hint; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; typename Base::hasher hash_funct() { return this->hash_function(); } void resize(typename Base::size_type hint) { this->rehash(hint); } }; // ----------------------------------------------------------------------------- // phmap::node_hash_map // ----------------------------------------------------------------------------- // // An `phmap::node_hash_map` is an unordered associative container which // has been optimized for both speed and memory footprint in most common use // cases. Its interface is similar to that of `std::unordered_map` with // the following notable differences: // // * Supports heterogeneous lookup, through `find()`, `operator[]()` and // `insert()`, provided that the map is provided a compatible heterogeneous // hashing function and equality operator. // * Contains a `capacity()` member function indicating the number of element // slots (open, deleted, and empty) within the hash map. // * Returns `void` from the `erase(iterator)` overload. // ----------------------------------------------------------------------------- template // default values in phmap_fwd_decl.h class node_hash_map : public phmap::container_internal::raw_hash_map< phmap::container_internal::NodeHashMapPolicy, Hash, Eq, Alloc> { using Base = typename node_hash_map::raw_hash_map; public: node_hash_map() {} #ifdef __INTEL_COMPILER using Base::raw_hash_map; #else using Base::Base; #endif using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::insert_or_assign; using Base::emplace; using Base::emplace_hint; using Base::try_emplace; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::at; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::operator[]; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; typename Base::hasher hash_funct() { return this->hash_function(); } void resize(typename Base::size_type hint) { this->rehash(hint); } }; // ----------------------------------------------------------------------------- // phmap::parallel_flat_hash_set // ----------------------------------------------------------------------------- template // default values in phmap_fwd_decl.h class parallel_flat_hash_set : public phmap::container_internal::parallel_hash_set< N, phmap::container_internal::raw_hash_set, Mtx_, phmap::container_internal::FlatHashSetPolicy, Hash, Eq, Alloc> { using Base = typename parallel_flat_hash_set::parallel_hash_set; public: parallel_flat_hash_set() {} #ifdef __INTEL_COMPILER using Base::parallel_hash_set; #else using Base::Base; #endif using Base::hash; using Base::subidx; using Base::subcnt; using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::emplace; using Base::emplace_hint; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; }; // ----------------------------------------------------------------------------- // phmap::parallel_flat_hash_map - default values in phmap_fwd_decl.h // ----------------------------------------------------------------------------- template class parallel_flat_hash_map : public phmap::container_internal::parallel_hash_map< N, phmap::container_internal::raw_hash_set, Mtx_, phmap::container_internal::FlatHashMapPolicy, Hash, Eq, Alloc> { using Base = typename parallel_flat_hash_map::parallel_hash_map; public: parallel_flat_hash_map() {} #ifdef __INTEL_COMPILER using Base::parallel_hash_map; #else using Base::Base; #endif using Base::hash; using Base::subidx; using Base::subcnt; using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::insert_or_assign; using Base::emplace; using Base::emplace_hint; using Base::try_emplace; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::at; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::operator[]; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; }; // ----------------------------------------------------------------------------- // phmap::parallel_node_hash_set // ----------------------------------------------------------------------------- template class parallel_node_hash_set : public phmap::container_internal::parallel_hash_set< N, phmap::container_internal::raw_hash_set, Mtx_, phmap::container_internal::NodeHashSetPolicy, Hash, Eq, Alloc> { using Base = typename parallel_node_hash_set::parallel_hash_set; public: parallel_node_hash_set() {} #ifdef __INTEL_COMPILER using Base::parallel_hash_set; #else using Base::Base; #endif using Base::hash; using Base::subidx; using Base::subcnt; using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::emplace; using Base::emplace_hint; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; typename Base::hasher hash_funct() { return this->hash_function(); } void resize(typename Base::size_type hint) { this->rehash(hint); } }; // ----------------------------------------------------------------------------- // phmap::parallel_node_hash_map // ----------------------------------------------------------------------------- template class parallel_node_hash_map : public phmap::container_internal::parallel_hash_map< N, phmap::container_internal::raw_hash_set, Mtx_, phmap::container_internal::NodeHashMapPolicy, Hash, Eq, Alloc> { using Base = typename parallel_node_hash_map::parallel_hash_map; public: parallel_node_hash_map() {} #ifdef __INTEL_COMPILER using Base::parallel_hash_map; #else using Base::Base; #endif using Base::hash; using Base::subidx; using Base::subcnt; using Base::begin; using Base::cbegin; using Base::cend; using Base::end; using Base::capacity; using Base::empty; using Base::max_size; using Base::size; using Base::clear; using Base::erase; using Base::insert; using Base::insert_or_assign; using Base::emplace; using Base::emplace_hint; using Base::try_emplace; using Base::extract; using Base::merge; using Base::swap; using Base::rehash; using Base::reserve; using Base::at; using Base::contains; using Base::count; using Base::equal_range; using Base::find; using Base::operator[]; using Base::bucket_count; using Base::load_factor; using Base::max_load_factor; using Base::get_allocator; using Base::hash_function; using Base::key_eq; typename Base::hasher hash_funct() { return this->hash_function(); } void resize(typename Base::size_type hint) { this->rehash(hint); } }; } // namespace phmap #endif // phmap_h_guard_ megahit-1.2.9/src/parallel_hashmap/phmap_base.h000066400000000000000000005224351355123202700215100ustar00rootroot00000000000000#if !defined(phmap_base_h_guard_) #define phmap_base_h_guard_ // --------------------------------------------------------------------------- // Copyright (c) 2019, Gregory Popovitch - greg7mdp@gmail.com // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. // // Includes work from abseil-cpp (https://github.com/abseil/abseil-cpp) // with modifications. // // Copyright 2018 The Abseil Authors. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. // --------------------------------------------------------------------------- #include #include #include #include #include #include #include #include #include #include #include #include #include // for std::lock #include "phmap_config.h" namespace phmap { template using Allocator = typename std::allocator; template using Pair = typename std::pair; template struct EqualTo { inline bool operator()(const T& a, const T& b) const { return std::equal_to()(a, b); } }; namespace type_traits_internal { template struct VoidTImpl { using type = void; }; // This trick to retrieve a default alignment is necessary for our // implementation of aligned_storage_t to be consistent with any implementation // of std::aligned_storage. // --------------------------------------------------------------------------- template > struct default_alignment_of_aligned_storage; template struct default_alignment_of_aligned_storage> { static constexpr size_t value = Align; }; // NOTE: The `is_detected` family of templates here differ from the library // fundamentals specification in that for library fundamentals, `Op` is // evaluated as soon as the type `is_detected` undergoes // substitution, regardless of whether or not the `::value` is accessed. That // is inconsistent with all other standard traits and prevents lazy evaluation // in larger contexts (such as if the `is_detected` check is a trailing argument // of a `conjunction`. This implementation opts to instead be lazy in the same // way that the standard traits are (this "defect" of the detection idiom // specifications has been reported). // --------------------------------------------------------------------------- template class Op, class... Args> struct is_detected_impl { using type = std::false_type; }; template