KMC-2.3/000077500000000000000000000000001257432033000117675ustar00rootroot00000000000000KMC-2.3/.gitattributes000066400000000000000000000007431257432033000146660ustar00rootroot00000000000000# Auto detect text files and perform LF normalization * text=auto # Custom for Visual Studio *.cs diff=csharp *.sln merge=union *.csproj merge=union *.vbproj merge=union *.fsproj merge=union *.dbproj merge=union # Standard to msysgit *.doc diff=astextplain *.DOC diff=astextplain *.docx diff=astextplain *.DOCX diff=astextplain *.dot diff=astextplain *.DOT diff=astextplain *.pdf diff=astextplain *.PDF diff=astextplain *.rtf diff=astextplain *.RTF diff=astextplain KMC-2.3/.gitignore000066400000000000000000000011361257432033000137600ustar00rootroot00000000000000# Windows image file caches Thumbs.db ehthumbs.db # Folder config file Desktop.ini # Recycle Bin used on file shares $RECYCLE.BIN/ # Windows Installer files *.cab *.msi *.msm *.msp # ========================= # Operating System Files # ========================= # OSX # ========================= .DS_Store .AppleDouble .LSOverride # Icon must end with two \r Icon # Thumbnails ._* # Files that might appear on external disk .Spotlight-V100 .Trashes # Directories potentially created on remote AFP share .AppleDB .AppleDesktop Network Trash Folder Temporary Items .apdisk KMC-2.3/README.md000066400000000000000000000073701257432033000132550ustar00rootroot00000000000000KMC = KMC is a disk-based programm for counting k-mers from (possibly gzipped) FASTQ/FASTA files. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc For accessing k-mers stored in database produced by KMC there is an API (kmc_api directory). Note that for KMC versions 0.x and 1.x dababase format differs from produced by KMC version 2.x. From version 2.2.0 API is unified for both formats and all new features/bug fixes are present only for 2.x branch (standalone API for older KMC version is not longer under development, so new version of API should be used even for databases produced by older KMC version). Installation = The following libraries come with KMC in a binary (64-bit compiled for x86 platform) form. If your system needs other binary formats, you should put the following libraries in kmer_counter/libs: * asmlib - for fast memcpy operation (http://www.agner.org/optimize/asmlib-instructions.pdf) * libbzip2 - for support for bzip2-compressed input FASTQ/FASTA files (http://www.bzip.org/) * zlib - for support for gzip-compressed input FASTQ/FASTA files (http://www.zlib.net/) Note: asmlib is free only for non commercial purposes. If needed, you can contact the author of asmlib or compile KMC without asmlib. If needed, you can also redefine maximal length of k-mer, which is 256 in the current version. Note: KMC is highly optimized and spends only as many bytes for k-mer (rounded up to 8) as necessary, so using large values of MAX_K does not affect the KMC performance for short k-mers. Some parts of KMC use C++11 features, so you need a compatible C++ compiler, e.g., gcc 4.7 or higher. After that, you can run make to compile kmc and kmc_dump applications. If you want to compile kmc without asmlib run: make DISABLE_ASMLIB=true #####Additional infromation for MAC OS installation For compilation under MAC OS there is makefile_mac. Usage: make -f makefile_mac There might be a need to change g++ path in makefile_mac. If needed we recommend install g++ with brew (http://brew.sh/). Note that KMC creates a hundreds of temporary files, while default limit for opened files is small for under MAC OS platform. To increase this number use following command before running KMC: ulimit -n 2048 Directory structure = * bin - main directory of KMC (programs after compilation will be stored here) * kmer_counter - source code of kmc program * kmer_counter/libs - compiled binary versions of libraries used by KMC * kmc_api - C++ source codes implementing API; must be used by any program that wants to process databases produced by kmc * kmc_dump - source codes of kmc_dump program listing k-mers in databases produced by kmc Binaries = After compilation you will obtain two binaries: * bin/kmc - the main program for counting k-mer occurrences * bin/kmc_dump - the program listing k-mers in a database produced by kmc License = * KMC software distributed under GNU GPL 2 licence. * libbzip2 is open-source (BSD-style license) * gzip is free, open-source * asmlib is under the licence GNU GPL 3 or higher Note: for commercial usage of asmlib follow the instructions in 'License conditions' (http://www.agner.org/optimize/asmlib-instructions.pdf) or compile KMC without asmlib. In case of doubt, please consult the original documentations. Warranty = THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. KMC-2.3/kmc_api/000077500000000000000000000000001257432033000133725ustar00rootroot00000000000000KMC-2.3/kmc_api/kmc_file.cpp000066400000000000000000001225441257432033000156570ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "mmer.h" #include "kmc_file.h" #include #include uint64 CKMCFile::part_size = 1 << 25; // ---------------------------------------------------------------------------------- // Open files *.kmc_pre & *.kmc_suf, read them to RAM, close files. // The file *.kmc_suf is opened for random access // IN : file_name - the name of kmer_counter's output // RET : true - if successful // ---------------------------------------------------------------------------------- bool CKMCFile::OpenForRA(const std::string &file_name) { uint64 size; size_t result; if (file_pre || file_suf) return false; if (!OpenASingleFile(file_name + ".kmc_pre", file_pre, size, (char *)"KMCP")) return false; ReadParamsFrom_prefix_file_buf(size); fclose(file_pre); file_pre = NULL; if (!OpenASingleFile(file_name + ".kmc_suf", file_suf, size, (char *)"KMCS")) return false; sufix_file_buf = new uchar[size]; result = fread(sufix_file_buf, 1, size, file_suf); if (result == 0) return false; fclose(file_suf); file_suf = NULL; is_opened = opened_for_RA; prefix_index = 0; sufix_number = 0; return true; } //---------------------------------------------------------------------------------- // Open files *kmc_pre & *.kmc_suf, read *.kmc_pre to RAM, close *kmc.pre // *.kmc_suf is buffered // IN : file_name - the name of kmer_counter's output // RET : true - if successful //---------------------------------------------------------------------------------- bool CKMCFile::OpenForListing(const std::string &file_name) { uint64 size; size_t result; if (is_opened) return false; if (file_pre || file_suf) return false; if (!OpenASingleFile(file_name + ".kmc_pre", file_pre, size, (char *)"KMCP")) return false; ReadParamsFrom_prefix_file_buf(size); fclose(file_pre); file_pre = NULL; end_of_file = total_kmers == 0; if (!OpenASingleFile(file_name + ".kmc_suf", file_suf, size, (char *)"KMCS")) return false; sufix_file_buf = new uchar[part_size]; result = fread(sufix_file_buf, 1, part_size, file_suf); if (result == 0) return false; is_opened = opened_for_listing; prefix_index = 0; sufix_number = 0; index_in_partial_buf = 0; return true; } //---------------------------------------------------------------------------------- CKMCFile::CKMCFile() { file_pre = NULL; file_suf = NULL; prefix_file_buf = NULL; sufix_file_buf = NULL; signature_map = NULL; is_opened = closed; end_of_file = false; }; //---------------------------------------------------------------------------------- CKMCFile::~CKMCFile() { if (file_pre) fclose(file_pre); if (file_suf) fclose(file_suf); if (prefix_file_buf) delete[] prefix_file_buf; if (sufix_file_buf) delete[] sufix_file_buf; if (signature_map) delete[] signature_map; }; //---------------------------------------------------------------------------------- // Open a file, recognize its size and check its marker. Auxiliary function. // IN : file_name - the name of a file to open // RET : true - if successful //---------------------------------------------------------------------------------- bool CKMCFile::OpenASingleFile(const std::string &file_name, FILE *&file_handler, uint64 &size, char marker[]) { char _marker[4]; size_t result; if ((file_handler = my_fopen(file_name.c_str(), "rb")) == NULL) return false; my_fseek(file_handler, 0, SEEK_END); size = my_ftell(file_handler); //the size of a whole file my_fseek(file_handler, -4, SEEK_CUR); result = fread(_marker, 1, 4, file_handler); if (result == 0) return false; size = size - 4; //the size of the file without the terminal marker if (strncmp(marker, _marker, 4) != 0) { fclose(file_handler); file_handler = NULL; return false; } rewind(file_handler); result = fread(_marker, 1, 4, file_handler); if (result == 0) return false; size = size - 4; //the size of the file without initial and terminal markers if (strncmp(marker, _marker, 4) != 0) { fclose(file_handler); file_handler = NULL; return false; } return true; }; //------------------------------------------------------------------------------------- // Recognize current parameters from kmc_databese. Auxiliary function. // IN : the size of the file *.kmc_pre, without initial and terminal markers // RET : true - if succesfull //---------------------------------------------------------------------------------- bool CKMCFile::ReadParamsFrom_prefix_file_buf(uint64 &size) { size_t prev_pos = my_ftell(file_pre); my_fseek(file_pre, -12, SEEK_END); size_t result; result = fread(&kmc_version, sizeof(uint32), 1, file_pre); if (kmc_version != 0 && kmc_version != 0x200) //only this versions are supported, 0 = kmc1, 0x200 = kmc2 return false; my_fseek(file_pre, prev_pos, SEEK_SET); if (kmc_version == 0x200) { my_fseek(file_pre, -8, SEEK_END); int64 header_offset; header_offset = fgetc(file_pre); size = size - 4; //file size without the size of header_offset (and without 2 markers) my_fseek(file_pre, (0LL - (header_offset + 8)), SEEK_END); result = fread(&kmer_length, 1, sizeof(uint32), file_pre); result = fread(&mode, 1, sizeof(uint32), file_pre); result = fread(&counter_size, 1, sizeof(uint32), file_pre); result = fread(&lut_prefix_length, 1, sizeof(uint32), file_pre); result = fread(&signature_len, 1, sizeof(uint32), file_pre); result = fread(&min_count, 1, sizeof(uint32), file_pre); original_min_count = min_count; result = fread(&max_count, 1, sizeof(uint32), file_pre); original_max_count = max_count; result = fread(&total_kmers, 1, sizeof(uint64), file_pre); result = fread(&both_strands, 1, 1, file_pre); both_strands = !both_strands; signature_map_size = ((1 << (2 * signature_len)) + 1); uint64 lut_area_size_in_bytes = size - (signature_map_size * sizeof(uint32)+header_offset + 8); single_LUT_size = 1 << (2 * lut_prefix_length); uint64 last_data_index = lut_area_size_in_bytes / sizeof(uint64); rewind(file_pre); my_fseek(file_pre, +4, SEEK_CUR); prefix_file_buf_size = (lut_area_size_in_bytes + 8) / sizeof(uint64); //reads without 4 bytes of a header_offset (and without markers) prefix_file_buf = new uint64[prefix_file_buf_size]; result = fread(prefix_file_buf, 1, (size_t)(lut_area_size_in_bytes + 8), file_pre); if (result == 0) return false; prefix_file_buf[last_data_index] = total_kmers + 1; signature_map = new uint32[signature_map_size]; result = fread(signature_map, 1, signature_map_size * sizeof(uint32), file_pre); if (result == 0) return false; sufix_size = (kmer_length - lut_prefix_length) / 4; sufix_rec_size = sufix_size + counter_size; return true; } else if (kmc_version == 0) { prefix_file_buf_size = (size - 4) / sizeof(uint64); //reads without 4 bytes of a header_offset (and without markers) prefix_file_buf = new uint64[prefix_file_buf_size]; result = fread(prefix_file_buf, 1, (size_t)(size - 4), file_pre); if (result == 0) return false; my_fseek(file_pre, -8, SEEK_END); uint64 header_offset; header_offset = fgetc(file_pre); size = size - 4; uint64 header_index = (size - header_offset) / sizeof(uint64); uint64 last_data_index = header_index; uint64 d = prefix_file_buf[header_index]; kmer_length = (uint32)d; //- kmer's length mode = d >> 32; //- mode: 0 or 1 header_index++; counter_size = (uint32)prefix_file_buf[header_index]; //- the size of a counter in bytes; //- for mode 0 counter_size is 1, 2, 3, or 4 (or 5, 6, 7, 8 for small k values) //- for mode = 1 counter_size is 4; lut_prefix_length = prefix_file_buf[header_index] >> 32; //- the number of prefix's symbols cut frm kmers; //- (kmer_length - lut_prefix_length) is divisible by 4 header_index++; original_min_count = (uint32)prefix_file_buf[header_index]; //- the minimal number of kmer's appearances min_count = original_min_count; original_max_count = prefix_file_buf[header_index] >> 32; //- the maximal number of kmer's appearances //max_count = original_max_count; header_index++; total_kmers = prefix_file_buf[header_index]; //- the total number of kmers header_index++; both_strands = (prefix_file_buf[header_index] & 0x000000000000000F) == 1; both_strands = !both_strands; original_max_count += prefix_file_buf[header_index] & 0xFFFFFFFF00000000; max_count = original_max_count; prefix_file_buf[last_data_index] = total_kmers + 1; sufix_size = (kmer_length - lut_prefix_length) / 4; sufix_rec_size = sufix_size + counter_size; return true; } return false; } //------------------------------------------------------------------------------------------ // Check if kmer exists. // IN : kmer - kmer // OUT: count - kmer's counter if kmer exists // RET: true - if kmer exists //------------------------------------------------------------------------------------------ bool CKMCFile::CheckKmer(CKmerAPI &kmer, float &count) { uint32 int_counter; if (CheckKmer(kmer, int_counter)) { if (mode == 0) count = (float)int_counter; else memcpy(&count, &int_counter, counter_size); return true; } return false; } //------------------------------------------------------------------------------------------ // Check if kmer exists. // IN : kmer - kmer // OUT: count - kmer's counter if kmer exists // RET: true - if kmer exists //------------------------------------------------------------------------------------------ bool CKMCFile::CheckKmer(CKmerAPI &kmer, uint32 &count) { if(is_opened != opened_for_RA) return false; if(end_of_file) return false; //recognize a prefix: uint64 pattern_prefix_value = kmer.kmer_data[0]; uint32 pattern_offset = (sizeof(pattern_prefix_value)* 8) - (lut_prefix_length * 2) - (kmer.byte_alignment * 2); int64 index_start = 0, index_stop = 0; pattern_prefix_value = pattern_prefix_value >> pattern_offset; //complements with 0 if (pattern_prefix_value >= prefix_file_buf_size) return false; if (kmc_version == 0x200) { uint32 signature = kmer.get_signature(signature_len); uint32 bin_start_pos = signature_map[signature]; bin_start_pos *= single_LUT_size; //look into the array with data index_start = *(prefix_file_buf + bin_start_pos + pattern_prefix_value); index_stop = *(prefix_file_buf + bin_start_pos + pattern_prefix_value + 1) - 1; } else if (kmc_version == 0) { //look into the array with data index_start = prefix_file_buf[pattern_prefix_value]; index_stop = prefix_file_buf[pattern_prefix_value + 1] - 1; } uint64 tmp_count ; bool res = BinarySearch(index_start, index_stop, kmer, tmp_count, pattern_offset); count = (uint32)tmp_count; return res; } //------------------------------------------------------------------------------------------ // Check if kmer exists. // IN : kmer - kmer // OUT: count - kmer's counter if kmer exists // RET: true - if kmer exists //------------------------------------------------------------------------------------------ bool CKMCFile::CheckKmer(CKmerAPI &kmer, uint64 &count) { if (is_opened != opened_for_RA) return false; if (end_of_file) return false; //recognize a prefix: uint64 pattern_prefix_value = kmer.kmer_data[0]; uint32 pattern_offset = (sizeof(pattern_prefix_value)* 8) - (lut_prefix_length * 2) - (kmer.byte_alignment * 2); int64 index_start = 0, index_stop = 0; pattern_prefix_value = pattern_prefix_value >> pattern_offset; //complements with 0 if (pattern_prefix_value >= prefix_file_buf_size) return false; if (kmc_version == 0x200) { uint32 signature = kmer.get_signature(signature_len); uint32 bin_start_pos = signature_map[signature]; bin_start_pos *= single_LUT_size; //look into the array with data index_start = *(prefix_file_buf + bin_start_pos + pattern_prefix_value); index_stop = *(prefix_file_buf + bin_start_pos + pattern_prefix_value + 1) - 1; } else if (kmc_version == 0) { //look into the array with data index_start = prefix_file_buf[pattern_prefix_value]; index_stop = prefix_file_buf[pattern_prefix_value + 1] - 1; } return BinarySearch(index_start, index_stop, kmer, count, pattern_offset); } //----------------------------------------------------------------------------------------------- // Check if end of file // RET: true - all kmers are listed //----------------------------------------------------------------------------------------------- bool CKMCFile::Eof(void) { return end_of_file; } bool CKMCFile::ReadNextKmer(CKmerAPI &kmer, float &count) { uint32 int_counter; if (ReadNextKmer(kmer, int_counter)) { if (mode == 0) count = (float)int_counter; else memcpy(&count, &int_counter, counter_size); return true; } return false; } //----------------------------------------------------------------------------------------------- // Read next kmer // OUT: kmer - next kmer // OUT: count - kmer's counter // RET: true - if not EOF //----------------------------------------------------------------------------------------------- bool CKMCFile::ReadNextKmer(CKmerAPI &kmer, uint32 &count) { uint64 prefix_mask = (1 << 2 * lut_prefix_length) - 1; //for kmc2 db if(is_opened != opened_for_listing) return false; do { if(end_of_file) return false; if(sufix_number == prefix_file_buf[prefix_index + 1]) { prefix_index++; while (prefix_file_buf[prefix_index] == prefix_file_buf[prefix_index + 1]) prefix_index++; } uint32 off = (sizeof(prefix_index) * 8) - (lut_prefix_length * 2) - kmer.byte_alignment * 2; uint64 temp_prefix = (prefix_index & prefix_mask) << off; // shift prefix towards MSD. "& prefix_mask" necessary for kmc2 db format kmer.kmer_data[0] = temp_prefix; // store prefix in an object CKmerAPI for(uint32 i = 1; i < kmer.no_of_rows; i++) kmer.kmer_data[i] = 0; //read sufix: uint32 row_index = 0; uint64 suf = 0; off = off - 8; for(uint32 a = 0; a < sufix_size; a ++) { if(index_in_partial_buf == part_size) Reload_sufix_file_buf(); suf = sufix_file_buf[index_in_partial_buf++]; suf = suf << off; kmer.kmer_data[row_index] = kmer.kmer_data[row_index] | suf; if (off == 0) //the end of a word in kmer_data { off = 56; row_index++; } else off -=8; } //read counter: if(index_in_partial_buf == part_size) Reload_sufix_file_buf(); count = sufix_file_buf[index_in_partial_buf++]; for(uint32 b = 1; b < counter_size; b++) { if(index_in_partial_buf == part_size) Reload_sufix_file_buf(); uint32 aux = 0x000000ff & sufix_file_buf[index_in_partial_buf++]; aux = aux << 8 * ( b); count = aux | count; } sufix_number++; if(sufix_number == total_kmers) end_of_file = true; if (mode != 0) { float float_counter; memcpy(&float_counter, &count, counter_size); if ((float_counter < min_count) || (float_counter > max_count)) continue; else break; } } while((count < min_count) || (count > max_count)); return true; } //----------------------------------------------------------------------------------------------- // Read next kmer // OUT: kmer - next kmer // OUT: count - kmer's counter // RET: true - if not EOF //----------------------------------------------------------------------------------------------- bool CKMCFile::ReadNextKmer(CKmerAPI &kmer, uint64 &count) { uint64 prefix_mask = (1 << 2 * lut_prefix_length) - 1; //for kmc2 db if (is_opened != opened_for_listing) return false; do { if (end_of_file) return false; if (sufix_number == prefix_file_buf[prefix_index + 1]) { prefix_index++; while (prefix_file_buf[prefix_index] == prefix_file_buf[prefix_index + 1]) prefix_index++; } uint32 off = (sizeof(prefix_index)* 8) - (lut_prefix_length * 2) - kmer.byte_alignment * 2; uint64 temp_prefix = (prefix_index & prefix_mask) << off; // shift prefix towards MSD. "& prefix_mask" necessary for kmc2 db format kmer.kmer_data[0] = temp_prefix; // store prefix in an object CKmerAPI for (uint32 i = 1; i < kmer.no_of_rows; i++) kmer.kmer_data[i] = 0; //read sufix: uint32 row_index = 0; uint64 suf = 0; off = off - 8; for (uint32 a = 0; a < sufix_size; a++) { if (index_in_partial_buf == part_size) Reload_sufix_file_buf(); suf = sufix_file_buf[index_in_partial_buf++]; suf = suf << off; kmer.kmer_data[row_index] = kmer.kmer_data[row_index] | suf; if (off == 0) //the end of a word in kmer_data { off = 56; row_index++; } else off -= 8; } //read counter: if (index_in_partial_buf == part_size) Reload_sufix_file_buf(); count = sufix_file_buf[index_in_partial_buf++]; for (uint32 b = 1; b < counter_size; b++) { if (index_in_partial_buf == part_size) Reload_sufix_file_buf(); uint64 aux = 0x000000ff & sufix_file_buf[index_in_partial_buf++]; aux = aux << 8 * (b); count = aux | count; } sufix_number++; if (sufix_number == total_kmers) end_of_file = true; } while ((count < min_count) || (count > max_count)); return true; } //------------------------------------------------------------------------------- // Reload a contents of an array "sufix_file_buf" for listing mode. Auxiliary function. //------------------------------------------------------------------------------- void CKMCFile::Reload_sufix_file_buf() { fread (sufix_file_buf, 1, (size_t) part_size, file_suf); index_in_partial_buf = 0; }; //------------------------------------------------------------------------------- // Release memory and close files in case they were opened // RET: true - if files have been readed //------------------------------------------------------------------------------- bool CKMCFile::Close() { if(is_opened) { if(file_pre) { fclose(file_pre); file_pre = NULL; } if(file_suf) { fclose(file_suf); file_suf = NULL; } is_opened = closed; end_of_file = false; delete [] prefix_file_buf; prefix_file_buf = NULL; delete [] sufix_file_buf; sufix_file_buf = NULL; delete[] signature_map; signature_map = NULL; return true; } else return false; }; //---------------------------------------------------------------------------------- // Set initial values to enable listing kmers from the begining. Only in listing mode // RET: true - if a file has been opened for listing //---------------------------------------------------------------------------------- bool CKMCFile::RestartListing(void) { if(is_opened == opened_for_listing) { my_fseek ( file_suf , 4 , SEEK_SET ); fread (sufix_file_buf, 1, (size_t) part_size, file_suf); prefix_index = 0; sufix_number = 0; index_in_partial_buf = 0; end_of_file = total_kmers == 0; return true; } return false; }; //---------------------------------------------------------------------------------------- // Set the minimal value for a counter. Kmers with counters below this theshold are ignored // IN : x - minimal value for a counter // RET : true - if successful //---------------------------------------------------------------------------------------- bool CKMCFile::SetMinCount(uint32 x) { if((original_min_count <= x) && (x < max_count)) { min_count = x; return true; } else return false; } //---------------------------------------------------------------------------------------- // Return a value of min_count. Kmers with counters below this theshold are ignored // RET : a value of min_count //---------------------------------------------------------------------------------------- uint32 CKMCFile::GetMinCount(void) { return min_count; }; //---------------------------------------------------------------------------------------- // Set the maximal value for a counter. Kmers with counters above this theshold are ignored // IN : x - maximal value for a counter // RET : true - if successful //---------------------------------------------------------------------------------------- bool CKMCFile::SetMaxCount(uint32 x) { if((original_max_count >= x) && (x > min_count)) { max_count = x; return true; } else return false; } //---------------------------------------------------------------------------------------- // Return a value of max_count. Kmers with counters above this theshold are ignored // RET : a value of max_count //---------------------------------------------------------------------------------------- uint64 CKMCFile::GetMaxCount(void) { return max_count; } //---------------------------------------------------------------------------------------- // Return true if KMC was run without -b switch // RET : a value of both_strands //---------------------------------------------------------------------------------------- bool CKMCFile::GetBothStrands(void) { return both_strands; } //---------------------------------------------------------------------------------------- // Set original (readed from *.kmer_pre) values for min_count and max_count //---------------------------------------------------------------------------------------- void CKMCFile::ResetMinMaxCounts(void) { min_count = original_min_count; max_count = original_max_count; } //---------------------------------------------------------------------------------------- // Return the length of kmers // RET : the length of kmers //---------------------------------------------------------------------------------------- uint32 CKMCFile::KmerLength(void) { return kmer_length; } //---------------------------------------------------------------------------------------- // Check if kmer exists // IN : kmer - kmer // RET : true if kmer exists //---------------------------------------------------------------------------------------- bool CKMCFile::IsKmer(CKmerAPI &kmer) { uint32 _count; if(CheckKmer(kmer, _count)) return true; else return false; } //----------------------------------------------------------------------------------------- // Check the total number of kmers between current min_count and max_count // RET : total number of kmers or 0 if a database has not been opened //----------------------------------------------------------------------------------------- uint64 CKMCFile::KmerCount(void) { if(is_opened) if((min_count == original_min_count) && (max_count == original_max_count)) return total_kmers; else { uint32 count; uint32 int_counter; uint64 aux_kmerCount = 0; if(is_opened == opened_for_RA) { uchar *ptr = sufix_file_buf; for(uint64 i = 0; i < total_kmers; i++) { ptr += sufix_size; int_counter = *ptr; ptr++; for(uint32 b = 1; b < counter_size; b ++) { uint32 aux = 0x000000ff & *(ptr); aux = aux << 8 * ( b); int_counter = aux | int_counter; ptr++; } if(mode == 0) count = int_counter; else memcpy(&count, &int_counter, counter_size); if((count >= min_count) && (count <= max_count)) aux_kmerCount++; } } else //opened_for_listing { CKmerAPI kmer(kmer_length); float count; RestartListing(); for(uint64 i = 0; i < total_kmers; i++) { ReadNextKmer(kmer, count); if((count >= min_count) && (count <= max_count)) aux_kmerCount++; } RestartListing(); } return aux_kmerCount; } else return 0 ; } //--------------------------------------------------------------------------------- // Get current parameters from kmer_database // OUT : _kmer_length - the length of kmers // _mode - mode // _counter_size - the size of a counter in bytes // _lut_prefix_length - the number of prefix's symbols cut from kmers // _min_count - the minimal number of kmer's appearances // _max_count - the maximal number of kmer's appearances // _total_kmers - the total number of kmers // RET : true if kmer_database has been opened //--------------------------------------------------------------------------------- bool CKMCFile::Info(uint32 &_kmer_length, uint32 &_mode, uint32 &_counter_size, uint32 &_lut_prefix_length, uint32 &_signature_len, uint32 &_min_count, uint64 &_max_count, uint64 &_total_kmers) { if(is_opened) { _kmer_length = kmer_length; _mode = mode; _counter_size = counter_size; _lut_prefix_length = lut_prefix_length; if (kmc_version == 0x200) _signature_len = signature_len; else _signature_len = 0; //for kmc1 there is no signature_len _min_count = min_count; _max_count = max_count; _total_kmers = total_kmers; return true; } return false; }; // Get current parameters from kmer_database bool CKMCFile::Info(CKMCFileInfo& info) { if (is_opened) { info.kmer_length = kmer_length; info.mode = mode; info.counter_size = counter_size; info.lut_prefix_length = lut_prefix_length; if (kmc_version == 0x200) info.signature_len = signature_len; else info.signature_len = 0; //for kmc1 there is no signature_len info.min_count = min_count; info.max_count = max_count; info.total_kmers = total_kmers; info.both_strands = both_strands; return true; } return false; } //--------------------------------------------------------------------------------- // Get counters from read // OUT : counters - vector of counters of each k-mer in read (of size read_len - kmer_len + 1), if some k-mer is invalid (i.e. contains 'N') the counter is equal to 0 // IN : read - // RET : true if success, false if k > read length or some failure //--------------------------------------------------------------------------------- bool CKMCFile::GetCountersForRead(const std::string& read, std::vector& counters) { if (is_opened != opened_for_RA) return false; if (read.length() < kmer_length) { counters.clear(); return false; } if (kmc_version == 0x200) { if (both_strands) return GetCountersForRead_kmc2_both_strands(read, counters); else return GetCountersForRead_kmc2(read, counters); } else if (kmc_version == 0) { if (both_strands) return GetCountersForRead_kmc1_both_strands(read,counters); else return GetCountersForRead_kmc1(read, counters); } else return false; //never should be here } //--------------------------------------------------------------------------------- // Get counters from read // OUT : counters - vector of counters of each k-mer in read (of size read_len - kmer_len + 1), if some k-mer is invalid (i.e. contains 'N') the counter is equal to 0 // IN : read - // RET : true if success //--------------------------------------------------------------------------------- bool CKMCFile::GetCountersForRead(const std::string& read, std::vector& counters) { if (is_opened != opened_for_RA) return false; std::vector uint32_v; if (GetCountersForRead(read, uint32_v)) { counters.clear(); counters.resize(uint32_v.size()); if (mode == 0) { for (uint32 i = 0; i < uint32_v.size(); ++i) counters[i] = static_cast(uint32_v[i]); } else { for (uint32 i = 0; i < uint32_v.size(); ++i) memcpy(&counters[i], &uint32_v[i], counter_size); } return true; } return false; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- uint32 CKMCFile::count_for_kmer_kmc1(CKmerAPI& kmer) { //recognize a prefix: uint64 pattern_prefix_value = kmer.kmer_data[0]; uint32 pattern_offset = (sizeof(pattern_prefix_value)* 8) - (lut_prefix_length * 2) - (kmer.byte_alignment * 2); pattern_prefix_value = pattern_prefix_value >> pattern_offset; //complements with 0 if (pattern_prefix_value >= prefix_file_buf_size) return false; //look into the array with data int64 index_start = prefix_file_buf[pattern_prefix_value]; int64 index_stop = prefix_file_buf[pattern_prefix_value + 1] - 1; uint64 counter = 0; if (BinarySearch(index_start, index_stop, kmer, counter, pattern_offset)) return (uint32)counter; return 0; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- uint32 CKMCFile::count_for_kmer_kmc2(CKmerAPI& kmer, uint32 bin_start_pos) { //recognize a prefix: uint64 pattern_prefix_value = kmer.kmer_data[0]; uint32 pattern_offset = (sizeof(pattern_prefix_value)* 8) - (lut_prefix_length * 2) - (kmer.byte_alignment * 2); pattern_prefix_value = pattern_prefix_value >> pattern_offset; //complements with 0 if (pattern_prefix_value >= prefix_file_buf_size) return false; //look into the array with data int64 index_start = *(prefix_file_buf + bin_start_pos + pattern_prefix_value); int64 index_stop = *(prefix_file_buf + bin_start_pos + pattern_prefix_value + 1) - 1; uint64 counter = 0; if (BinarySearch(index_start, index_stop, kmer, counter, pattern_offset)) return (uint32)counter; return 0; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- bool CKMCFile::GetCountersForRead_kmc1_both_strands(const std::string& read, std::vector& counters) { uint32 read_len = static_cast(read.length()); counters.resize(read.length() - kmer_length + 1); std::string transformed_read = read; for (char& c : transformed_read) c = CKmerAPI::num_codes[(uchar)c]; uint32 i = 0; CKmerAPI kmer(kmer_length), kmer_rev(kmer_length); uint32 pos = 0; uint32 rev_pos = kmer_length - 1; uint32 counters_pos = 0; while (i + kmer_length - 1 < read_len) { bool contains_N = false; while (i < read_len && pos < kmer_length) { if (CKmerAPI::num_codes[(uchar)read[i]] < 0) { pos = 0; rev_pos = kmer_length - 1; kmer.clear(); kmer_rev.clear(); ++i; uint32 wrong_kmers = MIN(i - counters_pos, static_cast(counters.size()) - counters_pos); fill_n(counters.begin() + counters_pos, wrong_kmers, 0); counters_pos += wrong_kmers; contains_N = true; break; } else { kmer_rev.insert2bits(rev_pos--, 3 - CKmerAPI::num_codes[(uchar)read[i]]); kmer.insert2bits(pos++, CKmerAPI::num_codes[(uchar)read[i++]]); } } if (contains_N) continue; if (pos == kmer_length) { if(kmer < kmer_rev) counters[counters_pos++] = count_for_kmer_kmc1(kmer); else counters[counters_pos++] = count_for_kmer_kmc1(kmer_rev); } else break; while (i < read_len) { if (CKmerAPI::num_codes[(uchar)read[i]] < 0) { pos = 0; break; } kmer_rev.SHR_insert2bits(3 - CKmerAPI::num_codes[(uchar)read[i]]); kmer.SHL_insert2bits(CKmerAPI::num_codes[(uchar)read[i++]]); if(kmer < kmer_rev) counters[counters_pos++] = count_for_kmer_kmc1(kmer); else counters[counters_pos++] = count_for_kmer_kmc1(kmer_rev); } } if (counters_pos < counters.size()) { fill_n(counters.begin() + counters_pos, counters.size() - counters_pos, 0); counters_pos = static_cast(counters.size()); } return true; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- bool CKMCFile::GetCountersForRead_kmc1(const std::string& read, std::vector& counters) { uint32 read_len = static_cast(read.length()); counters.resize(read.length() - kmer_length + 1); std::string transformed_read = read; for (char& c : transformed_read) c = CKmerAPI::num_codes[(uchar)c]; uint32 i = 0; CKmerAPI kmer(kmer_length); uint32 pos = 0; uint32 counters_pos = 0; while (i + kmer_length - 1 < read_len) { bool contains_N = false; while (i < read_len && pos < kmer_length) { if (CKmerAPI::num_codes[(uchar)read[i]] < 0) { pos = 0; kmer.clear(); ++i; uint32 wrong_kmers = MIN(i - counters_pos, static_cast(counters.size()) - counters_pos); fill_n(counters.begin() + counters_pos, wrong_kmers, 0); counters_pos += wrong_kmers; contains_N = true; break; } else kmer.insert2bits(pos++, CKmerAPI::num_codes[(uchar)read[i++]]); } if (contains_N) continue; if (pos == kmer_length) { counters[counters_pos++] = count_for_kmer_kmc1(kmer); } else break; while (i < read_len) { if (CKmerAPI::num_codes[(uchar)read[i]] < 0) { pos = 0; break; } kmer.SHL_insert2bits(CKmerAPI::num_codes[(uchar)read[i++]]); counters[counters_pos++] = count_for_kmer_kmc1(kmer); } } if (counters_pos < counters.size()) { fill_n(counters.begin() + counters_pos, counters.size() - counters_pos, 0); counters_pos = static_cast(counters.size()); } return true; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- void CKMCFile::GetSuperKmers(const std::string& transformed_read, super_kmers_t& super_kmers) { uint32 i = 0; uint32 len = 0; //length of super k-mer uint32 signature_start_pos; CMmer current_signature(signature_len), end_mmer(signature_len); while (i + kmer_length - 1 < transformed_read.length()) { bool contains_N = false; //building first signature after 'N' or at the read beginning for (uint32 j = 0; j < signature_len; ++j, ++i) { if (transformed_read[i] < 0)//'N' { contains_N = true; break; } } //signature must be shorter than k-mer so if signature contains 'N', k-mer will contains it also if (contains_N) { ++i; continue; } len = signature_len; signature_start_pos = i - signature_len; current_signature.insert(transformed_read.c_str() + signature_start_pos); end_mmer.set(current_signature); for (; i < transformed_read.length(); ++i) { if (transformed_read[i] < 0)//'N' { if (len >= kmer_length) { super_kmers.push_back(std::make_tuple(i - len, len, signature_map[current_signature.get()])); } len = 0; ++i; break; } end_mmer.insert(transformed_read[i]); if (end_mmer < current_signature)//signature at the end of current k-mer is lower than current { if (len >= kmer_length) { super_kmers.push_back(std::make_tuple(i - len, len, signature_map[current_signature.get()])); len = kmer_length - 1; } current_signature.set(end_mmer); signature_start_pos = i - signature_len + 1; } else if (end_mmer == current_signature) { current_signature.set(end_mmer); signature_start_pos = i - signature_len + 1; } else if (signature_start_pos + kmer_length - 1 < i)//need to find new signature { super_kmers.push_back(std::make_tuple(i - len, len, signature_map[current_signature.get()])); len = kmer_length - 1; //looking for new signature ++signature_start_pos; //building first signature in current k-mer end_mmer.insert(transformed_read.c_str() + signature_start_pos); current_signature.set(end_mmer); for (uint32 j = signature_start_pos + signature_len; j <= i; ++j) { end_mmer.insert(transformed_read[j]); if (end_mmer <= current_signature) { current_signature.set(end_mmer); signature_start_pos = j - signature_len + 1; } } } ++len; } } if (len >= kmer_length)//last one in read { super_kmers.push_back(std::make_tuple(i - len, len, signature_map[current_signature.get()])); } } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- bool CKMCFile::GetCountersForRead_kmc2_both_strands(const std::string& read, std::vector& counters) { counters.resize(read.length() - kmer_length + 1); std::string transformed_read = read; for (char& c : transformed_read) c = CKmerAPI::num_codes[(uchar)c]; super_kmers_t super_kmers; GetSuperKmers(transformed_read, super_kmers); uint32 counters_pos = 0; if (super_kmers.empty()) { fill_n(counters.begin(), counters.size(), 0); return true; } CKmerAPI kmer(kmer_length), rev_kmer(kmer_length); uint32 last_end = 0; //'N' somewhere in first k-mer if (std::get<0>(super_kmers.front()) > 0) { fill_n(counters.begin(), std::get<0>(super_kmers.front()), 0); last_end = std::get<0>(super_kmers.front()); counters_pos = std::get<0>(super_kmers.front()); } for (auto& super_kmer : super_kmers) { //'N's between super k-mers if (last_end < std::get<0>(super_kmer)) { uint32 gap = std::get<0>(super_kmer) -last_end; fill_n(counters.begin() + counters_pos, kmer_length + gap - 1, 0); counters_pos += kmer_length + gap - 1; } last_end = std::get<0>(super_kmer) +std::get<1>(super_kmer); kmer.from_binary(transformed_read.c_str() + std::get<0>(super_kmer)); rev_kmer.from_binary_rev(transformed_read.c_str() + std::get<0>(super_kmer)); uint32 bin_start_pos = std::get<2>(super_kmer) * single_LUT_size; if(kmer < rev_kmer) counters[counters_pos++] = count_for_kmer_kmc2(kmer, bin_start_pos); else counters[counters_pos++] = count_for_kmer_kmc2(rev_kmer, bin_start_pos); for (uint32 i = std::get<0>(super_kmer) +kmer_length; i < std::get<0>(super_kmer) +std::get<1>(super_kmer); ++i) { kmer.SHL_insert2bits(transformed_read[i]); rev_kmer.SHR_insert2bits(3 - transformed_read[i]); if(kmer < rev_kmer) counters[counters_pos++] = count_for_kmer_kmc2(kmer, bin_start_pos); else counters[counters_pos++] = count_for_kmer_kmc2(rev_kmer, bin_start_pos); } } //'N's at the end of read if (counters_pos < counters.size()) { fill_n(counters.begin() + counters_pos, counters.size() - counters_pos, 0); counters_pos = static_cast(counters.size()); } return true; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- bool CKMCFile::GetCountersForRead_kmc2(const std::string& read, std::vector& counters) { counters.resize(read.length() - kmer_length + 1); std::string transformed_read = read; for (char& c : transformed_read) c = CKmerAPI::num_codes[(uchar)c]; super_kmers_t super_kmers; GetSuperKmers(transformed_read, super_kmers); uint32 counters_pos = 0; if (super_kmers.empty()) { fill_n(counters.begin(), counters.size(), 0); return true; } CKmerAPI kmer(kmer_length); uint32 last_end = 0; //'N' somewhere in first k-mer if (std::get<0>(super_kmers.front()) > 0) { fill_n(counters.begin(), std::get<0>(super_kmers.front()), 0); last_end = std::get<0>(super_kmers.front()); counters_pos = std::get<0>(super_kmers.front()); } for (auto& super_kmer : super_kmers) { //'N's between super k-mers if (last_end < std::get<0>(super_kmer)) { uint32 gap = std::get<0>(super_kmer) -last_end; fill_n(counters.begin() + counters_pos, kmer_length + gap - 1, 0); counters_pos += kmer_length + gap - 1; } last_end = std::get<0>(super_kmer) + std::get<1>(super_kmer); kmer.from_binary(transformed_read.c_str() + std::get<0>(super_kmer)); uint32 bin_start_pos = std::get<2>(super_kmer) * single_LUT_size; counters[counters_pos++] = count_for_kmer_kmc2(kmer, bin_start_pos); for (uint32 i = std::get<0>(super_kmer) +kmer_length; i < std::get<0>(super_kmer) +std::get<1>(super_kmer); ++i) { kmer.SHL_insert2bits(transformed_read[i]); counters[counters_pos++] = count_for_kmer_kmc2(kmer, bin_start_pos); } } //'N's at the end of read if (counters_pos < counters.size()) { fill_n(counters.begin() + counters_pos, counters.size() - counters_pos, 0); counters_pos = static_cast(counters.size()); } return true; } //--------------------------------------------------------------------------------- // Auxiliary function. //--------------------------------------------------------------------------------- bool CKMCFile::BinarySearch(int64 index_start, int64 index_stop, const CKmerAPI& kmer, uint64& counter, uint32 pattern_offset) { if (index_start >= total_kmers) return false; uchar *sufix_byte_ptr = nullptr; uint64 sufix = 0; //sufix_offset is always 56 uint32 sufix_offset = 56; // the offset of a sufix is for shifting the sufix towards MSB, to compare the sufix with a pattern // Bytes of a pattern to search are always shifted towards MSB uint32 row_index = 0; // the number of a current row in an array kmer_data bool found = false; while (index_start <= index_stop) { int64 mid_index = (index_start + index_stop) / 2; sufix_byte_ptr = &sufix_file_buf[mid_index * sufix_rec_size]; uint64 pattern = 0; pattern_offset = (lut_prefix_length + kmer.byte_alignment) * 2; row_index = 0; for (uint32 a = 0; a < sufix_size; a++) //check byte by byte { pattern = kmer.kmer_data[row_index]; pattern = pattern << pattern_offset; pattern = pattern & 0xff00000000000000; sufix = sufix_byte_ptr[a]; sufix = sufix << sufix_offset; if (pattern != sufix) break; pattern_offset += 8; if (pattern_offset == 64) //the end of a word { pattern_offset = 0; row_index++; } } if (pattern == sufix) { found = true; break; } if (sufix < pattern) index_start = mid_index + 1; else index_stop = mid_index - 1; } if (found) { sufix_byte_ptr += sufix_size; counter = *sufix_byte_ptr; for (uint32 b = 1; b < counter_size; b++) { uint64 aux = 0x000000ff & *(sufix_byte_ptr + b); aux = aux << 8 * (b); counter = aux | counter; } if (mode != 0) { float float_counter; memcpy(&float_counter, &counter, counter_size); return (float_counter >= min_count) && (float_counter <= max_count); } return (counter >= min_count) && (counter <= max_count); } return false; } // ***** EOF KMC-2.3/kmc_api/kmc_file.h000066400000000000000000000132251257432033000153170ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMC_FILE_H #define _KMC_FILE_H #include "kmer_defs.h" #include "kmer_api.h" #include #include struct CKMCFileInfo { uint32 kmer_length; uint32 mode; uint32 counter_size; uint32 lut_prefix_length; uint32 signature_len; uint32 min_count; uint64 max_count; bool both_strands; uint64 total_kmers; }; class CKMCFile { enum open_mode {closed, opened_for_RA, opened_for_listing}; open_mode is_opened; bool end_of_file; FILE *file_pre; FILE *file_suf; uint64* prefix_file_buf; uint64 prefix_file_buf_size; uint64 prefix_index; // The current prefix's index in an array "prefix_file_buf", readed from *.kmc_pre uint32 single_LUT_size; // The size of a single LUT (in no. of elements) uint32* signature_map; uint32 signature_map_size; uchar* sufix_file_buf; uint64 sufix_number; // The sufix's number to be listed uint64 index_in_partial_buf; // The current byte's number in an array "sufix_file_buf", for listing mode uint32 kmer_length; uint32 mode; uint32 counter_size; uint32 lut_prefix_length; uint32 signature_len; uint32 min_count; uint64 max_count; uint64 total_kmers; bool both_strands; uint32 kmc_version; uint32 sufix_size; // sufix's size in bytes uint32 sufix_rec_size; // sufix_size + counter_size uint32 original_min_count; uint64 original_max_count; static uint64 part_size; // the size of a block readed to sufix_file_buf, in listing mode bool BinarySearch(int64 index_start, int64 index_stop, const CKmerAPI& kmer, uint64& counter, uint32 pattern_offset); // Open a file, recognize its size and check its marker. Auxiliary function. bool OpenASingleFile(const std::string &file_name, FILE *&file_handler, uint64 &size, char marker[]); // Recognize current parameters. Auxiliary function. bool ReadParamsFrom_prefix_file_buf(uint64 &size); // Reload a contents of an array "sufix_file_buf" for listing mode. Auxiliary function. void Reload_sufix_file_buf(); // Implementation of GetCountersForRead for kmc1 database format for both strands bool GetCountersForRead_kmc1_both_strands(const std::string& read, std::vector& counters); // Implementation of GetCountersForRead for kmc1 database format without choosing canonical k-mer bool GetCountersForRead_kmc1(const std::string& read, std::vector& counters); using super_kmers_t = std::vector>;//start_pos, len, bin_no void GetSuperKmers(const std::string& transformed_read, super_kmers_t& super_kmers); // Implementation of GetCountersForRead for kmc2 database format for both strands bool GetCountersForRead_kmc2_both_strands(const std::string& read, std::vector& counters); // Implementation of GetCountersForRead for kmc2 database format bool GetCountersForRead_kmc2(const std::string& read, std::vector& counters); public: CKMCFile(); ~CKMCFile(); // Open files *.kmc_pre & *.kmc_suf, read them to RAM, close files. *.kmc_suf is opened for random access bool OpenForRA(const std::string &file_name); // Open files *kmc_pre & *.kmc_suf, read *.kmc_pre to RAM, *.kmc_suf is buffered bool OpenForListing(const std::string& file_name); // Return next kmer in CKmerAPI &kmer. Return its counter in float &count. Return true if not EOF bool ReadNextKmer(CKmerAPI &kmer, float &count); bool ReadNextKmer(CKmerAPI &kmer, uint64 &count); //for small k-values when counter may be longer than 4bytes bool ReadNextKmer(CKmerAPI &kmer, uint32 &count); // Release memory and close files in case they were opened bool Close(); // Set the minimal value for a counter. Kmers with counters below this theshold are ignored bool SetMinCount(uint32 x); // Return a value of min_count. Kmers with counters below this theshold are ignored uint32 GetMinCount(void); // Set the maximal value for a counter. Kmers with counters above this theshold are ignored bool SetMaxCount(uint32 x); // Return a value of max_count. Kmers with counters above this theshold are ignored uint64 GetMaxCount(void); //Return true if kmc was run without -b switch. bool GetBothStrands(void); // Return the total number of kmers between min_count and max_count uint64 KmerCount(void); // Return the length of kmers uint32 KmerLength(void); // Set initial values to enable listing kmers from the begining. Only in listing mode bool RestartListing(void); // Return true if all kmers are listed bool Eof(void); // Return true if kmer exists. In this case return kmer's counter in count bool CheckKmer(CKmerAPI &kmer, float &count); bool CheckKmer(CKmerAPI &kmer, uint32 &count); bool CheckKmer(CKmerAPI &kmer, uint64 &count); // Return true if kmer exists bool IsKmer(CKmerAPI &kmer); // Set original (readed from *.kmer_pre) values for min_count and max_count void ResetMinMaxCounts(void); // Get current parameters from kmer_database bool Info(uint32 &_kmer_length, uint32 &_mode, uint32 &_counter_size, uint32 &_lut_prefix_length, uint32 &_signature_len, uint32 &_min_count, uint64 &_max_count, uint64 &_total_kmers); // Get current parameters from kmer_database bool Info(CKMCFileInfo& info); // Get counters for all k-mers in read bool GetCountersForRead(const std::string& read, std::vector& counters); bool GetCountersForRead(const std::string& read, std::vector& counters); private: uint32 count_for_kmer_kmc1(CKmerAPI& kmer); uint32 count_for_kmer_kmc2(CKmerAPI& kmer, uint32 bin_start_pos); }; #endif // ***** EOF KMC-2.3/kmc_api/kmer_api.cpp000066400000000000000000000043751257432033000156760ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz and Agnieszka Debudaj-Grabysz Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "kmer_api.h" #include #include using namespace std; const char CKmerAPI::char_codes[] = {'A','C', 'G', 'T'}; char CKmerAPI::num_codes[]; CKmerAPI::_si CKmerAPI::_init; uchar CKmerAPI::rev_comp_bytes_LUT[] = { 0xff, 0xbf, 0x7f, 0x3f, 0xef, 0xaf, 0x6f, 0x2f, 0xdf, 0x9f, 0x5f, 0x1f, 0xcf, 0x8f, 0x4f, 0x0f, 0xfb, 0xbb, 0x7b, 0x3b, 0xeb, 0xab, 0x6b, 0x2b, 0xdb, 0x9b, 0x5b, 0x1b, 0xcb, 0x8b, 0x4b, 0x0b, 0xf7, 0xb7, 0x77, 0x37, 0xe7, 0xa7, 0x67, 0x27, 0xd7, 0x97, 0x57, 0x17, 0xc7, 0x87, 0x47, 0x07, 0xf3, 0xb3, 0x73, 0x33, 0xe3, 0xa3, 0x63, 0x23, 0xd3, 0x93, 0x53, 0x13, 0xc3, 0x83, 0x43, 0x03, 0xfe, 0xbe, 0x7e, 0x3e, 0xee, 0xae, 0x6e, 0x2e, 0xde, 0x9e, 0x5e, 0x1e, 0xce, 0x8e, 0x4e, 0x0e, 0xfa, 0xba, 0x7a, 0x3a, 0xea, 0xaa, 0x6a, 0x2a, 0xda, 0x9a, 0x5a, 0x1a, 0xca, 0x8a, 0x4a, 0x0a, 0xf6, 0xb6, 0x76, 0x36, 0xe6, 0xa6, 0x66, 0x26, 0xd6, 0x96, 0x56, 0x16, 0xc6, 0x86, 0x46, 0x06, 0xf2, 0xb2, 0x72, 0x32, 0xe2, 0xa2, 0x62, 0x22, 0xd2, 0x92, 0x52, 0x12, 0xc2, 0x82, 0x42, 0x02, 0xfd, 0xbd, 0x7d, 0x3d, 0xed, 0xad, 0x6d, 0x2d, 0xdd, 0x9d, 0x5d, 0x1d, 0xcd, 0x8d, 0x4d, 0x0d, 0xf9, 0xb9, 0x79, 0x39, 0xe9, 0xa9, 0x69, 0x29, 0xd9, 0x99, 0x59, 0x19, 0xc9, 0x89, 0x49, 0x09, 0xf5, 0xb5, 0x75, 0x35, 0xe5, 0xa5, 0x65, 0x25, 0xd5, 0x95, 0x55, 0x15, 0xc5, 0x85, 0x45, 0x05, 0xf1, 0xb1, 0x71, 0x31, 0xe1, 0xa1, 0x61, 0x21, 0xd1, 0x91, 0x51, 0x11, 0xc1, 0x81, 0x41, 0x01, 0xfc, 0xbc, 0x7c, 0x3c, 0xec, 0xac, 0x6c, 0x2c, 0xdc, 0x9c, 0x5c, 0x1c, 0xcc, 0x8c, 0x4c, 0x0c, 0xf8, 0xb8, 0x78, 0x38, 0xe8, 0xa8, 0x68, 0x28, 0xd8, 0x98, 0x58, 0x18, 0xc8, 0x88, 0x48, 0x08, 0xf4, 0xb4, 0x74, 0x34, 0xe4, 0xa4, 0x64, 0x24, 0xd4, 0x94, 0x54, 0x14, 0xc4, 0x84, 0x44, 0x04, 0xf0, 0xb0, 0x70, 0x30, 0xe0, 0xa0, 0x60, 0x20, 0xd0, 0x90, 0x50, 0x10, 0xc0, 0x80, 0x40, 0x00 }; uint64 CKmerAPI::alignment_mask[] = { 0xFFFFFFFFFFFFFFFFULL, 0x3FFFFFFFFFFFFFFFULL, 0x0FFFFFFFFFFFFFFFULL, 0x03FFFFFFFFFFFFFFULL, 0x00FFFFFFFFFFFFFFULL }; // ***** EOF KMC-2.3/kmc_api/kmer_api.h000066400000000000000000000436451257432033000153460ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz and Agnieszka Debudaj-Grabysz Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMER_API_H #define _KMER_API_H #include "kmer_defs.h" #include #include #include #include "mmer.h" class CKMCFile; class CKmerAPI { protected: uint64 *kmer_data; // An array to store kmer's data. On 64 bits 32 symbols can be stored // Data are shifted to let sufix's symbols to start with a border of a byte uint32 kmer_length; // Kmer's length, in symbols uchar byte_alignment; // A number of "empty" symbols placed before prefix to let sufix's symbols to start with a border of a byte uint32 no_of_rows; // A number of 64-bits words allocated for kmer_data friend class CKMCFile; //---------------------------------------------------------------------------------- inline void clear() { memset(kmer_data, 0, sizeof(*kmer_data) * no_of_rows); } //---------------------------------------------------------------------------------- inline void insert2bits(uint32 pos, uchar val) { kmer_data[(pos + byte_alignment) >> 5] += (uint64)val << (62 - (((pos + byte_alignment) & 31) * 2)); } inline uchar extract2bits(uint32 pos) { return (kmer_data[(pos + byte_alignment) >> 5] >> (62 - (((pos + byte_alignment) & 31) * 2))) & 3; } //---------------------------------------------------------------------------------- inline void SHL_insert2bits(uchar val) { kmer_data[0] <<= 2; if (byte_alignment) { uint64 mask = ~(((1ull << 2 * byte_alignment) - 1) << (64 - 2 * byte_alignment)); kmer_data[0] &= mask; } for (uint32 i = 1; i < no_of_rows; ++i) { kmer_data[i - 1] += kmer_data[i] >> 62; kmer_data[i] <<= 2; } kmer_data[no_of_rows - 1] += (uint64)val << (62 - (((kmer_length - 1 + byte_alignment) & 31) * 2)); } //---------------------------------------------------------------------------------- inline void SHR_insert2bits(uchar val) { for (uint32 i = no_of_rows - 1; i > 0; --i) { kmer_data[i] >>= 2; kmer_data[i] += kmer_data[i - 1] << 62; } kmer_data[0] >>= 2; kmer_data[no_of_rows - 1] &= ~((1ull << ((32 - (kmer_length + byte_alignment - (no_of_rows - 1) * 32)) * 2)) - 1);//mask falling of symbol kmer_data[0] += ((uint64)val << 62) >> (byte_alignment * 2); } // ---------------------------------------------------------------------------------- inline void from_binary(const char* kmer) { clear(); for (uint32 i = 0; i < kmer_length; ++i) insert2bits(i, kmer[i]); } // ---------------------------------------------------------------------------------- inline void from_binary_rev(const char* kmer) { clear(); for (uint32 i = 0; i < kmer_length; ++i) insert2bits(i, 3 - kmer[kmer_length - i - 1]); } // ---------------------------------------------------------------------------------- template inline void to_string_impl(RandomAccessIterator iter) { uchar *byte_ptr; uchar c; uchar temp_byte_alignment = byte_alignment; uint32 cur_string_size = 0; for (uint32 row_counter = 0; row_counter < no_of_rows; row_counter++) { byte_ptr = reinterpret_cast(&kmer_data[row_counter]); byte_ptr += 7; // shift a pointer towards a MSB for (uint32 i = 0; (i < kmer_length) && (i < 32); i += 4) // 32 symbols of any "row" in kmer_data { if ((i == 0) && temp_byte_alignment) // check if a byte_alignment placed before a prefix is to be skipped temp_byte_alignment--; else { c = 0xc0 & *byte_ptr; //11000000 c = c >> 6; *(iter + cur_string_size++) = char_codes[c]; if (cur_string_size == kmer_length) break; } if ((i == 0) && temp_byte_alignment) // check if a byte_alignment placed before a prefix is to be skipped temp_byte_alignment--; else { c = 0x30 & *byte_ptr; //00110000 c = c >> 4; *(iter + cur_string_size++) = char_codes[c]; if (cur_string_size == kmer_length) break; } if ((i == 0) && temp_byte_alignment) // check if a byte_alignment placed before a prefix is to be skipped temp_byte_alignment--; else { c = 0x0c & *byte_ptr; //00001100 c = c >> 2; *(iter + cur_string_size++) = char_codes[c]; if (cur_string_size == kmer_length) break; } // no need to check byte alignment as its length is at most 3 c = 0x03 & *byte_ptr; //00000011 *(iter + cur_string_size++) = char_codes[c]; if (cur_string_size == kmer_length) break; byte_ptr--; } } } // ---------------------------------------------------------------------------------- template inline bool from_string_impl(const RandomAccessIterator iter, uint32 len) { unsigned char c_char; uchar c_binary; uchar temp_byte_alignment; if (kmer_length != len) { if (kmer_length && kmer_data) delete[] kmer_data; kmer_length = len; if (kmer_length % 4) byte_alignment = 4 - (kmer_length % 4); else byte_alignment = 0; if (kmer_length != 0) { no_of_rows = (((kmer_length + byte_alignment) % 32) ? (kmer_length + byte_alignment) / 32 + 1 : (kmer_length + byte_alignment) / 32); //no_of_rows = (int)ceil((double)(kmer_length + byte_alignment) / 32); kmer_data = new uint64[no_of_rows]; //memset(kmer_data, 0, sizeof(*kmer_data) * no_of_rows); } } memset(kmer_data, 0, sizeof(*kmer_data) * no_of_rows); temp_byte_alignment = byte_alignment; uint32 i = 0; uint32 i_in_string = 0; uchar *byte_ptr; for (uint32 row_index = 0; row_index < no_of_rows; row_index++) { byte_ptr = reinterpret_cast(&kmer_data[row_index]); byte_ptr += 7; // shift a pointer towards a MSB while (i < kmer_length) { if ((i_in_string == 0) && temp_byte_alignment) // check if a byte_alignment placed before a prefix is to be skipped { temp_byte_alignment--; i++; } else { c_char = *(iter + i_in_string); c_binary = num_codes[c_char]; c_binary = c_binary << 6; //11000000 *byte_ptr = *byte_ptr | c_binary; i++; i_in_string++; if (i_in_string == kmer_length) break; } if ((i_in_string == 0) && temp_byte_alignment) // check if a byte_alignment placed before a prefix is to be skipped { temp_byte_alignment--; i++; } else { c_char = *(iter + i_in_string); c_binary = num_codes[c_char]; c_binary = c_binary << 4; *byte_ptr = *byte_ptr | c_binary; i++; i_in_string++; if (i_in_string == kmer_length) break; } //!!!if((i == 0) && temp_byte_alignment) //poprawka zg3oszona przez Maaka D3ugosza // check if a byte_alignment placed before a prefix is to be skipped if ((i_in_string == 0) && temp_byte_alignment) // check if a byte_alignment placed before a prefix is to be skipped { temp_byte_alignment--; i++; } else { c_char = *(iter + i_in_string); c_binary = num_codes[c_char]; c_binary = c_binary << 2; *byte_ptr = *byte_ptr | c_binary; i++; i_in_string++; if (i_in_string == kmer_length) break; } c_char = *(iter + i_in_string); c_binary = num_codes[c_char]; *byte_ptr = *byte_ptr | c_binary; i++; i_in_string++; if (i_in_string == kmer_length) break; if (i % 32 == 0) break; //check if a new "row" is to be started byte_ptr--; } }; return true; } public: static const char char_codes[]; static char num_codes[256]; static uchar rev_comp_bytes_LUT[]; static uint64 alignment_mask[]; struct _si { _si() { for (int i = 0; i < 256; i++) num_codes[i] = -1; num_codes['A'] = num_codes['a'] = 0; num_codes['C'] = num_codes['c'] = 1; num_codes['G'] = num_codes['g'] = 2; num_codes['T'] = num_codes['t'] = 3; } } static _init; // ---------------------------------------------------------------------------------- // The constructor creates kmer for the number of symbols equal to length. // The array kmer_data has the size of ceil((length + byte_alignment) / 32)) // IN : length - a number of symbols of a kmer // ---------------------------------------------------------------------------------- inline CKmerAPI(uint32 length = 0) { if(length) { if(length % 4) byte_alignment = 4 - (length % 4); else byte_alignment = 0; no_of_rows = (((length + byte_alignment) % 32) ? (length + byte_alignment) / 32 + 1 : (length + byte_alignment) / 32); //no_of_rows = (int)ceil((double)(length + byte_alignment) / 32); kmer_data = new uint64[no_of_rows]; memset(kmer_data, 0, sizeof(*kmer_data) * no_of_rows); } else { kmer_data = NULL; no_of_rows = 0; byte_alignment = 0; } kmer_length = length; }; //----------------------------------------------------------------------- // The destructor //----------------------------------------------------------------------- inline ~CKmerAPI() { if (kmer_data != NULL) delete [] kmer_data; }; //----------------------------------------------------------------------- // The copy constructor //----------------------------------------------------------------------- inline CKmerAPI(const CKmerAPI &kmer) { kmer_length = kmer.kmer_length; byte_alignment = kmer.byte_alignment; no_of_rows = kmer.no_of_rows; kmer_data = new uint64[no_of_rows]; for(uint32 i = 0; i < no_of_rows; i++) kmer_data[i] = kmer.kmer_data[i]; }; //----------------------------------------------------------------------- // The operator = //----------------------------------------------------------------------- inline CKmerAPI& operator=(const CKmerAPI &kmer) { if(kmer.kmer_length != kmer_length) { if(kmer_length && kmer_data) delete [] kmer_data; kmer_length = kmer.kmer_length; byte_alignment = kmer.byte_alignment; no_of_rows = kmer.no_of_rows; kmer_data = new uint64[no_of_rows]; } for(uint32 i = 0; i < no_of_rows; i++) kmer_data[i] = kmer.kmer_data[i]; return *this; }; //----------------------------------------------------------------------- // The operator == //----------------------------------------------------------------------- inline bool operator==(const CKmerAPI &kmer) { if(kmer.kmer_length != kmer_length) return false; for(uint32 i = 0; i < no_of_rows; i++) if(kmer.kmer_data[i] != kmer_data[i]) return false; return true; }; //----------------------------------------------------------------------- // Operator < . If arguments differ in length a result is undefined //----------------------------------------------------------------------- inline bool operator<(const CKmerAPI &kmer) { if(kmer.kmer_length != kmer_length) return false; for(uint32 i = 0; i < no_of_rows; i++) if(kmer.kmer_data[i] > kmer_data[i]) return true; else if(kmer.kmer_data[i] < kmer_data[i]) return false; return false; }; //----------------------------------------------------------------------- // Return a symbol of a kmer from an indicated position (numbered form 0). // The symbol is returned as an ASCI character A/C/G/T // IN : pos - a position of a symbol // RET : symbol - a symbol placed on a position pos //----------------------------------------------------------------------- inline char get_asci_symbol(unsigned int pos) { if(pos >= kmer_length) return 0; uint32 current_row = (pos + byte_alignment) / 32; uint32 current_pos = ((pos + byte_alignment) % 32) * 2; uint64 mask = 0xc000000000000000 >> current_pos; uint64 symbol = kmer_data[current_row] & mask; symbol = symbol >> (64 - current_pos - 2); return char_codes[symbol]; }; //----------------------------------------------------------------------- // Return a symbol of a kmer from an indicated position (numbered form 0) // The symbol is returned as a numerical value 0/1/2/3 // IN : pos - a position of a symbol // RET : symbol - a symbol placed on a position pos //----------------------------------------------------------------------- inline uchar get_num_symbol(unsigned int pos) { if (pos >= kmer_length) return 0; uint32 current_row = (pos + byte_alignment) / 32; uint32 current_pos = ((pos + byte_alignment) % 32) * 2; uint64 mask = 0xc000000000000000 >> current_pos; uint64 symbol = kmer_data[current_row] & mask; symbol = symbol >> (64 - current_pos - 2); uchar* byte_ptr = reinterpret_cast(&symbol); return *byte_ptr; }; //----------------------------------------------------------------------- // Convert kmer into string (an alphabet ACGT) // RET : string kmer //----------------------------------------------------------------------- inline std::string to_string() { std::string string_kmer; string_kmer.resize(kmer_length); to_string_impl(string_kmer.begin()); return string_kmer; }; //----------------------------------------------------------------------- // Convert kmer into string (an alphabet ACGT). The function assumes enough memory was allocated // OUT : str - string kmer. //----------------------------------------------------------------------- inline void to_string(char *str) { to_string_impl(str); str[kmer_length] = '\0'; }; inline void to_long(std::vector& kmer) { kmer.resize(no_of_rows); uint32 offset = 62 - ((kmer_length - 1 + byte_alignment) & 31) * 2; for (int32 i = no_of_rows - 1; i >= 1; --i) { kmer[i] = kmer_data[i] >> offset; kmer[i] += kmer_data[i - 1] << (64 - offset); } kmer[0] = kmer_data[0] >> offset; } //----------------------------------------------------------------------- // Convert kmer into string (an alphabet ACGT) // OUT : str - string kmer //----------------------------------------------------------------------- inline void to_string(std::string &str) { str.resize(kmer_length); to_string_impl(str.begin()); }; //----------------------------------------------------------------------- // Convert a string of an alphabet ACGT into a kmer of a CKmerAPI // IN : kmer_string - a string of an alphabet ACGT // RET : true - if succesfull //----------------------------------------------------------------------- inline bool from_string(const char* kmer_string) { uint32 len = 0; for (; kmer_string[len] != '\0' ; ++len) { if (num_codes[(uchar)kmer_string[len]] == -1) return false; } return from_string_impl(kmer_string, len); } //----------------------------------------------------------------------- // Convert a string of an alphabet ACGT into a kmer of a CKmerAPI // IN : kmer_string - a string of an alphabet ACGT // RET : true - if succesfull //----------------------------------------------------------------------- inline bool from_string(const std::string& kmer_string) { for (uint32 ii = 0; ii < kmer_string.size(); ++ii) { if (num_codes[(uchar)kmer_string[ii]] == -1) return false; } return from_string_impl(kmer_string.begin(), static_cast(kmer_string.length())); } //----------------------------------------------------------------------- // Convert k-mer to its reverse complement //----------------------------------------------------------------------- inline bool reverse() { if (kmer_data == NULL) { return false; } // number of bytes used to store the k-mer in the 0-th row const uint32 size_in_byte = ((kmer_length + byte_alignment) / 4) / no_of_rows; uchar* byte1; uchar* byte2; if (no_of_rows == 1) { *kmer_data <<= 2 * byte_alignment; byte1 = reinterpret_cast(kmer_data)+8 - size_in_byte; byte2 = reinterpret_cast(kmer_data)+7; for (uint32 i_bytes = 0; i_bytes < size_in_byte / 2; ++i_bytes) { unsigned char temp = rev_comp_bytes_LUT[*byte1]; *byte1 = rev_comp_bytes_LUT[*byte2]; *byte2 = temp; ++byte1; --byte2; } if (size_in_byte % 2) { *byte1 = rev_comp_bytes_LUT[*byte1]; } } else { for (uint32 i_rows = no_of_rows - 1; i_rows > 0; --i_rows) { kmer_data[i_rows] >>= 64 - 8 * size_in_byte - 2 * byte_alignment; // more significant row uint64 previous = kmer_data[i_rows - 1]; previous <<= 8 * size_in_byte + 2 * byte_alignment; kmer_data[i_rows] |= previous; byte1 = reinterpret_cast(kmer_data + i_rows); byte2 = reinterpret_cast(kmer_data + i_rows) + 7; for (int i_bytes = 0; i_bytes < 4; ++i_bytes) { unsigned char temp = rev_comp_bytes_LUT[*byte1]; *byte1 = rev_comp_bytes_LUT[*byte2]; *byte2 = temp; ++byte1; --byte2; } } // clear less significant bits kmer_data[0] >>= 64 - 8 * size_in_byte - 2 * byte_alignment; kmer_data[0] <<= 64 - 8 * size_in_byte; byte1 = reinterpret_cast(kmer_data)+8 - size_in_byte; byte2 = reinterpret_cast(kmer_data)+7; for (uint32 i_bytes = 0; i_bytes < size_in_byte / 2; ++i_bytes) { unsigned char temp = rev_comp_bytes_LUT[*byte1]; *byte1 = rev_comp_bytes_LUT[*byte2]; *byte2 = temp; ++byte1; --byte2; } if (size_in_byte % 2) { *byte1 = rev_comp_bytes_LUT[*byte1]; } for (uint32 i_rows = 0; i_rows < no_of_rows / 2; ++i_rows) { std::swap(kmer_data[i_rows], kmer_data[no_of_rows - i_rows - 1]); } } // clear alignment *kmer_data &= alignment_mask[byte_alignment]; return true; } //----------------------------------------------------------------------- // Counts a signature of an existing kmer // IN : sig_len - the length of a signature // RET : signature value //----------------------------------------------------------------------- uint32 get_signature(uint32 sig_len) { uchar symb; CMmer cur_mmr(sig_len); for(uint32 i = 0; i < sig_len; ++i) { symb = get_num_symbol(i); cur_mmr.insert(symb); } CMmer min_mmr(cur_mmr); for (uint32 i = sig_len; i < kmer_length; ++i) { symb = get_num_symbol(i); cur_mmr.insert(symb); if (cur_mmr < min_mmr) min_mmr = cur_mmr; } return min_mmr.get(); } }; #endif // ***** EOF KMC-2.3/kmc_api/kmer_defs.h000066400000000000000000000017761257432033000155150ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz and Agnieszka Debudaj-Grabysz Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMER_DEFS_H #define _KMER_DEFS_H #define KMC_VER "2.3.0" #define KMC_DATE "2015-08-21" #define MIN(x,y) ((x) < (y) ? (x) : (y)) #ifndef WIN32 #include #include #include #include #include #define _TCHAR char #define _tmain main #define my_fopen fopen #define my_fseek fseek #define my_ftell ftell #include #include #include using namespace std; #else #define my_fopen fopen #define my_fseek _fseeki64 #define my_ftell _ftelli64 #endif //typedef unsigned char uchar; typedef int int32; typedef unsigned int uint32; typedef long long int64; typedef unsigned long long uint64; typedef unsigned char uchar; #endif // ***** EOF KMC-2.3/kmc_api/mmer.cpp000066400000000000000000000015201257432033000150340ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "../kmc_api/mmer.h" uint32 CMmer::norm5[]; uint32 CMmer::norm6[]; uint32 CMmer::norm7[]; uint32 CMmer::norm8[]; CMmer::_si CMmer::_init; //-------------------------------------------------------------------------- CMmer::CMmer(uint32 _len) { switch (_len) { case 5: norm = norm5; break; case 6: norm = norm6; break; case 7: norm = norm7; break; case 8: norm = norm8; break; default: break; } len = _len; mask = (1 << _len * 2) - 1; str = 0; } //-------------------------------------------------------------------------- KMC-2.3/kmc_api/mmer.h000066400000000000000000000101031257432033000144760ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _MMER_H #define _MMER_H #include "kmer_defs.h" // ************************************************************************* // ************************************************************************* class CMmer { uint32 str; uint32 mask; uint32 current_val; uint32* norm; uint32 len; static uint32 norm5[1 << 10]; static uint32 norm6[1 << 12]; static uint32 norm7[1 << 14]; static uint32 norm8[1 << 16]; static bool is_allowed(uint32 mmer, uint32 len) { if ((mmer & 0x3f) == 0x3f) // TTT suffix return false; if ((mmer & 0x3f) == 0x3b) // TGT suffix return false; if ((mmer & 0x3c) == 0x3c) // TG* suffix return false; for (uint32 j = 0; j < len - 3; ++j) if ((mmer & 0xf) == 0) // AA inside return false; else mmer >>= 2; if (mmer == 0) // AAA prefix return false; if (mmer == 0x04) // ACA prefix return false; if ((mmer & 0xf) == 0) // *AA prefix return false; return true; } friend class CSignatureMapper; struct _si { static uint32 get_rev(uint32 mmer, uint32 len) { uint32 rev = 0; uint32 shift = len*2 - 2; for(uint32 i = 0 ; i < len ; ++i) { rev += (3 - (mmer & 3)) << shift; mmer >>= 2; shift -= 2; } return rev; } static void init_norm(uint32* norm, uint32 len) { uint32 special = 1 << len * 2; for(uint32 i = 0 ; i < special ; ++i) { uint32 rev = get_rev(i, len); uint32 str_val = is_allowed(i, len) ? i : special; uint32 rev_val = is_allowed(rev, len) ? rev : special; norm[i] = MIN(str_val, rev_val); } } _si() { init_norm(norm5, 5); init_norm(norm6, 6); init_norm(norm7, 7); init_norm(norm8, 8); } }static _init; public: CMmer(uint32 _len); inline void insert(uchar symb); inline uint32 get() const; inline bool operator==(const CMmer& x); inline bool operator<(const CMmer& x); inline void clear(); inline bool operator<=(const CMmer& x); inline void set(const CMmer& x); inline void insert(const char* seq); }; //-------------------------------------------------------------------------- inline void CMmer::insert(uchar symb) { str <<= 2; str += symb; str &= mask; current_val = norm[str]; } //-------------------------------------------------------------------------- inline uint32 CMmer::get() const { return current_val; } //-------------------------------------------------------------------------- inline bool CMmer::operator==(const CMmer& x) { return current_val == x.current_val; } //-------------------------------------------------------------------------- inline bool CMmer::operator<(const CMmer& x) { return current_val < x.current_val; } //-------------------------------------------------------------------------- inline void CMmer::clear() { str = 0; } //-------------------------------------------------------------------------- inline bool CMmer::operator<=(const CMmer& x) { return current_val <= x.current_val; } //-------------------------------------------------------------------------- inline void CMmer::set(const CMmer& x) { str = x.str; current_val = x.current_val; } //-------------------------------------------------------------------------- inline void CMmer::insert(const char* seq) { switch (len) { case 5: str = (seq[0] << 8) + (seq[1] << 6) + (seq[2] << 4) + (seq[3] << 2) + (seq[4]); break; case 6: str = (seq[0] << 10) + (seq[1] << 8) + (seq[2] << 6) + (seq[3] << 4) + (seq[4] << 2) + (seq[5]); break; case 7: str = (seq[0] << 12) + (seq[1] << 10) + (seq[2] << 8) + (seq[3] << 6) + (seq[4] << 4 ) + (seq[5] << 2) + (seq[6]); break; case 8: str = (seq[0] << 14) + (seq[1] << 12) + (seq[2] << 10) + (seq[3] << 8) + (seq[4] << 6) + (seq[5] << 4) + (seq[6] << 2) + (seq[7]); break; default: break; } current_val = norm[str]; } #endifKMC-2.3/kmc_api/stdafx.h000066400000000000000000000001251257432033000150320ustar00rootroot00000000000000#include #include #include using namespace std; KMC-2.3/kmc_dump/000077500000000000000000000000001257432033000135665ustar00rootroot00000000000000KMC-2.3/kmc_dump/kmc_dump.cpp000066400000000000000000000076411257432033000161010ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc This file demonstrates the example usage of kmc_api software. It reads kmer_counter's output and prints kmers to an output file. Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include #include "../kmc_api/kmc_file.h" #include "nc_utils.h" void print_info(void); int _tmain(int argc, char* argv[]) { CKMCFile kmer_data_base; int32 i; uint32 min_count_to_set = 0; uint32 max_count_to_set = 0; std::string input_file_name; std::string output_file_name; FILE * out_file; //------------------------------------------------------------ // Parse input parameters //------------------------------------------------------------ if(argc < 3) { print_info(); return EXIT_FAILURE; } for(i = 1; i < argc; ++i) { if(argv[i][0] == '-') { if(strncmp(argv[i], "-ci", 3) == 0) min_count_to_set = atoi(&argv[i][3]); else if(strncmp(argv[i], "-cx", 3) == 0) max_count_to_set = atoi(&argv[i][3]); } else break; } if(argc - i < 2) { print_info(); return EXIT_FAILURE; } input_file_name = std::string(argv[i++]); output_file_name = std::string(argv[i]); if((out_file = fopen (output_file_name.c_str(),"wb")) == NULL) { print_info(); return EXIT_FAILURE; } setvbuf(out_file, NULL ,_IOFBF, 1 << 24); //------------------------------------------------------------------------------ // Open kmer database for listing and print kmers within min_count and max_count //------------------------------------------------------------------------------ if (!kmer_data_base.OpenForListing(input_file_name)) { print_info(); return EXIT_FAILURE ; } else { uint32 _kmer_length; uint32 _mode; uint32 _counter_size; uint32 _lut_prefix_length; uint32 _signature_len; uint32 _min_count; uint64 _max_count; uint64 _total_kmers; kmer_data_base.Info(_kmer_length, _mode, _counter_size, _lut_prefix_length, _signature_len, _min_count, _max_count, _total_kmers); //std::string str; char str[1024]; uint32 counter_len; CKmerAPI kmer_object(_kmer_length); if(min_count_to_set) if (!(kmer_data_base.SetMinCount(min_count_to_set))) return EXIT_FAILURE; if(max_count_to_set) if (!(kmer_data_base.SetMaxCount(max_count_to_set))) return EXIT_FAILURE; if (_mode) //quake compatible mode { float counter; while (kmer_data_base.ReadNextKmer(kmer_object, counter)) { kmer_object.to_string(str); str[_kmer_length] = '\t'; counter_len = CNumericConversions::Double2PChar(counter, 6, (uchar*)str + _kmer_length + 1); str[_kmer_length + 1 + counter_len] = '\n'; fwrite(str, 1, _kmer_length + counter_len + 2, out_file); } } else { uint64 counter; while (kmer_data_base.ReadNextKmer(kmer_object, counter)) { kmer_object.to_string(str); str[_kmer_length] = '\t'; counter_len = CNumericConversions::Int2PChar(counter, (uchar*)str + _kmer_length + 1); str[_kmer_length + 1 + counter_len] = '\n'; fwrite(str, 1, _kmer_length + counter_len + 2, out_file); } } fclose(out_file); kmer_data_base.Close(); } return EXIT_SUCCESS; } // ------------------------------------------------------------------------- // Print execution options // ------------------------------------------------------------------------- void print_info(void) { std::cout << "KMC dump ver. " << KMC_VER << " (" << KMC_DATE << ")\n"; std::cout << "\nUsage:\nkmc_dump [options] \n"; std::cout << "Parameters:\n"; std::cout << " - kmer_counter's output\n"; std::cout << "Options:\n"; std::cout << "-ci - print k-mers occurring less than times\n"; std::cout << "-cx - print k-mers occurring more of than times\n"; }; // ***** EOF KMC-2.3/kmc_dump/kmc_dump.vcxproj000066400000000000000000000200371257432033000170040ustar00rootroot00000000000000 Debug Win32 Debug x64 Release Win32 Release x64 {8939AD12-23D5-469C-806B-DC3F98F8A514} Win32Proj kmc_dump Application true v120 NotSet Application true v120 Unicode Application false v120 true NotSet Static Application false v120 true NotSet Static true true false false Use Level3 Disabled WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) Console true Use Level3 Disabled WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) Console true Level3 Use Full true true WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) Speed Console true true true Level3 Use Full true true WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) Speed Console true true true Create Create Create Create KMC-2.3/kmc_dump/nc_utils.cpp000066400000000000000000000010751257432033000161150ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc This file demonstrates the example usage of kmc_api software. It reads kmer_counter's output and prints kmers to an output file. Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "nc_utils.h" uchar CNumericConversions::digits[100000*5]; int CNumericConversions::powOf10[30]; CNumericConversions::_si CNumericConversions::_init;KMC-2.3/kmc_dump/nc_utils.h000066400000000000000000000067211257432033000155650ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc This file demonstrates the example usage of kmc_api software. It reads kmer_counter's output and prints kmers to an output file. Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include "../kmc_api/kmer_defs.h" #ifndef _NC_UTILS_H #define _NC_UTILS_H class CNumericConversions { public: static uchar digits[100000*5]; static int powOf10[30]; struct _si { _si() { for(int i = 0; i < 100000; ++i) { int dig = i; digits[i*5+4] = '0' + (dig % 10); dig /= 10; digits[i*5+3] = '0' + (dig % 10); dig /= 10; digits[i*5+2] = '0' + (dig % 10); dig /= 10; digits[i*5+1] = '0' + (dig % 10); dig /= 10; digits[i*5+0] = '0' + dig; } powOf10[0] = 1; for(int i = 1 ; i < 30 ; ++i) { powOf10[i] = powOf10[i-1]*10; } } } static _init; static int NDigits(uint64 val) { if(val >= 10000) return 5; else if(val >= 1000) return 4; else if(val >= 100) return 3; else if(val >= 10) return 2; else return 1; } static int Int2PChar(uint64 val, uchar *str) { if(val >= 1000000000000000ull) { uint64 dig1 = val / 1000000000000000ull; val -= dig1 * 1000000000000000ull; uint64 dig2 = val / 10000000000ull; val -= dig2 * 10000000000ull; uint64 dig3 = val / 100000ull; uint64 dig4 = val - dig3 * 100000ull; int ndig = NDigits(dig1); memcpy(str, digits+dig1*5+(5-ndig), ndig); memcpy(str+ndig, digits+dig2*5, 5); memcpy(str+ndig+5, digits+dig3*5, 5); memcpy(str+ndig+10, digits+dig4*5, 5); return ndig+15; } else if(val >= 10000000000ull) { uint64 dig1 = val / 10000000000ull; val -= dig1 * 10000000000ull; uint64 dig2 = val / 100000ull; uint64 dig3 = val - dig2 * 100000ull; int ndig = NDigits(dig1); memcpy(str, digits+dig1*5+(5-ndig), ndig); memcpy(str+ndig, digits+dig2*5, 5); memcpy(str+ndig+5, digits+dig3*5, 5); return ndig+10; } else if(val >= 100000ull) { uint64 dig1 = val / 100000ull; uint64 dig2 = val - dig1 * 100000ull; int ndig = NDigits(dig1); memcpy(str, digits+dig1*5+(5-ndig), ndig); memcpy(str+ndig, digits+dig2*5, 5); return ndig+5; } else { int ndig = NDigits(val); memcpy(str, digits+val*5+(5-ndig), ndig); return ndig; } } static int Double2PChar(double val, int prec, uchar *str) { double corrector = .5 / powOf10[prec]; val += corrector; double ipart; double fractPart = std::modf(val, &ipart); uint32 intPart = (uint32)ipart; uint32 len = Int2PChar(intPart, str); uint32 pos = len; str[pos++] = '.'; for(int i = 0 ; i < prec ; ++i) { fractPart *= 10; str[pos++] = '0' + (uint32)fractPart % 10 ; } return len + prec + 1; } }; #endifKMC-2.3/kmc_dump/stdafx.cpp000066400000000000000000000004371257432033000155670ustar00rootroot00000000000000// stdafx.cpp : source file that includes just the standard includes // kmc_dump.pch will be the pre-compiled header // stdafx.obj will contain the pre-compiled type information #include "stdafx.h" // TODO: reference any additional headers you need in STDAFX.H // and not in this file KMC-2.3/kmc_dump/stdafx.h000066400000000000000000000005041257432033000152270ustar00rootroot00000000000000#ifdef WIN32 // stdafx.h : include file for standard system include files, // or project specific include files that are used frequently, but // are changed infrequently // #pragma once #include "targetver.h" #include #include // TODO: reference additional headers your program requires here #endifKMC-2.3/kmc_dump/targetver.h000066400000000000000000000004621257432033000157440ustar00rootroot00000000000000#pragma once // Including SDKDDKVer.h defines the highest available Windows platform. // If you wish to build your application for a previous Windows platform, include WinSDKVer.h and // set the _WIN32_WINNT macro to the platform you wish to support before including SDKDDKVer.h. #include KMC-2.3/kmc_dump_sample/000077500000000000000000000000001257432033000151275ustar00rootroot00000000000000KMC-2.3/kmc_dump_sample/kmc_dump_sample.cpp000066400000000000000000000067731257432033000210100ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc This file demonstrates the example usage of kmc_api software. It reads kmer_counter's output and prints kmers to an output file. Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include #include "../kmc_api/kmc_file.h" void print_info(void); int _tmain(int argc, char* argv[]) { CKMCFile kmer_data_base; int32 i; uint32 min_count_to_set = 0; uint32 max_count_to_set = 0; std::string input_file_name; std::string output_file_name; FILE * out_file; //------------------------------------------------------------ // Parse input parameters //------------------------------------------------------------ if(argc < 3) { print_info(); return EXIT_FAILURE; } for(i = 1; i < argc; ++i) { if(argv[i][0] == '-') { if(strncmp(argv[i], "-ci", 3) == 0) min_count_to_set = atoi(&argv[i][3]); else if(strncmp(argv[i], "-cx", 3) == 0) max_count_to_set = atoi(&argv[i][3]); } else break; } if(argc - i < 2) { print_info(); return EXIT_FAILURE; } input_file_name = std::string(argv[i++]); output_file_name = std::string(argv[i]); if((out_file = fopen (output_file_name.c_str(),"wb")) == NULL) { print_info(); return EXIT_FAILURE; } setvbuf(out_file, NULL ,_IOFBF, 1 << 24); //------------------------------------------------------------------------------ // Open kmer database for listing and print kmers within min_count and max_count //------------------------------------------------------------------------------ if (!kmer_data_base.OpenForListing(input_file_name)) { print_info(); return EXIT_FAILURE ; } else { uint32 _kmer_length; uint32 _mode; uint32 _counter_size; uint32 _lut_prefix_length; uint32 _signature_len; uint32 _min_count; uint64 _max_count; uint64 _total_kmers; kmer_data_base.Info(_kmer_length, _mode, _counter_size, _lut_prefix_length, _signature_len, _min_count, _max_count, _total_kmers); CKmerAPI kmer_object(_kmer_length); if(min_count_to_set) if (!(kmer_data_base.SetMinCount(min_count_to_set))) return EXIT_FAILURE; if(max_count_to_set) if (!(kmer_data_base.SetMaxCount(max_count_to_set))) return EXIT_FAILURE; std::string str; if (_mode) //quake compatible mode { float counter; while (kmer_data_base.ReadNextKmer(kmer_object, counter)) { kmer_object.to_string(str); fprintf(out_file, "%s\t%f\n", str.c_str(), counter); } } else { uint32 counter; while (kmer_data_base.ReadNextKmer(kmer_object, counter)) { kmer_object.to_string(str); fprintf(out_file, "%s\t%u\n", str.c_str(), counter); } } fclose(out_file); kmer_data_base.Close(); } return EXIT_SUCCESS; } // ------------------------------------------------------------------------- // Print execution options // ------------------------------------------------------------------------- void print_info(void) { std::cout << "KMC dump ver. " << KMC_VER << " (" << KMC_DATE << ")\n"; std::cout << "\nUsage:\nkmc_dump [options] \n"; std::cout << "Parameters:\n"; std::cout << " - kmer_counter's output\n"; std::cout << "Options:\n"; std::cout << "-ci - print k-mers occurring less than times\n"; std::cout << "-cx - print k-mers occurring more of than times\n"; }; // ***** EOF KMC-2.3/kmc_dump_sample/kmc_dump_sample.vcxproj000066400000000000000000000177351257432033000217210ustar00rootroot00000000000000 Debug Win32 Debug x64 Release Win32 Release x64 {17823F37-86DE-4E58-B354-B84DA9EDA6A1} Win32Proj kmc_dump_sample Application true v120 NotSet Application true v120 Unicode Application false v120 true NotSet Static Application false v120 true NotSet Static true true false false Use Level3 Disabled WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) Console true Use Level3 Disabled WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) Console true Level3 Use Full true true WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) Speed Console true true true Level3 Use Full true true WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) Speed Console true true true Create Create Create Create KMC-2.3/kmc_dump_sample/stdafx.cpp000066400000000000000000000004461257432033000171300ustar00rootroot00000000000000// stdafx.cpp : source file that includes just the standard includes // kmc_dump_sample.pch will be the pre-compiled header // stdafx.obj will contain the pre-compiled type information #include "stdafx.h" // TODO: reference any additional headers you need in STDAFX.H // and not in this file KMC-2.3/kmc_dump_sample/stdafx.h000066400000000000000000000005041257432033000165700ustar00rootroot00000000000000#ifdef WIN32 // stdafx.h : include file for standard system include files, // or project specific include files that are used frequently, but // are changed infrequently // #pragma once #include "targetver.h" #include #include // TODO: reference additional headers your program requires here #endifKMC-2.3/kmc_dump_sample/targetver.h000066400000000000000000000004621257432033000173050ustar00rootroot00000000000000#pragma once // Including SDKDDKVer.h defines the highest available Windows platform. // If you wish to build your application for a previous Windows platform, include WinSDKVer.h and // set the _WIN32_WINNT macro to the platform you wish to support before including SDKDDKVer.h. #include KMC-2.3/kmc_tools/000077500000000000000000000000001257432033000137615ustar00rootroot00000000000000KMC-2.3/kmc_tools/asmlib_wrapper.h000066400000000000000000000006321257432033000171420ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _ASMLIB_WRAPPER_H #define _ASMLIB_WRAPPER_H #include "defs.h" #ifdef DISABLE_ASMLIB #define A_memcpy memcpy #define SetMemcpyCacheLimit(X) #else #include "libs/asmlib.h" #endif #endifKMC-2.3/kmc_tools/bundle.h000066400000000000000000000124361257432033000154110ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _BUNDLE_H #define _BUNDLE_H #include "defs.h" #include "kmer.h" //************************************************************************************************************ // CBundle and CInput are CORE classes of this application. CInputs are nodes of binary tree which // represent operations. Leafs of this tree are kmc database (1 or 2) inputs (sets of k-mers). // Each node represents an operation like intersection, subtraction, etc. Because this class is abstract // calling virtual method to get each single k-mer may be costly. To prevent high const, between tree nodes there //are instances of CBundle which contains buffer of k-mers and its counters. // // The algorithm works as follow (conceptually): // Build a tree with CBundles and CInputs, as a root take a some output writer (kmc database). // Root has its bundle and get from it k-mers, but at the beginning there is nothing in bundle. Each bundle // contains pointer to CInput below in tree. The CBundle is getting k-mers from its CInput. // This is repeated from top of tree to leafs //************************************************************************************************************ //Forward declaration template class CBundle; //************************************************************************************************************ // CInput - Base abstract class representing data source for CBundle class //************************************************************************************************************ template class CInput { public: virtual void NextBundle(CBundle& bundle) = 0; virtual void IgnoreRest() = 0; bool Finished(){ return finished; } virtual ~CInput(){} protected: bool finished = false; }; //************************************************************************************************************ // CBundleData - class containing a buffer of k-mers and its counters. //************************************************************************************************************ template class CBundleData { public: CBundleData() : insert_pos(0), get_pos(0), size(BUNDLE_CAPACITY) { kmers = new CKmer[size]; counters = new uint32[size]; } ~CBundleData() { delete[] kmers; delete[] counters; } CBundleData(CBundleData&& rhs): insert_pos(rhs.insert_pos), get_pos(rhs.get_pos), size(rhs.size), kmers(rhs.kmers), counters(rhs.counters) { rhs.counters = nullptr; rhs.kmers = nullptr; rhs.get_pos = rhs.size = rhs.insert_pos = 0; } CBundleData& operator=(CBundleData&& rhs) { if (this != &rhs) { delete[] kmers; delete[] counters; kmers = rhs.kmers; counters = rhs.counters; get_pos = rhs.get_pos; size = rhs.size; insert_pos = rhs.insert_pos; rhs.counters = nullptr; rhs.kmers = nullptr; rhs.get_pos = rhs.size = rhs.insert_pos = 0; } return *this; } CBundleData(const CBundleData&) = delete; CBundle& operator=(const CBundleData&) = delete; CKmer& TopKmer() const { return kmers[get_pos]; } uint32& TopCounter() const { return counters[get_pos]; } bool Full() { return insert_pos >= size; } bool Empty() { return get_pos >= insert_pos; } void Insert(CKmer& kmer, uint32 counter) { kmers[insert_pos] = kmer; counters[insert_pos++] = counter; } void Pop() { ++get_pos; } void Clear() { insert_pos = get_pos = 0; } private: friend class CBundle; uint32 insert_pos, get_pos, size; CKmer* kmers; uint32* counters; }; //************************************************************************************************************ // CBundle - connector between CBundleData and CInput //************************************************************************************************************ template class CBundle { public: CBundle(CInput* input) : input(input) { } CKmer& TopKmer() const { return data.TopKmer(); } uint32& TopCounter() const { return data.TopCounter(); } bool Full() { return data.Full(); } void Insert(CKmer& kmer, uint32 counter) { data.Insert(kmer, counter); } void Pop() { data.Pop(); } ~CBundle() { delete input; } bool Empty() { return data.Empty(); } CBundleData& Data() { return data; } inline bool Finished(); void IgnoreRest() { input->IgnoreRest(); } uint32 Size() { return data.insert_pos; } private: CBundleData data; CInput* input; bool finished = false; }; //************************************************************************************************************ template inline bool CBundle::Finished() { if (finished) return true; if (data.get_pos >= data.insert_pos) { if (input->Finished()) { finished = true; return true; } data.get_pos = data.insert_pos = 0; input->NextBundle(*this); if (data.insert_pos == 0)//Because maybe NextBundle did not add anything, which means there is nothing to take { finished = true; return true; } } return false; } #endif // ***** EOFKMC-2.3/kmc_tools/config.h000066400000000000000000000475551257432033000154170ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _CONFIG_H #define _CONFIG_H #include "defs.h" #include #include #include #include "kmc_header.h" #include "percent_progress.h" #include "queues.h" struct CDescBase { std::string file_src; uint32 cutoff_min = 0; //0 means it is not set yet uint32 cutoff_max = 0; //0 means it is not set yet CDescBase(const std::string& file_src) : file_src(file_src) { } CDescBase() = default; }; //************************************************************************************************************ // CInputDesc - description of a single input KMC database. //************************************************************************************************************ struct CInputDesc : public CDescBase { uint32 threads = 0; //for kmc2 input CInputDesc(const std::string& file_src) : CDescBase(file_src) { } CInputDesc() = default; }; //************************************************************************************************************ // COutputDesc - description of a output KMC database. //************************************************************************************************************ struct COutputDesc : public CDescBase { uint32 counter_max = 0; //0 means it is not set yet COutputDesc(const std::string& file_src) : CDescBase(file_src) { } COutputDesc() = default; }; struct CFilteringParams { enum class file_type { fasta, fastq }; uint32 n_readers; uint32 n_filters; int fastq_buffer_size; int64 mem_part_pmm_fastq_reader; int64 mem_tot_pmm_fastq_reader; int64 mem_part_pmm_fastq_filter; int64 mem_tot_pmm_fastq_filter; uint32 kmer_len; uint32 gzip_buffer_size = 64 << 20; uint32 bzip2_buffer_size = 64 << 20; std::vector input_srcs; bool use_float_value = false; uint32 n_min_kmers = 2; uint32 n_max_kmers = 1000000000; float f_min_kmers = 0.0f; float f_max_kmers = 1.0f; file_type input_file_type = file_type::fastq; file_type output_file_type = file_type::fastq; std::string output_src; }; struct CDumpParams { bool sorted_output = false; }; struct CFilteringQueues { CInputFilesQueue *input_files_queue; CPartQueue *input_part_queue, *filtered_part_queue; CMemoryPool *pmm_fastq_reader; CMemoryPool *pmm_fastq_filter; }; //************************************************************************************************************ // CConfig - configuration of current application run. Singleton class. //************************************************************************************************************ class CConfig { public: enum class Mode { UNDEFINED, INTERSECTION, KMERS_SUBTRACT, COUNTERS_SUBTRACT, UNION, COMPLEX, SORT, REDUCE, COMPACT, HISTOGRAM, DUMP, COMPARE, FILTER }; uint32 avaiable_threads; uint32 kmer_len = 0; Mode mode = Mode::UNDEFINED; bool verbose = false; std::vector input_desc; std::vector headers; COutputDesc output_desc; CFilteringParams filtering_params; //for filter operation only CDumpParams dump_params; //for dump operation only CPercentProgress percent_progress; static CConfig& GetInstance() { static CConfig config; return config; } CConfig(const CConfig&) = delete; CConfig& operator=(const CConfig&) = delete; bool Is2ArgOper() { return mode == Mode::UNION || mode == Mode::KMERS_SUBTRACT || mode == Mode::COUNTERS_SUBTRACT || mode == Mode::INTERSECTION || mode == Mode::COMPARE; } bool IsComplex() { return mode == Mode::COMPLEX; } bool Is1ArgOper() { return mode == Mode::SORT || mode == Mode::REDUCE || mode == Mode::COMPACT || mode == Mode::HISTOGRAM || mode == Mode::DUMP; } std::string GetOperationName() { switch (mode) { case CConfig::Mode::UNDEFINED: return ""; case CConfig::Mode::INTERSECTION: return "intersect"; case CConfig::Mode::KMERS_SUBTRACT: return "kmers_subtract"; case CConfig::Mode::COUNTERS_SUBTRACT: return "counters_subtract"; case CConfig::Mode::UNION: return "union"; case CConfig::Mode::COMPLEX: return "complex"; case CConfig::Mode::SORT: return "sort"; case CConfig::Mode::REDUCE: return "reduce"; case CConfig::Mode::COMPACT: return "compact"; case CConfig::Mode::HISTOGRAM: return "histogram"; case CConfig::Mode::DUMP: return "dump"; case CConfig::Mode::COMPARE: return "compare"; case CConfig::Mode::FILTER: return "filter"; default: return ""; } } private: CConfig() = default; }; class CUsageDisplayer { protected: std::string name; bool is2ArgOper = false; bool is1ArgOper = false; CUsageDisplayer(const std::string& name) :name(name){} void Display2ArgGeneral() const { std::cout << "The '" << name << "' is two arguments' operation. General syntax:\n"; std::cout << " kmc_tools " << name << " \n"; std::cout << " input1, input2 - paths to databases generated by KMC \n"; std::cout << " output - path to output database\n"; std::cout << " For each input there are additional parameters:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; std::cout << " For output there are additional parameters:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; std::cout << " -cs - maximal value of a counter\n"; } void Display1ArgGeneral(bool output_params) const { std::cout << " The '" << name << "' is one argument operation. General syntax:\n"; std::cout << " kmc_tools " << name << " [input_params] "<< (output_params ? "[output_params]" : "") << "\n"; std::cout << " input - path to database generated by KMC \n"; std::cout << " For input there are additional parameters:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; } public: virtual void Display() const = 0; virtual ~CUsageDisplayer() {} }; class CGeneralUsageDisplayer : public CUsageDisplayer { public: CGeneralUsageDisplayer() :CUsageDisplayer("") {} void Display() const override { std::cout << "kmc_tools ver. " << KMC_VER << " (" << KMC_DATE << ")\n"; std::cout << "Usage:\n kmc_tools [global parameters] [operation parameters]\n"; std::cout << "Available operations:\n"; std::cout << " k-mers sets' operations for 2 KMC's databases:\n"; std::cout << " intersect - intersection of 2 k-mers' sets\n"; std::cout << " kmers_subtract - subtraction of 2 k-mers' sets\n"; std::cout << " counters_subtract - counters' subtraction of 2 k-mers' sets\n"; std::cout << " union - union of 2 k-mers' sets\n\n"; std::cout << " operations for single kmc database:\n"; std::cout << " sort - sorts k-mers from database generated by KMC2.x\n"; std::cout << " reduce - exclude too rare and too frequent k-mers\n"; std::cout << " compact - remove counters (store only k-mers)\n"; std::cout << " histogram - histogram of k-mers occurences\n"; std::cout << " dump - dump k-mers and counters to text file\n"; std::cout << " more complex operations:\n"; std::cout << " complex - complex operations with a number of input databases\n"; std::cout << " other operatations:\n"; std::cout << " filter - filter out reads with too small number of k-mers\n"; std::cout << " global parameters:\n"; std::cout << " -t - total number of threads (default: no. of CPU cores)\n"; std::cout << " -v - enable verbose mode (shows some information) (default: false)\n"; std::cout << " -hp - hide percentage progress (default: false)\n"; std::cout << "Example:\n"; std::cout << "kmc_tools union db1 -ci3 db2 -ci5 -cx300 db1_union_db2 -ci10\n"; std::cout << "For detailed help of concrete operation type operation name without parameters:\n"; std::cout << "kmc_tools union\n"; } }; class CUnionUsageDisplayer : public CUsageDisplayer { public: CUnionUsageDisplayer() :CUsageDisplayer("union") {} void Display() const override { Display2ArgGeneral(); std::cout << "The output database will contains each k-mer present in both input sets. For the same k-mers in first and second input the counter in output is equal to sum from inputs."; std::cout << "Example:\n"; std::cout << "kmc - k28 file1.fastq kmers1 tmp\n"; std::cout << "kmc - k28 file2.fastq kmers2 tmp\n"; std::cout << "kmc_tools union kmers1 -ci3 -cx70000 kmers2 kmers1_kmers2_union -cs65536\n"; } }; class CIntersectUsageDisplayer : public CUsageDisplayer { public: CIntersectUsageDisplayer() :CUsageDisplayer("intersect") {} void Display() const override { Display2ArgGeneral(); std::cout << "The output database will contains only k-mers that are present in both input sets. The counter value in output database is equal to lower counter value in input."; std::cout << "Example:\n"; std::cout << "kmc - k28 file1.fastq kmers1 tmp\n"; std::cout << "kmc - k28 file2.fastq kmers2 tmp\n"; std::cout << "kmc_tools intersect kmers1 -ci10 -cx200 kmers2 -ci4 -cx100 kmers1_kmers2_intersect -ci20 -cx150\n"; } }; class CCountersSubtractUsageDisplayer : public CUsageDisplayer { public: CCountersSubtractUsageDisplayer() :CUsageDisplayer("counters_subtract") {} void Display() const override { Display2ArgGeneral(); std::cout << "The output database will contains only k-mers that are present in first input set and have counters higher than apropriate k - mers in second set. For each k - mer the counter is equal to difference between counter in first set and counter in second set."; std::cout << "Example:\n"; std::cout << "kmc -k28 file1.fastq kmers1 tmp\n"; std::cout << "kmc -k28 file2.fastq kmers2 tmp\n"; std::cout << "kmc_tools counters_subtract kmers1 kmers2 kmers1_kmers2_counters_subtract\n"; } }; class CKmersSubtractUsageDisplayer : public CUsageDisplayer { public: CKmersSubtractUsageDisplayer() :CUsageDisplayer("kmers_subtract") {} void Display() const override { Display2ArgGeneral(); std::cout << "The output database will contains only k-mers that are present in first input set but absent in the second one. The counter value is equal to value from first input set."; std::cout << "Example:\n"; std::cout << "kmc - k28 file1.fastq kmers1 tmp\n"; std::cout << "kmc - k28 file2.fastq kmers2 tmp\n"; std::cout << "kmc_tools kmers_subtract kmers1 kmers2 kmers1_kmers2_subtract - cs200\n"; } }; class CComplexUsageDisplayer : public CUsageDisplayer { public: CComplexUsageDisplayer() :CUsageDisplayer("complex") {} void Display() const override { std::cout << "Complex operation allows to define operations for more than 2 input k-mers sets. Command-line syntax:\n"; std::cout << "kmc_tools complex \n"; std::cout << " operations_definition_file - path to file which define input sets and operations. It is text file with following syntax:\n"; std::cout << " __________________________________________________________________ \n"; std::cout << "|INPUT: |\n"; std::cout << "|= [params] |\n"; std::cout << "|= [params] |\n"; std::cout << "|... |\n"; std::cout << "|= [params] |\n"; std::cout << "|OUTPUT: |\n"; std::cout << "|=[[...] |\n"; std::cout << "|[OUTPUT_PARAMS: __|\n"; std::cout << "|] | /\n"; std::cout << "| | / \n"; std::cout << "|_______________________________________________________________|/ \n"; std::cout << "input1, input2, ..., inputN - names of inputs used to define equasion\n"; std::cout << "input1_db_path, input2_db_path, ..., inputN_db_path - paths to k-mers sets\n"; std::cout << "For each input there are additional parameters which can be set:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; std::cout << "out_db_path - path to output database\n"; std::cout << "ref_input - one of input1, input2, ..., inputN\n"; std::cout << "oper - one of {*,-,~,+}, which refers to {intersect, kmers_subtract, counters_subtract, union}\n"; std::cout << "operator * has the highest priority. Other operators has equals priorities. Order of operations can be changed with barenthesis\n"; std::cout << "output_params are:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; std::cout << " -cs - maximal value of a counter\n"; std::cout << "Example:\n"; std::cout << " __________________________________________________________________ \n"; std::cout << "|INPUT: |\n"; std::cout << "|set1 = kmc_o1 -ci5 |\n"; std::cout << "|set2 = kmc_o2 |\n"; std::cout << "|set3 = kmc_o3 -ci10 -cx100 __|\n"; std::cout << "|OUTPUT: | /\n"; std::cout << "|result = (set3+set1)*set2 | / \n"; std::cout << "|_______________________________________________________________|/ \n"; } }; class CSortUsageDisplayer : public CUsageDisplayer { public: CSortUsageDisplayer() :CUsageDisplayer("sort") {} void Display() const override { Display1ArgGeneral(true); std::cout << " For output there are additional parameters:\n"; std::cout << " -cs - maximal value of a counter\n"; std::cout << "Converts database produced by KMC2.x to KMC1.x database format (which contains k-mers in sorted order)\n"; std::cout << "Example:\n"; std::cout << "kmc_tools sort wy_kmc2 -ci3 -cx1000 wy_kmc1 -cs255\n"; } }; class CReduceUsageDisplayer : public CUsageDisplayer { public: CReduceUsageDisplayer() :CUsageDisplayer("reduce") {} void Display() const override { Display1ArgGeneral(true); std::cout << " For output there are additional parameters:\n"; std::cout << " -cs - maximal value of a counter\n"; std::cout << "Exclude too rare and too frequent k-mers\n"; std::cout << "Example:\n"; std::cout << "kmc_tools reduce wy_kmc2 -ci3 -cx1000 wy_kmc1 -cs255\n"; } }; class CCompactUsageDisplayer : public CUsageDisplayer { public: CCompactUsageDisplayer() :CUsageDisplayer("compact") {} void Display() const override { Display1ArgGeneral(false); std::cout << "Remove counters of k-mers\n"; std::cout << "Example:\n"; std::cout << "kmc_tools compact wy_kmc2 -ci3 -cx1000 wy_kmc1\n"; } }; class CHistogramUsageDisplayer : public CUsageDisplayer { public: CHistogramUsageDisplayer() :CUsageDisplayer("histogram") {} void Display() const override { Display1ArgGeneral(false); std::cout << "Produce histogram of k-mers occurrences\n"; std::cout << "Example:\n"; std::cout << "kmc_tools histogram wy_kmc2 -ci3 -cx1000 histo.txt\n"; } }; class CDumpUsageDisplayer : public CUsageDisplayer { public: CDumpUsageDisplayer() :CUsageDisplayer("dump") {} void Display() const override { std::cout << " The '" << name << "' is one argument operation. General syntax:\n"; std::cout << " kmc_tools " << name << " [dump_params] [input_params] \n"; std::cout << " dump_params:\n"; std::cout << " -s - sorted output\n"; std::cout << " input - path to database generated by KMC \n"; std::cout << " For input there are additional parameters:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; std::cout << "Produce text dump of kmc database\n"; std::cout << "Example:\n"; std::cout << "kmc_tools dump wy_kmc2 -ci3 -cx1000 dump.txt\n"; } }; class CFilterUsageDisplayer : public CUsageDisplayer { public: CFilterUsageDisplayer() : CUsageDisplayer("filter") {} void Display() const override { std::cout << " The '" << name << "' is two arguments' operation. General syntax:\n"; std::cout << " kmc_tools " << name << " [kmc_input_db_params] [input_read_set_params] [output_read_set_params]\n"; std::cout << " kmc_input_db - path to database generated by KMC \n"; std::cout << " input_read_set - path to input set of reads \n"; std::cout << " output_read_set - path to set output of reads \n"; std::cout << " For k-mers' database there are additional parameters:\n"; std::cout << " -ci - exclude k-mers occurring less than times \n"; std::cout << " -cx - exclude k-mers occurring more of than times\n"; std::cout << " For input set of reads there are additional parameters:\n"; std::cout << " -ci - remove reads containing less k-mers than value. It can be integer or floating number in range [0.0;1.0]\n"; std::cout << " -ci - remove reads containing more k-mers than value. It can be integer or floating number in range [0.0;1.0]\n"; std::cout << " -f - input in FASTA format (-fa), FASTQ format (-fq); default: FASTQ\n"; std::cout << " For output set of reads there are additional parameters:\n"; std::cout << " -f - output in FASTA format (-fa), FASTQ format (-fq); default: same as input\n"; std::cout << "Example:\n"; std::cout << "kmc_tools filter kmc_db -ci3 input.fastq -ci0.5 -cx1.0 filtered.fastq\n"; std::cout << "kmc_tools filter kmc_db input.fastq -ci10 -cx100 filtered.fastq\n"; } }; class CUsageDisplayerFactory { std::unique_ptr desc; public: CUsageDisplayerFactory(CConfig::Mode mode) { switch (mode) { case CConfig::Mode::UNDEFINED: desc = std::make_unique(); break; case CConfig::Mode::INTERSECTION: desc = std::make_unique(); break; case CConfig::Mode::KMERS_SUBTRACT: desc = std::make_unique(); break; case CConfig::Mode::COUNTERS_SUBTRACT: desc = std::make_unique(); break; case CConfig::Mode::UNION: desc = std::make_unique(); break; case CConfig::Mode::COMPLEX: desc = std::make_unique(); break; case CConfig::Mode::SORT: desc = std::make_unique(); break; case CConfig::Mode::REDUCE: desc = std::make_unique(); break; case CConfig::Mode::COMPACT: desc = std::make_unique(); break; case CConfig::Mode::HISTOGRAM: desc = std::make_unique(); break; case CConfig::Mode::DUMP: desc = std::make_unique(); break; case CConfig::Mode::COMPARE: desc = std::make_unique(); break; case CConfig::Mode::FILTER: desc = std::make_unique(); break; default: desc = std::make_unique(); break; } } const CUsageDisplayer& GetUsageDisplayer() { return *desc; } }; #endif // ***** EOFKMC-2.3/kmc_tools/defs.h000066400000000000000000000043501257432033000150550ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _DEFS_H #define _DEFS_H #include #include using uint32 = unsigned int; using uint64 = unsigned long long; using int32 = int; using int64 = long long; using uchar = unsigned char; #define MIN(x,y) ((x) < (y) ? (x) : (y)) #define MAX(x,y) ((x) > (y) ? (x) : (y)) #define NORM(x, lower, upper) ((x) < (lower) ? (lower) : (x) > (upper) ? (upper) : (x)) #define BYTE_LOG(x) (((x) < (1 << 8)) ? 1 : ((x) < (1 << 16)) ? 2 : ((x) < (1 << 24)) ? 3 : 4) //#define DISABLE_ASMLIB //#define ENABLE_DEBUG //#define ENABLE_LOGGER #define KMC_VER "2.3.0" #define KMC_DATE "2015-08-21" #define DEFAULT_CIRCULAL_QUEUE_CAPACITY (4) #define SUFIX_WRITE_QUEUE_CAPACITY (10) #define KMC1_DB_READER_PREFIX_BUFF_BYTES (1 << 24) #define KMC1_DB_READER_SUFIX_BUFF_BYTES (1 << 24) #define KMC2_DB_READER_PREFIX_BUFF_BYTES (1 << 24) #define KMC2_DB_READER_SUFIX_BUFF_BYTES (1 << 24) #define KMC1_DB_WRITER_PREFIX_BUFF_BYTES (1 << 24) #define KMC1_DB_WRITER_SUFIX_BUFF_BYTES (1 << 24) #define HISTOGRAM_MAX_COUNTER_DEFAULT 10000 #define DUMP_BUF_SIZE (1 << 24) //Increasing this value will lead to more memory consumption, but from preliminary observations it has no performance(is sense of time) impact, so it is recommended to not change this value #define BUNDLE_CAPACITY (1 << 12) //in kmers, for kmers and counters. //this value has high impact to used memory, max value of memory is = 2 * SINGLE_BIN_BUFF_SIZE_FOR_DB2_READER * number_of_kmc2_input_dbs * number_of_bins_per_in_db //increasing this value can have positive performance impact when running on HDD #define SINGLE_BIN_BUFF_SIZE_FOR_DB2_READER (1 << 21) //if less is needed less will be allocated //default values #define CUTOFF_MIN 2 #define CUTOFF_MAX 1000000000 #define COUNTER_MAX 255 #define MAX_K 256 #define KMER_WORDS ((MAX_K + 31) / 32) #define USE_META_PROG #ifdef WIN32 #define my_fopen fopen #define my_fseek _fseeki64 #define my_ftell _ftelli64 #else #define my_fopen fopen #define my_fseek fseek #define my_ftell ftell #endif #endif // ***** EOFKMC-2.3/kmc_tools/dump_writer.h000066400000000000000000000075641257432033000165070ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _DUMP_WRITER_H #define _DUMP_WRITER_H #include "defs.h" #include "kmer.h" #include "nc_utils.h" #include "config.h" #include //wrapper to simplify interface //For kmc1 input and kmc2 input without -s parameter template class CKMCDBForDump { KMCDB kmcdb; public: CKMCDBForDump() : kmcdb(CConfig::GetInstance().headers.front(), CConfig::GetInstance().input_desc.front(), CConfig::GetInstance().percent_progress, KMCDBOpenMode::sequential){} bool NextKmer(CKmer& kmer, uint32& counter) { return kmcdb.NextKmerSequential(kmer, counter); } }; //specialization for -s parameter nad kmc2 input template class CKMCDBForDump, SIZE, true> { CKMC2DbReader* kmcdb; CBundle bundle; public: CKMCDBForDump() : kmcdb(new CKMC2DbReader(CConfig::GetInstance().headers.front(), CConfig::GetInstance().input_desc.front(), CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted)), bundle(kmcdb){} bool NextKmer(CKmer& kmer, uint32& counter) { if(!bundle.Finished()) { kmer = bundle.TopKmer(); counter = bundle.TopCounter(); bundle.Pop(); return true; } return false; } }; template class CDumpWriter { static const uint32 OVERHEAD_SIZE = 1000; KMCDB& kmcdb; COutputDesc output_desc; uint32 kmer_len; uint32 kmer_bytes; CConfig& config; uint32 in_first_byte; char* buf; uint32 buf_size; uint32 buf_pos; struct DumpOpt { char* opt_ACGT; DumpOpt() { opt_ACGT = new char[1024]; char codes[] = { 'A', 'C', 'G', 'T' }; uint32 pos = 0; for (uint32 kmer = 0; kmer < 256; ++kmer) { opt_ACGT[pos++] = codes[(kmer >> 6) & 3]; opt_ACGT[pos++] = codes[(kmer >> 4) & 3]; opt_ACGT[pos++] = codes[(kmer >> 2) & 3]; opt_ACGT[pos++] = codes[kmer & 3]; } } ~DumpOpt() { delete[]opt_ACGT; } }opt; void kmerToStr(CKmer& kmer, char* kmer_str) { //first byte char* base = opt.opt_ACGT + 4 * kmer.get_byte(kmer_bytes - 1) + 4 - in_first_byte; for(uint32 i = 0 ; i < in_first_byte ; ++i) *kmer_str++ = *base++; //rest for (int pos = kmer_bytes - 2; pos >= 0; --pos) { base = opt.opt_ACGT + 4 * kmer.get_byte(pos); *kmer_str++ = *base++; *kmer_str++ = *base++; *kmer_str++ = *base++; *kmer_str++ = *base++; } } public: CDumpWriter(KMCDB& kmcdb) :kmcdb(kmcdb), output_desc(CConfig::GetInstance().output_desc), config(CConfig::GetInstance()) { kmer_len = config.headers.front().kmer_len; kmer_bytes = (kmer_len + 3) / 4; in_first_byte = kmer_len % 4; if (in_first_byte == 0) in_first_byte = 4; } bool Process() { CKmer kmer; uint32 counter; uint32 counter_len; FILE* file = fopen(output_desc.file_src.c_str(), "wb"); if (!file) { std::cout << "Error: cannot open file: " << output_desc.file_src << "\n"; exit(1); } buf_pos = 0; buf_size = DUMP_BUF_SIZE; buf = new char[buf_size]; //while (kmcdb.NextKmerSequential(kmer, counter)) while (kmcdb.NextKmer(kmer, counter)) { if (counter >= output_desc.cutoff_min && counter <= output_desc.cutoff_max) { kmerToStr(kmer, buf + buf_pos); buf[buf_pos + kmer_len] = '\t'; counter_len = CNumericConversions::Int2PChar(counter, (uchar*)(buf + buf_pos + kmer_len + 1)); buf[buf_pos + kmer_len + 1 + counter_len] = '\n'; buf_pos += kmer_len + 2 + counter_len; if (buf_pos + OVERHEAD_SIZE > buf_size) { fwrite(buf, 1, buf_pos, file); buf_pos = 0; } } } //save rest if necessary if (buf_pos) { fwrite(buf, 1, buf_pos, file); buf_pos = 0; } fclose(file); delete[] buf; return true; } }; #endifKMC-2.3/kmc_tools/expression_node.h000066400000000000000000000123311257432033000173360ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _EXPRESSION_NODE_H #define _EXPRESSION_NODE_H #include "defs.h" #include "operations.h" #include #include #include #include #include "kmc1_db_reader.h" #include "kmc2_db_reader.h" //************************************************************************************************************ // CExpressionNode - Base abstract class representing expression node. In first stage of algorithm from // user input there is created binary tree. Node type represents operation. This tree is only for generating // another tree (check out CInput and CBundle) //************************************************************************************************************ template class CExpressionNode { public: CExpressionNode() :left(nullptr), right(nullptr) { } CExpressionNode* GetLeftChild() const { return left; } CExpressionNode* GetRightChild() const { return right; } virtual CBundle* GetExecutionRoot() = 0; void AddLeftChild(CExpressionNode* child) { #ifdef ENABLE_DEBUG if (left) { std::cout << "This child node already exists\n"; exit(1); } #endif left = child; } void AddRightChild(CExpressionNode* child) { #ifdef ENABLE_DEBUG if (right) { std::cout << "This child node already exists\n"; exit(1); } #endif right = child; } #ifdef ENABLE_DEBUG virtual void Info() = 0; void Display(int adient = 0) { if (right) right->Display(adient + 5); for (int i = 0; i < adient; ++i) std::cout << " "; Info(); std::cout << "\n"; if (left) left->Display(adient + 5); } #endif virtual ~CExpressionNode() { delete left; delete right; } protected: CExpressionNode* left, *right; }; //************************************************************************************************************ // CExpressionNode - represents node for union operation //************************************************************************************************************ template class CUnionNode : public CExpressionNode { public: CBundle* GetExecutionRoot() override { return new CBundle(new CUnion(this->left->GetExecutionRoot(), this->right->GetExecutionRoot())); } #ifdef ENABLE_DEBUG void Info() override { std::cout << "+"; } #endif }; //************************************************************************************************************ // CKmersSubtractionNode - represents node for subtraction of k-mers (if k-mer exists in both input, // it is absent in result) operation //************************************************************************************************************ template class CKmersSubtractionNode : public CExpressionNode { public: CBundle* GetExecutionRoot() override { return new CBundle(new CKmersSubtract(this->left->GetExecutionRoot(), this->right->GetExecutionRoot())); } #ifdef ENABLE_DEBUG void Info() override { std::cout << "-"; } #endif }; template class CCountersSubtractionNode : public CExpressionNode { public: CBundle* GetExecutionRoot() override { return new CBundle(new CCountersSubtract(this->left->GetExecutionRoot(), this->right->GetExecutionRoot())); } #ifdef ENABLE_DEBUG void Info() override { std::cout << "~"; } #endif }; //************************************************************************************************************ // CIntersectionNode - represents node for intersection operation //************************************************************************************************************ template class CIntersectionNode : public CExpressionNode { public: CBundle* GetExecutionRoot() override { return new CBundle(new CIntersection(this->left->GetExecutionRoot(), this->right->GetExecutionRoot())); } #ifdef ENABLE_DEBUG void Info() override { std::cout << "*"; } #endif }; //************************************************************************************************************ // CInputNode - represents node (leaf) - KMC1 or KMC2 database //************************************************************************************************************ template class CInputNode : public CExpressionNode { uint32 desc_pos; public: CInputNode(uint32 desc_pos) : desc_pos(desc_pos) { } CBundle* GetExecutionRoot() override { CConfig& config = CConfig::GetInstance(); CInput* db = nullptr; if (!config.headers[desc_pos].IsKMC2()) db = new CKMC1DbReader(config.headers[desc_pos], config.input_desc[desc_pos], CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted); else db = new CKMC2DbReader(config.headers[desc_pos], config.input_desc[desc_pos], CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted); return new CBundle(db); } #ifdef ENABLE_DEBUG void Info() override { std::cout << "In: " << CConfig::GetInstance().input_desc[desc_pos].file_src; } #endif }; #endif // ***** EOFKMC-2.3/kmc_tools/fastq_filter.cpp000066400000000000000000000257071257432033000171630ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "fastq_filter.h" #include "asmlib_wrapper.h" #include using namespace std; /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ CFastqFilter::CFastqFilter(CFilteringParams& Params, CFilteringQueues& Queues, CKMCFile& kmc_api) : kmc_api(kmc_api) { input_part_queue = Queues.input_part_queue; filtered_part_queue = Queues.filtered_part_queue; pmm_fastq_reader = Queues.pmm_fastq_reader; pmm_fastq_filter = Queues.pmm_fastq_filter; input_file_type = Params.input_file_type; output_file_type = Params.output_file_type; use_float_value = Params.use_float_value; f_max_kmers = Params.f_max_kmers; f_min_kmers = Params.f_min_kmers; n_max_kmers = Params.n_max_kmers; n_min_kmers = Params.n_min_kmers; kmer_len = Params.kmer_len; output_part_size = Params.mem_part_pmm_fastq_reader; } /*****************************************************************************************************************************/ CWFastqFilter::CWFastqFilter(CFilteringParams& Params, CFilteringQueues& Queues, CKMCFile& kmc_api) { ff = make_unique(Params, Queues, kmc_api); } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ void CFastqFilter::Process() { if (input_file_type == CFilteringParams::file_type::fastq && output_file_type == CFilteringParams::file_type::fastq) ProcessFastqToFastq(); else if (input_file_type == CFilteringParams::file_type::fastq && output_file_type == CFilteringParams::file_type::fasta) ProcessFastqToFasta(); else if (input_file_type == CFilteringParams::file_type::fasta && output_file_type == CFilteringParams::file_type::fasta) ProcessFastaToFasta(); else { cout << "Error: this file type is not supported by filter operation\n"; exit(1); } } /*****************************************************************************************************************************/ void CWFastqFilter::operator()() { ff->Process(); } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ bool CFastqFilter::FilterRead() { uint32 read_len = static_cast(seq_desc.read_end - seq_desc.read_start); read.assign((char*)input_part + seq_desc.read_start, read_len); kmc_api.GetCountersForRead(read, counters); uint32 valid_kmers = 0; for(auto counter : counters) if (counter) ++valid_kmers; if (use_float_value) { uint32 min = static_cast(f_min_kmers * (read_len - kmer_len + 1)); uint32 max = static_cast(f_max_kmers * (read_len - kmer_len + 1)); if (valid_kmers >= min && valid_kmers <= max) return true; return false; } else { if (valid_kmers >= n_min_kmers && valid_kmers <= n_max_kmers) return true; return false; } } /*****************************************************************************************************************************/ bool CFastqFilter::NextSeqFasta() { // Title char c; if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; if (c != '>') return false; seq_desc.read_header_start = input_part_pos - 1; for (; input_part_pos < input_part_size;) { c = input_part[input_part_pos++]; if (c < 32) // newliners break; } seq_desc.read_header_end = input_part_pos - 1; if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; if (c >= 32) input_part_pos--; else if (input_part_pos >= input_part_size) return false; seq_desc.read_start = input_part_pos; // Sequence for (; input_part_pos < input_part_size;) { c = input_part[input_part_pos++]; if (c < 32) // newliners break; } seq_desc.read_end = input_part_pos - 1; seq_desc.end = input_part_pos; if (input_part_pos >= input_part_size) return true; seq_desc.end++; if (input_part[input_part_pos++] >= 32) { input_part_pos--; seq_desc.end--; } else if (input_part_pos >= input_part_size) return true; return (c == '\n' || c == '\r'); } /*****************************************************************************************************************************/ bool CFastqFilter::NextSeqFastq() { char c; // Title if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; if (c != '@') return false; seq_desc.read_header_start = input_part_pos - 1; for (; input_part_pos < input_part_size;) { c = input_part[input_part_pos++]; if (c < 32) // newliners break; } seq_desc.read_header_end = input_part_pos - 1; if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; if (c >= 32) input_part_pos--; else if (input_part_pos >= input_part_size) return false; seq_desc.read_start = input_part_pos; // Sequence for (; input_part_pos < input_part_size;) { c = input_part[input_part_pos++]; if (c < 32) // newliners break; } seq_desc.read_end = input_part_pos - 1; if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; if (c >= 32) input_part_pos--; else if (input_part_pos >= input_part_size) return false; // Plus c = input_part[input_part_pos++]; if (input_part_pos >= input_part_size) return false; if (c != '+') return false; seq_desc.quality_header_start = input_part_pos - 1; for (; input_part_pos < input_part_size;) { c = input_part[input_part_pos++]; if (c < 32) // newliners break; } seq_desc.quality_header_end = input_part_pos - 1; if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; if (c >= 32) input_part_pos--; else if (input_part_pos >= input_part_size) return false; // Quality seq_desc.quality_start = input_part_pos; input_part_pos += seq_desc.read_end - seq_desc.read_start; if (input_part_pos >= input_part_size) return false; c = input_part[input_part_pos++]; seq_desc.quality_end = input_part_pos - 1; seq_desc.end = input_part_pos; if (input_part_pos >= input_part_size) return true; seq_desc.end++; if (input_part[input_part_pos++] >= 32) { input_part_pos--; seq_desc.end--; } else if (input_part_pos >= input_part_size) return true; return c == '\n' || c == '\r'; } /*****************************************************************************************************************************/ void CFastqFilter::ProcessFastaToFasta() { pmm_fastq_filter->reserve(output_part); output_part_pos = 0; uint64 required_size; while (input_part_queue->pop(input_part, input_part_size)) { input_part_pos = 0; while (NextSeqFasta()) { if (FilterRead()) { required_size = seq_desc.end - seq_desc.read_header_start; if (output_part_pos + required_size > output_part_size) { filtered_part_queue->push(output_part, output_part_pos); pmm_fastq_filter->reserve(output_part); output_part_pos = 0; } A_memcpy(output_part + output_part_pos, input_part + seq_desc.read_header_start, required_size); output_part_pos += required_size; } } pmm_fastq_reader->free(input_part); } filtered_part_queue->push(output_part, output_part_pos); filtered_part_queue->mark_completed(); } /*****************************************************************************************************************************/ void CFastqFilter::ProcessFastqToFastq() { pmm_fastq_filter->reserve(output_part); output_part_pos = 0; uint64 required_size; while (input_part_queue->pop(input_part, input_part_size)) { input_part_pos = 0; while (NextSeqFastq()) { if (FilterRead()) { required_size = seq_desc.quality_header_start - seq_desc.read_header_start + 1 + seq_desc.end - seq_desc.quality_header_end; if (output_part_pos + required_size > output_part_size) { filtered_part_queue->push(output_part, output_part_pos); pmm_fastq_filter->reserve(output_part); output_part_pos = 0; } A_memcpy(output_part + output_part_pos, input_part + seq_desc.read_header_start, seq_desc.quality_header_start - seq_desc.read_header_start + 1); output_part_pos += seq_desc.quality_header_start - seq_desc.read_header_start + 1; A_memcpy(output_part + output_part_pos, input_part + seq_desc.quality_header_end, seq_desc.end - seq_desc.quality_header_end); output_part_pos += seq_desc.end - seq_desc.quality_header_end; } } pmm_fastq_reader->free(input_part); } filtered_part_queue->push(output_part, output_part_pos); filtered_part_queue->mark_completed(); } /*****************************************************************************************************************************/ void CFastqFilter::ProcessFastqToFasta() { pmm_fastq_filter->reserve(output_part); output_part_pos = 0; uint64 required_size; while (input_part_queue->pop(input_part, input_part_size)) { input_part_pos = 0; while (NextSeqFastq()) { if (FilterRead()) { required_size = seq_desc.quality_header_start - seq_desc.read_header_start; if (output_part_pos + required_size > output_part_size) { filtered_part_queue->push(output_part, output_part_pos); pmm_fastq_filter->reserve(output_part); output_part_pos = 0; } input_part[seq_desc.read_header_start] = '>'; A_memcpy(output_part + output_part_pos, input_part + seq_desc.read_header_start, seq_desc.quality_header_start - seq_desc.read_header_start); output_part_pos += seq_desc.quality_header_start - seq_desc.read_header_start; } } pmm_fastq_reader->free(input_part); } filtered_part_queue->push(output_part, output_part_pos); filtered_part_queue->mark_completed(); } // ***** EOFKMC-2.3/kmc_tools/fastq_filter.h000066400000000000000000000041121257432033000166130ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _FASTQ_FILTER_H #define _FASTQ_FILTER_H #include "config.h" #include "../kmc_api/kmc_file.h" //************************************************************************************************************ // CFastqFilter - filter of reads //************************************************************************************************************ class CFastqFilter { CPartQueue *input_part_queue, *filtered_part_queue; CMemoryPool *pmm_fastq_reader; CMemoryPool *pmm_fastq_filter; CFilteringParams::file_type input_file_type, output_file_type; CKMCFile& kmc_api; uint64 output_part_size; uchar* input_part; uint64 input_part_size; uint64 input_part_pos; uchar* output_part; uint64 output_part_pos; std::vector counters; std::string read; struct { uint64 read_header_start; uint64 read_header_end; uint64 read_start; uint64 read_end; uint64 quality_header_start; uint64 quality_header_end; uint64 quality_start; uint64 quality_end; uint64 end; }seq_desc; bool use_float_value; float f_max_kmers; float f_min_kmers; uint32 n_max_kmers; uint32 n_min_kmers; uint32 kmer_len; void ProcessFastaToFasta(); void ProcessFastqToFasta(); void ProcessFastqToFastq(); bool NextSeqFastq(); bool NextSeqFasta(); bool FilterRead(); public: CFastqFilter(CFilteringParams& Params, CFilteringQueues& Queues, CKMCFile& kmc_api); void Process(); }; //************************************************************************************************************ // CWFastqFilter - wrapper for CFastqFilter class - for multithreading purposes //************************************************************************************************************ class CWFastqFilter { std::unique_ptr ff; public: CWFastqFilter(CFilteringParams& Params, CFilteringQueues& Queues, CKMCFile& kmc_api); void operator()(); }; #endif // ***** EOFKMC-2.3/kmc_tools/fastq_reader.cpp000066400000000000000000000200551257432033000171270ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include #include #include "defs.h" #include "fastq_reader.h" //************************************************************************************************************ // CFastqReader - reader class //************************************************************************************************************ uint64 CFastqReader::OVERHEAD_SIZE = 1 << 16; //---------------------------------------------------------------------------------- // Constructor of FASTA/FASTQ reader // Parameters: // * _mm - pointer to memory monitor (to check the memory limits) CFastqReader::CFastqReader(CMemoryPool *_pmm_fastq, CFilteringParams::file_type _file_type, uint32 _gzip_buffer_size, uint32 _bzip2_buffer_size, int _kmer_len) { pmm_fastq = _pmm_fastq; file_type = _file_type; kmer_len = _kmer_len; // Input file mode (default: uncompressed) mode = m_plain; // Pointers to input files in various formats (uncompressed, gzip-compressed, bzip2-compressed) in = NULL; in_gzip = NULL; in_bzip2 = NULL; bzerror = BZ_OK; // Size and pointer for the buffer part_size = 1 << 23; part = NULL; gzip_buffer_size = _gzip_buffer_size; bzip2_buffer_size = _bzip2_buffer_size; } //---------------------------------------------------------------------------------- // Destructor - close the files CFastqReader::~CFastqReader() { if(mode == m_plain) { if(in) fclose(in); } else if(mode == m_gzip) { if(in_gzip) gzclose(in_gzip); } else if(mode == m_bzip2) { if(in) { BZ2_bzReadClose(&bzerror, in_bzip2); fclose(in); } } if(part) pmm_fastq->free(part); } //---------------------------------------------------------------------------------- // Set the name of the file to process bool CFastqReader::SetNames(string _input_file_name) { input_file_name = _input_file_name; // Set mode according to the extension of the file name if(input_file_name.size() > 3 && string(input_file_name.end()-3, input_file_name.end()) == ".gz") mode = m_gzip; else if(input_file_name.size() > 4 && string(input_file_name.end()-4, input_file_name.end()) == ".bz2") mode = m_bzip2; else mode = m_plain; return true; } //---------------------------------------------------------------------------------- // Set part size of the buffer bool CFastqReader::SetPartSize(uint64 _part_size) { if(in || in_gzip || in_bzip2) return false; if(_part_size < (1 << 20) || _part_size > (1 << 30)) return false; part_size = _part_size; return true; } //---------------------------------------------------------------------------------- // Open the file bool CFastqReader::OpenFiles() { if(in || in_gzip || in_bzip2) return false; // Uncompressed file if(mode == m_plain) { if((in = fopen(input_file_name.c_str(), "rb")) == NULL) return false; } // Gzip-compressed file else if(mode == m_gzip) { if((in_gzip = gzopen(input_file_name.c_str(), "rb")) == NULL) return false; gzbuffer(in_gzip, gzip_buffer_size); } // Bzip2-compressed file else if(mode == m_bzip2) { in = fopen(input_file_name.c_str(), "rb"); if(!in) return false; setvbuf(in, NULL, _IOFBF, bzip2_buffer_size); if((in_bzip2 = BZ2_bzReadOpen(&bzerror, in, 0, 0, NULL, 0)) == NULL) { fclose(in); return false; } } // Reserve via PMM pmm_fastq->reserve(part); part_filled = 0; return true; } //---------------------------------------------------------------------------------- // Read a part of the file bool CFastqReader::GetPart(uchar *&_part, uint64 &_size) { if(!in && !in_gzip && !in_bzip2) return false; if(IsEof()) return false; uint64 readed; // Read data if(mode == m_plain) readed = fread(part+part_filled, 1, part_size, in); else if(mode == m_gzip) readed = gzread(in_gzip, part+part_filled, (int) part_size); else if(mode == m_bzip2) readed = BZ2_bzRead(&bzerror, in_bzip2, part+part_filled, (int) part_size); else readed = 0; // Never should be here int64 total_filled = part_filled + readed; int64 i; if(part_filled >= OVERHEAD_SIZE) { cout << "Error: Wrong input file!\n"; exit(1); } if(IsEof()) { _part = part; _size = total_filled; part = NULL; return true; } // Look for the end of the last complete record in a buffer if(file_type == CFilteringParams::file_type::fasta) // FASTA files { // Looking for a FASTA record at the end of the area int64 line_start[3]; int32 j; i = total_filled - OVERHEAD_SIZE / 2; for(j = 0; j < 3; ++j) { if(!SkipNextEOL(part, i, total_filled)) break; line_start[j] = i; } _part = part; if(j < 3) _size = 0; else { int k; for(k = 0; k < 2; ++k) if(part[line_start[k]+0] == '>') break; if(k == 2) _size = 0; else _size = line_start[k]; } } else // FASTQ file { // Looking for a FASTQ record at the end of the area int64 line_start[9]; int32 j; i = total_filled - OVERHEAD_SIZE / 2; for(j = 0; j < 9; ++j) { if(!SkipNextEOL(part, i, total_filled)) break; line_start[j] = i; } _part = part; if(j < 9) _size = 0; else { int k; for(k = 0; k < 4; ++k) { if(part[line_start[k]+0] == '@' && part[line_start[k+2]+0] == '+') { if(part[line_start[k+2]+1] == '\n' || part[line_start[k+2]+1] == '\r') break; if(line_start[k+1]-line_start[k] == line_start[k+3]-line_start[k+2] && memcmp(part+line_start[k]+1, part+line_start[k+2]+1, line_start[k+3]-line_start[k+2]-1) == 0) break; } } if(k == 4) _size = 0; else _size = line_start[k]; } } // Allocate new memory for the buffer pmm_fastq->reserve(part); copy(_part+_size, _part+total_filled, part); part_filled = total_filled - _size; return true; } //---------------------------------------------------------------------------------- // Skip to next EOL from the current position in a buffer bool CFastqReader::SkipNextEOL(uchar *part, int64 &pos, int64 max_pos) { int64 i; for(i = pos; i < max_pos-2; ++i) if((part[i] == '\n' || part[i] == '\r') && !(part[i+1] == '\n' || part[i+1] == '\r')) break; if(i >= max_pos-2) return false; pos = i+1; return true; } //---------------------------------------------------------------------------------- // Check whether there is an EOF bool CFastqReader::IsEof() { if(mode == m_plain) return feof(in) != 0; else if(mode == m_gzip) return gzeof(in_gzip) != 0; else if(mode == m_bzip2) return bzerror == BZ_STREAM_END; return true; } //************************************************************************************************************ // CWFastqReader - wrapper for multithreading purposes //************************************************************************************************************ CWFastqReader::CWFastqReader(CFilteringParams &Params, CFilteringQueues &Queues) { pmm_fastq = Queues.pmm_fastq_reader; input_files_queue = Queues.input_files_queue; part_size = Params.fastq_buffer_size; part_queue = Queues.input_part_queue; file_type = Params.input_file_type; kmer_len = Params.kmer_len; gzip_buffer_size = Params.gzip_buffer_size; bzip2_buffer_size = Params.bzip2_buffer_size; fqr = nullptr; } //---------------------------------------------------------------------------------- CWFastqReader::~CWFastqReader() { } //---------------------------------------------------------------------------------- void CWFastqReader::operator()() { uchar *part; uint64 part_filled; while(input_files_queue->pop(file_name)) { fqr = new CFastqReader(pmm_fastq, file_type, gzip_buffer_size, bzip2_buffer_size, kmer_len); fqr->SetNames(file_name); fqr->SetPartSize(part_size); if(fqr->OpenFiles()) { // Reading Fastq parts while(fqr->GetPart(part, part_filled)) part_queue->push(part, part_filled); } else cerr << "Error: Cannot open file " << file_name << "\n"; delete fqr; } part_queue->mark_completed(); } // ***** EOF KMC-2.3/kmc_tools/fastq_reader.h000066400000000000000000000042241257432033000165740ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _FASTQ_READER_H #define _FASTQ_READER_H #include "defs.h" #include "queues.h" #include "config.h" #include #include #include "libs/zlib.h" #include "libs/bzlib.h" using namespace std; //************************************************************************************************************ // FASTA/FASTQ reader class //************************************************************************************************************ class CFastqReader { typedef enum {m_plain, m_gzip, m_bzip2} t_mode; CMemoryPool *pmm_fastq; string input_file_name; CFilteringParams::file_type file_type; int kmer_len; t_mode mode; FILE *in; gzFile_s *in_gzip; BZFILE *in_bzip2; int bzerror; uint64 part_size; uchar *part; uint64 part_filled; uint32 gzip_buffer_size; uint32 bzip2_buffer_size; bool SkipNextEOL(uchar *part, int64 &pos, int64 max_pos); bool IsEof(); public: CFastqReader(CMemoryPool *_pmm_fastq, CFilteringParams::file_type _file_type, uint32 _gzip_buffer_size, uint32 _bzip2_buffer_size, int _kmer_len); ~CFastqReader(); static uint64 OVERHEAD_SIZE; bool SetNames(string _input_file_name); bool SetPartSize(uint64 _part_size); bool OpenFiles(); bool GetPart(uchar *&_part, uint64 &_size); }; //************************************************************************************************************ // Wrapper for FASTA/FASTQ reader class - for multithreading purposes //************************************************************************************************************ class CWFastqReader { CMemoryPool *pmm_fastq; CFastqReader *fqr; string file_name; uint64 part_size; CInputFilesQueue *input_files_queue; CPartQueue *part_queue; CFilteringParams::file_type file_type; uint32 gzip_buffer_size; uint32 bzip2_buffer_size; int kmer_len; public: CWFastqReader(CFilteringParams &Params, CFilteringQueues &Queues); ~CWFastqReader(); void operator()(); }; #endif // ***** EOFKMC-2.3/kmc_tools/fastq_writer.cpp000066400000000000000000000051061257432033000172010ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "fastq_writer.h" #include using namespace std; /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ CFastqWriter::CFastqWriter(CFilteringParams& Params, CFilteringQueues& Queues) { output_src = Params.output_src; filtered_part_queue = Queues.filtered_part_queue; pmm_fastq_filter = Queues.pmm_fastq_filter; } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ void CFastqWriter::Process() { uchar* part; uint64 size; FILE* f = fopen(output_src.c_str(), "wb"); if (!f) { cout << "cannot open file :" << output_src; exit(1); } while (filtered_part_queue->pop(part, size)) { if (fwrite(part, 1, size, f) != size) { cout << "Error while writing to " << output_src << "\n"; exit(1); } pmm_fastq_filter->free(part); } fclose(f); } /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ CWFastqWriter::CWFastqWriter(CFilteringParams& Params, CFilteringQueues& Queues) :writer(Params, Queues) { } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ void CWFastqWriter::operator()() { writer.Process(); } // ***** EOFKMC-2.3/kmc_tools/fastq_writer.h000066400000000000000000000023421257432033000166450ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _FASTQ_WRITER_H #define _FASTQ_WRITER_H #include "defs.h" #include "config.h" #include //************************************************************************************************************ // CFastqWriter - Writer of fastq/fasta file //************************************************************************************************************ class CFastqWriter { std::string output_src; CPartQueue* filtered_part_queue; CMemoryPool *pmm_fastq_filter; public: CFastqWriter(CFilteringParams& Params, CFilteringQueues& Queues); void Process(); }; //************************************************************************************************************ // CWFastqWriter - wrapper for CFastqWriter class - for multithreading purposes //************************************************************************************************************ class CWFastqWriter { CFastqWriter writer; public: CWFastqWriter(CFilteringParams& Params, CFilteringQueues& Queues); void operator()(); }; #endif KMC-2.3/kmc_tools/histogram_writer.h000066400000000000000000000021661257432033000175300ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _HISTOGRAM_WRITER_H #define _HISTOGRAM_WRITER_H #include "defs.h" #include "config.h" #include #include template class CHistogramWriter { KMCDB& kmcdb; COutputDesc& output_desc; std::vector counters; public: CHistogramWriter(KMCDB& kmcdb) :kmcdb(kmcdb), output_desc(CConfig::GetInstance().output_desc) { } bool Process() { counters.resize(output_desc.cutoff_max + 1); uint32 counter; while (kmcdb.NextCounter(counter)) { if (counter >= output_desc.cutoff_min && counter <= output_desc.cutoff_max) counters[counter]++; } std::ofstream file(output_desc.file_src); if (!file) { std::cout << "Error: cannot open file: " << output_desc.file_src << "\n"; exit(1); } for (uint32 i = output_desc.cutoff_min; i <= output_desc.cutoff_max; ++i) { file << i << "\t" << counters[i] << "\n"; } file.close(); return true; } }; #endifKMC-2.3/kmc_tools/kmc1_db_reader.h000066400000000000000000000245141257432033000167620ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMC1_DB_READER_H #define _KMC1_DB_READER_H #include "kmer.h" #include "defs.h" #include "config.h" #include "bundle.h" #include "kmc_header.h" #include "queues.h" #include #include #include enum class KMCDBOpenMode { sequential, sorted, counters_only }; //************************************************************************************************************ // CKMC1DbReader - reader of KMC1 database //************************************************************************************************************ template class CKMC1DbReader : public CInput { public: CKMC1DbReader(const CKMC_header& header, const CInputDesc& desc, CPercentProgress& percent_progress, KMCDBOpenMode open_mode); void NextBundle(CBundle& bundle) override { bool exists = circular_queue->pop(bundle.Data()); percent_progress.UpdateItem(progress_id, bundle.Size()); if (exists) return; percent_progress.Complete(progress_id); this->finished = true; this->sorted_access_thread.join(); delete this->circular_queue; } void IgnoreRest() override { circular_queue->force_finish(); this->finished = true; this->sorted_access_thread.join(); delete this->circular_queue; } ~CKMC1DbReader() { if(prefix_file != nullptr) fclose(prefix_file); if(sufix_file != nullptr) fclose(sufix_file); delete[] prefix_buff; delete[] sufix_buff; } bool NextKmerSequential(CKmer& kmer, uint32& counter) { if (next_kmer_sorted(kmer, counter)) { percent_progress.UpdateItem(progress_id); return true; } percent_progress.Complete(progress_id); return false; } bool NextCounter(uint32& counter); private: static const uint32 PREFIX_BUFF_BYTES = KMC1_DB_READER_PREFIX_BUFF_BYTES; static const uint32 SUFIX_BUFF_BYTES = KMC1_DB_READER_SUFIX_BUFF_BYTES; const CKMC_header& header; const CInputDesc& desc; CPercentProgress& percent_progress; KMCDBOpenMode open_mode; uint32 progress_id; FILE* prefix_file; FILE* sufix_file; uint32 record_size; //of sufix, in bytes uint32 current_preffix; uint32 sufix_bytes; uint64* prefix_buff = nullptr; uchar* sufix_buff = nullptr; uint32 prefix_bytes; uint32 kmer_bytes; uint64 prefix_buff_size; uint64 sufix_buff_size; uint64 prefix_buff_pos; uint64 sufix_buff_pos; uint64 prefix_left_to_read; uint64 sufix_left_to_read; std::string prefix_file_name; std::string sufix_file_name; uint64 sufix_number; CCircularQueue* circular_queue = nullptr; //for sorted access only std::thread sorted_access_thread; void reload_pref_buff(); bool reload_suf_buff(); bool next_kmer_sorted(CKmer& kmer, uint32& counter); void open_files(); void allocate_buffers() { sufix_buff = new uchar[sufix_buff_size]; if (open_mode == KMCDBOpenMode::sequential || open_mode == KMCDBOpenMode::sorted) prefix_buff = new uint64[prefix_buff_size]; } }; /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKMC1DbReader::CKMC1DbReader(const CKMC_header& header, const CInputDesc& desc, CPercentProgress& percent_progress, KMCDBOpenMode open_mode) : header(header), desc(desc), percent_progress(percent_progress), open_mode(open_mode) { progress_id = percent_progress.RegisterItem(header.total_kmers); prefix_file = sufix_file = nullptr; sufix_bytes = (header.kmer_len - header.lut_prefix_len) / 4; record_size = sufix_bytes + header.counter_size; sufix_buff_size = SUFIX_BUFF_BYTES / record_size * record_size; prefix_buff_size = PREFIX_BUFF_BYTES / sizeof(uint64); sufix_left_to_read = header.total_kmers * record_size; if (sufix_left_to_read < sufix_buff_size) sufix_buff_size = sufix_left_to_read; prefix_left_to_read = (1 << header.lut_prefix_len * 2) - 1; if (prefix_left_to_read < prefix_buff_size) prefix_buff_size = prefix_left_to_read; prefix_bytes = (header.lut_prefix_len + 3) / 4; kmer_bytes = prefix_bytes + sufix_bytes; open_files(); allocate_buffers(); if(open_mode == KMCDBOpenMode::sequential || open_mode == KMCDBOpenMode::sorted) reload_pref_buff(); reload_suf_buff(); current_preffix = 0; sufix_number = 0; if (open_mode == KMCDBOpenMode::sorted) { circular_queue = new CCircularQueue(DEFAULT_CIRCULAL_QUEUE_CAPACITY); sorted_access_thread = std::thread([this]{ CKmer kmer; uint32 counter; CBundleData bundle_data; while (next_kmer_sorted(kmer, counter)) { bundle_data.Insert(kmer, counter); if (bundle_data.Full()) { if (!this->circular_queue->push(bundle_data)) break; } } if (!bundle_data.Empty()) this->circular_queue->push(bundle_data); this->circular_queue->mark_completed(); }); } } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template bool CKMC1DbReader::NextCounter(uint32& counter) { while (true) { if (sufix_number >= header.total_kmers) return false; uchar* record = sufix_buff + sufix_buff_pos + sufix_bytes; counter = 0; for (int32 i = header.counter_size - 1; i >= 0; --i) { counter <<= 8; counter += record[i]; } ++sufix_number; sufix_buff_pos += record_size; if (sufix_buff_pos >= sufix_buff_size) reload_suf_buff(); if (counter >= desc.cutoff_min && counter <= desc.cutoff_max) return true; } } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CKMC1DbReader::reload_pref_buff() { uint64 to_read = MIN(prefix_left_to_read, prefix_buff_size); prefix_buff_pos = 0; if (to_read == 0) { prefix_buff[0] = header.total_kmers;//guard return; } if (fread(prefix_buff, sizeof(uint64), to_read, prefix_file) != to_read) { std::cout << "Error: some error while reading " << prefix_file_name << "\n"; exit(1); } prefix_left_to_read -= to_read; if (to_read < prefix_buff_size) { prefix_buff[to_read] = header.total_kmers;//guard } } /*****************************************************************************************************************************/ template bool CKMC1DbReader::reload_suf_buff() { uint64 to_read = MIN(sufix_left_to_read, sufix_buff_size); if (to_read == 0) return false; uint64 readed = fread(sufix_buff, 1, to_read, sufix_file); if (readed != to_read) { std::cout << "Error: some error while reading " << sufix_file_name << "\n"; exit(1); } sufix_buff_pos = 0; sufix_left_to_read -= to_read; return true; } /*****************************************************************************************************************************/ template void CKMC1DbReader::open_files() { sufix_file_name = desc.file_src + ".kmc_suf"; sufix_file = fopen(sufix_file_name.c_str(), "rb"); setvbuf(sufix_file, NULL, _IONBF, 0); if (!sufix_file) { std::cout << "Error: cannot open file: " << sufix_file_name << "\n"; exit(1); } char marker[4]; if (fread(marker, 1, 4, sufix_file) != 4) { std::cout << "Error: while reading start marker in file: " << sufix_file_name << "\n"; exit(1); } if (strncmp(marker, "KMCS", 4) != 0) { std::cout << "Error: wrong start marker in file: " << sufix_file_name << "\n"; exit(1); } my_fseek(sufix_file, -4, SEEK_END); if (fread(marker, 1, 4, sufix_file) != 4) { std::cout << "Error: while reading end marker in file: " << sufix_file_name << "\n"; exit(1); } if (strncmp(marker, "KMCS", 4) != 0) { std::cout << "Error: wrong end marker in file: " << sufix_file_name << "\n"; exit(1); } my_fseek(sufix_file, 4, SEEK_SET); //skip KMCS if (open_mode == KMCDBOpenMode::sequential || open_mode == KMCDBOpenMode::sorted) { prefix_file_name = desc.file_src + ".kmc_pre"; prefix_file = fopen(prefix_file_name.c_str(), "rb"); setvbuf(prefix_file, NULL, _IONBF, 0); if (!prefix_file) { std::cout << "Error: cannot open file: " << prefix_file_name << "\n"; exit(1); } my_fseek(prefix_file, 4 + sizeof(uint64), SEEK_SET);//skip KMCP and first value as it must be 0 } } /*****************************************************************************************************************************/ template bool CKMC1DbReader::next_kmer_sorted(CKmer& kmer, uint32& counter) { while (true) { if (sufix_number >= header.total_kmers) return false; while (prefix_buff[prefix_buff_pos] <= sufix_number) { ++current_preffix; ++prefix_buff_pos; if (prefix_buff_pos >= prefix_buff_size) reload_pref_buff(); } uchar* record = sufix_buff + sufix_buff_pos; uint32 pos = kmer_bytes - 1; kmer.load(record, sufix_bytes); for (int32 i = prefix_bytes - 1; i >= 0; --i) kmer.set_byte(pos--, current_preffix >> (i << 3)); counter = 0; for (int32 i = header.counter_size - 1; i >= 0; --i) { counter <<= 8; counter += record[i]; } ++sufix_number; sufix_buff_pos += record_size; if (sufix_buff_pos >= sufix_buff_size) reload_suf_buff(); if (counter >= desc.cutoff_min && counter <= desc.cutoff_max) return true; } } #endif // ***** EOFKMC-2.3/kmc_tools/kmc1_db_writer.h000066400000000000000000000262531257432033000170360ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMC1_DB_WRITER_H #define _KMC1_DB_WRITER_H #include "defs.h" #include "config.h" #include "queues.h" #include #include //************************************************************************************************************ // CKMC1SufixFileWriter - thread for writing sufixes' parts //************************************************************************************************************ class CKMC1SufixFileWriter { public: CKMC1SufixFileWriter(CSufWriteQueue& input_queue, FILE* kmc_suf) : input_queue(input_queue), kmc_suf(kmc_suf) { } void operator()() { uchar* buf; uint32 size; while (input_queue.pop(buf, size)) { if (fwrite(buf, 1, size, kmc_suf) != size) { std::cout << "Error while writting to kmc_suf file\n"; exit(1); } delete[] buf; } } private: CSufWriteQueue& input_queue; FILE* kmc_suf; }; //************************************************************************************************************ // CKMC1DbWriter - writer of KMC1 database //************************************************************************************************************ template class CKMC1DbWriter { public: CKMC1DbWriter(CBundle* bundle); ~CKMC1DbWriter(); bool Process(); private: static const uint32 PRE_BUFF_SIZE_BYTES = KMC1_DB_WRITER_PREFIX_BUFF_BYTES; static const uint32 SUF_BUFF_SIZE_BYTES = KMC1_DB_WRITER_SUFIX_BUFF_BYTES; CConfig& config; CBundle* bundle; FILE* kmc_pre, *kmc_suf; uint32 lut_prefix_len; uint32 current_prefix; uint32 counter_size; uint32 pre_buff_size; uint32 suf_buff_size; uint64* pre_buff; uchar* suf_buff; uint64 added_kmers; uint32 sufix_rec_bytes; uint32 suf_pos, pre_pos; void store_pre_buf(); void send_suf_buf_to_queue(); void start_writting(); inline void add_kmer(CKmer& kmer, uint32 counter); void finish_writting(); template void write_header_part(T data); void calc_lut_prefix_len(); CCircularQueue bundles_queue; CSufWriteQueue suf_buf_queue; }; /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template CKMC1DbWriter::CKMC1DbWriter(CBundle* bundle) : config(CConfig::GetInstance()), bundle(bundle), bundles_queue(DEFAULT_CIRCULAL_QUEUE_CAPACITY) { kmc_pre = NULL; kmc_suf = NULL; pre_buff = NULL; suf_buff = NULL; std::string kmc_pre_file_name = config.output_desc.file_src + ".kmc_pre"; std::string kmc_suf_file_name = config.output_desc.file_src + ".kmc_suf"; kmc_pre = fopen(kmc_pre_file_name.c_str(), "wb"); setvbuf(kmc_pre, NULL, _IONBF, 0); if (!kmc_pre) { std::cout << "Error: cannot open file : " << kmc_pre_file_name << "\n"; exit(1); } kmc_suf = fopen(kmc_suf_file_name.c_str(), "wb"); setvbuf(kmc_suf, NULL, _IONBF, 0); if (!kmc_suf) { fclose(kmc_pre); std::cout << "Error: cannot open file : " << kmc_suf_file_name << "\n"; exit(1); } setvbuf(kmc_pre, NULL, _IONBF, 0); setvbuf(kmc_suf, NULL, _IONBF, 0); // Calculate LUT size calc_lut_prefix_len(); counter_size = MIN(BYTE_LOG(config.output_desc.counter_max), BYTE_LOG(config.output_desc.cutoff_max)); sufix_rec_bytes = (config.kmer_len - lut_prefix_len) / 4 + counter_size; current_prefix = 0; added_kmers = 0; pre_buff_size = PRE_BUFF_SIZE_BYTES / sizeof(uint64); suf_buff_size = SUF_BUFF_SIZE_BYTES / sufix_rec_bytes; suf_pos = pre_pos = 0; pre_buff = new uint64[pre_buff_size]; pre_buff[pre_pos++] = 0; suf_buff = new uchar[suf_buff_size * sufix_rec_bytes]; suf_buf_queue.init(suf_buff_size * sufix_rec_bytes, SUFIX_WRITE_QUEUE_CAPACITY); } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template bool CKMC1DbWriter::Process() { start_writting(); //Converts bundles to output buffers, sufix buffer is placed to another queue and write in separate thread (sufix_writer) std::thread preparing_thread([this]{ CBundleData bundle_data; while (bundles_queue.pop(bundle_data)) { while (!bundle_data.Empty()) { add_kmer(bundle_data.TopKmer(), bundle_data.TopCounter()); bundle_data.Pop(); } } suf_buf_queue.push(suf_buff, sufix_rec_bytes * suf_pos); suf_buf_queue.mark_completed(); }); CKMC1SufixFileWriter sufix_writer(suf_buf_queue, kmc_suf); std::thread suf_buf_writing_thread(std::ref(sufix_writer)); #ifdef ENABLE_LOGGER CTimer timer; #endif while (!bundle->Finished()) { #ifdef ENABLE_LOGGER timer.start(); #endif bundles_queue.push(bundle->Data()); #ifdef ENABLE_LOGGER CLoger::GetLogger().log_operation("dodawanie do kolejki wyjsciowej bundla", this, timer.get_time()); #endif } bundles_queue.mark_completed(); preparing_thread.join(); suf_buf_writing_thread.join(); finish_writting(); return true; } /*****************************************************************************************************************************/ template CKMC1DbWriter::~CKMC1DbWriter() { delete[] suf_buff; delete[] pre_buff; } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template template void CKMC1DbWriter::write_header_part(T data) { for (uint32 i = 0; i < sizeof(T); ++i) { char c = (data >> (i << 3)) & 0xff; if (putc(c, kmc_pre) == EOF) { std::cout << "Error while writing header of kmc1\n"; exit(1); } } } /*****************************************************************************************************************************/ template void CKMC1DbWriter::start_writting() { if (fwrite("KMCP", 1, 4, kmc_pre) != 4) { std::cout << "Error while writting starting KMCP marker"; exit(1); } if (fwrite("KMCS", 1, 4, kmc_suf) != 4) { std::cout << "Error while writting starting KMCS marker"; exit(1); } } /*****************************************************************************************************************************/ template void CKMC1DbWriter::finish_writting() { uint32 max_prefix = (1 << 2 * lut_prefix_len); while (current_prefix < max_prefix - 1) { pre_buff[pre_pos++] = added_kmers; ++current_prefix; if (pre_pos == pre_buff_size) store_pre_buf(); } store_pre_buf(); send_suf_buf_to_queue(); //store header write_header_part(config.kmer_len); write_header_part(config.headers.front().mode); write_header_part(counter_size); write_header_part(lut_prefix_len); write_header_part(config.output_desc.cutoff_min); write_header_part(config.output_desc.cutoff_max); write_header_part(added_kmers); bool both_stands = false; for (auto& input : config.headers) both_stands = both_stands || input.both_strands; //if any input database is in both strands, output is also in both strands write_header_part(!both_stands); for (uint32 i = 0; i < 31; ++i) write_header_part(uchar(0)); write_header_part((uint32)64); if (fwrite("KMCP", 1, 4, kmc_pre) != 4) { std::cout << "Error while writting end KMCP marker"; exit(1); } if (fwrite("KMCS", 1, 4, kmc_suf) != 4) { std::cout << "Error while writting end KMCS marker"; exit(1); } fclose(kmc_pre); fclose(kmc_suf); } /*****************************************************************************************************************************/ template void CKMC1DbWriter::add_kmer(CKmer& kmer, uint32 counter) { if (counter < config.output_desc.cutoff_min || counter > config.output_desc.cutoff_max) return; if (counter > config.output_desc.counter_max) counter = config.output_desc.counter_max; uint64 kmer_prefix = kmer.remove_suffix((config.kmer_len - lut_prefix_len) * 2); while (current_prefix < kmer_prefix) { pre_buff[pre_pos++] = added_kmers; ++current_prefix; if (pre_pos == pre_buff_size) store_pre_buf(); } uchar* rec = suf_buff + suf_pos * sufix_rec_bytes; kmer.store(rec, sufix_rec_bytes - counter_size); for (uint32 i = 0; i < counter_size; ++i) *rec++ = counter >> (i << 3); ++suf_pos; if (suf_pos == suf_buff_size) send_suf_buf_to_queue(); ++added_kmers; } /*****************************************************************************************************************************/ template void CKMC1DbWriter::store_pre_buf() { if (fwrite(pre_buff, sizeof(uint64), pre_pos, kmc_pre) != pre_pos) { std::cout << "Error while writting to kmc_pre file\n"; exit(1); } pre_pos = 0; } /*****************************************************************************************************************************/ template void CKMC1DbWriter::send_suf_buf_to_queue() { suf_buf_queue.push(suf_buff, sufix_rec_bytes * suf_pos); suf_pos = 0; } /*****************************************************************************************************************************/ template void CKMC1DbWriter::calc_lut_prefix_len() { std::vector best_lut_prefix_len_inputs(config.headers.size()); for (uint32 i = 0; i < config.headers.size(); ++i) { uint32 best_lut_prefix_len = 0; uint64 best_mem_amount = 1ull << 62; for (lut_prefix_len = 6; lut_prefix_len < 16; ++lut_prefix_len) { uint32 suffix_len = config.headers[i].kmer_len - lut_prefix_len; if (suffix_len % 4) continue; uint64 suf_mem = config.headers[i].total_kmers * suffix_len / 4; uint64 lut_mem = (1ull << (2 * lut_prefix_len)) * sizeof(uint64); if (suf_mem + lut_mem < best_mem_amount) { best_lut_prefix_len = lut_prefix_len; best_mem_amount = suf_mem + lut_mem; } } best_lut_prefix_len_inputs[i] = best_lut_prefix_len; } //TODO poki co jako lut size biore najwieszy z najlepszych dla baz wejsciowych lut_prefix_len = *std::max_element(best_lut_prefix_len_inputs.begin(), best_lut_prefix_len_inputs.end()); } #endif // ***** EOFKMC-2.3/kmc_tools/kmc2_db_reader.h000066400000000000000000001353241257432033000167650ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMC2_DB_READER_H #define _KMC2_DB_READER_H #include "config.h" #include "bundle.h" #include "queues.h" #include #include #include #include //#include #include #include //Forward declaration template class CKMC2DbReaderSorted; template class CBin; struct CBinBuff //must be moveable { uchar* buf; uint32 size; CBinBuff() : buf(nullptr), size(0) { } CBinBuff(uchar* buf, uint32 size) :buf(buf), size(size) { } #ifdef WIN32 CBinBuff& operator=(CBinBuff&& rhs) throw() #else CBinBuff& operator=(CBinBuff&& rhs) noexcept #endif { if (this != &rhs) { buf = rhs.buf; size = rhs.size; rhs.buf = nullptr; rhs.size = 0; } return *this; } #ifdef WIN32 CBinBuff(CBinBuff&& rhs) throw() #else CBinBuff(CBinBuff&& rhs) noexcept #endif { buf = rhs.buf; size = rhs.size; rhs.buf = nullptr; rhs.size = 0; } CBinBuff(const CBinBuff&) = delete; CBinBuff& operator=(const CBinBuff&) = delete; }; template class CBinBufProvider { std::vector internal_bufs; uchar *buf_bins, *buf_internal; uint32 bins_left_to_read = 0; uint32 max_bin_bytes; uint32 rec_size; using desc_t = std::tuple;//current_kmer, last_kmer, is_empty using to_read_t = std::tuple;//bin_id, file_pos, bufer to read, size to read std::vector desc; //std::stack> to_read; std::queue> to_read; mutable std::mutex mtx; std::condition_variable cv_pop; std::condition_variable cv_get_next_to_read; bool forced_to_finish = false; public: void init(std::vector>& bins); void pop(uint32 bin_id, CBinBuff& bin_buf) { std::unique_lock lck(mtx); cv_pop.wait(lck, [this, bin_id]{return !std::get<2>(desc[bin_id]); }); std::swap(bin_buf, internal_bufs[bin_id]); std::get<2>(desc[bin_id]) = true; uint64 kmers_left = std::get<1>(desc[bin_id]) - std::get<0>(desc[bin_id]); if (kmers_left) { uint32 kmers_to_read = (uint32)MIN(kmers_left, max_bin_bytes / rec_size); internal_bufs[bin_id].size = kmers_to_read * rec_size; bool was_empty = to_read.empty(); to_read.push(std::make_tuple(bin_id, 4 + std::get<0>(desc[bin_id]) * rec_size, internal_bufs[bin_id].buf, internal_bufs[bin_id].size)); std::get<0>(desc[bin_id]) += kmers_to_read; if (was_empty) cv_get_next_to_read.notify_all(); } else { --bins_left_to_read; if (!bins_left_to_read) cv_get_next_to_read.notify_all(); } } void notify_bin_filled(uint32 bin_id) { std::lock_guard lck(mtx); std::get<2>(desc[bin_id]) = false; cv_pop.notify_all(); } bool get_next_to_read(uint32& bin_id, uint64& file_pos, uchar* &buf, uint32& size) { std::unique_lock lck(mtx); cv_get_next_to_read.wait(lck, [this]{return !to_read.empty() || !bins_left_to_read || forced_to_finish; }); if (forced_to_finish || (to_read.empty() && !bins_left_to_read)) return false; std::tie(bin_id, file_pos, buf, size) = to_read.front(); to_read.pop(); return true; } void force_to_finish() { std::lock_guard lck(mtx); forced_to_finish = true; cv_get_next_to_read.notify_all(); } ~CBinBufProvider() { delete[] buf_bins; delete[] buf_internal; } }; template class CSufBinReader { CBinBufProvider& bin_provider; FILE* suf_file; public: CSufBinReader(CBinBufProvider& bin_provider, FILE* suf_file) : bin_provider(bin_provider), suf_file(suf_file) { } void operator()() { uint32 bin_id; uint64 file_pos; uchar* buf; uint32 size; #ifdef ENABLE_LOGGER CTimer timer; #endif while (bin_provider.get_next_to_read(bin_id, file_pos, buf, size)) { my_fseek(suf_file, file_pos, SEEK_SET); #ifdef ENABLE_LOGGER timer.start(); #endif if (fread(buf, 1, size, suf_file) != size) { std::cout << "Error while reading sufix file\n"; exit(1); } #ifdef ENABLE_LOGGER CLoger::GetLogger().log_operation("fread", this, timer.get_time()); timer.start(); #endif bin_provider.notify_bin_filled(bin_id); } } }; template class CBin { public: CBin(uint32 bin_id, uint64* LUT, CKMC2DbReaderSorted& kmc2_db); bool NextKmer(CKmer& kmer, uint32& counter); uint64 get_kmer_number() { return kmer_number; } uint64 get_kmer_number_end() { return kmer_number_end; } uint32 get_record_size() { return record_size; } void set_bin_buff(CBinBuff&& _bin_buff) { bin_buff = std::move(_bin_buff); pos = bin_buff.size; //force reload } #ifdef WIN32 //Because VS2013 does generate default move ctro here CBin(CBin&& o) throw(): bin_id(o.bin_id), bin_buff(std::move(o.bin_buff)), LUT(o.LUT), pos(o.pos), bin_provider(o.bin_provider), kmc2_db(o.kmc2_db), kmer_number(o.kmer_number), kmer_number_end(o.kmer_number_end), kmer_bytes(o.kmer_bytes), prefix_bytes(o.prefix_bytes), suffix_bytes(o.suffix_bytes), counter_size(o.counter_size), record_size(o.record_size), prefix(o.prefix), max_prefix(o.max_prefix) { } #else //g++ generate here move ctor automatically #endif private: uint32 bin_id; CBinBuff bin_buff; uint64* LUT; uint32 pos = 0; CBinBufProvider& bin_provider; CKMC2DbReaderSorted& kmc2_db; void reload_suf_buf(); uint64 kmer_number, kmer_number_end; uint32 kmer_bytes, prefix_bytes, suffix_bytes, counter_size, record_size; uint64 prefix = 0; uint64 max_prefix; }; template void CBinBufProvider::init(std::vector>& bins) { uint64 start, end; uint64 needed_mem = 0; rec_size = bins.front().get_record_size(); max_bin_bytes = SINGLE_BIN_BUFF_SIZE_FOR_DB2_READER / rec_size * rec_size; uint32 mem; internal_bufs.resize(bins.size()); for (uint32 i = 0; i < bins.size(); ++i) { auto& b = bins[i]; start = b.get_kmer_number(); end = b.get_kmer_number_end(); mem = (uint32)MIN((end - start) * rec_size, max_bin_bytes); internal_bufs[i] = CBinBuff(nullptr, mem); desc.push_back(std::make_tuple(start, end, true)); needed_mem += mem; } bins_left_to_read = (uint32)bins.size(); buf_bins = new uchar[needed_mem]; buf_internal = new uchar[needed_mem]; internal_bufs[0].buf = buf_internal; uchar* ptr = buf_bins; bins[0].set_bin_buff(CBinBuff(ptr, internal_bufs[0].size)); for (uint32 i = 1; i < internal_bufs.size(); ++i) { internal_bufs[i].buf = internal_bufs[i - 1].buf + internal_bufs[i - 1].size; ptr += internal_bufs[i - 1].size; bins[i].set_bin_buff(CBinBuff(ptr, internal_bufs[i].size)); } for (uint32 bin_id = 0; bin_id < desc.size(); ++bin_id) { uint64 kmers_left = std::get<1>(desc[bin_id]) - std::get<0>(desc[bin_id]); if (kmers_left) { uint32 kmers_to_read = (uint32)MIN(kmers_left, max_bin_bytes / rec_size); internal_bufs[bin_id].size = kmers_to_read * rec_size; to_read.push(std::make_tuple(bin_id, 4 + std::get<0>(desc[bin_id]) * rec_size, internal_bufs[bin_id].buf, internal_bufs[bin_id].size)); std::get<0>(desc[bin_id]) += kmers_to_read; } else { --bins_left_to_read; } } } //************************************************************************************************************ // CKmerPQ - Priority Queue of k-mers - binary heap. K-mers from bins are processed by this priority queue //************************************************************************************************************ template class CKmerPQ { public: CKmerPQ(uint32 _no_of_bins); inline void init_add(CBin* bin); inline bool get_min(CBundleData& bundle_data); inline void reset(); private: inline void update_heap(); using elem_t = std::pair, uint32>;//kmer, desc_id using desc_t = std::pair*, uint32>;//bin, counter std::vector elems; std::vector descs; uint32 pos, desc_pos; }; //************************************************************************************************************ // CMergerParent - Merger of k-mers produced by CMergerChilds //************************************************************************************************************ template class CMergerParent { public: CMergerParent(std::vector*>& input_queues, CCircularQueue& output_queue); void operator()(); private: std::vector> input_bundles; std::vector*>& input_queues; CBundleData output_bundle; CCircularQueue& output_queue; }; //************************************************************************************************************ // CMergerChild - Merger of k-mers from bins //************************************************************************************************************ template class CMergerChild { using bin_iter = typename std::vector>::iterator; public: CMergerChild(bin_iter begin, bin_iter end, CCircularQueue& output_queue); void operator()(); private: std::vector>> bins; CCircularQueue& output_queue; }; //************************************************************************************************************ // CKMC2DbReaderSorted - Produce k-mers in sorted order from KMC2 database //************************************************************************************************************ template class CKMC2DbReaderSorted { public: CKMC2DbReaderSorted(const CKMC_header& header, const CInputDesc& desc); void NextBundle(CBundle& bundle, bool& finished); void IgnoreRest(); ~CKMC2DbReaderSorted(); private: //void get_suf_buf_part(uchar* &buf, uint64 start, uint32 size); const CKMC_header& header; const CInputDesc& desc; uint64* LUTS = nullptr; uint32 lut_size = 0; uint32 suffix_bytes; uint32 record_size; FILE* kmc_suf; friend class CBin; std::vector> bins; CBinBufProvider bin_provider; CSufBinReader* suf_bin_reader; std::thread suf_bin_reader_th; uint32 n_child_threads; CMergerParent* parent; std::thread parent_thread; CCircularQueue output_queue; std::vector*> childs_parent_queues; std::vector*> childs; std::vector childs_threads; //mutable std::mutex mtx; }; //************************************************************************************************************ // CKCM2DbReaderSeqCounter_Base - Base class for classes to access k-mers one by one (not sorted) or // for counters only from KMC2 database //************************************************************************************************************ template class CKCM2DbReaderSeqCounter_Base { protected: CKCM2DbReaderSeqCounter_Base(const CKMC_header& header, const CInputDesc& desc); ~CKCM2DbReaderSeqCounter_Base(); void open_files(); bool reload_suf_buff(); static const uint32 PREFIX_BUFF_BYTES = KMC2_DB_READER_PREFIX_BUFF_BYTES; static const uint32 SUFIX_BUFF_BYTES = KMC2_DB_READER_SUFIX_BUFF_BYTES; const CKMC_header& header; const CInputDesc& desc; uint32 sufix_bytes; uint32 record_size; //of sufix, in bytes uint64 sufix_buff_size, sufix_buff_pos, sufix_left_to_read; uint64 prefix_buff_size, prefix_buff_pos, prefix_left_to_read; uint64 sufix_number; uint32 kmer_bytes, prefix_bytes; uchar* sufix_buff = nullptr; FILE* sufix_file; std::string sufix_file_name; }; //************************************************************************************************************ // CKMC2DbReaderSequential - Produce k-mers sequentialy from KMC2 database (they are not sorted!) //************************************************************************************************************ template class CKMC2DbReaderSequential : public CKCM2DbReaderSeqCounter_Base { public: CKMC2DbReaderSequential(const CKMC_header& header, const CInputDesc& desc); bool NextKmerSequential(CKmer& kmer, uint32& counter); ~CKMC2DbReaderSequential(); private: void allocate_buffers(); void reload_pref_buff(); uint32 signle_bin_size, map_size, map_size_bytes, no_of_bins; std::string prefix_file_name; FILE* prefix_file; uint64 current_prefix_index; uint64 prefix_mask; uint64* prefix_buff = nullptr; }; //************************************************************************************************************ // CKMC2DbReaderCountersOnly - Produce counters of k-mers from KMC2 database //************************************************************************************************************ template class CKMC2DbReaderCountersOnly : CKCM2DbReaderSeqCounter_Base { public: CKMC2DbReaderCountersOnly(const CKMC_header& header, const CInputDesc& desc); bool NextCounter(uint32& counter); private: void allocate_buffers(); }; //************************************************************************************************************ // CKMC2DbReader - reader of KMC2 //************************************************************************************************************ template class CKMC2DbReader : public CInput { public: CKMC2DbReader(const CKMC_header& header, const CInputDesc& desc, CPercentProgress& percent_progress, KMCDBOpenMode open_mode); void NextBundle(CBundle& bundle) override; void IgnoreRest() override; bool NextKmerSequential(CKmer& kmer, uint32& counter); bool NextCounter(uint32& counter); private: CPercentProgress& percent_progress; uint32 progress_id; std::unique_ptr> db_reader_sorted; std::unique_ptr> db_reader_sequential; std::unique_ptr> db_reader_counters_only; }; /*****************************************************************************************************************************/ /**************************************************** CBin IMPLEMENTATION ****************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CBin::CBin(uint32 bin_id, uint64* LUT, CKMC2DbReaderSorted& kmc2_db) : bin_id(bin_id), LUT(LUT), bin_provider(kmc2_db.bin_provider), kmc2_db(kmc2_db), suffix_bytes(kmc2_db.suffix_bytes), counter_size(kmc2_db.header.counter_size), max_prefix(kmc2_db.lut_size - 1) { kmer_number = LUT[0]; kmer_number_end = LUT[kmc2_db.lut_size]; prefix_bytes = (kmc2_db.header.lut_prefix_len + 3) / 4; kmer_bytes = prefix_bytes + suffix_bytes; record_size = suffix_bytes + counter_size; } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template bool CBin::NextKmer(CKmer& kmer, uint32& counter) { while (true) { if (kmer_number >= kmer_number_end) return false; if (pos >= bin_buff.size) reload_suf_buf(); //skip empty while (LUT[prefix + 1] <= kmer_number) { ++prefix; } uint32 in_kmer_pos = kmer_bytes - 1; uchar* record = bin_buff.buf + pos; kmer.load(record, suffix_bytes); for (int32 i = prefix_bytes - 1; i >= 0; --i) kmer.set_byte(in_kmer_pos--, uchar(prefix >> (i << 3))); counter = 0; for (int32 i = counter_size - 1; i >= 0; --i) { counter <<= 8; counter += record[i]; } ++kmer_number; pos += record_size; if (counter >= kmc2_db.desc.cutoff_min && counter <= kmc2_db.desc.cutoff_max) return true; } return true; } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CBin::reload_suf_buf() { bin_provider.pop(bin_id, bin_buff); pos = 0; } /*****************************************************************************************************************************/ /************************************************** CKmerPQ IMPLEMENTATION ***************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKmerPQ::CKmerPQ(uint32 _no_of_bins) { elems.resize(_no_of_bins + 1); descs.resize(_no_of_bins + 1); pos = 1; desc_pos = 0; } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CKmerPQ::reset() { pos = 1; desc_pos = 0; } /*****************************************************************************************************************************/ template inline bool CKmerPQ::get_min(CBundleData& bundle_data) { if (pos <= 1) return false; bundle_data.Insert(elems[1].first, descs[elems[1].second].second); update_heap(); return true; } /*****************************************************************************************************************************/ template inline void CKmerPQ::init_add(CBin* bin) { CKmer kmer; uint32 counter; if (bin->NextKmer(kmer, counter)) { descs[desc_pos] = std::make_pair(bin, counter); elems[pos] = std::make_pair(kmer, desc_pos); uint32 child_pos = pos++; while (child_pos > 1 && elems[child_pos].first < elems[child_pos / 2].first) { swap(elems[child_pos], elems[child_pos / 2]); child_pos /= 2; } ++desc_pos; } } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template inline void CKmerPQ::update_heap() { uint32 desc_id = elems[1].second; CBin* bin = descs[desc_id].first; CKmer kmer; uint32 counter; if (!bin->NextKmer(kmer, counter)) { kmer.set(elems[--pos].first); desc_id = elems[pos].second; } else descs[desc_id].second = counter; uint32 parent, less; parent = less = 1; while (true) { if (parent * 2 >= pos) break; if (parent * 2 + 1 >= pos) less = parent * 2; else if (elems[parent * 2].first < elems[parent * 2 + 1].first) less = parent * 2; else less = parent * 2 + 1; if (elems[less].first < kmer) { elems[parent] = elems[less]; parent = less; } else break; } elems[parent] = std::make_pair(kmer, desc_id); } /*****************************************************************************************************************************/ /*********************************************** CMergerParent IMPLEMENTATION ************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CMergerParent::CMergerParent(std::vector*>& input_queues, CCircularQueue& output_queue) : input_queues(input_queues), output_queue(output_queue) { input_bundles.resize(input_queues.size()); } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CMergerParent::operator()() { //init //for (uint32 i = 0; i < input_queues.size(); ++i) auto q_iter = input_queues.begin(); auto b_iter = input_bundles.begin(); for (; q_iter != input_queues.end();) { if (!(*q_iter)->pop(*b_iter)) { q_iter = input_queues.erase(q_iter); b_iter = input_bundles.erase(b_iter); } else ++q_iter, ++b_iter; } //run uint32 index_of_min = 0; while (input_bundles.size()) { index_of_min = 0; for (uint32 i = 1; i < input_bundles.size(); ++i) { if (input_bundles[i].TopKmer() < input_bundles[index_of_min].TopKmer()) index_of_min = i; } output_bundle.Insert(input_bundles[index_of_min].TopKmer(), input_bundles[index_of_min].TopCounter()); input_bundles[index_of_min].Pop(); if (input_bundles[index_of_min].Empty()) { if (!input_queues[index_of_min]->pop(input_bundles[index_of_min])) { input_queues.erase(input_queues.begin() + index_of_min); input_bundles.erase(input_bundles.begin() + index_of_min); } } if (output_bundle.Full()) { if (!output_queue.push(output_bundle)) break; } } if (!output_bundle.Empty()) output_queue.push(output_bundle); output_queue.mark_completed(); } /*****************************************************************************************************************************/ /************************************************ CMergerChild IMPLEMENTATION ************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CMergerChild::CMergerChild(bin_iter begin, bin_iter end, CCircularQueue& output_queue) : bins(begin, end), output_queue(output_queue) { } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CMergerChild::operator()() { CKmerPQ kmers_pq(static_cast(bins.size())); for (uint32 i = 0; i < bins.size(); ++i) kmers_pq.init_add(&bins[i].get()); CBundleData bundle_data; while (kmers_pq.get_min(bundle_data)) { if (bundle_data.Full()) { if (!output_queue.push(bundle_data)) break; } } if (!bundle_data.Empty()) output_queue.push(bundle_data); output_queue.mark_completed(); } /*****************************************************************************************************************************/ /********************************************* CKMC2DbReaderSorted IMPLEMENTATION ********************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKMC2DbReaderSorted::CKMC2DbReaderSorted(const CKMC_header& header, const CInputDesc& desc) : header(header), desc(desc), output_queue(DEFAULT_CIRCULAL_QUEUE_CAPACITY) { LUTS = nullptr; lut_size = 1 << 2 * header.lut_prefix_len; uint32 lut_recs = (1 << 2 * header.lut_prefix_len) * header.no_of_bins + 1; LUTS = new uint64[lut_recs]; suffix_bytes = (header.kmer_len - header.lut_prefix_len) / 4; record_size = suffix_bytes + header.counter_size; if (!LUTS) { std::cout << "cannot allocate memory for LUTS of KMC2 database\n"; exit(1); } std::string kmc_pre_file_name = desc.file_src + ".kmc_pre"; FILE* kmc_pre = fopen(kmc_pre_file_name.c_str(), "rb"); if (!kmc_pre) { std::cout << "Cannot open kmc2 prefix file to read LUTS"; exit(1); } my_fseek(kmc_pre, 4, SEEK_SET); if (fread(LUTS, sizeof(uint64), lut_recs, kmc_pre) != lut_recs) { std::cout << "Some error occured while reading LUTS from kmc2 prefix file \n"; exit(1); } fclose(kmc_pre); std::string kmc_suf_file_name = desc.file_src + ".kmc_suf"; kmc_suf = fopen(kmc_suf_file_name.c_str(), "rb"); if (!kmc_suf) { std::cout << "Cannot open kmc2 suffix file\n"; exit(1); } setvbuf(kmc_suf, NULL, _IONBF, 0); bins.reserve(header.no_of_bins); for (uint32 i = 0; i < header.no_of_bins; ++i) bins.emplace_back(i, LUTS + i * lut_size, *this); //starting threads bin_provider.init(bins); suf_bin_reader = new CSufBinReader(bin_provider, kmc_suf); suf_bin_reader_th = std::thread(std::ref(*suf_bin_reader)); n_child_threads = desc.threads; childs_parent_queues.reserve(n_child_threads); for (uint32 i = 0; i < n_child_threads; ++i) childs_parent_queues.push_back(new CCircularQueue(DEFAULT_CIRCULAL_QUEUE_CAPACITY)); uint32 bins_per_thread = header.no_of_bins / n_child_threads; for (uint32 i = 0; i < n_child_threads - 1; ++i) { childs.push_back(new CMergerChild(bins.begin() + i * bins_per_thread, bins.begin() + (i + 1) * bins_per_thread, *childs_parent_queues[i])); childs_threads.push_back(std::thread(std::ref(*childs.back()))); } //last one childs.push_back(new CMergerChild(bins.begin() + (n_child_threads - 1) * bins_per_thread, bins.end(), *childs_parent_queues.back())); childs_threads.push_back(std::thread(std::ref(*childs.back()))); parent = new CMergerParent(childs_parent_queues, output_queue); parent_thread = std::thread(std::ref(*parent)); } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CKMC2DbReaderSorted::NextBundle(CBundle& bundle, bool& finished) { if (output_queue.pop(bundle.Data())) { return; } for (auto& child_thread : childs_threads) child_thread.join(); for (auto& child : childs) delete child; parent_thread.join(); delete parent; for (auto& q : childs_parent_queues) delete q; suf_bin_reader_th.join(); delete suf_bin_reader; } /*****************************************************************************************************************************/ template void CKMC2DbReaderSorted::IgnoreRest() { output_queue.force_finish(); for (auto& q : childs_parent_queues) q->force_finish(); for (auto& child_thread : childs_threads) child_thread.join(); for (auto& child : childs) delete child; parent_thread.join(); delete parent; for (auto& q : childs_parent_queues) delete q; bin_provider.force_to_finish(); suf_bin_reader_th.join(); delete suf_bin_reader; } /*****************************************************************************************************************************/ template CKMC2DbReaderSorted::~CKMC2DbReaderSorted() { delete[] LUTS; fclose(kmc_suf); } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ //template void CKMC2DbReaderSorted::get_suf_buf_part(uchar* &buf, uint64 start, uint32 size) //{ //#ifdef ENABLE_LOGGER // CTimer timer; // timer.start(); //#endif // std::unique_lock lck(mtx); //#ifdef ENABLE_LOGGER // CLoger::GetLogger().log_operation("waiting for lock", this, timer.get_time()); // timer.start(); //#endif // start = 4 + start * record_size; // size *= record_size; // // // my_fseek(kmc_suf, start, SEEK_SET); // if (fread(buf, 1, size, kmc_suf) != size) // { // std::cout << "Error: some error occured while reading " << desc.file_src << ".kmc_suf file\n"; // exit(1); // } //#ifdef ENABLE_LOGGER // CLoger::GetLogger().log_operation("fread time", this, timer.get_time()); //#endif //} /*****************************************************************************************************************************/ /******************************************* CKCM2DbReaderSeqCounter_Base IMPLEMENTATION *************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKCM2DbReaderSeqCounter_Base::CKCM2DbReaderSeqCounter_Base(const CKMC_header& header, const CInputDesc& desc) : header(header), desc(desc) { sufix_bytes = (header.kmer_len - header.lut_prefix_len) / 4; record_size = sufix_bytes + header.counter_size; sufix_buff_size = SUFIX_BUFF_BYTES / record_size * record_size; sufix_left_to_read = header.total_kmers * record_size; if (sufix_left_to_read < sufix_buff_size) sufix_buff_size = sufix_left_to_read; prefix_bytes = (header.lut_prefix_len + 3) / 4; kmer_bytes = prefix_bytes + sufix_bytes; } /*****************************************************************************************************************************/ /********************************************************* PROTECTED**********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template CKCM2DbReaderSeqCounter_Base::~CKCM2DbReaderSeqCounter_Base() { if (sufix_file) fclose(sufix_file); delete[] sufix_buff; } /*****************************************************************************************************************************/ template void CKCM2DbReaderSeqCounter_Base::open_files() { sufix_file_name = desc.file_src + ".kmc_suf"; sufix_file = fopen(sufix_file_name.c_str(), "rb"); if (!sufix_file) { std::cout << "Error: cannot open file: " << sufix_file_name << "\n"; exit(1); } setvbuf(sufix_file, NULL, _IONBF, 0); char marker[4]; if (fread(marker, 1, 4, sufix_file) != 4) { std::cout << "Error: while reading start marker in file: " << sufix_file_name << "\n"; exit(1); } if (strncmp(marker, "KMCS", 4) != 0) { std::cout << "Error: wrong start marker in file: " << sufix_file_name << "\n"; exit(1); } my_fseek(sufix_file, -4, SEEK_END); if (fread(marker, 1, 4, sufix_file) != 4) { std::cout << "Error: while reading end marker in file: " << sufix_file_name << "\n"; exit(1); } if (strncmp(marker, "KMCS", 4) != 0) { std::cout << "Error: wrong end marker in file: " << sufix_file_name << "\n"; exit(1); } my_fseek(sufix_file, 4, SEEK_SET); //skip KMCS } /*****************************************************************************************************************************/ template bool CKCM2DbReaderSeqCounter_Base::reload_suf_buff() { uint64 to_read = MIN(sufix_left_to_read, sufix_buff_size); if (to_read == 0) return false; uint64 readed = fread(sufix_buff, 1, to_read, sufix_file); if (readed != to_read) { std::cout << "Error: some error while reading " << sufix_file_name << "\n"; exit(1); } sufix_buff_pos = 0; sufix_left_to_read -= to_read; return true; } /*****************************************************************************************************************************/ /********************************************* CKMC2DbReaderSequential IMPLEMENTATION ****************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKMC2DbReaderSequential::CKMC2DbReaderSequential(const CKMC_header& header, const CInputDesc& desc) : CKCM2DbReaderSeqCounter_Base(header, desc) { this->open_files(); prefix_file_name = desc.file_src + ".kmc_pre"; prefix_file = fopen(prefix_file_name.c_str(), "rb"); if (!prefix_file) { std::cout << "Error: cannot open file: " << prefix_file_name << "\n"; exit(1); } setvbuf(prefix_file, NULL, _IONBF, 0); my_fseek(prefix_file, 4 + sizeof(uint64), SEEK_SET);//skip KMCP and first value as it must be 0 signle_bin_size = 1 << 2 * header.lut_prefix_len; map_size = (1 << 2 * header.signature_len) + 1; map_size_bytes = map_size * sizeof(uint32); no_of_bins = header.no_of_bins; this->prefix_buff_size = this->PREFIX_BUFF_BYTES / sizeof(uint64); this->sufix_left_to_read = this->header.total_kmers * this->record_size; if (this->sufix_left_to_read < this->sufix_buff_size) this->sufix_buff_size = this->sufix_left_to_read; this->prefix_left_to_read = (1 << this->header.lut_prefix_len * 2) * this->no_of_bins; if (this->prefix_left_to_read < this->prefix_buff_size) this->prefix_buff_size = this->prefix_left_to_read; prefix_mask = (1 << 2 * this->header.lut_prefix_len) - 1; allocate_buffers(); my_fseek(prefix_file, 4 + sizeof(uint64), SEEK_SET); reload_pref_buff(); this->reload_suf_buff(); current_prefix_index = 0; this->sufix_number = 0; } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template bool CKMC2DbReaderSequential::NextKmerSequential(CKmer& kmer, uint32& counter) { while (true) { if (this->sufix_number >= this->header.total_kmers) return false; while (this->prefix_buff[this->prefix_buff_pos] <= this->sufix_number) { ++current_prefix_index; ++this->prefix_buff_pos; if (this->prefix_buff_pos >= this->prefix_buff_size) this->reload_pref_buff(); } uchar* record = this->sufix_buff + this->sufix_buff_pos; uint32 pos = this->kmer_bytes - 1; uint32 current_prefix = static_cast(current_prefix_index & prefix_mask); kmer.load(record, this->sufix_bytes); for (int32 i = this->prefix_bytes - 1; i >= 0; --i) kmer.set_byte(pos--, current_prefix >> (i << 3)); counter = 0; for (int32 i = this->header.counter_size - 1; i >= 0; --i) { counter <<= 8; counter += record[i]; } ++this->sufix_number; this->sufix_buff_pos += this->record_size; if (this->sufix_buff_pos >= this->sufix_buff_size) this->reload_suf_buff(); if (counter >= this->desc.cutoff_min && counter <= this->desc.cutoff_max) return true; } } /*****************************************************************************************************************************/ template CKMC2DbReaderSequential::~CKMC2DbReaderSequential() { if (prefix_file) fclose(prefix_file); delete[] prefix_buff; } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CKMC2DbReaderSequential::allocate_buffers() { this->sufix_buff = new uchar[this->sufix_buff_size]; this->prefix_buff = new uint64[this->prefix_buff_size]; } /*****************************************************************************************************************************/ template void CKMC2DbReaderSequential::reload_pref_buff() { uint64 to_read = MIN(this->prefix_left_to_read, this->prefix_buff_size); this->prefix_buff_pos = 0; if (fread(prefix_buff, sizeof(uint64), to_read, prefix_file) != to_read) { std::cout << "Error: some error while reading " << prefix_file_name << "\n"; exit(1); } this->prefix_left_to_read -= to_read; if (to_read < this->prefix_buff_size) { this->prefix_buff[to_read] = this->header.total_kmers;//guard } } /*****************************************************************************************************************************/ /******************************************** CKMC2DbReaderCountersOnly IMPLEMENTATION ***************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKMC2DbReaderCountersOnly::CKMC2DbReaderCountersOnly(const CKMC_header& header, const CInputDesc& desc) : CKCM2DbReaderSeqCounter_Base(header, desc) { this->open_files(); allocate_buffers(); this->reload_suf_buff(); this->sufix_number = 0; } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template bool CKMC2DbReaderCountersOnly::NextCounter(uint32& counter) { while (true) { if (this->sufix_number >= this->header.total_kmers) return false; uchar* record = this->sufix_buff + this->sufix_buff_pos + this->sufix_bytes; counter = 0; for (int32 i = this->header.counter_size - 1; i >= 0; --i) { counter <<= 8; counter += record[i]; } ++this->sufix_number; this->sufix_buff_pos += this->record_size; if (this->sufix_buff_pos >= this->sufix_buff_size) this->reload_suf_buff(); if (counter >= this->desc.cutoff_min && counter <= this->desc.cutoff_max) return true; } } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CKMC2DbReaderCountersOnly::allocate_buffers() { this->sufix_buff = new uchar[this->sufix_buff_size]; } /*****************************************************************************************************************************/ /************************************************* CKMC2DbReader IMPLEMENTATION **********************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ template CKMC2DbReader::CKMC2DbReader(const CKMC_header& header, const CInputDesc& desc, CPercentProgress& percent_progress, KMCDBOpenMode open_mode) : percent_progress(percent_progress) { progress_id = percent_progress.RegisterItem(header.total_kmers); switch (open_mode) { case KMCDBOpenMode::sorted: db_reader_sorted = std::make_unique >(header, desc); break; case KMCDBOpenMode::sequential: db_reader_sequential = std::make_unique>(header, desc); break; case KMCDBOpenMode::counters_only: db_reader_counters_only = std::make_unique>(header, desc); break; default: //should never be here std::cout << "Error: unknow open mode \n"; exit(1); } } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void CKMC2DbReader::NextBundle(CBundle& bundle) { #ifdef ENABLE_LOGGER CTimer timer; timer.start(); #endif db_reader_sorted->NextBundle(bundle, this->finished); percent_progress.UpdateItem(progress_id, bundle.Size()); if(this->finished) { percent_progress.Complete(progress_id); } #ifdef ENABLE_LOGGER CLoger::GetLogger().log_operation("pobranie bundla z wejscia", this, timer.get_time()); #endif } /*****************************************************************************************************************************/ template void CKMC2DbReader::IgnoreRest() { db_reader_sorted->IgnoreRest(); } /*****************************************************************************************************************************/ template bool CKMC2DbReader::NextKmerSequential(CKmer& kmer, uint32& counter) { if (db_reader_sequential->NextKmerSequential(kmer, counter)) { percent_progress.UpdateItem(progress_id); return true; } percent_progress.Complete(progress_id); return false; } /*****************************************************************************************************************************/ template bool CKMC2DbReader::NextCounter(uint32& counter) { if (db_reader_counters_only->NextCounter(counter)) { percent_progress.UpdateItem(progress_id); return true; } percent_progress.Complete(progress_id); return false; } #endifKMC-2.3/kmc_tools/kmc_header.cpp000066400000000000000000000044601257432033000165530ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "kmc_header.h" #include /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ CKMC_header::CKMC_header(std::string file_name) { file_name += ".kmc_pre"; FILE* file = my_fopen(file_name.c_str(), "rb"); if (!file) { std::cout << "Error: Cannot open file " << file_name << "\n"; exit(1); } char marker[4]; if (fread(marker, 1, 4, file) != 4) { std::cout << "Error while reading start marker in " << file_name << "\n"; exit(1); } if (strncmp(marker, "KMCP", 4) != 0) { std::cout << "Error: wrong start marker in " << file_name << "\n"; exit(1); } my_fseek(file, -4, SEEK_END); if (fread(marker, 1, 4, file) != 4) { std::cout << "Error while reading end marker in " << file_name << "\n"; exit(1); } if (strncmp(marker, "KMCP", 4) != 0) { std::cout << "Error: wrong end marker in " << file_name << "\n"; exit(1); } my_fseek(file, 0, SEEK_END); file_size = my_ftell(file); my_fseek(file, -8, SEEK_END); load_uint(file, header_offset); my_fseek(file, -12, SEEK_END); load_uint(file, db_version); my_fseek(file, 0LL - (header_offset + 8), SEEK_END); load_uint(file, kmer_len); load_uint(file, mode); load_uint(file, counter_size); load_uint(file, lut_prefix_len); if (IsKMC2()) load_uint(file, signature_len); load_uint(file, min_count); load_uint(file, max_count); load_uint(file, total_kmers); uchar both_s_tmp; load_uint(file, both_s_tmp); both_strands = both_s_tmp == 1; both_strands = !both_strands; fclose(file); if (IsKMC2()) { uint32 single_lut_size = (1ull << (2 * lut_prefix_len)) * sizeof(uint64); uint32 map_size = ((1 << 2 * signature_len) + 1) * sizeof(uint32); no_of_bins = (uint32)((file_size - sizeof(uint64) - 12 - header_offset - map_size) / single_lut_size); } } // ***** EOFKMC-2.3/kmc_tools/kmc_header.h000066400000000000000000000023341257432033000162160ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMC_HEADER_H #define _KMC_HEADER_H #include "defs.h" #include #include //************************************************************************************************************ // CKMC_header - represents header of KMC database. //************************************************************************************************************ struct CKMC_header { public: uint32 kmer_len = 0; uint32 mode = 0; uint32 counter_size = 0; uint32 lut_prefix_len = 0; uint32 signature_len = 0; //only for kmc2 uint32 min_count = 0; uint32 max_count = 0; uint64 total_kmers = 0; bool both_strands = true; uint32 db_version = 0; uint32 header_offset = 0; uint64 file_size = 0; uint32 no_of_bins = 0; //only for kmc2 bool IsKMC2() { return db_version == 0x200; } CKMC_header(std::string file_name); private: template void load_uint(FILE* file, T& res) { res = 0; for (uint32 i = 0; i < sizeof(T); ++i) res += (T)getc(file) << (i << 3); } }; #endif // ***** EOFKMC-2.3/kmc_tools/kmc_tools.cpp000066400000000000000000000217411257432033000164640ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include #include #include "config.h" #include "parser.h" #include "timer.h" #include "kmc1_db_reader.h" #include "kmc2_db_reader.h" #include "kmc1_db_writer.h" #include "parameters_parser.h" #include "histogram_writer.h" #include "dump_writer.h" #include "fastq_reader.h" #include "fastq_filter.h" #include "fastq_writer.h" #ifdef ENABLE_LOGGER #include "develop.h" #endif using namespace std; template class CTools { CParametersParser& parameters_parser; CConfig& config; bool histo() { if (!config.headers.front().IsKMC2()) //KMC1 { CKMC1DbReader kmcdb(config.headers.front(), config.input_desc.front(), CConfig::GetInstance().percent_progress, KMCDBOpenMode::counters_only); CHistogramWriter> writer(kmcdb); return writer.Process(); } else //KMC2 { CKMC2DbReader kmcdb(config.headers.front(), config.input_desc.front(), CConfig::GetInstance().percent_progress, KMCDBOpenMode::counters_only); CHistogramWriter> writer(kmcdb); return writer.Process(); } } bool dump() { if (!config.headers.front().IsKMC2()) //KMC1 - input is sorted { CKMCDBForDump, SIZE, true> kmcdb_wrapper; CDumpWriter writer(kmcdb_wrapper); return writer.Process(); } else //KMC2 { if (config.dump_params.sorted_output) { CKMCDBForDump, SIZE, true> kmcdb_wrapper; CDumpWriter writer(kmcdb_wrapper); return writer.Process(); } else { CKMCDBForDump, SIZE, false> kmcdb_wrapper; CDumpWriter writer(kmcdb_wrapper); return writer.Process(); } } return true; } bool filter() { CFilteringParams& filtering_params = config.filtering_params; CFilteringQueues filtering_queues; //set parameters and quques int32 avaiable_threads = config.avaiable_threads; filtering_params.n_readers = max(1, avaiable_threads / 2); bool gz_bz2 = false; vector file_sizes; for (auto& p : filtering_params.input_srcs) { string ext(p.end() - 3, p.end()); if (ext == ".gz" || ext == ".bz2") { gz_bz2 = true; } FILE* tmp = my_fopen(p.c_str(), "rb"); if (!tmp) { cout << "Cannot open file: " << p.c_str(); exit(1); } my_fseek(tmp, 0, SEEK_END); file_sizes.push_back(my_ftell(tmp)); fclose(tmp); } if (gz_bz2) { sort(file_sizes.begin(), file_sizes.end(), greater()); uint64 file_size_threshold = (uint64)(file_sizes.front() * 0.05); int32 n_allowed_files = 0; for (auto& p : file_sizes) if (p > file_size_threshold) ++n_allowed_files; filtering_params.n_readers = MIN(n_allowed_files, MAX(1, avaiable_threads / 2)); } else filtering_params.n_readers = 1; avaiable_threads -= filtering_params.n_readers; filtering_params.n_filters = max(1, avaiable_threads); filtering_params.fastq_buffer_size = 1 << 25; filtering_params.mem_part_pmm_fastq_reader = filtering_params.fastq_buffer_size + CFastqReader::OVERHEAD_SIZE; filtering_params.mem_tot_pmm_fastq_reader = filtering_params.mem_part_pmm_fastq_reader * (filtering_params.n_readers + 48); filtering_params.mem_part_pmm_fastq_filter = filtering_params.mem_part_pmm_fastq_reader; filtering_params.mem_tot_pmm_fastq_filter = filtering_params.mem_part_pmm_fastq_filter * (filtering_params.n_filters + 48); filtering_queues.input_files_queue = new CInputFilesQueue(filtering_params.input_srcs); filtering_queues.input_part_queue = new CPartQueue(filtering_params.n_readers); filtering_queues.filtered_part_queue = new CPartQueue(filtering_params.n_filters); filtering_queues.pmm_fastq_reader = new CMemoryPool(filtering_params.mem_tot_pmm_fastq_reader, filtering_params.mem_part_pmm_fastq_reader); filtering_queues.pmm_fastq_filter = new CMemoryPool(filtering_params.mem_tot_pmm_fastq_filter, filtering_params.mem_part_pmm_fastq_filter); filtering_params.kmer_len = config.headers.front().kmer_len; vector readers_ths; vector filters_ths; vector> filters; vector> readers; CKMCFile kmc_api; if (!kmc_api.OpenForRA(config.input_desc.front().file_src)) { cout << "Error: cannot open: " << config.input_desc.front().file_src << " by KMC API\n"; exit(1); } kmc_api.SetMinCount(config.input_desc.front().cutoff_min); kmc_api.SetMaxCount(config.input_desc.front().cutoff_max); CWFastqWriter writer(filtering_params, filtering_queues); thread writer_th(writer); for (uint32 i = 0; i < filtering_params.n_filters; ++i) { filters.push_back(make_unique(filtering_params, filtering_queues, kmc_api)); filters_ths.emplace_back(ref(*filters.back())); } for (uint32 i = 0; i < filtering_params.n_readers; ++i) { readers.push_back(make_unique(filtering_params, filtering_queues)); readers_ths.emplace_back(ref(*readers.back())); } writer_th.join(); for (auto& thread : filters_ths) thread.join(); filters.clear(); for (auto& thread : readers_ths) thread.join(); readers.clear(); delete filtering_queues.input_part_queue; delete filtering_queues.pmm_fastq_reader; delete filtering_queues.pmm_fastq_filter; delete filtering_queues.input_files_queue; delete filtering_queues.filtered_part_queue; return true; } public: CTools(CParametersParser& parameters_parser) : parameters_parser(parameters_parser), config(CConfig::GetInstance()) { } bool Process() { if (config.mode == CConfig::Mode::FILTER) { return filter(); } if (config.mode == CConfig::Mode::HISTOGRAM) { return histo(); } else if (config.mode == CConfig::Mode::DUMP) { return dump(); } else if (config.mode == CConfig::Mode::COMPARE) { CInput *db1, *db2; if (!config.headers[0].IsKMC2()) db1 = new CKMC1DbReader(config.headers[0], config.input_desc[0], CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted); else db1 = new CKMC2DbReader(config.headers[0], config.input_desc[0], CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted); if (!config.headers[1].IsKMC2()) db2 = new CKMC1DbReader(config.headers[1], config.input_desc[1], CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted); else db2 = new CKMC2DbReader(config.headers[1], config.input_desc[1], CConfig::GetInstance().percent_progress, KMCDBOpenMode::sorted); CBundle input1(db1), input2(db2); CComparer comparer(&input1, &input2); bool res = comparer.Equals(); delete db1; delete db2; std::cout << "\n"; if (res) { cout << "DB Equals\n"; exit(0); } else { cout << "DB Differs\n"; exit(1); } } else { CExpressionNode* expression_root = parameters_parser.GetExpressionRoot(); auto t = expression_root->GetExecutionRoot(); delete expression_root; CKMC1DbWriter writer(t); writer.Process(); delete t; return true; } return false; } }; template class CApplication { CApplication* app_1; CTools* tools; bool is_selected; CConfig& config; CParametersParser& parameter_parser; public: CApplication(CParametersParser& parameter_parser) : config(CConfig::GetInstance()), parameter_parser(parameter_parser) { is_selected = config.kmer_len <= (int32)SIZE * 32 && config.kmer_len > ((int32)SIZE - 1) * 32; app_1 = new CApplication(parameter_parser); if (is_selected) { tools = new CTools(parameter_parser); } else { tools = nullptr; } } ~CApplication() { delete app_1; if (is_selected) delete tools; } bool Process() { if (is_selected) return tools->Process(); else return app_1->Process(); } }; template<> class CApplication<1> { CTools<1>* tools; CConfig& config; CParametersParser& parameter_parser; bool is_selected; public: CApplication(CParametersParser& parameter_parser) : config(CConfig::GetInstance()), parameter_parser(parameter_parser) { is_selected = config.kmer_len <= 32; if (is_selected) tools = new CTools<1>(parameter_parser); else tools = nullptr; } ~CApplication<1>() { if (tools) delete tools; } bool Process() { if (is_selected) { return tools->Process(); } return false; } }; int main(int argc, char**argv) { #ifdef ENABLE_LOGGER CTimer timer; timer.start(); #endif CParametersParser params_parser(argc, argv); params_parser.Parse(); if (params_parser.validate_input_dbs()) { params_parser.SetThreads(); CApplication app(params_parser); app.Process(); } #ifdef ENABLE_LOGGER cout << "RUN TIME: " << timer.get_time() <<"ms\n\n"; CLoger::GetLogger().print_stats(); #endif } // ***** EOFKMC-2.3/kmc_tools/kmc_tools.vcxproj000066400000000000000000000242321257432033000173730ustar00rootroot00000000000000 Debug Win32 Debug x64 Release Win32 Release x64 {F3B0CC94-9DD0-4642-891C-EA08BDA50260} Win32Proj kmc_tools Application true v120 Unicode Application true v120 Unicode Application false v120 true Unicode Application false v120 true Unicode true true false false Use Level3 Disabled _CRT_SECURE_NO_WARNINGS;WIN32;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions) true Console true Use Level3 Disabled _SCL_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS;WIN32;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions) true Console true libcmt.lib Level3 Use MaxSpeed true true _CRT_SECURE_NO_WARNINGS;WIN32;NDEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions) true Console true true true true Level3 Use Full true true _CRT_SECURE_NO_WARNINGS;WIN32;NDEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions) true Default Speed MultiThreaded Console true true true true Create Create Create Create KMC-2.3/kmc_tools/kmer.h000066400000000000000000000326051257432033000150760ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMER_H #define _KMER_H // Important remark: there is no inheritance here to guarantee that all classes defined here are POD according to C++11 #include "defs.h" #include "meta_oper.h" #include // ************************************************************************* // Ckmer class for k > 32 with classic kmer counting template struct CKmer { unsigned long long data[SIZE]; typedef unsigned long long data_t; inline void set(const CKmer &x); inline void mask(const CKmer &x); inline uint32 end_mask(const uint32 mask); inline void set_2bits(const uint64 x, const uint32 p); inline uchar get_2bits(const uint32 p); inline uchar get_byte(const uint32 p); inline void set_byte(const uint32 p, uchar x); inline void set_bits(const uint32 p, const uint32 n, uint64 x); inline void SHL_insert_2bits(const uint64 x); inline void SHR_insert_2bits(const uint64 x, const uint32 p); inline void SHR(const uint32 p); inline void SHL(const uint32 p); inline uint64 remove_suffix(const uint32 n) const; inline void set_n_1(const uint32 n); inline void set_n_01(const uint32 n); inline void store(uchar *&buffer, int32 n); inline void store(uchar *buffer, int32 p, int32 n); inline void load(uchar *&buffer, int32 n); inline bool operator==(const CKmer &x); inline bool operator<(const CKmer &x); inline void clear(void); inline char get_symbol(int p); }; // ********************************************************************* template inline void CKmer::set(const CKmer &x) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = x.data[i]; }, uint_()); #else for (uint32 i = 0; i < SIZE; ++i) data[i] = x.data[i]; #endif } // ********************************************************************* template inline void CKmer::mask(const CKmer &x) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] &= x.data[i]; }, uint_()); #else for (uint32 i = 0; i < SIZE; ++i) data[i] &= x.data[i]; #endif } // ********************************************************************* template inline uint32 CKmer::end_mask(const uint32 mask) { return data[0] & mask; } // ********************************************************************* template inline void CKmer::set_2bits(const uint64 x, const uint32 p) { // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); } template inline uchar CKmer::get_2bits(const uint32 p) { return (data[p >> 6] >> (p & 63)) & 3; } // ********************************************************************* template inline void CKmer::SHR_insert_2bits(const uint64 x, const uint32 p) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] >>= 2; // data[i] |= data[i+1] << (64-2); data[i] += data[i + 1] << (64 - 2); }, uint_()); #else for (uint32 i = 0; i < SIZE - 1; ++i) { data[i] >>= 2; // data[i] |= data[i+1] << (64-2); data[i] += data[i + 1] << (64 - 2); } #endif data[SIZE - 1] >>= 2; // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); } // ********************************************************************* template inline void CKmer::SHR(const uint32 p) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] >>= 2 * p; // data[i] |= data[i+1] << (64-2*p); data[i] += data[i + 1] << (64 - 2 * p); }, uint_()); #else for (uint32 i = 0; i < SIZE - 1; ++i) { data[i] >>= 2 * p; // data[i] |= data[i+1] << (64-2*p); data[i] += data[i + 1] << (64 - 2 * p); } #endif data[SIZE - 1] >>= 2 * p; } // ********************************************************************* template inline void CKmer::SHL(const uint32 p) { #ifdef USE_META_PROG IterRev([&](const int &i){ data[i + 1] <<= p * 2; // data[i+1] |= data[i] >> (64-p*2); data[i + 1] += data[i] >> (64 - p * 2); }, uint_()); #else for (uint32 i = SIZE - 1; i > 0; --i) { data[i] <<= p * 2; // data[i] |= data[i-1] >> (64-p*2); data[i] += data[i - 1] >> (64 - p * 2); } #endif data[0] <<= p * 2; } // ********************************************************************* template inline void CKmer::SHL_insert_2bits(const uint64 x) { #ifdef USE_META_PROG IterRev([&](const int &i){ data[i + 1] <<= 2; // data[i+1] |= data[i] >> (64-2); data[i + 1] += data[i] >> (64 - 2); }, uint_()); #else for (uint32 i = SIZE - 1; i > 0; --i) { data[i] <<= 2; // data[i] |= data[i-1] >> (64-2); data[i] += data[i - 1] >> (64 - 2); } #endif data[0] <<= 2; // data[0] |= x; data[0] += x; } // ********************************************************************* template inline uchar CKmer::get_byte(const uint32 p) { return (data[p >> 3] >> ((p << 3) & 63)) & 0xFF; } // ********************************************************************* template inline void CKmer::set_byte(const uint32 p, uchar x) { // data[p >> 3] |= ((uint64) x) << ((p & 7) << 3); data[p >> 3] += ((uint64)x) << ((p & 7) << 3); } // ********************************************************************* template inline void CKmer::set_bits(const uint32 p, const uint32 n, uint64 x) { // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); if ((p >> 6) != ((p + n - 1) >> 6)) // data[(p >> 6) + 1] |= x >> (64 - (p & 63)); data[(p >> 6) + 1] += x >> (64 - (p & 63)); } // ********************************************************************* template inline bool CKmer::operator==(const CKmer &x) { for (uint32 i = 0; i < SIZE; ++i) if (data[i] != x.data[i]) return false; return true; } // ********************************************************************* template inline bool CKmer::operator<(const CKmer &x) { for (int32 i = SIZE - 1; i >= 0; --i) if (data[i] < x.data[i]) return true; else if (data[i] > x.data[i]) return false; return false; } // ********************************************************************* template inline void CKmer::clear(void) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = 0; }, uint_()); #else for (uint32 i = 0; i < SIZE; ++i) data[i] = 0; #endif } // ********************************************************************* template inline uint64 CKmer::remove_suffix(const uint32 n) const { uint32 p = n >> 6; // / 64; uint32 r = n & 63; // % 64; if (p == SIZE - 1) return data[p] >> r; else // return (data[p+1] << (64-r)) | (data[p] >> r); return (data[p + 1] << (64 - r)) + (data[p] >> r); } // ********************************************************************* template inline void CKmer::set_n_1(const uint32 n) { clear(); for (uint32 i = 0; i < (n >> 6); ++i) data[i] = ~((uint64)0); uint32 r = n & 63; if (r) data[n >> 6] = (1ull << r) - 1; } // ********************************************************************* template inline void CKmer::set_n_01(const uint32 n) { clear(); for (uint32 i = 0; i < n; ++i) if (!(i & 1)) // data[i >> 6] |= (1ull << (i & 63)); data[i >> 6] += (1ull << (i & 63)); } // ********************************************************************* template inline void CKmer::store(uchar *&buffer, int32 n) { for (int32 i = n - 1; i >= 0; --i) *buffer++ = get_byte(i); } // ********************************************************************* template inline void CKmer::store(uchar *buffer, int32 p, int32 n) { for (int32 i = n - 1; i >= 0; --i) buffer[p++] = get_byte(i); } // ********************************************************************* template inline void CKmer::load(uchar *&buffer, int32 n) { clear(); for (int32 i = n - 1; i >= 0; --i) set_byte(i, *buffer++); } // ********************************************************************* template inline char CKmer::get_symbol(int p) { uint32 x = (data[p >> 5] >> (2 * (p & 31))) & 0x03; switch (x) { case 0: return 'A'; case 1: return 'C'; case 2: return 'G'; default: return 'T'; } } // ********************************************************************* // ********************************************************************* // ********************************************************************* // ********************************************************************* // Ckmer class for k <= 32 with classic kmer counting template<> struct CKmer<1> { unsigned long long data; typedef unsigned long long data_t; static uint32 QUALITY_SIZE; void set(const CKmer<1> &x); void mask(const CKmer<1> &x); uint32 end_mask(const uint32 mask); void set_2bits(const uint64 x, const uint32 p); uchar get_2bits(const uint32 p); uchar get_byte(const uint32 p); void set_byte(const uint32 p, uchar x); void set_bits(const uint32 p, const uint32 n, uint64 x); void SHL_insert_2bits(const uint64 x); void SHR_insert_2bits(const uint64 x, const uint32 p); void SHR(const uint32 p); void SHL(const uint32 p); uint64 remove_suffix(const uint32 n) const; void set_n_1(const uint32 n); void set_n_01(const uint32 n); void store(uchar *&buffer, int32 n); void store(uchar *buffer, int32 p, int32 n); void load(uchar *&buffer, int32 n); bool operator==(const CKmer<1> &x); bool operator<(const CKmer<1> &x); void clear(void); inline char get_symbol(int p); }; // ********************************************************************* inline void CKmer<1>::mask(const CKmer<1> &x) { data &= x.data; } // ********************************************************************* inline uint32 CKmer<1>::end_mask(const uint32 mask) { return data & mask; } // ********************************************************************* inline void CKmer<1>::set(const CKmer<1> &x) { data = x.data; } // ********************************************************************* inline void CKmer<1>::set_2bits(const uint64 x, const uint32 p) { // data |= x << p; data += x << p; } inline uchar CKmer<1>::get_2bits(const uint32 p) { return (data >> p) & 3; } // ********************************************************************* inline void CKmer<1>::SHR_insert_2bits(const uint64 x, const uint32 p) { data >>= 2; // data |= x << p; data += x << p; } // ********************************************************************* inline void CKmer<1>::SHR(const uint32 p) { data >>= 2 * p; } // ********************************************************************* inline void CKmer<1>::SHL(const uint32 p) { data <<= p * 2; } // ********************************************************************* inline void CKmer<1>::SHL_insert_2bits(const uint64 x) { // data = (data << 2) | x; data = (data << 2) + x; } // ********************************************************************* inline uchar CKmer<1>::get_byte(const uint32 p) { return (data >> (p << 3)) & 0xFF; } // ********************************************************************* inline void CKmer<1>::set_byte(const uint32 p, uchar x) { // data |= ((uint64) x) << (p << 3); data += ((uint64)x) << (p << 3); } // ********************************************************************* inline void CKmer<1>::set_bits(const uint32 p, const uint32 n, uint64 x) { // data |= x << p; data += x << p; } // ********************************************************************* inline bool CKmer<1>::operator==(const CKmer<1> &x) { return data == x.data; } // ********************************************************************* inline bool CKmer<1>::operator<(const CKmer<1> &x) { return data < x.data; } // ********************************************************************* inline void CKmer<1>::clear(void) { data = 0ull; } // ********************************************************************* inline uint64 CKmer<1>::remove_suffix(const uint32 n) const { return data >> n; } // ********************************************************************* inline void CKmer<1>::set_n_1(const uint32 n) { if (n == 64) data = ~(0ull); else data = (1ull << n) - 1; } // ********************************************************************* inline void CKmer<1>::set_n_01(const uint32 n) { data = 0ull; for (uint32 i = 0; i < n; ++i) if (!(i & 1)) data += (1ull << i); } // ********************************************************************* inline void CKmer<1>::store(uchar *&buffer, int32 n) { for (int32 i = n - 1; i >= 0; --i) *buffer++ = get_byte(i); } // ********************************************************************* inline void CKmer<1>::store(uchar *buffer, int32 p, int32 n) { for (int32 i = n - 1; i >= 0; --i) buffer[p++] = get_byte(i); } // ********************************************************************* inline void CKmer<1>::load(uchar *&buffer, int32 n) { clear(); for (int32 i = n - 1; i >= 0; --i) set_byte(i, *buffer++); } // ********************************************************************* char CKmer<1>::get_symbol(int p) { uint32 x = (data >> (2 * p)) & 0x03; switch (x) { case 0: return 'A'; case 1: return 'C'; case 2: return 'G'; default: return 'T'; } } #endif // ***** EOF KMC-2.3/kmc_tools/meta_oper.h000066400000000000000000000015521257432033000161100ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _META_OPER_H #define _META_OPER_H //#include template struct uint_{ }; // For loop (forward) template inline void IterFwd(const Lambda &oper, uint_) { IterFwd(oper, uint_()); oper(N); } template inline void IterFwd(const Lambda &oper, uint_<0>) { oper(0); } // For loop (backward) template inline void IterRev(const Lambda &oper, uint_) { oper(N); IterRev(oper, uint_()); } template inline void IterRev(const Lambda &oper, uint_<0>) { oper(0); } #endif // ***** EOF KMC-2.3/kmc_tools/nc_utils.cpp000066400000000000000000000006121257432033000163040ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "nc_utils.h" uchar CNumericConversions::digits[100000*5]; int CNumericConversions::powOf10[30]; CNumericConversions::_si CNumericConversions::_init;KMC-2.3/kmc_tools/nc_utils.h000066400000000000000000000055541257432033000157630ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include "defs.h" #include #ifndef _NC_UTILS_H #define _NC_UTILS_H class CNumericConversions { public: static uchar digits[100000*5]; static int powOf10[30]; struct _si { _si() { for(int i = 0; i < 100000; ++i) { int dig = i; digits[i*5+4] = '0' + (dig % 10); dig /= 10; digits[i*5+3] = '0' + (dig % 10); dig /= 10; digits[i*5+2] = '0' + (dig % 10); dig /= 10; digits[i*5+1] = '0' + (dig % 10); dig /= 10; digits[i*5+0] = '0' + dig; } powOf10[0] = 1; for(int i = 1 ; i < 30 ; ++i) { powOf10[i] = powOf10[i-1]*10; } } } static _init; static int NDigits(uint64 val) { if(val >= 10000) return 5; else if(val >= 1000) return 4; else if(val >= 100) return 3; else if(val >= 10) return 2; else return 1; } static int Int2PChar(uint64 val, uchar *str) { if(val >= 1000000000000000ull) { uint64 dig1 = val / 1000000000000000ull; val -= dig1 * 1000000000000000ull; uint64 dig2 = val / 10000000000ull; val -= dig2 * 10000000000ull; uint64 dig3 = val / 100000ull; uint64 dig4 = val - dig3 * 100000ull; int ndig = NDigits(dig1); memcpy(str, digits+dig1*5+(5-ndig), ndig); memcpy(str+ndig, digits+dig2*5, 5); memcpy(str+ndig+5, digits+dig3*5, 5); memcpy(str+ndig+10, digits+dig4*5, 5); return ndig+15; } else if(val >= 10000000000ull) { uint64 dig1 = val / 10000000000ull; val -= dig1 * 10000000000ull; uint64 dig2 = val / 100000ull; uint64 dig3 = val - dig2 * 100000ull; int ndig = NDigits(dig1); memcpy(str, digits+dig1*5+(5-ndig), ndig); memcpy(str+ndig, digits+dig2*5, 5); memcpy(str+ndig+5, digits+dig3*5, 5); return ndig+10; } else if(val >= 100000ull) { uint64 dig1 = val / 100000ull; uint64 dig2 = val - dig1 * 100000ull; int ndig = NDigits(dig1); memcpy(str, digits+dig1*5+(5-ndig), ndig); memcpy(str+ndig, digits+dig2*5, 5); return ndig+5; } else { int ndig = NDigits(val); memcpy(str, digits+val*5+(5-ndig), ndig); return ndig; } } }; #endifKMC-2.3/kmc_tools/operations.h000066400000000000000000000171501257432033000163210ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _OPERATIONS_H #define _OPERATIONS_H #ifdef ENABLE_DEBUG #include "config.h" #endif #include #include "bundle.h" //************************************************************************************************************ // C2ArgOper - abstract class representing 2 argument's operation //************************************************************************************************************ template class C2ArgOper : public CInput { protected: CBundle* input1, *input2; public: C2ArgOper(CBundle* input1, CBundle* input2) : input1(input1), input2(input2) { } void IgnoreRest() override { input1->IgnoreRest(); input2->IgnoreRest(); } ~C2ArgOper() override { delete input1; delete input2; } }; //************************************************************************************************************ // CUnion - implementation of union operation on 2 k-mer's sets. //************************************************************************************************************ template class CUnion : public C2ArgOper { public: CUnion(CBundle* input1, CBundle* input2) : C2ArgOper(input1, input2) { } void NextBundle(CBundle& bundle) override { while (!this->input1->Finished() && !this->input2->Finished()) { if (bundle.Full()) { return; } if (this->input1->TopKmer() == this->input2->TopKmer()) { bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter() + this->input2->TopCounter()); this->input1->Pop(); this->input2->Pop(); } else if (this->input1->TopKmer() < this->input2->TopKmer()) { bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter()); this->input1->Pop(); } else { bundle.Insert(this->input2->TopKmer(), this->input2->TopCounter()); this->input2->Pop(); } } CBundle* non_empty_bundle = this->input1->Finished() ? this->input2 : this->input1; while (!non_empty_bundle->Finished()) { if (bundle.Full()) { return; } bundle.Insert(non_empty_bundle->TopKmer(), non_empty_bundle->TopCounter()); non_empty_bundle->Pop(); } this->finished = true; } }; //************************************************************************************************************ // CIntersection - implementation of intersection operation on 2 k-mer's sets. //************************************************************************************************************ template class CIntersection : public C2ArgOper { public: CIntersection(CBundle* input1, CBundle* input2) : C2ArgOper(input1, input2) { } void NextBundle(CBundle& bundle) override { while (!this->input1->Finished() && !this->input2->Finished()) { /*this->input1->Top(kmer1, counter1); this->input2->Top(kmer2, counter2);*/ if (this->input1->TopKmer() == this->input2->TopKmer()) { bundle.Insert(this->input1->TopKmer(), MIN(this->input1->TopCounter(), this->input2->TopCounter())); this->input1->Pop(); this->input2->Pop(); if (bundle.Full()) return; } else if (this->input1->TopKmer() < this->input2->TopKmer()) this->input1->Pop(); else this->input2->Pop(); } if (!this->input1->Finished()) this->input1->IgnoreRest(); if (!this->input2->Finished()) this->input2->IgnoreRest(); this->finished = true; } }; //************************************************************************************************************ // CKmersSubtract - implementation of subtraction operation of 2 k-mer's sets. // If k-mer exists in both input it is absent in result (counters does not matter). //************************************************************************************************************ template class CKmersSubtract : public C2ArgOper { //CKmer kmer1, kmer2; //uint32 counter1, counter2; public: CKmersSubtract(CBundle* input1, CBundle* input2) : C2ArgOper(input1, input2) { } void NextBundle(CBundle& bundle) override { while (!this->input1->Finished() && !this->input2->Finished()) { //this->input1->Top(kmer1, counter1); //this->input2->Top(kmer2, counter2); if (this->input2->TopKmer() < this->input1->TopKmer()) this->input2->Pop(); else if (this->input2->TopKmer() == this->input1->TopKmer()) { this->input1->Pop(); this->input2->Pop(); } else { bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter()); this->input1->Pop(); if (bundle.Full()) return; } } if(!this->input2->Finished()) this->input2->IgnoreRest(); while (!this->input1->Finished()) { if (bundle.Full()) return; bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter()); this->input1->Pop(); } this->finished = true; } }; //************************************************************************************************************ // CCountersSubtract - implementation of subtraction operation of 2 k-mer's sets. // If k-mer exists in both input their counters are subtracted. //************************************************************************************************************ template class CCountersSubtract : public C2ArgOper { //CKmer kmer1, kmer2; //uint32 counter1, counter2; public: CCountersSubtract(CBundle* input1, CBundle* input2) : C2ArgOper(input1, input2) { } void NextBundle(CBundle& bundle) override { while (!this->input1->Finished() && !this->input2->Finished()) { //this->input1->Top(kmer1, counter1); //this->input2->Top(kmer2, counter2); if (this->input2->TopKmer() < this->input1->TopKmer()) this->input2->Pop(); else if (this->input2->TopKmer() == this->input1->TopKmer()) { if (this->input1->TopCounter() > this->input2->TopCounter()) { bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter() - this->input2->TopCounter()); this->input1->Pop(); this->input2->Pop(); if (bundle.Full()) return; } else { this->input1->Pop(); this->input2->Pop(); } } else { bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter()); this->input1->Pop(); if (bundle.Full()) return; } } if (!this->input2->Finished()) this->input2->IgnoreRest(); while (!this->input1->Finished()) { if (bundle.Full()) return; bundle.Insert(this->input1->TopKmer(), this->input1->TopCounter()); this->input1->Pop(); } this->finished = true; } }; template class CComparer { CBundle* input1, *input2; public: CComparer(CBundle* input1, CBundle* input2) : input1(input1), input2(input2) { } bool Equals() { while (!this->input1->Finished() && !this->input2->Finished()) { if (this->input1->TopCounter() != this->input2->TopCounter()) { this->input1->IgnoreRest(); this->input2->IgnoreRest(); return false; } if (!(this->input1->TopKmer() == this->input2->TopKmer())) { this->input1->IgnoreRest(); this->input2->IgnoreRest(); return false; } this->input1->Pop(); this->input2->Pop(); } if (!this->input1->Finished() || !this->input2->Finished()) { std::cout << "one of input is not finished\n"; this->input1->IgnoreRest(); this->input2->IgnoreRest(); return false; } return true; } }; #endif // ***** EOFKMC-2.3/kmc_tools/output_parser.h000066400000000000000000000127401257432033000170520ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _OUTPUT_PARSER_H #define _OUTPUT_PARSER_H #include "defs.h" #include #include #include "tokenizer.h" #include "expression_node.h" /*****************************************************************************************************************************/ // This parser validate below grammar: // expr -> term sum_op // sum_op -> PLUSMINUS term sum_op // sum_op -> TERMINATOR // // term -> argument term_op // term_op -> MUL argument term_op // term_op -> TERMINATOR // argument -> VARIABLE // argument -> OPEN_BRACKET expr CLOSE_BRACKET // This code is based on: https://github.com/mikailsheikh/cogitolearning-examples/tree/master/CogPar /*****************************************************************************************************************************/ template class COutputParser { std::list tokens; const std::map& input; Token curr_token; void nextToken(); CExpressionNode* argument(); CExpressionNode* term_op(CExpressionNode* left); CExpressionNode* term(); CExpressionNode* sum_op(CExpressionNode* left); CExpressionNode* expr(); public: COutputParser(std::list& tokens, const std::map& input) : tokens(tokens), input(input) { curr_token = tokens.front(); } CExpressionNode* Parse(); }; /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ template CExpressionNode* COutputParser::Parse() { CExpressionNode* res = expr(); if (curr_token.second != TokenType::TERMINATOR) { std::cout << "Error: wrong symbol :" << curr_token.first; exit(1); } #ifdef ENABLE_DEBUG std::cout << "\n"; res->Display(); #endif return res; } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ template void COutputParser::nextToken() { tokens.pop_front(); if (tokens.empty()) curr_token.second = TokenType::TERMINATOR; else curr_token = tokens.front(); } /*****************************************************************************************************************************/ template CExpressionNode* COutputParser::argument() { if (curr_token.second == TokenType::VARIABLE) { //check if this variable was defined auto elem = input.find(curr_token.first); if (elem == input.end()) { std::cout << "Error: variable " << curr_token.first << " was not defined\n"; exit(1); } CExpressionNode* res = new CInputNode(elem->second); nextToken(); return res; } else if (curr_token.second == TokenType::PARENTHESIS_OPEN) { nextToken(); CExpressionNode* res = expr(); if (curr_token.second != TokenType::PARENTHESIS_CLOSE) { std::cout << "Error: close parenthesis expected, but " << curr_token.first << " found\n"; exit(1); } nextToken(); return res; } return nullptr; } /*****************************************************************************************************************************/ template CExpressionNode* COutputParser::term_op(CExpressionNode* left) { if (curr_token.second == TokenType::MUL_OPER) { CExpressionNode* res = new CIntersectionNode; res->AddLeftChild(left); nextToken(); auto right = argument(); res->AddRightChild(right); return term_op(res); } return left; } template CExpressionNode* COutputParser::term() { auto left = argument(); return term_op(left); } /*****************************************************************************************************************************/ template CExpressionNode* COutputParser::sum_op(CExpressionNode* left) { if (curr_token.second == TokenType::PLUS_OPER || curr_token.second == TokenType::STRICT_MINUS_OPER || curr_token.second == TokenType::COUNTER_MINUS_OPER) { CExpressionNode* res = nullptr; if (curr_token.second == TokenType::PLUS_OPER) res = new CUnionNode; else if (curr_token.second == TokenType::STRICT_MINUS_OPER) res = new CKmersSubtractionNode; else res = new CCountersSubtractionNode; res->AddLeftChild(left); nextToken(); auto right = term(); res->AddRightChild(right); return sum_op(res); } return left; } /*****************************************************************************************************************************/ template CExpressionNode* COutputParser::expr() { auto left = term(); return sum_op(left); } #endif // ***** EOFKMC-2.3/kmc_tools/parameters_parser.cpp000066400000000000000000000327231257432033000202130ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "parameters_parser.h" #include using namespace std; uint32 CParametersParser::replace_zero(uint32 val, const char* param_name, uint32 value_to_set_if_zero) { if (val == 0) { cout << "Warning: min value for " << param_name << " is " << value_to_set_if_zero << ". Your value will be converted to " << value_to_set_if_zero << "\n"; return value_to_set_if_zero; } return val; } void CParametersParser::parse_int_or_float(bool& force_float, bool& force_int, float& float_value, uint32& int_val, const char* param_name) { if (strchr(argv[pos] + 3, '.')) { float_value = (float)atof(argv[pos++] + 3); if (float_value > 1.0f || float_value < 0.0f) { cout << " Error: wrong value for fastq input parameter: "<< param_name <<"\n"; exit(1); } if (force_int) { cout << "Error: both -ci, -cx must be specified as real number [0;1] or as integer \n"; exit(1); } force_float = true; config.filtering_params.use_float_value = true; } else { int_val = atoi(argv[pos++] + 3); if (force_float) { cout << "Error: both -ci, -cx must be specified as real number [0;1] or as integer \n"; exit(1); } force_int = true; config.filtering_params.use_float_value = false; } } void CParametersParser::parse_global_params() { //defaults config.avaiable_threads = thread::hardware_concurrency(); //override defaults if specified for( ; pos < argc && argv[pos][0] == '-' ; ++pos) { if (strncmp(argv[pos], "-t", 2) == 0) { config.avaiable_threads = atoi(argv[pos] + 2); continue; } if (argv[pos][1] == 'v') { config.verbose = true; continue; } if (strncmp(argv[pos], "-hp", 3) == 0) { config.percent_progress.Hide(); continue; } } } void CParametersParser::read_input_fastq_desc() { if (pos >= argc) { cout << "Error: Input fastq files(s) missed\n"; exit(1); } if (strncmp(argv[pos], "-", 1) == 0) { cout << "Error: Input fastq file(s) required, but " << argv[pos] << " found\n"; exit(1); } string input_file_name = argv[pos++]; if (input_file_name[0] != '@') config.filtering_params.input_srcs.push_back(input_file_name); else { ifstream in(input_file_name.c_str() + 1); if (!in.good()) { cout << "Error: No " << input_file_name.c_str() + 1 << " file\n"; exit(1); } string s; while (getline(in, s)) { if (s != "") config.filtering_params.input_srcs.push_back(s); } in.close(); } bool force_float = false; bool force_int = false; for (int i = 0; i < 3 && pos < argc; ++i) { if(argv[pos][0] != '-') break; if (strncmp(argv[pos], "-ci", 3) == 0) { parse_int_or_float(force_float, force_int, config.filtering_params.f_min_kmers, config.filtering_params.n_min_kmers, "-ci"); } else if (strncmp(argv[pos], "-cx", 3) == 0) { parse_int_or_float(force_float, force_int, config.filtering_params.f_max_kmers, config.filtering_params.n_max_kmers, "-cx"); } else if (strncmp(argv[pos], "-f", 2) == 0) { switch (argv[pos++][2]) { case 'a': config.filtering_params.input_file_type = CFilteringParams::file_type::fasta; break; case 'q': config.filtering_params.input_file_type = CFilteringParams::file_type::fastq; break; default: cout << "Error: unknow parameter " << argv[pos - 1] << "\n"; exit(1); break; } } } config.filtering_params.output_file_type = config.filtering_params.input_file_type; } void CParametersParser::read_output_fastq_desc() { if (pos >= argc) { cout << "Error: Output fastq source missed\n"; exit(1); } if (strncmp(argv[pos], "-", 1) == 0) { cout << "Error: Output fastq source required, but " << argv[pos] << "found\n"; exit(1); } config.filtering_params.output_src = argv[pos++]; while (pos < argc && argv[pos][0] == '-') { if (strncmp(argv[pos], "-f", 2) == 0) { switch (argv[pos][2]) { case 'q': config.filtering_params.output_file_type = CFilteringParams::file_type::fastq; break; case 'a': config.filtering_params.output_file_type = CFilteringParams::file_type::fasta; break; default: cout << "Error: unknown parameter " << argv[pos] << "\n"; exit(1); break; } if (config.filtering_params.input_file_type == CFilteringParams::file_type::fasta && config.filtering_params.output_file_type == CFilteringParams::file_type::fastq) { cout << "Error: cannot set -fq for output when -fa is set for input\n"; exit(1); } } else { cout << "Error: Unknown parameter: " << argv[pos] << "\n"; exit(1); } ++pos; } } void CParametersParser::read_dump_params() { while (pos < argc && argv[pos][0] == '-') { if (strncmp(argv[pos], "-s", 2) == 0) { config.dump_params.sorted_output = true; } else { cout << "Warning: Unknow parameter for dump operation: " << argv[pos] << "\n"; } ++pos; } } void CParametersParser::read_input_desc() { if (pos >= argc) { cout << "Error: Input database source missed\n"; exit(1); } if (strncmp(argv[pos], "-", 1) == 0) { cout << "Error: Input database source required, but " << argv[pos] << "found\n"; exit(1); } CInputDesc desc(argv[pos++]); config.input_desc.push_back(desc); for (int i = 0; i < 2 && pos < argc; ++i) { if (strncmp(argv[pos], "-", 1) != 0) break; if (strncmp(argv[pos], "-ci", 3) == 0) { config.input_desc.back().cutoff_min = replace_zero(atoi(argv[pos++] + 3), "-ci", 1); } else if (strncmp(argv[pos], "-cx", 3) == 0) { config.input_desc.back().cutoff_max = replace_zero(atoi(argv[pos++] + 3), "-cx", 1); } else { cout << "Error: Unknow parameter: " << argv[pos]; exit(1); } } } void CParametersParser::read_output_desc() { if (pos >= argc) { cout << "Error: Output database source missed\n"; exit(1); } if (strncmp(argv[pos], "-", 1) == 0) { cout << "Error: Output database source required, but " << argv[pos] << "found\n"; exit(1); } config.output_desc.file_src = argv[pos++]; for (int i = 0; i < 2 && pos < argc; ++i) { if (strncmp(argv[pos], "-", 1) != 0) break; if (strncmp(argv[pos], "-ci", 3) == 0) { config.output_desc.cutoff_min = replace_zero(atoi(argv[pos++] + 3), "-ci", 1); } else if (strncmp(argv[pos], "-cx", 3) == 0) { config.output_desc.cutoff_max = replace_zero(atoi(argv[pos++] + 3), "-cx", 1); } else if (strncmp(argv[pos], "-cs", 3) == 0) { config.output_desc.counter_max = replace_zero(atoi(argv[pos++] + 3), "-cs", 1); } else { cout << "Error: Unknow parameter: " << argv[pos]; exit(1); } } } void CParametersParser::Usage() { CUsageDisplayerFactory disp(CConfig::GetInstance().mode); disp.GetUsageDisplayer().Display(); } CParametersParser::CParametersParser(int argc, char** argv) :argc(argc), argv(argv), config(CConfig::GetInstance()) { pos = 0; if (argc < 2) { Usage(); exit(1); } } void CParametersParser::Parse() { pos = 1; parse_global_params(); if (strcmp(argv[pos], "intersect") == 0) { config.mode = CConfig::Mode::INTERSECTION; } else if (strcmp(argv[pos], "kmers_subtract") == 0) { config.mode = CConfig::Mode::KMERS_SUBTRACT; } else if (strcmp(argv[pos], "counters_subtract") == 0) { config.mode = CConfig::Mode::COUNTERS_SUBTRACT; } else if (strcmp(argv[pos], "union") == 0) { config.mode = CConfig::Mode::UNION; } else if (strcmp(argv[pos], "complex") == 0) { config.mode = CConfig::Mode::COMPLEX; } else if (strcmp(argv[pos], "sort") == 0) { config.mode = CConfig::Mode::SORT; } else if (strcmp(argv[pos], "reduce") == 0) { config.mode = CConfig::Mode::REDUCE; } else if (strcmp(argv[pos], "compact") == 0) { config.mode = CConfig::Mode::COMPACT; } else if (strcmp(argv[pos], "histogram") == 0) { config.mode = CConfig::Mode::HISTOGRAM; } else if (strcmp(argv[pos], "dump") == 0) { config.mode = CConfig::Mode::DUMP; } else if (strcmp(argv[pos], "compare") == 0) { config.mode = CConfig::Mode::COMPARE; } else if (strcmp(argv[pos], "filter") == 0) { config.mode = CConfig::Mode::FILTER; } else { cout << "Error: Unknow mode: " << argv[pos] << "\n"; Usage(); exit(1); } if (argc == 2) { Usage(); exit(1); } pos++; if (config.mode == CConfig::Mode::INTERSECTION || config.mode == CConfig::Mode::KMERS_SUBTRACT || config.mode == CConfig::Mode::UNION || config.mode == CConfig::Mode::COUNTERS_SUBTRACT) { read_input_desc(); //first input read_input_desc(); //second input read_output_desc(); //output } else if (config.mode == CConfig::Mode::FILTER) { read_input_desc(); //kmc db read_input_fastq_desc(); //fastq input read_output_fastq_desc(); } else if (config.mode == CConfig::Mode::COMPLEX) { if (strncmp(argv[2], "-", 1) == 0) { cout << "Error: operations description file expected but " << argv[2] << " found\n"; exit(1); } complex_parser = make_unique(argv[pos]); complex_parser->ParseInputs(); } else if (config.mode == CConfig::Mode::DUMP) { read_dump_params(); read_input_desc(); read_output_desc(); } else if (config.mode == CConfig::Mode::SORT || config.mode == CConfig::Mode::HISTOGRAM || config.mode == CConfig::Mode::REDUCE || config.mode == CConfig::Mode::COMPACT) { read_input_desc(); read_output_desc(); if (config.mode == CConfig::Mode::COMPACT) { if (config.output_desc.counter_max) cout << "Warning: -cs can not be specified for compact operation, value specified will be ignored\n"; config.output_desc.counter_max = 1; } } else if (config.mode == CConfig::Mode::COMPARE) { read_input_desc(); read_input_desc(); } } bool CParametersParser::validate_input_dbs() { config.headers.push_back(CKMC_header(config.input_desc.front().file_src)); uint32 kmer_len = config.headers.front().kmer_len; uint32 mode = config.headers.front().mode; if (mode == 1) { cout << "Error: quality counters are not supported in kmc tools\n"; return false; } for (uint32 i = 1; i < config.input_desc.size(); ++i) { config.headers.push_back(CKMC_header(config.input_desc[i].file_src)); CKMC_header& h = config.headers.back(); if (h.mode != mode) { cout << "Error: quality/direct based counters conflict!\n"; return false; } if (h.kmer_len != kmer_len) { cout << "Database " << config.input_desc.front().file_src << " contains " << kmer_len << "-mers, but database " << config.input_desc[i].file_src << " contains " << h.kmer_len << "-mers\n"; return false; } } config.kmer_len = kmer_len; //update cutoff_min and coutoff_max if it was not set with parameters for (uint32 i = 0; i < config.input_desc.size(); ++i) { if (config.input_desc[i].cutoff_min == 0) config.input_desc[i].cutoff_min = config.headers[i].min_count; if (config.input_desc[i].cutoff_max == 0) config.input_desc[i].cutoff_max = config.headers[i].max_count; } //update output description if it was not set with parameters if (config.output_desc.cutoff_min == 0) { uint32 min_cutoff_min = config.input_desc.front().cutoff_min; for (uint32 i = 0; i < config.input_desc.size(); ++i) { if (config.input_desc[i].cutoff_min < min_cutoff_min) min_cutoff_min = config.input_desc[i].cutoff_min; } config.output_desc.cutoff_min = min_cutoff_min; if (config.verbose) cout << "-ci was not specified for output. It will be set to " << min_cutoff_min << "\n"; } if (config.output_desc.cutoff_max == 0) { if (config.mode == CConfig::Mode::HISTOGRAM) //for histogram default value differs { config.output_desc.cutoff_max = MIN(config.headers.front().max_count, MIN(HISTOGRAM_MAX_COUNTER_DEFAULT, (uint32)((1ull << (8 * config.headers.front().counter_size)) - 1))); } else { uint32 max_cutoff_max = config.input_desc.front().cutoff_max; for (uint32 i = 0; i < config.input_desc.size(); ++i) { if (config.input_desc[i].cutoff_max > max_cutoff_max) max_cutoff_max = config.input_desc[i].cutoff_max; } config.output_desc.cutoff_max = max_cutoff_max; } if (config.verbose) cout << "-cx was not specified for output. It will be set to " << config.output_desc.cutoff_max << "\n"; } if (config.output_desc.counter_max == 0) { uint32 max_counter_max = config.headers.front().counter_size; for (uint32 i = 0; i < config.headers.size(); ++i) { if (config.headers[i].counter_size> max_counter_max) max_counter_max = config.headers[i].counter_size; } max_counter_max = (uint32)((1ull << (max_counter_max << 3)) - 1); config.output_desc.counter_max = max_counter_max; if (config.verbose) cout << "-cs was not specified for output. It will be set to " << max_counter_max << "\n"; } return true; } void CParametersParser::SetThreads() { uint32 threads_left = config.avaiable_threads; //threads distribution: as many as possible for kmc2 database input, 1 thread for main thread which make operations calculation vector> kmc2_desc; if (!config.Is1ArgOper()) threads_left = MAX(1, threads_left - 1); for (uint32 i = 0; i < config.headers.size(); ++i) { if (config.headers[i].IsKMC2()) { kmc2_desc.push_back(ref(config.input_desc[i])); } } if (kmc2_desc.size()) { uint32 per_signle_kmc2_input = MAX(1, (uint32)(threads_left / kmc2_desc.size())); uint32 per_last_kmc2_input = MAX(1, (uint32)((threads_left + kmc2_desc.size() - 1) / kmc2_desc.size())); for (uint32 i = 0; i < kmc2_desc.size() - 1; ++i) kmc2_desc[i].get().threads = per_signle_kmc2_input; kmc2_desc.back().get().threads = per_last_kmc2_input; } } // ***** EOFKMC-2.3/kmc_tools/parameters_parser.h000066400000000000000000000053571257432033000176630ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _PARAMETERS_PARSER_H #define _PARAMETERS_PARSER_H #include "defs.h" #include "parser.h" #include class CParametersParser { std::unique_ptr complex_parser; int argc; char** argv; int pos; CConfig& config; uint32 replace_zero(uint32 val, const char* param_name, uint32 value_to_set_if_zero); void parse_int_or_float(bool& force_float, bool& force_int, float& float_value, uint32& int_val, const char* param_name); void parse_global_params(); void read_input_fastq_desc(); void read_output_fastq_desc(); void read_input_desc(); void read_dump_params(); void read_output_desc(); public: CParametersParser(int argc, char** argv); void Usage(); template CExpressionNode* GetExpressionRoot(); void Parse(); bool validate_input_dbs(); void SetThreads(); }; template CExpressionNode* CParametersParser::GetExpressionRoot() { if (config.mode == CConfig::Mode::INTERSECTION || config.mode == CConfig::Mode::KMERS_SUBTRACT || config.mode == CConfig::Mode::UNION || config.mode == CConfig::Mode::COUNTERS_SUBTRACT) { CExpressionNode* left = new CInputNode(0); CExpressionNode* right = new CInputNode(1); CExpressionNode* expression_root = nullptr; switch (config.mode) { case CConfig::Mode::INTERSECTION: expression_root = new CIntersectionNode; break; case CConfig::Mode::KMERS_SUBTRACT: expression_root = new CKmersSubtractionNode; break; case CConfig::Mode::UNION: expression_root = new CUnionNode; break; case CConfig::Mode::COUNTERS_SUBTRACT: expression_root = new CCountersSubtractionNode; break; default: std::cout << "Error: unknow operation\n"; exit(1); } expression_root->AddLeftChild(left); expression_root->AddRightChild(right); return expression_root; } else if (config.mode == CConfig::Mode::COMPLEX) { auto result = complex_parser->ParseOutput(); return result; } else if (config.mode == CConfig::Mode::SORT) { if (!config.headers.front().IsKMC2()) { std::cout << "This database contains sorted k-mers already!"; exit(1); } return new CInputNode(0); } else if (config.mode == CConfig::Mode::REDUCE) { return new CInputNode(0); } else if (config.mode == CConfig::Mode::COMPACT) { return new CInputNode(0); } else //should never be here { std::cout << "Error: unknow operation\n"; #ifdef ENABLE_DEBUG std::cout << __FUNCTION__ << " line: " << __LINE__ << "\n"; #endif exit(1); } } #endif // ***** EOFKMC-2.3/kmc_tools/parser.cpp000066400000000000000000000120151257432033000157600ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "parser.h" #include "tokenizer.h" #include "output_parser.h" #include "config.h" /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ CParser::CParser(const std::string& src): config(CConfig::GetInstance()) { line_no = 0; file.open(src); if (!file.is_open()) { std::cout << "Cannot open file: " << src << "\n"; exit(1); } //input_line_pattern = "\\s*(\\w*)\\s*=\\s*(.*)$"; input_line_pattern = "^\\s*([\\w-+]*)\\s*=\\s*(.*)$"; //TODO: consider valid file name empty_line_pattern = "^\\s*$"; } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ void CParser::ParseInputs() { std::string line; while (true) { if (!nextLine(line)) { std::cout << "Error: 'INPUT:' missing\n"; exit(1); } if (line.find("INPUT:") != std::string::npos) break; } if (!nextLine(line) || line.find("OUTPUT:") != std::string::npos) { std::cout << "Error: None input was defined\n"; exit(1); } while (true) { parseInputLine(line); if (!nextLine(line)) { std::cout << "Error: 'OUTPUT:' missing\n"; exit(1); } if (line.find("OUTPUT:") != std::string::npos) break; } } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ void CParser::parseInputLine(const std::string& line) { std::smatch match; if (std::regex_search(line, match, input_line_pattern)) { #ifdef ENABLE_DEBUG std::cout << "\ninput name: " << match[1]; std::cout << "\nafter = " << match[2]; #endif if (input.find(match[1]) != input.end()) { std::cout << "Error: Name redefinition(" << match[1] << ")" << " line: " << line_no << "\n"; exit(1); } else { std::string file_name; std::istringstream stream(match[2]); CInputDesc desc; if (!(stream >> desc.file_src)) { std::cout << "Error: file name for " << match[1] << " was not specified, line: "<< line_no <<"\n"; exit(1); } std::string tmp; while (stream >> tmp) { if (strncmp(tmp.c_str(), "-ci", 3) == 0) { desc.cutoff_min = atoi(tmp.c_str() + 3); continue; } else if (strncmp(tmp.c_str(), "-cx", 3) == 0) { desc.cutoff_max = atoi(tmp.c_str() + 3); continue; } std::cout << "Error: Unknow parameter " << tmp << " for variable " << match[1] << ", line: "<< line_no <<"\n"; exit(1); } config.input_desc.push_back(desc); input[match[1]] = (uint32)(config.input_desc.size() - 1); } } else { std::cout << "Error: wrong line format, line: " << line_no << "\n"; exit(1); } } /*****************************************************************************************************************************/ void CParser::parseOtuputParamsLine() { std::string line; if (!nextLine(line)) { std::cout << "Warning: OUTPUT_PARAMS exists, but no parameters are defined\n"; } else { std::istringstream stream(line); std::string tmp; while (stream >> tmp) { if (strncmp(tmp.c_str(), "-ci", 3) == 0) { config.output_desc.cutoff_min = atoi(tmp.c_str() + 3); continue; } else if (strncmp(tmp.c_str(), "-cx", 3) == 0) { config.output_desc.cutoff_max = atoi(tmp.c_str() + 3); continue; } else if ((strncmp(tmp.c_str(), "-cs", 3) == 0)) { config.output_desc.counter_max = atoi(tmp.c_str() + 3); continue; } std::cout << "Error: Unknow parameter " << tmp << " for variable " << tmp << ", line: " << line_no << "\n"; exit(1); } } } /*****************************************************************************************************************************/ bool CParser::nextLine(std::string& line) { while (true) { if (file.eof()) return false; std::getline(file, line); ++line_no; std::smatch match; if (!std::regex_search(line, match, empty_line_pattern)) return true; } } // ***** EOFKMC-2.3/kmc_tools/parser.h000066400000000000000000000050041257432033000154250ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _PARSER_H #define _PARSER_H #include "defs.h" #include "expression_node.h" #include "tokenizer.h" #include "output_parser.h" #include #include #include #include #include //************************************************************************************************************ // CParser - parser for complex operations //************************************************************************************************************ class CParser { std::ifstream file; uint32 line_no; std::regex input_line_pattern; std::regex empty_line_pattern; std::map input; void parseInputLine(const std::string& line); template CExpressionNode* parseOutputLine(const std::string& line); void parseOtuputParamsLine(); bool nextLine(std::string& line); CConfig& config; public: CParser(const std::string& src); void ParseInputs(); template CExpressionNode* ParseOutput(); }; //************************************************************************************************************ template CExpressionNode* CParser::ParseOutput() { std::string line; if (!nextLine(line) || line.find("OUTPUT_PARAMS:") != std::string::npos) { std::cout << "Error: None output was defined\n"; exit(1); } auto result = parseOutputLine(line); while (nextLine(line)) { if (line.find("OUTPUT_PARAMS:") != std::string::npos) { parseOtuputParamsLine(); break; } } return result; } //************************************************************************************************************ template CExpressionNode* CParser::parseOutputLine(const std::string& line) { std::smatch match; if (std::regex_search(line, match, input_line_pattern)) { #ifdef ENABLE_DEBUG std::cout << "out file name " << match[1] << "\n"; std::cout << "rest of output " << match[2] << "\n"; std::cout << "Tokenize resf of output\n"; #endif config.output_desc.file_src = match[1]; CTokenizer tokenizer; std::list tokens; tokenizer.Tokenize(match[2], tokens); COutputParser out_parser(tokens, input); return out_parser.Parse(); } else { std::cout << "Error: wrong line format, line: " << line_no << "\n"; exit(1); } } #endif // ***** EOFKMC-2.3/kmc_tools/percent_progress.cpp000066400000000000000000000066361257432033000200640ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "percent_progress.h" #include #include using namespace std; /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ uint32 CPercentProgress::RegisterItem(const std::string& name, uint64 max_value) { items.emplace_back(name, max_value); return static_cast(items.size() - 1); } /*****************************************************************************************************************************/ uint32 CPercentProgress::RegisterItem(uint64 max_value) { items.emplace_back("in" + std::to_string(items.size() + 1), max_value); display(); return static_cast(items.size() - 1); } /*****************************************************************************************************************************/ void CPercentProgress::UpdateItem(uint32 id) { --items[id].to_next_update; if (!items[id].to_next_update) { items[id].to_next_update = items[id].to_next_update_pattern; UpdateItem(id, items[id].to_next_update_pattern); } } /*****************************************************************************************************************************/ void CPercentProgress::UpdateItem(uint32 id, uint32 offset) { items[id].cur_val += offset; uint32 prev = items[id].cur_percent; if (items[id].max_val) items[id].cur_percent = static_cast((items[id].cur_val * 100) / items[id].max_val); else items[id].cur_percent = 100; if (prev != items[id].cur_percent) display(); } /*****************************************************************************************************************************/ void CPercentProgress::Complete(uint32 id) { if (items[id].cur_percent != 100) { items[id].cur_percent = 100; display(); } } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ /*****************************************************************************************************************************/ void CPercentProgress::display() { if (hide_progress) return; std::cout << "\r"; for (auto& item : items) std::cout << item.name << ": " << item.cur_percent << "% "; std::cout.flush(); } /*****************************************************************************************************************************/ CPercentProgress::CDisplayItem::CDisplayItem(const std::string name, uint64 max_val) : name(name), max_val(max_val) { to_next_update_pattern = (uint32)MAX(1, max_val / 100); to_next_update = to_next_update_pattern; } // ***** EOFKMC-2.3/kmc_tools/percent_progress.h000066400000000000000000000023201257432033000175130ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _PERCENT_PROGRESS_H #define _PERCENT_PROGRESS_H #include "defs.h" #include #include //************************************************************************************************************ // CPercentProgress - class to display progress of reading inputs //************************************************************************************************************ class CPercentProgress { bool hide_progress = false; struct CDisplayItem { std::string name; uint64 cur_val = 0; uint64 max_val; uint32 cur_percent = 0; uint32 to_next_update; uint32 to_next_update_pattern; public: CDisplayItem(const std::string name, uint64 max_val); }; std::vector items; void display(); public: uint32 RegisterItem(const std::string& name, uint64 max_value); uint32 RegisterItem(uint64 max_value); void UpdateItem(uint32 id); void Complete(uint32 id); void UpdateItem(uint32 id, uint32 offset); void Hide(){ hide_progress = true; } }; #endif // ***** EOFKMC-2.3/kmc_tools/queues.h000066400000000000000000000177311257432033000154520ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _QUEUES_H_ #define _QUEUES_H_ #include "defs.h" #include "bundle.h" #include #include #include #include #include class CSufWriteQueue { uint32 buf_size; uint32 max_inside; using elem_t = std::pair; std::list content; mutable std::mutex mtx; uint32 n_writers; std::condition_variable cv_pop, cv_push; public: void init(uint32 _buf_size, uint32 _max_inside) { buf_size = _buf_size; max_inside = _max_inside; n_writers = 1; } void push(uchar* &buf, uint32 size) { std::unique_lock lck(mtx); cv_push.wait(lck, [this]{return content.size() < max_inside; }); bool was_empty = content.empty(); content.push_back(std::make_pair(buf, size)); buf = new uchar[buf_size]; if (was_empty) cv_pop.notify_all(); } bool pop(uchar* &buf, uint32& size) { std::unique_lock lck(mtx); cv_pop.wait(lck, [this]{return !content.empty() || !n_writers; }); if (!n_writers && content.empty()) return false; bool was_full = max_inside == content.size(); buf = content.front().first; size = content.front().second; content.pop_front(); if (was_full) cv_push.notify_all(); return true; } void mark_completed() { std::lock_guard lck(mtx); --n_writers; if (!n_writers) cv_pop.notify_all(); } }; template class CCircularQueue { std::vector> buff; bool full, is_completed; int start, end; mutable std::mutex mtx; std::condition_variable cv_push; std::condition_variable cv_pop; bool forced_to_finish = false; public: CCircularQueue(int size) : buff(size), full(false), is_completed(false), start(0), end(0) { } bool push(CBundleData& bundle_data) { std::unique_lock lck(mtx); cv_push.wait(lck, [this]{return !full || forced_to_finish; }); if (forced_to_finish) { return false; } bool was_empty = start == end; std::swap(buff[end], bundle_data); bundle_data.Clear(); end = (end + 1) % buff.size(); if (end == start) full = true; if (was_empty) cv_pop.notify_all(); return true; } bool pop(CBundleData& bundle_data) { std::unique_lock lck(mtx); cv_pop.wait(lck, [this]{ return start != end || full || is_completed || forced_to_finish; }); if (forced_to_finish) return false; if (is_completed && !full && start == end) return false; bool was_full = full; std::swap(buff[start], bundle_data); buff[start].Clear(); start = (start + 1) % buff.size(); full = false; if (was_full) cv_push.notify_all(); return true; } void mark_completed() { std::lock_guard lck(mtx); is_completed = true; cv_pop.notify_all(); } void force_finish() { std::lock_guard lck(mtx); forced_to_finish = true; cv_pop.notify_all(); cv_push.notify_all(); } }; class CInputFilesQueue { typedef std::string elem_t; typedef std::queue> queue_t; queue_t q; mutable std::mutex mtx; // The mutex to synchronise on public: CInputFilesQueue(const std::vector &file_names) { std::unique_lock lck(mtx); for (auto p = file_names.cbegin(); p != file_names.cend(); ++p) q.push(*p); }; bool pop(std::string &file_name) { std::lock_guard lck(mtx); if (q.empty()) return false; file_name = q.front(); q.pop(); return true; } }; class CMemoryPool { int64 total_size; int64 part_size; int64 n_parts_total; int64 n_parts_free; uchar *buffer, *raw_buffer; uint32 *stack; mutable std::mutex mtx; // The mutex to synchronise on std::condition_variable cv; // The condition to wait for public: CMemoryPool(int64 _total_size, int64 _part_size) { raw_buffer = NULL; buffer = NULL; stack = NULL; prepare(_total_size, _part_size); } ~CMemoryPool() { release(); } void prepare(int64 _total_size, int64 _part_size) { release(); n_parts_total = _total_size / _part_size; part_size = (_part_size + 15) / 16 * 16; // to allow mapping pointer to int* n_parts_free = n_parts_total; total_size = n_parts_total * part_size; raw_buffer = new uchar[total_size + 64]; buffer = raw_buffer; while (((uint64)buffer) % 64) buffer++; stack = new uint32[n_parts_total]; for (uint32 i = 0; i < n_parts_total; ++i) stack[i] = i; } void release(void) { if (raw_buffer) delete[] raw_buffer; raw_buffer = NULL; buffer = NULL; if (stack) delete[] stack; stack = NULL; } // Allocate memory buffer - uchar* void reserve(uchar* &part) { std::unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0; }); part = buffer + stack[--n_parts_free] * part_size; } // Allocate memory buffer - char* void reserve(char* &part) { std::unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0; }); part = (char*)(buffer + stack[--n_parts_free] * part_size); } // Allocate memory buffer - uint32* void reserve(uint32* &part) { std::unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0; }); part = (uint32*)(buffer + stack[--n_parts_free] * part_size); } // Allocate memory buffer - uint64* void reserve(uint64* &part) { std::unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0; }); part = (uint64*)(buffer + stack[--n_parts_free] * part_size); } // Allocate memory buffer - double* void reserve(double* &part) { std::unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0; }); part = (double*)(buffer + stack[--n_parts_free] * part_size); } // Deallocate memory buffer - uchar* void free(uchar* part) { std::lock_guard lck(mtx); stack[n_parts_free++] = (uint32)((part - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - char* void free(char* part) { std::lock_guard lck(mtx); stack[n_parts_free++] = (uint32)(((uchar*)part - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - uint32* void free(uint32* part) { std::lock_guard lck(mtx); stack[n_parts_free++] = (uint32)((((uchar *)part) - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - uint64* void free(uint64* part) { std::lock_guard lck(mtx); stack[n_parts_free++] = (uint32)((((uchar *)part) - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - double* void free(double* part) { std::lock_guard lck(mtx); stack[n_parts_free++] = (uint32)((((uchar *)part) - buffer) / part_size); cv.notify_all(); } }; class CPartQueue { typedef std::pair elem_t; typedef std::queue> queue_t; queue_t q; bool is_completed; int n_readers; mutable std::mutex mtx; // The mutex to synchronise on std::condition_variable cv_queue_empty; public: CPartQueue(int _n_readers) { std::unique_lock lck(mtx); is_completed = false; n_readers = _n_readers; }; ~CPartQueue() {}; bool empty() { std::lock_guard lck(mtx); return q.empty(); } bool completed() { std::lock_guard lck(mtx); return q.empty() && !n_readers; } void mark_completed() { std::lock_guard lck(mtx); n_readers--; if (!n_readers) cv_queue_empty.notify_all(); } void push(uchar *part, uint64 size) { std::unique_lock lck(mtx); bool was_empty = q.empty(); q.push(std::make_pair(part, size)); if (was_empty) cv_queue_empty.notify_all(); } bool pop(uchar *&part, uint64 &size) { std::unique_lock lck(mtx); cv_queue_empty.wait(lck, [this]{return !this->q.empty() || !this->n_readers; }); if (q.empty()) return false; std::tie(part, size) = q.front(); q.pop(); return true; } }; #endif KMC-2.3/kmc_tools/stdafx.cpp000066400000000000000000000004401257432033000157540ustar00rootroot00000000000000// stdafx.cpp : source file that includes just the standard includes // kmc_tools.pch will be the pre-compiled header // stdafx.obj will contain the pre-compiled type information #include "stdafx.h" // TODO: reference any additional headers you need in STDAFX.H // and not in this file KMC-2.3/kmc_tools/stdafx.h000066400000000000000000000005041257432033000154220ustar00rootroot00000000000000#ifdef WIN32 // stdafx.h : include file for standard system include files, // or project specific include files that are used frequently, but // are changed infrequently // #pragma once #include "targetver.h" #include #include // TODO: reference additional headers your program requires here #endifKMC-2.3/kmc_tools/targetver.h000066400000000000000000000004621257432033000161370ustar00rootroot00000000000000#pragma once // Including SDKDDKVer.h defines the highest available Windows platform. // If you wish to build your application for a previous Windows platform, include WinSDKVer.h and // set the _WIN32_WINNT macro to the platform you wish to support before including SDKDDKVer.h. #include KMC-2.3/kmc_tools/timer.h000066400000000000000000000012061257432033000152510ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _TIMER_H #define _TIMER_H #include class CTimer { using time_p = std::chrono::system_clock::time_point; time_p _start, _end; public: void start() { _start = std::chrono::high_resolution_clock::now(); } double get_time() { auto time = std::chrono::high_resolution_clock::now() - _start; return static_cast(std::chrono::duration_cast(time).count()); } }; #endif KMC-2.3/kmc_tools/tokenizer.cpp000066400000000000000000000055271257432033000165100ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "tokenizer.h" /*****************************************************************************************************************************/ /******************************************************** CONSTRUCTOR ********************************************************/ /*****************************************************************************************************************************/ CTokenizer::CTokenizer() { token_patterns.resize(7); token_patterns[0] = std::make_pair("^(\\()", TokenType::PARENTHESIS_OPEN); token_patterns[1] = std::make_pair("^(\\))", TokenType::PARENTHESIS_CLOSE); token_patterns[2] = std::make_pair("^(\\-)", TokenType::STRICT_MINUS_OPER); token_patterns[3] = std::make_pair("^(\\~)", TokenType::COUNTER_MINUS_OPER); token_patterns[4] = std::make_pair("^(\\+)", TokenType::PLUS_OPER); token_patterns[5] = std::make_pair("^(\\*)", TokenType::MUL_OPER); token_patterns[6] = std::make_pair("^(\\w*)", TokenType::VARIABLE); } /*****************************************************************************************************************************/ /********************************************************** PUBLIC ***********************************************************/ /*****************************************************************************************************************************/ void CTokenizer::Tokenize(const std::string& _expression, std::list& tokens) { std::string expression = _expression; std::smatch match; leftTrimString(expression, 0); while (!expression.empty()) { bool valid_token = false; for (const auto& pattern : token_patterns) { if (std::regex_search(expression, match, pattern.first)) { #ifdef ENABLE_DEBUG std::cout << match[1]; #endif tokens.push_back(std::make_pair(match[1], pattern.second)); leftTrimString(expression, (int)match[1].length()); valid_token = true; break; } } if (!valid_token) { std::cout << "Error: wrong output format near : " << expression << "\n"; exit(1); } } } /*****************************************************************************************************************************/ /********************************************************** PRIVATE **********************************************************/ /*****************************************************************************************************************************/ void CTokenizer::leftTrimString(std::string& str, int start_pos) { static const std::string whitespace = " \t\r\n\v\f"; auto next_pos = str.find_first_not_of(whitespace, start_pos); str.erase(0, next_pos); } // ***** EOFKMC-2.3/kmc_tools/tokenizer.h000066400000000000000000000020731257432033000161460ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _TOKENIZER_H #define _TOKENIZER_H #include "defs.h" #include #include #include #include enum class TokenType{ VARIABLE, PLUS_OPER, STRICT_MINUS_OPER, COUNTER_MINUS_OPER, MUL_OPER, PARENTHESIS_OPEN, PARENTHESIS_CLOSE, TERMINATOR }; using Token = std::pair; //************************************************************************************************************ // CTokenizer - Tokenizer for k-mers set operations //************************************************************************************************************ class CTokenizer { public: CTokenizer(); void Tokenize(const std::string& _expression, std::list& tokens); private: std::vector> token_patterns; void leftTrimString(std::string& str, int start_pos); }; #endif // ***** EOFKMC-2.3/kmer_counter.sln000066400000000000000000000114501257432033000152030ustar00rootroot00000000000000 Microsoft Visual Studio Solution File, Format Version 12.00 # Visual Studio 2013 VisualStudioVersion = 12.0.21005.1 MinimumVisualStudioVersion = 10.0.40219.1 Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "kmer_counter", "kmer_counter\kmer_counter.vcxproj", "{8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}" EndProject Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "kmc_dump", "kmc_dump\kmc_dump.vcxproj", "{8939AD12-23D5-469C-806B-DC3F98F8A514}" EndProject Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "kmc_dump_sample", "kmc_dump_sample\kmc_dump_sample.vcxproj", "{17823F37-86DE-4E58-B354-B84DA9EDA6A1}" EndProject Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "kmc_tools", "kmc_tools\kmc_tools.vcxproj", "{F3B0CC94-9DD0-4642-891C-EA08BDA50260}" EndProject Global GlobalSection(SolutionConfigurationPlatforms) = preSolution Debug|Mixed Platforms = Debug|Mixed Platforms Debug|Win32 = Debug|Win32 Debug|x64 = Debug|x64 Release|Mixed Platforms = Release|Mixed Platforms Release|Win32 = Release|Win32 Release|x64 = Release|x64 EndGlobalSection GlobalSection(ProjectConfigurationPlatforms) = postSolution {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Debug|Mixed Platforms.ActiveCfg = Debug|Win32 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Debug|Mixed Platforms.Build.0 = Debug|Win32 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Debug|Win32.ActiveCfg = Debug|Win32 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Debug|Win32.Build.0 = Debug|Win32 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Debug|x64.ActiveCfg = Debug|x64 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Debug|x64.Build.0 = Debug|x64 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Release|Mixed Platforms.ActiveCfg = Release|x64 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Release|Mixed Platforms.Build.0 = Release|x64 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Release|Win32.ActiveCfg = Release|Win32 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Release|Win32.Build.0 = Release|Win32 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Release|x64.ActiveCfg = Release|x64 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F}.Release|x64.Build.0 = Release|x64 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Debug|Mixed Platforms.ActiveCfg = Debug|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Debug|Mixed Platforms.Build.0 = Debug|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Debug|Win32.ActiveCfg = Debug|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Debug|Win32.Build.0 = Debug|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Debug|x64.ActiveCfg = Debug|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Release|Mixed Platforms.ActiveCfg = Release|x64 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Release|Mixed Platforms.Build.0 = Release|x64 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Release|Win32.ActiveCfg = Release|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Release|Win32.Build.0 = Release|Win32 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Release|x64.ActiveCfg = Release|x64 {8939AD12-23D5-469C-806B-DC3F98F8A514}.Release|x64.Build.0 = Release|x64 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Debug|Mixed Platforms.ActiveCfg = Debug|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Debug|Mixed Platforms.Build.0 = Debug|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Debug|Win32.ActiveCfg = Debug|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Debug|Win32.Build.0 = Debug|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Debug|x64.ActiveCfg = Debug|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Release|Mixed Platforms.ActiveCfg = Release|x64 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Release|Mixed Platforms.Build.0 = Release|x64 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Release|Win32.ActiveCfg = Release|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Release|Win32.Build.0 = Release|Win32 {17823F37-86DE-4E58-B354-B84DA9EDA6A1}.Release|x64.ActiveCfg = Release|x64 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Debug|Mixed Platforms.ActiveCfg = Debug|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Debug|Mixed Platforms.Build.0 = Debug|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Debug|Win32.ActiveCfg = Debug|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Debug|Win32.Build.0 = Debug|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Debug|x64.ActiveCfg = Debug|x64 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Release|Mixed Platforms.ActiveCfg = Release|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Release|Mixed Platforms.Build.0 = Release|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Release|Win32.ActiveCfg = Release|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Release|Win32.Build.0 = Release|Win32 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Release|x64.ActiveCfg = Release|x64 {F3B0CC94-9DD0-4642-891C-EA08BDA50260}.Release|x64.Build.0 = Release|x64 EndGlobalSection GlobalSection(SolutionProperties) = preSolution HideSolutionNode = FALSE EndGlobalSection GlobalSection(ExtensibilityGlobals) = postSolution VisualSVNWorkingCopyRoot = . EndGlobalSection EndGlobal KMC-2.3/kmer_counter/000077500000000000000000000000001257432033000144645ustar00rootroot00000000000000KMC-2.3/kmer_counter/asmlib_wrapper.h000066400000000000000000000007111257432033000176430ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _ASMLIB_WRAPPER_H #define _ASMLIB_WRAPPER_H #include "defs.h" #ifdef DISABLE_ASMLIB #define A_memcpy memcpy #define SetMemcpyCacheLimit(X) #else #include "libs/asmlib.h" #endif #endifKMC-2.3/kmer_counter/bkb_merger.h000066400000000000000000000213701257432033000167370ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _HBH_MERGER_H #define _HBH_MERGER_H #include "bkb_subbin.h" //************************************************************************************************************ // CBigKmerBinMerger - merger sorted k-mers from number of subbins //************************************************************************************************************ template class CBigKmerBinMerger { vector*> sub_bins; std::vector> curr_min; CDiskLogger* disk_logger; uint32 size; CBigBinDesc* bbd; CBigBinKmerPartQueue* bbkpq; CCompletedBinsCollector* sm_cbc; uint32 kmer_len; uint32 lut_prefix_len; int32 cutoff_min, cutoff_max, counter_max; CMemoryPool* sm_pmm_merger_suff, *sm_pmm_merger_lut, *sm_pmm_sub_bin_suff, *sm_pmm_sub_bin_lut; int64 sm_mem_part_merger_suff, sm_mem_part_merger_lut, sm_mem_part_sub_bin_suff, sm_mem_part_sub_bin_lut; uchar *sub_bin_suff_buff, *sub_bin_lut_buff; public: CBigKmerBinMerger(CKMCParams& Params, CKMCQueues& Queues); void init(int32 bin_id, uint32 _size); bool get_min(KMER_T& kmer, uint32& count); void Process(); ~CBigKmerBinMerger(); }; //---------------------------------------------------------------------------------- template CBigKmerBinMerger::CBigKmerBinMerger(CKMCParams& Params, CKMCQueues& Queues) { disk_logger = Queues.disk_logger; bbd = Queues.bbd; bbkpq = Queues.bbkpq; sm_cbc = Queues.sm_cbc; kmer_len = Params.kmer_len; lut_prefix_len = Params.lut_prefix_len; cutoff_min = Params.cutoff_min; cutoff_max = (int32)Params.cutoff_max; counter_max = (int32)Params.counter_max; sm_pmm_merger_suff = Queues.sm_pmm_merger_suff; sm_pmm_merger_lut = Queues.sm_pmm_merger_lut; sm_pmm_sub_bin_suff = Queues.sm_pmm_sub_bin_suff; sm_pmm_sub_bin_lut = Queues.sm_pmm_sub_bin_lut; sm_mem_part_sub_bin_suff = Params.sm_mem_part_sub_bin_suff; sm_mem_part_merger_suff = Params.sm_mem_part_merger_suff; sm_mem_part_merger_lut = Params.sm_mem_part_merger_lut; sm_mem_part_sub_bin_lut = Params.sm_mem_part_sub_bin_lut; sm_pmm_sub_bin_lut->reserve(sub_bin_lut_buff); sm_pmm_sub_bin_suff->reserve(sub_bin_suff_buff); } //---------------------------------------------------------------------------------- template CBigKmerBinMerger::~CBigKmerBinMerger() { for (auto p : sub_bins) delete p; sm_pmm_sub_bin_lut->free(sub_bin_lut_buff); sm_pmm_sub_bin_suff->free(sub_bin_suff_buff); } //---------------------------------------------------------------------------------- template void CBigKmerBinMerger::init(int32 bin_id, uint32 _size) { size = _size; uint32 prev_size = (uint32)sub_bins.size(); int32 sub_bin_id; if (size > prev_size) { sub_bins.resize(size); curr_min.resize(size); for (uint32 i = prev_size; i < size; ++i) { sub_bins[i] = new CSubBin(disk_logger); } } uint32 lut_prefix_len = 0;; uint32 n_kmers = 0; uint64 file_size = 0; FILE* file = NULL; string name; uint32 per_sub_bin_lut_size = (uint32)(sm_mem_part_sub_bin_lut / size); uint32 per_sub_bin_suff_size = (uint32)(sm_mem_part_sub_bin_suff / size); for (uint32 i = 0; i < size; ++i) { bbd->next_sub_bin(bin_id, sub_bin_id, lut_prefix_len, n_kmers, file, name, file_size); sub_bins[i]->init(file, file_size, lut_prefix_len, n_kmers, name, kmer_len, sub_bin_lut_buff + i * per_sub_bin_lut_size, per_sub_bin_lut_size, sub_bin_suff_buff + i * per_sub_bin_suff_size, per_sub_bin_suff_size); get<2>(curr_min[i]) = i; sub_bins[i]->get_min(get<0>(curr_min[i]), get<1>(curr_min[i])); } } //---------------------------------------------------------------------------------- template bool CBigKmerBinMerger::get_min(KMER_T& kmer, uint32& count) { if (!size) return false; uint32 min = 0; for (uint32 i = 1; i < size; ++i) if (get<0>(curr_min[i]) < get<0>(curr_min[min])) min = i; kmer = get<0>(curr_min[min]); count = get<1>(curr_min[min]); if (sub_bins[get<2>(curr_min[min])]->get_min(get<0>(curr_min[min]), get<1>(curr_min[min]))) ; else curr_min[min] = curr_min[--size]; return true; } //---------------------------------------------------------------------------------- template void CBigKmerBinMerger::Process() { int32 bin_id; uint32 size = 0; uint32 counter_size = min(BYTE_LOG(cutoff_max), BYTE_LOG(counter_max)); uint32 lut_recs = 1 << 2 * lut_prefix_len; uint32 kmer_symbols = (kmer_len - lut_prefix_len); uint32 kmer_bytes = kmer_symbols / 4; uint32 suff_rec_bytes = kmer_bytes + counter_size; uint64 suff_buff_size = sm_mem_part_merger_suff / suff_rec_bytes * suff_rec_bytes; uint64 suff_buff_pos = 0; uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total; KMER_T kmer, next_kmer; kmer.clear(); next_kmer.clear(); uint32 count_tmp = 0, count = 0; int32 max_in_lut = (int32)(sm_mem_part_merger_lut / sizeof(uint64)); while (sm_cbc->pop(bin_id)) { bbd->get_n_sub_bins(bin_id, size); uchar *raw_lut; sm_pmm_merger_lut->reserve(raw_lut); uint64 *lut = (uint64*)raw_lut; uchar* suff_buff; sm_pmm_merger_suff->reserve(suff_buff); suff_buff_pos = 0; n_unique = n_cutoff_min = n_cutoff_max = n_total = 0; fill_n(lut, max_in_lut, 0); init(bin_id, size); get_min(kmer, count_tmp); count = count_tmp; uint32 lut_offset = 0; uint64 prefix; while (get_min(next_kmer, count_tmp)) { if (kmer == next_kmer) count += count_tmp; else { ++n_unique; n_total += count; if (count < (uint32)cutoff_min) n_cutoff_min++; else if (count > (uint32)cutoff_max) n_cutoff_max++; else { if (count > (uint32)counter_max) count = counter_max; //store prefix = kmer.remove_suffix(2 * kmer_symbols); if (prefix >= max_in_lut + lut_offset) { bbkpq->push(bin_id, NULL, 0, raw_lut, max_in_lut * sizeof(uint64), 0, 0, 0, 0, false); lut_offset += max_in_lut; sm_pmm_merger_lut->reserve(raw_lut); lut = (uint64*)raw_lut; fill_n(lut, max_in_lut, 0); } lut[prefix - lut_offset]++; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) suff_buff[suff_buff_pos++] = kmer.get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) suff_buff[suff_buff_pos++] = (count >> (j * 8)) & 0xFF; if (suff_buff_pos >= suff_buff_size) { bbkpq->push(bin_id, suff_buff, suff_buff_pos, NULL, 0, 0, 0, 0, 0, false); suff_buff_pos = 0; sm_pmm_merger_suff->reserve(suff_buff); } } count = count_tmp; kmer = next_kmer; } } ++n_unique; n_total += count; if (count < (uint32)cutoff_min) ++n_cutoff_min; else if (count > (uint32)cutoff_max) ++n_cutoff_max; else { if (count > (uint32)counter_max) count = counter_max; //store lut[kmer.remove_suffix(2 * kmer_symbols)]++; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) suff_buff[suff_buff_pos++] = kmer.get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) suff_buff[suff_buff_pos++] = (count >> (j * 8)) & 0xFF; } bbkpq->push(bin_id, suff_buff, suff_buff_pos, raw_lut, (lut_recs - lut_offset) * sizeof(uint64), n_unique, n_cutoff_min, n_cutoff_max, n_total, true); } bbkpq->mark_completed(); } //************************************************************************************************************ // CWBigKmerBinMerger - wrapper for multithreading purposes //************************************************************************************************************ template class CWBigKmerBinMerger { CBigKmerBinMerger *merger; public: CWBigKmerBinMerger(CKMCParams& Params, CKMCQueues& Queues); ~CWBigKmerBinMerger(); void operator()(); }; //---------------------------------------------------------------------------------- // Constructor template CWBigKmerBinMerger::CWBigKmerBinMerger(CKMCParams& Params, CKMCQueues& Queues) { merger = new CBigKmerBinMerger(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor template CWBigKmerBinMerger::~CWBigKmerBinMerger() { delete merger; } #endif //---------------------------------------------------------------------------------- // Execution template void CWBigKmerBinMerger::operator()() { merger->Process(); } // ***** EOF KMC-2.3/kmer_counter/bkb_reader.cpp000066400000000000000000000057741257432033000172650ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "bkb_reader.h" #include "asmlib_wrapper.h" //************************************************************************************************************ // CBigKmerBinReader //************************************************************************************************************ //---------------------------------------------------------------------------------- CBigKmerBinReader::CBigKmerBinReader(CKMCParams& Params, CKMCQueues& Queues) { tlbq = Queues.tlbq; disk_logger = Queues.disk_logger; bd = Queues.bd; bbpq = Queues.bbpq; sm_pmm_input_file = Queues.sm_pmm_input_file; sm_mem_part_input_file = Params.sm_mem_part_input_file; } //---------------------------------------------------------------------------------- void CBigKmerBinReader::ProcessBigBin() { int32 bin_id; CMemDiskFile *file; string name; uint64 size, n_rec, n_plus_x_recs, in_buffer, end_pos; uint32 buffer_size, kmer_len; uchar *file_buff, *tmp; while (tlbq->get_next(bin_id)) { bd->read(bin_id, file, name, size, n_rec, n_plus_x_recs, buffer_size, kmer_len); cout << "*"; file->Rewind(); end_pos = 0; sm_pmm_input_file->reserve(file_buff); while ( (in_buffer = end_pos + file->Read(file_buff + end_pos, 1, sm_mem_part_input_file - end_pos)) ) { end_pos = 0; for (; end_pos + 1 + (file_buff[end_pos] + kmer_len + 3) / 4 <= in_buffer; end_pos += 1 + (file_buff[end_pos] + kmer_len + 3) / 4); uint64 rest = in_buffer - end_pos; sm_pmm_input_file->reserve(tmp); A_memcpy(tmp, file_buff + end_pos, rest); bbpq->push(bin_id, file_buff, end_pos); file_buff = tmp; end_pos = rest; } sm_pmm_input_file->free(file_buff); file->Close(); //Remove file file->Remove(); disk_logger->log_remove(size); } bbpq->mark_completed(); } //---------------------------------------------------------------------------------- CBigKmerBinReader::~CBigKmerBinReader() { } //************************************************************************************************************ // CWBigKmerBinReader - wrapper for multithreading purposes //************************************************************************************************************ //---------------------------------------------------------------------------------- // Constructor CWBigKmerBinReader::CWBigKmerBinReader(CKMCParams& Params, CKMCQueues& Queues) { bkb_reader = new CBigKmerBinReader(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor CWBigKmerBinReader::~CWBigKmerBinReader() { delete bkb_reader; } //---------------------------------------------------------------------------------- // Execution void CWBigKmerBinReader::operator()() { bkb_reader->ProcessBigBin(); } // ***** EOFKMC-2.3/kmer_counter/bkb_reader.h000066400000000000000000000026111257432033000167150ustar00rootroot00000000000000 /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _BKB_READER_H_ #define _BKB_READER_H_ #include "params.h" //************************************************************************************************************ // CBigKmerBinReader - reader of bins from distribution phase. Only in strict memory mode //************************************************************************************************************ class CBigKmerBinReader { CTooLargeBinsQueue * tlbq; CDiskLogger* disk_logger; CBinDesc* bd; CBigBinPartQueue* bbpq; CMemoryPool* sm_pmm_input_file; uint64 sm_mem_part_input_file; public: CBigKmerBinReader(CKMCParams& Params, CKMCQueues& Queues); ~CBigKmerBinReader(); void ProcessBigBin(); }; //************************************************************************************************************ // CWBigKmerBinReader - wrapper for multithreading purposes //************************************************************************************************************ class CWBigKmerBinReader { CBigKmerBinReader* bkb_reader; public: CWBigKmerBinReader(CKMCParams& Params, CKMCQueues& Queues); ~CWBigKmerBinReader(); void operator()(); }; #endif // ***** EOFKMC-2.3/kmer_counter/bkb_sorter.h000066400000000000000000000417751257432033000170070ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _BKB_SORTER_H #define _BKB_SORTER_H #include "radix.h" #include "kxmer_set.h" #include "params.h" //************************************************************************************************************ template class CBigKmerBinSorter_Impl; //************************************************************************************************************ // CBigKmerBinSorter - sorter for part of bin, only in strict memory mode //************************************************************************************************************ template class CBigKmerBinSorter { CBigBinKXmersQueue* bbkq; CBigBinDesc* bbd; CBigBinSortedPartQueue* bbspq; CMemoryPool *pmm_radix_buf, *sm_pmm_expand, *sm_pmm_sorter_suffixes, *sm_pmm_sorter_lut, *sm_pmm_sort; uchar* _raw_kxmers; int64 sm_mem_part_suffixes; CKXmerSet kxmer_set; KMER_T* kxmers; KMER_T* sort_tmp; KMER_T* sorted_kxmers; uint32 *kxmers_counters; uint64 kxmers_size; uint64 kxmers_pos; int32 cutoff_min, cutoff_max; int32 counter_max; uint32 *kxmer_counters; int32 lut_prefix_len; int n_omp_threads; int32 bin_id; uint32 sub_bin_id; uint32 max_x; uint32 kmer_len; bool use_quake; uint64 sum_n_rec, sum_n_plus_x_rec; friend class CBigKmerBinSorter_Impl; void Sort(); public: CBigKmerBinSorter(CKMCParams& Params, CKMCQueues& Queues); ~CBigKmerBinSorter(); void Process(); }; //************************************************************************************************************ // CBigKmerBinSorter_Impl - implementation of k-mer type- and size-dependent functions //************************************************************************************************************ template class CBigKmerBinSorter_Impl { public: static void PostProcessSort(CBigKmerBinSorter& ptr); }; template class CBigKmerBinSorter_Impl, SIZE> { static void PostProcessKmers(CBigKmerBinSorter, SIZE>& ptr); static void PostProcessKxmers(CBigKmerBinSorter, SIZE>& ptr); static void PreCompactKxmers(CBigKmerBinSorter, SIZE>& ptr, uint64& compacted_count, uint32* counters); static uint64 FindFirstSymbOccur(CBigKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uchar symb); static void InitKXMerSet(CBigKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uint32 depth); public: static void PostProcessSort(CBigKmerBinSorter, SIZE>& ptr); }; template class CBigKmerBinSorter_Impl, SIZE> { public: static void PostProcessSort(CBigKmerBinSorter, SIZE>& ptr); }; //************************************************************************************************************ // CBigKmerBinSorter //************************************************************************************************************ //---------------------------------------------------------------------------------- template void CBigKmerBinSorter::Process() { int32 curr_bin_id = -1; bin_id = -1; uchar* data = NULL; uint64 size = 0; kxmers_pos = 0; sub_bin_id = 0; while (bbkq->pop(curr_bin_id, data, size)) { if (bin_id == -1) bin_id = curr_bin_id; if (curr_bin_id != bin_id) //new bin { if (kxmers_pos) { Sort(); CBigKmerBinSorter_Impl::PostProcessSort(*this); kxmers_pos = 0; } bin_id = curr_bin_id; sub_bin_id = 0; } if (kxmers_pos + size < kxmers_size) { A_memcpy(kxmers + kxmers_pos, data, size * sizeof(KMER_T)); sm_pmm_expand->free(data); kxmers_pos += size; } else { Sort(); CBigKmerBinSorter_Impl::PostProcessSort(*this); ++sub_bin_id; A_memcpy(kxmers, data, size * sizeof(KMER_T)); sm_pmm_expand->free(data); kxmers_pos = size; } } if (kxmers_pos) { Sort(); CBigKmerBinSorter_Impl::PostProcessSort(*this); } bbspq->mark_completed(); } //---------------------------------------------------------------------------------- template CBigKmerBinSorter::CBigKmerBinSorter(CKMCParams& Params, CKMCQueues& Queues) : kxmer_set(Params.kmer_len) { sorted_kxmers = NULL; kxmer_counters = NULL; bbkq = Queues.bbkq; bbspq = Queues.bbspq; pmm_radix_buf = Queues.pmm_radix_buf; sm_pmm_expand = Queues.sm_pmm_expand; sm_pmm_sorter_suffixes = Queues.sm_pmm_sorter_suffixes; sm_pmm_sorter_lut = Queues.sm_pmm_sorter_lut; sm_pmm_sort = Queues.sm_pmm_sort; kxmers_size = Params.sm_mem_part_sort / 2 / sizeof(KMER_T); sm_mem_part_suffixes = Params.sm_mem_part_suffixes; sm_pmm_sort->reserve(_raw_kxmers); kxmers = (KMER_T*)_raw_kxmers; sort_tmp = kxmers + kxmers_size; max_x = Params.max_x; bbd = Queues.bbd; use_quake = Params.use_quake; kmer_len = Params.kmer_len; lut_prefix_len = Params.lut_prefix_len; n_omp_threads = Params.sm_n_omp_threads; sum_n_rec = sum_n_plus_x_rec = 0; cutoff_max = (int32)Params.cutoff_max; cutoff_min = Params.cutoff_min; counter_max = (int32)Params.counter_max; } //---------------------------------------------------------------------------------- template CBigKmerBinSorter::~CBigKmerBinSorter() { sm_pmm_sort->free(_raw_kxmers); } //---------------------------------------------------------------------------------- template void CBigKmerBinSorter::Sort() { uint32 rec_len; uint64 sort_rec = kxmers_pos; if (max_x && !use_quake) { rec_len = (kmer_len + max_x + 1 + 3) / 4; } else { rec_len = (kmer_len + 3) / 4; } sum_n_plus_x_rec += kxmers_pos; if (sizeof(KMER_T) == 8) { uint64 *_buffer_input = (uint64*)kxmers; uint64 *_buffer_tmp = (uint64*)sort_tmp; RadixSort_buffer(pmm_radix_buf, _buffer_input, _buffer_tmp, sort_rec, rec_len, n_omp_threads); if (rec_len % 2) { kxmers_counters = (uint32*)kxmers; sorted_kxmers = (KMER_T*)sort_tmp; } else { kxmers_counters = (uint32*)sort_tmp; sorted_kxmers = (KMER_T*)kxmers; } } else { uint32 *_buffer_input = (uint32*)kxmers; uint32 *_buffer_tmp = (uint32*)sort_tmp; RadixSort_uint8(_buffer_input, _buffer_tmp, sort_rec, sizeof(KMER_T), offsetof(KMER_T, data), SIZE*sizeof(typename KMER_T::data_t), rec_len, n_omp_threads); if (rec_len % 2) { kxmers_counters = (uint32*)_buffer_input; sorted_kxmers = (KMER_T*)_buffer_tmp; } else { kxmers_counters = (uint32*)_buffer_tmp; sorted_kxmers = (KMER_T*)_buffer_input; } } } //************************************************************************************************************ // CBigKmerBinSorter_Impl //************************************************************************************************************ //---------------------------------------------------------------------------------- template void CBigKmerBinSorter_Impl, SIZE>::PostProcessKmers(CBigKmerBinSorter, SIZE>& ptr) { uint32 best_lut_prefix_len = 0; uint32 local_lut_prefix_len; uint64 best_mem_amount = 1ull << 62; uint32 counter_size = sizeof(uint32); for (local_lut_prefix_len = 2; local_lut_prefix_len < 13; ++local_lut_prefix_len) { uint32 suffix_len = ptr.kmer_len - local_lut_prefix_len; if (suffix_len % 4) continue; uint64 suf_mem = (suffix_len / 4 + counter_size) * ptr.kxmers_pos; uint64 lut_mem = (1ull << (2 * local_lut_prefix_len)) * sizeof(uint64); if (suf_mem + lut_mem < best_mem_amount) { best_mem_amount = suf_mem + lut_mem; best_lut_prefix_len = local_lut_prefix_len; } } local_lut_prefix_len = best_lut_prefix_len; uint32 kmer_symbols = ptr.kmer_len - local_lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint32 suffix_rec_bytes = (ptr.kmer_len - local_lut_prefix_len) / 4 + counter_size; uint64 lut_recs = 1ull << 2 * local_lut_prefix_len; uchar* suff_buff; ptr.sm_pmm_sorter_suffixes->reserve(suff_buff); uchar* _raw_lut; ptr.sm_pmm_sorter_lut->reserve(_raw_lut); uint64* lut = (uint64*)_raw_lut; fill_n(lut, lut_recs, 0); uint64 suff_buff_size = ptr.sm_mem_part_suffixes / suffix_rec_bytes * suffix_rec_bytes; uint64 suff_buff_pos = 0; uint32 n_recs = 0; CKmer *act_kmer; uint32 count; uint64 i; act_kmer = &ptr.kxmers[0]; count = 1; for (i = 1; i < ptr.kxmers_pos; ++i) { if (*act_kmer == ptr.kxmers[i]) count++; else { lut[act_kmer->remove_suffix(2 * kmer_symbols)]++; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) suff_buff[suff_buff_pos++] = act_kmer->get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) suff_buff[suff_buff_pos++] = (count >> (j * 8)) & 0xFF; ++n_recs; if (suff_buff_pos >= suff_buff_size) { ptr.bbspq->push(ptr.bin_id, ptr.sub_bin_id, suff_buff, suff_buff_pos, NULL, 0, false); ptr.sm_pmm_sorter_suffixes->reserve(suff_buff); suff_buff_pos = 0; } count = 1; act_kmer = &ptr.kxmers[i]; } } lut[act_kmer->remove_suffix(2 * kmer_symbols)]++; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) suff_buff[suff_buff_pos++] = act_kmer->get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) suff_buff[suff_buff_pos++] = (count >> (j * 8)) & 0xFF; ++n_recs; ptr.bbspq->push(ptr.bin_id, ptr.sub_bin_id, suff_buff, suff_buff_pos, NULL, 0, false); ptr.bbspq->push(ptr.bin_id, ptr.sub_bin_id, NULL, 0, lut, lut_recs, true); ptr.bbd->push(ptr.bin_id, ptr.sub_bin_id, local_lut_prefix_len, n_recs, NULL, "", 0); } //---------------------------------------------------------------------------------- template void CBigKmerBinSorter_Impl, SIZE>::PreCompactKxmers(CBigKmerBinSorter, SIZE>& ptr, uint64& compacted_count, uint32* counters) { compacted_count = 0; CKmer *act_kmer; act_kmer = &ptr.sorted_kxmers[0]; counters[compacted_count] = 1; for (uint32 i = 1; i < ptr.kxmers_pos; ++i) { if (*act_kmer == ptr.sorted_kxmers[i]) ++counters[compacted_count]; else { ptr.sorted_kxmers[compacted_count++] = *act_kmer; counters[compacted_count] = 1; act_kmer = &ptr.sorted_kxmers[i]; } } ptr.sorted_kxmers[compacted_count++] = *act_kmer; } //---------------------------------------------------------------------------------- //Binary search position of first occurence of symbol 'symb' in [start_pos,end_pos). Offset defines which symbol in k+x-mer is taken. template uint64 CBigKmerBinSorter_Impl, SIZE>::FindFirstSymbOccur(CBigKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uchar symb) { uint32 kxmer_offset = (ptr.kmer_len + ptr.max_x - offset) * 2; uint64 middle_pos; uchar middle_symb; while (start_pos < end_pos) { middle_pos = (start_pos + end_pos) / 2; middle_symb = ptr.sorted_kxmers[middle_pos].get_2bits(kxmer_offset); if (middle_symb < symb) start_pos = middle_pos + 1; else end_pos = middle_pos; } return end_pos; } //---------------------------------------------------------------------------------- template void CBigKmerBinSorter_Impl, SIZE>::InitKXMerSet(CBigKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uint32 depth) { if (start_pos == end_pos) return; uint32 shr = ptr.max_x + 1 - offset; ptr.kxmer_set.init_add(start_pos, end_pos, shr); --depth; if (depth > 0) { uint64 pos[5]; pos[0] = start_pos; pos[4] = end_pos; for (uint32 i = 1; i < 4; ++i) pos[i] = FindFirstSymbOccur(ptr, pos[i - 1], end_pos, offset, i); for (uint32 i = 1; i < 5; ++i) InitKXMerSet(ptr, pos[i - 1], pos[i], offset + 1, depth); } } //---------------------------------------------------------------------------------- template void CBigKmerBinSorter_Impl, SIZE>::PostProcessKxmers(CBigKmerBinSorter, SIZE>& ptr) { ptr.kxmer_set.clear(); ptr.kxmer_set.set_buffer(ptr.sorted_kxmers); uint32 best_lut_prefix_len = 0; uint32 local_lut_prefix_len; uint64 best_mem_amount = 1ull << 62; uint32 counter_size = sizeof(uint32); for (local_lut_prefix_len = 2; local_lut_prefix_len < 13; ++local_lut_prefix_len) { uint32 suffix_len = ptr.kmer_len - local_lut_prefix_len; if(suffix_len % 4) continue; uint64 suf_mem = (suffix_len / 4 + counter_size) * ptr.kxmers_pos; uint64 lut_mem = (1ull << (2 * local_lut_prefix_len)) * sizeof(uint64); if (suf_mem + lut_mem < best_mem_amount) { best_mem_amount = suf_mem + lut_mem; best_lut_prefix_len = local_lut_prefix_len; } } local_lut_prefix_len = best_lut_prefix_len; uint32 kmer_symbols = ptr.kmer_len - local_lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint32 suffix_rec_bytes = (ptr.kmer_len - local_lut_prefix_len) / 4 + counter_size; uint64 lut_recs = 1ull << 2 * local_lut_prefix_len; uchar* suff_buff; ptr.sm_pmm_sorter_suffixes->reserve(suff_buff); uchar* _raw_lut; ptr.sm_pmm_sorter_lut->reserve(_raw_lut); uint64* lut = (uint64*)_raw_lut; fill_n(lut, lut_recs, 0); uint64 suff_buff_size = ptr.sm_mem_part_suffixes / suffix_rec_bytes * suffix_rec_bytes; uint64 suff_buff_pos = 0; uint32 n_recs = 0; uint64 compacted_count; PreCompactKxmers(ptr, compacted_count, ptr.kxmers_counters); uint64 pos[5]; pos[0] = 0; pos[4] = compacted_count; for(uint32 i = 1 ; i < 4 ; ++i) pos[i] = FindFirstSymbOccur(ptr, pos[i - 1], compacted_count, 0, i); for (uint32 i = 1; i < 5; ++i) InitKXMerSet(ptr, pos[i - 1], pos[i], ptr.max_x + 2 - i, i); uint64 counter_pos = 0; CKmer kmer, next_kmer; kmer.clear(); next_kmer.clear(); CKmer kmer_mask; uint32 count; kmer_mask.set_n_1(ptr.kmer_len * 2); ptr.kxmer_set.get_min(counter_pos, kmer); count = ptr.kxmers_counters[counter_pos]; while (ptr.kxmer_set.get_min(counter_pos, next_kmer)) { if (kmer == next_kmer) count += ptr.kxmers_counters[counter_pos]; else { lut[kmer.remove_suffix(2 * kmer_symbols)]++; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) suff_buff[suff_buff_pos++] = kmer.get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) suff_buff[suff_buff_pos++] = (count >> (j * 8)) & 0xFF; ++n_recs; if (suff_buff_pos >= suff_buff_size) { ptr.bbspq->push(ptr.bin_id, ptr.sub_bin_id, suff_buff, suff_buff_pos, NULL, 0, false); ptr.sm_pmm_sorter_suffixes->reserve(suff_buff); suff_buff_pos = 0; } count = ptr.kxmers_counters[counter_pos]; kmer = next_kmer; } } lut[kmer.remove_suffix(2 * kmer_symbols)]++; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) suff_buff[suff_buff_pos++] = kmer.get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) suff_buff[suff_buff_pos++] = (count >> (j * 8)) & 0xFF; ++n_recs; ptr.bbspq->push(ptr.bin_id, ptr.sub_bin_id, suff_buff, suff_buff_pos, NULL, 0, false); ptr.bbspq->push(ptr.bin_id, ptr.sub_bin_id, NULL, 0, lut, lut_recs, true); ptr.bbd->push(ptr.bin_id, ptr.sub_bin_id, local_lut_prefix_len, n_recs, NULL, "", 0); } //---------------------------------------------------------------------------------- template void CBigKmerBinSorter_Impl, SIZE>::PostProcessSort(CBigKmerBinSorter, SIZE>& ptr) { if (ptr.max_x) PostProcessKxmers(ptr); else PostProcessKmers(ptr); } //---------------------------------------------------------------------------------- template void CBigKmerBinSorter_Impl, SIZE>::PostProcessSort(CBigKmerBinSorter, SIZE>& ptr) { //"Not supported in current release" } //************************************************************************************************************ // CWBigKmerBinSorter - wrapper for multithreading purposes //************************************************************************************************************ template class CWBigKmerBinSorter { CBigKmerBinSorter* bkb_sorter; public: CWBigKmerBinSorter(CKMCParams& Params, CKMCQueues& Queues); ~CWBigKmerBinSorter(); void operator()(); }; //---------------------------------------------------------------------------------- // Constructor template CWBigKmerBinSorter::CWBigKmerBinSorter(CKMCParams& Params, CKMCQueues& Queues) { bkb_sorter = new CBigKmerBinSorter(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor template CWBigKmerBinSorter::~CWBigKmerBinSorter() { delete bkb_sorter; } //---------------------------------------------------------------------------------- // Execution template void CWBigKmerBinSorter::operator()() { bkb_sorter->Process(); } #endif // ***** EOF KMC-2.3/kmer_counter/bkb_subbin.h000066400000000000000000000105551257432033000167430ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _BKB_SUBBIN_H #define _BKB_SUBBIN_H //************************************************************************************************************ // CSubBin - sorted k-mers (part of some bin), used in strict memory mode //************************************************************************************************************ template class CSubBin { CDiskLogger* disk_logger; uchar* raw_lut; uint64* lut; uint32 current_prefix; uchar* suff_buff; uint64 suff_buff_size, max_in_suff_buff, lut_start_pos_in_file; uint32 kmer_len, lut_size, lut_buff_recs, lut_offset, cur_in_suff_buff, left_to_read, n_kmers, suff_buff_pos, in_current_prefix; string name; FILE* file; uint32 suff_rec_len, lut_prefix_len, counter_size, suffix_bytes; uint64 size; void read_next_lut_part(); public: bool get_min(KMER_T& kmer, uint32& count); CSubBin(CDiskLogger* _disk_logger) { lut_size = 0; disk_logger = _disk_logger; } void init(FILE* _file, uint64 _size, uint32 _lut_prefix_len, uint32 _n_kmers, string _name, uint32 _kmer_len, uchar* _lut_buff, uint32 _lut_buff_size, uchar* _suff_buff, uint64 _suff_buff_size); }; //-------------------------------------------------------------------------- template void CSubBin::read_next_lut_part() { uint32 to_read = MIN(lut_size - lut_offset, lut_buff_recs); lut_offset += lut_buff_recs; if (to_read) { uint64 prev_pos = my_ftell(file); my_fseek(file, lut_start_pos_in_file + (lut_offset - lut_buff_recs) * sizeof(uint64), SEEK_SET); if (fread(lut, sizeof(uint64), to_read, file) != to_read) { cout << "Error while reading file : " << name << "\n"; exit(1); } my_fseek(file, prev_pos, SEEK_SET); } } //-------------------------------------------------------------------------- template bool CSubBin::get_min(KMER_T& kmer, uint32& count) { while (true) { if (current_prefix >= lut_offset) { read_next_lut_part(); } if (in_current_prefix >= lut[current_prefix + lut_buff_recs - lut_offset]) { ++current_prefix; in_current_prefix = 0; } else { ++in_current_prefix; break; } if (current_prefix >= lut_size) { fclose(file); remove(name.c_str()); disk_logger->log_remove(size); return false; } } uchar *suf_rec = suff_buff + suff_buff_pos * suff_rec_len; uint32 tmp = current_prefix; uint32 pos = suffix_bytes; kmer.load(suf_rec, suffix_bytes); while (tmp) { kmer.set_byte(pos++, (uchar)tmp & 0xFF); tmp >>= 8; } count = 0; for (uint32 i = 0; i < counter_size; ++i) count += (*suf_rec++) << (8 * i); suff_buff_pos++; if (suff_buff_pos >= cur_in_suff_buff) { cur_in_suff_buff = (uint32)fread(suff_buff, 1, MIN(suff_rec_len * max_in_suff_buff, left_to_read), file) / suff_rec_len; suff_buff_pos = 0; left_to_read -= cur_in_suff_buff * suff_rec_len; } return true; } //-------------------------------------------------------------------------- template void CSubBin::init(FILE* _file, uint64 _size, uint32 _lut_prefix_len, uint32 _n_kmers, string _name, uint32 _kmer_len, uchar* _lut_buff, uint32 _lut_buff_size, uchar* _suff_buff, uint64 _suff_buff_size) { size = _size; lut = (uint64*)_lut_buff; lut_buff_recs = _lut_buff_size / sizeof(uint64); suff_buff = _suff_buff; suff_buff_size = _suff_buff_size; lut_offset = 0; lut_prefix_len = _lut_prefix_len; kmer_len = _kmer_len; suffix_bytes = (kmer_len - lut_prefix_len) / 4; file = _file; n_kmers = _n_kmers; name = _name; counter_size = sizeof(uint32); lut_size = (1 << lut_prefix_len * 2); suff_rec_len = (kmer_len - lut_prefix_len) / 4 + counter_size; left_to_read = suff_rec_len * n_kmers; max_in_suff_buff = suff_buff_size / suff_rec_len; lut_start_pos_in_file = n_kmers * suff_rec_len; rewind(file); read_next_lut_part(); cur_in_suff_buff = (uint32)fread(suff_buff, 1, MIN(max_in_suff_buff * suff_rec_len, left_to_read), file) / suff_rec_len; left_to_read -= cur_in_suff_buff * suff_rec_len; current_prefix = 0; in_current_prefix = 0; suff_buff_pos = 0; } #endif // ***** EOFKMC-2.3/kmer_counter/bkb_uncompactor.h000066400000000000000000000452351257432033000200160ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _BKB_UNCOMPACTOR_H #define _BKB_UNCOMPACTOR_H #include "params.h" #include "kmer.h" #include "rev_byte.h" //************************************************************************************************************ template class CBigKmerBinUncompactor_Impl; //************************************************************************************************************ // CBigKmerBinUncompactor - Unpacking super k-mers to k+x-mers, only in strict memory mode //************************************************************************************************************ template class CBigKmerBinUncompactor { CBigBinPartQueue* bbpq; CBigBinKXmersQueue* bbkq; CMemoryPool *sm_pmm_expand; uint32 max_x; bool both_strands; uint32 kmer_len; KMER_T* kxmers; int64 sm_mem_part_expand; uint32 kxmers_size; int32 bin_id; uchar* input_data; uint64 input_data_size; friend class CBigKmerBinUncompactor_Impl; public: CBigKmerBinUncompactor(CKMCParams& Params, CKMCQueues& Queues); ~CBigKmerBinUncompactor(); void Uncompact(int32 _bin_id, uchar* _data, uint64 _size); }; //************************************************************************************************************ // CBigKmerBinUncompactor_Impl - implementation of k-mer type- and size-dependent functions //************************************************************************************************************ template class CBigKmerBinUncompactor_Impl { public: static void Uncompact(CBigKmerBinUncompactor& ptr); }; template class CBigKmerBinUncompactor_Impl < CKmer, SIZE > { public: static void GetNextSymb(uchar& symb, uchar& byte_shift, uint64& pos, uchar* data_p); static void Uncompact(CBigKmerBinUncompactor, SIZE>& ptr); static void ExpandKxmersBoth(CBigKmerBinUncompactor, SIZE>& ptr); static void ExpandKxmersAll(CBigKmerBinUncompactor, SIZE>& ptr); static void ExpandKmersBoth(CBigKmerBinUncompactor, SIZE>& ptr); static void ExpandKmersAll(CBigKmerBinUncompactor, SIZE>& ptr); }; template class CBigKmerBinUncompactor_Impl < CKmerQuake, SIZE > { public: static void Uncompact(CBigKmerBinUncompactor, SIZE>& ptr); }; //************************************************************************************************************ // CBigKmerBinUncompactor //************************************************************************************************************ //---------------------------------------------------------------------------------- template CBigKmerBinUncompactor::CBigKmerBinUncompactor(CKMCParams& Params, CKMCQueues& Queues) { sm_pmm_expand = Queues.sm_pmm_expand; bbpq = Queues.bbpq; bbkq = Queues.bbkq; kmer_len = Params.kmer_len; max_x = Params.max_x; both_strands = Params.both_strands; sm_mem_part_expand = Params.sm_mem_part_expand; kxmers_size = (uint32)(sm_mem_part_expand / sizeof(KMER_T)); } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor::Uncompact(int32 _bin_id, uchar* _data, uint64 _size) { bin_id = _bin_id; input_data = _data; input_data_size = _size; CBigKmerBinUncompactor_Impl::Uncompact(*this); } //---------------------------------------------------------------------------------- template CBigKmerBinUncompactor::~CBigKmerBinUncompactor() { } //************************************************************************************************************ // CBigKmerBinUncompactor_Impl //************************************************************************************************************ //---------------------------------------------------------------------------------- template inline void CBigKmerBinUncompactor_Impl, SIZE>::GetNextSymb(uchar& symb, uchar& byte_shift, uint64& pos, uchar* data_p) { symb = (data_p[pos] >> byte_shift) & 3; if (byte_shift == 0) { ++pos; byte_shift = 6; } else byte_shift -= 2; } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor_Impl, SIZE>::ExpandKxmersBoth(CBigKmerBinUncompactor, SIZE>& ptr) { uchar* _raw_buffer; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; CKmer kmer, rev_kmer, kmer_mask; CKmer kxmer_mask; bool kmer_lower; uint32 x, additional_symbols; uchar symb; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; uint32 rev_shift = ptr.kmer_len * 2 - 2; uchar* data_p = ptr.input_data; kmer_mask.set_n_1(ptr.kmer_len * 2); uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; kxmer_mask.set_n_1((ptr.kmer_len + ptr.max_x + 1) * 2); uint64 kxmers_pos = 0; uint64 pos = 0; while (pos < ptr.input_data_size) { kmer.clear(); rev_kmer.clear(); additional_symbols = data_p[pos++]; //build kmer for (uint32 i = 0, kmer_pos = 8 * SIZE - 1, kmer_rev_pos = 0; i < kmer_bytes; ++i, --kmer_pos, ++kmer_rev_pos) { kmer.set_byte(kmer_pos, data_p[pos + i]); rev_kmer.set_byte(kmer_rev_pos, CRev_byte::lut[data_p[pos + i]]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; if (kmer_shr) kmer.SHR(kmer_shr); kmer.mask(kmer_mask); rev_kmer.mask(kmer_mask); kmer_lower = kmer < rev_kmer; x = 0; if (kmer_lower) ptr.kxmers[kxmers_pos].set(kmer); else ptr.kxmers[kxmers_pos].set(rev_kmer); uint32 symbols_left = additional_symbols; while (symbols_left) { GetNextSymb(symb, byte_shift, pos, data_p); kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, rev_shift); --symbols_left; if (kmer_lower) { if (kmer < rev_kmer) { ptr.kxmers[kxmers_pos].SHL_insert_2bits(symb); ++x; if (x == ptr.max_x) { if(!symbols_left) break; ptr.kxmers[kxmers_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } x = 0; GetNextSymb(symb, byte_shift, pos, data_p); kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, rev_shift); --symbols_left; kmer_lower = kmer < rev_kmer; if (kmer_lower) ptr.kxmers[kxmers_pos].set(kmer); else ptr.kxmers[kxmers_pos].set(rev_kmer); } } else { ptr.kxmers[kxmers_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } x = 0; kmer_lower = false; ptr.kxmers[kxmers_pos].set(rev_kmer); } } else { if (!(kmer < rev_kmer)) { ptr.kxmers[kxmers_pos].set_2bits(3 - symb, ptr.kmer_len * 2 + x * 2); ++x; if (x == ptr.max_x) { if(!symbols_left) break; ptr.kxmers[kxmers_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } x = 0; GetNextSymb(symb, byte_shift, pos, data_p); kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, rev_shift); --symbols_left; kmer_lower = kmer < rev_kmer; if (kmer_lower) ptr.kxmers[kxmers_pos].set(kmer); else ptr.kxmers[kxmers_pos].set(rev_kmer); } } else { ptr.kxmers[kxmers_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } x = 0; ptr.kxmers[kxmers_pos].set(kmer); kmer_lower = true; } } } ptr.kxmers[kxmers_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } if (byte_shift != 6) ++pos; } if (kxmers_pos) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); } else { ptr.sm_pmm_expand->free(_raw_buffer); } } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor_Impl, SIZE>::ExpandKxmersAll(CBigKmerBinUncompactor, SIZE>& ptr) { uchar* _raw_buffer; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; uint64 pos = 0; CKmer kmer_mask, kxmer, kxmer_mask; kxmer_mask.set_n_1((ptr.kmer_len + ptr.max_x) * 2); uchar *data_p = ptr.input_data; kmer_mask.set_n_1(ptr.kmer_len * 2); uint64 kxmers_pos = 0; while (pos < ptr.input_data_size) { kxmer.clear(); uint32 additional_symbols = data_p[pos++]; uchar symb; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; //building kmer for (uint32 i = 0, kmer_pos = 8 * SIZE - 1; i < kmer_bytes; ++i, --kmer_pos) { kxmer.set_byte(kmer_pos, data_p[pos + i]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; if (kmer_shr) kxmer.SHR(kmer_shr); kxmer.mask(kmer_mask); uint32 tmp = MIN(ptr.max_x, additional_symbols); for (uint32 i = 0; i < tmp; ++i) { GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); } kxmer.set_2bits(tmp, (ptr.kmer_len + ptr.max_x) * 2); ptr.kxmers[kxmers_pos++].set(kxmer); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } additional_symbols -= tmp; uint32 kxmers_count = additional_symbols / (ptr.max_x + 1); uint32 kxmer_rest = additional_symbols % (ptr.max_x + 1); for (uint32 j = 0; j < kxmers_count; ++j) { for (uint32 i = 0; i < ptr.max_x + 1; ++i) { GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); } kxmer.mask(kxmer_mask); kxmer.set_2bits(ptr.max_x, (ptr.kmer_len + ptr.max_x) * 2); ptr.kxmers[kxmers_pos++].set(kxmer); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } } if (kxmer_rest) { uint32 i = 0; GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); kxmer.mask(kmer_mask); --kxmer_rest; for (; i < kxmer_rest; ++i) { GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); } kxmer.set_2bits(kxmer_rest, (ptr.kmer_len + ptr.max_x) * 2); ptr.kxmers[kxmers_pos++].set(kxmer); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } } if (byte_shift != 6) ++pos; } if (kxmers_pos) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); } else { ptr.sm_pmm_expand->free(_raw_buffer); } } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor_Impl, SIZE>::ExpandKmersBoth(CBigKmerBinUncompactor, SIZE>& ptr) { uchar* _raw_buffer; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; CKmer kmer, rev_kmer, kmer_can, kmer_mask; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; uint32 kmer_len_shift = (ptr.kmer_len - 1) * 2; kmer_mask.set_n_1(ptr.kmer_len * 2); uchar *data_p = ptr.input_data; uint64 kxmers_pos = 0; uint64 pos = 0; while (pos < ptr.input_data_size) { kmer.clear(); rev_kmer.clear(); uint32 additional_symbols = data_p[pos++]; uchar symb; //building kmer for (uint32 i = 0, kmer_pos = 8 * SIZE - 1, kmer_rev_pos = 0; i < kmer_bytes; ++i, --kmer_pos, ++kmer_rev_pos) { kmer.set_byte(kmer_pos, data_p[pos + i]); rev_kmer.set_byte(kmer_rev_pos, CRev_byte::lut[data_p[pos + i]]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; if (kmer_shr) kmer.SHR(kmer_shr); kmer.mask(kmer_mask); rev_kmer.mask(kmer_mask); kmer_can = kmer < rev_kmer ? kmer : rev_kmer; ptr.kxmers[kxmers_pos++].set(kmer_can); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } for (uint32 i = 0; i < additional_symbols; ++i) { symb = (data_p[pos] >> byte_shift) & 3; if (byte_shift == 0) { ++pos; byte_shift = 6; } else byte_shift -= 2; kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, kmer_len_shift); kmer_can = kmer < rev_kmer ? kmer : rev_kmer; ptr.kxmers[kxmers_pos++].set(kmer_can); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } } if (byte_shift != 6) ++pos; } if (kxmers_pos) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); } else { ptr.sm_pmm_expand->free(_raw_buffer); } } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor_Impl, SIZE>::ExpandKmersAll(CBigKmerBinUncompactor, SIZE>& ptr) { uchar* _raw_buffer; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; uint64 kxmers_pos = 0; uint64 pos = 0; CKmer kmer; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; CKmer kmer_mask; kmer_mask.set_n_1(ptr.kmer_len * 2); uchar *data_p = ptr.input_data; while (pos < ptr.input_data_size) { kmer.clear(); uint32 additional_symbols = data_p[pos++]; for (uint32 i = 0, kmer_pos = 8 * SIZE - 1; i < kmer_bytes; ++i, --kmer_pos) { kmer.set_byte(kmer_pos, data_p[pos + i]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; if (kmer_shr) kmer.SHR(kmer_shr); kmer.mask(kmer_mask); ptr.kxmers[kxmers_pos++].set(kmer); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } for (uint32 i = 0; i < additional_symbols; ++i) { uchar symb = (data_p[pos] >> byte_shift) & 3; if (byte_shift == 0) { ++pos; byte_shift = 6; } else byte_shift -= 2; kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); ptr.kxmers[kxmers_pos++].set(kmer); if (kxmers_pos >= ptr.kxmers_size) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); kxmers_pos = 0; ptr.sm_pmm_expand->reserve(_raw_buffer); ptr.kxmers = (CKmer*)_raw_buffer; } } if (byte_shift != 6) ++pos; } if (kxmers_pos) { ptr.bbkq->push(ptr.bin_id, (uchar*)ptr.kxmers, kxmers_pos); } else { ptr.sm_pmm_expand->free(_raw_buffer); } } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor_Impl, SIZE>::Uncompact(CBigKmerBinUncompactor, SIZE>& ptr) { if (ptr.max_x) { if (ptr.both_strands) ExpandKxmersBoth(ptr); else ExpandKxmersAll(ptr); } else { if (ptr.both_strands) ExpandKmersBoth(ptr); else ExpandKmersAll(ptr); } } //---------------------------------------------------------------------------------- template void CBigKmerBinUncompactor_Impl, SIZE>::Uncompact(CBigKmerBinUncompactor, SIZE>& ptr) { //"Not supported in current release" } //************************************************************************************************************ // CWBigKmerBinUncompactor - wrapper for multithreading purposes //************************************************************************************************************ template class CWBigKmerBinUncompactor { CBigKmerBinUncompactor* bkb_uncompactor; CBigBinPartQueue* bbpq; CBigBinKXmersQueue* bbkq; CMemoryPool* sm_pmm_input_file; public: CWBigKmerBinUncompactor(CKMCParams& Params, CKMCQueues& Queues); ~CWBigKmerBinUncompactor(); void operator()(); }; //---------------------------------------------------------------------------------- // Constructor template CWBigKmerBinUncompactor::CWBigKmerBinUncompactor(CKMCParams& Params, CKMCQueues& Queues) { bkb_uncompactor = new CBigKmerBinUncompactor(Params, Queues); bbpq = Queues.bbpq; bbkq = Queues.bbkq; sm_pmm_input_file = Queues.sm_pmm_input_file; } //---------------------------------------------------------------------------------- // Destructor template CWBigKmerBinUncompactor::~CWBigKmerBinUncompactor() { delete bkb_uncompactor; } //---------------------------------------------------------------------------------- // Execution template void CWBigKmerBinUncompactor::operator()() { int32 bin_id; uchar* data; uint64 size; while (bbpq->pop(bin_id, data, size)) { bkb_uncompactor->Uncompact(bin_id, data, size); sm_pmm_input_file->free(data); } bbkq->mark_completed(); } #endif // ***** EOFKMC-2.3/kmer_counter/bkb_writer.cpp000066400000000000000000000076261257432033000173350ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "bkb_writer.h" //************************************************************************************************************ // CBigKmerBinWriter //************************************************************************************************************ //---------------------------------------------------------------------------------- CBigKmerBinWriter::CBigKmerBinWriter(CKMCParams& Params, CKMCQueues& Queues) { disk_logger = Queues.disk_logger; bbspq = Queues.bbspq; sm_pmm_sorter_suffixes = Queues.sm_pmm_sorter_suffixes; sm_pmm_sorter_lut = Queues.sm_pmm_sorter_lut; working_directory = Params.working_directory; bbd = Queues.bbd; sm_cbc = Queues.sm_cbc; } //---------------------------------------------------------------------------------- void CBigKmerBinWriter::Process() { int32 curr_bin_id = -1; uchar* suff_buff = NULL; uint64 suff_buff_size; uint64* lut = NULL; uint64 lut_size = 0; bool last_one_in_sub_bin; bool first_in_sub_bin = true; FILE* file = NULL; string name; uint64 file_size = 0; while (bbspq->pop(bin_id, sub_bin_id, suff_buff, suff_buff_size, lut, lut_size, last_one_in_sub_bin)) { if (curr_bin_id != bin_id) { if (curr_bin_id != -1) sm_cbc->push(curr_bin_id); curr_bin_id = bin_id; } if (first_in_sub_bin) { file_size = 0; name = GetName(); file = fopen(name.c_str(), "wb+"); if (!file) { cout << "Can not open file : " << name; exit(1); } setbuf(file, nullptr); } first_in_sub_bin = false; if (suff_buff_size) { disk_logger->log_write(suff_buff_size); if (fwrite(suff_buff, 1, suff_buff_size, file) != suff_buff_size) { cout << "Error while writing to file : " << name; exit(1); } file_size += suff_buff_size; sm_pmm_sorter_suffixes->free(suff_buff); } if (lut_size) { disk_logger->log_write(lut_size * sizeof(uint64)); if (fwrite(lut, sizeof(uint64), lut_size, file) != lut_size) { cout << "Error while writing to file : " << name; exit(1); } file_size += lut_size * sizeof(uint64); sm_pmm_sorter_lut->free((uchar*)lut); } if (last_one_in_sub_bin) { bbd->push(bin_id, sub_bin_id, 0, 0, file, name, file_size); first_in_sub_bin = true; } } if(curr_bin_id != -1) sm_cbc->push(curr_bin_id); sm_cbc->mark_completed(); } //---------------------------------------------------------------------------------- string CBigKmerBinWriter::GetName() { string s_tmp = std::to_string(bin_id); while (s_tmp.length() < 5) s_tmp = string("0") + s_tmp; string s1 = std::to_string(sub_bin_id); while (s1.length() < 3) s1 = string("0") + s1; if (*working_directory.rbegin() != '/' && *working_directory.rbegin() != '\\') working_directory += "/"; return working_directory + "kmc_" + s_tmp + "_" + s1 + "_" + s1 + ".bin"; } //************************************************************************************************************ // CWBigKmerBinWriter - wrapper for multithreading purposes //************************************************************************************************************ //---------------------------------------------------------------------------------- // Constructor CWBigKmerBinWriter::CWBigKmerBinWriter(CKMCParams& Params, CKMCQueues& Queues) { bkb_writer = new CBigKmerBinWriter(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor CWBigKmerBinWriter::~CWBigKmerBinWriter() { delete bkb_writer; } //---------------------------------------------------------------------------------- // Execution void CWBigKmerBinWriter::operator()() { bkb_writer->Process(); } // ***** EOF KMC-2.3/kmer_counter/bkb_writer.h000066400000000000000000000026571257432033000170010ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _BKB_WRITER_H #define _BKB_WRITER_H #include "params.h" //************************************************************************************************************ // CBigKmerBinWriter - Write sub bins to HDD //************************************************************************************************************ class CBigKmerBinWriter { int32 bin_id, sub_bin_id; CBigBinSortedPartQueue* bbspq; CCompletedBinsCollector* sm_cbc; CDiskLogger* disk_logger; CMemoryPool * sm_pmm_sorter_suffixes; CMemoryPool * sm_pmm_sorter_lut; string working_directory; CBigBinDesc* bbd; string GetName(); public: CBigKmerBinWriter(CKMCParams& Params, CKMCQueues& Queues); void Process(); }; //************************************************************************************************************ // CWBigKmerBinWriter - wrapper for multithreading purposes //************************************************************************************************************ class CWBigKmerBinWriter { CBigKmerBinWriter* bkb_writer; public: CWBigKmerBinWriter(CKMCParams& Params, CKMCQueues& Queues); void operator()(); ~CWBigKmerBinWriter(); }; #endif // ***** EOF KMC-2.3/kmer_counter/defs.h000066400000000000000000000054241257432033000155630ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _DEFS_H #define _DEFS_H #define KMC_VER "2.3.0" #define KMC_DATE "2015-08-21" #define _CRT_SECURE_NO_WARNINGS #define MIN(x,y) ((x) < (y) ? (x) : (y)) #define MAX(x,y) ((x) > (y) ? (x) : (y)) #define NORM(x, lower, upper) ((x) < (lower) ? (lower) : (x) > (upper) ? (upper) : (x)) #define uchar unsigned char #include //uncomment below line to disable asmlib //#define DISABLE_ASMLIB //#define DEBUG_MODE //#define DEVELOP_MODE #define USE_META_PROG #define KMER_X 3 #define STATS_FASTQ_SIZE (1 << 28) #define EXPAND_BUFFER_RECS (1 << 16) #define MIN_N_BINS 64 #define MAX_N_BINS 2000 #ifndef MAX_K #define MAX_K 256 #endif #define MIN_K 1 #define MIN_MEM 1 // Range of number of FASTQ/FASTA reading threads #define MIN_SF 1 #define MAX_SF 32 // Range of number of signature length #define MIN_SL 5 #define MAX_SL 8 // Range of number of splitting threads #define MIN_SP 1 #define MAX_SP 64 // Range of number of sorting threads #define MIN_SO 1 #define MAX_SO 64 //Range of number of sorter threads pre sorter in strict memory mode #define MIN_SMSO 1 #define MAX_SMSO 16 //Range of number of uncompactor threads in strict memory mode #define MIN_SMUN 1 #define MAX_SMUN 16 //Range of number of merger threads in strict memory mode #define MIN_SMME 1 #define MAX_SMME 16 // Range of number of threads per single sorting thread #define MIN_SR 1 #define MAX_SR 16 typedef float count_t; #define KMER_WORDS ((MAX_K + 31) / 32) #ifdef _DEBUG #define A_memcpy memcpy #define A_memset memset #endif #ifdef WIN32 #define my_fopen fopen #define my_fseek _fseeki64 #define my_ftell _ftelli64 typedef int int32; typedef unsigned int uint32; typedef long long int64; typedef unsigned long long uint64; #else #define my_fopen fopen #define my_fseek fseek #define my_ftell ftell #define _TCHAR char #define _tmain main typedef int int32; typedef unsigned int uint32; typedef long long int64; typedef unsigned long long uint64; #include #include #include #include #include #include #include using namespace std; using __gnu_cxx::copy_n; #endif const int32 MAX_STR_LEN = 32768; #define ALIGNMENT 0x100 #define BYTE_LOG(x) (((x) < (1 << 8)) ? 1 : ((x) < (1 << 16)) ? 2 : ((x) < (1 << 24)) ? 3 : 4) #define BYTE_LOG_ULL(x) (((x) < (1ull << 8)) ? 1 : ((x) < (1ull << 16)) ? 2 : ((x) < (1ull << 24)) ? 3 : ((x) < (1ull << 32)) ? 4 : ((x) < (1ull << 40) ? 5 : ((x) < (1ull << 48) ? 6 : ((x) < (1ull << 56)) ? 7 : 8))) #endif // ***** EOF KMC-2.3/kmer_counter/develop.cpp000066400000000000000000000075651257432033000166430ustar00rootroot00000000000000#include "stdafx.h" #include "develop.h" #include "params.h" #include using namespace std; void map_log(uint32 signature_len, uint32 map_size, int32* signature_map) { #ifdef MAP_LOG_SRC FILE* mapLogFile = fopen(MAP_LOG_SRC, "w"); char ACGT[10]; ACGT[signature_len] = '\0'; char symbols[] = { 'A', 'C', 'G', 'T' }; if (!mapLogFile) { cout << "Cannot save map log to file"; exit(1); } fprintf(mapLogFile, "SIGNMATURE | ACGT | BIN NO\n"); for (uint32 i = 0; i < map_size; ++i) { for (int j = signature_len - 1; j >= 0; --j) ACGT[signature_len - j - 1] = symbols[(i >> 2 * j) & 3]; if (signature_map[i] >= 0) fprintf(mapLogFile, "%i\t\t%s\t%i\n", i, ACGT, signature_map[i]); else fprintf(mapLogFile, "%i\t\t%s\tDISABLED_SIGNATURE\n", i, ACGT); } fclose(mapLogFile); #endif } void save_bins_stats(CKMCQueues& Queues, CKMCParams& Params, uint32 kmer_size, uint32 quality_size, uint64 n_reads) { #ifdef KMERS_PER_BIN_LOG_FILE int32 bin_id; CMemDiskFile *file; string name; uint64 n_rec; uint64 n_plus_x_recs; uint64 n_super_kmers; uint64 size; Queues.bd->reset_reading(); FILE* stats_file = fopen(KMERS_PER_BIN_LOG_FILE, "w"); uint64 sum_size, sum_n_rec, sum_n_plus_x_recs, sum_n_super_kmers; sum_size = sum_n_rec = sum_n_plus_x_recs = sum_n_super_kmers = 0; if (!stats_file) { cout << "cannot open file to store kmers per bin: " << KMERS_PER_BIN_LOG_FILE << "\n"; exit(1); } fprintf(stats_file, "%12s%12s%12s%12s%12s\n", "bin_id", "n_rec", "n_super_kmers", "size", "sorted mem"); while ((bin_id = Queues.bd->get_next_bin()) >= 0) { Queues.bd->read(bin_id, file, name, size, n_rec, n_plus_x_recs, n_super_kmers); // Reserve memory necessary to process the current bin at all next stages uint64 input_kmer_size; int64 kxmer_counter_size; uint32 kxmer_symbols; if (Params.max_x && !Params.use_quake) { input_kmer_size = n_plus_x_recs * kmer_size; kxmer_counter_size = n_plus_x_recs * sizeof(uint32); kxmer_symbols = Params.kmer_len + Params.max_x + 1; } else { input_kmer_size = n_rec * kmer_size; kxmer_counter_size = 0; kxmer_symbols = Params.kmer_len; } uint64 max_out_recs = (n_rec + 1) / max(Params.cutoff_min, 1); std::function round_up_to_alignment = [](int64 x){ return (x + ALIGNMENT - 1) / ALIGNMENT * ALIGNMENT; }; uint64 counter_size = min(BYTE_LOG(Params.cutoff_max), BYTE_LOG(Params.counter_max)); if (quality_size > counter_size) counter_size = quality_size; uint32 kmer_symbols = Params.kmer_len - Params.lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint64 out_buffer_size = max_out_recs * (kmer_bytes + counter_size); uint32 rec_len = (kxmer_symbols + 3) / 4; uint64 lut_recs = 1 << (2 * Params.lut_prefix_len); uint64 lut_size = lut_recs * sizeof(uint64); size = round_up_to_alignment(size); input_kmer_size = round_up_to_alignment(input_kmer_size); out_buffer_size = round_up_to_alignment(out_buffer_size); kxmer_counter_size = round_up_to_alignment(kxmer_counter_size); lut_size = round_up_to_alignment(lut_size); int64 part1_size; int64 part2_size; if (rec_len % 2 == 0) { part1_size = input_kmer_size + kxmer_counter_size; part2_size = max(max(size, input_kmer_size), out_buffer_size + lut_size); } else { part1_size = max(input_kmer_size + kxmer_counter_size, size); part2_size = max(input_kmer_size, out_buffer_size + lut_size); } int64 req_size = part1_size + part2_size; fprintf(stats_file, "%12i%12llu%12llu%12llu%12llu\n", bin_id, n_rec, n_super_kmers, size, (uint64)req_size); sum_size += size; sum_n_rec += n_rec; sum_n_plus_x_recs += n_plus_x_recs; sum_n_super_kmers += n_super_kmers; } fprintf(stats_file, "%12s%12llu%12llu%12llu\n", "SUMMARY", sum_n_rec, sum_n_super_kmers, sum_size); fprintf(stats_file, "n_reads: %llu\n", n_reads); fclose(stats_file); exit(1); #endif } KMC-2.3/kmer_counter/develop.h000066400000000000000000000005731257432033000163000ustar00rootroot00000000000000#ifndef _DEVELOP_H #define _DEVELOP_H #include "defs.h" #define MAP_LOG_SRC "map.log" #define KMERS_PER_BIN_LOG_FILE "kmers_per_bin.log" void map_log(uint32 signature_len, uint32 map_size, int32* signature_map); struct CKMCQueues; struct CKMCParams; void save_bins_stats(CKMCQueues& Queues, CKMCParams& Params, uint32 kmer_size, uint32 quality_size, uint64 n_reads); #endifKMC-2.3/kmer_counter/fastq_reader.cpp000066400000000000000000000263631257432033000176420ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include "defs.h" #include "fastq_reader.h" //************************************************************************************************************ // CFastqReader - reader class //************************************************************************************************************ uint64 CFastqReader::OVERHEAD_SIZE = 1 << 16; //---------------------------------------------------------------------------------- // Constructor of FASTA/FASTQ reader // Parameters: // * _mm - pointer to memory monitor (to check the memory limits) CFastqReader::CFastqReader(CMemoryMonitor *_mm, CMemoryPool *_pmm_fastq, input_type _file_type, uint32 _gzip_buffer_size, uint32 _bzip2_buffer_size, int _kmer_len) { mm = _mm; pmm_fastq = _pmm_fastq; file_type = _file_type; kmer_len = _kmer_len; // Input file mode (default: uncompressed) mode = m_plain; // Pointers to input files in various formats (uncompressed, gzip-compressed, bzip2-compressed) in = NULL; in_gzip = NULL; in_bzip2 = NULL; bzerror = BZ_OK; // Size and pointer for the buffer part_size = 1 << 23; part = NULL; gzip_buffer_size = _gzip_buffer_size; bzip2_buffer_size = _bzip2_buffer_size; containsNextChromosome = false; } //---------------------------------------------------------------------------------- // Destructor - close the files CFastqReader::~CFastqReader() { if(mode == m_plain) { if(in) fclose(in); } else if(mode == m_gzip) { if(in_gzip) gzclose(in_gzip); } else if(mode == m_bzip2) { if(in) { BZ2_bzReadClose(&bzerror, in_bzip2); fclose(in); } } if(part) pmm_fastq->free(part); } //---------------------------------------------------------------------------------- // Set the name of the file to process bool CFastqReader::SetNames(string _input_file_name) { input_file_name = _input_file_name; // Set mode according to the extension of the file name if(input_file_name.size() > 3 && string(input_file_name.end()-3, input_file_name.end()) == ".gz") mode = m_gzip; else if(input_file_name.size() > 4 && string(input_file_name.end()-4, input_file_name.end()) == ".bz2") mode = m_bzip2; else mode = m_plain; return true; } //---------------------------------------------------------------------------------- // Set part size of the buffer bool CFastqReader::SetPartSize(uint64 _part_size) { if(in || in_gzip || in_bzip2) return false; if(_part_size < (1 << 20) || _part_size > (1 << 30)) return false; part_size = _part_size; return true; } //---------------------------------------------------------------------------------- // Open the file bool CFastqReader::OpenFiles() { if(in || in_gzip || in_bzip2) return false; // Uncompressed file if(mode == m_plain) { if((in = fopen(input_file_name.c_str(), "rb")) == NULL) return false; } // Gzip-compressed file else if(mode == m_gzip) { if((in_gzip = gzopen(input_file_name.c_str(), "rb")) == NULL) return false; gzbuffer(in_gzip, gzip_buffer_size); } // Bzip2-compressed file else if(mode == m_bzip2) { in = fopen(input_file_name.c_str(), "rb"); if(!in) return false; setvbuf(in, NULL, _IOFBF, bzip2_buffer_size); if((in_bzip2 = BZ2_bzReadOpen(&bzerror, in, 0, 0, NULL, 0)) == NULL) { fclose(in); return false; } } // Reserve via PMM pmm_fastq->reserve(part); part_filled = 0; return true; } //---------------------------------------------------------------------------------- // Read a part of the file in multi line fasta format bool CFastqReader::GetPartFromMultilneFasta(uchar *&_part, uint64 &_size) { uint64 readed = 0; if(!containsNextChromosome) { if(IsEof()) return false; } if(mode == m_plain) readed = fread(part+part_filled, 1, part_size-part_filled, in); else if(mode == m_gzip) readed = gzread(in_gzip, part+part_filled, (int) (part_size-part_filled)); else if(mode == m_bzip2) readed = BZ2_bzRead(&bzerror, in_bzip2, part+part_filled, (int) (part_size-part_filled)); int64 total_filled = part_filled + readed; int64 last_header_pos = 0; int64 pos = 0; for(int64 i = 0 ; i < total_filled ;++i )//find last '>' and remove EOLs { if(part[i] == '>') { int64 tmp = i; SkipNextEOL(part,i,total_filled); copy(part+tmp, part+i, part+pos); last_header_pos = pos; pos += i - tmp; } if(part[i] != '\n' && part[i] != '\r') { part[pos++] = part[i]; } } _part = part; if(last_header_pos == 0)//data in block belong to one seq { part_filled = kmer_len - 1; _size = pos; pmm_fastq->reserve(part); copy(_part+_size-part_filled, _part+_size, part); containsNextChromosome = false; } else//next seq starts at last_header_pos { _size = last_header_pos; part_filled = pos - last_header_pos; pmm_fastq->reserve(part); copy(_part + last_header_pos, _part + pos, part); containsNextChromosome = true; } return true; } //---------------------------------------------------------------------------------- // Read a part of the file bool CFastqReader::GetPart(uchar *&_part, uint64 &_size) { if(!in && !in_gzip && !in_bzip2) return false; if(file_type == multiline_fasta) return GetPartFromMultilneFasta(_part,_size); if(IsEof()) return false; uint64 readed; // Read data if(mode == m_plain) readed = fread(part+part_filled, 1, part_size, in); else if(mode == m_gzip) readed = gzread(in_gzip, part+part_filled, (int) part_size); else if(mode == m_bzip2) readed = BZ2_bzRead(&bzerror, in_bzip2, part+part_filled, (int) part_size); else readed = 0; // Never should be here int64 total_filled = part_filled + readed; int64 i; if(part_filled >= OVERHEAD_SIZE) { cout << "Error: Wrong input file!\n"; exit(1); } if(IsEof()) { _part = part; _size = total_filled; part = NULL; return true; } // Look for the end of the last complete record in a buffer if(file_type == fasta) // FASTA files { // Looking for a FASTA record at the end of the area int64 line_start[3]; int32 j; i = total_filled - OVERHEAD_SIZE / 2; for(j = 0; j < 3; ++j) { if(!SkipNextEOL(part, i, total_filled)) break; line_start[j] = i; } _part = part; if(j < 3) _size = 0; else { int k; for(k = 0; k < 2; ++k) if(part[line_start[k]+0] == '>') break; if(k == 2) _size = 0; else _size = line_start[k]; } } else // FASTQ file { // Looking for a FASTQ record at the end of the area int64 line_start[9]; int32 j; i = total_filled - OVERHEAD_SIZE / 2; for(j = 0; j < 9; ++j) { if(!SkipNextEOL(part, i, total_filled)) break; line_start[j] = i; } _part = part; if(j < 9) _size = 0; else { int k; for(k = 0; k < 4; ++k) { if(part[line_start[k]+0] == '@' && part[line_start[k+2]+0] == '+') { if(part[line_start[k+2]+1] == '\n' || part[line_start[k+2]+1] == '\r') break; if(line_start[k+1]-line_start[k] == line_start[k+3]-line_start[k+2] && memcmp(part+line_start[k]+1, part+line_start[k+2]+1, line_start[k+3]-line_start[k+2]-1) == 0) break; } } if(k == 4) _size = 0; else _size = line_start[k]; } } // Allocate new memory for the buffer pmm_fastq->reserve(part); copy(_part+_size, _part+total_filled, part); part_filled = total_filled - _size; return true; } //---------------------------------------------------------------------------------- // Skip to next EOL from the current position in a buffer bool CFastqReader::SkipNextEOL(uchar *part, int64 &pos, int64 max_pos) { int64 i; for(i = pos; i < max_pos-2; ++i) if((part[i] == '\n' || part[i] == '\r') && !(part[i+1] == '\n' || part[i+1] == '\r')) break; if(i >= max_pos-2) return false; pos = i+1; return true; } //---------------------------------------------------------------------------------- // Check whether there is an EOF bool CFastqReader::IsEof() { if(mode == m_plain) return feof(in) != 0; else if(mode == m_gzip) return gzeof(in_gzip) != 0; else if(mode == m_bzip2) return bzerror == BZ_STREAM_END; return true; } //************************************************************************************************************ // CWFastqReader - wrapper for multithreading purposes //************************************************************************************************************ CWFastqReader::CWFastqReader(CKMCParams &Params, CKMCQueues &Queues) { mm = Queues.mm; pmm_fastq = Queues.pmm_fastq; input_files_queue = Queues.input_files_queue; part_size = Params.fastq_buffer_size; part_queue = Queues.part_queue; file_type = Params.file_type; kmer_len = Params.p_k; gzip_buffer_size = Params.gzip_buffer_size; bzip2_buffer_size = Params.bzip2_buffer_size; fqr = NULL; } //---------------------------------------------------------------------------------- CWFastqReader::~CWFastqReader() { } //---------------------------------------------------------------------------------- void CWFastqReader::operator()() { uchar *part; uint64 part_filled; while(input_files_queue->pop(file_name)) { fqr = new CFastqReader(mm, pmm_fastq, file_type, gzip_buffer_size, bzip2_buffer_size, kmer_len); fqr->SetNames(file_name); fqr->SetPartSize(part_size); if(fqr->OpenFiles()) { // Reading Fastq parts while(fqr->GetPart(part, part_filled)) part_queue->push(part, part_filled); } else cerr << "Error: Cannot open file " << file_name << "\n"; delete fqr; } part_queue->mark_completed(); } //************************************************************************************************************ // CWStatsFastqReader - wrapper for multithreading purposes //************************************************************************************************************ CWStatsFastqReader::CWStatsFastqReader(CKMCParams &Params, CKMCQueues &Queues) { mm = Queues.mm; pmm_fastq = Queues.pmm_fastq; input_files_queue = Queues.input_files_queue; part_size = Params.fastq_buffer_size; stats_part_queue = Queues.stats_part_queue; file_type = Params.file_type; kmer_len = Params.p_k; gzip_buffer_size = Params.gzip_buffer_size; bzip2_buffer_size = Params.bzip2_buffer_size; fqr = NULL; } //---------------------------------------------------------------------------------- CWStatsFastqReader::~CWStatsFastqReader() { } //---------------------------------------------------------------------------------- void CWStatsFastqReader::operator()() { uchar *part; uint64 part_filled; bool finished = false; while (input_files_queue->pop(file_name) && !finished) { fqr = new CFastqReader(mm, pmm_fastq, file_type, gzip_buffer_size, bzip2_buffer_size, kmer_len); fqr->SetNames(file_name); fqr->SetPartSize(part_size); if (fqr->OpenFiles()) { // Reading Fastq parts while (fqr->GetPart(part, part_filled)) { if (!stats_part_queue->push(part, part_filled)) { finished = true; pmm_fastq->free(part); break; } } } else cerr << "Error: Cannot open file " << file_name << "\n"; delete fqr; } stats_part_queue->mark_completed(); } // ***** EOF KMC-2.3/kmer_counter/fastq_reader.h000066400000000000000000000057751257432033000173130ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _FASTQ_READER_H #define _FASTQ_READER_H #include "defs.h" #include "params.h" #include #include #include "libs/zlib.h" #include "libs/bzlib.h" using namespace std; //************************************************************************************************************ // FASTA/FASTQ reader class //************************************************************************************************************ class CFastqReader { typedef enum {m_plain, m_gzip, m_bzip2} t_mode; CMemoryMonitor *mm; CMemoryPool *pmm_fastq; string input_file_name; input_type file_type; int kmer_len; t_mode mode; FILE *in; gzFile_s *in_gzip; BZFILE *in_bzip2; int bzerror; uint64 part_size; uchar *part; uint64 part_filled; uint32 gzip_buffer_size; uint32 bzip2_buffer_size; bool containsNextChromosome; //for multiline_fasta processing bool SkipNextEOL(uchar *part, int64 &pos, int64 max_pos); bool IsEof(); public: CFastqReader(CMemoryMonitor *_mm, CMemoryPool *_pmm_fastq, input_type _file_type, uint32 _gzip_buffer_size, uint32 _bzip2_buffer_size, int _kmer_len); ~CFastqReader(); static uint64 OVERHEAD_SIZE; bool SetNames(string _input_file_name); bool SetPartSize(uint64 _part_size); bool OpenFiles(); bool GetPartFromMultilneFasta(uchar *&_part, uint64 &_size); bool GetPart(uchar *&_part, uint64 &_size); }; //************************************************************************************************************ // Wrapper for FASTA/FASTQ reader class - for multithreading purposes //************************************************************************************************************ class CWFastqReader { CMemoryMonitor *mm; CMemoryPool *pmm_fastq; CFastqReader *fqr; string file_name; uint64 part_size; CInputFilesQueue *input_files_queue; CPartQueue *part_queue; input_type file_type; uint32 gzip_buffer_size; uint32 bzip2_buffer_size; int kmer_len; public: CWFastqReader(CKMCParams &Params, CKMCQueues &Queues); ~CWFastqReader(); void operator()(); }; //************************************************************************************************************ // Wrapper for FASTA/FASTQ reader class (stats mode) - for multithreading purposes //************************************************************************************************************ class CWStatsFastqReader { CMemoryMonitor *mm; CMemoryPool *pmm_fastq; CFastqReader *fqr; string file_name; uint64 part_size; CInputFilesQueue *input_files_queue; CStatsPartQueue *stats_part_queue; input_type file_type; uint32 gzip_buffer_size; uint32 bzip2_buffer_size; int kmer_len; public: CWStatsFastqReader(CKMCParams &Params, CKMCQueues &Queues); ~CWStatsFastqReader(); void operator()(); }; #endif // ***** EOF KMC-2.3/kmer_counter/kb_collector.h000066400000000000000000000125111257432033000172770ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KB_COLLECTOR_H #define _KB_COLLECTOR_H #include "defs.h" #include "params.h" #include "kmer.h" #include "queues.h" #include "radix.h" #include "rev_byte.h" #include #include #include #include #include #include using namespace std; //---------------------------------------------------------------------------------- // Class collecting kmers belonging to a single bin class CKmerBinCollector { enum comparision_state { kmer_smaller, rev_smaller, equals }; uint32 bin_no; CBinPartQueue *bin_part_queue; CBinDesc *bd; uint32 kmer_len; uchar* buffer; uint32 buffer_size; uint32 buffer_pos; CMemoryPool *pmm_bins; uint32 n_recs; uint32 n_plus_x_recs; uint32 n_super_kmers; int lowest_quality; uint32 max_x; uint32 kmer_bytes; bool both_strands; template void update_n_plus_x_recs(char* seq, uint32 n); public: CKmerBinCollector(CKMCQueues& Queues, CKMCParams& Params, uint32 _buffer_size, uint32 _bin_no); void PutExtendedKmer(char* seq, uint32 n); void PutExtendedKmer(char* seq, char* quals, uint32 n);//for quake mode inline void Flush(); }; //---------------------------------------------------------------------------------- //---------------------------------------------------------------------------------- //---------------------------------------------------------------------------------- CKmerBinCollector::CKmerBinCollector(CKMCQueues& Queues, CKMCParams& Params, uint32 _buffer_size, uint32 _bin_no) { bin_part_queue = Queues.bpq; kmer_len = Params.kmer_len; bd = Queues.bd; buffer_size = _buffer_size; pmm_bins = Queues.pmm_bins; lowest_quality = Params.lowest_quality; max_x = Params.max_x; bin_no = _bin_no; n_recs = 0; n_super_kmers = 0; n_plus_x_recs = 0; buffer_pos = 0; pmm_bins->reserve(buffer); both_strands = Params.both_strands; kmer_bytes = (kmer_len + 3) / 4; } //--------------------------------------------------------------------------------- void CKmerBinCollector::PutExtendedKmer(char* seq, uint32 n) { uint32 bytes = 1 + (n + 3) / 4; if(buffer_pos + bytes > buffer_size) { //send current buff Flush(); pmm_bins->reserve(buffer); buffer_pos = 0; n_recs = 0; n_super_kmers = 0; n_plus_x_recs = 0; } buffer[buffer_pos++] = n - kmer_len; for(uint32 i = 0, j = 0 ; i < n / 4 ; ++i,j+=4) buffer[buffer_pos++] = (seq[j] << 6) + (seq[j + 1] << 4) + (seq[j + 2] << 2) + seq[j + 3]; switch (n%4) { case 1: buffer[buffer_pos++] = (seq[n-1] << 6); break; case 2: buffer[buffer_pos++] = (seq[n-2] << 6) + (seq[n-1] << 4); break; case 3: buffer[buffer_pos++] = (seq[n-3] << 6) + (seq[n-2] << 4) + (seq[n-1] << 2); break; } ++n_super_kmers; n_recs += n - kmer_len + 1; if (max_x) ///for max_x = 0 k-mers (not k+x-mers) will be sorted { if (!both_strands) n_plus_x_recs += 1 + (n - kmer_len) / (max_x + 1); else { switch (max_x) { case 1: update_n_plus_x_recs<2>(seq, n); break; case 2: update_n_plus_x_recs<3>(seq, n); break; case 3: update_n_plus_x_recs<4>(seq, n); break; } } } } //--------------------------------------------------------------------------------- template void CKmerBinCollector::update_n_plus_x_recs(char* seq, uint32 n) { uchar kmer, rev; uint32 kmer_pos = 4; uint32 rev_pos = kmer_len; uint32 x; kmer = (seq[0] << 6) + (seq[1] << 4) + (seq[2] << 2) + seq[3]; rev = ((3 - seq[kmer_len - 1]) << 6) + ((3 - seq[kmer_len - 2]) << 4) + ((3 - seq[kmer_len - 3]) << 2) + (3 - seq[kmer_len - 4]); x = 0; comparision_state current_state, new_state; if (kmer < rev) current_state = kmer_smaller; else if (rev < kmer) current_state = rev_smaller; else current_state = equals; for (uint32 i = 0; i < n - kmer_len; ++i) { rev >>= 2; rev += (3 - seq[rev_pos++]) << 6; kmer <<= 2; kmer += seq[kmer_pos++]; if (kmer < rev) new_state = kmer_smaller; else if (rev < kmer) new_state = rev_smaller; else new_state = equals; if (new_state == current_state) { if (current_state == equals) ++n_plus_x_recs; else ++x; } else { current_state = new_state; n_plus_x_recs += 1 + x / DIVIDE_FACTOR; x = 0; } } n_plus_x_recs += 1 + x / DIVIDE_FACTOR; } //--------------------------------------------------------------------------------- void CKmerBinCollector::PutExtendedKmer(char* seq, char* quals, uint32 n) { uint32 bytes = n + 1; if (buffer_pos + bytes > buffer_size) { Flush(); pmm_bins->reserve(buffer); buffer_pos = 0; n_recs = 0; n_super_kmers = 0; } n_recs += n - kmer_len + 1; ++n_super_kmers; buffer[buffer_pos++] = n - kmer_len; char qual; for (uint32 i = 0; i < n; ++i) { qual = quals[i] - lowest_quality; if (qual > 63) qual = 63; buffer[buffer_pos++] = (seq[i] << 6) + qual; } } //--------------------------------------------------------------------------------- void CKmerBinCollector::Flush() { bin_part_queue->push(bin_no, buffer, buffer_pos, buffer_size); bd->insert(bin_no, NULL, "", buffer_pos, n_recs, n_plus_x_recs, n_super_kmers); } #endif // ***** EOF KMC-2.3/kmer_counter/kb_completer.cpp000066400000000000000000000220251257432033000176370ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include #include #include "kb_completer.h" using namespace std; extern uint64 total_reads; //************************************************************************************************************ // CKmerBinCompleter //************************************************************************************************************ //---------------------------------------------------------------------------------- // Assign queues and monitors CKmerBinCompleter::CKmerBinCompleter(CKMCParams &Params, CKMCQueues &Queues) { mm = Queues.mm; file_name = Params.output_file_name; kq = Queues.kq; bd = Queues.bd; s_mapper = Queues.s_mapper; memory_bins = Queues.memory_bins; bbkpq = Queues.bbkpq; use_strict_mem = Params.use_strict_mem; kmer_file_name = file_name + ".kmc_suf"; lut_file_name = file_name + ".kmc_pre"; kmer_len = Params.kmer_len; signature_len = Params.signature_len; cutoff_min = Params.cutoff_min; cutoff_max = (int32)Params.cutoff_max; counter_max = (int32)Params.counter_max; lut_prefix_len = Params.lut_prefix_len; both_strands = Params.both_strands; kmer_t_size = Params.KMER_T_size; use_quake = Params.use_quake; } //---------------------------------------------------------------------------------- CKmerBinCompleter::~CKmerBinCompleter() { } //---------------------------------------------------------------------------------- // Store sorted and compacted bins to the output file (stage first) void CKmerBinCompleter::ProcessBinsFirstStage() { int32 bin_id = 0; uchar *data = NULL; uint64 data_size = 0; uchar *lut = NULL; uint64 lut_size = 0; counter_size = 0; sig_map_size = (1 << (signature_len * 2)) + 1; sig_map = new uint32[sig_map_size]; fill_n(sig_map, sig_map_size, 0); lut_pos = 0; if(use_quake) counter_size = 4; else counter_size = min(BYTE_LOG(cutoff_max), BYTE_LOG(counter_max)); // Open output file out_kmer = fopen(kmer_file_name.c_str(), "wb"); if(!out_kmer) { cout << "Error: Cannot create " << kmer_file_name << "\n"; exit(1); return; } out_lut = fopen(lut_file_name.c_str(), "wb"); if(!out_lut) { cout << "Error: Cannot create " << lut_file_name << "\n"; fclose(out_kmer); exit(1); return; } n_recs = 0; _n_unique = _n_cutoff_min = _n_cutoff_max = _n_total = 0; n_unique = n_cutoff_min = n_cutoff_max = n_total = 0; char s_kmc_pre[] = "KMCP"; char s_kmc_suf[] = "KMCS"; // Markers at the beginning fwrite(s_kmc_pre, 1, 4, out_lut); fwrite(s_kmc_suf, 1, 4, out_kmer); // Process priority queue of ready-to-output bins while(!kq->empty()) { // Get the next bin if (!kq->pop(bin_id, data, data_size, lut, lut_size, _n_unique, _n_cutoff_min, _n_cutoff_max, _n_total)) continue; // Decrease memory size allocated by stored bin string name; uint64 n_rec; uint64 n_plus_x_recs; uint64 n_super_kmers; uint64 raw_size; CMemDiskFile *file; bd->read(bin_id, file, name, raw_size, n_rec, n_plus_x_recs, n_super_kmers); uint64 lut_recs = lut_size / sizeof(uint64); // Write bin data to the output file #ifdef WIN32 //fwrite bug https://connect.microsoft.com/VisualStudio/feedback/details/755018/fwrite-hangs-with-large-size-count uint64 write_offset = 0; uint64 left_to_write = data_size; while (left_to_write) { uint64 current_to_write = MIN(left_to_write, (4ull << 30) - 1); fwrite(data + write_offset, 1, current_to_write, out_kmer); write_offset += current_to_write; left_to_write -= current_to_write; } #else fwrite(data, 1, data_size, out_kmer); #endif memory_bins->free(bin_id, CMemoryBins::mba_suffix); uint64 *ulut = (uint64*) lut; for(uint64 i = 0; i < lut_recs; ++i) { uint64 x = ulut[i]; ulut[i] = n_recs; n_recs += x; } fwrite(lut, lut_recs, sizeof(uint64), out_lut); //fwrite(&n_rec, 1, sizeof(uint64), out_lut); memory_bins->free(bin_id, CMemoryBins::mba_lut); n_unique += _n_unique; n_cutoff_min += _n_cutoff_min; n_cutoff_max += _n_cutoff_max; n_total += _n_total; for (uint32 i = 0; i < sig_map_size; ++i) { if (s_mapper->get_bin_id(i) == bin_id) { sig_map[i] = lut_pos; } } ++lut_pos; } } //---------------------------------------------------------------------------------- // Store sorted and compacted bins to the output file (stage second) void CKmerBinCompleter::ProcessBinsSecondStage() { char s_kmc_pre[] = "KMCP"; char s_kmc_suf[] = "KMCS"; if (use_strict_mem) { int32 bin_id; uchar *data = NULL; uint64 data_size = 0; uchar *lut = NULL; uint64 lut_size = 0; bool last_in_bin = false; while (bbkpq->pop(bin_id, data, data_size, lut, lut_size, _n_unique, _n_cutoff_min, _n_cutoff_max, _n_total, last_in_bin)) { if (data_size) { fwrite(data, 1, data_size, out_kmer); sm_pmm_merger_suff->free(data); } if (lut_size) { uint64 lut_recs = lut_size / sizeof(uint64); uint64* ulut = (uint64*)lut; for (uint64 i = 0; i < lut_recs; ++i) { uint64 x = ulut[i]; ulut[i] = n_recs; n_recs += x; } fwrite(lut, lut_recs, sizeof(uint64), out_lut); sm_pmm_merger_lut->free(lut); } if(last_in_bin) { n_unique += _n_unique; n_cutoff_min += _n_cutoff_min; n_cutoff_max += _n_cutoff_max; n_total += _n_total; for (uint32 i = 0; i < sig_map_size; ++i) { if (s_mapper->get_bin_id(i) == bin_id) { sig_map[i] = lut_pos; } } ++lut_pos; } } } // Marker at the end fwrite(s_kmc_suf, 1, 4, out_kmer); fclose(out_kmer); fwrite(&n_recs, 1, sizeof(uint64), out_lut); //store signature mapping fwrite(sig_map, sizeof(uint32), sig_map_size, out_lut); // Store header uint32 offset = 0; store_uint(out_lut, kmer_len, 4); offset += 4; store_uint(out_lut, (uint32)use_quake, 4); offset += 4; // mode: 0 (counting), 1 (Quake-compatibile counting) store_uint(out_lut, counter_size, 4); offset += 4; store_uint(out_lut, lut_prefix_len, 4); offset += 4; store_uint(out_lut, signature_len, 4); offset += 4; store_uint(out_lut, cutoff_min, 4); offset += 4; store_uint(out_lut, cutoff_max, 4); offset += 4; store_uint(out_lut, n_unique - n_cutoff_min - n_cutoff_max, 8); offset += 8; store_uint(out_lut, both_strands ? 0 : 1, 1); offset++; // Space for future use for (int32 i = 0; i < 27; ++i) { store_uint(out_lut, 0, 1); offset ++; } store_uint(out_lut, 0x200, 4); offset += 4; store_uint(out_lut, offset, 4); // Marker at the end fwrite(s_kmc_pre, 1, 4, out_lut); fclose(out_lut); cout << "\n"; delete[] sig_map; } //---------------------------------------------------------------------------------- // Return statistics void CKmerBinCompleter::GetTotal(uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total) { _n_unique = n_unique; _n_cutoff_min = n_cutoff_min; _n_cutoff_max = n_cutoff_max; _n_total = n_total; } //---------------------------------------------------------------------------------- // Store single unsigned integer in LSB fashion bool CKmerBinCompleter::store_uint(FILE *out, uint64 x, uint32 size) { for(uint32 i = 0; i < size; ++i) putc((x >> (i * 8)) & 0xFF, out); return true; } //---------------------------------------------------------------------------------- //Init memory pools for 2nd stage void CKmerBinCompleter::InitStage2(CKMCParams& Params, CKMCQueues& Queues) { sm_pmm_merger_lut = Queues.sm_pmm_merger_lut; sm_pmm_merger_suff = Queues.sm_pmm_merger_suff; } //************************************************************************************************************ // CWKmerBinCompleter //************************************************************************************************************ //---------------------------------------------------------------------------------- // Constructor CWKmerBinCompleter::CWKmerBinCompleter(CKMCParams &Params, CKMCQueues &Queues) { kbc = new CKmerBinCompleter(Params, Queues); } void CWKmerBinCompleter::InitStage2(CKMCParams& Params, CKMCQueues& Queues) { kbc->InitStage2(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor CWKmerBinCompleter::~CWKmerBinCompleter() { delete kbc; } //---------------------------------------------------------------------------------- // Execution void CWKmerBinCompleter::operator()(bool first_stage) { if(first_stage) kbc->ProcessBinsFirstStage(); else kbc->ProcessBinsSecondStage(); } //---------------------------------------------------------------------------------- // Return statistics void CWKmerBinCompleter::GetTotal(uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total) { if(kbc) kbc->GetTotal(_n_unique, _n_cutoff_min, _n_cutoff_max, _n_total); } // ***** EOF KMC-2.3/kmer_counter/kb_completer.h000066400000000000000000000201711257432033000173040ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KB_COMPLETER_H #define _KB_COMPLETER_H #include "defs.h" #include "params.h" #include "kmer.h" #include "radix.h" #include #include #include #include #include #include "small_k_buf.h" //************************************************************************************************************ // CKmerBinCompleter - complete the sorted bins and store in a file //************************************************************************************************************ class CKmerBinCompleter { CMemoryMonitor *mm; string file_name, kmer_file_name, lut_file_name; CKmerQueue *kq; CBinDesc *bd; CSignatureMapper *s_mapper; uint32 *sig_map; uint64 _n_unique, _n_cutoff_min, _n_cutoff_max, _n_total; uint64 n_recs; FILE *out_kmer, *out_lut; uint32 lut_pos; uint32 sig_map_size; uint64 counter_size; CMemoryBins *memory_bins; bool use_strict_mem; CBigBinKmerPartQueue* bbkpq; CMemoryPool *sm_pmm_merger_lut, *sm_pmm_merger_suff; uint32 lut_prefix_len; uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total; uint32 kmer_t_size; int32 cutoff_min, cutoff_max; int32 counter_max; int32 kmer_len; int32 signature_len; bool use_quake; bool both_strands; bool store_uint(FILE *out, uint64 x, uint32 size); public: CKmerBinCompleter(CKMCParams &Params, CKMCQueues &Queues); ~CKmerBinCompleter(); void ProcessBinsFirstStage(); void ProcessBinsSecondStage(); void GetTotal(uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total); void InitStage2(CKMCParams& Params, CKMCQueues& Queues); }; //************************************************************************************************************ // CWKmerBinCompleter - wrapper for multithreading purposes //************************************************************************************************************ class CWKmerBinCompleter { CKmerBinCompleter *kbc; public: CWKmerBinCompleter(CKMCParams &Params, CKMCQueues &Queues); ~CWKmerBinCompleter(); void operator()(bool first_stage); void GetTotal(uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total); void InitStage2(CKMCParams& Params, CKMCQueues& Queues); }; //************************************************************************************************************ // SmallKCompleter - completer for small k optimization //************************************************************************************************************ template class CSmallKCompleter { CMemoryPool *pmm_small_k_completer; uint64 n_unique, n_cutoff_min, n_cutoff_max; uint32 lut_prefix_len; int64 cutoff_max, counter_max; int cutoff_min; uint32 kmer_len; int64 mem_tot_small_k_completer; std::string output_file_name; bool both_strands; bool use_quake; bool store_uint(FILE *out, uint64 x, uint32 size); public: CSmallKCompleter(CKMCParams& Params, CKMCQueues& Queues); template bool Complete(CSmallKBuf results); void GetTotal(uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max); }; template CSmallKCompleter::CSmallKCompleter(CKMCParams& Params, CKMCQueues& Queues) { pmm_small_k_completer = Queues.pmm_small_k_completer; n_unique = n_cutoff_min = n_cutoff_max = 0; lut_prefix_len = Params.lut_prefix_len; cutoff_max = Params.cutoff_max; cutoff_min = Params.cutoff_min; counter_max = Params.counter_max; both_strands = Params.both_strands; kmer_len = (uint32)Params.kmer_len; use_quake = Params.use_quake; mem_tot_small_k_completer = Params.mem_tot_small_k_completer; output_file_name = Params.output_file_name; } template bool CSmallKCompleter::store_uint(FILE *out, uint64 x, uint32 size) { for (uint32 i = 0; i < size; ++i) putc((x >> (i * 8)) & 0xFF, out); return true; } template template bool CSmallKCompleter::Complete(CSmallKBuf result) { uchar* raw_buffer; uint64 counter_size = 0; if (use_quake) counter_size = 4; else counter_size = min(BYTE_LOG_ULL((uint64)cutoff_max), BYTE_LOG_ULL((uint64)counter_max)); uint64 kmer_suf_bytes = (kmer_len - lut_prefix_len) / 4; pmm_small_k_completer->reserve(raw_buffer); uint32 lut_recs = (1 << 2 * lut_prefix_len); uint32 lut_buf_recs = (uint32)(MIN(lut_recs * sizeof(uint64), (uint64)mem_tot_small_k_completer / 2) / sizeof(uint64)); uint32 lut_buf_pos = 0; uint32 suf_size = (uint32)(mem_tot_small_k_completer - lut_buf_recs * sizeof(uint64)); uint32 suf_recs = (uint32)(suf_size / (counter_size + kmer_suf_bytes) * (counter_size + kmer_suf_bytes)); uint32 suf_pos = 0; uint64* lut = (uint64*)raw_buffer; uchar* suf = raw_buffer + lut_buf_recs * sizeof(uint64); FILE* suf_file, *pre_file; string pre_file_name = output_file_name + ".kmc_pre"; string suf_file_name = output_file_name + ".kmc_suf"; pre_file = fopen(pre_file_name.c_str(), "wb"); if (!pre_file) { cout << "Error: Cannot create " << pre_file_name << "\n"; exit(1); return false; } suf_file = fopen(suf_file_name.c_str(), "wb"); if (!suf_file) { cout << "Error: Cannot create " << suf_file_name << "\n"; fclose(pre_file); exit(1); return false; } char s_kmc_pre[] = "KMCP"; char s_kmc_suf[] = "KMCS"; // Markers at the beginning fwrite(s_kmc_pre, 1, 4, pre_file); fwrite(s_kmc_suf, 1, 4, suf_file); CKmer<1> kmer; uint64 prev_prefix = 0, prefix; lut[lut_buf_pos++] = 0; uint64 kmer_no = 0; for (kmer.data = 0; kmer.data < (1ull << 2 * kmer_len); ++kmer.data) { prefix = kmer.remove_suffix(2 * (kmer_len - lut_prefix_len)); if (prefix != prev_prefix) //new prefix { prev_prefix = prefix; lut[lut_buf_pos++] = kmer_no; if (lut_buf_pos >= lut_buf_recs) { fwrite(lut, sizeof(uint64), lut_buf_pos, pre_file); lut_buf_pos = 0; } } if (result.buf[kmer.data]) //k-mer exists { ++n_unique; if (result.buf[kmer.data] < (uint32)cutoff_min) ++n_cutoff_min; else if (result.buf[kmer.data] > (uint64)cutoff_max) ++n_cutoff_max; else { ++kmer_no; if (result.buf[kmer.data] > (uint64)counter_max) result.buf[kmer.data] = (COUNTER_TYPE)counter_max; for (int32 j = (int32)kmer_suf_bytes - 1; j >= 0; --j) suf[suf_pos++] = kmer.get_byte(j); result.Store(kmer.data, suf, suf_pos, counter_size); if (suf_pos >= suf_recs * (kmer_suf_bytes + counter_size)) { fwrite(suf, 1, suf_pos, suf_file); suf_pos = 0; } } } } fwrite(lut, sizeof(uint64), lut_buf_pos, pre_file); fwrite(suf, 1, suf_pos, suf_file); uint32 offset = 0; store_uint(pre_file, kmer_len, 4); offset += 4; store_uint(pre_file, (uint32)use_quake, 4); offset += 4; // mode: 0 (counting), 1 (Quake-compatibile counting) store_uint(pre_file, counter_size, 4); offset += 4; store_uint(pre_file, lut_prefix_len, 4); offset += 4; store_uint(pre_file, cutoff_min, 4); offset += 4; store_uint(pre_file, cutoff_max, 4); offset += 4; store_uint(pre_file, n_unique - n_cutoff_min - n_cutoff_max, 8); offset += 8; store_uint(pre_file, both_strands ? 0 : 1, 1); offset++; store_uint(pre_file, 0, 1); offset++; store_uint(pre_file, 0, 1); offset++; store_uint(pre_file, 0, 1); offset++; store_uint(pre_file, cutoff_max >> 32, 4); offset += 4; // Space for future use for (int32 i = 0; i < 20; ++i) { store_uint(pre_file, 0, 1); offset++; } store_uint(pre_file, 0x0, 4); //KMC 1.x format offset += 4; store_uint(pre_file, offset, 4); // Markers at the end fwrite(s_kmc_pre, 1, 4, pre_file); fwrite(s_kmc_suf, 1, 4, suf_file); fclose(pre_file); fclose(suf_file); pmm_small_k_completer->free(raw_buffer); return true; } template void CSmallKCompleter::GetTotal(uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max) { _n_unique = n_unique; _n_cutoff_min = n_cutoff_min; _n_cutoff_max = n_cutoff_max; } #endif // ***** EOF KMC-2.3/kmer_counter/kb_reader.h000066400000000000000000000146611257432033000165630ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KB_READER_H #define _KB_READER_H #include "defs.h" #include "params.h" #include "kmer.h" #include "s_mapper.h" #include "radix.h" #include #include #include #include #include #include //************************************************************************************************************ // CKmerBinReader - reader of bins from distribution phase //************************************************************************************************************ template class CKmerBinReader { CMemoryMonitor *mm; CSignatureMapper* s_mapper; CBinDesc *bd; CBinQueue *bq; CTooLargeBinsQueue *tlbq; CMemoryBins *memory_bins; CDiskLogger* disk_logger; int32 cutoff_min, cutoff_max; int32 counter_max; int32 kmer_len; int32 lut_prefix_len; uint32 max_x; bool both_strands; bool use_quake; int64 round_up_to_alignment(int64 x) { return (x + ALIGNMENT-1) / ALIGNMENT * ALIGNMENT; } public: CKmerBinReader(CKMCParams &Params, CKMCQueues &Queues); ~CKmerBinReader(); void ProcessBins(); }; //---------------------------------------------------------------------------------- // Assign monitors and queues template CKmerBinReader::CKmerBinReader(CKMCParams &Params, CKMCQueues &Queues) { mm = Queues.mm; // dm = Queues.dm; bd = Queues.bd; bq = Queues.bq; tlbq = Queues.tlbq; disk_logger = Queues.disk_logger; memory_bins = Queues.memory_bins; kmer_len = Params.kmer_len; cutoff_min = Params.cutoff_min; cutoff_max = (int32)Params.cutoff_max; counter_max = (int32)Params.counter_max; both_strands = Params.both_strands; use_quake = Params.use_quake; max_x = Params.max_x; s_mapper = Queues.s_mapper; lut_prefix_len = Params.lut_prefix_len; } //---------------------------------------------------------------------------------- template CKmerBinReader::~CKmerBinReader() { } //---------------------------------------------------------------------------------- // Read all bins from temporary HDD template void CKmerBinReader::ProcessBins() { uchar *data; uint64 readed; int32 bin_id; CMemDiskFile *file; string name; uint64 size; uint64 n_rec; uint64 n_plus_x_recs; uint32 buffer_size; uint32 kmer_len; bd->init_random(); while((bin_id = bd->get_next_random_bin()) >= 0) // Get id of the next bin to read { bd->read(bin_id, file, name, size, n_rec, n_plus_x_recs, buffer_size, kmer_len); fflush(stdout); // Reserve memory necessary to process the current bin at all next stages uint64 input_kmer_size; uint64 kxmer_counter_size; uint32 kxmer_symbols; if (max_x && !use_quake) { input_kmer_size = n_plus_x_recs * sizeof(KMER_T); kxmer_counter_size = n_plus_x_recs * sizeof(uint32); kxmer_symbols = kmer_len + max_x + 1; } else { input_kmer_size = n_rec * sizeof(KMER_T); kxmer_counter_size = 0; kxmer_symbols = kmer_len; } uint64 max_out_recs = (n_rec+1) / max(cutoff_min, 1); uint64 counter_size = min(BYTE_LOG(cutoff_max), BYTE_LOG(counter_max)); if(KMER_T::QUALITY_SIZE > counter_size) counter_size = KMER_T::QUALITY_SIZE; uint32 kmer_symbols = kmer_len - lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint64 out_buffer_size = max_out_recs * (kmer_bytes + counter_size); uint32 rec_len = (kxmer_symbols + 3) / 4; uint64 lut_recs = 1 << (2 * lut_prefix_len); uint64 lut_size = lut_recs * sizeof(uint64); if (!memory_bins->init(bin_id, rec_len, round_up_to_alignment(size), round_up_to_alignment(input_kmer_size), round_up_to_alignment(out_buffer_size), round_up_to_alignment(kxmer_counter_size), round_up_to_alignment(lut_size))) { tlbq->insert(bin_id); continue; } #ifdef DEBUG_MODE cout << bin_id << ": " << name << " " << c_disk << " " << size << " " << n_rec << "\n"; #else cout << "*"; #endif // Process the bin if it is not empty if(size > 0) { if (file == NULL) { cout << "Error: Cannot open temporary file: " << name << "\n"; fflush(stdout); exit(1); } else file->Rewind(); memory_bins->reserve(bin_id, data, CMemoryBins::mba_input_file); //readed = fread(data, 1, size, file); readed = file->Read(data, 1, size); if(readed != size) { cout << "Error: Corrupted file: " << name << " " << "Real size : " << readed << " " << "Should be : " << size << "\n"; fflush(stdout); exit(1); } // Push bin data to a queue of bins to process bq->push(bin_id, data, size, n_rec); } else // Push empty bin to process (necessary, since all bin ids must be processed) bq->push(bin_id, NULL, 0, 0); file->Close(); //Remove temporary file #ifndef DEVELOP_MODE file->Remove(); #endif disk_logger->log_remove(size); } bq->mark_completed(); fflush(stdout); } //************************************************************************************************************ // CWKmerBinReader - wrapper for multithreading purposes //************************************************************************************************************ //---------------------------------------------------------------------------------- template class CWKmerBinReader { CKmerBinReader *kbr; public: CWKmerBinReader(CKMCParams &Params, CKMCQueues &Queues); ~CWKmerBinReader(); void operator()(); }; //---------------------------------------------------------------------------------- // Constructor template CWKmerBinReader::CWKmerBinReader(CKMCParams &Params, CKMCQueues &Queues) { kbr = new CKmerBinReader(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor template CWKmerBinReader::~CWKmerBinReader() { delete kbr; } //---------------------------------------------------------------------------------- // Execution template void CWKmerBinReader::operator()() { kbr->ProcessBins(); } #endif // ***** EOF KMC-2.3/kmer_counter/kb_sorter.h000066400000000000000000001067411257432033000166400ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KB_SORTER_H #define _KB_SORTER_H #define DEBUGG_INFO #include "defs.h" #include "prob_qual.h" #include "params.h" #include "kmer.h" #include "radix.h" #include "s_mapper.h" #include #include #include #include #include #include #include #include "kxmer_set.h" #include "rev_byte.h" //************************************************************************************************************ template class CKmerBinSorter_Impl; //************************************************************************************************************ // CKmerBinSorter - sorting of k-mers in a bin //************************************************************************************************************ template class CKmerBinSorter { mutable mutex expander_mtx; uint64 input_pos; CMemoryMonitor *mm; CBinDesc *bd; CBinQueue *bq; CKmerQueue *kq; CMemoryPool *pmm_prob, *pmm_radix_buf, *pmm_expand; CMemoryBins *memory_bins; CKXmerSet kxmer_set; int32 n_bins; int32 bin_id; uchar *data; uint64 size; uint64 n_rec; uint64 n_plus_x_recs; string desc; uint32 buffer_size; uint32 kmer_len; uint32 max_x; uint64 sum_n_rec, sum_n_plus_x_rec; int n_omp_threads; bool both_strands; bool use_quake; CSignatureMapper* s_mapper; uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total; int32 cutoff_min, cutoff_max; int32 lut_prefix_len; int32 counter_max; KMER_T *buffer_input, *buffer_tmp, *buffer; uint32 *kxmer_counters; void Sort(); friend class CKmerBinSorter_Impl; public: static uint32 PROB_BUF_SIZE; CKmerBinSorter(CKMCParams &Params, CKMCQueues &Queues, int thread_no); ~CKmerBinSorter(); void GetDebugStats(uint64& _sum_n_recs, uint64& _sum_n_plus_x_recs) { _sum_n_recs = sum_n_rec; _sum_n_plus_x_recs = sum_n_plus_x_rec; } void set_omp_threads(uint32 _n_omp_threads) { n_omp_threads = _n_omp_threads; } void ProcessBins(); }; template uint32 CKmerBinSorter::PROB_BUF_SIZE = 1 << 14; //************************************************************************************************************ // CKmerBinSorter_Impl - implementation of k-mer type- and size-dependent functions //************************************************************************************************************ template class CKmerBinSorter_Impl { public: static void Compact(CKmerBinSorter &ptr); static void Expand(CKmerBinSorter &ptr, uint64 tmp_size); static void ComapctKXmers(CKmerBinSorter &ptr, uint64& compacted_count); }; template class CKmerBinSorter_Impl, SIZE> { static uint64 FindFirstSymbOccur(CKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uchar symb); static void InitKXMerSet(CKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uint32 depth); static void CompactKxmers(CKmerBinSorter, SIZE> &ptr); static void PreCompactKxmers(CKmerBinSorter, SIZE> &ptr, uint64& compacted_count); static void CompactKmers(CKmerBinSorter, SIZE> &ptr); static void ExpandKxmersAll(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size); static void ExpandKxmersBoth(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size); static void ExpandKmersAll(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size); static void ExpandKmersBoth(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size); static void GetNextSymb(uchar& symb, uchar& byte_shift, uint64& pos, uchar* data_p); static void FromChildThread(CKmerBinSorter, SIZE>& ptr, CKmer* thread_buffer, uint64 size); static void ExpandKxmerBothParaller(CKmerBinSorter, SIZE>& ptr, uint64 start_pos, uint64 end_pos); public: static void Compact(CKmerBinSorter, SIZE> &ptr); static void Expand(CKmerBinSorter, SIZE> &ptr, uint64 tmp_size); }; template class CKmerBinSorter_Impl, SIZE> { public: static void Compact(CKmerBinSorter, SIZE> &ptr); static void Expand(CKmerBinSorter, SIZE> &ptr, uint64 tmp_size); }; //************************************************************************************************************ // CKmerBinSorter //************************************************************************************************************ //---------------------------------------------------------------------------------- // Assign queues and monitors template CKmerBinSorter::CKmerBinSorter(CKMCParams &Params, CKMCQueues &Queues, int thread_no) : kxmer_set(Params.kmer_len) { both_strands = Params.both_strands; mm = Queues.mm; n_bins = Params.n_bins; bd = Queues.bd; bq = Queues.bq; kq = Queues.kq; s_mapper = Queues.s_mapper; pmm_radix_buf = Queues.pmm_radix_buf; pmm_prob = Queues.pmm_prob; pmm_expand = Queues.pmm_expand; memory_bins = Queues.memory_bins; cutoff_min = Params.cutoff_min; cutoff_max = (int32)Params.cutoff_max; counter_max = (int32)Params.counter_max; max_x = Params.max_x; use_quake = Params.use_quake; lut_prefix_len = Params.lut_prefix_len; n_omp_threads = Params.n_omp_threads[thread_no]; sum_n_rec = sum_n_plus_x_rec = 0; } //---------------------------------------------------------------------------------- template CKmerBinSorter::~CKmerBinSorter() { } //---------------------------------------------------------------------------------- // Process the bins template void CKmerBinSorter::ProcessBins() { uint64 tmp_size; uint64 tmp_n_rec; CMemDiskFile *file; SetMemcpyCacheLimit(8); // Process bins while (!bq->completed()) { // Gat bin data description to sort if (!bq->pop(bin_id, data, size, n_rec)) { continue; } // Get bin data bd->read(bin_id, file, desc, tmp_size, tmp_n_rec, n_plus_x_recs, buffer_size, kmer_len); // Uncompact the kmers - append truncate prefixes //Expand(tmp_size); CKmerBinSorter_Impl::Expand(*this, tmp_size); memory_bins->free(bin_id, CMemoryBins::mba_input_file); // Perfor sorting of kmers in a bin Sort(); // Compact the same kmers (occurring at neighbour positions now) CKmerBinSorter_Impl::Compact(*this); } // Mark all the kmers are already processed kq->mark_completed(); } template inline void CKmerBinSorter_Impl, SIZE>::GetNextSymb(uchar& symb, uchar& byte_shift, uint64& pos, uchar* data_p) { symb = (data_p[pos] >> byte_shift) & 3; if (byte_shift == 0) { ++pos; byte_shift = 6; } else byte_shift -= 2; } template void CKmerBinSorter_Impl, SIZE>::ExpandKmersAll(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size) { uint64 pos = 0; ptr.input_pos = 0; CKmer kmer; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; CKmer kmer_mask; kmer_mask.set_n_1(ptr.kmer_len * 2); uchar *data_p = ptr.data; uchar additional_symbols; uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; while (pos < tmp_size) { kmer.clear(); additional_symbols = data_p[pos++]; for (uint32 i = 0, kmer_pos = 8 * SIZE - 1; i < kmer_bytes; ++i, --kmer_pos) { kmer.set_byte(kmer_pos, data_p[pos + i]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; if (kmer_shr) kmer.SHR(kmer_shr); kmer.mask(kmer_mask); ptr.buffer_input[ptr.input_pos++].set(kmer); for (int i = 0; i < additional_symbols; ++i) { uchar symb = (data_p[pos] >> byte_shift) & 3; if (byte_shift == 0) { ++pos; byte_shift = 6; } else byte_shift -= 2; kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); ptr.buffer_input[ptr.input_pos++].set(kmer); } if (byte_shift != 6) ++pos; } } template void CKmerBinSorter_Impl, SIZE>::ExpandKmersBoth(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size) { uint64 pos = 0; CKmer kmer; CKmer rev_kmer; CKmer kmer_can; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; uint32 kmer_len_shift = (ptr.kmer_len - 1) * 2; CKmer kmer_mask; kmer_mask.set_n_1(ptr.kmer_len * 2); uchar *data_p = ptr.data; ptr.input_pos = 0; uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; uchar additional_symbols; uchar symb; while (pos < tmp_size) { kmer.clear(); rev_kmer.clear(); additional_symbols = data_p[pos++]; //building kmer for (uint32 i = 0, kmer_pos = 8 * SIZE - 1, kmer_rev_pos = 0; i < kmer_bytes; ++i, --kmer_pos, ++kmer_rev_pos) { kmer.set_byte(kmer_pos, data_p[pos + i]); rev_kmer.set_byte(kmer_rev_pos, CRev_byte::lut[data_p[pos + i]]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; if (kmer_shr) kmer.SHR(kmer_shr); kmer.mask(kmer_mask); rev_kmer.mask(kmer_mask); kmer_can = kmer < rev_kmer ? kmer : rev_kmer; ptr.buffer_input[ptr.input_pos++].set(kmer_can); for (int i = 0; i < additional_symbols; ++i) { symb = (data_p[pos] >> byte_shift) & 3; if (byte_shift == 0) { ++pos; byte_shift = 6; } else byte_shift -= 2; kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, kmer_len_shift); kmer_can = kmer < rev_kmer ? kmer : rev_kmer; ptr.buffer_input[ptr.input_pos++].set(kmer_can); } if (byte_shift != 6) ++pos; } } template void CKmerBinSorter_Impl, SIZE>::FromChildThread(CKmerBinSorter, SIZE>& ptr, CKmer* thread_buffer, uint64 size) { lock_guard lcx(ptr.expander_mtx); A_memcpy(ptr.buffer_input + ptr.input_pos, thread_buffer, size * sizeof(CKmer)); ptr.input_pos += size; } template void CKmerBinSorter_Impl, SIZE>::ExpandKxmerBothParaller(CKmerBinSorter, SIZE>& ptr, uint64 start_pos, uint64 end_pos) { uchar* _raw_buffer; ptr.pmm_expand->reserve(_raw_buffer); CKmer* buffer = (CKmer*)_raw_buffer; CKmer kmer, rev_kmer, kmer_mask; CKmer kxmer_mask; bool kmer_lower; //true if kmer is lower than its rev. comp uint32 x, additional_symbols; uchar symb; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; uint32 rev_shift = ptr.kmer_len * 2 - 2; uchar *data_p = ptr.data; kmer_mask.set_n_1(ptr.kmer_len * 2); uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; kxmer_mask.set_n_1((ptr.kmer_len + ptr.max_x + 1) * 2); uint64 buffer_pos = 0; uint64 pos = start_pos; while (pos < end_pos) { kmer.clear(); rev_kmer.clear(); additional_symbols = data_p[pos++]; //building kmer for (uint32 i = 0, kmer_pos = 8 * SIZE - 1, kmer_rev_pos = 0; i < kmer_bytes; ++i, --kmer_pos, ++kmer_rev_pos) { kmer.set_byte(kmer_pos, data_p[pos + i]); rev_kmer.set_byte(kmer_rev_pos, CRev_byte::lut[data_p[pos + i]]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; if (kmer_shr) kmer.SHR(kmer_shr); kmer.mask(kmer_mask); rev_kmer.mask(kmer_mask); kmer_lower = kmer < rev_kmer; x = 0; if (kmer_lower) buffer[buffer_pos].set(kmer); else buffer[buffer_pos].set(rev_kmer); uint32 symbols_left = additional_symbols; while (symbols_left) { GetNextSymb(symb, byte_shift, pos, data_p); kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, rev_shift); --symbols_left; if (kmer_lower) { if (kmer < rev_kmer) { buffer[buffer_pos].SHL_insert_2bits(symb); ++x; if (x == ptr.max_x) { if (!symbols_left) break; buffer[buffer_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (buffer_pos >= EXPAND_BUFFER_RECS) { FromChildThread(ptr, buffer, buffer_pos); buffer_pos = 0; } x = 0; GetNextSymb(symb, byte_shift, pos, data_p); kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, rev_shift); --symbols_left; kmer_lower = kmer < rev_kmer; if (kmer_lower) buffer[buffer_pos].set(kmer); else buffer[buffer_pos].set(rev_kmer); } } else { buffer[buffer_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (buffer_pos >= EXPAND_BUFFER_RECS) { FromChildThread(ptr, buffer, buffer_pos); buffer_pos = 0; } x = 0; kmer_lower = false; buffer[buffer_pos].set(rev_kmer); } } else { if (!(kmer < rev_kmer)) { buffer[buffer_pos].set_2bits(3 - symb, ptr.kmer_len * 2 + x * 2); ++x; if (x == ptr.max_x) { if (!symbols_left) break; buffer[buffer_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (buffer_pos >= EXPAND_BUFFER_RECS) { FromChildThread(ptr, buffer, buffer_pos); buffer_pos = 0; } x = 0; GetNextSymb(symb, byte_shift, pos, data_p); kmer.SHL_insert_2bits(symb); kmer.mask(kmer_mask); rev_kmer.SHR_insert_2bits(3 - symb, rev_shift); --symbols_left; kmer_lower = kmer < rev_kmer; if (kmer_lower) buffer[buffer_pos].set(kmer); else buffer[buffer_pos].set(rev_kmer); } } else { buffer[buffer_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (buffer_pos >= EXPAND_BUFFER_RECS) { FromChildThread(ptr, buffer, buffer_pos); buffer_pos = 0; } x = 0; buffer[buffer_pos].set(kmer); kmer_lower = true; } } } buffer[buffer_pos++].set_2bits(x, ptr.kmer_len * 2 + ptr.max_x * 2); if (buffer_pos >= EXPAND_BUFFER_RECS) { FromChildThread(ptr, buffer, buffer_pos); buffer_pos = 0; } if (byte_shift != 6) ++pos; } if (buffer_pos) { FromChildThread(ptr, buffer, buffer_pos); buffer_pos = 0; } ptr.pmm_expand->free(_raw_buffer); } template void CKmerBinSorter_Impl, SIZE>::ExpandKxmersBoth(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size) { ptr.input_pos = 0; uint32 threads = ptr.n_omp_threads; uint64 bytes_per_thread = (tmp_size + threads - 1) / threads; uint32 thread_no = 0; vector exp_threads; uint64 start = 0; uint64 pos = 0; for (; pos < tmp_size; pos += 1 + (ptr.data[pos] + ptr.kmer_len + 3) / 4) { if ((thread_no + 1) * bytes_per_thread <= pos) { exp_threads.push_back(thread(ExpandKxmerBothParaller, std::ref(ptr), start, pos)); start = pos; ++thread_no; } } if (start < pos) { exp_threads.push_back(thread(ExpandKxmerBothParaller, std::ref(ptr), start, tmp_size)); } for (auto& p : exp_threads) p.join(); ptr.n_plus_x_recs = ptr.input_pos;// !!!!!!!! } template void CKmerBinSorter_Impl, SIZE>::ExpandKxmersAll(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size) { ptr.input_pos = 0; uint64 pos = 0; CKmer kmer_mask; CKmer kxmer; CKmer kxmer_mask; kxmer_mask.set_n_1((ptr.kmer_len + ptr.max_x) * 2); uchar *data_p = ptr.data; kmer_mask.set_n_1(ptr.kmer_len * 2); while (pos < tmp_size) { kxmer.clear(); uint32 additional_symbols = data_p[pos++]; uchar symb; uint32 kmer_bytes = (ptr.kmer_len + 3) / 4; //building kmer for (uint32 i = 0, kmer_pos = 8 * SIZE - 1; i < kmer_bytes; ++i, --kmer_pos) { kxmer.set_byte(kmer_pos, data_p[pos + i]); } pos += kmer_bytes; uchar byte_shift = 6 - (ptr.kmer_len % 4) * 2; if (byte_shift != 6) --pos; uint32 kmer_shr = SIZE * 32 - ptr.kmer_len; if (kmer_shr) kxmer.SHR(kmer_shr); kxmer.mask(kmer_mask); uint32 tmp = MIN(ptr.max_x, additional_symbols); for (uint32 i = 0; i < tmp; ++i) { GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); } kxmer.set_2bits(tmp, (ptr.kmer_len + ptr.max_x) * 2); ptr.buffer_input[ptr.input_pos++].set(kxmer); additional_symbols -= tmp; uint32 kxmers_count = additional_symbols / (ptr.max_x + 1); uint32 kxmer_rest = additional_symbols % (ptr.max_x + 1); for (uint32 j = 0; j < kxmers_count; ++j) { for (uint32 i = 0; i < ptr.max_x + 1; ++i) { GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); } kxmer.mask(kxmer_mask); kxmer.set_2bits(ptr.max_x, (ptr.kmer_len + ptr.max_x) * 2); ptr.buffer_input[ptr.input_pos++].set(kxmer); } if (kxmer_rest) { uint32 i = 0; GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); kxmer.mask(kmer_mask); --kxmer_rest; for (; i < kxmer_rest; ++i) { GetNextSymb(symb, byte_shift, pos, data_p); kxmer.SHL_insert_2bits(symb); } kxmer.set_2bits(kxmer_rest, (ptr.kmer_len + ptr.max_x) * 2); ptr.buffer_input[ptr.input_pos++].set(kxmer); } if (byte_shift != 6) ++pos; } } //---------------------------------------------------------------------------------- // Uncompact the kmers template void CKmerBinSorter_Impl, SIZE>::Expand(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size) { uchar *raw_buffer_input, *raw_buffer_tmp; ptr.memory_bins->reserve(ptr.bin_id, raw_buffer_input, CMemoryBins::mba_input_array); ptr.memory_bins->reserve(ptr.bin_id, raw_buffer_tmp, CMemoryBins::mba_tmp_array); ptr.buffer_input = (CKmer *) raw_buffer_input; ptr.buffer_tmp = (CKmer *) raw_buffer_tmp; if (ptr.max_x) { if (ptr.both_strands) ExpandKxmersBoth(ptr, tmp_size); else ExpandKxmersAll(ptr, tmp_size); } else { if (ptr.both_strands) ExpandKmersBoth(ptr, tmp_size); else ExpandKmersAll(ptr, tmp_size); } } //---- template void CKmerBinSorter_Impl, SIZE>::Expand(CKmerBinSorter, SIZE>& ptr, uint64 tmp_size) { uchar *data_p = ptr.data; uchar *raw_buffer_input, *raw_buffer_tmp; ptr.memory_bins->reserve(ptr.bin_id, raw_buffer_input, CMemoryBins::mba_input_array); ptr.memory_bins->reserve(ptr.bin_id, raw_buffer_tmp, CMemoryBins::mba_tmp_array); ptr.buffer_input = (CKmerQuake *) raw_buffer_input; ptr.buffer_tmp = (CKmerQuake *) raw_buffer_tmp; CKmerQuake current_kmer; CKmerQuake kmer_rev; CKmerQuake kmer_can; kmer_rev.clear(); uint32 kmer_len_shift = (ptr.kmer_len - 1) * 2; CKmerQuake kmer_mask; kmer_mask.set_n_1(ptr.kmer_len * 2); ptr.input_pos = 0; uint64 pos = 0; double *inv_probs; ptr.pmm_prob->reserve(inv_probs); double kmer_prob; uchar qual, symb; uint32 inv_probs_pos; if (ptr.both_strands) while (pos < tmp_size) { uchar additional_symbols = data_p[pos++]; inv_probs_pos = 0; kmer_prob = 1.0; for (uint32 i = 0; i < ptr.kmer_len; ++i) { symb = (data_p[pos] >> 6) & 3; qual = data_p[pos++] & 63; inv_probs[inv_probs_pos++] = CProbQual::inv_prob_qual[qual]; current_kmer.SHL_insert_2bits(symb); kmer_rev.SHR_insert_2bits(3 - symb, kmer_len_shift); kmer_prob *= CProbQual::prob_qual[qual]; } current_kmer.mask(kmer_mask); if (kmer_prob >= CProbQual::MIN_PROB_QUAL_VALUE) { kmer_can = current_kmer < kmer_rev ? current_kmer : kmer_rev; kmer_can.quality = (float)kmer_prob; ptr.buffer_input[ptr.input_pos++].set(kmer_can); } for (uint32 i = 0; i < additional_symbols; ++i) { symb = (data_p[pos] >> 6) & 3; qual = data_p[pos++] & 63; current_kmer.SHL_insert_2bits(symb); current_kmer.mask(kmer_mask); kmer_rev.SHR_insert_2bits(3 - symb, kmer_len_shift); kmer_prob *= CProbQual::prob_qual[qual] * inv_probs[inv_probs_pos - ptr.kmer_len]; inv_probs[inv_probs_pos++] = CProbQual::inv_prob_qual[qual]; if (kmer_prob >= CProbQual::MIN_PROB_QUAL_VALUE) { kmer_can = current_kmer < kmer_rev ? current_kmer : kmer_rev; kmer_can.quality = (float)kmer_prob; ptr.buffer_input[ptr.input_pos++].set(kmer_can); } } } else while (pos < tmp_size) { uchar additional_symbols = data_p[pos++]; inv_probs_pos = 0; kmer_prob = 1.0; for (uint32 i = 0; i < ptr.kmer_len; ++i) { symb = (data_p[pos] >> 6) & 3; qual = data_p[pos++] & 63; inv_probs[inv_probs_pos++] = CProbQual::inv_prob_qual[qual]; current_kmer.SHL_insert_2bits(symb); kmer_prob *= CProbQual::prob_qual[qual]; } current_kmer.mask(kmer_mask); if (kmer_prob >= CProbQual::MIN_PROB_QUAL_VALUE) { current_kmer.quality = (float)kmer_prob; ptr.buffer_input[ptr.input_pos++].set(current_kmer); } for (uint32 i = 0; i < additional_symbols; ++i) { symb = (data_p[pos] >> 6) & 3; qual = data_p[pos++] & 63; current_kmer.SHL_insert_2bits(symb); current_kmer.mask(kmer_mask); kmer_prob *= CProbQual::prob_qual[qual] * inv_probs[inv_probs_pos - ptr.kmer_len]; inv_probs[inv_probs_pos++] = CProbQual::inv_prob_qual[qual]; if (kmer_prob >= CProbQual::MIN_PROB_QUAL_VALUE) { current_kmer.quality = (float)kmer_prob; ptr.buffer_input[ptr.input_pos++].set(current_kmer); } } } ptr.pmm_prob->free(inv_probs); } //---------------------------------------------------------------------------------- // Sort the kmers template void CKmerBinSorter::Sort() { uint32 rec_len; uint64 sort_rec; if (max_x && !use_quake) { sort_rec = n_plus_x_recs; rec_len = (kmer_len + max_x + 1 + 3) / 4; } else { sort_rec = n_rec; rec_len = (kmer_len + 3) / 4; } sum_n_plus_x_rec += n_plus_x_recs; sum_n_rec += n_rec; if (sizeof(KMER_T) == 8) { uint64 *_buffer_input = (uint64*)buffer_input; uint64 *_buffer_tmp = (uint64*)buffer_tmp; RadixSort_buffer(pmm_radix_buf, _buffer_input, _buffer_tmp, sort_rec, rec_len, n_omp_threads); if (rec_len % 2) buffer = (KMER_T*)_buffer_tmp; else buffer = (KMER_T*)_buffer_input; } else { uint32 *_buffer_input = (uint32*)buffer_input; uint32 *_buffer_tmp = (uint32*)buffer_tmp; RadixSort_uint8(_buffer_input, _buffer_tmp, sort_rec, sizeof(KMER_T), offsetof(KMER_T, data), SIZE*sizeof(typename KMER_T::data_t), rec_len, n_omp_threads); if (rec_len % 2) buffer = (KMER_T*)_buffer_tmp; else buffer = (KMER_T*)_buffer_input; } } //---------------------------------------------------------------------------------- //Binary search position of first occurence of symbol 'symb' in [start_pos,end_pos). Offset defines which symbol in k+x-mer is taken. template uint64 CKmerBinSorter_Impl, SIZE>::FindFirstSymbOccur(CKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uchar symb) { uint32 kxmer_offset = (ptr.kmer_len + ptr.max_x - offset) * 2; uint64 middle_pos; uchar middle_symb; while (start_pos < end_pos) { middle_pos = (start_pos + end_pos) / 2; middle_symb = ptr.buffer[middle_pos].get_2bits(kxmer_offset); if (middle_symb < symb) start_pos = middle_pos + 1; else end_pos = middle_pos; } return end_pos; } //---------------------------------------------------------------------------------- template void CKmerBinSorter_Impl, SIZE>::InitKXMerSet(CKmerBinSorter, SIZE> &ptr, uint64 start_pos, uint64 end_pos, uint32 offset, uint32 depth) { if (start_pos == end_pos) return; uint32 shr = ptr.max_x + 1 - offset; ptr.kxmer_set.init_add(start_pos, end_pos, shr); --depth; if (depth > 0) { uint64 pos[5]; pos[0] = start_pos; pos[4] = end_pos; for (uint32 i = 1; i < 4; ++i) pos[i] = FindFirstSymbOccur(ptr, pos[i - 1], end_pos, offset, i); for (uint32 i = 1; i < 5; ++i) InitKXMerSet(ptr, pos[i - 1], pos[i], offset + 1, depth); } } //---------------------------------------------------------------------------------- template void CKmerBinSorter_Impl, SIZE>::PreCompactKxmers(CKmerBinSorter, SIZE> &ptr, uint64& compacted_count) { compacted_count = 0; CKmer *act_kmer; act_kmer = &ptr.buffer[0]; ptr.kxmer_counters[compacted_count] = 1; for (uint32 i = 1; i < ptr.n_plus_x_recs; ++i) { if (*act_kmer == ptr.buffer[i]) ++ptr.kxmer_counters[compacted_count]; else { ptr.buffer[compacted_count++] = *act_kmer; ptr.kxmer_counters[compacted_count] = 1; act_kmer = &ptr.buffer[i]; } } ptr.buffer[compacted_count++] = *act_kmer; } //---------------------------------------------------------------------------------- template void CKmerBinSorter_Impl, SIZE>::CompactKxmers(CKmerBinSorter, SIZE> &ptr) { ptr.kxmer_set.clear(); ptr.kxmer_set.set_buffer(ptr.buffer); ptr.n_unique = 0; ptr.n_cutoff_min = 0; ptr.n_cutoff_max = 0; ptr.n_total = 0; uint32 kmer_symbols = ptr.kmer_len - ptr.lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint64 lut_recs = 1 << (2 * ptr.lut_prefix_len); uint64 lut_size = lut_recs * sizeof(uint64); uchar *out_buffer = NULL; uchar *raw_lut = NULL; ptr.memory_bins->reserve(ptr.bin_id, out_buffer, CMemoryBins::mba_suffix); ptr.memory_bins->reserve(ptr.bin_id, raw_lut, CMemoryBins::mba_lut); uint64 *lut = (uint64*)raw_lut; fill_n(lut, lut_recs, 0); uint64 out_pos = 0; if (ptr.n_plus_x_recs) { uchar* raw_kxmer_counters = NULL; ptr.memory_bins->reserve(ptr.bin_id, raw_kxmer_counters, CMemoryBins::mba_kxmer_counters); ptr.kxmer_counters = (uint32*)raw_kxmer_counters; uint64 compacted_count; PreCompactKxmers(ptr, compacted_count); uint64 pos[5];//pos[symb] is first position where symb occur (at first position of k+x-mer) and pos[symb+1] is first position where symb is not starting symbol of k+x-mer pos[0] = 0; pos[4] = compacted_count; for (uint32 i = 1; i < 4; ++i) pos[i] = FindFirstSymbOccur(ptr, pos[i - 1], compacted_count, 0, i); for (uint32 i = 1; i < 5; ++i) InitKXMerSet(ptr, pos[i - 1], pos[i], ptr.max_x + 2 - i, i); uint64 counter_pos = 0; uint64 counter_size = min(BYTE_LOG(ptr.cutoff_max), BYTE_LOG(ptr.counter_max)); CKmer kmer, next_kmer; kmer.clear(); next_kmer.clear(); CKmer kmer_mask; kmer_mask.set_n_1(ptr.kmer_len * 2); uint32 count; //first ptr.kxmer_set.get_min(counter_pos, kmer); count = ptr.kxmer_counters[counter_pos]; //rest while (ptr.kxmer_set.get_min(counter_pos, next_kmer)) { if (kmer == next_kmer) count += ptr.kxmer_counters[counter_pos]; else { ptr.n_total += count; ++ptr.n_unique; if (count < (uint32)ptr.cutoff_min) ptr.n_cutoff_min++; else if (count >(uint32)ptr.cutoff_max) ptr.n_cutoff_max++; else { lut[kmer.remove_suffix(2 * kmer_symbols)]++; if (count > (uint32)ptr.counter_max) count = ptr.counter_max; // Store compacted kmer for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) out_buffer[out_pos++] = kmer.get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) out_buffer[out_pos++] = (count >> (j * 8)) & 0xFF; } count = ptr.kxmer_counters[counter_pos]; kmer = next_kmer; } } //last one ++ptr.n_unique; ptr.n_total += count; if (count < (uint32)ptr.cutoff_min) ptr.n_cutoff_min++; else if (count >(uint32)ptr.cutoff_max) ptr.n_cutoff_max++; else { lut[kmer.remove_suffix(2 * kmer_symbols)]++; if (count > (uint32)ptr.counter_max) count = ptr.counter_max; // Store compacted kmer for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) out_buffer[out_pos++] = kmer.get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) out_buffer[out_pos++] = (count >> (j * 8)) & 0xFF; } ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_kxmer_counters); } // Push the sorted and compacted kmer bin to a priority queue in a form ready to be stored to HDD ptr.kq->push(ptr.bin_id, out_buffer, out_pos, raw_lut, lut_size, ptr.n_unique, ptr.n_cutoff_min, ptr.n_cutoff_max, ptr.n_total); if (ptr.buffer_input) { ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_input_array); ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_tmp_array); } ptr.buffer = NULL; } //---------------------------------------------------------------------------------- template void CKmerBinSorter_Impl, SIZE>::CompactKmers(CKmerBinSorter, SIZE> &ptr) { uint64 i; uint32 kmer_symbols = ptr.kmer_len - ptr.lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint64 lut_recs = 1 << (2 * (ptr.lut_prefix_len)); uint64 lut_size = lut_recs * sizeof(uint64); uint64 counter_size = min(BYTE_LOG(ptr.cutoff_max), BYTE_LOG(ptr.counter_max)); uchar *out_buffer; uchar *raw_lut; ptr.memory_bins->reserve(ptr.bin_id, out_buffer, CMemoryBins::mba_suffix); ptr.memory_bins->reserve(ptr.bin_id, raw_lut, CMemoryBins::mba_lut); uint64 *lut = (uint64*)raw_lut; fill_n(lut, lut_recs, 0); uint64 out_pos = 0; uint32 count; CKmer *act_kmer; ptr.n_unique = 0; ptr.n_cutoff_min = 0; ptr.n_cutoff_max = 0; ptr.n_total = 0; if (ptr.n_rec) // non-empty bin { act_kmer = &ptr.buffer[0]; count = 1; ptr.n_total = ptr.n_rec; for (i = 1; i < ptr.n_rec; ++i) { if (*act_kmer == ptr.buffer[i]) count++; else { if (count < (uint32)ptr.cutoff_min) { act_kmer = &ptr.buffer[i]; ptr.n_cutoff_min++; ptr.n_unique++; count = 1; } else if (count >(uint32) ptr.cutoff_max) { act_kmer = &ptr.buffer[i]; ptr.n_cutoff_max++; ptr.n_unique++; count = 1; } else { if (count > (uint32)ptr.counter_max) count = ptr.counter_max; // Store compacted kmer for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) out_buffer[out_pos++] = act_kmer->get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) out_buffer[out_pos++] = (count >> (j * 8)) & 0xFF; lut[act_kmer->remove_suffix(2 * kmer_symbols)]++; act_kmer = &ptr.buffer[i]; count = 1; ptr.n_unique++; } } } if (count < (uint32)ptr.cutoff_min) { ptr.n_cutoff_min++; } else if (count >= (uint32)ptr.cutoff_max) { ptr.n_cutoff_max++; } else { if (count >(uint32) ptr.counter_max) count = ptr.counter_max; for (int32 j = (int32)kmer_bytes - 1; j >= 0; --j) out_buffer[out_pos++] = act_kmer->get_byte(j); for (int32 j = 0; j < (int32)counter_size; ++j) out_buffer[out_pos++] = (count >> (j * 8)) & 0xFF; lut[act_kmer->remove_suffix(2 * kmer_symbols)]++; } ptr.n_unique++; } // Push the sorted and compacted kmer bin to a priority queue in a form ready to be stored to HDD ptr.kq->push(ptr.bin_id, out_buffer, out_pos, raw_lut, lut_size, ptr.n_unique, ptr.n_cutoff_min, ptr.n_cutoff_max, ptr.n_total); if (ptr.buffer_input) { ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_input_array); ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_tmp_array); } ptr.buffer = NULL; } //---------------------------------------------------------------------------------- template void CKmerBinSorter_Impl, SIZE>::Compact(CKmerBinSorter, SIZE> &ptr) { if (ptr.max_x) CompactKxmers(ptr); else CompactKmers(ptr); } //---------------------------------------------------------------------------------- // Compact the kmers - the same kmers (at neighbour positions now) are compated to a single kmer and counter template void CKmerBinSorter_Impl, SIZE>::Compact(CKmerBinSorter, SIZE> &ptr) { uint64 i; uint32 kmer_symbols = ptr.kmer_len - ptr.lut_prefix_len; uint64 kmer_bytes = kmer_symbols / 4; uint64 lut_recs = 1 << (2 * (ptr.lut_prefix_len)); uint64 lut_size = lut_recs * sizeof(uint64); uchar *out_buffer; uchar *raw_lut; ptr.memory_bins->reserve(ptr.bin_id, out_buffer, CMemoryBins::mba_suffix); ptr.memory_bins->reserve(ptr.bin_id, raw_lut, CMemoryBins::mba_lut); uint64 *lut = (uint64*)raw_lut; fill_n(lut, lut_recs, 0); uint64 out_pos = 0; double count; CKmerQuake *act_kmer; ptr.n_unique = 0; ptr.n_cutoff_min = 0; ptr.n_cutoff_max = 0; ptr.n_total = 0; if(ptr.n_rec) // non-empty bin { act_kmer = &ptr.buffer[0]; count = (double)act_kmer->quality; ptr.n_total = ptr.n_rec; for(i = 1; i < ptr.n_rec; ++i) { if(*act_kmer == ptr.buffer[i]) count += ptr.buffer[i].quality; else { if(count < (double) ptr.cutoff_min) { act_kmer = &ptr.buffer[i]; ++ptr.n_cutoff_min; ++ptr.n_unique; count = act_kmer->quality; } else if(count > (double) ptr.cutoff_max) { act_kmer = &ptr.buffer[i]; ++ptr.n_cutoff_max; ++ptr.n_unique; count = act_kmer->quality; } else { if(count > (double) ptr.counter_max) count = (double) ptr.counter_max; // Store compacted kmer for(int32 j = (int32) kmer_bytes-1; j >= 0; --j) out_buffer[out_pos++] = act_kmer->get_byte(j); uint32 tmp; float f_count = (float) count; memcpy(&tmp, &f_count, 4); for(int32 j = 0; j < 4; ++j) out_buffer[out_pos++] = (tmp >> (j * 8)) & 0xFF; lut[act_kmer->remove_suffix(2 * kmer_symbols)]++; act_kmer = &ptr.buffer[i]; count = act_kmer->quality; ++ptr.n_unique; } } } if(count < (double) ptr.cutoff_min) { ++ptr.n_cutoff_min; } else if(count > (double) ptr.cutoff_max) { ++ptr.n_cutoff_max; } else { if(count > (double) ptr.counter_max) count = (double) ptr.counter_max; for(int32 j = (int32) kmer_bytes-1; j >= 0; --j) out_buffer[out_pos++] = act_kmer->get_byte(j); uint32 tmp; float f_count = (float) count; memcpy(&tmp, &f_count, 4); for(int32 j = 0; j < 4; ++j) out_buffer[out_pos++] = (tmp >> (j * 8)) & 0xFF; lut[act_kmer->remove_suffix(2 * kmer_symbols)]++; } ++ptr.n_unique; } //// Push the sorted and compacted kmer bin to a priority queue in a form ready to be stored to HDD ptr.kq->push(ptr.bin_id, out_buffer, out_pos, raw_lut, lut_size, ptr.n_unique, ptr.n_cutoff_min, ptr.n_cutoff_max, ptr.n_total); if(ptr.buffer_input) { ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_input_array); ptr.memory_bins->free(ptr.bin_id, CMemoryBins::mba_tmp_array); } ptr.buffer = NULL; } //************************************************************************************************************ // CWKmerBinSorter - wrapper for multithreading purposes //************************************************************************************************************ template class CWKmerBinSorter { CKmerBinSorter *kbs; public: CWKmerBinSorter(CKMCParams &Params, CKMCQueues &Queues, int thread_no); ~CWKmerBinSorter(); void GetDebugStats(uint64& _sum_n_recs, uint64& _sum_n_plus_x_recs) { kbs->GetDebugStats(_sum_n_recs, _sum_n_plus_x_recs); } void operator()(); }; //---------------------------------------------------------------------------------- // Constructor template CWKmerBinSorter::CWKmerBinSorter(CKMCParams &Params, CKMCQueues &Queues, int thread_no) { kbs = new CKmerBinSorter(Params, Queues, thread_no); } //---------------------------------------------------------------------------------- // Destructor template CWKmerBinSorter::~CWKmerBinSorter() { delete kbs; } //---------------------------------------------------------------------------------- // Execution template void CWKmerBinSorter::operator()() { kbs->ProcessBins(); } #endif // ***** EOFKMC-2.3/kmer_counter/kb_storer.cpp000066400000000000000000000141061257432033000171640ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include #include #include "kb_storer.h" using namespace std; extern uint64 total_reads; //************************************************************************************************************ // CKmerBinStorer - storer for bins //************************************************************************************************************ //---------------------------------------------------------------------------------- // Constructor CKmerBinStorer::CKmerBinStorer(CKMCParams &Params, CKMCQueues &Queues) { pmm_bins = Queues.pmm_bins; mm = Queues.mm; n_bins = Params.n_bins; q_part = Queues.bpq; bd = Queues.bd; working_directory = Params.working_directory; mem_mode = Params.mem_mode; s_mapper = Queues.s_mapper; disk_logger = Queues.disk_logger; files = NULL; buf_sizes = NULL; buffer_size_bytes = 0; max_buf_size = 0; max_buf_size_id = 0; max_mem_buffer = Params.max_mem_storer; max_mem_single_package = Params.max_mem_storer_pkg; tmp_buff = new uchar[max_mem_single_package*2]; buffer = new elem_t*[n_bins]; for(int i = 0; i < n_bins; ++i) buffer[i] = NULL; total_size = 0 ; } //---------------------------------------------------------------------------------- // Destructor CKmerBinStorer::~CKmerBinStorer() { Release(); } //---------------------------------------------------------------------------------- // Write ends of bins and release memory void CKmerBinStorer::Release() { if(!files) return; for(int i = 0; i < n_bins; ++i) if(buffer[i]) delete buffer[i]; delete[] buffer; buffer = NULL; delete[] files; files = NULL; delete[] buf_sizes; buf_sizes = NULL; delete [] tmp_buff; cout << "\n"; } //---------------------------------------------------------------------------------- // Put buffer items to the queue void CKmerBinStorer::ReleaseBuffer() { for(int i = 0; i < n_bins; ++i) if(buffer[i]) PutBinToTmpFile(i); for(int i = n_bins-1; i >= 0; --i) if(buffer[i]) { delete buffer[i]; buffer[i] = NULL; } } //---------------------------------------------------------------------------------- // Return name of a file related to a kmer of given id. string CKmerBinStorer::GetName(int n) { string s_tmp = std::to_string(n); while(s_tmp.length() < 5) s_tmp = string("0") + s_tmp; if (*working_directory.rbegin() != '/' && *working_directory.rbegin() != '\\') working_directory += "/"; return working_directory + "kmc_" + s_tmp + ".bin"; } //---------------------------------------------------------------------------------- // Check wheter it is necessary to store some bin to a HDD void CKmerBinStorer::CheckBuffer() { int32 i; if(buffer_size_bytes < max_mem_buffer && max_buf_size < max_mem_single_package) return; PutBinToTmpFile(max_buf_size_id); buf_sizes[max_buf_size_id] = 0; max_buf_size = buf_sizes[0]; max_buf_size_id = 0; for(i = 1; i < n_bins; ++i) { if(buf_sizes[i] > max_buf_size) { max_buf_size = buf_sizes[i]; max_buf_size_id = i; } } } //---------------------------------------------------------------------------------- // Send bin to temp file void CKmerBinStorer::PutBinToTmpFile(uint32 n) { if(buf_sizes[n]) { uint64 w; uint64 tmp_buff_pos = 0; uint32 size; uchar* buf; for(auto p = buffer[n]->begin() ; p != buffer[n]->end() ; ++p) { buf = get<0>(*p); size = get<1>(*p); A_memcpy(tmp_buff + tmp_buff_pos, buf, size); tmp_buff_pos += size; pmm_bins->free(buf); } disk_logger->log_write(tmp_buff_pos); w = files[n]->Write(tmp_buff, 1, tmp_buff_pos); if(w != tmp_buff_pos) { cout<<"Error while writing to temporary file " << n; exit(1); } total_size += w; buffer_size_bytes -= buf_sizes[n]; } buffer[n]->clear(); } // //---------------------------------------------------------------------------------- // Open temporary files for all bins bool CKmerBinStorer::OpenFiles() { string f_name; files = new CMemDiskFile*[n_bins]; for (int i = 0 ; i < n_bins ; ++i) { files[i] = new CMemDiskFile(mem_mode); } buf_sizes = new uint64[n_bins]; for(int i = 0; i < n_bins; ++i) { f_name = GetName(i); buf_sizes[i] = 0; files[i]->Open(f_name); bd->insert(i, files[i], f_name, 0, 0, 0, 0); } return true; } //---------------------------------------------------------------------------------- // void CKmerBinStorer::ProcessQueue() { // Process the queue while(!q_part->completed()) { int32 bin_id; uchar *part; uint32 true_size; uint32 alloc_size; if(q_part->pop(bin_id, part, true_size, alloc_size)) { if(!buffer[bin_id]) buffer[bin_id] = new elem_t; buffer[bin_id]->push_back(make_tuple(part, true_size, alloc_size)); buffer_size_bytes += alloc_size; buf_sizes[bin_id] += alloc_size; if(buf_sizes[bin_id] > max_buf_size) { max_buf_size = buf_sizes[bin_id]; max_buf_size_id = bin_id; } CheckBuffer(); } } // Move all remaining parts to queue ReleaseBuffer(); } //************************************************************************************************************ // CWKmerBinStorer - wrapper //************************************************************************************************************ //---------------------------------------------------------------------------------- // Constructor CWKmerBinStorer::CWKmerBinStorer(CKMCParams &Params, CKMCQueues &Queues) { kbs = new CKmerBinStorer(Params, Queues); kbs->OpenFiles(); } //---------------------------------------------------------------------------------- // Destructore CWKmerBinStorer::~CWKmerBinStorer() { delete kbs; } //---------------------------------------------------------------------------------- // Execution void CWKmerBinStorer::operator()() { kbs->ProcessQueue(); } // ***** EOF KMC-2.3/kmer_counter/kb_storer.h000066400000000000000000000041121257432033000166250ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KB_STORER_H #define _KB_STORER_H #include "defs.h" #include "params.h" #include "kmer.h" #include "radix.h" #include #include #include #include #include #include using namespace std; //************************************************************************************************************ // CKmerBinStorer - storer of bins of k-mers //************************************************************************************************************ class CKmerBinStorer { CMemoryMonitor *mm; uint64 total_size; CMemoryPool *pmm_bins; string working_directory; int n_bins; CBinPartQueue *q_part; CBinDesc *bd; uint64 buffer_size_bytes; uint64 max_mem_buffer; uint64 max_mem_single_package; CSignatureMapper *s_mapper; CDiskLogger *disk_logger; uchar* tmp_buff; CMemDiskFile** files; uint64 *buf_sizes; uint64 max_buf_size; uint32 max_buf_size_id; bool mem_mode; typedef list> elem_t; elem_t** buffer; void Release(); string GetName(int n); void CheckBuffer(); void ReleaseBuffer(); void PutBinToTmpFile(uint32 n); public: void GetTotal(uint64& _total) { _total = total_size; } CKmerBinStorer(CKMCParams &Params, CKMCQueues &Queues); ~CKmerBinStorer(); bool OpenFiles(); void ProcessQueue(); }; //************************************************************************************************************ // CWKmerBinStorer - wrapper for multithreading purposes //************************************************************************************************************ class CWKmerBinStorer { CKmerBinStorer *kbs; public: void GetTotal(uint64& _total) { kbs->GetTotal(_total); } CWKmerBinStorer(CKMCParams &Params, CKMCQueues &Queues); ~CWKmerBinStorer(); void operator()(); }; #endif // ***** EOF KMC-2.3/kmer_counter/kmc.h000066400000000000000000001134011257432033000154070ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMC_H #define _KMC_H #include "defs.h" #include "params.h" #include "kmer.h" #include #include #include #include #include #include #include "queues.h" #include "timer.h" #include "fastq_reader.h" #include "kb_collector.h" #include "kb_completer.h" #include "kb_reader.h" #include "kb_sorter.h" #include "kb_storer.h" #include "s_mapper.h" #include "splitter.h" #include "asmlib_wrapper.h" #ifdef DEVELOP_MODE #include "develop.h" #endif #include "bkb_reader.h" #include "bkb_uncompactor.h" #include "bkb_sorter.h" #include "bkb_merger.h" #include "bkb_writer.h" using namespace std; template class CSmallKWrapper; template class CKMC { bool initialized; CStopWatch w0, heuristic_time , w1, w2, w3;//w3 - strict memory time // Parameters (input and internal) CKMCParams Params; // Memory monitor and queues CKMCQueues Queues; // Thread groups vector gr0_1, gr0_2; vector gr1_1, gr1_2, gr1_3, gr1_4, gr1_5; // thread groups for 1st stage vector gr2_1, gr2_2, gr2_3; // thread groups for 2nd stage uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total, n_reads, tmp_size, tmp_size_strict_mem, max_disk_usage, n_total_super_kmers; // Threads vector w_stats_fastqs; vector*> w_stats_splitters; vector w_fastqs; vector*> w_splitters; CWKmerBinStorer *w_storer; CWKmerBinReader* w_reader; vector*> w_sorters; CWKmerBinCompleter *w_completer; void SetThreads1Stage(); void SetThreads2Stage(vector& sorted_sizes); void SetThreadsStrictMemoryMode(); void AdjustMemoryLimitsStrictMemoryMode(); bool AdjustMemoryLimits(); void AdjustMemoryLimitsStage2(); void ShowSettingsStage1(); void ShowSettingsStage2(); friend class CSmallKWrapper; bool AdjustMemoryLimitsSmallK(); template bool ProcessSmallKOptimization(); public: CKMC(); ~CKMC(); void SetParams(CKMCParams &_Params); bool Process(); void GetStats(double &time1, double &time2, double &time3, uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total, uint64 &_n_reads, uint64 &_tmp_size, uint64 &_tmp_size_strict_mem, uint64 &_max_disk_usage, uint64& _n_total_super_kmers); }; template class CSmallKWrapper { public: static bool Process(CKMC& ptr); }; template class CSmallKWrapper { public: static bool Process(CKMC& ptr); }; template bool CSmallKWrapper::Process(CKMC& ptr) { return ptr.template ProcessSmallKOptimization(); } template bool CSmallKWrapper::Process(CKMC& ptr) { if ((uint64)ptr.Params.cutoff_max > ((1ull << 32) - 1)) return ptr.template ProcessSmallKOptimization(); else return ptr.template ProcessSmallKOptimization(); } //---------------------------------------------------------------------------------- template CKMC::CKMC() { // OpenMP support is a must, so do not compile if it is not supported #if !defined(_OPENMP) static_assert(false, "You need to use OpenMP"); #endif initialized = false; Params.kmer_len = 0; Params.n_readers = 1; Params.n_splitters = 1; Params.n_sorters = 1; //Params.n_omp_threads = 1; Queues.s_mapper = NULL; } //---------------------------------------------------------------------------------- template CKMC::~CKMC() { } //---------------------------------------------------------------------------------- // Set params of the k-mer counter template void CKMC::SetParams(CKMCParams &_Params) { Params = _Params; Params.kmer_len = Params.p_k; Params.n_bins = Params.p_n_bins; if (Params.kmer_len % 32 == 0) Params.max_x = 0; else Params.max_x = MIN(31 - (Params.kmer_len % 32), KMER_X); Params.verbose = Params.p_verbose; // Technical parameters related to temporary files Params.signature_len = Params.p_p1; Params.bin_part_size = 1 << 16; // Thresholds for counters Params.cutoff_min = Params.p_ci; Params.cutoff_max = Params.p_cx; Params.counter_max = Params.p_cs; Params.use_quake = Params.p_quake; Params.lowest_quality = Params.p_quality; Params.both_strands = Params.p_both_strands; Params.use_strict_mem = Params.p_strict_mem; Params.mem_mode = Params.p_mem_mode; // Technical parameters related to no. of threads and memory usage if(Params.p_sf && Params.p_sp && Params.p_so && Params.p_sr) { Params.n_readers = NORM(Params.p_sf, 1, 32); Params.n_splitters = NORM(Params.p_sp, 1, 32); Params.n_sorters = NORM(Params.p_sr, 1, 32); //Params.n_omp_threads = NORM(Params.p_so, 1, 32); Params.n_omp_threads.assign(Params.n_sorters, NORM(Params.p_so, 1, 32)); } else { // Adjust the number of threads according to the current hardware Params.n_threads = Params.p_t; if (!Params.n_threads) Params.n_threads = thread::hardware_concurrency(); SetThreads1Stage(); } //Params.max_mem_size = NORM(((uint64) Params.p_m) << 30, (uint64) MIN_MEM << 30, 1024ull << 30); Params.max_mem_size = NORM(((uint64)Params.p_m) * 1000000000ull, (uint64)MIN_MEM * 1000000000ull, 1024ull * 1000000000ull); Params.file_type = Params.p_file_type; Params.KMER_T_size = sizeof(KMER_T); initialized = true; SetMemcpyCacheLimit(8); // Sets the asmlib's memcpy function to make copy without use of cache memory } //---------------------------------------------------------------------------------- template void CKMC::SetThreads1Stage() { if (!Params.p_sf || !Params.p_sp || !Params.p_sr || !Params.p_so) { int cores = Params.n_threads; bool gz_bz2 = false; vector file_sizes; for (auto& p : Params.input_file_names) { string ext(p.end() - 3, p.end()); if (ext == ".gz" || ext == ".bz2") { gz_bz2 = true; //break; } FILE* tmp = my_fopen(p.c_str(), "rb"); if (!tmp) { cout << "Cannot open file: " << p.c_str(); exit(1); } my_fseek(tmp, 0, SEEK_END); file_sizes.push_back(my_ftell(tmp)); fclose(tmp); } if (gz_bz2) { sort(file_sizes.begin(), file_sizes.end(), greater()); uint64 file_size_threshold = (uint64)(file_sizes.front() * 0.05); int32 n_allowed_files = 0; for(auto& p : file_sizes) if (p > file_size_threshold) ++n_allowed_files; Params.n_readers = MIN(n_allowed_files, MAX(1, cores / 2)); } else Params.n_readers = 1; Params.n_splitters = MAX(1, cores - Params.n_readers); } } //---------------------------------------------------------------------------------- template void CKMC::SetThreads2Stage(vector& sorted_sizes) { if (!Params.p_sf || !Params.p_sp || !Params.p_sr || !Params.p_so) { if (Params.n_threads == 1) { Params.n_sorters = 1; Params.n_omp_threads.assign(1, 1); } else { int64 _10th_proc_bin_size = MAX(sorted_sizes[int(sorted_sizes.size() * 0.1)], 1); Params.n_sorters = (int)NORM(Params.max_mem_size / _10th_proc_bin_size, 1, Params.n_threads); Params.n_omp_threads.assign(Params.n_sorters, MAX(1, Params.n_threads / Params.n_sorters)); int threads_left = Params.n_threads - Params.n_omp_threads.front() * Params.n_sorters; for (uint32 i = 0; threads_left; --threads_left, ++i) Params.n_omp_threads[i%Params.n_sorters]++; } } } //---------------------------------------------------------------------------------- template void CKMC::SetThreadsStrictMemoryMode() { Params.sm_n_mergers = Params.p_smme; Params.sm_n_uncompactors = Params.p_smun; Params.sm_n_omp_threads = Params.p_smso; if (!Params.sm_n_omp_threads) Params.sm_n_omp_threads = Params.n_threads; if (!Params.sm_n_uncompactors) Params.sm_n_uncompactors = 1; if (!Params.sm_n_mergers) Params.sm_n_mergers = 1; } //---------------------------------------------------------------------------------- template void CKMC::AdjustMemoryLimitsStrictMemoryMode() { int64 m_rest = Params.max_mem_size; Params.sm_mem_part_input_file = 1ull << 26; Params.sm_mem_tot_input_file = Params.sm_mem_part_input_file * (Params.sm_n_uncompactors + 1); m_rest -= Params.sm_mem_tot_input_file; Params.sm_mem_part_expand = Params.sm_mem_part_input_file; Params.sm_mem_tot_expand = Params.sm_mem_part_expand * (Params.sm_n_uncompactors + 1); m_rest -= Params.sm_mem_tot_expand; Params.sm_mem_part_suffixes = 1 << 25; Params.sm_mem_tot_suffixes = Params.sm_mem_part_suffixes * 2; m_rest -= Params.sm_mem_tot_suffixes; Params.sm_mem_part_lut = (1 << 2 * 12) * sizeof(uint64); //12 is max lut prefix len for strict memory sub bins Params.sm_mem_tot_lut = Params.sm_mem_part_lut * 2; m_rest -= Params.sm_mem_tot_lut; Params.sm_mem_part_merger_suff = 1 << 24; Params.sm_mem_tot_merger_suff = (Params.sm_n_mergers + 1) * Params.sm_mem_part_merger_suff; m_rest -= Params.sm_mem_tot_merger_suff; Params.sm_mem_part_merger_lut = 1 << 24; Params.sm_mem_tot_merger_lut = (Params.sm_n_mergers + 1) * Params.sm_mem_part_merger_lut; m_rest -= Params.sm_mem_tot_merger_lut; const uint32 PREDICTET_NO_OF_SUBBINS = 3; Params.sm_mem_part_sub_bin_suff = (1ull << 24) * PREDICTET_NO_OF_SUBBINS; Params.sm_mem_tot_sub_bin_suff = Params.sm_mem_part_sub_bin_suff * Params.sm_n_mergers; m_rest -= Params.sm_mem_tot_sub_bin_suff; Params.sm_mem_part_sub_bin_lut = (1ull << 24) * PREDICTET_NO_OF_SUBBINS; Params.sm_mem_tot_sub_bin_lut = Params.sm_mem_part_sub_bin_lut * Params.sm_n_mergers; m_rest -= Params.sm_mem_tot_sub_bin_lut; //cout << "Memory left for sorter: " << (m_rest >> 20) << "MB"<< endl; Params.sm_mem_part_sort = m_rest; Params.sm_mem_tot_sort = Params.sm_mem_part_sort; } //---------------------------------------------------------------------------------- template void CKMC::AdjustMemoryLimitsStage2() { // Memory for 2nd stage // Settings for memory manager of radix internal buffers Params.mem_part_pmm_radix_buf = (256 * BUFFER_WIDTH + ALIGNMENT) * sizeof(uint64); int64 sum_n_omp_threads = 0; for (auto& p : Params.n_omp_threads) sum_n_omp_threads += p; //Params.mem_tot_pmm_radix_buf = Params.mem_part_pmm_radix_buf * Params.n_sorters * Params.n_omp_threads; Params.mem_tot_pmm_radix_buf = Params.mem_part_pmm_radix_buf * sum_n_omp_threads; if (Params.use_quake) { Params.mem_part_pmm_prob = (CKmerBinSorter::PROB_BUF_SIZE + 1) * sizeof(double); Params.mem_tot_pmm_prob = Params.n_sorters * Params.mem_part_pmm_prob; } else Params.mem_part_pmm_prob = Params.mem_tot_pmm_prob = 0; if (!Params.use_quake && Params.both_strands) { Params.mem_part_pmm_epxand = EXPAND_BUFFER_RECS * sizeof(KMER_T); Params.mem_tot_pmm_epxand = sum_n_omp_threads * Params.mem_part_pmm_epxand; } else Params.mem_part_pmm_epxand = Params.mem_tot_pmm_epxand = 0; Params.max_mem_stage2 = Params.max_mem_size - Params.mem_tot_pmm_radix_buf - Params.mem_tot_pmm_prob - Params.mem_tot_pmm_epxand; } //---------------------------------------------------------------------------------- // Adjust the memory limits for queues and other large data structures template bool CKMC::AdjustMemoryLimits() { // Memory for splitter internal buffers int64 m_rest = Params.max_mem_size; Params.mem_part_pmm_stats = ((1 << Params.signature_len * 2) + 1) * sizeof(uint32); Params.mem_tot_pmm_stats = (Params.n_splitters + 1 + 1) * Params.mem_part_pmm_stats; //1 merged in main thread, 1 for sorting indices // Settings for memory manager of FASTQ buffers Params.fastq_buffer_size = 32 << 20; do { if(Params.fastq_buffer_size & (Params.fastq_buffer_size-1)) Params.fastq_buffer_size &= Params.fastq_buffer_size - 1; else Params.fastq_buffer_size = Params.fastq_buffer_size / 2 + Params.fastq_buffer_size / 4; Params.mem_part_pmm_fastq = Params.fastq_buffer_size + CFastqReader::OVERHEAD_SIZE; Params.mem_tot_pmm_fastq = Params.mem_part_pmm_fastq * (Params.n_readers + Params.n_splitters + 96); } while(Params.mem_tot_pmm_fastq > m_rest * 0.17); m_rest -= Params.mem_tot_pmm_fastq; // Subtract memory for buffers for decompression of FASTQ files while(Params.n_readers * Params.gzip_buffer_size > m_rest / 10) Params.gzip_buffer_size /= 2; m_rest -= Params.n_readers * Params.gzip_buffer_size; // Subtract memory for bin collectors internal buffers m_rest -= Params.n_splitters * Params.bin_part_size * sizeof(KMER_T); // Settings for memory manager of reads Params.mem_part_pmm_reads = (CSplitter::MAX_LINE_SIZE + 1) * sizeof(double); Params.mem_tot_pmm_reads = Params.mem_part_pmm_reads * 2 * Params.n_splitters; m_rest -= Params.mem_tot_pmm_reads; // Max. memory for single package Params.max_mem_storer_pkg = 1ll << 25; Params.mem_part_pmm_bins = Params.bin_part_size; Params.mem_tot_pmm_bins = m_rest; // memory for storer internal buffer if(Params.max_mem_size >= 16ll << 30) Params.max_mem_storer = (int64) (Params.mem_tot_pmm_bins * 0.75); else Params.max_mem_storer = (int64) (Params.mem_tot_pmm_bins * 0.65); if(Params.max_mem_storer < (1ll << 28)) return false; return true; } //---------------------------------------------------------------------------------- // Show the settings of the KMC (in verbose mode only) template void CKMC::ShowSettingsStage1() { if(!Params.verbose) return; cout << "\n********** Used parameters: **********\n"; cout << "No. of input files : " << Params.input_file_names.size() << "\n"; cout << "Output file name : " << Params.output_file_name << "\n"; cout << "No. of working directories : " << 1 << "\n"; cout << "Input format : "; switch (Params.file_type) { case fasta: cout << "FASTA\n"; break; case fastq: cout << "FASTQ\n"; break; case multiline_fasta: cout << "MULTI LINE FASTA\n"; break; } cout << "\n"; cout << "k-mer length : " << Params.kmer_len << "\n"; cout << "Max. k-mer length : " << MAX_K << "\n"; cout << "Signature length : " << Params.signature_len << "\n"; cout << "Min. count threshold : " << Params.cutoff_min << "\n"; cout << "Max. count threshold : " << Params.cutoff_max << "\n"; cout << "Max. counter value : " << Params.counter_max << "\n"; cout << "Type of counters : " << (Params.use_quake ? "Quake-compatibile\n" : "direct\n"); if(Params.use_quake) cout << "Lowest quality value : " << Params.lowest_quality << "\n"; cout << "Both strands : " << (Params.both_strands ? "true\n" : "false\n"); cout << "RAM olny mode : " << (Params.mem_mode ? "true\n" : "false\n"); cout << "\n******* Stage 1 configuration: *******\n"; cout << "\n"; cout << "No. of bins : " << Params.n_bins << "\n"; cout << "Bin part size : " << Params.bin_part_size << "\n"; cout << "Input buffer size : " << Params.fastq_buffer_size << "\n"; cout << "\n"; cout << "No. of readers : " << Params.n_readers << "\n"; cout << "No. of splitters : " << Params.n_splitters << "\n"; cout << "\n"; cout << "Max. mem. size : " << setw(5) << (Params.max_mem_size / 1000000) << "MB\n"; cout << "Max. mem. per storer : " << setw(5) << (Params.max_mem_storer / 1000000) << "MB\n"; cout << "Max. mem. for single package : " << setw(5) << (Params.max_mem_storer_pkg / 1000000) << "MB\n"; cout << "\n"; cout << "Max. mem. for PMM (bin parts): " << setw(5) << (Params.mem_tot_pmm_bins / 1000000) << "MB\n"; cout << "Max. mem. for PMM (FASTQ) : " << setw(5) << (Params.mem_tot_pmm_fastq / 1000000) << "MB\n"; cout << "Max. mem. for PMM (reads) : " << setw(5) << (Params.mem_tot_pmm_reads / 1000000) << "MB\n"; cout << "\n"; } //---------------------------------------------------------------------------------- // Show the settings of the KMC (in verbose mode only) template void CKMC::ShowSettingsStage2() { if (!Params.verbose) return; cout << "\n******* Stage 2 configuration: *******\n"; cout << "No. of sorters : " << Params.n_sorters << "\n"; cout << "No. of sort. threads : "; for (uint32 i = 0; i < Params.n_omp_threads.size() - 1; ++i) cout << Params.n_omp_threads[i] << ", "; cout << Params.n_omp_threads.back() << "\n"; cout << "\n"; cout << "Max. mem. for 2nd stage : " << setw(5) << (Params.max_mem_stage2 / 1000000) << "MB\n"; cout << "\n"; } //---------------------------------------------------------------------------------- template bool CKMC::AdjustMemoryLimitsSmallK() { if (Params.kmer_len > 13) return false; uint32 counter_size = 4; //in bytes if ((uint64)Params.cutoff_max > ((1ull << 32) - 1)) counter_size = 8; int tmp_n_splitters = Params.n_splitters; int tmp_n_readers = Params.n_readers; int tmp_fastq_buffer_size = 0; int64 tmp_mem_part_pmm_fastq = 0; int64 tmp_mem_tot_pmm_fastq = 0; int64 tmp_mem_part_pmm_reads = (CSplitter::MAX_LINE_SIZE + 1) * sizeof(double); int64 tmp_mem_tot_pmm_reads = 0; int32 tmp_gzip_buffer_size = Params.gzip_buffer_size; int64 tmp_mem_part_small_k_buf = (1ll << 2 * Params.kmer_len) * counter_size;//no of possible k-mers * counter size int64 tmp_mem_tot_small_k_buf = 0; int64 mim_mem_for_readers = tmp_n_readers * (16 << 20); while (tmp_n_splitters) { tmp_mem_tot_pmm_reads = tmp_mem_part_pmm_reads * 3 * tmp_n_splitters; tmp_mem_tot_small_k_buf = tmp_mem_part_small_k_buf * tmp_n_splitters; if (tmp_mem_tot_pmm_reads + tmp_mem_tot_small_k_buf + mim_mem_for_readers < Params.max_mem_size) break; --tmp_n_splitters; } if (!tmp_n_splitters) return false; int64 left_for_readers = Params.max_mem_size - tmp_mem_tot_pmm_reads - tmp_mem_tot_small_k_buf; int64 max_for_gzip = (int64)(0.66 * left_for_readers); while (tmp_n_readers * tmp_gzip_buffer_size > max_for_gzip) tmp_gzip_buffer_size /= 2; int64 for_pmm_fastq = left_for_readers - tmp_n_readers * tmp_gzip_buffer_size; tmp_fastq_buffer_size = 32 << 20; do { if (tmp_fastq_buffer_size & (tmp_fastq_buffer_size - 1)) tmp_fastq_buffer_size &= tmp_fastq_buffer_size - 1; else tmp_fastq_buffer_size = tmp_fastq_buffer_size / 2 + tmp_fastq_buffer_size / 4; tmp_mem_part_pmm_fastq = tmp_fastq_buffer_size + CFastqReader::OVERHEAD_SIZE; tmp_mem_tot_pmm_fastq = tmp_mem_part_pmm_fastq * (tmp_n_readers + tmp_n_splitters + 96); } while (tmp_mem_tot_pmm_fastq > for_pmm_fastq); Params.n_splitters = tmp_n_splitters; Params.n_readers = tmp_n_readers; Params.fastq_buffer_size = tmp_fastq_buffer_size; Params.mem_part_pmm_fastq = tmp_mem_part_pmm_fastq; Params.mem_part_small_k_completer = Params.mem_tot_small_k_completer = Params.mem_tot_pmm_fastq = tmp_mem_tot_pmm_fastq; Params.mem_part_pmm_reads = tmp_mem_part_pmm_reads; Params.mem_tot_pmm_reads = tmp_mem_tot_pmm_reads; Params.gzip_buffer_size = tmp_gzip_buffer_size; Params.mem_part_small_k_buf = tmp_mem_part_small_k_buf; Params.mem_tot_small_k_buf = tmp_mem_tot_small_k_buf; return true; } //---------------------------------------------------------------------------------- template template bool CKMC::ProcessSmallKOptimization() { vector*> w_small_k_splitters; //For small k values only w1.startTimer(); Queues.input_files_queue = new CInputFilesQueue(Params.input_file_names); Queues.part_queue = new CPartQueue(Params.n_readers); Queues.pmm_fastq = new CMemoryPool(Params.mem_tot_pmm_fastq, Params.mem_part_pmm_fastq); Queues.pmm_reads = new CMemoryPool(Params.mem_tot_pmm_reads, Params.mem_part_pmm_reads); Queues.pmm_small_k_buf = new CMemoryPool(Params.mem_tot_small_k_buf, Params.mem_part_small_k_buf); w_small_k_splitters.resize(Params.n_splitters); for (int i = 0; i < Params.n_splitters; ++i) { w_small_k_splitters[i] = new CWSmallKSplitter(Params, Queues); gr1_2.push_back(thread(std::ref(*w_small_k_splitters[i]))); } w_fastqs.resize(Params.n_readers); for (int i = 0; i < Params.n_readers; ++i) { w_fastqs[i] = new CWFastqReader(Params, Queues); gr1_1.push_back(thread(std::ref(*w_fastqs[i]))); } for (auto& t : gr1_1) t.join(); for (auto& t : gr1_2) t.join(); for (auto r : w_fastqs) delete r; vector> results(Params.n_splitters); for (int i = 0; i < Params.n_splitters; ++i) { results[i] = w_small_k_splitters[i]->GetResult(); } w1.stopTimer(); w2.startTimer(); uint64 n_kmers = 0; for (int j = 1; j < Params.n_splitters; ++j) { for (int i = 0; i < (1 << 2 * Params.kmer_len); ++i) results[0].buf[i] += results[j].buf[i]; } n_total = 0; for (int j = 0; j < (1 << 2 * Params.kmer_len); ++j) if (results[0].buf[j]) ++n_kmers; uint64 tmp_n_reads; tmp_size = 0; n_reads = 0; n_total_super_kmers = 0; for (auto s : w_small_k_splitters) { s->GetTotal(tmp_n_reads); n_reads += tmp_n_reads; n_total += s->GetTotalKmers(); s->Release(); delete s; } Queues.pmm_fastq->release(); delete Queues.pmm_fastq; uint32 best_lut_prefix_len = 0; uint64 best_mem_amount = 1ull << 62; uint32 counter_size = 0; if (Params.use_quake) counter_size = 4; else counter_size = min(BYTE_LOG(Params.cutoff_max), BYTE_LOG(Params.counter_max)); for (Params.lut_prefix_len = 1; Params.lut_prefix_len < 16; ++Params.lut_prefix_len) { uint32 suffix_len; if (Params.lut_prefix_len > (uint32)Params.kmer_len) suffix_len = 0; else suffix_len = Params.kmer_len - Params.lut_prefix_len; if (suffix_len % 4) continue; uint64 suf_mem = n_kmers * (suffix_len / 4 + counter_size); uint64 lut_mem = (1ull << (2 * Params.lut_prefix_len)) * sizeof(uint64); if (suf_mem + lut_mem < best_mem_amount) { best_lut_prefix_len = Params.lut_prefix_len; best_mem_amount = suf_mem + lut_mem; } } Params.lut_prefix_len = best_lut_prefix_len; Queues.pmm_small_k_completer = new CMemoryPool(Params.mem_tot_small_k_completer, Params.mem_part_small_k_completer); CSmallKCompleter small_k_completer(Params, Queues); small_k_completer.Complete(results[0]); small_k_completer.GetTotal(n_unique, n_cutoff_min, n_cutoff_max); Queues.pmm_reads->release(); Queues.pmm_small_k_buf->release(); Queues.pmm_small_k_completer->release(); delete Queues.pmm_small_k_completer; delete Queues.pmm_reads; delete Queues.pmm_small_k_buf; w2.stopTimer(); cout << "\n"; return true; } //---------------------------------------------------------------------------------- // Run the counter template bool CKMC::Process() { if (!initialized) return false; if (AdjustMemoryLimitsSmallK()) { if (Params.verbose) { cout << "\nSmall k optimization on!\n"; } return CSmallKWrapper::Process(*this); } int32 bin_id; CMemDiskFile *file; string name; uint64 size; uint64 n_rec; uint64 n_plus_x_recs; uint64 n_super_kmers; if (!AdjustMemoryLimits()) return false; w1.startTimer(); // Create monitors Queues.mm = new CMemoryMonitor(Params.max_mem_stage2); // Create queues Queues.input_files_queue = new CInputFilesQueue(Params.input_file_names); Queues.part_queue = new CPartQueue(Params.n_readers); Queues.bpq = new CBinPartQueue(Params.n_splitters); Queues.bd = new CBinDesc; Queues.bq = new CBinQueue(1); Queues.stats_part_queue = new CStatsPartQueue(Params.n_readers, STATS_FASTQ_SIZE); // Create memory manager Queues.pmm_bins = new CMemoryPool(Params.mem_tot_pmm_bins, Params.mem_part_pmm_bins); Queues.pmm_fastq = new CMemoryPool(Params.mem_tot_pmm_fastq, Params.mem_part_pmm_fastq); Queues.pmm_reads = new CMemoryPool(Params.mem_tot_pmm_reads, Params.mem_part_pmm_reads); Queues.pmm_stats = new CMemoryPool(Params.mem_tot_pmm_stats, Params.mem_part_pmm_stats); Queues.s_mapper = new CSignatureMapper(Queues.pmm_stats, Params.signature_len, Params.n_bins); Queues.disk_logger = new CDiskLogger; // ***** Stage 0 ***** w0.startTimer(); w_stats_splitters.resize(Params.n_splitters); for (int i = 0; i < Params.n_splitters; ++i) { w_stats_splitters[i] = new CWStatsSplitter(Params, Queues); gr0_2.push_back(thread(std::ref(*w_stats_splitters[i]))); } w_stats_fastqs.resize(Params.n_readers); for (int i = 0; i < Params.n_readers; ++i) { w_stats_fastqs[i] = new CWStatsFastqReader(Params, Queues); gr0_1.push_back(thread(std::ref(*w_stats_fastqs[i]))); } for (auto p = gr0_1.begin(); p != gr0_1.end(); ++p) p->join(); for (auto p = gr0_2.begin(); p != gr0_2.end(); ++p) p->join(); uint32 *stats; Queues.pmm_stats->reserve(stats); fill_n(stats, (1 << Params.signature_len * 2) + 1, 0); for (int i = 0; i < Params.n_readers; ++i) delete w_stats_fastqs[i]; for (int i = 0; i < Params.n_splitters; ++i) { w_stats_splitters[i]->GetStats(stats); delete w_stats_splitters[i]; } delete Queues.stats_part_queue; Queues.stats_part_queue = NULL; delete Queues.input_files_queue; Queues.input_files_queue = new CInputFilesQueue(Params.input_file_names); heuristic_time.startTimer(); Queues.s_mapper->Init(stats); heuristic_time.stopTimer(); cout << "\n"; w0.stopTimer(); Queues.pmm_stats->free(stats); Queues.pmm_stats->release(); delete Queues.pmm_stats; Queues.pmm_stats = NULL; // ***** Stage 1 ***** ShowSettingsStage1(); w_splitters.resize(Params.n_splitters); for(int i = 0; i < Params.n_splitters; ++i) { w_splitters[i] = new CWSplitter(Params, Queues); gr1_2.push_back(thread(std::ref(*w_splitters[i]))); } w_storer = new CWKmerBinStorer(Params, Queues); gr1_3.push_back(thread(std::ref(*w_storer))); w_fastqs.resize(Params.n_readers); for(int i = 0; i < Params.n_readers; ++i) { w_fastqs[i] = new CWFastqReader(Params, Queues); gr1_1.push_back(thread(std::ref(*w_fastqs[i]))); } for(auto p = gr1_1.begin(); p != gr1_1.end(); ++p) p->join(); for(auto p = gr1_2.begin(); p != gr1_2.end(); ++p) p->join(); Queues.pmm_fastq->release(); Queues.pmm_reads->release(); delete Queues.pmm_fastq; delete Queues.pmm_reads; for(auto p = gr1_3.begin(); p != gr1_3.end(); ++p) p->join(); n_reads = 0; thread *release_thr_st1_1 = new thread([&]{ for(int i = 0; i < Params.n_readers; ++i) delete w_fastqs[i]; for(int i = 0; i < Params.n_splitters; ++i) { uint64 _n_reads; w_splitters[i]->GetTotal(_n_reads); n_reads += _n_reads; delete w_splitters[i]; } delete w_storer; }); thread *release_thr_st1_2 = new thread([&]{ Queues.pmm_bins->release(); delete Queues.pmm_bins; }); release_thr_st1_1->join(); release_thr_st1_2->join(); delete release_thr_st1_1; delete release_thr_st1_2; w1.stopTimer(); w2.startTimer(); // ***** End of Stage 1 ***** // Adjust RAM for 2nd stage // Calculate LUT size uint32 best_lut_prefix_len = 0; uint64 best_mem_amount = 1ull << 62; for (Params.lut_prefix_len = 2; Params.lut_prefix_len < 16; ++Params.lut_prefix_len) { uint32 suffix_len = Params.kmer_len - Params.lut_prefix_len; if (suffix_len % 4) continue; uint64 est_suf_mem = n_reads * suffix_len; uint64 lut_mem = Params.n_bins * (1ull << (2 * Params.lut_prefix_len)) * sizeof(uint64); if (est_suf_mem + lut_mem < best_mem_amount) { best_lut_prefix_len = Params.lut_prefix_len; best_mem_amount = est_suf_mem + lut_mem; } } Params.lut_prefix_len = best_lut_prefix_len; #ifdef DEVELOP_MODE save_bins_stats(Queues, Params, sizeof(KMER_T), KMER_T::QUALITY_SIZE, n_reads); #endif Queues.bd->reset_reading(); vector bin_sizes; while((bin_id = Queues.bd->get_next_bin()) >= 0) { Queues.bd->read(bin_id, file, name, size, n_rec, n_plus_x_recs, n_super_kmers); if (Params.max_x) bin_sizes.push_back(n_plus_x_recs * 2 * sizeof(KMER_T)); // estimation of RAM for sorting bins else bin_sizes.push_back(n_rec * 2 * sizeof(KMER_T)); } sort(bin_sizes.begin(), bin_sizes.end(), greater()); SetThreads2Stage(bin_sizes); AdjustMemoryLimitsStage2(); if (Params.use_strict_mem) { SetThreadsStrictMemoryMode(); Queues.tlbq = new CTooLargeBinsQueue; Queues.bbkpq = new CBigBinKmerPartQueue(Params.sm_n_mergers); } else { Queues.tlbq = NULL; Queues.bbkpq = NULL; } Queues.kq = new CKmerQueue(Params.n_bins, Params.n_sorters); int64 stage2_size = 0; for (int i = 0; i < 4 * Params.n_sorters; ++i) stage2_size += bin_sizes[i]; stage2_size = MAX(stage2_size, 16 << 20); Params.max_mem_stage2 = MIN(Params.max_mem_stage2, stage2_size); ShowSettingsStage2(); // ***** Stage 2 ***** Queues.bd->reset_reading(); Queues.pmm_radix_buf = new CMemoryPool(Params.mem_tot_pmm_radix_buf, Params.mem_part_pmm_radix_buf ); if (!Params.use_quake && Params.both_strands) Queues.pmm_expand = new CMemoryPool(Params.mem_tot_pmm_epxand, Params.mem_part_pmm_epxand); else Queues.pmm_expand = NULL; Queues.memory_bins = new CMemoryBins(Params.max_mem_stage2, Params.n_bins, Params.use_strict_mem); if (Params.use_quake) Queues.pmm_prob = new CMemoryPool(Params.mem_tot_pmm_prob, Params.mem_part_pmm_prob); else Queues.pmm_prob = NULL; w_reader = new CWKmerBinReader(Params, Queues); gr2_1.push_back(thread(std::ref(*w_reader))); w_sorters.resize(Params.n_sorters); for(int i = 0; i < Params.n_sorters; ++i) { w_sorters[i] = new CWKmerBinSorter(Params, Queues, i); gr2_2.push_back(thread(std::ref(*w_sorters[i]))); } w_completer = new CWKmerBinCompleter(Params, Queues); gr2_3.push_back(thread(std::ref(*w_completer), true)); for(auto p = gr2_1.begin(); p != gr2_1.end(); ++p) p->join(); for(auto p = gr2_2.begin(); p != gr2_2.end(); ++p) p->join(); //Finishing first stage of completer for (auto p = gr2_3.begin(); p != gr2_3.end(); ++p) p->join(); gr2_3.clear(); thread *release_thr_st2_1 = new thread([&]{ delete Queues.mm; if (Queues.pmm_expand) { Queues.pmm_expand->release(); delete Queues.pmm_expand; } //Queues.pmm_radix_buf->release(); Queues.memory_bins->release(); //delete Queues.pmm_radix_buf; delete Queues.memory_bins; }); //process big bins if necessary (only in strict memory limit mode) thread* release_thr_sm = NULL; if (Params.use_strict_mem) { w2.stopTimer(); w3.startTimer(); release_thr_st2_1->join(); //need to be sure that memory_bins is released AdjustMemoryLimitsStrictMemoryMode(); cout << "\n"; Queues.sm_pmm_input_file = new CMemoryPool(Params.sm_mem_tot_input_file, Params.sm_mem_part_input_file); Queues.sm_pmm_expand = new CMemoryPool(Params.sm_mem_tot_expand, Params.sm_mem_part_expand); Queues.sm_pmm_sort = new CMemoryPool(Params.sm_mem_tot_sort, Params.sm_mem_part_sort); Queues.sm_pmm_sorter_suffixes = new CMemoryPool(Params.sm_mem_tot_suffixes, Params.sm_mem_part_suffixes); Queues.sm_pmm_sorter_lut = new CMemoryPool(Params.sm_mem_tot_lut, Params.sm_mem_part_lut); Queues.sm_pmm_merger_lut = new CMemoryPool(Params.sm_mem_tot_merger_lut, Params.sm_mem_part_merger_lut); Queues.sm_pmm_merger_suff = new CMemoryPool(Params.sm_mem_tot_merger_suff, Params.sm_mem_part_merger_suff); Queues.sm_pmm_sub_bin_lut = new CMemoryPool(Params.sm_mem_tot_sub_bin_lut, Params.sm_mem_part_sub_bin_lut); Queues.sm_pmm_sub_bin_suff = new CMemoryPool(Params.sm_mem_tot_sub_bin_suff, Params.sm_mem_part_sub_bin_suff); Queues.bbpq = new CBigBinPartQueue(); Queues.bbkq = new CBigBinKXmersQueue(Params.sm_n_uncompactors); Queues.bbd = new CBigBinDesc(); Queues.bbspq = new CBigBinSortedPartQueue(1); Queues.sm_cbc = new CCompletedBinsCollector(1); CWBigKmerBinReader* w_bkb_reader = new CWBigKmerBinReader(Params, Queues); thread bkb_reader(std::ref(*w_bkb_reader)); vector*> w_bkb_uncompactors(Params.sm_n_uncompactors); vector bkb_uncompactors; for (int32 i = 0; i < Params.sm_n_uncompactors; ++i) { w_bkb_uncompactors[i] = new CWBigKmerBinUncompactor(Params, Queues); bkb_uncompactors.push_back(thread(std::ref(*w_bkb_uncompactors[i]))); } CWBigKmerBinSorter* w_bkb_sorter = new CWBigKmerBinSorter(Params, Queues); thread bkb_sorter(std::ref(*w_bkb_sorter)); CWBigKmerBinWriter* w_bkb_writer = new CWBigKmerBinWriter(Params, Queues); thread bkb_writer(std::ref(*w_bkb_writer)); vector*> w_bkb_mergers(Params.sm_n_mergers); vector bkb_mergers; for (int32 i = 0; i < Params.sm_n_mergers; ++i) { w_bkb_mergers[i] = new CWBigKmerBinMerger(Params, Queues); bkb_mergers.push_back(thread(std::ref(*w_bkb_mergers[i]))); } w_completer->InitStage2(Params, Queues); gr2_3.push_back(thread(std::ref(*w_completer), false)); for (auto& m : bkb_mergers) m.join(); bkb_sorter.join(); bkb_writer.join(); for (auto& u : bkb_uncompactors) u.join(); bkb_reader.join(); delete w_bkb_reader; for (auto& u : w_bkb_uncompactors) delete u; delete w_bkb_sorter; delete w_bkb_writer; for (auto& m : w_bkb_mergers) delete m; release_thr_sm = new thread([&]{ delete Queues.bbpq; delete Queues.bbkq; delete Queues.sm_cbc; delete Queues.bbspq; }); } else { gr2_3.push_back(thread(std::ref(*w_completer), false)); } Queues.pmm_radix_buf->release(); delete Queues.pmm_radix_buf; for(auto p = gr2_3.begin(); p != gr2_3.end(); ++p) p->join(); if (Params.use_strict_mem) { Queues.sm_pmm_input_file->release(); Queues.sm_pmm_expand->release(); Queues.sm_pmm_sort->release(); Queues.sm_pmm_sorter_suffixes->release(); Queues.sm_pmm_sorter_lut->release(); delete Queues.sm_pmm_input_file; delete Queues.sm_pmm_expand; delete Queues.sm_pmm_sort; delete Queues.sm_pmm_sorter_suffixes; delete Queues.sm_pmm_sorter_lut; Queues.sm_pmm_merger_lut->release(); Queues.sm_pmm_merger_suff->release(); Queues.sm_pmm_sub_bin_lut->release(); Queues.sm_pmm_sub_bin_suff->release(); delete Queues.sm_pmm_merger_lut; delete Queues.sm_pmm_merger_suff; delete Queues.sm_pmm_sub_bin_lut; delete Queues.sm_pmm_sub_bin_suff; } // ***** End of Stage 2 ***** w_completer->GetTotal(n_unique, n_cutoff_min, n_cutoff_max, n_total); uint64 stat_n_plus_x_recs, stat_n_recs, stat_n_recs_tmp, stat_n_plus_x_recs_tmp; stat_n_plus_x_recs = stat_n_recs = stat_n_recs_tmp = stat_n_plus_x_recs_tmp = 0; thread *release_thr_st2_2 = new thread([&]{ delete w_reader; for(int i = 0; i < Params.n_sorters; ++i) { w_sorters[i]->GetDebugStats(stat_n_recs_tmp, stat_n_plus_x_recs_tmp); stat_n_plus_x_recs += stat_n_plus_x_recs_tmp; stat_n_recs += stat_n_recs_tmp; delete w_sorters[i]; } delete w_completer; delete Queues.input_files_queue; delete Queues.bq; delete Queues.part_queue; delete Queues.bpq; delete Queues.kq; delete Queues.tlbq; }); // ***** Getting disk usage statistics ***** tmp_size = 0; n_total_super_kmers = 0; Queues.bd->reset_reading(); while((bin_id = Queues.bd->get_next_bin()) >= 0) { Queues.bd->read(bin_id, file, name, size, n_rec, n_plus_x_recs, n_super_kmers); tmp_size += size; n_total_super_kmers += n_super_kmers; } delete Queues.bd; tmp_size_strict_mem = 0; if (!Params.use_strict_mem) { release_thr_st2_1->join(); } else { release_thr_sm->join(); Queues.bbd->reset_reading(); int32 sub_bin_id = 0; uint32 lut_prefix_len = 0; uint32 n_kmers = 0; uint64 file_size = 0; uint32 size = 0; FILE* file = NULL; while (Queues.bbd->next_bin(bin_id, size)) { while (Queues.bbd->next_sub_bin(bin_id, sub_bin_id, lut_prefix_len, n_kmers, file, name, file_size)) { tmp_size_strict_mem += file_size; } } delete Queues.bbd; delete release_thr_sm; } release_thr_st2_2->join(); delete release_thr_st2_1; delete release_thr_st2_2; delete Queues.s_mapper; max_disk_usage = Queues.disk_logger->get_max(); delete Queues.disk_logger; if(!Params.use_strict_mem) w2.stopTimer(); else w3.stopTimer(); return true; } //---------------------------------------------------------------------------------- // Return statistics template void CKMC::GetStats(double &time1, double &time2, double &time3, uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total, uint64 &_n_reads, uint64 &_tmp_size, uint64 &_tmp_size_strict_mem, uint64 &_max_disk_usage, uint64& _n_total_super_kmers) { time1 = w1.getElapsedTime(); time2 = w2.getElapsedTime(); time3 = w3.getElapsedTime(); _n_unique = n_unique; _n_cutoff_min = n_cutoff_min; _n_cutoff_max = n_cutoff_max; _n_total = n_total; _n_reads = n_reads; _tmp_size = tmp_size; _tmp_size_strict_mem = tmp_size_strict_mem; _max_disk_usage = max_disk_usage; _n_total_super_kmers = n_total_super_kmers; } #endif // ***** EOF KMC-2.3/kmer_counter/kmer.cpp000066400000000000000000000006361257432033000161330ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "defs.h" #include "kmer.h" uint32 CKmer<1>::QUALITY_SIZE = 0; uint32 CKmerQuake<1>::QUALITY_SIZE = 4; // ***** EOF KMC-2.3/kmer_counter/kmer.h000066400000000000000000000655441257432033000156110ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KMER_H #define _KMER_H // Important remark: there is no inheritance here to guarantee that all classes defined here are POD according to C++11 #include "meta_oper.h" #include // ************************************************************************* // Ckmer class for k > 32 with classic kmer counting template struct CKmer { unsigned long long data[SIZE]; typedef unsigned long long data_t; static uint32 QUALITY_SIZE; inline void set(const CKmer &x); inline void from_kxmer(const CKmer& x, uint32 _shr, const CKmer& _mask); template inline void to_kxmer(CKmer& x); inline void mask(const CKmer &x); inline uint32 end_mask(const uint32 mask); inline void set_2bits(const uint64 x, const uint32 p); inline uchar get_2bits(const uint32 p); inline uchar get_byte(const uint32 p); inline void set_byte(const uint32 p, uchar x); inline void set_bits(const uint32 p, const uint32 n, uint64 x); inline void SHL_insert_2bits(const uint64 x); inline void SHR_insert_2bits(const uint64 x, const uint32 p); inline void SHR(const uint32 p); inline void SHL(const uint32 p); inline uint64 remove_suffix(const uint32 n) const; inline void set_n_1(const uint32 n); inline void set_n_01(const uint32 n); inline void store(uchar *&buffer, int32 n); inline void store(uchar *buffer, int32 p, int32 n); inline void load(uchar *&buffer, int32 n); inline bool operator==(const CKmer &x); inline bool operator<(const CKmer &x); inline void clear(void); inline char get_symbol(int p); }; template uint32 CKmer::QUALITY_SIZE = 0; // ********************************************************************* template inline void CKmer::set(const CKmer &x) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = x.data[i]; }, uint_()); #else for(uint32 i = 0; i < SIZE; ++i) data[i] = x.data[i]; #endif } // ********************************************************************* template template inline void CKmer::to_kxmer(CKmer& x) { x.data[X_SIZE - 1] = 0; #ifdef USE_META_PROG IterFwd([&](const int &i){ x.data[i] = data[i]; }, uint_()); #else for (uint32 i = 0; i < SIZE; ++i) x.data[i] = data[i]; #endif } // ********************************************************************* template inline void CKmer::from_kxmer(const CKmer& x, uint32 _shr, const CKmer& _mask) { if (_shr) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = x.data[i] >> (2 * _shr); data[i] += x.data[i + 1] << (64 - 2 * _shr); }, uint_()); #else for (uint32 i = 0; i < SIZE - 1; ++i) { data[i] = x.data[i] >> (2 * _shr); data[i] += x.data[i+1]<<(64-2*_shr); } #endif data[SIZE - 1] = x.data[SIZE - 1] >> (2 * _shr); } else { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = x.data[i]; }, uint_()); #else for (uint32 i = 0; i < SIZE; ++i) data[i] = x.data[i]; #endif } mask(_mask); } // ********************************************************************* template inline void CKmer::mask(const CKmer &x) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] &= x.data[i]; }, uint_()); #else for(uint32 i = 0; i < SIZE; ++i) data[i] &= x.data[i]; #endif } // ********************************************************************* template inline uint32 CKmer::end_mask(const uint32 mask) { return data[0] & mask; } // ********************************************************************* template inline void CKmer::set_2bits(const uint64 x, const uint32 p) { // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); } template inline uchar CKmer::get_2bits(const uint32 p) { return (data[p >> 6] >> (p & 63)) & 3; } // ********************************************************************* template inline void CKmer::SHR_insert_2bits(const uint64 x, const uint32 p) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] >>= 2; // data[i] |= data[i+1] << (64-2); data[i] += data[i+1] << (64-2); }, uint_()); #else for(uint32 i = 0; i < SIZE-1; ++i) { data[i] >>= 2; // data[i] |= data[i+1] << (64-2); data[i] += data[i+1] << (64-2); } #endif data[SIZE-1] >>= 2; // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); } // ********************************************************************* template inline void CKmer::SHR(const uint32 p) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] >>= 2*p; // data[i] |= data[i+1] << (64-2*p); data[i] += data[i+1] << (64-2*p); }, uint_()); #else for(uint32 i = 0; i < SIZE-1; ++i) { data[i] >>= 2*p; // data[i] |= data[i+1] << (64-2*p); data[i] += data[i+1] << (64-2*p); } #endif data[SIZE-1] >>= 2*p; } // ********************************************************************* template inline void CKmer::SHL(const uint32 p) { #ifdef USE_META_PROG IterRev([&](const int &i){ data[i+1] <<= p*2; // data[i+1] |= data[i] >> (64-p*2); data[i+1] += data[i] >> (64-p*2); }, uint_()); #else for(uint32 i = SIZE-1; i > 0; --i) { data[i] <<= p*2; // data[i] |= data[i-1] >> (64-p*2); data[i] += data[i-1] >> (64-p*2); } #endif data[0] <<= p*2; } // ********************************************************************* template inline void CKmer::SHL_insert_2bits(const uint64 x) { #ifdef USE_META_PROG IterRev([&](const int &i){ data[i+1] <<= 2; // data[i+1] |= data[i] >> (64-2); data[i+1] += data[i] >> (64-2); }, uint_()); #else for(uint32 i = SIZE-1; i > 0; --i) { data[i] <<= 2; // data[i] |= data[i-1] >> (64-2); data[i] += data[i-1] >> (64-2); } #endif data[0] <<= 2; // data[0] |= x; data[0] += x; } // ********************************************************************* template inline uchar CKmer::get_byte(const uint32 p) { return (data[p >> 3] >> ((p << 3) & 63)) & 0xFF; } // ********************************************************************* template inline void CKmer::set_byte(const uint32 p, uchar x) { // data[p >> 3] |= ((uint64) x) << ((p & 7) << 3); data[p >> 3] += ((uint64) x) << ((p & 7) << 3); } // ********************************************************************* template inline void CKmer::set_bits(const uint32 p, const uint32 n, uint64 x) { // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); if((p >> 6) != ((p+n-1) >> 6)) // data[(p >> 6) + 1] |= x >> (64 - (p & 63)); data[(p >> 6) + 1] += x >> (64 - (p & 63)); } // ********************************************************************* template inline bool CKmer::operator==(const CKmer &x) { for(uint32 i = 0; i < SIZE; ++i) if(data[i] != x.data[i]) return false; return true; } // ********************************************************************* template inline bool CKmer::operator<(const CKmer &x) { for(int32 i = SIZE-1; i >= 0; --i) if(data[i] < x.data[i]) return true; else if(data[i] > x.data[i]) return false; return false; } // ********************************************************************* template inline void CKmer::clear(void) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = 0; }, uint_()); #else for(uint32 i = 0; i < SIZE; ++i) data[i] = 0; #endif } // ********************************************************************* template inline uint64 CKmer::remove_suffix(const uint32 n) const { uint32 p = n >> 6; // / 64; uint32 r = n & 63; // % 64; if(p == SIZE-1) return data[p] >> r; else // return (data[p+1] << (64-r)) | (data[p] >> r); return (data[p+1] << (64-r)) + (data[p] >> r); } // ********************************************************************* template inline void CKmer::set_n_1(const uint32 n) { clear(); for(uint32 i = 0; i < (n >> 6); ++i) data[i] = ~((uint64) 0); uint32 r = n & 63; if(r) data[n >> 6] = (1ull << r) - 1; } // ********************************************************************* template inline void CKmer::set_n_01(const uint32 n) { clear(); for(uint32 i = 0; i < n; ++i) if(!(i & 1)) // data[i >> 6] |= (1ull << (i & 63)); data[i >> 6] += (1ull << (i & 63)); } // ********************************************************************* template inline void CKmer::store(uchar *&buffer, int32 n) { for(int32 i = n-1; i >= 0; --i) *buffer++ = get_byte(i); } // ********************************************************************* template inline void CKmer::store(uchar *buffer, int32 p, int32 n) { for(int32 i = n-1; i >= 0; --i) buffer[p++] = get_byte(i); } // ********************************************************************* template inline void CKmer::load(uchar *&buffer, int32 n) { clear(); for(int32 i = n-1; i >= 0; --i) set_byte(i, *buffer++); } // ********************************************************************* template inline char CKmer::get_symbol(int p) { uint32 x = (data[p >> 5] >> (2*(p & 31))) & 0x03; switch(x) { case 0 : return 'A'; case 1 : return 'C'; case 2 : return 'G'; default: return 'T'; } } // ********************************************************************* // ********************************************************************* // ********************************************************************* // ********************************************************************* // Ckmer class for k <= 32 with classic kmer counting template<> struct CKmer<1> { unsigned long long data; typedef unsigned long long data_t; static uint32 QUALITY_SIZE; void set(const CKmer<1> &x); void from_kxmer(const CKmer<1>& x, uint32 _shr, const CKmer<1>& _mask); template void to_kxmer(CKmer& x); void mask(const CKmer<1> &x); uint32 end_mask(const uint32 mask); void set_2bits(const uint64 x, const uint32 p); uchar get_2bits(const uint32 p); uchar get_byte(const uint32 p); void set_byte(const uint32 p, uchar x); void set_bits(const uint32 p, const uint32 n, uint64 x); void SHL_insert_2bits(const uint64 x); void SHR_insert_2bits(const uint64 x, const uint32 p); void SHR(const uint32 p); void SHL(const uint32 p); uint64 remove_suffix(const uint32 n) const; void set_n_1(const uint32 n); void set_n_01(const uint32 n); void store(uchar *&buffer, int32 n); void store(uchar *buffer, int32 p, int32 n); void load(uchar *&buffer, int32 n); bool operator==(const CKmer<1> &x); bool operator<(const CKmer<1> &x); void clear(void); inline char get_symbol(int p); }; // ********************************************************************* template inline void CKmer<1>::to_kxmer(CKmer&x) { x.data[X_SIZE - 1] = 0; x.data[0] = data; } // ********************************************************************* template<> inline void CKmer<1>::to_kxmer(CKmer<1>& x) { x.data = data; } // ********************************************************************* inline void CKmer<1>::mask(const CKmer<1> &x) { data &= x.data; } // ********************************************************************* inline uint32 CKmer<1>::end_mask(const uint32 mask) { return data & mask; } // ********************************************************************* inline void CKmer<1>::set(const CKmer<1> &x) { data = x.data; } // ********************************************************************* inline void CKmer<1>::from_kxmer(const CKmer<1>& x, uint32 _shr, const CKmer<1>& _mask) { data = (x.data >> (2 * _shr)) & _mask.data; } // ********************************************************************* inline void CKmer<1>::set_2bits(const uint64 x, const uint32 p) { // data |= x << p; data += x << p; } inline uchar CKmer<1>::get_2bits(const uint32 p) { return (data >> p) & 3; } // ********************************************************************* inline void CKmer<1>::SHR_insert_2bits(const uint64 x, const uint32 p) { data >>= 2; // data |= x << p; data += x << p; } // ********************************************************************* inline void CKmer<1>::SHR(const uint32 p) { data >>= 2*p; } // ********************************************************************* inline void CKmer<1>::SHL(const uint32 p) { data <<= p*2; } // ********************************************************************* inline void CKmer<1>::SHL_insert_2bits(const uint64 x) { // data = (data << 2) | x; data = (data << 2) + x; } // ********************************************************************* inline uchar CKmer<1>::get_byte(const uint32 p) { return (data >> (p << 3)) & 0xFF; } // ********************************************************************* inline void CKmer<1>::set_byte(const uint32 p, uchar x) { // data |= ((uint64) x) << (p << 3); data += ((uint64) x) << (p << 3); } // ********************************************************************* inline void CKmer<1>::set_bits(const uint32 p, const uint32 n, uint64 x) { // data |= x << p; data += x << p; } // ********************************************************************* inline bool CKmer<1>::operator==(const CKmer<1> &x) { return data == x.data; } // ********************************************************************* inline bool CKmer<1>::operator<(const CKmer<1> &x) { return data < x.data; } // ********************************************************************* inline void CKmer<1>::clear(void) { data = 0ull; } // ********************************************************************* inline uint64 CKmer<1>::remove_suffix(const uint32 n) const { return data >> n; } // ********************************************************************* inline void CKmer<1>::set_n_1(const uint32 n) { if(n == 64) data = ~(0ull); else data = (1ull << n) - 1; } // ********************************************************************* inline void CKmer<1>::set_n_01(const uint32 n) { data = 0ull; for(uint32 i = 0; i < n; ++i) if(!(i & 1)) // data |= (1ull << i); data += (1ull << i); } // ********************************************************************* inline void CKmer<1>::store(uchar *&buffer, int32 n) { for(int32 i = n-1; i >= 0; --i) *buffer++ = get_byte(i); } // ********************************************************************* inline void CKmer<1>::store(uchar *buffer, int32 p, int32 n) { for(int32 i = n-1; i >= 0; --i) buffer[p++] = get_byte(i); } // ********************************************************************* inline void CKmer<1>::load(uchar *&buffer, int32 n) { clear(); for(int32 i = n-1; i >= 0; --i) set_byte(i, *buffer++); } // ********************************************************************* char CKmer<1>::get_symbol(int p) { uint32 x = (data >> (2*p)) & 0x03; switch(x) { case 0 : return 'A'; case 1 : return 'C'; case 2 : return 'G'; default: return 'T'; } } // ********************************************************************* // ********************************************************************* // ********************************************************************* template struct CKmerQuake { unsigned long long data[SIZE]; float quality; typedef unsigned long long data_t; static uint32 QUALITY_SIZE; inline void set(const CKmerQuake &x); inline void mask(const CKmerQuake &x); inline void set_2bits(const uint64 x, const uint32 p); inline uchar get_byte(const uint32 p); inline void set_byte(const uint32 p, uchar x); inline void set_bits(const uint32 p, const uint32 n, uint64 x); inline void SHL_insert_2bits(const uint64 x); inline void SHR_insert_2bits(const uint64 x, const uint32 p); inline uint64 remove_suffix(const uint32 n); inline void set_n_1(const uint32 n); inline void set_n_01(const uint32 n); inline void store(uchar *&buffer, int32 n); inline void store(uchar *buffer, int32 p, int32 n); inline void load(uchar *&buffer, int32 n); inline bool operator==(const CKmerQuake &x); inline bool operator<(const CKmerQuake &x); inline void clear(void); inline char get_symbol(int p); }; template uint32 CKmerQuake::QUALITY_SIZE = sizeof(float); // ********************************************************************* template void CKmerQuake::set(const CKmerQuake &x) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = x.data[i]; }, uint_()); #else for(uint32 i = 0; i < SIZE; ++i) data[i] = x.data[i]; #endif quality = x.quality; } // ********************************************************************* template void CKmerQuake::mask(const CKmerQuake &x) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] &= x.data[i]; }, uint_()); #else for(uint32 i = 0; i < SIZE; ++i) data[i] &= x.data[i]; #endif } // ********************************************************************* template void CKmerQuake::set_2bits(const uint64 x, const uint32 p) { // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); } // ********************************************************************* template void CKmerQuake::SHR_insert_2bits(const uint64 x, const uint32 p) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] >>= 2; // data[i] |= data[i+1] << (64-2); data[i] += data[i+1] << (64-2); }, uint_()); #else for(uint32 i = 0; i < SIZE-1; ++i) { data[i] >>= 2; // data[i] |= data[i+1] << (64-2); data[i] += data[i+1] << (64-2); } #endif data[SIZE-1] >>= 2; // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); } // ********************************************************************* template void CKmerQuake::SHL_insert_2bits(const uint64 x) { #ifdef USE_META_PROG IterRev([&](const int &i){ data[i+1] <<= 2; // data[i+1] |= data[i] >> (64-2); data[i+1] += data[i] >> (64-2); }, uint_()); #else for(uint32 i = SIZE-1; i > 0; --i) { data[i] <<= 2; // data[i] |= data[i-1] >> (64-2); data[i] += data[i-1] >> (64-2); } #endif data[0] <<= 2; // data[0] |= x; data[0] += x; } // ********************************************************************* template uchar CKmerQuake::get_byte(const uint32 p) { return (data[p >> 3] >> ((p << 3) & 63)) & 0xFF; } // ********************************************************************* template void CKmerQuake::set_byte(const uint32 p, uchar x) { // data[p >> 3] |= ((uint64) x) << ((p & 7) << 3); data[p >> 3] += ((uint64) x) << ((p & 7) << 3); } // ********************************************************************* template void CKmerQuake::set_bits(const uint32 p, const uint32 n, uint64 x) { // data[p >> 6] |= x << (p & 63); data[p >> 6] += x << (p & 63); if((p >> 6) != ((p+n-1) >> 6)) // data[(p >> 6) + 1] |= x >> (64 - (p & 63)); data[(p >> 6) + 1] += x >> (64 - (p & 63)); } // ********************************************************************* template bool CKmerQuake::operator==(const CKmerQuake &x) { for(uint32 i = 0; i < SIZE; ++i) if(data[i] != x.data[i]) return false; return true; } // ********************************************************************* template bool CKmerQuake::operator<(const CKmerQuake &x) { for(int32 i = SIZE-1; i >= 0; --i) if(data[i] < x.data[i]) return true; else if(data[i] > x.data[i]) return false; return false; } // ********************************************************************* template void CKmerQuake::clear(void) { #ifdef USE_META_PROG IterFwd([&](const int &i){ data[i] = 0; }, uint_()); #else for(uint32 i = 0; i < SIZE; ++i) data[i] = 0; #endif quality = 0.0; } // ********************************************************************* template uint64 CKmerQuake::remove_suffix(const uint32 n) { uint32 p = n >> 6; // / 64; uint32 r = n & 63; // % 64; if(p == SIZE-1) return data[p] >> r; else // return (data[p+1] << (64-r)) | (data[p] >> r); return (data[p+1] << (64-r)) + (data[p] >> r); } // ********************************************************************* template void CKmerQuake::set_n_1(const uint32 n) { clear(); for(uint32 i = 0; i < (n >> 6); ++i) data[i] = ~((uint64) 0); uint32 r = n & 63; if(r) data[n >> 6] = (1ull << r) - 1; quality = 0.0; } // ********************************************************************* template void CKmerQuake::set_n_01(const uint32 n) { clear(); for(uint32 i = 0; i < n; ++i) if(!(i & 1)) // data[i >> 6] |= (1ull << (i & 63)); data[i >> 6] += (1ull << (i & 63)); quality = 0.0; } // ********************************************************************* template void CKmerQuake::store(uchar *&buffer, int32 n) { for(int32 i = n-1; i >= 0; --i) *buffer++ = get_byte(i); memcpy(buffer, &quality, sizeof(quality)); buffer += sizeof(quality); } // ********************************************************************* template void CKmerQuake::store(uchar *buffer, int32 p, int32 n) { for(int32 i = n-1; i >= 0; --i) buffer[p++] = get_byte(i); memcpy(buffer+p, &quality, sizeof(quality)); } // ********************************************************************* template void CKmerQuake::load(uchar *&buffer, int32 n) { clear(); for(int32 i = n-1; i >= 0; --i) set_byte(i, *buffer++); memcpy(&quality, buffer, sizeof(quality)); buffer += sizeof(quality); } // ********************************************************************* template char CKmerQuake::get_symbol(int p) { uint32 x = (data[p >> 5] >> (2*(p & 31))) & 0x03; switch(x) { case 0 : return 'A'; case 1 : return 'C'; case 2 : return 'G'; default: return 'T'; } } // ********************************************************************* // ********************************************************************* // ********************************************************************* template<> struct CKmerQuake<1> { unsigned long long data; float quality; typedef unsigned long long data_t; static uint32 QUALITY_SIZE; void set(const CKmerQuake<1> &x); void mask(const CKmerQuake<1> &x); void set_2bits(const uint64 x, const uint32 p); uchar get_byte(const uint32 p); void set_byte(const uint32 p, uchar x); void set_bits(const uint32 p, const uint32 n, uint64 x); void SHL_insert_2bits(const uint64 x); void SHR_insert_2bits(const uint64 x, const uint32 p); uint64 remove_suffix(const uint32 n); void set_n_1(const uint32 n); void set_n_01(const uint32 n); void store(uchar *&buffer, int32 n); void store(uchar *buffer, int32 p, int32 n); void load(uchar *&buffer, int32 n); bool operator==(const CKmerQuake<1> &x); bool operator<(const CKmerQuake<1> &x); void clear(void); inline char get_symbol(int p); }; // ********************************************************************* inline void CKmerQuake<1>::set(const CKmerQuake<1> &x) { data = x.data; quality = x.quality; } // ********************************************************************* inline void CKmerQuake<1>::mask(const CKmerQuake<1> &x) { data &= x.data; } // ********************************************************************* inline void CKmerQuake<1>::set_2bits(const uint64 x, const uint32 p) { // data |= x << p; data += x << p; } // ********************************************************************* inline void CKmerQuake<1>::SHR_insert_2bits(const uint64 x, const uint32 p) { data >>= 2; // data |= x << p; data += x << p; } // ********************************************************************* inline void CKmerQuake<1>::SHL_insert_2bits(const uint64 x) { // data = (data << 2) | x; data = (data << 2) + x; } // ********************************************************************* inline uchar CKmerQuake<1>::get_byte(const uint32 p) { return (data >> (p << 3)) & 0xFF; } // ********************************************************************* inline void CKmerQuake<1>::set_byte(const uint32 p, uchar x) { // data |= ((uint64) x) << (p << 3); data += ((uint64) x) << (p << 3); } // ********************************************************************* inline void CKmerQuake<1>::set_bits(const uint32 p, const uint32 n, uint64 x) { // data |= x << p; data += x << p; } // ********************************************************************* inline bool CKmerQuake<1>::operator==(const CKmerQuake<1> &x) { return data == x.data; } // ********************************************************************* inline bool CKmerQuake<1>::operator<(const CKmerQuake<1> &x) { return data < x.data; } // ********************************************************************* inline void CKmerQuake<1>::clear(void) { data = 0; quality = 0.0; } // ********************************************************************* inline uint64 CKmerQuake<1>::remove_suffix(const uint32 n) { return data >> n; } // ********************************************************************* inline void CKmerQuake<1>::set_n_1(const uint32 n) { if(n == 64) data = ~(0ull); else data = (1ull << n) - 1; quality = 0.0; } // ********************************************************************* inline void CKmerQuake<1>::set_n_01(const uint32 n) { data = 0ull; for(uint32 i = 0; i < n; ++i) if(!(i & 1)) // data |= (1ull << i); data += (1ull << i); quality = 0.0; } // ********************************************************************* inline void CKmerQuake<1>::store(uchar *&buffer, int32 n) { for(int32 i = n-1; i >= 0; --i) *buffer++ = get_byte(i); memcpy(buffer, &quality, sizeof(quality)); buffer += sizeof(quality); } // ********************************************************************* inline void CKmerQuake<1>::store(uchar *buffer, int32 p, int32 n) { for(int32 i = n-1; i >= 0; --i) buffer[p++] = get_byte(i); memcpy(buffer+p, &quality, sizeof(quality)); } // ********************************************************************* inline void CKmerQuake<1>::load(uchar *&buffer, int32 n) { clear(); for(int32 i = n-1; i >= 0; --i) set_byte(i, *buffer++); memcpy(&quality, buffer, sizeof(quality)); buffer += sizeof(quality); } // ********************************************************************* char CKmerQuake<1>::get_symbol(int p) { uint32 x = (data >> (2*p)) & 0x03; switch(x) { case 0 : return 'A'; case 1 : return 'C'; case 2 : return 'G'; default: return 'T'; } } #endif // ***** EOF KMC-2.3/kmer_counter/kmer_counter.cpp000066400000000000000000000342741257432033000176770ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include #include #include #include #include "timer.h" #include "kmc.h" #include "meta_oper.h" using namespace std; uint64 total_reads, total_fastq_size; void usage(); bool parse_parameters(int argc, char *argv[]); CKMCParams Params; //---------------------------------------------------------------------------------- // Application class // Template parameters: // * KMER_TPL - k-mer class // * SIZE - maximal size of the k-mer (divided by 32) template class KMER_TPL, unsigned SIZE, bool QUAKE_MODE> class CApplication { CApplication *app_1; CKMC, SIZE, QUAKE_MODE> *kmc; int p_k; bool is_selected; public: CApplication(CKMCParams &Params) { p_k = Params.p_k; is_selected = p_k <= (int32) SIZE * 32 && p_k > ((int32) SIZE-1)*32; app_1 = new CApplication(Params); if(is_selected) { kmc = new CKMC, SIZE, QUAKE_MODE>; kmc->SetParams(Params); } else { kmc = NULL; } }; ~CApplication() { delete app_1; if (kmc) delete kmc; } void GetStats(double &time1, double &time2, double &time3, uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total, uint64 &_n_reads, uint64 &_tmp_size, uint64 &_tmp_size_strict_mem, uint64 &_max_disk_usage, uint64& _n_total_super_kmers) { if (is_selected) { kmc->GetStats(time1, time2, time3, _n_unique, _n_cutoff_min, _n_cutoff_max, _n_total, _n_reads, _tmp_size, _tmp_size_strict_mem, _max_disk_usage, _n_total_super_kmers); } else app_1->GetStats(time1, time2, time3, _n_unique, _n_cutoff_min, _n_cutoff_max, _n_total, _n_reads, _tmp_size, _tmp_size_strict_mem, _max_disk_usage, _n_total_super_kmers); } bool Process() { if (is_selected) { return kmc->Process(); } else return app_1->Process(); } }; //---------------------------------------------------------------------------------- // Specialization of the application class for the SIZE=1 template class KMER_TPL, bool QUAKE_MODE> class CApplication { CKMC, 1, QUAKE_MODE> *kmc; int p_k; bool is_selected; public: CApplication(CKMCParams &Params) { is_selected = Params.p_k <= 32; if(is_selected) { kmc = new CKMC, 1, QUAKE_MODE>; kmc->SetParams(Params); } else { kmc = NULL; } }; ~CApplication() { if(kmc) delete kmc; }; void GetStats(double &time1, double &time2, double &time3, uint64 &_n_unique, uint64 &_n_cutoff_min, uint64 &_n_cutoff_max, uint64 &_n_total, uint64 &_n_reads, uint64 &_tmp_size, uint64 &_tmp_size_strict_mem, uint64 &_max_disk_usage, uint64& _n_total_super_kmers) { if (is_selected) { if(kmc) kmc->GetStats(time1, time2, time3, _n_unique, _n_cutoff_min, _n_cutoff_max, _n_total, _n_reads, _tmp_size, _tmp_size_strict_mem, _max_disk_usage, _n_total_super_kmers); } } bool Process() { if (is_selected) { return kmc->Process(); } return false; } }; //---------------------------------------------------------------------------------- // Show execution options of the software void usage() { cout << "K-Mer Counter (KMC) ver. " << KMC_VER << " (" << KMC_DATE << ")\n"; cout << "Usage:\n kmc [options] \n"; cout << " kmc [options] <@input_file_names> \n"; cout << "Parameters:\n"; cout << " input_file_name - single file in FASTQ format (gziped or not)\n"; cout << " @input_file_names - file name with list of input files in FASTQ format (gziped or not)\n"; cout << "Options:\n"; cout << " -v - verbose mode (shows all parameter settings); default: false\n"; cout << " -k - k-mer length (k from " << MIN_K << " to " << MAX_K << "; default: 25)\n"; cout << " -m - max amount of RAM in GB (from 1 to 1024); default: 12\n"; cout << " -sm - use strict memory mode (memory limit from -m switch will not be exceeded)\n"; cout << " -p - signature length (5, 6, 7, 8); default: 7\n"; cout << " -f - input in FASTA format (-fa), FASTQ format (-fq) or mulit FASTA (-fm); default: FASTQ\n"; cout << " -q[value] - use Quake's compatible counting with [value] representing lowest quality (default: 33)\n"; cout << " -ci - exclude k-mers occurring less than times (default: 2)\n"; cout << " -cs - maximal value of a counter (default: 255)\n"; cout << " -cx - exclude k-mers occurring more of than times (default: 1e9)\n"; cout << " -b - turn off transformation of k-mers into canonical form\n"; cout << " -r - turn on RAM-only mode \n"; cout << " -n - number of bins \n"; cout << " -t - total number of threads (default: no. of CPU cores)\n"; cout << " -sf - number of FASTQ reading threads\n"; cout << " -sp - number of splitting threads\n"; cout << " -sr - number of sorter threads\n"; cout << " -so - number of threads per single sorter\n"; cout << "Example:\n"; cout << "kmc -k27 -m24 NA19238.fastq NA.res \\data\\kmc_tmp_dir\\\n"; cout << "kmc -k27 -q -m24 @files.lst NA.res \\data\\kmc_tmp_dir\\\n"; } //---------------------------------------------------------------------------------- // Parse the parameters bool parse_parameters(int argc, char *argv[]) { int i; int tmp; if(argc < 4) return false; for(i = 1 ; i < argc; ++i) { if(argv[i][0] != '-') break; // Number of threads if(strncmp(argv[i], "-t", 2) == 0) Params.p_t = atoi(&argv[i][2]); // else // k-mer length if(strncmp(argv[i], "-k", 2) == 0) { tmp = atoi(&argv[i][2]); if(tmp < MIN_K || tmp > MAX_K) { cout << "Wrong parameter: k must be from range <" << MIN_K << "," << MAX_K << ">\n"; return false; } else Params.p_k = tmp; } // Memory limit else if(strncmp(argv[i], "-m", 2) == 0) { tmp = atoi(&argv[i][2]); if(tmp < MIN_MEM) { cout << "Wrong parameret: min memory must be at least " << MIN_MEM << "GB\n"; return false; } else Params.p_m = tmp; } // Minimum counter threshold else if(strncmp(argv[i], "-ci", 3) == 0) Params.p_ci = atoi(&argv[i][3]); // Maximum counter threshold else if(strncmp(argv[i], "-cx", 3) == 0) Params.p_cx = atoll(&argv[i][3]); // Maximal counter value else if(strncmp(argv[i], "-cs", 3) == 0) Params.p_cs = atoll(&argv[i][3]); // Quake mode else if(strncmp(argv[i], "-q", 2) == 0) { Params.p_quake = true; if(strlen(argv[i]) > 2) Params.p_quality = atoi(argv[i]+2); } // Set p1 else if (strncmp(argv[i], "-p", 2) == 0) { tmp = atoi(&argv[i][2]); if (tmp < MIN_SL || tmp > MAX_SL) { cout << "Wrong parameter: p must be from range <" << MIN_SL << "," << MAX_SL << ">\n"; return false; } else Params.p_p1 = tmp; } // FASTA input files else if(strncmp(argv[i], "-fa", 3) == 0) Params.p_file_type = fasta; // FASTQ input files else if(strncmp(argv[i], "-fq", 3) == 0) Params.p_file_type = fastq; else if(strncmp(argv[i], "-fm", 3) == 0) Params.p_file_type = multiline_fasta; else if (strncmp(argv[i], "-v", 2) == 0) Params.p_verbose = true; else if (strncmp(argv[i], "-sm", 3) == 0 && strlen(argv[i]) == 3) Params.p_strict_mem = true; else if (strncmp(argv[i], "-r", 2) == 0) Params.p_mem_mode = true; else if(strncmp(argv[i], "-b", 2) == 0) Params.p_both_strands = false; // Number of reading threads else if(strncmp(argv[i], "-sf", 3) == 0) { tmp = atoi(&argv[i][3]); if(tmp < MIN_SF || tmp > MAX_SF) { cout << "Wrong parameter: number of reading thread must be from range <" << MIN_SF << "," << MAX_SF << ">\n"; return false; } else Params.p_sf = tmp; } // Number of splitting threads else if(strncmp(argv[i], "-sp", 3) == 0) { tmp = atoi(&argv[i][3]); if(tmp < MIN_SP || tmp > MAX_SP) { cout << "Wrong parameter: number of splitting threads must be in range <" << MIN_SP << "," << MAX_SP << "<\n"; return false; } else Params.p_sp = tmp; } // Number of sorting threads else if(strncmp(argv[i], "-so", 3) == 0) { tmp = atoi(&argv[i][3]); if(tmp < MIN_SO || tmp > MAX_SO) { cout << "Wrong parameter: number of sorter threads must be in range <" << MIN_SO << "," << MAX_SO << "\n"; return false; } else Params.p_so = tmp; } // Number of internal sorting threads (per single sorter) else if(strncmp(argv[i], "-sr", 3) == 0) { tmp = atoi(&argv[i][3]); if(tmp < MIN_SR || tmp > MAX_SR) { cout << "Wrong parameter: number of sotring threads per single sorter must be in range <" << MIN_SR << "," << MAX_SR << "\n"; return false; } else Params.p_sr = tmp; } else if (strncmp(argv[i], "-n", 2) == 0) { tmp = atoi(&argv[i][2]); if (tmp < MIN_N_BINS || tmp > MAX_N_BINS) { cout << "Wrong parameter: number of bins must be in range <" << MIN_SR << "," << MAX_SR << "\n"; return false; } else Params.p_n_bins = tmp; } if (strncmp(argv[i], "-smso", 5) == 0) { tmp = atoi(&argv[i][5]); if (tmp < MIN_SMSO || tmp > MAX_SMSO) { cout << "Wrong parameter: number of sorting threads per sorter in strict memory mode must be in range <" << MIN_SMSO << "," << MAX_SMSO << "\n"; return false; } else Params.p_smso = tmp; } if (strncmp(argv[i], "-smun", 5) == 0) { tmp = atoi(&argv[i][5]); if (tmp < MIN_SMUN || tmp > MAX_SMUN) { cout << "Wrong parameter: number of uncompactor threads in strict memory mode must be in range <" << MIN_SMUN << "," << MAX_SMUN << "\n"; return false; } else Params.p_smun = tmp; } if (strncmp(argv[i], "-smme", 5) == 0) { tmp = atoi(&argv[i][5]); if (tmp < MIN_SMME || tmp > MAX_SMME) { cout << "Wrong parameter: number of merger threads in strict memory mode must be in range <" << MIN_SMME << "," << MAX_SMME << "\n"; return false; } else Params.p_smme = tmp; } } if(argc - i < 3) return false; string input_file_name = string(argv[i++]); Params.output_file_name = string(argv[i++]); Params.working_directory = string(argv[i++]); Params.input_file_names.clear(); if(input_file_name[0] != '@') Params.input_file_names.push_back(input_file_name); else { ifstream in(input_file_name.c_str()+1); if(!in.good()) { cout << "Error: No " << input_file_name.c_str()+1 << " file\n"; return false; } string s; while(getline(in, s)) if(s != "") Params.input_file_names.push_back(s); in.close(); random_shuffle(Params.input_file_names.begin(), Params.input_file_names.end()); } //Validate and resolve conflicts in parameters if (Params.p_strict_mem && Params.p_mem_mode) { cout << "Error: -sm can not be used with -r\n"; return false; } if (Params.p_strict_mem && Params.p_quake) { cout << "Warning: -sm is not supported in quake mode. -sm has no effect\n"; Params.p_strict_mem = false; } if (Params.p_k > 9) { if ((uint64)Params.p_cx > ((1ull << 32) - 1)) { cout << "Warning: for k > 9 maximum value of -cx is 4294967295\n"; Params.p_cx = 4294967295; } if ((uint64)Params.p_cs > ((1ull << 32) - 1)) { cout << "Warning: for k > 9 maximum value of -cs is 4294967295\n"; Params.p_cs = 4294967295; } } return true; } //---------------------------------------------------------------------------------- // Main function int _tmain(int argc, _TCHAR* argv[]) { CStopWatch w0, w1; double time1, time2, time3; uint64 n_unique, n_cutoff_min, n_cutoff_max, n_total, n_reads, tmp_size, tmp_size_strict_mem, max_disk_usage, n_total_super_kmers; omp_set_num_threads(1); #ifdef WIN32 _setmaxstdio(2040); #endif if(!parse_parameters(argc, argv)) { usage(); return 0; } if(Params.p_quake) { CApplication *app = new CApplication(Params); if(!app->Process()) { cout << "Not enough memory or some other error\n"; delete app; return 0; } app->GetStats(time1, time2, time3, n_unique, n_cutoff_min, n_cutoff_max, n_total, n_reads, tmp_size, tmp_size_strict_mem, max_disk_usage, n_total_super_kmers); delete app; } else { CApplication *app = new CApplication(Params); if(!app->Process()) { cout << "Not enough memory or some other error\n"; delete app; return 0; } app->GetStats(time1, time2, time3, n_unique, n_cutoff_min, n_cutoff_max, n_total, n_reads, tmp_size, tmp_size_strict_mem, max_disk_usage, n_total_super_kmers); delete app; } cout << "1st stage: " << time1 << "s\n"; cout << "2nd stage: " << time2 << "s\n"; if (Params.p_strict_mem) cout << "3rd stage: " << time3 << "s\n"; if (Params.p_strict_mem) cout << "Total : " << (time1 + time2 + time3) << "s\n"; else cout << "Total : " << (time1+time2) << "s\n"; if (Params.p_strict_mem) { cout << "Tmp size : " << tmp_size / 1000000 << "MB\n"; cout << "Tmp size strict memory : " << tmp_size_strict_mem / 1000000 << "MB\n"; //cout << "Tmp total: " << (tmp_size + tmp_size_strict_mem) / 1000000 << "MB\n"; cout << "Tmp total: " << max_disk_usage / 1000000 << "MB\n"; } else cout << "Tmp size : " << tmp_size / 1000000 << "MB\n"; cout << "\nStats:\n"; cout << " No. of k-mers below min. threshold : " << setw(12) << n_cutoff_min << "\n"; cout << " No. of k-mers above max. threshold : " << setw(12) << n_cutoff_max << "\n"; cout << " No. of unique k-mers : " << setw(12) << n_unique << "\n"; cout << " No. of unique counted k-mers : " << setw(12) << n_unique-n_cutoff_min-n_cutoff_max << "\n"; cout << " Total no. of k-mers : " << setw(12) << n_total << "\n"; if(Params.p_file_type != multiline_fasta) cout << " Total no. of reads : " << setw(12) << n_reads << "\n"; else cout << " Total no. of sequences : " << setw(12) << n_reads << "\n"; cout << " Total no. of super-k-mers : " << setw(12) << n_total_super_kmers << "\n"; return 0; } // ***** EOF KMC-2.3/kmer_counter/kmer_counter.vcxproj000066400000000000000000000267061257432033000206110ustar00rootroot00000000000000 Debug Win32 Debug x64 Release Win32 Release x64 {8C8B90DA-28B7-4D82-81F3-C0E7CE52D59F} Win32Proj kmer_counter Application true NotSet v120 Application true NotSet Static v120 Application false true NotSet v120 Application false true NotSet Static v120 true true false false Use Level3 Disabled WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) true true %(AdditionalIncludeDirectories) ProgramDatabase Console true %(AdditionalLibraryDirectories) Use Level3 Disabled WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) %(AdditionalIncludeDirectories) true true MultiThreadedDebugDLL /D "_VARIADIC_MAX=10" /bigobj %(AdditionalOptions) Console true %(AdditionalLibraryDirectories) libcmt.lib Level3 Use MaxSpeed true true WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) %(AdditionalIncludeDirectories) Console true true true %(AdditionalLibraryDirectories) Level3 Use Full true true WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) true Speed true %(AdditionalIncludeDirectories) Default NotSet /D "_VARIADIC_MAX=10" Console false true true %(AdditionalLibraryDirectories) Create Create Create Create KMC-2.3/kmer_counter/kxmer_set.h000066400000000000000000000044511257432033000166420ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _KXMER_SET_ #define _KXMER_SET_ #include "defs.h" #include using namespace std; #define KXMER_SET_SIZE 1024 template class CKXmerSet { typedef tuple elem_desc_t; //start_pos, end_pos, shr typedef pair heap_elem_t; //kxmer val, desc_id elem_desc_t data_desc[KXMER_SET_SIZE]; heap_elem_t data[KXMER_SET_SIZE]; uint32 pos; uint32 desc_pos; KMER_T mask; KMER_T* buffer; inline void update_heap() { uint32 desc_id = data[1].second; KMER_T kmer; if (++get<0>(data_desc[desc_id]) < get<1>(data_desc[desc_id])) { kmer.from_kxmer(buffer[get<0>(data_desc[desc_id])], get<2>(data_desc[desc_id]), mask); } else { kmer.set(data[--pos].first); desc_id = data[pos].second; } uint32 parent, less; parent = less = 1; while (true) { if (parent * 2 >= pos) break; if (parent * 2 + 1 >= pos) less = parent * 2; else if (data[parent * 2].first < data[parent * 2 + 1].first) less = parent * 2; else less = parent * 2 + 1; if (data[less].first < kmer) { data[parent] = data[less]; parent = less; } else break; } data[parent] = make_pair(kmer, desc_id); } public: CKXmerSet(uint32 kmer_len) { pos = 1; mask.set_n_1(kmer_len * 2); desc_pos = 0; } inline void init_add(uint64 start_pos, uint64 end_pos, uint32 shr) { data_desc[desc_pos] = make_tuple(start_pos, end_pos, shr); data[pos].first.from_kxmer(buffer[start_pos], shr, mask); data[pos].second = desc_pos; uint32 child_pos = pos++; while (child_pos > 1 && data[child_pos].first < data[child_pos / 2].first) { swap(data[child_pos], data[child_pos / 2]); child_pos /= 2; } ++desc_pos; } inline void set_buffer(KMER_T* _buffer) { buffer = _buffer; } inline void clear() { pos = 1; desc_pos = 0; } inline bool get_min(uint64& _pos, KMER_T& kmer) { if (pos <= 1) return false; kmer = data[1].first; _pos = get<0>(data_desc[data[1].second]); update_heap(); return true; } }; #endifKMC-2.3/kmer_counter/mem_disk_file.cpp000066400000000000000000000044121257432033000177600ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "mem_disk_file.h" #include "asmlib_wrapper.h" #include using namespace std; //---------------------------------------------------------------------------------- // Constructor CMemDiskFile::CMemDiskFile(bool _memory_mode) { memory_mode = _memory_mode; file = NULL; } //---------------------------------------------------------------------------------- void CMemDiskFile::Open(const string& f_name) { if(memory_mode) { } else { file = fopen(f_name.c_str(), "wb+"); if (!file) { cout << "Error: Cannot open temporary file " << f_name << "\n"; exit(1); } setbuf(file, nullptr); } name = f_name; } //---------------------------------------------------------------------------------- void CMemDiskFile::Rewind() { if(memory_mode) { } else { rewind(file); } } //---------------------------------------------------------------------------------- int CMemDiskFile::Close() { if(memory_mode) { for(auto& p : container) { delete[] p.first; } container.clear(); return 0; } else { return fclose(file); } } //---------------------------------------------------------------------------------- void CMemDiskFile::Remove() { if (!memory_mode) remove(name.c_str()); } //---------------------------------------------------------------------------------- size_t CMemDiskFile::Read(uchar * ptr, size_t size, size_t count) { if(memory_mode) { uint64 pos = 0; for(auto& p : container) { A_memcpy(ptr + pos, p.first, p.second); pos += p.second; delete[] p.first; } container.clear(); return pos; } else { return fread(ptr, size, count, file); } } //---------------------------------------------------------------------------------- size_t CMemDiskFile::Write(const uchar * ptr, size_t size, size_t count) { if(memory_mode) { uchar *buf = new uchar[size * count]; A_memcpy(buf, ptr, size * count); container.push_back(make_pair(buf, size * count)); return size * count; } else { return fwrite(ptr, size, count, file); } } KMC-2.3/kmer_counter/mem_disk_file.h000066400000000000000000000021331257432033000174230ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _MEM_DISK_FILE_H #define _MEM_DISK_FILE_H #include "defs.h" #include #include #include using namespace std; //************************************************************************************************************ // CMemDiskFile - wrapper for FILE* or memory equivalent //************************************************************************************************************ class CMemDiskFile { bool memory_mode; FILE* file; typedef pair elem_t;//buf,size typedef vector container_t; container_t container; string name; public: CMemDiskFile(bool _memory_mode); void Open(const string& f_name); void Rewind(); int Close(); size_t Read(uchar * ptr, size_t size, size_t count); size_t Write(const uchar * ptr, size_t size, size_t count); void Remove(); }; #endif KMC-2.3/kmer_counter/meta_oper.h000066400000000000000000000015641257432033000166160ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _META_OPER_H #define _META_OPER_H //#include template struct uint_{ }; // For loop (forward) template inline void IterFwd(const Lambda &oper, uint_) { IterFwd(oper, uint_()); oper(N); } template inline void IterFwd(const Lambda &oper, uint_<0>) { oper(0); } // For loop (backward) template inline void IterRev(const Lambda &oper, uint_) { oper(N); IterRev(oper, uint_()); } template inline void IterRev(const Lambda &oper, uint_<0>) { oper(0); } #endif // ***** EOF KMC-2.3/kmer_counter/mmer.cpp000066400000000000000000000015051257432033000161310ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "mmer.h" uint32 CMmer::norm5[]; uint32 CMmer::norm6[]; uint32 CMmer::norm7[]; uint32 CMmer::norm8[]; CMmer::_si CMmer::_init; //-------------------------------------------------------------------------- CMmer::CMmer(uint32 _len) { switch (_len) { case 5: norm = norm5; break; case 6: norm = norm6; break; case 7: norm = norm7; break; case 8: norm = norm8; break; default: break; } len = _len; mask = (1 << _len * 2) - 1; str = 0; } //-------------------------------------------------------------------------- KMC-2.3/kmer_counter/mmer.h000066400000000000000000000100761257432033000156010ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _MMER_H #define _MMER_H #include "defs.h" // ************************************************************************* // ************************************************************************* class CMmer { uint32 str; uint32 mask; uint32 current_val; uint32* norm; uint32 len; static uint32 norm5[1 << 10]; static uint32 norm6[1 << 12]; static uint32 norm7[1 << 14]; static uint32 norm8[1 << 16]; static bool is_allowed(uint32 mmer, uint32 len) { if ((mmer & 0x3f) == 0x3f) // TTT suffix return false; if ((mmer & 0x3f) == 0x3b) // TGT suffix return false; if ((mmer & 0x3c) == 0x3c) // TG* suffix return false; for (uint32 j = 0; j < len - 3; ++j) if ((mmer & 0xf) == 0) // AA inside return false; else mmer >>= 2; if (mmer == 0) // AAA prefix return false; if (mmer == 0x04) // ACA prefix return false; if ((mmer & 0xf) == 0) // *AA prefix return false; return true; } friend class CSignatureMapper; struct _si { static uint32 get_rev(uint32 mmer, uint32 len) { uint32 rev = 0; uint32 shift = len*2 - 2; for(uint32 i = 0 ; i < len ; ++i) { rev += (3 - (mmer & 3)) << shift; mmer >>= 2; shift -= 2; } return rev; } static void init_norm(uint32* norm, uint32 len) { uint32 special = 1 << len * 2; for(uint32 i = 0 ; i < special ; ++i) { uint32 rev = get_rev(i, len); uint32 str_val = is_allowed(i, len) ? i : special; uint32 rev_val = is_allowed(rev, len) ? rev : special; norm[i] = MIN(str_val, rev_val); } } _si() { init_norm(norm5, 5); init_norm(norm6, 6); init_norm(norm7, 7); init_norm(norm8, 8); } }static _init; public: CMmer(uint32 _len); inline void insert(uchar symb); inline uint32 get() const; inline bool operator==(const CMmer& x); inline bool operator<(const CMmer& x); inline void clear(); inline bool operator<=(const CMmer& x); inline void set(const CMmer& x); inline void insert(const char* seq); }; //-------------------------------------------------------------------------- inline void CMmer::insert(uchar symb) { str <<= 2; str += symb; str &= mask; current_val = norm[str]; } //-------------------------------------------------------------------------- inline uint32 CMmer::get() const { return current_val; } //-------------------------------------------------------------------------- inline bool CMmer::operator==(const CMmer& x) { return current_val == x.current_val; } //-------------------------------------------------------------------------- inline bool CMmer::operator<(const CMmer& x) { return current_val < x.current_val; } //-------------------------------------------------------------------------- inline void CMmer::clear() { str = 0; } //-------------------------------------------------------------------------- inline bool CMmer::operator<=(const CMmer& x) { return current_val <= x.current_val; } //-------------------------------------------------------------------------- inline void CMmer::set(const CMmer& x) { str = x.str; current_val = x.current_val; } //-------------------------------------------------------------------------- inline void CMmer::insert(const char* seq) { switch (len) { case 5: str = (seq[0] << 8) + (seq[1] << 6) + (seq[2] << 4) + (seq[3] << 2) + (seq[4]); break; case 6: str = (seq[0] << 10) + (seq[1] << 8) + (seq[2] << 6) + (seq[3] << 4) + (seq[4] << 2) + (seq[5]); break; case 7: str = (seq[0] << 12) + (seq[1] << 10) + (seq[2] << 8) + (seq[3] << 6) + (seq[4] << 4 ) + (seq[5] << 2) + (seq[6]); break; case 8: str = (seq[0] << 14) + (seq[1] << 12) + (seq[2] << 10) + (seq[3] << 8) + (seq[4] << 6) + (seq[5] << 4) + (seq[6] << 2) + (seq[7]); break; default: break; } current_val = norm[str]; } #endifKMC-2.3/kmer_counter/params.h000066400000000000000000000135471257432033000161320ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _PARAMS_H #define _PARAMS_H #include "defs.h" #include "queues.h" #include "s_mapper.h" #include #include typedef enum {fasta, fastq, multiline_fasta} input_type; using namespace std; // Structure for passing KMC parameters struct CKMCParams { // Input parameters int p_m; // max. total RAM usage int p_k; // k-mer length int p_t; // no. of threads int p_sf; // no. of reading threads int p_sp; // no. of splitting threads int p_so; // no. of OpenMP threads for sorting int p_sr; // no. of sorting threads int p_ci; // do not count k-mers occurring less than int64 p_cx; // do not count k-mers occurring more than int64 p_cs; // maximal counter value bool p_quake; // use Quake-compatibile counting bool p_strict_mem; // use strict memory limit mode bool p_mem_mode; // use RAM instead of disk int p_quality; // lowest quality input_type p_file_type; // input in FASTA format bool p_verbose; // verbose mode bool p_both_strands; // compute canonical k-mer representation int p_p1; // signature length int p_n_bins; // no. of bins int p_smso; // no. of OpenMP threads for sorting in strict memory mode int p_smun; // no. of uncompacting threads in strict memory mode int p_smme; // no. of merging threads in strict memory mode // File names vector input_file_names; string output_file_name; string working_directory; input_type file_type; uint32 lut_prefix_len; uint32 KMER_T_size; // Memory sizes int64 max_mem_size; // maximum amount of memory to be used in GBs; default: 30GB int64 max_mem_storer; // maximum amount of memory for internal buffers of KmerStorer int64 max_mem_stage2; // maximum amount of memory in stage 2 int64 max_mem_storer_pkg; // maximum amount of memory for single package int64 mem_tot_pmm_bins; // maximal amount of memory per pool memory manager (PMM) of bin parts int64 mem_part_pmm_bins; // maximal amount of memory per single part of memory maintained by PMM of bin parts int64 mem_tot_pmm_fastq; int64 mem_part_pmm_fastq; int64 mem_part_pmm_reads; int64 mem_tot_pmm_reads; int64 mem_part_pmm_radix_buf; int64 mem_tot_pmm_radix_buf; int64 mem_part_pmm_prob; int64 mem_tot_pmm_prob; int64 mem_part_pmm_cnts_sort; int64 mem_tot_pmm_stats; int64 mem_part_pmm_stats; int64 mem_tot_pmm_epxand; int64 mem_part_pmm_epxand; int64 mem_part_small_k_buf; int64 mem_tot_small_k_buf; int64 mem_part_small_k_completer; int64 mem_tot_small_k_completer; bool verbose; int kmer_len; // kmer length int signature_len; int cutoff_min; // exclude k-mers occurring less than times int64 cutoff_max; // exclude k-mers occurring more than times int64 counter_max; // maximal counter value bool use_quake; // use Quake's counting based on qualities bool use_strict_mem; // use strict memory limit mode int lowest_quality; // lowest quality value bool both_strands; // find canonical representation of each k-mer bool mem_mode; // use RAM instead of disk int n_bins; // number of bins; int bin_part_size; // size of a bin part; fixed: 2^15 int fastq_buffer_size; // size of FASTQ file buffer; fixed: 2^23 int n_threads; // number of cores int n_readers; // number of FASTQ readers; default: 1 int n_splitters; // number of splitters; default: 1 int n_sorters; // number of sorters; default: 1 vector n_omp_threads;// number of OMP threads per sorters uint32 max_x; //k+x-mers will be counted uint32 gzip_buffer_size; uint32 bzip2_buffer_size; //params for strict memory mode int sm_n_uncompactors; int sm_n_omp_threads; int sm_n_mergers; int64 sm_mem_part_input_file; int64 sm_mem_tot_input_file; int64 sm_mem_part_expand; int64 sm_mem_tot_expand; int64 sm_mem_part_sort; int64 sm_mem_tot_sort; int64 sm_mem_part_suffixes; int64 sm_mem_tot_suffixes; int64 sm_mem_part_lut; int64 sm_mem_tot_lut; int64 sm_mem_part_sub_bin_lut; int64 sm_mem_tot_sub_bin_lut; int64 sm_mem_part_sub_bin_suff; int64 sm_mem_tot_sub_bin_suff; int64 sm_mem_part_merger_lut; int64 sm_mem_tot_merger_lut; int64 sm_mem_part_merger_suff; int64 sm_mem_tot_merger_suff; CKMCParams() { p_m = 12; p_k = 25; p_t = 0; p_sf = 0; p_sp = 0; p_so = 0; p_sr = 0; p_smme = p_smso = p_smun = 0; p_ci = 2; p_cx = 1000000000; p_cs = 255; p_quake = false; p_strict_mem = false; p_mem_mode = false; p_quality = 33; p_file_type = fastq; p_verbose = false; p_both_strands = true; p_p1 = 7; p_n_bins = 512; gzip_buffer_size = 64 << 20; bzip2_buffer_size = 64 << 20; } }; // Structure for passing KMC queues and monitors to threads struct CKMCQueues { //Signature mapper CSignatureMapper* s_mapper; // Memory monitors CMemoryMonitor *mm; // Queues CInputFilesQueue *input_files_queue; CPartQueue *part_queue; CStatsPartQueue* stats_part_queue; CBinPartQueue *bpq; CBinDesc *bd; CBinQueue *bq; CKmerQueue *kq; CMemoryPool *pmm_bins, *pmm_fastq, *pmm_reads, *pmm_radix_buf, *pmm_prob, *pmm_stats, *pmm_expand; CMemoryBins *memory_bins; CMemoryPool* pmm_small_k_buf, *pmm_small_k_completer; CDiskLogger* disk_logger; //for strict memory mode CTooLargeBinsQueue* tlbq; CBigBinPartQueue* bbpq; CBigBinKXmersQueue* bbkq; CBigBinDesc* bbd; CBigBinKmerPartQueue* bbkpq; CBigBinSortedPartQueue* bbspq; CKMCQueues() {} CMemoryPool* sm_pmm_input_file, *sm_pmm_expand, *sm_pmm_sort, *sm_pmm_sorter_suffixes, *sm_pmm_sorter_lut, *sm_pmm_sub_bin_lut, *sm_pmm_sub_bin_suff, *sm_pmm_merger_lut, *sm_pmm_merger_suff; CCompletedBinsCollector* sm_cbc; }; #endif // ***** EOF KMC-2.3/kmer_counter/prob_qual.cpp000066400000000000000000000103241257432033000171540ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "stdafx.h" #include "prob_qual.h" // K-mers with probability less than MIN_PROB_QUAL_VALUE will not be counted double CProbQual::MIN_PROB_QUAL_VALUE = 0.0000; double CProbQual::prob_qual[94] = { 0.2500000000000000, 0.2500000000000000, 0.3690426555198070, 0.4988127663727280, 0.6018928294465030, 0.6837722339831620, 0.7488113568490420, 0.8004737685031120, 0.8415106807538890, 0.8741074588205830, 0.9000000000000000, 0.9205671765275720, 0.9369042655519810, 0.9498812766372730, 0.9601892829446500, 0.9683772233983160, 0.9748811356849040, 0.9800473768503110, 0.9841510680753890, 0.9874107458820580, 0.9900000000000000, 0.9920567176527570, 0.9936904265551980, 0.9949881276637270, 0.9960189282944650, 0.9968377223398320, 0.9974881135684900, 0.9980047376850310, 0.9984151068075390, 0.9987410745882060, 0.9990000000000000, 0.9992056717652760, 0.9993690426555200, 0.9994988127663730, 0.9996018928294460, 0.9996837722339830, 0.9997488113568490, 0.9998004737685030, 0.9998415106807540, 0.9998741074588210, 0.9999000000000000, 0.9999205671765280, 0.9999369042655520, 0.9999498812766370, 0.9999601892829450, 0.9999683772233980, 0.9999748811356850, 0.9999800473768500, 0.9999841510680750, 0.9999874107458820, 0.9999900000000000, 0.9999920567176530, 0.9999936904265550, 0.9999949881276640, 0.9999960189282940, 0.9999968377223400, 0.9999974881135680, 0.9999980047376850, 0.9999984151068080, 0.9999987410745880, 0.9999990000000000, 0.9999992056717650, 0.9999993690426560, 0.9999994988127660, 0.9999996018928290, 0.9999996837722340, 0.9999997488113570, 0.9999998004737680, 0.9999998415106810, 0.9999998741074590, 0.9999999000000000, 0.9999999205671770, 0.9999999369042660, 0.9999999498812770, 0.9999999601892830, 0.9999999683772230, 0.9999999748811360, 0.9999999800473770, 0.9999999841510680, 0.9999999874107460, 0.9999999900000000, 0.9999999920567180, 0.9999999936904270, 0.9999999949881280, 0.9999999960189280, 0.9999999968377220, 0.9999999974881140, 0.9999999980047380, 0.9999999984151070, 0.9999999987410750, 0.9999999990000000, 0.9999999992056720, 0.9999999993690430, 0.9999999994988130 }; double CProbQual::inv_prob_qual[94] = { 4.0000000000000000, 4.0000000000000000, 2.7097138638119600, 2.0047602375372500, 1.6614253419825500, 1.4624752955742600, 1.3354498310601800, 1.2492601748462100, 1.1883390465158700, 1.1440241012807300, 1.1111111111111100, 1.0862868300084900, 1.0673449110735400, 1.0527631448218000, 1.0414613220148200, 1.0326554320337200, 1.0257660789563300, 1.0203588353185700, 1.0161041657513100, 1.0127497641386300, 1.0101010101010100, 1.0080068832818700, 1.0063496369454600, 1.0050371177272600, 1.0039969839853900, 1.0031723093832600, 1.0025182118938000, 1.0019992513458400, 1.0015874090662800, 1.0012605123027600, 1.0010010010010000, 1.0007949596936500, 1.0006313557030000, 1.0005014385482300, 1.0003982657229900, 1.0003163277976500, 1.0002512517547400, 1.0001995660501600, 1.0001585144420900, 1.0001259083921100, 1.0001000100010000, 1.0000794391335500, 1.0000630997157700, 1.0000501212353700, 1.0000398123020100, 1.0000316237766300, 1.0000251194952900, 1.0000199530212600, 1.0000158491831200, 1.0000125894126100, 1.0000100001000000, 1.0000079433454400, 1.0000063096132600, 1.0000050118974600, 1.0000039810875500, 1.0000031622876600, 1.0000025118927400, 1.0000019952663000, 1.0000015848957000, 1.0000012589270000, 1.0000010000010000, 1.0000007943288700, 1.0000006309577400, 1.0000005011874800, 1.0000003981073300, 1.0000003162278700, 1.0000002511887100, 1.0000001995262700, 1.0000001584893400, 1.0000001258925600, 1.0000001000000100, 1.0000000794328300, 1.0000000630957400, 1.0000000501187300, 1.0000000398107200, 1.0000000316227800, 1.0000000251188600, 1.0000000199526200, 1.0000000158489300, 1.0000000125892500, 1.0000000100000000, 1.0000000079432800, 1.0000000063095700, 1.0000000050118700, 1.0000000039810700, 1.0000000031622800, 1.0000000025118900, 1.0000000019952600, 1.0000000015848900, 1.0000000012589300, 1.0000000010000000, 1.0000000007943300, 1.0000000006309600, 1.0000000005011900 }; KMC-2.3/kmer_counter/prob_qual.h000066400000000000000000000006441257432033000166250ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _PROB_QUAL_H #define _PROB_QUAL_H struct CProbQual { static double prob_qual[94]; static double inv_prob_qual[94]; static double MIN_PROB_QUAL_VALUE; }; #endifKMC-2.3/kmer_counter/queues.h000066400000000000000000001000711257432033000161430ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _QUEUES_H #define _QUEUES_H #include "defs.h" #include #include #include #include #include #include #include #include "mem_disk_file.h" using namespace std; #include #include #include using std::thread; //************************************************************************************************************ class CInputFilesQueue { typedef string elem_t; typedef queue> queue_t; queue_t q; bool is_completed; mutable mutex mtx; // The mutex to synchronise on public: CInputFilesQueue(const vector &file_names) { unique_lock lck(mtx); for(vector::const_iterator p = file_names.begin(); p != file_names.end(); ++p) q.push(*p); is_completed = false; }; ~CInputFilesQueue() {}; bool empty() { lock_guard lck(mtx); return q.empty(); } bool completed() { lock_guard lck(mtx); return q.empty() && is_completed; } void mark_completed() { lock_guard lck(mtx); is_completed = true; } bool pop(string &file_name) { lock_guard lck(mtx); if(q.empty()) return false; file_name = q.front(); q.pop(); return true; } }; //************************************************************************************************************ class CPartQueue { typedef pair elem_t; typedef queue> queue_t; queue_t q; bool is_completed; int n_readers; mutable mutex mtx; // The mutex to synchronise on condition_variable cv_queue_empty; public: CPartQueue(int _n_readers) { unique_lock lck(mtx); is_completed = false; n_readers = _n_readers; }; ~CPartQueue() {}; bool empty() { lock_guard lck(mtx); return q.empty(); } bool completed() { lock_guard lck(mtx); return q.empty() && !n_readers; } void mark_completed() { lock_guard lck(mtx); n_readers--; if(!n_readers) cv_queue_empty.notify_all(); } void push(uchar *part, uint64 size) { unique_lock lck(mtx); bool was_empty = q.empty(); q.push(make_pair(part, size)); if(was_empty) cv_queue_empty.notify_all(); } bool pop(uchar *&part, uint64 &size) { unique_lock lck(mtx); cv_queue_empty.wait(lck, [this]{return !this->q.empty() || !this->n_readers;}); if(q.empty()) return false; part = q.front().first; size = q.front().second; q.pop(); return true; } }; //************************************************************************************************************ class CStatsPartQueue { typedef pair elem_t; typedef queue> queue_t; queue_t q; mutable mutex mtx; condition_variable cv_queue_empty; int n_readers; int64 bytes_to_read; public: CStatsPartQueue(int _n_readers, int64 _bytes_to_read) { unique_lock lck(mtx); n_readers = _n_readers; bytes_to_read = _bytes_to_read; } ~CStatsPartQueue() {}; void mark_completed() { lock_guard lck(mtx); n_readers--; if (!n_readers) cv_queue_empty.notify_all(); } bool completed() { lock_guard lck(mtx); return q.empty() && !n_readers; } bool push(uchar *part, uint64 size) { unique_lock lck(mtx); if (bytes_to_read <= 0) return false; bool was_empty = q.empty(); q.push(make_pair(part, size)); bytes_to_read -= size; if (was_empty) cv_queue_empty.notify_one(); return true; } bool pop(uchar *&part, uint64 &size) { unique_lock lck(mtx); cv_queue_empty.wait(lck, [this]{return !this->q.empty() || !this->n_readers; }); if (q.empty()) return false; part = q.front().first; size = q.front().second; q.pop(); return true; } }; //************************************************************************************************************ class CBinPartQueue { typedef tuple elem_t; typedef queue> queue_t; queue_t q; int n_writers; bool is_completed; mutable mutex mtx; // The mutex to synchronise on condition_variable cv_queue_empty; public: CBinPartQueue(int _n_writers) { lock_guard lck(mtx); n_writers = _n_writers; is_completed = false; } ~CBinPartQueue() {} bool empty() { lock_guard lck(mtx); return q.empty(); } bool completed() { lock_guard lck(mtx); return q.empty() && !n_writers; } void mark_completed() { lock_guard lck(mtx); n_writers--; if(!n_writers) cv_queue_empty.notify_all(); } void push(int32 bin_id, uchar *part, uint32 true_size, uint32 alloc_size) { unique_lock lck(mtx); bool was_empty = q.empty(); q.push(std::make_tuple(bin_id, part, true_size, alloc_size)); if(was_empty) cv_queue_empty.notify_all(); } bool pop(int32 &bin_id, uchar *&part, uint32 &true_size, uint32 &alloc_size) { unique_lock lck(mtx); cv_queue_empty.wait(lck, [this]{return !q.empty() || !n_writers;}); if(q.empty()) return false; bin_id = get<0>(q.front()); part = get<1>(q.front()); true_size = get<2>(q.front()); alloc_size = get<3>(q.front()); q.pop(); return true; } }; //************************************************************************************************************ class CBinDesc { typedef tuple desc_t; typedef map map_t; map_t m; int32 bin_id; vector random_bins; mutable mutex mtx; public: CBinDesc() { lock_guard lck(mtx); bin_id = -1; } ~CBinDesc() {} void reset_reading() { lock_guard lck(mtx); bin_id = -1; } bool empty() { lock_guard lck(mtx); return m.empty(); } void init_random() { lock_guard lck(mtx); vector> bin_sizes; for (auto& p : m) bin_sizes.push_back(make_pair(p.first, get<2>(p.second))); sort(bin_sizes.begin(), bin_sizes.end(), [](const pair& l, const pair& r){ return l.second > r.second; }); uint32 no_sort_start = uint32(0.6 * bin_sizes.size()); uint32 no_sort_end = uint32(0.8 * bin_sizes.size()); for (uint32 i = 0; i < no_sort_start; ++i) random_bins.push_back(bin_sizes[i].first); for (uint32 i = no_sort_end; i < bin_sizes.size(); ++i) random_bins.push_back(bin_sizes[i].first); random_shuffle(random_bins.begin(), random_bins.end()); for (uint32 i = no_sort_start; i < no_sort_end; ++i) random_bins.push_back(bin_sizes[i].first); } int32 get_next_random_bin() { lock_guard lck(mtx); if (bin_id == -1) bin_id = 0; else ++bin_id; if (bin_id >= (int32)m.size()) return -1000; return random_bins[bin_id]; } int32 get_next_bin() { lock_guard lck(mtx); map_t::iterator p; if(bin_id == -1) p = m.begin(); else { p = m.find(bin_id); if(p != m.end()) ++p; } if(p == m.end()) bin_id = -1000; else bin_id = p->first; return bin_id; } void insert(int32 bin_id, CMemDiskFile *file, string desc, int64 size, uint64 n_rec, uint64 n_plus_x_recs, uint64 n_super_kmers, uint32 buffer_size = 0, uint32 kmer_len = 0) { lock_guard lck(mtx); map_t::iterator p = m.find(bin_id); if(p != m.end()) { if(desc != "") { get<0>(m[bin_id]) = desc; get<5>(m[bin_id]) = file; } get<1>(m[bin_id]) += size; get<2>(m[bin_id]) += n_rec; get<6>(m[bin_id]) += n_plus_x_recs; get<7>(m[bin_id]) += n_super_kmers; if(buffer_size) { get<3>(m[bin_id]) = buffer_size; get<4>(m[bin_id]) = kmer_len; } } else m[bin_id] = std::make_tuple(desc, size, n_rec, buffer_size, kmer_len, file, n_plus_x_recs, n_super_kmers); } void read(int32 bin_id, CMemDiskFile *&file, string &desc, uint64 &size, uint64 &n_rec, uint64 &n_plus_x_recs, uint32 &buffer_size, uint32 &kmer_len) { lock_guard lck(mtx); desc = get<0>(m[bin_id]); file = get<5>(m[bin_id]); size = (uint64) get<1>(m[bin_id]); n_rec = get<2>(m[bin_id]); buffer_size = get<3>(m[bin_id]); kmer_len = get<4>(m[bin_id]); n_plus_x_recs = get<6>(m[bin_id]); } void read(int32 bin_id, CMemDiskFile *&file, string &desc, uint64 &size, uint64 &n_rec, uint64 &n_plus_x_recs, uint64 &n_super_kmers) { lock_guard lck(mtx); desc = get<0>(m[bin_id]); file = get<5>(m[bin_id]); size = (uint64) get<1>(m[bin_id]); n_rec = get<2>(m[bin_id]); n_plus_x_recs = get<6>(m[bin_id]); n_super_kmers = get<7>(m[bin_id]); } }; //************************************************************************************************************ class CBinQueue { typedef tuple elem_t; typedef queue> queue_t; queue_t q; int n_writers; mutable mutex mtx; // The mutex to synchronise on condition_variable cv_queue_empty; public: CBinQueue(int _n_writers) { lock_guard lck(mtx); n_writers = _n_writers; } ~CBinQueue() {} bool empty() { lock_guard lck(mtx); return q.empty(); } bool completed() { lock_guard lck(mtx); return q.empty() && !n_writers; } void mark_completed() { lock_guard lck(mtx); n_writers--; if(n_writers == 0) cv_queue_empty.notify_all(); } void push(int32 bin_id, uchar *part, uint64 size, uint64 n_rec) { lock_guard lck(mtx); bool was_empty = q.empty(); q.push(std::make_tuple(bin_id, part, size, n_rec)); if(was_empty) cv_queue_empty.notify_all(); } bool pop(int32 &bin_id, uchar *&part, uint64 &size, uint64 &n_rec) { unique_lock lck(mtx); cv_queue_empty.wait(lck, [this]{return !q.empty() || !n_writers;}); if(q.empty()) return false; bin_id = get<0>(q.front()); part = get<1>(q.front()); size = get<2>(q.front()); n_rec = get<3>(q.front()); q.pop(); return true; } }; //************************************************************************************************************ class CKmerQueue { typedef tuple data_t; typedef list list_t; int n_writers; mutable mutex mtx; // The mutex to synchronise on condition_variable cv_queue_empty; list_t l; int32 n_bins; public: CKmerQueue(int32 _n_bins, int _n_writers) { lock_guard lck(mtx); n_bins = _n_bins; n_writers = _n_writers; } ~CKmerQueue() { } bool empty() { lock_guard lck(mtx); return l.empty() && !n_writers; } void mark_completed() { lock_guard lck(mtx); n_writers--; if (!n_writers) cv_queue_empty.notify_all(); } void push(int32 bin_id, uchar *data, uint64 data_size, uchar *lut, uint64 lut_size, uint64 n_unique, uint64 n_cutoff_min, uint64 n_cutoff_max, uint64 n_total) { lock_guard lck(mtx); l.push_back(std::make_tuple(bin_id, data, data_size, lut, lut_size, n_unique, n_cutoff_min, n_cutoff_max, n_total)); cv_queue_empty.notify_all(); } bool pop(int32 &bin_id, uchar *&data, uint64 &data_size, uchar *&lut, uint64 &lut_size, uint64 &n_unique, uint64 &n_cutoff_min, uint64 &n_cutoff_max, uint64 &n_total) { unique_lock lck(mtx); cv_queue_empty.wait(lck, [this]{return !l.empty() || !n_writers; }); if (l.empty()) return false; bin_id = get<0>(l.front()); data = get<1>(l.front()); data_size = get<2>(l.front()); lut = get<3>(l.front()); lut_size = get<4>(l.front()); n_unique = get<5>(l.front()); n_cutoff_min = get<6>(l.front()); n_cutoff_max = get<7>(l.front()); n_total = get<8>(l.front()); l.pop_front(); if (l.empty()) cv_queue_empty.notify_all(); return true; } }; //************************************************************************************************************ class CMemoryMonitor { uint64 max_memory; uint64 memory_in_use; mutable mutex mtx; // The mutex to synchronise on condition_variable cv_memory_full; // The condition to wait for public: CMemoryMonitor(uint64 _max_memory) { lock_guard lck(mtx); max_memory = _max_memory; memory_in_use = 0; } ~CMemoryMonitor() { } void increase(uint64 n) { unique_lock lck(mtx); cv_memory_full.wait(lck, [this, n]{return memory_in_use + n <= max_memory;}); memory_in_use += n; } void force_increase(uint64 n) { unique_lock lck(mtx); cv_memory_full.wait(lck, [this, n]{return memory_in_use + n <= max_memory || memory_in_use == 0;}); memory_in_use += n; } void decrease(uint64 n) { lock_guard lck(mtx); memory_in_use -= n; cv_memory_full.notify_all(); } void info(uint64 &_max_memory, uint64 &_memory_in_use) { lock_guard lck(mtx); _max_memory = max_memory; _memory_in_use = memory_in_use; } }; //************************************************************************************************************ class CMemoryPool { int64 total_size; int64 part_size; int64 n_parts_total; int64 n_parts_free; uchar *buffer, *raw_buffer; uint32 *stack; mutable mutex mtx; // The mutex to synchronise on condition_variable cv; // The condition to wait for public: CMemoryPool(int64 _total_size, int64 _part_size) { raw_buffer = NULL; buffer = NULL; stack = NULL; prepare(_total_size, _part_size); } ~CMemoryPool() { release(); } void prepare(int64 _total_size, int64 _part_size) { release(); n_parts_total = _total_size / _part_size; part_size = (_part_size + 15) / 16 * 16; // to allow mapping pointer to int* n_parts_free = n_parts_total; total_size = n_parts_total * part_size; raw_buffer = new uchar[total_size+64]; buffer = raw_buffer; while(((uint64) buffer) % 64) buffer++; stack = new uint32[n_parts_total]; for(uint32 i = 0; i < n_parts_total; ++i) stack[i] = i; } void release(void) { if(raw_buffer) delete[] raw_buffer; raw_buffer = NULL; buffer = NULL; if(stack) delete[] stack; stack = NULL; } // Allocate memory buffer - uchar* void reserve(uchar* &part) { unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0;}); part = buffer + stack[--n_parts_free]*part_size; } // Allocate memory buffer - char* void reserve(char* &part) { unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0;}); part = (char*) (buffer + stack[--n_parts_free]*part_size); } // Allocate memory buffer - uint32* void reserve(uint32* &part) { unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0;}); part = (uint32*) (buffer + stack[--n_parts_free]*part_size); } // Allocate memory buffer - uint64* void reserve(uint64* &part) { unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0;}); part = (uint64*) (buffer + stack[--n_parts_free]*part_size); } // Allocate memory buffer - double* void reserve(double* &part) { unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0;}); part = (double*) (buffer + stack[--n_parts_free]*part_size); } // Allocate memory buffer - float* void reserve(float* &part) { unique_lock lck(mtx); cv.wait(lck, [this]{return n_parts_free > 0; }); part = (float*)(buffer + stack[--n_parts_free] * part_size); } // Deallocate memory buffer - uchar* void free(uchar* part) { lock_guard lck(mtx); stack[n_parts_free++] = (uint32) ((part - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - char* void free(char* part) { lock_guard lck(mtx); stack[n_parts_free++] = (uint32) (((uchar*) part - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - uint32* void free(uint32* part) { lock_guard lck(mtx); stack[n_parts_free++] = (uint32) ((((uchar *) part) - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - uint64* void free(uint64* part) { lock_guard lck(mtx); stack[n_parts_free++] = (uint32) ((((uchar *) part) - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - double* void free(double* part) { lock_guard lck(mtx); stack[n_parts_free++] = (uint32) ((((uchar *) part) - buffer) / part_size); cv.notify_all(); } // Deallocate memory buffer - float* void free(float* part) { lock_guard lck(mtx); stack[n_parts_free++] = (uint32)((((uchar *)part) - buffer) / part_size); cv.notify_all(); } }; class CMemoryBins { int64 total_size; int64 free_size; uint32 n_bins; bool use_strict_mem; typedef std::tuple bin_ptrs_t; public: typedef enum{ mba_input_file, mba_input_array, mba_tmp_array, mba_suffix, mba_kxmer_counters, mba_lut } mba_t; private: uchar *buffer, *raw_buffer; bin_ptrs_t *bin_ptrs; list> list_reserved; list> list_insert_order; mutable mutex mtx; // The mutex to synchronise on condition_variable cv; // The condition to wait for public: CMemoryBins(int64 _total_size, uint32 _n_bins, bool _use_strict_mem) { raw_buffer = NULL; buffer = NULL; bin_ptrs = NULL; use_strict_mem = _use_strict_mem; prepare(_total_size, _n_bins); } ~CMemoryBins() { release(); } int64 round_up_to_alignment(int64 x) { return (x + ALIGNMENT - 1) / ALIGNMENT * ALIGNMENT; } void prepare(int64 _total_size, uint32 _n_bins) { release(); n_bins = _n_bins; bin_ptrs = new bin_ptrs_t[n_bins]; total_size = round_up_to_alignment(_total_size - n_bins * sizeof(bin_ptrs_t)); free_size = total_size; raw_buffer = (uchar*)malloc(total_size + ALIGNMENT); buffer = raw_buffer; while (((uint64)buffer) % ALIGNMENT) buffer++; list_reserved.clear(); list_insert_order.clear(); list_reserved.push_back(make_pair(total_size, 0)); // guard } void release(void) { if (raw_buffer) ::free(raw_buffer); raw_buffer = NULL; buffer = NULL; if (bin_ptrs) delete[] bin_ptrs; bin_ptrs = NULL; } // Prepare memory buffer for bin of given id bool init(uint32 bin_id, uint32 sorting_phases, int64 file_size, int64 kxmers_size, int64 out_buffer_size, int64 kxmer_counter_size, int64 lut_size) { unique_lock lck(mtx); int64 part1_size; int64 part2_size; if (sorting_phases % 2 == 0) { part1_size = kxmers_size + kxmer_counter_size; part2_size = max(max(file_size, kxmers_size), out_buffer_size + lut_size); } else { part1_size = max(kxmers_size + kxmer_counter_size, file_size); part2_size = max(kxmers_size, out_buffer_size + lut_size); } int64 req_size = part1_size + part2_size; if (use_strict_mem && req_size > total_size) { return false; } uint64 found_pos; uint64 last_found_pos; // Look for space to insert cv.wait(lck, [&]() -> bool{ found_pos = total_size; if (!list_insert_order.empty()) { last_found_pos = list_insert_order.back().second; for (auto p = list_reserved.begin(); p != list_reserved.end(); ++p) if (p->first == last_found_pos) { uint64 last_end_pos = p->first + p->second; ++p; if (last_end_pos + req_size <= p->first) { found_pos = last_end_pos; return true; } else break; } } uint64 prev_end_pos = 0; for (auto p = list_reserved.begin(); p != list_reserved.end(); ++p) { if (prev_end_pos + req_size <= p->first) { found_pos = prev_end_pos; return true; } prev_end_pos = p->first + p->second; } // Reallocate memory for buffer if necessary if (list_insert_order.empty() && req_size > (int64)list_reserved.back().first) { ::free(raw_buffer); total_size = round_up_to_alignment(req_size); free_size = total_size; raw_buffer = (uchar*)malloc(total_size + ALIGNMENT); buffer = raw_buffer; while (((uint64)buffer) % ALIGNMENT) buffer++; list_reserved.back().first = total_size; found_pos = 0; return true; } return false; }); // Reserve found free space list_insert_order.push_back(make_pair(bin_id, found_pos)); for (auto p = list_reserved.begin(); p != list_reserved.end(); ++p) if (found_pos < p->first) { list_reserved.insert(p, make_pair(found_pos, req_size)); break; } uchar *base_ptr = get<0>(bin_ptrs[bin_id]) = buffer + found_pos; if (sorting_phases % 2 == 0) // the result of sorting is in the same place as input { get<1>(bin_ptrs[bin_id]) = base_ptr + part1_size; get<2>(bin_ptrs[bin_id]) = base_ptr; get<3>(bin_ptrs[bin_id]) = base_ptr + part1_size; } else { get<1>(bin_ptrs[bin_id]) = base_ptr; get<2>(bin_ptrs[bin_id]) = base_ptr + part1_size; get<3>(bin_ptrs[bin_id]) = base_ptr; } get<4>(bin_ptrs[bin_id]) = base_ptr + part1_size; // data get<5>(bin_ptrs[bin_id]) = get<4>(bin_ptrs[bin_id]) + out_buffer_size; if (kxmer_counter_size) get<6>(bin_ptrs[bin_id]) = base_ptr + kxmers_size; //kxmers counter else get<6>(bin_ptrs[bin_id]) = NULL; free_size -= req_size; get<7>(bin_ptrs[bin_id]) = req_size; return true; } void reserve(uint32 bin_id, uchar* &part, mba_t t) { unique_lock lck(mtx); if (t == mba_input_file) part = get<1>(bin_ptrs[bin_id]); else if (t == mba_input_array) part = get<2>(bin_ptrs[bin_id]); else if (t == mba_tmp_array) part = get<3>(bin_ptrs[bin_id]); else if (t == mba_suffix) part = get<4>(bin_ptrs[bin_id]); else if (t == mba_lut) part = get<5>(bin_ptrs[bin_id]); else if (t == mba_kxmer_counters) part = get<6>(bin_ptrs[bin_id]); } // Deallocate memory buffer - uchar* void free(uint32 bin_id, mba_t t) { unique_lock lck(mtx); if (t == mba_input_file) get<1>(bin_ptrs[bin_id]) = NULL; else if (t == mba_input_array) get<2>(bin_ptrs[bin_id]) = NULL; else if (t == mba_tmp_array) get<3>(bin_ptrs[bin_id]) = NULL; else if (t == mba_suffix) get<4>(bin_ptrs[bin_id]) = NULL; else if (t == mba_lut) get<5>(bin_ptrs[bin_id]) = NULL; else if (t == mba_kxmer_counters) get<6>(bin_ptrs[bin_id]) = NULL; if (!get<1>(bin_ptrs[bin_id]) && !get<2>(bin_ptrs[bin_id]) && !get<3>(bin_ptrs[bin_id]) && !get<4>(bin_ptrs[bin_id]) && !get<5>(bin_ptrs[bin_id]) && !get<6>(bin_ptrs[bin_id])) { for (auto p = list_reserved.begin(); p != list_reserved.end() && p->second != 0; ++p) { if ((int64)p->first == get<0>(bin_ptrs[bin_id]) - buffer) { list_reserved.erase(p); break; } } for (auto p = list_insert_order.begin(); p != list_insert_order.end(); ++p) if (p->first == bin_id) { list_insert_order.erase(p); break; } get<0>(bin_ptrs[bin_id]) = NULL; free_size += get<7>(bin_ptrs[bin_id]); cv.notify_all(); } } }; class CTooLargeBinsQueue { queue> q; uint32 curr; public: CTooLargeBinsQueue() { curr = 0; } ~CTooLargeBinsQueue() { } bool get_next(int32& _bin_id) { if (q.empty()) return false; _bin_id = q.front(); q.pop(); return true; } bool empty() { return q.empty(); } void insert(int32 _bin_id) { q.push(_bin_id); } }; class CBigBinPartQueue { typedef std::tuple data_t; typedef list list_t; list_t l; bool completed; mutable mutex mtx; condition_variable cv_pop; public: void init() { completed = false; } CBigBinPartQueue() { init(); } void push(int32 bin_id, uchar* data, uint64 size) { lock_guard lck(mtx); bool was_empty = l.empty(); l.push_back(std::make_tuple(bin_id, data, size)); if (was_empty) cv_pop.notify_all(); } bool pop(int32& bin_id, uchar* &data, uint64& size) { unique_lock lck(mtx); cv_pop.wait(lck, [this]{return !l.empty() || completed; }); if (completed && l.empty()) return false; bin_id = get<0>(l.front()); data = get<1>(l.front()); size = get<2>(l.front()); l.pop_front(); return true; } void mark_completed() { lock_guard lck(mtx); completed = true; cv_pop.notify_all(); } }; class CBigBinKXmersQueue { typedef std::tuple data_t; typedef list list_t; list_t l; uint32 n_writers; mutable mutex mtx; condition_variable cv_pop; uint32 n_waiters; int32 current_id; condition_variable cv_push; public: CBigBinKXmersQueue(uint32 _n_writers) { n_waiters = 0; current_id = -1; //means queue is not initialized n_writers = _n_writers; } void push(int32 bin_id, uchar* data, uint64 size) { unique_lock lck(mtx); ++n_waiters; if (current_id == -1) current_id = bin_id; cv_push.wait(lck, [this, bin_id]{return bin_id == current_id || n_waiters == n_writers; }); if (n_waiters == n_writers) { current_id = bin_id; cv_push.notify_all(); } --n_waiters; bool was_empty = l.empty(); l.push_back(std::make_tuple(bin_id, data, size)); if(was_empty) cv_pop.notify_all(); } bool pop(int32& bin_id, uchar* &data, uint64& size) { unique_lock lck(mtx); cv_pop.wait(lck, [this]{return !l.empty() || !n_writers; }); if (l.empty() && !n_writers) return false; bin_id = get<0>(l.front()); data = get<1>(l.front()); size = get<2>(l.front()); l.pop_front(); return true; } void mark_completed() { lock_guard lck(mtx); --n_writers; if (!n_writers) cv_pop.notify_all(); cv_push.notify_all(); } ~CBigBinKXmersQueue() { } }; class CBigBinSortedPartQueue { //bin_id, sub_bin_id,suff_buff, suff_buff_size, lut, lut_size, last_one_in_bin typedef std::tuple data_t; typedef list list_t; list_t l; uint32 n_writers; mutable mutex mtx; condition_variable cv_pop; public: CBigBinSortedPartQueue(uint32 _n_writers) { n_writers = _n_writers; } void push(int32 bin_id, int32 sub_bin_id, uchar* suff_buff, uint64 suff_buff_size, uint64* lut, uint64 lut_size, bool last_one_in_sub_bin) { lock_guard lck(mtx); bool was_empty = l.empty(); l.push_back(std::make_tuple(bin_id, sub_bin_id, suff_buff, suff_buff_size, lut, lut_size, last_one_in_sub_bin)); if (was_empty) cv_pop.notify_all(); } bool pop(int32& bin_id, int32& sub_bin_id, uchar* &suff_buff, uint64& suff_buff_size, uint64* &lut, uint64 &lut_size, bool &last_one_in_sub_bin) { unique_lock lck(mtx); cv_pop.wait(lck, [this]{return !n_writers || !l.empty(); }); if (!n_writers && l.empty()) return false; bin_id = get<0>(l.front()); sub_bin_id = get<1>(l.front()); suff_buff = get<2>(l.front()); suff_buff_size = get<3>(l.front()); lut = get<4>(l.front()); lut_size = get<5>(l.front()); last_one_in_sub_bin = get<6>(l.front()); l.pop_front(); return true; } void mark_completed() { --n_writers; if (!n_writers) cv_pop.notify_all(); } }; class CBigBinKmerPartQueue { typedef std::tuple data_t; typedef list list_t; list_t l; uint32 n_writers; mutable mutex mtx; condition_variable cv_pop; condition_variable cv_push; int32 curr_id; bool allow_next; public: CBigBinKmerPartQueue(uint32 _n_writers) { n_writers = _n_writers; allow_next = true; } void push(int32 bin_id, uchar* suff_buff, uint64 suff_buff_size, uchar* lut, uint64 lut_size, uint64 n_unique, uint64 n_cutoff_min, uint64 n_cutoff_max, uint64 n_total, bool last_in_bin) { unique_lock lck(mtx); cv_push.wait(lck, [this, bin_id, lut_size]{return curr_id == bin_id || allow_next; }); allow_next = false; if (last_in_bin) { allow_next = true; } curr_id = bin_id; bool was_empty = l.empty(); l.push_back(std::make_tuple(bin_id, suff_buff, suff_buff_size, lut, lut_size, n_unique, n_cutoff_min, n_cutoff_max, n_total, last_in_bin)); if (was_empty) cv_pop.notify_all(); if (allow_next) cv_push.notify_all(); } bool pop(int32& bin_id, uchar* &suff_buff, uint64& suff_buff_size, uchar* &lut, uint64& lut_size, uint64 &n_unique, uint64 &n_cutoff_min, uint64 &n_cutoff_max, uint64 &n_total, bool& last_in_bin) { unique_lock lck(mtx); cv_pop.wait(lck, [this]{return !l.empty() || !n_writers; }); if (!n_writers && l.empty()) return false; bin_id = get<0>(l.front()); suff_buff = get<1>(l.front()); suff_buff_size = get<2>(l.front()); lut = get<3>(l.front()); lut_size = get<4>(l.front()); n_unique = get<5>(l.front()); n_cutoff_min = get<6>(l.front()); n_cutoff_max = get<7>(l.front()); n_total = get<8>(l.front()); last_in_bin = get<9>(l.front()); l.pop_front(); return true; } void mark_completed() { lock_guard lck(mtx); --n_writers; if(!n_writers) cv_pop.notify_all(); } }; class CBigBinDesc { //lut_prefix_len, n_kmers, tmp_file_handle, string file_name, file_size typedef std::tuple elem_t; typedef map>> data_t; mutable mutex mtx; data_t m; int32 curr_id; public: CBigBinDesc() { curr_id = -1; } void push(int32 bin_id, int32 sub_bin_id, uint32 lut_prefix_len, uint32 n_kmers, FILE* file, string desc, uint64 file_size) { lock_guard lck(mtx); auto bin = m.find(bin_id); if (bin == m.end()) { m[bin_id].first = -1; m[bin_id].second[sub_bin_id] = std::make_tuple(lut_prefix_len, n_kmers, file, desc, file_size); } else { auto sub_bin = bin->second.second.find(sub_bin_id); if (sub_bin == bin->second.second.end()) { m[bin_id].second[sub_bin_id] = std::make_tuple(lut_prefix_len, n_kmers, file, desc, file_size); } else { if(lut_prefix_len) get<0>(sub_bin->second) = lut_prefix_len; get<1>(sub_bin->second) += n_kmers; if (file) { get<2>(sub_bin->second) = file; get<3>(sub_bin->second) = desc; } get<4>(sub_bin->second) += file_size; } } } bool get_n_sub_bins(int32 bin_id, uint32& size) { lock_guard lck(mtx); auto e = m.find(bin_id); if (e == m.end()) return false; size = (uint32)e->second.second.size(); return true; } bool next_bin(int32& bin_id, uint32& size) { lock_guard lck(mtx); if (m.empty()) return false; if (curr_id == -1) { curr_id = bin_id = m.begin()->first; size = (uint32)m.begin()->second.second.size(); } else { auto e = m.find(curr_id); e++; if (e == m.end()) return false; curr_id = bin_id = e->first; size = (uint32)e->second.second.size(); } return true; } void reset_reading() { lock_guard lck(mtx); curr_id = -1; for (auto& e : m) e.second.first = -1; } bool next_sub_bin(int32 bin_id, int32& sub_bin_id, uint32& lut_prefix_len, uint32& n_kmers, FILE* &file, string& desc, uint64& file_size) { lock_guard lck(mtx); auto& sub_bin = m.find(bin_id)->second; int32 curr_sub_bin_id = sub_bin.first; map::iterator e; if (curr_sub_bin_id == -1) e = sub_bin.second.begin(); else { e = sub_bin.second.find(curr_sub_bin_id); ++e; if (e == sub_bin.second.end()) return false; } sub_bin_id = sub_bin.first = e->first; lut_prefix_len = get<0>(e->second); n_kmers = get<1>(e->second); file = get<2>(e->second); desc = get<3>(e->second); file_size = get<4>(e->second); return true; } }; class CCompletedBinsCollector { list l; mutable mutex mtx; condition_variable cv_pop; uint32 n_writers; public: CCompletedBinsCollector(uint32 _n_writers) { n_writers = _n_writers; } void push(int32 bin_id) { lock_guard lck(mtx); bool was_empty = l.empty(); l.push_back(bin_id); if (was_empty) cv_pop.notify_all(); } bool pop(int32& bin_id) { unique_lock lck(mtx); cv_pop.wait(lck, [this]{return !n_writers || !l.empty(); }); if (!n_writers && l.empty()) return false; bin_id = l.front(); l.pop_front(); return true; } void mark_completed() { --n_writers; if (!n_writers) cv_pop.notify_all(); } }; class CDiskLogger { uint64 current; uint64 max; mutable mutex mtx; public: CDiskLogger() { current = max = 0; } void log_write(uint64 _size) { lock_guard lck(mtx); current += _size; if (current > max) max = current; } void log_remove(uint64 _size) { lock_guard lck(mtx); current -= _size; } uint64 get_max() { lock_guard lck(mtx); return max; } uint64 get_current() { lock_guard lck(mtx); return current; } }; #endif // ***** EOF KMC-2.3/kmer_counter/radix.cpp000066400000000000000000000217161257432033000163060ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include #include "radix.h" //---------------------------------------------------------------------------------- /*Parallel radix sort. The input data to be sorted are divided evenly among threads. Each thread is responsible for building a local histogram to enable sorting keys according to a given digit. Then a global histogram is created as a combination of local ones and the write offset (location) to which each digit should be written is computed. Finally, threads scatter the data to the appropriate locations.*/ template void RadixOMP_uint8(uint32 *SourcePtr, uint32 *DestPtr, const int64 SourceSize, unsigned rec_size, unsigned data_offset, unsigned data_size, const unsigned n_phases, const unsigned n_threads) { /* SourceSize - number of records */ /* rec_size - in bytes */ /* data_offset - in bytes*/ /* data_size - in bytes - not used now */ #ifdef WIN32 __declspec( align( WIN_ALIGNMENT ) ) COUNTER_TYPE ByteCounter[MAX_NUM_THREADS][256]; #else COUNTER_TYPE ByteCounter[MAX_NUM_THREADS][256] __attribute__((aligned(ALIGNMENT))); #endif #ifdef WIN32 __declspec( align( WIN_ALIGNMENT ) ) COUNTER_TYPE globalHisto[256]; #else COUNTER_TYPE globalHisto[256] __attribute__((aligned(ALIGNMENT))); #endif #pragma omp parallel num_threads(n_threads) { int myID = omp_get_thread_num(); uint8_t ByteIndex = 0; long long i; COUNTER_TYPE prevSum; COUNTER_TYPE temp; uint32 n; int private_i; int byteValue; int64 SourceSize_in_bytes = SourceSize * rec_size; uint8_t *char_ptr_tempSource = (uint8_t*)(SourcePtr); uint8_t *char_ptr_tempDest = (uint8_t*)(DestPtr); uint8_t *char_tempPtr; #ifdef WIN32 __declspec( align( WIN_ALIGNMENT ) ) COUNTER_TYPE privateByteCounter[256] = {0}; #else __attribute__((aligned(ALIGNMENT))) COUNTER_TYPE privateByteCounter[256] = {0}; #endif for(uint32 privatePhaseCounter = 0; privatePhaseCounter < n_phases; privatePhaseCounter++) { #pragma omp for private(i) schedule(static) for(i = data_offset; i < SourceSize_in_bytes; i = i + rec_size) { byteValue = *(&char_ptr_tempSource[i] + ByteIndex); ++privateByteCounter[byteValue]; } A_memcpy(&ByteCounter[myID][0], privateByteCounter, sizeof(privateByteCounter)); #pragma omp barrier #pragma omp for schedule(static) for(i = 0; i < 256; ++i) { prevSum = 0; for(n = 0; n < n_threads; n++) { temp = ByteCounter[n][i]; ByteCounter[n][i] = prevSum; prevSum += temp; } globalHisto[i] = prevSum; } #pragma omp single { prevSum = 0; for(i = 0; i < 256; ++i) { temp = globalHisto[i]; globalHisto[i] = prevSum; prevSum += temp; } } for (private_i = 0; private_i < 256; private_i++) ByteCounter[myID][private_i] += globalHisto[private_i]; A_memcpy(privateByteCounter, &ByteCounter[myID][0], sizeof(privateByteCounter)); #pragma omp for schedule(static) for(i = data_offset; i < SourceSize_in_bytes; i = i + rec_size) { byteValue = *(&char_ptr_tempSource[i] + ByteIndex); memcpy(&char_ptr_tempDest[privateByteCounter[byteValue] * rec_size], &char_ptr_tempSource[i - data_offset], rec_size); (privateByteCounter[byteValue])++; } #pragma omp barrier char_tempPtr = char_ptr_tempDest; char_ptr_tempDest = char_ptr_tempSource; char_ptr_tempSource = char_tempPtr; ByteIndex++; memset(privateByteCounter, 0, sizeof(privateByteCounter)); } } } //---------------------------------------------------------------------------------- void RadixSort_uint8(uint32 *&data_ptr, uint32 *&tmp_ptr, uint64 size, unsigned rec_size, unsigned data_offset, unsigned data_size, const unsigned n_phases, const unsigned n_threads) { if(size * rec_size >= (1ull << 32)) RadixOMP_uint8(data_ptr, tmp_ptr, size, rec_size, data_offset, data_size, n_phases, n_threads); else RadixOMP_uint8(data_ptr, tmp_ptr, size, rec_size, data_offset, data_size, n_phases, n_threads); } //---------------------------------------------------------------------------------- /*Parallel radix sort. Parallelization scheme taken from Satish, N., Kim, C., Chhugani, J., Nguyen, A.D., Lee, V.W., Kim, D., Dubey, P. (2010). Fast Sort on CPUs and GPUs. A Case for Bandwidth Oblivious SIMD Sort. Proc. of the 2010 Int. Conf. on Management of data, pp. 351–362. The usage of software-managed buffers in the writting phase results in diminishing the influence of irregular memory accesses. As the number of cache conflict misses is reduced better efficiency is reached.*/ template void RadixOMP_buffer(CMemoryPool *pmm_radix_buf, uint64 *Source, uint64 *Dest, const int64 SourceSize, const unsigned n_phases, const unsigned n_threads) { #ifdef WIN32 __declspec( align( WIN_ALIGNMENT ) ) COUNTER_TYPE ByteCounter[MAX_NUM_THREADS][256]; #else COUNTER_TYPE ByteCounter[MAX_NUM_THREADS][256] __attribute__((aligned(ALIGNMENT))); #endif #ifdef WIN32 __declspec( align( WIN_ALIGNMENT ) ) COUNTER_TYPE globalHisto[256]; #else COUNTER_TYPE globalHisto[256] __attribute__((aligned(ALIGNMENT))); #endif #pragma omp parallel num_threads(n_threads) { int myID = omp_get_thread_num(); uint8_t ByteIndex = 0; long long i; COUNTER_TYPE prevSum; COUNTER_TYPE temp; uint32 n; int index_x; int private_i; int byteValue; uint64 *tempSource = Source; uint64 *tempDest = Dest; uint64 *tempPtr; uint64 *raw_Buffer; pmm_radix_buf->reserve(raw_Buffer); uint64 *Buffer = raw_Buffer; while(((unsigned long long) Buffer) % ALIGNMENT) Buffer++; #ifdef WIN32 __declspec( align( WIN_ALIGNMENT ) ) COUNTER_TYPE privateByteCounter[256] = {0}; #else __attribute__((aligned(ALIGNMENT))) COUNTER_TYPE privateByteCounter[256] = {0}; #endif for(uint32 privatePhaseCounter = 0; privatePhaseCounter < n_phases; privatePhaseCounter++) { #pragma omp for private(i) schedule(static) for(i = 0; i < SourceSize; ++i) { byteValue = *(reinterpret_cast(&tempSource[i]) + ByteIndex); ++privateByteCounter[byteValue]; } A_memcpy(&ByteCounter[myID][0], privateByteCounter, sizeof(privateByteCounter)); #pragma omp barrier #pragma omp for schedule(static) for(i = 0; i < 256; ++i) { prevSum = 0; for(n = 0; n < n_threads; n++) { temp = ByteCounter[n][i]; ByteCounter[n][i] = prevSum; prevSum += temp; } globalHisto[i] = prevSum; } #pragma omp single { prevSum = 0; for(i = 0; i < 256; ++i) { temp = globalHisto[i]; globalHisto[i] = prevSum; prevSum += temp; } } for (private_i = 0; private_i < 256; private_i++) ByteCounter[myID][private_i] += globalHisto[private_i]; A_memcpy(privateByteCounter, &ByteCounter[myID][0], sizeof(privateByteCounter)); #pragma omp for schedule(static) for(i = 0; i < SourceSize; ++i) { byteValue = *(reinterpret_cast(&tempSource[i]) + ByteIndex); index_x = privateByteCounter[byteValue] % BUFFER_WIDTH; Buffer[byteValue * BUFFER_WIDTH + index_x] = tempSource[i]; privateByteCounter[byteValue]++; if(index_x == (BUFFER_WIDTH -1)) A_memcpy ( &tempDest[privateByteCounter[byteValue] - (BUFFER_WIDTH)], &Buffer[byteValue * BUFFER_WIDTH], BUFFER_WIDTH *sizeof(uint64) ); } //end_for INT_TYPE elemInBuffer; INT_TYPE index_stop; INT_TYPE index_start; INT_TYPE elemWrittenIntoBuffer; for(private_i = 0; private_i < 256; private_i++) { index_stop = privateByteCounter[private_i] % BUFFER_WIDTH; index_start = ByteCounter[myID][private_i] % BUFFER_WIDTH; elemWrittenIntoBuffer = privateByteCounter[private_i] - ByteCounter[myID][private_i]; if((index_stop - elemWrittenIntoBuffer) <= 0) elemInBuffer = index_stop; else elemInBuffer = index_stop - index_start; if(elemInBuffer != 0) A_memcpy ( &tempDest[privateByteCounter[private_i] - elemInBuffer], &Buffer[private_i * BUFFER_WIDTH + (privateByteCounter[private_i] - elemInBuffer)%BUFFER_WIDTH], (elemInBuffer)*sizeof(uint64) ); } #pragma omp barrier tempPtr = tempDest; tempDest = tempSource; tempSource = tempPtr; ByteIndex++; memset(privateByteCounter, 0, sizeof(privateByteCounter)); } pmm_radix_buf->free(raw_Buffer); } } //---------------------------------------------------------------------------------- void RadixSort_buffer(CMemoryPool *pmm_radix_buf, uint64 *&data, uint64 *&tmp, uint64 size, const unsigned n_phases, const unsigned n_threads) { if(size >= (1ull << 31)) RadixOMP_buffer(pmm_radix_buf, data, tmp, size, n_phases, n_threads); else RadixOMP_buffer(pmm_radix_buf, data, tmp, size, n_phases, n_threads); } // ***** EOF KMC-2.3/kmer_counter/radix.h000066400000000000000000000021311257432033000157410ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _RADIX_H #define _RADIX_H #include #include #include #include #include #include #include "asmlib_wrapper.h" #include "defs.h" #include "queues.h" #ifdef WIN32 typedef unsigned __int8 uint8_t; #else #include #endif #define MAX_NUM_THREADS 32 #define BUFFER_WIDTH 32 #define ALIGNMENT 0x100 #define WIN_ALIGNMENT 64 #define shift_BUFFER_WIDTH 5 #define BUFFER_WIDTH_MINUS_1 31 #define BUFFER_WIDTH_MUL_sizeof_UINT 256 void RadixSort_uint8(uint32 *&data_ptr, uint32 *&tmp_ptr, uint64 size, unsigned rec_size, unsigned data_offset, unsigned data_size, const unsigned n_phases, const unsigned n_threads); void RadixSort_buffer(CMemoryPool *pmm_radix_buf, uint64 *&data, uint64 *&tmp, uint64 size, const unsigned n_phases, const unsigned n_threads); #endif // ***** EOF KMC-2.3/kmer_counter/rev_byte.cpp000066400000000000000000000005551257432033000170140ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #include "rev_byte.h" uchar CRev_byte::lut[256]; CRev_byte::_si CRev_byte::_init;KMC-2.3/kmer_counter/rev_byte.h000066400000000000000000000011011257432033000164450ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _REV_BYTE_H #define _REV_BYTE_H #include "defs.h" struct CRev_byte { static uchar lut[256]; struct _si { _si() { for (uint32 i = 0; i < 256; ++i) lut[i] = ((3 - (i & 3)) << 6) + ((3 - ((i >> 2) & 3)) << 4) + ((3 - ((i >> 4) & 3)) << 2) + (3 - ((i >> 6) & 3)); } }static _init; }; #endifKMC-2.3/kmer_counter/s_mapper.h000066400000000000000000000064031257432033000164460ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _S_MAPPER_H #define _S_MAPPER_H #include "defs.h" #include "mmer.h" #include "params.h" #ifdef DEVELOP_MODE #include "develop.h" #endif class CSignatureMapper { uint32 map_size; int32* signature_map; uint32 signature_len; uint32 special_signature; CMemoryPool* pmm_stats; uint32 n_bins; class Comp { uint32* signature_occurences; public: Comp(uint32* _signature_occurences) : signature_occurences(_signature_occurences){} bool operator()(int i, int j) { return signature_occurences[i] > signature_occurences[j]; } }; public: void Init(uint32* stats) { uint32 *sorted; pmm_stats->reserve(sorted); for (uint32 i = 0; i < map_size ; ++i) sorted[i] = i; sort(sorted, sorted + map_size, Comp(stats)); list> _stats; for (uint32 i = 0; i < map_size ; ++i) { if (CMmer::is_allowed(sorted[i], signature_len)) _stats.push_back(make_pair(sorted[i], stats[sorted[i]])); } list> group; uint32 bin_no = 0; //counting sum double sum = 0.0; for (auto &i : _stats) { i.second += 1000; sum += i.second; } double mean = sum / n_bins; double max_bin_size = 1.1 * mean; uint32 n = n_bins - 1; //one is needed for disabled signatures uint32 max_bins = n_bins - 1; while (_stats.size() > n) { pair& max = _stats.front(); if (max.second > mean) { signature_map[max.first] = bin_no++; sum -= max.second; mean = sum / (max_bins - bin_no); max_bin_size = 1.1 * mean; _stats.pop_front(); --n; } else { //heuristic group.clear(); double tmp_sum = 0.0; uint32 in_current = 0; for (auto it = _stats.begin(); it != _stats.end();) { if (tmp_sum + it->second < max_bin_size) { tmp_sum += it->second; group.push_back(*it); it = _stats.erase(it); ++in_current; } else ++it; } for (auto i = group.begin(); i != group.end(); ++i) { signature_map[i->first] = bin_no; } --n; ++bin_no; sum -= tmp_sum; mean = sum / (max_bins - bin_no); max_bin_size = 1.1 * mean; } } if (_stats.size() > 0) { for (auto i = _stats.begin(); i != _stats.end(); ++i) { signature_map[i->first] = bin_no++; //cout << "rest bin: " << i->second << "\n"; } } signature_map[special_signature] = bin_no; pmm_stats->free(sorted); #ifdef DEVELOP_MODE map_log(signature_len, map_size, signature_map); #endif } CSignatureMapper(CMemoryPool* _pmm_stats, uint32 _signature_len, uint32 _n_bins) { n_bins = _n_bins; pmm_stats = _pmm_stats; signature_len = _signature_len; special_signature = 1 << 2 * signature_len; map_size = (1 << 2 * signature_len) + 1; signature_map = new int32[map_size]; fill_n(signature_map, map_size, -1); } inline int32 get_bin_id(uint32 signature) { return signature_map[signature]; } inline int32 get_max_bin_no() { return signature_map[special_signature]; } ~CSignatureMapper() { delete [] signature_map; } }; #endif KMC-2.3/kmer_counter/small_k_buf.h000066400000000000000000000015661257432033000171230ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _SMALL_K_BUF #define _SMALL_K_BUF #include "defs.h" template struct CSmallKBuf { COUNTER_TYPE* buf; void Store(uint64 index, uchar* _buf, uint32& buf_pos, uint64 counter_size) { for (uint64 j = 0; j < counter_size; ++j) _buf[buf_pos++] = (buf[index] >> (j * 8)) & 0xFF; } }; template<> struct CSmallKBuf { float* buf; void Store(uint64 index, uchar* _buf, uint32& buf_pos, uint64 counter_size)//counter_size should be always 4 here { uint32 c; memcpy(&c, &buf[index], 4); for (int32 j = 0; j < 4; ++j) _buf[buf_pos++] = (c >> (j * 8)) & 0xFF; } }; #endifKMC-2.3/kmer_counter/splitter.h000066400000000000000000001034571257432033000165150ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Marek Kokot Version: 2.3.0 Date : 2015-08-21 */ #ifndef _SPLITTER_H #define _SPLITTER_H #include "defs.h" #include "kmer.h" #include "kb_storer.h" #include "kb_collector.h" #include "prob_qual.h" #include "kb_reader.h" #include "kb_sorter.h" #include "kb_completer.h" #include "queues.h" #include "s_mapper.h" #include "mmer.h" #include #include #include #include "small_k_buf.h" using namespace std; //************************************************************************************************************ //************************************************************************************************************ template class CSplitter_Impl; //************************************************************************************************************ // CSplitter class - splits kmers into bins according to their prefix //************************************************************************************************************ template class CSplitter { CMemoryMonitor *mm; uint64 total_kmers = 0; //CExKmer ex_kmer; uchar *part; uint64 part_size, part_pos; CKmerBinCollector **bins; CBinPartQueue *bin_part_queue; CBinDesc *bd; CMemoryPool *pmm_reads; int64 mem_part_pmm_bins; int64 mem_part_pmm_reads; char codes[256]; bool use_quake; input_type file_type; int lowest_quality; bool both_strands; uint32 kmer_len; //uint32 prefix_len; uint32 signature_len; uint32 n_bins; uint64 n_reads;//for multifasta its a sequences counter CSignatureMapper* s_mapper; inline bool GetSeq(char *seq, uint32 &seq_size); inline bool GetSeq(char *seq, char *quals, uint32 &seq_size); friend class CSplitter_Impl; public: inline void CalcStats(uchar* _part, uint64 _part_size, uint32* _stats); static uint32 MAX_LINE_SIZE; CSplitter(CKMCParams &Params, CKMCQueues &Queues); void InitBins(CKMCParams &Params, CKMCQueues &Queues); bool ProcessReads(uchar *_part, uint64 _part_size); template bool ProcessReadsSmallK(uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf); void Complete(); void GetTotal(uint64 &_n_reads); uint64 GetTotalKmers(); ~CSplitter(); }; template uint32 CSplitter::MAX_LINE_SIZE = 1 << 14; //************************************************************************************************************ // Implementation of ProcessReads and Complete methods for various types and sizes of kmer class //************************************************************************************************************ template class CSplitter_Impl { public: static bool ProcessReads(CSplitter &ptr, uchar *_part, uint64 _part_size); template static bool ProcessReadsSmallK(CSplitter &ptr, uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf); }; template <> class CSplitter_Impl { public: static bool ProcessReads(CSplitter &ptr, uchar *_part, uint64 _part_size); template static bool ProcessReadsSmallK(CSplitter &ptr, uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf); }; template <> class CSplitter_Impl { public: static bool ProcessReads(CSplitter &ptr, uchar *_part, uint64 _part_size); static bool ProcessReadsSmallK(CSplitter &ptr, uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf); }; //---------------------------------------------------------------------------------- // Return a single record from FASTA/FASTQ data template bool CSplitter::GetSeq(char *seq, uint32 &seq_size) { uchar c = 0; uint32 pos = 0; if(file_type == fasta) { // Title if(part_pos >= part_size) return false; c = part[part_pos++]; if(c != '>') return false; for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Sequence for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; seq[pos++] = codes[c]; } seq_size = pos; if(part_pos >= part_size) return true; if(part[part_pos++] >= 32) part_pos--; else if(part_pos >= part_size) return true; } else if(file_type == fastq) { // Title if(part_pos >= part_size) return false; c = part[part_pos++]; if(c != '@') return false; for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Sequence for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; seq[pos++] = codes[c]; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Plus c = part[part_pos++]; if(part_pos >= part_size) return false; if(c != '+') return false; for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Quality part_pos += pos; if(part_pos >= part_size) return false; c = part[part_pos++]; seq_size = pos; if(part_pos >= part_size) return true; if(part[part_pos++] >= 32) part_pos--; else if(part_pos >= part_size) return true; } else if(file_type == multiline_fasta) { if(part_pos >= part_size) return false; if(part[part_pos] == '>')//need to ommit header { ++n_reads; for(;part_pos < part_size && part[part_pos] != '\n' && part[part_pos] != '\r';++part_pos);//find EOF ++part_pos; if(part[part_pos] == '\n' || part[part_pos] == '\r') ++part_pos; } for(;part_pos < part_size && pos < mem_part_pmm_reads && part[part_pos] != '>';) { seq[pos++] = codes[part[part_pos++]]; } seq_size = pos; if(part_pos < part_size && part[part_pos] != '>')//need to copy last k-1 kmers { part_pos -= kmer_len - 1; } return true; } return (c == '\n' || c == '\r'); } //---------------------------------------------------------------------------------- // Return a single record with quality codes from FASTA/FASTQ data template bool CSplitter::GetSeq(char *seq, char *quals, uint32 &seq_size) { uchar c; uint32 pos = 0; if(file_type == fasta || file_type == multiline_fasta) { return false; // FASTA file does not store quality values } else { // Title if(part_pos >= part_size) return false; c = part[part_pos++]; if(c != '@') return false; for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Sequence for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; seq[pos++] = codes[c]; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Plus c = part[part_pos++]; if(part_pos >= part_size) return false; if(c != '+') return false; for(; part_pos < part_size;) { c = part[part_pos++]; if(c < 32) // newliners break; } if(part_pos >= part_size) return false; c = part[part_pos++]; if(c >= 32) part_pos--; else if(part_pos >= part_size) return false; // Quality copy(part+part_pos, part+part_pos+pos, quals); part_pos += pos; if(part_pos >= part_size) return false; c = part[part_pos++]; seq_size = pos; if(part_pos >= part_size) return true; if(part[part_pos++] >= 32) part_pos--; else if(part_pos >= part_size) return true; } return (c == '\n' || c == '\r'); } template void CSplitter::CalcStats(uchar* _part, uint64 _part_size, uint32* _stats) { part = _part; part_size = _part_size; part_pos = 0; char *seq; uint32 seq_size; pmm_reads->reserve(seq); uint32 signature_start_pos; CMmer current_signature(signature_len), end_mmer(signature_len); uint32 i; uint32 len;//length of extended kmer while (GetSeq(seq, seq_size)) { i = 0; len = 0; while (i + kmer_len - 1 < seq_size) { bool contains_N = false; //building first signature after 'N' or at the read begining for (uint32 j = 0; j < signature_len; ++j, ++i) if (seq[i] < 0)//'N' { contains_N = true; break; } //signature must be shorter than k-mer so if signature contains 'N', k-mer will contains it also if (contains_N) { ++i; continue; } len = signature_len; signature_start_pos = i - signature_len; current_signature.insert(seq + signature_start_pos); end_mmer.set(current_signature); for (; i < seq_size; ++i) { if (seq[i] < 0)//'N' { if (len >= kmer_len) _stats[current_signature.get()] += 1 + len - kmer_len; len = 0; ++i; break; } end_mmer.insert(seq[i]); if (end_mmer < current_signature)//signature at the end of current k-mer is lower than current { if (len >= kmer_len) { _stats[current_signature.get()] += 1 + len - kmer_len; len = kmer_len - 1; } current_signature.set(end_mmer); signature_start_pos = i - signature_len + 1; } else if (end_mmer == current_signature) { current_signature.set(end_mmer); signature_start_pos = i - signature_len + 1; } else if (signature_start_pos + kmer_len - 1 < i)//need to find new signature { _stats[current_signature.get()] += 1 + len - kmer_len; len = kmer_len - 1; //looking for new signature ++signature_start_pos; //building first signature in current k-mer end_mmer.insert(seq + signature_start_pos); current_signature.set(end_mmer); for (uint32 j = signature_start_pos + signature_len; j <= i; ++j) { end_mmer.insert(seq[j]); if (end_mmer <= current_signature) { current_signature.set(end_mmer); signature_start_pos = j - signature_len + 1; } } } ++len; } } if (len >= kmer_len)//last one in read _stats[current_signature.get()] += 1 + len - kmer_len; } putchar('*'); fflush(stdout); pmm_reads->free(seq); } //---------------------------------------------------------------------------------- // Assigns queues and monitors template CSplitter::CSplitter(CKMCParams &Params, CKMCQueues &Queues) { mm = Queues.mm; file_type = Params.file_type; use_quake = Params.use_quake; lowest_quality = Params.lowest_quality; both_strands = Params.both_strands; bin_part_queue = Queues.bpq; bd = Queues.bd; pmm_reads = Queues.pmm_reads; kmer_len = Params.kmer_len; signature_len = Params.signature_len; mem_part_pmm_bins = Params.mem_part_pmm_bins; mem_part_pmm_reads = Params.mem_part_pmm_reads; s_mapper = Queues.s_mapper; part = NULL; // Prepare encoding of symbols for(int i = 0; i < 256; ++i) codes[i] = -1; codes['A'] = codes['a'] = 0; codes['C'] = codes['c'] = 1; codes['G'] = codes['g'] = 2; codes['T'] = codes['t'] = 3; n_reads = 0; bins = NULL; } template void CSplitter::InitBins(CKMCParams &Params, CKMCQueues &Queues) { n_bins = Params.n_bins; uint32 buffer_size = Params.bin_part_size; // Create objects for all bin bins = new CKmerBinCollector*[n_bins]; for (uint32 i = 0; i < n_bins; ++i) { bins[i] = new CKmerBinCollector(Queues, Params, buffer_size, i); bd->insert(i, NULL, "", 0, 0, 0, 0, buffer_size, kmer_len); } } //---------------------------------------------------------------------------------- // Release memory template CSplitter::~CSplitter() { if (bins) { for (uint32 i = 0; i < n_bins; ++i) if (bins[i]) delete bins[i]; delete[] bins; } } //---------------------------------------------------------------------------------- // Finish the processing of input file template void CSplitter::Complete() { if (bins) for (uint32 i = 0; i < n_bins; ++i) if(bins[i]) bins[i]->Flush(); } //---------------------------------------------------------------------------------- // Process the reads from the given FASTQ file part in small k optimization mode template template bool CSplitter::ProcessReadsSmallK(uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf) { return CSplitter_Impl::ProcessReadsSmallK(*this, _part, _part_size, small_k_buf); } //---------------------------------------------------------------------------------- // Process the reads from the given FASTQ file part template bool CSplitter::ProcessReads(uchar *_part, uint64 _part_size) { return CSplitter_Impl::ProcessReads(*this, _part, _part_size); } //---------------------------------------------------------------------------------- // Return the number of reads processed by splitter template void CSplitter::GetTotal(uint64 &_n_reads) { _n_reads = n_reads; } //---------------------------------------------------------------------------------- // Return the number of kmers processed by splitter (!!! only for small k optimization) template uint64 CSplitter::GetTotalKmers() { return total_kmers; } //************************************************************************************************************ // Implementation of specific splitter methods for various types and sizes of kmers //************************************************************************************************************ //---------------------------------------------------------------------------------- // Process the reads from the given FASTQ file part in small k optimization mode template bool CSplitter_Impl::ProcessReadsSmallK(CSplitter &ptr, uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf) { ptr.part = _part; ptr.part_size = _part_size; ptr.part_pos = 0; char *seq; uint32 seq_size; int omit_next_n_kmers; CKmer<1> kmer_str, kmer_rev, kmer_can; uint32 i; CKmer<1> kmer_mask; ptr.pmm_reads->reserve(seq); kmer_mask.set_n_1(2 * ptr.kmer_len); uint32 kmer_len_shift = (ptr.kmer_len - 1) * 2; if (ptr.both_strands) while (ptr.GetSeq(seq, seq_size)) { if (ptr.file_type != multiline_fasta) ptr.n_reads++; // Init k-mer kmer_str.clear(); kmer_rev.clear(); // Process first k-1 symbols of a read uint32 str_pos = kmer_len_shift - 2; uint32 rev_pos = 2; omit_next_n_kmers = 0; for (i = 0; i < ptr.kmer_len - 1; ++i, str_pos -= 2, rev_pos += 2) { if (seq[i] < 0) { seq[i] = 0; omit_next_n_kmers = i + 1; } kmer_str.set_2bits(seq[i], str_pos); kmer_rev.set_2bits(3 - seq[i], rev_pos); } // Process next part of a read for (; i < seq_size; ++i) { if (seq[i] < 0) // N in a read { seq[i] = 0; omit_next_n_kmers = ptr.kmer_len; // Mark how many symbols to ommit to get the next kmer without any N } kmer_str.SHL_insert_2bits(seq[i]); kmer_str.mask(kmer_mask); kmer_rev.SHR_insert_2bits(3 - seq[i], kmer_len_shift); // If necessary ommit next symbols if (omit_next_n_kmers > 0) { omit_next_n_kmers--; continue; } // Find canonical kmer representation kmer_can = (kmer_str < kmer_rev) ? kmer_str : kmer_rev; ++small_k_buf.buf[kmer_can.data]; ++ptr.total_kmers; } } else while (ptr.GetSeq(seq, seq_size)) { if (ptr.file_type != multiline_fasta) ptr.n_reads++; // Init k-mer kmer_str.clear(); // Process first k-1 symbols of a read uint32 str_pos = kmer_len_shift - 2; omit_next_n_kmers = 0; for (i = 0; i < ptr.kmer_len - 1; ++i, str_pos -= 2) { if (seq[i] < 0) { seq[i] = 0; omit_next_n_kmers = i + 1; } kmer_str.set_2bits(seq[i], str_pos); } // Process next part of a read for (; i < seq_size; ++i) { if (seq[i] < 0) // N in a read { seq[i] = 0; omit_next_n_kmers = ptr.kmer_len; // Mark how many symbols to ommit to get the next kmer without any N } kmer_str.SHL_insert_2bits(seq[i]); kmer_str.mask(kmer_mask); // If necessary ommit next symbols if (omit_next_n_kmers > 0) { omit_next_n_kmers--; continue; } ++small_k_buf.buf[kmer_str.data]; ++ptr.total_kmers; } } putchar('*'); fflush(stdout); ptr.pmm_reads->free(seq); return true; } //---------------------------------------------------------------------------------- // Process the reads from the given FASTQ file part bool CSplitter_Impl::ProcessReads(CSplitter &ptr, uchar *_part, uint64 _part_size) { ptr.part = _part; ptr.part_size = _part_size; ptr.part_pos = 0; char *seq; uint32 seq_size; ptr.pmm_reads->reserve(seq); uint32 signature_start_pos; CMmer current_signature(ptr.signature_len), end_mmer(ptr.signature_len); uint32 bin_no; uint32 i; uint32 len;//length of extended kmer while (ptr.GetSeq(seq, seq_size)) { if (ptr.file_type != multiline_fasta) ptr.n_reads++; i = 0; len = 0; while (i + ptr.kmer_len - 1 < seq_size) { bool contains_N = false; //building first signature after 'N' or at the read begining for (uint32 j = 0; j < ptr.signature_len; ++j, ++i) if (seq[i] < 0)//'N' { contains_N = true; break; } //signature must be shorter than k-mer so if signature contains 'N', k-mer will contains it also if (contains_N) { ++i; continue; } len = ptr.signature_len; signature_start_pos = i - ptr.signature_len; current_signature.insert(seq + signature_start_pos); end_mmer.set(current_signature); for (; i < seq_size; ++i) { if (seq[i] < 0)//'N' { if (len >= ptr.kmer_len) { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len); } len = 0; ++i; break; } end_mmer.insert(seq[i]); if (end_mmer < current_signature)//signature at the end of current k-mer is lower than current { if (len >= ptr.kmer_len) { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len); len = ptr.kmer_len - 1; } current_signature.set(end_mmer); signature_start_pos = i - ptr.signature_len + 1; } else if (end_mmer == current_signature) { current_signature.set(end_mmer); signature_start_pos = i - ptr.signature_len + 1; } else if (signature_start_pos + ptr.kmer_len - 1 < i)//need to find new signature { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len); len = ptr.kmer_len - 1; //looking for new signature ++signature_start_pos; //building first signature in current k-mer end_mmer.insert(seq + signature_start_pos); current_signature.set(end_mmer); for (uint32 j = signature_start_pos + ptr.signature_len; j <= i; ++j) { end_mmer.insert(seq[j]); if (end_mmer <= current_signature) { current_signature.set(end_mmer); signature_start_pos = j - ptr.signature_len + 1; } } } ++len; if (len == ptr.kmer_len + 255) //one byte is used to store counter of additional symbols in extended k-mer { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i + 1 - len, len); i -= ptr.kmer_len - 2; len = 0; break; } } } if (len >= ptr.kmer_len)//last one in read { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, len); } } putchar('*'); fflush(stdout); ptr.pmm_reads->free(seq); return true; } bool CSplitter_Impl::ProcessReadsSmallK(CSplitter &ptr, uchar *_part, uint64 _part_size, CSmallKBuf& small_k_buf) { ptr.part = _part; ptr.part_size = _part_size; ptr.part_pos = 0; char *seq; char *quals; double *raw_inv_probs; ptr.pmm_reads->reserve(seq); ptr.pmm_reads->reserve(quals); ptr.pmm_reads->reserve(raw_inv_probs); double *inv_probs = raw_inv_probs + 1; inv_probs[-1] = 1.0; // !!! Correct uint32 seq_size; int omit_next_n_kmers; CKmer<1> kmer_str, kmer_rev, kmer_can; double kmer_prob; uint32 i; CKmer<1> kmer_mask; kmer_mask.set_n_1(2 * ptr.kmer_len); uint32 kmer_len_shift = (ptr.kmer_len - 1) * 2; if (ptr.both_strands) while (ptr.GetSeq(seq, quals, seq_size)) { ptr.n_reads++; // Init k-mer kmer_str.clear(); kmer_rev.clear(); // Process first k-1 symbols of a read uint32 str_pos = kmer_len_shift - 2; uint32 rev_pos = 2; omit_next_n_kmers = 0; kmer_prob = 1.0; for (i = 0; i < ptr.kmer_len - 1; ++i, str_pos -= 2, rev_pos += 2) { if (seq[i] < 0) { seq[i] = 0; omit_next_n_kmers = i + 1; } inv_probs[i] = CProbQual::inv_prob_qual[quals[i] - ptr.lowest_quality]; kmer_str.set_2bits(seq[i], str_pos); kmer_rev.set_2bits(3 - seq[i], rev_pos); kmer_prob *= CProbQual::prob_qual[quals[i] - ptr.lowest_quality]; } // Process next part of a read for (; i < seq_size; ++i) { if (seq[i] < 0) // N in a read { seq[i] = 0; omit_next_n_kmers = ptr.kmer_len; // Mark how many symbols to ommit to get the next kmer without any N } inv_probs[i] = CProbQual::inv_prob_qual[quals[i] - ptr.lowest_quality]; kmer_str.SHL_insert_2bits(seq[i]); kmer_str.mask(kmer_mask); kmer_rev.SHR_insert_2bits(3 - seq[i], kmer_len_shift); kmer_prob *= CProbQual::prob_qual[quals[i] - ptr.lowest_quality] * inv_probs[(int)i - (int)ptr.kmer_len]; // If necessary ommit next symbols if (omit_next_n_kmers > 0) { omit_next_n_kmers--; continue; } if (kmer_prob < CProbQual::MIN_PROB_QUAL_VALUE) continue; // Find canonical kmer representation kmer_can = (kmer_str < kmer_rev) ? kmer_str : kmer_rev; small_k_buf.buf[kmer_can.data] += static_cast(kmer_prob); ++ptr.total_kmers; } } else while (ptr.GetSeq(seq, quals, seq_size)) { ptr.n_reads++; // Init k-mer kmer_str.clear(); // Process first k-1 symbols of a read uint32 str_pos = kmer_len_shift - 2; omit_next_n_kmers = 0; kmer_prob = 1.0; for (i = 0; i < ptr.kmer_len - 1; ++i, str_pos -= 2) { if (seq[i] < 0) { seq[i] = 0; omit_next_n_kmers = i + 1; } inv_probs[i] = CProbQual::inv_prob_qual[quals[i] - ptr.lowest_quality]; kmer_str.set_2bits(seq[i], str_pos); kmer_prob *= CProbQual::prob_qual[quals[i] - ptr.lowest_quality]; } // Process next part of a read for (; i < seq_size; ++i) { if (seq[i] < 0) // N in a read { seq[i] = 0; omit_next_n_kmers = ptr.kmer_len; // Mark how many symbols to ommit to get the next kmer without any N } inv_probs[i] = CProbQual::inv_prob_qual[quals[i] - ptr.lowest_quality]; kmer_str.SHL_insert_2bits(seq[i]); kmer_str.mask(kmer_mask); kmer_prob *= CProbQual::prob_qual[quals[i] - ptr.lowest_quality] * inv_probs[(int)i - (int)ptr.kmer_len]; // If necessary ommit next symbols if (omit_next_n_kmers > 0) { omit_next_n_kmers--; continue; } if (kmer_prob < CProbQual::MIN_PROB_QUAL_VALUE) continue; small_k_buf.buf[kmer_str.data] += static_cast(kmer_prob); ++ptr.total_kmers; } } putchar('*'); fflush(stdout); ptr.pmm_reads->free(seq); ptr.pmm_reads->free(quals); ptr.pmm_reads->free(raw_inv_probs); return true; } //---------------------------------------------------------------------------------- // Process the reads from the given FASTQ file part bool CSplitter_Impl::ProcessReads(CSplitter &ptr, uchar *_part, uint64 _part_size) { ptr.part = _part; ptr.part_size = _part_size; ptr.part_pos = 0; char *seq; char *quals; ptr.pmm_reads->reserve(seq); ptr.pmm_reads->reserve(quals); uint32 seq_size; uint32 signature_start_pos; CMmer current_signature(ptr.signature_len), end_mmer(ptr.signature_len); uint32 bin_no; uint32 i; uint32 len;//length of extended kmer while (ptr.GetSeq(seq, quals, seq_size)) { if (ptr.file_type != multiline_fasta) ptr.n_reads++; i = 0; len = 0; while (i + ptr.kmer_len - 1 < seq_size) { bool contains_N = false; //building first signature after 'N' or at the read begining for (uint32 j = 0; j < ptr.signature_len; ++j, ++i) if (seq[i] < 0)//'N' { contains_N = true; break; } //signature must be shorter than k-mer so if signature contains 'N', k-mer will contains it also if (contains_N) { ++i; continue; } len = ptr.signature_len; signature_start_pos = i - ptr.signature_len; current_signature.insert(seq + signature_start_pos); end_mmer.set(current_signature); for (; i < seq_size; ++i) { if (seq[i] < 0)//'N' { if (len >= ptr.kmer_len) { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, quals + i - len, len); } len = 0; ++i; break; } end_mmer.insert(seq[i]); if (end_mmer < current_signature)//signature at the end of current k-mer is lower than current { if (len >= ptr.kmer_len) { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, quals + i - len, len); len = ptr.kmer_len - 1; } current_signature.set(end_mmer); signature_start_pos = i - ptr.signature_len + 1; } else if (end_mmer == current_signature) { current_signature.set(end_mmer); signature_start_pos = i - ptr.signature_len + 1; } else if (signature_start_pos + ptr.kmer_len - 1 < i)//need to find new signature { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, quals + i - len, len); len = ptr.kmer_len - 1; //looking for new signature ++signature_start_pos; //building first signature in current k-mer end_mmer.insert(seq + signature_start_pos); current_signature.set(end_mmer); for (uint32 j = signature_start_pos + ptr.signature_len; j <= i; ++j) { end_mmer.insert(seq[j]); if (end_mmer <= current_signature) { current_signature.set(end_mmer); signature_start_pos = j - ptr.signature_len + 1; } } } ++len; if (len == ptr.kmer_len + 255) //one byte is used to store counter of additional symbols in extended k-mer { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i + 1 - len, quals + i + 1 - len, len); i -= ptr.kmer_len - 2; len = 0; break; } } } if (len >= ptr.kmer_len)//last one in read { bin_no = ptr.s_mapper->get_bin_id(current_signature.get()); ptr.bins[bin_no]->PutExtendedKmer(seq + i - len, quals + i - len, len); } } putchar('*'); fflush(stdout); ptr.pmm_reads->free(seq); ptr.pmm_reads->free(quals); return true; } //************************************************************************************************************ // CWSplitter class - wrapper for multithreading purposes //************************************************************************************************************ //---------------------------------------------------------------------------------- template class CWSplitter { CPartQueue *pq; CBinPartQueue *bpq; CMemoryPool *pmm_fastq; CSplitter *spl; uint64 n_reads; public: CWSplitter(CKMCParams &Params, CKMCQueues &Queues); ~CWSplitter(); void operator()(); void GetTotal(uint64 &_n_reads); }; //---------------------------------------------------------------------------------- // Constructor template CWSplitter::CWSplitter(CKMCParams &Params, CKMCQueues &Queues) { pq = Queues.part_queue; bpq = Queues.bpq; pmm_fastq = Queues.pmm_fastq; spl = new CSplitter(Params, Queues); spl->InitBins(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor template CWSplitter::~CWSplitter() { } //---------------------------------------------------------------------------------- // Execution template void CWSplitter::operator()() { // Splitting parts while(!pq->completed()) { uchar *part; uint64 size; if(pq->pop(part, size)) { spl->ProcessReads(part, size); pmm_fastq->free(part); } } spl->Complete(); bpq->mark_completed(); spl->GetTotal(n_reads); delete spl; spl = NULL; } //---------------------------------------------------------------------------------- // Return statistics template void CWSplitter::GetTotal(uint64 &_n_reads) { if(spl) spl->GetTotal(n_reads); _n_reads = n_reads; } //************************************************************************************************************ // CWStatsSplitter class - wrapper for multithreading purposes //************************************************************************************************************ //---------------------------------------------------------------------------------- template class CWStatsSplitter { CStatsPartQueue *spq; CMemoryPool *pmm_fastq, *pmm_stats; uint32 *stats; CSplitter *spl; uint32 signature_len; public: CWStatsSplitter(CKMCParams &Params, CKMCQueues &Queues); ~CWStatsSplitter(); void operator()(); void GetStats(uint32* _stats); }; //---------------------------------------------------------------------------------- // Constructor template CWStatsSplitter::CWStatsSplitter(CKMCParams &Params, CKMCQueues &Queues) { spq = Queues.stats_part_queue; pmm_fastq = Queues.pmm_fastq; pmm_stats = Queues.pmm_stats; spl = new CSplitter(Params, Queues); signature_len = Params.signature_len; pmm_stats->reserve(stats); fill_n(stats, (1 << signature_len * 2) + 1, 0); } //---------------------------------------------------------------------------------- // Destructor template CWStatsSplitter::~CWStatsSplitter() { pmm_stats->free(stats); } //---------------------------------------------------------------------------------- // Execution template void CWStatsSplitter::operator()() { // Splitting parts while (!spq->completed()) { uchar *part; uint64 size; if (spq->pop(part, size)) { spl->CalcStats(part, size, stats); pmm_fastq->free(part); } } delete spl; spl = NULL; } //---------------------------------------------------------------------------------- template void CWStatsSplitter::GetStats(uint32* _stats) { uint32 size = (1 << signature_len * 2) + 1; for (uint32 i = 0; i < size; ++i) _stats[i] += stats[i]; } //************************************************************************************************************ // CWSmallKSplitter class - wrapper for multithreading purposes //************************************************************************************************************ //---------------------------------------------------------------------------------- template class CWSmallKSplitter { CPartQueue *pq; CMemoryPool *pmm_fastq, *pmm_small_k; CSmallKBuf small_k_buf; CSplitter *spl; uint64 n_reads; uint64 total_kmers; uint32 kmer_len; public: CWSmallKSplitter(CKMCParams &Params, CKMCQueues &Queues); ~CWSmallKSplitter(); void operator()(); void GetTotal(uint64 &_n_reads); CSmallKBuf GetResult() { return small_k_buf; } uint64 GetTotalKmers() { if (spl) return spl->GetTotalKmers(); return total_kmers; } void Release() { pmm_small_k->free(small_k_buf.buf); } }; //---------------------------------------------------------------------------------- // Constructor template CWSmallKSplitter::CWSmallKSplitter(CKMCParams &Params, CKMCQueues &Queues) { pq = Queues.part_queue; pmm_fastq = Queues.pmm_fastq; pmm_small_k = Queues.pmm_small_k_buf; kmer_len = Params.kmer_len; spl = new CSplitter(Params, Queues); } //---------------------------------------------------------------------------------- // Destructor template CWSmallKSplitter::~CWSmallKSplitter() { } //---------------------------------------------------------------------------------- // Execution template void CWSmallKSplitter::operator()() { pmm_small_k->reserve(small_k_buf.buf); memset(small_k_buf.buf, 0, (1ull << 2 * kmer_len) * sizeof(*small_k_buf.buf)); // Splitting parts while (!pq->completed()) { uchar *part; uint64 size; if (pq->pop(part, size)) { spl->ProcessReadsSmallK(part, size, small_k_buf); pmm_fastq->free(part); } } spl->Complete(); spl->GetTotal(n_reads); total_kmers = spl->GetTotalKmers(); delete spl; spl = NULL; } //---------------------------------------------------------------------------------- // Return statistics template void CWSmallKSplitter::GetTotal(uint64 &_n_reads) { if (spl) spl->GetTotal(n_reads); _n_reads = n_reads; } #endif // ***** EOF KMC-2.3/kmer_counter/stdafx.cpp000066400000000000000000000004431257432033000164620ustar00rootroot00000000000000// stdafx.cpp : source file that includes just the standard includes // kmer_counter.pch will be the pre-compiled header // stdafx.obj will contain the pre-compiled type information #include "stdafx.h" // TODO: reference any additional headers you need in STDAFX.H // and not in this file KMC-2.3/kmer_counter/stdafx.h000066400000000000000000000005041257432033000161250ustar00rootroot00000000000000#ifdef WIN32 // stdafx.h : include file for standard system include files, // or project specific include files that are used frequently, but // are changed infrequently // #pragma once #include "targetver.h" #include #include // TODO: reference additional headers your program requires here #endifKMC-2.3/kmer_counter/targetver.h000066400000000000000000000004621257432033000166420ustar00rootroot00000000000000#pragma once // Including SDKDDKVer.h defines the highest available Windows platform. // If you wish to build your application for a previous Windows platform, include WinSDKVer.h and // set the _WIN32_WINNT macro to the platform you wish to support before including SDKDDKVer.h. #include KMC-2.3/kmer_counter/timer.cpp000066400000000000000000000024341257432033000163130ustar00rootroot00000000000000#include "stdafx.h" /* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc The source codes are based on codes written by Dennis and published: http://allmybrain.com/2008/06/10/timing-cc-code-on-linux/ Version: 2.3.0 Date : 2015-08-21 */ #ifdef WIN32 #include #endif #include // NULL #include "timer.h" #ifdef WIN32 double CStopWatch::LIToSecs( LARGE_INTEGER & L) { return ((double)L.QuadPart /(double)frequency.QuadPart); } CStopWatch::CStopWatch(){ timer.start.QuadPart=0; timer.stop.QuadPart=0; QueryPerformanceFrequency( &frequency ); } void CStopWatch::startTimer( ) { QueryPerformanceCounter(&timer.start); } void CStopWatch::stopTimer( ) { QueryPerformanceCounter(&timer.stop); } double CStopWatch::getElapsedTime() { LARGE_INTEGER time; time.QuadPart = timer.stop.QuadPart - timer.start.QuadPart; return LIToSecs( time) ; } #else void CStopWatch::startTimer( ) { gettimeofday(&(timer.start),NULL); } void CStopWatch::stopTimer( ) { gettimeofday(&(timer.stop),NULL); } double CStopWatch::getElapsedTime() { timeval res; timersub(&(timer.stop),&(timer.start),&res); return res.tv_sec + res.tv_usec/1000000.0; // 10^6 uSec per second } #endifKMC-2.3/kmer_counter/timer.h000066400000000000000000000016531257432033000157620ustar00rootroot00000000000000/* This file is a part of KMC software distributed under GNU GPL 3 licence. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc The source codes are based on codes written by Dennis and published: http://allmybrain.com/2008/06/10/timing-cc-code-on-linux/ Version: 2.3.0 Date : 2015-08-21 */ #ifndef _TIMER_H #define _TIMER_H #ifdef WIN32 #include typedef struct { LARGE_INTEGER start; LARGE_INTEGER stop; } stopWatch; class CStopWatch { private: stopWatch timer; LARGE_INTEGER frequency; double LIToSecs( LARGE_INTEGER & L); public: CStopWatch(); void startTimer( ); void stopTimer( ); double getElapsedTime(); }; #else #include typedef struct { timeval start; timeval stop; } stopWatch; class CStopWatch { private: stopWatch timer; public: CStopWatch() {}; void startTimer( ); void stopTimer( ); double getElapsedTime(); }; #endif #endif // ***** EOF KMC-2.3/kmer_counter/x64/000077500000000000000000000000001257432033000151055ustar00rootroot00000000000000KMC-2.3/kmer_counter/x64/Release/000077500000000000000000000000001257432033000164655ustar00rootroot00000000000000KMC-2.3/makefile000066400000000000000000000047611257432033000134770ustar00rootroot00000000000000all: kmc KMC_BIN_DIR = bin KMC_MAIN_DIR = kmer_counter KMC_API_DIR = kmc_api KMC_DUMP_DIR = kmc_dump KMC_TOOLS_DIR = kmc_tools CC = g++ CFLAGS = -Wall -O3 -m64 -static -fopenmp -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 CLINK = -lm -static -fopenmp -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 KMC_TOOLS_CFLAGS = -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14 KMC_TOOLS_CLINK = -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14 DISABLE_ASMLIB = false KMC_OBJS = \ $(KMC_MAIN_DIR)/kmer_counter.o \ $(KMC_MAIN_DIR)/mmer.o \ $(KMC_MAIN_DIR)/mem_disk_file.o \ $(KMC_MAIN_DIR)/rev_byte.o \ $(KMC_MAIN_DIR)/bkb_writer.o \ $(KMC_MAIN_DIR)/bkb_reader.o \ $(KMC_MAIN_DIR)/fastq_reader.o \ $(KMC_MAIN_DIR)/timer.o \ $(KMC_MAIN_DIR)/radix.o \ $(KMC_MAIN_DIR)/kb_completer.o \ $(KMC_MAIN_DIR)/kb_storer.o \ $(KMC_MAIN_DIR)/kmer.o \ $(KMC_MAIN_DIR)/prob_qual.o KMC_LIBS = \ $(KMC_MAIN_DIR)/libs/libz.a \ $(KMC_MAIN_DIR)/libs/libbz2.a KMC_DUMP_OBJS = \ $(KMC_DUMP_DIR)/nc_utils.o \ $(KMC_API_DIR)/mmer.o \ $(KMC_DUMP_DIR)/kmc_dump.o KMC_API_OBJS = \ $(KMC_API_DIR)/mmer.o \ $(KMC_API_DIR)/kmc_file.o \ $(KMC_API_DIR)/kmer_api.o KMC_TOOLS_OBJS = \ $(KMC_TOOLS_DIR)/kmc_header.o \ $(KMC_TOOLS_DIR)/kmc_tools.o \ $(KMC_TOOLS_DIR)/nc_utils.o \ $(KMC_TOOLS_DIR)/parameters_parser.o \ $(KMC_TOOLS_DIR)/parser.o \ $(KMC_TOOLS_DIR)/tokenizer.o \ $(KMC_TOOLS_DIR)/fastq_filter.o \ $(KMC_TOOLS_DIR)/fastq_reader.o \ $(KMC_TOOLS_DIR)/fastq_writer.o \ $(KMC_TOOLS_DIR)/percent_progress.o KMC_TOOLS_LIBS = \ $(KMC_TOOLS_DIR)/libs/libz.a \ $(KMC_TOOLS_DIR)/libs/libbz2.a ifeq ($(DISABLE_ASMLIB),true) CFLAGS += -DDISABLE_ASMLIB KMC_TOOLS_CFLAGS += -DDISABLE_ASMLIB else KMC_LIBS += \ $(KMC_MAIN_DIR)/libs/alibelf64.a KMC_TOOLS_LIBS += \ $(KMC_TOOLS_DIR)/libs/alibelf64.a endif $(KMC_OBJS) $(KMC_DUMP_OBJS) $(KMC_API_OBJS): %.o: %.cpp $(CC) $(CFLAGS) -c $< -o $@ $(KMC_TOOLS_OBJS): %.o: %.cpp $(CC) $(KMC_TOOLS_CFLAGS) -c $< -o $@ kmc: $(KMC_OBJS) -mkdir -p $(KMC_BIN_DIR) $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_LIBS) kmc_dump: $(KMC_DUMP_OBJS) $(KMC_API_OBJS) -mkdir -p $(KMC_BIN_DIR) $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^ kmc_tools: $(KMC_TOOLS_OBJS) $(KMC_API_OBJS) -mkdir -p $(KMC_BIN_DIR) $(CC) $(KMC_TOOLS_CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_TOOLS_LIBS) clean: -rm $(KMC_MAIN_DIR)/*.o -rm $(KMC_API_DIR)/*.o -rm $(KMC_DUMP_DIR)/*.o -rm $(KMC_TOOLS_DIR)/*.o -rm -rf bin all: kmc kmc_dump kmc_toolsKMC-2.3/makefile_mac000066400000000000000000000050021257432033000143040ustar00rootroot00000000000000all: kmc kmc_dump kmc_tools KMC_BIN_DIR = bin KMC_MAIN_DIR = kmer_counter KMC_API_DIR = kmc_api KMC_DUMP_DIR = kmc_dump KMC_TOOLS_DIR = kmc_tools CC = /usr/local/Cellar/gcc49/4.9.2/bin/g++-4.9 CFLAGS = -Wall -O3 -m64 -static-libgcc -static-libstdc++ -fopenmp -pthread -std=c++11 CLINK = -lm -fopenmp -static-libgcc -static-libstdc++ -O3 -pthread -std=c++11 KMC_TOOLS_CFLAGS = -Wall -O3 -m64 -static-libgcc -static-libstdc++ -pthread -std=c++14 KMC_TOOLS_CLINK = -lm -static-libgcc -static-libstdc++ -O3 -pthread -std=c++14 DISABLE_ASMLIB = false KMC_OBJS = \ $(KMC_MAIN_DIR)/kmer_counter.o \ $(KMC_MAIN_DIR)/mmer.o \ $(KMC_MAIN_DIR)/mem_disk_file.o \ $(KMC_MAIN_DIR)/rev_byte.o \ $(KMC_MAIN_DIR)/bkb_writer.o \ $(KMC_MAIN_DIR)/bkb_reader.o \ $(KMC_MAIN_DIR)/fastq_reader.o \ $(KMC_MAIN_DIR)/timer.o \ $(KMC_MAIN_DIR)/radix.o \ $(KMC_MAIN_DIR)/kb_completer.o \ $(KMC_MAIN_DIR)/kb_storer.o \ $(KMC_MAIN_DIR)/kmer.o \ $(KMC_MAIN_DIR)/prob_qual.o KMC_LIBS = \ $(KMC_MAIN_DIR)/libs/libz.1.2.5.dylib \ $(KMC_MAIN_DIR)/libs/libbz2.1.0.5.dylib KMC_DUMP_OBJS = \ $(KMC_DUMP_DIR)/nc_utils.o \ $(KMC_API_DIR)/mmer.o \ $(KMC_DUMP_DIR)/kmc_dump.o KMC_API_OBJS = \ $(KMC_API_DIR)/mmer.o \ $(KMC_API_DIR)/kmc_file.o \ $(KMC_API_DIR)/kmer_api.o KMC_TOOLS_OBJS = \ $(KMC_TOOLS_DIR)/kmc_header.o \ $(KMC_TOOLS_DIR)/kmc_tools.o \ $(KMC_TOOLS_DIR)/nc_utils.o \ $(KMC_TOOLS_DIR)/parameters_parser.o \ $(KMC_TOOLS_DIR)/parser.o \ $(KMC_TOOLS_DIR)/tokenizer.o \ $(KMC_TOOLS_DIR)/fastq_filter.o \ $(KMC_TOOLS_DIR)/fastq_reader.o \ $(KMC_TOOLS_DIR)/fastq_writer.o \ $(KMC_TOOLS_DIR)/percent_progress.o KMC_TOOLS_LIBS = \ $(KMC_TOOLS_DIR)/libs/libz.1.2.5.dylib \ $(KMC_TOOLS_DIR)/libs/libbz2.1.0.5.dylib ifeq ($(DISABLE_ASMLIB),true) CFLAGS += -DDISABLE_ASMLIB KMC_TOOLS_CFLAGS += -DDISABLE_ASMLIB else KMC_LIBS += \ $(KMC_MAIN_DIR)/libs/libamac64.a KMC_TOOLS_LIBS += \ $(KMC_TOOLS_DIR)/libs/libamac64.a endif $(KMC_OBJS) $(KMC_DUMP_OBJS) $(KMC_API_OBJS): %.o: %.cpp $(CC) $(CFLAGS) -c $< -o $@ $(KMC_TOOLS_OBJS): %.o: %.cpp $(CC) $(KMC_TOOLS_CFLAGS) -c $< -o $@ kmc: $(KMC_OBJS) -mkdir -p $(KMC_BIN_DIR) $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_LIBS) kmc_dump: $(KMC_DUMP_OBJS) $(KMC_API_OBJS) -mkdir -p $(KMC_BIN_DIR) $(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^ kmc_tools: $(KMC_TOOLS_OBJS) $(KMC_API_OBJS) -mkdir -p $(KMC_BIN_DIR) $(CC) $(KMC_TOOLS_CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_TOOLS_LIBS) clean: -rm $(KMC_MAIN_DIR)/*.o -rm $(KMC_API_DIR)/*.o -rm $(KMC_DUMP_DIR)/*.o -rm $(KMC_TOOLS_DIR)/*.o -rm -rf bin all: kmc kmc_dump kmc_toolsKMC-2.3/readme.txt000066400000000000000000000056101257432033000137670ustar00rootroot00000000000000***** The program ***** KMC is a disk-based programm for counting k-mers from (possibly gzipped) FASTQ/FASTA files. The homepage of the KMC project is http://sun.aei.polsl.pl/kmc ***** Installation ***** The following libraries come with KMC in a binary (64-bit compiled for x86 platform) form. If your system needs other binary formats, you should put the following libraries in kmer_counter/libs: * asmlib - for fast memcpy operation (http://www.agner.org/optimize/asmlib-instructions.pdf) * libbzip2 - for support for bzip2-compressed input FASTQ/FASTA files (http://www.bzip.org/) * zlib - for support for gzip-compressed input FASTQ/FASTA files (http://www.zlib.net/) Note: asmlib is free only for non commercial purposes. If needed, you can contact the author of asmlib or compile KMC without asmlib. If needed, you can also redefine maximal length of k-mer, which is 256 in the current version. Note: KMC is highly optimized and spends only as many bytes for k-mer (rounded up to 8) as necessary, so using large values of MAX_K does not affect the KMC performance for short k-mers. Some parts of KMC use C++11 features, so you need a compatible C++ compiler, e.g., gcc 4.7 or higher. After that, you can run make to compile kmc and kmc_dump applications. If you want to compile kmc without asmlib run: make DISABLE_ASMLIB=true ***** Directory structure ***** bin - main directory of KMC (programs after compilation will be stored here) kmer_counter - source code of kmc program kmer_counter/libs - compiled binary versions of libraries used by KMC kmc_api - C++ source codes implementing API; must be used by any program that wants to process databases produced by kmc kmc_dump - source codes of kmc_dump program listing k-mers in databases produced by kmc ***** Binaries ***** After compilation you will obtain two binaries: * bin/kmc - the main program for counting k-mer occurrences * bin/kmc_dump - the program listing k-mers in a database produced by kmc ***** License ***** * KMC software distributed under GNU GPL 2 licence. * libbzip2 is open-source (BSD-style license) * gzip is free, open-source * asmlib is under the licence GNU GPL 3 or higher Note: for commercial usage of asmlib follow the instructions in 'License conditions' (http://www.agner.org/optimize/asmlib-instructions.pdf) or compile KMC without asmlib. In case of doubt, please consult the original documentations. ***** Warranty ***** THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.