pax_global_header 0000666 0000000 0000000 00000000064 14573550650 0014524 g ustar 00root root 0000000 0000000 52 comment=42f8b69799570d73f61bc0d5b6562387d7da4132
bio-0.13.3/ 0000775 0000000 0000000 00000000000 14573550650 0012361 5 ustar 00root root 0000000 0000000 bio-0.13.3/.github/ 0000775 0000000 0000000 00000000000 14573550650 0013721 5 ustar 00root root 0000000 0000000 bio-0.13.3/.github/FUNDING.yml 0000664 0000000 0000000 00000000150 14573550650 0015532 0 ustar 00root root 0000000 0000000 # These are supported funding model platforms
# github: shenwei356
custom: http://paypal.me/shenwei356
bio-0.13.3/.gitignore 0000775 0000000 0000000 00000000424 14573550650 0014354 0 ustar 00root root 0000000 0000000 # Compiled Object files, Static and Dynamic libs (Shared Objects)
*.o
*.a
*.so
# Folders
_obj
_test
# Architecture specific extensions/prefixes
*.[568vq]
[568vq].out
*.cgo1.go
*.cgo2.c
_cgo_defun.c
_cgo_gotypes.go
_cgo_export.*
_testmain.go
*.exe
*.directory
.DS_Store
bio-0.13.3/CHANGELOG.md 0000664 0000000 0000000 00000004330 14573550650 0014172 0 ustar 00root root 0000000 0000000 # Changelog
### v0.13.3 - 2024-03-11
- fix dependency.
### v0.13.2 - 2024-03-11
- seq: increase default value of `seq.ComplementSeqLenThreshold` from 1000 to 1000000, which means only parallelize complement computation for sequences longer than 1Mb.
### v0.13.1 - 2024-02-22
- util: fix computation of L50.
### v0.13.0 - 2024-02-19
- seq: remove the global variable: `seq.ValidateWholeSeq`.
### v0.12.1 - 2024-01-03
- seqio/fai: when using the whole FASTA header as the sequence ID, replace possible tabs in FASTA header with spaces.
### v0.12.0 - 2023-12-04
- seqio/fastx.Reader: reuse the reader with an object pool, this requires users to call reader.Close() after using.
The benefit is that it reduces memory when handling of a lot of sequences.
### v0.11.0 - 2023-12-03
Do not use this version!
- seqio/fastx.Reader: delete reader.Recycle() to avoid API changes.
### v0.10.0 - 2023-12-03
- seqio/fastx.Reader: reuse reader with object pool, this requires users to call reader.Recycle() after using.
### v0.9.3 - 2023-11-11
- seqio/fastx.Reader: fix a panic of nil pointer when some files has file size of 0.
### v0.9.2 - 2023-11-10
- seq/alphabet.IsValid: fix panic: close of closed channel
### v0.9.1 - 2023-11-08
- seqio/fastx.Reader: recycle []byte buffer to save memory for reading a large number of sequences.
### v0.9.0 - 2023-06-25
- util/LengthStats: new method to compute N50 for any number
- seq/alphabet: faster with asciiset
### v0.8.4 - 2023-02-14
- seqio/fai: report error for non-fasta files
### v0.8.3 - 2022-12-02
- util.LengthStats: fix computing Q1 and Q3 for one element.
### v0.8.2 - 2022-11-16
- faidx: allow empty lines at the end of sequences
### v0.8.1 - 2022-09-06
- fastx: fix concurrency bug of `record.FormatToWriter()`.
### v0.8.0 - 2022-09-06
- sketches: added an iterator of SimHash.
### v0.7.1 - 2022-04-19
- taxdump: allow reading empty merged.dmp and delnodes.dmp
### v0.6.4 - 2022-03-13
- update xopen version
### v0.6.3 - 2022-03-12
- use new versin of xopen which support .xz and .zst
### v0.6.2 - 2021-12-01
- taxdump: more robust
### v0.6.1 - 2021-11-16
- taxdump: fix pkg name in errors
### v0.6.0 - 2021-11-08
- move sketches and taxdump packages from unikmer to here
bio-0.13.3/LICENSE 0000775 0000000 0000000 00000002113 14573550650 0013366 0 ustar 00root root 0000000 0000000 Copyright (c) 2013 - 2021 Wei Shen (shenwei356@gmail.com)
The MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
bio-0.13.3/README.md 0000775 0000000 0000000 00000001266 14573550650 0013650 0 ustar 00root root 0000000 0000000 # bio - a lightweight and high-performance bioinformatics package for Golang
[](https://pkg.go.dev/github.com/shenwei356/bio)
[](https://goreportcard.com/report/github.com/shenwei356/bio)
See sub packages for details.
## Install
go get -u github.com/shenwei356/bio
## Support
Please [open an issue](https://github.com/shenwei356/bio/issues) to report bugs,
propose new functions or ask for help.
## License
Copyright (c) 2013-2021, Wei Shen (shenwei356@gmail.com)
[MIT License](https://github.com/shenwei356/bio/blob/master/LICENSE)
bio-0.13.3/benchmark/ 0000775 0000000 0000000 00000000000 14573550650 0014313 5 ustar 00root root 0000000 0000000 bio-0.13.3/benchmark/fastx/ 0000775 0000000 0000000 00000000000 14573550650 0015440 5 ustar 00root root 0000000 0000000 bio-0.13.3/benchmark/fastx/README.md 0000664 0000000 0000000 00000003641 14573550650 0016723 0 ustar 00root root 0000000 0000000
## FASTA/Q reading and writing
***bio/seqio/fastx has a high performance close to the famous C lib
[`kseq.h`](https://github.com/attractivechaos/klib/blob/master/kseq.h).***
To test the performance, three datasets and their gzip-compressed file are used:
- dataset_A, bacteria genomes, 2.7G
- dataset_B, human genome, 2.9G
- dataset_C, Illumina reads, 2.2G
Summary by [`seqkit`](https://github.com/shenwei356/seqkit):
file seq_format seq_type num_seqs min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 100 100 100
[`seqtk`](https://github.com/lh3/seqtk/)
(Version [1.3-r119-dirty](https://github.com/lh3/seqtk/commit/f6ea81cc30b9232e244dffa94187114275389132),
using `kseq.h`)
and [`seqkit`](https://github.com/shenwei356/seqkit)
(Version [v2.4.0](https://github.com/shenwei356/seqkit/releases/tag/v2.4.0),
using this package) were used to test.
**Note** that `seqtk` does not support wrapped (fixed line width) ouputing, so `seqkit` uses
`-w 0` to disable outputing wrapping.
Script [`memusg`](https://github.com/shenwei356/memusg) is used to assess running time
and peak memory usage.
[Commands](https://github.com/shenwei356/bio/blob/master/benchmark/)
Tests were repeated 4 times and average time and memory usage were computed.
Results:
Notes:
- `seqkit` uses 4 threads by default.
- `seqkit_t1` uses 1 thread.
- `seqtk` is single-threaded.
- `seqtk+gzip`: `seqtk` pipes data to the single-threaded `gzip`.
- `seqtk+pigz`: `seqtk` pipes data to the multithreaded `pigz` which uses 4 threads here.
## Run
./run.pl -n 4 run_benchmark_*.sh --outfile benchmark.tsv
# PLOT
./plot.sh
bio-0.13.3/benchmark/fastx/benchmark.tsv 0000775 0000000 0000000 00000010116 14573550650 0020132 0 ustar 00root root 0000000 0000000 test dataset app time mem
Plain text dataset_A.fa seqkit 2.54 28332
Plain text dataset_A.fa seqkit 2.53 28432
Plain text dataset_A.fa seqkit 2.64 29680
Plain text dataset_A.fa seqkit 2.54 27520
Plain text dataset_A.fa seqkit_t1 3.77 31888
Plain text dataset_A.fa seqkit_t1 3.67 31836
Plain text dataset_A.fa seqkit_t1 3.23 29540
Plain text dataset_A.fa seqkit_t1 3.96 27884
Plain text dataset_A.fa seqtk 2.77 7684
Plain text dataset_A.fa seqtk 2.65 7664
Plain text dataset_A.fa seqtk 2.65 7756
Plain text dataset_A.fa seqtk 2.76 7668
Plain text dataset_B.fa seqkit 3.11 1039000
Plain text dataset_B.fa seqkit 3.11 1035532
Plain text dataset_B.fa seqkit 3.22 1034608
Plain text dataset_B.fa seqkit 3.21 1037848
Plain text dataset_B.fa seqkit_t1 3.23 1037696
Plain text dataset_B.fa seqkit_t1 3.34 1035184
Plain text dataset_B.fa seqkit_t1 3.46 1035184
Plain text dataset_B.fa seqkit_t1 3.35 1038660
Plain text dataset_B.fa seqtk 3.11 244940
Plain text dataset_B.fa seqtk 3.11 244988
Plain text dataset_B.fa seqtk 3.12 244948
Plain text dataset_B.fa seqtk 3.20 244944
Plain text dataset_C.fq seqkit 3.22 24560
Plain text dataset_C.fq seqkit 3.28 24496
Plain text dataset_C.fq seqkit 3.28 14176
Plain text dataset_C.fq seqkit 3.23 14296
Plain text dataset_C.fq seqkit_t1 3.79 14192
Plain text dataset_C.fq seqkit_t1 3.58 24628
Plain text dataset_C.fq seqkit_t1 3.68 14524
Plain text dataset_C.fq seqkit_t1 4.15 14288
Plain text dataset_C.fq seqtk 4.15 1028
Plain text dataset_C.fq seqtk 4.15 1088
Plain text dataset_C.fq seqtk 4.15 1020
Plain text dataset_C.fq seqtk 4.26 1028
Gzip compressed dataset_A.fa.gz seqkit 14.27 57764
Gzip compressed dataset_A.fa.gz seqkit 14.38 58748
Gzip compressed dataset_A.fa.gz seqkit 14.45 55004
Gzip compressed dataset_A.fa.gz seqkit 14.49 57792
Gzip compressed dataset_A.fa.gz seqkit_t1 41.86 42856
Gzip compressed dataset_A.fa.gz seqkit_t1 41.43 43884
Gzip compressed dataset_A.fa.gz seqkit_t1 42.50 41936
Gzip compressed dataset_A.fa.gz seqkit_t1 40.89 44064
Gzip compressed dataset_A.fa.gz seqtk+gzip 461.54 7700
Gzip compressed dataset_A.fa.gz seqtk+gzip 460.74 7820
Gzip compressed dataset_A.fa.gz seqtk+gzip 461.59 7748
Gzip compressed dataset_A.fa.gz seqtk+gzip 459.01 7800
Gzip compressed dataset_A.fa.gz seqtk+pigz 118.43 7748
Gzip compressed dataset_A.fa.gz seqtk+pigz 117.57 7664
Gzip compressed dataset_A.fa.gz seqtk+pigz 120.28 7696
Gzip compressed dataset_A.fa.gz seqtk+pigz 119.26 7724
Gzip compressed dataset_B.fa.gz seqkit 25.37 1045592
Gzip compressed dataset_B.fa.gz seqkit 25.05 1044708
Gzip compressed dataset_B.fa.gz seqkit 25.81 1042072
Gzip compressed dataset_B.fa.gz seqkit 25.46 1045288
Gzip compressed dataset_B.fa.gz seqkit_t1 46.53 1043316
Gzip compressed dataset_B.fa.gz seqkit_t1 46.19 1041472
Gzip compressed dataset_B.fa.gz seqkit_t1 46.30 1041008
Gzip compressed dataset_B.fa.gz seqkit_t1 46.80 1042332
Gzip compressed dataset_B.fa.gz seqtk+gzip 457.71 245104
Gzip compressed dataset_B.fa.gz seqtk+gzip 461.24 245072
Gzip compressed dataset_B.fa.gz seqtk+gzip 463.54 245164
Gzip compressed dataset_B.fa.gz seqtk+gzip 467.39 245208
Gzip compressed dataset_B.fa.gz seqtk+pigz 128.74 245028
Gzip compressed dataset_B.fa.gz seqtk+pigz 126.62 245100
Gzip compressed dataset_B.fa.gz seqtk+pigz 128.22 245076
Gzip compressed dataset_B.fa.gz seqtk+pigz 130.17 245100
Gzip compressed dataset_C.fq.gz seqkit 9.05 40948
Gzip compressed dataset_C.fq.gz seqkit 8.87 41016
Gzip compressed dataset_C.fq.gz seqkit 9.23 38484
Gzip compressed dataset_C.fq.gz seqkit 9.05 39380
Gzip compressed dataset_C.fq.gz seqkit_t1 31.18 30100
Gzip compressed dataset_C.fq.gz seqkit_t1 30.59 29512
Gzip compressed dataset_C.fq.gz seqkit_t1 30.50 29052
Gzip compressed dataset_C.fq.gz seqkit_t1 30.66 29424
Gzip compressed dataset_C.fq.gz seqtk+gzip 188.89 1024
Gzip compressed dataset_C.fq.gz seqtk+gzip 190.83 1080
Gzip compressed dataset_C.fq.gz seqtk+gzip 190.95 964
Gzip compressed dataset_C.fq.gz seqtk+gzip 191.15 968
Gzip compressed dataset_C.fq.gz seqtk+pigz 49.13 1080
Gzip compressed dataset_C.fq.gz seqtk+pigz 48.17 968
Gzip compressed dataset_C.fq.gz seqtk+pigz 48.11 1032
Gzip compressed dataset_C.fq.gz seqtk+pigz 49.28 964
bio-0.13.3/benchmark/fastx/benchmark.tsv.png 0000664 0000000 0000000 00000516350 14573550650 0020725 0 ustar 00root root 0000000 0000000 PNG
IHDR pHYs .# .#x?v IDATxy@dAP$5qCEpSs7[̲wͲ|fj+b. ~~s{93̽gqt h F` 9F hQ C` 9F hQ C` Tsu '?K233WޤI`W7u։'o{Q79rԩSBGGG:466ݝs (_AAKKK5jv T n:mý;o>}f̘!$kױcV\RRRb*[ӦMccc~Ayyyw}w5;ԜVZ3FMNw}p? {;w-[TiРO<ꫯ֩SNJ+yJrrќyyy