pax_global_header00006660000000000000000000000064136615256560014531gustar00rootroot0000000000000052 comment=f0e0bfec9d50c86a1fe4b99352982a830a412008 gsort-0.1.4/000077500000000000000000000000001366152565600126715ustar00rootroot00000000000000gsort-0.1.4/.gitignore000066400000000000000000000000101366152565600146500ustar00rootroot00000000000000*.pprof gsort-0.1.4/.travis.yml000066400000000000000000000001671366152565600150060ustar00rootroot00000000000000language: go os: - linux - osx go: - 1.11 - 1.12 - 1.13 script: - go test - ./functional-test.sh gsort-0.1.4/CHANGES.md000066400000000000000000000010611366152565600142610ustar00rootroot000000000000000.1.0 ===== + add --chromosomMappings to re-map chromosome names for example from hg19 to GRCh37 0.0.6 ===== + allow sorting GFF with respect to parent so that of elements with the same start, the parent will always come first. 0.0.5 ===== + fix bug with finding END for svs. 0.0.3 ===== + fix some memory use issues + performance improvements by not splitting strings, just grabbing what we need. 0.0.2 ===== + allow bed files with 'track' and 'browser' headers. + add newline to stderr msg for version. + fix memory use. 0.0.1 ===== + initial release gsort-0.1.4/LICENSE000066400000000000000000000021121366152565600136720ustar00rootroot00000000000000The MIT License (MIT) Copyright (c) 2016 Brent Pedersen - Bioinformatics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. gsort-0.1.4/README.md000066400000000000000000000057771366152565600141700ustar00rootroot00000000000000# gsort [![Build Status](https://travis-ci.org/brentp/gsort.svg?branch=master)](https://travis-ci.org/brentp/gsort) [Binaries Available Here](https://github.com/brentp/gsort/releases) `gsort` is a tool to sort genomic files according to a genomefile. For example, for some reason, you may want to sort your VCF to have order: `X,Y,2,1,3,...` and you want to **keep the header** at the top. As a more likely example, you may want to sort your file to match GATK order (1 ... X, Y, MT) which is not possible with any other sorting tool. With `gsort` one can simply place MT as the last chrom in the .genome file. Given a genome file (lines of chrom\tlength) With this tool, you can sort a BED/VCF/GTF/... in the order dictated by that file with: ``` gsort --memory 1500 my.vcf.gz crazy.genome | bgzip -c > my.crazy-order.vcf.gz ``` where here, memory-use will be limited to 1500 megabytes. We will use this to enforce chromosome ordering in [ggd](https://github.com/gogetdata/ggd). It will also be useful for getting your files ready for use in **bedtools**. # GFF parent In GFF, the `Parent` attribute may refer to a row that would otherwise be sorted after it (based on the end position). But, some programs require that the row referenced in a `Parent` attribute be sorted first. If this is required, used the `--parent` flag introduced in version 0.0.6. # Performance gsort can sort the 2 million variants in ESP in 15 seconds. It takes a few minutes to sort the ~10 million ExAC variants because of the huuuuge INFO strings in that file. # Usage `gsort` will error if your genome file has 'chr' prefix and your file does not (or vice-versa). It will write temporary files to your $TMPDIR (usually /tmp/) as needed to avoid using too much memory. # TODO + Specify a VCF for the genome file and pull order from the @SQ tags + Avoid temp file when everything can fit in memory. (more universally, last chunk can always be kept in memory). # API Documentation -- import "github.com/brentp/gsort" Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes) (from a reader) using the amount of memory requested. Instead of using a compare function as most sorts do, this accepts a user-defined function with signature: `func(line []byte) []int` where the []ints are used to determine ordering. For example if we were sorting on 2 columns, one of months and another of day of months, the function would replace "Jan" with 1 and "Feb" with 2 for the first column and just return the Atoi of the 2nd column. #### func Sort ```go func Sort(rdr io.Reader, wtr io.Writer, preprocess Processor, memMB int) error ``` Sort accepts a tab-delimited io.Reader and writes to wtr using prepocess to determine ordering #### type Processor ```go type Processor func(line []byte) []int ``` Processor is a function that takes a line and return a slice of ints that determine ordering gsort-0.1.4/cmd/000077500000000000000000000000001366152565600134345ustar00rootroot00000000000000gsort-0.1.4/cmd/gsort/000077500000000000000000000000001366152565600145725ustar00rootroot00000000000000gsort-0.1.4/cmd/gsort/gsort.go000066400000000000000000000204731366152565600162650ustar00rootroot00000000000000package main import ( "bufio" "bytes" "fmt" "io" "log" "os" "strconv" "strings" "unsafe" "github.com/alexflint/go-arg" "github.com/brentp/gsort" "github.com/brentp/xopen" ggd_utils "github.com/gogetdata/ggd-utils" ) // DEFAULT_MEM is the number of megabytes of mem to use. var DEFAULT_MEM = 2800 // VERSION is the program version number const VERSION = "0.1.4" var FileCols map[string][]int = map[string][]int{ "BED": []int{0, 1, 2}, "VCF-LIKE": []int{0, 1, -1}, "VCF": []int{0, 1, -1}, "GFF": []int{0, 3, -1, 4}, "GTF": []int{0, 3, -1, 4}, } var CHECK_ORDER = []string{"BED", "GTF", "VCF-LIKE"} var args struct { Path string `arg:"positional,help:a tab-delimited file to sort"` Genome string `arg:"positional,help:a genome file of chromosome sizes and order"` ChromosomeMappings string `arg:"-c,help:a file used to re-map chromosome names for example from hg19 to GRCh37"` Memory int `arg:"-m,help:megabytes of memory to use before writing to temp files."` Parent bool `arg:"-p,help:for gff only. given rows with same chrom and start put those with a 'Parent' attribute first"` } func unsafeString(b []byte) string { return *(*string)(unsafe.Pointer(&b)) } // get the start and end of a column given the index func getAt(line []byte, idx int) (int, int) { if idx == 0 { return 0, bytes.IndexRune(line, '\t') } off := 0 for i := 0; i < idx; i++ { off += 1 + bytes.IndexRune(line[off:], '\t') } e := bytes.IndexRune(line[off:], '\t') if e == -1 { e = len(line) for line[e-1] == '\n' || line[e-1] == '\r' { e-- } } else { e = off + e } return off, e } // the last function is used when a column is -1 func sortFnFromCols(cols []int, gf *ggd_utils.GenomeFile, getter endGetter) func([]byte) []int { m := 0 for _, c := range cols { if c > m { m = c } } m += 2 if getter != nil && m < 6 { m = 6 } H := 0 // keep order of header fn := func(line []byte) []int { l := make([]int, len(cols)) s, e := getAt(line, cols[0]) var ok bool if s < 0 || e < 0 { ok = false } else { chrom := string(line[s:e]) // the chromosome name has already been remapped, if necessary. l[0], ok = gf.Order[chrom] } if !ok { if line[0] == '#' || hasAnyHeader(string(line)) { H++ return []int{gsort.HEADER_LINE, H} } log.Fatalf("unknown chromosome: %s (known: %v) in line: %s", line[s:e], gf.Order, string(line)) } for k, col := range cols[1:] { i := k + 1 if col == -1 { l[i] = getter(l[i-1], line) } else { s, e := getAt(line, col) subset := line[s:e] v, err := strconv.Atoi(unsafeString(subset)) if err != nil { log.Fatal(err) } l[i] = v } } return l } return fn } var allowedHeaders = []string{"browser", "track"} func hasAnyHeader(line string) bool { for _, a := range allowedHeaders { if strings.HasPrefix(line, a) { return true } } return false } func sniff(rdr *bufio.Reader) (string, *bufio.Reader, error) { lines := make([]string, 0, 200) var ftype string for len(lines) < 50000 { line, err := rdr.ReadString('\n') if len(line) > 0 { lines = append(lines, line) if line[0] == '#' { if strings.HasPrefix(line, "##fileformat=VCF") || strings.HasPrefix(line, "#CHROM\tPOS\tID") { ftype = "VCF" break } else { continue } } else { toks := strings.Split(line, "\t") if len(toks) < 3 { if hasAnyHeader(string(line)) { continue } return "", nil, fmt.Errorf("file has fewer than 3 columns") } for _, t := range CHECK_ORDER { cols := FileCols[t] ok := true last := 0 for _, c := range cols[1:] { if c == -1 { continue } if c >= len(toks) { ok = false break } v, err := strconv.Atoi(strings.TrimRight(toks[c], "\r\n")) if err != nil { ok = false break } // check that 0 <= start col <= end_col if v < last { ok = false break } last = v } if ok { ftype = t break } } if hasAnyHeader(string(line)) { continue } if ftype == "" { return "", nil, fmt.Errorf("unknown file format: %s", string(line)) } break } } if err != nil { return "", nil, err } } nrdr := io.MultiReader(strings.NewReader(strings.Join(lines, "")), rdr) return ftype, bufio.NewReader(nrdr), nil } func find(key []byte, info []byte) (int, int) { l := len(key) if pos := bytes.Index(info, key); pos != -1 { var end int for end = pos + l + 1; end < len(info); end++ { if info[end] == ';' { break } } return pos + l, end } return -1, -1 } func getMax(i []byte) (int, error) { if !bytes.Contains(i, []byte(",")) { return strconv.Atoi(unsafeString(i)) } all := bytes.Split(i, []byte{','}) max := -1 for _, b := range all { v, err := strconv.Atoi(unsafeString(b)) if err != nil { return max, err } if v > max { max = v } } return max, nil } type endGetter func(start int, line []byte) int var vcfEndGetter = endGetter(func(start int, line []byte) int { col4s, col4e := getAt(line, 4) col4 := line[col4s:col4e] if bytes.Contains(col4, []byte{'<'}) && (bytes.Contains(col4, []byte(" gsort version %s\n", VERSION) if args.Path == "" || args.Genome == "" { p.Fail("must specify a tab-delimited file and a genome file") } rdr, err := xopen.Ropen(args.Path) if err == io.EOF { log.Println("gsort: empty file") os.Exit(0) } if err != nil { log.Fatal(err) } defer rdr.Close() ftype, brdr, err := sniff(rdr.Reader) if err != nil && err != io.EOF { log.Fatal(err) } gf, err := ggd_utils.ReadGenomeFile(args.Genome, args.ChromosomeMappings) if err != nil && err != io.EOF { log.Fatal(err) } var getter endGetter if ftype == "VCF" || ftype == "VCF-LIKE" { getter = vcfEndGetter } else if args.Parent && (ftype == "GFF" || ftype == "GTF") { seen := make(map[string]int, 20) cnt := 2 getter = endGetter(func(start int, line []byte) int { ix := bytes.Index(line, []byte("\tID=")) if ix == -1 { ix = bytes.Index(line, []byte(";ID=")) } if ix != -1 { ix += 4 ixEnd := bytes.IndexByte(line[ix:], ';') if ixEnd == -1 { seen[string(line[ix:len(line)-1])] = cnt } else { seen[string(line[ix:ix+ixEnd])] = cnt } cnt++ } // want parent lines to come first. so lines containing a parent come last. if ix := bytes.Index(line, []byte("Parent=")); ix != -1 { ix += 7 ie := bytes.IndexByte(line[ix:], ';') if ie == -1 { ie = len(line) - 1 - ix } if o, ok := seen[string(line[ix:ix+ie])]; ok { return o } return 1 } return 0 }) } else if ftype == "GFF" || ftype == "GTF" { FileCols[ftype] = []int{0, 3, 4} } sortFn := sortFnFromCols(FileCols[ftype], gf, getter) wtr := bufio.NewWriter(os.Stdout) if err := gsort.Sort(brdr, wtr, sortFn, args.Memory, gf.ReMap); err != nil { log.Fatal("error from gsort.Sort", err) } } gsort-0.1.4/example/000077500000000000000000000000001366152565600143245ustar00rootroot00000000000000gsort-0.1.4/example/123Y.genome000066400000000000000000000000551366152565600161560ustar00rootroot00000000000000#chrom len 1 4556 2 4233 3 1234 Y 222 gsort-0.1.4/example/3Y21.genome000066400000000000000000000000551366152565600161560ustar00rootroot00000000000000#chrom len 3 1234 Y 222 2 4233 1 4556 gsort-0.1.4/example/a.bed000066400000000000000000000001001366152565600152070ustar00rootroot000000000000001 4556 5566 1 7556 8566 1 1 2 2 4233 5555 3 1234 1235 Y 222 333 gsort-0.1.4/example/chra.bed000066400000000000000000000001221366152565600157100ustar00rootroot00000000000000chr1 4556 5566 chr1 7556 8566 chr1 1 2 chr2 4233 5555 chr3 1234 1235 chrY 222 333 gsort-0.1.4/example/remapchr.txt000066400000000000000000000000341366152565600166630ustar00rootroot00000000000000chr1 1 chr2 2 chr3 3 chrY Y gsort-0.1.4/example/track-browser.bed000066400000000000000000000012461366152565600175700ustar00rootroot00000000000000track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" browser position chr7:127471196-127495720 browser hide all Y 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 Y 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 1 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 1 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 2 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 2 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 1 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 1 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0 1 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255 gsort-0.1.4/functional-test.sh000077500000000000000000000073031366152565600163520ustar00rootroot00000000000000#!/bin/bash test -e ssshtest || wget -q https://raw.githubusercontent.com/ryanlayer/ssshtest/master/ssshtest . ssshtest set -o nounset go build -o ./gsort_linux_amd64 cmd/gsort/gsort.go run check_usage ./gsort_linux_amd64 assert_exit_code 255 assert_in_stderr "Usage" run check_funky ./gsort_linux_amd64 example/a.bed example/3Y21.genome assert_exit_code 0 assert_equal "$(cut -f 1 $STDOUT_FILE | perl -pe 's/\n//')" "3Y2111" assert_equal "$(cut -f 2 $STDOUT_FILE | perl -pe 's/\n//')" "12342224233145567556" assert_equal "$(cut -f 3 $STDOUT_FILE | perl -pe 's/\n//')" "12353335555255668566" run check_funky_with_remap ./gsort_linux_amd64 -c example/remapchr.txt example/chra.bed example/3Y21.genome assert_equal "$(cut -f 1 $STDOUT_FILE | perl -pe 's/\n//')" "3Y2111" run check_normal ./gsort_linux_amd64 example/a.bed example/123Y.genome assert_exit_code 0 exp="1 1 2 1 4556 5566 1 7556 8566 2 4233 5555 3 1234 1235 Y 222 333" assert_equal "$(cat $STDOUT_FILE)" "$exp" run check_bed_header ./gsort_linux_amd64 example/track-browser.bed example/123Y.genome assert_exit_code 0 exp="track name=\"ItemRGBDemo\" description=\"Item RGB demonstration\" visibility=2 itemRgb=\"On\" browser position chr7:127471196-127495720 browser hide all 1 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 1 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 1 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 1 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0 1 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255 2 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 2 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 Y 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 Y 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0" assert_equal "$(cat $STDOUT_FILE)" "$exp" run check_gff_parent ./gsort_linux_amd64 --parent test.gff test.gff.genome assert_exit_code 0 exp="##gff-version 3 ##sequence-region CHROM1 1 20386 ### ### CHROM1 Cufflinks mRNA 1473 16154 . - . ID=XLOC_228.2;description=228 CHROM1 Cufflinks mRNA 1473 16386 . - . ID=XLOC_228.3 CHROM1 Cufflinks exon 1473 1814 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 1473 12024 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 11626 12574 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 12615 12721 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 12695 12721 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 13637 13726 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 13637 13726 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 15329 15408 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 15329 16386 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 15994 16154 . - . Parent=XLOC_228.2" assert_equal "$(cat $STDOUT_FILE)" "$exp" run check_vcf_like_sort ./gsort_linux_amd64 test.vcf-like.tsv test.vcf-like.genome assert_exit_code 0 exp="#chrom pos ref alt strand gene_symbol prediction class score 1 12623 A C + DDX11L9 benign neutral 0.386 1 12624 T C + DDX11L9 possiblydamaging deleterious 0.89 1 12625 G A + DDX11L9 possiblydamaging deleterious 0.769 1 12626 C G + DDX11L9 benign neutral 0 1 12627 C G + DDX11L9 benign neutral 0 12 78791 A G + DKFZp434K1323 possiblydamaging deleterious 0.713 12 78792 T A + DKFZp434K1323 possiblydamaging deleterious 0.932 12 78793 G A + DKFZp434K1323 possiblydamaging deleterious 0.851 12 78794 A C + DKFZp434K1323 possiblydamaging deleterious 0.895 Y 59356108 G T + WASH1 benign neutral 0.026 Y 59356110 A G + WASH1 benign neutral 0 Y 59356111 G C + WASH1 possiblydamaging deleterious 0.501 Y 59356112 C A + WASH1 benign neutral 0.003 Y 59356113 A G + WASH1 benign neutral 0.004 Y 59356113 A C + WASH1 possiblydamaging deleterious 0.952 Y 59356114 G A + WASH1 possiblydamaging deleterious 0.736" assert_equal "$(cat $STDOUT_FILE)" "$exp" gsort-0.1.4/gsort.go000066400000000000000000000172521366152565600143650ustar00rootroot00000000000000// Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes) (from a reader) // using the amount of memory requested. // // Instead of using a compare function as most sorts do, this accepts a user-defined // function with signature: `func(line []byte) []int` where the []ints are used to // determine ordering. For example if we were sorting on 2 columns, one of months and another of // day of months, the function would replace "Jan" with 1 and "Feb" with 2 for the first column // and just return the Atoi of the 2nd column. // // Header lines are assumed to start with '#'. To indicate other lines that are header lines, the // user function to Sort() can return `[]int{gsort.HEADER_LINE}`. package gsort import ( "bufio" "bytes" "compress/flate" "fmt" "io/ioutil" "math" "os/signal" "path/filepath" "runtime" "syscall" "time" gzip "github.com/klauspost/compress/gzip" "github.com/pkg/errors" //gzip "github.com/klauspost/pgzip" "container/heap" "io" "log" "os" "sort" ) type chunk struct { lines [][]byte idxs []int // used only in Heap Cols [][]int } func (c chunk) Len() int { return len(c.lines) } func (c chunk) Less(i, j int) bool { for k := 0; k < len(c.Cols[i]); k++ { if c.Cols[j][k] == c.Cols[i][k] { continue } return c.Cols[i][k] < c.Cols[j][k] } return false } func (c *chunk) Swap(i, j int) { if i < len((*c).lines) { (*c).lines[j], (*c).lines[i] = c.lines[i], c.lines[j] } if i < len((*c).Cols) { (*c).Cols[j], (*c).Cols[i] = c.Cols[i], c.Cols[j] } if i < len((*c).idxs) { (*c).idxs[j], (*c).idxs[i] = c.idxs[i], c.idxs[j] } } // for Heap type pair struct { line []byte idx int cols []int } func (c *chunk) Push(i interface{}) { p := i.(pair) (*c).lines = append((*c).lines, p.line) (*c).idxs = append((*c).idxs, p.idx) (*c).Cols = append((*c).Cols, p.cols) } func (c *chunk) Pop() interface{} { n := len((*c).lines) if n == 0 { return nil } line := (*c).lines[n-1] (*c).lines = (*c).lines[:n-1] idx := (*c).idxs[n-1] (*c).idxs = (*c).idxs[:n-1] cols := (*c).Cols[n-1] (*c).Cols = (*c).Cols[:n-1] return pair{line, idx, cols} } // Processor is a function that takes a line and return a slice of ints that determine ordering type Processor func(line []byte) []int // Sort accepts a tab-delimited io.Reader and writes to wtr using prepocess to determine ordering func Sort(rdr io.Reader, wtr io.Writer, preprocess Processor, memMB int, chromosomeMappings map[string]string) error { /* f, perr := os.Create("gsort.pprof") if perr != nil { panic(perr) } pprof.StartCPUProfile(f) defer pprof.StopCPUProfile() */ brdr, bwtr := bufio.NewReader(rdr), bufio.NewWriter(wtr) defer bwtr.Flush() if err := writeHeader(bwtr, brdr); err == io.EOF { return nil } else if err != nil { return errors.Wrap(err, "error reading/writing header") } ch := make(chan [][]byte) go readLines(ch, brdr, memMB, chromosomeMappings) fileNames := writeChunks(ch, preprocess) for _, f := range fileNames { defer os.Remove(f) } if len(fileNames) == 1 { return writeOne(fileNames[0], bwtr) } // TODO have special merge for when stuff is already mostly sorted. don't need pri queue. return merge(fileNames, bwtr, preprocess) } func readLines(ch chan [][]byte, rdr *bufio.Reader, memMb int, chromosomeMappings map[string]string) { mem := int(1000000.0 * float64(memMb) * 0.7) lines := make([][]byte, 0, 500000) var line []byte var err error sum := 0 k := 0 for { line, err = rdr.ReadBytes('\n') if err != nil && err != io.EOF { log.Fatal(err) } if len(line) > 0 { if chromosomeMappings != nil { i := bytes.IndexRune(line, '\t') chrom := string(line[0:i]) newChrom, ok := chromosomeMappings[chrom] if !ok { log.Printf("[gsort] WARNING: could not find mapping for chromosome: %s", chrom) } else { line = append([]byte(newChrom), line[i:]...) } } lines = append(lines, line) sum += len(line) } if len(line) == 0 || err == io.EOF { np := len(lines) last := lines[np-1] if len(last) == 0 || last[len(last)-1] != '\n' { lines[np-1] = append(last, '\n') } ch <- lines break } if sum >= mem { ch <- lines lines = make([][]byte, 0, 500000) if k == 0 { ch <- make([][]byte, 0, 0) mem /= 3 } k++ sum = 0 } } close(ch) } // indicate that this is a header line, even if it doesn't have '#' prefix const HEADER_LINE = math.MinInt32 func writeHeader(wtr *bufio.Writer, rdr *bufio.Reader) error { for { b, err := rdr.Peek(1) if err != nil { return errors.Wrap(err, "error peaking for header") } if b[0] != '#' { break } line, err := rdr.ReadBytes('\n') if err != nil { return err } wtr.Write(line) } return nil } // fast path where we don't use merge if it all fit in memory. func writeOne(fname string, wtr io.Writer) error { rdr, err := os.Open(fname) if err != nil { return err } defer rdr.Close() gz, err := gzip.NewReader(rdr) if err == io.EOF { return nil } if err != nil { return err } _, err = io.Copy(wtr, gz) return errors.Wrapf(err, "error copying from %s", fname) } func merge(fileNames []string, wtr io.Writer, process Processor) error { start := time.Now() fhs := make([]*bufio.Reader, len(fileNames)) cache := chunk{lines: make([][]byte, len(fileNames)), Cols: make([][]int, len(fileNames)), idxs: make([]int, len(fileNames))} for i, fn := range fileNames { fh, err := os.Open(fn) if err != nil { return errors.Wrap(err, fmt.Sprintf("error opening: %s", fn)) } defer fh.Close() //gz, err := newFastGzReader(fh) gz, err := gzip.NewReader(fh) if err != nil { return errors.Wrap(err, fmt.Sprintf("error reading %s as gzip", fn)) } defer gz.Close() fhs[i] = bufio.NewReader(gz) line, err := fhs[i].ReadBytes('\n') if len(line) > 0 { cache.lines[i] = line cache.Cols[i] = process(line) cache.idxs[i] = i } else if err == io.EOF { continue } else if err != nil { return err } } heap.Init(&cache) for { o := heap.Pop(&cache) if o == nil { break } c := o.(pair) // refill from same file line, err := fhs[c.idx].ReadBytes('\n') if err != io.EOF && err != nil { return err } if len(line) != 0 { next := pair{line: line, idx: c.idx, cols: process(line)} heap.Push(&cache, next) } else { os.Remove(fileNames[c.idx]) } wtr.Write(c.line) } log.Printf("time to merge %d files: %.3f", len(fileNames), time.Since(start).Seconds()) return nil } func init() { // make sure we don't leave any temporary files. c := make(chan os.Signal, 1) pid := os.Getpid() signal.Notify(c, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT) go func() { <-c matches, err := filepath.Glob(filepath.Join(os.TempDir(), fmt.Sprintf("gsort.%d.*", pid))) if err != nil { log.Fatal(err) } for _, m := range matches { os.Remove(m) } os.Exit(3) }() } func writeChunks(ch chan [][]byte, process Processor) []string { fileNames := make([]string, 0, 20) pid := os.Getpid() for lines := range ch { if len(lines) == 0 { continue } f, err := ioutil.TempFile("", fmt.Sprintf("gsort.%d.%d.", pid, len(fileNames))) if err != nil { log.Fatal(err) } achunk := chunk{lines: lines, Cols: make([][]int, len(lines))} for i, line := range achunk.lines { achunk.Cols[i] = process(line) } //lines = nil //sort.Stable(&achunk) sort.Sort(&achunk) gz, _ := gzip.NewWriterLevel(f, flate.BestSpeed) wtr := bufio.NewWriterSize(gz, 65536) for i, line := range achunk.lines { wtr.Write(line) achunk.lines[i] = nil lines[i] = nil } wtr.Flush() achunk.Cols, lines = nil, nil achunk.lines = nil gz.Close() f.Close() fileNames = append(fileNames, f.Name()) } runtime.GC() return fileNames } gsort-0.1.4/gsort_test.go000066400000000000000000000040621366152565600154170ustar00rootroot00000000000000package gsort_test import ( "bytes" "log" "math" "strconv" "strings" "testing" "github.com/brentp/gsort" . "gopkg.in/check.v1" ) func Test(t *testing.T) { TestingT(t) } type GSortTest struct{} var _ = Suite(&GSortTest{}) func (s *GSortTest) TestSort1(c *C) { data := strings.NewReader(`a 1 b 2 a 3 `) pp := func(line []byte) []int { l := make([]int, 2) toks := bytes.Split(line, []byte{'\t'}) l[0] = int(toks[0][0]) if len(toks) > 1 { v, err := strconv.Atoi(string(toks[1])) if err != nil { l[1] = -1 } else { l[1] = v } } else { l[1] = -1 } return l } b := make([]byte, 0, 20) wtr := bytes.NewBuffer(b) err := gsort.Sort(data, wtr, pp, 22, nil) c.Assert(err, IsNil) c.Assert(wtr.String(), Equals, `a 1 a 3 b 2 `) } func (s *GSortTest) TestSort2(c *C) { // sort by number, then reverse letter data := strings.NewReader(`a 1 b 2 a 3 g 1 `) pp := func(line []byte) []int { l := make([]int, 2) toks := bytes.Split(line, []byte{'\t'}) l[1] = -int(toks[0][0]) if len(toks) > 1 { toks[1] = bytes.TrimSuffix(toks[1], []byte{'\n'}) v, err := strconv.Atoi(string(toks[1])) if err != nil { l[0] = -1 } else { l[0] = v } } else { l[0] = math.MinInt32 } return l } b := make([]byte, 0, 20) wtr := bytes.NewBuffer(b) err := gsort.Sort(data, wtr, pp, 22, nil) c.Assert(err, IsNil) c.Assert(wtr.String(), Equals, `g 1 a 1 b 2 a 3 `) // sort numbers in reverse rev := func(line []byte) []int { l := make([]int, 2) toks := bytes.Split(line, []byte{'\t'}) l[1] = -int(toks[0][0]) if len(toks) > 1 { toks[1] = bytes.TrimSuffix(toks[1], []byte{'\n'}) v, err := strconv.Atoi(string(toks[1])) if err != nil { log.Println(err) l[0] = 1 } else { // NOTE added negative here l[0] = -v } } else { l[0] = math.MaxInt32 } return l } b = make([]byte, 0, 20) wtr = bytes.NewBuffer(b) data = strings.NewReader(`a 1 b 2 a 3 g 1`) err = gsort.Sort(data, wtr, rev, 22, nil) c.Assert(err, IsNil) c.Assert(wtr.String(), Equals, `a 3 b 2 g 1 a 1 `) } gsort-0.1.4/test.gff000066400000000000000000000013661366152565600143420ustar00rootroot00000000000000##gff-version 3 ##sequence-region CHROM1 1 20386 ### CHROM1 Cufflinks mRNA 1473 16154 . - . ID=XLOC_228.2;description=228 CHROM1 Cufflinks exon 1473 1814 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 11626 12574 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 12695 12721 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 13637 13726 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 15329 15408 . - . Parent=XLOC_228.2 CHROM1 Cufflinks exon 15994 16154 . - . Parent=XLOC_228.2 ### CHROM1 Cufflinks mRNA 1473 16386 . - . ID=XLOC_228.3 CHROM1 Cufflinks exon 1473 12024 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 12615 12721 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 13637 13726 . - . Parent=XLOC_228.3 CHROM1 Cufflinks exon 15329 16386 . - . Parent=XLOC_228.3 gsort-0.1.4/test.gff.genome000066400000000000000000000000151366152565600156010ustar00rootroot00000000000000CHROM1 80000 gsort-0.1.4/test.vcf-like.genome000066400000000000000000000000621366152565600165410ustar00rootroot00000000000000#chrom length 1 249250621 12 133851895 Y 59373566 gsort-0.1.4/test.vcf-like.tsv000066400000000000000000000016011366152565600161030ustar00rootroot00000000000000#chrom pos ref alt strand gene_symbol prediction class score 1 12623 A C + DDX11L9 benign neutral 0.386 1 12624 T C + DDX11L9 possiblydamaging deleterious 0.89 1 12625 G A + DDX11L9 possiblydamaging deleterious 0.769 Y 59356108 G T + WASH1 benign neutral 0.026 1 12626 C G + DDX11L9 benign neutral 0 12 78793 G A + DKFZp434K1323 possiblydamaging deleterious 0.851 1 12627 C G + DDX11L9 benign neutral 0 Y 59356110 A G + WASH1 benign neutral 0 Y 59356113 A C + WASH1 possiblydamaging deleterious 0.952 Y 59356113 A G + WASH1 benign neutral 0.004 Y 59356114 G A + WASH1 possiblydamaging deleterious 0.736 12 78791 A G + DKFZp434K1323 possiblydamaging deleterious 0.713 12 78792 T A + DKFZp434K1323 possiblydamaging deleterious 0.932 12 78794 A C + DKFZp434K1323 possiblydamaging deleterious 0.895 Y 59356111 G C + WASH1 possiblydamaging deleterious 0.501 Y 59356112 C A + WASH1 benign neutral 0.003